- The entire list of 179 visual scene labels in MovieCLIP dataset can be found in the file MovieCLIP_taxonomy_split.txt.
- Download the original videos by requesting access to the Condensed Movies Dataset. Our video tagging was peformed on the videos present in Condensed Movies dataset. We do not own the raw videos.
-
The complete list of CLIP tags for the shots in the MovieCLIP dataset can be downloaded from this Drive Link
-
Load the CLIP tags using the following code snippet:
import json with open('movieCLIP_dataset.json', 'r') as f: movieCLIP_tags = json.load(f)
-
movieCLIP_tags is a dictionary with keys as the video names (youtube ids in Condensed Movies) and values as a list of CLIP tags for each shot in the video:
"qM8jk56Vj9Y": "qM8jk56Vj9Y-Scene-018.mp4": { "start_frame": 1059.0, "end_frame": 1137.0, "start_time": 44.169, "end_time": 47.422, "labels": { "banquet": 0.7861328125, "dining room": 0.07110595703125, "restaurant": 0.028594970703125, "penthouse": 0.01611328125, "salon": 0.01186370849609375 } }, "B-tq7mbTvrA": "B-tq7mbTvrA-Scene-003.mp4": { "start_frame": 54.0, "end_frame": 74.0, "start_time": 2.252, "end_time": 3.086, "labels": { "batting cage": 0.479248046875, "locker room": 0.160400390625, "baseball field": 0.11248779296875, "stadium": 0.0601806640625, "bowling alley": 0.040496826171875 } }, "Ld2g77JckSk": "Ld2g77JckSk-Scene-018.mp4": { "start_frame": 974.0, "end_frame": 1031.0, "start_time": 40.627, "end_time": 43.004, "labels": { "animal shelter": 0.640625, "zoo": 0.07684326171875, "farm": 0.04071044921875, "fair": 0.0256500244140625, "suburban": 0.0123748779296875 } }
For example, to access the CLIP tags associated with the shot
qM8jk56Vj9Y-Scene-018.mp4
present in the video idqM8jk56Vj9Y-Scene-018.mp4
, usemovieCLIP_tags['qM8jk56Vj9Y']['qM8jk56Vj9Y-Scene-018.mp4']['labels']
. This gives CLIP tags with their correspondingCLIPScene
scores for the shot. -
The shots can be extracted from original videos using the provided
start_time
andend_time
by using the following command (Format the start and end times usinghh:mm:ss
string format):ffmpeg -ss <start_time> -i <video_name specified by video id> -to <end_time> -c copy <output_shot_name>
-
Out of the 179 visual scene labels present in MovieCLIP_taxonomy_split.txt, we use 150 visual scene labels after thresholding on the top-1 CLIPScene score (>=0.4) and top-k(k=2 to 5) CLIPScene scores (>=0.1). the list of 150 visual scene labels can be found in the file label_2_ind_multi_label_thresh_0_4_0_1_150_label_map.pkl. Sample mapping is shown as below
{'courtroom': 0,'police station': 1,'mountain': 2,'swamp': 3, 'train': 4,'corridor': 5,'baseball field': 6,'garage': 7,'bakery': 8,'stairs': 9, 'pool': 10,'road': 11,'park': 12, 'church': 13,'desert': 14,...}
- The training, validation and text splits are present in the files train_multi_label_thresh_0_4_0_1_150_labels.txt, val_multi_label_thresh_0_4_0_1_150_labels.txt and test_multi_label_thresh_0_4_0_1_150_labels.txt respectively. The organization of each line in the files are as follows:
<shot name> <label index> rAchA32z2zM/rAchA32z2zM-Scene-009.mp4 63 Vje8Fp3yQFM/Vje8Fp3yQFM-Scene-002.mp4 88 UxIpYfsTkew/UxIpYfsTkew-Scene-042.mp4 142 75 z9SXvUdM_iw/z9SXvUdM_iw-Scene-001.mp4 35 50
- After thresholding, interrogation room and room have been merged into the a single label room .
- Phone booth and booth have been merged into the a single label booth.