### Check Xray duplicates

This notebook checks for cases where two movies might be sharing the same x-ray file. Such a scenario can arise out of network issues or other unknown factors. To find such instances, we see if any two movies have the exact same cast. The strategy is explained throughout the notebook.

In [1]:
import pandas as pd

Read the validated metadata

In [2]:
meta = pd.read_csv("../../data/6_character_metadata/final_validated_metadata.csv", dtype={"movie_id": str})

Read the people file to check we have data for all movies.

In [3]:
people = pd.read_csv("../../data/6_character_metadata/all_people_with_duplicates.csv")

In [4]:
final_people = people[people['movie'].isin(meta['file'])]

In [5]:
final_people['movie'].nunique()

3374

In [6]:
meta['file'].nunique()

3374

Combine all the people into a list. We check for duplicate X-Rays by using this cast information.

In [7]:
dup = pd.DataFrame(final_people.groupby('movie')['name_id'].agg(list)).reset_index()
dup

Unnamed: 0,movie,name_id
0,0_My_Fault,"[nm1799971, nm2338819, nm5913850, nm1293644, n..."
1,1000_On_The_Trail_of_UFOS_Dark_Sky,"[nm13818646, nm10541807, nm7379431, nm13818650..."
2,1001_Student_Of_The_Year,"[nm5023746, nm0438501, nm4765939, nm5208689, n..."
3,1003_Exit_Humanity,"[nm0004051, nm4056899, nm4224457, nm2495152, n..."
4,1005_The_Badge_The_Bible_and_Bigfoot,"[nm11004690, nm10977492, nm10977492, nm1100120..."
...,...,...
3369,998_Haunting_of_Helena,"[nm3246544, nm1074178, nm5021846, nm0581711, n..."
3370,999_Lost_Child,"[nm1628079, nm7748637, nm11619505, nm11619502,..."
3371,99_Cowgirls_N_Angels,"[nm1933128, nm1027429, nm4685196, nm1988958, n..."
3372,9_80_For_Brady,"[nm0005499, nm0000404, nm0000398, nm0001549, n..."


In [8]:
duplicate_entries = dup[dup.duplicated('name_id', keep=False)]

In [9]:
duplicate_entries.shape

(191, 2)

We get all the movies that have duplicate entries. 

Among these duplicated movies, we assume that if a movie was matched using the imdb_matching algorithm, it is likely the correct x-ray data of the movie. Thus, movies that weren't matched have inappropriate x-ray data.

In [10]:
meta_imdb_matched = pd.read_csv("../../data/6_character_metadata/metadata_for_validation.csv", dtype={"movie_id": str})

In [11]:
# movie matched with imdb
empty = meta_imdb_matched[meta_imdb_matched['movie_id'].isnull()]
# movie matched by imdb matching algorithm
properly_matched = meta_imdb_matched[~meta_imdb_matched['movie_id'].isnull()]

In [24]:
unmatched_duplicates = duplicate_entries[duplicate_entries['movie'].isin(empty['file'])]
unmatched_duplicates

Unnamed: 0,movie,name_id
11,1013_MOM_BEHIND_THE_WHEEL,"[nm10056858, nm8257146, nm3559559, nm4204921, ..."
28,1030_The_Big_Easy,"[nm1486911, nm1267552, nm6143802, nm8511145, n..."
44,1052_Zero_Days,"[nm6932074, nm2408881, nm3535087, nm4049575, n..."
75,1091_Bird_Catcher_The,"[nm0033165, nm1030395, nm1401531, nm0444621, n..."
103,1135_The_March_Sisters_at_Christmas,"[nm1139455, nm8475457, nm0712486, nm0324872, n..."
...,...,...
3230,800_CLIMAX,"[nm12339812, nm0707399, nm10466463, nm12539241..."
3238,809_Well_Done_Baby,"[nm12339812, nm0707399, nm10466463, nm12539241..."
3257,837_Nanis_Gang_Leader,"[nm1633541, nm10976972, nm9171680, nm13373316,..."
3318,909_Tender_Mercies,"[nm9555673, nm7757668, nm4756769, nm9361072, n..."


The movies above are the duplicated entries that weren't matched by imdb_matching algorithm. If we remove them, we'll be left with entries that were matched properly and have genuine x-ray data of their respective movies.

In [14]:
# remove the entries that couldn't be matched by algorithm (possibly because of duplicate xrays)
duplicate_entries_2 = duplicate_entries[~duplicate_entries['movie'].isin(empty['file'])]

In [34]:
# after we removed the entries that weren't matched,
# we may have removed duplicate cases for many movies

# now we can check for movies that are still duplicated
dup_2 = duplicate_entries_2[duplicate_entries_2['name_id'].duplicated(keep=False)]
# keep the genuine movie that has muliple titles
dup_2 = dup_2[dup_2['movie'] != '1697_Bigil']
dup_2

Unnamed: 0,movie,name_id
245,1317_Pata_Nahi_Par_Bolna_Hai,[nm7732132]
293,1375_Tathastu,[nm6321286]
340,1414_Dilli_Se_Hoon_BD,[nm7677658]
364,1436_Daddy_Issues,[nm0930149]
688,1850_Gaadi_Tera_Bhai_Chalayega,[nm7677658]
798,1974_Fun_Size,[nm0930149]
1003,2314_Whistle,"[nm10154962, nm9497172, nm6489058, nm1686962, ..."
1070,2400_Biswa_Kalyan_Rath_Biswa_Mast_Aadmi,[nm8271396]
1369,2897_Zakir_Khan_Haq_Se_Single,[nm6321286]
1643,3494_Kaksha_Gyarvi,[nm6321286]


Bigil, whistle same movie. Others are comedy specials. Duplicated because they are from the same person.

Thus, if we remove these duplicates as well, we won't have any more duplicates left.

In [37]:
# first remove "imdb unmatched" duplicates
meta_1 = meta[~meta['file'].isin(unmatched_duplicates['movie'])]
# then remove entries that are still duplicated after removing "imdb unmatched" duplicates, like the comedy specials
meta_2 = meta_1[~meta_1['file'].isin(dup_2['movie'])]
meta_2

Unnamed: 0,title,movie_id,file,dir,synopsis,year,link
0,My Fault,21909764,0_My_Fault,com,"Noah must leave her town, boyfriend and friend...",2023.0,/My-Fault-Nicole-Wallace/dp/B0B683GB78/ref=sr_...
1,On The Trail of UFOS: Dark Sky,14928972,1000_On_The_Trail_of_UFOS_Dark_Sky,com,On the Trail of UFOs: Dark Sky traces decades ...,2021.0,/Trail-UFOS-Dark-Sky/dp/B09BKF2WGQ/ref=sr_1_24...
2,Student Of The Year,2172071,1001_Student_Of_The_Year,com,"Introducing Alia Bhatt (Sharanya Singhania), S...",2012.0,/Student-Year-Sidharth-Malhotra/dp/B0BZTD87WK/...
3,"The Badge, The Bible and Bigfoot",11208026,1005_The_Badge_The_Bible_and_Bigfoot,com,"In a small coastal town Bigfoot is sighted, an...",2019.0,/Badge-Bible-Bigfoot-Ashley-Wright/dp/B09JMYV8...
4,Sharknado 5: Global Swarming,6298780,1009_Sharknado_5_Global_Swarming,com,"With much of North America lying in ruins, the...",2017.0,/Sharknado-Global-Swarming-Ian-Ziering/dp/B07M...
...,...,...,...,...,...,...,...
3366,Only,3984356,868_Only,after2020,After a mysterious plague threatens to kill al...,2020.0,/Only-Freida-Pinto/dp/B085PWDYFV/ref=sr_1_2357...
3367,Hustle,15693006,1742_Hustle,after2020,A con artist finds herself torn between a no-s...,2021.0,/Hustle-Nancy-Isime/dp/B0BZ57DWF9/ref=sr_1_528...
3368,No Man's Land,15686202,935_No_Mans_Land,after2020,"Working as a housekeeper, Sumitra trudges alon...",2021.0,/No-Mans-Land-Lukman-Avaran/dp/B09LT2P8NW/ref=...
3369,Ombatthane Dikku,12299992,975_Ombatthane_Dikku,after2020,"Varadappa, a sawmill owner and rural don sends...",2022.0,/Ombatthane-Dikku-Loose-Mada-Yogi/dp/B09SGV3NL...


In [38]:
meta_2.to_csv("../../data/6_character_metadata/filtered_final_validated_metadata.csv", index=False)