### Merge Validated Ids

This notebook fills the entries in `metadata_for_validation` with missing imdb IDs with the manually gathered movie IDs.

In [1]:
import pandas as pd

Import the manually validated metadata file.

In [2]:
METADATA_DIR = "../../data/6_character_metadata"
df = pd.read_csv(f"{METADATA_DIR}/validated_metadata.csv", dtype={"alt_id": str})

In [3]:
df.head()

Unnamed: 0,title,movie_id,file,dir,synopsis,year,link,match,alt_id,notes
0,Chain,,5965_Chain,com,A man gets entangled in an extramarital affair...,2019.0,/Chain-Eddie-Watson/dp/B09KHGHG72/ref=sr_1_119...,0,21052498,
1,Kaala (Tamil),,3343_Kaala_Tamil,com,Kaala (aka) Karikaalan is a representative of ...,2018.0,/Kaala-Tamil-Super-Star-Rajinikanth/dp/B08KWP2...,0,6929642,
2,Custody (Tamil),,5972_Custody_Tamil,com,The movie â€œCustodyâ€ is a run action drama a...,2023.0,/Custody-Tamil-Naga-Chaitanya-Akkineni/dp/B0BZ...,0,23782248,
3,Tad: The Lost Explorer,,5973_Tad_The_Lost_Explorer,com,Tad's dreams of being a famous adventurer fina...,2013.0,/Tad-Lost-Explorer-Kerry-Shale/dp/B079P6FP7J/r...,0,1764625,
4,The Big Easy,,1030_The_Big_Easy,com,"Set in New Orleans, this crime thriller tells ...",1987.0,/Big-Easy-Dennis-Quaid/dp/B07VWYC9V8/ref=sr_1_...,0,92654,


Manually removing unfit movies like comedy specials, anthologies, etc. specified in notes

In [4]:
df = df.drop([8, 54, 191, 206, 218, 228, 229, 236, 239, 299, 311, 319, 327, 331, 367, 379, 384, 387, 407, 435])

Remove all duplicates and null ids.

In [5]:
df_rem = df.loc[~df['alt_id'].isnull()]

In [7]:
df_rem.shape, df.shape

((421, 10), (421, 10))

In [8]:
df_unique = df_rem[~df_rem['alt_id'].duplicated()]

In [9]:
df_unique.shape

(370, 10)

Import the original validation file to merge.

In [10]:
df_meta = pd.read_csv(f"{METADATA_DIR}/metadata_for_validation.csv", dtype={"movie_id": str})

In [11]:
df_meta.head()

Unnamed: 0,title,movie_id,file,dir,synopsis,year,link
0,My Fault,21909764,0_My_Fault,com,"Noah must leave her town, boyfriend and friend...",2023.0,/My-Fault-Nicole-Wallace/dp/B0B683GB78/ref=sr_...
1,On The Trail of UFOS: Dark Sky,14928972,1000_On_The_Trail_of_UFOS_Dark_Sky,com,On the Trail of UFOs: Dark Sky traces decades ...,2021.0,/Trail-UFOS-Dark-Sky/dp/B09BKF2WGQ/ref=sr_1_24...
2,Student Of The Year,2172071,1001_Student_Of_The_Year,com,"Introducing Alia Bhatt (Sharanya Singhania), S...",2012.0,/Student-Year-Sidharth-Malhotra/dp/B0BZTD87WK/...
3,"The Badge, The Bible and Bigfoot",11208026,1005_The_Badge_The_Bible_and_Bigfoot,com,"In a small coastal town Bigfoot is sighted, an...",2019.0,/Badge-Bible-Bigfoot-Ashley-Wright/dp/B09JMYV8...
4,Sharknado 5: Global Swarming,6298780,1009_Sharknado_5_Global_Swarming,com,"With much of North America lying in ruins, the...",2017.0,/Sharknado-Global-Swarming-Ian-Ziering/dp/B07M...


Remove all the null and duplicate movie ids.

In [12]:
df_meta_rem = df_meta[~df_meta['movie_id'].isnull()]

We can also see if there are any duplicates based on the imdb ID in our metadata file since it is a better metric.

In [15]:
df_meta_rem[df_meta_rem['movie_id'].duplicated()]

Unnamed: 0,title,movie_id,file,dir,synopsis,year,link
367,Ponniyin Selvan Part 2 (Hindi),22444570,3582_Ponniyin_Selvan_Part_2_Hindi,com,968 AD. The Pandyan assassins gather once agai...,2023.0,/Ponniyin-Selvan-Part-2-Hindi/dp/B0B791B4BJ/re...
430,Anni Manchi Sakunamule,14986032,1252_Anni_Manchi_Sakunamule,com,With the impact on interpersonal relationships...,2023.0,/Anni-Manchi-Sakunamule-Santosh-Sobhan/dp/B0BZ...
459,Vaarasudu,11998558,6194_Vaarasudu,com,"Vijay, the prodigal son of business tycoon Raj...",2023.0,/Vaarasudu-Thalapathy-Vijay/dp/B0B8S31J5G/ref=...
539,K.G.F: Chapter 1 (Telugu),7838252,1305_KGF_Chapter_1_Telugu,com,KGF Chapter 1 is a film based on the gold mine...,2018.0,/K-G-F-Chapter-1-Telugu-Yash/dp/B08KWQ4QMH/ref...
735,Raktha Sambandham,15175418,1484_Raktha_Sambandham,com,The story revolves around the sibling families...,2021.0,/Raktha-Sambandham-Jyotika/dp/B09HYB6T26/ref=s...
...,...,...,...,...,...,...,...
3472,A Lot Like Christmas (4K UHD),16150870,1409_A_Lot_Like_Christmas_4K_UHD,after2020,Jessica Roberts owns the most popular Christma...,2021.0,/Lot-Like-Christmas-4K-UHD/dp/B0BP1YD4HY/ref=s...
3509,Balaga,26690825,162_Balaga,after2020,"In a village in Telangana, a family head Komar...",2023.0,/Balaga-Priyadarshi/dp/B0C9P21961/ref=sr_1_171...
3519,The ProtÃ©gÃ©,6079772,907_The_Protege,after2020,"Michael Keaton, Maggie Q, and Samuel L. Jackso...",2021.0,/Prot%C3%A9g%C3%A9-Michael-Keaton/dp/B09FKSV56...
3532,13aam Number Veedu,13242968,953_13aam_Number_Veedu,after2020,5 IT professionals staying together vacate the...,2020.0,/13aam-Number-Veedu-Ramana/dp/B08P2HM7DW/ref=s...


Merge the original validation file with the manually collected data.

In [16]:
df_meta_2 = df_unique.drop(['movie_id', 'match', 'notes'], axis=1).rename({'alt_id': 'movie_id'}, axis=1).reset_index(drop=True)

In [17]:
df_meta_merged = pd.concat([df_meta_rem.reset_index(drop=True), df_meta_2])
df_meta_merged = df_meta_merged[~df_meta_merged['movie_id'].duplicated()]

In [18]:
df_meta_merged[df_meta_merged['movie_id'].duplicated()]

Unnamed: 0,title,movie_id,file,dir,synopsis,year,link


There are no duplicates left which is good.

In [19]:
df_meta_merged.shape

(3374, 7)

In [20]:
df_meta_merged.to_csv(f"{METADATA_DIR}/final_validated_metadata.csv", index=False)