### 2: Clean Movie Links Metadata

This notebook parses and standardizes tags and titles of movies in the movie listing. Explores varieties of duplicated movie entries and de-duplication techniques.

In [1]:
import pandas as pd
import numpy as np
from unidecode import unidecode
import urllib

In [2]:
import sys
sys.path.append("../")
from consts import CLEAN_SCRAPE_DIR, BEFORE_2010_DIR, IN_2010S, AFTER_2020

In [3]:
# set batch variable here and run the cell and everything below for each
TARGET_DIR = AFTER_2020

In [4]:
METADATA_DIR = "../data/2_metadata"

In [5]:
# USEFUL ENCODINGS: cp1252, utf-8
try:
    df = pd.read_csv(f"{METADATA_DIR}/{TARGET_DIR}/meta_en_prime.csv", encoding='utf-8', header=None)
except:
    df = pd.read_csv(f"{METADATA_DIR}/{TARGET_DIR}/meta_en_prime.csv", encoding='cp1252', header=None)

In [6]:
df.tail()

Unnamed: 0,0,1,2
2890,Future Soldier,/Future-Soldier-Sean-Earl-McPherson/dp/B0BZJSK...,2023|TV-14|CC
2891,Pawankhind,/Pawankhind-Chinmay-Mandlekar/dp/B09QFHQSNX/re...,2022|CC
2892,French Biriyani,/French-Biriyani-Danish-Sait/dp/B08CZH1RWY/ref...,2020|CC
2893,Brutus vs Cesar,/Brutus-vs-Cesar-Kheiron/dp/B09FKJNRJ6/ref=sr_...,2020|PG-13|CC
2894,Deadlocked,/Deadlocked-Taylor-Tunes/dp/B08PC9YZCF/ref=sr_...,2020|CC


In [7]:
df.columns=['title', 'link', 'tags']

In [8]:
df.shape

(2895, 3)

In [9]:
df['year'] = pd.to_numeric(df['tags'].str.split('|').str[0], errors='coerce')

First, we'll convert the name to only alphanumeric characters

In [10]:
df['clean_title'] = [unidecode(text) for text in df['title']]
df['clean_title'] = df['clean_title'].str.replace("[^0-9a-zA-Z ]", "", regex=True).str.replace(" ", "_")

In [11]:
df.head()

Unnamed: 0,title,link,tags,year,clean_title
0,Creed III,/Creed-III-Michael-B-Jordan/dp/B0B8TJ2897/ref=...,2023|PG-13|CC,2023.0,Creed_III
1,Jurassic World Dominion,/Jurassic-World-Dominion-Chris-Pratt/dp/B0B5MQ...,2022|PG-13|CC,2022.0,Jurassic_World_Dominion
2,Air,/AIR-Matt-Damon/dp/B0B8Q3JMCG/ref=sr_1_3?qid=1...,2023|R|CC,2023.0,Air
3,Top Gun: Maverick,/Top-Gun-Maverick-Tom-Cruise/dp/B0B215H8Y3/ref...,2022|PG-13|CC,2022.0,Top_Gun_Maverick
4,80 for Brady,/80-Brady-Lily-Tomlin/dp/B0B84Z7N5K/ref=sr_1_5...,2023|PG-13|CC,2023.0,80_for_Brady


We'll also add the index to make the names unique.

In [12]:
df['file'] = pd.Series(np.arange(0, df.shape[0]), dtype='str') + "_" + df['clean_title']

In [13]:
df.head()

Unnamed: 0,title,link,tags,year,clean_title,file
0,Creed III,/Creed-III-Michael-B-Jordan/dp/B0B8TJ2897/ref=...,2023|PG-13|CC,2023.0,Creed_III,0_Creed_III
1,Jurassic World Dominion,/Jurassic-World-Dominion-Chris-Pratt/dp/B0B5MQ...,2022|PG-13|CC,2022.0,Jurassic_World_Dominion,1_Jurassic_World_Dominion
2,Air,/AIR-Matt-Damon/dp/B0B8Q3JMCG/ref=sr_1_3?qid=1...,2023|R|CC,2023.0,Air,2_Air
3,Top Gun: Maverick,/Top-Gun-Maverick-Tom-Cruise/dp/B0B215H8Y3/ref...,2022|PG-13|CC,2022.0,Top_Gun_Maverick,3_Top_Gun_Maverick
4,80 for Brady,/80-Brady-Lily-Tomlin/dp/B0B84Z7N5K/ref=sr_1_5...,2023|PG-13|CC,2023.0,80_for_Brady,4_80_for_Brady


In [15]:
df[df['title'].duplicated()]

Unnamed: 0,title,link,tags,year,clean_title,file
352,Bad Cupid,/Bad-Cupid-John-Rhys-Davies/dp/B08V11VWGT/ref=...,2021|CC,2021.0,Bad_Cupid,352_Bad_Cupid
540,My True Fairytale,/My-True-Fairytale-D-Mitry/dp/B0B66ZSLRD/ref=s...,2020|TV-PG|CC,2020.0,My_True_Fairytale,540_My_True_Fairytale
562,Nishabdham (Telugu),/Nishabdham-Telugu-Anushka-Shetty/dp/B095HXX6X...,2020|CC,2020.0,Nishabdham_Telugu,562_Nishabdham_Telugu
697,Lost Transmissions,/Lost-Transmissions-Simon-Pegg/dp/B0BTQ4ZX2P/r...,2020|CC,2020.0,Lost_Transmissions,697_Lost_Transmissions
710,Halal Love Story,/Halal-Love-Story-Indrajith-Sukumaran/dp/B08KW...,2020|CC,2020.0,Halal_Love_Story,710_Halal_Love_Story
889,Las huellas de elBulli,/Las-huellas-elBulli-Eduard-Xatruch/dp/B0C86S9...,2021|CC,2021.0,Las_huellas_de_elBulli,889_Las_huellas_de_elBulli
918,Silence (Malayalam),/Silence-Malayalam-Anushka-Shetty/dp/B095HWYLB...,2020|CC,2020.0,Silence_Malayalam,918_Silence_Malayalam
921,Penguin (Malayalam),/Penguin-Malayalam-Keerthy-Suresh/dp/B095HTPPB...,2020|CC,2020.0,Penguin_Malayalam,921_Penguin_Malayalam
928,Ponmagal Vandhal,/Ponmagal-Vandhal-Jyotika/dp/B095HZDBMM/ref=sr...,2020|CC,2020.0,Ponmagal_Vandhal,928_Ponmagal_Vandhal
1098,She,/She-Sahasra-Kirthi/dp/B094C1JS67/ref=sr_1_255...,2021|CC,2021.0,She,1098_She


There could be multiple files with the same name. Same movies can have slightly different tags (years) too. Using the short_url matches, then it is likely a duplicate with some exceptions.

In [16]:
df[df['title'] == "No Surrender"]

Unnamed: 0,title,link,tags,year,clean_title,file


In [17]:
df['short_url'] = df['link'].str.extract("(/[^/]*/)")

Some can also have the same short_url, but could be different movies. 
Example: *Doc West - Part 1* and *Doc West - Part 2* in the CLEA_SCRAPE/com batch.

In [113]:
# Inspect movie duplicates based on short_url
# df[df['short_url'].isin(df[df[['short_url']].duplicated()]['short_url'])].sort_values(by='short_url', ascending=True).to_csv("test.csv")

In [114]:
# df = pd.read_csv("test.csv")
# df[~df['title'].isin(df[~df[['short_url', 'year']].duplicated()]['title'])]

We can check the title as well along with the short_url for validation.

In [18]:
dups = df[df[['title', 'year']].duplicated()]

In [19]:
dups.shape

(24, 7)

Following is code to inspect some of the duplicates.

In [20]:
# sample_param = 0.5
# dups['short_url'].sample(int(len(dups) * sample_param))

# for url in dups['short_url']:
#     print(df.loc[df['short_url'] == url])
#     input()

Manual inspection shows checking for short_url and title is a good heuristic to remove duplicates. Later in the pipeline, we generate imdb ids for each of the movies using which we can remove the duplicates that remain.

We also find that generally movies with same title and year are not the same movie. Thus, first, we remove based on the title and year. Then, we remove movies based on the title and short_url next.

We removed entries with duplicate title and year.

In [21]:
df_clean = df[~df[['title', 'year']].duplicated()]

We'll also remove entries with duplicate title and short_url next.

In [22]:
df_clean = df_clean[~df_clean[['title', 'short_url']].duplicated()]
df_clean_copy = df_clean.copy()
df_clean_copy['clean_short_url'] = [unidecode(urllib.parse.unquote(string)) for string in df_clean_copy['short_url']]

In [23]:
df_clean_copy[df_clean_copy[['title', 'clean_short_url']].duplicated()]

Unnamed: 0,title,link,tags,year,clean_title,file,short_url,clean_short_url


In [24]:
df_clean_copy = df_clean_copy[~df_clean_copy[['title', 'clean_short_url']].duplicated()]

In [25]:
df_clean_copy.shape

(2870, 8)

In [123]:
df_clean_copy.to_csv(f"{METADATA_DIR}/{TARGET_DIR}/clean_meta_en_prime.csv", index=False)