# Improving Merge

Merging IMDB Core with ticket sales data has to rely on title and year match. 
After we fixed issues with movies being released in the same year under the same title, we faced a new one. 
<br><br>
There are numerous titles with slightly different names. 
Roughly 1k per data set don't match. 
<br><br>
Next goal is to analyze the mismatch and correct that. 

### 1. Importing packages and data from sql

In [581]:
import pandas as pd
import numpy as np
import psycopg2 as psycopg2
import sql_functions as sqlf

In [582]:
schema = "capstone_24_4_group1"
schema

'capstone_24_4_group1'

In [583]:
imdb_query = f'''   SELECT *
                    FROM {schema}."IMDB_data"
                    '''

eu_query = f'''   SELECT *
                    FROM {schema}."movie_data_EU"
                    '''

na_query = f'''   SELECT *
                    FROM {schema}."movie_data_NA"
                    '''

In [584]:
imdb_df = sqlf.get_dataframe(imdb_query)
display(imdb_df.head())
imdb_df.shape

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,acting5,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult
0,tt0013274,Istoriya grazhdanskoy voyny,Istoriya grazhdanskoy voyny,2021,94,73.0,6.7,1.0,Documentary,,...,,2.0,Nikolai Izvolov,Dziga Vertov,,,,,,0
1,tt0015414,La tierra de los toros,La tierra de los toros,2000,60,17.0,5.4,,,,...,,1.0,Musidora,,,,,,,0
2,tt0035423,Kate & Leopold,Kate & Leopold,2001,118,89944.0,6.4,3.0,Comedy,Fantasy,...,Natasha Lyonne,1.0,James Mangold,,,2.0,Steven Rogers,James Mangold,,0
3,tt0062336,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,2020,70,190.0,6.5,1.0,Drama,,...,Luis Vilches,2.0,Raúl Ruiz,Valeria Sarmiento,,2.0,Raúl Ruiz,Omar Saavedra Santis,,0
4,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122,8143.0,6.7,1.0,Drama,,...,Norman Foster,1.0,Orson Welles,,,2.0,Orson Welles,Oja Kodar,,0


(188163, 25)

In [585]:
eu_df = sqlf.get_dataframe(eu_query)
display(eu_df.head())
eu_df.shape

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold
0,(500) Days of Summer,US,2009,1713086,1684771
1,(Nie)znajomi,PL,2019,685075,684833
2,(T)Raumschiff Surprise - Periode 1,DE,2004,10763531,10731881
3,1 1/2 Ritter - Auf der Suche nach der hinreiße...,DE,2008,1986168,1986168
4,1 chance sur 2,FR,1998,1295620,1238175


(4956, 5)

In [586]:
na_df = sqlf.get_dataframe(na_query)
display(na_df.head())
na_df.shape

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,release_year
0,(500) Days of Summer,2009-08-07,Fox Searchlight,32425665,4323422,2009
1,10 Cloverfield Lane,2016-03-11,Paramount Pictures,72082999,8333294,2016
2,10 Things I Hate About You,1999-03-31,Walt Disney,38177966,7515347,1999
3,"10,000 B.C.",2008-03-07,Warner Bros.,94784201,13201142,2008
4,101 Dalmatians,1996-11-27,Walt Disney,136189294,30691447,1996


(4965, 6)

## 2. Get not matching movies
### EU

In [587]:
eu_unmatched_df = pd.merge(eu_df,imdb_df, how="left", left_on=["title", "year"], right_on=["original_title", "year"])
eu_unmatched_df.shape

(4959, 29)

In [588]:
eu_unmatched_mask = eu_unmatched_df["tconst"].isnull()

In [589]:
eu_unmatched_df[eu_unmatched_mask]

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,tconst,primary_title,original_title,runtime,num_votes,...,acting5,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult
13,101 Dalmatians,US,1996,21623260,21532085,,,,,,...,,,,,,,,,,
20,13 Going On 30,US,2004,3579724,3566771,,,,,,...,,,,,,,,,,
22,15 Minutes (Fifteen Minutes),US,2001,2600419,2565397,,,,,,...,,,,,,,,,,
41,23,DE,1998,701787,693358,,,,,,...,,,,,,,,,,
47,28 Days Later,GB,2002,4252690,4066710,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4949,Çok filim hareketler bunlar,TR,2010,1223552,1223156,,,,,,...,,,,,,,,,,
4955,Účastníci zájezdu,CZ,2006,871439,830242,,,,,,...,,,,,,,,,,
4956,Śluby panieńskie,PL,2010,1001866,1000373,,,,,,...,,,,,,,,,,
4957,Świadectwo,PL,2008,1039901,1034911,,,,,,...,,,,,,,,,,


currently 922 unmatched rows

### Let's check how much lower case improves (add for EU and NA)

In [590]:
imdb_df["original_title_merge"] = imdb_df["original_title"].str.lower()
eu_df["title_merge"] = eu_df["title"].str.lower()
na_df["title_merge"] = na_df["title"].str.lower()
display(imdb_df.head())
display(eu_df.head())
display(na_df.head())

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
0,tt0013274,Istoriya grazhdanskoy voyny,Istoriya grazhdanskoy voyny,2021,94,73.0,6.7,1.0,Documentary,,...,2.0,Nikolai Izvolov,Dziga Vertov,,,,,,0,istoriya grazhdanskoy voyny
1,tt0015414,La tierra de los toros,La tierra de los toros,2000,60,17.0,5.4,,,,...,1.0,Musidora,,,,,,,0,la tierra de los toros
2,tt0035423,Kate & Leopold,Kate & Leopold,2001,118,89944.0,6.4,3.0,Comedy,Fantasy,...,1.0,James Mangold,,,2.0,Steven Rogers,James Mangold,,0,kate & leopold
3,tt0062336,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,2020,70,190.0,6.5,1.0,Drama,,...,2.0,Raúl Ruiz,Valeria Sarmiento,,2.0,Raúl Ruiz,Omar Saavedra Santis,,0,el tango del viudo y su espejo deformante
4,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122,8143.0,6.7,1.0,Drama,,...,1.0,Orson Welles,,,2.0,Orson Welles,Oja Kodar,,0,the other side of the wind


Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge
0,(500) Days of Summer,US,2009,1713086,1684771,(500) days of summer
1,(Nie)znajomi,PL,2019,685075,684833,(nie)znajomi
2,(T)Raumschiff Surprise - Periode 1,DE,2004,10763531,10731881,(t)raumschiff surprise - periode 1
3,1 1/2 Ritter - Auf der Suche nach der hinreiße...,DE,2008,1986168,1986168,1 1/2 ritter - auf der suche nach der hinreiße...
4,1 chance sur 2,FR,1998,1295620,1238175,1 chance sur 2


Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,release_year,title_merge
0,(500) Days of Summer,2009-08-07,Fox Searchlight,32425665,4323422,2009,(500) days of summer
1,10 Cloverfield Lane,2016-03-11,Paramount Pictures,72082999,8333294,2016,10 cloverfield lane
2,10 Things I Hate About You,1999-03-31,Walt Disney,38177966,7515347,1999,10 things i hate about you
3,"10,000 B.C.",2008-03-07,Warner Bros.,94784201,13201142,2008,"10,000 b.c."
4,101 Dalmatians,1996-11-27,Walt Disney,136189294,30691447,1996,101 dalmatians


In [591]:
eu_check_df = pd.merge(eu_df,imdb_df, how="left", left_on=["title_merge", "year"], right_on=["original_title_merge", "year"])

In [592]:
eu_check_mask = eu_check_df["tconst"].isnull()
eu_check_df[eu_check_mask]

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
13,101 Dalmatians,US,1996,21623260,21532085,101 dalmatians,,,,,...,,,,,,,,,,
22,15 Minutes (Fifteen Minutes),US,2001,2600419,2565397,15 minutes (fifteen minutes),,,,,...,,,,,,,,,,
41,23,DE,1998,701787,693358,23,,,,,...,,,,,,,,,,
47,28 Days Later,GB,2002,4252690,4066710,28 days later,,,,,...,,,,,,,,,,
59,"4 luni, 3 saptamani si 2 zile",RO,2007,1090696,736957,"4 luni, 3 saptamani si 2 zile",,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4930,[Rec] 2,ES,2009,1387671,1013091,[rec] 2,,,,,...,,,,,,,,,,
4955,Účastníci zájezdu,CZ,2006,871439,830242,účastníci zájezdu,,,,,...,,,,,,,,,,
4956,Śluby panieńskie,PL,2010,1001866,1000373,śluby panieńskie,,,,,...,,,,,,,,,,
4957,Świadectwo,PL,2008,1039901,1034911,świadectwo,,,,,...,,,,,,,,,,


From 922 to 788 unmatched rows
-> 134 less problems

### Check what's wrong with "101 Dalmatians"

In [593]:
imdb_df[imdb_df["tconst"] == "tt0115433"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge


hmm ... can't find the 101 ... movie ... why? 

In [594]:
imdb_df[imdb_df["original_title"].str.contains("101")].sort_values(by="original_title").head(15)

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
136405,tt3668280,101 Chodhyangal?,101 Chodhyangal?,2013,107,113.0,7.0,2.0,Drama,Family,...,1.0,Sidhartha Siva,,,1.0,Sidhartha Siva,,,0,101 chodhyangal?
145785,tt4512212,101 Reasons: Liberty Lives in New Hampshire,101 Reasons: Liberty Lives in New Hampshire,2014,64,14.0,8.6,2.0,Documentary,News,...,1.0,Beau Davis,,,2.0,Beau Davis,Vince Perfetto,,0,101 reasons: liberty lives in new hampshire
9269,tt0252802,101 Rent Boys,101 Rent Boys,2000,78,384.0,6.5,1.0,Documentary,,...,2.0,Fenton Bailey,Randy Barbato,,,,,,0,101 rent boys
8196,tt0237993,101 Reykjavík,101 Reykjavík,2000,88,9967.0,6.8,3.0,Comedy,Drama,...,1.0,Baltasar Kormákur,,,4.0,Hallgrímur Helgason,Baltasar Kormákur,,0,101 reykjavík
69922,tt14358208,101 Reys,101 Reys,2020,110,20.0,7.8,1.0,Biography,,...,1.0,Akrom Shohnazarov,,,1.0,Akrom Shohnazarov,,,0,101 reys
185698,tt9429520,101 Seconds,101 Seconds,2018,81,29.0,6.2,1.0,Documentary,,...,1.0,Skye Fitzgerald,,,,,,,0,101 seconds
130417,tt3219396,101 Secrets,101 Secrets,2015,95,15.0,5.3,3.0,Adventure,Drama,...,1.0,Tophy Cho,,,1.0,Tophy Cho,,,0,101 secrets
8418,tt0241142,101 Ways (the Things a Girl Will Do to Keep He...,101 Ways (The Things a Girl Will Do to Keep He...,2000,100,162.0,5.2,1.0,Comedy,,...,1.0,Jennifer B. Katz,,,1.0,Jennifer B. Katz,,,0,101 ways (the things a girl will do to keep he...
115606,tt2545176,101 Weddings,101 Weddings,2012,145,235.0,4.7,3.0,Comedy,Drama,...,1.0,Shafi,,,2.0,Kalavoor Ravikumar,Shafi,,0,101 weddings
84889,tt1674766,101 Proposals,101 ci qiu hun,2013,120,526.0,5.4,1.0,Romance,,...,1.0,Leste Chen,,,3.0,Shinji Nojima,Peng Ren,Wei Zhang,0,101 ci qiu hun


### Check if we can find the tconst in the basic data-set

In [595]:
basic_df = pd.read_csv("Data/title.principals/title.basics.csv")
basic_df.shape

(11057208, 9)

In [596]:
basic_df[basic_df["tconst"] == "tt0115433"]

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
112776,tt0115433,movie,101 Dalmatians,101 Dalmatians,0.0,1996.0,\N,103,"Adventure,Comedy,Crime"


Is is in there ... that means we drop it unwanted at some stage during the filtering process <br><br>

First idea: Maybe we filter for year > 1996 instead of year >= 1996?

### Solution: 

It is the other way round. We decided to look at the last 25 years (1998-2023) and filtered the IMDB data accordingly. However, the EU and NA data starts at 1996. Those we never filtered for the appropriate date range ... 

Let's correct this ... 

### Filter EU and NA for Year >= 1998

In [597]:
eu_df[eu_df["year"] >= 1998].sort_values(by="year")

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge
770,City of Angels,US,1998,8318763,8271916,city of angels
992,Desperate Measures,US,1998,1194818,1191026,desperate measures
3413,Rush Hour,US,1998,8120318,8007449,rush hour
2251,Lautrec,"FR, ES",1998,650948,560307,lautrec
4748,Virus,"US, GB, JP, DE, FR",1998,1719742,1715116,virus
...,...,...,...,...,...,...
97,A Haunting in Venice,US,2023,6170413,6170413,a haunting in venice
3858,Thanksgiving,"US, CA, AU",2023,1006977,1006977,thanksgiving
2967,"O psie, który jezdzil koleja",PL,2023,730994,730994,"o psie, który jezdzil koleja"
3828,Taylor Swift: The Eras Tour,US,2023,2120352,2120352,taylor swift: the eras tour


In [598]:
eu_df = eu_df[eu_df["year"] >= 1998].reset_index(drop=True)
eu_df

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge
0,(500) Days of Summer,US,2009,1713086,1684771,(500) days of summer
1,(Nie)znajomi,PL,2019,685075,684833,(nie)znajomi
2,(T)Raumschiff Surprise - Periode 1,DE,2004,10763531,10731881,(t)raumschiff surprise - periode 1
3,1 1/2 Ritter - Auf der Suche nach der hinreiße...,DE,2008,1986168,1986168,1 1/2 ritter - auf der suche nach der hinreiße...
4,1 chance sur 2,FR,1998,1295620,1238175,1 chance sur 2
...,...,...,...,...,...,...
4536,Ölümlü Dünya 2,TR,2023,1476943,1476943,ölümlü dünya 2
4537,Účastníci zájezdu,CZ,2006,871439,830242,účastníci zájezdu
4538,Śluby panieńskie,PL,2010,1001866,1000373,śluby panieńskie
4539,Świadectwo,PL,2008,1039901,1034911,świadectwo


In [599]:
na_df[na_df["release_year"] >= 1998].sort_values(by="release_year")


Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,release_year,title_merge
2214,Little Voice,1998-12-04,Miramax,3714954,731290,1998,little voice
1688,Hilary and Jackie,1998-12-30,October Films,4739909,933052,1998,hilary and jackie
406,Babe: Pig in the City,1998-11-25,Universal,18319860,3870373,1998,babe: pig in the city
4107,The Man in the Iron Mask,1998-03-13,MGM,56968169,12146731,1998,the man in the iron mask
989,Deep Impact,1998-05-08,Paramount Pictures,140464664,29949821,1998,deep impact
...,...,...,...,...,...,...,...
2714,Oppenheimer,2023-07-21,Universal,326101370,30250590,2023,oppenheimer
4758,Waitress: The Musical,2023-12-07,Bleecker Street,5402148,501126,2023,waitress: the musical
593,Blue Beetle,2023-08-18,Warner Bros.,72541501,6729267,2023,blue beetle
4116,The Marvels,2023-11-10,Walt Disney,84479155,7836656,2023,the marvels


In [600]:
na_df = na_df[na_df["release_year"] >= 1998].reset_index(drop=True)
na_df

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,release_year,title_merge
0,(500) Days of Summer,2009-08-07,Fox Searchlight,32425665,4323422,2009,(500) days of summer
1,10 Cloverfield Lane,2016-03-11,Paramount Pictures,72082999,8333294,2016,10 cloverfield lane
2,10 Things I Hate About You,1999-03-31,Walt Disney,38177966,7515347,1999,10 things i hate about you
3,"10,000 B.C.",2008-03-07,Warner Bros.,94784201,13201142,2008,"10,000 b.c."
4,102 Dalmatians,2000-11-22,Walt Disney,66941559,12343421,2000,102 dalmatians
...,...,...,...,...,...,...,...
4539,earth,2009-04-22,Walt Disney,32011576,4268210,2009,earth
4540,jackass forever,2022-02-04,Paramount Pictures,57743451,5483709,2022,jackass forever
4541,mother!,2017-09-15,Paramount Pictures,17800004,1984392,2017,mother!
4542,xXx,2002-08-09,Sony Pictures,141930000,24428571,2002,xxx


### Check Merge with reducced EU Data (4541 rows)

In [601]:
eu_check_df = pd.merge(eu_df,imdb_df, how="left", left_on=["title_merge", "year"], right_on=["original_title_merge", "year"])

In [602]:
eu_check_mask = eu_check_df["tconst"].isnull()
eu_check_df[eu_check_mask]

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
21,15 Minutes (Fifteen Minutes),US,2001,2600419,2565397,15 minutes (fifteen minutes),,,,,...,,,,,,,,,,
40,23,DE,1998,701787,693358,23,,,,,...,,,,,,,,,,
46,28 Days Later,GB,2002,4252690,4066710,28 days later,,,,,...,,,,,,,,,,
58,"4 luni, 3 saptamani si 2 zile",RO,2007,1090696,736957,"4 luni, 3 saptamani si 2 zile",,,,,...,,,,,,,,,,
65,5X2 cinq fois deux,FR,2004,1150178,814942,5x2 cinq fois deux,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4515,[Rec] 2,ES,2009,1387671,1013091,[rec] 2,,,,,...,,,,,,,,,,
4540,Účastníci zájezdu,CZ,2006,871439,830242,účastníci zájezdu,,,,,...,,,,,,,,,,
4541,Śluby panieńskie,PL,2010,1001866,1000373,śluby panieńskie,,,,,...,,,,,,,,,,
4542,Świadectwo,PL,2008,1039901,1034911,świadectwo,,,,,...,,,,,,,,,,


We are down to 373 unmatched rows (from 788)
-> improved by 415

### Check Problem with 15Minutes

In [603]:
imdb_df[imdb_df["tconst"] == "tt0179626"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
4855,tt0179626,15 Minutes,15 Minutes,2001,120,52238.0,6.1,3.0,Action,Crime,...,1.0,John Herzfeld,,,1.0,John Herzfeld,,,0,15 minutes


ok the EU data contains the numbers written out in brackets. The IMDB does not. <br> <br>

Does that happen for other movies, too? 

In [604]:
bracket_mask = eu_check_df["title_merge"].str.contains("\(")
eu_check_df[eu_check_mask & bracket_mask]

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
21,15 Minutes (Fifteen Minutes),US,2001,2600419,2565397,15 minutes (fifteen minutes),,,,,...,,,,,,,,,,
411,Beast (US),"US, IS, JP",2022,1078911,1078729,beast (us),,,,,...,,,,,,,,,,
812,"Dangerous Beauty (The Honest Courtesan, A Dest...",US,1998,929520,928888,"dangerous beauty (the honest courtesan, a dest...",,,,,...,,,,,,,,,,
2809,Paparazzi (FR),FR,1998,994869,988887,paparazzi (fr),,,,,...,,,,,,,,,,
2810,Paparazzi (IT),IT,1998,1604573,1604573,paparazzi (it),,,,,...,,,,,,,,,,


Nop, unique problem ... but Paparazzi also does not match!! <br><br>

Anyway ... Correct 15 Minutes first

In [605]:
eu_df.loc[eu_df["title_merge"] == "15 minutes (fifteen minutes)", "title_merge"]

21    15 minutes (fifteen minutes)
Name: title_merge, dtype: object

In [606]:
eu_df.loc[eu_df["title_merge"] == "15 minutes (fifteen minutes)", "title_merge"] = "15 minutes"
eu_df.loc[eu_df["title_merge"] == "15 minutes"]

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge
21,15 Minutes (Fifteen Minutes),US,2001,2600419,2565397,15 minutes


### repeat with new eu_df and check out Paparazzi problems

In [607]:
eu_check_df = pd.merge(eu_df,imdb_df, how="left", left_on=["title_merge", "year"], right_on=["original_title_merge", "year"])

In [608]:
eu_check_mask = eu_check_df["tconst"].isnull()
eu_check_df[eu_check_mask]

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
40,23,DE,1998,701787,693358,23,,,,,...,,,,,,,,,,
46,28 Days Later,GB,2002,4252690,4066710,28 days later,,,,,...,,,,,,,,,,
58,"4 luni, 3 saptamani si 2 zile",RO,2007,1090696,736957,"4 luni, 3 saptamani si 2 zile",,,,,...,,,,,,,,,,
65,5X2 cinq fois deux,FR,2004,1150178,814942,5x2 cinq fois deux,,,,,...,,,,,,,,,,
67,666 - Traue keinem mit dem du schläfst,DE,2002,677829,677829,666 - traue keinem mit dem du schläfst,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4515,[Rec] 2,ES,2009,1387671,1013091,[rec] 2,,,,,...,,,,,,,,,,
4540,Účastníci zájezdu,CZ,2006,871439,830242,účastníci zájezdu,,,,,...,,,,,,,,,,
4541,Śluby panieńskie,PL,2010,1001866,1000373,śluby panieńskie,,,,,...,,,,,,,,,,
4542,Świadectwo,PL,2008,1039901,1034911,świadectwo,,,,,...,,,,,,,,,,


In [609]:
imdb_df[imdb_df["original_title"] == "Paparazzi"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
2350,tt0133314,Paparazzi (FR),Paparazzi,1998,111,1091.0,5.4,2.0,Comedy,Romance,...,1.0,Alain Berbérian,,,6.0,Alain Berbérian,Jean-François Halin,,0,paparazzi
4551,tt0174105,Paparazzi (IT),Paparazzi,1998,100,1156.0,4.0,1.0,Comedy,,...,1.0,Neri Parenti,,,,,,,0,paparazzi
16504,tt0338325,Paparazzi,Paparazzi,2004,84,15776.0,5.7,3.0,Action,Crime,...,1.0,Paul Abascal,,,1.0,Forry Smith,,,0,paparazzi
153851,tt5303564,Paparazzi,Paparazzi,2015,110,54.0,5.4,3.0,Action,Drama,...,1.0,Saad Hendawy,,,1.0,Ahmed Abdel Fattah,,,0,paparazzi


ok, we changed the primary but not the original title ... <br><br>

### Make Beast and Paparazzi correct name in original_title, too.

Paparazzi

In [610]:
imdb_df.loc[imdb_df["original_title"] == "Paparazzi", "primary_title"]

2350      Paparazzi (FR)
4551      Paparazzi (IT)
16504          Paparazzi
153851         Paparazzi
Name: primary_title, dtype: object

In [611]:
imdb_df.loc[imdb_df["original_title"] == "Paparazzi", "original_title"] = imdb_df.loc[imdb_df["original_title"] == "Paparazzi", "primary_title"]

In [612]:
imdb_df[imdb_df["original_title"].str.contains("Paparazzi")]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
2350,tt0133314,Paparazzi (FR),Paparazzi (FR),1998,111,1091.0,5.4,2.0,Comedy,Romance,...,1.0,Alain Berbérian,,,6.0,Alain Berbérian,Jean-François Halin,,0,paparazzi
4551,tt0174105,Paparazzi (IT),Paparazzi (IT),1998,100,1156.0,4.0,1.0,Comedy,,...,1.0,Neri Parenti,,,,,,,0,paparazzi
16504,tt0338325,Paparazzi,Paparazzi,2004,84,15776.0,5.7,3.0,Action,Crime,...,1.0,Paul Abascal,,,1.0,Forry Smith,,,0,paparazzi
84736,tt1671678,Paparazzi: Full Throttle LA,Paparazzi: Full Throttle LA,2010,62,15.0,6.5,1.0,Documentary,,...,1.0,Daniel Ramos,,,1.0,Daniel Ramos,,,0,paparazzi: full throttle la
91523,tt1836097,Paparazzi Eye in the Dark,Paparazzi Eye in the Dark,2011,142,9.0,6.8,1.0,Mystery,,...,1.0,Bayo Akinfemi,,,1.0,Kojo Edu Ansah,,,0,paparazzi eye in the dark
153851,tt5303564,Paparazzi,Paparazzi,2015,110,54.0,5.4,3.0,Action,Drama,...,1.0,Saad Hendawy,,,1.0,Ahmed Abdel Fattah,,,0,paparazzi


Beast

In [613]:
imdb_df[imdb_df["original_title"] == "Beast"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
47684,tt11301946,Beast (IN),Beast,2022,155,36815.0,5.2,3.0,Action,Comedy,...,1.0,Nelson Dilipkumar,,,1.0,Nelson Dilipkumar,,,0,beast
61223,tt13223398,Beast (US),Beast,2022,93,43487.0,5.6,3.0,Action,Adventure,...,1.0,Baltasar Kormákur,,,2.0,Jaime Primak Sullivan,Ryan Engle,,0,beast
69002,tt1423333,Beast,Beast,2007,85,14.0,7.0,1.0,Horror,,...,1.0,Jack Bennett,,,1.0,Jack Bennett,,,0,beast
79841,tt1572501,Beast,Beast,2011,83,609.0,5.6,2.0,Drama,Thriller,...,1.0,Christoffer Boe,,,1.0,Christoffer Boe,,,0,beast
143254,tt4251006,Beast,Beast,2015,94,71.0,6.6,3.0,Crime,Drama,...,2.0,Sam McKeith,Tom McKeith,,3.0,Will Howarth,Sam McKeith,Tom McKeith,0,beast
144339,tt4359322,Beast,Beast,2009,87,10.0,5.5,1.0,Horror,,...,1.0,Chris Jupp,,,2.0,Chris Jupp,Michael J. Murphy,,0,beast
156764,tt5628302,Beast,Beast,2017,107,16190.0,6.8,3.0,Crime,Drama,...,1.0,Michael Pearce,,,1.0,Michael Pearce,,,0,beast
165200,tt6463468,Beast,Beast,2018,60,30.0,5.8,2.0,Adventure,Drama,...,1.0,Ben Strang,,,1.0,Ben Strang,,,0,beast


In [614]:
imdb_df.loc[imdb_df["original_title"] == "Beast", "original_title"] = imdb_df.loc[imdb_df["original_title"] == "Beast", "primary_title"]

In [615]:
imdb_df[(imdb_df["original_title"].str.startswith("Beast")) & (imdb_df["year"] == 2022)]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
47684,tt11301946,Beast (IN),Beast (IN),2022,155,36815.0,5.2,3.0,Action,Comedy,...,1.0,Nelson Dilipkumar,,,1.0,Nelson Dilipkumar,,,0,beast
61223,tt13223398,Beast (US),Beast (US),2022,93,43487.0,5.6,3.0,Action,Adventure,...,1.0,Baltasar Kormákur,,,2.0,Jaime Primak Sullivan,Ryan Engle,,0,beast
102369,tt21352688,Beast Mode On,Beast Mode On,2022,85,52.0,6.4,3.0,Biography,Documentary,...,2.0,Julian Alexander Oliver,Najia Khaan,,4.0,Adebayo Akinfenwa,Dele Akinfenwa,,0,beast mode on


Recreate the merging column

In [616]:
imdb_df["original_title_merge"] = imdb_df["original_title"].str.lower()

### Merge again and recheck with Beast and Paparazzi Done

In [617]:
eu_check_df = pd.merge(eu_df,imdb_df, how="left", left_on=["title_merge", "year"], right_on=["original_title_merge", "year"])

In [618]:
eu_check_mask = eu_check_df["tconst"].isnull()
eu_check_df[eu_check_mask]

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
40,23,DE,1998,701787,693358,23,,,,,...,,,,,,,,,,
46,28 Days Later,GB,2002,4252690,4066710,28 days later,,,,,...,,,,,,,,,,
58,"4 luni, 3 saptamani si 2 zile",RO,2007,1090696,736957,"4 luni, 3 saptamani si 2 zile",,,,,...,,,,,,,,,,
65,5X2 cinq fois deux,FR,2004,1150178,814942,5x2 cinq fois deux,,,,,...,,,,,,,,,,
67,666 - Traue keinem mit dem du schläfst,DE,2002,677829,677829,666 - traue keinem mit dem du schläfst,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4515,[Rec] 2,ES,2009,1387671,1013091,[rec] 2,,,,,...,,,,,,,,,,
4540,Účastníci zájezdu,CZ,2006,871439,830242,účastníci zájezdu,,,,,...,,,,,,,,,,
4541,Śluby panieńskie,PL,2010,1001866,1000373,śluby panieńskie,,,,,...,,,,,,,,,,
4542,Świadectwo,PL,2008,1039901,1034911,świadectwo,,,,,...,,,,,,,,,,


In [619]:
imdb_df[(imdb_df["original_title"] == "Paparazzi") & eu_check_mask]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge


Down from 372 to 369 (-3 for Paparazzi and Beast)

What the heck is wrong with the turkish titles though?

In [620]:
imdb_df.loc[imdb_df["tconst"] == "tt0795488", "original_title_merge"]

32084    úcastníci zájezdu
Name: original_title_merge, dtype: object

In [621]:
eu_check_df.loc[eu_check_df["title"] == "Účastníci zájezdu", "title_merge"]

4540    účastníci zájezdu
Name: title_merge, dtype: object

In [622]:
imdb_df.loc[imdb_df["tconst"] == "tt1720223", "original_title_merge"]

86744    sluby panienskie
Name: original_title_merge, dtype: object

In [623]:
eu_check_df.loc[eu_check_df["title"] == "Śluby panieńskie", "title_merge"]

4541    śluby panieńskie
Name: title_merge, dtype: object

In [624]:
imdb_df.loc[imdb_df["tconst"] == "tt1627942", "original_title_merge"]

82746    zeny v pokusení
Name: original_title_merge, dtype: object

In [625]:
eu_check_df.loc[eu_check_df["title"] == "Ženy v pokušení", "title_merge"]

4543    ženy v pokušení
Name: title_merge, dtype: object

č, ś, ń, ž are all normal character in the IMDB data

### Change all polish/special charcters in IMDB, EU and NA to standard

polish characters: ą, ć, ę, ł, ń, ó, ś, ź, ż

In [626]:
from unidecode import unidecode

In [627]:
# test new function:
test = eu_check_df.loc[eu_check_df["title"] == "Ženy v pokušení", "title_merge"].values[0]
display(test)
unidecode(test)

'ženy v pokušení'

'zeny v pokuseni'

In [628]:
imdb_df["original_title_merge"] = imdb_df["original_title_merge"].apply(unidecode)
eu_df["title_merge"] = eu_df["title_merge"].apply(unidecode)
na_df["title_merge"] = na_df["title_merge"].apply(unidecode)

check for improvements

In [629]:
eu_check_df = pd.merge(eu_df,imdb_df, how="left", left_on=["title_merge", "year"], right_on=["original_title_merge", "year"])

In [630]:
eu_check_mask = eu_check_df["tconst"].isnull()
eu_check_df[eu_check_mask]

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
40,23,DE,1998,701787,693358,23,,,,,...,,,,,,,,,,
46,28 Days Later,GB,2002,4252690,4066710,28 days later,,,,,...,,,,,,,,,,
65,5X2 cinq fois deux,FR,2004,1150178,814942,5x2 cinq fois deux,,,,,...,,,,,,,,,,
67,666 - Traue keinem mit dem du schläfst,DE,2002,677829,677829,666 - traue keinem mit dem du schlafst,,,,,...,,,,,,,,,,
118,A Stork's Journey,"DE, BE, LU, NO",2017,1862112,1848892,a stork's journey,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4471,Ya Sonra?,TR,2011,904069,904069,ya sonra?,,,,,...,,,,,,,,,,
4498,Zeny v behu,CZ,2019,1705959,1675569,zeny v behu,,,,,...,,,,,,,,,,
4514,[REC]³ Génesis,ES,2011,828887,774431,[rec]3 genesis,,,,,...,,,,,,,,,,
4515,[Rec] 2,ES,2009,1387671,1013091,[rec] 2,,,,,...,,,,,,,,,,


We went to 345 from 369 ... 
That is an improvement of 24. 

### Let's check the next one on the list: "23"

In [631]:
imdb_df[imdb_df["tconst"] == "tt0126765"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
1970,tt0126765,23,23 - Nichts ist so wie es scheint,1998,99,7404.0,7.2,2.0,Drama,Thriller,...,1.0,Hans-Christian Schmid,,,3.0,Hans-Christian Schmid,Michael Gutmann,Michael Dierking,0,23 - nichts ist so wie es scheint


"23" in the EU data Is called "23 - Nichts ist so wie es scheint" inn the IMDB dataset. 
Let's chickly check NA for that title

In [632]:
na_df[na_df["title_merge"].str.contains("23")]

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,release_year,title_merge
35,2023 Oscar Shorts,2023-02-17,ShortsHD,3023866,280507,2023,2023 oscar shorts
3824,The Number 23,2007-02-23,New Line,35193167,5115285,2007,the number 23
4017,The Taking of Pelham 123,2009-06-12,Sony Pictures,65452312,8726974,2009,the taking of pelham 123


No 23 in NA Data. Let's adjust the name for EU.

In [633]:
eu_df.loc[eu_df["title_merge"] == "23", "title"] = "23 - Nichts ist so wie es scheint"

### Let's also check 28 Days Later

In [634]:
imdb_df[imdb_df["tconst"] == "tt0289043"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
12347,tt0289043,28 Days Later,28 Days Later...,2002,113,453540.0,7.5,3.0,Drama,Horror,...,1.0,Danny Boyle,,,1.0,Alex Garland,,,0,28 days later...


hmmm ... the primary title would work ... the original does not ...  

Could I maybe try to do another merge on primary title with the non fitting data only? 

### Create mergable primary title, too with lower and unidecode

In [635]:
imdb_df["primary_title_merge"] = imdb_df["primary_title"].str.lower().apply(unidecode)

In [636]:
second_merge_eu = eu_check_df[eu_check_mask].copy()
second_merge_eu

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,directors_count,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge
40,23,DE,1998,701787,693358,23,,,,,...,,,,,,,,,,
46,28 Days Later,GB,2002,4252690,4066710,28 days later,,,,,...,,,,,,,,,,
65,5X2 cinq fois deux,FR,2004,1150178,814942,5x2 cinq fois deux,,,,,...,,,,,,,,,,
67,666 - Traue keinem mit dem du schläfst,DE,2002,677829,677829,666 - traue keinem mit dem du schlafst,,,,,...,,,,,,,,,,
118,A Stork's Journey,"DE, BE, LU, NO",2017,1862112,1848892,a stork's journey,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4471,Ya Sonra?,TR,2011,904069,904069,ya sonra?,,,,,...,,,,,,,,,,
4498,Zeny v behu,CZ,2019,1705959,1675569,zeny v behu,,,,,...,,,,,,,,,,
4514,[REC]³ Génesis,ES,2011,828887,774431,[rec]3 genesis,,,,,...,,,,,,,,,,
4515,[Rec] 2,ES,2009,1387671,1013091,[rec] 2,,,,,...,,,,,,,,,,


Drop Columns from merged frame and merge again with new primary title

In [637]:
second_merge_eu.drop(columns = second_merge_eu.iloc[:,6:], inplace=True)

In [638]:
second_merge_eu = second_merge_eu.merge(imdb_df, how="inner", left_on=["title_merge", "year"], right_on=["primary_title_merge", "year"])

In [639]:
second_merge_eu.shape

(46, 32)

In [640]:
eu_check_df.shape

(4544, 31)

In [641]:
second_merge_mask = eu_check_df["title"].isin(list(second_merge_eu["title"]))
second_merge_mask.sum()

eu_check_df.drop(eu_check_df[second_merge_mask].index, inplace=True)

In [642]:
eu_check_df = pd.concat([eu_check_df, second_merge_eu])
eu_check_df.reset_index(drop=True, inplace=True)
eu_check_df

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
0,(500) Days of Summer,US,2009,1713086,1684771,(500) days of summer,tt1022603,500 Days of Summer,(500) Days of Summer,95,...,Marc Webb,,,2.0,Scott Neustadter,Michael H. Weber,,0.0,(500) days of summer,
1,(Nie)znajomi,PL,2019,685075,684833,(nie)znajomi,tt10518924,(Nie)znajomi,(Nie)znajomi,103,...,Tadeusz Sliwa,,,10.0,Filippo Bologna,Paolo Costella,,0.0,(nie)znajomi,
2,(T)Raumschiff Surprise - Periode 1,DE,2004,10763531,10731881,(t)raumschiff surprise - periode 1,tt0349047,(T)Raumschiff Surprise - Periode 1,(T)Raumschiff Surprise - Periode 1,87,...,Michael Herbig,,,3.0,Michael Herbig,Alfons Biedermann,Rick Kavanian,0.0,(t)raumschiff surprise - periode 1,
3,1 1/2 Ritter - Auf der Suche nach der hinreiße...,DE,2008,1986168,1986168,1 1/2 ritter - auf der suche nach der hinreiss...,tt1187047,1½ Knights - In Search of the Ravishing Prince...,1 1/2 Ritter - Auf der Suche nach der hinreiße...,115,...,Til Schweiger,Torsten Künstler,Christof Wahl,2.0,Oliver Ziegenbalg,Oliver Philipp,,0.0,1 1/2 ritter - auf der suche nach der hinreiss...,
4,1 chance sur 2,FR,1998,1295620,1238175,1 chance sur 2,tt0119247,Half a Chance,1 chance sur 2,104,...,Patrice Leconte,,,3.0,Patrick Dewolf,Serge Frydman,Patrice Leconte,0.0,1 chance sur 2,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4539,Tom & Jerry,US,2021,2988828,2982401,tom & jerry,tt1361336,Tom & Jerry,Tom and Jerry,101,...,Tim Story,,,3.0,William Hanna,Joseph Barbera,Kevin Costello,0.0,tom and jerry,tom & jerry
4540,Two Brothers,"FR, GB",2004,5029977,4897605,two brothers,tt0338512,Two Brothers,Deux frères,109,...,Jean-Jacques Annaud,,,3.0,Alain Godard,Jean-Jacques Annaud,Julian Fellowes,0.0,deux freres,two brothers
4541,Van Wilder,"US, DE",2002,2328440,2312108,van wilder,tt0283111,Van Wilder,National Lampoon's Van Wilder,92,...,Walt Becker,,,2.0,Brent Goldberg,David Wagner,,0.0,national lampoon's van wilder,van wilder
4542,Wolf Totem,"CN, FR",2015,2118880,2087865,wolf totem,tt2909116,Wolf Totem,Le dernier loup,121,...,Jean-Jacques Annaud,,,5.0,Jiang Rong,Alain Godard,,0.0,le dernier loup,wolf totem


In [643]:
eu_check_mask = eu_check_df["tconst"].isnull()
eu_check_df[eu_check_mask]

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
63,5X2 cinq fois deux,FR,2004,1150178,814942,5x2 cinq fois deux,,,,,...,,,,,,,,,,
65,666 - Traue keinem mit dem du schläfst,DE,2002,677829,677829,666 - traue keinem mit dem du schlafst,,,,,...,,,,,,,,,,
116,A Stork's Journey,"DE, BE, LU, NO",2017,1862112,1848892,a stork's journey,,,,,...,,,,,,,,,,
122,A todo tren 2: Ahora son ellas,ES,2022,946416,669302,a todo tren 2: ahora son ellas,,,,,...,,,,,,,,,,
140,Adaptation,US,2002,1224151,1217905,adaptation,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4425,Ya Sonra?,TR,2011,904069,904069,ya sonra?,,,,,...,,,,,,,,,,
4452,Zeny v behu,CZ,2019,1705959,1675569,zeny v behu,,,,,...,,,,,,,,,,
4468,[REC]³ Génesis,ES,2011,828887,774431,[rec]3 genesis,,,,,...,,,,,,,,,,
4469,[Rec] 2,ES,2009,1387671,1013091,[rec] 2,,,,,...,,,,,,,,,,


After mergeing again on primary title, number of unmachted reduced from 345 to 299.
That is an improvement by 46. 

But merging now is more complicated. 

### Let's write a merging function for original than primary to use in the following tests.

In [644]:
def double_merge_func(data, imdb_base):
    '''
    Merges our Tickets Sold Dataframes First on the newly created original title than second on the primary title and keeps only hits from both.

    Input:
        ticket_data ... either eu or na depending on what we are testing
        base_data = imdb_df

    Output:
        Returns: Dataframe with all matches from both columns
        Prints: Unmatched rows
    '''
    ticket_data = data.copy()
    base_data = imdb_base.copy()

    # first merge
    check_df = pd.merge(ticket_data,base_data, how="left", left_on=["title_merge", "year"], right_on=["original_title_merge", "year"])

    # set-up second df for further calculations
    check_mask = check_df["tconst"].isnull()
    second_merge_df = check_df[check_mask].copy()
    second_merge_df

    # drop columns from first merge
    second_merge_df.drop(columns = second_merge_df.iloc[:,6:], inplace=True)

    # second merge (inner)
    second_merge_df = second_merge_df.merge(base_data, how="inner", left_on=["title_merge", "year"], right_on=["primary_title_merge", "year"])

    # drop columns from first merge table that matched on second merge
    second_merge_mask = check_df["title"].isin(list(second_merge_df["title"]))
    check_df.drop(check_df[second_merge_mask].index, inplace=True)

    # add fitting rows from second merge to first table
    check_df = pd.concat([check_df, second_merge_df])
    check_df.reset_index(drop=True, inplace=True)

    # show unmatched rows
    check_mask = check_df["tconst"].isnull()
    display(check_df[check_mask])
    
    return check_df

In [645]:
double_merge_func(eu_df, imdb_df)

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
63,5X2 cinq fois deux,FR,2004,1150178,814942,5x2 cinq fois deux,,,,,...,,,,,,,,,,
65,666 - Traue keinem mit dem du schläfst,DE,2002,677829,677829,666 - traue keinem mit dem du schlafst,,,,,...,,,,,,,,,,
116,A Stork's Journey,"DE, BE, LU, NO",2017,1862112,1848892,a stork's journey,,,,,...,,,,,,,,,,
122,A todo tren 2: Ahora son ellas,ES,2022,946416,669302,a todo tren 2: ahora son ellas,,,,,...,,,,,,,,,,
140,Adaptation,US,2002,1224151,1217905,adaptation,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4425,Ya Sonra?,TR,2011,904069,904069,ya sonra?,,,,,...,,,,,,,,,,
4452,Zeny v behu,CZ,2019,1705959,1675569,zeny v behu,,,,,...,,,,,,,,,,
4468,[REC]³ Génesis,ES,2011,828887,774431,[rec]3 genesis,,,,,...,,,,,,,,,,
4469,[Rec] 2,ES,2009,1387671,1013091,[rec] 2,,,,,...,,,,,,,,,,


Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
0,(500) Days of Summer,US,2009,1713086,1684771,(500) days of summer,tt1022603,500 Days of Summer,(500) Days of Summer,95,...,Marc Webb,,,2.0,Scott Neustadter,Michael H. Weber,,0.0,(500) days of summer,500 days of summer
1,(Nie)znajomi,PL,2019,685075,684833,(nie)znajomi,tt10518924,(Nie)znajomi,(Nie)znajomi,103,...,Tadeusz Sliwa,,,10.0,Filippo Bologna,Paolo Costella,,0.0,(nie)znajomi,(nie)znajomi
2,(T)Raumschiff Surprise - Periode 1,DE,2004,10763531,10731881,(t)raumschiff surprise - periode 1,tt0349047,(T)Raumschiff Surprise - Periode 1,(T)Raumschiff Surprise - Periode 1,87,...,Michael Herbig,,,3.0,Michael Herbig,Alfons Biedermann,Rick Kavanian,0.0,(t)raumschiff surprise - periode 1,(t)raumschiff surprise - periode 1
3,1 1/2 Ritter - Auf der Suche nach der hinreiße...,DE,2008,1986168,1986168,1 1/2 ritter - auf der suche nach der hinreiss...,tt1187047,1½ Knights - In Search of the Ravishing Prince...,1 1/2 Ritter - Auf der Suche nach der hinreiße...,115,...,Til Schweiger,Torsten Künstler,Christof Wahl,2.0,Oliver Ziegenbalg,Oliver Philipp,,0.0,1 1/2 ritter - auf der suche nach der hinreiss...,1 1/2 knights - in search of the ravishing pri...
4,1 chance sur 2,FR,1998,1295620,1238175,1 chance sur 2,tt0119247,Half a Chance,1 chance sur 2,104,...,Patrice Leconte,,,3.0,Patrick Dewolf,Serge Frydman,Patrice Leconte,0.0,1 chance sur 2,half a chance
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4539,Tom & Jerry,US,2021,2988828,2982401,tom & jerry,tt1361336,Tom & Jerry,Tom and Jerry,101,...,Tim Story,,,3.0,William Hanna,Joseph Barbera,Kevin Costello,0.0,tom and jerry,tom & jerry
4540,Two Brothers,"FR, GB",2004,5029977,4897605,two brothers,tt0338512,Two Brothers,Deux frères,109,...,Jean-Jacques Annaud,,,3.0,Alain Godard,Jean-Jacques Annaud,Julian Fellowes,0.0,deux freres,two brothers
4541,Van Wilder,"US, DE",2002,2328440,2312108,van wilder,tt0283111,Van Wilder,National Lampoon's Van Wilder,92,...,Walt Becker,,,2.0,Brent Goldberg,David Wagner,,0.0,national lampoon's van wilder,van wilder
4542,Wolf Totem,"CN, FR",2015,2118880,2087865,wolf totem,tt2909116,Wolf Totem,Le dernier loup,121,...,Jean-Jacques Annaud,,,5.0,Jiang Rong,Alain Godard,,0.0,le dernier loup,wolf totem


### Let's keep on checking the remaining problems ...

In [646]:
eu_check_df = double_merge_func(eu_df, imdb_df)

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
63,5X2 cinq fois deux,FR,2004,1150178,814942,5x2 cinq fois deux,,,,,...,,,,,,,,,,
65,666 - Traue keinem mit dem du schläfst,DE,2002,677829,677829,666 - traue keinem mit dem du schlafst,,,,,...,,,,,,,,,,
116,A Stork's Journey,"DE, BE, LU, NO",2017,1862112,1848892,a stork's journey,,,,,...,,,,,,,,,,
122,A todo tren 2: Ahora son ellas,ES,2022,946416,669302,a todo tren 2: ahora son ellas,,,,,...,,,,,,,,,,
140,Adaptation,US,2002,1224151,1217905,adaptation,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4425,Ya Sonra?,TR,2011,904069,904069,ya sonra?,,,,,...,,,,,,,,,,
4452,Zeny v behu,CZ,2019,1705959,1675569,zeny v behu,,,,,...,,,,,,,,,,
4468,[REC]³ Génesis,ES,2011,828887,774431,[rec]3 genesis,,,,,...,,,,,,,,,,
4469,[Rec] 2,ES,2009,1387671,1013091,[rec] 2,,,,,...,,,,,,,,,,


In [647]:
imdb_df[imdb_df["tconst"] == "tt0354356"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
17678,tt0354356,Five Times Two,5x2,2004,90,10498.0,6.6,2.0,Drama,Romance,...,François Ozon,,,2.0,François Ozon,Emmanuèle Bernheim,,0,5x2,five times two


In [648]:
eu_df.loc[eu_df["title_merge"].str.contains("5x2"), "title_merge"] = "5x2"
eu_df.loc[eu_df["title_merge"].str.contains("5x2"), :]

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge
65,5X2 cinq fois deux,FR,2004,1150178,814942,5x2


In [649]:
imdb_df[imdb_df["tconst"] == "tt0291167"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
12552,tt0291167,666: In Bed with the Devil,"666 - Traue keinem, mit dem Du schläfst!",2002,85,886.0,5.3,1.0,Comedy,,...,Rainer Matsutani,,,2.0,Johann Wolfgang von Goethe,Rainer Matsutani,,0,"666 - traue keinem, mit dem du schlafst!",666: in bed with the devil


In [650]:
eu_df.loc[eu_df["title_merge"].str.contains("666"),"title_merge"] = "666 - traue keinem, mit dem du schlafst!"
eu_df.loc[eu_df["title_merge"].str.contains("666"),:]

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge
67,666 - Traue keinem mit dem du schläfst,DE,2002,677829,677829,"666 - traue keinem, mit dem du schlafst!"


In [651]:
imdb_df[imdb_df["tconst"] == "tt3823116"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
138475,tt3823116,Little Bird's Big Adventure,"Überflieger - Kleine Vögel, großes Geklapper",2017,85,2377.0,5.8,3.0,Adventure,Animation,...,Toby Genkel,Reza Memari,,4.0,Reza Memari,Anne D. Bernstein,,0,"uberflieger - kleine vogel, grosses geklapper",little bird's big adventure


In [652]:
eu_df.loc[eu_df["title_merge"] == "a stork's journey", "title_merge"] = "uberflieger - kleine vogel, grosses geklapper"
eu_df.loc[eu_df["title_merge"].str.contains("uberflieger"), :]

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge
118,A Stork's Journey,"DE, BE, LU, NO",2017,1862112,1848892,"uberflieger - kleine vogel, grosses geklapper"


In [653]:
imdb_df[imdb_df["tconst"] == "tt21335908"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
102258,tt21335908,The Kids Are Alright 2,"A Todo Tren 2: Sí, les ha pasado otra vez",2022,84,4678.0,4.5,1.0,Comedy,,...,Inés de León,,,2.0,Marta González de Vega,Santiago Segura,,0,"a todo tren 2: si, les ha pasado otra vez",the kids are alright 2


In [654]:
eu_df.loc[eu_df["title_merge"].str.contains("a todo tren 2"), "title_merge"] = "a todo tren 2: si, les ha pasado otra vez"
eu_df.loc[eu_df["title_merge"].str.contains("a todo tren 2"), :]

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge
124,A todo tren 2: Ahora son ellas,ES,2022,946416,669302,"a todo tren 2: si, les ha pasado otra vez"


corrected 5 manually ... this will be tedious if all 300 remaining unmatching rows have to be done by hand ...

### Let's check again after those 5 are gone -> strip all punctuations from the merging strings

In [655]:
eu_check_df = double_merge_func(eu_df, imdb_df)

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
140,Adaptation,US,2002,1224151,1217905,adaptation,,,,,...,,,,,,,,,,
147,African Cats,US,2011,803485,794169,african cats,,,,,...,,,,,,,,,,
159,Agir Roman,"TR, FR, HU",1998,850000,850000,agir roman,,,,,...,,,,,,,,,,
191,Allahin Sadik Kulu: Barla,TR,2011,2279419,2226422,allahin sadik kulu: barla,,,,,...,,,,,,,,,,
212,Amen,"FR, DE, RO",2002,1932487,1687019,amen,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4425,Ya Sonra?,TR,2011,904069,904069,ya sonra?,,,,,...,,,,,,,,,,
4452,Zeny v behu,CZ,2019,1705959,1675569,zeny v behu,,,,,...,,,,,,,,,,
4468,[REC]³ Génesis,ES,2011,828887,774431,[rec]3 genesis,,,,,...,,,,,,,,,,
4469,[Rec] 2,ES,2009,1387671,1013091,[rec] 2,,,,,...,,,,,,,,,,


In [656]:
imdb_df[imdb_df["tconst"] == "tt0268126"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
10377,tt0268126,Adaptation.,Adaptation.,2002,115,205067.0,7.7,2.0,Comedy,Drama,...,Spike Jonze,,,2.0,Susan Orlean,Charlie Kaufman,,0,adaptation.,adaptation.


ok this is the third movie which oes not fit because IMDB added .!... or anything like that to the end of it.. .

In [657]:
import string as string

In [658]:
imdb_df["original_title_merge"] = imdb_df["original_title_merge"].str.translate(str.maketrans("","",string.punctuation))
imdb_df["primary_title_merge"] = imdb_df["primary_title_merge"].str.translate(str.maketrans("","",string.punctuation))

eu_df["title_merge"] = eu_df["title_merge"].str.translate(str.maketrans("","",string.punctuation))
na_df["title_merge"] = na_df["title_merge"].str.translate(str.maketrans("","",string.punctuation))

In [659]:
eu_check_df = double_merge_func(eu_df, imdb_df)

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
149,African Cats,US,2011,803485,794169,african cats,,,,,...,,,,,,,,,,
161,Agir Roman,"TR, FR, HU",1998,850000,850000,agir roman,,,,,...,,,,,,,,,,
225,American Pie 4,US,2012,12501353,12500536,american pie 4,,,,,...,,,,,,,,,,
260,Angry Birds,"US, FI",2016,10228756,10217533,angry birds,,,,,...,,,,,,,,,,
261,"Angus, Thongs and Full-Frontal Snogging","GBinc, US, DE",2008,1358397,1346801,angus thongs and fullfrontal snogging,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4396,Winx club - Il segreto del regno perduto,IT,2007,2217926,2132336,winx club il segreto del regno perduto,,,,,...,,,,,,,,,,
4459,Zeny v behu,CZ,2019,1705959,1675569,zeny v behu,,,,,...,,,,,,,,,,
4475,[REC]³ Génesis,ES,2011,828887,774431,rec3 genesis,,,,,...,,,,,,,,,,
4476,[Rec] 2,ES,2009,1387671,1013091,rec 2,,,,,...,,,,,,,,,,


Stripping punctuations resulted in 255 unmachted rows (from 295). That is an improvement by 40

### Let's check the remaining again for new Ideas

OK IDEA: we merge again with remaining but do year +/- 1

In [660]:
imdb_df[imdb_df["tconst"] == "tt1223236"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
54210,tt1223236,African Cats,African Cats,2010,89,6776.0,7.5,2.0,Adventure,Documentary,...,Keith Scholey,Alastair Fothergill,,3.0,Keith Scholey,John Truby,Owen Newman,0,african cats,african cats


hmm maybe I should, only for the unmatching rows, create two new year columns with +1 and -1 and check again. There might be cases in which the movie has different dates in IMDB and the tickets data bases.

In [661]:
eu_unmatched_df = eu_check_df.loc[eu_check_df["tconst"].isnull(),:].copy()
eu_unmatched_df

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
149,African Cats,US,2011,803485,794169,african cats,,,,,...,,,,,,,,,,
161,Agir Roman,"TR, FR, HU",1998,850000,850000,agir roman,,,,,...,,,,,,,,,,
225,American Pie 4,US,2012,12501353,12500536,american pie 4,,,,,...,,,,,,,,,,
260,Angry Birds,"US, FI",2016,10228756,10217533,angry birds,,,,,...,,,,,,,,,,
261,"Angus, Thongs and Full-Frontal Snogging","GBinc, US, DE",2008,1358397,1346801,angus thongs and fullfrontal snogging,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4396,Winx club - Il segreto del regno perduto,IT,2007,2217926,2132336,winx club il segreto del regno perduto,,,,,...,,,,,,,,,,
4459,Zeny v behu,CZ,2019,1705959,1675569,zeny v behu,,,,,...,,,,,,,,,,
4475,[REC]³ Génesis,ES,2011,828887,774431,rec3 genesis,,,,,...,,,,,,,,,,
4476,[Rec] 2,ES,2009,1387671,1013091,rec 2,,,,,...,,,,,,,,,,


In [662]:
eu_unmatched_df.drop(columns = eu_unmatched_df.iloc[:,6:], inplace=True)
display(eu_unmatched_df.head())
eu_unmatched_df.shape

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge
149,African Cats,US,2011,803485,794169,african cats
161,Agir Roman,"TR, FR, HU",1998,850000,850000,agir roman
225,American Pie 4,US,2012,12501353,12500536,american pie 4
260,Angry Birds,"US, FI",2016,10228756,10217533,angry birds
261,"Angus, Thongs and Full-Frontal Snogging","GBinc, US, DE",2008,1358397,1346801,angus thongs and fullfrontal snogging


(255, 6)

In [663]:
eu_unmatched_df.loc[:,"year_plus"] = eu_unmatched_df["year"] + 1
eu_unmatched_df.loc[:,"year_minus"] = eu_unmatched_df["year"] - 1

In [664]:
eu_unmatched_df.head()

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,year_plus,year_minus
149,African Cats,US,2011,803485,794169,african cats,2012,2010
161,Agir Roman,"TR, FR, HU",1998,850000,850000,agir roman,1999,1997
225,American Pie 4,US,2012,12501353,12500536,american pie 4,2013,2011
260,Angry Birds,"US, FI",2016,10228756,10217533,angry birds,2017,2015
261,"Angus, Thongs and Full-Frontal Snogging","GBinc, US, DE",2008,1358397,1346801,angus thongs and fullfrontal snogging,2009,2007


In [665]:
eu_unmatched_df.merge(imdb_df, how="inner", left_on=["title_merge", "year_plus"], right_on=["original_title_merge", "year"])

Unnamed: 0,title,producing_country,year_x,tickets_sold_since_1996,tickets_sold,title_merge,year_plus,year_minus,tconst,primary_title,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
0,Arlington Road,US,1998,2030466,2029739,arlington road,1999,1997,tt0137363,Arlington Road,...,Mark Pellington,,,1.0,Ehren Kruger,,,0,arlington road,arlington road
1,Au coeur du mensonge,FR,1998,933602,828875,au coeur du mensonge,1999,1997,tt0164368,The Color of Lies,...,Claude Chabrol,,,2.0,Odile Barski,Claude Chabrol,,0,au coeur du mensonge,the color of lies
2,Baba Parasi,TR,2019,2071090,1896340,baba parasi,2020,2018,tt10549312,Baba Parasi,...,Selçuk Aydemir,,,1.0,Selçuk Aydemir,,,0,baba parasi,baba parasi
3,Belphégor - Le fantôme du Louvre,FR,2000,3053474,2563623,belphegor le fantome du louvre,2001,1999,tt0214529,Belphegor: Phantom of the Louvre,...,Jean-Paul Salomé,,,4.0,Arthur Bernède,Jean-Paul Salomé,,0,belphegor le fantome du louvre,belphegor phantom of the louvre
4,Bon Bini Holland 3,NL,2021,459342,459296,bon bini holland 3,2022,2020,tt11177150,Bon Bini Holland 3,...,Pieter van Rijn,,,4.0,Michel Bonset,Jandino Asporaat,,0,bon bini holland 3,bon bini holland 3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,Une hirondelle a fait le printemps,"FR, BE",2000,2773596,2340748,une hirondelle a fait le printemps,2001,1999,tt0240149,The Girl from Paris,...,Christian Carion,,,2.0,Christian Carion,Eric Assous,,0,une hirondelle a fait le printemps,the girl from paris
59,Virus,"US, GB, JP, DE, FR",1998,1719742,1715116,virus,1999,1997,tt0120458,Virus,...,John Bruno,,,2.0,Chuck Pfarrer,Dennis Feldman,,0,virus,virus
60,Vizontele,TR,2000,3556294,3556294,vizontele,2001,1999,tt0270053,Vizontele,...,Yilmaz Erdogan,Ömer Faruk Sorak,,1.0,Yilmaz Erdogan,,,0,vizontele,vizontele
61,Vsetko alebo nic,"SK, CZ, PL",2016,831177,825927,vsetko alebo nic,2017,2015,tt3868240,All or Nothing,...,Marta Ferencova,,,2.0,Marta Ferencova,Eva Urbaníková,,0,vsetko alebo nic,all or nothing


In [666]:
eu_unmatched_df.merge(imdb_df, how="inner", left_on=["title_merge", "year_minus"], right_on=["original_title_merge", "year"])

Unnamed: 0,title,producing_country,year_x,tickets_sold_since_1996,tickets_sold,title_merge,year_plus,year_minus,tconst,primary_title,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
0,African Cats,US,2011,803485,794169,african cats,2012,2010,tt1223236,African Cats,...,Keith Scholey,Alastair Fothergill,,3.0,Keith Scholey,John Truby,Owen Newman,0,african cats,african cats
1,Agir Roman,"TR, FR, HU",1998,850000,850000,agir roman,1999,1997,tt0149601,Cholera Street,...,Mustafa Altioklar,,,2.0,Mustafa Altioklar,Metin Kaçan,,0,agir roman,cholera street
2,De l'autre côté du lit,FR,2009,2181416,1979965,de lautre cote du lit,2010,2008,tt1275590,De l'autre côté du lit,...,Pascale Pouzadoux,,,3.0,Alix Girod de l'Ain,Pascale Pouzadoux,Grégoire Vigneron,0,de lautre cote du lit,de lautre cote du lit
3,Die Wilden Hühner und die Liebe,DE,2007,1104115,1097403,die wilden huhner und die liebe,2008,2006,tt0844463,Wild Chicks in Love,...,Vivian Naefe,,,4.0,Cornelia Funke,Marie Graf,,0,die wilden huhner und die liebe,wild chicks in love
4,Die wilden Hühner,DE,2006,1289438,1284597,die wilden huhner,2007,2005,tt0466195,Wild Chicks,...,Vivian Naefe,,,3.0,Cornelia Funke,Güzin Kar,Uschi Reich,0,die wilden huhner,wild chicks
5,Die wilden Hühner und das Leben,DE,2009,1085680,1083857,die wilden huhner und das leben,2010,2008,tt1213660,Wild Chicks and Life,...,Vivian Naefe,,,4.0,Cornelia Funke,Vivian Naefe,,0,die wilden huhner und das leben,wild chicks and life
6,Drogówka,PL,2013,1024895,1015418,drogowka,2014,2012,tt2577150,Traffic Department,...,Wojciech Smarzowski,,,1.0,Wojciech Smarzowski,,,0,drogowka,traffic department
7,Eddie the Eagle,"GBinc, US, DE",2016,1785984,1774437,eddie the eagle,2017,2015,tt1083452,Eddie the Eagle,...,Dexter Fletcher,,,2.0,Simon Kelton,Sean Macaulay,,0,eddie the eagle,eddie the eagle
8,Ex Machina,"GBinc, US",2015,1281613,1270356,ex machina,2016,2014,tt0470752,Ex Machina,...,Alex Garland,,,1.0,Alex Garland,,,0,ex machina,ex machina
9,Fly Me to the Moon,"BE, US",2008,1851505,1278403,fly me to the moon,2009,2007,tt0486321,Fly Me to the Moon 3D,...,Ben Stassen,Mimi Maynard,,1.0,Domonic Paris,,,0,fly me to the moon,fly me to the moon 3d


ok year +/- 1 on original title is like 85 movies. Let's implement that into the function and check the result.

In [667]:
def double_merge_func(data, imdb_base):
    '''
    Merges our Tickets Sold Dataframes First on the newly created original title than second on the primary title.
    Afterward, merge again but with year +/- one. 

    Input:
        ticket_data ... either eu or na depending on what we are testing
        base_data = imdb_df

    Output:
        Returns: Dataframe with all matches from both columns
        Prints: Unmatched rows
    '''
    ticket_data = data.copy()
    base_data = imdb_base.copy()

    # first merge
    check_df = pd.merge(ticket_data,base_data, how="left", left_on=["title_merge", "year"], right_on=["original_title_merge", "year"])

    # set-up second df for further calculations
    check_mask = check_df["tconst"].isnull()
    second_merge_df = check_df[check_mask].copy()

    # drop columns from first merge
    second_merge_df.drop(columns = second_merge_df.iloc[:,6:], inplace=True)

    # second merge (inner)
    second_merge_df = second_merge_df.merge(base_data, how="inner", left_on=["title_merge", "year"], right_on=["primary_title_merge", "year"])

    # drop columns from first merge table that matched on second merge
    second_merge_mask = check_df["title"].isin(list(second_merge_df["title"]))
    check_df.drop(check_df[second_merge_mask].index, inplace=True)

    # add fitting rows from second merge to first table
    check_df = pd.concat([check_df, second_merge_df])
    check_df.reset_index(drop=True, inplace=True)

    # --- TIME FOR YEAR +/-1
    # set-up third df for further calculations
    check_mask = check_df["tconst"].isnull()
    third_merge_df = check_df[check_mask].copy()

    # drop columns from first merge
    third_merge_df.drop(columns = third_merge_df.iloc[:,6:], inplace=True)

    # add year plus and minus
    third_merge_df["year_plus"] = third_merge_df["year"] + 1
    third_merge_df["year_minus"] = third_merge_df["year"] - 1
    
    # 3_plus merge (inner)
    third_merge_plus_df = third_merge_df.merge(base_data, how="inner", left_on=["title_merge", "year_plus"], right_on=["original_title_merge", "year"])
    # drop columns from first merge table that matched on 3_plus merge (inner)
    third_merge_plus_mask = check_df["title"].isin(list(third_merge_plus_df["title"]))
    check_df.drop(check_df[third_merge_plus_mask].index, inplace=True)
    # add fitting rows from 3_plus merge (inner) to first table
    check_df = pd.concat([check_df, third_merge_plus_df])
    check_df.reset_index(drop=True, inplace=True)

    # 3_minus merge (inner)
    third_merge_minus_df = third_merge_df.merge(base_data, how="inner", left_on=["title_merge", "year_minus"], right_on=["original_title_merge", "year"])
    # drop columns from first merge table that matched on 3_minus merge (inner)
    third_merge_minus_mask = check_df["title"].isin(list(third_merge_minus_df["title"]))
    check_df.drop(check_df[third_merge_minus_mask].index, inplace=True)
    # add fitting rows from 3_plus merge (inner) to first table
    check_df = pd.concat([check_df, third_merge_minus_df])
    check_df.reset_index(drop=True, inplace=True)

    # -- Now do the same for PRIMARY TITLE ....
    # 3_plus merge (inner)
    third_merge_plus_df = third_merge_df.merge(base_data, how="inner", left_on=["title_merge", "year_plus"], right_on=["primary_title_merge", "year"])
    # drop columns from first merge table that matched on 3_plus merge (inner)
    third_merge_plus_mask = check_df["title"].isin(list(third_merge_plus_df["title"]))
    check_df.drop(check_df[third_merge_plus_mask].index, inplace=True)
    # add fitting rows from 3_plus merge (inner) to first table
    check_df = pd.concat([check_df, third_merge_plus_df])
    check_df.reset_index(drop=True, inplace=True)

    # 3_minus merge (inner)
    third_merge_minus_df = third_merge_df.merge(base_data, how="inner", left_on=["title_merge", "year_minus"], right_on=["primary_title_merge", "year"])
    # drop columns from first merge table that matched on 3_minus merge (inner)
    third_merge_minus_mask = check_df["title"].isin(list(third_merge_minus_df["title"]))
    check_df.drop(check_df[third_merge_minus_mask].index, inplace=True)
    # add fitting rows from 3_plus merge (inner) to first table
    check_df = pd.concat([check_df, third_merge_minus_df])
    check_df.reset_index(drop=True, inplace=True)

    # drop new year columns from final table
    check_df.drop(columns = check_df.iloc[:,-4:], inplace=True)

    # show unmatched rows
    check_mask = check_df["tconst"].isnull()
    display(check_df[check_mask])
    
    return check_df

In [668]:
eu_check_df = double_merge_func(eu_df, imdb_df)

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
223,American Pie 4,US,2012.0,12501353,12500536,american pie 4,,,,,...,,,,,,,,,,
258,Angry Birds,"US, FI",2016.0,10228756,10217533,angry birds,,,,,...,,,,,,,,,,
259,"Angus, Thongs and Full-Frontal Snogging","GBinc, US, DE",2008.0,1358397,1346801,angus thongs and fullfrontal snogging,,,,,...,,,,,,,,,,
265,Annabelle 2,US,2017.0,7298882,7291919,annabelle 2,,,,,...,,,,,,,,,,
299,Arthur et la guerre des deux mondes,FR,2010.0,3838378,3363498,arthur et la guerre des deux mondes,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4251,Warum Männer nicht zuhören und Frauen schlecht...,DE,2007.0,1452342,1068475,warum manner nicht zuhoren und frauen schlecht...,,,,,...,,,,,,,,,,
4306,Winx club - Il segreto del regno perduto,IT,2007.0,2217926,2132336,winx club il segreto del regno perduto,,,,,...,,,,,,,,,,
4369,Zeny v behu,CZ,2019.0,1705959,1675569,zeny v behu,,,,,...,,,,,,,,,,
4385,[Rec] 2,ES,2009.0,1387671,1013091,rec 2,,,,,...,,,,,,,,,,


We went from 255 to 169 with this new function! That is an improvement of 86!

### Let's check for the next shit

In [669]:
imdb_df[imdb_df["tconst"] == "tt0149601"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
3080,tt0149601,Cholera Street,Agir Roman,1997,120,11994.0,7.6,3.0,Crime,Drama,...,Mustafa Altioklar,,,2.0,Mustafa Altioklar,Metin Kaçan,,0,agir roman,cholera street


In [670]:
basic_df[basic_df["tconst"] == "tt0149601"]

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
145007,tt0149601,movie,Cholera Street,Agir Roman,0.0,1997.0,\N,120,"Crime,Drama,Romance"


In [671]:
# let's delete basic_df again to have less huge variables loaded
del basic_df

In [672]:
eu_check_df[(eu_check_df["tconst"].isnull() ) & (eu_check_df["year"]==1998)]

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
776,"Crna mačka, beli mačor","FR, DE, YU",1998.0,2494989,2170497,crna macka beli macor,,,,,...,,,,,,,,,,
784,Cumhuriyet,TR,1998.0,726000,726000,cumhuriyet,,,,,...,,,,,,,,,,
800,"Dangerous Beauty (The Honest Courtesan, A Dest...",US,1998.0,929520,928888,dangerous beauty the honest courtesan a destin...,,,,,...,,,,,,,,,,
1077,Elisabeth I,GB,1998.0,3569152,3567062,elisabeth i,,,,,...,,,,,,,,,,
2319,"Martha, Meet Frank, Daniel and Laurence",GB,1998.0,666543,499834,martha meet frank daniel and laurence,,,,,...,,,,,,,,,,
2840,Pokémon the First Movie: Mewtwo Strikes Back,"JP, US",1998.0,12682197,12207823,pokemon the first movie mewtwo strikes back,,,,,...,,,,,,,,,,
3178,Sibirskij tsiryulnik,"RU, FR, IT, CZ",1998.0,1485657,1209107,sibirskij tsiryulnik,,,,,...,,,,,,,,,,


placeholder: check if 11  - 1998 movies match if we change the filter and apply the year +-1 again

-> Gian-Luca on his branch

-> -> Done: Only got rid of 4 more movies sadly :( 

In [673]:
imdb_df[imdb_df["tconst"] == "tt1605630"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,director_name,director2_name,director3_name,writers_count,writer_name,writer2_name,writer3_name,is_adult,original_title_merge,primary_title_merge
81655,tt1605630,American Reunion,American Reunion,2012,113,225651.0,6.7,1.0,Comedy,,...,Jon Hurwitz,Hayden Schlossberg,,3.0,Jon Hurwitz,Hayden Schlossberg,Adam Herz,0,american reunion,american reunion


### Import newly created AKAs_df

In [674]:
aka_query = f'''   SELECT *
                    FROM {schema}."imdb_akas_data"
                    '''

In [675]:
aka_df = sqlf.get_dataframe(aka_query)
display(aka_df.head())
aka_df.shape

Unnamed: 0,tconst,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
0,tt0013274,,,,Histoire de la guerre civile,,,,,,
1,tt0015414,,,La tierra de los toros,La terre des taureaux,,,,,,
2,tt0035423,Kate et Léopold,Kate und Leopold,La Kate i en Leopold,Kate et Léopold,Kate & Leopold,Kate and Leopold,,Kate i Leopold,Büyülü çift,
3,tt0062336,,,,El Tango del Viudo y Su Espejo Deformante,The Tango of the Widower and Its Distorting Mi...,,,,,
4,tt0069049,The Other Side of the Wind,The Other Side of the Wind,Al otro lado del viento,De l'autre côté du vent,The Other Side of the Wind,L'altra faccia del vento,,Druga strona wiatru,,


(118555, 11)

### Transform all titles accordingly
- lower
- unidecode
- remove punctuation

In [676]:
for column in aka_df.iloc[:,1:]:
    aka_df.loc[:,column] = aka_df[column].str.lower()
    aka_df.loc[:,column] = aka_df[column].astype(str).apply(unidecode)
    aka_df.loc[:,column] = aka_df.loc[:,column].str.translate(str.maketrans("","",string.punctuation))

In [677]:
aka_df.head()

Unnamed: 0,tconst,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
0,tt0013274,,,,histoire de la guerre civile,,,,,,
1,tt0015414,,,la tierra de los toros,la terre des taureaux,,,,,,
2,tt0035423,kate et leopold,kate und leopold,la kate i en leopold,kate et leopold,kate leopold,kate and leopold,,kate i leopold,buyulu cift,
3,tt0062336,,,,el tango del viudo y su espejo deformante,the tango of the widower and its distorting mi...,,,,,
4,tt0069049,the other side of the wind,the other side of the wind,al otro lado del viento,de lautre cote du vent,the other side of the wind,laltra faccia del vento,,druga strona wiatru,,


In [678]:
imdb_aka_df = pd.merge(imdb_df, aka_df, how="left", on="tconst")

In [679]:
display(imdb_aka_df.shape)
imdb_df.shape

(188163, 37)

(188163, 27)

In [680]:
imdb_aka_df.head()

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
0,tt0013274,Istoriya grazhdanskoy voyny,Istoriya grazhdanskoy voyny,2021,94,73.0,6.7,1.0,Documentary,,...,,,,histoire de la guerre civile,,,,,,
1,tt0015414,La tierra de los toros,La tierra de los toros,2000,60,17.0,5.4,,,,...,,,la tierra de los toros,la terre des taureaux,,,,,,
2,tt0035423,Kate & Leopold,Kate & Leopold,2001,118,89944.0,6.4,3.0,Comedy,Fantasy,...,kate et leopold,kate und leopold,la kate i en leopold,kate et leopold,kate leopold,kate and leopold,,kate i leopold,buyulu cift,
3,tt0062336,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,2020,70,190.0,6.5,1.0,Drama,,...,,,,el tango del viudo y su espejo deformante,the tango of the widower and its distorting mi...,,,,,
4,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122,8143.0,6.7,1.0,Drama,,...,the other side of the wind,the other side of the wind,al otro lado del viento,de lautre cote du vent,the other side of the wind,laltra faccia del vento,,druga strona wiatru,,


### write the mother of all functions:
- goes over each title (original, primary, ...)
- for each title check merge with year and year +/-/1
- after the first merge, do the following merges on unmatched only and replace if neccassary

In [681]:
eu_df.head()

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge
0,(500) Days of Summer,US,2009,1713086,1684771,500 days of summer
1,(Nie)znajomi,PL,2019,685075,684833,nieznajomi
2,(T)Raumschiff Surprise - Periode 1,DE,2004,10763531,10731881,traumschiff surprise periode 1
3,1 1/2 Ritter - Auf der Suche nach der hinreiße...,DE,2008,1986168,1986168,1 12 ritter auf der suche nach der hinreissen...
4,1 chance sur 2,FR,1998,1295620,1238175,1 chance sur 2


In [682]:
imdb_aka_df.shape

(188163, 37)

In [683]:
na_df.columns

Index(['title', 'release_date', 'distributor', 'gross_sales', 'tickets_sold',
       'release_year', 'title_merge'],
      dtype='object')

In [684]:
na_df.columns = ['title', 'release_date', 'distributor', 'gross_sales', 'tickets_sold',
       'year', 'title_merge']

In [685]:
def ultimate_merge_func(data, imdb_base, number_of_columns=6):
    '''
    Merges our Tickets Sold Dataframes First on each title for year and year +/- 1

    Input:
        ticket_data ... either eu or na depending on what we are testing
        base_data = imdb_df

    Output:
        Returns: Dataframe with all matches from both columns
        Prints: Unmatched rows
    '''

    ticket_data = data.copy()
    base_data = imdb_base.copy()

    list_titles = ['original_title_merge', 'primary_title_merge', 'CA' ,'DE', 'ES', 'FR', 'GB', 'IT', 'NL', 'PL', 'TR', 'ALTER']
    counter = 0

    base_data["year_plus"] = base_data["year"] + 1
    base_data["year_minus"] = base_data["year"] + -1

    list_years = ["year_minus", "year", "year_plus"]

    for title in list_titles:
        if counter == 0:
            check_df = pd.merge(ticket_data,base_data, how="left", left_on=["title_merge", "year"], right_on=[title, "year"])

            
            # set-up third df for further calculations
            check_mask = check_df["tconst"].isnull()
            third_merge_df = check_df[check_mask].copy()

            # drop columns from first merge
            third_merge_df.drop(columns = third_merge_df.iloc[:,number_of_columns:], inplace=True)
            
            # 3_plus merge (inner)
            third_merge_plus_df = third_merge_df.merge(base_data, how="inner", left_on=["title_merge", "year"], right_on=[title, "year_plus"])
            # drop columns from first merge table that matched on 3_plus merge (inner)
            third_merge_plus_mask = check_df["title"].isin(list(third_merge_plus_df["title"]))
            check_df.drop(check_df[third_merge_plus_mask].index, inplace=True)
            # add fitting rows from 3_plus merge (inner) to first table
            check_df = pd.concat([check_df, third_merge_plus_df])
            check_df.reset_index(drop=True, inplace=True)
            check_df.drop(columns = "year_x", inplace=True)

            # 3_minus merge (inner)
            third_merge_minus_df = third_merge_df.merge(base_data, how="inner", left_on=["title_merge", "year"], right_on=[title, "year_minus"])
            # drop columns from first merge table that matched on 3_minus merge (inner)
            third_merge_minus_mask = check_df["title"].isin(list(third_merge_minus_df["title"]))
            check_df.drop(check_df[third_merge_minus_mask].index, inplace=True)
            # add fitting rows from 3_plus merge (inner) to first table
            check_df = pd.concat([check_df, third_merge_minus_df])
            check_df.reset_index(drop=True, inplace=True)
            check_df.drop(columns = "year_x", inplace=True)

            counter += 1

        for year in list_years:
            # set-up third df for further calculations
            check_mask = check_df["tconst"].isnull()
            third_merge_df = check_df[check_mask].copy()

            # drop columns from first merge
            third_merge_df.drop(columns = third_merge_df.iloc[:,number_of_columns:], inplace=True)
            
            # merge again
            third_merge_plus_df = third_merge_df.merge(base_data, how="inner", left_on=["title_merge", "year"], right_on=[title, year])
            # drop columns from first merge table that matched on 3_plus merge (inner)
            third_merge_plus_mask = check_df["title"].isin(list(third_merge_plus_df["title"]))
            check_df.drop(check_df[third_merge_plus_mask].index, inplace=True)
            # add fitting rows from 3_plus merge (inner) to first table
            check_df = pd.concat([check_df, third_merge_plus_df])
            check_df.reset_index(drop=True, inplace=True)
        counter += 1

    # drop new year columns from final table
    check_df.drop(columns = check_df.iloc[:,-4:], inplace=True)

    # show unmatched rows
    check_mask = check_df["tconst"].isnull()
    display(check_df[check_mask])
    display(counter)
    
    return check_df

In [686]:
eu_check_df = ultimate_merge_func(eu_df, imdb_aka_df)

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
295,Arthur et la guerre des deux mondes,FR,2010.0,3838378,3363498,arthur et la guerre des deux mondes,,,,,...,,,,,,,,,,
298,Artificial Intelligence: AI,US,2001.0,8073605,8041431,artificial intelligence ai,,,,,...,,,,,,,,,,
314,Astérix & Obélix: Au service de sa Majesté,"FR, ES, IT, LT, BE",2012.0,6405518,6148156,asterix obelix au service de sa majeste,,,,,...,,,,,,,,,,
317,Astérix: Le secret de la potion magique,"FR, BE",2018.0,6513544,6499739,asterix le secret de la potion magique,,,,,...,,,,,,,,,,
320,Atatürk 1881 - 1919,TR,2023.0,1732649,1732649,ataturk 1881 1919,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4185,Warum Männer nicht zuhören und Frauen schlecht...,DE,2007.0,1452342,1068475,warum manner nicht zuhoren und frauen schlecht...,,,,,...,,,,,,,,,,
4240,Winx club - Il segreto del regno perduto,IT,2007.0,2217926,2132336,winx club il segreto del regno perduto,,,,,...,,,,,,,,,,
4303,Zeny v behu,CZ,2019.0,1705959,1675569,zeny v behu,,,,,...,,,,,,,,,,
4319,[Rec] 2,ES,2009.0,1387671,1013091,rec 2,,,,,...,,,,,,,,,,


13

Ok, After applying the MOTHER OF ALL FUNCTIONS using the AKAs titles reduces the number of unmatched rows down to 103 (from 169).
That is an improvement of 66 rows

### Let's Check again what else we could do

In [687]:
eu_check_df[eu_check_df["tconst"].isnull()].sort_values(by="tickets_sold", ascending=False).head()

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
1249,Furious Seven,"US, JP, CN",2015.0,30936497,30914513,furious seven,,,,,...,,,,,,,,,,
4174,Wallace & Gromit in The Curse of the Were-Rabbit,"GBinc, US",2005.0,14014825,13251997,wallace gromit in the curse of the wererabbit,,,,,...,,,,,,,,,,
2791,Pokémon the First Movie: Mewtwo Strikes Back,"JP, US",1998.0,12682197,12207823,pokemon the first movie mewtwo strikes back,,,,,...,,,,,,,,,,
3049,Scooby Doo 2: Monsters Unleashed,US,2004.0,8179612,8175801,scooby doo 2 monsters unleashed,,,,,...,,,,,,,,,,
298,Artificial Intelligence: AI,US,2001.0,8073605,8041431,artificial intelligence ai,,,,,...,,,,,,,,,,


In [688]:
imdb_aka_df[imdb_aka_df["tconst"] == "tt2820852"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
122635,tt2820852,Furious 7,Fast & Furious 7,2015,137,418868.0,7.1,3.0,Action,Crime,...,furious 7,fast furious 7,a todo gas 7,fast furious 7,fast furious 7,fast furious 7,fast furious 7,szybcy i wsciekli 7,hizli ve ofkeli 7,


Furious Seven in EU written out but in IMDB written as Furious 7 <br><br>
Let's see if some akas got the other spelling

In [689]:
basic_aka_df = pd.read_csv("Data/title.principals/title.akas.csv", na_values="\\N")

  basic_aka_df = pd.read_csv("Data/title.principals/title.akas.csv", na_values="\\N")


In [690]:
basic_aka_df.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Carmencita,,,original,,1.0
1,tt0000001,2,Carmencita,DE,,,literal title,0.0
2,tt0000001,3,Carmencita,US,,imdbDisplay,,0.0
3,tt0000001,4,Carmencita - spanyol tánc,HU,,imdbDisplay,,0.0
4,tt0000001,5,Καρμενσίτα,GR,,imdbDisplay,,0.0


In [691]:
basic_aka_df[basic_aka_df["titleId"] == "tt2820852"].head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
29995145,tt2820852,1,Fast & Furious 7,,,original,,1.0
29995146,tt2820852,10,Fast & Furious 7,GB,,imdbDisplay,,0.0
29995147,tt2820852,11,Fast & Furious 7,HK,en,imdbDisplay,,0.0
29995148,tt2820852,12,Fast & Furious 7,IE,en,imdbDisplay,,0.0
29995149,tt2820852,13,Fast & Furious 7,IN,en,imdbDisplay,,0.0


In [692]:
del basic_aka_df

only one alternative spelling has Furious Seven ... -.-

### Let's check how we currently do with the NA data

added a columns parameter to the merging function since NA and EU don't have the same number of columns

In [693]:
na_check = ultimate_merge_func(na_df, imdb_aka_df,number_of_columns=7)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
11,13 Hours: The Secret Soldie…,2016-01-15,Paramount Pictures,52853219,6110198,2016.0,13 hours the secret soldie,,,,...,,,,,,,,,,
19,2 For the Money,2005-10-07,Universal,22991379,3586798,2005.0,2 for the money,,,,...,,,,,,,,,,
25,2013 Oscar Shorts,2013-02-01,Shorts International,2142342,263510,2013.0,2013 oscar shorts,,,,...,,,,,,,,,,
26,2014 Oscar Shorts,2014-01-31,ShortsHD,2357890,288603,2014.0,2014 oscar shorts,,,,...,,,,,,,,,,
27,2015 Oscar Shorts,2015-01-30,ShortsHD,2412493,286179,2015.0,2015 oscar shorts,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4008,You Will Meet a Tall Dark S…,2010-09-22,Sony Pictures Cla…,3229586,409326,2010.0,you will meet a tall dark s,,,,...,,,,,,,,,,
4009,You're Next,2013-08-23,Lionsgate,18494006,2274785,2013.0,youre next,,,,...,,,,,,,,,,
4015,"Yours, Mine and Ours",2005-11-23,Paramount Pictures,50733384,7914724,2005.0,yours mine and ours,,,,...,,,,,,,,,,
4017,Yu-Gi-Oh,2004-08-13,Warner Bros.,19762690,3182397,2004.0,yugioh,,,,...,,,,,,,,,,


13

In [694]:
na_check[na_check["tconst"].isnull()].sort_values(by="tickets_sold", ascending=False).head()

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
2754,Star Wars Ep. VII: The Forc…,2015-12-18,Walt Disney,936662225,110523913,2015.0,star wars ep vii the forc,,,,...,,,,,,,,,,
2751,Star Wars Ep. I: The Phanto…,1999-05-19,20th Century Fox,473901685,90192671,1999.0,star wars ep i the phanto,,,,...,,,,,,,,,,
2755,Star Wars Ep. VIII: The Las…,2017-12-15,Walt Disney,620181382,68963106,2017.0,star wars ep viii the las,,,,...,,,,,,,,,,
2279,Pirates of the Caribbean: D…,2006-07-07,Walt Disney,423315812,64628368,2006.0,pirates of the caribbean d,,,,...,,,,,,,,,,
3308,The Lord of the Rings: The …,2003-12-17,New Line,378203410,62218072,2003.0,the lord of the rings the,,,,...,,,,,,,,,,


### Add variable in ultimate function to shorten titles if needed (for all NAs shorten to 22 characters#)

In [695]:
def ultimate_merge_func(data, imdb_base, number_of_columns=6, short=False):
    '''
    Merges our Tickets Sold Dataframes First on each title for year and year +/- 1

    Input:
        ticket_data ... either eu or na depending on what we are testing
        base_data = imdb_df
        number_of_columns = 6 ... How many columns does the df with the ticket data have
        short = False ... if true, shorten the titles to merge to ... 25?

    Output:
        Returns: Dataframe with all matches from both columns
        Prints: Unmatched rows
    '''
    ticket_data = data.copy()
    base_data = imdb_base.copy()

    list_titles = ['original_title_merge', 'primary_title_merge', 'CA','DE', 'ES', 'FR', 'GB', 'IT', 'NL', 'PL', 'TR', 'ALTER']
    counter = 0

    base_data["year_plus"] = base_data["year"] + 1
    base_data["year_minus"] = base_data["year"] + -1

    list_years = ["year_minus", "year", "year_plus"]

    if short == True:
        for title in list_titles:
            base_data.loc[:,title] = base_data[title].str[:25]
        ticket_data.loc[:,"title_merge"] = ticket_data["title_merge"].str[:25]

    for title in list_titles:
        if counter == 0:
            check_df = pd.merge(ticket_data,base_data, how="left", left_on=["title_merge", "year"], right_on=[title, "year"])

            
            # set-up third df for further calculations
            check_mask = check_df["tconst"].isnull()
            third_merge_df = check_df[check_mask].copy()

            # drop columns from first merge
            third_merge_df.drop(columns = third_merge_df.iloc[:,number_of_columns:], inplace=True)
            
            # 3_plus merge (inner)
            third_merge_plus_df = third_merge_df.merge(base_data, how="inner", left_on=["title_merge", "year"], right_on=[title, "year_plus"])
            # drop columns from first merge table that matched on 3_plus merge (inner)
            third_merge_plus_mask = check_df["title"].isin(list(third_merge_plus_df["title"]))
            check_df.drop(check_df[third_merge_plus_mask].index, inplace=True)
            # add fitting rows from 3_plus merge (inner) to first table
            check_df = pd.concat([check_df, third_merge_plus_df])
            check_df.reset_index(drop=True, inplace=True)
            check_df.drop(columns = "year_x", inplace=True)

            # 3_minus merge (inner)
            third_merge_minus_df = third_merge_df.merge(base_data, how="inner", left_on=["title_merge", "year"], right_on=[title, "year_minus"])
            # drop columns from first merge table that matched on 3_minus merge (inner)
            third_merge_minus_mask = check_df["title"].isin(list(third_merge_minus_df["title"]))
            check_df.drop(check_df[third_merge_minus_mask].index, inplace=True)
            # add fitting rows from 3_plus merge (inner) to first table
            check_df = pd.concat([check_df, third_merge_minus_df])
            check_df.reset_index(drop=True, inplace=True)
            check_df.drop(columns = "year_x", inplace=True)

            counter += 1

        for year in list_years:
            # set-up third df for further calculations
            check_mask = check_df["tconst"].isnull()
            third_merge_df = check_df[check_mask].copy()

            # drop columns from first merge
            third_merge_df.drop(columns = third_merge_df.iloc[:,number_of_columns:], inplace=True)
            
            # merge again
            third_merge_plus_df = third_merge_df.merge(base_data, how="inner", left_on=["title_merge", "year"], right_on=[title, year])
            # drop columns from first merge table that matched on 3_plus merge (inner)
            third_merge_plus_mask = check_df["title"].isin(list(third_merge_plus_df["title"]))
            check_df.drop(check_df[third_merge_plus_mask].index, inplace=True)
            # add fitting rows from 3_plus merge (inner) to first table
            check_df = pd.concat([check_df, third_merge_plus_df])
            check_df.reset_index(drop=True, inplace=True)
        counter += 1

    # drop new year columns from final table
    check_df.drop(columns = check_df.iloc[:,-4:], inplace=True)

    # show unmatched rows
    check_mask = check_df["tconst"].isnull()
    display(check_df[check_mask])
    display(counter)
    
    return check_df

In [696]:
na_check = ultimate_merge_func(na_df,imdb_aka_df, number_of_columns=7, short=True)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
19,2 For the Money,2005-10-07,Universal,22991379,3586798,2005.0,2 for the money,,,,...,,,,,,,,,,
25,2013 Oscar Shorts,2013-02-01,Shorts International,2142342,263510,2013.0,2013 oscar shorts,,,,...,,,,,,,,,,
26,2014 Oscar Shorts,2014-01-31,ShortsHD,2357890,288603,2014.0,2014 oscar shorts,,,,...,,,,,,,,,,
27,2015 Oscar Shorts,2015-01-30,ShortsHD,2412493,286179,2015.0,2015 oscar shorts,,,,...,,,,,,,,,,
29,2017 Oscar Shorts,2017-02-10,ShortsHD,2835355,316093,2017.0,2017 oscar shorts,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3954,Y Tu Mama Tambien (And Your…,2002-03-15,IFC Films,13649881,2349377,2002.0,y tu mama tambien and you,,,,...,,,,,,,,,,
3967,You're Next,2013-08-23,Lionsgate,18494006,2274785,2013.0,youre next,,,,...,,,,,,,,,,
3973,"Yours, Mine and Ours",2005-11-23,Paramount Pictures,50733384,7914724,2005.0,yours mine and ours,,,,...,,,,,,,,,,
3975,Yu-Gi-Oh,2004-08-13,Warner Bros.,19762690,3182397,2004.0,yugioh,,,,...,,,,,,,,,,


13

Shortening helped to reduce the number of unmatched rows from 505 to 257

### Let's keep checking the other problem cases for the NA data... 

In [697]:
na_check[na_check["tconst"].isnull()].sort_values(by="tickets_sold", ascending=False).head(10)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
2729,Star Wars Ep. VII: The Forc…,2015-12-18,Walt Disney,936662225,110523913,2015.0,star wars ep vii the forc,,,,...,,,,,,,,,,
2726,Star Wars Ep. I: The Phanto…,1999-05-19,20th Century Fox,473901685,90192671,1999.0,star wars ep i the phanto,,,,...,,,,,,,,,,
2730,Star Wars Ep. VIII: The Las…,2017-12-15,Walt Disney,620181382,68963106,2017.0,star wars ep viii the las,,,,...,,,,,,,,,,
2728,Star Wars Ep. III: Revenge …,2005-05-19,20th Century Fox,380270577,59324582,2005.0,star wars ep iii revenge,,,,...,,,,,,,,,,
2727,Star Wars Ep. II: Attack of…,2002-05-16,20th Century Fox,310676740,53472760,2002.0,star wars ep ii attack of,,,,...,,,,,,,,,,
1047,Fast and Furious 6,2013-05-24,Universal,238679850,29357915,2013.0,fast and furious 6,,,,...,,,,,,,,,,
897,Dr. Seuss' The Lorax,2012-03-02,Universal,214030500,26888253,2012.0,dr seuss the lorax,,,,...,,,,,,,,,,
1963,Mission: Impossible—Ghost P…,2011-12-21,Paramount Pictures,209397903,26373370,2011.0,mission impossibleghost p,,,,...,,,,,,,,,,
1962,Mission: Impossible—Fallout,2018-07-27,Paramount Pictures,220159104,24166751,2018.0,mission impossiblefallout,,,,...,,,,,,,,,,
1964,Mission: Impossible—Rogue N…,2015-07-31,Paramount Pictures,195042377,23136699,2015.0,mission impossiblerogue n,,,,...,,,,,,,,,,


In [698]:
imdb_aka_df.loc[imdb_aka_df["tconst"] == "tt2488496", ["primary_title_merge", "original_title_merge"]]

Unnamed: 0,primary_title_merge,original_title_merge
114604,star wars episode vii the force awakens,star wars episode vii the force awakens


In [699]:
imdb_aka_df.loc[imdb_aka_df["tconst"] == "tt2488496"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
114604,tt2488496,Star Wars: Episode VII - The Force Awakens,Star Wars: Episode VII - The Force Awakens,2015,138,983277.0,7.8,3.0,Action,Adventure,...,star wars the force awakens,star wars episode vii das erwachen der macht,star wars el despertar de la fuerza,star wars episode vii le reveil de la force,star wars episode vii the force awakens,star wars il risveglio della forza,star wars episode vii the force awakens,gwiezdne wojny czesc vii przebudzenie mocy,star wars bolum vii guc uyaniyor,


In [700]:
na_df[na_df["title_merge"].str.contains("star wars")]

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge
2780,Rogue One: A Star Wars Story,2016-12-16,Walt Disney,533539991,61210724,2016,rogue one a star wars story
3017,Solo: A Star Wars Story,2018-05-25,Walt Disney,213767512,23465149,2018,solo a star wars story
3092,Star Wars Ep. I: The Phanto…,1999-05-19,20th Century Fox,473901685,90192671,1999,star wars ep i the phanto
3093,Star Wars Ep. II: Attack of…,2002-05-16,20th Century Fox,310676740,53472760,2002,star wars ep ii attack of
3094,Star Wars Ep. III: Revenge …,2005-05-19,20th Century Fox,380270577,59324582,2005,star wars ep iii revenge
3095,Star Wars Ep. VII: The Forc…,2015-12-18,Walt Disney,936662225,110523913,2015,star wars ep vii the forc
3096,Star Wars Ep. VIII: The Las…,2017-12-15,Walt Disney,620181382,68963106,2017,star wars ep viii the las
3097,Star Wars: The Clone Wars,2008-08-15,Warner Bros.,35161554,4897152,2008,star wars the clone wars
3098,Star Wars: The Rise of Skyw…,2019-12-20,Walt Disney,515202542,56215208,2019,star wars the rise of skyw


change all ep to episode

In [701]:
na_df.loc[:,"title_merge"] = na_df["title_merge"].str.replace(" ep ", " episode ")
na_df[na_df["title_merge"].str.contains("star wars")]

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge
2780,Rogue One: A Star Wars Story,2016-12-16,Walt Disney,533539991,61210724,2016,rogue one a star wars story
3017,Solo: A Star Wars Story,2018-05-25,Walt Disney,213767512,23465149,2018,solo a star wars story
3092,Star Wars Ep. I: The Phanto…,1999-05-19,20th Century Fox,473901685,90192671,1999,star wars episode i the phanto
3093,Star Wars Ep. II: Attack of…,2002-05-16,20th Century Fox,310676740,53472760,2002,star wars episode ii attack of
3094,Star Wars Ep. III: Revenge …,2005-05-19,20th Century Fox,380270577,59324582,2005,star wars episode iii revenge
3095,Star Wars Ep. VII: The Forc…,2015-12-18,Walt Disney,936662225,110523913,2015,star wars episode vii the forc
3096,Star Wars Ep. VIII: The Las…,2017-12-15,Walt Disney,620181382,68963106,2017,star wars episode viii the las
3097,Star Wars: The Clone Wars,2008-08-15,Warner Bros.,35161554,4897152,2008,star wars the clone wars
3098,Star Wars: The Rise of Skyw…,2019-12-20,Walt Disney,515202542,56215208,2019,star wars the rise of skyw


In [702]:
# check again

na_check = ultimate_merge_func(na_df, imdb_aka_df,number_of_columns=7, short=True)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
19,2 For the Money,2005-10-07,Universal,22991379,3586798,2005.0,2 for the money,,,,...,,,,,,,,,,
25,2013 Oscar Shorts,2013-02-01,Shorts International,2142342,263510,2013.0,2013 oscar shorts,,,,...,,,,,,,,,,
26,2014 Oscar Shorts,2014-01-31,ShortsHD,2357890,288603,2014.0,2014 oscar shorts,,,,...,,,,,,,,,,
27,2015 Oscar Shorts,2015-01-30,ShortsHD,2412493,286179,2015.0,2015 oscar shorts,,,,...,,,,,,,,,,
29,2017 Oscar Shorts,2017-02-10,ShortsHD,2835355,316093,2017.0,2017 oscar shorts,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3954,Y Tu Mama Tambien (And Your…,2002-03-15,IFC Films,13649881,2349377,2002.0,y tu mama tambien and you,,,,...,,,,,,,,,,
3967,You're Next,2013-08-23,Lionsgate,18494006,2274785,2013.0,youre next,,,,...,,,,,,,,,,
3973,"Yours, Mine and Ours",2005-11-23,Paramount Pictures,50733384,7914724,2005.0,yours mine and ours,,,,...,,,,,,,,,,
3975,Yu-Gi-Oh,2004-08-13,Warner Bros.,19762690,3182397,2004.0,yugioh,,,,...,,,,,,,,,,


13

In [703]:
na_check[na_check["tconst"].isnull()].sort_values(by="tickets_sold", ascending=False).head(8)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
2729,Star Wars Ep. VII: The Forc…,2015-12-18,Walt Disney,936662225,110523913,2015.0,star wars episode vii the,,,,...,,,,,,,,,,
2726,Star Wars Ep. I: The Phanto…,1999-05-19,20th Century Fox,473901685,90192671,1999.0,star wars episode i the p,,,,...,,,,,,,,,,
2730,Star Wars Ep. VIII: The Las…,2017-12-15,Walt Disney,620181382,68963106,2017.0,star wars episode viii th,,,,...,,,,,,,,,,
2728,Star Wars Ep. III: Revenge …,2005-05-19,20th Century Fox,380270577,59324582,2005.0,star wars episode iii rev,,,,...,,,,,,,,,,
2727,Star Wars Ep. II: Attack of…,2002-05-16,20th Century Fox,310676740,53472760,2002.0,star wars episode ii atta,,,,...,,,,,,,,,,
1047,Fast and Furious 6,2013-05-24,Universal,238679850,29357915,2013.0,fast and furious 6,,,,...,,,,,,,,,,
897,Dr. Seuss' The Lorax,2012-03-02,Universal,214030500,26888253,2012.0,dr seuss the lorax,,,,...,,,,,,,,,,
1963,Mission: Impossible—Ghost P…,2011-12-21,Paramount Pictures,209397903,26373370,2011.0,mission impossibleghost p,,,,...,,,,,,,,,,


Star Wars Episode ... is written as Star Wars EP. ... in EU. So we change all " ep " into episode. 
Sadly, it did not create any new matches. 

### Let's keep going looking at spaces

In [704]:
imdb_aka_df[imdb_aka_df["original_title_merge"].str.contains("star wars episode vii the")]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER


In [705]:
imdb_aka_df[imdb_aka_df["primary_title_merge"].str.contains("star wars episode vii the")]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER


In [706]:
# pd.set_option("display.max_columns", 37)

In [707]:
imdb_aka_df[imdb_aka_df["primary_title_merge"].str.contains("star wars episode vii")]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
114604,tt2488496,Star Wars: Episode VII - The Force Awakens,Star Wars: Episode VII - The Force Awakens,2015,138,983277.0,7.8,3.0,Action,Adventure,...,star wars the force awakens,star wars episode vii das erwachen der macht,star wars el despertar de la fuerza,star wars episode vii le reveil de la force,star wars episode vii the force awakens,star wars il risveglio della forza,star wars episode vii the force awakens,gwiezdne wojny czesc vii przebudzenie mocy,star wars bolum vii guc uyaniyor,
115286,tt2527336,Star Wars: Episode VIII - The Last Jedi,Star Wars: Episode VIII - The Last Jedi,2017,152,680742.0,6.9,3.0,Action,Adventure,...,star wars episode viii the last jedi,star wars die letzten jedi,star wars los ultimos jedi,star wars episode viii les derniers jedi,star wars episode viii the last jedi,star wars gli ultimi jedi,star wars episode viii the last jedi,gwiezdne wojny czesc viii ostatni jedi,star wars bolum viii son jedi,
175405,tt7810706,That movie Is Not Star Wars. Episode VII. Last...,That movie Is Not Star Wars. Episode VII. Last...,2017,131,55.0,6.1,1.0,Drama,,...,,,,,,,,,,


In [708]:
imdb_aka_df[imdb_aka_df["primary_title_merge"].str.contains("star wars episode vii  ")]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
114604,tt2488496,Star Wars: Episode VII - The Force Awakens,Star Wars: Episode VII - The Force Awakens,2015,138,983277.0,7.8,3.0,Action,Adventure,...,star wars the force awakens,star wars episode vii das erwachen der macht,star wars el despertar de la fuerza,star wars episode vii le reveil de la force,star wars episode vii the force awakens,star wars il risveglio della forza,star wars episode vii the force awakens,gwiezdne wojny czesc vii przebudzenie mocy,star wars bolum vii guc uyaniyor,


In [709]:
imdb_aka_df[imdb_aka_df["primary_title_merge"].str.contains("  ")]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
2,tt0035423,Kate & Leopold,Kate & Leopold,2001,118,89944.0,6.4,3.0,Comedy,Fantasy,...,kate et leopold,kate und leopold,la kate i en leopold,kate et leopold,kate leopold,kate and leopold,,kate i leopold,buyulu cift,
53,tt0113031,Farmer & Chase,Farmer & Chase,1997,97,85.0,5.0,1.0,Comedy,,...,,ein heisses trio,,,,,,,,
60,tt0113646,Lewis & Clark & George,Lewis & Clark & George,1997,82,742.0,5.3,3.0,Comedy,Crime,...,lewis clark george,joyride die highwaykiller,,,lewis clark george,,,,,
81,tt0115447,Il tocco - La sfida,Il tocco - La sfida,1997,107,48.0,5.8,3.0,Action,Crime,...,,der todesstoss,a tres bandas,,,il tocco la sfida,,,,
86,tt0115512,All's Fair in Love & War,All's Fair in Love & War,1997,119,33.0,3.1,2.0,Drama,Thriller,...,,fair play,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188051,tt9893130,"2025: Blood, White & Blue","2025: Blood, White & Blue",2022,135,228.0,2.2,3.0,Action,Comedy,...,2025 blood white blue,,2025 la purga anual,,2025 blood white blue,2025 blood white blue,2025 blood white blue,2025 blood white blue,,
188058,tt9894394,Upin & Ipin: Keris Siamang Tunggal,Upin & Ipin: Keris siamang tunggal,2019,100,855.0,7.3,3.0,Adventure,Animation,...,,,,,upin ipin the lone gibbon kris,,,,,
188077,tt9898858,Coffee & Kareem,Coffee & Kareem,2020,88,14704.0,5.2,3.0,Action,Comedy,...,coffee et kareem,coffee kareem,coffee kareem,coffee kareem,coffee kareem,coffee kareem,,coffee i kareem,,
188106,tt9904530,Scream Returns - Fan Film Spin-Off,Scream Returns,2018,48,83.0,5.7,2.0,Horror,Thriller,...,scream returns,,,scream returns,scream returns,,,,,


Hypothese: Double spaces create problems

check how often in each dataframe

In [710]:
display(imdb_aka_df[imdb_aka_df["original_title_merge"].str.contains("  ")].shape)
display(imdb_aka_df[imdb_aka_df["primary_title_merge"].str.contains("  ")].shape)
display(eu_df[eu_df["title_merge"].str.contains("  ")].shape)
na_df[na_df["title_merge"].str.contains("  ")].shape

(4809, 37)

(4517, 37)

(157, 6)

(53, 7)

Replace all Double spaces with singulare spaces twice

In [711]:
imdb_aka_df.loc[:,"original_title_merge"] = imdb_aka_df["original_title_merge"].str.replace("  ", " ")
imdb_aka_df.loc[:,"primary_title_merge"] = imdb_aka_df["primary_title_merge"].str.replace("  ", " ")
eu_df.loc[:,"title_merge"] = eu_df["title_merge"].str.replace("  ", " ")
na_df.loc[:,"title_merge"] = na_df["title_merge"].str.replace("  ", " ")

check again

In [712]:
display(imdb_aka_df[imdb_aka_df["original_title_merge"].str.contains("  ")].shape)
display(imdb_aka_df[imdb_aka_df["primary_title_merge"].str.contains("  ")].shape)
display(eu_df[eu_df["title_merge"].str.contains("  ")].shape)
na_df[na_df["title_merge"].str.contains("  ")].shape

(3, 37)

(1, 37)

(2, 6)

(0, 7)

replace ones more

In [713]:
# replace IMDB primary, original, EU and NA
imdb_aka_df.loc[:,"original_title_merge"] = imdb_aka_df["original_title_merge"].str.replace("  ", " ")
imdb_aka_df.loc[:,"primary_title_merge"] = imdb_aka_df["primary_title_merge"].str.replace("  ", " ")
eu_df.loc[:,"title_merge"] = eu_df["title_merge"].str.replace("  ", " ")
na_df.loc[:,"title_merge"] = na_df["title_merge"].str.replace("  ", " ")

check again

In [714]:
display(imdb_aka_df[imdb_aka_df["original_title_merge"].str.contains("  ")].shape)
display(imdb_aka_df[imdb_aka_df["primary_title_merge"].str.contains("  ")].shape)
display(eu_df[eu_df["title_merge"].str.contains("  ")].shape)
na_df[na_df["title_merge"].str.contains("  ")].shape

(0, 37)

(0, 37)

(0, 6)

(0, 7)

replace all the akas

In [715]:
for column in imdb_aka_df.iloc[:,-12:-2]:
    display(imdb_aka_df[column].str.contains("  ").sum())

0

0

741

7110

338

1452

1489

3479

312

474

In [716]:
for column in imdb_aka_df.iloc[:,-12:-2]:
    imdb_aka_df.loc[:,column] = imdb_aka_df[column].str.replace("  ", " ")
    imdb_aka_df.loc[:,column] = imdb_aka_df[column].str.replace("  ", " ")

In [717]:
for column in imdb_aka_df.iloc[:,-12:-2]:
    display(imdb_aka_df[column].str.contains("  ").sum())

0

0

0

0

0

0

0

0

0

0

In [718]:
# check again how well the merge works now

In [719]:
na_check = ultimate_merge_func(na_df, imdb_aka_df, number_of_columns=7, short=True)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
19,2 For the Money,2005-10-07,Universal,22991379,3586798,2005.0,2 for the money,,,,...,,,,,,,,,,
25,2013 Oscar Shorts,2013-02-01,Shorts International,2142342,263510,2013.0,2013 oscar shorts,,,,...,,,,,,,,,,
26,2014 Oscar Shorts,2014-01-31,ShortsHD,2357890,288603,2014.0,2014 oscar shorts,,,,...,,,,,,,,,,
27,2015 Oscar Shorts,2015-01-30,ShortsHD,2412493,286179,2015.0,2015 oscar shorts,,,,...,,,,,,,,,,
29,2017 Oscar Shorts,2017-02-10,ShortsHD,2835355,316093,2017.0,2017 oscar shorts,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3956,Y Tu Mama Tambien (And Your…,2002-03-15,IFC Films,13649881,2349377,2002.0,y tu mama tambien and you,,,,...,,,,,,,,,,
3969,You're Next,2013-08-23,Lionsgate,18494006,2274785,2013.0,youre next,,,,...,,,,,,,,,,
3975,"Yours, Mine and Ours",2005-11-23,Paramount Pictures,50733384,7914724,2005.0,yours mine and ours,,,,...,,,,,,,,,,
3977,Yu-Gi-Oh,2004-08-13,Warner Bros.,19762690,3182397,2004.0,yugioh,,,,...,,,,,,,,,,


13

In [720]:
na_check[na_check["tconst"].isnull()].sort_values(by="tickets_sold", ascending=False).head(8)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
1046,Fast and Furious 6,2013-05-24,Universal,238679850,29357915,2013.0,fast and furious 6,,,,...,,,,,,,,,,
896,Dr. Seuss' The Lorax,2012-03-02,Universal,214030500,26888253,2012.0,dr seuss the lorax,,,,...,,,,,,,,,,
1963,Mission: Impossible—Ghost P…,2011-12-21,Paramount Pictures,209397903,26373370,2011.0,mission impossibleghost p,,,,...,,,,,,,,,,
1962,Mission: Impossible—Fallout,2018-07-27,Paramount Pictures,220159104,24166751,2018.0,mission impossiblefallout,,,,...,,,,,,,,,,
1964,Mission: Impossible—Rogue N…,2015-07-31,Paramount Pictures,195042377,23136699,2015.0,mission impossiblerogue n,,,,...,,,,,,,,,,
1043,Fast & Furious Presents: Ho…,2019-08-02,Universal,173956935,18990932,2019.0,fast furious presents ho,,,,...,,,,,,,,,,
1583,John Wick: Chapter 3 — Para…,2019-05-17,Lionsgate,171015687,18669834,2019.0,john wick chapter 3 para,,,,...,,,,,,,,,,
2717,Spy Kids 3-D: Game Over,2003-07-25,Miramax/Dimension,111678621,18520500,2003.0,spy kids 3d game over,,,,...,,,,,,,,,,


correcting from double spaces to single spaces created 252 unmatched rows (from 257).
We are just 5 better

### Let's try and get rid of all spaces to test how that goes 

In [721]:
na_check = ultimate_merge_func(na_df, imdb_aka_df, number_of_columns=7, short=True)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
19,2 For the Money,2005-10-07,Universal,22991379,3586798,2005.0,2 for the money,,,,...,,,,,,,,,,
25,2013 Oscar Shorts,2013-02-01,Shorts International,2142342,263510,2013.0,2013 oscar shorts,,,,...,,,,,,,,,,
26,2014 Oscar Shorts,2014-01-31,ShortsHD,2357890,288603,2014.0,2014 oscar shorts,,,,...,,,,,,,,,,
27,2015 Oscar Shorts,2015-01-30,ShortsHD,2412493,286179,2015.0,2015 oscar shorts,,,,...,,,,,,,,,,
29,2017 Oscar Shorts,2017-02-10,ShortsHD,2835355,316093,2017.0,2017 oscar shorts,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3956,Y Tu Mama Tambien (And Your…,2002-03-15,IFC Films,13649881,2349377,2002.0,y tu mama tambien and you,,,,...,,,,,,,,,,
3969,You're Next,2013-08-23,Lionsgate,18494006,2274785,2013.0,youre next,,,,...,,,,,,,,,,
3975,"Yours, Mine and Ours",2005-11-23,Paramount Pictures,50733384,7914724,2005.0,yours mine and ours,,,,...,,,,,,,,,,
3977,Yu-Gi-Oh,2004-08-13,Warner Bros.,19762690,3182397,2004.0,yugioh,,,,...,,,,,,,,,,


13

### Add SPace removal to ultimate Function

In [722]:
def ultimate_merge_func(data, imdb_base, number_of_columns=6, short=False):
    '''
    Merges our Tickets Sold Dataframes First on each title for year and year +/- 1

    Input:
        ticket_data ... either eu or na depending on what we are testing
        base_data = imdb_df
        number_of_columns = 6 ... How many columns does the df with the ticket data have
        short = False ... if true, shorten the titles to merge to ... 25?

    Output:
        Returns: Dataframe with all matches from both columns
        Prints: Unmatched rows
    '''
    ticket_data = data.copy()
    base_data = imdb_base.copy()

    list_titles = ['original_title_merge', 'primary_title_merge','CA', 'DE', 'ES', 'FR', 'GB', 'IT', 'NL', 'PL', 'TR', "ALTER"]
    counter = 0

    # remove all spaces from all titles
    #imdb
    for column in base_data.iloc[:,-12:-2]:
        base_data.loc[:,column] = base_data[column].str.replace(" ", "")
        base_data.loc[:,column] = base_data[column].str.replace(" ", "")
    # ticket data
    ticket_data.loc[:,"title_merge"] = ticket_data["title_merge"].str.replace(" ", "")

    # add two year plus and minus as extra rows
    base_data["year_plus"] = base_data["year"] + 1
    base_data["year_minus"] = base_data["year"] + -1

    list_years = ["year_minus", "year", "year_plus"]

    if short == True:
        for title in list_titles:
            base_data.loc[:,title] = base_data[title].str[:18]
        ticket_data.loc[:,"title_merge"] = ticket_data["title_merge"].str[:18]

    for title in list_titles:
        if counter == 0:
            check_df = pd.merge(ticket_data,base_data, how="left", left_on=["title_merge", "year"], right_on=[title, "year"])

            
            # set-up third df for further calculations
            check_mask = check_df["tconst"].isnull()
            third_merge_df = check_df[check_mask].copy()

            # drop columns from first merge
            third_merge_df.drop(columns = third_merge_df.iloc[:,number_of_columns:], inplace=True)
            
            # 3_plus merge (inner)
            third_merge_plus_df = third_merge_df.merge(base_data, how="inner", left_on=["title_merge", "year"], right_on=[title, "year_plus"])
            # drop columns from first merge table that matched on 3_plus merge (inner)
            third_merge_plus_mask = check_df["title"].isin(list(third_merge_plus_df["title"]))
            check_df.drop(check_df[third_merge_plus_mask].index, inplace=True)
            # add fitting rows from 3_plus merge (inner) to first table
            check_df = pd.concat([check_df, third_merge_plus_df])
            check_df.reset_index(drop=True, inplace=True)
            check_df.drop(columns = "year_x", inplace=True)

            # 3_minus merge (inner)
            third_merge_minus_df = third_merge_df.merge(base_data, how="inner", left_on=["title_merge", "year"], right_on=[title, "year_minus"])
            # drop columns from first merge table that matched on 3_minus merge (inner)
            third_merge_minus_mask = check_df["title"].isin(list(third_merge_minus_df["title"]))
            check_df.drop(check_df[third_merge_minus_mask].index, inplace=True)
            # add fitting rows from 3_plus merge (inner) to first table
            check_df = pd.concat([check_df, third_merge_minus_df])
            check_df.reset_index(drop=True, inplace=True)
            check_df.drop(columns = "year_x", inplace=True)

            counter += 1

        for year in list_years:
            # set-up third df for further calculations
            check_mask = check_df["tconst"].isnull()
            third_merge_df = check_df[check_mask].copy()

            # drop columns from first merge
            third_merge_df.drop(columns = third_merge_df.iloc[:,number_of_columns:], inplace=True)
            
            # merge again
            third_merge_plus_df = third_merge_df.merge(base_data, how="inner", left_on=["title_merge", "year"], right_on=[title, year])
            # drop columns from first merge table that matched on 3_plus merge (inner)
            third_merge_plus_mask = check_df["title"].isin(list(third_merge_plus_df["title"]))
            check_df.drop(check_df[third_merge_plus_mask].index, inplace=True)
            # add fitting rows from 3_plus merge (inner) to first table
            check_df = pd.concat([check_df, third_merge_plus_df])
            check_df.reset_index(drop=True, inplace=True)
        counter += 1

    # drop new year columns from final table
    check_df.drop(columns = check_df.iloc[:,-4:], inplace=True)

    # show unmatched rows
    check_mask = check_df["tconst"].isnull()
    display(check_df[check_mask])
    display(counter)
    
    return check_df

In [723]:
na_check = ultimate_merge_func(na_df, imdb_aka_df, number_of_columns=7, short=True)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
19,2 For the Money,2005-10-07,Universal,22991379,3586798,2005.0,2forthemoney,,,,...,,,,,,,,,,
25,2013 Oscar Shorts,2013-02-01,Shorts International,2142342,263510,2013.0,2013oscarshorts,,,,...,,,,,,,,,,
26,2014 Oscar Shorts,2014-01-31,ShortsHD,2357890,288603,2014.0,2014oscarshorts,,,,...,,,,,,,,,,
27,2015 Oscar Shorts,2015-01-30,ShortsHD,2412493,286179,2015.0,2015oscarshorts,,,,...,,,,,,,,,,
29,2017 Oscar Shorts,2017-02-10,ShortsHD,2835355,316093,2017.0,2017oscarshorts,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3971,Y Tu Mama Tambien (And Your…,2002-03-15,IFC Films,13649881,2349377,2002.0,ytumamatambienandy,,,,...,,,,,,,,,,
3984,You're Next,2013-08-23,Lionsgate,18494006,2274785,2013.0,yourenext,,,,...,,,,,,,,,,
3990,"Yours, Mine and Ours",2005-11-23,Paramount Pictures,50733384,7914724,2005.0,yoursmineandours,,,,...,,,,,,,,,,
3992,Yu-Gi-Oh,2004-08-13,Warner Bros.,19762690,3182397,2004.0,yugioh,,,,...,,,,,,,,,,


13

In [724]:
na_check[na_check["tconst"].isnull()].sort_values(by="tickets_sold", ascending=False).head(10)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
1629,Jurassic Park 3,2001-07-18,Universal,181166115,32008147,2001.0,jurassicpark3,,,,...,,,,,,,,,,
1050,Fast and Furious 6,2013-05-24,Universal,238679850,29357915,2013.0,fastandfurious6,,,,...,,,,,,,,,,
899,Dr. Seuss' The Lorax,2012-03-02,Universal,214030500,26888253,2012.0,drseussthelorax,,,,...,,,,,,,,,,
2729,Spy Kids 3-D: Game Over,2003-07-25,Miramax/Dimension,111678621,18520500,2003.0,spykids3dgameover,,,,...,,,,,,,,,,
851,Disney’s A Christmas Carol,2009-11-06,Walt Disney,137443917,18325856,2009.0,disneysachristmasc,,,,...,,,,,,,,,,
3467,The Road to Perdition,2002-07-12,Dreamworks SKG,104054514,17909554,2002.0,theroadtoperdition,,,,...,,,,,,,,,,
3630,The X Files: Fight the Future,1998-06-19,20th Century Fox,83898313,17888766,1998.0,thexfilesfightthef,,,,...,,,,,,,,,,
1000,Everest,1998-03-06,MacGillivray Free…,84941548,17815503,1998.0,everest,,,,...,,,,,,,,,,
900,Dr. Seuss’ The Cat in the H…,2003-11-21,Universal,99383495,16481508,2003.0,drseussthecatinthe,,,,...,,,,,,,,,,
3037,The Divergent Serires: Insu…,2015-03-20,Lionsgate,130179072,15442357,2015.0,thedivergentserire,,,,...,,,,,,,,,,


removing rows reduced from 257 to 221 ... (with now cutting down to 18 characters since there are no more spaces)

### Let's check next thing for NA (numbers)

In [725]:
imdb_aka_df[imdb_aka_df["tconst"] == "tt0120912"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
1683,tt0120912,Men in Black II,Men in Black II,2002,88,409331.0,6.2,3.0,Action,Adventure,...,hommes en noir ii,men in black 2,homes de negre 2,men in black 2,men in black ii,men in black ii,men in black ii,faceci w czerni 2,siyah giyen adamlar 2,


Explore functions to convert numbers into words/ roman numbers into numbers

In [726]:
from num2words import num2words

In [727]:
# roman_dict = {" ii ": "two",
#               " iii ": "three",
#               " iv ": "four",
#               " v ": "five",
#               " vi ": "six",
#               " vii ": "seven",
#               " viii ": "eight",
#               " ix ": "nine",
#               " x ": "ten"}

In [728]:
testing_dict = {r'(^ii | ii | ii$)': "two",
                r'(^iii | iii | iii$)': "three",
                r'(^iv | iv | iv$)': "four",
                r'(^v | v | v$)': "five",
                r'(^vi | vi | vi$)': "six",
                r'(^vii | vii | vii$)': "seven",
                r'(^viii | viii | viii$)': "eight",
                r'(^ix | ix | ix$)': "nine",
                r'(^x | x | x$)': "ten",}

In [729]:
testing_dict.keys()

dict_keys(['(^ii | ii | ii$)', '(^iii | iii | iii$)', '(^iv | iv | iv$)', '(^v | v | v$)', '(^vi | vi | vi$)', '(^vii | vii | vii$)', '(^viii | viii | viii$)', '(^ix | ix | ix$)', '(^x | x | x$)'])

convert all roman <= 10 into words

In [730]:
imdb_aka_df[imdb_aka_df["original_title"].str.contains("Jurassic Park")]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
890,tt0119567,The Lost World: Jurassic Park,The Lost World: Jurassic Park,1997,129,449430.0,6.6,3.0,Action,Adventure,...,the lost world jurassic park,vergessene welt jurassic park,el mundo perdido jurassic park,le monde perdu,the lost world jurassic park,il mondo perduto jurassic park,the lost world jurassic park,park jurajski ii,jurassic park 2 kayip dunya,jurassic park ii
3807,tt0163025,Jurassic Park III,Jurassic Park III,2001,92,344970.0,5.9,3.0,Action,Adventure,...,le parc jurassique iii,jurassic park iii,jurassic park iii parque jurasico iii,jurassic park iii,jurassic park iii,jurassic park iii,jurassic park iii,jurassic park iii,jurassic park 3,jurassic park 3
141839,tt4130956,Jurassic Park: Operation Rebirth,Jurassic Park: Operation Rebirth,2014,70,106.0,6.7,1.0,Thriller,,...,jurassic park operation rebirth,,,,jurassic park operation rebirth,,,,,


In [731]:
list_titles = ['original_title_merge', 'primary_title_merge', 'CA','DE', 'ES', 'FR', 'GB', 'IT', 'NL', 'PL', 'TR', 'ALTER']

for title in list_titles:
    for key, value in testing_dict.items():
        imdb_aka_df.loc[:,title] = imdb_aka_df[title].str.replace(key, value, regex=True)

for key, value in testing_dict.items():
        eu_df.loc[:,"title_merge"] = eu_df["title_merge"].str.replace(key, value, regex=True)

for key, value in testing_dict.items():
        na_df.loc[:,"title_merge"] = na_df["title_merge"].str.replace(key, value, regex=True)

In [732]:
imdb_aka_df[imdb_aka_df["original_title"].str.contains("Jurassic Park")]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
890,tt0119567,The Lost World: Jurassic Park,The Lost World: Jurassic Park,1997,129,449430.0,6.6,3.0,Action,Adventure,...,the lost world jurassic park,vergessene welt jurassic park,el mundo perdido jurassic park,le monde perdu,the lost world jurassic park,il mondo perduto jurassic park,the lost world jurassic park,park jurajskitwo,jurassic park 2 kayip dunya,jurassic parktwo
3807,tt0163025,Jurassic Park III,Jurassic Park III,2001,92,344970.0,5.9,3.0,Action,Adventure,...,le parc jurassiquethree,jurassic parkthree,jurassic parkthreeparque jurasicothree,jurassic parkthree,jurassic parkthree,jurassic parkthree,jurassic parkthree,jurassic parkthree,jurassic park 3,jurassic park 3
141839,tt4130956,Jurassic Park: Operation Rebirth,Jurassic Park: Operation Rebirth,2014,70,106.0,6.7,1.0,Thriller,,...,jurassic park operation rebirth,,,,jurassic park operation rebirth,,,,,


Test how to convert numbers

In [733]:
imdb_aka_df["test_primary"] = imdb_aka_df["primary_title_merge"].str.replace(r'\d', lambda x: num2words(int(x.group())), regex=True)
imdb_aka_df["test_primary"]

0                               istoriya grazhdanskoy voyny
1                                    la tierra de los toros
2                                              kate leopold
3         the tango of the widower and its distorting mi...
4                                the other side of the wind
                                ...                        
188158                                                coven
188159                                  the secret of china
188160                                  kuambil lagi hatiku
188161                                      dankyavar danka
188162                                             six gunn
Name: test_primary, Length: 188163, dtype: object

In [734]:
imdb_aka_df.drop(columns="test_primary", inplace=True)

convert all numbers into words

In [735]:
list_titles = ['original_title_merge', 'primary_title_merge','CA', 'DE', 'ES', 'FR', 'GB', 'IT', 'NL', 'PL', 'TR', 'ALTER']

for title in list_titles:
    imdb_aka_df.loc[:,title] = imdb_aka_df[title].str.replace(r'\d', lambda x: num2words(int(x.group())), regex=True)

eu_df.loc[:,"title_merge"] = eu_df["title_merge"].str.replace(r'\d', lambda x: num2words(int(x.group())), regex=True)

na_df.loc[:,"title_merge"] = na_df["title_merge"].str.replace(r'\d', lambda x: num2words(int(x.group())), regex=True)

### de-comment the space removal on top and do it down here after changing numbers into words

In [736]:
na_check = ultimate_merge_func(na_df, imdb_aka_df, number_of_columns=7, short=True)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
25,2013 Oscar Shorts,2013-02-01,Shorts International,2142342,263510,2013.0,twozeroonethreeosc,,,,...,,,,,,,,,,
26,2014 Oscar Shorts,2014-01-31,ShortsHD,2357890,288603,2014.0,twozeroonefourosca,,,,...,,,,,,,,,,
27,2015 Oscar Shorts,2015-01-30,ShortsHD,2412493,286179,2015.0,twozeroonefiveosca,,,,...,,,,,,,,,,
44,21 and Over,2013-03-01,Relativity,25682380,3158964,2013.0,twooneandover,,,,...,,,,,,,,,,
67,63 Up,2019-11-27,BritBox,183940,20037,2019.0,sixthreeup,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3985,Y Tu Mama Tambien (And Your…,2002-03-15,IFC Films,13649881,2349377,2002.0,ytumamatambienandy,,,,...,,,,,,,,,,
3998,You're Next,2013-08-23,Lionsgate,18494006,2274785,2013.0,yourenext,,,,...,,,,,,,,,,
4004,"Yours, Mine and Ours",2005-11-23,Paramount Pictures,50733384,7914724,2005.0,yoursmineandours,,,,...,,,,,,,,,,
4006,Yu-Gi-Oh,2004-08-13,Warner Bros.,19762690,3182397,2004.0,yugioh,,,,...,,,,,,,,,,


13

Replacing all number and roman numbers with words reduced the number of unmathed rows down to 210 (from 221)

### Next Check/ improvement ...

In [737]:
na_check = ultimate_merge_func(na_df, imdb_aka_df, number_of_columns=7, short=True)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
25,2013 Oscar Shorts,2013-02-01,Shorts International,2142342,263510,2013.0,twozeroonethreeosc,,,,...,,,,,,,,,,
26,2014 Oscar Shorts,2014-01-31,ShortsHD,2357890,288603,2014.0,twozeroonefourosca,,,,...,,,,,,,,,,
27,2015 Oscar Shorts,2015-01-30,ShortsHD,2412493,286179,2015.0,twozeroonefiveosca,,,,...,,,,,,,,,,
44,21 and Over,2013-03-01,Relativity,25682380,3158964,2013.0,twooneandover,,,,...,,,,,,,,,,
67,63 Up,2019-11-27,BritBox,183940,20037,2019.0,sixthreeup,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3985,Y Tu Mama Tambien (And Your…,2002-03-15,IFC Films,13649881,2349377,2002.0,ytumamatambienandy,,,,...,,,,,,,,,,
3998,You're Next,2013-08-23,Lionsgate,18494006,2274785,2013.0,yourenext,,,,...,,,,,,,,,,
4004,"Yours, Mine and Ours",2005-11-23,Paramount Pictures,50733384,7914724,2005.0,yoursmineandours,,,,...,,,,,,,,,,
4006,Yu-Gi-Oh,2004-08-13,Warner Bros.,19762690,3182397,2004.0,yugioh,,,,...,,,,,,,,,,


13

In [738]:
eu_check_df = ultimate_merge_func(eu_df, imdb_aka_df)

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
295,Arthur et la guerre des deux mondes,FR,2010.0,3838378,3363498,arthuretlaguerredesdeuxmondes,,,,,...,,,,,,,,,,
298,Artificial Intelligence: AI,US,2001.0,8073605,8041431,artificialintelligenceai,,,,,...,,,,,,,,,,
322,Atatürk 1881 - 1919,TR,2023.0,1732649,1732649,ataturkoneeighteightoneonenineonenine,,,,,...,,,,,,,,,,
371,Bambi II,US,2006.0,4484756,4473494,bambitwo,,,,,...,,,,,,,,,,
394,Battlefield Earth: A Saga of the Year 3000,US,2000.0,874660,874546,battlefieldearthasagaoftheyearthreezerozerozero,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4002,Tinker Bell and the Pirate Fairy,US,2014.0,5489166,5472379,tinkerbellandthepiratefairy,,,,,...,,,,,,,,,,
4198,Wallace & Gromit in The Curse of the Were-Rabbit,"GBinc, US",2005.0,14014825,13251997,wallacegromitinthecurseofthewererabbit,,,,,...,,,,,,,,,,
4209,Warum Männer nicht zuhören und Frauen schlecht...,DE,2007.0,1452342,1068475,warummannernichtzuhorenundfrauenschlechtereinp...,,,,,...,,,,,,,,,,
4327,Zeny v behu,CZ,2019.0,1705959,1675569,zenyfivebehu,,,,,...,,,,,,,,,,


13

check weird oscar films

In [739]:
na_df[na_df["title"].str.contains("Oscar")]

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge
25,2013 Oscar Shorts,2013-02-01,Shorts International,2142342,263510,2013,twozeroonethree oscar shorts
26,2014 Oscar Shorts,2014-01-31,ShortsHD,2357890,288603,2014,twozeroonefour oscar shorts
27,2015 Oscar Shorts,2015-01-30,ShortsHD,2412493,286179,2015,twozeroonefive oscar shorts
29,2017 Oscar Shorts,2017-02-10,ShortsHD,2835355,316093,2017,twozerooneseven oscar shorts
30,2018 Oscar Shorts,2018-02-09,ShortsHD,3508777,385156,2018,twozerooneeight oscar shorts
31,2019 Oscar Shorts,2019-02-08,ShortsHD,3531093,385490,2019,twozeroonenine oscar shorts
32,2020 Oscar Shorts,2020-01-29,ShortsHD,3304748,359994,2020,twozerotwozero oscar shorts
33,2021 Oscar Shorts,2021-04-02,ShortsHD,443050,43564,2021,twozerotwoone oscar shorts
34,2022 Oscar Shorts,2022-02-25,ShortsHD,1801646,171096,2022,twozerotwotwo oscar shorts
35,2023 Oscar Shorts,2023-02-17,ShortsHD,3023866,280507,2023,twozerotwothree oscar shorts


this are all oscar nominated short film that are released into cinemas ... let's kick them ...

In [740]:
na_df[na_df["distributor"].str.contains("Shorts")]

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge
25,2013 Oscar Shorts,2013-02-01,Shorts International,2142342,263510,2013,twozeroonethree oscar shorts
26,2014 Oscar Shorts,2014-01-31,ShortsHD,2357890,288603,2014,twozeroonefour oscar shorts
27,2015 Oscar Shorts,2015-01-30,ShortsHD,2412493,286179,2015,twozeroonefive oscar shorts
29,2017 Oscar Shorts,2017-02-10,ShortsHD,2835355,316093,2017,twozerooneseven oscar shorts
30,2018 Oscar Shorts,2018-02-09,ShortsHD,3508777,385156,2018,twozerooneeight oscar shorts
31,2019 Oscar Shorts,2019-02-08,ShortsHD,3531093,385490,2019,twozeroonenine oscar shorts
32,2020 Oscar Shorts,2020-01-29,ShortsHD,3304748,359994,2020,twozerotwozero oscar shorts
33,2021 Oscar Shorts,2021-04-02,ShortsHD,443050,43564,2021,twozerotwoone oscar shorts
34,2022 Oscar Shorts,2022-02-25,ShortsHD,1801646,171096,2022,twozerotwotwo oscar shorts
35,2023 Oscar Shorts,2023-02-17,ShortsHD,3023866,280507,2023,twozerotwothree oscar shorts


In [741]:
na_df.drop(na_df[na_df["distributor"].str.contains("Shorts")].index, inplace=True)

check for improvements 

In [742]:
na_check = ultimate_merge_func(na_df, imdb_aka_df, number_of_columns=7, short=True)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
31,21 and Over,2013-03-01,Relativity,25682380,3158964,2013.0,twooneandover,,,,...,,,,,,,,,,
54,63 Up,2019-11-27,BritBox,183940,20037,2019.0,sixthreeup,,,,...,,,,,,,,,,
69,A Common Thread,2002-11-29,Odeon Films,5058187,838836,2002.0,acommonthread,,,,...,,,,,,,,,,
107,A Rescue of Little Eggs,2021-08-27,Lionsgate,927154,91166,2021.0,arescueoflittleegg,,,,...,,,,,,,,,,
114,A Stir of Echoes,1999-09-10,Artisan,21133087,4160056,1999.0,astirofechoes,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3972,Y Tu Mama Tambien (And Your…,2002-03-15,IFC Films,13649881,2349377,2002.0,ytumamatambienandy,,,,...,,,,,,,,,,
3985,You're Next,2013-08-23,Lionsgate,18494006,2274785,2013.0,yourenext,,,,...,,,,,,,,,,
3991,"Yours, Mine and Ours",2005-11-23,Paramount Pictures,50733384,7914724,2005.0,yoursmineandours,,,,...,,,,,,,,,,
3993,Yu-Gi-Oh,2004-08-13,Warner Bros.,19762690,3182397,2004.0,yugioh,,,,...,,,,,,,,,,


13

Improvement by 3 (instead of 10) but we TAKE IT. Now at 202 for NA

### Let's check next

In [743]:
na_check[na_check["tconst"].isnull()].sort_values(by="tickets_sold", ascending=False).head(10)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
1042,Fast and Furious 6,2013-05-24,Universal,238679850,29357915,2013.0,fastandfurioussix,,,,...,,,,,,,,,,
893,Dr. Seuss' The Lorax,2012-03-02,Universal,214030500,26888253,2012.0,drseussthelorax,,,,...,,,,,,,,,,
2729,Spy Kids 3-D: Game Over,2003-07-25,Miramax/Dimension,111678621,18520500,2003.0,spykidsthreedgameo,,,,...,,,,,,,,,,
845,Disney’s A Christmas Carol,2009-11-06,Walt Disney,137443917,18325856,2009.0,disneysachristmasc,,,,...,,,,,,,,,,
3468,The Road to Perdition,2002-07-12,Dreamworks SKG,104054514,17909554,2002.0,theroadtoperdition,,,,...,,,,,,,,,,
3631,The X Files: Fight the Future,1998-06-19,20th Century Fox,83898313,17888766,1998.0,thetenfilesfightth,,,,...,,,,,,,,,,
994,Everest,1998-03-06,MacGillivray Free…,84941548,17815503,1998.0,everest,,,,...,,,,,,,,,,
894,Dr. Seuss’ The Cat in the H…,2003-11-21,Universal,99383495,16481508,2003.0,drseussthecatinthe,,,,...,,,,,,,,,,
3038,The Divergent Serires: Insu…,2015-03-20,Lionsgate,130179072,15442357,2015.0,thedivergentserire,,,,...,,,,,,,,,,
2728,Spy Kids 2: The Island of L…,2002-08-07,Miramax/Dimension,85570368,14728118,2002.0,spykidstwotheislan,,,,...,,,,,,,,,,


In [744]:
imdb_aka_df[imdb_aka_df["tconst"] == "tt0163025"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
3807,tt0163025,Jurassic Park III,Jurassic Park III,2001,92,344970.0,5.9,3.0,Action,Adventure,...,le parc jurassiquethree,jurassic parkthree,jurassic parkthreeparque jurasicothree,jurassic parkthree,jurassic parkthree,jurassic parkthree,jurassic parkthree,jurassic parkthree,jurassic park three,jurassic park three


fixed roman translator further to the top... did not really improve for the entire df, but did fix jurassic park III ...

### Let's check the Dr. Seuss’ movies

In [745]:
imdb_aka_df[imdb_aka_df["tconst"] == "tt2709692"]

Unnamed: 0,tconst,primary_title,original_title,year,runtime,num_votes,average_rating,genres_count,genre,genre2,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
119303,tt2709692,The Grinch,The Grinch,2018,85,92948.0,6.4,3.0,Animation,Comedy,...,dr seuss the grinch,der grinch,el grinch,le grinch,the grinch,il grinch,de grinch,grinch,grinc,dr seuss the grinch


In [746]:
na_df[na_df["title"].str.lower().str.contains("grinch")]

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge
1005,Dr. Seuss’ The Grinch,2018-11-09,Universal,271841570,29837128,2018,dr seuss the grinch
1593,How the Grinch Stole Christmas,2000-11-17,Universal,261238755,48384918,2000,how the grinch stole christmas


In [747]:
na_df[na_df["title"].str.contains("Seuss")]

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge
1003,Dr. Seuss' The Lorax,2012-03-02,Universal,214030500,26888253,2012,dr seuss the lorax
1004,Dr. Seuss’ The Cat in the H…,2003-11-21,Universal,99383495,16481508,2003,dr seuss the cat in the h
1005,Dr. Seuss’ The Grinch,2018-11-09,Universal,271841570,29837128,2018,dr seuss the grinch


replace Dr Seuss for all cases in NA with ... nothing

In [748]:
na_df.loc[:,"title_merge"] = na_df["title_merge"].str.replace("dr seuss", "")

check improvements (expected -3)

In [749]:
na_check = ultimate_merge_func(na_df, imdb_aka_df, number_of_columns=7, short=True)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
31,21 and Over,2013-03-01,Relativity,25682380,3158964,2013.0,twooneandover,,,,...,,,,,,,,,,
54,63 Up,2019-11-27,BritBox,183940,20037,2019.0,sixthreeup,,,,...,,,,,,,,,,
69,A Common Thread,2002-11-29,Odeon Films,5058187,838836,2002.0,acommonthread,,,,...,,,,,,,,,,
107,A Rescue of Little Eggs,2021-08-27,Lionsgate,927154,91166,2021.0,arescueoflittleegg,,,,...,,,,,,,,,,
114,A Stir of Echoes,1999-09-10,Artisan,21133087,4160056,1999.0,astirofechoes,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3973,Y Tu Mama Tambien (And Your…,2002-03-15,IFC Films,13649881,2349377,2002.0,ytumamatambienandy,,,,...,,,,,,,,,,
3986,You're Next,2013-08-23,Lionsgate,18494006,2274785,2013.0,yourenext,,,,...,,,,,,,,,,
3992,"Yours, Mine and Ours",2005-11-23,Paramount Pictures,50733384,7914724,2005.0,yoursmineandours,,,,...,,,,,,,,,,
3994,Yu-Gi-Oh,2004-08-13,Warner Bros.,19762690,3182397,2004.0,yugioh,,,,...,,,,,,,,,,


13

In [752]:
na_check[na_check["tconst"].isnull()].sort_values(by="tickets_sold", ascending=False).head(15)

Unnamed: 0,title,release_date,distributor,gross_sales,tickets_sold,year,title_merge,tconst,primary_title,original_title,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
1043,Fast and Furious 6,2013-05-24,Universal,238679850,29357915,2013.0,fastandfurioussix,,,,...,,,,,,,,,,
2730,Spy Kids 3-D: Game Over,2003-07-25,Miramax/Dimension,111678621,18520500,2003.0,spykidsthreedgameo,,,,...,,,,,,,,,,
845,Disney’s A Christmas Carol,2009-11-06,Walt Disney,137443917,18325856,2009.0,disneysachristmasc,,,,...,,,,,,,,,,
3469,The Road to Perdition,2002-07-12,Dreamworks SKG,104054514,17909554,2002.0,theroadtoperdition,,,,...,,,,,,,,,,
3632,The X Files: Fight the Future,1998-06-19,20th Century Fox,83898313,17888766,1998.0,thetenfilesfightth,,,,...,,,,,,,,,,
995,Everest,1998-03-06,MacGillivray Free…,84941548,17815503,1998.0,everest,,,,...,,,,,,,,,,
894,Dr. Seuss’ The Cat in the H…,2003-11-21,Universal,99383495,16481508,2003.0,thecatintheh,,,,...,,,,,,,,,,
3039,The Divergent Serires: Insu…,2015-03-20,Lionsgate,130179072,15442357,2015.0,thedivergentserire,,,,...,,,,,,,,,,
2729,Spy Kids 2: The Island of L…,2002-08-07,Miramax/Dimension,85570368,14728118,2002.0,spykidstwotheislan,,,,...,,,,,,,,,,
1730,Lee Daniels' The Butler,2013-08-16,Weinstein Co.,116293662,14304263,2013.0,leedanielsthebutle,,,,...,,,,,,,,,,


In [750]:
eu_check_df = ultimate_merge_func(eu_df, imdb_aka_df)

Unnamed: 0,title,producing_country,year,tickets_sold_since_1996,tickets_sold,title_merge,tconst,primary_title,original_title,runtime,...,CA,DE,ES,FR,GB,IT,NL,PL,TR,ALTER
295,Arthur et la guerre des deux mondes,FR,2010.0,3838378,3363498,arthuretlaguerredesdeuxmondes,,,,,...,,,,,,,,,,
298,Artificial Intelligence: AI,US,2001.0,8073605,8041431,artificialintelligenceai,,,,,...,,,,,,,,,,
322,Atatürk 1881 - 1919,TR,2023.0,1732649,1732649,ataturkoneeighteightoneonenineonenine,,,,,...,,,,,,,,,,
371,Bambi II,US,2006.0,4484756,4473494,bambitwo,,,,,...,,,,,,,,,,
394,Battlefield Earth: A Saga of the Year 3000,US,2000.0,874660,874546,battlefieldearthasagaoftheyearthreezerozerozero,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4002,Tinker Bell and the Pirate Fairy,US,2014.0,5489166,5472379,tinkerbellandthepiratefairy,,,,,...,,,,,,,,,,
4198,Wallace & Gromit in The Curse of the Were-Rabbit,"GBinc, US",2005.0,14014825,13251997,wallacegromitinthecurseofthewererabbit,,,,,...,,,,,,,,,,
4209,Warum Männer nicht zuhören und Frauen schlecht...,DE,2007.0,1452342,1068475,warummannernichtzuhorenundfrauenschlechtereinp...,,,,,...,,,,,,,,,,
4327,Zeny v behu,CZ,2019.0,1705959,1675569,zenyfivebehu,,,,,...,,,,,,,,,,


13

### Ok, lets add two more columns to the AKA dataframe: US-Alternative and CA