# Phase 1 Project Office Hours Notebook!

In [1]:
import pandas as pd

## Reading in our data

In [2]:
!ls zippedData/

bom.movie_gross.csv.gz       imdb.title.ratings.csv.gz
imdb.name.basics.csv.gz      rt.movie_info.tsv.gz
imdb.title.akas.csv.gz       rt.reviews.tsv.gz
imdb.title.basics.csv.gz     tmdb.movies.csv.gz
imdb.title.crew.csv.gz       tn.movie_budgets.csv.gz
imdb.title.principals.csv.gz


In [4]:
# box office mojo data
bom = pd.read_csv("zippedData/bom.movie_gross.csv.gz")

In [5]:
bom.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [2]:
# main imdb movie data
imdb = pd.read_csv("zippedData/imdb.title.basics.csv.gz")

In [3]:
imdb.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [3]:
# name details from imdb for crew
imdb_name = pd.read_csv("zippedData/imdb.name.basics.csv.gz")

In [4]:
imdb_name.head()

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer","tt0837562,tt2398241,tt0844471,tt0118553"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department","tt0896534,tt6791238,tt0287072,tt1682940"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer","tt1470654,tt0363631,tt0104030,tt0102898"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department","tt0114371,tt2004304,tt1618448,tt1224387"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator","tt0452644,tt0452692,tt3458030,tt2178256"


In [26]:
# rotten tomatoes reviews data - with difficult parameters
rt_reviews = pd.read_csv("zippedData/rt.reviews.tsv.gz", sep="\t", encoding="iso8859-1")

In [25]:
rt_reviews.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [31]:
rt_reviews['review'].unique()

array(["A distinctly gallows take on contemporary financial mores, as one absurdly rich man's limo ride across town for a haircut functions as a state-of-the-nation discourse. ",
       "It's an allegory in search of a meaning that never arrives...It's just old-fashioned bad storytelling.",
       '... life lived in a bubble in financial dealings and digital communications and brief face-to-face conversations and sexual intermissions in a space shuttle of a limousine creeping through the gridlock of an anonymous New York City.',
       ...,
       "Despite Besson's high-profile name being Wasabi's big selling point, there is no doubt that Krawczyk deserves a huge amount of the credit for the film's thoroughly winning tone.",
       'The film lapses too often into sugary sentiment and withholds delivery on the pell-mell pyrotechnics its punchy style promises.',
       'The real charm of this trifle is the deadpan comic face of its star, Jean Reno, who resembles Sly Stallone in a hot sak

In [30]:
rt_reviews['review'][0]

"A distinctly gallows take on contemporary financial mores, as one absurdly rich man's limo ride across town for a haircut functions as a state-of-the-nation discourse. "

## Difficult merges!

The `bom` and `imdb` datasets don't share a unique identifier to match correctly - how can we think about merging these two dataframes?

### Using Movie Titles

In [36]:
bom['title'].duplicated().sum()

1

In [37]:
imdb['primary_title'].duplicated().sum()

10073

In [44]:
bom['title']

0                                       Toy Story 3
1                        Alice in Wonderland (2010)
2       Harry Potter and the Deathly Hallows Part 1
3                                         Inception
4                               Shrek Forever After
                           ...                     
3382                                      The Quake
3383                    Edward II (2018 re-release)
3384                                       El Pacto
3385                                       The Swan
3386                              An Actor Prepares
Name: title, Length: 3387, dtype: object

In [6]:
bom.loc[bom['title'].str.contains("Star Wars")]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
793,Star Wars: Episode I - The Phantom Menace (in 3D),Fox,43500000.0,59300000.0,2012
1872,Star Wars: The Force Awakens,BV,936700000.0,1131.6,2015
2323,Rogue One: A Star Wars Story,BV,532200000.0,523900000.0,2016
2758,Star Wars: The Last Jedi,BV,620200000.0,712400000.0,2017
3101,Solo: A Star Wars Story,BV,213800000.0,179200000.0,2018


In [50]:
bom['title'].str.split(": ")

0                                       [Toy Story 3]
1                        [Alice in Wonderland (2010)]
2       [Harry Potter and the Deathly Hallows Part 1]
3                                         [Inception]
4                               [Shrek Forever After]
                            ...                      
3382                                      [The Quake]
3383                    [Edward II (2018 re-release)]
3384                                       [El Pacto]
3385                                       [The Swan]
3386                              [An Actor Prepares]
Name: title, Length: 3387, dtype: object

In [49]:
imdb.loc[imdb['primary_title'].str.contains("Star Wars")]

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
2370,tt10239898,Star Wars: Battle for the Holocrons,Star Wars: Battle for the Holocrons,2020,,"Action,Adventure,Fantasy"
2947,tt10300394,Untitled Star Wars Film,Untitled Star Wars Film,2022,,
2948,tt10300396,Untitled Star Wars Film,Untitled Star Wars Film,2024,,
2949,tt10300398,Untitled Star Wars Film,Untitled Star Wars Film,2026,,Fantasy
3219,tt10321138,RiffTrax: Star Wars: The Force Awakens,RiffTrax: Star Wars: The Force Awakens,2016,,Comedy
34425,tt2275656,Star Wars: Threads of Destiny,Star Wars: Threads of Destiny,2014,110.0,"Action,Adventure,Sci-Fi"
41443,tt2488496,Star Wars: Episode VII - The Force Awakens,Star Wars: Episode VII - The Force Awakens,2015,136.0,"Action,Adventure,Fantasy"
42223,tt2527336,Star Wars: The Last Jedi,Star Wars: Episode VIII - The Last Jedi,2017,152.0,"Action,Adventure,Fantasy"
42224,tt2527338,Star Wars: The Rise of Skywalker,Star Wars: The Rise of Skywalker,2019,,"Action,Adventure,Fantasy"
63494,tt3648510,Plastic Galaxy: The Story of Star Wars Toys,Plastic Galaxy: The Story of Star Wars Toys,2014,70.0,"Documentary,History,Sci-Fi"


In [40]:
duplicated_titles = list(imdb.loc[imdb['primary_title'].duplicated()]['primary_title'])

In [41]:
duplicated_titles

['Nemesis',
 'Untitled Disney Marvel Film',
 'Untitled Marvel Film',
 'Plushtubers: The Apocalypse',
 'Indemnity',
 'Cinderella',
 'Windfall',
 'Prey',
 'Olanda',
 'Rok Sako To Rok Lo',
 'Aitebaar',
 'Huway Hum Jin Kay Liye Barbaad',
 'Paradise',
 'Sapo: Live at the Avalon... Ritmo del Corazon',
 'Raggarjävlar (Swedish Greasers)',
 'Adam',
 'Cinema of Sleep',
 'Devour',
 'Untitled Marvel Film',
 'Untitled Star Wars Film',
 'Untitled Star Wars Film',
 'The Outsider',
 'Unitlted Disney Live Action Project',
 'Unitlted Disney Live Action Project',
 'Unitlted Disney Live Action Project',
 'Raffaele Sollecito',
 'A Resistance',
 'Me and Mr. Canadian',
 'Me and Mr. Canadian',
 'Between Two Worlds',
 'Magic',
 'Alone',
 'The Courier',
 'Camino del Triunfo',
 'Bloody Benders',
 'Agent Sai Srinivasa Athreya',
 'Diamond Anxiety',
 'Rising Star',
 'Immortal',
 'Drain Baby',
 'Antigone',
 'Sunday',
 'Grateful Dead: Meet-Up at the Movies',
 'Innocence Of A King',
 'Pororo, Dinosaur Island Adventure

## Genres

In [8]:
imdb.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [12]:
imdb['genres'].value_counts()

Documentary                   32185
Drama                         21486
Comedy                         9177
Horror                         4372
Comedy,Drama                   3519
                              ...  
Action,Documentary,Musical        1
Animation,Sci-Fi,War              1
Comedy,Music,War                  1
Drama,Mystery,Sport               1
Documentary,News,Western          1
Name: genres, Length: 1085, dtype: int64

In [14]:
imdb['genres'].unique()

array(['Action,Crime,Drama', 'Biography,Drama', 'Drama', ...,
       'Music,Musical,Reality-TV', 'Animation,Crime',
       'Adventure,History,War'], dtype=object)

In [22]:
imdb['genres'].isna().sum()

5408

In [23]:
imdb['genres'].isna().sum() / len(imdb)

0.037004598204510616

In [25]:
imdb['genres'] = imdb['genres'].fillna("Unknown")

In [32]:
# getting a unique genre list
unique_genres_list = []
for genre_details in imdb['genres']:
    genres_list = genre_details.split(",")
    for genre in genres_list:
        unique_genres_list.append(genre)

unique_genres_list = sorted(list(set(unique_genres_list)))

In [33]:
unique_genres_list

['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'Unknown',
 'War',
 'Western']

In [30]:
# creating columns for each unique genre
for genre in unique_genres_list:
    imdb[genre] = 0

In [31]:
imdb.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,News,Crime,Animation,Family,...,Music,Comedy,History,Romance,Unknown,Short,Adult,Game-Show,Reality-TV,Talk-Show
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
"Crime" in imdb['genres'][0]

True

In [45]:
unique_genres_list

['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'Unknown',
 'War',
 'Western']

In [42]:
# using a loop to populate those genre columns
for index, genre_details in enumerate(imdb['genres']):
    for genre in unique_genres_list:
        if genre in genre_details:
            # if a genre is in the genres column for that row, 
            # it'll add a 1 to that genre's column
            imdb.at[index, genre] = 1
            # functionally this is the same as
            # imdb[genre][index] = 1
            # but pandas likes .at more than using the above syntax

In [43]:
imdb.head(10)

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,News,Crime,Animation,Family,...,Music,Comedy,History,Romance,Unknown,Short,Adult,Game-Show,Reality-TV,Talk-Show
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
5,tt0111414,A Thin Life,A Thin Life,2018,75.0,Comedy,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
6,tt0112502,Bigfoot,Bigfoot,2017,,"Horror,Thriller",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,83.0,"Adventure,Animation,Comedy",0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
8,tt0139613,O Silêncio,O Silêncio,2012,,"Documentary,History",0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,82.0,Biography,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
imdb[unique_genres_list].sum()

Action         10335
Adult             25
Adventure       6465
Animation       2799
Biography       8722
Comedy         25312
Crime           6753
Documentary    51640
Drama          49883
Family          6227
Fantasy         3516
Game-Show          4
History         6225
Horror         10805
Music           5624
Musical         1430
Mystery         4659
News            1551
Reality-TV        98
Romance         9372
Sci-Fi          3365
Short             11
Sport           2234
Talk-Show         50
Thriller       11883
Unknown         5408
War             1405
Western          467
dtype: int64

In [46]:
imdb[unique_genres_list].mean()

Action         0.070718
Adult          0.000171
Adventure      0.044237
Animation      0.019152
Biography      0.059681
Comedy         0.173199
Crime          0.046208
Documentary    0.353350
Drama          0.341328
Family         0.042609
Fantasy        0.024058
Game-Show      0.000027
History        0.042595
Horror         0.073934
Music          0.038483
Musical        0.009785
Mystery        0.031880
News           0.010613
Reality-TV     0.000671
Romance        0.064129
Sci-Fi         0.023025
Short          0.000075
Sport          0.015286
Talk-Show      0.000342
Thriller       0.081310
Unknown        0.037005
War            0.009614
Western        0.003195
dtype: float64

In [62]:
imdb.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,News,Crime,Animation,Family,...,Music,Comedy,History,Romance,Unknown,Short,Adult,Game-Show,Reality-TV,Talk-Show
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [51]:
for genre in unique_genres_list:
    genre_mins = imdb.loc[imdb[genre] == 1]['runtime_minutes'].mean()
    print(f"{genre}: {genre_mins:.2f} average runtime minutes")

Action: 100.02 average runtime minutes
Adult: 86.29 average runtime minutes
Adventure: 85.78 average runtime minutes
Animation: 80.67 average runtime minutes
Biography: 74.13 average runtime minutes
Comedy: 93.92 average runtime minutes
Crime: 95.51 average runtime minutes
Documentary: 72.11 average runtime minutes
Drama: 94.28 average runtime minutes
Family: 83.19 average runtime minutes
Fantasy: 91.92 average runtime minutes
Game-Show: 117.00 average runtime minutes
History: 78.76 average runtime minutes
Horror: 87.35 average runtime minutes
Music: 85.46 average runtime minutes
Musical: 95.42 average runtime minutes
Mystery: 93.23 average runtime minutes
News: 66.42 average runtime minutes
Reality-TV: 80.23 average runtime minutes
Romance: 100.22 average runtime minutes
Sci-Fi: 90.55 average runtime minutes
Short: 16.40 average runtime minutes
Sport: 80.94 average runtime minutes
Talk-Show: 86.74 average runtime minutes
Thriller: 94.35 average runtime minutes
Unknown: 82.33 average r

#### Creating a new copy of the imdb data to showcase another method to populate those genre columns

In [54]:
imdb_copy = pd.read_csv('zippedData/imdb.title.basics.csv.gz')

In [58]:
imdb_copy = imdb_copy.dropna(subset=['genres'])

In [59]:
# shorter version of the above code using lambda
for genre in unique_genres_list:
    imdb_copy[genre] = imdb_copy['genres'].apply(lambda x: genre in x).astype('int')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  imdb_copy[genre] = imdb_copy['genres'].apply(lambda x: genre in x).astype('int')


In [60]:
imdb_copy.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,Action,Adult,Adventure,Animation,...,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,Unknown,War,Western
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### What about combination of genres?

In [63]:
from itertools import combinations

In [65]:
combos = list(combinations(unique_genres_list, 2))

In [68]:
combos[0]

('Action', 'Adult')

In [73]:
combo_name = f"{combos[0][0]}-{combos[1][1]}"

In [74]:
combo_name

'Action-Adventure'

In [70]:
[combos[0]]

[('Action', 'Adult')]

In [77]:
imdb.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,News,Crime,Animation,Family,...,Music,Comedy,History,Romance,Unknown,Short,Adult,Game-Show,Reality-TV,Talk-Show
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [76]:
# note that this code does not do anything on the dataframe
# need to adjust to add the unique combos as columns first
for row in imdb.index[:5]:
    for genre_combo in combos:
        print(genre_combo)
        combo_name = f"{genre_combo[0]}-{genre_combo[1]}"
        if (imdb.at[row, genre_combo[0]] == 1) and (imdb.at[row, genre_combo[1]] == 1):
            print(f"{combo_name}: yes")

('Action', 'Adult')
('Action', 'Adventure')
('Action', 'Animation')
('Action', 'Biography')
('Action', 'Comedy')
('Action', 'Crime')
Action-Crime: yes
('Action', 'Documentary')
('Action', 'Drama')
Action-Drama: yes
('Action', 'Family')
('Action', 'Fantasy')
('Action', 'Game-Show')
('Action', 'History')
('Action', 'Horror')
('Action', 'Music')
('Action', 'Musical')
('Action', 'Mystery')
('Action', 'News')
('Action', 'Reality-TV')
('Action', 'Romance')
('Action', 'Sci-Fi')
('Action', 'Short')
('Action', 'Sport')
('Action', 'Talk-Show')
('Action', 'Thriller')
('Action', 'Unknown')
('Action', 'War')
('Action', 'Western')
('Adult', 'Adventure')
('Adult', 'Animation')
('Adult', 'Biography')
('Adult', 'Comedy')
('Adult', 'Crime')
('Adult', 'Documentary')
('Adult', 'Drama')
('Adult', 'Family')
('Adult', 'Fantasy')
('Adult', 'Game-Show')
('Adult', 'History')
('Adult', 'Horror')
('Adult', 'Music')
('Adult', 'Musical')
('Adult', 'Mystery')
('Adult', 'News')
('Adult', 'Reality-TV')
('Adult', 'Roma

('Reality-TV', 'Unknown')
('Reality-TV', 'War')
('Reality-TV', 'Western')
('Romance', 'Sci-Fi')
('Romance', 'Short')
('Romance', 'Sport')
('Romance', 'Talk-Show')
('Romance', 'Thriller')
('Romance', 'Unknown')
('Romance', 'War')
('Romance', 'Western')
('Sci-Fi', 'Short')
('Sci-Fi', 'Sport')
('Sci-Fi', 'Talk-Show')
('Sci-Fi', 'Thriller')
('Sci-Fi', 'Unknown')
('Sci-Fi', 'War')
('Sci-Fi', 'Western')
('Short', 'Sport')
('Short', 'Talk-Show')
('Short', 'Thriller')
('Short', 'Unknown')
('Short', 'War')
('Short', 'Western')
('Sport', 'Talk-Show')
('Sport', 'Thriller')
('Sport', 'Unknown')
('Sport', 'War')
('Sport', 'Western')
('Talk-Show', 'Thriller')
('Talk-Show', 'Unknown')
('Talk-Show', 'War')
('Talk-Show', 'Western')
('Thriller', 'Unknown')
('Thriller', 'War')
('Thriller', 'Western')
('Unknown', 'War')
('Unknown', 'Western')
('War', 'Western')


In [None]:
for index, genre_details in enumerate(imdb['genres']):
    for genre in unique_genres_list:
        if genre in genre_details:
            
            imdb.at[index, genre] = 1
            # functionally this is the same as
            # imdb[genre][index] = 1

## Fuzzy Wuzzy (fuzzy string matching)

In [7]:
bom.loc[bom['title'].str.contains("Star Wars")]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
793,Star Wars: Episode I - The Phantom Menace (in 3D),Fox,43500000.0,59300000.0,2012
1872,Star Wars: The Force Awakens,BV,936700000.0,1131.6,2015
2323,Rogue One: A Star Wars Story,BV,532200000.0,523900000.0,2016
2758,Star Wars: The Last Jedi,BV,620200000.0,712400000.0,2017
3101,Solo: A Star Wars Story,BV,213800000.0,179200000.0,2018


In [9]:
imdb.loc[imdb['primary_title'].str.contains("Star Wars")]

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
2370,tt10239898,Star Wars: Battle for the Holocrons,Star Wars: Battle for the Holocrons,2020,,"Action,Adventure,Fantasy"
2947,tt10300394,Untitled Star Wars Film,Untitled Star Wars Film,2022,,
2948,tt10300396,Untitled Star Wars Film,Untitled Star Wars Film,2024,,
2949,tt10300398,Untitled Star Wars Film,Untitled Star Wars Film,2026,,Fantasy
3219,tt10321138,RiffTrax: Star Wars: The Force Awakens,RiffTrax: Star Wars: The Force Awakens,2016,,Comedy
34425,tt2275656,Star Wars: Threads of Destiny,Star Wars: Threads of Destiny,2014,110.0,"Action,Adventure,Sci-Fi"
41443,tt2488496,Star Wars: Episode VII - The Force Awakens,Star Wars: Episode VII - The Force Awakens,2015,136.0,"Action,Adventure,Fantasy"
42223,tt2527336,Star Wars: The Last Jedi,Star Wars: Episode VIII - The Last Jedi,2017,152.0,"Action,Adventure,Fantasy"
42224,tt2527338,Star Wars: The Rise of Skywalker,Star Wars: The Rise of Skywalker,2019,,"Action,Adventure,Fantasy"
63494,tt3648510,Plastic Galaxy: The Story of Star Wars Toys,Plastic Galaxy: The Story of Star Wars Toys,2014,70.0,"Documentary,History,Sci-Fi"


In [23]:
bom.iloc[1872]['title'].lower().replace(":", "")

'star wars the force awakens'

In [24]:
imdb.iloc[41443]['primary_title'].lower().replace(":", "").replace("-", "")

'star wars episode vii  the force awakens'

In [18]:
from fuzzywuzzy import fuzz

In [25]:
fuzz.ratio(bom.iloc[1872]['title'].lower().replace(":", ""), 
           imdb.iloc[41443]['primary_title'].lower().replace(":", "").replace("-", ""))

81

In [13]:
imdb[imdb['runtime_minutes'].isna() == False]['start_year'].sort_values(ascending=False)

134557    2022
1330      2022
4382      2022
3710      2021
3560      2021
          ... 
13194     2010
13193     2010
13192     2010
13191     2010
66995     2010
Name: start_year, Length: 114405, dtype: int64

### Testing joins without fuzzy string matching

In [26]:
test_join = pd.merge(
    imdb,
    bom,
    how='inner',
    left_on="primary_title",
    right_on="title")

In [None]:
test_join.head()

### Creating a year_title column on each to see if that reduces duplicates

In [31]:
imdb['year_title'] = imdb['primary_title'] + " " + imdb['start_year'].astype('str')

In [32]:
imdb.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,year_title
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",Sunghursh 2013
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",One Day Before the Rainy Season 2019
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,The Other Side of the Wind 2018
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",Sabse Bada Sukh 2018
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",The Wandering Soap Opera 2017


In [33]:
bom['year_title'] = bom['title'] + " " + bom['year'].astype('str')

In [41]:
bom = bom.reset_index()

In [42]:
bom.head()

Unnamed: 0,index,title,studio,domestic_gross,foreign_gross,year,year_title
0,0,Toy Story 3,BV,415000000.0,652000000,2010,Toy Story 3 2010
1,1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010,Alice in Wonderland (2010) 2010
2,2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010,Harry Potter and the Deathly Hallows Part 1 2010
3,3,Inception,WB,292600000.0,535700000,2010,Inception 2010
4,4,Shrek Forever After,P/DW,238700000.0,513900000,2010,Shrek Forever After 2010


In [37]:
len(bom)

3387

In [38]:
len(imdb)

146144

In [43]:
test_join = pd.merge(
    imdb,
    bom,
    how='inner',
    on ='year_title')

In [39]:
len(test_join)

1873

In [44]:
test_join

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,year_title,index,title,studio,domestic_gross,foreign_gross,year
0,tt0315642,Wazir,Wazir,2016,103.0,"Action,Crime,Drama",Wazir 2016,2568,Wazir,Relbig.,1100000.0,,2016
1,tt0337692,On the Road,On the Road,2012,124.0,"Adventure,Drama,Romance",On the Road 2012,904,On the Road,IFC,744000.0,8000000,2012
2,tt0359950,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,114.0,"Adventure,Comedy,Drama",The Secret Life of Walter Mitty 2013,1169,The Secret Life of Walter Mitty,Fox,58200000.0,129900000,2013
3,tt0365907,A Walk Among the Tombstones,A Walk Among the Tombstones,2014,114.0,"Action,Crime,Drama",A Walk Among the Tombstones 2014,1577,A Walk Among the Tombstones,Uni.,26300000.0,26900000,2014
4,tt0369610,Jurassic World,Jurassic World,2015,124.0,"Action,Adventure,Sci-Fi",Jurassic World 2015,1873,Jurassic World,Uni.,652300000.0,1019.4,2015
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1868,tt8362680,Mountain,Mountain,2018,15.0,Documentary,Mountain 2018,3308,Mountain,Greenwich,365000.0,,2018
1869,tt8404272,How Long Will I Love U,Chao shi kong tong ju,2018,101.0,Romance,How Long Will I Love U 2018,3149,How Long Will I Love U,WGUSA,747000.0,82100000,2018
1870,tt8427036,Helicopter Eela,Helicopter Eela,2018,135.0,Drama,Helicopter Eela 2018,3354,Helicopter Eela,Eros,72000.0,,2018
1871,tt9078374,Last Letter,"Ni hao, Zhihua",2018,114.0,"Drama,Romance",Last Letter 2018,3319,Last Letter,CL,181000.0,,2018


In [46]:
test_join.duplicated(subset=['index']).sum()

40

In [47]:
test_join.duplicated(subset=['tconst']).sum()

0

### Always justify your decisions in your project notebooks!

This will help you remember what decisions you made AND why you made them!

Example for this case: 

> I can see that some movies were released under the same name in the same year - so I will drop all of these rows because they're not reliable

(note the "because" here - the why is important to write down!)

In [49]:
test_join.loc[test_join.duplicated(subset=['index'], keep=False) == True]

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,year_title,index,title,studio,domestic_gross,foreign_gross,year
140,tt1038919,The Bounty Hunter,The Bounty Hunter,2010,110.0,"Action,Comedy,Romance",The Bounty Hunter 2010,50,The Bounty Hunter,Sony,67099999.0,69300000,2010
141,tt1472211,The Bounty Hunter,The Bounty Hunter,2010,,,The Bounty Hunter 2010,50,The Bounty Hunter,Sony,67099999.0,69300000,2010
167,tt1126590,Big Eyes,Big Eyes,2014,106.0,"Biography,Crime,Drama",Big Eyes 2014,1606,Big Eyes,Wein.,14500000.0,14800000,2014
168,tt4317898,Big Eyes,Big Eyes,2014,,Documentary,Big Eyes 2014,1606,Big Eyes,Wein.,14500000.0,14800000,2014
169,tt1126591,Burlesque,Burlesque,2010,119.0,"Drama,Music,Musical",Burlesque 2010,69,Burlesque,SGem,39400000.0,50100000,2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1807,tt9042690,The Negotiation,The Negotiation,2018,89.0,"Documentary,History,War",The Negotiation 2018,3333,The Negotiation,CJ,111000.0,,2018
1838,tt7348082,The Guardians,The Guardians,2018,104.0,Documentary,The Guardians 2018,3321,The Guardians,MBox,177000.0,,2018
1839,tt8150132,The Guardians,The Guardians,2018,70.0,Documentary,The Guardians 2018,3321,The Guardians,MBox,177000.0,,2018
1857,tt7905466,They Shall Not Grow Old,They Shall Not Grow Old,2018,99.0,"Documentary,History,War",They Shall Not Grow Old 2018,3209,They Shall Not Grow Old,WB,18000000.0,,2018


### Bonus!

If you're scrolling through this notebook: https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas