# Movie Industry Analysis 
## Part I - Data Preparation
---
This personal project was undertaken in order for me to become proficient using Pandas for larger data sets and also for practicing basic machine learning tasks. 

It consists of Three Different Notebooks: 

* **Part I   - Data Preparation (This Notebook)**
* **Part II  - Data Exploration**
* **Part III - Predictions with Machine Learning**

The data mining process is followed througout the project, both on a large scale and in Parts II & III on a case by case basis. The steps are: 

1. Defining the Objective
2. Data Preparation
3. Data Exploration
4. Building (or Choosing) a Model
5. Evaluating Model Results

---
## 1. Defining the Objective
The overall objective of this project is to predict which directors will give a movie the best possible IMDB score for any given target year (between 1930 and 2015) using the prior 10 years worth of data for analysis.

---
## 2. Data Preparation

I found an interface page on IMDB that contains subsets from their entire database. It is updated daily and contains millions of data points. 
The data they provide does not contain all metrics from the site, but the many relevant fields are included. The ones I used are:

title, release year, genre, runtime, director, actor1, actor2, actor3, imdb_rating, and imdb_rating_count

[IMDB Data Sets](https://www.imdb.com/interfaces/)


**IMPORTANT NOTE ON THE DATA**
For this project my main varaible of interest or (response variable) will be the IMDB Score. In the real world, box-office numbers would better reflect movie success as IMDB scores themselves are highly subjective and susceptible to being artificially inflated. However I do not have a reliable source for box office data (infaltion adjusted, ect.), therefore I am using the IMDB Score as the metric of success of a given movie. 

Because the IMDB Data contains more than just movie data, I have to create my population by weeding out metrics that are not relevent here. I am doing so by enforcing some rules:

* Only fictional movies (no documentaries, tv-shows, etc.)
* Only feature length films >= 40 minutes in length (as defined by the Academy of Motion Pictures, Arts, and Sciences)
* No adult films
* Films have to contain an IMDB score

In [2]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore", 'This pattern has match groups')

## 2a. Cleaning Basic Movie Data
Note: The 'title.basics.tsv.gz' contains the basic movie data, I renamed it 'movies.tsv.gz'

In [2]:
mdf = pd.read_table('data/movies.tsv.gz', low_memory=False)

In [3]:
mdf.shape

(6691705, 9)

In [4]:
mdf.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [5]:
mdf['titleType'].unique()

array(['short', 'movie', 'tvMovie', 'tvSeries', 'tvEpisode', 'tvShort',
       'tvMiniSeries', 'tvSpecial', 'video', 'videoGame'], dtype=object)

### Drop unwanted columns, only keep 'titleType' movie

In [6]:
mdf = mdf.drop(columns=['originalTitle','endYear']) # drop unwanted columns
mdf = mdf[mdf['titleType'] == 'movie']              # only keep 'titleType' 
mdf = mdf.drop(columns=['titleType'])               # drop 'titleType' column, no londer needed
print(mdf.shape)
mdf.head(2)

(547478, 6)


Unnamed: 0,tconst,primaryTitle,isAdult,startYear,runtimeMinutes,genres
8,tt0000009,Miss Jerry,0,1894,45,Romance
145,tt0000147,The Corbett-Fitzsimmons Fight,0,1897,20,"Documentary,News,Sport"


### Remove genres that contains 'Documentary' or 'TV' or where genres = 'Sport',  'Music' or 'Biography'
This takes care of documentaries and tv shows. 

In [7]:
mdf = mdf[mdf['genres'] != '\\N']
mdf = mdf[mdf['genres'].str.contains(r'[Dd]ocumentary') == False]
mdf = mdf[mdf['genres'].str.contains(r'([Tt]V)') == False]
mdf = mdf[mdf['genres'] != 'Sport']
mdf = mdf[mdf['genres'] != 'Music']
mdf = mdf[mdf['genres'] != 'Biography']
print(mdf.shape)
mdf.head(2)

(369216, 6)


Unnamed: 0,tconst,primaryTitle,isAdult,startYear,runtimeMinutes,genres
8,tt0000009,Miss Jerry,0,1894,45,Romance
332,tt0000335,Soldiers of the Cross,0,1900,\N,"Biography,Drama"


### Remove all movies under 40 minutes

In [8]:
mdf = mdf[mdf['runtimeMinutes'] != '\\N']
mdf['runtimeMinutes'] = pd.to_numeric(mdf['runtimeMinutes'])
mdf = mdf[mdf['runtimeMinutes'] >= 40]
print(mdf.shape)
mdf.head(2)

(240833, 6)


Unnamed: 0,tconst,primaryTitle,isAdult,startYear,runtimeMinutes,genres
8,tt0000009,Miss Jerry,0,1894,45,Romance
571,tt0000574,The Story of the Kelly Gang,0,1906,70,"Biography,Crime,Drama"


### Remove adult films
Adult films are set to 1, therefore I remove them by only keeping the 0's

In [9]:
mdf = mdf[mdf['isAdult'] == 0]
mdf = mdf.drop(columns=['isAdult'])
print(mdf.shape)
mdf.head(2)

(235482, 5)


Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres
8,tt0000009,Miss Jerry,1894,45,Romance
571,tt0000574,The Story of the Kelly Gang,1906,70,"Biography,Crime,Drama"


### Count 'startYear' by decade to ensure that a given date range had enough samples to even use

In [10]:
mdf = mdf[mdf['startYear'] != '\\N']
mdf['startYear'] = pd.to_numeric(mdf['startYear'])

yr_chk = mdf.copy()
yr_chk = yr_chk.drop(columns = ['tconst', 'primaryTitle',
                                'runtimeMinutes', 'genres'])

decade = 10 * (yr_chk['startYear'] // 10)
yr_chk['decade'] = decade
yr_chk.groupby('decade')['startYear'].count()

decade
1890        1
1900        7
1910     2103
1920     5151
1930    10574
1940     9242
1950    13173
1960    17066
1970    19308
1980    20432
1990    21152
2000    36705
2010    74439
2020     2354
Name: startYear, dtype: int64

### The decade count above shows that 1900 only has 8 titles, therefore it needs to be removed. Also, 2020 only has 1 year of data so far, therefore years > 2020 need to be go as well. 

In [11]:
mdf = mdf[(mdf['startYear'] >= 1910) & (mdf['startYear'] <= 2020)]
print(mdf.shape)
mdf.head(2)

(231580, 5)


Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres
1173,tt0001184,Don Juan de Serrallonga,1910,58,"Adventure,Drama"
1247,tt0001258,The White Slave Trade,1910,45,Drama


In [12]:
yr_chk = mdf.copy()
yr_chk = yr_chk.drop(columns = ['tconst', 'primaryTitle',
                                'runtimeMinutes', 'genres'])

decade = 10 * (yr_chk['startYear'] // 10)
yr_chk['decade'] = decade
yr_chk.groupby('decade')['startYear'].count()

decade
1910     2103
1920     5151
1930    10574
1940     9242
1950    13173
1960    17066
1970    19308
1980    20432
1990    21152
2000    36705
2010    74439
2020     2235
Name: startYear, dtype: int64

In [13]:
mdf[mdf.isna().any(axis=1)]

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres


### Convert genre column to only contain one ideal genre per movie
The 'genres' colummn often has more than one genre per movie, and I really just want one genre that best fits a given movie. 

In [14]:
mdf['genres'] = mdf['genres'].str.replace(' ', '')
mdf['genres'] = mdf['genres'].str.split(',')
mdf['genres'] = mdf['genres'].replace(' ', '') 

In [15]:
# Perfom logic to convert genres properly
genre_lst = mdf['genres'].tolist()
def convert_genre(lst):
    good_genres = ['Comedy', 'Drama', 'Action']
    
    for i, l in enumerate(lst):
        if 'Family' in l:
            lst[i] = 'Family'
        elif 'Horror' in l and 'Comedy' not in l:
            lst[i] = 'Horror'   
        elif l[0] in good_genres: 
            lst[i] = l[0]
        elif len(l) > 1 and l[1] in good_genres:
            lst[i] = l[1]
        elif len(l) > 2 and l[2] in good_genres:
            lst[i] = l[2]
        elif 'Adventure' in l:
            lst[i] = 'Action'
        else:
            lst[i] = 'Drama'
    return(lst)

mdf['genres'] = convert_genre(genre_lst)

In [16]:
mdf.head(2)

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres
1173,tt0001184,Don Juan de Serrallonga,1910,58,Drama
1247,tt0001258,The White Slave Trade,1910,45,Drama


### Rename columns and save data to csv file 

In [17]:
mdf = mdf.rename(columns = {'primaryTitle':'title', 'startYear' : 'year',
                            'runtimeMinutes': 'runtime', 'genres': 'genre'})
mdf.to_csv('data/movies_1.csv', index=False)

---
## 2b. Merging IMDB ratings with movie data
Note: The original ratings file was called 'title.ratings.tsv.gz', I renamed it 'ratings.tsv.gz'

In [18]:
rdf = pd.read_table('data/ratings.tsv.gz')
mdf = pd.read_csv('data/movies_1.csv')

In [19]:
rdf.head(2)

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.6,1592
1,tt0000002,6.1,194


In [20]:
mdf = mdf.merge(rdf, on='tconst', how='left')
print(mdf.shape)
mdf.head(2)

(231580, 7)


Unnamed: 0,tconst,title,year,runtime,genre,averageRating,numVotes
0,tt0001184,Don Juan de Serrallonga,1910,58,Drama,3.1,11.0
1,tt0001258,The White Slave Trade,1910,45,Drama,5.7,79.0


### Remove films with no IMDB score

In [21]:
mdf = mdf.dropna(subset=['averageRating'])
print(mdf.shape)
mdf.head(2)

(177681, 7)


Unnamed: 0,tconst,title,year,runtime,genre,averageRating,numVotes
0,tt0001184,Don Juan de Serrallonga,1910,58,Drama,3.1,11.0
1,tt0001258,The White Slave Trade,1910,45,Drama,5.7,79.0


### Rename columns and save data to csv file 

In [22]:
mdf = mdf.rename(columns = {'averageRating':'imdb_score', 'numVotes' : 'num_scores'})
mdf.to_csv('data/movies_2.csv', index=False)

---
## 2c. Merging directors id's with movie data
Note: the original file with directors was called 'title.crew.tsv.gz', I renamed it to 'directors.tsv.gz'

In [2]:
mdf = pd.read_csv('data/movies_2.csv')
dir_df =  pd.read_table('data/directors.tsv.gz')

In [24]:
print(mdf.shape)
print(dir_df.shape)

(177681, 7)
(6691705, 3)


In [25]:
dir_df.head(2)

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N


In [26]:
dir_df = dir_df.drop(columns=['writers'])
mdf = mdf.merge(dir_df, on='tconst', how='left')
mdf = mdf[mdf['directors'] != '\\N']
print(mdf.shape)
mdf.head(2)

(177095, 8)


Unnamed: 0,tconst,title,year,runtime,genre,imdb_score,num_scores,directors
0,tt0001184,Don Juan de Serrallonga,1910,58,Drama,3.1,11.0,"nm0550220,nm0063413"
1,tt0001258,The White Slave Trade,1910,45,Drama,5.7,79.0,nm0088881


### Converting director id's to list then keeping only the first entry, as this is usually the main director

In [27]:
# Convert 'directors to list
mdf['directors'] = mdf['directors'].str.replace(' ', '')
mdf['directors'] = mdf['directors'].str.split(',')
mdf['directors'] = mdf['directors'].replace(' ', '')
mdf['directors'] = mdf['directors'].str[0]
print(mdf.shape)
mdf.head(2)

(177095, 8)


Unnamed: 0,tconst,title,year,runtime,genre,imdb_score,num_scores,directors
0,tt0001184,Don Juan de Serrallonga,1910,58,Drama,3.1,11.0,nm0550220
1,tt0001258,The White Slave Trade,1910,45,Drama,5.7,79.0,nm0088881


### Save data to csv file

In [28]:
mdf.to_csv('data/movies_3.csv', index=False)

## 2d. Merging directors names with movie data
Note: the original name file was called 'name.basics.tsv.gz', I renamed this to 'names.tsv.gz'

In [2]:
ndf = pd.read_table('data/names.tsv.gz')
mdf = pd.read_csv('data/movies_3.csv')

In [3]:
print(ndf.shape)
print(mdf.shape)
ndf.head(2)

(9997980, 6)
(177095, 8)


Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0043044,tt0072308,tt0053137,tt0050419"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0038355,tt0037382,tt0117057,tt0071877"


In [4]:
#ndf = ndf.drop(columns = ['birthYear', 'deathYear', 'primaryProfession', 'knownForTitles'])
ndf = ndf.drop(columns = ['birthYear', 'primaryProfession', 'knownForTitles'])
ndf.head(2)

Unnamed: 0,nconst,primaryName,deathYear
0,nm0000001,Fred Astaire,1987
1,nm0000002,Lauren Bacall,2014


In [5]:
mdf = mdf.rename(columns={'directors': 'nconst'})
mdf = mdf.merge(ndf, on='nconst', how='left')            # Merge dataframes
mdf = mdf.drop(columns = ['nconst'])
mdf = mdf.rename(columns={'primaryName': 'director', 'deathYear' : 'dir_dth_yr'})

mdf['dir_dth_yr'].fillna('9999', inplace = True)
mdf['dir_dth_yr'] = mdf['dir_dth_yr'].replace('\\N', '9999')

In [13]:

mdf[mdf['dir_dth_yr'].isnull()]
mdf.head(10)

Unnamed: 0,tconst,title,year,runtime,genre,imdb_score,num_scores,director,dir_dth_yr
0,tt0001184,Don Juan de Serrallonga,1910,58,Drama,3.1,11.0,Alberto Marro,1956
1,tt0001258,The White Slave Trade,1910,45,Drama,5.7,79.0,August Blom,1947
2,tt0001498,The Battle of Trafalgar,1911,51,Drama,7.2,5.0,J. Searle Dawley,1949
3,tt0001790,"Les Misérables, Part 1: Jean Valjean",1913,60,Drama,5.8,21.0,Albert Capellani,1931
4,tt0001812,Oedipus Rex,1911,56,Drama,5.8,6.0,Theo Frenkel,1956
5,tt0001892,Den sorte drøm,1911,53,Drama,5.9,179.0,Urban Gad,1947
6,tt0001911,Nell Gwynne,1911,50,Drama,4.1,7.0,Raymond Longford,1959
7,tt0001964,The Traitress,1911,48,Drama,6.1,51.0,Urban Gad,1947
8,tt0002026,Anny - Story of a Prostitute,1912,68,Drama,4.0,7.0,Adam Eriksen,9999
9,tt0002101,Cleopatra,1912,100,Drama,5.2,431.0,Charles L. Gaskill,1943


### Fixing director values
Because there are only two director titles with NaN values and the year of the movies is recent, I checked on IMDB and found the director names, then replaced them in the table

In [14]:
mdf[mdf.isna().any(axis=1)]

Unnamed: 0,tconst,title,year,runtime,genre,imdb_score,num_scores,director,dir_dth_yr
123050,tt1572165,Planet Blood,2001,57,Horror,1.8,91.0,,9999
135064,tt2205589,Rise of the Black Bat,2012,80,Action,1.2,641.0,,9999


In [17]:
mdf.iloc[135064]

tconst                    tt2205589
title         Rise of the Black Bat
year                           2012
runtime                          80
genre                        Action
imdb_score                      1.2
num_scores                      641
director                        NaN
dir_dth_yr                     9999
Name: 135064, dtype: object

In [18]:
mdf.iat[135064, 7] = 'Brett Kelly'
mdf.iloc[135064]

tconst                    tt2205589
title         Rise of the Black Bat
year                           2012
runtime                          80
genre                        Action
imdb_score                      1.2
num_scores                      641
director                Brett Kelly
dir_dth_yr                     9999
Name: 135064, dtype: object

In [19]:
mdf[mdf.isna().any(axis=1)]

Unnamed: 0,tconst,title,year,runtime,genre,imdb_score,num_scores,director,dir_dth_yr


### Save data to csv file

In [20]:
mdf.to_csv('data/movies_4.csv', index=False)

## 2e. Merging top 3 actors per movie into movie data
Note: The original file was called 'title.principals.tsv.gz', I renamed it to 'actors.tsv.gz'

In [21]:
mdf = pd.read_csv('data/movies_4.csv')
act = pd.read_table('data/actors.tsv.gz')

In [22]:
print(mdf.shape)
print(act.shape)
act.head(2)

(177095, 9)
(38631701, 6)


Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Self""]"
1,tt0000001,2,nm0005690,director,\N,\N


In [23]:
act['category'].unique()

array(['self', 'director', 'cinematographer', 'composer', 'producer',
       'editor', 'actor', 'actress', 'writer', 'production_designer',
       'archive_footage', 'archive_sound'], dtype=object)

### Only keep categories actor, actress, and self

In [24]:
act = act[(act['category'] == 'actor') | (act['category'] == 'actress') | (act['category'] == 'self')]
act = act.drop(columns = ['category', 'job', 'characters'])
act.head(2)

Unnamed: 0,tconst,ordering,nconst
0,tt0000001,1,nm1588970
11,tt0000005,1,nm0443482


### Create actor data sets based on order cast in movie (1: lead, 2: second, 3: third)

In [25]:
act1 = act[act['ordering'] == 1]
act2 = act[act['ordering'] == 2]
act3 = act[act['ordering'] == 3]

### Merge actor names with each actor data set

In [26]:
ndf = pd.read_table('data/names.tsv.gz')
ndf.head(2)

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0043044,tt0072308,tt0053137,tt0050419"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0038355,tt0037382,tt0117057,tt0071877"


In [27]:
ndf = ndf.drop(columns = ['birthYear', 'deathYear', 'primaryProfession', 'knownForTitles'])
ndf.head(2)

Unnamed: 0,nconst,primaryName
0,nm0000001,Fred Astaire
1,nm0000002,Lauren Bacall


In [28]:
act1 = act1.merge(ndf, on='nconst', how='left')
act2 = act2.merge(ndf, on='nconst', how='left')
act3 = act3.merge(ndf, on='nconst', how='left')

act1 = act1.drop(columns = ['ordering', 'nconst'])
act2 = act2.drop(columns = ['ordering', 'nconst'])
act3 = act3.drop(columns = ['ordering', 'nconst'])

act1.head(2)

Unnamed: 0,tconst,primaryName
0,tt0000001,Carmencita
1,tt0000005,Charles Kayser


### Merge actor 1 (lead actor) with movie data set

In [29]:
mdf = mdf.merge(act1, on='tconst', how='left')
mdf = mdf.rename(columns={"primaryName": "actr1"})
mdf.head(2)

Unnamed: 0,tconst,title,year,runtime,genre,imdb_score,num_scores,director,dir_dth_yr,actr1
0,tt0001184,Don Juan de Serrallonga,1910,58,Drama,3.1,11.0,Alberto Marro,1956,Dolores Puchol
1,tt0001258,The White Slave Trade,1910,45,Drama,5.7,79.0,August Blom,1947,Ellen Diedrich


### Merge actor 2 with movie data set

In [30]:
mdf = mdf.merge(act2, on='tconst', how='left')
mdf = mdf.rename(columns={"primaryName": "actr2"})
mdf.head(2)

Unnamed: 0,tconst,title,year,runtime,genre,imdb_score,num_scores,director,dir_dth_yr,actr1,actr2
0,tt0001184,Don Juan de Serrallonga,1910,58,Drama,3.1,11.0,Alberto Marro,1956,Dolores Puchol,Cecilio Rodríguez de la Vega
1,tt0001258,The White Slave Trade,1910,45,Drama,5.7,79.0,August Blom,1947,Ellen Diedrich,Victor Fabian


### Merge actor 3 with movie data set

In [31]:
mdf = mdf.merge(act3, on='tconst', how='left')
mdf = mdf.rename(columns={"primaryName": "actr3"})
mdf.head(2)

Unnamed: 0,tconst,title,year,runtime,genre,imdb_score,num_scores,director,dir_dth_yr,actr1,actr2,actr3
0,tt0001184,Don Juan de Serrallonga,1910,58,Drama,3.1,11.0,Alberto Marro,1956,Dolores Puchol,Cecilio Rodríguez de la Vega,
1,tt0001258,The White Slave Trade,1910,45,Drama,5.7,79.0,August Blom,1947,Ellen Diedrich,Victor Fabian,Julie Henriksen


In [32]:
mdf[mdf['title'] == 'Goodfellas']

Unnamed: 0,tconst,title,year,runtime,genre,imdb_score,num_scores,director,dir_dth_yr,actr1,actr2,actr3
48992,tt0099685,Goodfellas,1990,146,Drama,8.7,962223.0,Martin Scorsese,9999,Robert De Niro,Ray Liotta,Joe Pesci


### Save data to csv file

In [33]:
mdf.to_csv('data/movies_5.csv', index = False)


## 2f. Final data cleanup

In [3]:
mdf = pd.read_csv('data/movies_5.csv')
print(mdf.shape)
mdf.head(2)

(177095, 12)


Unnamed: 0,tconst,title,year,runtime,genre,imdb_score,num_scores,director,dir_dth_yr,actr1,actr2,actr3
0,tt0001184,Don Juan de Serrallonga,1910,58,Drama,3.1,11.0,Alberto Marro,1956,Dolores Puchol,Cecilio Rodríguez de la Vega,
1,tt0001258,The White Slave Trade,1910,45,Drama,5.7,79.0,August Blom,1947,Ellen Diedrich,Victor Fabian,Julie Henriksen


### Rename all columns

In [4]:
mdf = mdf.rename(columns={'tconst': 'mov_id'})
mdf = mdf.reindex(columns=['mov_id','title','genre','year','director', 'dir_dth_yr', 'actr1', 'actr2',
                          'actr3', 'runtime', 'imdb_score', 'num_scores'])

mdf.head(2)

Unnamed: 0,mov_id,title,genre,year,director,dir_dth_yr,actr1,actr2,actr3,runtime,imdb_score,num_scores
0,tt0001184,Don Juan de Serrallonga,Drama,1910,Alberto Marro,1956,Dolores Puchol,Cecilio Rodríguez de la Vega,,58,3.1,11.0
1,tt0001258,The White Slave Trade,Drama,1910,August Blom,1947,Ellen Diedrich,Victor Fabian,Julie Henriksen,45,5.7,79.0


### Clean up non utf-8 characters

In [5]:
mdf['title'] = mdf['title'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
mdf['director'] = mdf['director'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
mdf['actr1'] = mdf['actr1'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
mdf['actr2'] = mdf['actr2'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
mdf['actr3'] = mdf['actr3'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')

In [6]:
mdf.head(2)

Unnamed: 0,mov_id,title,genre,year,director,dir_dth_yr,actr1,actr2,actr3,runtime,imdb_score,num_scores
0,tt0001184,Don Juan de Serrallonga,Drama,1910,Alberto Marro,1956,Dolores Puchol,Cecilio Rodriguez de la Vega,,58,3.1,11.0
1,tt0001258,The White Slave Trade,Drama,1910,August Blom,1947,Ellen Diedrich,Victor Fabian,Julie Henriksen,45,5.7,79.0


### Final NaN check
The only null values found were in the directors column and the three actor columns. 

In [7]:
print(mdf[mdf['mov_id'].isnull()])
print()
print(mdf[mdf['title'].isnull()])
print()
print(mdf[mdf['genre'].isnull()])
print()
print(mdf[mdf['year'].isnull()])
print()
print(mdf[mdf['director'].isnull()])
print()
print(mdf[mdf['runtime'].isnull()])
print()
print(mdf[mdf['imdb_score'].isnull()])
print()
print(mdf[mdf['num_scores'].isnull()])
print()

Empty DataFrame
Columns: [mov_id, title, genre, year, director, dir_dth_yr, actr1, actr2, actr3, runtime, imdb_score, num_scores]
Index: []

Empty DataFrame
Columns: [mov_id, title, genre, year, director, dir_dth_yr, actr1, actr2, actr3, runtime, imdb_score, num_scores]
Index: []

Empty DataFrame
Columns: [mov_id, title, genre, year, director, dir_dth_yr, actr1, actr2, actr3, runtime, imdb_score, num_scores]
Index: []

Empty DataFrame
Columns: [mov_id, title, genre, year, director, dir_dth_yr, actr1, actr2, actr3, runtime, imdb_score, num_scores]
Index: []

Empty DataFrame
Columns: [mov_id, title, genre, year, director, dir_dth_yr, actr1, actr2, actr3, runtime, imdb_score, num_scores]
Index: []

Empty DataFrame
Columns: [mov_id, title, genre, year, director, dir_dth_yr, actr1, actr2, actr3, runtime, imdb_score, num_scores]
Index: []

Empty DataFrame
Columns: [mov_id, title, genre, year, director, dir_dth_yr, actr1, actr2, actr3, runtime, imdb_score, num_scores]
Index: []

Empty DataFra

### Movies usually have at least 1 actor, so I'm assuming that when all three actor columns are empty I'm either dealing with a really old silent film, an animated film, or a film so obscure that the actors aren't a part of the Screen Actors Guild. At this stage, because I'm creating a hypothetical population, I am going to keep these in for completions sake. Most likely they will be weeded out during the sampling phase. 

In [8]:
mdf[mdf['actr1'].isnull()]

Unnamed: 0,mov_id,title,genre,year,director,dir_dth_yr,actr1,actr2,actr3,runtime,imdb_score,num_scores
294,tt0007646,El apostol,Comedy,1917,Quirino Cristiani,1984,,,,70,6.8,39.0
545,tt0010108,The Fall of Babylon,Drama,1919,D.W. Griffith,1948,,,,82,6.8,93.0
598,tt0010484,The Mother and the Law,Drama,1919,D.W. Griffith,1948,,,,95,6.9,161.0
1350,tt0015256,The Radio Flyer,Drama,1924,Harry O. Hoyt,1961,,,,50,6.8,8.0
1398,tt0015532,The Adventures of Prince Achmed,Action,1926,Lotte Reiniger,1981,,,,80,7.8,5113.0
...,...,...,...,...,...,...,...,...,...,...,...,...
176552,tt9619150,Valley of Souls,Drama,2019,Nicolas Rincon Gille,9999,,,,137,7.5,44.0
176661,tt9666830,False Belief,Drama,2019,Lene Berg,9999,,,,105,7.0,6.0
176783,tt9741908,Breathless Animals,Drama,2019,Lei Lei,9999,,,,68,4.7,21.0
176865,tt9799044,Baggage,Drama,2018,Roopa Rao,9999,,,,115,7.8,6.0


In [9]:
yr_chk = mdf.copy()
yr_chk = yr_chk.drop(columns = ['mov_id', 'title', 'genre', 'director',
                                'actr1', 'actr2', 'actr3', 'runtime',
                                'imdb_score', 'num_scores'])

decade = 10 * (yr_chk['year'] // 10)
yr_chk['decade'] = decade
yr_chk.groupby('decade')['year'].count()

decade
1910      779
1920     2448
1930     7732
1940     7585
1950    10346
1960    13079
1970    15229
1980    16328
1990    17671
2000    30605
2010    54535
2020      758
Name: year, dtype: int64

In [10]:
mdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177095 entries, 0 to 177094
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   mov_id      177095 non-null  object 
 1   title       177095 non-null  object 
 2   genre       177095 non-null  object 
 3   year        177095 non-null  int64  
 4   director    177095 non-null  object 
 5   dir_dth_yr  177095 non-null  int64  
 6   actr1       176161 non-null  object 
 7   actr2       175497 non-null  object 
 8   actr3       174103 non-null  object 
 9   runtime     177095 non-null  int64  
 10  imdb_score  177095 non-null  float64
 11  num_scores  177095 non-null  float64
dtypes: float64(2), int64(3), object(7)
memory usage: 16.2+ MB


In [11]:
mdf = mdf.astype({'num_scores': 'int64'})

In [12]:
mdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177095 entries, 0 to 177094
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   mov_id      177095 non-null  object 
 1   title       177095 non-null  object 
 2   genre       177095 non-null  object 
 3   year        177095 non-null  int64  
 4   director    177095 non-null  object 
 5   dir_dth_yr  177095 non-null  int64  
 6   actr1       176161 non-null  object 
 7   actr2       175497 non-null  object 
 8   actr3       174103 non-null  object 
 9   runtime     177095 non-null  int64  
 10  imdb_score  177095 non-null  float64
 11  num_scores  177095 non-null  int64  
dtypes: float64(1), int64(4), object(7)
memory usage: 16.2+ MB


In [13]:
mdf.to_csv('data/movies_final.csv', index = False)

In [14]:
mdf = pd.read_csv('data/movies_final.csv')

In [15]:
mdf.shape

(177095, 12)

In [16]:
mdf.head(2)

Unnamed: 0,mov_id,title,genre,year,director,dir_dth_yr,actr1,actr2,actr3,runtime,imdb_score,num_scores
0,tt0001184,Don Juan de Serrallonga,Drama,1910,Alberto Marro,1956,Dolores Puchol,Cecilio Rodriguez de la Vega,,58,3.1,11
1,tt0001258,The White Slave Trade,Drama,1910,August Blom,1947,Ellen Diedrich,Victor Fabian,Julie Henriksen,45,5.7,79
