## Data cleaning and preparation

`data.tsv` downloaded from the official IMDB Developer website (https://developer.imdb.com/non-commercial-datasets/). It contains over 10,000,000 rows of data, each row representing a movie.

#### Filtering the data to include only movies + removing unnecessary columns, and saving the data to a new file

In [9]:
import pandas as pd

df = pd.read_csv('data.tsv', sep='\t')
df.head()

  df = pd.read_csv('data.tsv', sep='\t')


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [10]:
print("Titles in dataset: {:,}".format(df.shape[0]))

Titles in dataset: 10,345,990


In [11]:
# get unique values in titleType column
df.titleType.unique()

array(['short', 'movie', 'tvShort', 'tvMovie', 'tvSeries', 'tvEpisode',
       'tvMiniSeries', 'tvSpecial', 'video', 'videoGame', 'tvPilot'],
      dtype=object)

In [12]:
df = df.drop(columns=['tconst'])
df = df.drop(columns=['endYear'])

# include only movie, tvMovie, tvSeries, tvEpisode, tvMiniSeries, tvSpecial
df = df[df.titleType.isin(['movie', 'tvMovie', 'tvSeries', 'tvEpisode', 'tvMiniSeries', 'tvSpecial'])]

print("Titles after filtering: {:,}".format(df.shape[0]))

Titles after filtering: 9,052,428


In [13]:
df.head()

Unnamed: 0,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
8,movie,Miss Jerry,Miss Jerry,0,1894,45,Romance
144,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897,100,"Documentary,News,Sport"
498,movie,Bohemios,Bohemios,0,1905,100,\N
570,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,70,"Action,Adventure,Biography"
587,movie,The Prodigal Son,L'enfant prodigue,0,1907,90,Drama


In [14]:
# reset index
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
0,movie,Miss Jerry,Miss Jerry,0,1894,45,Romance
1,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897,100,"Documentary,News,Sport"
2,movie,Bohemios,Bohemios,0,1905,100,\N
3,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,70,"Action,Adventure,Biography"
4,movie,The Prodigal Son,L'enfant prodigue,0,1907,90,Drama


In [19]:
# drop rows with missing values
df = df.dropna()
print("Titles after dropping missing values: {:,}".format(df.shape[0]))

Titles after dropping missing values: 9,052,396


In [18]:
df.head()

Unnamed: 0,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
0,movie,Miss Jerry,Miss Jerry,0,1894,45,Romance
1,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897,100,"Documentary,News,Sport"
2,movie,Bohemios,Bohemios,0,1905,100,\N
3,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,70,"Action,Adventure,Biography"
4,movie,The Prodigal Son,L'enfant prodigue,0,1907,90,Drama


In [20]:
# save to .csv
df.to_csv('data.csv', index=False)