## Genre Genie - Multi-label Classification with NLP
### Part 1.1: IMDb dataset

#### Tom Keith

---

**Goal:** Explore IMDb datasets.

IMDb offers datasets with loads of information. I explored these sets to see what I could use while still figuring out what direction to go with this project.

More information on these datasets and the features they have can be found here: https://www.imdb.com/interfaces/

I will be using the direct links for the compressed `.tsv` files: https://datasets.imdbws.com/

---

In [48]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

pd.set_option('display.max_rows', 200)

---
**Exploring IMDb Datasets**

Loop though to peek at what each dataset looks like. While these files are updated daily, the data used throughout this project was fetched February 2, 2020.

An importnat note is that `NULL` values are represented as `\N` in these sets.

In [47]:
%%time
imdb_api_file_list = ['title.basics.tsv.gz','title.ratings.tsv.gz','name.basics.tsv.gz','title.principals.tsv.gz','title.crew.tsv.gz','title.akas.tsv.gz','title.episode.tsv.gz']

for package in imdb_api_file_list:
    package_file_name = f'https://datasets.imdbws.com/{package}'
    display(pd.read_csv(package_file_name, sep='\t', low_memory=False))

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
...,...,...,...,...,...,...,...,...,...
6672689,tt9916848,tvEpisode,Episode #3.17,Episode #3.17,0,2010,\N,\N,"Action,Drama,Family"
6672690,tt9916850,tvEpisode,Episode #3.19,Episode #3.19,0,2010,\N,\N,"Action,Drama,Family"
6672691,tt9916852,tvEpisode,Episode #3.20,Episode #3.20,0,2010,\N,\N,"Action,Drama,Family"
6672692,tt9916856,short,The Wind,The Wind,0,2015,\N,27,Short


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.6,1591
1,tt0000002,6.1,194
2,tt0000003,6.5,1264
3,tt0000004,6.2,120
4,tt0000005,6.1,2025
...,...,...,...
1019001,tt9916576,6.0,9
1019002,tt9916578,8.5,16
1019003,tt9916720,5.5,48
1019004,tt9916766,6.8,13


Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0050419,tt0072308,tt0053137,tt0043044"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0117057,tt0038355,tt0037382,tt0071877"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,producer","tt0054452,tt0057345,tt0059956,tt0049189"
3,nm0000004,John Belushi,1949,1982,"actor,soundtrack,writer","tt0072562,tt0077975,tt0080455,tt0078723"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0069467,tt0050986,tt0050976,tt0083922"
...,...,...,...,...,...,...
9982871,nm9993714,Romeo del Rosario,\N,\N,"animation_department,art_department",tt2455546
9982872,nm9993716,Essias Loberg,\N,\N,,\N
9982873,nm9993717,Harikrishnan Rajan,\N,\N,cinematographer,tt8736744
9982874,nm9993718,Aayush Nair,\N,\N,cinematographer,\N


Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Self""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0374658,cinematographer,director of photography,\N
3,tt0000002,1,nm0721526,director,\N,\N
4,tt0000002,2,nm1335271,composer,\N,\N
...,...,...,...,...,...,...
38538893,tt9916880,5,nm0996406,director,principal director,\N
38538894,tt9916880,6,nm1482639,writer,\N,\N
38538895,tt9916880,7,nm2586970,writer,books,\N
38538896,tt9916880,8,nm1594058,producer,producer,\N


Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,\N
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N
...,...,...,...
6672689,tt9916848,"nm5519454,nm5519375","nm6182221,nm1628284,nm2921377"
6672690,tt9916850,"nm5519375,nm5519454","nm6182221,nm1628284,nm2921377"
6672691,tt9916852,"nm5519375,nm5519454","nm6182221,nm1628284,nm2921377"
6672692,tt9916856,nm10538645,nm6951431


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Carmencita,DE,\N,\N,literal title,0
1,tt0000001,2,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
2,tt0000001,3,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
3,tt0000001,4,Карменсита,RU,\N,imdbDisplay,\N,0
4,tt0000001,5,Carmencita,US,\N,\N,\N,0
...,...,...,...,...,...,...,...,...
20839948,tt9916852,3,Folge #3.20,DE,de,\N,\N,0
20839949,tt9916852,4,エピソード #3.20,JP,ja,\N,\N,0
20839950,tt9916852,5,Episódio #3.20,PT,pt,\N,\N,0
20839951,tt9916852,6,Episodio #3.20,IT,it,\N,\N,0


Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber
0,tt0041951,tt0041038,1,9
1,tt0042816,tt0989125,1,17
2,tt0042889,tt0989125,\N,\N
3,tt0043426,tt0040051,3,42
4,tt0043631,tt0989125,2,16
...,...,...,...,...
4737120,tt9916846,tt1289683,3,18
4737121,tt9916848,tt1289683,3,17
4737122,tt9916850,tt1289683,3,19
4737123,tt9916852,tt1289683,3,20


Wall time: 2min 40s


There is a lot of great data here! However, I there aren't isn't much text to work with - no plot summary or even taglines. I will have to scrape for that information.

I only want movie results (no TV or people), so I'm going to focus on `title.basics.tsv.gz` and `title.ratings.tsv.gz`.

---

Save `title.basics` and `ratings` into their own dataframe, merge them together on `tconst`, and explore.

In [12]:
df_basics  = pd.read_csv('https://datasets.imdbws.com/title.basics.tsv.gz',  sep='\t', low_memory=False)
df_ratings = pd.read_csv('https://datasets.imdbws.com/title.ratings.tsv.gz', sep='\t', low_memory=False)
df_merge = pd.merge(df_basics, df_ratings, left_on='tconst', right_on='tconst')
df_merge

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short",5.6,1591
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short",6.1,194
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance",6.5,1264
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short",6.2,120
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short",6.1,2025
...,...,...,...,...,...,...,...,...,...,...,...
1018999,tt9916576,tvEpisode,Destinee's Story,Destinee's Story,0,2019,\N,85,Reality-TV,6.0,9
1019000,tt9916578,tvEpisode,The Trial of Joan Collins,The Trial of Joan Collins,0,2019,\N,\N,"Adventure,Biography,Comedy",8.5,16
1019001,tt9916720,short,The Nun 2,The Nun 2,0,2019,\N,10,"Comedy,Horror,Mystery",5.5,48
1019002,tt9916766,tvEpisode,Episode #10.15,Episode #10.15,0,2019,\N,43,"Family,Reality-TV",6.8,13


We have 1,000,000+ rows with basic movie information, now including number of votes and rating. This additional information will help to pare down this dataframe.

There are still have a LOT of rows that aren't needed. For example, anything where the `titleType` is not "movie" can be ignored.

In [13]:
df_merge.dtypes

tconst             object
titleType          object
primaryTitle       object
originalTitle      object
isAdult             int64
startYear          object
endYear            object
runtimeMinutes     object
genres             object
averageRating     float64
numVotes            int64
dtype: object

The data types of our new dataframe is mostly `object` (string) types. Years and runtime should be integers. However, we don't need to worry about these here.

---

Create slim downed dataframe with the following changes
- Remove unreleased titles (where `startYear` is NULL)
- Only want type 'movie'
- No adult films
- Drop `endYear` as it only applies to TV
- Change `startYear` to `year` and move it to the beginning

In [17]:
df_slim = df_merge
# Remove unreleased, non-movies, adult
df_slim = df_slim.drop(df_slim[df_slim.startYear == '\\N'].index)
df_slim = df_slim[ (df_slim['titleType'] == 'movie' ) & (df_slim['isAdult'] == 0) ]
df_slim = df_slim.drop(['endYear', 'isAdult'], axis=1)

# Reformat year column
df_slim.insert(loc=2, column='year', value=df_slim['startYear'])
df_slim = df_slim.drop(['startYear'], axis=1)
df_slim

Unnamed: 0,tconst,titleType,year,primaryTitle,originalTitle,runtimeMinutes,genres,averageRating,numVotes
8,tt0000009,movie,1894,Miss Jerry,Miss Jerry,45,Romance,5.4,89
144,tt0000147,movie,1897,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,20,"Documentary,News,Sport",5.2,333
251,tt0000335,movie,1900,Soldiers of the Cross,Soldiers of the Cross,\N,"Biography,Drama",6.1,40
327,tt0000502,movie,1905,Bohemios,Bohemios,100,\N,4.4,5
361,tt0000574,movie,1906,The Story of the Kelly Gang,The Story of the Kelly Gang,70,"Biography,Crime,Drama",6.1,562
...,...,...,...,...,...,...,...,...,...
1018959,tt9914942,movie,2019,La vida sense la Sara Amat,La vida sense la Sara Amat,74,Drama,6.7,76
1018974,tt9915790,movie,2019,Bobbyr Bondhura,Bobbyr Bondhura,\N,Family,7.6,13
1018987,tt9916160,movie,2019,Drømmeland,Drømmeland,72,Documentary,6.6,36
1018996,tt9916428,movie,2019,The Secret of China,The Secret of China,\N,"Adventure,History,War",3.3,11


In [65]:
df_slim.isna().sum()

tconst            0
titleType         0
year              0
primaryTitle      0
originalTitle     0
runtimeMinutes    0
genres            0
averageRating     0
numVotes          0
dtype: int64

In [26]:
dft = df_slim[(df_slim['tconst'].isin(['tt8946378','tt0076759']))]
dft

Unnamed: 0,tconst,titleType,year,primaryTitle,originalTitle,runtimeMinutes,genres,averageRating,numVotes
51794,tt0076759,movie,1977,Star Wars: Episode IV - A New Hope,Star Wars,121,"Action,Adventure,Fantasy",8.6,1170498
998631,tt8946378,movie,2019,Knives Out,Knives Out,131,"Comedy,Crime,Drama",8.0,140682


Down to about 240,000 rows after those modifications.

Pulled up a quick sample to check. I've watched both of these movies, the genres jump out to me.
- Star Wars missing Sci-fi
- Knives Out missing Mystery

It turns out, genres are limited to a count of 3, and it's only the first 3 alphabetically. That is not reliable data to correctly classify genres. As you can see from the screenshots (https://www.imdb.com/title/tt0076759/), there are actually 4 genres on this page that associate with this title.

I am going to scrape this page for all the information I want (storyline / plot summary, FULL genre list). All I need is IMDb's title ID - which is `tconst` in this dataset.

![](images/imdb-top.png)

![](images/imdb-bottom.png)

---

### New Plan:
#### Export list of IMDb IDs so I can make the scraping url

In [51]:
df_slim.dtypes

tconst             object
titleType          object
year               object
primaryTitle       object
originalTitle      object
runtimeMinutes     object
genres             object
averageRating     float64
numVotes            int64
dtype: object

I want to do integer comparison for `year` but it currently an `object`. Need to fix that now.

In [53]:
# Clean year column
df_slim['year'] = df_slim['year'].fillna(0.0).astype(int)

In [55]:
final_df = df_slim[(df_slim['year'] >= 1920) & (df_slim['numVotes'] > 1000)].sort_values(['numVotes'], ascending=False)
display(final_df)

Unnamed: 0,tconst,titleType,year,primaryTitle,originalTitle,runtimeMinutes,genres,averageRating,numVotes
80744,tt0111161,movie,1994,The Shawshank Redemption,The Shawshank Redemption,142,Drama,9.3,2203956
241990,tt0468569,movie,2008,The Dark Knight,The Dark Knight,152,"Action,Crime,Drama",9.0,2184629
530003,tt1375666,movie,2010,Inception,Inception,148,"Action,Adventure,Sci-Fi",8.8,1933557
96873,tt0137523,movie,1999,Fight Club,Fight Club,139,Drama,8.8,1759843
80528,tt0110912,movie,1994,Pulp Fiction,Pulp Fiction,154,"Crime,Drama",8.9,1731665
...,...,...,...,...,...,...,...,...,...
843975,tt5275476,movie,2017,Bedbugs,Fikkefuchs,101,"Comedy,Drama",6.2,1001
56437,tt0082210,movie,1981,El crack,El crack,131,"Crime,Drama,Mystery",7.3,1001
25596,tt0045992,movie,1952,The Lawless Breed,The Lawless Breed,83,Western,6.3,1001
520460,tt1327833,movie,2008,Sorry Bhai!,Sorry Bhai!,154,"Comedy,Drama,Romance",6.1,1001


This looks like a good amount, approx. 30,000 titles to scrape. I'm using an arbitrary threshold where `numVotes` is greater than 1000. I'm hoping those entries have some decent level of accuracy if they are that popular.

Additionally I'm limiting it to 1920 and later for an even '100 years' of movies.

---

I'm not going to scrape 2020, but this cell if for future use to see how much 'new' training data I will have.

In [61]:
final_df[final_df['year'] >= 2020].shape

(71, 9)

Finally, export this list to a `.csv` file so I can access it later for scraping using `tconst` id.

In [60]:
final_df.to_csv('imdb_movie_list.csv', header=True, index=False)

---

Genre Genie - Multi-label classification using NLP

Tom Keith - 2020