# Introduction to Data Science - Week 7 Random Forests and Feature Engineering

In [89]:
# Import libraries 
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
pd.set_option("display.max_rows",1000)
pd.set_option("display.max_columns",1000)

## Dataset

The training dataset for this week comes from [Netflix TV Shows and Movies](https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies). This dataset was created to list all shows available on Netflix streaming, and analyze the data to find interesting facts. This data was acquired in July 2022 containing data available in the United States. The `title.csv` contains +5k unique titles on Netflix with 15 columns containing their information, including:

* id: The title ID on JustWatch.
* title: The name of the title.
* show_type: TV show or movie.
* description: A brief description.
* release_year: The release year.
* age_certification: The age certification.
* runtime: The length of the episode (SHOW) or movie.
* genres: A list of genres.
* production_countries: A list of countries that produced the title.
* seasons: Number of seasons if it's a SHOW.
* imdb_id: The title ID on IMDB.
* imdb_score: Score on IMDB.
* imdb_votes: Votes on IMDB.
* tmdb_popularity: Popularity on TMDB.
* tmdb_score: Score on TMDB.

In [90]:
# Read in data 
df = pd.read_csv("../data/titles.csv")
df

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.600,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],,tt0068473,7.7,107673.0,10.010,7.300
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['fantasy', 'action', 'comedy']",['GB'],,tt0071853,8.2,534486.0,15.461,7.811
4,tm120801,The Dirty Dozen,MOVIE,12 American military prisoners in World War II...,1967,,150,"['war', 'action']","['GB', 'US']",,tt0061578,7.7,72662.0,20.398,7.600
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5845,tm1014599,Fine Wine,MOVIE,A beautiful love story that can happen between...,2021,,100,"['romance', 'drama']",['NG'],,tt13857480,6.8,45.0,1.466,
5846,tm898842,C/O Kaadhal,MOVIE,A heart warming film that explores the concept...,2021,,134,['drama'],[],,tt11803618,7.7,348.0,,
5847,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021,,90,['comedy'],['CO'],,tt14585902,3.8,68.0,26.005,6.300
5848,tm1035612,Dad Stop Embarrassing Me - The Afterparty,MOVIE,"Jamie Foxx, David Alan Grier and more from the...",2021,PG-13,37,[],['US'],,,,,1.296,10.000


In [91]:
df.columns

Index(['id', 'title', 'type', 'description', 'release_year',
       'age_certification', 'runtime', 'genres', 'production_countries',
       'seasons', 'imdb_id', 'imdb_score', 'imdb_votes', 'tmdb_popularity',
       'tmdb_score'],
      dtype='object')

We can see that there are quite a few null values in the **'seasons'** column. However, I believe most movies don't have multiple seasons. Therefore, before removing null values overall, I'd like to handle the null values in it by replacing them with **'0'**.


In [92]:
# Replace NaN
df["seasons"].fillna(0, inplace=True)
df

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.600,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],0.0,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],0.0,tt0068473,7.7,107673.0,10.010,7.300
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['fantasy', 'action', 'comedy']",['GB'],0.0,tt0071853,8.2,534486.0,15.461,7.811
4,tm120801,The Dirty Dozen,MOVIE,12 American military prisoners in World War II...,1967,,150,"['war', 'action']","['GB', 'US']",0.0,tt0061578,7.7,72662.0,20.398,7.600
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5845,tm1014599,Fine Wine,MOVIE,A beautiful love story that can happen between...,2021,,100,"['romance', 'drama']",['NG'],0.0,tt13857480,6.8,45.0,1.466,
5846,tm898842,C/O Kaadhal,MOVIE,A heart warming film that explores the concept...,2021,,134,['drama'],[],0.0,tt11803618,7.7,348.0,,
5847,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021,,90,['comedy'],['CO'],0.0,tt14585902,3.8,68.0,26.005,6.300
5848,tm1035612,Dad Stop Embarrassing Me - The Afterparty,MOVIE,"Jamie Foxx, David Alan Grier and more from the...",2021,PG-13,37,[],['US'],0.0,,,,1.296,10.000


We can also see that there are some **'[]'** present in the **'genres'** and **'production_countries'** columns. I would like to remove the rows containing these empty brackets.

In [93]:
# Find the rows containing '[]'
rows_to_drop = df[df.apply(lambda x: x.astype(str).str.contains('\[\]').any(), axis=1)].index

# Remove them
df = df.drop(index=rows_to_drop)
df

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.600,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],0.0,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],0.0,tt0068473,7.7,107673.0,10.010,7.300
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['fantasy', 'action', 'comedy']",['GB'],0.0,tt0071853,8.2,534486.0,15.461,7.811
4,tm120801,The Dirty Dozen,MOVIE,12 American military prisoners in World War II...,1967,,150,"['war', 'action']","['GB', 'US']",0.0,tt0061578,7.7,72662.0,20.398,7.600
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5839,tm1165179,Kongsi Raya,MOVIE,Jack - a Chinese chef-manager who is in-line t...,2022,,102,['comedy'],['MY'],0.0,tt16806990,7.0,66.0,2.112,
5841,tm985215,Princess 'Daya'Reese,MOVIE,Reese is a con artist from Manila who dreams o...,2021,,115,"['comedy', 'romance']",['PH'],0.0,tt13399802,7.1,50.0,1.383,
5843,tm1097142,My Bride,MOVIE,The story follows a young man and woman who go...,2021,,93,"['romance', 'comedy', 'drama']",['EG'],0.0,tt14216488,5.0,327.0,2.545,5.300
5845,tm1014599,Fine Wine,MOVIE,A beautiful love story that can happen between...,2021,,100,"['romance', 'drama']",['NG'],0.0,tt13857480,6.8,45.0,1.466,


Now we can remove the remaining data items containing NaN

In [94]:
df = df.dropna(axis=0)
df

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],0.0,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],0.0,tt0068473,7.7,107673.0,10.010,7.300
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['fantasy', 'action', 'comedy']",['GB'],0.0,tt0071853,8.2,534486.0,15.461,7.811
5,ts22164,Monty Python's Flying Circus,SHOW,A British sketch comedy series with the shows ...,1969,TV-14,30,"['comedy', 'european']",['GB'],4.0,tt0063929,8.8,73424.0,17.617,8.306
6,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],0.0,tt0079470,8.0,395024.0,17.770,7.800
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5798,tm1099320,Convergence: Courage in a Crisis,MOVIE,Activists and volunteers work through the dark...,2021,R,113,['documentation'],"['GB', 'US']",0.0,tt15398694,5.4,262.0,6.589,5.200
5800,tm982470,Stuck Apart,MOVIE,"Entrenched in a midlife crisis, Aziz seeks sol...",2021,R,96,"['comedy', 'drama']",['TR'],0.0,tt11213372,6.0,10515.0,7.403,6.200
5801,ts270616,We Are: The Brooklyn Saints,SHOW,A Brooklyn youth football program and its self...,2021,TV-14,47,"['documentation', 'sport']",['US'],1.0,tt13656220,6.5,162.0,2.523,10.000
5819,ts287729,Alma Matters: Inside the IIT Dream,SHOW,"In a ""nation of middle-class"" the IIT dream in...",2021,TV-MA,49,"['documentation', 'drama']",['IN'],1.0,tt14512938,8.3,2346.0,1.493,9.000


## Classify the genres and build the classification model

Now that I've processed my dataset, I want to build a classification model to categorize genres of films and TV shows based on other information. 

In [95]:
from sklearn.preprocessing import LabelEncoder
encoded_y = LabelEncoder().fit_transform(df["genres"])
print(encoded_y)
print(len(encoded_y), len(df["genres"]))

[619 556 820 ... 526 504 695]
2904 2904


In [96]:
print(np.unique(encoded_y))

[   0    1    2 ... 1241 1242 1243]


In [97]:
df["genres"].value_counts().head(30)

genres
['comedy']                           141
['documentation']                     96
['drama']                             88
['reality']                           78
['comedy', 'drama']                   65
['drama', 'romance']                  60
['drama', 'comedy']                   53
['documentation', 'crime']            40
['comedy', 'drama', 'romance']        35
['drama', 'comedy', 'romance']        28
['comedy', 'documentation']           26
['crime', 'drama', 'thriller']        25
['drama', 'crime']                    25
['documentation', 'sport']            23
['drama', 'thriller', 'crime']        22
['crime', 'documentation']            21
['comedy', 'romance']                 21
['comedy', 'family']                  20
['drama', 'sport']                    19
['drama', 'crime', 'thriller']        19
['thriller', 'drama']                 18
['drama', 'romance', 'comedy']        18
['horror', 'thriller']                16
['animation', 'family']               16
['drama',

So I'm going to split it into 3 classes:

1. Genres containing comedy;
2. Genres containing documentation (excluding comedy);
3. All other movies and TV shows.

In [98]:
# Make my dataset
def get_dataset(features):
    # Filter just features
    df_features = df.loc[:, features]
    X = df_features.values.copy()
    # Filter genres and add label column
    df_features.loc[df["genres"].str.contains("comedy"),"label"] = 0
    df_features.loc[~df["genres"].str.contains("comedy") & df["genres"].str.contains("documentation"),"label"] = 1
    df_features.loc[~df["genres"].str.contains("comedy") & ~df["genres"].str.contains("documentation"),"label"] = 2
    
    return X, df_features["label"]

I want to compare the performance and characteristics of **"Random Forest"** and **"AdaBoost"**, and record how parameter changes impact the performance of both.

In [99]:
# Train and evaluate the model
def train(dataset, rf=True):
    X, y = dataset
    # Fit the model
    if rf:
        # Experiment with key parameters
        model = RandomForestClassifier(oob_score=True, random_state=42, n_estimators=500, max_depth=None, min_samples_split=2, n_jobs=-1)
        model.fit(X, y)
        print("OOB accuracy", model.oob_score_)
    else:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        # Experiment with key parameters
        model = AdaBoostClassifier(n_estimators=500, random_state=42, learning_rate= 0.1)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        print("accuracy", accuracy)

## Experiment with different features

### Runtime (The length of the episode (SHOW) or movie)

In [100]:
features = ["runtime"]
train(get_dataset(features), rf=True)

OOB accuracy 0.556129476584022


### Internet Movie Database (IMDB)

In [101]:
features = features + ["imdb_score", "imdb_votes"]
train(get_dataset(features), rf=True)

OOB accuracy 0.553374655647383


### The Movie Database (TMDB)

In [102]:
features = features + ["tmdb_popularity", "tmdb_score"]
train(get_dataset(features), rf=True)

OOB accuracy 0.59400826446281


### The age certification

In [103]:
print(len(df["age_certification"].unique()))
df["age_certification"].value_counts().head(20)

11


age_certification
TV-MA    809
R        523
PG-13    421
TV-14    413
PG       218
TV-PG    157
TV-Y7    105
G         86
TV-Y      85
TV-G      72
NC-17     15
Name: count, dtype: int64

In [104]:
df.loc[:, "13andOlder"] = df["age_certification"].str.contains("PG-13|TV-14|R|TV-MA|NC-17").astype(int)
df.loc[:, "17andOlder"] = df["age_certification"].str.contains("R|TV-MA|NC-17").astype(int)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, '13andOlder'] = df['age_certification'].str.contains('PG-13|TV-14|R|TV-MA|NC-17').astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, '17andOlder'] = df['age_certification'].str.contains('R|TV-MA|NC-17').astype(int)


Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,13andOlder,17andOlder
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],0.0,tt0075314,8.2,808582.0,40.965,8.179,1,1
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],0.0,tt0068473,7.7,107673.0,10.010,7.300,1,1
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['fantasy', 'action', 'comedy']",['GB'],0.0,tt0071853,8.2,534486.0,15.461,7.811,0,0
5,ts22164,Monty Python's Flying Circus,SHOW,A British sketch comedy series with the shows ...,1969,TV-14,30,"['comedy', 'european']",['GB'],4.0,tt0063929,8.8,73424.0,17.617,8.306,1,0
6,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],0.0,tt0079470,8.0,395024.0,17.770,7.800,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5798,tm1099320,Convergence: Courage in a Crisis,MOVIE,Activists and volunteers work through the dark...,2021,R,113,['documentation'],"['GB', 'US']",0.0,tt15398694,5.4,262.0,6.589,5.200,1,1
5800,tm982470,Stuck Apart,MOVIE,"Entrenched in a midlife crisis, Aziz seeks sol...",2021,R,96,"['comedy', 'drama']",['TR'],0.0,tt11213372,6.0,10515.0,7.403,6.200,1,1
5801,ts270616,We Are: The Brooklyn Saints,SHOW,A Brooklyn youth football program and its self...,2021,TV-14,47,"['documentation', 'sport']",['US'],1.0,tt13656220,6.5,162.0,2.523,10.000,1,0
5819,ts287729,Alma Matters: Inside the IIT Dream,SHOW,"In a ""nation of middle-class"" the IIT dream in...",2021,TV-MA,49,"['documentation', 'drama']",['IN'],1.0,tt14512938,8.3,2346.0,1.493,9.000,1,1


In [105]:
features = features + ["13andOlder", "17andOlder"]
train(get_dataset(features), rf=True)

OOB accuracy 0.6208677685950413


### Countries that produced the titles

In [106]:
len(df["production_countries"].unique())

238

In [107]:
df["production_countries"].value_counts().head(20)

production_countries
['US']          1267
['IN']           213
['JP']           186
['GB']           137
['KR']           116
['ES']            77
['FR']            54
['CA']            53
['TR']            49
['MX']            46
['CN']            40
['BR']            37
['AU']            35
['GB', 'US']      34
['TW']            30
['CA', 'US']      27
['DE']            24
['IT']            20
['CO']            20
['US', 'GB']      20
Name: count, dtype: int64

In [108]:
# Create a mapping from publisher to an index
df["production_countries_index"] = LabelEncoder().fit_transform(df["production_countries"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['production_countries_index'] = LabelEncoder().fit_transform(df["production_countries"])


In [109]:
# Frequency Encoding
counts = df["production_countries"].value_counts()
df["production_countries_count"] = df["production_countries"].apply(lambda x: counts.get(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['production_countries_count'] = df['production_countries'].apply(lambda x: counts.get(x))


In [110]:
features = features + ["production_countries_index", "production_countries_count"]
train(get_dataset(features), rf=True)

OOB accuracy 0.6387741046831956


### Release length

In [111]:
df["release_year"] = df["release_year"] - df["release_year"].min()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["release_year"] = df["release_year"] - df["release_year"].min()


In [112]:
features = features + ["release_year"]
train(get_dataset(features), rf=True)

OOB accuracy 0.6553030303030303
