# Get Started

Alex Schittko

Robert Waguespack

Run this block to set up the notebook

**You must enable Internet in Kaggle before running this notebook!**

**This requires you to verify your Kaggle account!**

* Download CSV Files test/training sets
* Import python dependencies

Goal is to [predict box office revenues](https://www.kaggle.com/c/tmdb-box-office-prediction/submit)

In [None]:
output_path = "./" # Where are outputs stored in your notebook? (With trailing slash, eg /output/)
input_path = "../input/tmdb-box-office-prediction/" # Where can inputs be stored? (With trailing slash, eg /input/)

# Uncomment these if the test.csv and train.csv don't exist in INPUT_PATH
#!wget -q --show-progress --no-check-certificate 'https://docs.google.com/uc?export=download&id=13f3n4H67RjbEHPl_A4i9R6oY9jUa2eOm' -O {input_path}test.csv
#!wget -q --show-progress --no-check-certificate 'https://docs.google.com/uc?export=download&id=1JxEPMg415Y6NIslXcL9mWGr8RMx86B6Y' -O {input_path}train.csv

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import missingno as mno
import math
import json
from multiprocessing import Pool
import multiprocessing
from tqdm import tqdm,trange,tqdm_notebook
from time import sleep
from sklearn.model_selection import train_test_split
print("Ready")

In [None]:
df_train = pd.read_csv(input_path + 'train.csv', parse_dates=["release_date"])
df_test = pd.read_csv(input_path + 'test.csv', parse_dates=["release_date"])
df_test.info()
df_all = pd.concat([df_train, df_test])
df_all.reset_index(inplace=True)

# Initialize your new features here
df_all['cast_json'] = ""
df_all['crew_json'] = ""
df_all.insert(0, 'popularity_cast', np.float64(0))
df_all.insert(0, 'popularity_crew', np.float64(0))
df_all.insert(0, 'has_homepage', 0)

In [None]:
# The JSON payloads are not valid JSON!!
# These functions help us parse the invalid JSON to native python objects
# Using regular expressions

import re
def repl_quotes(m):
  preq = m.group(1)
  qbody = m.group(2)
  qbody = re.sub(r'"', r"'", qbody)
  return preq + '"' + qbody + '"'


# Thanks user1384220 of StackOverflow
# https://stackoverflow.com/questions/62012736/regex-replace-double-quotes-in-json
# Takes an unsafe JSON s, and returns the native py object and safe json string
def to_json(s):
  safe = s.replace("'", '"')
  safe = re.sub(r'("[\s\w]*)"([\s\w]*")',r"\1'\2", safe)  # O'Brien
  safe = re.sub( r'([:\[,{]\s*)"(.*?)"(?=\s*[:,\]}])', repl_quotes, safe ) # Alex "Nickname" Schittko
  safe = safe.replace("None", 'null')
  safe = safe.replace("\\'", "'")
  safe = safe.replace("\\x92", "'")
  safe = safe.replace("\\xa0", "-")
  safe = safe.replace("\\xad", "-")

  #print(safe)
  try:
    cast_json = json.loads(safe)
  except:
    print("to_json() failed for string")
    print(safe)

  return cast_json, safe

In [None]:
# This code splits df_all back to df_test/df_train

def split_from_all(df):
    df_train = df.iloc[:3000,:]
    df_test = df.iloc[3000:,:]
    return df_train,df_test

# We can iterate df's faster as dicts.
def quickIt(df):
    data = df.to_dict('index')
    idxs = df.index.values
    return data, idxs


# Exploratory Data Analysis (EDA)

## Training Set

In [None]:
df_train.head()

In [None]:
df_train.info()

We can see between from the info output that we have some incomplete features, as well as about 3000 entries to train with.

This should be enough for some classification models, however I don't think it will be enough to construct a Neural Network.

In [None]:
mno.matrix(df_train, (20,6))

By analyzing the mno "missing number" matrix, we can see we need to do something about these incomplete features.

* belongs_to_collection
* homepage
* overview
* genres
* poster_path
* production_companies
* production_countries
* runtime
* spoken_languages
* tagline
* Keywords
* cast
* crew

We have a few options:

1. Discard the feature. We should only do this if we believe the data isn't correlated.
2. Impute on the dataset.  We could use IterativeImputer or SimpleImputer to fill in the blanks.
3. Feature engineering.  We can extract boolean facts, eg "has_homepage" and replace this new feature with the current "homepage" feature. This only makes sense for certain features we can turn into classifications.  Eg, the presence of a homepage or tagline may have some influence on the target. 

In [None]:
# How else do we tell if the data is correlated

### Categorical Features

#### Overview

Here we look at categorical columns in a pie chart to understand the spread of the dataset (original code from [this notebook](https://www.kaggle.com/sisharaneranjana/titanic-survival-prediction-complete-analysis))

In [None]:
categorical_cols_train= df_train.select_dtypes(include=['object'])
categorical_cols_test= df_test.select_dtypes(include=['object'])
print(f'The train dataset contains {len(categorical_cols_train.columns.tolist())} categorical columns')
for cols in categorical_cols_train.columns:
    print(cols,':', len(categorical_cols_train[cols].unique()),'labels')

print(f'The test dataset contains {len(categorical_cols_test.columns.tolist())} categorical columns')
for cols in categorical_cols_test.columns:
    print(cols,':', len(categorical_cols_test[cols].unique()),'labels')

categorical_cols_train.describe()

The features with > 50 labels are very unique, perhaps we can use specifics about them later in our analysis for correlation.  Perhaps a movie with Danny DeVito has more revenue than one with unheard of actors?

It looks like original_language and status are small enough that we could try to see if their values correlated with revenues.

We'll need an approach to bring some order to the categorical values before trying to model the problem.

In [None]:
print("Scalar features")
for col in df_all.columns:
    if col not in categorical_cols_train:
        print(col)

#### Status Feature

In [None]:
# Checking status' values
df_all['status'].unique()



In [None]:
import plotly.graph_objects as go

night_colors = ['#D3DBDD',  'navy',  '#57A7F3']
labels = [x for x in df_train.status.value_counts().index]
values = df_train.status.value_counts()

# Use `hole` to create a donut-like pie chart
fig=go.Figure(data=[go.Pie(labels=labels,values=values,hole=.3,pull=[0,0,0.06,0])])

fig.update_layout(
    title_text="Training Set - Movie Status")
fig.update_traces(marker=dict(colors=night_colors))
fig.show()

In [None]:

night_colors = ['#D3DBDD',  'navy',  '#57A7F3']
labels = [x for x in df_test.status.value_counts().index]
values = df_test.status.value_counts()

# Use `hole` to create a donut-like pie chart
fig=go.Figure(data=[go.Pie(labels=labels,values=values,hole=.3,pull=[0,0,0.06,0])])

fig.update_layout(
    title_text="Test Set - Movie Status")
fig.update_traces(marker=dict(colors=night_colors))
fig.show()

Here's a consideration - is Released & Post Production related?

# Feature Engineering

### Collection Sequence Feature

We're going to engineer a feature named "collection_iteration_seq" that represents which position in a series a movie is. Eg, the 3rd movie in the Dark Knight series will have a value "3", where the first movie will have a value "1"

We'll engineer another feature called "single" that will be boolean 0/1, if the movie is a singleton or not.

We'll use `df_all` to make sure this feature is complete.

After we create the feature on `df_all`, `df_train` and `df_test` will be RECREATED with the new feature.  They'll be split based on having or not having the `revenue` feature.

In [None]:
# Collection Iteration Feature

created_collection_iteration = False

# First we can iterate over the rows and determine who doesn't belong to a collection.
# They'll be the first movies in their Series, for now

totalSingle = 0
for k, v in df_all.iterrows():
    collect = v['belongs_to_collection']
    if pd.isna(collect):
      #print("pd.isna(collect): " + str(collect))
      if not created_collection_iteration:
        try:
          df_all.insert(k, "collection_iteration_seq", 0)
          df_all.insert(k, "single", 0)
          created_collection_iteration = True
          df_all.at[k, 'collection_iteration_seq'] = 1
          df_all.at[k, 'single'] = 1
        except:
          df_all.at[k, 'collection_iteration_seq'] = 1
          df_all.at[k, 'single'] = 1
      else:
        df_all.at[k, 'collection_iteration_seq'] = 1
        df_all.at[k, 'single'] = 1

      totalSingle+=1

print("Set " + str(totalSingle) + " singles")
df_all.info()
mno.matrix(df_all, (10,5))

In [None]:
df_all['collection_iteration_seq'].unique()

In [None]:
# Identifying "Single movies"

collection_ids = []
created_collection_id = False

moviesInSeries = 0
# Iterate all the rows again, and safely read the JSON string in belongs_to_collection
all = df_all.to_dict('index')

for k in all:
  v = all[k]
  safe = str(v['belongs_to_collection'])
  safe = safe.replace("n' ", 'n')
  safe = safe.replace("'", '"')
  safe = safe.replace("\"s ", "'s")
  safe = safe.replace("None", 'null')
  safe = safe.replace("N\"E", "N'E")
  safe = safe.replace("We\"re", "We're")
  safe = safe.replace("L\"a", "L'a")
  if safe != "nan": # Only get entries with a belongs_to_collection
    parsed = json.loads(safe)
    collection_id = parsed[0]['id']
    collection_ids.append(collection_id)
    # We show here there are only 0 or 1 collection entries on a movie object.
    if (len(parsed) > 1):
      print(parsed)

    if not created_collection_id:
      try:
        df_all.insert(k, "collection_id", 0)
        df_all.at[k, 'collection_id'] = collection_id
        df_all.at[k, 'single'] = 0
        created_collection_id = True
      except:
        df_all.at[k, 'collection_id'] = collection_id
        df_all.at[k, 'single'] = 0
    else:
      df_all.at[k, 'collection_id'] = collection_id
      df_all.at[k, 'single'] = 0


In [None]:
df_all['collection_iteration_seq'].unique()

In [None]:
# Identifying series position for movies

# Silence Pandas SettingWithCopyWarning
pd.options.mode.chained_assignment = None
moviesInSeries = 0
for cid in collection_ids:
    movies = df_all.loc[df_all['collection_id'] == cid]
    # SettingWithCopyWarning thrown here but it's ok, we know we're doing this on a copy
    # We don't need the copy after we set the counter in the next loop.
    movies.sort_values(by='release_date', inplace=True)
    
    counter = 1
    # Apply the value to collection_iteration_seq
    for k, v in movies.iterrows():
      df_all.at[k, 'collection_iteration_seq'] = counter
      counter += 1    
      moviesInSeries += 1

print("Marked collection_iteration_seq on " + str(moviesInSeries))

# Put back Pandas SettingWithCopyWarning
pd.options.mode.chained_assignment = "warn"
mno.matrix(df_all, (10,5))

In [None]:
df_all['collection_iteration_seq'].unique()

In [None]:
# now drop collection_id and belongs_to_collection

df_all.drop(labels=['belongs_to_collection'], axis=1, inplace=True)

In [None]:
# Now create df_train and df_test again
df_train,df_test = split_from_all(df_all)

In [None]:
mno.matrix(df_train, (10,5))

In [None]:
df_train['single'].unique()

In [None]:
df_train['collection_iteration_seq'].unique()

In [None]:
df_train.describe()

In [None]:
# Reduce fragmenting of DataFrame
df_all = df_all.copy()

### Cast Popularity Feature

We're going to build a dataset called `df_cast` that has two features
`name` as a key
`rating` as a popularity rating

Once we have this dataset, we can add a feature called `popularity_cast`, which is a weighted sum/average of the cast's popularity ratings, on the `df_all` set.

Afterwards, we should be able to drop the cast feature.


In [None]:
# Here we build a list of all actors

actors=[]
all = df_all.to_dict('index')
for k in all:
  v = all[k]
  cast_str = v['cast']
  if str(cast_str) == "nan":
    continue
  cast_json, safe = to_json(cast_str)
  df_all.at[k,'cast_json'] = safe
  for actor in cast_json:
    #print(actor['name'])
    actors.append(actor['name'])
df_all.drop(['cast'], axis=1)
df_all = df_all.copy()

# Now we remove duplicates and create a dataframe to contain our actors

actors=list(set(actors))
actors_dict=[]
for actor in actors:
    actors_dict.append({'name':actor,'rating':0,'movies':0})

df_cast = pd.DataFrame(actors_dict)
df_cast.drop_duplicates(subset=['name'], keep='first')
# This speeds us up from 5 frames per second to thousands of frames per second, CPU Only.
df_cast.set_index(['name'],inplace=True)
df_cast.info()
df_cast.index.name

Wow!  We have 76k unique actors!  This should give us some good insight!

We're going to make a "rating" for each actor, then use these "rating"s to extract a "cast_rating" feature for the films.

We'll sum the popularity each film has, on each actor's record.  We'll also keep track of how many films an actor has been in.

This lets us average the score of an actor based on the movies they've participated in.

In [None]:
%%time
# This part computes the sum of movies & ratings in the df_cast dataframe
#dict_cast = df_cast.to_dict('records')
all = df_all.to_dict('index')

idxs = df_all.index.values
for k in tqdm(idxs,desc="Computing sums",unit="Film"):
    #print("k: " + str(k))
    v = all[k]
    cast_json = v['cast_json']
    popularity = v['popularity']
    #print(cast_json)
    #print(popularity)
    if str(cast_json) == "nan" or str(cast_json) == "":
      #print("bail")
      continue
    actors = json.loads(cast_json)
    #print(type(actors))
    for actor in actors:
      #print(actor)
      idx = actor['name']
      df_cast.at[idx, 'rating'] += popularity
      df_cast.at[idx, 'movies'] += 1


#print(df_cast[0
df_cast.info()
df_cast.describe()


In [None]:
# Now we compute the average ratings per actor
cast_idxs = df_cast.index.values
cast = df_cast.to_dict('index')
for k in tqdm(cast_idxs,desc="Computing averages",unit="Actor"):
  v = cast[k]
  sum_movies = v['movies']
  sum_rating = v['rating']
  try:
    new_rating = sum_rating / sum_movies 
  except:
    new_rating = 0
  
  df_cast.at[k,'rating'] = new_rating

In [None]:
df_cast.describe()

In [None]:
# Let's see what the new feature looks like
ax = plt.gca()

df_cast.plot(kind='scatter',x='movies',y='rating',color='blue',ax=ax)

plt.show()

So this plot shows us that actors with more movies typically have a lower rating.

Now we're going to iterate all the films once more, and engineer this `popularity_cast` rating 

In [None]:
for idx in tqdm(idxs,desc="Comptuing film popularity_cast",unit="Film"):
    film = all[idx]
    popularity_cast = 0
    count = 0
    cast_json = df_all.at[idx,'cast_json']
    if cast_json == "":
        continue
    try:
        actors = json.loads(cast_json)
        if len(actors) > 0:
            for actor in actors:
                popularity_cast += df_cast.at[actor['name'],'rating']
                count+=1
    except Exception as e:
        print("Failed for film")
        print(film)
        print(e)
        
    try:
        rating = popularity_cast / count
    except:
        rating = 0
    
    if (rating > 100):
        print(film['original_title'] + " " + str(rating))
    
    df_all.at[idx,'popularity_cast'] = rating

df_all[df_all['original_title'] == 'Minions']['popularity_cast']

In [None]:
df_all.describe()

In [None]:
# Let's see what it looks like
ax = plt.gca()

df_all.plot(kind='scatter',x='revenue',y='popularity_cast',color='blue',ax=ax, figsize=(12,8))

plt.show()

ax = plt.gca()

df_all.plot(kind='scatter',x='revenue',y='popularity',color='red',ax=ax, figsize=(12,8))

plt.show()

Nice, this looks like a natural feature now - it's distribution looks similar to that of popularity.

### Crew Popularity Feature

We're going to build a dataset called `df_crew` that has two features
`name` as a key
`rating` as a popularity rating

Once we have this dataset, we can add a feature called "popularity_crew" to the movies data sets which is a weighted sum/average of the cast's popularity ratings.

Afterwards, we should be able to drop the crew feature.

This should pretty much mirror what happened in the cast popularity feature

In [None]:
# Here we build a list of all cast members

crews=[]
all = df_all.to_dict('index')
for k in all:
  v = all[k]
  crew_str = v['crew']
  if str(crew_str) == "nan":
    continue
  crew_json, safe = to_json(crew_str)
  df_all.at[k,'crew_json'] = safe
  for crew in crew_json:
    crews.append(crew['name'])
df_all.drop(['crew'], axis=1)
df_all = df_all.copy()

# Now we remove duplicates and create a dataframe to contain our actors

crews=list(set(crews))
crews_dict=[]
for crew in crews:
    crews_dict.append({'name':crew,'rating':0,'movies':0})

df_crew = pd.DataFrame(crews_dict)
df_crew.drop_duplicates(subset=['name'], keep='first')
# This speeds us up from 5 frames per second to thousands of frames per second, CPU Only.
df_crew.set_index(['name'],inplace=True)
df_crew.info()
df_crew.index.name

In [None]:
%%time
# This part computes the sum of movies & ratings in the df_crew dataframe
all = df_all.to_dict('index')

idxs = df_all.index.values
for k in tqdm(idxs,desc="Computing sums",unit="Film"):
    v = all[k]
    crew_json = v['crew_json']
    popularity = v['popularity']
    if str(crew_json) == "nan" or str(crew_json) == "":
      continue
    crews = json.loads(crew_json)
    for crew in crews:
      idx = crew['name']
      df_crew.at[idx, 'rating'] += popularity
      df_crew.at[idx, 'movies'] += 1

df_crew.info()
df_crew.describe()


In [None]:
# Now we compute the average ratings per crew member
crew_idxs = df_crew.index.values
crew = df_crew.to_dict('index')
for k in tqdm(crew_idxs,desc="Computing averages",unit="Actor"):
  v = crew[k]
  sum_movies = v['movies']
  sum_rating = v['rating']
  try:
    new_rating = sum_rating / sum_movies 
  except:
    new_rating = 0
  
  df_crew.at[k,'rating'] = new_rating
    
df_crew.describe()


In [None]:
ax = plt.gca()

df_crew.plot(kind='scatter',x='movies',y='rating',color='blue',ax=ax)

plt.show()

In [None]:
for idx in tqdm(idxs,desc="Comptuing film popularity_crew",unit="Film"):
    film = all[idx]
    popularity_crew = 0
    count = 0
    crew_json = df_all.at[idx,'crew_json']
    if crew_json == "":
        continue
    try:
        crews = json.loads(crew_json)
        if len(crews) > 0:
            for crew in crews:
                popularity_crew += df_crew.at[crew['name'],'rating']
                count+=1
    except Exception as e:
        print("Failed for film")
        print(film)
        print(e)
        
    try:
        rating = popularity_crew / count
    except:
        rating = 0
    
    if (rating > 100):
        print(film['original_title'] + " " + str(rating))
    
    df_all.at[idx,'popularity_crew'] = rating

df_all[df_all['original_title'] == 'Minions']['popularity_crew']

In [None]:
# Let's see what it looks like
ax = plt.gca()

df_all.plot(kind='scatter',x='revenue',y='popularity_crew',color='blue',ax=ax, figsize=(20,8))
df_all.plot(kind='scatter',x='revenue',y='popularity_cast',color='red',ax=ax, figsize=(20,8))
df_all.plot(kind='scatter',x='revenue',y='popularity',color='green',ax=ax, figsize=(20,8))

plt.show()

### Keywords Feature

We should be able to extract keyword_rating feature like we do for Cast & Crew

In [None]:
### 

### Homepage Feature

We can easily set a boolean for "has_homepage" and replace "homepage" feature with this

In [None]:
data, idxs = quickIt(df_all)
#print(data)
for k in idxs:
    v = data[k]
    if v['homepage'] == "" or str(v['homepage']) == "nan":
      df_all.at[k,'has_homepage'] = 0
    else:
      df_all.at[k,'has_homepage'] = 1  

In [None]:
df_all['has_homepage'].unique()
df_test.info()
df_train, df_test = split_from_all(df_all)
df_test.info()


In [None]:
ax = plt.gca()

df_all.plot(kind='scatter',x='revenue',y='has_homepage',color='blue',ax=ax, figsize=(20,8))

plt.show()

In [None]:
df_all.drop(labels=['homepage'], axis=1, inplace=True)


In [None]:
df_all.describe()

In [None]:
df_all.info()

### original_language

We can replace `original_language` with a dummy because it's unique count is *low*

In [None]:
len(df_all['original_language'].unique())

### spoken_languages

not sure what to do with this one

### Status

I think any movie in post-production should go to 'released' status to be included in the larger dataset.  Do you?


### Drops

Leave this last. We drop everything else we don't need for the model here.

In [None]:
# df_all.drop(["original_title","cast","crew","cast_json","crew_json","title","imdb_id"],axis=1,inplace=True)

# Understanding Test Dataset

# Predictions

In [None]:
df_train,df_test = split_from_all(df_all)
df_train.info()

In [None]:
df_test.info()

In [None]:
mno.matrix(df_test,(20,6))

In [None]:
df_test.describe()

### LinearRegressor

In [None]:
# Training the model
X = df_train.drop(['single','has_homepage','cast_json','crew_json','crew','cast','Keywords','title','tagline','status','spoken_languages','release_date','production_companies','production_countries','poster_path','overview','original_title','original_language','imdb_id','genres','id','index'], axis=1)
X.dropna(inplace=True)
X.info()
y = X['revenue'].to_numpy()
X.drop(['revenue'],axis=1,inplace=True)
X = X.to_numpy()

# This makes it so the model test later sees data it's seen before
# This concept stinks, but we want to maximize the amount of data we use for training too.

_, X_test, _, y_test = train_test_split(
     X, y, test_size=0.33)

X_train = X
y_train = y

import sklearn.ensemble as ske
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

train_scores = []
lr_models = []
max_depth_k = 2 # After about depth 14 the score stays stagnant
for k in tqdm(range(1,max_depth_k),desc="Training models",unit="LinearRegression"):
  #regr = DecisionTreeRegressor(max_depth=k)
  #rfeRegr = ske.RandomForestRegressor(max_depth=k)
  lr = LinearRegression()
  lr.fit(X_train, y_train)
  train_scores.append(lr.score(X_train, y_train))
  lr_models.append(lr)

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,6)

fig, ax = plt.subplots()
xi = np.arange(1, max_depth_k, step=1)
y = train_scores

plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')

plt.xlabel('max_depth')
plt.xticks(np.arange(0, max_depth_k, step=1)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Accuracy Score (%)')
plt.title('n_estimators for AdaBoostRegressor + base LinearRegressor')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)
plt.axhline(y=0.80, color='r', linestyle='-')
plt.text(0.5, 0.70, '80% cut-off threshold', color = 'red', fontsize=16)

ax.grid(axis='x')
plt.show()


In [None]:
!pip install pydotplus

from sklearn.tree import export_graphviz
import pydotplus
from IPython.display import Image

# This can show a decision tree for RandomForestRegressor
#gvz = export_graphviz(selected_model.estimators_[0]) 
#graph = pydotplus.graph_from_dot_data(gvz) 
#Image(graph.create_png())

### RandomForestRegressor

In [None]:
# Training the model
X = df_train.drop(['single','has_homepage','cast_json','crew_json','crew','cast','Keywords','title','tagline','status','spoken_languages','release_date','production_companies','production_countries','poster_path','overview','original_title','original_language','imdb_id','genres','id','index'], axis=1)
X.dropna(inplace=True)
X.info()
y = X['revenue'].to_numpy()
X.drop(['revenue'],axis=1,inplace=True)
X = X.to_numpy()

# This makes it so the model test later sees data it's seen before
# This concept stinks, but we want to maximize the amount of data we use for training too.

_, X_test, _, y_test = train_test_split(
     X, y, test_size=0.33)

X_train = X
y_train = y

import sklearn.ensemble as ske
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression

train_scores = []
random_forest_models = []
max_depth_k = 21 # After about depth 14 the score stays stagnant
for k in tqdm(range(1,max_depth_k),desc="Training models",unit="RandomForestRegressor"):
  #regr = DecisionTreeRegressor(max_depth=k)
  regr = ske.RandomForestRegressor(max_depth=k)
  #regr = ske.AdaBoostRegressor(base_estimator=rfeRegr,n_estimators=100)
  regr.fit(X_train, y_train)
  train_scores.append(regr.score(X_train, y_train))
  random_forest_models.append(regr)

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,6)

fig, ax = plt.subplots()
xi = np.arange(1, max_depth_k, step=1)
y = train_scores

plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')

plt.xlabel('max_depth')
plt.xticks(np.arange(0, max_depth_k, step=1)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Accuracy Score (%)')
plt.title('RandomForestRegressor')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)
plt.axhline(y=0.80, color='r', linestyle='-')
plt.text(0.5, 0.70, '80% cut-off threshold', color = 'red', fontsize=16)

ax.grid(axis='x')
plt.show()


### DecisionTreeRegressor

In [None]:
# Training the model
X = df_train.drop(['single','has_homepage','cast_json','crew_json','crew','cast','Keywords','title','tagline','status','spoken_languages','release_date','production_companies','production_countries','poster_path','overview','original_title','original_language','imdb_id','genres','id','index'], axis=1)
X.dropna(inplace=True)
X.info()
y = X['revenue'].to_numpy()
X.drop(['revenue'],axis=1,inplace=True)
X = X.to_numpy()

# This makes it so the model test later sees data it's seen before
# This concept stinks, but we want to maximize the amount of data we use for training too.

_, X_test, _, y_test = train_test_split(
     X, y, test_size=0.33)

X_train = X
y_train = y

import sklearn.ensemble as ske
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression

train_scores = []
decision_tree_models = []
max_depth_k = 21 # After about depth 14 the score stays stagnant
for k in tqdm(range(1,max_depth_k),desc="Training models",unit="DecisionTreeRegressor"):
  regr = DecisionTreeRegressor(max_depth=k)
  #regr = ske.RandomForestRegressor(max_depth=k)
  #regr = ske.AdaBoostRegressor(base_estimator=rfeRegr,n_estimators=100)
  regr.fit(X_train, y_train)
  train_scores.append(regr.score(X_train, y_train))
  decision_tree_models.append(regr)

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,6)

fig, ax = plt.subplots()
xi = np.arange(1, max_depth_k, step=1)
y = train_scores

plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')

plt.xlabel('max_depth')
plt.xticks(np.arange(0, max_depth_k, step=1)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Accuracy Score (%)')
plt.title('DecisionTreeRegressor')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)
plt.axhline(y=0.80, color='r', linestyle='-')
plt.text(0.5, 0.70, '80% cut-off threshold', color = 'red', fontsize=16)

ax.grid(axis='x')
plt.show()


### AdaBoost + LinearRegression

In [None]:
# Training the model
X = df_train.drop(['single','has_homepage','cast_json','crew_json','crew','cast','Keywords','title','tagline','status','spoken_languages','release_date','production_companies','production_countries','poster_path','overview','original_title','original_language','imdb_id','genres','id','index'], axis=1)
X.dropna(inplace=True)
X.info()
y = X['revenue'].to_numpy()
X.drop(['revenue'],axis=1,inplace=True)
X = X.to_numpy()

# This makes it so the model test later sees data it's seen before
# This concept stinks, but we want to maximize the amount of data we use for training too.

_, X_test, _, y_test = train_test_split(
     X, y, test_size=0.33)

X_train = X
y_train = y

import sklearn.ensemble as ske
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

train_scores = []
ada_lr_models = []
max_depth_k = 2 # After about depth 14 the score stays stagnant
for k in tqdm(range(1,max_depth_k),desc="Training models",unit="LinearRegression"):
  #regr = DecisionTreeRegressor(max_depth=k)
  #rfeRegr = ske.RandomForestRegressor(max_depth=k)
  lr = LinearRegression()
  regr = ske.AdaBoostRegressor(base_estimator=lr)
  regr.fit(X_train, y_train)
  train_scores.append(regr.score(X_train, y_train))
  ada_lr_models.append(regr)

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,6)

fig, ax = plt.subplots()
xi = np.arange(1, max_depth_k, step=1)
y = train_scores

plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')

plt.xlabel('max_depth')
plt.xticks(np.arange(0, max_depth_k, step=1)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Accuracy Score (%)')
plt.title('AdaBoost + Linear Regression')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)
plt.axhline(y=0.80, color='r', linestyle='-')
plt.text(0.5, 0.70, '80% cut-off threshold', color = 'red', fontsize=16)

ax.grid(axis='x')
plt.show()


### AdaBoost + RandomForest

In [None]:
# SLOW! ~10-15 minutes
# Training the model
X = df_train.drop(['single','has_homepage','cast_json','crew_json','crew','cast','Keywords','title','tagline','status','spoken_languages','release_date','production_companies','production_countries','poster_path','overview','original_title','original_language','imdb_id','genres','id','index'], axis=1)
X.dropna(inplace=True)
X.info()
y = X['revenue'].to_numpy()
X.drop(['revenue'],axis=1,inplace=True)
X = X.to_numpy()

# This makes it so the model test later sees data it's seen before
# This concept stinks, but we want to maximize the amount of data we use for training too.

_, X_test, _, y_test = train_test_split(
     X, y, test_size=0.33)

X_train = X
y_train = y

import sklearn.ensemble as ske
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression

train_scores = []
ada_rf_models = []
max_depth_k = 14 # After about depth 15 it's useless to keep trying
for k in tqdm(range(0,max_depth_k),desc="Training models",unit="AdaBoostRegressor"):
  #regr = DecisionTreeRegressor(max_depth=k)
  #rfeRegr = ske.RandomForestRegressor(max_depth=k)
  regr = ske.AdaBoostRegressor(base_estimator=random_forest_models[k])
  regr.fit(X_train, y_train)
  train_scores.append(regr.score(X_train, y_train))
  ada_rf_models.append(regr)

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,6)

fig, ax = plt.subplots()
xi = np.arange(0, max_depth_k, step=1)
y = train_scores

plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')

plt.xlabel('Random Forest #')
plt.xticks(np.arange(0, max_depth_k, step=1)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Accuracy Score (%)')
plt.title('AdaBoost + Random Forest')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)
plt.axhline(y=0.80, color='r', linestyle='-')
plt.text(0.5, 0.70, '80% cut-off threshold', color = 'red', fontsize=16)

ax.grid(axis='x')
plt.show()

### AdaBoost + DecisionTree

In [None]:
# Training the model
X = df_train.drop(['single','has_homepage','cast_json','crew_json','crew','cast','Keywords','title','tagline','status','spoken_languages','release_date','production_companies','production_countries','poster_path','overview','original_title','original_language','imdb_id','genres','id','index'], axis=1)
X.dropna(inplace=True)
X.info()
y = X['revenue'].to_numpy()
X.drop(['revenue'],axis=1,inplace=True)
X = X.to_numpy()

# This makes it so the model test later sees data it's seen before
# This concept stinks, but we want to maximize the amount of data we use for training too.

_, X_test, _, y_test = train_test_split(
     X, y, test_size=0.33)

X_train = X
y_train = y

import sklearn.ensemble as ske
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression

train_scores = []
ada_dt_models = []
max_depth_k = 20 # After about depth 14 the score stays stagnant
for k in tqdm(range(0,max_depth_k),desc="Training models",unit="AdaBoostRegressor"):
  #regr = DecisionTreeRegressor(max_depth=k)
  #rfeRegr = ske.RandomForestRegressor(max_depth=k)
  regr = ske.AdaBoostRegressor(base_estimator=decision_tree_models[k])
  regr.fit(X_train, y_train)
  train_scores.append(regr.score(X_train, y_train))
  ada_dt_models.append(regr)

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,6)

fig, ax = plt.subplots()
xi = np.arange(0, max_depth_k, step=1)
y = train_scores

plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')

plt.xlabel('Decision Tree #')
plt.xticks(np.arange(0, max_depth_k, step=1)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Accuracy Score (%)')
plt.title('AdaBoost + Decision Tree')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)
plt.axhline(y=0.80, color='r', linestyle='-')
plt.text(0.5, 0.70, '80% cut-off threshold', color = 'red', fontsize=16)

ax.grid(axis='x')
plt.show()


### GradientBoostRegressor + RandomForest

In [None]:
# Training the model
X = df_train.drop(['single','has_homepage','cast_json','crew_json','crew','cast','Keywords','title','tagline','status','spoken_languages','release_date','production_companies','production_countries','poster_path','overview','original_title','original_language','imdb_id','genres','id','index'], axis=1)
X.dropna(inplace=True)
X.info()
y = X['revenue'].to_numpy()
X.drop(['revenue'],axis=1,inplace=True)
X = X.to_numpy()

# This makes it so the model test later sees data it's seen before
# This concept stinks, but we want to maximize the amount of data we use for training too.

_, X_test, _, y_test = train_test_split(
     X, y, test_size=0.33)

X_train = X
y_train = y

import sklearn.ensemble as ske
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression

train_scores = []
gb_rf_models = []
max_depth_k = 20 # After about depth 14 the score stays stagnant
for k in tqdm(range(0,max_depth_k),desc="Training models",unit="GradientBoostingRegressor"):
  #regr = DecisionTreeRegressor(max_depth=k)
  #rfeRegr = ske.RandomForestRegressor(max_depth=k)
  regr = ske.GradientBoostingRegressor(init=random_forest_models[k])
  regr.fit(X_train, y_train)
  train_scores.append(regr.score(X_train, y_train))
  gb_rf_models.append(regr)

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,6)

fig, ax = plt.subplots()
xi = np.arange(0, max_depth_k, step=1)
y = train_scores

plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')

plt.xlabel('Random Forest #')
plt.xticks(np.arange(0, max_depth_k, step=1)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Accuracy Score (%)')
plt.title('GradientBoostingRegressor + Random Forest')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)
plt.axhline(y=0.80, color='r', linestyle='-')
plt.text(0.5, 0.70, '80% cut-off threshold', color = 'red', fontsize=16)

ax.grid(axis='x')
plt.show()


### GradientBoostRegressor + DecisionTree

In [None]:
# Training the model
X = df_train.drop(['single','has_homepage','cast_json','crew_json','crew','cast','Keywords','title','tagline','status','spoken_languages','release_date','production_companies','production_countries','poster_path','overview','original_title','original_language','imdb_id','genres','id','index'], axis=1)
X.dropna(inplace=True)
X.info()
y = X['revenue'].to_numpy()
X.drop(['revenue'],axis=1,inplace=True)
X = X.to_numpy()

# This makes it so the model test later sees data it's seen before
# This concept stinks, but we want to maximize the amount of data we use for training too.

_, X_test, _, y_test = train_test_split(
     X, y, test_size=0.33)

X_train = X
y_train = y

import sklearn.ensemble as ske
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression

train_scores = []
gb_dt_models = []
max_depth_k = 20 # After about depth 14 the score stays stagnant
for k in tqdm(range(0,max_depth_k),desc="Training models",unit="GradientBoostingRegressor"):
  #regr = DecisionTreeRegressor(max_depth=k)
  #rfeRegr = ske.RandomForestRegressor(max_depth=k)
  regr = ske.GradientBoostingRegressor(init=decision_tree_models[k])
  regr.fit(X_train, y_train)
  train_scores.append(regr.score(X_train, y_train))
  gb_dt_models.append(regr)

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,6)

fig, ax = plt.subplots()
xi = np.arange(0, max_depth_k, step=1)
y = train_scores

plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')

plt.xlabel('Decision Tree #')
plt.xticks(np.arange(0, max_depth_k, step=1)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Accuracy Score (%)')
plt.title('GradientBoostingRegressor + Decision Tree')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)
plt.axhline(y=0.80, color='r', linestyle='-')
plt.text(0.5, 0.70, '80% cut-off threshold', color = 'red', fontsize=16)

ax.grid(axis='x')
plt.show()


# Output

In [None]:
import datetime

def generate_submission(X, model, name):
    y_submit = model.predict(X)
    submit_list = []
    idx = 3001
    for y in y_submit:
        submit_list.append({'id': idx, 'revenue': y})
        idx+=1

    submission = pd.DataFrame(submit_list)
    timestamp = datetime.datetime.now().isoformat()
    sanitized_name = "".join([c for c in name if re.match(r'\w', c)])
    submission.to_csv(output_path + sanitized_name + "-" + timestamp + ".csv", index=False)
    print(name + " output available!")

In [None]:
selected_models = [
    {
        'name': 'RandomForest - max depth 5',
        'model': random_forest_models[4]
    },
    {
        'name': 'RandomForest - max depth 12',
        'model': random_forest_models[11]
    },
    {
        'name': 'DecisionTree - max depth 6',
        'model': random_forest_models[5]
    },
    {
        'name': 'DecisionTree - max depth 11',
        'model': decision_tree_models[10]
    },
    {
        'name': 'AdaBoost + RandomForest 8',
        'model': ada_rf_models[7]
    },
    {
        'name': 'AdaBoost + RandomForest 11',
        'model': ada_rf_models[10]
    },
    {
        'name': 'AdaBoost + DecisionTree 6',
        'model': ada_dt_models[5]
    },
    {
        'name': 'AdaBoost + DecisionTree 9',
        'model': ada_dt_models[8]
    },
    {
        'name': 'GradientBoostRegressor + RandomForest 5',
        'model': gb_rf_models[4]
    },
    {
        'name': 'GradientBoostRegressor + RandomForest 11',
        'model': gb_rf_models[10]
    },
    {
        'name': 'GradientBoostRegressor + DecisionTree 5',
        'model': gb_rf_models[4]
    },
    {
        'name': 'GradientBoostRegressor + DecisionTree 10',
        'model': gb_rf_models[9]
    }
]

print("Loaded models for submission")
for m in selected_models:
    print(m['model'])

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

df_test.info()
dropped_features = ['single','has_homepage','cast_json','crew_json','crew','cast','Keywords','title','tagline','status','spoken_languages','release_date','production_companies','production_countries','poster_path','overview','original_title','original_language','imdb_id','genres','id','index']
df_submit = df_test.drop(dropped_features, axis=1)
df_runtime_null = df_submit[df_submit['runtime'].isna()]
df_runtime_complete = df_submit[df_submit['runtime'].notna()]

runtime_imputer = IterativeImputer(random_state=42)

runtime_imputer.fit(df_runtime_complete)
runtimes = runtime_imputer.transform(df_runtime_null)
#print(runtimes)
runtime_null = df_runtime_null.to_dict('index')
it = 0
for k in runtime_null:
    v = runtime_null[k]
    df_submit.at[k,'runtime'] = runtimes[it][5]
    it+=1
    
df_submit.info()

X_submit = df_submit.drop(['revenue'],axis=1).to_numpy()
for m in selected_models:
    generate_submission(X_submit, m['model'], m['name'])

In [None]:
timestamp = datetime.datetime.now().isoformat()
!zip -r9 outputs-{timestamp}.zip *.csv