# Box Office Prediction - Classification

Continuation of the Mod-1 project of the Flatiron School Data Science program.  

## Introduction

The main project was a simple data analysis of the movie industry to try to determine what factors are most instrumental in creating a hit movie using nothing but EDA.  In this notebook, we will be using various machine learning classification methods to predict whether a movie will be a blockbuster hit and then analyzing the most important features according to the best performing model.  

### Imports

In [25]:
# Regular suspects
import numpy as np
import pandas as pd
from functools import reduce
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

# pandas display max columns
pd.set_option('display.max_columns', None)

### Functions

## Data

We're going to tackle the data again from scratch assuming that we can create a better dataset with the skillset acquired in the time since this project was first started.

In [26]:
# Import the IMDB datasets

title_aka_df = pd.read_csv('zippedData/imdb.title.akas.csv.gz', compression = 'gzip')
title_basics_df = pd.read_csv('zippedData/imdb.title.basics.csv.gz', compression = 'gzip')
title_crew_df = pd.read_csv('zippedData/imdb.title.crew.csv.gz', compression = 'gzip')
title_principals_df = pd.read_csv('zippedData/imdb.title.principals.csv.gz', compression = 'gzip')
title_ratings_df = pd.read_csv('zippedData/imdb.title.ratings.csv.gz', compression = 'gzip')
budget_ratings_df = pd.read_csv('budget_ratings.csv')

### Title AKA

In [27]:
print(title_aka_df.shape)
title_aka_df['tconst'] = title_aka_df['title_id']
title_aka_df.drop('title_id', axis = 1, inplace = True)
title_aka_df.head()

(331703, 8)


Unnamed: 0,ordering,title,region,language,types,attributes,is_original_title,tconst
0,10,Джурасик свят,BG,bg,,,0.0,tt0369610
1,11,Jurashikku warudo,JP,,imdbDisplay,,0.0,tt0369610
2,12,Jurassic World: O Mundo dos Dinossauros,BR,,imdbDisplay,,0.0,tt0369610
3,13,O Mundo dos Dinossauros,BR,,,short title,0.0,tt0369610
4,14,Jurassic World,FR,,imdbDisplay,,0.0,tt0369610


### Title Basics

In [28]:
print(title_basics_df.shape)
title_basics_df.head()

(146144, 6)


Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


### Title Crew

In [29]:
print(title_crew_df.shape)
title_crew_df.head()

(146144, 3)


Unnamed: 0,tconst,directors,writers
0,tt0285252,nm0899854,nm0899854
1,tt0438973,,"nm0175726,nm1802864"
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943


### Title Principals

In [30]:
print(title_principals_df.shape)
title_principals_df.head()

(1028186, 6)


Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0111414,1,nm0246005,actor,,"[""The Man""]"
1,tt0111414,2,nm0398271,director,,
2,tt0111414,3,nm3739909,producer,producer,
3,tt0323808,10,nm0059247,editor,,
4,tt0323808,1,nm3579312,actress,,"[""Beth Boothby""]"


### Title Ratings

In [31]:
print(title_ratings_df.shape)
title_ratings_df.head()

(73856, 3)


Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


### Scraped Data

In [32]:
print(budget_ratings_df.shape)
budget_ratings_df.drop('Unnamed: 0', axis = 1, inplace = True)
budget_ratings_df.head()

(52943, 6)


Unnamed: 0,tconst,budget,gross,ww_gross,rating
0,tt2200832,,,,NotRated
1,tt2200860,,,1924766.0,
2,tt2200908,,,105367.0,
3,tt2200926,,,5784.0,
4,tt2200955,,,,Comedy


In [34]:
df = pd.merge(budget_ratings_df, title_ratings_df, on = 'tconst')
df = pd.merge(df, title_principals_df, on = 'tconst')
df = pd.merge(df, title_aka_df, on = 'tconst')
df = pd.merge(df, title_crew_df, on = 'tconst')
df = pd.merge(df, title_basics_df, on = 'tconst')

print(df.shape)
df.set_index('tconst')
df.head()

(1912998, 26)


Unnamed: 0,tconst,budget,gross,ww_gross,rating,averagerating,numvotes,ordering_x,nconst,category,job,characters,ordering_y,title,region,language,types,attributes,is_original_title,directors,writers,primary_title,original_title,start_year,runtime_minutes,genres
0,tt2200860,,,1924766.0,,6.8,48,10,nm2456371,cinematographer,director of photography,,1,Born to Love You,PH,,,,0.0,nm1760842,"nm1293423,nm2029519,nm1762121",Born to Love You,Born to Love You,2012,105.0,"Drama,Romance"
1,tt2200860,,,1924766.0,,6.8,48,1,nm2029519,actor,,"[""Rex Manrique""]",1,Born to Love You,PH,,,,0.0,nm1760842,"nm1293423,nm2029519,nm1762121",Born to Love You,Born to Love You,2012,105.0,"Drama,Romance"
2,tt2200860,,,1924766.0,,6.8,48,2,nm1403269,actress,,"[""Joey Liwanag""]",1,Born to Love You,PH,,,,0.0,nm1760842,"nm1293423,nm2029519,nm1762121",Born to Love You,Born to Love You,2012,105.0,"Drama,Romance"
3,tt2200860,,,1924766.0,,6.8,48,3,nm0553445,actor,,"[""Charles""]",1,Born to Love You,PH,,,,0.0,nm1760842,"nm1293423,nm2029519,nm1762121",Born to Love You,Born to Love You,2012,105.0,"Drama,Romance"
4,tt2200860,,,1924766.0,,6.8,48,4,nm0883648,actress,,"[""Sylvia""]",1,Born to Love You,PH,,,,0.0,nm1760842,"nm1293423,nm2029519,nm1762121",Born to Love You,Born to Love You,2012,105.0,"Drama,Romance"


## Cleaning

In [40]:
def get_info(df):
    print('DataFrame Shape\n-------------------------------')
    print(df.shape)
    print('\nDataFrame Info\n-------------------------------')
    print(df.info())
    print('\nDataFrame Null Values\n-------------------------------')
    print(df.isna().sum())

In [41]:
get_info(df)

DataFrame Shape
-------------------------------
(1912998, 26)

DataFrame Info
-------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1912998 entries, 0 to 1912997
Data columns (total 26 columns):
 #   Column             Dtype  
---  ------             -----  
 0   tconst             object 
 1   budget             float64
 2   gross              float64
 3   ww_gross           float64
 4   rating             object 
 5   averagerating      float64
 6   numvotes           int64  
 7   ordering_x         int64  
 8   nconst             object 
 9   category           object 
 10  job                object 
 11  characters         object 
 12  ordering_y         int64  
 13  title              object 
 14  region             object 
 15  language           object 
 16  types              object 
 17  attributes         object 
 18  is_original_title  float64
 19  directors          object 
 20  writers            object 
 21  primary_title      object 
 22  orig

In [42]:
df['tconst'].value_counts()

tt2488496    610
tt1201607    550
tt2310332    550
tt1790809    530
tt2948356    530
            ... 
tt1798640      1
tt6402046      1
tt7131678      1
tt7547516      1
tt1337191      1
Name: tconst, Length: 41466, dtype: int64

In [45]:
df.drop_duplicates()
get_info(df)

DataFrame Shape
-------------------------------
(1912998, 26)

DataFrame Info
-------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1912998 entries, 0 to 1912997
Data columns (total 26 columns):
 #   Column             Dtype  
---  ------             -----  
 0   tconst             object 
 1   budget             float64
 2   gross              float64
 3   ww_gross           float64
 4   rating             object 
 5   averagerating      float64
 6   numvotes           int64  
 7   ordering_x         int64  
 8   nconst             object 
 9   category           object 
 10  job                object 
 11  characters         object 
 12  ordering_y         int64  
 13  title              object 
 14  region             object 
 15  language           object 
 16  types              object 
 17  attributes         object 
 18  is_original_title  float64
 19  directors          object 
 20  writers            object 
 21  primary_title      object 
 22  orig