<a href="https://colab.research.google.com/github/tijlk/tmdb_box_office/blob/master/TMDB_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TMDB Box Office Prediction

We are challenging you to beat the other teams in two competitions. One of them is this one where you are asked to predict the box office results. For more info, see the [competition page](https://www.kaggle.com/c/tmdb-box-office-prediction).

## Setting up Google Colab

*   First of all, it's probably a good idea to save this notebook in your Google Drive. To do that, go to File and click on 'Save a copy in Drive'. Otherwise, you might lose your results if you're not careful.
*   Secondly, if you want to use GPU's, make sure to select a GPU runtime. Go to 'Runtime' -> 'Change runtime type'. Select 'GPU' as Hardware accelarator and click on 'Save'.

## Important! Configure your team name.
This will be used to identify your submissions in the Kaggle contests.

In [0]:
team_name = 'Avengers'

## Setting up the Kaggle API

All the teams are using the same Kaggle account, so that we can easily keep track of each other's scores. And so you don't have to set up anything :).

In [0]:
import os
os.environ["KAGGLE_USERNAME"] = "uniteds"
os.environ["KAGGLE_KEY"] = "e2cc23b4870d3b069e2f8bf9d159847d"

## Downloading the data

In [0]:
# Download the dataset.
!kaggle competitions download -c tmdb-box-office-prediction
!unzip -o train.csv.zip
!unzip -o test.csv.zip
!chmod 644 *.csv

sample_submission.csv: Skipping, found more recently modified local copy (use --force to force download)
train.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
test.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  train.csv.zip
  inflating: train.csv               
Archive:  test.csv.zip
  inflating: test.csv                


## Loading Data

In [0]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression
import os

In [0]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

print('train dataset size:', train.shape)
print('test dataset size:', test.shape)

train.sample(4)

train dataset size: (3000, 23)
test dataset size: (4398, 22)


Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
1825,1826,"[{'id': 111751, 'name': 'Texas Chainsaw Massac...",0,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...",,tt0099994,en,Leatherface: Texas Chainsaw Massacre III,A couple encounters a perverted gas station at...,5.79907,...,1/12/90,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,He puts the teeth in terror.,Leatherface: Texas Chainsaw Massacre III,"[{'id': 9663, 'name': 'sequel'}, {'id': 11545,...","[{'cast_id': 1, 'character': 'Michelle', 'cred...","[{'credit_id': '52fe44b5c3a368484e0325ab', 'de...",5765562
393,394,"[{'id': 307637, 'name': 'Jingle All the Way Co...",60000000,"[{'id': 10751, 'name': 'Family'}, {'id': 35, '...",,tt0116705,en,Jingle All the Way,"Meet Howard Langston, a salesman for a mattres...",7.898202,...,11/15/96,89.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"Two Dads, One Toy, No Prisoners.",Jingle All the Way,"[{'id': 65, 'name': 'holiday'}, {'id': 1441, '...","[{'cast_id': 12, 'character': 'Howard Langston...","[{'credit_id': '52fe44e0c3a36847f80af6f3', 'de...",129832389
229,230,,14000000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,tt0113463,en,Jefferson in Paris,"His wife having recently died, Thomas Jefferso...",1.596058,...,3/31/95,139.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A powerful man torn between his love for one w...,Jefferson in Paris,"[{'id': 254, 'name': 'france'}, {'id': 2020, '...","[{'cast_id': 2, 'character': 'Thomas Jefferson...","[{'credit_id': '52fe49d49251416c910ba1f5', 'de...",2474000
2399,2400,,0,"[{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n...",http://www.kaiji-movie.jp,tt1904937,ja,カイジ2 人生奪回ゲーム,3 years after the ultimate life-or-death game ...,1.13376,...,11/5/11,133.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,,Kaiji 2: The Ultimate Gambler,,"[{'cast_id': 1, 'character': 'Kaiji Ito', 'cre...","[{'credit_id': '52fe4a68c3a36847f81cc20f', 'de...",15


## Feature engineering

In [0]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 23 columns):
id                       3000 non-null int64
belongs_to_collection    604 non-null object
budget                   3000 non-null int64
genres                   2993 non-null object
homepage                 946 non-null object
imdb_id                  3000 non-null object
original_language        3000 non-null object
original_title           3000 non-null object
overview                 2992 non-null object
popularity               3000 non-null float64
poster_path              2999 non-null object
production_companies     2844 non-null object
production_countries     2945 non-null object
release_date             3000 non-null object
runtime                  2998 non-null float64
spoken_languages         2980 non-null object
status                   3000 non-null object
tagline                  2403 non-null object
title                    3000 non-null object
Keywords             

### Process JSON-style features

There are 8 JSON-style features, 4 numerical, 4 text, and 1 date feature. At first, convert JSON-styled features into string/category/list ones.

* **`belongs_to_collection`**: convert `name` into string
* **`genres`, `production_companies`**: convert `name` values into comma-separated string list
* **`production_countries`**: convert `iso_3166_1` values into comma-separated string list
* **`spoken_languages`**: convert `iso_639_1` values into comma-separated string list
* **`Keywords`**: convert `name` values into comma-separated string list
* **`cast`, `crew`**: get their lengths, as its detailed information is very unlikely relevant to the revenue 

In [0]:
def proc_json(string, key):
    try:
        data = eval(string)
        return ",".join([d[key] for d in data])
    except:
        return ''

def proc_json_len(string):
    try:
        data = eval(string)
        return len(data)
    except:
        return 0

train.belongs_to_collection = train.belongs_to_collection.apply(lambda x: proc_json(x, 'name'))
test.belongs_to_collection = test.belongs_to_collection.apply(lambda x: proc_json(x, 'name'))

train.genres = train.genres.apply(lambda x: proc_json(x, 'name'))
test.genres = test.genres.apply(lambda x: proc_json(x, 'name'))

train.production_companies = train.production_companies.apply(lambda x: proc_json(x, 'name'))
test.production_companies = test.production_companies.apply(lambda x: proc_json(x, 'name'))

train.production_countries = train.production_countries.apply(lambda x: proc_json(x, 'iso_3166_1'))
test.production_countries = test.production_countries.apply(lambda x: proc_json(x, 'iso_3166_1'))

train.spoken_languages = train.spoken_languages.apply(lambda x: proc_json(x, 'iso_639_1'))
test.spoken_languages = test.spoken_languages.apply(lambda x: proc_json(x, 'iso_639_1'))

train.Keywords = train.Keywords.apply(lambda x: proc_json(x, 'name'))
test.Keywords = test.Keywords.apply(lambda x: proc_json(x, 'name'))

train.cast = train.cast.apply(proc_json_len)
test.cast = test.cast.apply(proc_json_len)

train.crew = train.crew.apply(proc_json_len)
test.crew = test.crew.apply(proc_json_len)

### Genres
As a movie can have multiple genres, it is not a reasonable way to convert `genres` column into category type. It might make the same genres different, e.g. 'Drama,Romance' and 'Romance,Drama' would be categorized differently. Therefore we make dummy columns for all of the genres.

In [0]:
# get total genres list
genres = []
for idx, val in train.genres.iteritems():
    gen_list = val.split(',')
    for gen in gen_list:
        if gen == '':
            continue

        if gen not in genres:
            genres.append(gen)
            

genre_column_names = []
for gen in genres:
    col_name = 'genre_' + gen.replace(' ', '_')
    train[col_name] = train.genres.str.contains(gen).astype('uint8')
    test[col_name] = test.genres.str.contains(gen).astype('uint8')
    genre_column_names.append(col_name)

### Normalize Revenue and Budget

Budget and revenue are highly skewed and they need to be normalized by logarithm.

In [0]:
train['revenue'] = np.log1p(train['revenue'])
train['budget'] = np.log1p(train['budget'])
test['budget'] = np.log1p(test['budget'])

train.sample(5)

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,genre_Music,genre_Crime,genre_Science_Fiction,genre_Mystery,genre_Foreign,genre_Fantasy,genre_War,genre_Western,genre_History,genre_TV_Movie
423,424,,17.073607,"Crime,Thriller,Mystery",,tt1486192,en,The Raven,A fictionalized account of the last days of Ed...,7.254242,...,0,1,0,1,0,0,0,0,0,0
2458,2459,,0.0,"Drama,Fantasy,Horror,Thriller",,tt3174376,en,Before I Wake,About an orphaned child whose dreams - and nig...,9.449103,...,0,0,0,0,0,1,0,0,0,0
1440,1441,Psycho Collection,0.0,"Horror,Thriller",,tt0091799,en,Psycho III,Norman Bates is still running his little motel...,6.345885,...,0,0,0,0,0,0,0,0,0,0
995,996,,15.60727,"Drama,Horror,Science Fiction",http://www.casshern.com/,tt0405821,ja,キャシャーン,Fifty years of war between the Great Eastern F...,4.664474,...,0,0,1,0,0,0,0,0,0,0
1643,1644,The Maze Runner Collection,17.926384,Action,http://mazerunnermovies.com,tt4046784,en,Maze Runner: The Scorch Trials,Thomas and his fellow Gladers face their great...,41.225769,...,0,0,0,0,0,0,0,0,0,0


## Select features

For this benchmark solution, we are manually selecting the following features:

In [0]:
features = ['budget', 'popularity', 'runtime', 'genre_Adventure', 'genre_Action', 'genre_Fantasy', 
            'genre_Drama', 'genre_Family', 'genre_Animation', 'genre_Science_Fiction']

## Fix missing values

The `runtime` column has 2 and 4 missing values for the train and test dataset respectively:

In [0]:
print('-'*30)
print(train[features].isnull().sum())
print('-'*30)
print(test[features].isnull().sum())

------------------------------
budget                   0
popularity               0
runtime                  2
genre_Adventure          0
genre_Action             0
genre_Fantasy            0
genre_Drama              0
genre_Family             0
genre_Animation          0
genre_Science_Fiction    0
dtype: int64
------------------------------
budget                   0
popularity               0
runtime                  4
genre_Adventure          0
genre_Action             0
genre_Fantasy            0
genre_Drama              0
genre_Family             0
genre_Animation          0
genre_Science_Fiction    0
dtype: int64


So let's fill the missing values with the mean of the other runtimes.

In [0]:
train.runtime = train.runtime.fillna(np.mean(train.runtime))
test.runtime = test.runtime.fillna(np.mean(train.runtime))

In [0]:
train[features].sample(5)

Unnamed: 0,budget,popularity,runtime,genre_Adventure,genre_Action,genre_Fantasy,genre_Drama,genre_Family,genre_Animation,genre_Science_Fiction
1801,15.068274,0.556435,90.0,0,0,0,0,0,0,0
1552,17.216708,8.040589,138.0,0,0,0,1,0,0,0
1745,15.671809,19.293562,108.0,0,1,0,0,0,0,1
2674,16.321037,6.561373,103.0,0,0,0,1,0,0,0
2729,18.643824,13.392824,97.0,0,0,0,0,1,1,0


## Train model

Let's use cross validation on a linear regression model using the selected features to check the explained variance:

In [0]:
X, y = train[features], train['revenue']
model = LinearRegression()
result = cross_validate(model, X, y, cv=10, scoring="explained_variance", verbose=False, n_jobs=-1)
print(f"The variance explained is {np.mean(result['test_score']):.1%}")

The variance explained is 32.2%


## Make predictions

Now let's train the model and make predictions for the test set:

In [0]:
model.fit(X, y)
predict = model.predict(test[features])

## Create submission file

Create a submission file from the predictions:

In [0]:
submit = pd.DataFrame({'id': test.id, 'revenue':np.expm1(predict)})
submit.to_csv('submission.csv', index=False)

----

## Submit your results to Kaggle

**IMPORTANT** Each team is allowed only **2** submissions per day per competition! So, be careful!

Run the following command:

In [18]:
if input("Are you sure you want to submit? (y)es or (n)o: ") == 'y':
    !kaggle competitions submit tmdb-box-office-prediction -f submission.csv -m "{team_name}"

Are you sure you want to submit? (y)es or (n)o: n


In [17]:
!kaggle competitions submissions tmdb-box-office-prediction

No submissions found


So, did you beat the other teams? :)