# TMDB Box Office Prediction

## Downloading Data

There are three ways of getting hold of the data. If you're running this notebook locally, you can download the data manually or through the Kaggle API. If you're running this notebook as a kernel on Kaggle, the data is already there and you don't need to do anything. In all three cases you need to have [signed in](https://www.kaggle.com/) to Kaggle and [accepted the rules](https://www.kaggle.com/c/tmdb-box-office-prediction/rules) of the competition in order to be able to do anything. 

### Downloading manually

Download the [data](https://www.kaggle.com/c/10300/download-all) next to this notebook. Then unzip it to a directory called `input`:

In [2]:
!unzip tmdb-box-office-prediction.zip

Archive:  tmdb-box-office-prediction.zip
  inflating: input/train.csv         
  inflating: input/sample_submission.csv  
  inflating: input/test.csv          


### Downloading with the Kaggle API

If you already set up the [Kaggle API](https://github.com/Kaggle/kaggle-api), you can download the data for this competition with the following commands. If you get a *403 Permission Denied* error, you didn't [accept the rules](https://www.kaggle.com/c/tmdb-box-office-prediction/rules) of the competition yet.

In [15]:
!kaggle competitions download -c tmdb-box-office-prediction
!unzip train.csv.zip
!unzip test.csv.zip
!chmod 644 *.csv

Downloading sample_submission.csv to /Users/eq81tw/Developer/Kaggle/TMDB
  0%|                                               | 0.00/60.1k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 60.1k/60.1k [00:00<00:00, 34.8MB/s]
Downloading train.csv.zip to /Users/eq81tw/Developer/Kaggle/TMDB
100%|█████████████████████████████████████▉| 7.00M/7.02M [00:06<00:00, 1.01MB/s]
100%|██████████████████████████████████████| 7.02M/7.02M [00:06<00:00, 1.14MB/s]
Downloading test.csv.zip to /Users/eq81tw/Developer/Kaggle/TMDB
 97%|████████████████████████████████████▋ | 10.0M/10.3M [00:10<00:00, 1.05MB/s]
100%|██████████████████████████████████████| 10.3M/10.3M [00:10<00:00, 1.03MB/s]
Archive:  train.csv.zip
  inflating: train.csv               
Archive:  test.csv.zip
  inflating: test.csv                


### Running this notebook as a kernel on Kaggle

You don't need to do anything. The data is already there.

## Loading Data

In [16]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression
import os

In [17]:
if os.getcwd() == '/kaggle/working':
    print("Running on Kaggle kernel")
    data_folder = '../input/'
else:
    data_folder = ''

train = pd.read_csv(data_folder + 'train.csv')
test = pd.read_csv(data_folder + 'test.csv')

print('train dataset size:', train.shape)
print('test dataset size:', test.shape)

train.sample(4)

train dataset size: (3000, 23)
test dataset size: (4398, 22)


Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
344,345,"[{'id': 192492, 'name': 'The Jack Ryan Collect...",60000000,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,tt1205537,en,Jack Ryan: Shadow Recruit,"Jack Ryan, as a young covert CIA analyst, unco...",11.234862,...,1/15/14,105.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Trust no one.,Jack Ryan: Shadow Recruit,"[{'id': 212, 'name': 'london england'}, {'id':...","[{'cast_id': 14, 'character': 'Jack Ryan', 'cr...","[{'credit_id': '52fe4c25c3a368484e1a9da1', 'de...",50549107
762,763,,2000000,"[{'id': 18, 'name': 'Drama'}, {'id': 53, 'name...",,tt0405094,de,Das Leben der Anderen,A tragic love story set in East Berlin with th...,9.02255,...,3/15/06,137.0,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Released,"Before the Fall of the Berlin Wall, East Germa...",The Lives of Others,"[{'id': 74, 'name': 'germany'}, {'id': 220, 'n...","[{'cast_id': 7, 'character': 'Christa-Maria Si...","[{'credit_id': '564fcf649251414b070058b5', 'de...",70000000
2528,2529,,6000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,tt0068421,en,The Cowboys,When his cattlemen abandon him for the gold fi...,3.189566,...,1/13/72,131.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,All they wanted was their chance to be men...a...,The Cowboys,"[{'id': 5202, 'name': 'boy'}, {'id': 5701, 'na...","[{'cast_id': 1, 'character': 'Wil Andersen', '...","[{'credit_id': '5769751592514153b200266a', 'de...",7500000
1422,1423,,32000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 18, '...",,tt2101473,en,The Physician,"The story of Rob Cole, a boy who is left a pen...",13.332387,...,12/25/13,155.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,The Physician,"[{'id': 1279, 'name': 'medicine'}, {'id': 1965...","[{'cast_id': 4, 'character': 'Rob Cole', 'cred...","[{'credit_id': '561389fd92514147d800de96', 'de...",57284237


## Feature engineering

In [18]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 23 columns):
id                       3000 non-null int64
belongs_to_collection    604 non-null object
budget                   3000 non-null int64
genres                   2993 non-null object
homepage                 946 non-null object
imdb_id                  3000 non-null object
original_language        3000 non-null object
original_title           3000 non-null object
overview                 2992 non-null object
popularity               3000 non-null float64
poster_path              2999 non-null object
production_companies     2844 non-null object
production_countries     2945 non-null object
release_date             3000 non-null object
runtime                  2998 non-null float64
spoken_languages         2980 non-null object
status                   3000 non-null object
tagline                  2403 non-null object
title                    3000 non-null object
Keywords             

### Process JSON-style features

There are 8 JSON-style features, 4 numerical, 4 text, and 1 date feature. At first, convert JSON-styled features into string/category/list ones.

* **`belongs_to_collection`**: convert `name` into string
* **`genres`, `production_companies`**: convert `name` values into comma-separated string list
* **`production_countries`**: convert `iso_3166_1` values into comma-separated string list
* **`spoken_languages`**: convert `iso_639_1` values into comma-separated string list
* **`Keywords`**: convert `name` values into comma-separated string list
* **`cast`, `crew`**: get their lengths, as its detailed information is very unlikely relevant to the revenue 

In [19]:
def proc_json(string, key):
    try:
        data = eval(string)
        return ",".join([d[key] for d in data])
    except:
        return ''

def proc_json_len(string):
    try:
        data = eval(string)
        return len(data)
    except:
        return 0

train.belongs_to_collection = train.belongs_to_collection.apply(lambda x: proc_json(x, 'name'))
test.belongs_to_collection = test.belongs_to_collection.apply(lambda x: proc_json(x, 'name'))

train.genres = train.genres.apply(lambda x: proc_json(x, 'name'))
test.genres = test.genres.apply(lambda x: proc_json(x, 'name'))

train.production_companies = train.production_companies.apply(lambda x: proc_json(x, 'name'))
test.production_companies = test.production_companies.apply(lambda x: proc_json(x, 'name'))

train.production_countries = train.production_countries.apply(lambda x: proc_json(x, 'iso_3166_1'))
test.production_countries = test.production_countries.apply(lambda x: proc_json(x, 'iso_3166_1'))

train.spoken_languages = train.spoken_languages.apply(lambda x: proc_json(x, 'iso_639_1'))
test.spoken_languages = test.spoken_languages.apply(lambda x: proc_json(x, 'iso_639_1'))

train.Keywords = train.Keywords.apply(lambda x: proc_json(x, 'name'))
test.Keywords = test.Keywords.apply(lambda x: proc_json(x, 'name'))

train.cast = train.cast.apply(proc_json_len)
test.cast = test.cast.apply(proc_json_len)

train.crew = train.crew.apply(proc_json_len)
test.crew = test.crew.apply(proc_json_len)

### Genres
As a movie can have multiple genres, it is not a reasonable way to convert `genres` column into category type. It might make the same genres different, e.g. 'Drama,Romance' and 'Romance,Drama' would be categorized differently. Therefore we make dummy columns for all of the genres.

In [20]:
# get total genres list
genres = []
for idx, val in train.genres.iteritems():
    gen_list = val.split(',')
    for gen in gen_list:
        if gen == '':
            continue

        if gen not in genres:
            genres.append(gen)
            

genre_column_names = []
for gen in genres:
    col_name = 'genre_' + gen.replace(' ', '_')
    train[col_name] = train.genres.str.contains(gen).astype('uint8')
    test[col_name] = test.genres.str.contains(gen).astype('uint8')
    genre_column_names.append(col_name)

### Normalize Revenue and Budget

Budget and revenue are highly skewed and they need to be normalized by logarithm.

In [21]:
train['revenue'] = np.log1p(train['revenue'])
train['budget'] = np.log1p(train['budget'])
test['budget'] = np.log1p(test['budget'])

train.sample(5)

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,genre_Music,genre_Crime,genre_Science_Fiction,genre_Mystery,genre_Foreign,genre_Fantasy,genre_War,genre_Western,genre_History,genre_TV_Movie
2486,2487,,18.603002,"Action,Family,Science Fiction",http://www.speedracerthemovie.warnerbros.com/,tt0811080,en,Speed Racer,Speed Racer is the tale of a young and brillia...,7.134168,...,0,0,1,0,0,0,0,0,0,0
1189,1190,,13.927947,"Drama,History",,tt0027690,en,The Gorgeous Hussy,It's the early nineteenth century Washington. ...,0.948775,...,0,0,0,0,0,0,0,0,1,0
536,537,,14.914123,Comedy,,tt0327036,en,Martin Lawrence Live: Runteldat,The controversial bad-boy of comedy delivers a...,0.527297,...,0,0,0,0,0,0,0,0,0,0
1103,1104,,17.216708,"Comedy,Crime,Drama,Romance,Thriller",,tt0270288,en,Confessions of a Dangerous Mind,"Television made him famous, but his biggest hi...",7.645827,...,0,1,0,0,0,0,0,0,0,0
776,777,,18.683045,Drama,,tt0455824,en,Australia,"Set in northern Australia before World War II,...",9.566135,...,0,0,0,0,0,0,0,0,0,0


## Select features

For this benchmark solution, we are manually selecting the following features:

In [22]:
features = ['budget', 'popularity', 'runtime', 'genre_Adventure', 'genre_Action', 'genre_Fantasy', 
            'genre_Drama', 'genre_Family', 'genre_Animation', 'genre_Science_Fiction']

## Fix missing values

The `runtime` column has 2 and 4 missing values for the train and test dataset respectively:

In [23]:
print('-'*30)
print(train[features].isnull().sum())
print('-'*30)
print(test[features].isnull().sum())

------------------------------
budget                   0
popularity               0
runtime                  2
genre_Adventure          0
genre_Action             0
genre_Fantasy            0
genre_Drama              0
genre_Family             0
genre_Animation          0
genre_Science_Fiction    0
dtype: int64
------------------------------
budget                   0
popularity               0
runtime                  4
genre_Adventure          0
genre_Action             0
genre_Fantasy            0
genre_Drama              0
genre_Family             0
genre_Animation          0
genre_Science_Fiction    0
dtype: int64


So let's fill the missing values with the mean of the other runtimes.

In [24]:
train.runtime = train.runtime.fillna(np.mean(train.runtime))
test.runtime = test.runtime.fillna(np.mean(train.runtime))

In [25]:
train[features].sample(5)

Unnamed: 0,budget,popularity,runtime,genre_Adventure,genre_Action,genre_Fantasy,genre_Drama,genre_Family,genre_Animation,genre_Science_Fiction
250,16.346028,4.052991,105.0,0,0,0,0,0,0,0
1874,0.0,0.229233,0.0,0,0,0,1,0,0,0
60,18.826146,23.065078,144.0,1,1,0,0,0,0,0
1269,13.594865,6.165341,104.0,0,0,0,1,0,0,0
2515,15.70258,8.151346,124.0,0,0,0,1,0,0,0


## Train model

Let's use cross validation on a linear regression model using the selected features to check the explained variance:

In [26]:
X, y = train[features], train['revenue']
model = LinearRegression()
result = cross_validate(model, X, y, cv=10, scoring="explained_variance", verbose=False, n_jobs=-1)
print(f"The variance explained is {np.mean(result['test_score']):.1%}")

The variance explained is 32.2%


## Make predictions

Now let's train the model and make predictions for the test set:

In [27]:
model.fit(X, y)
predict = model.predict(test[features])

## Create submission file

Create a submission file from the predictions:

In [28]:
submit = pd.DataFrame({'id': test.id, 'revenue':np.expm1(predict)})
submit.to_csv('submission.csv', index=False)

----

## Submit your results to Kaggle

### With the Kaggle API

Run the following command:

In [29]:
!kaggle competitions submit tmdb-box-office-prediction -f submission.csv -m "New submission"

100%|████████████████████████████████████████| 100k/100k [00:03<00:00, 27.5kB/s]
Successfully submitted to TMDB Box Office Prediction

Then go to [your submissions](https://www.kaggle.com/c/tmdb-box-office-prediction/submissions) to see your latest results.
 
### By uploading your submission file manually

Go to the [submission page](https://www.kaggle.com/c/tmdb-box-office-prediction/submit) on Kaggle and click the *'Upload Files'* button below to upload your .csv and have it scored, so you can see your new ranking!

### When you ran the notebook as a kernel on Kaggle

Click the *'Commit'* button at the top to commit (and save) your work and run the notebook. Note that it should run successfully from start to finish, so make sure you remove any cells that were meant to be run from a local machine for example.

Once it ran, click the blue *'Open Version'* button at the right, then click on *'Output'* at the left, which will point you to the *'Output'* section of this page. There you will see a *'Submit to competition'* button. Clicking that will score your submission file and show you your new ranking!