# Project: Machine Learning Capstone - Predict Movie Revenue

This project works with the datasets from the Kaggle competition at https://www.kaggle.com/c/tmdb-box-office-prediction/data to create a model for predicting worldwide box office revenue for movies.

There are 3000 movies in the sample data set being used for training and testing. The model created from this will then be used on a separate dataset to score the accuracy of predictions. The sample set contains 23 columns, of which 7 columns each contain JSON array data of objects. These objects, along with other fields, will require pre-processing into individual table columns for use in modeling. The fields the will be focuse on are:

* **belongs_to_collection**: Indicates whether this movie is part of a series, and if so which series. Will be one-hot encoded as col_0 for none or col_X where X is the associated collection.
* **budget**: Budget for film as an integer. Some entries have a budget of 0. It will be interesting to try the model both with and without this column as a consideration.
* **genres**: Indicates the genres to which the film belongs. Will be one-hot encoded as gen_X where X is the associated genre.
* **homepage**: Lists the URL for the homepage of the movie, if any. Will be encoded to 0 or 1 to indicate only whether the movie had a homepage.
* **original_language**: Gives the ISO language value for the film. Will be one-hot encoded to lan_ISO where ISO is the ISO language value.
* **popularity**: Long number value rating the film's popularity. This would not be a known value for future film productions, so it will be interesting to try modeling both with and without this value considered.
* **production_companies**: Indicates production companies involved in the movie. Will be one-hot encoded as pcom_X where X is the id of the production company.
* **production_countries**: Indicates countries where the movie was filmed or produced. Will be one-hot encoded as pcou_ISO where ISO is the ISO_3166_1 country value.
* **release_date**: The release date for the film in mm/dd/yyyy format. Will be encoded to r_year and r_week columns, where r_week is the number of the week in the year in which the file was relesed (0-51)
* **runtime**: Integer value for the runtime of the film in minutes.
* **spoken_languages**: Indicates the languages spoken in the movie. Will be one-hot encoded to spo_ISO where ISO is the ISO_639-1 language value.
* **Keywords**: Indicates the keyword values associated with the movie. Keyword objects have the format {'id': int, 'name': ''}, so these will be encoded to key_X where X is the keyword id.
* **cast**: Indicates cast members associated with the movie. Will be one-hot encoded to cast_X where X is the id of the cast member.
* **crew**: Indicates crew members associated with the movie. Will be one-hot encoded to crew_X where X is the id of the crew member.
* **revenue**: Integer value for the worldwide revenue of the movie.

In [1]:
# Import libraries necessary for this project
import sys
sys.path.insert(0, 'utilities')

import numpy as np
import pandas as pd
import json
import ast
from pandas.io.json import json_normalize
from sklearn.model_selection import ShuffleSplit

from json_columnizer import jcolumnize
from json_columnizer import collection_columnize

# Import supplementary visualizations code visuals.py
#import visuals as vs

# Pretty display for notebooks
%matplotlib inline


# Load the movies dataset
train = pd.read_csv('data/train.csv')

# Drop useless columns
train = train.drop(['imdb_id', 'poster_path'], axis = 1)
print("Column names {}".format(list(train)))

# Turn collections into columns
train = jcolumnize(train, 'belongs_to_collection', 'id', 'collection_')
# Turn cast into columns
train = jcolumnize(train, 'cast', 'id', 'cast_', True)
# Turn genres into columns
train = jcolumnize(train, 'genres', 'id', 'genres_')
# Turn production companies into columns
train = jcolumnize(train, 'production_companies', 'id', 'pcomp_')
# Turn production countries into columns
train = jcolumnize(train, 'production_countries', 'iso_3166_1', 'pcoun_')
# Turn spoken languages into columns
train = jcolumnize(train, 'spoken_languages', 'iso_639_1', 'spoken_')
# Turn keywords int columns
train = jcolumnize(train, 'Keywords', 'id', 'key_')

#revenue = data['revenue']
#features = data.drop('revenue', axis = 1)
    
# Success
print("Movies dataset has {} data points with {} variables each.".format(*train.shape))
#print("Column names {}".format(list(data)))


Column names ['id', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'Keywords', 'cast', 'crew', 'revenue']
jcolumnize shape: (3000, 443)
jcolumnize shape: (3000, 9519)
jcolumnize shape: (3000, 9539)
jcolumnize shape: (3000, 13251)
jcolumnize shape: (3000, 13325)
jcolumnize shape: (3000, 13404)
jcolumnize shape: (3000, 20804)
Movies dataset has 3000 data points with 20804 variables each.
