## IMDB 5000 Movie Dataset Introduction
In this note book, I first visualize the datasets in the IMDB 5000 movies and we found that some of the currency might not be unify in the budget column. Besides, some data also missing and display as NaN in the dataset. Instead of fill those NaN data as 0, we drop all of them in case of data bias in the budget, gross and so on. 
After that, we visualize part of data such as gross, budget and group them by years. Meanwhile, we also split the genres, language and visualize which categories and language are more popular in the datasets. 
Besides, we use linear model to analyze gross, budget and IMDB score to see which factors are significant to these indicators. However, the result isn't quite clear and the correlation are all below 0.5 and we could only preliminary derive some factor like number of votes could have higher connection with the gross or IMDB scores. 
In the end, we used another library (Graphlab, which could only execute on python 2.7 and below) to make a decision tree to predict the high and low gross (we separate the datasets into high gross which has gross higher than 100 million and other data belongs to low gross). After that, we derived a decision tree model which works quite well (Its Accuracy ~ 0.8 and it's F1 score also around 0.8) and it classified data based on some nodes like number of voted user < 83K” and “budget > 69 million”. 

In [None]:
## Input library 
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from IPython.display import display
from __future__ import division

mpl.rc('savefig', dpi=100)
plt.style.use('ggplot')
pd.set_option('display.max_rows', 10)

## Data Visualization

In [None]:
# Load data
data = pd.read_csv('../input/movie_metadata.csv')
data.head()

In [None]:
# Currency unify
data["TrueBudget"] = data["budget"]
for i in range(len(data.language)):
    if "Korean" == data["language"][i]:
        data["TrueBudget"][i] = data["TrueBudget"][i]/1000
    if "Mandarin" == data["language"][i]:
        data["TrueBudget"][i] = data["TrueBudget"][i]/7
    if "Japanese" == data["language"][i]:
        data["TrueBudget"][i] = data["TrueBudget"][i]/100
    if "Cantonese" == data["language"][i]:
        data["TrueBudget"][i] = data["TrueBudget"][i]/10
    if "Hindi" == data["language"][i]:
        data["TrueBudget"][i] = data["TrueBudget"][i]/68

In [None]:
# Drop NaN data
data = data.dropna()

In [None]:
# Establish the data Genres table
data_Genres = pd.DataFrame()
data_Genres["Action"] = data["genres"].apply(lambda x : "True" if "Action" in x else "False")
data_Genres["Adventure"] = data["genres"].apply(lambda x : "True" if "Adventure" in x else "False")
data_Genres["Animation"] = data["genres"].apply(lambda x : "True" if "Animation" in x else "False")
data_Genres["Biography"] = data["genres"].apply(lambda x : "True" if "Biography" in x else "False")
data_Genres["Comedy"] = data["genres"].apply(lambda x : "True" if "Comedy" in x else "False")
data_Genres["Crime"] = data["genres"].apply(lambda x : "True" if "Crime" in x else "False")
data_Genres["Documentary"] = data["genres"].apply(lambda x : "True" if "Documentary" in x else "False")
data_Genres["Drama"] = data["genres"].apply(lambda x : "True" if "Drama" in x else "False")
data_Genres["Family"] = data["genres"].apply(lambda x : "True" if "Family" in x else "False")
data_Genres["Fantasy"] = data["genres"].apply(lambda x : "True" if "Fantasy" in x else "False")
data_Genres["Film-Noir"] = data["genres"].apply(lambda x : "True" if "Film-Noir" in x else "False")
data_Genres["History"] = data["genres"].apply(lambda x : "True" if "History" in x else "False")
data_Genres["Horror"] = data["genres"].apply(lambda x : "True" if "Horror" in x else "False")
data_Genres["Music"] = data["genres"].apply(lambda x : "True" if "Music" in x else "False")
data_Genres["Musical"] = data["genres"].apply(lambda x : "True" if "Musical" in x else "False")
data_Genres["Mystery"] = data["genres"].apply(lambda x : "True" if "Mystery" in x else "False")
data_Genres["Romance"] = data["genres"].apply(lambda x : "True" if "Romance" in x else "False")
data_Genres["Sci-Fi"] = data["genres"].apply(lambda x : "True" if "Sci-Fi" in x else "False")
data_Genres["Sport"] = data["genres"].apply(lambda x : "True" if "Sport" in x else "False")
data_Genres["Thriller"] = data["genres"].apply(lambda x : "True" if "Thriller" in x else "False")
data_Genres["War"] = data["genres"].apply(lambda x : "True" if "War" in x else "False")
data_Genres["Western"] = data["genres"].apply(lambda x : "True" if "Western" in x else "False")
pd.set_option('display.max_columns', 30)
data_Genres.head()

In [None]:
data_Genres_Counts = pd.DataFrame()
data_Genres_Counts = data_Genres.apply(pd.value_counts)
data_Genres_Counts = data_Genres_Counts.drop(data_Genres_Counts.index[[0]])
Genres_title = list(data_Genres_Counts.columns.values)
Genres_index = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]
data_Genres_Counts = data_Genres_Counts.transpose()
plt.bar(Genres_index, data_Genres_Counts["True"], align='center')
plt.xticks(Genres_index, Genres_title, rotation = 90)
plt.show()
pd.set_option('display.max_rows', 25)
display(data_Genres_Counts)

In [None]:
data.corr()

## Overview Data - Data Visualization

In [None]:
## See the gross and budget change by year
data_groupby_year = data.groupby(data["title_year"])
data_groupby_year_mean = data_groupby_year.mean()
Gross = plt.scatter(data_groupby_year_mean.index, data_groupby_year_mean["gross"], s = data_groupby_year["gross"].count())
Budget = plt.scatter(data_groupby_year_mean.index, data_groupby_year_mean["budget"],color = "r" ,s = data_groupby_year["budget"].count())
plt.legend((Gross, Budget), ('Gross', 'Budget'),)
plt.xlabel("Year")
plt.ylabel("Money($10,000,000)")
plt.show()

In [None]:
## See the people participation by year
plt.subplot(211)
num_critic_for_reviews = plt.scatter(data_groupby_year_mean.index, data_groupby_year_mean["num_critic_for_reviews"], s = data_groupby_year["gross"].count())
plt.xlabel("Year")
plt.ylabel("num of reviews")
plt.subplot(212)
num_voted_users = plt.scatter(data_groupby_year_mean.index, data_groupby_year_mean["num_voted_users"], marker = "x",color = "r" ,s = data_groupby_year_mean["budget"].count())
plt.xlabel("Year")
plt.ylabel("num of voted")
plt.tight_layout()

In [None]:
# Box plot
data.boxplot(column="gross", by="language", rot= 90, grid=False)

## Model & Predicting for IMDB score

In [None]:
## Formula for regression on IMDb_score
linear_formula_imdb = "imdb_score ~ \
C(color) + num_critic_for_reviews + duration + gross + num_voted_users \
+ cast_total_facebook_likes + facenumber_in_poster + num_user_for_reviews \
+ TrueBudget + title_year + aspect_ratio + movie_facebook_likes \
+ actor_1_facebook_likes + actor_2_facebook_likes + actor_3_facebook_likes"

In [None]:
linear_model_imdb = smf.ols(formula=linear_formula_imdb, data=data)
linear_model_fit_imdb = linear_model_imdb.fit()
linear_model_fit_imdb.summary()

In [None]:
predicted_imdbScore = linear_model_fit_imdb.predict(data)
plt.scatter(data["imdb_score"], predicted_imdbScore)
plt.xlabel("Actual Score")
plt.ylabel("Predicted Score")

## Model & Predicting for Budget score

In [None]:
#Divide the numerical data and the string data
str_list = []
for colname, colvalue in data.iteritems():
    if type(colvalue[1]) == str:
        str_list.append(colname)
str_list.remove("color")

num_list = data.columns.difference(str_list)

In [None]:
num_data = data[num_list]

In [None]:
#Create the linear fit model for budget
linear_model_formula_budget = "TrueBudget ~ director_facebook_likes + actor_1_facebook_likes + actor_2_facebook_likes + \
actor_3_facebook_likes + cast_total_facebook_likes + title_year + aspect_ratio + C(color)"
linear_model_budget = smf.ols(formula=linear_model_formula_budget, data=num_data)
linear_model_fit_budget = linear_model_budget.fit()

In [None]:
linear_model_fit_budget.summary()

In [None]:
predicted_budget = linear_model_fit_budget.predict()
plt.scatter(num_data["TrueBudget"],predicted_budget)
plt.xlabel("Actual Budget")
plt.ylabel("Predicted Budget")

In [None]:
#Correnlation between predicted data and original data
predicted_budget = linear_model_fit_budget.predict()
predicted_budget = pd.Series(predicted_budget, name="PredictedBudget")
predicted_budget.corr(num_data["TrueBudget"])

In [None]:
#Visualization for budget change by years
plt.scatter(num_data["title_year"],num_data["budget"])
plt.gca().set_yscale('log')
plt.xlabel("Title Year")
plt.ylabel("Actual Budget")

## Model & Predicting for Gross score
Build linear model

In [None]:
# check the data correlation to decide which variable use on the gross prediction
print(data.corr()["gross"])

In [None]:
import statsmodels.formula.api as smf
# Build a simple linear model model ~ choose only corr above 0.3
simple_model_formula_Gross = "gross ~ num_voted_users + num_critic_for_reviews + num_user_for_reviews \
                                    + TrueBudget + movie_facebook_likes"
# set up the simple linear model
simple_model_Gross = smf.ols(formula=simple_model_formula_Gross, data=data)
simple_model_fit_Gross = simple_model_Gross.fit()
# observe the p-value and R-square
simple_model_fit_Gross.summary()

In [None]:
# predict the gross by simple linear model
predicted_gross_simple = simple_model_fit_Gross.predict()
predicted_gross_simple = pd.Series(predicted_gross_simple, name="PredictedGross")
# check the correlation
print("The correlation of simple linear model is : %.3f" % predicted_gross_simple.corr(data["gross"]))
# plt scatter chart of actual data and predicted gross
plt.scatter(data["gross"], predicted_gross_simple)
plt.gca().set_xscale("log")
plt.gca().set_yscale("log")
plt.xlabel("Actual Gross")
plt.ylabel("Predicted Gross")

In [None]:
# Build the whole model
import statsmodels.formula.api as smf
linear_model_formula_Gross = "gross ~ num_voted_users + num_critic_for_reviews + num_user_for_reviews \
                        + cast_total_facebook_likes + actor_1_facebook_likes + actor_2_facebook_likes \
                        + actor_3_facebook_likes + actor_2_facebook_likes + duration + cast_total_facebook_likes \
                        + imdb_score + actor_1_facebook_likes + director_facebook_likes + aspect_ratio + \
                        TrueBudget + title_year + facenumber_in_poster + duration + movie_facebook_likes"
linear_model_Gross = smf.ols(formula=linear_model_formula_Gross, data=data)
linear_model_fit_Gross = linear_model_Gross.fit()
linear_model_fit_Gross.summary()

In [None]:
# predict the gross by the whole linear model
predicted_gross = linear_model_fit_Gross.predict()
predicted_gross = pd.Series(predicted_gross, name="PredictedGross")
# check the correlation
print("The correlation of simple linear model is : %.3f" % predicted_gross.corr(data["gross"]))
# plt scatter chart of actual data and predicted gross
plt.scatter(data["gross"], predicted_gross)
plt.gca().set_xscale("log")
plt.gca().set_yscale("log")
plt.xlabel("Actual Gross")
plt.ylabel("Predicted Gross")

In [None]:
# Log scale
logGross = np.log(data["gross"])
logPredictGross = np.log(predicted_gross)
print("The correlation of log scale is : %.3f" %logGross.corr(logPredictGross))
plt.scatter(logGross,logPredictGross)
plt.xlabel("Actual Gross")
plt.ylabel("Predicted Gross")

In [None]:
## Building model by training data and valid by vaildation
np.random.seed(0)

# Let N_test be 20% (or 1/5th) of the data, and use a random shuffle
# to partition the data.
N_rows = len(data)
N_test = N_rows//5
shuffled_row_indices = np.random.permutation(N_rows)
test_rows = shuffled_row_indices[:N_test]
train_rows = shuffled_row_indices[N_test:]

test_data = data.loc[test_rows,:]
train_data = data.loc[train_rows,:]

# Plot the training data in blue and the test data in red
plt.plot(train_data["num_voted_users"], train_data['gross'],'x', c='blue', label='training')
plt.plot(test_data["num_voted_users"], test_data["gross"], 'x', c='red', label='testing')
plt.gca().set_xscale("log")
plt.gca().set_yscale("log")
plt.xlabel("num_voted_users")
plt.ylabel("gross")
plt.legend()
plt.show()

In [None]:
train_data = train_data.dropna()
test_data = test_data.dropna()
simple_model_formula = "gross ~ num_voted_users"
simple_model = smf.ols(formula=simple_model_formula, data=train_data)
simple_model_fit = simple_model.fit()
# One variable Simple Model
print('Test data true mean gross: $%.2f' % np.mean(test_data["gross"]))
test_price_predicted = simple_model_fit.predict(test_data)
print('Test data predicted mean price: $%.2f' % np.mean(test_price_predicted))

In [None]:
print('Test data true mean gross: $%.2f' % np.mean(test_data["gross"]))
test_price_predicted_fulldata = linear_model_fit_Gross.predict(test_data)
print('Test data predicted mean price: $%.2f' % np.mean(test_price_predicted_fulldata))

In [None]:
err = test_data["gross"] - test_price_predicted
err_rms = np.sqrt(np.mean(err ** 2))
print('Simple model of RMSE from sqft of gross: $%.2f' % err_rms)
err = test_data["gross"] - test_price_predicted_fulldata
err_rms = np.sqrt(np.mean(err ** 2))
print('linear model of RMSE from sqft of gross: $%.2f' % err_rms)

In [None]:
plt.plot(train_data["num_voted_users"],
         train_data["gross"],
         'x', c='blue', label="training data")
plt.plot(train_data["num_voted_users"],
         simple_model_fit.fittedvalues,
         c='red', label="simple fit")
plt.legend()
plt.xlabel("num_voted_users")
plt.ylabel("gross")
plt.show()

## Decision Tree

In [None]:
## Only work on Python 2.7 and below, it won't work on Python 3.0 or above
import graphlab
graphlab.canvas.set_target('ipynb')
# import data into graph lab
data_graphlab = graphlab.SFrame('movie_metadata.csv')

In [None]:
# data overview
data_graphlab.show()
data_decision_tree = data_graphlab.dropna()
len(data_decision_tree)

In [None]:
# based on above hist diagram, deciding the high box office is about top 20%
high_gross = data_decision_tree[data_decision_tree['gross'] >= 1*1e8 ]
low_gross = data_decision_tree[data_decision_tree['gross'] < 1*1e8 ]

print("Box office above $100,000,000 is : %s" % len(high_gross))
print("Box office below $100,000,000 is : %s" % len(low_gross))
print("Percent of high and low box office is : %.2f" %(len(high_gross)/len(low_gross)))data_decision_tree['gross'].show()

In [None]:
features = ['num_voted_users',
            'num_user_for_reviews',
            'num_critic_for_reviews',
            'movie_facebook_likes',
            'director_facebook_likes',
            'actor_1_facebook_likes',
            'actor_2_facebook_likes',
            'actor_3_facebook_likes',
            'cast_total_facebook_likes',
            'director_name',
            'actor_1_name',
            'actor_2_name',
            'actor_3_name',
            'imdb_score',
            'movie_title',
            'title_year',
            'content_rating',
            'language',
            'country',
            'genres',
            'color',
            'budget',
           ]
target = 'gross_HL'
data_dTree = data_decision_tree[ features + [target]]
High_gross_raw = data_decision_tree[data_decision_tree[target] == +1 ]
Low_gross_raw = data_decision_tree[data_decision_tree[target] == -1 ]
print("Box office above $10,000,000 is : %s" % len(High_gross_raw))
print("Box office below $10,000,000 is : %s" % len(Low_gross_raw))# Divide the gross into high and low gross
data_decision_tree['gross_HL'] = data_decision_tree['gross'].apply(lambda x : +1 if x>= 1*1e8 else -1)

In [None]:
percentage = len(High_gross_raw)/float(len(Low_gross_raw))

High_gross = High_gross_raw
Low_gross = Low_gross_raw.sample(percentage, seed=1)

# Append the high gross data with the downsampled version of low gross data
gross_data = High_gross.append(Low_gross)
print "Percentage of high gross data               :", len(High_gross) / float(len(gross_data))
print "Percentage of low gross data                :", len(Low_gross) / float(len(gross_data))
print "Total number of gross in our new dataset :", len(gross_data)

In [None]:
# split the train and validation data
train_data, validation_data = gross_data.random_split(.75, seed=1)
print("trian data has : %d" % len(train_data))
print("validation data has : %d" % len(validation_data))

In [None]:
decision_tree_model = graphlab.decision_tree_classifier.create(
    train_data, validation_set=None, target=target, features=features)

In [None]:
WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
Decision tree classifier:
--------------------------------------------------------
Number of examples          : 899
Number of classes           : 2
Number of feature columns   : 22
Number of unpacked features : 22
+-----------+--------------+-------------------+-------------------+
| Iteration | Elapsed Time | Training-accuracy | Training-log_loss |
+-----------+--------------+-------------------+-------------------+
| 1         | 0.009019     | 0.927697          | 0.511922          |
+-----------+--------------+-------------------+-------------------+

In [None]:
decision_tree_model.show(view="Evaluation")

![Evaluation of decision tree model ][1]


  [1]: https://drive.google.com/file/d/0B4UueDDReN7QU3M2YmpOVUdEVm8/view?usp=sharing

In [None]:
small_model = graphlab.decision_tree_classifier.create(
    train_data, validation_set=None, target=target, features=features, max_depth=3)

![small model of decision tree figure][1]


  [1]: https://drive.google.com/file/d/0B4UueDDReN7QN1VBa2d4TlpMaGM/view?usp=sharing

In [None]:
# set up the validation data
validation_high_gross = validation_data[validation_data[target] == 1]
validation_low_gross = validation_data[validation_data[target] == -1]

sample_validation_data_high_gross = validation_high_gross[0:2]
sample_validation_data_low_gross = validation_low_gross[0:2]

sample_validation_data = sample_validation_data_low_gross.append(sample_validation_data_high_gross)
sample_validation_data["gross_HL"]

In [None]:
# check the sample validation result -> 50% accuracy
decision_tree_model.predict(sample_validation_data)

In [None]:
decision_tree_model.predict(sample_validation_data, output_type='probability')

In [None]:
print ("The accuracy of training data in small model is : %.2f" % small_model.evaluate(train_data)['accuracy'])
print ("The accuracy of validation data in small model is : %.2f" % small_model.evaluate(validation_data)['accuracy'])

The accuracy of training data in small model is : 0.84
The accuracy of validation data in small model is : 0.80

In [None]:
print("The accuracy of training data in decision tree model is : %.2f" % decision_tree_model.evaluate(train_data)['accuracy'])
print("The accuracy of validation data decision tree model is : %.2f" % decision_tree_model.evaluate(validation_data)['accuracy'])

The accuracy of training data in decision tree model is : 0.93
The accuracy of validation data decision tree model is : 0.81