# Predict the Oscars 

This example will show how to use supervised learning to predict the oscars. We'll use a dataset that contains previous Oscar winners to build a prediction model to guess the next winner for Best Picture Award. Our model will predict only one winner!

In [1]:
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

pd.set_option('mode.chained_assignment', None)

In [4]:
train_file = "train.csv"
initial_train = pd.read_csv(train_file)

train = initial_train[(initial_train['Year'] > 1980)]

test_file = "test.csv"
test = pd.read_csv(test_file)

In [6]:
train.head(5)

In [7]:
train.ix[train["Rate"] == "G", "Rate"] = 1
train.ix[train["Rate"] == "PG", "Rate"] = 2
train.ix[train["Rate"] == "PG-13", "Rate"] = 3
train.ix[train["Rate"] == "R", "Rate"] = 4

test.ix[test["Rate"] == "G", "Rate"] = 1
test.ix[test["Rate"] == "PG", "Rate"] = 2
test.ix[test["Rate"] == "PG-13", "Rate"] = 3
test.ix[test["Rate"] == "R", "Rate"] = 4

In [8]:
#train.head(5)

In [9]:
train["IMDB Rating"].fillna(train["IMDB Rating"].median(), inplace=True)
test["IMDB Rating"].fillna(test["IMDB Rating"].median(), inplace=True)

train["Metascore"].fillna(train["Metascore"].median(), inplace=True)
test["Metascore"].fillna(train["Metascore"].median(), inplace=True)

## Let's create our model...
#### How would you choose the right variables to model?


In [10]:
target = train["Won?"].values

feature_names = [
    "Oscar Nominations",
    "Won Golden Globe",
    "Golden Globe Nominations",
    "Won Bafta",
    "Bafta Nominations",
    "Won Producers",
    "Won Actors",
    "Won Directors",
    "Metascore",
    "IMDB Rating"]

features = train[feature_names].values

my_tree = tree.DecisionTreeClassifier()
my_tree = my_tree.fit(features, target)

In [16]:
# What is a Decision Tree?
# How do you interpret the results?
# What is a technique used to combat variability of results from Decision Trees?

tree_importances = pd.DataFrame(my_tree.feature_importances_, feature_names, columns=["Importances"])

#print(tree_importances)
#print('Score', my_tree.score(features, target))

## Now that we have a model, lets use our model to predict

In [12]:
test_features = test[feature_names].values

pred_tree = my_tree.predict_proba(test_features)[:, 1]

movie_name = np.array(test["Movie"])
year = np.array(test["Year"])
won = np.array(test["Won?"])

tree_prediction = pd.DataFrame(pred_tree.round(2), movie_name, columns=["Probability"])
tree_prediction["Year"] = year
tree_prediction["Actually Won?"] = won

In [14]:
tree_prediction[tree_prediction['Year'] != 2016]

### Are these reliable results?

### Are there any issues with these results?

### What can we do about it?