# Thinkful Data Science Event:
## Predicting the Oscars- 
### 5/29/2018

In this practical workshop you'll use a dataset that contains previous Oscar winners to build a prediction model to guess the winner for Best Picture Award. In the process, you'll get an introduction to the data scientist's tools and methods. This will include an overview of basic machine learning concepts. 

## Getting Started-

Before we can actually start making predictions, there are several steps that we must first take to load and pre-process the data.

In [26]:
# Import the required libraries
import requests
import os
import zipfile
import glob
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt



In [2]:
# Define working directory
path = '/Documents/Work/Thinkful/Code'

In [3]:
# Check current directory
print(os.getcwd())

# Change directory if needed
# os.chdir(path)

/Users/zansadiq/Documents/Work/Thinkful/Code


Our data is stored online in the format of a .zip file. This is the [link](https://www.thinkful.com/workshops/washington-dc/oscars-prediction/). You can either download the files and import them, or you can feed the information directly into python as we do below:

In [4]:
# File location
url = 'https://www.dropbox.com/s/shg31hm4voydqnl/Thinkful%20Workshops%20-%20Predicting%20the%20Oscars.zip?dl=1'

r = requests.get(url)

In [5]:
# Create a staging directory
staging_dir = "staging"
os.mkdir(staging_dir)

In [6]:
# Confirm the staging directory path
os.path.isdir(staging_dir)

# Machine independent path to create new files
zip_file = os.path.join(staging_dir, "Thinkful Workshops - Predicting the Oscars.zip")

In [7]:
# Write the file to the computer
zf = open(zip_file,"wb")
zf.write(r.content)
zf.close()

# Unzip the files
z = zipfile.ZipFile(zip_file,"r")
z.extractall(staging_dir)
z.close()

# Extract the .csv's
files = glob.glob(os.path.join("staging/oscars" + "/*.csv"))

In [8]:
# Create an empty dictionary to hold the dataframes from csvs
dict_ = {}

# Write the files into the dictionary
for file in files:
    fname = os.path.basename(file)
    fname = fname.replace('.csv', '')
    dict_[fname] = pd.read_csv(file, header = 0).fillna('')
    
# Extract the dataframes
train = dict_['train']
test = dict_['test']

Our data has now been downloaded, unzipped, and written into pandas data frames. The next step is to inspect and manipulate the data to make sure it is in the correct format for our machine learning algorithms. 

## Formatting and Cleaning-

Before we can fit a model around our data, we have to make sure that certain objectives have been completed:

* Inspect column types and reformat if necessary
* Handle missing data 

In [9]:
print(train.head())

   Year               Movie  Won?   Budget Opening Weekend IMDB Rating  \
0  2016             Arrival     0  4.7e+07         2.4e+07         8.1   
1  2016              Fences     0  2.4e+07          129462         7.5   
2  2016       Hacksaw Ridge     0    4e+07     1.51908e+07         8.3   
3  2016  Hell or High Water     0  1.2e+07          621329         7.7   
4  2016      Hidden Figures     0  2.5e+07          515499         7.9   

                             Genres  Won Golden Globe  Won Bafta  \
0  Drama, Mystery, Sci-Fi, Thriller                 0          0   
1                             Drama                 0          0   
2               Drama, History, War                 0          0   
3     Action, Crime, Drama, Western                 0          0   
4         Biography, Drama, History                 0          0   

   Oscar Nominations  Golden Globe Nominations  Bafta Nominations    IMdB id  \
0                  8                         2                  9 

One problem that stands out is that our target variable must be converted to a factor. Also, the string values need to be transformed.

In [10]:
# Convert target variable to factor
train['Won?'] = train['Won?'].astype('category')
test['Won?'] = test['Won?'].astype('category')

In [11]:
# Fix the ratings
train.loc[train["Rate"] == "G", "Rate"] = 1
train.loc[train["Rate"] == "PG", "Rate"] = 2
train.loc[train["Rate"] == "PG-13", "Rate"] = 3
train.loc[train["Rate"] == "R", "Rate"] = 4

test.loc[test["Rate"] == "G", "Rate"] = 1
test.loc[test["Rate"] == "PG", "Rate"] = 2
test.loc[test["Rate"] == "PG-13", "Rate"] = 3
test.loc[test["Rate"] == "R", "Rate"] = 4

Now lets check to see if we have any missing values:

In [12]:
# Handle missing values
train.isnull().sum()

Year                        0
Movie                       0
Won?                        0
Budget                      0
Opening Weekend             0
IMDB Rating                 0
Genres                      0
Won Golden Globe            0
Won Bafta                   0
Oscar Nominations           0
Golden Globe Nominations    0
Bafta Nominations           0
IMdB id                     0
Won Producers               0
Won Directors               0
Won Actors                  0
Rate                        0
Metascore                   0
dtype: int64

In [13]:
# Handle missing values
test.isnull().sum()

Year                        0
Movie                       0
Won?                        0
Budget                      0
Opening Weekend             0
IMDB Rating                 0
Genres                      0
Won Golden Globe            0
Won Bafta                   0
Oscar Nominations           0
Golden Globe Nominations    0
Bafta Nominations           0
IMdB id                     0
Won Producers               0
Won Directors               0
Won Actors                  0
Rate                        0
Metascore                   0
dtype: int64

The only column that has missing values is the one for `Opening Weekend`. For this reason, we can just go ahead and delete the entire column. 

In [14]:
train = train.drop(columns=['Opening Weekend'])
test = test.drop(columns=['Opening Weekend'])

Now, for the most important steps of the process:

1) Separate the dependent and independent variables

2) Check the data types of all columns

In [15]:
# List all of the column headers
train_vars = train.columns.values.tolist()
test_vars = test.columns.values.tolist()

# Select independent variables
x_train = [i for i in train_vars if i not in ['Won?']]
x_test = [i for i in test_vars if i not in ['Won?']]

# Fill the values and select the dependent variable
x = train[x_train]
y = train['Won?']

x_test = test[x_test]
y_test = test['Won?']

# Column types
x.dtypes

Year                         int64
Movie                       object
Budget                      object
IMDB Rating                 object
Genres                      object
Won Golden Globe             int64
Won Bafta                    int64
Oscar Nominations            int64
Golden Globe Nominations     int64
Bafta Nominations            int64
IMdB id                     object
Won Producers                int64
Won Directors                int64
Won Actors                   int64
Rate                         int64
Metascore                   object
dtype: object

Several of our variables are listed as `object`. Before we can proceed, we must coerce these columns to numeric values so that they can be interpreted by `sklearn`.

In [18]:
# Fix the genres
lb = LabelEncoder()

x['Genres'] = lb.fit_transform(x['Genres'])
x_test['Genres'] = lb.fit_transform(x_test['Genres'])

# Fix the movie titles
x['Movie'] = lb.fit_transform(x['Movie'])
x_test['Movie'] = lb.fit_transform(x_test['Movie'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are also blank cells. Because these cells are blank, they did not show up when we checked for null values. Before we can convert our remaining objects to numbers, we will have to fill them

In [21]:
# Remove the IMDB id
x = x.drop(columns = 'IMdB id')
x_test = x_test.drop(columns = 'IMdB id')

# Fill blank cells with 0's
x.replace(r'^\s*$', 0, regex = True, inplace = True)
x_test.replace(r'^\s*$', 0, regex = True, inplace = True)

In [None]:
# Convert remaining object variables
cols = x.columns[x.dtypes.eq('object')]
test_cols = x_test.columns[x_test.dtypes.eq('object')]

x[cols] = x[cols].apply(pd.to_numeric)
x_test[test_cols] = x_test[test_cols].apply(pd.to_numeric)

## Machine Learning-

Our data is finally in the appropriate shape. We have inspected everything to make sure that the variables are properly formatted and there are no missing values. In most projects, these steps are 90% of the data scientist's job. 

Now the fun part begins, we can go ahead and begin the actual building of our models. For this experiment, we will try two different methods that build upon each other. 

1. A simple decision tree
2. A random forest

In [None]:
# Create validation set
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size = 0.3, random_state = 100)

In [22]:
# Decision Tree
my_tree = DecisionTreeClassifier()
tree_fit = my_tree.fit(x_train, y_train)

In [24]:
# Random Forest
rf = RandomForestClassifier()
rf_fit = rf.fit(x_train, y_train)

### Predict:

We have created two different models. One for a decision tree, and the other for a random forest. It is important to note here that the random forest is an expansion on the decision tree model. 

A random forest is basically just a collection of decision trees where each tree is trained on a different, random subset of features included in the overall dataset. 

In order to evaluate these models, we will have to first make predictions and then calculate the accuracy of each one. Since we are comparing multiple models before making predictions on the test set, we will also need to split the training data into our training set, plus another set for validation. This is especially true since our test set is unlabeled, so we would otherwise have no way of measuring our accuracy before deploying the final model.

In [25]:
# Predict
tree_preds = tree_fit.predict(x_val)
rf_preds = rf_fit.predict(x_val)

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

### Evaluate:

Now that we have created our models and made predictions for the validation set, we can compare our algorithms to see which one is better for use on the test set.  

There are many ways of assessing the predictions, such as: F1 score, specificity, sensitivity, etc. 

For the sake of simplicity, we will judge our models based on accuracy and also the ROC curve.

In [None]:
# Accuracy
print("Validation decision tree accuracy:", accuracy_score(y_val, tree_val_preds))
print("Validation random forest accuracy:", accuracy_score(y_val, rf_val_preds))

Surprisingly, our decision tree is actually slightly more accurate than our random forest. This is not uncommon, it is also the reason why Data Scientist's generally run multiple models before settling on a final selection. There is no guarentee that one model or the other will always perform better in a given set of circumstances.

Lets go ahead and plot the ROC curves for both of our models.

In [None]:
# ROC Curve: Decision Tree
fpr, tpr, _ = roc_curve(y_val, tree_val_preds)
tree_roc_auc = auc(fpr, tpr)

# Plot ROC
plt.figure()
plt.plot(fpr, tpr, color = 'darkorange', label = 'ROC Curve (area = %0.2f)' % tree_roc_auc)
plt.plot([0, 1], [0, 1], color='navy', linestyle = '--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC: Decision Tree')
plt.legend(loc="lower right")
plt.show()

In [None]:
# ROC Curve: Random Forest
fpr, tpr, _ = roc_curve(y_val, rf_val_preds)
tree_roc_auc = auc(fpr, tpr)

# Plot ROC
plt.figure()
plt.plot(fpr, tpr, color = 'darkorange', label = 'ROC Curve (area = %0.2f)' % tree_roc_auc)
plt.plot([0, 1], [0, 1], color='navy', linestyle = '--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC: Random Forest')
plt.legend(loc="lower right")
plt.show()

From these results, it is clear that our decision tree is actually the better model to deploy in these circumstances. 

All thats left to do is to make predictions on the test set, add these labels to our data, and save the output as a csv.

In [None]:
# Final Predictions
tree_test_preds = tree_fit.predict(x_test)

# Add predictions to data
x_test['Won_Preds'] = tree_test_preds

# Save output
x_test.to_csv('oscar_test_preds.csv')

## Conclusion-

In this workshop we have gone over several important concepts in data science and machine learning:

* Load and transform a dataset
* Build a model
* Assess Predictions
* Save output

The convenient thing about the `sklearn` package in Python is that it provides a convenient, uniform interface that allows the data scientist to build a variety of different models using similar syntax. This makes it easier to deploy more models and in less time than you would otherwise. 

There are still many ways in which we could improve the accuracy of our model. For example, we could have imputed the missing values in the dataset instead of filling them with 0's. Also, we could have used a larger training set. Feel free to work on this at home and please feel free to reach out if you have any additional questions or concerns.