### Milestone 4: Deep learning, due Wednesday, April 26, 2017

For this milestone you will (finally) use deep learning to predict movie genres. You will train one small network from scratch on the posters only, and compare this one to a pre-trained network that you fine tune. [Here](https://keras.io/getting-started/faq/#how-can-i-use-pre-trained-models-in-keras) is a description of how to use pretrained models in Keras.

You can try different architectures, initializations, parameter settings, optimization methods, etc. Be adventurous and explore deep learning! It can be fun to combine the features learned by the deep learning model with a SVM, or incorporate meta data into your deep learning model. 

**Note:** Be mindful of the longer training times for deep models. Not only for training time, but also for the parameter tuning efforts. You need time to develop a feel for the different parameters and which settings work, which normalization you want to use, which model architecture you choose, etc. 

It is great that we have GPUs via AWS to speed up the actual computation time, but you need to be mindful of your AWS credits. The GPU instances are not cheap and can accumulate costs rather quickly. Think about your model first and do some quick dry runs with a larger learning rate or large batch size on your local machine. 

The notebook to submit this week should at least include:

- Complete description of the deep network you trained from scratch, including parameter settings, performance, features learned, etc. 
- Complete description of the pre-trained network that you fine tuned, including parameter settings, performance, features learned, etc. 
- Discussion of the results, how much improvement you gained with fine tuning, etc. 
- Discussion of at least one additional exploratory idea you pursued

In [1]:
import json
import urllib
import cStringIO
from PIL import Image
from imdb import IMDb
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import time
import ast
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.cross_validation import KFold
import difflib



In [2]:
# part 3 - top 10 most popular movies of 2016 from TMDb and their genre
top_2016_1 = urllib.urlopen("https://api.themoviedb.org/3/discover/movie?api_key=2dc6c9f1d17bd39dcbaef83321e1b5a3&sort_by=popularity.desc&include_adult=false&include_video=false&page=1&primary_release_year=2016")
top_2016_1_json = json.loads(top_2016_1.read())

# get genre list
genre_list = urllib.urlopen("https://api.themoviedb.org/3/genre/movie/list?api_key=2dc6c9f1d17bd39dcbaef83321e1b5a3&language=en-US")

genre_list_json = json.loads(genre_list.read()) 

genre_lst = {}
for i in genre_list_json['genres']:
    genre_lst[i['id']] = str(i['name'])
    
# top most popular movies of 2016
top_2016_1 = urllib.urlopen("https://api.themoviedb.org/3/discover/movie?api_key=2dc6c9f1d17bd39dcbaef83321e1b5a3&sort_by=popularity.desc&include_adult=false&include_video=false&page=1&primary_release_year=2016")
top_2016_1_json = json.loads(top_2016_1.read())


for i in top_2016_1_json['results']:
    print i['title'], [genre_lst[j] for j in i['genre_ids']]


Sing ['Animation', 'Comedy', 'Drama', 'Family', 'Music']
Split ['Horror', 'Thriller']
Fantastic Beasts and Where to Find Them ['Action', 'Adventure', 'Fantasy']
Rogue One: A Star Wars Story ['Action', 'Drama', 'Science Fiction', 'War']
Deadpool ['Action', 'Adventure', 'Comedy', 'Romance']
Arrival ['Thriller', 'Drama', 'Science Fiction', 'Mystery']
Boyka: Undisputed IV ['Action']
La La Land ['Comedy', 'Drama', 'Music', 'Romance']
Doctor Strange ['Action', 'Adventure', 'Fantasy', 'Science Fiction']
Tomorrow Everything Starts ['Drama', 'Comedy']
Captain America: Civil War ['Adventure', 'Action', 'Science Fiction']
Finding Dory ['Adventure', 'Animation', 'Comedy', 'Family']
Collateral Beauty ['Drama', 'Romance']
X-Men: Apocalypse ['Action', 'Adventure', 'Fantasy', 'Science Fiction']
Passengers ['Adventure', 'Drama', 'Romance', 'Science Fiction']
Why Him? ['Comedy']
Underworld: Blood Wars ['Action', 'Horror']
Suicide Squad ['Action', 'Crime', 'Fantasy', 'Science Fiction']
Hacksaw Ridge ['Dr

In [3]:
import ast

movie_2000_df = pd.read_csv('tmdb_metadata.csv')
movie_2000_df = movie_2000_df.drop('Unnamed: 0', axis=1)

movie_2000_df = movie_2000_df.dropna()

labels = []
for i in movie_2000_df.genre_ids:
    label_matrix = np.zeros(len(genre_lst.keys()), dtype=int)
    for j in ast.literal_eval(i):
        if j in genre_lst.keys():
            label_matrix[genre_lst.keys().index(j)] = 1
    labels.append(label_matrix)
movie_2000_df['labels'] = labels

# convert dates
import datetime
def to_integer(dt_time):
    return 10000*dt_time.year + 100*dt_time.month + dt_time.day

int_dates =[]

for i in movie_2000_df.release_date:
    f = i.split('-')
    a = datetime.date(int(f[0]), int(f[1]), int(f[2]))
    int_dates.append(to_integer(a))

movie_2000_df['int_dates'] = int_dates

In [4]:
data = movie_2000_df.drop(['genre_ids', 'movie_id', 'poster_path', 'overview', 'title', 'release_date'], axis = 1)

In [5]:
words = pd.read_csv('genre_words_pca.csv').drop('Unnamed: 0', axis = 1)

In [6]:
x = pd.concat([data[['popularity', 'vote_average', 'vote_count', 'int_dates']], words], axis = 1).values
y = data['labels']
y = np.asarray(y.tolist())
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

In [7]:
from __future__ import print_function

import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD

import matplotlib
sns.set_style('white')

Using TensorFlow backend.


In [8]:
# smaller batch size means noisier gradient, but more updates per epoch
batch_size = 512
# this is fixed, we have 10 digits in our data set
num_classes = 10
# number of iterations over the complete training data
epochs = 100

# the data, shuffled and split between train and test sets
# (x_train, y_train), (x_test, y_test) = mnist.load_data()

# x_train = x_train.reshape(60000, 784)
# x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# normalize image values to [0,1]
# interestingly the keras example code does not center the data
# x_train /= 255
# x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

3431 train samples
1471 test samples


In [9]:
# create an empty network model
model = Sequential()
# add an input layer
model.add(Dense(64, activation='relu', input_shape=(304,)))
# this is our hidden layer
model.add(Dense(64, activation='relu'))
# and an output layer
# note that the 10 is the number of classes we have
# the classes are mutually exclusive so softmax is a good choice
model.add(Dense(19, activation='sigmoid'))

# prints out a summary of the model architecture
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 64)                19520     
_________________________________________________________________
dense_2 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_3 (Dense)              (None, 19)                1235      
Total params: 24,915
Trainable params: 24,915
Non-trainable params: 0
_________________________________________________________________


In [10]:
from keras import metrics
import keras.backend as K

def precision(y_true, y_pred):
    """Precision metric.
    Only computes a batch-wise average of precision.
    Computes the precision, a metric for multi-label classification of
    how many selected items are relevant.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision


def recall(y_true, y_pred):
    """Recall metric.
    Only computes a batch-wise average of recall.
    Computes the recall, a metric for multi-label classification of
    how many relevant items are selected.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def f1_score(y_true, y_pred):

    # Count positive samples.
    c1 = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    c2 = K.sum(K.round(K.clip(y_pred, 0, 1)))
    c3 = K.sum(K.round(K.clip(y_true, 0, 1)))

    # If there are no true samples, fix the F1 score at 0.
    if c3 == 0:
        return 0

    # How many selected items are relevant?
    precision = c1 / c2

    # How many relevant items are selected?
    recall = c1 / c3

    # Calculate f1_score
    f1_score = 2 * (precision * recall) / (precision + recall)
    return f1_score

In [11]:
sgd = SGD(lr=0.01, momentum=0.9)
model.compile(loss='binary_crossentropy',
              optimizer=sgd,
              metrics=['accuracy', precision, recall, f1_score])

In [12]:
# this is not the actual training
# in addition to the training data we provide validation data
# this data is used to calculate the performance of the model over all the epochs
# this is useful to determine when training should stop
# in our case we just use it to monitor the evolution of the model over the training epochs
# if we use the validation data to determine when to stop the training or which model to save, we 
# should not use the test data, but a separate validation set. 
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test))

# once training is complete, let's see how well we have done
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
print('Test precision:', score[2])
print('Test recall:', score[3])
print('Test f1_score:', score[4])

Train on 3431 samples, validate on 1471 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100