### Milestone 5: Final submission, report and screencast, due Wednesday, May 3, 2017

The grand finale! Gather all your experiences, ideas, results, and discussions into one coherent final report that tells a compelling story and produce a 2 minute screencast that summarizes it. 

Your report needs to be max. 6 pages long (no more!) and include text and visualizations. Your audience are data scientists who did not spend any time pondering movie genre classification problems. Those data scientists do have the same background as you (e.g., you do not need to explain what PCA means) but they are not familiar with your data and the specific problems and questions you faced. Make sure to use good storytelling principles to write your reports. 

The screencast is for the same audience and needs to be max. 2 minutes long (no longer!). Do not just scroll through your notebook while talking--that is boring and confusing. You can extract visualizations from your notebook or produce new visuals and slides for a narrated presentation. Please use a good microphone and test the sound quality. Do not underestimate the time it takes to do a good job on your screencast. Start early, write a script, and collect additional materials that you might want to show. 

[Upload](https://support.google.com/youtube/answer/57407?co=GENIE.Platform%3DDesktop&hl=en) your screenscast to YouTube.

What to submit this week:

- Up to date versions of all your notebooks
- README to go with the notebooks that explains how much the notebooks changed since the milestone submissions. This is to guide your TF to find the relevant updates
- The 6 page final report as a PDF
- The link to your 2 minute screencast on YouTube
- A link to a .zip file with all your cleaned data

#### Final Peer Assessment

It is important to provide positive feedback to people who truly worked hard for the good of the team and to also make suggestions to those you perceived not to be working as effectively on team tasks. We ask you to provide an honest assessment of the contributions of the members of your team, including yourself. The feedback you provide should reflect your judgment of each team member’s:

- Preparation – were they prepared during team meetings?
- Contribution – did they contribute productively to the team discussion and work?
- Respect for others’ ideas – did they encourage others to contribute their ideas?
- Flexibility – were they flexible when disagreements occurred?

Your teammate’s assessment of your contributions and the accuracy of your self-assessment will be considered as part of your overall project score.

Final Peer Assessment: [https://goo.gl/forms/YYFqGbDEfFWeNaSC2](https://goo.gl/forms/YYFqGbDEfFWeNaSC2)

In [1]:
import json
import urllib
import cStringIO
from PIL import Image
from imdb import IMDb
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import time
import ast
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.cross_validation import KFold
import difflib



In [2]:
# part 3 - top 10 most popular movies of 2016 from TMDb and their genre
top_2016_1 = urllib.urlopen("https://api.themoviedb.org/3/discover/movie?api_key=2dc6c9f1d17bd39dcbaef83321e1b5a3&sort_by=popularity.desc&include_adult=false&include_video=false&page=1&primary_release_year=2016")
top_2016_1_json = json.loads(top_2016_1.read())

# get genre list
genre_list = urllib.urlopen("https://api.themoviedb.org/3/genre/movie/list?api_key=2dc6c9f1d17bd39dcbaef83321e1b5a3&language=en-US")

genre_list_json = json.loads(genre_list.read()) 

genre_lst = {}
for i in genre_list_json['genres']:
    genre_lst[i['id']] = str(i['name'])
    
# top most popular movies of 2016
top_2016_1 = urllib.urlopen("https://api.themoviedb.org/3/discover/movie?api_key=2dc6c9f1d17bd39dcbaef83321e1b5a3&sort_by=popularity.desc&include_adult=false&include_video=false&page=1&primary_release_year=2016")
top_2016_1_json = json.loads(top_2016_1.read())


for i in top_2016_1_json['results']:
    print i['title'], [genre_lst[j] for j in i['genre_ids']]


Sing ['Animation', 'Comedy', 'Drama', 'Family', 'Music']
Split ['Horror', 'Thriller']
Fantastic Beasts and Where to Find Them ['Action', 'Adventure', 'Fantasy']
Arrival ['Thriller', 'Drama', 'Science Fiction', 'Mystery']
La La Land ['Comedy', 'Drama', 'Music', 'Romance']
Deadpool ['Action', 'Adventure', 'Comedy', 'Romance']
Rogue One: A Star Wars Story ['Action', 'Drama', 'Science Fiction']
Captain America: Civil War ['Adventure', 'Action', 'Science Fiction']
Tomorrow Everything Starts ['Drama', 'Comedy']
Doctor Strange ['Action', 'Adventure', 'Fantasy', 'Science Fiction']
Passengers ['Adventure', 'Drama', 'Romance', 'Science Fiction']
X-Men: Apocalypse ['Action', 'Adventure', 'Fantasy', 'Science Fiction']
Underworld: Blood Wars ['Action', 'Horror']
Finding Dory ['Adventure', 'Animation', 'Comedy', 'Family']
Hacksaw Ridge ['Drama', 'History', 'War']
Batman v Superman: Dawn of Justice ['Action', 'Adventure', 'Fantasy']
Gold ['Adventure', 'Drama']
Zootopia ['Animation', 'Adventure', 'Fam

In [3]:
movie_20000_df = pd.read_csv('20000_movie_meta_good.csv')
movie_20000_df = movie_20000_df.drop('Unnamed: 0', axis=1)

movie_20000_df = movie_20000_df.dropna()

labels = []
for i in movie_20000_df.genre_ids:
    label_matrix = np.zeros(len(genre_lst.keys()), dtype=int)
    for j in ast.literal_eval(i):
        if j in genre_lst.keys():
            label_matrix[genre_lst.keys().index(j)] = 1
    labels.append(label_matrix)
movie_20000_df['labels'] = labels

In [4]:
movie_20000_df.head()

Unnamed: 0,genre_ids,movie_id,overview,popularity,poster_path,release_date,title,vote_average,vote_count,labels,int_dates
0,"[14, 10402, 10749]",321612,A live-action adaptation of Disney's version o...,149.54276,/tWqifoYuwLETmmasnGHO7xBjEtt.jpg,3/16/17,Beauty and the Beast,6.9,1770,"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",20170316
1,"[28, 18, 878]",263115,"In the near future, a weary Logan cares for an...",79.627847,/45Y1G5FEgttPAwjTYic6czC9xCn.jpg,2/28/17,Logan,7.5,2429,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, ...",20170228
2,"[16, 35, 18, 10751, 10402]",335797,A koala named Buster recruits his best friend ...,77.930498,/s9ye87pvq2IaDvjv9x4IOXVjvA7.jpg,11/23/16,Sing,6.7,1170,"[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, ...",20161123
3,"[28, 12, 14]",293167,Explore the mysterious and dangerous home of t...,61.012215,/5wBbdNb0NdGiZQJYoKHRv6VbiOr.jpg,3/8/17,Kong: Skull Island,6.0,1203,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, ...",20170308
4,"[28, 80, 53]",337339,When a mysterious woman seduces Dom into the w...,60.623332,/iNpz2DgTsTMPaDRZq2tnbqjL2vF.jpg,4/12/17,The Fate of the Furious,7.2,482,"[0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...",20170412


In [5]:
data = movie_20000_df['overview'].values

In [6]:
unique_words = []
data[0]
for review in data:
    splited = review.split()
    for word in splited:
        if not word in unique_words:
            unique_words.append(word)

In [7]:
word_dic = {x: i for i, x in enumerate(unique_words, 1)}

In [8]:
sequence = []
for review in data:
    splited = review.split()
    seq = []
    for word in splited:
        seq.append(word_dic[word])
    sequence.append(seq)

In [9]:
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.layers import Dropout

Using TensorFlow backend.


In [10]:
maxwords = 100
padded = pad_sequences(sequence, maxlen = maxwords)

In [11]:
model = Sequential()
model.add(LSTM(100, input_shape = (maxwords,1), return_sequences = True, activation = 'relu'))
model.add(LSTM(50, return_sequences = True, activation = 'relu'))
model.add(LSTM(19, activation = 'sigmoid'))
model.compile(loss = 'binary_crossentropy',
             optimizer = 'adam',
             metrics = ['accuracy'])

In [12]:
index = range(padded.shape[0])
np.random.shuffle(index)
y = np.stack(movie_20000_df['labels'].values, axis = 0)
y.shape
train_index = index[:(int(len(index) * 0.7))]
test_index = index[(int(len(index) * 0.7)):]
x_train = padded[train_index,:]
x_train = x_train.reshape(x_train.shape[0], x_train.shape[1], 1)
y_train = y[train_index]
x_test = padded[test_index,:]
x_test = x_test.reshape(x_test.shape[0], x_test.shape[1], 1)
y_test = y[test_index]

In [13]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 100, 100)          40800     
_________________________________________________________________
lstm_2 (LSTM)                (None, 100, 50)           30200     
_________________________________________________________________
lstm_3 (LSTM)                (None, 19)                5320      
Total params: 76,320
Trainable params: 76,320
Non-trainable params: 0
_________________________________________________________________


In [88]:
model.fit(x_train, y_train,
         batch_size = 50,
         epochs = 10,
         verbose = 1,
         validation_data = (x_test, y_test)) 

Train on 13559 samples, validate on 5811 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fb35cfb1e90>

In [90]:
# once training is complete, let's see how well we have done
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

('Test loss:', 1.397556677473492)
('Test accuracy:', 0.86377923737046558)


# Validation Results:

We did not run too many epochs because the models took pretty long to train.

100 maxwords, 3 layers, adam optimzer, binary crossentropy

Epoch 1/2
13559/13559 [==============================] - 339s - loss: 1.3498 - acc: 0.7840 - val_loss: 1.2309 - val_acc: 0.8106
Epoch 2/2
13559/13559 [==============================] - 341s - loss: 1.2125 - acc: 0.8223 - val_loss: 1.2347 - val_acc: 0.8315

200 maxwords, 3 layers, adam optimzer, binary crossentropy

Epoch 1/2
13559/13559 [==============================] - 798s - loss: 1.4799 - acc: 0.7569 - val_loss: 1.3712 - val_acc: 0.7763
Epoch 2/2
13559/13559 [==============================] - 715s - loss: 1.2938 - acc: 0.7900 - val_loss: 1.3022 - val_acc: 0.8058

('Test loss:', 1.302213639692767)
('Test accuracy:', 0.80584010901867142)

50 maxwords, 3 layers, adam optimzer, binary crossentropy, 2 epochs

Epoch 1/2
13559/13559 [==============================] - 176s - loss: 1.4532 - acc: 0.7875 - val_loss: 1.4410 - val_acc: 0.8151
Epoch 2/2
13559/13559 [==============================] - 171s - loss: 1.4644 - acc: 0.8365 - val_loss: 1.4366 - val_acc: 0.8514

('Test loss:', 1.4365563358995512)
('Test accuracy:', 0.85144328199182107)

50 maxwords, 4 layers, adam optimizer, binary crossentropy

Epoch 1/2
13559/13559 [==============================] - 410s - loss: 1.3655 - acc: 0.7785 - val_loss: 1.2769 - val_acc: 0.8307
Epoch 2/2
13559/13559 [==============================] - 391s - loss: 1.1986 - acc: 0.8320 - val_loss: 1.1068 - val_acc: 0.8500

('Test loss:', 1.1067506750240073)
('Test accuracy:', 0.85003941132143879)

50 maxwords, 3 layers, adam optimzer, binary crossentropy, 20 epochs

start with 0.85 and ends with 0.87
('Test loss:', 1.1714805600524585)
('Test accuracy:', 0.87667674125728168)

75 maxwords, 3 layers, adam optimzer, binary crossentropy, 10 epochs

('Test loss:', 1.4473536369251192)
('Test accuracy:', 0.87377842147344975)

100 maxwords, 3 layers

('Test loss:', 1.397556677473492)
('Test accuracy:', 0.86377923737046558)
