## Assignment 2 - Movie Classification, the sequel
#### In this assignment, we will learn a little more about word2vec and then use the resulting vectors to make some predictions.

We will be working with a movie synopsis dataset, found here: http://www.cs.cmu.edu/~ark/personas/

The overall goal should sound a little familiar - based on the movie synopses, we will classify movie genre. Some of your favorites should be in this dataset, and hopefully, based on the genre specific terminology of the movie synopses, we will be able to figure out which movies are which type.

### Task 1: clean your dataset!

For your input data:

1. Find the top 10 movie genres
2. Remove any synopses that don't fit into these genres
3. Take the top 10,000 reviews in terms of "Movie box office revenue"

Congrats, you've got a dataset! For each movie, some of them may have multiple classifications. To deal with this, you'll have to look at the Reuters dataset classification code that we used previously and possibly this example: https://github.com/keras-team/keras/blob/master/examples/reuters_mlp.py

We want to use categorical cross-entropy as our loss function (or a one vs. all classifier in the case of SVM) because our data will potentially have multiple classes!

In [2]:
import pandas as pd
import ast
from collections import Counter
import numpy as np

## DATA FORMATTING
# Step 1. Reading the data in!
# Metadata ==> id, name of movie, gross boxoffice, genre
metadata = pd.read_csv('./MovieSummaries/movie.metadata.tsv', sep='\t', 
                       header= None, 
                       usecols = [0,2,4,8],
                      names = ['id','name','gross','genre'])

# Plot summaries
with open('./MovieSummaries/plot_summaries.txt', encoding='utf8') as fp:
    plots = fp.readlines()

# Step 2. Cleaning the null values from gross box office
filter_metadata = metadata[metadata['gross'].notnull()]

# Step 3. Reading in the genre column
filter_metadata["genre"] =  filter_metadata["genre"].map(lambda d : list(ast.literal_eval(d).values()))

# Step 4: Finding the top genres
all_genres = list(filter_metadata['genre'])
all_genres_flat = [item for sublist in all_genres for item in sublist]
genre_counter = Counter(all_genres_flat)
top_genres = [x[0] for x in genre_counter.most_common(10)]

# Step 5: Filtering on top genres
keep_genres = []
for item in all_genres:
    gens = list(set(item).intersection(set(top_genres)))
    if len(gens)>0:
        keep_genres.append(gens)
    else:
        keep_genres.append(np.nan)

filter_metadata['genre'] = keep_genres
filter_metadata = filter_metadata[filter_metadata['genre'].notnull()]

# Step 6: joining plots to metadata!
plots = {x.split('\t')[0]:x.split('\t')[1] for x in plots}
filter_metadata['plots'] = [plots[str(key)] if str(key) in plots else np.nan for key in filter_metadata['id']]
filter_metadata = filter_metadata[filter_metadata['plots'].notnull()]

# WHEW!

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [3]:
# Splitting data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(filter_metadata['plots'], 
                                                    filter_metadata['genre'], 
                                                    test_size = .30, 
                                                    random_state=10)

### Task 3a: Build a model using ONLY word2vec

Woah what? I don't think that's recommended...

In fact it's a commonly accepted practice. What you will want to do is average the word vectors that will be input for a given synopsis (https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html) and then input that averaged vector as your feature space into a model. For this example, use a Support Vector Machine classifier. For your first time doing this, train a model in Gensim and use the output vectors.


In [25]:
# Building word2vec model
from nltk.tokenize import PunktSentenceTokenizer
import string

sentences = []
for item in filter_metadata['plots']:
    sentences.extend([[w.translate(str.maketrans('','',string.punctuation)).strip().lower() for w in sent.split()]\
                      for sent in PunktSentenceTokenizer().tokenize(item)])

In [26]:
import gensim

model = gensim.models.Word2Vec (sentences, size=150, window=10, min_count=2, workers=10)
model.train(sentences,total_examples=len(sentences),epochs=10)
w2v = dict(zip(model.wv.index2word, model.wv.syn0))

  """


In [27]:
# Checking word2vec model
model.similar_by_word('tree')

  """Entry point for launching an IPython kernel.


[('lake', 0.6614973545074463),
 ('river', 0.6171631813049316),
 ('stone', 0.6153417229652405),
 ('creek', 0.6111149787902832),
 ('shack', 0.604758620262146),
 ('snow', 0.5995670557022095),
 ('rope', 0.5986873507499695),
 ('rocks', 0.5980300903320312),
 ('hole', 0.5960316061973572),
 ('trees', 0.5954447984695435)]

In [30]:
from sklearn.svm import SVC
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import f1_score, precision_score, recall_score

class TfidfEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.word2weight = None
        self.dim = len(list(word2vec.values())[0])

    def fit(self, X, y):
        tfidf = TfidfVectorizer(stop_words='english' ,analyzer=lambda x: x)
        tfidf.fit(X)
        # if a word was never seen - it must be at least as infrequent
        # as any of the known words - so the default idf is the max of 
        # known idf's
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf,
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])

        return self

    def transform(self, X):
        return np.array([
                np.mean([self.word2vec[w] * self.word2weight[w]
                         for w in words if w in self.word2vec] or
                        [np.zeros(self.dim)], axis=0)
                for words in X
            ])

w2v_tfidf = Pipeline([("word2vec vectorizer", TfidfEmbeddingVectorizer(w2v)), 
                        ("svm", OneVsRestClassifier(SVC(probability = True)))])

mlb = MultiLabelBinarizer()
train_labels = mlb.fit_transform(y_train) 
test_labels = mlb.transform(y_test)

w2v_tfidf.fit(X_train, train_labels)
predictions = w2v_tfidf.predict(X_test)
pred_probs = w2v_tfidf.predict_proba(X_test)

In [31]:
def evaluate(test_labels, predictions):
    precision = precision_score(test_labels, predictions, average='micro')
    recall = recall_score(test_labels, predictions, average='micro')

    print("Precision: {:.4f}, Recall: {:.4f}".format(precision, recall)) 
    
evaluate(test_labels, predictions)

Precision: 0.5614, Recall: 0.2270


In [34]:
pred_invert = mlb.inverse_transform(predictions)
label_invert = mlb.inverse_transform(test_labels)

pred_data = list(zip(X_test, pred_invert , label_invert, test_labels, pred_probs))

for input, prediction, label, lab_arr, pred_arr in pred_data[:5]:
    if lab_arr[np.argmax(pred_arr)]!=1:
        print(input[:500], '\n\n Movie has been classified as ', prediction, 'and should be ', label)
        print('\n')

In Texas, Dignan  "rescues" Anthony  from a voluntary mental hospital, where he has been staying for self-described exhaustion. Dignan has an elaborate escape planned and has developed a 75-year plan that he shows to Anthony. The plan is to pull off several heists and then meet Mr. Henry, a landscaper and part-time criminal known to Dignan. As a practice heist, the two friends break into Anthony's house, stealing specific items from a list. Afterward, critiquing the heist, Dignan reveals that he 

 Movie has been classified as  ('Drama',) and should be  ('Comedy', 'Crime Fiction', 'Indie')


On an academic scholarship, Paul Tannek  is a fish out of water kid from the upstate New York who arrives in New York City. In the fall of 1999, attending college at NYU, Paul runs into repeated complications and mishaps, usually brought on by his roommates, three spoiled, obnoxious party animals. When Paul is branded a loser and kicked out by his roommates, he settles in a room at a veterinary cli

### Task 3b: Do the same thing but with pretrained embeddings

Now pull down the Google News word embeddings and do the same thing. Compare the results. Why was one better than the other?

### Task 3: Build a neural net model using word2vec embeddings (both pretrained and within an Embedding layer from Keras)

In [15]:
import numpy as np
import keras

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Embedding, Flatten
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
#from keras.utils import multi_gpu_model

max_words = 1000
batch_size = 64
epochs = 20

num_classes = 10
print(num_classes, 'classes')

print('Vectorizing sequence data...')
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)
x_train = tokenizer.texts_to_matrix(X_train)
x_test = tokenizer.texts_to_matrix(X_test)
print('x_train shape:', X_train.shape)
print('x_test shape:', X_test.shape)

# Borrow our binarized labels from the previous model
y_train = train_labels
y_test = test_labels
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

print('Building model...')
model = Sequential()
model.add(Embedding(max_words, 100, input_length= x_train.shape[1] ))
model.add(Flatten())
model.add(Dense(256, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation="sigmoid"))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_split=0.1)
score = model.evaluate(x_test, y_test,
                       batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])


10 classes
Vectorizing sequence data...
x_train shape: (5037,)
x_test shape: (2159,)
y_train shape: (5037, 10)
y_test shape: (2159, 10)
Building model...
Train on 4533 samples, validate on 504 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test score: 5.60009924512
Test accuracy: 0.429828624184


In [30]:
preds = model.predict_classes(x_test)



In [32]:
from sklearn.metrics import classification_report, accuracy_score

mlb2 = MultiLabelBinarizer(classes=[str(a) for a in list(range(10))])
pred_mat = mlb2.fit_transform([str(a) for a in list(preds)]) 

print('Accuracy: \n')
for i in range(10):
    print(top_genres[i], accuracy_score(y_test[:,i], pred_mat[:,i]))
print('\n')

Accuracy: 

Drama 0.816118573414
Comedy 0.816118573414
Romance Film 0.861509958314
Thriller 0.72116720704
Action 0.839740620658
Action/Adventure 0.666512274201
Crime Fiction 0.872163038444
Adventure 0.754979157017
Indie 0.887447892543
Romantic comedy 0.788791106994




In [34]:
from sklearn.metrics import f1_score, precision_score, recall_score
def evaluate(test_labels, predictions):
    precision = precision_score(test_labels, predictions, average='micro')
    recall = recall_score(test_labels, predictions, average='micro')

    print("Precision: {:.4f}, Recall: {:.4f}".format(precision, recall)) 
    
evaluate(test_labels, pred_mat)

Precision: 0.7490, Recall: 0.3028


### Task 4: Change the architecture of your model and compare the result

### Task 5: For each model, do an error evaluation

You now have a bunch of classifiers. For each classifier, pick 2 good classifications and 2 bad classifications. Print the expected and predicted label, and also print the movie synopsis. From these results, can you spot some systematic errors from your models?