## Assignment 2 - Movie Classification, the sequel
![](https://images-na.ssl-images-amazon.com/images/S/sgp-catalog-images/region_US/paramount-01376-Full-Image_GalleryBackground-en-US-1484000188762._RI_SX940_.jpg)


#### In this assignment, we will learn a little more about word2vec and then use the resulting vectors to make some predictions.

We will be working with a movie synopsis dataset, found here: http://www.cs.cmu.edu/~ark/personas/

The overall goal should sound a little familiar - based on the movie synopses, we will classify movie genre. Some of your favorites should be in this dataset, and hopefully, based on the genre specific terminology of the movie synopses, we will be able to figure out which movies are which type.

### Task 1: clean your dataset!

For your input data:

1. Find the top 10 movie genres
2. Remove any synopses that don't fit into these genres
3. Take the top 10,000 reviews in terms of "Movie box office revenue"

Congrats, you've got a dataset! For each movie, some of them may have multiple classifications. To deal with this, you'll have to look at the Reuters dataset classification code that we used previously and possibly this example: https://github.com/keras-team/keras/blob/master/examples/reuters_mlp.py

We want to use categorical cross-entropy as our loss function (or a one vs. all classifier in the case of SVM) because our data will potentially have multiple classes!

In [1]:
import pandas as pd
import numpy as np
import re as re
import csv
import warnings
warnings.filterwarnings("ignore")

In [2]:
movieData = pd.read_csv('MovieSummaries/movie.metadata.tsv', sep = '\t')
movieData.columns = ['Wikipedia movie ID', 'Freebase movie ID', 'Name', 'Date', 'BoxOffice', 'Runtime', 'Language', 'Country', 'genres']
print(movieData.shape)

#movieData[:42]


(81740, 9)


In [3]:
movieData = movieData.replace('{}', np.nan)
movieData = movieData.replace('[]', np.nan)

#movieData[:42]

In [4]:
movieData.dropna(subset=['Language'], how='any', inplace = True )
movieData.dropna(subset=['genres'], how='any', inplace = True )
movieData.dropna(subset=['BoxOffice'], how='any', inplace = True )
print(movieData.shape)
movieData = movieData.reset_index(drop=True) 

#movieData[:42]

(8066, 9)


In [5]:
def extractTags(string):
    return re.findall('"([\w\s]+)"', str(string))

movieData["genres"] = movieData["genres"].apply(extractTags)
movieData["Language"] = movieData["Language"].apply(extractTags)
movieData["Country"] = movieData["Country"].apply(extractTags)

In [6]:
genres_col = movieData["genres"]
print(genres_col.shape)
print(len(genres_col))
print(genres_col.head(2))

(8066,)
8066
0                                    [Musical, Comedy]
1    [Costume drama, War film, Epic, Period piece, ...
Name: genres, dtype: object


In [7]:
my_list = []

for element in genres_col:
    #print(element)
    my_list.extend(element)

In [8]:
genres_dict = {}

for item in my_list:
    if item not in genres_dict:
        genres_dict[item] = 1
    else:
        genres_dict[item] += 1

In [9]:
for key, value in sorted(genres_dict.items(), key=lambda item: (item[1], item[0]), reverse=True):
    print ("%s: %s" % (key, value))

Drama: 4192
Comedy: 3133
Romance Film: 1992
Thriller: 1927
Action: 1710
Crime Fiction: 1255
Adventure: 1146
Indie: 1034
Romantic comedy: 883
Family Film: 834
Horror: 762
Romantic drama: 757
Fantasy: 695
Mystery: 672
Period piece: 664
Science Fiction: 643
Film adaptation: 601
World cinema: 566
Crime Thriller: 565
Musical: 532
Teen: 400
Psychological thriller: 396
War film: 391
Coming of age: 348
Black comedy: 348
Animation: 322
Parody: 318
Cult: 316
Sports: 299
Biography: 286
LGBT: 277
Family Drama: 269
Suspense: 268
Western: 258
Biographical film: 251
Buddy film: 221
Costume drama: 219
Slapstick: 213
Satire: 196
Slasher: 191
Supernatural: 186
Ensemble Film: 182
Japanese Movies: 181
Action Thrillers: 176
Documentary: 169
Political drama: 165
Martial Arts Film: 151
Gangster Film: 137
History: 133
Screwball comedy: 132
Music: 132
Sex comedy: 121
Superhero movie: 116
Road movie: 114
Comedy film: 114
Epic: 111
Domestic Comedy: 111
Fantasy Comedy: 110
Spy: 109
Comedy of manners: 103
Fantasy 

#### Top 10 Genres are: Drama, Comedy, Romance Film, Thriller, Action, Crime Fiction,  Adventure, Indie, Romantic comedy, Family Film

In [10]:
movieData = movieData.sort_values('BoxOffice', ascending=False)
movieData = movieData.reset_index(drop=True)

In [18]:
#573, 5773, 6513, 7217, 7349, 7994 ###start here
movieData = movieData.drop(movieData.index[[573]])
movieData = movieData.reset_index(drop=True)

In [41]:
movieData = movieData.drop(movieData.index[[5773]])
movieData = movieData.reset_index(drop=True)

In [45]:
movieData = movieData.drop(movieData.index[[6515]])
movieData = movieData.reset_index(drop=True)

In [50]:
movieData = movieData.drop(movieData.index[[7223]])
movieData = movieData.reset_index(drop=True)

In [53]:
movieData = movieData.drop(movieData.index[[7360]])
movieData = movieData.reset_index(drop=True)

In [55]:
movieData = movieData.drop(movieData.index[[8013]])
movieData = movieData.reset_index(drop=True)

In [56]:
movieData.shape

(8054, 9)

In [54]:
movieData[8010:8022]

Unnamed: 0,Wikipedia movie ID,Freebase movie ID,Name,Date,BoxOffice,Runtime,Language,Country,genres
8010,16395026,/m/03y062v,The Leading Man,1996,18012.0,96.0,[English Language],[United Kingdom],"[Romantic comedy, Romance Film, Comedy]"
8011,21133929,/m/05c1msq,Here and There,2009-03-01,18000.0,81.0,"[English Language, Serbian language]","[United States of America, Serbia, Germany]","[Comedy film, Drama, Indie, World cinema]"
8012,13939322,/m/03cntbz,Cruel World,2005-09-01,17986.0,88.0,[English Language],[United States of America],"[Thriller, Comedy film, Horror]"
8013,16186514,/m/03wc964,Fanny by Gaslight,1944,17285.0,107.0,[English Language],[United Kingdom],[]
8014,20089961,/m/04yhbxd,Holding Trevor,2007,16814.0,88.0,[English Language],[United States of America],"[LGBT, Indie, Gay, Gay Interest, Drama, Gay Th..."
8015,1176717,/m/04dp5j,Gummo,1997-08-29,16799.0,95.0,[English Language],[United States of America],"[Ensemble Film, Indie, Experimental film, Dram..."
8016,5468633,/m/0dn731,Day Zero,2007-11-02,16659.0,93.0,[English Language],[United States of America],"[Drama, Political drama, Indie]"
8017,9054868,/m/027v_s6,Tennessee,2008-04-26,16100.0,99.0,[English Language],[United States of America],"[Road movie, Family Drama, Drama, Indie]"
8018,4032727,/m/0bdj76,Molly,1999,15593.0,89.0,[English Language],[United States of America],"[Romantic comedy, Drama]"
8019,2885102,/m/0892mn,Save the Green Planet!,2003,15516.0,118.0,[Korean Language],[South Korea],"[Thriller, Science Fiction, Horror, World cine..."


In [58]:
top10_genres = [u'Drama', u'Comedy', u'Romance Film', u'Thriller', u'Action', u'Crime Fiction',
u'Adventure', u'Indie', u'Romantic comedy', u'Family Film']

top10_movie1 = []
genres_list = []

#for i in range(1, len(movieData)):
for i in range(0, 8054):
    movie_id = movieData["Wikipedia movie ID"][i]
    movie_genres = movieData["genres"][i]
    genre_element = movie_genres[0].split(", ")
    top10_movie_genres = []
    for element in movie_genres:
        if element in top10_genres:
            top10_movie1.append(movie_id)
            #top10_movie_genres.append(element)
            top10_movie_genres.append(element)
    genres_list.append(top10_movie_genres)

In [59]:
movieData['genres_list'] = genres_list

In [62]:
movieData.head(2)
#print(movieData.shape)

Unnamed: 0,Wikipedia movie ID,Freebase movie ID,Name,Date,BoxOffice,Runtime,Language,Country,genres,genres_list
0,4273140,/m/0bth54,Avatar,2009-12-10,2782275000.0,178.0,"[English Language, Spanish Language]","[United States of America, United Kingdom]","[Thriller, Science Fiction, Adventure, Compute...","[Thriller, Adventure, Action]"
1,52371,/m/0dr_4,Titanic,1997-11-01,2185372000.0,194.0,"[Italian Language, English Language, French La...",[United States of America],"[Tragedy, Costume drama, Historical fiction, P...","[Drama, Romance Film]"


In [64]:
#movieData.to_csv('SJ_HW2_metadata.csv', index=False, sep='\t')

In [68]:
top10_movie = movieData['Wikipedia movie ID']
top10_movie = list(top10_movie)

#### Finally, we have the top 4621 movies!!! Even thought we only need 1000 movies, but just in case some movies don't have a synopsis

In [69]:
#now work on plot summary

plot = pd.read_csv("MovieSummaries/plot_summaries.txt", sep='\t')

plot.columns = ['Wikipedia movie ID', 'Synopsis']
print(plot.shape)

(42302, 2)


In [70]:
plot = plot[~plot['Wikipedia movie ID'].isin(top10_movie)]

In [71]:
plot.shape

(34984, 2)

In [72]:
data = pd.merge(movieData, plot, left_index=True, right_index=True, how='inner')
data.shape

(6714, 12)

In [73]:
data.head(3)

Unnamed: 0,Wikipedia movie ID_x,Freebase movie ID,Name,Date,BoxOffice,Runtime,Language,Country,genres,genres_list,Wikipedia movie ID_y,Synopsis
1,52371,/m/0dr_4,Titanic,1997-11-01,2185372000.0,194.0,"[Italian Language, English Language, French La...",[United States of America],"[Tragedy, Costume drama, Historical fiction, P...","[Drama, Romance Film]",20663735,Poovalli Induchoodan is sentenced for six yea...
4,25001260,/m/0872p_c,Transformers: Dark of the Moon,2011-06-23,1123747000.0,157.0,[English Language],[United States of America],"[Alien Film, Science Fiction, Action, Adventure]","[Action, Adventure]",5272176,The president is on his way to give a speech. ...
7,1213838,/m/04hwbq,Toy Story 3,2010-06-12,1063172000.0,102.0,"[English Language, Spanish Language]",[United States of America],"[Adventure, Computer Animation, Animation, Fan...","[Adventure, Comedy, Family Film]",2462689,Infuriated at being told to write one final co...


In [74]:
data = data.sort_values('BoxOffice', ascending=False)
data = data[:1000] ###Finally, our dataset!!!

In [75]:
data.head(5)
#data.to_csv('SJ_HW2_data.csv', index=False, sep='\t')

Unnamed: 0,Wikipedia movie ID_x,Freebase movie ID,Name,Date,BoxOffice,Runtime,Language,Country,genres,genres_list,Wikipedia movie ID_y,Synopsis
1,52371,/m/0dr_4,Titanic,1997-11-01,2185372000.0,194.0,"[Italian Language, English Language, French La...",[United States of America],"[Tragedy, Costume drama, Historical fiction, P...","[Drama, Romance Film]",20663735,Poovalli Induchoodan is sentenced for six yea...
4,25001260,/m/0872p_c,Transformers: Dark of the Moon,2011-06-23,1123747000.0,157.0,[English Language],[United States of America],"[Alien Film, Science Fiction, Action, Adventure]","[Action, Adventure]",5272176,The president is on his way to give a speech. ...
7,1213838,/m/04hwbq,Toy Story 3,2010-06-12,1063172000.0,102.0,"[English Language, Spanish Language]",[United States of America],"[Adventure, Computer Animation, Animation, Fan...","[Adventure, Comedy, Family Film]",2462689,Infuriated at being told to write one final co...
8,24314116,/m/09v8clw,Pirates of the Caribbean: On Stranger Tides,2011-05-07,1043872000.0,136.0,[English Language],[United States of America],"[Swashbuckler films, Adventure, Costume Advent...","[Adventure, Action]",20532852,A line of people drool at the window of the s...
9,50793,/m/0ddt_,Star Wars Episode I: The Phantom Menace,1999-05-19,1027045000.0,136.0,[English Language],[United States of America],"[Science Fiction, Action, Fantasy, Adventure, ...","[Action, Adventure, Family Film]",15401493,Lola attempts to gain her father's trust fund...


### Task 2: Split the data

Make a dataset of 70% train and 30% test. Sweet.

In [77]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.3, random_state=42)

In [78]:
print(train.shape,test.shape)

(700, 12) (300, 12)


### Task 3a: Build a model using ONLY word2vec

Woah what? I don't think that's recommended...

In fact it's a commonly accepted practice. What you will want to do is average the word vectors that will be input for a given synopsis (https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html) and then input that averaged vector as your feature space into a model. For this example, use a Support Vector Machine classifier. For your first time doing this, train a model in Gensim and use the output vectors.

SVM: http://scikit-learn.org/stable/modules/svm.html

In [106]:
x_train = train['Synopsis']
y_train = train['genres_list']
x_test = test['Synopsis']
y_test = test['genres_list']

Synopsis = data['Synopsis']
genres = data['genres_list']
print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)
print(Synopsis.shape, genres.shape)

(700,) (700,) (300,) (300,)
(1000,) (1000,)


In [309]:
def w2v_tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word)
    return tokens

In [312]:
train_tokenized = train.apply(lambda train: w2v_tokenize_text(train['Synopsis']), axis=1).values
test_tokenized = test.apply(lambda test: w2v_tokenize_text(test['Synopsis']), axis=1).values

#overall:
tokenized_word = data.apply(lambda data: w2v_tokenize_text(data['Synopsis']), axis=1).values

In [325]:
### PUT GENSIM CODE HERE
#docement = list(Synopsis)
import gensim
model = gensim.models.Word2Vec (tokenized_word, size=150, window=10, min_count=2, workers=10)
#sentences
model.train(tokenized_word,total_examples=len(tokenized_word),epochs=10)
#w2v = dict(zip)
#w2v = dict(zip(model.wv.index2word, model.wv.syn0))
w2v = {w: vec for w, vec in zip(model.wv.index2word, model.wv.syn0)}
#meanembeddingvectorizer

2018-03-11 18:50:09,720 : INFO : collecting all words and their counts
2018-03-11 18:50:09,722 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-03-11 18:50:09,777 : INFO : collected 24293 word types from a corpus of 278978 raw words and 1000 sentences
2018-03-11 18:50:09,778 : INFO : Loading a fresh vocabulary
2018-03-11 18:50:09,810 : INFO : min_count=2 retains 13169 unique words (54% of original 24293, drops 11124)
2018-03-11 18:50:09,812 : INFO : min_count=2 leaves 267854 word corpus (96% of original 278978, drops 11124)
2018-03-11 18:50:09,859 : INFO : deleting the raw counts dictionary of 24293 items
2018-03-11 18:50:09,862 : INFO : sample=0.001 downsamples 38 most-common words
2018-03-11 18:50:09,864 : INFO : downsampling leaves estimated 204411 word corpus (76.3% of prior 267854)
2018-03-11 18:50:09,908 : INFO : estimated required memory for 13169 words and 150 dimensions: 22387300 bytes
2018-03-11 18:50:09,911 : INFO : resetting layer weights
2018

2018-03-11 18:50:11,428 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-03-11 18:50:11,430 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-03-11 18:50:11,436 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-03-11 18:50:11,438 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-03-11 18:50:11,440 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-11 18:50:11,442 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-11 18:50:11,443 : INFO : EPOCH - 2 : training on 278978 raw words (204277 effective words) took 0.2s, 1028445 effective words/s
2018-03-11 18:50:11,593 : INFO : worker thread finished; awaiting finish of 9 more threads
2018-03-11 18:50:11,614 : INFO : worker thread finished; awaiting finish of 8 more threads
2018-03-11 18:50:11,617 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-03-11 18:50:11,620 : INFO : worker threa

2018-03-11 18:50:12,947 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-03-11 18:50:12,949 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-03-11 18:50:12,952 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-03-11 18:50:12,954 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-03-11 18:50:12,956 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-03-11 18:50:12,967 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-11 18:50:12,968 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-11 18:50:12,970 : INFO : EPOCH - 10 : training on 278978 raw words (204435 effective words) took 0.2s, 1070316 effective words/s
2018-03-11 18:50:12,972 : INFO : training on a 2789780 raw words (2043421 effective words) took 1.9s, 1052726 effective words/s


In [326]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multiclass import OneVsRestClassifier

svc = Pipeline([("count_vectorizer", 
                 CountVectorizer(analyzer=lambda x: x)), 
                ("linear svc", OneVsRestClassifier(SVC(kernel="linear")))])

In [327]:
class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        self.dim = len(list(word2vec.values())[0])

    def fit(self, X, y):
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

In [328]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import StratifiedShuffleSplit
from collections import defaultdict

from sklearn.pipeline import Pipeline
from sklearn.ensemble import ExtraTreesClassifier

etree_w2v = Pipeline([
    ("word2vec vectorizer", MeanEmbeddingVectorizer(w2v)),
    ("extra trees", ExtraTreesClassifier(n_estimators=10))])

In [329]:
w2v_mean = MeanEmbeddingVectorizer(w2v)
print(w2v_mean)

<__main__.MeanEmbeddingVectorizer object at 0x257fe0e48>


In [333]:
from tabulate import tabulate
%matplotlib inline

all_models = [("svc", svc)]

unsorted_scores = [(name, cross_val_score(model, train_tokenized, y_train_mat, cv=5).mean()) for name, model in all_models]
scores = sorted(unsorted_scores, key=lambda x: -x[1])


print (tabulate(scores, floatfmt=".4f", headers=("model", 'score')))

model      score
-------  -------
svc       0.0171


### Task 3b: Do the same thing but with pretrained embeddings

Now pull down the Google News word embeddings and do the same thing. Compare the results. Why was one better than the other?

In [568]:
import gzip
import gensim 
import logging
logging.root.handlers = []  
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import nltk
import warnings
warnings.filterwarnings("ignore")

In [606]:
from sklearn.preprocessing import MultiLabelBinarizer
y = data['genres_list']
y_train, y_test = train_test_split(y, test_size=0.3, random_state=42)

y_matrix = MultiLabelBinarizer().fit_transform(y)
print(y_matrix.shape)

y_train_mat, y_test_mat = train_test_split(y_matrix, test_size=0.3, random_state=42)

print(y_train_mat.shape, y_test_mat.shape)

y_train, y_test = train_test_split(y, test_size=0.3, random_state=42)
print(y_train.shape, y_test.shape)

(1000, 10)
(700, 10) (300, 10)
(700,) (300,)


In [570]:
X_train = train['Synopsis']
X_test = test['Synopsis']

In [571]:
wv = gensim.models.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
wv.init_sims(replace=True)

2018-03-13 12:30:04,483 : INFO : loading projection weights from GoogleNews-vectors-negative300.bin.gz
2018-03-13 12:31:43,769 : INFO : loaded (3000000, 300) matrix from GoogleNews-vectors-negative300.bin.gz
2018-03-13 12:31:50,577 : INFO : precomputing L2-norms of word weight vectors


In [572]:
def w2v_tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word)
    return tokens

In [573]:
train_tokenized = train.apply(lambda train: w2v_tokenize_text(train['Synopsis']), axis=1).values
test_tokenized = test.apply(lambda test: w2v_tokenize_text(test['Synopsis']), axis=1).values
#train_tokenized = train.apply(lambda r: w2v_tokenize_text(r['Synopsis']), axis=1).values

In [574]:
def word_averaging(wv, words):
    all_words, mean = set(), []
    
    for word in words:
        if isinstance(word, np.ndarray):
            mean.append(word)
        elif word in wv.vocab:
            mean.append(wv.syn0norm[wv.vocab[word].index])
            all_words.add(wv.vocab[word].index)

    if not mean:
        logging.warning("cannot compute similarity with no input %s", words)
        # FIXME: remove these examples in pre-processing
        return np.zeros(wv.layer_size,)

    mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
    return mean

def  word_averaging_list(wv, text_list):
    return np.vstack([word_averaging(wv, review) for review in text_list ])

In [575]:
X_train_word_average = word_averaging_list(wv,train_tokenized)
X_test_word_average = word_averaging_list(wv,test_tokenized)

In [576]:
print(X_train_word_average.shape, y_train_mat.shape)

(700, 300) (700, 10)


In [578]:
#print(X_train_word_average[:1], y_train[:1])
#data[data['summary]]notnull()]

https://github.com/RaRe-Technologies/movie-plots-by-genre/blob/5a2d9157f9bf1bf908794051597b7851333dcfca/ipynb_with_output/Document%20classification%20with%20word%20embeddings%20tutorial%20-%20with%20output.ipynb

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [613]:
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
#one vs rest classifier python svm
#http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html


model3b = OneVsRestClassifier(SVC(decision_function_shape='ovo', kernel = 'sigmoid', probability=True))
model3b.fit(X_train_word_average, y_train_mat)

OneVsRestClassifier(estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovo', degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
          n_jobs=1)

In [653]:
from sklearn.metrics import accuracy_score
pred1 = model3b.predict(X_test_word_average)

In [655]:
print(accuracy_score(pred1, y_test_mat))

0.023333333333333334


### Task 4a: Build a neural net model using word2vec embeddings (both pretrained and within an Embedding layer from Keras)

Use Tokenizer from Keras (Tokenizer.fit_on_texts on x_train);

then Tokenizer.text_to_sequence on x_train; then receive a t.word_index

In [170]:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.layers import Dropout, Activation

In [337]:
# integer encode the documents
vocab_size = 3000
encoded_docs = [one_hot(d, vocab_size) for d in train['Synopsis']]
#print(encoded_docs)

In [338]:
vocab_size = 3000
test_encoded = [one_hot(d, vocab_size) for d in test['Synopsis']]

In [339]:
# pad documents to a max length of 4 words
max_length = 1500
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
#print(padded_docs)

In [340]:
# pad documents to a max length of 4 words
max_length = 1500
test_padded = pad_sequences(test_encoded, maxlen=max_length, padding='post')
#print(padded_docs)

In [341]:
print('Building model...')
num_classes = 10
model = Sequential()
model.add(Dense(512, input_shape=(max_length,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
print(model.summary())

Building model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_55 (Dense)             (None, 512)               768512    
_________________________________________________________________
activation_64 (Activation)   (None, 512)               0         
_________________________________________________________________
dropout_51 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_56 (Dense)             (None, 10)                5130      
_________________________________________________________________
activation_65 (Activation)   (None, 10)                0         
Total params: 773,642
Trainable params: 773,642
Non-trainable params: 0
_________________________________________________________________
None


In [342]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [343]:
padded_docs.shape

(700, 1500)

In [344]:
y_train_mat.shape

(700, 10)

In [345]:
batch_size = 32
epochs = 5

model4a = model.fit(padded_docs, y_train_mat, batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_split=0.1)

Train on 630 samples, validate on 70 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [346]:
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, y_train_mat, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 69.242857


In [347]:
pred4a = model.evaluate(test_padded, y_test_mat,
                       batch_size=batch_size, verbose=1)
print('Test score:', pred4a[0])
print('Test accuracy:', pred4a[1])

Test score: 5.035177866617839
Test accuracy: 0.6860000141461691


In [656]:
pred4a = model.predict(test_padded, batch_size = batch_size)
#print(pred4a[210:220])
#print(y_test_mat[210:220])
#print(x_test[12:13])
#print(y_test[29:30])
#Action, Adventure, Comedy, CrimeFiction, Drama, Family Film, Indie, RomanceFilm, RomanticComedy, Thriller

### Task 4b: Change the architecture of your model and compare the result

In [541]:
print('Building another model...')
num_classes = 10
model = Sequential()
model.add(Dense(1024, input_shape=(max_length,)))
model.add(Activation('relu'))
model.add(Dropout(0.85))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
model.add(Dropout(0.5))
print(model.summary())

Building another model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_112 (Dense)            (None, 1024)              1537024   
_________________________________________________________________
activation_116 (Activation)  (None, 1024)              0         
_________________________________________________________________
dropout_82 (Dropout)         (None, 1024)              0         
_________________________________________________________________
dense_113 (Dense)            (None, 10)                10250     
_________________________________________________________________
activation_117 (Activation)  (None, 10)                0         
_________________________________________________________________
dropout_83 (Dropout)         (None, 10)                0         
Total params: 1,547,274
Trainable params: 1,547,274
Non-trainable params: 0
________________________________________

In [542]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [543]:
batch_size = 32
epochs = 5

model4b = model.fit(padded_docs, y_train_mat, batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_split=0.1)

Train on 630 samples, validate on 70 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [544]:
loss, accuracy = model.evaluate(padded_docs, y_train_mat, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 72.328571


In [545]:
pred4b = model.evaluate(test_padded, y_test_mat,
                       batch_size=batch_size, verbose=1)
print('Test score:', pred4b[0])
print('Test accuracy:', pred4b[1])

Test score: 4.524834753672282
Test accuracy: 0.7186666584014892


In [567]:
pred4b = model.predict(test_padded, batch_size = batch_size)


### Task 5: For each model, do an error evaluation

You now have a bunch of classifiers. For each classifier, pick 2 good classifications and 2 bad classifications. Print the expected and predicted label, and also print the movie synopsis. From these results, can you spot some systematic errors from your models?

#### Task 3a: Build a model using ONLY word2vec: Accuracy = 14%

Not sure what I did wrong for this model, but it could only predict Action, otherwise they are just blank.

In [657]:
from IPython.display import HTML, display
import tabulate
pd.set_option('display.max_colwidth', -1)

table_3a = [["Movie Synopsis", "Actual Genres", "Predicted Genres"],
            [x_test[11:12], "Adventure, Romance Film, Action", "Action"],
            [x_test[27:28], "Crime Fiction, Comedy, Adventure", "Action"],
            [x_test[244:245], "Action, Drama, Thriller", "Action"],
            [x_test[22:23], "Romance Film, Family Film", " "]
    
]

display(HTML(tabulate.tabulate(table_3a, tablefmt='html')))

0,1,2
Movie Synopsis,Actual Genres,Predicted Genres
"98 It is the late 1970s, and smuggler David Swansey specialises in importing goods to war-torn Southern Rhodesia, defying international sanctions imposed on the doomed nation. Swansey is eventually contracted by the Ian Smith administration to arrange an illicit purchase of American-made Iroquois helicopters for counter-insurgency operations against black African nationalists. However, word of his plan soon reaches the latter, who apply strong political pressure to kill the deal in its cradle - the aircraft shipment in question is impounded upon reaching neighbouring South-West Africa. Meanwhile, one of the many indigenous guerillas resisting the white supremacist policies of the Rhodesian regime is Gideon Marunga , veteran combatant and reluctant participant in atrocities directed against unarmed civilians by his fellow insurgents. Marunga discovers that Swansey, with the aid of the Rhodesian Security Forces and South African sympathizers, hopes to lead an armed raid on the airfield where the Iroquois are being temporarily held - with the intention of stealing them across the border into Rhodesia. On the day of the assault, Marunga arrives at the airfield and stalls the attacking troopers, while his accomplices succeed in destroying some of the helicopters. In the firefight which ensues he comes face to face with Swansey, and the two men subsequently share a weary moment of reflection on their stalemate. Both abruptly part ways; the smuggler permits his enemy to escape unarmed into the night. As word of the foiled transaction spreads, Swansey finds himself unable to continue conducting business on the global scale and is restricted to Rhodesia, where he faces conscription into active duty with the armed forces. The film's storyline closes as Marunga and Swansey confront each other on the battlefield again - this time through the sights of their rifles. Name: Synopsis, dtype: object","Adventure, Romance Film, Action",Action
"1193 Lia and Tina are two beautiful girls who meet and realize that they have a lot in common. They are both young, beautiful and pissed off, so they decide to hitchhike their way to Rome to find a commune where they can stay and live the life of free love. . . or so they think. Things don't go as they have planned though, and soon they become entangled with prostitution, the police and an aggressive gang. Name: Synopsis, dtype: object","Crime Fiction, Comedy, Adventure",Action
"375 Mostly the same as the original biblical story, but with notable differences such as, once again, the expanded role of Delilah , the introduction of the garrison commander who is friends with Samson , more focus upon Samson's relationship with his first wife, a different handling of the 30 garments bet, and, perhaps the most crucial alteration is to the climax. In the original story, maintained in the 1949 film and the 1996 TV remake, is that Samson only regains his strength after his hair has grown long again, thus allowing him to tear down the Philistine temple. Here, however, Samson is taken to the Philistine temple just after his hair has been cut short, and he prays to God to restore his immense strength despite his short hair, and God complies, allowing Samson enough strength to tear down the stone pillars, thus destroying the temple. Unlike the 1949 and 1996 adaptations, Delilah survives to mourn Samson alongside his followers. Name: Synopsis, dtype: object","Action, Drama, Thriller",Action
"454 Impecunious bookmaker's clerk Arnold Grierson, seeing a way to easy money, forces his daughter Margaret to marry wealthy but obnoxious songwriter Nevern, ignoring her romance with local newspaper editor Michael Hardwick. Soon after the wedding, Grierson requests the loan of a significant sum of money from Nevern and is furious and humiliated to be flatly turned down. He begins to make elaborate plans to murder Nevern on the assumption that Margaret will then inherit her husband's estate. Meanwhile the desperately unhappy Margaret has rekindled her relationship with Hardwick. Nevern finds them in a café together and causes a public scene. Margaret determines that her only course of action is to divorce Nevern, a prospect which horrifies her father. Nevern is in the process of composing a new song, and lodges a draft manuscript with his publisher. Making sure he has set up a foolproof alibi, Grierson goes to Nevern's house and kills him as he is finalising his new composition. As he leaves through one door, Hardwick, intending to ask Nevern to divorce Margaret, arrives through another. Hardwick finds the body and alerts the police, who in the circumstances do not believe his story and arrest him on suspicion of murder. The interested parties later gather at Nevern's home to hear the reading of the will. Margaret is declared the sole inheritor of all her husband's money and assets, to the delight of her father. He is so happy that he begins to whistle, and gives himself away because it is Nevern's finished composition, which he could only have heard by being in the house on the night of the murder. Name: Synopsis, dtype: object","Romance Film, Family Film",


#### Task 3b: Do the same thing but with pretrained embeddings: Accuracy = 23%

This model could predict most of Drama and Action

In [658]:
#Action, Adventure, Comedy, CrimeFiction, Drama, Family Film, Indie, RomanceFilm, RomanticComedy, Thriller
from IPython.display import HTML, display
import tabulate
pd.set_option('display.max_colwidth', -1)

table_3b = [["Movie Synopsis", "Actual Genres", "Predicted Genres"],
            [x_test[:1], "Action, Crime Fiction, Thriller", "Action, Drama, Thriller"],
            [x_test[203:204], "Romance Film, Drama, Comedy", "Romance Film, Drama"],
            [x_test[258:259], "Action, Thriller", "Action, Comedy, Drama"],
            [x_test[244:245], "Action, Drama, Thriller", "Action, Thriller"]
    
]

display(HTML(tabulate.tabulate(table_3b, tablefmt='html')))

0,1,2
Movie Synopsis,Actual Genres,Predicted Genres
"635 {{Plot}} The story opens in 1941 Poland, with Dymitr Mirga, a prominent Gypsy violin player, entertaining a group of Nazis in a restaurant. At first the Nazis enjoy the entertainment and assure the musicians that the ongoing removal of the region's Jews is being conducted for the sake of the Romani. However, Dymitr Mirga soon realizes the truth, and asks the head of the Gypsy community to lead its evacuation into Hungary, which that time has no Nazis. The leader is reluctant to comply, and the community's council eventually forces him to resign, giving his position instead to Dymitr Mirga. The son of the deposed leader has been betrothed to a beautiful Romani named Zoya Natkin; but she now chooses to marry Dymitr Mirga's son, Roman Mirga. On their ensuing journey to Hungary, some of the Gypsies desert and are massacred by the Nazis. Others voluntarily split off, in hopes that in smaller numbers they will appear to be merchants rather than Gypsies. Dymitr Mirga's small company eventually sells their jewels to buy horses from another Romani community - a great sacrifice, but necessary to enable them to move quickly. Many are nevertheless killed by the Nazis. The sympathetic population gives them burials and provides a chance for their comrades to meet and mourn their loss. In time, the resolute Dymitr Mirga reaches Hungary with his much diminished group of followers, including his wife, his son and daughter-in-law Roman and Zoya, Zoya's family and Roman's ""rival,"" the son of the former leader, who has been killed by Nazis. All Dymitr Mirga's efforts go for nought, however, when the Nazis finally invade Hungary in 1944. A Nazi column takes the Romani in cattle trucks to concentration camps, where the infamous Col. Kruger conducts horrifying experiments on prisoners. Before their arrival, Dymitr Mirga's daughter escapes out through the window of one of the cattle trucks. At the camp. Dymitr Mirga is forced to play for the Nazis, whilst his son Roman receives minor privileges because of his skill as a translator. However, when Roman's wife Zoya dies, the young man begins to consider his father's urging that he escape. Roman approaches his friend and former rival, and recognizing that their families are marked for death, the two agree to make an attempt. The attempt succeeds, and they manage to reconnect with Roman's younger sister who escaped from the cattle truck. The film ends with the war over. As three Romani carriages head off into a sunset, carrying—we assume—Roman, his friend and his younger sister, the narrator concludes that the ""Gypsy nation has yet to receive any compensation."" Name: Synopsis, dtype: object","Action, Crime Fiction, Thriller","Action, Drama, Thriller"
"679 Dholakpur is suddenly attacked by two fire spitting dragon monsters. As they spew fire and create havoc, King Inderaverma places the responsibility of saving his kingdm on the mighty Chhota Bheem`s shoulders. Meanwhile, Bheem and his friends save a mouse`s life, who happens to be a mushik, Lord Ganesh`s companion mouse. But due to some unfortunate events, Mushik is taken away by the dragons. Lord Ganesh comes down on earth to help his companion. He and Bheem pair-up against the dragons to save humanity.http://woobooks.in/chhota-bheem-and-ganesh.html Name: Synopsis, dtype: object","Romance Film, Drama, Comedy","Romance Film, Drama"
"1116 The wives of several top doctors feel neglected by their husbands, so they turn to drink, drugs and sex for solace. Name: Synopsis, dtype: object","Action, Thriller","Action, Comedy, Drama"
"375 Mostly the same as the original biblical story, but with notable differences such as, once again, the expanded role of Delilah , the introduction of the garrison commander who is friends with Samson , more focus upon Samson's relationship with his first wife, a different handling of the 30 garments bet, and, perhaps the most crucial alteration is to the climax. In the original story, maintained in the 1949 film and the 1996 TV remake, is that Samson only regains his strength after his hair has grown long again, thus allowing him to tear down the Philistine temple. Here, however, Samson is taken to the Philistine temple just after his hair has been cut short, and he prays to God to restore his immense strength despite his short hair, and God complies, allowing Samson enough strength to tear down the stone pillars, thus destroying the temple. Unlike the 1949 and 1996 adaptations, Delilah survives to mourn Samson alongside his followers. Name: Synopsis, dtype: object","Action, Drama, Thriller","Action, Thriller"


#### Model 4a: Build a neural net model using word2vec embeddings (both pretrained and within an Embedding layer from Keras): Accuracy = 69%

this model prediction tends to be Romance Film and Action, for the rest of genres, it had a pretty low prediction

In [651]:
from IPython.display import HTML, display
import tabulate
pd.set_option('display.max_colwidth', -1)

table_4a = [["Movie Synopsis", "Actual Genres", "Predicted Genres"],
            [x_test[11:12], "Adventure, Romance Film, Action", "Romance Film"],
            [x_test[22:23], "Romance Film, Family Film", "Romance Film"],
            [x_test[27:28], "Crime Fiction, Comedy, Adventure", "Action"],
            [x_test[244:245], "Action, Drama, Thriller", "Action, Drama"]
    
]

display(HTML(tabulate.tabulate(table_4a, tablefmt='html')))


0,1,2
Movie Synopsis,Actual Genres,Predicted Genres
"98 It is the late 1970s, and smuggler David Swansey specialises in importing goods to war-torn Southern Rhodesia, defying international sanctions imposed on the doomed nation. Swansey is eventually contracted by the Ian Smith administration to arrange an illicit purchase of American-made Iroquois helicopters for counter-insurgency operations against black African nationalists. However, word of his plan soon reaches the latter, who apply strong political pressure to kill the deal in its cradle - the aircraft shipment in question is impounded upon reaching neighbouring South-West Africa. Meanwhile, one of the many indigenous guerillas resisting the white supremacist policies of the Rhodesian regime is Gideon Marunga , veteran combatant and reluctant participant in atrocities directed against unarmed civilians by his fellow insurgents. Marunga discovers that Swansey, with the aid of the Rhodesian Security Forces and South African sympathizers, hopes to lead an armed raid on the airfield where the Iroquois are being temporarily held - with the intention of stealing them across the border into Rhodesia. On the day of the assault, Marunga arrives at the airfield and stalls the attacking troopers, while his accomplices succeed in destroying some of the helicopters. In the firefight which ensues he comes face to face with Swansey, and the two men subsequently share a weary moment of reflection on their stalemate. Both abruptly part ways; the smuggler permits his enemy to escape unarmed into the night. As word of the foiled transaction spreads, Swansey finds himself unable to continue conducting business on the global scale and is restricted to Rhodesia, where he faces conscription into active duty with the armed forces. The film's storyline closes as Marunga and Swansey confront each other on the battlefield again - this time through the sights of their rifles. Name: Synopsis, dtype: object","Adventure, Romance Film, Action",Romance Film
"454 Impecunious bookmaker's clerk Arnold Grierson, seeing a way to easy money, forces his daughter Margaret to marry wealthy but obnoxious songwriter Nevern, ignoring her romance with local newspaper editor Michael Hardwick. Soon after the wedding, Grierson requests the loan of a significant sum of money from Nevern and is furious and humiliated to be flatly turned down. He begins to make elaborate plans to murder Nevern on the assumption that Margaret will then inherit her husband's estate. Meanwhile the desperately unhappy Margaret has rekindled her relationship with Hardwick. Nevern finds them in a café together and causes a public scene. Margaret determines that her only course of action is to divorce Nevern, a prospect which horrifies her father. Nevern is in the process of composing a new song, and lodges a draft manuscript with his publisher. Making sure he has set up a foolproof alibi, Grierson goes to Nevern's house and kills him as he is finalising his new composition. As he leaves through one door, Hardwick, intending to ask Nevern to divorce Margaret, arrives through another. Hardwick finds the body and alerts the police, who in the circumstances do not believe his story and arrest him on suspicion of murder. The interested parties later gather at Nevern's home to hear the reading of the will. Margaret is declared the sole inheritor of all her husband's money and assets, to the delight of her father. He is so happy that he begins to whistle, and gives himself away because it is Nevern's finished composition, which he could only have heard by being in the house on the night of the murder. Name: Synopsis, dtype: object","Romance Film, Family Film",Romance Film
"1193 Lia and Tina are two beautiful girls who meet and realize that they have a lot in common. They are both young, beautiful and pissed off, so they decide to hitchhike their way to Rome to find a commune where they can stay and live the life of free love. . . or so they think. Things don't go as they have planned though, and soon they become entangled with prostitution, the police and an aggressive gang. Name: Synopsis, dtype: object","Crime Fiction, Comedy, Adventure",Action
"375 Mostly the same as the original biblical story, but with notable differences such as, once again, the expanded role of Delilah , the introduction of the garrison commander who is friends with Samson , more focus upon Samson's relationship with his first wife, a different handling of the 30 garments bet, and, perhaps the most crucial alteration is to the climax. In the original story, maintained in the 1949 film and the 1996 TV remake, is that Samson only regains his strength after his hair has grown long again, thus allowing him to tear down the Philistine temple. Here, however, Samson is taken to the Philistine temple just after his hair has been cut short, and he prays to God to restore his immense strength despite his short hair, and God complies, allowing Samson enough strength to tear down the stone pillars, thus destroying the temple. Unlike the 1949 and 1996 adaptations, Delilah survives to mourn Samson alongside his followers. Name: Synopsis, dtype: object","Action, Drama, Thriller","Action, Drama"


#### Model 4b: Change the architecture of your model and compare the result: Accuracy = 72%

this model prediction tends to be Drama, for the rest of genres, it had a pretty low prediction

In [659]:
from IPython.display import HTML, display
import tabulate
pd.set_option('display.max_colwidth', -1)

table_4b = [["Movie Synopsis", "Actual Genres", "Predicted Genres"], 
            [x_test[109:110], "Drama", "Drama"],
            [x_test[209:210], "Drama", "Drama"],
            [x_test[132:133], "Family Film", "Drama"],
            [x_test[223:224], "Drama, Indie", "Drama"]
    
]

display(HTML(tabulate.tabulate(table_4b, tablefmt='html')))

0,1,2
Movie Synopsis,Actual Genres,Predicted Genres
"1111 {{Plot}} Aircraft factory worker Barry Kane is wrongly accused of starting a fire at a Glendale, California airplane plant during World War II, an act of fifth columnist sabotage that killed his friend Mason. Kane believes that the real culprit is a man named Fry who had handed him a fire extinguisher filled with gasoline, at the plant when the fire broke out, causing Mason's death. When the investigators find no one named “Fry” on the list of plant workers, they assume Kane is the real saboteur. They visit the home of Mason's mother, to ask if she knows where Kane is, but he has gone to get her some brandy, in an attempt to ease her suffering from the loss of her son. They come back as he returns, but she tells him to leave, breaking down in tears. Kane and Mason had seen Fry's name on an envelope the saboteur had dropped before the fire, so Kane heads to the address, a ranch in the High Desert, catching a ride from a garrulous truck driver. The ranch owner, Charles Tobin , appears to be a well-respected citizen, playing in the pool with his granddaughter, although it is later revealed that he is secretly in league with the saboteurs. The granddaughter hands some mail to Kane, when Tobin goes indoors to call the sheriff to arrest Kane. He returns to gloat, seeing Kane returning the letters, but Kane escapes on horseback, although he doesn't make it very far. In handcuffs, on the way to town, Kane manages to escape from the police, at a bridge blocked by the same truckdriver's vehicle. Kane escapes by jumping off the bridge, and manages to tumble one of the searching sheriff's officers into the river. The helpful truck driver misdirects the searchers, then watches as Kane climbs out of the river below on the other side of the bridge. Kane takes refuge with a kind blind man whose visiting niece is a billboard model, Patricia ""Pat"" Martin . Although her uncle asks her to take Kane to the local blacksmith shop to have his handcuffs removed, she instead attempts to take him to the police, believing it is the right thing to do. Despite her attempt to control Kane by wrapping his handcuffed arms around the steering wheel, Kane manages to turn the tables and kidnaps Martin, protesting his innocence to her. When she stops the car, and gets out, threatening to stop the first car that comes by, he uses the fan-belt pulley of her car's generator to cut off his handcuffs, causing the car to overheat shortly after. They arrive in the abandoned Soda City and stumble into an abandoned mine building, which turns out to be a staging area for the saboteurs' plan to blow up Boulder Dam. Kane is discovered by the saboteurs, but he manages to conceal Martin, and he convinces them the newspaper and radio accounts are true, that he is, in fact, a saboteur in league with them. After finding their plans to destroy the dam foiled, although the storyline does not explain why, Kane convinces the saboteurs to take him with them to New York City. He learns of their plans to sabotage the launching of a new U.S. Navy ship {{USS}} at the Brooklyn shipyard. Kane's performance has fooled Martin as well; she flees and contacts the authorities, hoping to get to New York in time to foil their plans for the next bit of sabotage. The saboteurs arrive in New York City, only to find the phone at their office disconnected, a sign the police are on to them. They drive to a Cut Rate Drugs drugstore, the site of Hitchcock's cameo appearance, where they walk through to a door, into a back room, then into a kitchen and out, into a ballroom, into the mansion of a New York dowager. When they walk into the library, to meet the dowager and other conspirators, Kane finds the captured Martin, who had gone to the police but was betrayed by a corrupt sheriff, part of the conspiracy. As Kane attempts to signal her that she should escape, Tobin arrives, immediately recognizing Kane and denouncing him as a foe of the conspiracy. He sneers at Kane's patriotism, causing Kane to question why someone who has benefited most from living in a free country would work to bring it down. Tobin ridicules Kane's simple-minded belief in good and evil, and claims he is in it for the ""power"". The saboteurs lock Kane in the cellar and Martin in an office at Rockefeller Center. Martin drops a note from her window, alerting cabbies on the street to ""watch for the flickering lights above"". They notify the FBI who rescue her. Meanwhile, desperate to escape, Kane triggers a fire alarm at the mansion and escapes in the pandemonium. Across the street, watching all the servants fleeing the mansion, Kane asks a man on the street if he knows whose place it is. The man answers that it is the ""Sutton mansion"", home of a well-known philanthropic older woman. Kane races to the shipyard, abandoning his taxi when it becomes stalled in traffic, because time is running out. At the gate to the Navy Yard, he is stopped by the guard, who turns him over to a Sergeant-of-the-Guard. He then eludes the MP Sergeant taking him to his superior. Desperate to warn someone of the impending sabotage, Kane runs into the Yard, then stumbles onto Fry, at the controls inside a fake newsreel truck. They struggle long enough for Kane to prevent Fry from pushing the bomb's detonator. The ship is safely out of the dock before Fry can detonate the bomb. Coming up with a pistol, Fry holds Kane prisoner, and has his accomplice drive them to Rockefeller Center. When they arrive, they find the police and FBI waiting to arrest them. Fry's flight from the officers takes him into a movie theatre, where he shoots a spectator to cause confusion to allow him to escape in the crowd. As he exits, Kane and Martin are exiting the building, Kane in the custody of an FBI agent. Seeing Fry getting into a taxi, Kane tells her to follow the spy wherever he goes. In a taxi herself, she follows Fry to Battery Park, as he smugly notes a capsized ship in the river while traveling by. At the Battery, she sees him get on a boat to Liberty Island. Martin follows him onto the boat, attracting his attention, then sees him walk into the pedestal. She calls the FBI office, then goes into the Statue herself, climbing to the top of the Statue of Liberty, where she strikes up a conversation with Fry, to stall the spy until Kane and the FBI arrive. An aggressive agent in the FBI office insists on taking Kane with him to the island, where Kane escapes his escort, racing into the pedestal. Martin calls down to Kane that Fry is getting away, so Kane, brought along to identify the spy, follows Fry up the narrow tunnel, onto the torch viewing platform. When Kane emerges from the tunnel, he confronts Fry, who backs up against the railing and loses his balance. Fry falls over the torch's railing, but manages to grab hold of the statue's hand. Kane climbs down in an attempt to rescue Fry. The police and FBI agent finally arrive at the torch, looking over the railing. When Fry's grip slips, Kane quickly grabs the sleeve of Fry's jacket. The camera focuses on the stitching of Fry's jacket sleeve as it parts, opening a tear and eventually giving way, as Fry tumbles to his death, with a loud, dramatic cry, leaving Kane holding the empty sleeve. Kane climbs carefully back up the thumb to the torch railing to embrace the waiting Martin. Name: Synopsis, dtype: object",Drama,Drama
"994 The story tells of a North Korean father and husband who decides to illegally cross into China to buy medicine for his pregnant wife who is suffering from tuberculosis. However, once he crosses into China, he realises that it's not as easy as he thought. He finds himself working as an illegal immigrant under the constant threat of being captured by the Chinese authorities and deported back to North Korea. He eventually finds his way to South Korea by entering the embassy in China. Meanwhile, his wife has already died, leaving their son homeless and wandering around trying to find a way back to his father. Scenes switch between those of the father who is outside North Korea trying to find medicine, and those of the son, who ends up homeless and tries to defect also. Name: Synopsis, dtype: object",Drama,Drama
"318 Dil Ne Jise Apna Kahaa begins with Rishabh and Pari who are deeply in love. He is a wealthy young man, working in an advertising agency while she is a hardworking, dedicated doctor. They marry and soon Pari is pregnant. Pari has a dream to create a hospital for children. Tragically, she is involved in an accident and dies in hospital. Pari's last wish was to donate her heart to her patient Dhani . Rishabh is devastated and opposes the plan to donate the heart; he goes ahead with Pari's last request: the creation of a children's hospital. Dhani is cured, much to the joy of her family and her grandmother ([[Helen . Rishabh has gone into depression but soon comes across Pari's project to build a hospital for children. He begins to develop the hospital. Soon enough Rishabh and Dhani come across each other, and she feels an instant attraction to him. Rishabh ignores her advances as he is still very much in love with Pari. Rishabh does not know that Pari's heart was given to Dhani but soon he realizes that. When he does, Dhani faints. While she is in the hospital, with doctors struggling to restart her broken heart, he falls in love with her. He tells her that he loves her, and if she loves him, she will pull through. She does, and the two get together. Name: Synopsis, dtype: object",Family Film,Drama
"65 Ralph Bellamy is ""The Healer"" in this 1930s morally uplifting pot-boiler. Bellamy is a doctor that has come home to a warm springs to try to heal children from the unnamed crippling disease . He runs a destitute camp for these children, assisted by Evelyn who looks upon the Doc as a great man. Mickey Rooney is Jimmy, a paraplegic kid whom the Doc promises to cure. This little triangle is interrupted by rich girl Joan who cons the good Doc into building a sanitorium for the wealthy with her father's money. Doc is momentarily swayed, but comes to his senses just as a forest fire threatens his original cabins around the warm spring. His treatment of Jimmy pays off as Jimmy rides a bicycle to save the day. Doc realizes that his true love is Evelyn, not the self-interested Joan. Name: Synopsis, dtype: object","Drama, Indie",Drama
