**Author**: Siddhant Sutar

Import libraries.

In [62]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score
import numpy as np

Import data and remove movies with no plot.

In [63]:
df = pd.read_csv('plot_data.csv')
df = df[pd.notnull(df['FullPlot'])]

In [64]:
df.head()

Unnamed: 0,ID,Liked,FullPlot
0,9,0,"Geraldine (Jerry) Holbrook, a girl of Eastern ..."
1,679,0,L. Frank Baum would appear in a white suit and...
2,966,0,"Theseus, the Duke of Athens, is engaged to be ..."
5,1409,0,This is a completely bogus title; no film bear...
6,1482,0,A father's drinking leads him to neglect his f...


Initialize TfidfVectorizer.

In [65]:
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)

Dependent variable (y): whether the movie plot should be considered positive or negative. For our purposes, movies with a rating of more than 7.5 are considered positive (1), and the ones less are considered negative (0).

In [66]:
y = df.Liked

Independent variable (x): movie plots.

In [67]:
X = vectorizer.fit_transform(df.FullPlot)

In [68]:
print y.shape
print X.shape

(51132L,)
(51132, 70553)


In [69]:
idf = vectorizer.idf_

In [70]:
my_dict = dict(zip(vectorizer.get_feature_names(), idf))

Initialize and fit the multinomial Naive Bayes model.

In [71]:
clf = naive_bayes.MultinomialNB()
clf.fit(X, y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Calculate the positive probabilities for the plot keywords.

In [72]:
prob_dict = {}
for key in sorted(my_dict, key=my_dict.get, reverse=True):
    my_vector = vectorizer.transform(np.array([key]))
    prob_dict[key] = clf.predict_proba(my_vector)[0][0]

List the top 100 favorable plot keywords.

In [73]:
count = 0
for key in sorted(prob_dict, key=prob_dict.get, reverse=True):
    count += 1
    if count <= 100:
        print key, prob_dict[key]
    else:
        break

gene 0.987034323615
scientist 0.98699651539
vampires 0.986402333281
framed 0.985173555853
serial 0.985034548799
killer 0.984629942026
psychic 0.983595787067
hunted 0.983298003708
bloody 0.983132511294
hoppy 0.983007750574
murders 0.982915914691
kills 0.98285965299
gang 0.982824379984
reporter 0.982769916105
murdered 0.982684068084
killing 0.982662446781
victims 0.982452413522
investigates 0.98242682129
cia 0.982386584924
horse 0.982354711421
bikers 0.98223657818
cops 0.982152930139
undercover 0.982133440874
crooked 0.982106631389
kidnapped 0.982106026629
cattle 0.982014605208
aliens 0.981771160401
agent 0.9815530767
gangster 0.981379160208
outlaw 0.981377474285
deadly 0.981315690049
zombies 0.98122655911
rapist 0.98097135025
publicity 0.980940815448
ranch 0.98086227357
horror 0.98056905273
raped 0.980568754823
creature 0.980502288108
resort 0.980298488081
supposedly 0.980285797345
escaped 0.980203057276
gangsters 0.980198796529
nevada 0.980091285618
indians 0.980078716091
murder 0.9800