**Author**: Siddhant Sutar

Import Pandas, Numpy, and ordered logistic regression module by Fabian Pedregosa (with a few tweaks) obtained from https://github.com/fabianp/minirank/blob/master/minirank/logistic.py, since Scikit-learn doesn't support it yet.

In [83]:
import pandas as pd
import numpy as np
from logistic import ordinal_logistic_fit, ordinal_logistic_predict

Read training and test data.

In [84]:
train = pd.read_csv("train_data.csv")
test = pd.read_csv("test_data.csv")

Import NLTK library and TfIdfVectorizer to handle plot data. Vectorize the plot keywords and store in a sparse CSR matrix.

In [85]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=False, norm=False, strip_accents='ascii', stop_words=stopset, binary=True)
train["FullPlot"] = train["FullPlot"].fillna('')
plot_keywords = vectorizer.fit_transform(train.FullPlot)
keywords_list = vectorizer.get_feature_names()

Read feature vectors into a csv.

In [86]:
ratings = pd.read_csv("ratings.csv")
genres = pd.read_csv("genres.csv")
directors = pd.read_csv("directors.csv", usecols=['Row', 'Column'])
actors = pd.read_csv("actors.csv", usecols=['Row', 'Column'])

Get the list of unique actors and directors present in the sparse COO feature matrix.

In [87]:
actors_list = pd.unique(actors["Column"].values.ravel()).tolist()
directors_list = pd.unique(directors["Column"].values.ravel()).tolist()

Function to generate feature column with 1s and 0s for an actor/director predictor into a readable format (Pandas) for the linear regression model, since they are stored as sparse COO matrices.

In [88]:
def get_feature_col(df, name, prefix=''):
    fc = pd.DataFrame(train["ID"], columns=['ID'], dtype=np.int64)
    fc[name] = 0
    temp = df.loc[df['Column'].isin([name])]
    fc.ix[fc.ID.isin(temp.Row.tolist()), name] = 1
    fc = fc.rename(columns = {name : prefix + name})
    return fc[prefix + name]

Ordered logit regression model

In [93]:
g = ['Biography', 'Comedy', 'Crime']
d = ['Martin Scorsese']
c = ['Leonardo DiCaprio', 'Jonah Hill', 'Margot Robbie']
keywords = ["corruption", "government", "girl"]
feature_cols = [genres[g]]
for each in d:
    feature_cols.append(get_feature_col(directors, each, 'dir_'))
for each in c:
    feature_cols.append(get_feature_col(actors, each))
for each in keywords:
    if each in keywords_list:
        feature_cols.append(pd.DataFrame(plot_keywords[:, keywords_list.index(each)].toarray().flatten(), columns=[each]))
X = pd.concat([each for each in feature_cols], axis=1)
y = train.OrderedRating
w, theta = ordinal_logistic_fit(X, y)

In [94]:
X

Unnamed: 0,Biography,Comedy,Crime,dir_Martin Scorsese,Leonardo DiCaprio,Jonah Hill,Margot Robbie,corruption,government,girl
0,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0
7,0,1,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0


Ordered ratings: 0 = [0-5), 1 = [5-6), 2 = [6-7), 3 = [7-8), 4 = [8-9), 5 = [9-10)

In [95]:
pred = ordinal_logistic_predict(w, theta, np.ones(len(X.columns)))
print(pred)

4
