**Author**: Siddhant Sutar

Import Pandas, Numpy, and linear regression model from Scikit-Learn.

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

Read training and test data.

In [3]:
train = pd.read_csv("train_data.csv")
test = pd.read_csv("test_data.csv")  

Read feature vectors into a csv.

In [4]:
ratings = pd.read_csv("ratings.csv")
genres = pd.read_csv("genres.csv")
directors = pd.read_csv("directors.csv", usecols=['Row', 'Column'])
actors = pd.read_csv("actors.csv", usecols=['Row', 'Column'])

Get the list of unique actors and directors present in the sparse COO feature matrix.

In [5]:
actors_list = pd.unique(actors["Column"].values.ravel()).tolist()
directors_list = pd.unique(directors["Column"].values.ravel()).tolist()

Function to generate feature column with 1s and 0s for an actor/director predictor into a readable format (Pandas) for the linear regression model, since they are stored as sparse COO matrices.

In [6]:
def get_feature_col(df, name, prefix=''):
    fc = pd.DataFrame(train["ID"], columns=['ID'], dtype=np.int64)
    fc[name] = 0
    temp = df.loc[df['Column'].isin([name])]
    fc.ix[fc.ID.isin(temp.Row.tolist()), name] = 1
    fc = fc.rename(columns = {name : prefix + name})
    return fc[prefix + name]

Linear regression model

In [7]:
g = ['Adventure', 'Drama', 'Sci-Fi']
d = ['Christopher Nolan']
c = ['Matthew McConaughey', 'Anne Hathaway', 'Jessica Chastain']
feature_cols = [genres[g]]
for each in d:
    feature_cols.append(get_feature_col(directors, each, 'dir_'))
for each in c:
    feature_cols.append(get_feature_col(actors, each))

In [8]:
X = pd.concat([each for each in feature_cols], axis=1)
y = train.imdbRating
lm = LinearRegression()
lm.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [9]:
lm.predict([1 for each in range(0, len(X.columns))])[0]

8.5972831954396955