**Author**: Siddhant Sutar

Import Pandas, Numpy, and linear regression model from Scikit-Learn.

In [90]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

Read the CSV data obtained from OMDb.

In [91]:
data = pd.read_csv("omdbMovies.csv")

Dataset contains multiple entries of movies with same imdbID; drop duplicates. Drop all rows with NaN votes. 

In [92]:
df = data.drop_duplicates(subset='imdbID')
df = df[pd.notnull(data['imdbVotes'])]

Calculate threshold. Threshold is the nth percentile of the list of number of iMDB votes of all movies with non-zero votes. The training dataset will include only movies with votes above this value. Standardized approach rather than using a constant. Also chose 97 because it seems to be the least percentile that is also fast and does not cause any issues on my machine.

In [93]:
THRESHOLD = np.percentile(df['imdbVotes'].tolist(), 97)

Clean up the data: get only USA movies, drop rows with NaN as genre, remove documentaries and shorts, and get the movies that pass the threshold.

In [94]:
df = df.loc[df['Country'] == "USA"]
df = df[pd.notnull(df['Genre'])]
df = df[~df['Genre'].str.contains("Documentary|Short")]
df = df.loc[df['imdbVotes'] > THRESHOLD]
print 'Movie count: ' + str(len(df.index))

Movie count: 5078


Create feature vectors for ratings, genres, directors, and actors. Using str.get_dummies(sep=', ') to account for multiple genres/directors of a same movie.

In [95]:
ratings = df['Rating'].str.get_dummies(sep=', ')
genres = df['Genre'].str.get_dummies(sep=', ')
directors = df['Director'].str.get_dummies(sep=', ')

In [96]:
actors = df['Cast'].str.get_dummies(sep=', ')

Add a 'dir' prefix to the directors feature vector before concatenating to the original dataframe to account for actors who are directors as well (thereby avoiding identical column names).

In [97]:
directors = directors.rename(columns = lambda x : 'dir_' + x)

Concatenate the feature vectors to the original dataframe.

In [98]:
df = pd.concat([df, ratings], axis = 1)
df = pd.concat([df, genres], axis = 1)
df = pd.concat([df, directors], axis = 1)
df = pd.concat([df, actors], axis = 1)

Rating, genre(s), director(s), actor(s) go into the feature_cols list as features (independent categorical variables). Predictor imdbRating is the dependent variable. Instantiate and fit.

In [99]:
feature_cols = ['PG-13', 'Comedy', 'Drama', 'Romance', 'dir_Tom Hanks', 'Tom Hanks', 'Julia Roberts', 'Sarah Mahoney']
X = df[feature_cols]
y = df.imdbRating
lm = LinearRegression()
lm.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Print intercept and co-efficients.

In [100]:
print lm.intercept_
print lm.coef_

6.29271061457
[-0.35594073 -0.22644887  0.58879227  0.11824794  0.14494599  0.47784332
 -0.18842884 -1.04015054]


Predict the rating!

In [101]:
lm.predict([1 for each in range(0, len(feature_cols))])[0]

5.8115711598762543