**Author**: Siddhant Sutar

Import Pandas, Numpy, and linear regression model from Scikit-Learn.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

Read the CSV data obtained from OMDb. Link to dataset: https://www.dropbox.com/s/f3ce7lj2baqty7h/omdbMovies.csv?dl=0

In [2]:
data = pd.read_csv("omdbMovies.csv")

Dataset contains multiple entries of movies with same imdbID; drop duplicates. Drop all rows with NaN votes. 

In [3]:
df = data.drop_duplicates(subset='imdbID')
df = df[pd.notnull(data['imdbVotes'])]



Set threshold equal to 1000.

In [4]:
THRESHOLD = 1000

Clean up the data: get only USA movies, drop rows with NaN as genre, remove documentaries and shorts, and get the movies that pass the threshold.

In [5]:
df = df.loc[df['Country'] == "USA"]
df = df[pd.notnull(df['Genre'])]
df = df[~df['Genre'].str.contains("Documentary|Short")]
df = df.loc[df['imdbVotes'] > THRESHOLD]
print 'Movie count: ' + str(len(df.index))

Movie count: 10200


Create feature vectors for ratings, genres, directors, and actors. Using str.get_dummies(sep=', ') to account for multiple genres/directors of a same movie.

In [6]:
ratings = df['Rating'].str.get_dummies(sep=', ')
genres = df['Genre'].str.get_dummies(sep=', ')
directors = df['Director'].str.get_dummies(sep=', ')

In [7]:
actors = df['Cast'].str.get_dummies(sep=', ')

Add a 'dir' prefix to the directors feature vector before concatenating to the original dataframe to account for actors who are directors as well (thereby avoiding identical column names).

In [8]:
directors = directors.rename(columns = lambda x : 'dir_' + x)

Concatenate the feature vectors to the original dataframe.

In [9]:
df = pd.concat([df, ratings], axis = 1)
df = pd.concat([df, genres], axis = 1)
df = pd.concat([df, directors], axis = 1)
df = pd.concat([df, actors], axis = 1)

Rating, genre(s), director(s), actor(s) go into the feature_cols list as features (independent categorical variables). Predictor imdbRating is the dependent variable. Instantiate and fit.

In [10]:
feature_cols = ['PG-13', 'Comedy', 'Drama', 'Romance', 'dir_Tom Hanks', 'Tom Hanks', 'Julia Roberts', 'Sarah Mahoney']
X = df[feature_cols]
y = df.imdbRating
lm = LinearRegression()
lm.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Print intercept and co-efficients.

In [11]:
print lm.intercept_
print lm.coef_

5.72579165788
[-0.22355859  0.07891088  0.78171608  0.28768594  0.21358138  0.53613133
 -0.01152758 -1.40025868]


Predict the rating!

In [12]:
lm.predict([1 for each in range(0, len(feature_cols))])[0]

5.9884724183949771

The above movie is in reference to Larry Crowne (2011), which has an IMDb rating of 6.0

**Reference**: http://nbviewer.ipython.org/github/justmarkham/DAT4/blob/master/notebooks/08_linear_regression.ipynb

In [13]:
print 'Actor count: ' + str(len(actors.columns))
print 'Director count: ' + str(len(directors.columns))

Actor count: 15129
Director count: 4528
