# HW3: Netflix Data Analysis

In this homework assignment, you will analyze the netflix prize data. The data consist of 100,480,50 movie ratings on a scale from 0 to 5 stars. The reveiws are distributed across 17,770 movies and 480,189. We have provided the training data as a sparse matrix where the row corresponds to the movie ID and the column corresponds to the user ID. A seperate file contains the title and year of release for each movie. The original, raw data consists of multiple lists of tuples; each list is a seperate movie and each tuple is User ID, Rating, and Rating Year. 
The original data can be downloaded here: https://archive.org/download/nf_prize_dataset.tar
Further information about the netflix prize is available online: 
https://en.wikipedia.org/wiki/Netflix_Prize
https://www.netflixprize.com/

In [2]:
import time
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
import scipy.sparse
from sklearn import metrics
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split


In [116]:
# This file consists of titles and release years associated with each ID
movie_titles = pd.read_csv('movie_titles.txt', header = None, names = ['ID','Year','Name'])
print(movie_titles.head())
print(movie_titles.shape)

movie_by_id = {}
for id, name, year in zip(movie_titles['ID'], movie_titles['Name'], movie_titles['Year']):
    if not (np.isnan(year)):
        year = str(int(year))
    else:
        year = 'NaN'
    movie_by_id[id] = name + ' ' + '(' + year + ')'

   ID    Year                          Name
0   1  2003.0               Dinosaur Planet
1   2  2004.0    Isle of Man TT 2004 Review
2   3  1997.0                     Character
3   4  1994.0  Paula Abdul's Get Up & Dance
4   5  2004.0      The Rise and Fall of ECW
(17770, 3)


In [4]:
# This file is a sparse matrix of movies by user, with each element a rating (1-5) or nonresponse (0)
ratings_csr = scipy.sparse.load_npz('netflix_full_csr.npz')
print(ratings_csr.shape)

(17771, 2649430)


To avoid memory overflow errors we have randomly subsampled the data. Some computers can handle the full dataset (e.g. 2017 Macbook Pro can perform SVD on the full dataset). Older computers likely need to subsample the data. You can consider using Princeton computing resources and clusters to perform more computationally expensive analysis.

In [13]:
#n_samples = 5000
n_viewers = 500000
#random_sample_movies = np.random.choice(17771, size = n_samples)
random_sample_viewers = np.random.choice(2649430, size = n_viewers)
ratings_small = ratings_csr[:,random_sample_viewers]

In [22]:
# Filter the matris to remove rows with NO REVIEWS
start = time.time()
ratings_csc = ratings_csr.T
print 'before removing users with no reviews: ', ratings_csc.shape
non_zero_users_csc = ratings_csc[(ratings_csc.getnnz(axis=1) != 0)]
print non_zero_users_csc.shape

finish = time.time()
print 'finished in %.2f seconds' % (finish - start)

before removing users with no reviews:  (2649430, 17771)
(480189, 17771)
finished in 21.74 seconds


A common methods for analyzing large datasets is dimension reduction. Here we perform a truncated SVD suited for sparse datasets and analyze which movies are associated with different latent dimensions

In [6]:
from sklearn.decomposition import TruncatedSVD

In [7]:
n_components = 5
svd = TruncatedSVD(n_components = n_components)

In [8]:
Z = svd.fit_transform(ratings_small)

In [9]:
components = svd.components_

In [10]:
print(svd.explained_variance_ratio_)

[0.22315634 0.02998073 0.01984643 0.01672574 0.01252159]


In [11]:
for i in range(0,n_components):
    Z_sort = np.argsort(np.abs(Z[:,i]))
    print('Component ' + str(i))
    for j in range(1,10):
        movie_index = Z_sort[-j]
        movie_title = movie_titles[movie_titles['ID'] == movie_index]['Name']
        movie_weight = Z[movie_index,i]
        print(str(movie_title) + ': ' + str(movie_weight))
    print(' ')

Component 0
1904    Pirates of the Caribbean: The Curse of the Bla...
Name: Name, dtype: object: 628.2421679719206
11282    Forrest Gump
Name: Name, dtype: object: 618.6151597228289
2451    Lord of the Rings: The Fellowship of the Ring
Name: Name, dtype: object: 580.2945735548218
11520    Lord of the Rings: The Two Towers
Name: Name, dtype: object: 578.3100591346567
4305    The Sixth Sense
Name: Name, dtype: object: 569.0264968651688
16376    The Green Mile
Name: Name, dtype: object: 561.3186664751792
14549    The Shawshank Redemption: Special Edition
Name: Name, dtype: object: 558.6274883101519
15123    Independence Day
Name: Name, dtype: object: 551.1692711931979
13727    Gladiator
Name: Name, dtype: object: 543.994579121776
 
Component 1
5316    Miss Congeniality
Name: Name, dtype: object: 232.95887472280333
15204    The Day After Tomorrow
Name: Name, dtype: object: 227.55474730855732
9339    Pearl Harbor
Name: Name, dtype: object: 222.18553361372844
2121    Being John Malkovich
Nam

In [12]:
svd.explained_variance_

array([3268.33614216,  439.09618756,  290.66973813,  244.96436827,
        183.39048866])

In [23]:
ratings_small = non_zero_users_csc

In [71]:
movie_results_dict = {}
for id in range(1, 4):
    movie_id = id
    num_reviews = ratings_small[:,movie_id].count_nonzero()
    movie_name = movie_titles[movie_titles['ID'] == movie_id]['Name']
    print 'Movie #%s, %s, average rating: %.2f in %i reviews' % (
        movie_id,
        movie_name,
        np.sum(ratings_small[:,movie_id]) / num_reviews,
        int(num_reviews)
    )

    start = time.time()
    filter_by = np.ravel((ratings_small[:,movie_id] != 0.0).toarray())
    filtered_ratings = ratings_small[filter_by,:]

    movie_mask = np.ravel(np.full((filtered_ratings.shape[1], 1), True))
    movie_mask[movie_id] = False
    X = filtered_ratings[:,movie_mask]
    y = filtered_ratings[:,movie_id]

    X_train, X_test, y_train, y_test = train_test_split(X.toarray(), y.toarray(), test_size=0.2)
    finish = time.time()
    data_time = (finish - start)
    #print 'finished choosing data in %.2f seconds' % data
    for i, name in zip([X_train, X_test, y_train, y_test], ['X_train', 'X_test', 'y_train', 'y_test']):
        print name, ':', i.shape, '\t',
    print

    # Run the regression on the film
    start = time.time()
    regr = LinearRegression()
    regr.fit(X_train, y_train)

    y_pred = regr.predict(X_test)

    #print 'Coefficients: \n', regr.coef_
    #print "Mean squared error: %.2f" % mean_squared_error(y_test, y_pred)
    print 'Variance score: %.2f' % r2_score(y_test, y_pred)
    finish = time.time()
    regr_time = finish - start
    print 'finished regression in %.2f seconds' % regr_time
    movie_results_dict[id] = {
        'name': str(movie_titles[movie_titles['ID'] == movie_id]['Name']),
        'regr_time': regr_time,
        'data_time': data_time,
        'regr': regr,
        'mse': mean_squared_error(y_test, y_pred),
        'r2': r2_score(y_test, y_pred)
    }

Movie #1, 0    Dinosaur Planet
Name: Name, dtype: object, average rating: 3.75 in 547 reviews
X_train : (437, 17770) 	X_test : (110, 17770) 	y_train : (437, 1) 	y_test : (110, 1) 	
Variance score: -0.00
finished regression in 1.94 seconds
Movie #2, 1    Isle of Man TT 2004 Review
Name: Name, dtype: object, average rating: 3.56 in 145 reviews
X_train : (116, 17770) 	X_test : (29, 17770) 	y_train : (116, 1) 	y_test : (29, 1) 	
Variance score: 0.11
finished regression in 0.63 seconds
Movie #3, 2    Character
Name: Name, dtype: object, average rating: 3.64 in 2012 reviews
X_train : (1609, 17770) 	X_test : (403, 17770) 	y_train : (1609, 1) 	y_test : (403, 1) 	
Variance score: -0.88
finished regression in 17.76 seconds


In [74]:
movie_results_dict[3]

{'data_time': 0.9145989418029785,
 'mse': 1.4984351875293063,
 'name': '2    Character\nName: Name, dtype: object',
 'r2': -0.8808786113140292,
 'regr': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
 'regr_time': 17.756766080856323}

In [122]:
review_nums = []
for i in range(17700):
    num_reviews = ratings_small[:,i].count_nonzero()
    review_nums.append((i, num_reviews, np.sum(ratings_small[:,i]) / num_reviews))



In [125]:
s = sorted(review_nums, key=lambda x: x[2])
for movie_id, num, avg_review in s[-20:]:
    print '%s\t%s\t%s' % (num, avg_review, movie_by_id[movie_id])

681	4.538913362701909	Fruits Basket (2001)
17292	4.542563034929447	The Simpsons: Season 5 (1993)
92470	4.543700659673408	Star Wars: Episode V: The Empire Strikes Back (1980)
134284	4.5451207887760265	Lord of the Rings: The Return of the King (2003)
125	4.552	Lord of the Rings: The Return of the King: Extended Edition: Bonus Material (2003)
1883	4.554434413170473	Inu-Yasha (2000)
8426	4.581295988606693	The Simpsons: Season 6 (1994)
6621	4.582389367165081	Arrested Development: Season 2 (2004)
220	4.586363636363636	Ghost in the Shell: Stand Alone Complex: 2nd Gig (2005)
1238	4.592084006462035	Veronica Mars: Season 1 (2004)
139660	4.593383932407275	The Shawshank Redemption: Special Edition (1994)
89	4.595505617977528	Tenchi Muyo! Ryo Ohki (1995)
25	4.6	Trailer Park Boys: Season 4 (2003)
75	4.6	Trailer Park Boys: Season 3 (2003)
1633	4.605021432945499	Fullmetal Alchemist (2004)
1747	4.638809387521466	Battlestar Galactica: Season 1 (2004)
7249	4.6709891019450955	Lost: Season 1 (2004)
74912	4