# HW3: Netflix Data Analysis

In this homework assignment, you will analyze the netflix prize data. The data consist of 100,480,50 movie ratings on a scale from 0 to 5 stars. The reveiws are distributed across 17,770 movies and 480,189. We have provided the training data as a sparse matrix where the row corresponds to the movie ID and the column corresponds to the user ID. A seperate file contains the title and year of release for each movie. The original, raw data consists of multiple lists of tuples; each list is a seperate movie and each tuple is User ID, Rating, and Rating Year. 
The original data can be downloaded here: https://archive.org/download/nf_prize_dataset.tar
Further information about the netflix prize is available online: 
https://en.wikipedia.org/wiki/Netflix_Prize
https://www.netflixprize.com/

In [88]:
import time
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
import scipy.sparse as sparse
from sklearn import metrics
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split


In [3]:
# This file consists of titles and release years associated with each ID
movie_titles = pd.read_csv('movie_titles.txt', header = None, names = ['ID','Year','Name'])
print(movie_titles.head())
print(movie_titles.shape)

movie_by_id = {}
for id, name, year in zip(movie_titles['ID'], movie_titles['Name'], movie_titles['Year']):
    if not (np.isnan(year)):
        year = str(int(year))
    else:
        year = 'NaN'
    movie_by_id[id] = name + ' ' + '(' + year + ')'

   ID    Year                          Name
0   1  2003.0               Dinosaur Planet
1   2  2004.0    Isle of Man TT 2004 Review
2   3  1997.0                     Character
3   4  1994.0  Paula Abdul's Get Up & Dance
4   5  2004.0      The Rise and Fall of ECW
(17770, 3)


In [4]:
# This file is a sparse matrix of movies by user, with each element a rating (1-5) or nonresponse (0)
ratings_csr = scipy.sparse.load_npz('netflix_full_csr.npz')
print(ratings_csr.shape)

(17771, 2649430)


To avoid memory overflow errors we have randomly subsampled the data. Some computers can handle the full dataset (e.g. 2017 Macbook Pro can perform SVD on the full dataset). Older computers likely need to subsample the data. You can consider using Princeton computing resources and clusters to perform more computationally expensive analysis.

In [5]:
#n_samples = 5000
n_viewers = 500000
#random_sample_movies = np.random.choice(17771, size = n_samples)
random_sample_viewers = np.random.choice(2649430, size = n_viewers)
ratings_small = ratings_csr[:,random_sample_viewers]

In [6]:
# Filter the matris to remove rows with NO REVIEWS
start = time.time()
ratings_csc = ratings_csr.T
print 'before removing users with no reviews: ', ratings_csc.shape
non_zero_users_csc = ratings_csc[(ratings_csc.getnnz(axis=1) != 0)]
print non_zero_users_csc.shape

finish = time.time()
print 'finished in %.2f seconds' % (finish - start)

before removing users with no reviews:  (2649430, 17771)
(480189, 17771)
finished in 24.09 seconds


A common methods for analyzing large datasets is dimension reduction. Here we perform a truncated SVD suited for sparse datasets and analyze which movies are associated with different latent dimensions

In [7]:
from sklearn.decomposition import TruncatedSVD

In [7]:
n_components = 5
svd = TruncatedSVD(n_components = n_components)

In [8]:
Z = svd.fit_transform(ratings_small)

In [9]:
components = svd.components_

In [10]:
print(svd.explained_variance_ratio_)

[0.22315634 0.02998073 0.01984643 0.01672574 0.01252159]


In [None]:
for i in range(0,n_components):
    Z_sort = np.argsort(np.abs(Z[:,i]))
    print('Component ' + str(i))
    for j in range(1,10):
        movie_index = Z_sort[-j]
        movie_title = movie_titles[movie_titles['ID'] == movie_index]['Name']
        movie_weight = Z[movie_index,i]
        print(str(movie_title) + ': ' + str(movie_weight))
    print(' ')

In [8]:
non_zero_users_csr = csr_matrix(non_zero_users_csc)
reviews_by_user = {}
for u in range(non_zero_users_csr.shape[0]):
    reviews_by_user[u] = non_zero_users_csr[u,:].count_nonzero()

In [19]:
s = sorted(reviews_by_user.keys(), key=lambda x: reviews_by_user[x], reverse=True)[:10]
print [reviews_by_user[i] for i in s]
print s

[17653, 17436, 16565, 15813, 14831, 9821, 9768, 9739, 9064, 8881]
[55373, 70466, 442139, 301823, 383961, 265129, 297513, 238656, 472465, 350357]


In [105]:
ratings_small = non_zero_users_csc
n_viewers = 100000
ratings_small = ratings_small.transpose()
ratings_small.shape

(17771, 480189)

In [None]:
max_reviews = 500
movie_results_dict = {}
# top_reviewers, random_sample
choosing_mechanism = 'random_sample'

for id in range(1, 100):
    movie_id = id
    num_reviews = ratings_small[movie_id].count_nonzero()
    movie_name = movie_by_id[movie_id]
    print 'Movie #%s, %s, average rating: %.2f in %i reviews' % (
        movie_id,
        movie_name,
        np.sum(ratings_small[movie_id]) / num_reviews,
        int(num_reviews)
    )

    start = time.time()
    
    # filter out the reviewers who havent seen the movie
    user_mask = sparse.find(ratings_small[movie_id])[1]
    
    if (choosing_mechanism == 'top_reviewers'):
        # if there are many reviews, only take the ones of high-volume reviewers
        if (num_reviews > max_reviews):
            user_list = np.ravel(np.where(user_mask == True))
            user_list = sorted(user_list, key=lambda u: reviews_by_user[u])
            user_mask = np.ravel(np.full((user_mask.shape[0], 1), False))
            for u in user_list[:max_reviews]:
                user_mask[u] = True
                
    elif (choosing_mechanism == 'random_sample'):
        if (num_reviews > max_reviews):
            rand_indexes = np.random.choice(len(user_mask), size=max_reviews)
            user_mask = user_mask[rand_indexes]
        
    filtered_ratings = ratings_small[:,user_mask]
    
    # make mask to filter out review for the movie in question
    movie_mask = np.ravel(np.full((filtered_ratings.shape[0], 1), True))
    movie_mask[movie_id] = False
    
    # generate X and y, training and testing data splitting
    X = filtered_ratings[movie_mask]
    y = filtered_ratings[movie_id]
    X_train, X_test, y_train, y_test = train_test_split(X.transpose().todense(), y.transpose().todense(), test_size=0.2)

    finish = time.time()
    data_time = (finish - start)
    #print 'finished choosing data in %.2f seconds' % data

    # Run the regression on the film
    start = time.time()
    regr = LinearRegression()
    regr.fit(X_train, y_train)

    y_pred = regr.predict(X_test)

    #print 'Coefficients: \n', regr.coef_
    print "MSE: %.2f" % mean_squared_error(y_test, y_pred)
    print 'r2 value: %.2f' % r2_score(y_test, y_pred)
    finish = time.time()
    regr_time = (finish - start)
    #print 'finished regression in %.2f seconds' % regr_time
    movie_results_dict[id] = {
        'name': movie_by_id[movie_id],
        'regr_time': regr_time,
        'data_time': data_time,
        'regr': regr,
        'mse': mean_squared_error(y_test, y_pred),
        'r2': r2_score(y_test, y_pred)
    }
    print 'time: %.2f ... (%.2f + %.2f)' % (data_time + regr_time, data_time, regr_time)
    print

Movie #1, Dinosaur Planet (2003), average rating: 3.75 in 547 reviews
MSE: 1.21
r2 value: 0.10
time: 2.16 ... (0.81 + 1.35)

Movie #2, Isle of Man TT 2004 Review (2004), average rating: 3.56 in 145 reviews
MSE: 1.43
r2 value: 0.08
time: 1.31 ... (0.72 + 0.60)

Movie #3, Character (1997), average rating: 3.64 in 2012 reviews
MSE: 1.26
r2 value: -0.05
time: 2.67 ... (0.71 + 1.96)

Movie #4, Paula Abdul's Get Up & Dance (1994), average rating: 2.74 in 142 reviews
MSE: 2.07
r2 value: -0.35
time: 1.46 ... (0.73 + 0.73)

Movie #5, The Rise and Fall of ECW (2004), average rating: 3.92 in 1140 reviews
MSE: 1.29
r2 value: 0.31
time: 2.13 ... (0.74 + 1.39)

Movie #6, Sick (1997), average rating: 3.08 in 1019 reviews
MSE: 1.17
r2 value: 0.43
time: 2.19 ... (0.76 + 1.43)

Movie #7, 8 Man (1992), average rating: 2.13 in 93 reviews
MSE: 1.56
r2 value: -0.36
time: 1.11 ... (0.73 + 0.38)

Movie #8, What the #$*! Do We Know!? (2004), average rating: 3.19 in 14910 reviews
MSE: 1.39
r2 value: 0.07
time: 

In [74]:
movie_results_dict[3]

{'data_time': 0.9145989418029785,
 'mse': 1.4984351875293063,
 'name': '2    Character\nName: Name, dtype: object',
 'r2': -0.8808786113140292,
 'regr': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
 'regr_time': 17.756766080856323}

In [122]:
review_nums = []
for i in range(17700):
    num_reviews = ratings_small[:,i].count_nonzero()
    review_nums.append((i, num_reviews, np.sum(ratings_small[:,i]) / num_reviews))



In [129]:
s = sorted(review_nums, key=lambda x: x[2])
for movie_id, num, avg_review in s[-20:]:
    print '%s\t%0.4f\t%s' % (num, avg_review, movie_by_id[movie_id])

681	4.5389	Fruits Basket (2001)
17292	4.5426	The Simpsons: Season 5 (1993)
92470	4.5437	Star Wars: Episode V: The Empire Strikes Back (1980)
134284	4.5451	Lord of the Rings: The Return of the King (2003)
125	4.5520	Lord of the Rings: The Return of the King: Extended Edition: Bonus Material (2003)
1883	4.5544	Inu-Yasha (2000)
8426	4.5813	The Simpsons: Season 6 (1994)
6621	4.5824	Arrested Development: Season 2 (2004)
220	4.5864	Ghost in the Shell: Stand Alone Complex: 2nd Gig (2005)
1238	4.5921	Veronica Mars: Season 1 (2004)
139660	4.5934	The Shawshank Redemption: Special Edition (1994)
89	4.5955	Tenchi Muyo! Ryo Ohki (1995)
25	4.6000	Trailer Park Boys: Season 4 (2003)
75	4.6000	Trailer Park Boys: Season 3 (2003)
1633	4.6050	Fullmetal Alchemist (2004)
1747	4.6388	Battlestar Galactica: Season 1 (2004)
7249	4.6710	Lost: Season 1 (2004)
74912	4.7026	Lord of the Rings: The Two Towers: Extended Edition (2002)
73422	4.7166	The Lord of the Rings: The Fellowship of the Ring: Extended Edition (20

In [33]:
ratings_small[1][8]

IndexError: index out of bounds: 0 <= 8 <= 1, 0 <= 9 <= 1, 8 <= 9

In [68]:
ks = movie_results_dict.keys()
# sort by r2 value
print 'r2'
s = sorted(ks, key=lambda x: movie_results_dict[x]['r2'], reverse=True)
for i in s[:10]:
    print '%.3f\t%s' % (movie_results_dict[i]['r2'], movie_results_dict[i]['name'])
    
# sort by mse
print '\nMSE'
s = sorted(ks, key=lambda x: movie_results_dict[x]['mse'])
for i in s[:10]:
    print '%.3f\t%s' % (movie_results_dict[i]['mse'], movie_results_dict[i]['name'])

r2
0.537	Haven (2001)
0.424	You're Invited to Mary-Kate and Ashley's Vacation Parties (1996)
0.420	Gentlemen of Fortune (1971)
0.415	Dark Shadows: Vol. 9 (1968)
0.399	The Beverly Hillbillies (1962)
0.392	Devil in the Flesh 2 (2000)
0.384	The Matchmaker (1958)
0.355	Danielle Steel's Heartbeat (1993)
0.354	A Touch of Frost: Seasons 7 & 8 (2003)
0.351	Gupt (1997)

MSE
0.296	Avia Vampire Hunter (2005)
0.459	House of Whipcord (1974)
0.541	The Bravados (1958)
0.548	The Giallo Collection: Who Saw Her Die? (1972)
0.559	Hot War (1998)
0.571	Panic in the Streets (1950)
0.585	Goddess of Mercy (2004)
0.617	Haven (2001)
0.658	Cadfael: A Morbid Taste for Bones (1996)
0.665	Whisper Kill (1988)


In [71]:
import pickle
with open('movie_results_dict_100_1000.pickle', 'wb') as f:
    pickle.dump(movie_results_dict, f)