# HW3: Netflix Data Analysis

In this homework assignment, you will analyze the netflix prize data. The data consist of 100,480,50 movie ratings on a scale from 0 to 5 stars. The reveiws are distributed across 17,770 movies and 480,189. We have provided the training data as a sparse matrix where the row corresponds to the movie ID and the column corresponds to the user ID. A seperate file contains the title and year of release for each movie. The original, raw data consists of multiple lists of tuples; each list is a seperate movie and each tuple is User ID, Rating, and Rating Year. 
The original data can be downloaded here: https://archive.org/download/nf_prize_dataset.tar
Further information about the netflix prize is available online: 
https://en.wikipedia.org/wiki/Netflix_Prize
https://www.netflixprize.com/

In [2]:
import time
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
import scipy.sparse
from sklearn import metrics
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.model_selection import train_test_split


In [3]:
# This file consists of titles and release years associated with each ID
movie_titles = pd.read_csv('movie_titles.txt', header = None, names = ['ID','Year','Name'], encoding='latin-1')
print(movie_titles.head())
print(movie_titles.shape)

movie_by_id = {}
for id, name, year in zip(movie_titles['ID'], movie_titles['Name'], movie_titles['Year']):
    if not (np.isnan(year)):
        year = str(int(year))
    else:
        year = 'NaN'
    movie_by_id[id] = name + ' ' + '(' + year + ')'

   ID    Year                          Name
0   1  2003.0               Dinosaur Planet
1   2  2004.0    Isle of Man TT 2004 Review
2   3  1997.0                     Character
3   4  1994.0  Paula Abdul's Get Up & Dance
4   5  2004.0      The Rise and Fall of ECW
(17770, 3)


In [4]:
# This file is a sparse matrix of movies by user, with each element a rating (1-5) or nonresponse (0)
ratings_csr = scipy.sparse.load_npz('netflix_full_csr.npz')
print(ratings_csr.shape)

(17771, 2649430)


To avoid memory overflow errors we have randomly subsampled the data. Some computers can handle the full dataset (e.g. 2017 Macbook Pro can perform SVD on the full dataset). Older computers likely need to subsample the data. You can consider using Princeton computing resources and clusters to perform more computationally expensive analysis.

In [5]:
#n_samples = 5000
n_viewers = 500000
#random_sample_movies = np.random.choice(17771, size = n_samples)
random_sample_viewers = np.random.choice(2649430, size = n_viewers)
ratings_small = ratings_csr[:,random_sample_viewers]

In [6]:
# Filter the matrix to remove rows with NO REVIEWS
start = time.time()
ratings_csc = ratings_csr.T
print('before removing users with no reviews: ', ratings_csc.shape)
non_zero_users_csc = ratings_csc[(ratings_csc.getnnz(axis=1) != 0)]
print(non_zero_users_csc.shape)

finish = time.time()
print('finished in %.2f seconds' % (finish - start))

before removing users with no reviews:  (2649430, 17771)
(480189, 17771)
finished in 24.99 seconds


A common methods for analyzing large datasets is dimension reduction. Here we perform a truncated SVD suited for sparse datasets and analyze which movies are associated with different latent dimensions

In [8]:
from sklearn.decomposition import TruncatedSVD

In [63]:
n_components = 5000
svd = TruncatedSVD(n_components = n_components)

In [None]:
Z = svd.fit_transform(ratings_small)

In [11]:
components = svd.components_

In [12]:
print(svd.explained_variance_ratio_)

[ 0.22171354  0.0300368   0.02018774  0.01675508  0.01244614]


In [13]:
for i in range(0,n_components):
    Z_sort = np.argsort(np.abs(Z[:,i]))
    print('Component ' + str(i))
    for j in range(1,10):
        movie_index = Z_sort[-j]
        movie_title = movie_titles[movie_titles['ID'] == movie_index]['Name']
        movie_weight = Z[movie_index,i]
        print(str(movie_title) + ': ' + str(movie_weight))
    print(' ')

Component 0
1904    Pirates of the Caribbean: The Curse of the Bla...
Name: Name, dtype: object: 618.684
11282    Forrest Gump
Name: Name, dtype: object: 613.529
2451    Lord of the Rings: The Fellowship of the Ring
Name: Name, dtype: object: 571.28
4305    The Sixth Sense
Name: Name, dtype: object: 569.455
11520    Lord of the Rings: The Two Towers
Name: Name, dtype: object: 568.664
14549    The Shawshank Redemption: Special Edition
Name: Name, dtype: object: 556.065
16376    The Green Mile
Name: Name, dtype: object: 555.738
15123    Independence Day
Name: Name, dtype: object: 542.54
13727    Gladiator
Name: Name, dtype: object: 538.47
 
Component 1
5316    Miss Congeniality
Name: Name, dtype: object: 230.629
15204    The Day After Tomorrow
Name: Name, dtype: object: 225.835
2121    Being John Malkovich
Name: Name, dtype: object: -218.055
9339    Pearl Harbor
Name: Name, dtype: object: 217.384
4995    Gone in 60 Seconds
Name: Name, dtype: object: 210.44
6971    Armageddon
Name: Name, 

In [15]:
svd.explained_variance_

array([ 3216.24267578,   435.7227478 ,   292.8493042 ,   243.0541687 ,
         180.5473938 ], dtype=float32)

In [7]:
ratings_small = non_zero_users_csc



In [25]:
from sklearn.cluster import KMeans, DBSCAN, SpectralClustering, AgglomerativeClustering
from scipy import stats

In [26]:
movie_results_dict = {}
for id in range(10, 15):
    movie_id = id
    num_reviews = ratings_small[:,movie_id].count_nonzero()
    movie_name = movie_titles[movie_titles['ID'] == movie_id]['Name']
    print ('Movie #%s, %s, average rating: %.2f in %i reviews' % (
        movie_id,
        movie_name,
        np.sum(ratings_small[:,movie_id]) / num_reviews,
        int(num_reviews)
    ))

    start = time.time()
    filter_by = np.ravel((ratings_small[:,movie_id] != 0.0).toarray())
    filtered_ratings = ratings_small[filter_by,:]

    movie_mask = np.ravel(np.full((filtered_ratings.shape[1], 1), True))
    movie_mask[movie_id] = False
    X = filtered_ratings[:,movie_mask]
    y = filtered_ratings[:,movie_id]

    X_train, X_test, y_train, y_test = train_test_split(X.toarray(), y.toarray(), test_size=0.2, random_state=0)
    finish = time.time()
    data_time = (finish - start)
    #print 'finished choosing data in %.2f seconds' % data
    for i, name in zip([X_train, X_test, y_train, y_test], ['X_train', 'X_test', 'y_train', 'y_test']):
        print (name, ':', i.shape, '\t',)
    print()
    
    # Run the regression on the film
    start = time.time()
    regr = LinearRegression()
    regr.fit(X_train, y_train)

    y_pred_reg = regr.predict(X_test)

    #print 'Coefficients: \n', regr.coef_
    #print "Mean squared error: %.2f" % mean_squared_error(y_test, y_pred)
    print('No Clustering')
    print('Variance score: %.2f' % r2_score(y_test, y_pred_reg))
    print()
    finish = time.time()
    regr_time = finish - start
   # print ('finished regression in %.2f seconds' % regr_time)
   # print()
    movie_results_dict[id] = {
        'name': str(movie_titles[movie_titles['ID'] == movie_id]['Name']),
        'regr_time': regr_time,
        'data_time': data_time,
        'regr': regr,
        'mse': mean_squared_error(y_test, y_pred_reg),
        'r2': r2_score(y_test, y_pred_reg)
    }
    
    kmeans = KMeans(n_clusters = np.max([int(X_train.shape[0]/50)+1, 3]) , random_state=0, algorithm="full")
    kmeans.fit(X_train)
    cluster_members = kmeans.labels_
    pred_clusters = kmeans.predict(X_test)
    
    y_pred_avg_km = []
    y_pred_rand_km = []
    y_pred_clus_km = []
    y_pred_mode_km = []
    for k in range(len(pred_clusters)):
        person = pred_clusters[k]
        lookalike_y = y_train[np.where(cluster_members==person)]
        y_pred_avg_km.append([np.mean(lookalike_y)])
        y_pred_rand_km.append([np.random.choice(lookalike_y.flatten())])
        y_pred_mode_km.append(stats.mode(lookalike_y)[0][0])
        
        lookalike_x = X_train[np.where(cluster_members==person)]
        regr = LinearRegression()
        regr.fit(lookalike_x, lookalike_y)
        y_pred_clus_km.append(regr.predict(X_test[k].reshape(1, -1))[0])
        
    
    y_pred_avg_km = np.asmatrix(y_pred_avg_km)
    y_pred_mode_km = np.asmatrix(y_pred_mode_km)
    y_pred_rand_km = np.asmatrix(y_pred_rand_km)
    y_pred_clus_km = np.asmatrix(y_pred_clus_km)
    
    print('KMeans Clustering')
    print('Variance score mean: %.2f' % r2_score(y_test, y_pred_avg_km))
    print('Variance score mode: %.2f' % r2_score(y_test, y_pred_mode_km))
    print('Variance score random: %.2f' % r2_score(y_test, y_pred_rand_km))
    print('Variance score regress: %.2f' % r2_score(y_test, y_pred_clus_km))
    print()
    
    dbscan = DBSCAN()
    dbscan.fit(X_train)
    cluster_members = dbscan.labels_
    pred_clusters = dbscan.fit_predict(X_test)
    
    y_pred_avg_db = []
    y_pred_mode_db = []
    y_pred_rand_db = []
    y_pred_clus_db = []
    
    for k in range(len(pred_clusters)):
        person = pred_clusters[k]
        lookalike_y = y_train[np.where(cluster_members==person)]
        y_pred_avg_db.append([np.mean(lookalike_y)])
        y_pred_mode_db.append(stats.mode(lookalike_y)[0][0])
        y_pred_rand_db.append([np.random.choice(lookalike_y.flatten())])
        
        lookalike_x = X_train[np.where(cluster_members==person)]
        regr = LinearRegression()
        regr.fit(lookalike_x, lookalike_y)
        y_pred_clus_db.append(regr.predict(X_test[k].reshape(1, -1))[0])
        
    
    y_pred_avg_db = np.asmatrix(y_pred_avg_db)
    y_pred_mode_db = np.asmatrix(y_pred_mode_db)
    y_pred_rand_db = np.asmatrix(y_pred_rand_db)
    y_pred_clus_db = np.asmatrix(y_pred_clus_db)
    
    print('DBSCAN Clustering')
    print('Variance score mean: %.2f' % r2_score(y_test, y_pred_avg_db))
    print('Variance score mode: %.2f' % r2_score(y_test, y_pred_mode_db))
    print('Variance score random: %.2f' % r2_score(y_test, y_pred_rand_db))
    print('Variance score regress: %.2f' % r2_score(y_test, y_pred_clus_db))
    print()
    
    
    spect = SpectralClustering(n_clusters = np.max([int(X_train.shape[0]/50)+1, 3]) , random_state=0)
    spect.fit(X_train)
    cluster_members = spect.labels_
    pred_clusters = spect.fit_predict(X_test)
    
    y_pred_avg_sc = []
    y_pred_mode_sc = []
    y_pred_rand_sc = []
    y_pred_clus_sc = []
    
    for k in range(len(pred_clusters)):
        person = pred_clusters[k]
        lookalike_y = y_train[np.where(cluster_members==person)]
        y_pred_avg_sc.append([np.mean(lookalike_y)])
        y_pred_mode_sc.append(stats.mode(lookalike_y)[0][0])
        y_pred_rand_sc.append([np.random.choice(lookalike_y.flatten())])
        
        lookalike_x = X_train[np.where(cluster_members==person)]
        regr = LinearRegression()
        regr.fit(lookalike_x, lookalike_y)
        y_pred_clus_sc.append(regr.predict(X_test[k].reshape(1, -1))[0])
        
    
    y_pred_avg_sc = np.asmatrix(y_pred_avg_sc)
    y_pred_mode_sc = np.asmatrix(y_pred_mode_sc)
    y_pred_rand_sc = np.asmatrix(y_pred_rand_sc)
    y_pred_clus_sc = np.asmatrix(y_pred_clus_sc)
    
    print('Spectral Clustering')
    print('Variance score mean: %.2f' % r2_score(y_test, y_pred_avg_sc))
    print('Variance score mode: %.2f' % r2_score(y_test, y_pred_mode_sc))
    print('Variance score random: %.2f' % r2_score(y_test, y_pred_rand_sc))
    print('Variance score regress: %.2f' % r2_score(y_test, y_pred_clus_sc))
    print()
    
    agg = AgglomerativeClustering(n_clusters = np.max([int(X_train.shape[0]/50)+1, 3]))
    agg.fit(X_train)
    cluster_members = agg.labels_
    pred_clusters = agg.fit_predict(X_test)
    
    y_pred_avg_ac = []
    y_pred_mode_ac = []
    y_pred_rand_ac = []
    y_pred_clus_ac = []
    
    for k in range(len(pred_clusters)):
        person = pred_clusters[k]
        lookalike_y = y_train[np.where(cluster_members==person)]
        y_pred_avg_ac.append([np.mean(lookalike_y)])
        y_pred_mode_ac.append(stats.mode(lookalike_y)[0][0])
        y_pred_rand_ac.append([np.random.choice(lookalike_y.flatten())])
        
        lookalike_x = X_train[np.where(cluster_members==person)]
        regr = LinearRegression()
        regr.fit(lookalike_x, lookalike_y)
        y_pred_clus_ac.append(regr.predict(X_test[k].reshape(1, -1))[0])
        
    
    y_pred_avg_ac = np.asmatrix(y_pred_avg_ac)
    y_pred_mode_ac = np.asmatrix(y_pred_mode_ac)
    y_pred_rand_ac = np.asmatrix(y_pred_rand_ac)
    y_pred_clus_ac = np.asmatrix(y_pred_clus_ac)
    
    print('Agglomerative Clustering')
    print('Variance score mean: %.2f' % r2_score(y_test, y_pred_avg_ac))
    print('Variance score mode: %.2f' % r2_score(y_test, y_pred_mode_ac))
    print('Variance score random: %.2f' % r2_score(y_test, y_pred_rand_ac))
    print('Variance score regress: %.2f' % r2_score(y_test, y_pred_clus_ac))
    print()
    
    
    
        
    
    
    
    

Movie #10, 9    Fighter
Name: Name, dtype: object, average rating: 3.18 in 249 reviews
X_train : (199, 17770) 	
X_test : (50, 17770) 	
y_train : (199, 1) 	
y_test : (50, 1) 	

No Clustering
Variance score: 0.16

KMeans Clustering
Variance score mean: 0.01
Variance score mode cluster: -0.01
Variance score random cluster: -1.11
Variance score regress cluster: 0.19

DBSCAN Clustering
Variance score mean: -0.01
Variance score mode cluster: -0.05
Variance score random cluster: -0.87
Variance score regress cluster: 0.16





Spectral Clustering
Variance score mean: -0.24
Variance score mode cluster: -0.60
Variance score random cluster: -1.18
Variance score regress cluster: -0.57

Agglomerative Clustering
Variance score mean: -0.86
Variance score mode cluster: -1.94
Variance score random cluster: -1.22
Variance score regress cluster: -3.35

Movie #11, 10    Full Frame: Documentary Shorts
Name: Name, dtype: object, average rating: 3.03 in 198 reviews
X_train : (158, 17770) 	
X_test : (40, 17770) 	
y_train : (158, 1) 	
y_test : (40, 1) 	

No Clustering
Variance score: -0.34

KMeans Clustering
Variance score mean: -0.23
Variance score mode cluster: -0.21
Variance score random cluster: -0.73
Variance score regress cluster: -0.36

DBSCAN Clustering
Variance score mean: -0.18
Variance score mode cluster: -0.13
Variance score random cluster: -0.30
Variance score regress cluster: -0.34





Spectral Clustering
Variance score mean: -0.12
Variance score mode cluster: -0.13
Variance score random cluster: -1.01
Variance score regress cluster: -0.08

Agglomerative Clustering
Variance score mean: -0.26
Variance score mode cluster: -0.21
Variance score random cluster: -0.71
Variance score regress cluster: -0.15

Movie #12, 11    My Favorite Brunette
Name: Name, dtype: object, average rating: 3.42 in 546 reviews
X_train : (436, 17770) 	
X_test : (110, 17770) 	
y_train : (436, 1) 	
y_test : (110, 1) 	

No Clustering
Variance score: 0.22

KMeans Clustering
Variance score mean: -0.05
Variance score mode cluster: -0.26
Variance score random cluster: -0.84
Variance score regress cluster: 0.05

DBSCAN Clustering
Variance score mean: -0.00
Variance score mode cluster: -0.15
Variance score random cluster: -0.98
Variance score regress cluster: 0.22





Spectral Clustering
Variance score mean: -0.12
Variance score mode cluster: -0.06
Variance score random cluster: -0.52
Variance score regress cluster: -0.11

Agglomerative Clustering
Variance score mean: -0.05
Variance score mode cluster: -0.09
Variance score random cluster: -0.62
Variance score regress cluster: 0.03

Movie #13, 12    Lord of the Rings: The Return of the King: Ext...
Name: Name, dtype: object, average rating: 4.55 in 125 reviews
X_train : (100, 17770) 	
X_test : (25, 17770) 	
y_train : (100, 1) 	
y_test : (25, 1) 	

No Clustering
Variance score: -0.09

KMeans Clustering
Variance score mean: 0.03
Variance score mode cluster: -0.66
Variance score random cluster: -1.34
Variance score regress cluster: -0.16

DBSCAN Clustering
Variance score mean: -0.02
Variance score mode cluster: -0.66
Variance score random cluster: -2.03
Variance score regress cluster: -0.09





Spectral Clustering
Variance score mean: -0.03
Variance score mode cluster: -0.66
Variance score random cluster: -1.34
Variance score regress cluster: -0.13

Agglomerative Clustering
Variance score mean: 0.06
Variance score mode cluster: -0.66
Variance score random cluster: -0.56
Variance score regress cluster: -0.06

Movie #14, 13    Nature: Antarctica
Name: Name, dtype: object, average rating: 3.03 in 118 reviews
X_train : (94, 17770) 	
X_test : (24, 17770) 	
y_train : (94, 1) 	
y_test : (24, 1) 	

No Clustering
Variance score: 0.06

KMeans Clustering
Variance score mean: -0.01
Variance score mode cluster: -0.21
Variance score random cluster: -0.21
Variance score regress cluster: 0.02

DBSCAN Clustering
Variance score mean: -0.00
Variance score mode cluster: -0.00
Variance score random cluster: -0.57
Variance score regress cluster: 0.06





Spectral Clustering
Variance score mean: -0.03
Variance score mode cluster: -0.00
Variance score random cluster: -1.21
Variance score regress cluster: 0.06

Agglomerative Clustering
Variance score mean: -0.84
Variance score mode cluster: -2.16
Variance score random cluster: -1.49
Variance score regress cluster: -1.82



In [26]:
movie_results_dict[3]

{'data_time': 1.1898272037506104,
 'mse': 1.6863325,
 'name': '2    Character\nName: Name, dtype: object',
 'r2': -0.74125852269030212,
 'regr': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
 'regr_time': 4.725301027297974}

In [22]:
review_nums = []
for i in range(17700):
    num_reviews = ratings_small[:,i].count_nonzero()
    review_nums.append((i, num_reviews, np.sum(ratings_small[:,i]) / num_reviews))

  after removing the cwd from sys.path.


In [24]:
s = sorted(review_nums, key=lambda x: x[2])
for movie_id, num, avg_review in s[-20:]:
    print ('%s\t%s\t%s' % (num, avg_review, movie_by_id[movie_id]))

681	4.5389133627	Fruits Basket (2001)
17292	4.54256303493	The Simpsons: Season 5 (1993)
92470	4.54370065967	Star Wars: Episode V: The Empire Strikes Back (1980)
134284	4.54512078878	Lord of the Rings: The Return of the King (2003)
125	4.552	Lord of the Rings: The Return of the King: Extended Edition: Bonus Material (2003)
1883	4.55443441317	Inu-Yasha (2000)
8426	4.58129598861	The Simpsons: Season 6 (1994)
6621	4.58238936717	Arrested Development: Season 2 (2004)
220	4.58636363636	Ghost in the Shell: Stand Alone Complex: 2nd Gig (2005)
1238	4.59208400646	Veronica Mars: Season 1 (2004)
139660	4.59338393241	The Shawshank Redemption: Special Edition (1994)
89	4.59550561798	Tenchi Muyo! Ryo Ohki (1995)
25	4.6	Trailer Park Boys: Season 4 (2003)
75	4.6	Trailer Park Boys: Season 3 (2003)
1633	4.60502143295	Fullmetal Alchemist (2004)
1747	4.63880938752	Battlestar Galactica: Season 1 (2004)
7249	4.67098910195	Lost: Season 1 (2004)
74912	4.70261106365	Lord of the Rings: The Two Towers: Extended Ed