# HW3: Netflix Data Analysis

In this homework assignment, you will analyze the netflix prize data. The data consist of 100,480,50 movie ratings on a scale from 0 to 5 stars. The reveiws are distributed across 17,770 movies and 480,189. We have provided the training data as a sparse matrix where the row corresponds to the movie ID and the column corresponds to the user ID. A seperate file contains the title and year of release for each movie. The original, raw data consists of multiple lists of tuples; each list is a seperate movie and each tuple is User ID, Rating, and Rating Year. 
The original data can be downloaded here: https://archive.org/download/nf_prize_dataset.tar
Further information about the netflix prize is available online: 
https://en.wikipedia.org/wiki/Netflix_Prize
https://www.netflixprize.com/

In [44]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
import scipy.sparse
from sklearn import metrics
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans, MiniBatchKMeans


In [28]:
# This file consists of titles and release years associated with each ID
movie_titles = pd.read_csv('movie_titles.txt', header = None, names = ['ID','Year','Name'])
print(movie_titles.head())
print(movie_titles.shape)

   ID    Year                          Name
0   1  2003.0               Dinosaur Planet
1   2  2004.0    Isle of Man TT 2004 Review
2   3  1997.0                     Character
3   4  1994.0  Paula Abdul's Get Up & Dance
4   5  2004.0      The Rise and Fall of ECW
(17770, 3)


In [3]:
# This file is a sparse matrix of movies by user, with each element a rating (1-5) or nonresponse (0)
ratings_csr = scipy.sparse.load_npz('netflix_full_csr.npz')
print(ratings_csr.shape)

(17771, 2649430)


To avoid memory overflow errors we have randomly subsampled the data. Some computers can handle the full dataset (e.g. 2017 Macbook Pro can perform SVD on the full dataset). Older computers likely need to subsample the data. You can consider using Princeton computing resources and clusters to perform more computationally expensive analysis.

In [4]:
#n_samples = 5000
n_viewers = 10000
#random_sample_movies = np.random.choice(17771, size = n_samples)
random_sample_viewers = np.random.choice(2649430, size = n_viewers)
ratings_small = ratings_csr[:,random_sample_viewers]

In [38]:
ratings_small.shape


(17771, 10000)

A common methods for analyzing large datasets is dimension reduction. Here we perform a truncated SVD suited for sparse datasets and analyze which movies are associated with different latent dimensions

In [6]:
from sklearn.decomposition import TruncatedSVD

In [76]:
n_components = 5
svd = TruncatedSVD(n_components = n_components)

In [77]:
Z = svd.fit_transform(ratings_small)

In [78]:
components = svd.components_

In [61]:
print(svd.explained_variance_ratio_)

[0.22393838 0.03198849 0.02131863 0.01907654 0.01460467 0.01018762
 0.01034069 0.00751837 0.00695364 0.00632032]


In [79]:
for i in range(0,n_components):
    Z_sort = np.argsort(np.abs(Z[:,i]))
    print('Component ' + str(i))
    for j in range(1,10):
        movie_index = Z_sort[-j]
        movie_title = movie_titles[movie_titles['ID'] == movie_index]['Name']
        movie_weight = Z[movie_index,i]
        print(str(movie_title) + ': ' + str(movie_weight))
    print(' ')

Component 0
11282    Forrest Gump
Name: Name, dtype: object: 88.68420397956177
1904    Pirates of the Caribbean: The Curse of the Bla...
Name: Name, dtype: object: 88.47268211312237
14549    The Shawshank Redemption: Special Edition
Name: Name, dtype: object: 86.2505252966183
16376    The Green Mile
Name: Name, dtype: object: 83.72260873929498
4305    The Sixth Sense
Name: Name, dtype: object: 82.45326222018026
2451    Lord of the Rings: The Fellowship of the Ring
Name: Name, dtype: object: 80.6694396742571
15123    Independence Day
Name: Name, dtype: object: 80.28584134349208
11520    Lord of the Rings: The Two Towers
Name: Name, dtype: object: 80.18793303418474
14690    The Matrix
Name: Name, dtype: object: 76.97298653160429
 
Component 1
5316    Miss Congeniality
Name: Name, dtype: object: 33.14044785539319
9339    Pearl Harbor
Name: Name, dtype: object: 32.086210934873
6971    Armageddon
Name: Name, dtype: object: 30.990235609518397
14537    How to Lose a Guy in 10 Days
Name: Name,

In [84]:
svd.explained_variance_

array([67.65818843,  9.66463672,  6.44097229,  5.76357164,  4.41248925])