# HW3: Netflix Data Analysis

In this homework assignment, you will analyze the netflix prize data. The data consist of 100,480,50 movie ratings on a scale from 0 to 5 stars. The reveiws are distributed across 17,770 movies and 480,189. We have provided the training data as a sparse matrix where the row corresponds to the movie ID and the column corresponds to the user ID. A seperate file contains the title and year of release for each movie. The original, raw data consists of multiple lists of tuples; each list is a seperate movie and each tuple is User ID, Rating, and Rating Year. 
The original data can be downloaded here: https://archive.org/download/nf_prize_dataset.tar
Further information about the netflix prize is available online: 
https://en.wikipedia.org/wiki/Netflix_Prize
https://www.netflixprize.com/

In [14]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
import scipy.sparse

In [15]:
# This file consists of titles and release years associated with each ID
movie_titles = pd.read_csv('movie_titles.txt', header = None, names = ['ID','Year','Name'])
print(movie_titles.head())

   ID    Year                          Name
0   1  2003.0               Dinosaur Planet
1   2  2004.0    Isle of Man TT 2004 Review
2   3  1997.0                     Character
3   4  1994.0  Paula Abdul's Get Up & Dance
4   5  2004.0      The Rise and Fall of ECW


In [16]:
# This file is a sparse matrix of movies by user, with each element a rating (1-5) or nonresponse (0)
ratings_csr = scipy.sparse.load_npz('netflix_full_csr.npz')
print(ratings_csr.shape)

(17771, 2649430)


To avoid memory overflow errors we have randomly subsampled the data. Some computers can handle the full dataset (e.g. 2017 Macbook Pro can perform SVD on the full dataset). Older computers likely need to subsample the data. You can consider using Princeton computing resources and clusters to perform more computationally expensive analysis.

In [17]:
#n_samples = 5000
n_viewers = 10000
#random_sample_movies = np.random.choice(17771, size = n_samples)
random_sample_viewers = np.random.choice(2649430, size = n_viewers)
ratings_small = ratings_csr[:,random_sample_viewers]

In [18]:
ratings_small.shape

(17771, 10000)

A common methods for analyzing large datasets is dimension reduction. Here we perform a truncated SVD suited for sparse datasets and analyze which movies are associated with different latent dimensions

In [19]:
from sklearn.decomposition import TruncatedSVD

In [20]:
n_components = 5
svd = TruncatedSVD(n_components = n_components)

In [21]:
Z = svd.fit_transform(ratings_small)

In [22]:
components = svd.components_

In [23]:
print(svd.explained_variance_ratio_)

[0.23049945 0.02934592 0.02127687 0.01800105 0.01420581]


In [24]:
for i in range(0,n_components):
    Z_sort = np.argsort(np.abs(Z[:,i]))
    print('Component ' + str(i))
    for j in range(1,5):
        movie_index = Z_sort[-j]
        movie_title = movie_titles[movie_titles['ID'] == movie_index]['Name']
        movie_weight = Z[movie_index,i]
        print(str(movie_title) + ': ' + str(movie_weight))
    print(' ')

Component 0
1904    Pirates of the Caribbean: The Curse of the Bla...
Name: Name, dtype: object: 88.78162789618499
11282    Forrest Gump
Name: Name, dtype: object: 85.88032850708947
4305    The Sixth Sense
Name: Name, dtype: object: 81.1467148710099
11520    Lord of the Rings: The Two Towers
Name: Name, dtype: object: 80.09479985050731
 
Component 1
5316    Miss Congeniality
Name: Name, dtype: object: 36.60423551376191
15204    The Day After Tomorrow
Name: Name, dtype: object: 34.53614102489464
4995    Gone in 60 Seconds
Name: Name, dtype: object: 31.699897842086326
15123    Independence Day
Name: Name, dtype: object: 31.188924989118423
 
Component 2
12231    Lost in Translation
Name: Name, dtype: object: 30.27038541655473
570    American Beauty
Name: Name, dtype: object: 26.345542047122937
5861    Memento
Name: Name, dtype: object: 25.728237532983837
12581    Mystic River
Name: Name, dtype: object: 25.589537428172793
 
Component 3
4576    Steel Magnolias
Name: Name, dtype: object: 28.