# HW3: Netflix Data Analysis

In this homework assignment, you will analyze the netflix prize data. The data consist of 100,480,50 movie ratings on a scale from 0 to 5 stars. The reveiws are distributed across 17,770 movies and 480,189. We have provided the training data as a sparse matrix where the row corresponds to the movie ID and the column corresponds to the user ID. A seperate file contains the title and year of release for each movie. The original, raw data consists of multiple lists of tuples; each list is a separate movie and each tuple is User ID, Rating, and Rating Year. 
The original data can be downloaded here: https://archive.org/download/nf_prize_dataset.tar
Further information about the netflix prize is available online: 
https://en.wikipedia.org/wiki/Netflix_Prize
https://www.netflixprize.com/

In [2]:
import os, pickle, time
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, coo_matrix, csc_matrix
import scipy.sparse as sparse
from scipy import stats
from sklearn import metrics
from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN, SpectralClustering, AgglomerativeClustering
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.model_selection import train_test_split


In [3]:
# This file consists of titles and release years associated with each ID
movie_titles = pd.read_csv('movie_titles.txt', header = None, names = ['ID','Year','Name'], encoding='latin-1')
print(movie_titles.head())
print(movie_titles.shape)

movie_by_id = {}
for id, name, year in zip(movie_titles['ID'], movie_titles['Name'], movie_titles['Year']):
    if not (np.isnan(year)):
        year = str(int(year))
    else:
        year = 'NaN'
    movie_by_id[id] = name + ' ' + '(' + year + ')'

   ID    Year                          Name
0   1  2003.0               Dinosaur Planet
1   2  2004.0    Isle of Man TT 2004 Review
2   3  1997.0                     Character
3   4  1994.0  Paula Abdul's Get Up & Dance
4   5  2004.0      The Rise and Fall of ECW
(17770, 3)


In [4]:
# This file is a sparse matrix of movies by user, with each element a rating (1-5) or nonresponse (0)
ratings_csr = sparse.load_npz('netflix_full_csr.npz')
print(ratings_csr.shape)

(17771, 2649430)


To avoid memory overflow errors we have randomly subsampled the data. Some computers can handle the full dataset (e.g. 2017 Macbook Pro can perform SVD on the full dataset). Older computers likely need to subsample the data. You can consider using Princeton computing resources and clusters to perform more computationally expensive analysis.

In [5]:
#n_samples = 5000
#n_viewers = 500000
#random_sample_movies = np.random.choice(17771, size = n_samples)
#random_sample_viewers = np.random.choice(2649430, size = n_viewers)
#ratings_small = ratings_csr[:,random_sample_viewers]

In [6]:
# Filter the matrix to remove rows with NO REVIEWS
start = time.time()
ratings_csc = ratings_csr.T
print('before removing users with no reviews: ', ratings_csc.shape)
non_zero_users_csc = ratings_csc[(ratings_csc.getnnz(axis=1) != 0)]
print(non_zero_users_csc.shape)

finish = time.time()
print('finished reduction in %.2f seconds' % (finish - start))
ratings_small = non_zero_users_csc

('before removing users with no reviews: ', (2649430, 17771))
(480189, 17771)
finished reduction in 27.70 seconds


In [176]:
# count the number of reviews for each film and store in review_nums list
reviews_by_movie = {}
csc_t = non_zero_users_csc.transpose()
for i in range(1, csc_t.shape[0]):
    movie_col = csc_t[i]
    num_reviews = movie_col.nnz
    reviews_by_movie[i] = {
        'num_reviews': num_reviews,
        'avg_review': movie_col.sum() / num_reviews
    }

In [7]:
if os.path.isfile('../../cache/cluster_cache.pickle'):
    with open('../../cache/cluster_cache.pickle', 'r') as f:
        [kmeans_all_users, clusters_all_users, clusters, counts] = pickle.load(f)
else:
    start = time.time()
    svd = TruncatedSVD(n_components = 15, algorithm="arpack", random_state=0)
    all_users_small = svd.fit_transform(ratings_small)
    finish = time.time()
    print(all_users_small.shape)
    print('finished svd in %.2f seconds' % (finish - start))



In [8]:
if os.path.isfile('../../cache/svd_cache.pickle'):
    with open('../../cache/svd_cache.pickle', 'r') as f:
        [svd, all_users_small] = pickle.load(f)
else:
    raise
    start = time.time()
    kmeans_all_users = KMeans(n_clusters = 20 , random_state=0, algorithm="full")
    kmeans_all_users.fit(all_users_small)
    finish = time.time()
    print('finished clustering in %.2f seconds' % (finish - start))
    clusters_all_users = kmeans_all_users.labels_
    clusters, counts = np.unique(clusters_all_users, return_counts=True)
    print(counts)



A common methods for analyzing large datasets is dimension reduction. Here we perform a truncated SVD suited for sparse datasets and analyze which movies are associated with different latent dimensions

In [52]:
y_pred_mode = []
y_pred_mean = []
y_test_all = []

results_dict = {}
max_reviews = 1000
for movie_id in range(5315, 5320):
    num_reviews = ratings_small[:,movie_id].count_nonzero()
    movie_name = movie_titles[movie_titles['ID'] == movie_id]['Name']
    print ('Movie #%s, %s, average rating: %.2f in %i reviews' % (
        movie_id,
        movie_name,
        np.sum(ratings_small[:,movie_id]) / num_reviews,
        int(num_reviews)
    ))
    
    start = time.time()
    filter_by = np.ravel((ratings_small[:,movie_id] != 0.0).toarray())
    filtered_clusters = clusters_all_users[filter_by]
    filtered_ratings = ratings_small[filter_by,:]
    
    movie_mask = np.ravel(np.full((filtered_ratings.shape[1], 1), True))
    movie_mask[movie_id] = False
    X = filtered_ratings[:,movie_mask]
    y = filtered_ratings[:,movie_id].toarray()

    
    #y_and_c = np.transpose(np.vstack([np.ravel(y), filtered_clusters]))
#     print(X.shape)
#     print(y.shape)
    #print(y_and_c.shape)
    
    X_train, X_test, y_train, y_test, c_train, c_test = train_test_split(X, y, filtered_clusters, test_size=0.2, random_state=0)
    
    ###
    finish = time.time()
    data_time = finish - start
    
    start = time.time()
    clusters_for_this_movie = np.unique(c_test)
    regrs = []
    user_masks = []
    total_y_pred = []
    total_y_test = []
    lens = []
    for cluster in clusters_for_this_movie:
        regr = LinearRegression()
        new_X_train = X_train[c_train==cluster]
        new_y_train = y_train[c_train==cluster]
        if new_y_train.shape[0] == 0:
            new_X_train = X_train
            new_y_train = y_train
        if new_X_train.shape[0] > max_reviews:
            random_mask = np.random.choice(new_X_train.shape[0], size=max_reviews, replace=False)
            new_X_train = new_X_train[random_mask]
            new_y_train = new_y_train[random_mask]
        lens.append(len(new_y_train))
        regr.fit(new_X_train.toarray(), new_y_train)
        regrs.append(regr)
        user_masks.append(user_mask)
        
    for i, cluster in enumerate(clusters_for_this_movie):
        new_X_test = X_test[c_test==cluster]
        new_y_test = y_test[c_test==cluster]
        new_y_pred = regrs[i].predict(new_X_test.toarray())
        total_y_pred = np.append(total_y_pred, new_y_pred)
        total_y_test = np.append(total_y_test, new_y_test)

    finish = time.time()
    regr_time = finish - start
    print 'regr took', regr_time
    
    current_r2 = r2_score(total_y_test, total_y_pred)
    current_mse = mean_squared_error(total_y_test, total_y_pred)
    
    results_dict[movie_id] = {
        'name': movie_by_id[movie_id],
        'id': movie_id,
        'regr_time': regr_time,
        'data_time': data_time,
        #'regr': regr,
        'mse': current_mse,
        'r2': current_r2,
    }
    print "MSE: %.2f \t\t r2: %.2f \t\t time: %.2f ... (%.2f + %.2f)" % (
        current_mse,
        current_r2,
        data_time + regr_time, data_time, regr_time
    )
    print
    ###
    
#     for (cluster, review) in zip(c_test, y_test):
#         y_test_all.append(review)
#         cluster_reviews = y_train[c_train == cluster]
#         mode, _ = stats.mode(y_train, axis =None)
#         mode_all = mode[0]
#         mean_all = np.mean(y_train)
        
#         if len(cluster_reviews) != 0:
#             mode, _ = stats.mode(cluster_reviews, axis =None)
#             y_pred_mode.append(mode[0])
#             y_pred_mean.append(np.mean(cluster_reviews))
#         else:
#             mode, _ = stats.mode(y_train, axis =None)
#             y_pred_mode.append(mode_all)
#             y_pred_mean.append(mean_all)
    
    #print'r2: %.2f' % r2_score(total_y_test, total_y_pred), '\t', 'MSE: %.2f' % mean_squared_error(total_y_test, total_y_pred)
    
        
    


Movie #5315, 5314    The Lost Boys: Special Edition: Bonus Material
Name: Name, dtype: object, average rating: 3.92 in 213 reviews
[9, 4, 7, 9, 3, 8, 15, 14, 5, 17, 13, 16, 7, 9, 8, 6]
regr took 0.159423112869
MSE: 0.84 		 r2: -0.19 		 time: 0.87 ... (0.71 + 0.16)

Movie #5316, 5315    The Lives of a Bengal Lancer
Name: Name, dtype: object, average rating: 3.33 in 110 reviews
[5, 7, 2, 9, 2, 88, 29, 3]
regr took 0.599139928818
MSE: 0.82 		 r2: 0.17 		 time: 1.30 ... (0.70 + 0.60)

Movie #5317, 5316    Miss Congeniality
Name: Name, dtype: object, average rating: 3.36 in 232944 reviews
[1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]
regr took 139.868797064
MSE: 2.38 		 r2: -0.95 		 time: 154.52 ... (14.65 + 139.87)

Movie #5318, 5317    Tommy Boy
Name: Name, dtype: object, average rating: 3.60 in 78584 reviews
[861, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000,

In [17]:
ratings_csc = ratings_small
ratings_csr = csr_matrix(ratings_small)

In [117]:
# Caluclate variance of user and movie reviews
def variance_of_sparse_rows(sparse_matrix):
    squared_sums = np.zeros(sparse_matrix.shape[0])
    for row in range(sparse_matrix.shape[0]):
        if row % 100000 == 0:
            print row, ' ',
        squared_sums[row] = np.sum(np.square(np.ravel(sparse.find(sparse_matrix[row])[2])))
    squared_sums 
    print squared_sums.shape
    num = np.ravel(sparse_matrix.getnnz(axis=1))
    mean = np.ravel(sparse_matrix.sum(axis=1)) / num
    return (squared_sums / num) - np.square(mean)

user_variance = variance_of_sparse_rows(ratings_csr)
movie_variance = variance_of_sparse_rows(ratings_csc.transpose())
movie_variance[0] = 0.0

user_average = np.ravel(ratings_csc.sum(axis=1)) / np.ravel(ratings_csc.getnnz(axis=1))
movie_average = np.ravel(ratings_csr.sum(axis=0)) / np.ravel(ratings_csr.getnnz(axis=0))
movie_average[0] = 0

0   100000   200000   300000   400000   (480189,)
0   (17771,)




In [115]:
# average_index = clusters_all_users.shape[0]

# new_data = coo_matrix(
#     (np.ones(clusters_all_users.shape[0]),
#     (
#         np.arange(clusters_all_users.shape[0]),
#         clusters_all_users)
#     )
# ) 
# csr_new_data = csr_matrix(new_data)

# csr_new_data.shape

(480189, 20)

In [268]:
# find the average movie review
rating_average = ratings_csr.sum() / ratings_csc.getnnz()
summ = 0
# find the average movie review per cluster
cluster_averages = []
for cluster in clusters:
    clust_data = ratings_csr[clusters_all_users == cluster]
    cluster_averages.append(clust_data.sum() / clust_data.getnnz())
    summ += clust_data.getnnz()


In [306]:
ratings_csr

<480189x17771 sparse matrix of type '<type 'numpy.float32'>'
	with 100479926 stored elements in Compressed Sparse Row format>

In [299]:
def user_avg_without_movie(user_id, movie_id):
    sums = ratings_csr[user_id,:movie_id].sum() + ratings_csr[user_id,movie_id+1:].sum()
    nums = ratings_csr[user_id,:movie_id].getnnz() + ratings_csr[user_id,movie_id+1:].getnnz()
    return sums / nums
print user_avg_without_movie(4, 12918)
print user_average[4]
print

def cluster_avg_without_user(cluster_id, user_id):
    sums = ratings_csr[clusters_all_users == cluster_id].sum() - ratings_csr[user_id].sum()
    nums = ratings_csr[clusters_all_users == cluster_id].getnnz() - ratings_csr[user_id].getnnz()
    return sums / nums

print cluster_avg_without_user(7, 55373)
print cluster_averages[7]

3.5384615384615383
3.4814814814814814

3.5805869946980584
3.5744080425719202


In [292]:
clusters_all_users[55373]

7

In [305]:
results_dict = {}
max_reviews = 1000
master_pred = []
master_test = []
for movie_id in range(100,200):
    num_reviews = ratings_small[:,movie_id].count_nonzero()
    movie_name = movie_titles[movie_titles['ID'] == movie_id]['Name']
    print ('Movie #%s, %s, average rating: %.2f in %i reviews' % (
        movie_id,
        movie_name,
        np.sum(ratings_small[:,movie_id]) / num_reviews,
        int(num_reviews)
    ))
    
    start = time.time()
    
    movie_mask = np.ravel(np.full((ratings_small.shape[1], 1), True))
    movie_mask[movie_id] = False
    
    filter_by = np.ravel((ratings_small[:,movie_id] != 0.0).toarray())
    filtered_clusters = clusters_all_users[filter_by]
    filtered_ratings = ratings_small[filter_by,:]
    
    indexes = np.ravel(np.where(filter_by)[0])
    
    X = filtered_ratings
    y = filtered_ratings[:,movie_id].toarray()
    
    X_train, X_test, y_train, y_test, c_train, c_test, i_train, i_test = train_test_split(X, y, filtered_clusters, indexes, test_size=0.2, random_state=0)
    
    y_pred_mean = []
    y_test_all = []
    
    mean_all = np.mean(y_train)
    cluster_predictions = [
        np.mean(y_train[c_train == cluster]) if len(y_train[c_train == cluster]) else None for cluster in clusters
    ]
    
    for (index, cluster, review) in zip(i_test, c_test, y_test):
        y_test_all.append(review)
        this_user_avg = user_avg_without_movie(index, movie_id)
        if cluster_predictions[cluster] is not None:
            if np.isnan(this_user_avg):
                this_user_avg = cluster_averages[cluster]
            user_difference = this_user_avg - cluster_averages[cluster]
            y_pred_mean.append(cluster_predictions[cluster] + user_difference)
        else:
            if np.isnan(this_user_avg):
                this_user_avg = mean_all
            user_difference = this_user_avg - mean_all
            y_pred_mean.append(mean_all + user_difference)


    # compute metrics
    master_pred.extend(y_pred_mean)
    master_test.extend(y_test)
        
    current_r2 = r2_score(y_test, y_pred_mean)
    current_mse = mean_squared_error(y_test, y_pred_mean)
    
    finish = time.time()
    results_dict[movie_id] = {
        'name': movie_by_id[movie_id],
        'id': movie_id,
#         'regr_time': regr_time,
#         'data_time': data_time,
        #'regr': regr,
        'mse': current_mse,
        'r2': current_r2,
    }
    print "MSE: %.2f \t\t r2: %.2f \t\t time: %.2f" % (
        current_mse,
        current_r2,
#         data_time + regr_time, data_time, regr_time
        finish - start,
    )
    print

print 'FINAL'
print 'r2:', r2_score(master_test, master_pred)
print 'mse:', r2_score(master_test, master_pred)

Movie #100, 99    Sam the Iron Bridge
Name: Name, dtype: object, average rating: 2.65 in 78 reviews
MSE: 0.99 		 r2: 0.27 		 time: 0.83

Movie #101, 100    Complete Shamanic Princess
Name: Name, dtype: object, average rating: 3.37 in 330 reviews
MSE: 1.25 		 r2: 0.11 		 time: 0.90

Movie #102, 101    Notre Musique
Name: Name, dtype: object, average rating: 2.74 in 330 reviews
MSE: 1.36 		 r2: 0.11 		 time: 0.70

Movie #103, 102    Sanford and Son: Season 6
Name: Name, dtype: object, average rating: 3.66 in 209 reviews
MSE: 1.15 		 r2: 0.35 		 time: 0.84

Movie #104, 103    The Great Race
Name: Name, dtype: object, average rating: 3.70 in 3500 reviews
MSE: 0.75 		 r2: 0.19 		 time: 1.58

Movie #105, 104    Obsessed
Name: Name, dtype: object, average rating: 3.13 in 192 reviews
MSE: 0.90 		 r2: -0.19 		 time: 0.75

Movie #106, 105    Stevie Ray Vaughan and Double Trouble: Live at...
Name: Name, dtype: object, average rating: 4.10 in 940 reviews
MSE: 0.98 		 r2: 0.17 		 time: 1.39

Movie 

In [303]:
def average_nested_dict_key(d, k):
    sum = np.sum([d[i][k] for i in sorted(d.keys())])
    return sum / len(d.keys())

def list_nested_dict_key(d, k):
    return [d[i][k] for i in sorted(d.keys())]

print average_nested_dict_key(results_dict, 'mse')

results = results_dict
print len(results.keys())

values = np.array(list_nested_dict_key(results, 'mse'))
weights = np.array(list_nested_dict_key(reviews_by_movie, 'num_reviews')[:4])
weighted = np.average(values, weights=weights)

print weighted

1.172661578058318
4
0.873209503856999


In [173]:
review_nums = []
for i in range(17700):
    num_reviews = ratings_small[:,i].count_nonzero()
    review_nums.append((i, num_reviews, np.sum(ratings_small[:,i]) / num_reviews))



In [24]:
s = sorted(review_nums, key=lambda x: x[2])
for movie_id, num, avg_review in s[-20:]:
    print ('%s\t%s\t%s' % (num, avg_review, movie_by_id[movie_id]))

681	4.5389133627	Fruits Basket (2001)
17292	4.54256303493	The Simpsons: Season 5 (1993)
92470	4.54370065967	Star Wars: Episode V: The Empire Strikes Back (1980)
134284	4.54512078878	Lord of the Rings: The Return of the King (2003)
125	4.552	Lord of the Rings: The Return of the King: Extended Edition: Bonus Material (2003)
1883	4.55443441317	Inu-Yasha (2000)
8426	4.58129598861	The Simpsons: Season 6 (1994)
6621	4.58238936717	Arrested Development: Season 2 (2004)
220	4.58636363636	Ghost in the Shell: Stand Alone Complex: 2nd Gig (2005)
1238	4.59208400646	Veronica Mars: Season 1 (2004)
139660	4.59338393241	The Shawshank Redemption: Special Edition (1994)
89	4.59550561798	Tenchi Muyo! Ryo Ohki (1995)
25	4.6	Trailer Park Boys: Season 4 (2003)
75	4.6	Trailer Park Boys: Season 3 (2003)
1633	4.60502143295	Fullmetal Alchemist (2004)
1747	4.63880938752	Battlestar Galactica: Season 1 (2004)
7249	4.67098910195	Lost: Season 1 (2004)
74912	4.70261106365	Lord of the Rings: The Two Towers: Extended Ed

(20,)