# HW3: Netflix Data Analysis

In this homework assignment, you will analyze the netflix prize data. The data consist of 100,480,50 movie ratings on a scale from 0 to 5 stars. The reveiws are distributed across 17,770 movies and 480,189. We have provided the training data as a sparse matrix where the row corresponds to the movie ID and the column corresponds to the user ID. A seperate file contains the title and year of release for each movie. The original, raw data consists of multiple lists of tuples; each list is a separate movie and each tuple is User ID, Rating, and Rating Year. 
The original data can be downloaded here: https://archive.org/download/nf_prize_dataset.tar
Further information about the netflix prize is available online: 
https://en.wikipedia.org/wiki/Netflix_Prize
https://www.netflixprize.com/

In [2]:
import time
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
import scipy.sparse
from scipy import stats
from sklearn import metrics
from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN, SpectralClustering, AgglomerativeClustering
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.model_selection import train_test_split


In [3]:
# This file consists of titles and release years associated with each ID
movie_titles = pd.read_csv('movie_titles.txt', header = None, names = ['ID','Year','Name'], encoding='latin-1')
print(movie_titles.head())
print(movie_titles.shape)

movie_by_id = {}
for id, name, year in zip(movie_titles['ID'], movie_titles['Name'], movie_titles['Year']):
    if not (np.isnan(year)):
        year = str(int(year))
    else:
        year = 'NaN'
    movie_by_id[id] = name + ' ' + '(' + year + ')'

   ID    Year                          Name
0   1  2003.0               Dinosaur Planet
1   2  2004.0    Isle of Man TT 2004 Review
2   3  1997.0                     Character
3   4  1994.0  Paula Abdul's Get Up & Dance
4   5  2004.0      The Rise and Fall of ECW
(17770, 3)


In [4]:
# This file is a sparse matrix of movies by user, with each element a rating (1-5) or nonresponse (0)
ratings_csr = scipy.sparse.load_npz('netflix_full_csr.npz')
print(ratings_csr.shape)

(17771, 2649430)


To avoid memory overflow errors we have randomly subsampled the data. Some computers can handle the full dataset (e.g. 2017 Macbook Pro can perform SVD on the full dataset). Older computers likely need to subsample the data. You can consider using Princeton computing resources and clusters to perform more computationally expensive analysis.

In [5]:
#n_samples = 5000
#n_viewers = 500000
#random_sample_movies = np.random.choice(17771, size = n_samples)
#random_sample_viewers = np.random.choice(2649430, size = n_viewers)
#ratings_small = ratings_csr[:,random_sample_viewers]

In [80]:
# Filter the matrix to remove rows with NO REVIEWS
start = time.time()
ratings_csc = ratings_csr.T
print('before removing users with no reviews: ', ratings_csc.shape)
non_zero_users_csc = ratings_csc[(ratings_csc.getnnz(axis=1) != 0)]
print(non_zero_users_csc.shape)

finish = time.time()
print('finished reduction in %.2f seconds' % (finish - start))
ratings_small = non_zero_users_csc

before removing users with no reviews:  (2649430, 17771)
(480189, 17771)
finished reduction in 23.21 seconds
(480189, 10)
finished svd in 19.60 seconds


In [144]:
start = time.time()
svd = TruncatedSVD(n_components = 15, algorithm="arpack", random_state=0)
all_users_small = svd.fit_transform(ratings_small)
finish = time.time()
print(all_users_small.shape)
print('finished svd in %.2f seconds' % (finish - start))

(480189, 15)
finished svd in 29.56 seconds


In [145]:
start = time.time()
kmeans_all_users = KMeans(n_clusters = 20 , random_state=0, algorithm="full")
kmeans_all_users.fit(all_users_small)
finish = time.time()
print('finished clustering in %.2f seconds' % (finish - start))
clusters_all_users = kmeans_all_users.labels_
clusters, counts = np.unique(clusters_all_users, return_counts=True)
print(counts)

finished clustering in 254.06 seconds
[  6760 160940   6160   5905  16404   2944  21565   7315   4781  18507
  11498  44379  13283   7477  13423  27054  62848   2864  34190  11892]


A common methods for analyzing large datasets is dimension reduction. Here we perform a truncated SVD suited for sparse datasets and analyze which movies are associated with different latent dimensions

In [149]:
y_pred_mode = []
y_pred_mean = []
y_test_all = []
for id in range(1777):
    movie_id = id
    num_reviews = ratings_small[:,movie_id].count_nonzero()
    movie_name = movie_titles[movie_titles['ID'] == movie_id]['Name']
    print ('Movie #%s, %s, average rating: %.2f in %i reviews' % (
        movie_id,
        movie_name,
        np.sum(ratings_small[:,movie_id]) / num_reviews,
        int(num_reviews)
    ))
    
    start = time.time()
    filter_by = np.ravel((ratings_small[:,movie_id] != 0.0).toarray())
    filtered_clusters = clusters_all_users[filter_by]
    filtered_ratings = ratings_small[filter_by,:]
    
    movie_mask = np.ravel(np.full((filtered_ratings.shape[1], 1), True))
    movie_mask[movie_id] = False
    X = filtered_ratings[:,movie_mask]
    y = filtered_ratings[:,movie_id].toarray()

    
    #y_and_c = np.transpose(np.vstack([np.ravel(y), filtered_clusters]))
    print(X.shape)
    print(y.shape)
    #print(y_and_c.shape)
    
    X_train, X_test, y_train, y_test, c_train, c_test = train_test_split(X, y, filtered_clusters, test_size=0.2, random_state=0)
    
    for (cluster, review) in zip(c_test, y_test):
        y_test_all.append(review)
        cluster_reviews = y_train[c_train == cluster]
        mode, _ = stats.mode(y_train, axis =None)
        mode_all = mode[0]
        mean_all = np.mean(y_train)
        
        if len(cluster_reviews) != 0:
            mode, _ = stats.mode(cluster_reviews, axis =None)
            y_pred_mode.append(mode[0])
            y_pred_mean.append(np.mean(cluster_reviews))
        else:
            mode, _ = stats.mode(y_train, axis =None)
            y_pred_mode.append(mode_all)
            y_pred_mean.append(mean_all)
    
print('Variance score mean: %.2f' % r2_score(y_test_all, y_pred_mean))
print('Variance score mode: %.2f' % r2_score(y_test_all, y_pred_mode))
print('MSE mean: %.2f' % mean_squared_error(y_test_all, y_pred_mean))
print('MSE mode: %.2f' % mean_squared_error(y_test_all, y_pred_mode))
print()
    
        
    


  # This is added back by InteractiveShellApp.init_path()


Movie #0, Series([], Name: Name, dtype: object), average rating: nan in 0 reviews
(0, 17770)
(0, 1)
Movie #1, 0    Dinosaur Planet
Name: Name, dtype: object, average rating: 3.75 in 547 reviews
(547, 17770)
(547, 1)
Movie #2, 1    Isle of Man TT 2004 Review
Name: Name, dtype: object, average rating: 3.56 in 145 reviews
(145, 17770)
(145, 1)
Movie #3, 2    Character
Name: Name, dtype: object, average rating: 3.64 in 2012 reviews
(2012, 17770)
(2012, 1)
Movie #4, 3    Paula Abdul's Get Up & Dance
Name: Name, dtype: object, average rating: 2.74 in 142 reviews
(142, 17770)
(142, 1)
Movie #5, 4    The Rise and Fall of ECW
Name: Name, dtype: object, average rating: 3.92 in 1140 reviews
(1140, 17770)
(1140, 1)
Movie #6, 5    Sick
Name: Name, dtype: object, average rating: 3.08 in 1019 reviews
(1019, 17770)
(1019, 1)
Movie #7, 6    8 Man
Name: Name, dtype: object, average rating: 2.13 in 93 reviews
(93, 17770)
(93, 1)
Movie #8, 7    What the #$*! Do We Know!?
Name: Name, dtype: object, average

(258, 17770)
(258, 1)
Movie #66, 65    Barbarian Queen 2
Name: Name, dtype: object, average rating: 2.30 in 166 reviews
(166, 17770)
(166, 1)
Movie #67, 66    Vampire Journals
Name: Name, dtype: object, average rating: 2.56 in 289 reviews
(289, 17770)
(289, 1)
Movie #68, 67    Invader Zim
Name: Name, dtype: object, average rating: 4.14 in 2216 reviews
(2216, 17770)
(2216, 1)
Movie #69, 68    WWE: Armageddon 2003
Name: Name, dtype: object, average rating: 3.29 in 116 reviews
(116, 17770)
(116, 1)
Movie #70, 69    Tai Chi: The 24 Forms
Name: Name, dtype: object, average rating: 2.73 in 343 reviews
(343, 17770)
(343, 1)
Movie #71, 70    Maya Lin: A Strong Clear Vision
Name: Name, dtype: object, average rating: 3.77 in 1561 reviews
(1561, 17770)
(1561, 1)
Movie #72, 71    At Home Among Strangers
Name: Name, dtype: object, average rating: 3.15 in 179 reviews
(179, 17770)
(179, 1)
Movie #73, 72    Davy Crockett: 50th Anniversary Double Feature
Name: Name, dtype: object, average rating: 3.89 

(830, 17770)
(830, 1)
Movie #133, 132    Viva La Bam: Season 1
Name: Name, dtype: object, average rating: 3.74 in 2563 reviews
(2563, 17770)
(2563, 1)
Movie #134, 133    Spirit Lost
Name: Name, dtype: object, average rating: 2.13 in 99 reviews
(99, 17770)
(99, 1)
Movie #135, 134    GTO: Great Teacher Onizuka: Set 2
Name: Name, dtype: object, average rating: 4.14 in 384 reviews
(384, 17770)
(384, 1)
Movie #136, 135    Cat and the Canary
Name: Name, dtype: object, average rating: 2.98 in 120 reviews
(120, 17770)
(120, 1)
Movie #137, 136    Naked Lies
Name: Name, dtype: object, average rating: 2.39 in 204 reviews
(204, 17770)
(204, 1)
Movie #138, 137    Star Trek: Voyager: Season 1
Name: Name, dtype: object, average rating: 3.94 in 6007 reviews
(6007, 17770)
(6007, 1)
Movie #139, 138    Allergies: A Natural Approach
Name: Name, dtype: object, average rating: 2.51 in 96 reviews
(96, 17770)
(96, 1)
Movie #140, 139    Lost in the Wild
Name: Name, dtype: object, average rating: 2.53 in 119 re

(81260, 17770)
(81260, 1)
Movie #198, 197    Gupt
Name: Name, dtype: object, average rating: 3.08 in 194 reviews
(194, 17770)
(194, 1)
Movie #199, 198    The Deer Hunter
Name: Name, dtype: object, average rating: 3.89 in 35509 reviews
(35509, 17770)
(35509, 1)
Movie #200, 199    The Fall of the Roman Empire
Name: Name, dtype: object, average rating: 3.01 in 129 reviews
(129, 17770)
(129, 1)
Movie #201, 200    Home Movie
Name: Name, dtype: object, average rating: 3.42 in 2724 reviews
(2724, 17770)
(2724, 1)
Movie #202, 201    Ruby's Bucket of Blood
Name: Name, dtype: object, average rating: 3.10 in 382 reviews
(382, 17770)
(382, 1)
Movie #203, 202    Sports Illustrated Swimsuit Edition: 1997
Name: Name, dtype: object, average rating: 3.16 in 96 reviews
(96, 17770)
(96, 1)
Movie #204, 203    Venus Boyz
Name: Name, dtype: object, average rating: 2.71 in 230 reviews
(230, 17770)
(230, 1)
Movie #205, 204    Troy: Bonus Material
Name: Name, dtype: object, average rating: 3.30 in 513 reviews


(4378, 17770)
(4378, 1)
Movie #263, 262    Dragon Ball: Tournament Saga
Name: Name, dtype: object, average rating: 3.98 in 900 reviews
(900, 17770)
(900, 1)
Movie #264, 263    Angelina Ballerina: Lights
Name: Name, dtype: object, average rating: 2.89 in 134 reviews
(134, 17770)
(134, 1)
Movie #265, 264    The Jeff Corwin Experience: Costa Rica and the...
Name: Name, dtype: object, average rating: 3.25 in 308 reviews
(308, 17770)
(308, 1)
Movie #266, 265    Saudade Do Futuro
Name: Name, dtype: object, average rating: 3.06 in 126 reviews
(126, 17770)
(126, 1)
Movie #267, 266    Touched by an Angel: Season 1
Name: Name, dtype: object, average rating: 3.69 in 1162 reviews
(1162, 17770)
(1162, 1)
Movie #268, 267    The Final Countdown
Name: Name, dtype: object, average rating: 3.71 in 5545 reviews
(5545, 17770)
(5545, 1)
Movie #269, 268    Parenthood
Name: Name, dtype: object, average rating: 3.76 in 21467 reviews
(21467, 17770)
(21467, 1)
Movie #270, 269    Sex and the City: Season 4
Name:

(165, 17770)
(165, 1)
Movie #327, 326    Storefront Hitchcock
Name: Name, dtype: object, average rating: 3.19 in 178 reviews
(178, 17770)
(178, 1)
Movie #328, 327    Deftones: Live in Hawaii
Name: Name, dtype: object, average rating: 2.90 in 197 reviews
(197, 17770)
(197, 1)
Movie #329, 328    Dogma
Name: Name, dtype: object, average rating: 3.59 in 66266 reviews
(66266, 17770)
(66266, 1)
Movie #330, 329    Wild Things
Name: Name, dtype: object, average rating: 3.46 in 27038 reviews
(27038, 17770)
(27038, 1)
Movie #331, 330    Chasing Amy
Name: Name, dtype: object, average rating: 3.62 in 47167 reviews
(47167, 17770)
(47167, 1)
Movie #332, 331    They Crawl
Name: Name, dtype: object, average rating: 2.26 in 149 reviews
(149, 17770)
(149, 1)
Movie #333, 332    Mail Call: The Best of Season 2
Name: Name, dtype: object, average rating: 3.79 in 393 reviews
(393, 17770)
(393, 1)
Movie #334, 333    The Pacifier
Name: Name, dtype: object, average rating: 3.56 in 39875 reviews
(39875, 17770)
(

(60, 17770)
(60, 1)
Movie #393, 392    The Replacement Killers
Name: Name, dtype: object, average rating: 3.32 in 11893 reviews
(11893, 17770)
(11893, 1)
Movie #394, 393    20
Name: Name, dtype: object, average rating: 3.57 in 1617 reviews
(1617, 17770)
(1617, 1)
Movie #395, 394    Captain Blood
Name: Name, dtype: object, average rating: 3.99 in 1822 reviews
(1822, 17770)
(1822, 1)
Movie #396, 395    Arjuna: Complete Collection
Name: Name, dtype: object, average rating: 3.10 in 204 reviews
(204, 17770)
(204, 1)
Movie #397, 396    A Night in Casablanca
Name: Name, dtype: object, average rating: 3.93 in 748 reviews
(748, 17770)
(748, 1)
Movie #398, 397    In the Realm of the Senses
Name: Name, dtype: object, average rating: 2.75 in 3565 reviews
(3565, 17770)
(3565, 1)
Movie #399, 398    Fangs
Name: Name, dtype: object, average rating: 2.34 in 125 reviews
(125, 17770)
(125, 1)
Movie #400, 399    Rio Lobo
Name: Name, dtype: object, average rating: 3.83 in 3693 reviews
(3693, 17770)
(3693, 

(116762, 17770)
(116762, 1)
Movie #458, 457    Blast
Name: Name, dtype: object, average rating: 3.55 in 355 reviews
(355, 17770)
(355, 1)
Movie #459, 458    Basquiat
Name: Name, dtype: object, average rating: 3.48 in 7271 reviews
(7271, 17770)
(7271, 1)
Movie #460, 459    Pink Narcissus
Name: Name, dtype: object, average rating: 2.19 in 344 reviews
(344, 17770)
(344, 1)
Movie #461, 460    Nightwalker #1: Midnight Detective
Name: Name, dtype: object, average rating: 3.47 in 544 reviews
(544, 17770)
(544, 1)
Movie #462, 461    Classic Cartoon Favorites: Starring Donald
Name: Name, dtype: object, average rating: 4.04 in 778 reviews
(778, 17770)
(778, 1)
Movie #463, 462    The Twilight Zone: Vol. 12
Name: Name, dtype: object, average rating: 4.09 in 2286 reviews
(2286, 17770)
(2286, 1)
Movie #464, 463    The Return of Ruben Blades
Name: Name, dtype: object, average rating: 2.76 in 82 reviews
(82, 17770)
(82, 1)
Movie #465, 464    Coolie No.1
Name: Name, dtype: object, average rating: 2.92 

(132, 17770)
(132, 1)
Movie #522, 521    Love Songs
Name: Name, dtype: object, average rating: 3.30 in 320 reviews
(320, 17770)
(320, 1)
Movie #523, 522    My Side of the Mountain
Name: Name, dtype: object, average rating: 3.03 in 294 reviews
(294, 17770)
(294, 1)
Movie #524, 523    Mumford
Name: Name, dtype: object, average rating: 3.28 in 6136 reviews
(6136, 17770)
(6136, 1)
Movie #525, 524    The Last Seduction II
Name: Name, dtype: object, average rating: 2.03 in 145 reviews
(145, 17770)
(145, 1)
Movie #526, 525    Angie
Name: Name, dtype: object, average rating: 2.91 in 581 reviews
(581, 17770)
(581, 1)
Movie #527, 526    Barbarian Queen
Name: Name, dtype: object, average rating: 2.28 in 271 reviews
(271, 17770)
(271, 1)
Movie #528, 527    The Hitchhiker's Guide to the Galaxy
Name: Name, dtype: object, average rating: 2.98 in 28454 reviews
(28454, 17770)
(28454, 1)
Movie #529, 528    Summer of the Monkeys
Name: Name, dtype: object, average rating: 3.36 in 489 reviews
(489, 17770)


(528, 17770)
(528, 1)
Movie #588, 587    Blue Planet: IMAX
Name: Name, dtype: object, average rating: 3.45 in 2053 reviews
(2053, 17770)
(2053, 1)
Movie #589, 588    Northanger Abbey
Name: Name, dtype: object, average rating: 2.35 in 599 reviews
(599, 17770)
(599, 1)
Movie #590, 589    Michael McDonald: A Gathering of Friends
Name: Name, dtype: object, average rating: 3.40 in 177 reviews
(177, 17770)
(177, 1)
Movie #591, 590    Particles of Truth
Name: Name, dtype: object, average rating: 2.89 in 514 reviews
(514, 17770)
(514, 1)
Movie #592, 591    House of Whipcord
Name: Name, dtype: object, average rating: 2.00 in 137 reviews
(137, 17770)
(137, 1)
Movie #593, 592    Baby Genius: Mozart and Friends
Name: Name, dtype: object, average rating: 2.73 in 92 reviews
(92, 17770)
(92, 1)
Movie #594, 593    By Hook or By Crook
Name: Name, dtype: object, average rating: 2.56 in 294 reviews
(294, 17770)
(294, 1)
Movie #595, 594    Monarch of the Glen: Series 2
Name: Name, dtype: object, average r

(144, 17770)
(144, 1)
Movie #652, 651    Marvin's Room
Name: Name, dtype: object, average rating: 3.32 in 5129 reviews
(5129, 17770)
(5129, 1)
Movie #653, 652    American Tragedy
Name: Name, dtype: object, average rating: 2.61 in 136 reviews
(136, 17770)
(136, 1)
Movie #654, 653    Return of the Chinese Boxer
Name: Name, dtype: object, average rating: 2.56 in 85 reviews
(85, 17770)
(85, 1)
Movie #655, 654    Green Acres: Season 1
Name: Name, dtype: object, average rating: 3.51 in 817 reviews
(817, 17770)
(817, 1)
Movie #656, 655    Veronica 2030
Name: Name, dtype: object, average rating: 2.74 in 119 reviews
(119, 17770)
(119, 1)
Movie #657, 656    Highlander: Season 5
Name: Name, dtype: object, average rating: 4.05 in 1930 reviews
(1930, 17770)
(1930, 1)
Movie #658, 657    Robin Hood: Prince of Thieves
Name: Name, dtype: object, average rating: 3.63 in 44001 reviews
(44001, 17770)
(44001, 1)
Movie #659, 658    The Last House on the Left
Name: Name, dtype: object, average rating: 2.75 i

(102, 17770)
(102, 1)
Movie #717, 716    Testosterone
Name: Name, dtype: object, average rating: 2.89 in 1951 reviews
(1951, 17770)
(1951, 1)
Movie #718, 717    44 Minutes
Name: Name, dtype: object, average rating: 3.32 in 1747 reviews
(1747, 17770)
(1747, 1)
Movie #719, 718    City of Industry
Name: Name, dtype: object, average rating: 2.99 in 814 reviews
(814, 17770)
(814, 1)
Movie #720, 719    Roger & Me
Name: Name, dtype: object, average rating: 3.84 in 26718 reviews
(26718, 17770)
(26718, 1)
Movie #721, 720    Royal Deceit
Name: Name, dtype: object, average rating: 2.48 in 343 reviews
(343, 17770)
(343, 1)
Movie #722, 721    The Wire: Season 1
Name: Name, dtype: object, average rating: 4.09 in 5955 reviews
(5955, 17770)
(5955, 1)
Movie #723, 722    Curly Sue
Name: Name, dtype: object, average rating: 2.99 in 14038 reviews
(14038, 17770)
(14038, 1)
Movie #724, 723    Yu Yu Hakusho
Name: Name, dtype: object, average rating: 4.41 in 1223 reviews
(1223, 17770)
(1223, 1)
Movie #725, 72

(741, 17770)
(741, 1)
Movie #782, 781    Soldier of Fortune
Name: Name, dtype: object, average rating: 2.44 in 95 reviews
(95, 17770)
(95, 1)
Movie #783, 782    For Love or Country
Name: Name, dtype: object, average rating: 3.37 in 1203 reviews
(1203, 17770)
(1203, 1)
Movie #784, 783    The Outer Limits: The New Series: Aliens Among Us
Name: Name, dtype: object, average rating: 3.53 in 224 reviews
(224, 17770)
(224, 1)
Movie #785, 784    American Pop
Name: Name, dtype: object, average rating: 3.16 in 387 reviews
(387, 17770)
(387, 1)
Movie #786, 785    Tokyo Drifter
Name: Name, dtype: object, average rating: 3.38 in 508 reviews
(508, 17770)
(508, 1)
Movie #787, 786    La Vallee
Name: Name, dtype: object, average rating: 2.50 in 82 reviews
(82, 17770)
(82, 1)
Movie #788, 787    Clerks
Name: Name, dtype: object, average rating: 3.76 in 58527 reviews
(58527, 17770)
(58527, 1)
Movie #789, 788    Boyz N the Hood
Name: Name, dtype: object, average rating: 3.79 in 31537 reviews
(31537, 17770)

(245, 17770)
(245, 1)
Movie #848, 847    La Separation
Name: Name, dtype: object, average rating: 3.05 in 369 reviews
(369, 17770)
(369, 1)
Movie #849, 848    Mr. & Mrs. Bridge
Name: Name, dtype: object, average rating: 3.26 in 1527 reviews
(1527, 17770)
(1527, 1)
Movie #850, 849    Stoked: The Rise and Fall of Gator
Name: Name, dtype: object, average rating: 3.42 in 2095 reviews
(2095, 17770)
(2095, 1)
Movie #851, 850    Back to the Future Part III
Name: Name, dtype: object, average rating: 3.89 in 25100 reviews
(25100, 17770)
(25100, 1)
Movie #852, 851    Disorganized Crime
Name: Name, dtype: object, average rating: 3.00 in 569 reviews
(569, 17770)
(569, 1)
Movie #853, 852    Dragonball: The Magic Begins
Name: Name, dtype: object, average rating: 2.82 in 465 reviews
(465, 17770)
(465, 1)
Movie #854, 853    Wes Craven's Invitation to Hell
Name: Name, dtype: object, average rating: 2.17 in 157 reviews
(157, 17770)
(157, 1)
Movie #855, 854    Bruiser
Name: Name, dtype: object, average r

(539, 17770)
(539, 1)
Movie #912, 911    Angelina Ballerina: Friends Forever
Name: Name, dtype: object, average rating: 3.15 in 322 reviews
(322, 17770)
(322, 1)
Movie #913, 912    Dead Kennedys: In God We Trust
Name: Name, dtype: object, average rating: 3.38 in 168 reviews
(168, 17770)
(168, 1)
Movie #914, 913    R.E.M.: Road Movie
Name: Name, dtype: object, average rating: 3.06 in 122 reviews
(122, 17770)
(122, 1)
Movie #915, 914    Haven
Name: Name, dtype: object, average rating: 2.84 in 43 reviews
(43, 17770)
(43, 1)
Movie #916, 915    Earth 2: The Complete Series
Name: Name, dtype: object, average rating: 3.59 in 442 reviews
(442, 17770)
(442, 1)
Movie #917, 916    String Cheese Incident: Live at the Fillmore
Name: Name, dtype: object, average rating: 3.54 in 112 reviews
(112, 17770)
(112, 1)
Movie #918, 917    A Hard Day's Night: Collector's Series
Name: Name, dtype: object, average rating: 3.92 in 7574 reviews
(7574, 17770)
(7574, 1)
Movie #919, 918    Comedian
Name: Name, dtype

(166, 17770)
(166, 1)
Movie #977, 976    Our Lady of the Assassins
Name: Name, dtype: object, average rating: 2.79 in 2076 reviews
(2076, 17770)
(2076, 1)
Movie #978, 977    Yi Yi
Name: Name, dtype: object, average rating: 3.33 in 2255 reviews
(2255, 17770)
(2255, 1)
Movie #979, 978    A Moment of Romance
Name: Name, dtype: object, average rating: 2.94 in 128 reviews
(128, 17770)
(128, 1)
Movie #980, 979    The Swan Princess
Name: Name, dtype: object, average rating: 3.63 in 1227 reviews
(1227, 17770)
(1227, 1)
Movie #981, 980    How I Got into College
Name: Name, dtype: object, average rating: 3.03 in 314 reviews
(314, 17770)
(314, 1)
Movie #982, 981    Dil Chahta Hai
Name: Name, dtype: object, average rating: 3.80 in 1563 reviews
(1563, 17770)
(1563, 1)
Movie #983, 982    Wishful Thinking
Name: Name, dtype: object, average rating: 2.50 in 497 reviews
(497, 17770)
(497, 1)
Movie #984, 983    The Young Lions
Name: Name, dtype: object, average rating: 3.40 in 1713 reviews
(1713, 17770)


(1021, 17770)
(1021, 1)
Movie #1042, 1041    Forgotten Silver
Name: Name, dtype: object, average rating: 3.33 in 722 reviews
(722, 17770)
(722, 1)
Movie #1043, 1042    Outrageous Fortune
Name: Name, dtype: object, average rating: 3.29 in 4847 reviews
(4847, 17770)
(4847, 1)
Movie #1044, 1043    The Party
Name: Name, dtype: object, average rating: 3.45 in 2997 reviews
(2997, 17770)
(2997, 1)
Movie #1045, 1044    For Keeps
Name: Name, dtype: object, average rating: 3.56 in 2582 reviews
(2582, 17770)
(2582, 1)
Movie #1046, 1045    Uptown Girls
Name: Name, dtype: object, average rating: 3.37 in 40744 reviews
(40744, 17770)
(40744, 1)
Movie #1047, 1046    Aimee and Jaguar
Name: Name, dtype: object, average rating: 3.64 in 2617 reviews
(2617, 17770)
(2617, 1)
Movie #1048, 1047    Year of the Horse: Neil Young & Crazy Horse Live
Name: Name, dtype: object, average rating: 2.99 in 804 reviews
(804, 17770)
(804, 1)
Movie #1049, 1048    The Object of Beauty
Name: Name, dtype: object, average rati

(1796, 17770)
(1796, 1)
Movie #1106, 1105    Virtual Girl
Name: Name, dtype: object, average rating: 2.52 in 115 reviews
(115, 17770)
(115, 1)
Movie #1107, 1106    NBA Street Series: Ankle Breakers: Vol. 2
Name: Name, dtype: object, average rating: 3.28 in 122 reviews
(122, 17770)
(122, 1)
Movie #1108, 1107    Dr. Andrew Weil: 8 Weeks to Optimum Health & S...
Name: Name, dtype: object, average rating: 3.35 in 426 reviews
(426, 17770)
(426, 1)
Movie #1109, 1108    My Fair Lady: Special Edition: Bonus Material
Name: Name, dtype: object, average rating: 4.09 in 158 reviews
(158, 17770)
(158, 1)
Movie #1110, 1109    Secondhand Lions
Name: Name, dtype: object, average rating: 4.07 in 98700 reviews
(98700, 17770)
(98700, 1)
Movie #1111, 1110    Cries and Whispers
Name: Name, dtype: object, average rating: 3.59 in 2041 reviews
(2041, 17770)
(2041, 1)
Movie #1112, 1111    The Delicate Delinquent
Name: Name, dtype: object, average rating: 3.28 in 167 reviews
(167, 17770)
(167, 1)
Movie #1113, 1

(383, 17770)
(383, 1)
Movie #1169, 1168    Bjork: Shepherds Bush Empire
Name: Name, dtype: object, average rating: 3.79 in 114 reviews
(114, 17770)
(114, 1)
Movie #1170, 1169    Truth or Consequences
Name: Name, dtype: object, average rating: 3.06 in 1025 reviews
(1025, 17770)
(1025, 1)
Movie #1171, 1170    RoboCop: Crash and Burn
Name: Name, dtype: object, average rating: 3.08 in 620 reviews
(620, 17770)
(620, 1)
Movie #1172, 1171    Krippendorf's Tribe
Name: Name, dtype: object, average rating: 2.78 in 2570 reviews
(2570, 17770)
(2570, 1)
Movie #1173, 1172    Walking with Dinosaurs
Name: Name, dtype: object, average rating: 3.75 in 3867 reviews
(3867, 17770)
(3867, 1)
Movie #1174, 1173    The Sandlot
Name: Name, dtype: object, average rating: 3.96 in 35537 reviews
(35537, 17770)
(35537, 1)
Movie #1175, 1174    Repo Man
Name: Name, dtype: object, average rating: 3.44 in 12683 reviews
(12683, 17770)
(12683, 1)
Movie #1176, 1175    The Devil's Backbone
Name: Name, dtype: object, average

(1135, 17770)
(1135, 1)
Movie #1232, 1231    It
Name: Name, dtype: object, average rating: 3.71 in 164 reviews
(164, 17770)
(164, 1)
Movie #1233, 1232    Repli-Kate
Name: Name, dtype: object, average rating: 3.01 in 571 reviews
(571, 17770)
(571, 1)
Movie #1234, 1233    Crooklyn
Name: Name, dtype: object, average rating: 3.48 in 3290 reviews
(3290, 17770)
(3290, 1)
Movie #1235, 1234    The Great Battles of World War II: Battle of R...
Name: Name, dtype: object, average rating: 2.70 in 121 reviews
(121, 17770)
(121, 1)
Movie #1236, 1235    Ali G Indahouse
Name: Name, dtype: object, average rating: 2.95 in 2563 reviews
(2563, 17770)
(2563, 1)
Movie #1237, 1236    The Stars of Star Wars
Name: Name, dtype: object, average rating: 2.99 in 773 reviews
(773, 17770)
(773, 1)
Movie #1238, 1237    The Pope of Greenwich Village
Name: Name, dtype: object, average rating: 3.39 in 1932 reviews
(1932, 17770)
(1932, 1)
Movie #1239, 1238    The Arena
Name: Name, dtype: object, average rating: 2.63 in 1

Movie #1296, 1295    No Ordinary Love
Name: Name, dtype: object, average rating: 2.57 in 1118 reviews
(1118, 17770)
(1118, 1)
Movie #1297, 1296    Agnes Browne
Name: Name, dtype: object, average rating: 3.44 in 3178 reviews
(3178, 17770)
(3178, 1)
Movie #1298, 1297    Hellraiser V: Inferno
Name: Name, dtype: object, average rating: 3.18 in 2543 reviews
(2543, 17770)
(2543, 1)
Movie #1299, 1298    Scooby-Doo Meets Batman
Name: Name, dtype: object, average rating: 3.22 in 2986 reviews
(2986, 17770)
(2986, 1)
Movie #1300, 1299    Return of the Dragon
Name: Name, dtype: object, average rating: 3.98 in 5755 reviews
(5755, 17770)
(5755, 1)
Movie #1301, 1300    Chasing Sleep
Name: Name, dtype: object, average rating: 2.66 in 370 reviews
(370, 17770)
(370, 1)
Movie #1302, 1301    The Indomitable Teddy Roosevelt
Name: Name, dtype: object, average rating: 3.24 in 176 reviews
(176, 17770)
(176, 1)
Movie #1303, 1302    Horse Crazy
Name: Name, dtype: object, average rating: 3.40 in 573 reviews
(573

(623, 17770)
(623, 1)
Movie #1361, 1360    Play'd: A Hip-Hop Story
Name: Name, dtype: object, average rating: 2.79 in 153 reviews
(153, 17770)
(153, 1)
Movie #1362, 1361    NHL Stanley Cup 2003
Name: Name, dtype: object, average rating: 3.30 in 122 reviews
(122, 17770)
(122, 1)
Movie #1363, 1362    Leprechaun
Name: Name, dtype: object, average rating: 2.75 in 3287 reviews
(3287, 17770)
(3287, 1)
Movie #1364, 1363    Normal
Name: Name, dtype: object, average rating: 3.51 in 4411 reviews
(4411, 17770)
(4411, 1)
Movie #1365, 1364    Kurt & Courtney
Name: Name, dtype: object, average rating: 2.87 in 1846 reviews
(1846, 17770)
(1846, 1)
Movie #1366, 1365    Tempted
Name: Name, dtype: object, average rating: 2.82 in 537 reviews
(537, 17770)
(537, 1)
Movie #1367, 1366    The Piano
Name: Name, dtype: object, average rating: 3.60 in 24880 reviews
(24880, 17770)
(24880, 1)
Movie #1368, 1367    Ozzy Osbourne: Double O: Unauthorized
Name: Name, dtype: object, average rating: 2.34 in 79 reviews
(79

(858, 17770)
(858, 1)
Movie #1425, 1424    No Way Out
Name: Name, dtype: object, average rating: 3.70 in 18126 reviews
(18126, 17770)
(18126, 1)
Movie #1426, 1425    Go Fish
Name: Name, dtype: object, average rating: 2.73 in 1146 reviews
(1146, 17770)
(1146, 1)
Movie #1427, 1426    Sweepers
Name: Name, dtype: object, average rating: 2.54 in 225 reviews
(225, 17770)
(225, 1)
Movie #1428, 1427    The Recruit
Name: Name, dtype: object, average rating: 3.54 in 113674 reviews
(113674, 17770)
(113674, 1)
Movie #1429, 1428    The Halfway House: Special Edition (Unrated Di...
Name: Name, dtype: object, average rating: 2.21 in 95 reviews
(95, 17770)
(95, 1)
Movie #1430, 1429    The House Next Door
Name: Name, dtype: object, average rating: 2.41 in 186 reviews
(186, 17770)
(186, 1)
Movie #1431, 1430    Legalese
Name: Name, dtype: object, average rating: 2.95 in 151 reviews
(151, 17770)
(151, 1)
Movie #1432, 1431    Ride with the Devil
Name: Name, dtype: object, average rating: 3.13 in 2100 revie

(1460, 17770)
(1460, 1)
Movie #1489, 1488    American Adobo
Name: Name, dtype: object, average rating: 2.83 in 479 reviews
(479, 17770)
(479, 1)
Movie #1490, 1489    Knight Hunters Eternity
Name: Name, dtype: object, average rating: 3.38 in 225 reviews
(225, 17770)
(225, 1)
Movie #1491, 1490    Dream Theater: Live at Budokan
Name: Name, dtype: object, average rating: 3.84 in 302 reviews
(302, 17770)
(302, 1)
Movie #1492, 1491    Rhinestone
Name: Name, dtype: object, average rating: 2.21 in 755 reviews
(755, 17770)
(755, 1)
Movie #1493, 1492    John Lee Hooker: Live in Montreal: Montreal Ja...
Name: Name, dtype: object, average rating: 3.12 in 135 reviews
(135, 17770)
(135, 1)
Movie #1494, 1493    Appalachian Journey: Yo-Yo Ma/Edgar Meyer/Mark...
Name: Name, dtype: object, average rating: 3.47 in 240 reviews
(240, 17770)
(240, 1)
Movie #1495, 1494    Alias: Season 1
Name: Name, dtype: object, average rating: 4.34 in 16683 reviews
(16683, 17770)
(16683, 1)
Movie #1496, 1495    Korn: Korn

(6258, 17770)
(6258, 1)
Movie #1554, 1553    The 5th Musketeer
Name: Name, dtype: object, average rating: 2.70 in 103 reviews
(103, 17770)
(103, 1)
Movie #1555, 1554    The Little Fugitive
Name: Name, dtype: object, average rating: 3.16 in 185 reviews
(185, 17770)
(185, 1)
Movie #1556, 1555    A Farewell to Arms
Name: Name, dtype: object, average rating: 3.20 in 1045 reviews
(1045, 17770)
(1045, 1)
Movie #1557, 1556    Dolls
Name: Name, dtype: object, average rating: 3.05 in 428 reviews
(428, 17770)
(428, 1)
Movie #1558, 1557    Rocky V
Name: Name, dtype: object, average rating: 3.03 in 8170 reviews
(8170, 17770)
(8170, 1)
Movie #1559, 1558    AFI's 100 Years
Name: Name, dtype: object, average rating: 3.27 in 361 reviews
(361, 17770)
(361, 1)
Movie #1560, 1559    Disney Princess Stories: Vol. 1: A Gift From t...
Name: Name, dtype: object, average rating: 3.33 in 643 reviews
(643, 17770)
(643, 1)
Movie #1561, 1560    American Wedding
Name: Name, dtype: object, average rating: 3.48 in 61

(652, 17770)
(652, 1)
Movie #1618, 1617    Nausicaa of the Valley of the Wind
Name: Name, dtype: object, average rating: 4.19 in 6674 reviews
(6674, 17770)
(6674, 1)
Movie #1619, 1618    The Journey of Natty Gann
Name: Name, dtype: object, average rating: 3.76 in 1044 reviews
(1044, 17770)
(1044, 1)
Movie #1620, 1619    The Saddle Club: The First Adventure
Name: Name, dtype: object, average rating: 3.41 in 318 reviews
(318, 17770)
(318, 1)
Movie #1621, 1620    Richard Kern: Hardcore Collection
Name: Name, dtype: object, average rating: 2.34 in 181 reviews
(181, 17770)
(181, 1)
Movie #1622, 1621    Zhou Yu's Train
Name: Name, dtype: object, average rating: 2.89 in 875 reviews
(875, 17770)
(875, 1)
Movie #1623, 1622    Pokemon Advanced
Name: Name, dtype: object, average rating: 3.45 in 253 reviews
(253, 17770)
(253, 1)
Movie #1624, 1623    The Minion
Name: Name, dtype: object, average rating: 2.46 in 127 reviews
(127, 17770)
(127, 1)
Movie #1625, 1624    Aliens: Collector's Edition
Name:

(147, 17770)
(147, 1)
Movie #1681, 1680    The Tesseract
Name: Name, dtype: object, average rating: 2.54 in 214 reviews
(214, 17770)
(214, 1)
Movie #1682, 1681    Absolute Power
Name: Name, dtype: object, average rating: 3.69 in 25961 reviews
(25961, 17770)
(25961, 1)
Movie #1683, 1682    The Lion King: Circle of Life: Sing-Along Songs
Name: Name, dtype: object, average rating: 3.01 in 686 reviews
(686, 17770)
(686, 1)
Movie #1684, 1683    What Alice Found
Name: Name, dtype: object, average rating: 2.76 in 571 reviews
(571, 17770)
(571, 1)
Movie #1685, 1684    First Option
Name: Name, dtype: object, average rating: 2.50 in 74 reviews
(74, 17770)
(74, 1)
Movie #1686, 1685    Riding the Rails: American Experience
Name: Name, dtype: object, average rating: 3.62 in 551 reviews
(551, 17770)
(551, 1)
Movie #1687, 1686    Eloise at the Plaza
Name: Name, dtype: object, average rating: 3.46 in 1579 reviews
(1579, 17770)
(1579, 1)
Movie #1688, 1687    To Sleep With a Vampire
Name: Name, dtype: o

(8588, 17770)
(8588, 1)
Movie #1744, 1743    Beverly Hills Cop
Name: Name, dtype: object, average rating: 3.82 in 73004 reviews
(73004, 17770)
(73004, 1)
Movie #1745, 1744    Love Me
Name: Name, dtype: object, average rating: 3.01 in 309 reviews
(309, 17770)
(309, 1)
Movie #1746, 1745    Bob Roberts
Name: Name, dtype: object, average rating: 3.41 in 11039 reviews
(11039, 17770)
(11039, 1)
Movie #1747, 1746    Hello Down There
Name: Name, dtype: object, average rating: 3.27 in 129 reviews
(129, 17770)
(129, 1)
Movie #1748, 1747    Happily Ever After: Fairy Tales for Every Child
Name: Name, dtype: object, average rating: 3.15 in 407 reviews
(407, 17770)
(407, 1)
Movie #1749, 1748    Angel and the Badman
Name: Name, dtype: object, average rating: 3.39 in 1319 reviews
(1319, 17770)
(1319, 1)
Movie #1750, 1749    Beast From Haunted Cave/The Brain That Wouldn'...
Name: Name, dtype: object, average rating: 2.40 in 114 reviews
(114, 17770)
(114, 1)
Movie #1751, 1750    L'Ennui
Name: Name, dtyp

In [141]:
movie_results_dict = {}
for id in range(10, 15):
    movie_id = id
    num_reviews = ratings_small[:,movie_id].count_nonzero()
    movie_name = movie_titles[movie_titles['ID'] == movie_id]['Name']
    print ('Movie #%s, %s, average rating: %.2f in %i reviews' % (
        movie_id,
        movie_name,
        np.sum(ratings_small[:,movie_id]) / num_reviews,
        int(num_reviews)
    ))

    start = time.time()
    filter_by = np.ravel((ratings_small[:,movie_id] != 0.0).toarray())
    filtered_ratings = ratings_small[filter_by,:]

    movie_mask = np.ravel(np.full((filtered_ratings.shape[1], 1), True))
    movie_mask[movie_id] = False
    X = filtered_ratings[:,movie_mask]
    y = filtered_ratings[:,movie_id].toarray()
    svd = TruncatedSVD(n_components = np.max([np.min(X.shape)-1, 1]), algorithm="arpack", random_state=0)
    Z = svd.fit_transform(X)
    #Z[Z == 0] = 3.6
    
    
    X_train, X_test, y_train, y_test = train_test_split(Z, y, test_size=0.2, random_state=0)
    print(Z.shape)
    print(y.shape)
    
    
    
    finish = time.time()
    data_time = (finish - start)
    #print 'finished choosing data in %.2f seconds' % data
    for i, name in zip([X_train, X_test, y_train, y_test], ['X_train', 'X_test', 'y_train', 'y_test']):
        print (name, ':', i.shape, '\t',)
    print()
    
    # Run the regression on the film
    start = time.time()
    regr = LinearRegression()
    regr.fit(X_train, y_train)

    y_pred_reg = regr.predict(X_test)

    #print 'Coefficients: \n', regr.coef_
    #print "Mean squared error: %.2f" % mean_squared_error(y_test, y_pred)
    print('No Clustering')
    print('Variance score: %.2f' % r2_score(y_test, y_pred_reg))
    print()
    finish = time.time()
    regr_time = finish - start
   # print ('finished regression in %.2f seconds' % regr_time)
   # print()
    movie_results_dict[id] = {
        'name': str(movie_titles[movie_titles['ID'] == movie_id]['Name']),
        'regr_time': regr_time,
        'data_time': data_time,
        'regr': regr,
        'mse': mean_squared_error(y_test, y_pred_reg),
        'r2': r2_score(y_test, y_pred_reg)
    }
    
    kmeans = KMeans(n_clusters = np.max([int(X_train.shape[0]/50)+1, 3]) , random_state=0, algorithm="full")
    kmeans.fit(X_train)
    cluster_members = kmeans.labels_
    pred_clusters = kmeans.predict(X_test)
    
    y_pred_avg_km = []
    y_pred_rand_km = []
    y_pred_clus_km = []
    y_pred_mode_km = []
    for k in range(len(pred_clusters)):
        person = pred_clusters[k]
        lookalike_y = y_train[np.where(cluster_members==person)]
        y_pred_avg_km.append([np.mean(lookalike_y)])
        y_pred_rand_km.append([np.random.choice(lookalike_y.flatten())])
        y_pred_mode_km.append(stats.mode(lookalike_y)[0][0])
        
        lookalike_x = X_train[np.where(cluster_members==person)]
        regr = LinearRegression()
        regr.fit(lookalike_x, lookalike_y)
        y_pred_clus_km.append(regr.predict(X_test[k].reshape(1, -1))[0])
        
    
    y_pred_avg_km = np.asmatrix(y_pred_avg_km)
    y_pred_mode_km = np.asmatrix(y_pred_mode_km)
    y_pred_rand_km = np.asmatrix(y_pred_rand_km)
    y_pred_clus_km = np.asmatrix(y_pred_clus_km)
    
    print('KMeans Clustering')
    print('Variance score mean: %.2f' % r2_score(y_test, y_pred_avg_km))
    print('Variance score mode: %.2f' % r2_score(y_test, y_pred_mode_km))
    print('Variance score random: %.2f' % r2_score(y_test, y_pred_rand_km))
    print('Variance score regress: %.2f' % r2_score(y_test, y_pred_clus_km))
    print()
    
    dbscan = DBSCAN()
    dbscan.fit(X_train)
    cluster_members = dbscan.labels_
    pred_clusters = dbscan.fit_predict(X_test)
    
    y_pred_avg_db = []
    y_pred_mode_db = []
    y_pred_rand_db = []
    y_pred_clus_db = []
    
    for k in range(len(pred_clusters)):
        person = pred_clusters[k]
        lookalike_y = y_train[np.where(cluster_members==person)]
        y_pred_avg_db.append([np.mean(lookalike_y)])
        y_pred_mode_db.append(stats.mode(lookalike_y)[0][0])
        y_pred_rand_db.append([np.random.choice(lookalike_y.flatten())])
        
        lookalike_x = X_train[np.where(cluster_members==person)]
        regr = LinearRegression()
        regr.fit(lookalike_x, lookalike_y)
        y_pred_clus_db.append(regr.predict(X_test[k].reshape(1, -1))[0])
        
    
    y_pred_avg_db = np.asmatrix(y_pred_avg_db)
    y_pred_mode_db = np.asmatrix(y_pred_mode_db)
    y_pred_rand_db = np.asmatrix(y_pred_rand_db)
    y_pred_clus_db = np.asmatrix(y_pred_clus_db)
    
    print('DBSCAN Clustering')
    print('Variance score mean: %.2f' % r2_score(y_test, y_pred_avg_db))
    print('Variance score mode: %.2f' % r2_score(y_test, y_pred_mode_db))
    print('Variance score random: %.2f' % r2_score(y_test, y_pred_rand_db))
    print('Variance score regress: %.2f' % r2_score(y_test, y_pred_clus_db))
    print()
    
    
    spect = SpectralClustering(n_clusters = np.max([int(X_train.shape[0]/50)+1, 3]) , random_state=0)
    spect.fit(X_train)
    cluster_members = spect.labels_
    pred_clusters = spect.fit_predict(X_test)
    
    y_pred_avg_sc = []
    y_pred_mode_sc = []
    y_pred_rand_sc = []
    y_pred_clus_sc = []
    
    for k in range(len(pred_clusters)):
        person = pred_clusters[k]
        lookalike_y = y_train[np.where(cluster_members==person)]
        y_pred_avg_sc.append([np.mean(lookalike_y)])
        y_pred_mode_sc.append(stats.mode(lookalike_y)[0][0])
        y_pred_rand_sc.append([np.random.choice(lookalike_y.flatten())])
        
        lookalike_x = X_train[np.where(cluster_members==person)]
        regr = LinearRegression()
        regr.fit(lookalike_x, lookalike_y)
        y_pred_clus_sc.append(regr.predict(X_test[k].reshape(1, -1))[0])
        
    
    y_pred_avg_sc = np.asmatrix(y_pred_avg_sc)
    y_pred_mode_sc = np.asmatrix(y_pred_mode_sc)
    y_pred_rand_sc = np.asmatrix(y_pred_rand_sc)
    y_pred_clus_sc = np.asmatrix(y_pred_clus_sc)
    
    print('Spectral Clustering')
    print('Variance score mean: %.2f' % r2_score(y_test, y_pred_avg_sc))
    print('Variance score mode: %.2f' % r2_score(y_test, y_pred_mode_sc))
    print('Variance score random: %.2f' % r2_score(y_test, y_pred_rand_sc))
    print('Variance score regress: %.2f' % r2_score(y_test, y_pred_clus_sc))
    print()
    
    agg = AgglomerativeClustering(n_clusters = np.max([int(X_train.shape[0]/50)+1, 3]))
    agg.fit(X_train)
    cluster_members = agg.labels_
    pred_clusters = agg.fit_predict(X_test)
    
    y_pred_avg_ac = []
    y_pred_mode_ac = []
    y_pred_rand_ac = []
    y_pred_clus_ac = []
    
    for k in range(len(pred_clusters)):
        person = pred_clusters[k]
        lookalike_y = y_train[np.where(cluster_members==person)]
        y_pred_avg_ac.append([np.mean(lookalike_y)])
        y_pred_mode_ac.append(stats.mode(lookalike_y)[0][0])
        y_pred_rand_ac.append([np.random.choice(lookalike_y.flatten())])
        
        lookalike_x = X_train[np.where(cluster_members==person)]
        regr = LinearRegression()
        regr.fit(lookalike_x, lookalike_y)
        y_pred_clus_ac.append(regr.predict(X_test[k].reshape(1, -1))[0])
        
    
    y_pred_avg_ac = np.asmatrix(y_pred_avg_ac)
    y_pred_mode_ac = np.asmatrix(y_pred_mode_ac)
    y_pred_rand_ac = np.asmatrix(y_pred_rand_ac)
    y_pred_clus_ac = np.asmatrix(y_pred_clus_ac)
    
    print('Agglomerative Clustering')
    print('Variance score mean: %.2f' % r2_score(y_test, y_pred_avg_ac))
    print('Variance score mode: %.2f' % r2_score(y_test, y_pred_mode_ac))
    print('Variance score random: %.2f' % r2_score(y_test, y_pred_rand_ac))
    print('Variance score regress: %.2f' % r2_score(y_test, y_pred_clus_ac))
    print()
    
    
    
        
    
    
    
    

Movie #10, 9    Fighter
Name: Name, dtype: object, average rating: 3.18 in 249 reviews
(249, 248)
(249, 1)
X_train : (199, 248) 	
X_test : (50, 248) 	
y_train : (199, 1) 	
y_test : (50, 1) 	

No Clustering
Variance score: 0.07

KMeans Clustering
Variance score mean: 0.01
Variance score mode: -0.01
Variance score random: -0.71
Variance score regress: 0.15

DBSCAN Clustering
Variance score mean: -0.01
Variance score mode: -0.05
Variance score random: -0.68
Variance score regress: 0.07

Spectral Clustering
Variance score mean: -0.08
Variance score mode: -0.26
Variance score random: -1.80
Variance score regress: 0.11

Agglomerative Clustering
Variance score mean: -0.86
Variance score mode: -1.94
Variance score random: -1.15
Variance score regress: -1.57

Movie #11, 10    Full Frame: Documentary Shorts
Name: Name, dtype: object, average rating: 3.03 in 198 reviews




(198, 197)
(198, 1)
X_train : (158, 197) 	
X_test : (40, 197) 	
y_train : (158, 1) 	
y_test : (40, 1) 	

No Clustering
Variance score: -0.34

KMeans Clustering
Variance score mean: -0.23
Variance score mode: -0.21
Variance score random: -0.82
Variance score regress: -0.44

DBSCAN Clustering
Variance score mean: -0.18
Variance score mode: -0.13
Variance score random: -0.71
Variance score regress: -0.34

Spectral Clustering
Variance score mean: -0.19
Variance score mode: -0.21
Variance score random: -1.45
Variance score regress: -0.32





Agglomerative Clustering
Variance score mean: -0.26
Variance score mode: -0.21
Variance score random: -0.66
Variance score regress: -0.14

Movie #12, 11    My Favorite Brunette
Name: Name, dtype: object, average rating: 3.42 in 546 reviews
(546, 545)
(546, 1)
X_train : (436, 545) 	
X_test : (110, 545) 	
y_train : (436, 1) 	
y_test : (110, 1) 	

No Clustering
Variance score: -11.52

KMeans Clustering
Variance score mean: -0.05
Variance score mode: -0.26
Variance score random: -0.47
Variance score regress: -1.00

DBSCAN Clustering
Variance score mean: -0.00
Variance score mode: -0.15
Variance score random: -1.02
Variance score regress: -11.52





Spectral Clustering
Variance score mean: -0.43
Variance score mode: -0.60
Variance score random: -1.21
Variance score regress: -1.05

Agglomerative Clustering
Variance score mean: -0.05
Variance score mode: -0.09
Variance score random: -0.54
Variance score regress: -0.16

Movie #13, 12    Lord of the Rings: The Return of the King: Ext...
Name: Name, dtype: object, average rating: 4.55 in 125 reviews
(125, 124)
(125, 1)
X_train : (100, 124) 	
X_test : (25, 124) 	
y_train : (100, 1) 	
y_test : (25, 1) 	

No Clustering
Variance score: -2.51

KMeans Clustering
Variance score mean: 0.03
Variance score mode: -0.66
Variance score random: -1.34
Variance score regress: -0.20

DBSCAN Clustering
Variance score mean: -0.02
Variance score mode: -0.66
Variance score random: -0.46
Variance score regress: -2.51





Spectral Clustering
Variance score mean: -0.02
Variance score mode: -0.66
Variance score random: -1.54
Variance score regress: -0.31

Agglomerative Clustering
Variance score mean: 0.06
Variance score mode: -0.66
Variance score random: -1.34
Variance score regress: -0.11

Movie #14, 13    Nature: Antarctica
Name: Name, dtype: object, average rating: 3.03 in 118 reviews
(118, 117)
(118, 1)
X_train : (94, 117) 	
X_test : (24, 117) 	
y_train : (94, 1) 	
y_test : (24, 1) 	

No Clustering
Variance score: -2073.74

KMeans Clustering
Variance score mean: -0.01
Variance score mode: -0.21
Variance score random: -1.16
Variance score regress: -801.08

DBSCAN Clustering
Variance score mean: -0.00
Variance score mode: -0.00
Variance score random: -0.87
Variance score regress: -2073.74

Spectral Clustering
Variance score mean: -0.09
Variance score mode: -0.00
Variance score random: -0.80
Variance score regress: -13.29

Agglomerative Clustering
Variance score mean: -0.84
Variance score mode: -2.16
Var



In [26]:
movie_results_dict[3]

{'data_time': 1.1898272037506104,
 'mse': 1.6863325,
 'name': '2    Character\nName: Name, dtype: object',
 'r2': -0.74125852269030212,
 'regr': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
 'regr_time': 4.725301027297974}

In [22]:
review_nums = []
for i in range(17700):
    num_reviews = ratings_small[:,i].count_nonzero()
    review_nums.append((i, num_reviews, np.sum(ratings_small[:,i]) / num_reviews))

  after removing the cwd from sys.path.


In [24]:
s = sorted(review_nums, key=lambda x: x[2])
for movie_id, num, avg_review in s[-20:]:
    print ('%s\t%s\t%s' % (num, avg_review, movie_by_id[movie_id]))

681	4.5389133627	Fruits Basket (2001)
17292	4.54256303493	The Simpsons: Season 5 (1993)
92470	4.54370065967	Star Wars: Episode V: The Empire Strikes Back (1980)
134284	4.54512078878	Lord of the Rings: The Return of the King (2003)
125	4.552	Lord of the Rings: The Return of the King: Extended Edition: Bonus Material (2003)
1883	4.55443441317	Inu-Yasha (2000)
8426	4.58129598861	The Simpsons: Season 6 (1994)
6621	4.58238936717	Arrested Development: Season 2 (2004)
220	4.58636363636	Ghost in the Shell: Stand Alone Complex: 2nd Gig (2005)
1238	4.59208400646	Veronica Mars: Season 1 (2004)
139660	4.59338393241	The Shawshank Redemption: Special Edition (1994)
89	4.59550561798	Tenchi Muyo! Ryo Ohki (1995)
25	4.6	Trailer Park Boys: Season 4 (2003)
75	4.6	Trailer Park Boys: Season 3 (2003)
1633	4.60502143295	Fullmetal Alchemist (2004)
1747	4.63880938752	Battlestar Galactica: Season 1 (2004)
7249	4.67098910195	Lost: Season 1 (2004)
74912	4.70261106365	Lord of the Rings: The Two Towers: Extended Ed