# Final Project - Movie Recommender
# DSC478 
# Eric Vistnes, Xiaojing Shen, Robert Kaszubski

Our goal is to explore ways in which we can recommend movies to users.
We have used different methods including Clustering, Classification, and SVD to do so.

Our dataset consisted of 17770 different movies rated by 480,189 with a grand total of 8,532,958,530 ratings. This proved difficult to process due to numerous issues with memory usage. We took several steps to make this dataset more manageable.


Please refer to the other attached Jupyter Notebooks for a breakdown on how we:

Combined the four txt files the dataset came in: Combine_Files.ipynb

Initially preprocessed and pivoted the entire dataset: CollaborativeFiltering.ipynd

Finally preprocesses and transformed the dataset to our final reduced version: Preprocessing and transformation.ipynd

### Begin Application:

We ask our user to rate movies they have seen from the most frequently rated in our dataset with the assumption that they have seen at least 5 of them.

In [1]:
import numpy as np
import pandas as pd

Read in our reduced and cleaned pickle file containing our dataset:

In [2]:
object = pd.read_pickle('data/cleanedMovie.pkl')
movies = pd.DataFrame(object)

Dataframe containing the most frequently rated films in order:

In [3]:
popular_movies = pd.DataFrame(movies["MovieID"].value_counts())

terms contains all of our movie titles, we remove unnecessary columns such as the release year:

In [4]:
terms = pd.read_csv('data/movie_titles.txt', sep='\t', encoding = "ISO-8859-1", header=None, index_col=0)
terms = terms.iloc[:,:2]
terms = terms.iloc[:,1:]

Function to query movie title data and return the title for the given movieID:

In [5]:
def toTitle(MovieID):
    return terms.loc[ MovieID , : ][2]

Function to ask users to rate n number of movies to get ratings to use for recommendations:

Returns an array of the movies the user chose to rate in the format [[MovieID, rating]]

In [6]:
def initial_rating(n):
    print("These are the top",n, "most watched movies on our system:")
    print("Please rate at least 5 movies on a scale of 1 to 5 (1,2,3,4,5), you haven't seen it just hit Enter.")
    print("We can then start providing you recommendations based on your initial input!")
    print("")
    ratings = []
    for mov in range(0,n):
        ID = popular_movies.index[mov]
        title = toTitle(ID)
        print("You are rating", title)
        try:
            while True:
                b = int(input("Rate from 1 to 5: "))
                if b < 1 or b > 5:
                    print("Sorry, your response must be on a scale of 1 to 5.")
                    continue
                else:
                    break
            print("You rated", title, "as a", b)
            ratings.append([ID,b])
        except:
            print("You skipped", title)
            print("")
            continue
        print("")
        
        
    return ratings

#### Please follow the instructions below and try to rate at least 5 movies

In [7]:
user_ratings = initial_rating(20)

These are the top 20 most watched movies on our system:
Please rate at least 5 movies on a scale of 1 to 5 (1,2,3,4,5), you haven't seen it just hit Enter.
We can then start providing you recommendations based on your initial input!

You are rating Forrest Gump
Rate from 1 to 5: 5
You rated Forrest Gump as a 5

You are rating The Sixth Sense
Rate from 1 to 5: 
You skipped The Sixth Sense

You are rating Pirates of the Caribbean: The Curse of the Black Pearl
Rate from 1 to 5: 4
You rated Pirates of the Caribbean: The Curse of the Black Pearl as a 4

You are rating The Matrix
Rate from 1 to 5: 4
You rated The Matrix as a 4

You are rating Spider-Man
Rate from 1 to 5: 5
You rated Spider-Man as a 5

You are rating Men in Black
Rate from 1 to 5: 
You skipped Men in Black

You are rating The Silence of the Lambs
Rate from 1 to 5: 5
You rated The Silence of the Lambs as a 5

You are rating Independence Day
Rate from 1 to 5: 4
You rated Independence Day as a 4

You are rating Jurassic Park
Rat

Here is the data we just captured from your responses
Returned array of ratings in the format of: [MovieID, Rating] :

In [8]:
user_ratings

[[11283, 5],
 [1905, 4],
 [14691, 4],
 [14410, 5],
 [2862, 5],
 [15124, 4],
 [14312, 4],
 [6971, 4],
 [15107, 4],
 [10042, 4],
 [2452, 3],
 [8387, 2],
 [607, 1],
 [17088, 1],
 [16384, 1],
 [11064, 4]]

### Now we are going to CLUSTER you based on your ratings

In [9]:
object = pd.read_pickle('data/cleanedMovie.pkl')
movies = pd.DataFrame(object)

Pivoting data to create a Customer by Movie Matrix of ratings to use for doc term clustering:

In [10]:
movieMatrix = movies.pivot_table(values='Rating', index='CustomerID', columns='MovieID')

Replacing missing values with 0's

In [11]:
movieMatrix = movieMatrix.fillna(0)
movie_arr = np.array(movieMatrix)

In [12]:
from sklearn.cluster import KMeans

Determining best k value:
Uncomment to run (takes a while) - look into Clustering.ipynb for output

In [13]:
#import matplotlib.pyplot as plt
#sse = {}
#for k in range(1, 20):
    #kmeans = KMeans(n_clusters=k).fit(movie_arr)
    #sse[k] = kmeans.inertia_

#plt.figure()
#plt.plot(list(sse.keys()), list(sse.values()))
#plt.xlabel("Number of cluster")
#plt.ylabel("SSE")
#plt.show()

#sse = {}
#for k in range(1, 10):
    #kmeans = KMeans(n_clusters=k).fit(movie_arr)
    #sse[k] = kmeans.inertia_
#plt.figure()
#plt.plot(list(sse.keys()), list(sse.values()))
#plt.xlabel("Number of cluster")
#plt.ylabel("SSE")
#plt.show()

In [14]:
kmeans = KMeans(n_clusters=10)
kmeans.fit(movie_arr)

KMeans(n_clusters=10)

In [15]:
def top_movies(df, n):
    for mov in range(0,n):
        print(toTitle(df.index[mov]),df.loc[df.index[mov]][0] )

In [16]:
np.set_printoptions(precision=2,suppress=True)

In [17]:
def print_clusters(kmeans, k, n):
    for cluster in range(0,k):
        clust = pd.DataFrame(kmeans.cluster_centers_[cluster])
        clust.index = movieMatrix.columns
        #print(clust)
        #sortDF = pd.DataFrame(clust,terms)
        #print(sortDF)
        sortDF = clust.sort_values(by=[0],ascending=False)
        #print(sortDF.loc[sortDF.index[0]][0])
        #print(sortDF)
        print("Top movies in Cluster", cluster+1)
        top_movies(sortDF, n)
        print("")

So these are the top movies found in each cluster:

In [18]:
print_clusters(kmeans,10,10)

Top movies in Cluster 1
Pulp Fiction 4.612186788154897
Raiders of the Lost Ark 4.598519362186787
The Matrix 4.5461275626423685
The Shawshank Redemption: Special Edition 4.502277904328017
Lord of the Rings: The Fellowship of the Ring 4.488041002277903
The Usual Suspects 4.466970387243738
Fight Club 4.458428246013669
Lord of the Rings: The Two Towers 4.44362186788155
The Silence of the Lambs 4.407744874715261
Braveheart 4.376993166287016

Top movies in Cluster 2
Raiders of the Lost Ark 4.040355125100888
Lord of the Rings: The Fellowship of the Ring 3.970944309927361
Lord of the Rings: The Two Towers 3.9281678773204196
The Matrix 3.87409200968523
Pirates of the Caribbean: The Curse of the Black Pearl 3.8006456820016146
Indiana Jones and the Last Crusade 3.7473769168684425
Lord of the Rings: The Return of the King 3.7046004842615012
The Sixth Sense 3.619854721549637
The Terminator 3.565778853914447
Star Wars: Episode V: The Empire Strikes Back 3.4761904761904763

Top movies in Cluster 3
Fo

Function to print top n movies in a cluster ignoring movies the user has already seen and rated:

In [19]:
def top_movies_user(df, n):
    watched = []
    for movie in user_ratings:
        watched.append([movie][0][0])
    moviesreturned = 0
    movieidx = 0
    while moviesreturned != n:
        if df.index[movieidx] not in watched:
            print(toTitle(df.index[movieidx]),df.loc[df.index[movieidx]][0])
            movieidx += 1
            moviesreturned +=1
        else:
            movieidx +=1

Function to print a singular cluster:

In [20]:
def print_clust(kmeans, k, n):
        clust = pd.DataFrame(kmeans.cluster_centers_[k-1])
        clust.index = movieMatrix.columns
        #print(clust)
        #sortDF = pd.DataFrame(clust,terms)
        #print(sortDF)
        sortDF = clust.sort_values(by=[0],ascending=False)
        #print(sortDF.loc[sortDF.index[0]][0])
        #print(sortDF)
        print("Top movies in Cluster", k)
        top_movies_user(sortDF, n)
        print("")

Function to convert user input from the start of the application into a row of data, passable into kmeans

In [21]:
def user_row(user_data):
    out = pd.DataFrame(np.zeros(len(movieMatrix.columns)), index=movieMatrix.columns).T
    for rating in user_data:
        out[rating[0]] = rating[1]
    out = np.array(out)
    return out

In [22]:
user_data = user_row(user_ratings)

Cluster prediction for new user:

In [23]:
user_cluster = kmeans.predict(user_data)

Adding a 1 to the user cluster because we don't want a cluster 0

In [24]:
print("User belongs to cluster:", user_cluster[0]+1)

User belongs to cluster: 2


Recommended movies from that cluster excluding movies already seen by the user:

In [25]:
print_clust(kmeans, user_cluster[0]+1, 10)

Top movies in Cluster 2
Lord of the Rings: The Two Towers 3.9281678773204196
Indiana Jones and the Last Crusade 3.7473769168684425
Lord of the Rings: The Return of the King 3.7046004842615012
The Sixth Sense 3.619854721549637
The Terminator 3.565778853914447
Star Wars: Episode V: The Empire Strikes Back 3.4761904761904763
Gladiator 3.455205811138014
Men in Black 3.4463276836158196
Die Hard 3.38498789346247
X-Men 3.375302663438257



Lets try it again using average instead of 0's. 0's favor the most often rated movies too much

In [26]:
movieMatrix = movies.pivot_table(values='Rating', index='CustomerID', columns='MovieID')

Using the global average + the average of each movie divided by the number of ratings

In [27]:
globalavg = movieMatrix.mean().mean()
rating_sums= movieMatrix.sum(axis=0) 
rating_count = movieMatrix.count(axis=0)
missing_ratings = (rating_sums + globalavg*25) / (25+rating_count)

In [28]:
movieMatrix = movieMatrix.fillna(missing_ratings)
movieMatrix

MovieID,1,2,3,4,5,6,8,10,11,12,...,17761,17762,17763,17764,17765,17766,17767,17768,17769,17770
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
769,3.441501,2.648875,3.333578,2.582291,3.214042,2.890785,2.986066,2.798204,2.783513,3.341931,...,2.833597,2.000000,3.210411,3.000000,2.534275,3.063489,2.770233,2.827028,2.532746,2.823785
1333,3.441501,2.648875,4.000000,2.582291,3.214042,2.890785,3.000000,2.798204,2.783513,3.341931,...,4.000000,3.000000,1.000000,4.000000,2.534275,3.063489,4.000000,2.827028,2.532746,2.823785
1442,3.441501,2.648875,3.333578,2.582291,3.214042,2.890785,2.986066,2.798204,2.783513,3.341931,...,2.833597,3.461434,3.210411,4.000000,2.534275,3.063489,2.770233,2.827028,2.532746,2.823785
2213,3.441501,2.648875,3.333578,2.582291,3.214042,2.890785,2.986066,2.798204,2.783513,3.341931,...,2.833597,4.000000,3.210411,4.000000,2.534275,3.063489,2.770233,3.000000,2.532746,4.000000
2455,3.441501,2.648875,3.333578,2.582291,3.214042,2.890785,2.986066,2.798204,2.783513,3.341931,...,2.833597,3.000000,3.210411,3.000000,2.534275,3.063489,2.770233,3.000000,2.532746,3.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2648589,3.441501,2.648875,3.333578,2.582291,3.214042,2.890785,2.986066,2.798204,2.783513,3.341931,...,2.833597,4.000000,3.210411,5.000000,2.534275,3.063489,2.770233,2.827028,2.532746,2.823785
2648734,3.441501,2.648875,3.333578,2.582291,3.214042,2.890785,2.986066,2.798204,2.783513,3.341931,...,2.833597,4.000000,3.210411,5.000000,2.534275,3.063489,2.770233,2.827028,2.532746,2.823785
2648869,3.441501,2.648875,3.333578,2.582291,3.214042,2.890785,2.986066,2.798204,2.783513,3.341931,...,2.833597,4.000000,3.210411,5.000000,2.534275,3.063489,2.770233,2.827028,2.532746,2.823785
2648885,3.441501,2.648875,3.333578,2.582291,3.214042,2.890785,2.986066,2.798204,2.783513,3.341931,...,2.833597,3.461434,3.210411,3.755668,2.534275,3.063489,2.770233,2.827028,2.532746,2.823785


In [29]:
movie_arr = np.array(movieMatrix)

In [30]:
kmeans = KMeans(n_clusters=10)
kmeans.fit(movie_arr)

KMeans(n_clusters=10)

In [31]:
user_cluster = kmeans.predict(user_data)

In [32]:
print("User belongs to cluster:", user_cluster[0]+1)

User belongs to cluster: 5


In [33]:
print_clust(kmeans, user_cluster[0]+1, 10)

Top movies in Cluster 5
The Passion of the Christ 5.0
Titanic 5.0
Black Hawk Down 5.0
We Were Soldiers 5.0
Saving Private Ryan 5.0
E.T. the Extra-Terrestrial: The 20th Anniversary (Rerelease) 4.666666666666667
O Brother 4.666666666666667
Bringing Down the House 4.666666666666667
Lord of the Rings: The Return of the King 4.666666666666667
Remember the Titans 4.666666666666667



### Now we will recommend movies using K Nearest Neighbor Classification:

### user based recommender 
#### find best key

In [34]:
from sklearn.neighbors import NearestNeighbors
from math import sqrt
from sklearn.metrics import mean_squared_error
import math

In [35]:

import os
cwd = os.getcwd()

In [36]:
df = pd.read_pickle('data/cleanedMovie.pkl')

In [37]:
user_movie_df = df.pivot(index='CustomerID',columns ='MovieID' ,values='Rating').fillna(0)
user_movie_df

MovieID,1,2,3,4,5,6,8,10,11,12,...,17761,17762,17763,17764,17765,17766,17767,17768,17769,17770
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
769,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
1333,0.0,0.0,4.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,4.0,3.0,1.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0
1442,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
2213,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,0.0,4.0,0.0,0.0,0.0,3.0,0.0,4.0
2455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.0,0.0,3.0,0.0,0.0,0.0,3.0,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2648589,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
2648734,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
2648869,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
2648885,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [38]:
movie_names = list(user_movie_df.columns) 
customer_names = list(user_movie_df.index.values)

In [39]:
user_movie_df_rename = user_movie_df

In [40]:
user_movie_df_rename.columns = range(16795)

In [41]:
user_movie_df_rename = user_movie_df_rename.reset_index()

In [42]:
user_movie_df_rename = user_movie_df_rename.drop(columns=['CustomerID'])
user_movie_df_rename

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16785,16786,16787,16788,16789,16790,16791,16792,16793,16794
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,4.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,4.0,3.0,1.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,0.0,4.0,0.0,0.0,0.0,3.0,0.0,4.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.0,0.0,3.0,0.0,0.0,0.0,3.0,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13136,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
13137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
13138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
13139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:

users_movie_array = user_movie_df_rename.values
users_movie_array

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 4., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [44]:
from sklearn.neighbors import NearestNeighbors

In [45]:

def cross_validate_user(dataMat, movie, test_ratio, k):

    number_of_users = np.shape(dataMat)[0] 
    rated_items_by_user = np.array([i for i in range(number_of_users) if dataMat[i,movie]>0])
    test_size = math.ceil(test_ratio * len(rated_items_by_user))  # round up the test_size
    test_indices = np.random.randint(0, len(rated_items_by_user), test_size)
    withheld_users = rated_items_by_user[test_indices]
    original_movie_profile = np.copy(dataMat[:, movie])
    dataMat[withheld_users, movie] = 0 # So that the withheld test items is not used in the rating estimation below
    error_u = 0.0
    count_u = len(withheld_users)
    # Compute absolute error for user u over all test items
    knn = NearestNeighbors(metric='cosine',algorithm = 'brute', n_neighbors=k)
    knn.fit(dataMat)
    for user in withheld_users:
        sum_predict = 0
        count_predict = 0 

        # Estimate rating on the withheld item
        neigh_dist, neigh_ind= knn.kneighbors(dataMat[user, :].reshape(1, 16795), n_neighbors=k+1)
        distLst = neigh_dist.tolist()[0][1:]
        indLst = neigh_ind.tolist()[0][1:]

        for j in indLst:
            if original_movie_profile[j] != 0:
                sum_predict += original_movie_profile[j]
                count_predict += 1
        if count_predict == 0:
            continue
        else:
            estimatedScore = sum_predict/count_predict
            error_u = error_u + abs(estimatedScore - original_movie_profile[user])

    # Now restore ratings of the withheld items to the user profile
    for user in withheld_users:
        dataMat[user, movie] = original_movie_profile[user]

    # Return sum of absolute errors and the count of test cases for this user
    # Note that these will have to be accumulated for each user to compute MAE
    return error_u, count_u

In [46]:
def test(dataMat, test_ratio, k):
# Write this function to iterate over all users and for each perform evaluation by calling
# the above cross_validate_user function on each user. MAE will be the ratio of total error 
# across all test cases to the total number of test cases, for all users
    total_error = 0
    cases_number = 0
    movies_number = np.shape(dataMat)[1]
    test_size = math.ceil(test_ratio * movies_number)
    test_indices = np.random.randint(0, movies_number, test_size)
    for movie in test_indices:
        error_user, count_user = cross_validate_user(dataMat, movie, test_ratio, k)
        total_error += error_user
        cases_number += count_user
    MAE = total_error/cases_number
    return MAE

Uncomment to run to determine the best K: takes close to an hour to complete
View results on seperate notebook:  KNN_3.ipynb

In [47]:
#from tqdm import tqdm

#KLst = [10, 30, 50]
#for k in tqdm(KLst):
    #MAE = test(users_movie_array, 0.005, k)
    #print(MAE)
    #print('-'*40)

make recommendation

In [48]:
user_movie_df = df.pivot(index='CustomerID',columns ='MovieID' ,values='Rating').fillna(0)
user_movie_df

MovieID,1,2,3,4,5,6,8,10,11,12,...,17761,17762,17763,17764,17765,17766,17767,17768,17769,17770
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
769,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
1333,0.0,0.0,4.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,4.0,3.0,1.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0
1442,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
2213,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,0.0,4.0,0.0,0.0,0.0,3.0,0.0,4.0
2455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.0,0.0,3.0,0.0,0.0,0.0,3.0,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2648589,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
2648734,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
2648869,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
2648885,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [49]:
from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(metric='cosine',algorithm = 'brute', n_neighbors=10)

In [50]:
movie_title = pd.read_csv("data/movie_titles.csv", encoding='unicode_escape', usecols=[2], header=None)
movie_title.columns = ['title']
movie_title

Unnamed: 0,title
0,Dinosaur Planet
1,Isle of Man TT 2004 Review
2,Character
3,Paula Abdul's Get Up & Dance
4,The Rise and Fall of ECW
...,...
17765,Where the Wild Things Are and Other Maurice Se...
17766,Fidel Castro: American Experience
17767,Epoch
17768,The Company


In [51]:
# create a function which takes a movie name and make recommedation for it
recommendation = {}
def make_recommendation(input_user,data,model,n_recommendation):
    model.fit(data)
    input_user_array = input_user.to_numpy()
    similar_users_list = (model.kneighbors(input_user_array,n_neighbors=n_recommendation+1,return_distance=False)).tolist()
    for i in similar_users_list[0][1:]:
        for j in data.columns:
            ratingLst = []
            if int(input_user[j]) == 0 and data.iloc[i][j] > 3:
                if j not in recommendation:
                    ratingLst.append(data.iloc[i][j])
                    recommendation[j] = ratingLst
                else:
                    recommendation.get(j).append(data.iloc[i][j])
    print("The new user who will like following movies.")
    number = 0
    for k in sorted(recommendation, key=lambda k: len(recommendation[k]), reverse=True):
        if number < 5:
            res = movie_title.loc[k-1]['title']
            print(res)
            number += 1 #recommend top 5 movies

In [52]:
#convert user_ratings from list to dataframe
# create a empty dataframe
column_names = movie_names
df_newUser = pd.DataFrame(columns = column_names)

In [53]:
for i in user_ratings:
    df_newUser.at[0 , i[0]] = i[1]
df_newUser

Unnamed: 0,1,2,3,4,5,6,8,10,11,12,...,17761,17762,17763,17764,17765,17766,17767,17768,17769,17770
0,,,,,,,,,,,...,,,,,,,,,,


In [54]:
#check input:
df_newUser.isna().sum().sum()

16779

In [55]:
newUser_df = df_newUser.fillna(0)
newUser_df

Unnamed: 0,1,2,3,4,5,6,8,10,11,12,...,17761,17762,17763,17764,17765,17766,17767,17768,17769,17770
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [56]:
make_recommendation(newUser_df, user_movie_df, knn, 10)

The new user who will like following movies.
Indiana Jones and the Last Crusade
Seven
Gladiator
The Sixth Sense
Lord of the Rings: The Two Towers


#### item based recommender
#### find best key

In [57]:
# pivot data to movie-users matrix
movie_users_df = df.pivot(index='MovieID',columns = 'CustomerID',values='Rating').fillna(0)
movie_users_df_rename = movie_users_df
movie_users_df_rename.columns = range(13141)
movie_users_df_rename = movie_users_df_rename.reset_index()
movie_users_df_rename = movie_users_df_rename.drop(columns=['MovieID'])

In [58]:
movie_users_array = movie_users_df_rename.values

In [59]:
from sklearn.neighbors import NearestNeighbors

In [60]:
def cross_validate_user(dataMat, user, test_ratio, k):

    number_of_items = np.shape(dataMat)[0] 
    rated_items_by_user = np.array([i for i in range(number_of_items) if dataMat[i,user]>0])
    test_size = math.ceil(test_ratio * len(rated_items_by_user))  # round up the test_size
    test_indices = np.random.randint(0, len(rated_items_by_user), test_size)
    withheld_items = rated_items_by_user[test_indices]
    original_user_profile = np.copy(dataMat[:,user])
    dataMat[withheld_items, user] = 0 # So that the withheld test items is not used in the rating estimation below
    error_u = 0.0
    count_u = len(withheld_items)
    # Compute absolute error for user u over all test items
    knn = NearestNeighbors(metric='cosine',algorithm = 'brute', n_neighbors=k)
    knn.fit(dataMat)
    for item in withheld_items:
        sum_predict = 0
        count_predict = 0
        # Estimate rating on the withheld item
        neigh_dist, neigh_ind= knn.kneighbors(dataMat[item, :].reshape(1, 13141), n_neighbors=k+1)
        distLst = neigh_dist.tolist()[0][1:]
        indLst = neigh_ind.tolist()[0][1:]
        for j in indLst:
            if original_user_profile[j] != 0:
                sum_predict += original_user_profile[j]
                count_predict += 1
        if count_predict == 0:
            continue
        else:
            estimatedScore = sum_predict/count_predict   
            error_u = error_u + abs(estimatedScore - original_user_profile[item])

    # Now restore ratings of the withheld items to the user profile
    for item in withheld_items:
        dataMat[item, user] = original_user_profile[item]

    # Return sum of absolute errors and the count of test cases for this user
    # Note that these will have to be accumulated for each user to compute MAE
    return error_u, count_u

In [61]:
def test(dataMat, test_ratio, k):
# Write this function to iterate over all users and for each perform evaluation by calling
# the above cross_validate_user function on each user. MAE will be the ratio of total error 
# across all test cases to the total number of test cases, for all users
    total_error = 0
    cases_number = 0
    users_number = np.shape(dataMat)[1]
    test_size = math.ceil(test_ratio * users_number)
    test_indices = np.random.randint(0, users_number, test_size)
    for user in test_indices:
        error_user, count_user = cross_validate_user(dataMat, user, test_ratio, k)
        total_error += error_user
        cases_number += count_user
    MAE = total_error/cases_number
    return MAE

In [62]:
#KLst = [10, 30, 50]
#for k in KLst:
    #MAE = test(movie_users_array, 0.005, k)
    #print(MAE)
    #print('-'*40)

k = 30 turns out to be the best result

make recommendation:

In [63]:
movie_users_df = df.pivot(index='MovieID',columns = 'CustomerID',values='Rating').fillna(0)

In [64]:
newUser_dict = {}
for i in user_ratings:
    newUser_dict[i[0]] = i[1]
newUser_dict

{11283: 5,
 1905: 4,
 14691: 4,
 14410: 5,
 2862: 5,
 15124: 4,
 14312: 4,
 6971: 4,
 15107: 4,
 10042: 4,
 2452: 3,
 8387: 2,
 607: 1,
 17088: 1,
 16384: 1,
 11064: 4}

In [65]:
#find top 3 favorite movies of the new user
newUser_dict_sort = dict(sorted(newUser_dict.items(), key=lambda item: item[1],reverse=True))
top3movies = {k: newUser_dict_sort[k] for k in list(newUser_dict_sort)[:3]}
top3movies

{11283: 5, 14410: 5, 2862: 5}

In [66]:
from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(metric='cosine',algorithm = 'brute', n_neighbors=30)

In [67]:

def find_similarity(topMovies,data,model,n_recommendation):
    dict_similarity = {}
    model.fit(data)
    for index in topMovies:
        array_index = data.iloc[index].to_numpy()
        neigh_dist, neigh_ind = model.kneighbors(array_index.reshape(1, 13141),n_neighbors=n_recommendation+1)
        neigh_dist_Lst = neigh_dist.tolist()[0][1:] #ignore itself
        neigh_ind_Lst = neigh_ind.tolist()[0][1:]
        for i in range(len(neigh_dist_Lst)):
            if neigh_ind_Lst[i] not in dict_similarity:
                dict_similarity[neigh_ind_Lst[i]] = (1-neigh_dist_Lst[i]) * topMovies.get(index)
            else:
                ratio = dict_similarity.get(neigh_ind_Lst[i])
                new_ratio = (ratio + (1-neigh_dist_Lst[i]) * topMovies.get(index))/2
                dict_similarity[neigh_ind_Lst[i]] = new_ratio
            
    sort_dict = dict(sorted(dict_similarity.items(), key=lambda item: item[1],reverse=True))
    return sort_dict

In [68]:
def make_recommendation(topMovies, newUser_rating, data, model, n_recommendation):    
    count = 0
    sort_dict = find_similarity(topMovies,data,model,n_recommendation)
    print("The new user who will like following movies:")
    for movie_ind in sort_dict:
        if count < 5:
            if int(newUser_rating[movie_ind]) == 0:
                res = movie_title.loc[movie_ind-1]['title']
            print(res)
            count += 1
    return

In [69]:
make_recommendation(top3movies, newUser_df, movie_users_df,knn,10)

The new user who will like following movies:
Outrage
The Rachel Papers
Layer Cake
The Fassbinder Collection: Pioneer in Ingolstadt
Love Affair


# SVD Predictor

Creating and testing the SVD Predictor model on our dataframe. First we need to split the data in train and test set, and then run GridSearchCV on the training set in order to find the best n factors value to pass into the SVD model creation. 

In [70]:
import pickle
from datetime import datetime
from tqdm import tqdm
import numpy as np
import pandas as pd
import os
import sklearn
from sklearn.metrics.pairwise import cosine_similarity
import random
from surprise import Reader, Dataset
from surprise import SVD
from surprise import SVDpp
from surprise.model_selection import GridSearchCV
#import xgboost as xbg
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [71]:
movie = pd.read_pickle("data/cleanedMovie.pkl")
movie.head()

Unnamed: 0,MovieID,CustomerID,Rating
0,1,1488844,3
3,1,30878,4
7,1,1248029,3
19,1,372233,5
20,1,1080361,3


In [72]:
movie_train, movie_test = sklearn.model_selection.train_test_split(movie, train_size = 0.8)

In [73]:
print(movie_test.shape)
movie_test.head()

(3751486, 3)


Unnamed: 0,MovieID,CustomerID,Rating
99727773,17586,2132252,4
21404896,3999,2356911,5
61705056,11236,1106280,5
38540143,6847,1194911,3
92225761,16377,220512,5


In [74]:
print(movie_train.shape)
movie_train.head()

(15005940, 3)


Unnamed: 0,MovieID,CustomerID,Rating
26339577,4906,1493697,3
54475320,9939,981753,4
34580137,6206,673187,5
38081724,6786,546017,5
54548614,9950,1508625,3


In [75]:
reader = Reader(rating_scale=(1,5))
movieInput = pd.DataFrame()
movieInput['CustomerID'] = movie_train['CustomerID']
movieInput['MovieID'] = movie_train['MovieID']
movieInput['Rating'] = movie_train['Rating']

train_data = Dataset.load_from_df(movieInput, reader)
trainset = train_data.build_full_trainset()

In [76]:
testset = list(zip(movie_test["CustomerID"].values, movie_test["MovieID"].values, movie_test["Rating"].values))

In [77]:
error_table = pd.DataFrame(columns = ["Model", "Train_RMSE", "Test_RMSE"])

In [78]:
trainset.global_mean

3.418664942016295

## Creating Fit and Prediction Method

In [79]:
def run_surprise(algo, trainset, testset, model_name):
    start = datetime.now()
    algo.fit(trainset)
    
    pred_train = algo.test(trainset.build_testset())
    
    trainActual = np.array([p.r_ui for p in pred_train])
    trainPred = np.array([p.est for p in pred_train]) 
    trainRMSE = np.sqrt(mean_squared_error(trainActual, trainPred))
    
    print("Train Data RMSE: {}".format(trainRMSE))
    print("\n")
    
    train = {"RMSE": trainRMSE, "Prediction": trainPred}
    
    pred_test = algo.test(testset)
    testActual = np.array([p.r_ui for p in pred_test])
    testPred = np.array([p.est for p in pred_test])
    testRMSE = np.sqrt(mean_squared_error(testActual, testPred))
    
    print("Test Data RMSE: {}".format(testRMSE))
    print("\n")
    
    test = {"RMSE": testRMSE, "Prediction": testPred}
    
    print("Time Taken = " + str(datetime.now() - start))
        
    return train, test

## Finding N Factors

GridSearchCV cannot handle the amount of data we are passing through, so we will run GridSearchCV on a smaller portion of the dataset in order to return the best n_factors to pass into SVD. SVD will itself get the full dataset we are using. Only run the code below for GridSearchCV if you want to run through the whole code. The fit command can take up to an hour, and the results are always the same, so we can hard code in the parameter for SVD.

In [80]:
params = { 'n_factors': [5, 10, 15, 20, 25, 30, 35, 40, 50]}
grid = GridSearchCV(SVD, params, measures=['rmse'], cv=3, refit=True)
grid.fit(Dataset.load_from_df(movieInput.iloc[:1500000], reader))
print(grid.best_score['rmse'])

0.9153264033462928


Of the N factors passed in, we can find the one that had the best RMSE and use that in the SVD model. Below, we use that directly from the calculation above. In the following class file, we use the value as a static variable in order to minimize processing time on unnecessary calculations. The results of grid.best_params['rmse']['n_factors'] are always 5, so feel free to input that into the parameters.

In [81]:
algo = SVD(n_factors = grid.best_params['rmse']['n_factors'], biased=True, verbose=True)
train_result, test_result = run_surprise(algo, trainset, testset, "SVD")

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Train Data RMSE: 0.8193422511672376


Test Data RMSE: 0.8286722145400238


Time Taken = 0:08:00.550029


The best results for the test data are RMSE of 0.829, which is an extremely good RMSE for the Netflix Prize Data! Of course, the data may be somewhat skewed from only taking the most active users and popular movies. However, for our data of active users, this is highly effective! We can test below on less active users to see the RMSE for users we didn't include.

## Testing on Unused Data

Here we read in some of the rest of the data that was not passed into the pickle file in the beginning. We test for the RMSE of this test data using the SVD algorithm above.

In [82]:
cwd = os.getcwd()
movie = pd.read_csv(cwd + "/data/final.csv")
movie.describe()

Unnamed: 0,MovieID,CustomerID,Rating
count,100480500.0,100480500.0,100480500.0
mean,9070.915,1322489.0,3.60429
std,5131.891,764536.8,1.085219
min,1.0,6.0,1.0
25%,4677.0,661198.0,3.0
50%,9051.0,1319012.0,4.0
75%,13635.0,1984455.0,4.0
max,17770.0,2649429.0,5.0


In [83]:
reduced_data = movie.drop(columns=['Date'])

reduced_data['MovieID'] = reduced_data['MovieID'].astype('int16')
reduced_data['CustomerID'] = reduced_data['CustomerID'].astype('int32')
reduced_data['Rating'] = reduced_data['Rating'].astype('int8')

In [84]:
movie_freq = pd.DataFrame(reduced_data.groupby('MovieID').size(),columns=['count'])
threshold = 100

popular_movies = list(set(movie_freq.query('count>=@threshold').index))

# ratings df after dropping non popular movies
data_popular_movies = reduced_data[reduced_data.MovieID.isin(popular_movies)]

print('shape of original data:', reduced_data.shape)
print('shape of data_popular_movies', data_popular_movies.shape)
print("No. of movies which are rated more than 100 times:", len(popular_movies))

shape of original data: (100480507, 3)
shape of data_popular_movies (100400918, 3)
No. of movies which are rated more than 100 times: 16795


In [85]:
# reduce data_popular_movie to only have movies that are in movie
movieList = movie.MovieID.tolist()
popMoviesTest = data_popular_movies[data_popular_movies.MovieID.isin(movieList)]

The below code takes a very long time to run as well, as it is 100,000,000 rows of data being passed in. You can uncomment the line near the top to reduce the dataset to a more manageable size. If not, it takes about an hour to run.

In [86]:
start = datetime.now()

reducedtestset = list(zip(popMoviesTest["CustomerID"].values, popMoviesTest["MovieID"].values, popMoviesTest["Rating"].values))
#reducedtestset = reducedtestset.iloc[:2000000]

pred_test = algo.test(reducedtestset)
testActual = np.array([p.r_ui for p in pred_test])
testPred = np.array([p.est for p in pred_test])
testRMSE = np.sqrt(mean_squared_error(testActual, testPred))
    
print("Test Data RMSE: {}".format(testRMSE))
print("\n")
    
test = {"RMSE": testRMSE, "Prediction": testPred}
    
print("Time Taken = " + str(datetime.now() - start))

AttributeError: 'list' object has no attribute 'iloc'

Based on the RMSE from the test set we just ran, our algorithm is definitely biased toward users who have rated very frequently. There is likely a high level of correlation between those users, which affects our results. This proves that our algorithm is highly effective for those users, but only average for users who do not rate in the top percentile.

However, the testing we ran on the full data set of users may not have provided accurate results as the model can not accurately predict ratings for users that were not at all in the training data. For those users, they would receive a rating that does not take into account their personal preference or bias. 

## Creating a Mini-Recommender

Though we already have 2 models for recommenders, we can also use the SVD model as a recommender for a Customer at a time. Below is code to recommend 10 movies to a given Customer.

In [87]:
def recommendFor(customerID, model):
    predictions = []
    ids = []
    for mov in movie.MovieID.unique().tolist():
        predictions.append(model.predict(customerID, mov).est)
        ids.append(mov)
    return predictions, ids

In [None]:
preds, movIds = recommendFor(1, svd)

In [88]:
def recommendedMovies(count, preds, movs):
    movieAndRating = {}
    copyPreds = preds[:]
    for i in range(count):
        index = copyPreds.index(max(copyPreds))
        maxPred = max(copyPreds)
        mov = movs[index]
        title = movie_title.iloc[mov-1:mov]['title'][mov-1]
        movieAndRating[title] = maxPred
        copyPreds.pop(index)
    return movieAndRating

In [None]:
recommendedMovies(10, preds, movIds)

The final product is a recommender that returns the count of top movies for the chosen Customer, and gives the predicted rating. The recommender is also a part of the class above, so let's test it out.

## Putting it all together

Take all the information we gathered, the functions we built, and the models we created, and put them all into one class. The class saves algorithms as pickle files to be reused later without having to calculate the model and algorithm all over again. The class also has no testset, as there is no need for verification at this stage- only fitting the model and predicting values for the given user.

In [89]:
import pickle
from tqdm import tqdm
import numpy as np
import pandas as pd
import os
import pathlib
from surprise import Reader, Dataset
from surprise import SVD
from surprise import SVDpp
from surprise.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

class SVDPredictor:
    error_table = pd.DataFrame(columns = ["Model", "Train_RMSE", "Test_RMSE"])
    
    #Class takes in final.csv as a whole as a DataFrame
    def __init__(self, data, titles):
        self.movie = data
        self.titles = titles
        self._createAlgorithmFromData()
        
    def _createAlgorithmFromData(self):
        #check if algo and trainset/train_data files are already created
        file = pathlib.Path('svd.pickle')
        if not file.exists():
            self._reduceDataSize()
            self._splitMovie()
            self._createTrainSet()
        self._run_surprise()
        
    def recommendFor(self, customerID, count):
        preds = []
        ids = []
        for mov in self.movie.MovieID.unique().tolist():
            preds.append(self.predict(customerID, mov).est)
            ids.append(mov)
            
        movieAndRating = {}
        copyPreds = preds[:]
        for i in range(count):
            index = copyPreds.index(max(copyPreds))
            maxPred = max(copyPreds)
            mov = ids[index]
            title = movie_title.iloc[mov-1:mov]['title'][mov-1]
            movieAndRating[title] = maxPred
            copyPreds.pop(index)
        return movieAndRating
        
    def predict(self, userID, movieID):
        #use algo to predict rating. Return predicted rating
        return self.algo.predict(userID, movieID)
    
    def _splitMovie(self):
        self.movie = self.movie.iloc[:1500000]
        
    def _createTrainSet(self):
        reader = Reader(rating_scale=(1,5))
        movieInput = pd.DataFrame()
        movieInput['CustomerID'] = self.movie['CustomerID']
        movieInput['MovieID'] = self.movie['MovieID']
        movieInput['Rating'] = self.movie['Rating']

        self.train_data = Dataset.load_from_df(movieInput, reader)
        self.trainset = self.train_data.build_full_trainset()
        #write to a file
    
    def _reduceDataSize(self):
        self.movie['Date'] = self.movie['Date'].astype('category')
        self.movie['MovieID'] = self.movie['MovieID'].astype('int16')
        self.movie['CustomerID'] = self.movie['CustomerID'].astype('int32')
        self.movie['Rating'] = self.movie['Rating'].astype('int8')
    
    def _run_surprise(self):
        file = pathlib.Path('svd.pickle')
        if file.exists():
            with open('svd.pickle', 'rb') as f:
                self.algo = pickle.load(f)
        else:
            self.algo = SVD(n_factors = 5, biased=True, verbose=True)
            self.algo = self.algo.fit(self.trainset)
            with open('svd.pickle', 'wb') as f:
                pickle.dump(self.algo, f)


Let's now instantiate the class we just created above with our curated data set, and then test it on a random CustomerID and MovieID! In the actual application of a recommender, we would want to set a specific CustomerID, but allow the MovieID to vary to get our ratings.

In [90]:
svd = SVDPredictor(movie, movie_title)

In [91]:
print("Prediction for Customer 1 and Movie 5: ", svd.predict(1, 12).est)
print("Prediction for Customer 1 and Movie 16378: ", svd.predict(7, 16368).est)

Prediction for Customer 1 and Movie 5:  3.498212944014612
Prediction for Customer 1 and Movie 16378:  4.3407102202735075


In [92]:
svd.recommendFor(2, 10)

{'Lord of the Rings: The Return of the King: Extended Edition: Bonus Material': 4.442524144832818,
 "ABC Primetime: Mel Gibson's The Passion of the Christ": 4.301467176230366,
 'Denise Austin: Ultimate Fat Burner': 4.291070054781686,
 'Barbarian Queen 2': 4.281939892932682,
 'Sanford and Son: Season 6': 4.265293509764156,
 'Arachnid': 4.26295462519234,
 'The Frogmen': 4.256601089435565,
 'Animation Legend: Winsor McCay': 4.225473557748273,
 'The Three Stooges: Sing a Song of Six Pants': 4.217292538722636,
 'The Big Clock': 4.209427945385035}

### This code is slighly modified and broken down into seperate .py files and used in conjunction with main.py to run the application. Extra code is also written in the initial.py and addrating.py files that main.py calls.

### Please read readme when running application