In [3]:
import numpy as np
import pandas as pd
from surprise import Dataset
from surprise import Reader
from surprise import KNNWithMeans
from surprise import Dataset
from surprise.model_selection import GridSearchCV

# Introduction
The realm of movie entertainment has been continuously expanding, offering viewers an overwhelming selection of films across various genres. In this vast ocean of choices, finding the next film that aligns with a viewer's preference can be daunting. A Movie Recommendation System aims to simplify this challenge by suggesting movies based on patterns of user preferences. Utilizing Collaborative Filtering, This project seeks to predict a user's interest by collecting preferences from many users (collaborating). This method assumes that if a person A has the same opinion as person B on an issue, A is more likely to have B's opinion on a different issue than that of a random person. Through this, we aspire to provide more personalized movie recommendations, enhancing the viewing experience.



# Methodology

Our approach to the Movie Recommendation System is underpinned by Collaborative Filtering using the k-Nearest Neighbors algorithm with means. Firstly, the datasets 'movies.csv' and 'ratings.csv' are imported and subsequently merged on the 'movieId' column, giving a combined dataset which maps each user's rating to their respective movie. We define a rating scale of 1 to 5 using the Reader class. For the optimization of our model, we explore multiple similarity measures (Mean Squared Difference and Cosine Similarity) and experiment with both item-based and user-based collaborative filtering. A grid search, using 10-fold cross-validation, is then executed over these hyperparameters, and the best configuration is chosen based on the Root Mean Squared Error (RMSE). With our best model, we then train on the complete dataset. A function rec_top10 is developed to recommend the top 10 unwatched movies for any given user based on their estimated ratings. Finally, the movies recommended for a sample user are extracted and displayed.



# Data Exploration & Preprocessing

The dataset we used is the MovieLens dataset. As we can see from below, the movies.csv is consisted of three columns: movieId, title and genres. The ratings.csv is consisted of four columns: userId, movieId, rating and timestamp. Since we are using collabrative filtering for our recommandation algorithm, we will only need the movieId, userId and ratings. This is because collabrative filtering is either user-user based or item-item based algorithm which measures either the mean square difference or cosine similarity between two data points, we won't be needing title, genres and timestamp.

In [11]:
movies=pd.read_csv('movies.csv')
ratings=pd.read_csv('ratings.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [12]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523


We merge the two data frames on movieId column, and by keeping only the userId, movieId and rating columns, we have obtained the dataset we need for using python package surprise for building our recommandation system.

In [57]:
df=pd.merge(ratings,movies,on='movieId',how='inner')
df.sort_values('userId').head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,16,4.0,1217897793,Casino (1995),Crime|Drama
12291,1,4993,4.5,1217895872,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy
12172,1,4963,3.5,1217896313,Ocean's Eleven (2001),Crime|Thriller
12014,1,4306,4.0,1217895903,Shrek (2001),Adventure|Animation|Children|Comedy|Fantasy|Ro...
11965,1,4262,5.0,1217897697,Scarface (1983),Action|Crime|Drama


In [30]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[["userId", "movieId", "rating"]], reader)

# Results

The chosen algorithm for our recommendation system is the Centered KNN. It operates by computing either the mean squared difference similarity or the cosine similarity between data points. The prediction is then derived by selecting the most frequent rating label as a user's expected movie rating. To optimize the model's performance, we employed hyperparameter tuning through a 10-fold cross-validation. During this process, the Root Mean Squared Error (RMSE) served as our primary evaluation metric to assess the efficacy of each hyperparameter combination. Following the grid search, our optimal model yielded an RMSE of 0.8899, which we believe represents a satisfactory level of accuracy.

In [27]:
sim_options = {
    "name": ["msd", "cosine"],
    "min_support": [2, 3, 4, 5, 6, 7, 8],
    "user_based": [False, True],
}
param_grid = {"sim_options": sim_options}

gs = GridSearchCV(KNNWithMeans, param_grid, measures=["rmse", "mae"], cv=10)

gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing th

In [58]:
sim_options = {'name': 'msd', 'min_support': 4, 'user_based': True}
knn = KNNWithMeans(sim_options=sim_options)
trainingSet = data.build_full_trainset()
knn.fit(trainingSet)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1e8b50cb1c0>

Base on our model, we are able to recommand to our sample user the top 10 movies he or she may like from a list of movies that our user haven't watched.

In [62]:
def rec_top10(df, userid):
    user = df[df['userId'] == userid]
    unwatched = movies[~movies['movieId'].isin(user['movieId'])]
    
    pred_dic={}
    for i in unwatched['movieId']:
        pred=knn.predict(userid, i)
        pred_dic[i]=pred.est
        
    sorted_dict = dict(sorted(pred_dic.items(), key=lambda item: item[1], reverse=True))
    
    first_10_items = dict(list(sorted_dict.items())[:10])
    
    selected_movies = movies.loc[movies['movieId'].isin(first_10_items.keys())]
    
    return selected_movies

In [63]:
user1=rec_top10(df, 1)
user1

Unnamed: 0,movieId,title,genres
366,418,Being Human (1993),Drama
565,649,Cold Fever (Á köldum klaka) (1995),Comedy|Drama
675,831,Stonewall (1995),Drama
1245,1546,Schizopolis (1996),Comedy
1388,1757,Fallen Angels (Duo luo tian shi) (1995),Drama|Romance
2081,2607,Get Real (1998),Drama|Romance
2368,2964,Julien Donkey-Boy (1999),Drama
2635,3311,"Man from Laramie, The (1955)",Western
2653,3340,Bride of the Monster (1955),Horror|Sci-Fi
2896,3670,Story of G.I. Joe (1945),War
