% DES431 Project 2: Recommendation System

# Background

**MovieLens** is a movie recommendation system operated by GroupLens, a research group at the University of Minnesota. 

# Task

1. Propose and implement your own recommendation system based on the MovieLens dataset. Use `ratings_train.csv` as the training set, `ratings_valid.csv` as the validation set. Your system may use information from `movies.csv` and `tags.csv` to conduct recommendations. The undisclosed test set will be used to evaluate your system.
   - The data file structure is available at https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html. 
   - The main goal of the recommendation system is to minimize the root-mean-square error.
   - The implementation should include a function named `predict_rating`. This function accepts a DataFrame with two columns `userId` and `movieId`. Then, the function adds a column named `rating` storing a predicted rating of a `movieId` by a `userId`.
   - Your program must return a root-mean-square error value when the validation set is changed to another file. Otherwise, your score will be deducted by 50%.
   - You must modify the given program to make better recommendations. Submitting the original program without modification is considered plagiarism.
2. Prepare slides for a 7-minute presentation to explain your proposed technique and algorithm to conduct recommendation, and show your RMSE results on the validation set.
3. Submit all required documents by April 30, 2023; 23:59. Late submission will not be accepted and will be marked 0. Do not wait until the last minute. Plagiarism and code duplication will be checked. 
4. Present your work on May 1, 2023 within 7 minutes. Exceeding 7 minutes will be subject to point deduction.

In [84]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [85]:
import imdb
import requests
import tmdbv3api as tmdb
from surprise import Dataset, Reader, SVD, accuracy

# Loading data

In [86]:
ratings_train = pd.read_csv('ratings_train.csv')
ratings_valid = pd.read_csv('ratings_valid.csv')
movies = pd.read_csv('movies.csv')
tags = pd.read_csv('tags.csv',usecols=["userId","movieId","tag"])
links = pd.read_csv('links.csv')

In [87]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [88]:
tags

Unnamed: 0,userId,movieId,tag
0,2,60756,funny
1,2,60756,Highly quotable
2,2,60756,will ferrell
3,2,89774,Boxing story
4,2,89774,MMA
...,...,...,...
3678,606,7382,for katie
3679,606,7936,austere
3680,610,3265,gun fu
3681,610,3265,heroic bloodshed


In [89]:
links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
9737,193581,5476944,432131.0
9738,193583,5914996,445030.0
9739,193585,6397426,479308.0
9740,193587,8391976,483455.0


In [90]:
ratings_train

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
96459,610,166534,4.0,1493848402
96460,610,168248,5.0,1493850091
96461,610,168250,5.0,1494273047
96462,610,168252,5.0,1493846352


In [91]:
ratings_valid

Unnamed: 0,userId,movieId,rating,timestamp
0,4,45,3.0,986935047
1,4,52,3.0,964622786
2,4,58,3.0,964538444
3,4,222,1.0,945629040
4,4,247,3.0,986848894
...,...,...,...,...
2349,561,139385,3.5,1491092337
2350,561,146656,3.5,1491095479
2351,561,149406,3.5,1491091520
2352,561,160438,2.0,1491091498


In [92]:
print("Number of users = "+str(ratings_train["userId"].nunique()))
print("Number of movies = "+str(ratings_train["movieId"].nunique()))

Number of users = 610
Number of movies = 9690


# CONTENT BASED

In [93]:
m_df=ratings_train[["userId","movieId","rating"]].merge(movies,on="movieId")
m_df

Unnamed: 0,userId,movieId,rating,title,genres
0,1,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
...,...,...,...,...,...
96459,610,160341,2.5,Bloodmoon (1997),Action|Thriller
96460,610,160527,4.5,Sympathy for the Underdog (1971),Action|Crime|Drama
96461,610,160836,3.0,Hazard (2005),Action|Drama|Thriller
96462,610,163937,3.5,Blair Witch (2016),Horror|Thriller


# Singular Value Decomposition (SVD)

In [94]:
from surprise import SVD
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy

# Set the rating scale
reader = Reader(rating_scale=(0, 5))

# Load the train dataset
train_data = Dataset.load_from_df(ratings_train[['userId', 'movieId', 'rating']], reader)

# Split the dataset into train and test sets
trainset, testset = train_test_split(train_data, test_size=0.2)

# Initialize the SVD model
model = SVD(n_factors=100, n_epochs=100, lr_all=0.005, reg_all=0.02)

# Train the model on the trainset
model.fit(trainset)

# Make predictions on the testset
predictions = model.test(testset)

# Compute and print the RMSE metric
rmse1 = accuracy.rmse(predictions)
print('RMSE:', rmse1)

# Print the estimated ratings for each prediction
for prediction in predictions:
    print(prediction.est)


RMSE: 0.8800
RMSE: 0.8800188971407694
3.341220417816451
2.781010159710963
2.672373294655474
3.1253026076886945
4.351260990817812
3.1533476305986845
3.800137920354785
3.1631576469742537
4.1161331342563825
2.0221798436734546
3.7923974713091275
2.793194472438056
3.15090714846876
3.414247127161882
3.620290298267471
2.6909103727338897
3.0747485865721202
2.9658047509466363
3.8713920871003604
3.1800813961573793
4.1964176637653
2.8929467786464422
3.0652208865076354
3.8879462825328357
3.0701861277686615
4.22193334807305
3.3211923987074656
4.194131822420453
3.3322015888487013
4.217230915883162
3.3935225730769236
3.0632298272646317
4.374540160230122
2.816593599658443
3.123200287851577
3.9773403504292832
3.7231332063176645
3.293407732634811
3.606994890262169
2.988459412422489
4.316895139715522
2.981906773894288
3.213765997148991
3.5111608212945753
2.998555297748736
4.018049780418625
3.8522802643989076
4.504666441958366
3.5197883502432927
3.9373775521226313
3.64435990600812
3.826615145121336
3.7280

In [95]:
from surprise import SVD
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy

# Set the rating scale
reader = Reader(rating_scale=(0, 5))

# Load the train dataset
train_data = Dataset.load_from_df(ratings_train[['userId', 'movieId', 'rating']], reader)

# Split the dataset into train and test sets
trainset, testset = train_test_split(train_data, test_size=0.2)

# Initialize the SVD model
model = SVD(n_factors=100, n_epochs=100, lr_all=0.005, reg_all=0.02)

# Train the model on the trainset
model.fit(trainset)

# Make predictions on the testset
predictions = model.test(testset)

# Compute and print the RMSE metric
rmse = accuracy.rmse(predictions)
print('RMSE:', rmse)

# Print the estimated ratings for each prediction
for prediction in predictions:
    print(round(prediction.est, 4))


RMSE: 0.8807
RMSE: 0.8807099660314365
2.6506
4.3182
3.4647
3.4868
4.1709
2.8222
3.7798
2.8386
2.9993
3.0652
4.4819
4.2702
3.9775
1.566
3.7333
3.1538
3.1248
3.1167
3.7752
2.8092
3.4096
4.197
3.3564
2.5975
2.685
2.2177
3.3329
4.3835
3.3521
4.3612
3.0274
5
3.8159
2.9366
3.0923
4.1735
3.9751
2.6264
2.735
3.3585
4.168
3.6245
3.9588
2.6072
3.1487
3.5375
3.5127
3.9936
3.5324
3.513
2.7126
3.8328
3.8862
2.7733
3.4274
2.1581
3.1629
3.9613
2.7577
3.3636
3.9071
4.1702
3.5673
3.0943
3.3974
3.8026
3.4065
3.043
3.0373
3.5906
3.4095
3.0991
2.8783
3.9143
4.0402
3.3965
2.9871
3.2532
3.8866
3.171
2.4044
2.7615
2.7835
4.1059
3.5475
3.4683
3.4271
3.1643
4.5061
4.1726
3.619
2.919
3.5483
4.3103
3.7213
3.3701
3.5781
2.7883
4.0346
2.7728
4.5039
2.5697
2.3434
2.6876
4.6312
2.7275
3.9875
4.1202
3.7797
3.9165
3.3319
3.6327
2.5802
3.6457
2.3136
2.4104
4.4482
3.0294
2.743
2.9332
3.6138
3.8414
3.5543
3.4648
3.7843
3.0261
4.8472
4.3158
3.7272
3.0833
3.3006
3.3537
3.1876
3.6035
4.3808
3.6377
4.2905
3.1862
3.1333
4.250

# Global Effect
# This one has lower RSME

In [100]:
from surprise import BaselineOnly
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy

# Set the rating scale
reader = Reader(rating_scale=(0, 5))

# Load the train dataset
train_data = Dataset.load_from_df(ratings_train[['userId', 'movieId', 'rating']], reader)

# Split the dataset into train and test sets
trainset, testset = train_test_split(train_data, test_size=0.2)

# Initialize the BaselineOnly model
model = BaselineOnly()

# Train the model on the trainset
model.fit(trainset)

# Make predictions on the testset
predictions = model.test(testset)

# Compute and print the RMSE metric
rmse_globaleffect = accuracy.rmse(predictions)
print('RMSE:', rmse_globaleffect)

# Print the estimated ratings for each prediction
for prediction in predictions:
    print(prediction.est)


Estimating biases using als...
RMSE: 0.8700
RMSE: 0.8700457079166488
3.458614340792209
3.9809553773619824
4.032260524041416
4.314664569556474
4.152772638855562
3.4964689010288184
3.6069750215074725
2.6944949158099547
4.520389523624672
3.575876792389024
3.6765149717945285
3.6948057611262017
3.4311729099319574
2.8485764245654446
3.5582779018197774
3.965050317746969
3.9287967049908645
3.2641467747733963
3.8287247885617828
3.7253307772044773
3.9737956204023943
4.484256232860387
2.7521132760077918
3.5034609836747976
4.16090633214064
4.453150917757318
4.953233611321702
3.994889323375132
3.805462265184313
3.8110109224837645
4.272418127724354
3.8473099143348755
3.887383309774206
4.116657295556728
2.407756013034066
3.5326332064125863
3.127772446779063
3.4042544063280227
4.336662383622064
3.917810633933675
3.598738658394571
3.8953753175169488
3.7414583556820475
2.2644725992233803
4.214206622260717
3.8567191467683686
3.328811371644915
3.5004915748012784
4.467541286912825
3.6395630512441555
3.6306

# Constructing model and predicting ratings

In [97]:
# Model construction
avg_rating = ratings_train[['movieId', 'rating']].groupby(by='movieId').mean()
	    
# Prediction
def predict_rating(df):
    # Input: 
	# 	df = a dataframe with two columns: userId, movieId
	# Output:
	#   a dataframe with three columns: userId, movieId, rating
	return df.join(avg_rating, on='movieId')


In [98]:
# Prepare df for prediction
r = ratings_valid[['userId', 'movieId']]

# Predict ratings
ratings_pred = predict_rating(r)
ratings_pred

Unnamed: 0,userId,movieId,rating
0,4,45,3.366667
1,4,52,3.520000
2,4,58,4.062500
3,4,222,3.928571
4,4,247,3.975000
...,...,...,...
2349,561,139385,3.860000
2350,561,146656,3.916667
2351,561,149406,3.416667
2352,561,160438,2.916667


In [99]:
from sklearn.metrics import mean_squared_error

r_true = ratings_valid['rating'].to_numpy()
r_pred = ratings_pred['rating'].to_numpy()

rmse = mean_squared_error(r_true, r_pred, squared=False)
print(f"RMSE = {rmse_globaleffect:.4f}")

RMSE = 0.8764
