# Movie Recommendation System

### Data Description
The data comes from [movielens dataset](https://grouplens.org/datasets/movielens/25m/).
There are 2 files that will be used:


**Movie data**: `movie.csv`

|Feature|Description|Data Type|
|:--|:--|:--:|
|`movieId`|Movie ID|`int`|
|`title`|Movie title|`int`|
|`genres`|Genre movies|`object`|

**User data**: `ratings.csv`

|Feature|Description|Data Type|
|:--|:--|:--:|
|`userId`|User ID|`int`|
|`movieId`|Movie title|`int`|
|`rating`|Rating movies|`float`|
|`timestamp`|Timestamp|`int`|

#### Import Data

In [1]:
#load library
import pandas as pd
import numpy as np

In [2]:
#initialize data path and movie path
data_path = '../data/ml-25m/ratings.csv'
movie_path = '../data/ml-25m/movies.csv'

In [3]:
#read user data and movie data from the CSV file defined in the 'data_path' and 'movie_path' variable
user_data = pd.read_csv(data_path, delimiter=',')
movie_data = pd.read_csv(movie_path, delimiter=',')

In [4]:
#display user data
user_data

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510
...,...,...,...,...
25000090,162541,50872,4.5,1240953372
25000091,162541,55768,2.5,1240951998
25000092,162541,56176,2.0,1240950697
25000093,162541,58559,4.0,1240953434


In [5]:
#display movie data
movie_data

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


### **Check data and handle duplicated**

In [6]:
#display detailed information about 'user_data'
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 762.9 MB


In [7]:
#counts the number of null values of 'user_data'
user_data.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [8]:
#display detailed information about 'movie_data'
movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


In [9]:
#remove the 'timestamp' column from the 'user_data' 
user_data = user_data.drop(['timestamp'], axis = 1)
user_data

Unnamed: 0,userId,movieId,rating
0,1,296,5.0
1,1,306,3.5
2,1,307,5.0
3,1,665,5.0
4,1,899,3.5
...,...,...,...
25000090,162541,50872,4.5
25000091,162541,55768,2.5
25000092,162541,56176,2.0
25000093,162541,58559,4.0


**user_data** and **rating_data** has the correct type and feature. There is no null data and duplicated in user_data and rating_data.

#### create load function

In [10]:
def read_and_analyze_data(data_path, movie_path):
    """
    Function to read, analyze, and clean user data and movie data from CSV files.

    Parameters:
    - data_path (str): Path to the CSV file containing user data.
    - movie_path (str): Path to the CSV file containing movie data.

    Output:
    - user_data (DataFrame): DataFrame containing user data after removing the 'timestamp' column.
    - movie_data (DataFrame): DataFrame containing movie data.
    """
    #read user data and movie data from the CSV file defined in the 'data_path' and 'movie_path' variable
    user_data = pd.read_csv(data_path, delimiter=',')
    movie_data = pd.read_csv(movie_path, delimiter=',')

    #show info user
    print("Information about Data User:")
    user_data.info()

    #calculate null value
    print("\nNumber of Null Values in User Data:")
    print(user_data.isnull().sum())

    #show about movie info
    print("\nInformation about Data Movie:")
    movie_data.info()

    #drop colomn 'timestamp' from user data
    user_data = user_data.drop(['timestamp'], axis=1)

    return user_data, movie_data


In [11]:
data_path = '../data/ml-25m/ratings.csv'
movie_path = '../data/ml-25m/movies.csv'
user_data, movie_data = read_and_analyze_data(data_path, movie_path)

Information about Data User:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 762.9 MB

Number of Null Values in User Data:
userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

Information about Data Movie:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


### Non-personalized: popularity-based recommendation

In [12]:
#count the number of duplicate rows based on the combination of 'userId' and 'movieId' columns in 'user_data'
user_data.duplicated(subset=['userId','movieId']).sum()

0

In [13]:
#count the number of duplicate rows in 'movie_data'
movie_data.duplicated().sum()

0

In [14]:
#group the data in 'user_data' based on the 'movieId' column and calculate the total rating for each movie
rating_total = user_data.groupby('movieId').count()['rating'].reset_index()
#rename the calculated column to 'total_rating' and display it
rating_total.rename(columns={'rating':'total_rating'}, inplace=True)
rating_total

Unnamed: 0,movieId,total_rating
0,1,57309
1,2,24228
2,3,11804
3,4,2523
4,5,11714
...,...,...
59042,209157,1
59043,209159,1
59044,209163,1
59045,209169,1


In [15]:
#group the data in 'user_data' based on the 'movieId' column and calculate the average rating for each movie.
avg_rating = user_data.groupby('movieId').mean().round(2)['rating'].reset_index()
#rename the calculated column to 'average_rating' and display it
avg_rating.rename(columns={'rating':'average_rating'}, inplace=True)
avg_rating

Unnamed: 0,movieId,average_rating
0,1,3.89
1,2,3.25
2,3,3.14
3,4,2.85
4,5,3.06
...,...,...
59042,209157,1.50
59043,209159,3.00
59044,209163,4.50
59045,209169,3.00


In [16]:
#concatenates two dataFrames, 'rating_total' and 'avg_rating', based on the 'movieId' column and displays them
popularity = rating_total.merge(avg_rating, on='movieId')
popularity

Unnamed: 0,movieId,total_rating,average_rating
0,1,57309,3.89
1,2,24228,3.25
2,3,11804,3.14
3,4,2523,2.85
4,5,11714,3.06
...,...,...,...
59042,209157,1,1.50
59043,209159,1,3.00
59044,209163,1,4.50
59045,209169,1,3.00


In [17]:
#merge 'popularity' with 'movie_data' based on 'movieId' column
#removing duplicates based on the 'movieId' column and selecting the column is then displayed
popularity = popularity.merge(movie_data, on='movieId').drop_duplicates('movieId')[['movieId','total_rating','average_rating','title']]
popularity

Unnamed: 0,movieId,total_rating,average_rating,title
0,1,57309,3.89,Toy Story (1995)
1,2,24228,3.25,Jumanji (1995)
2,3,11804,3.14,Grumpier Old Men (1995)
3,4,2523,2.85,Waiting to Exhale (1995)
4,5,11714,3.06,Father of the Bride Part II (1995)
...,...,...,...,...
59042,209157,1,1.50,We (2018)
59043,209159,1,3.00,Window of the Soul (2001)
59044,209163,1,4.50,Bad Poems (2018)
59045,209169,1,3.00,A Girl Thing (2001)


In [18]:
#sort 'popularity' by the 'total_rating' column in descending order and take the top 15 rows to get the 30 movies with the highest total rating.
top_15 = popularity.sort_values("total_rating", ascending=False).head(15)
top_15

Unnamed: 0,movieId,total_rating,average_rating,title
351,356,81491,4.05,Forrest Gump (1994)
314,318,81482,4.41,"Shawshank Redemption, The (1994)"
292,296,79672,4.19,Pulp Fiction (1994)
585,593,74127,4.15,"Silence of the Lambs, The (1991)"
2480,2571,72674,4.15,"Matrix, The (1999)"
257,260,68717,4.12,Star Wars: Episode IV - A New Hope (1977)
475,480,64144,3.68,Jurassic Park (1993)
522,527,60411,4.25,Schindler's List (1993)
108,110,59184,4.0,Braveheart (1995)
2867,2959,58773,4.23,Fight Club (1999)


### Non-personalized: Filter by Genre

In [19]:
# Function to create a list of unique movie genres from the 'movie_df' DataFrame.
def create_genre_list(movie_df):
    unique_genres = set()
    for genres in movie_df['genres']:
        genre_list = genres.split('|')
        unique_genres.update(genre_list)
    return list(unique_genres)

# Function to filter movies by genre.
def filter_movies_by_genre(user_df, movie_df, genre):
    # Merge 'user_df' and 'movie_df' based on the 'movieId' column.
    merged_df = pd.merge(user_df, movie_df, on='movieId', how='inner')

    # Filter movies that match the specified genre.
    filtered_movies = merged_df[merged_df['genres'].str.contains(genre, case=False, na=False)]
    filtered_movies = filtered_movies.drop_duplicates(subset='title')
    return filtered_movies

In [20]:
unique_genres = create_genre_list(movie_data)
print("Available Genres:", unique_genres)
genre = 'Crime'
filtered_genre_movies = filter_movies_by_genre(user_data, movie_data, genre)
print(f"Movies with the genre '{genre}':")

Available Genres: ['Musical', 'Action', 'Children', '(no genres listed)', 'War', 'Adventure', 'Mystery', 'Film-Noir', 'Crime', 'IMAX', 'Fantasy', 'Horror', 'Sci-Fi', 'Thriller', 'Romance', 'Comedy', 'Documentary', 'Drama', 'Animation', 'Western']
Movies with the genre 'Crime':


In [21]:
filtered_genre_movies

Unnamed: 0,userId,movieId,rating,title,genres
0,1,296,5.0,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
147663,1,1260,3.5,M (1931),Crime|Film-Noir|Thriller
233297,1,2692,5.0,Run Lola Run (Lola rennt) (1998),Action|Crime
346525,1,5767,5.0,Teddy Bear (Mis) (1981),Comedy|Crime
350395,1,5912,3.0,Hit the Bank (Vabank) (1981),Comedy|Crime
...,...,...,...,...,...
25000052,162047,108802,3.5,Cry in the Woods (Den som frykter ulven) (2004),Crime|Thriller
25000053,162047,117644,3.5,Dr. Socrates (1935),Crime|Drama|Romance
25000054,162047,123425,3.0,The Last Gangster (1937),Crime|Drama|Thriller
25000074,162271,92648,3.0,BookWars (2000),Comedy|Crime|Documentary


### Non-personalized: Filter by Release Year

In [22]:
# Function to filter movies by release year.
def filter_movies_by_year(movie_df, year):
    # Extract the year from the 'title' column and store it in a new 'year' column.
    movie_df['year'] = movie_df['title'].str.extract(r'\((\d{4})\)')

    # Filter movies released in a specific year.
    filtered_movies = movie_df[movie_df['year'] == year]

    return filtered_movies

In [23]:
#show filtered_movies
filtered_movies = filter_movies_by_year(movie_data, '2019')
filtered_movies

Unnamed: 0,movieId,title,genres,year
25068,122914,Avengers: Infinity War - Part II (2019),Action|Adventure|Sci-Fi,2019
33520,143345,Shazam! (2019),Action|Adventure|Fantasy|Sci-Fi,2019
57039,195473,Les Invisibles (2019),(no genres listed),2019
57371,196223,Hellboy (2019),Action|Adventure|Fantasy,2019
57462,196417,How to Train Your Dragon: The Hidden World (2019),Adventure|Animation|Children,2019
...,...,...,...,...
62387,209051,Jeff Garlin: Our Man in Chicago (2019),(no genres listed),2019
62398,209085,The Mistletoe Secret (2019),Romance,2019
62412,209143,The Painting (2019),Animation|Documentary,2019
62413,209145,Liberté (2019),Drama,2019


check again about shape data, duplicated and make sampling data (due to computational limitations)

In [24]:
user_data['userId'].sum()

2029739741827

In [25]:
movie_data['movieId'].sum()

7629363258

In [26]:
user_data.shape

(25000095, 3)

In [27]:
movie_data.shape

(62423, 4)

In [28]:
user_data.head(3)

Unnamed: 0,userId,movieId,rating
0,1,296,5.0
1,1,306,3.5
2,1,307,5.0


In [29]:
movie_data.head(3)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995


In [30]:
movie_data.duplicated(subset=['movieId','title']).sum()

0

In [31]:
#sampling the data
sampled_movies  = movie_data.drop(movie_data[movie_data['movieId'] > 624].copy().index)
sampled_users = user_data.copy()
sampled_users = user_data.loc[(user_data['userId'] <= 5000) & (user_data['movieId'] <= 624)]

In [32]:
sampled_movies

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995
...,...,...,...,...
610,618,Two Much (1995),Comedy|Romance,1995
611,619,Ed (1996),Comedy,1996
612,620,Scream of Stone (Cerro Torre: Schrei aus Stein...,Drama,1991
613,621,My Favorite Season (1993),Drama,1993


In [33]:
sampled_users

Unnamed: 0,userId,movieId,rating
0,1,296,5.0
1,1,306,3.5
2,1,307,5.0
70,2,1,3.5
71,2,62,0.5
...,...,...,...
733133,5000,104,3.0
733134,5000,140,3.0
733135,5000,141,4.0
733136,5000,260,4.0


### Personalized: Content Based Recommendation

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

#initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

#calculate TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(sampled_movies['genres'])

#check dimensions TF-IDF matrix
print(tfidf_matrix.shape)

(615, 21)


In [35]:
from sklearn.metrics.pairwise import linear_kernel

#calculate cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

#cosine similarity matrix is a matrix that stores the similarity score between each pair of movies.
print(cosine_sim.shape)


(615, 615)


In [36]:
#function for content-based movie recommendations
def content_based_recommendations(title, cosine_sim=cosine_sim):
    #get the index of the movie that matches the given title
    idx = sampled_movies[sampled_movies['title'] == title].index[0]
    
    #get similarity scores between the movie and all other movies
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    #sort movies based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    #get the top 10 (or as needed) similar movies
    sim_scores = sim_scores[1:11]
    
    #get the indices of the related movies
    movie_indices = [i[0] for i in sim_scores]
    
    #return the titles of recommended movies
    return sampled_movies['title'].iloc[movie_indices]

In [37]:
recommended_movies = content_based_recommendations("Toy Story (1995)")
print(recommended_movies)

551                  Pagemaster, The (1994)
12                             Balto (1995)
55           Kids of the Round Table (1995)
1                            Jumanji (1995)
59       Indian in the Cupboard, The (1995)
124       NeverEnding Story III, The (1994)
255    Kid in King Arthur's Court, A (1995)
241                 Gumby: The Movie (1995)
309               Swan Princess, The (1994)
608                  Aristocats, The (1970)
Name: title, dtype: object


### Personalized: Collaborative Filtering User to User

In [182]:
#load library
import surprise
from surprise import accuracy, Dataset, Reader, BaselineOnly, KNNBasic, KNNBaseline, SVD, NMF
from surprise.model_selection.search import RandomizedSearchCV
from surprise.model_selection import cross_validate, train_test_split

In [183]:
#initialize a Reader object in the Surprise library to read rating data on a scale of 1-5
reader = Reader(rating_scale = (1, 5))

In [184]:
#reads the rating data and converts it into a format that can be used to load the recommendation dataset from df 'rating_data'
dataset = Dataset.load_from_df(sampled_users[['userId', 'movieId', 'rating']].copy(), reader)
dataset

<surprise.dataset.DatasetAutoFolds at 0x20a49addcd0>

In [185]:
#show data
dataset.df

Unnamed: 0,userId,movieId,rating
0,1,296,5.0
1,1,306,3.5
2,1,307,5.0
70,2,1,3.5
71,2,62,0.5
...,...,...,...
733133,5000,104,3.0
733134,5000,140,3.0
733135,5000,141,4.0
733136,5000,260,4.0


In [186]:
#split dataset into training data and test data
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=42)

In [187]:
#validate splitting
train_data.n_ratings, len(test_data)

(105335, 26334)

In [188]:
#initialize
model_baseline = BaselineOnly()
model_baseline

<surprise.prediction_algorithms.baseline_only.BaselineOnly at 0x20a10b29df0>

In [189]:
#perform cross-validation on the initialized recommendation model using the 'BaselineOnly'
cv_baseline = cross_validate(algo=model_baseline, data=dataset, cv=5,measures=['rmse'])

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...


In [190]:
#cv result
cv_baseline_rmse = cv_baseline['test_rmse'].mean()
cv_baseline_rmse

0.8717889750098969

#### Hyperparameter candidate

In [191]:
#initialization of parameters that will be used in a randomized search
#for hyperparameters in the recommendation model with the KNNBaseline method
param_dist = {'k':list(np.arange(start=20, stop=40, step=5)),
          'sim_options':{'name':['pearson','pearson_baseline','cosine'],'user_based':['True']}, 'min_k': [1, 2, 3]}

In [192]:
#randomized search for hyperparameters in the recommendation model with the KNNBaseline method
knn_search = RandomizedSearchCV(algo_class=KNNBaseline, param_distributions = param_dist, cv=5)

In [193]:
#process search hyperparams
knn_search.fit(data=dataset)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
C

In [212]:
pickle.dump(knn_search.best_params["rmse"],open('../model/knn_baseline.pkl','wb'))

In [194]:
#dictionary containing hyperparameter values for SVD (Singular Value Decomposition) recommendation model
params_SVD = {'lr_all' : [1,0.1,0.01,0.001], 'n_factors' : [50,100],
              'reg_all' : [1,0.1,0.01, 0.02]
              }  

In [None]:
# Create a RandomizedSearchCV object for hyperparameter tuning of the SVD recommendation model.
# - 'algo_class=SVD': Specifies that the algorithm being tuned is SVD.
# - 'param_distributions=params_SVD': Specifies the hyperparameter search space defined in 'params_SVD'.
# - 'cv=5': Performs 5-fold cross-validation during hyperparameter tuning.
svd_search = RandomizedSearchCV(algo_class=SVD, param_distributions=params_SVD, cv=5)

# Fit the RandomizedSearchCV object to the dataset.
# This will perform a randomized search for the best hyperparameters of the SVD model.
svd_search.fit(data=dataset)

In [196]:
#dictionary containing hyperparameter values for NMF recommendation model
params_NMF = {'n_factors': np.arange(5, 50, 5),
              'n_epochs': np.arange(10, 100, 10)
             }

In [197]:
nmf_search = RandomizedSearchCV(algo_class=NMF, param_distributions = params_NMF, cv=5)
nmf_search.fit(data=dataset)

In [198]:
#summarize performance
summary_df = pd.DataFrame({'Model': ['Baseline', 'KNN Baseline', 'SVD', 'NMF'],
                           'CV Performance - RMSE': [cv_baseline_rmse,knn_search.best_score['rmse'],svd_search.best_score['rmse'],nmf_search.best_score['rmse']],
                           'Model Condiguration':['N/A',f'{knn_search.best_params["rmse"]}',f'{svd_search.best_params["rmse"]}',f'{nmf_search.best_params["rmse"]}']})

summary_df

Unnamed: 0,Model,CV Performance - RMSE,Model Condiguration
0,Baseline,0.871789,
1,KNN Baseline,0.848675,"{'k': 25, 'sim_options': {'name': 'pearson_bas..."
2,SVD,0.865616,"{'lr_all': 0.01, 'n_factors': 50, 'reg_all': 0..."
3,NMF,0.869731,"{'n_factors': 35, 'n_epochs': 90}"


In [199]:
#best hyperparams combination
knn_search.best_params["rmse"]

{'k': 25,
 'sim_options': {'name': 'pearson_baseline', 'user_based': 'True'},
 'min_k': 2}

In [200]:
#intialize ber hyperparams
best_params = knn_search.best_params['rmse']

In [201]:
#create obj. and retrain whole train data
model_best = KNNBaseline(**best_params)
model_best.fit(train_data)

Estimating biases using als...


Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x20a11c90cd0>

In [202]:
#predict test data using best model
test_pred = model_best.test(test_data)
test_rmse = accuracy.rmse(test_pred)
test_rmse

RMSE: 0.8406


0.8405597869115303

In [203]:
#summarize RMSE tuning dan test
summary_test_df = pd.DataFrame({'Model' : ['User to User CF'],
                                'RMSE-Tuning': [knn_search.best_score['rmse']],
                                'RMSE-Test': [test_rmse]})

summary_test_df

Unnamed: 0,Model,RMSE-Tuning,RMSE-Test
0,User to User CF,0.848675,0.84056


In [204]:
#predict user_id = 2 and movie_id = 4
sample_prediction = model_best.predict(uid = 2,
                                      iid = 4)

In [205]:
sample_prediction

Prediction(uid=2, iid=4, r_ui=None, est=2.468213535221938, details={'actual_k': 15, 'was_impossible': False})

In [206]:
#make function
def get_unrated_movie_ids(sampled_users, user_id):
    """
    Gets a list of movie IDs that a user has not rated yet.

    Parameters
    ----------
    rating_data : DataFrame
        The DataFrame containing the rating data.
    user_id : int
        The ID of the user for whom we want to find unrated movie IDs.

    Returns
    -------
    unrated_movie_ids : set
        A set of movie IDs that the user has not rated.
    """
    #get unique movie_id
    unique_movie_ids = set(sampled_users['movieId'])
    #get movie_id that is rated by user_id = 2
    rated_movie_ids = set(sampled_users.loc[sampled_users['userId'] == user_id, 'movieId'])
    #find unrated movie_id
    unrated_movie_ids = unique_movie_ids.difference(rated_movie_ids)
    
    return unrated_movie_ids

In [207]:
#check result function and manual
unrated_movie = get_unrated_movie_ids(sampled_users, 2)
print(unrated_movie)

{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 92, 93, 94, 95, 96, 97, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 144, 145, 146, 147, 148, 149, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 23

In [208]:
#make function
def predict_and_sort_ratings(model, userId, unrated_movie_ids):
    """
    Predicts and sorts unrated movie based on predicted ratings for a given user.

    Parameters
    ----------
    model : object
        The collaborative filtering model used for predictions.
    user_id : int
        The ID of the user for whom we want to predict and sort unrated movie.
    unrated_movie_ids : list
        A list of movie IDs that the user has not rated yet.

    Returns
    -------
    predicted_unrated_movie_df : DataFrame
        A DataFrame containing the predicted ratings and movie IDs,
        sorted in descending order of predicted ratings.
    """

    #initialize
    predicted_unrated_movie = {
        'userId': userId,
        'movieId': [],
        'predicted_rating': []
    }
    
    #loop all unrated movie
    for movie_id in unrated_movie_ids:
        #make predict
        pred_id = model.predict(uid=predicted_unrated_movie['userId'],
                                iid=movie_id)
        #append
        predicted_unrated_movie['movieId'].append(movie_id)
        predicted_unrated_movie['predicted_rating'].append(pred_id.est)

    #create df
    predicted_unrated_movie_df = pd.DataFrame(predicted_unrated_movie).sort_values('predicted_rating',
                                                                                  ascending=False)

    return predicted_unrated_movie_df


In [209]:
predicted_movie_df = predict_and_sort_ratings(model_best,2,unrated_movie_ids=unrated_movie)
predicted_movie_df

Unnamed: 0,userId,movieId,predicted_rating
301,2,320,4.775114
413,2,443,4.553606
94,2,99,4.401919
87,2,90,4.341444
48,2,50,4.294119
...,...,...,...
268,2,285,1.511848
464,2,496,1.496867
536,2,577,1.449763
442,2,473,1.418920


In [210]:
def get_top_predicted_movie(model, k, user_id, rating_data, movie_data):
    """
    Gets the top predicted movie for a given user based on a collaborative filtering model.

    Parameters
    ----------
    model : object
        The collaborative filtering model used for predictions
    k : int
        The number of top predicted movie to retrieve
    user_id : int
        The ID of the user for whom to get top predicted movie
    rating_data : DataFrame
        The DataFrame containing the rating data
    movie_data : DataFrame
        The DataFrame containing the movie details

    Returns
    -------
    top_movie_df : DataFrame
        A DataFrame containing the top predicted movie along with their details
    """

    # Get unrated movie IDs for the user
    unrated_movie_ids = get_unrated_movie_ids(rating_data, user_id)

    # Predict and sort unrated movie
    predicted_movie_df = predict_and_sort_ratings(model, user_id, unrated_movie_ids)

    # Get the top k predicted movie
    top_predicted_movie = predicted_movie_df.head(k).copy()

    # Add movie details to the top predicted movie
    top_predicted_movie['title'] = sampled_movies.loc[top_predicted_movie['movieId'], 'title'].values
    top_predicted_movie['genres'] = sampled_movies.loc[top_predicted_movie['movieId'], 'genres'].values

    return top_predicted_movie

In [211]:
# Example usage
predicted_movie = get_top_predicted_movie(model_best, 9, 2, sampled_users, sampled_movies)
predicted_movie

Unnamed: 0,userId,movieId,predicted_rating,title,genres
301,2,320,4.775114,National Lampoon's Senior Trip (1995),Comedy
413,2,443,4.553606,Fearless (1993),Drama
94,2,99,4.401919,Bottle Rocket (1996),Adventure|Comedy|Crime|Romance
87,2,90,4.341444,Mary Reilly (1996),Drama|Horror|Thriller
48,2,50,4.294119,Guardian Angel (1994),Action|Drama|Thriller
306,2,326,4.231077,Tom & Viv (1994),Drama
456,2,488,4.180438,Menace II Society (1993),Action|Crime|Drama
517,2,556,4.169549,Germinal (1993),Drama|Romance
537,2,580,4.169468,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical
