# Implementing Recommender Systems - Lab

## Introduction

In this lab, you'll practice creating a recommender system model using `surprise`. You'll also get the chance to create a more complete recommender system pipeline to obtain the top recommendations for a specific user.


## Objectives

In this lab you will: 

- Use surprise's built-in reader class to process data to work with recommender algorithms 
- Obtain a prediction for a specific user for a particular item 
- Introduce a new user with rating to a rating matrix and make recommendations for them 
- Create a function that will return the top n recommendations for a user 


For this lab, we will be using the famous 1M movie dataset. It contains a collection of user ratings for many different movies. In the last lesson, you were exposed to working with `surprise` datasets. In this lab, you will also go through the process of reading in a dataset into the `surprise` dataset format. To begin with, load the dataset into a Pandas DataFrame. Determine which columns are necessary for your recommendation system and drop any extraneous ones.

In [1]:
import pandas as pd
df = pd.read_csv('./ml-latest-small/ratings.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [2]:
# Drop unnecessary columns
new_df = df.drop(columns=['timestamp'])

It's now time to transform the dataset into something compatible with `surprise`. In order to do this, you're going to need `Reader` and `Dataset` classes. There's a method in `Dataset` specifically for loading dataframes.

In [3]:
from surprise import Reader, Dataset
# read in values as Surprise dataset 
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(new_df[['userId','movieId','rating']],reader)



Let's look at how many users and items we have in our dataset. If using neighborhood-based methods, this will help us determine whether or not we should perform user-user or item-item similarity

In [4]:
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

Number of users:  610 

Number of items:  9724


## Determine the best model 

Now, compare the different models and see which ones perform best. For consistency sake, use RMSE to evaluate models. Remember to cross-validate! Can you get a model with a higher average RMSE on test data than 0.869?

In [5]:
# importing relevant libraries
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV
import numpy as np

In [6]:
## Perform a gridsearch with SVD
# ⏰ This cell may take several minutes to run
param_grid = {'n_factors':[50, 100],'n_epochs': [10, 20], 'lr_all': [0.02, 0.008],
             'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD,param_grid,measures=['rmse','mae'],cv=3,n_jobs = -1,joblib_verbose=5)
gs.fit(data)


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    7.2s
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:   29.6s finished


In [7]:
# print out optimal parameters for SVD after GridSearch
rms_score = {gs.best_score['rmse']}
opt_params_rms = {'rmse': gs.best_params['rmse']}
mae_score = {gs.best_score['mae']}
opt_params_mae = {'mae': gs.best_params['mae']}
print(f'rmse: {rms_score,opt_params_rms}')
print('\n')
print(f'mae: {mae_score,opt_params_mae}')

rmse: ({0.8830349460709405}, {'rmse': {'n_factors': 50, 'n_epochs': 20, 'lr_all': 0.008, 'reg_all': 0.4}})


mae: ({0.6813410423868201}, {'mae': {'n_factors': 100, 'n_epochs': 20, 'lr_all': 0.02, 'reg_all': 0.4}})


In [8]:
# cross validating with KNNBasic
sim_pearson = {'name':'pearson', 'user_based':True}
basic = KNNBasic(sim_options=sim_pearson)

cv = cross_validate(basic,data,measures=['RMSE','MAE'],verbose=True)

Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9715  0.9667  0.9753  0.9707  0.9769  0.9722  0.0036  
MAE (testset)     0.7521  0.7464  0.7543  0.7518  0.7508  0.7511  0.0026  
Fit time          0.34    0.35    0.36    0.35    0.42    0.37    0.03    
Test time         1.45    1.54    1.57    1.51    1.40    1.49    0.06    


In [9]:
# print out the average RMSE score for the test set
print(f"Average RMSE: {np.mean(cv['test_rmse'])}")

Average RMSE: 0.972213784188828


In [10]:
# cross validating with KNNBaselin
sim_pearson = {'name':'pearson', 'user_based':True}
baseline = KNNBaseline(sim_options=sim_pearson)

cv = cross_validate(baseline,data,measures=['RMSE','MAE'],verbose=True)


Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBaseline on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8780  0.8746  0.8828  0.8783  0.8704  0.8768  0.0041  
MAE (testset)     0.6702  0.6691  0.6716  0.6719  0.6659  0.6698  0.0022  
Fit time          0.61    0.61    0.62    0.67    0.75    0.65    0.06    
Test time         1.94    1.80    1.95    2.40    2.42    2.10    0.26    


In [11]:
# print out the average score for the test set
print(f"Average RMSE: {np.mean(cv['test_rmse'])}")

Average RMSE: 0.8768390394005008


Based off these outputs, it seems like the best performing model is the SVD model with `n_factors = 50` and a regularization rate of 0.05. Use that model or if you found one that performs better, feel free to use that to make some predictions.

## Making Recommendations

It's important that the output for the recommendation is interpretable to people. Rather than returning the `movie_id` values, it would be far more valuable to return the actual title of the movie. As a first step, let's read in the movies to a dataframe and take a peek at what information we have about them.

In [12]:
df_movies = pd.read_csv('./ml-latest-small/movies.csv')

In [13]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Making simple predictions
Just as a reminder, let's look at how you make a prediction for an individual user and item. First, we'll fit the SVD model we had from before.

In [14]:
svd = SVD(n_factors= 50, reg_all=0.05)
svd.fit(dataset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x793ab4b68ad0>

In [15]:
svd.predict(2, 4)

Prediction(uid=2, iid=4, r_ui=None, est=2.928683972681593, details={'was_impossible': False})

This prediction value is a tuple and each of the values within it can be accessed by way of indexing. Now let's put our knowledge of recommendation systems to do something interesting: making predictions for a new user!

## Obtaining User Ratings 

It's great that we have working models and everything, but wouldn't it be nice to get to recommendations specifically tailored to your preferences? That's what we'll be doing now. The first step is to create a function that allows us to pick randomly selected movies. The function should present users with a movie and ask them to rate it. If they have not seen the movie, they should be able to skip rating it. 

The function `movie_rater()` should take as parameters: 

* `movie_df`: DataFrame - a dataframe containing the movie ids, name of movie, and genres
* `num`: int - number of ratings
* `genre`: string - a specific genre from which to draw movies

The function returns:
* rating_list : list - a collection of dictionaries in the format of {'userId': int , 'movieId': int , 'rating': float}

#### This function is optional, but fun :) 

In [16]:
def movie_rater(movie_df, num, genre=None):
    rating_list = []
    
    # filtering the dataframe by genre
    if genre:
        filtered_df = movie_df[movie_df['genres'].str.contains(genre, case=False, na=False)]
        if filtered_df.empty:
            print(f"No movies found for the genre '{genre}'.")
            return rating_list
    else:
        filtered_df = movie_df.copy()

     # Use a random seed for reproducibility in sampling
    random_movies = filtered_df.sample(n=min(num, len(filtered_df)), random_state=42)
    user_id = 1700  # a new, unique user ID for the new user

    for _, row in random_movies.iterrows():
        movie_id = row['movieId']
        movie_title = row['title']

        while True:
            user_input = input(
                f"Rate the movie '{movie_title}'(1-5), or 's' to skip: "
            ).lower()

            if user_input == 's':
                print(f"Skipped rating for '{movie_title}'.")
                break
            try:
                rating = float(user_input)
                if 1.0 <= rating <= 5.0:
                    rating_list.append({
                        'userId': user_id,
                        'movieId': movie_id,
                        'rating': rating
                    })
                    break
                else:
                    print("Invalid rating. Please enter a value between 1 and 5.") 
            except ValueError:
                print("Invalid input. Please enter a number or 's' to skip.") 
    return rating_list


In [18]:
# try out the new function here!
user_rating = movie_rater(df_movies, num=5,genre='Action')

If you're struggling to come up with the above function, you can use this list of user ratings to complete the next segment

In [19]:
user_rating

[{'userId': 1700, 'movieId': 2094, 'rating': 5.0},
 {'userId': 1700, 'movieId': 79224, 'rating': 4.0},
 {'userId': 1700, 'movieId': 111663, 'rating': 3.0},
 {'userId': 1700, 'movieId': 54908, 'rating': 4.0},
 {'userId': 1700, 'movieId': 61248, 'rating': 1.0}]

### Making Predictions With the New Ratings
Now that you have new ratings, you can use them to make predictions for this new user. The proper way this should work is:

* add the new ratings to the original ratings DataFrame, read into a `surprise` dataset 
* train a model using the new combined DataFrame
* make predictions for the user
* order those predictions from highest rated to lowest rated
* return the top n recommendations with the text of the actual movie (rather than just the index number) 

In [22]:
## add the new ratings to the original ratings DataFrame
# convert the new ratings into  a dataframe
user_ratings = pd.DataFrame(user_rating)
combined_df_ratings = pd.concat([new_df,user_ratings], ignore_index=True)
combined_df_ratings.head()
# read in values as Surprise dataset 
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(combined_df_ratings[['userId','movieId','rating']],reader)
trainset = data.build_full_trainset()


In [24]:
# train a model using the new combined DataFrame
# cross validating with KNNBaselin
sim_pearson = {'name':'pearson', 'user_based':True}
baseline = KNNBaseline(sim_options=sim_pearson)
baseline.fit(trainset)



Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x793a87814690>

In [26]:
# make predictions for the user
# you'll probably want to create a list of tuples in the format (movie_id, predicted_score)
new_user_id = 1700
# Get a set of all movie IDs from your movies DataFrame
all_movie_ids = set(df_movies['movieId'])
# Get a set of movie IDs that the new user has already rated
# First, convert the list of rating dictionaries to a list of movie IDs
rated_movie_ids = {r['movieId'] for r in user_rating}
# Find all movies the new user has NOT rated
unrated_movies_ids = all_movie_ids -rated_movie_ids
# Create a list to store the predictions
predictions_list = []
# Looping thru all unrated movies to make predictions of the ratings
for movie_id in unrated_movies_ids:
    # Use the trained model to predict the rating for the new user and this movie
    # Note: Surprise's predict method requires user and item IDs to be strings.
    predicted_rating_obj = baseline.predict(str(new_user_id),str(movie_id))
     # Store the movie ID and the predicted score in a tuple
    predictions_list.append((predicted_rating_obj.iid,predicted_rating_obj.est))
print(f'Generated predictions for {len(predictions_list)} unrated movies')

Generated predictions for 9737 unrated movies


In [29]:
# order the predictions from highest to lowest rated

ranked_movies = sorted(predictions_list,key=lambda x:x[1],reverse=True)
# Print the top 10 ranked movie IDs and their predicted scores
print("\n--- Top 10 Ranked Movie Predictions ---")
for movie_id, score in ranked_movies[:10]:
    print(f"Movie ID: {movie_id:<5} | Predicted Score: {score:.2f}")



--- Top 10 Ranked Movie Predictions ---
Movie ID: 1     | Predicted Score: 3.50
Movie ID: 2     | Predicted Score: 3.50
Movie ID: 3     | Predicted Score: 3.50
Movie ID: 4     | Predicted Score: 3.50
Movie ID: 5     | Predicted Score: 3.50
Movie ID: 6     | Predicted Score: 3.50
Movie ID: 7     | Predicted Score: 3.50
Movie ID: 8     | Predicted Score: 3.50
Movie ID: 9     | Predicted Score: 3.50
Movie ID: 10    | Predicted Score: 3.50


 For the final component of this challenge, it could be useful to create a function `recommended_movies()` that takes in the parameters:
* `user_ratings`: list - list of tuples formulated as (user_id, movie_id) (should be in order of best to worst for this individual)
* `movie_title_df`: DataFrame 
* `n`: int - number of recommended movies 

The function should use a `for` loop to print out each recommended *n* movies in order from best to worst

In [30]:
# return the top n recommendations using the 
def recommended_movies(user_ratings, movie_title_df, n):
    """
    Prints the top n recommended movies with their titles.

    Args:
        user_ratings (list): A ranked list of tuples in the format (movie_id, predicted_score),
                             ordered from best to worst.
        movie_title_df (pd.DataFrame): The DataFrame containing movie IDs and titles.
        n (int): The number of recommended movies to print.
    """
    print(f"\n--- Top {n} Movie Recommendations ---")
    
    # Check for empty ratings
    if not user_ratings:
        print("No ratings available to make recommendations.")
        return

    # Use a for loop to iterate through the top n recommendations
    for i in range(min(n, len(user_ratings))):
        movie_id, predicted_score = user_ratings[i]
        
        # Look up the movie title in the movie_title_df using the movie_id
        movie_title = movie_title_df[movie_title_df['movieId'] == int(movie_id)]['title'].iloc[0]
        
        print(f"{i + 1}. Movie: {movie_title} (Predicted Score: {predicted_score:.2f})")

# Example usage (assuming `ranked_movies` and `df_movies` are available)
recommended_movies(ranked_movies, df_movies, 5)



--- Top 5 Movie Recommendations ---
1. Movie: Toy Story (1995) (Predicted Score: 3.50)
2. Movie: Jumanji (1995) (Predicted Score: 3.50)
3. Movie: Grumpier Old Men (1995) (Predicted Score: 3.50)
4. Movie: Waiting to Exhale (1995) (Predicted Score: 3.50)
5. Movie: Father of the Bride Part II (1995) (Predicted Score: 3.50)


## Level Up (Optional)

* Try and chain all of the steps together into one function that asks users for ratings for a certain number of movies, then all of the above steps are performed to return the top $n$ recommendations
* Make a recommender system that only returns items that come from a specified genre

In [31]:
import pandas as pd
import random
from surprise import Dataset, Reader, SVD

# The movie_rater() function from previous turns is a dependency for this general function.
def movie_rater(movie_df: pd.DataFrame, num: int, genre: str = None) -> list:
    """
    Prompts a new user to rate a specified number of movies.
    
    Args:
        movie_df (pd.DataFrame): DataFrame containing movie IDs, titles, and genres.
        num (int): Number of movie ratings to collect.
        genre (str, optional): A specific genre to draw movies from for user ratings.
    
    Returns:
        list: A list of dictionaries, where each dictionary contains the new user's rating.
    """
    rating_list = []
    
    if genre:
        filtered_df = movie_df[movie_df['genres'].str.contains(genre, case=False, na=False)]
        if filtered_df.empty:
            print(f"No movies found for the genre '{genre}'.")
            return rating_list
    else:
        filtered_df = movie_df.copy()

    random_movies = filtered_df.sample(n=min(num, len(filtered_df)), random_state=42)
    user_id = 9999 # A high ID to avoid collision with existing users

    for _, row in random_movies.iterrows():
        movie_id = row['movieId']
        movie_title = row['title']
        
        while True:
            user_input = input(
                f"Rate the movie '{movie_title}' (1-5), or 's' to skip: "
            ).lower()
            
            if user_input == 's':
                print(f"Skipped rating for '{movie_title}'.")
                break
            
            try:
                rating = float(user_input)
                if 1.0 <= rating <= 5.0:
                    rating_list.append({
                        'userId': user_id,
                        'movieId': movie_id,
                        'rating': rating
                    })
                    break
                else:
                    print("Invalid rating. Please enter a value between 1 and 5.")
            except ValueError:
                print("Invalid input. Please enter a number or 's' to skip.") 

    return rating_list


def get_new_user_recommendations(
    df_ratings: pd.DataFrame,
    df_movies: pd.DataFrame,
    num_ratings: int = 5,
    n_recommendations: int = 10,
    rating_genre: str = None,
    recommend_genre: str = None,
    model_params: dict = {'n_factors': 50, 'reg_all': 0.05}
):
    """
    Guides a new user through rating movies, then provides personalized recommendations.

    Args:
        df_ratings (pd.DataFrame): The original ratings dataset.
        df_movies (pd.DataFrame): The movie metadata dataset.
        num_ratings (int): The number of movies to ask the user to rate.
        n_recommendations (int): The number of top recommendations to return.
        rating_genre (str, optional): An optional genre to draw movies from for user ratings.
        recommend_genre (str, optional): An optional genre to filter recommendations by.
        model_params (dict, optional): Parameters for the SVD model.

    Returns:
        pd.DataFrame: A DataFrame of the top n recommendations for the new user.
    """
    
    # --- 1. Get new user ratings ---
    user_new_ratings_list = movie_rater(df_movies, num=num_ratings, genre=rating_genre)

    if not user_new_ratings_list:
        print("No ratings were provided by the user. Cannot generate recommendations.")
        return pd.DataFrame()

    # Get the ID of the new user from the ratings list
    new_user_id = user_new_ratings_list[0]['userId']

    # --- 2. Combine the ratings ---
    df_new_user_ratings = pd.DataFrame(user_new_ratings_list)
    combined_df_ratings = pd.concat([df_ratings, df_new_user_ratings], ignore_index=True)

    # --- 3. Prepare and train the SVD model ---
    reader = Reader(rating_scale=(1, 5))
    data_surprise = Dataset.load_from_df(combined_df_ratings[['userId', 'movieId', 'rating']], reader)
    trainset = data_surprise.build_full_trainset()
    
    model = SVD(**model_params)
    model.fit(trainset)
    
    # --- 4. Identify potential movies to recommend ---
    # Filter by a specific genre for recommendations if requested
    if recommend_genre:
        potential_movies = df_movies[df_movies['genres'].str.contains(recommend_genre, case=False, na=False)].copy()
    else:
        potential_movies = df_movies.copy()
        
    # Exclude movies the user has already rated
    rated_movie_ids = {r['movieId'] for r in user_new_ratings_list}
    unrated_movie_ids = list(set(potential_movies['movieId']) - rated_movie_ids)

    # --- 5. Make predictions ---
    predictions = [
        model.predict(str(new_user_id), str(movie_id))
        for movie_id in unrated_movie_ids
    ]
    
    # --- 6. Rank and get top n recommendations ---
    predictions.sort(key=lambda x: x.est, reverse=True)
    top_n_predictions = predictions[:n_recommendations]

    # --- 7. Format and return recommendations ---
    top_n_movie_ids = [int(p.iid) for p in top_n_predictions]
    
    recommended_movies_df = df_movies[df_movies['movieId'].isin(top_n_movie_ids)].copy()
    recommended_movies_df['predicted_rating'] = recommended_movies_df['movieId'].map(
        {int(p.iid): p.est for p in top_n_predictions}
    )
    
    return recommended_movies_df.sort_values('predicted_rating', ascending=False)

# --- Example Usage ---
# Assume df_ratings and df_movies are already loaded
# df_ratings = pd.read_csv('./ml-latest-small/ratings.csv')
# df_movies = pd.read_csv('./ml-latest-small/movies.csv')

# Example 1: Recommend 5 movies from the 'Comedy' genre after rating 3 random movies
# recommendations_comedy = get_new_user_recommendations(
#     df_ratings, 
#     df_movies, 
#     num_ratings=3, 
#     n_recommendations=5, 
#     recommend_genre='Comedy'
# )
# print(recommendations_comedy)

# Example 2: Recommend 10 movies from any genre after rating 5 movies from the 'Thriller' genre
# recommendations_thriller = get_new_user_recommendations(
#     df_ratings, 
#     df_movies, 
#     num_ratings=5, 
#     n_recommendations=10, 
#     rating_genre='Thriller'
# )
# print(recommendations_thriller)


## Summary

In this lab, you got the chance to implement a collaborative filtering model as well as retrieve recommendations from that model. You also got the opportunity to add your own recommendations to the system to get new recommendations for yourself! Next, you will learn how to use Spark to make recommender systems.