# MOVIE RECOMMENDATION SYSTEM

## 1. BUSINESS UNDERSTANDING 



### Objective

Leveraging the MovieLens dataset is to develop a robust movie recommendation system that enhances user engagement and satisfaction within our online movie streaming platform. By effectively recommending movies that align with users' preferences, we aim to increase user retention, drive user-generated content, and boost overall revenue.

### Data Description 

The MovieLens dataset, curated by the GroupLens research lab at the University of Minnesota, is a well-established and widely used resource in the field of recommendation systems. It contains a wealth of information, including user ratings, movie metadata, and user profiles, collected over a significant period of time.

### Problem Definition 

Our primary business problem is to overcome the challenge of content discovery for users. With an ever-expanding catalog of movies, users often face decision making issues when choosing what to watch. We need to address this by providing tailored movie recommendations based on user preferences, thereby simplifying the selection process and improving user satisfaction.

### Key Stakeholders

#### Users: 
Our end-users are at the core of our business. We aim to provide them with an enjoyable and personalized movie-watching experience.
#### Platform Owners: 
The success of our recommendation system directly impacts platform owners (MovieLens) by increasing user engagement and revenue.
#### Content Providers:
Enhanced user engagement can attract content providers to collaborate with the platform, enriching their movie catalog.
#### Data Scientists and Engineers: 
The data science and engineering teams play a crucial role in developing, deploying, and maintaining the recommendation system.

### Solution Approach

Our approach is centered around collaborative filtering, a proven recommendation technique. We will analyze user behavior and preferences within the dataset to build models that identify similarities between users and movies. This will enable us to provide personalized movie recommendations.

### Evaluation Metrics

To assess the effectiveness of our recommendation system, we will employ metrics such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Precision, Recall, F1-score, Coverage, and Diversity. These metrics will help us quantify the system's performance in terms of accuracy and relevance.

### Research Questions?

1. What movies might I enjoy watching?
    Users can receive personalized movie recommendations based on their past viewing and rating history.

2. What are the most popular or highly-rated movies?
    The system can provide lists of top-rated or trending movies, helping users discover popular titles.

3. Are there movies similar to the ones I've enjoyed in the past?
    Users can receive recommendations for movies similar to those they've rated highly, expanding their viewing options.

4. How can I discover new movies from genres I like?
    The system can suggest movies from specific genres that align with a user's preferences.

5. What movies have received critical acclaim or awards?
    Users can access recommendations for award-winning or critically acclaimed films.

6. What are the top recommendations for a specific user, given their unique tastes?
    The system tailors recommendations for individual users based on their historical ratings and preferences.

7. How can we improve user engagement and retention on our platform?
    For businesses, the recommendation system can increase user engagement by providing relevant content, reducing churn, and increasing user satisfaction.

8. What is the diversity and coverage of our recommendations?
    Businesses can assess the diversity of recommendations to ensure users are exposed to a wide range of movie genres and styles. Additionally, they can measure how many unique movies in their catalog are being recommended.

9. How accurate are our recommendations?
    Businesses can evaluate the effectiveness of the recommendation system using metrics such as RMSE, MAE, or precision-recall, determining how closely the system's predictions align with user preferences.

10. How can we increase revenue through movie recommendations?
    Businesses can leverage the recommendation system to drive movie rentals, subscriptions, or sales, thereby increasing revenue and ROI.

11. How can we personalize the user experience and increase user-generated content?
    By offering tailored recommendations, businesses can encourage users to rate and review movies, contributing to a richer database of user-generated content.

### Success Criteria

The success of our recommendation system will be measured by improvements in key performance indicators (KPIs) including:

1. User Engagement: Increased user engagement through higher interaction with recommended movies.
2. User Retention: A decrease in user churn rates, indicating improved user satisfaction.
3. Revenue: A significant boost in revenue through increased user subscriptions and movie rentals.
4. Content Utilization: A broader range of movies being watched, leading to better utilization of the movie catalog.

## 2. DATA UNDERSTANDING

- The dataset is named "ml-latest-small" and is from MovieLens, a movie recommendation service.
- It includes 100,836 ratings and 3,683 tag applications across 9,742 movies.
- The data was generated by 610 users between March 29, 1996, and September 24, 2018.
- The dataset was last generated on September 26, 2018.
- Users were selected at random, and their demographic information is not included.

Movie file :
1. movieId - movie reference indicator
2. title - this is the movie titles
3. genres - movie types

Rating file :
1. userId - users reference indicator
2. movieId 
3. rating - movie rating 
4. timestamp - movie online information 

## 3. IMPORTING LIBRARIES

In [13]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from surprise import Dataset, Reader
from surprise import SVD
from surprise.model_selection import train_test_split
from surprise import accuracy
from surprise import Dataset, Reader


## 4. READING DATA 

Start by opening the data preferably by using the ratings and movies data

In [14]:
# Load the ratings data
ratings = pd.read_csv("C:\\Users\\Administrator\\Desktop\\Moringa\\Phase 4\\Phase 4 Project Recommendation System\\ml-latest-small\\ml-latest-small\\ratings.csv")
# Load the movies data 
movies = pd.read_csv("C:\\Users\\Administrator\\Desktop\\Moringa\\Phase 4\\Phase 4 Project Recommendation System\\ml-latest-small\\ml-latest-small\\movies.csv")

In [15]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [16]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Check for missing data on the dataset

In [17]:
print("Ratings Data - Missing Values:")
print(ratings.isnull().sum())

print("\nMovies Data - Missing Values:")
print(movies.isnull().sum())

Ratings Data - Missing Values:
userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

Movies Data - Missing Values:
movieId    0
title      0
genres     0
dtype: int64


From the illustration above there are no missing values in the Ratings and Movies data

Check for duplicates 

In [18]:
ratings.drop_duplicates(inplace=True)
# outlier handling: Identifying and capping outlier ratings
outlier_threshold = 5.0
ratings['rating'] = ratings['rating'].clip(0.5, outlier_threshold)

In [19]:
# Check data consistency: Ensure that movie IDs are consistent across datasets
if movies['movieId'].isin(ratings['movieId']).all():
    print("Inconsistent movie IDs between movies and ratings datasets.")
else:
    print("Consistent movie IDs between movies and ratings datasets. ")

Consistent movie IDs between movies and ratings datasets. 


In [20]:
ratings.drop(columns=['timestamp'], inplace=True)
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


### Feature Engineering

In [21]:
# Calculate the average rating for each movie
average_ratings = ratings.groupby('movieId')['rating'].mean().reset_index()
average_ratings.rename(columns={'rating': 'avg_rating'}, inplace=True)

In [22]:
# Assuming genres are in the "genres" column and are pipe-separated
unique_genres = set('|'.join(movies['genres']).split('|'))
for genre in unique_genres:
    movies[genre] = movies['genres'].apply(lambda x: 1 if genre in x else 0)

In [23]:
# Display the first few rows of the resulting DataFrame
print("Average Ratings for Movies:")
print(average_ratings.head())

print("\nMovies DataFrame with Genre-Based Features:")
print(movies.head())

Average Ratings for Movies:
   movieId  avg_rating
0        1    3.920930
1        2    3.431818
2        3    3.259615
3        4    2.357143
4        5    3.071429

Movies DataFrame with Genre-Based Features:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  Adventure  Drama  Horror  \
0  Adventure|Animation|Children|Comedy|Fantasy          1      0       0   
1                   Adventure|Children|Fantasy          1      0       0   
2                               Comedy|Romance          0      0       0   
3                         Comedy|Drama|Romance          0      1       0   
4                                       Comedy          0      0       0   

   Sci-Fi  Crime  Romance  Musi

Create a feature that counts the number of ratings each movie has received. Movies with a higher number of ratings may be more popular or well-known.

In [24]:
# Calculate the number of ratings for each movie
movie_rating_counts = ratings['movieId'].value_counts().reset_index()
movie_rating_counts.columns = ['movieId', 'num_ratings']

# Display the first few rows of the resulting DataFrame
print("Number of Ratings for Each Movie:")
print(movie_rating_counts.head())

Number of Ratings for Each Movie:
   movieId  num_ratings
0      356          329
1      318          317
2      296          307
3      593          279
4     2571          278


Genre Based:

Create binary columns for each genre (e.g., Action, Comedy, Romance) and indicate whether a movie belongs to a particular genre. These binary indicators can be used in content-based filtering.
Calculate the proportion of each genre in a movie's genre list (e.g., the percentage of Action movies).

In [25]:
# Extract unique genres
unique_genres = set('|'.join(movies['genres']).split('|'))

# Create binary genre-based columns
for genre in unique_genres:
    movies[genre] = movies['genres'].apply(lambda x: 1 if genre in x else 0)

# Calculate the proportion of each genre in a movie's genre list
genre_columns = list(unique_genres)
movies[genre_columns] = movies[genre_columns].div(movies[genre_columns].sum(axis=1), axis=0)

# Display the first few rows of the resulting DataFrame
print("Movies DataFrame with Binary Genre-Based Columns:")
print(movies.head())

Movies DataFrame with Binary Genre-Based Columns:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  Adventure     Drama  Horror  \
0  Adventure|Animation|Children|Comedy|Fantasy   0.200000  0.000000     0.0   
1                   Adventure|Children|Fantasy   0.333333  0.000000     0.0   
2                               Comedy|Romance   0.000000  0.000000     0.0   
3                         Comedy|Drama|Romance   0.000000  0.333333     0.0   
4                                       Comedy   0.000000  0.000000     0.0   

   Sci-Fi  Crime   Romance  Musical  ...  IMAX   Fantasy  Film-Noir  \
0     0.0    0.0  0.000000      0.0  ...   0.0  0.200000        0.0   
1     0.0    0.0  0.000000      

Release Year:

Extract the release year from movie titles and create a feature for the movie's release year. This can be used to recommend recent movies or movies from a specific era.

In [26]:
# Assuming that the release year is enclosed in parentheses at the end of the title
movies['release_year'] = movies['title'].str.extract(r'\((\d{4})\)')

# Display the first few rows of the resulting DataFrame
print("Movies DataFrame with Release Year:")
print(movies.head())

Movies DataFrame with Release Year:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  Adventure     Drama  Horror  \
0  Adventure|Animation|Children|Comedy|Fantasy   0.200000  0.000000     0.0   
1                   Adventure|Children|Fantasy   0.333333  0.000000     0.0   
2                               Comedy|Romance   0.000000  0.000000     0.0   
3                         Comedy|Drama|Romance   0.000000  0.333333     0.0   
4                                       Comedy   0.000000  0.000000     0.0   

   Sci-Fi  Crime   Romance  Musical  ...   Fantasy  Film-Noir  \
0     0.0    0.0  0.000000      0.0  ...  0.200000        0.0   
1     0.0    0.0  0.000000      0.0  ...  0.333333        

User-Based Features:

For collaborative filtering, you can create user-based features such as the average rating given by each user or the number of movies each user has rated.

In [27]:
# Calculate the average rating given by each user
user_avg_rating = ratings.groupby('userId')['rating'].mean().reset_index()
user_avg_rating.rename(columns={'rating': 'avg_rating_by_user'}, inplace=True)

# Calculate the number of movies each user has rated
user_rating_counts = ratings['userId'].value_counts().reset_index()
user_rating_counts.columns = ['userId', 'num_movies_rated']

# Display the first few rows of the resulting DataFrames
print("Average Rating Given by Each User:")
print(user_avg_rating.head())

print("\nNumber of Movies Rated by Each User:")
print(user_rating_counts.head())

Average Rating Given by Each User:
   userId  avg_rating_by_user
0       1            4.366379
1       2            3.948276
2       3            2.435897
3       4            3.555556
4       5            3.636364

Number of Movies Rated by Each User:
   userId  num_movies_rated
0     414              2698
1     599              2478
2     474              2108
3     448              1864
4     274              1346


## 5. MODELLING

Generate latent factors or embeddings for movies and users using matrix factorization techniques like Singular Value Decomposition (SVD) or matrix factorization models. These embeddings can be used to capture complex relationships.

Also by training the SVD Model set and fit the model 

In [36]:
# Define the Reader object
reader = Reader(rating_scale=(0.5, 5.0))  # Define your rating scale as appropriate

# Load the ratings data using the Reader
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# Split the dataset into a trainset and testset
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Create and train an SVD model
svd_model = SVD(n_factors=50, random_state=42)
svd_model.fit(trainset)

# Make predictions on the test set
predictions = svd_model.test(testset)

# Evaluate the model using RMSE
rmse = accuracy.rmse(predictions)
print(f"Root Mean Squared Error (RMSE): {rmse}")

# Get latent factors for movies and users
movie_factors = svd_model.qi
user_factors = svd_model.pu

RMSE: 0.8775
Root Mean Squared Error (RMSE): 0.8774680781839199


 RMSE value of 0.8775 indicates the average error (or the average difference) between the predicted ratings by the recommendation system and the actual ratings given by users. Lower RMSE values indicate better accuracy, while higher values suggest that the predictions are less accurate.

Lets make a simple prediction to see the outcome.

In [37]:
# Predict the rating for user 1 and movie 100
predicted_rating = svd_model.predict(1, 100)

# Extract the predicted rating value
predicted_rating_value = predicted_rating.est

print(f"Predicted rating for user 1 and movie 100: {predicted_rating_value}")

Predicted rating for user 1 and movie 100: 3.661648275505914


to determine the best model.Compare the different models and see which ones perform best. For consistency sake, use RMSE to evaluate models. Can it get a model with a higher average RMSE on test data than 0.877?

In [38]:
from surprise.model_selection import GridSearchCV
# Define the parameter grid for SVD
param_grid = {
    'n_factors': [50, 100, 150],
    'n_epochs': [20, 30, 40],
    'lr_all': [0.002, 0.005, 0.01],
    'reg_all': [0.02, 0.04, 0.06]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3, n_jobs=-1, joblib_verbose=5)

# Fit the grid search on the data
grid_search.fit(data)

# Get the best RMSE score and corresponding parameters
best_rmse = grid_search.best_score['rmse']
best_params = grid_search.best_params['rmse']

print(f"Best RMSE: {best_rmse}")
print(f"Best Parameters: {best_params}")

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    6.7s
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:   45.3s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  2.2min


Best RMSE: 0.8630795161670323
Best Parameters: {'n_factors': 150, 'n_epochs': 30, 'lr_all': 0.01, 'reg_all': 0.06}


[Parallel(n_jobs=-1)]: Done 243 out of 243 | elapsed:  4.2min finished


It is going to apply KNN basic model to evaluate

In [39]:
from surprise import KNNBasic
from surprise.model_selection import cross_validate
# Define the KNNBasic model
knn_basic_model = KNNBasic(sim_options={'user_based': True})

# Perform cross-validation
results = cross_validate(knn_basic_model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Extract and print the average RMSE and MAE scores
average_rmse = results['test_rmse'].mean()
average_mae = results['test_mae'].mean()

print(f"Average RMSE: {average_rmse}")
print(f"Average MAE: {average_mae}")

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9399  0.9513  0.9377  0.9508  0.9533  0.9466  0.0064  
MAE (testset)     0.7213  0.7277  0.7211  0.7294  0.7287  0.7256  0.0037  
Fit time          0.53    0.53    0.54    0.55    0.55    0.54    0.01    
Test time         3.33    3.83    3.46    3.40    3.92    3.59    0.24    
Average RMSE: 0.9465899460188847
Average MAE: 0.7256387075421818


Lower RMSE(Root Mean Squared Error) and MAE (Mean Absolute Error) values are desirable because they indicate that the model's predictions are closer to the actual ratings. 

Lets use the KNNBaseline model

In [40]:
from surprise import KNNBaseline
from surprise.model_selection import cross_validate

# Define the KNNBaseline model
knn_baseline_model = KNNBaseline(sim_options={'user_based': True})

# Perform cross-validation
results = cross_validate(knn_baseline_model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Extract and print the average RMSE and MAE scores
average_rmse = results['test_rmse'].mean()
average_mae = results['test_mae'].mean()

print(f"Average RMSE: {average_rmse}")
print(f"Average MAE: {average_mae}")

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBaseline on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8766  0.8721  0.8727  0.8708  0.8817  0.8748  0.0040  
MAE (testset)     0.6672  0.6696  0.6653  0.6641  0.6739  0.6680  0.0035  
Fit time          0.98    1.04    1.09    1.15    1.09    1.07    0.06    
Test time         4.77    4.22    4.19    4.20    4.20    4.32    0.23    
Average RMSE: 0.874789475108738
Average MAE

It appears to have lower RMSE and MAE compared to the KNNBasic model, suggesting better accuracy. The fit time and test time measurements give you an idea of the computational efficiency of the model.

### Testing the models with Random data 

we have working models and everything, but wouldn't it be nice to get to recommendations specifically tailored to your preferences? That's what we'll be doing now. The first step is to create a function that allows us to pick randomly selected movies. The function should present users with a movie and ask them to rate it. If they have not seen the movie, they should be able to skip rating it.

In [42]:
def movie_rater(movie_df, num, genre=None):
    """
    Rate randomly selected movies and return a list of ratings in the format of {'userId': int, 'movieId': int, 'rating': float}.

    Parameters:
    - movie_df: DataFrame - DataFrame containing movie ids, movie names, and genres.
    - num: int - Number of ratings to collect from the user.
    - genre: str (optional) - Specify a genre to filter movies. If None, all movies will be considered.

    Returns:
    - rating_list: list - A list of dictionaries with user ratings.
    """
    rating_list = []
    user_id = 9999  # Assign a unique user ID for the user (you can adjust this as needed)

    # Filter movies by genre if specified
    if genre:
        movie_subset = movie_df[movie_df['genres'].str.contains(genre, case=False, na=False)]
    else:
        movie_subset = movie_df

    # Randomly select movies for rating
    random_movies = random.sample(range(len(movie_subset)), num)

    for idx in random_movies:
        movie = movie_subset.iloc[idx]
        movie_id = movie['movieId']
        movie_title = movie['title']

        # Prompt the user for a rating (0.5 to 5.0) or skip (None)
        rating = None
        while rating is None:
            try:
                rating = float(input(f"Rate '{movie_title}' (Movie ID: {movie_id}): "))
                if rating < 0.5 or rating > 5.0:
                    print("Please enter a rating between 0.5 and 5.0.")
                    rating = None
            except ValueError:
                print("Invalid input. Please enter a valid rating or 'skip'.")

        # Add the user rating to the list
        rating_dict = {'userId': user_id, 'movieId': movie_id, 'rating': rating}
        rating_list.append(rating_dict)

    return rating_list


In [50]:
import random 
new_ratings = movie_rater(movies, num=5, genre="Action")

# Display the collected ratings
print("User Ratings:")
print(new_ratings)

Rate 'Jason Bourne (2016)' (Movie ID: 160438): 5
Rate 'Legend, The (Legend of Fong Sai-Yuk, The) (Fong Sai Yuk) (1993)' (Movie ID: 7844): 3
Rate 'Get Carter (1971)' (Movie ID: 3947): 4
Rate 'Five Element Ninjas (1982)' (Movie ID: 135803): 4
Rate 'Victory (a.k.a. Escape to Victory) (1981)' (Movie ID: 5915): 4
User Ratings:
[{'userId': 9999, 'movieId': 160438, 'rating': 5.0}, {'userId': 9999, 'movieId': 7844, 'rating': 3.0}, {'userId': 9999, 'movieId': 3947, 'rating': 4.0}, {'userId': 9999, 'movieId': 135803, 'rating': 4.0}, {'userId': 9999, 'movieId': 5915, 'rating': 4.0}]


In [54]:
# Assuming new_ratings is a list of dictionaries
new_ratings_df = pd.DataFrame(new_ratings)

# Check the format of new_ratings_df
print(new_ratings_df.head())

   userId  movieId  rating
0    9999   160438     5.0
1    9999     7844     3.0
2    9999     3947     4.0
3    9999   135803     4.0
4    9999     5915     4.0


In [57]:
ratings = pd.DataFrame(ratings)

# Assuming new_ratings is a list of dictionaries
new_ratings = pd.DataFrame(new_ratings)

In [58]:
updated_ratings = pd.concat([ratings, new_ratings], ignore_index=True)


In [59]:
# Create a Reader object to specify the rating scale
reader = Reader(rating_scale=(1, 5))  # Assuming ratings are on a scale from 1 to 5

# Load the dataset from the DataFrame using the Reader
data = Dataset.load_from_df(updated_ratings[['userId', 'movieId', 'rating']], reader)

### Making predicitions with the new ratings

In [64]:
# Use the best parameters obtained from grid search (replace with your actual best parameters)
best_svd_params = {'n_factors': 100, 'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}

# Initialize the SVD model with the best parameters
svd_model = SVD(n_factors=50, random_state=42)

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(updated_ratings[['userId', 'movieId', 'rating']], reader)

# Split the dataset into training and testing sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Train the SVD model on the training set
svd_model.fit(trainset)

# Make predictions on the test set
predictions = svd_model.test(testset)

# Calculate and print RMSE (Root Mean Squared Error) to evaluate the model's performance
rmse = accuracy.rmse(predictions)
print(f"RMSE: {rmse}")

RMSE: 1.3723
RMSE: 1.3722659857632973


In [65]:
#user example to use for this test
user_id = 9999

# Get a list of all movieIds in your dataset
all_movie_ids = ratings['movieId'].unique()

# Initialize an empty list to store movie recommendations in the format (movie_id, predicted_score)
user_recommendations = []

# Generate predictions for the selected user
for movie_id in all_movie_ids:
    # Predict the user's rating for this movie using your trained SVD model
    predicted_rating = svd_model.predict(user_id, movie_id).est
    
    # Append the movie_id and predicted_score to the recommendations list
    user_recommendations.append((movie_id, predicted_rating))

# Sort the recommendations by predicted_score in descending order
user_recommendations.sort(key=lambda x: x[1], reverse=True)

# Display the top N movie recommendations (e.g., top 10)
top_n_recommendations = user_recommendations[:10]
for i, (movie_id, predicted_score) in enumerate(top_n_recommendations, start=1):
    print(f"Recommendation {i}: Movie ID {movie_id}, Predicted Score {predicted_score}")

Recommendation 1: Movie ID 160565, Predicted Score 4.604336084467107
Recommendation 2: Movie ID 3404, Predicted Score 4.41122272900279
Recommendation 3: Movie ID 2196, Predicted Score 4.372265985763297
Recommendation 4: Movie ID 111362, Predicted Score 4.290131213012642
Recommendation 5: Movie ID 161594, Predicted Score 4.266474093639642


In [66]:
def recommended_movies(user_ratings, movie_title_df, n):
    # Create a dictionary to map movie IDs to movie titles
    movie_id_to_title = dict(zip(movie_title_df['movieId'], movie_title_df['title']))

    # Print a header
    print(f"Top {n} Recommended Movies for User")

    # Iterate through the user_ratings and print the top N recommendations
    for i, (user_id, movie_id) in enumerate(user_ratings, 1):
        # Check if the movie ID exists in the movie title DataFrame
        if movie_id in movie_id_to_title:
            movie_title = movie_id_to_title[movie_id]
            print(f"{i}. {movie_title}")

            # Stop when we have printed the top N recommendations
            if i >= n:
                break