# MOVIELENS RECOMMENDATION SYSTEM

# A.    INTRODUCTION : DATASET DESCRIPTION

## MovieLens Dataset: Content and File Structure

The MovieLens dataset is provided as comma-separated values (CSV) files, formatted with a single header row. Below is a detailed breakdown of the structure and content of each file, along with formatting and encoding notes.

### 1. Formatting and Encoding
- Files are encoded in **UTF-8**. Ensure your text editor or analysis script is configured to handle UTF-8, especially for accented characters (e.g., *Misérables, Les (1995)*).
- **Comma-separated** values (CSV) format is used, with columns that contain commas (`,`) enclosed in double-quotes (`"`).

### 2. User IDs
- User IDs are anonymized and consistent across `ratings.csv` and `tags.csv`.
- Each ID uniquely identifies a user, ensuring user consistency between the rating and tagging data.

### 3. Movie IDs
- Only movies with at least one rating or tag are included in the dataset.
- Movie IDs are consistent across all files (`ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv`).
- These IDs correspond to the same movies used on the [MovieLens website](https://movielens.org).

---

## File Structures

### 1. `ratings.csv`
This file contains explicit user ratings for movies on a **5-star scale**. The data is structured as:

| Column   | Description                                   |
|----------|-----------------------------------------------|
| `userId` | Anonymized ID representing each user          |
| `movieId`| ID representing each movie                    |
| `rating` | User rating for the movie (0.5 to 5.0 stars), with half star increments
| `timestamp` | UNIX timestamp when the rating was made     |

Ratings are sorted first by `userId`, then by `movieId`.

### 2. `tags.csv`
Tags represent user-generated metadata (e.g., short descriptions or labels). The structure is:

| Column   | Description                                    |
|----------|------------------------------------------------|
| `userId` | Anonymized ID representing each user           |
| `movieId`| ID representing each movie                     |
| `tag`    | User-assigned tag for the movie                |
| `timestamp` | UNIX timestamp when the tag was added       |

Like ratings, tags are sorted by `userId` and then by `movieId`.

### 3. `movies.csv`
This file includes movie titles and their associated genres. The data is structured as follows:

| Column    | Description                                            |
|-----------|--------------------------------------------------------|
| `movieId` | ID representing each movie                             |
| `title`   | Movie title, including the year of release (e.g., *Toy Story (1995)*) |
| `genres`  | Pipe-separated list of genres (e.g., *Animation|Children's|Comedy*)   |

Errors or inconsistencies may exist in movie titles due to manual entry.

### 4. `links.csv`
Contains identifiers linking MovieLens movies to external databases (IMDB and TMDb). The structure is:

| Column    | Description                                            |
|-----------|--------------------------------------------------------|
| `movieId` | ID representing each movie in the MovieLens dataset    |
| `imdbId`  | Corresponding movie ID from IMDb                       |
| `tmdbId`  | Corresponding movie ID from The Movie Database (TMDb)  |

### Available Genres:
Movies are categorized into the following genres (separated by pipes `|` in the dataset):

- Action
- Adventure
- Animation
- Children's
- Comedy
- Crime
- Documentary
- Drama
- Fantasy
- Film-Noir
- Horror
- Musical
- Mystery
- Romance
- Sci-Fi
- Thriller
- War
- Western
- (no genres listed)

---

#   B.  OBJECTIVES

## 1. Build a Collaborative Filtering Model
-  Implement and test a collaborative filtering model using the user-item interaction data from `ratings.csv` to provide top 5 movie recommendations for users.

## 2. Implement a Hybrid Model (if applicable)
-  Mitigate the cold start problem by incorporating content-based filtering that utilizes movie genres and user-generated tags.

## 3. Evaluate Model Performance
-  Measure the model’s predictive accuracy using metrics such as RMSE (Root Mean Square Error) and MAE (Mean Absolute Error), along with other relevant ranking-based metrics.

## 4. Optimize Model for Better Performance
-  Tune the model's hyperparameters to enhance the quality of recommendations and overall user satisfaction.

---


# C.  DATA CLEANING AND PREPROCESSING

In this section, we will perform the following steps to prepare our datasets for modeling. Specifically, we will:

1. **Examine for Missing Values:** 
   - Identifying any missing or null values in the datasets is essential as they can significantly affect the performance of our recommendation system. We will analyze each dataset to determine the extent and impact of any missing data.



2. **Initial Data Overview:**
   - We will familiarize ourselves with the structure and contents of the data. This step will provide insight into the types of variables we are dealing with.

3. **Statistical Summary:**
   - Generating descriptive statistics for the numerical variables will allow us to assess their central tendencies and variabilities. This information is crucial for understanding the distribution of ratings and detecting potential outliers.

4. **Data Visualization:**
   - Visualizing data distributions through plots (e.g., histograms) will enhance our understanding of how ratings are spread across different values. Visualization is a powerful tool for uncovering hidden insights and patterns in the data.

After these preliminary analyses, we will document any findings that necessitate further cleaning, such as filling in missing values, removing duplicates, or handling outliers.




In [1]:
#Read csv files
import pandas as pd
df_links = pd.read_csv("../ml-latest-small/links.csv")
df_movies = pd.read_csv("../ml-latest-small/movies.csv")
df_ratings = pd.read_csv("../ml-latest-small/ratings.csv")
df_tags = pd.read_csv("../ml-latest-small/tags.csv")

## Cleaning : `ratings.csv `

In [2]:
#checking for missing values...
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [3]:
df_ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In [4]:
df_ratings.isna().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

-seems no data cleaning necessary for this , next...

## Cleaning : `movies.csv `

In [5]:
##checking for missing values...
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [6]:
df_movies.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

-seems no data cleaning necessary for this , next...

## Cleaning : `tags.csv `

In [7]:
##checking for missing values...
df_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


In [8]:
df_tags.isna().sum()

userId       0
movieId      0
tag          0
timestamp    0
dtype: int64

- Seems the data is ready to work with since the datasets are already cleaned . No need to check `links.csv` since it links to external datasets

In [9]:
#

---
# D.  EXPLORATORY DATA ANALYSIS (EDA)  


In this phase, we will explore and clean the data before preparing it for modeling. Our primary datasets, `ratings.csv` and `movies.csv`, contain essential information for building a recommendation system:

- **ratings.csv** provides user ratings for various movies, containing columns such as `userId`, `movieId`, `rating`, and `timestamp`.
- **movies.csv** contains details about movies, such as `movieId`, `title`, and `genres`.

To create a more informative dataset, we will **merge the `ratings.csv` and `movies.csv` files** using the `movieId` column, which is common to both datasets. This process allows us to combine user ratings with movie metadata, specifically the genres associated with each movie.

## Why Merge These DataFrames?

Merging the two DataFrames gives us a more comprehensive dataset that includes:
- **User-specific ratings** (from `ratings.csv`)
- **Movie information** like titles and genres (from `movies.csv`)

By bringing in the `genres` column, we enable additional insights that could be valuable in recommendation strategies, particularly when extending collaborative filtering approaches with content-based filtering. This is especially useful for:
1. **Cold start problems**, where new users who have not provided ratings can receive recommendations based on their preferred genres.
2. **Enhancing collaborative filtering** by incorporating genre preferences to fine-tune recommendations.

The resulting dataset, which includes `userId`, `movieId`, `rating`, and `genres`, will be key in building both collaborative and hybrid recommendation models.


## Feature Engineering
### Merging the 2 datasets


In [10]:
#merging movies and ratings dataframes...
#df_movies, df_ratings..
df_movies_ratings = pd.merge (df_movies, df_ratings, on = 'movieId')
df_movies_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


In [11]:
#check if ratings and merged columns have same no. of rows

assert len(df_ratings) == len(df_movies_ratings)

In [12]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [13]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB



### Drop Unnecessary Columns
To streamline our analysis, we'll remove any columns that aren't relevant to our user-item interaction matrix. Specifically, we will drop the `timestamp` and `title` columns from our DataFrame as they do not contribute to the recommendation process

In [14]:
df_movies_ratings.columns

Index(['movieId', 'title', 'genres', 'userId', 'rating', 'timestamp'], dtype='object')

In [15]:
df_movies_ratings = df_movies_ratings.drop(columns= ['title', 'timestamp'], axis= 1)

In [16]:
df_movies_ratings.columns

Index(['movieId', 'genres', 'userId', 'rating'], dtype='object')

In [17]:
df_movies_ratings.head()

Unnamed: 0,movieId,genres,userId,rating
0,1,Adventure|Animation|Children|Comedy|Fantasy,1,4.0
1,1,Adventure|Animation|Children|Comedy|Fantasy,5,4.0
2,1,Adventure|Animation|Children|Comedy|Fantasy,7,4.5
3,1,Adventure|Animation|Children|Comedy|Fantasy,15,2.5
4,1,Adventure|Animation|Children|Comedy|Fantasy,17,4.5


In [18]:
df_movies_ratings = df_movies_ratings[['movieId','userId', 'genres','rating']]
df_movies_ratings.head()

Unnamed: 0,movieId,userId,genres,rating
0,1,1,Adventure|Animation|Children|Comedy|Fantasy,4.0
1,1,5,Adventure|Animation|Children|Comedy|Fantasy,4.0
2,1,7,Adventure|Animation|Children|Comedy|Fantasy,4.5
3,1,15,Adventure|Animation|Children|Comedy|Fantasy,2.5
4,1,17,Adventure|Animation|Children|Comedy|Fantasy,4.5


### Pivot the Data
Next, we will create our user-item matrix by pivoting the DataFrame. This pivot table will have the following structure:
- **Rows:** `userId`
- **Columns:** `movieId`
- **Values:** `rating`

This pivot table will serve as our user-item interaction matrix, which allows us to see how each user has rated different movies.

In [19]:
# Pivot the data to create the user-item interaction matrix
user_item_matrix = df_movies_ratings.pivot_table(index='userId', columns='movieId', values='rating')

# Display the pivoted user-item matrix
user_item_matrix.head()


movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


### Verify the User-Item Matrix
To ensure our user-item matrix is structured correctly, we will check that the number of columns in the matrix corresponds to the maximum `movieId` present in our dataset. This will confirm that our pivot operation has been successful and that we are on the right track.



In [20]:
df_movies.describe()

Unnamed: 0,movieId
count,9742.0
mean,42200.353623
std,52160.494854
min,1.0
25%,3248.25
50%,7300.0
75%,76232.0
max,193609.0


-   Once we have created the user-item matrix, we will fill any `NaN` values with zeros to indicate that there is no interaction (rating) for those particular user-movie pairs.


In [21]:
# Fill nans with zeros...
# Replace NaNs with 0 (or you can use another strategy)
user_item_matrix_filled = user_item_matrix.fillna(0)
user_item_matrix_filled.shape

(610, 9724)

In [22]:
user_item_matrix_filled

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


-    We are now done with the creation of user-item matrix, we can proceed to Modeling
---

# E.    MODELING

# Baseline Model

-   We shall create a simple baseline model with 

## Baseline Collaborative Filtering model

### 1. Introduction

In this section, we will build a baseline recommendation model using the **Surprise** library, specifically applying **Singular Value Decomposition (SVD)** (Singular Value Decomposition) for collaborative filtering. Our goal is to create a robust model that predicts user ratings for movies based on historical data.

To evaluate the model's performance, we will utilize **RMSE** (Root Mean Squared Error) as our primary metric. Additionally, we will ensure that our predictions are rounded to the nearest 0.5-star increment to align with the rating scale used in the dataset.

### 2. Objectives
- Apply **SVD** for collaborative filtering.
- Tune model hyperparameters using **GridSearchCV**.
- Round predictions to the nearest 0.5-star.
- Evaluate the model using **RMSE**.
### 3. Train-Test Split
We begin by splitting the data into a training and testing set. This ensures we can later evaluate model performance on unseen data:

In [23]:
from sklearn.model_selection import train_test_split

# Perform the train-test split (80% train, 20% test)
train_data, test_data = train_test_split(user_item_matrix_filled, test_size=0.2, random_state=42)


### 4. Applying SVD for Collaborative Filtering

We decompose the user-item matrix into latent factors using SVD. We will also tune SVD using GridSearchCV.

In [24]:
# Importing neccesary libraries
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split

# Loading data into Surprise format
#reader = Reader(rating_scale=(0.5, 5))
#data = Dataset.load_from_df(Movies_df[['Userid', 'Title', 'Rating']], reader)
# Use the surprise dataset format with Reader

reader = Reader(rating_scale=(0.5, 5))
data = Dataset.load_from_df(train_data.stack().reset_index(), reader)

# Splitting data into train and test set
trainset, testset = train_test_split(data, test_size=0.2)


# Initializing and train SVD model
svd = SVD()
svd.fit(trainset)

# Predict ratings for testset
predictions = svd.test(testset)


In [25]:
# Importing library
from surprise import accuracy

# Calculating accuracy
accuracy.rmse(predictions)

RMSE: 0.6135


0.613506398856177

### 5. Rounding Predictions to the Nearest 0.5

Since ratings are given in half-star increments, we round predictions to the nearest 0.5-star.

In [33]:
# Function to round to nearest 0.5
def round_to_half(x):
    return round(x * 2) / 2

# Apply rounding to predictions
rounded_predictions = []
for pred in predictions:
    rounded_rating = round_to_half(pred.est)
    rounded_predictions.append(rounded_rating)

### 6. Evaluating Model Performance using RMSE

Finally, we evaluate the model performance by comparing the rounded predicted ratings to the actual ratings.

#### actual predicted ratings

In [41]:
# Importing library
from surprise import accuracy

# Calculating accuracy
accuracy.rmse(predictions)

RMSE: 0.6135


0.613506398856177

#### rounded prediction ratings


In [42]:
# Apply rounding to predictions while keeping the Prediction object structure
rounded_predictions = []
for pred in predictions:
    # Create a new Prediction object with the rounded rating
    rounded_pred = pred._replace(est=round_to_half(pred.est))
    rounded_predictions.append(rounded_pred)

# Calculate accuracy with rounded predictions
accuracy.rmse(rounded_predictions)

RMSE: 0.6141


0.6140796167031595

## Model Tuning

This takes a very long time that's why I have commented them

In [43]:
"""from surprise.model_selection import GridSearchCV
# Define Hyperparameter Grid
param_grid = {
    'n_factors': [50, 100, 150],
    'n_epochs': [20, 30, 40],
    'lr_all': [0.005, 0.01, 0.02]
}
# Perform GridSearchCV
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)
gs.fit(data)

# Output the best score and parameters
print(gs.best_score['rmse'])
print(gs.best_params['rmse'])
"""

KeyboardInterrupt: 

### Evaluation

In [29]:
"""# Function to round predictions to the nearest 0.5
def round_to_nearest_half(rating):
    return max((round(rating * 2) / 2), 0.5)

# Fix the best estimator assignment
best_svd = gs_svd.best_estimator['rmse']  # Extract the best estimator for 'rmse'

# Now fit the best SVD model on the training set
trainset = data.build_full_trainset()
best_svd.fit(trainset)



# Predict on the entire test set and round predictions
predicted_ratings = np.zeros_like(test_data)
for user_id in range(test_data.shape[0]):
    for movie_id in range(test_data.shape[1]):
        if test_data.iloc[user_id, movie_id] > 0:
            predicted_rating = best_svd.predict(user_id, movie_id).est
            predicted_ratings[user_id, movie_id] = round_to_nearest_half(predicted_rating)
"""

"# Function to round predictions to the nearest 0.5\ndef round_to_nearest_half(rating):\n    return max((round(rating * 2) / 2), 0.5)\n\n# Fix the best estimator assignment\nbest_svd = gs_svd.best_estimator['rmse']  # Extract the best estimator for 'rmse'\n\n# Now fit the best SVD model on the training set\ntrainset = data.build_full_trainset()\nbest_svd.fit(trainset)\n\n\n\n# Predict on the entire test set and round predictions\npredicted_ratings = np.zeros_like(test_data)\nfor user_id in range(test_data.shape[0]):\n    for movie_id in range(test_data.shape[1]):\n        if test_data.iloc[user_id, movie_id] > 0:\n            predicted_rating = best_svd.predict(user_id, movie_id).est\n            predicted_ratings[user_id, movie_id] = round_to_nearest_half(predicted_rating)\n"

In [30]:
"""from sklearn.metrics import mean_squared_error
import numpy as np

non_zero_indices = np.where(test_data.values != 0)

# Extract true ratings and predicted ratings for non-zero entries
true_ratings = test_data.values[non_zero_indices].flatten()
predicted_ratings_nonzero = predicted_ratings[test_data.values.nonzero()].flatten()

# Compute RMSE
rmse = np.sqrt(mean_squared_error(true_ratings, predictions))
print(f'RMSE: {rmse:.4f}')
"""

"from sklearn.metrics import mean_squared_error\nimport numpy as np\n\nnon_zero_indices = np.where(test_data.values != 0)\n\n# Extract true ratings and predicted ratings for non-zero entries\ntrue_ratings = test_data.values[non_zero_indices].flatten()\npredicted_ratings_nonzero = predicted_ratings[test_data.values.nonzero()].flatten()\n\n# Compute RMSE\nrmse = np.sqrt(mean_squared_error(true_ratings, predictions))\nprint(f'RMSE: {rmse:.4f}')\n"