# Movielens Recommendation System


# Introduction

## Data Inspection

-   Since we dont know what the data set contains , we shall load them then inspect them briefly to have an idea of what we are dealing with

- before we can come up with objectives let's first inspect the dataset to see possible connections


In [1]:
#Imports
import pandas as pd

In [2]:
#Read csv files
df_links = pd.read_csv("../ml-latest-small/links.csv")
df_movies = pd.read_csv("../ml-latest-small/movies.csv")
df_ratings = pd.read_csv("../ml-latest-small/ratings.csv")
df_tags = pd.read_csv("../ml-latest-small/tags.csv")

## DataSet Inspection
-    We shall inspect the data set to which features we shall use

In [3]:
df_links.head()


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [4]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
df_tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [7]:
df_tags[df_tags['movieId']==1]

Unnamed: 0,userId,movieId,tag,timestamp
629,336,1,pixar,1139045764
981,474,1,pixar,1137206825
2886,567,1,fun,1525286013


In [8]:
list(df_tags['tag'].sample(n=20))

['irreverent',
 'understated',
 'lord of the rings',
 'England',
 'thought-provoking',
 'Nudity (Topless)',
 'Shakespeare',
 'Aardman',
 'predictible plot',
 'James Cameron',
 'bad plot',
 'sexuality',
 'aviation',
 'psychology',
 'Anthony Hopkins',
 'dark humor',
 'mental illness',
 'movie business',
 'child abuse',
 'British gangster']

- We now have a rough idea of the files, Let us officially start the project...

# MOVIELENS RECOMMENDATION SYSTEM

# A.    INTRODUCTION : DATASET DESCRIPTION

## MovieLens Dataset: Content and File Structure

The MovieLens dataset is provided as comma-separated values (CSV) files, formatted with a single header row. Below is a detailed breakdown of the structure and content of each file, along with formatting and encoding notes.

### 1. Formatting and Encoding
- Files are encoded in **UTF-8**. Ensure your text editor or analysis script is configured to handle UTF-8, especially for accented characters (e.g., *Misérables, Les (1995)*).
- **Comma-separated** values (CSV) format is used, with columns that contain commas (`,`) enclosed in double-quotes (`"`).

### 2. User IDs
- User IDs are anonymized and consistent across `ratings.csv` and `tags.csv`.
- Each ID uniquely identifies a user, ensuring user consistency between the rating and tagging data.

### 3. Movie IDs
- Only movies with at least one rating or tag are included in the dataset.
- Movie IDs are consistent across all files (`ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv`).
- These IDs correspond to the same movies used on the [MovieLens website](https://movielens.org).

---

## File Structures

### 1. `ratings.csv`
This file contains explicit user ratings for movies on a **5-star scale**. The data is structured as:

| Column   | Description                                   |
|----------|-----------------------------------------------|
| `userId` | Anonymized ID representing each user          |
| `movieId`| ID representing each movie                    |
| `rating` | User rating for the movie (0.5 to 5.0 stars)  |
| `timestamp` | UNIX timestamp when the rating was made     |

Ratings are sorted first by `userId`, then by `movieId`.

### 2. `tags.csv`
Tags represent user-generated metadata (e.g., short descriptions or labels). The structure is:

| Column   | Description                                    |
|----------|------------------------------------------------|
| `userId` | Anonymized ID representing each user           |
| `movieId`| ID representing each movie                     |
| `tag`    | User-assigned tag for the movie                |
| `timestamp` | UNIX timestamp when the tag was added       |

Like ratings, tags are sorted by `userId` and then by `movieId`.

### 3. `movies.csv`
This file includes movie titles and their associated genres. The data is structured as follows:

| Column    | Description                                            |
|-----------|--------------------------------------------------------|
| `movieId` | ID representing each movie                             |
| `title`   | Movie title, including the year of release (e.g., *Toy Story (1995)*) |
| `genres`  | Pipe-separated list of genres (e.g., *Animation|Children's|Comedy*)   |

Errors or inconsistencies may exist in movie titles due to manual entry.

### 4. `links.csv`
Contains identifiers linking MovieLens movies to external databases (IMDB and TMDb). The structure is:

| Column    | Description                                            |
|-----------|--------------------------------------------------------|
| `movieId` | ID representing each movie in the MovieLens dataset    |
| `imdbId`  | Corresponding movie ID from IMDb                       |
| `tmdbId`  | Corresponding movie ID from The Movie Database (TMDb)  |

### Available Genres:
Movies are categorized into the following genres (separated by pipes `|` in the dataset):

- Action
- Adventure
- Animation
- Children's
- Comedy
- Crime
- Documentary
- Drama
- Fantasy
- Film-Noir
- Horror
- Musical
- Mystery
- Romance
- Sci-Fi
- Thriller
- War
- Western
- (no genres listed)

---

#   B.  OBJECTIVES

## 1. Build a Collaborative Filtering Model
-  Implement and test a collaborative filtering model using the user-item interaction data from `ratings.csv` to provide top 5 movie recommendations for users.

## 2. Implement a Hybrid Model (if applicable)
-  Mitigate the cold start problem by incorporating content-based filtering that utilizes movie genres and user-generated tags.

## 3. Evaluate Model Performance
-  Measure the model’s predictive accuracy using metrics such as RMSE (Root Mean Square Error) and MAE (Mean Absolute Error), along with other relevant ranking-based metrics.

## 4. Optimize Model for Better Performance
-  Tune the model's hyperparameters to enhance the quality of recommendations and overall user satisfaction.

---


# C.  DATA CLEANING AND PREPROCESSING

In this section, we will perform the following steps to prepare our datasets for modeling. Specifically, we will:

1. **Examine for Missing Values:** 
   - Identifying any missing or null values in the datasets is essential as they can significantly affect the performance of our recommendation system. We will analyze each dataset to determine the extent and impact of any missing data.



2. **Initial Data Overview:**
   - We will familiarize ourselves with the structure and contents of the data. This step will provide insight into the types of variables we are dealing with.

3. **Statistical Summary:**
   - Generating descriptive statistics for the numerical variables will allow us to assess their central tendencies and variabilities. This information is crucial for understanding the distribution of ratings and detecting potential outliers.

4. **Data Visualization:**
   - Visualizing data distributions through plots (e.g., histograms) will enhance our understanding of how ratings are spread across different values. Visualization is a powerful tool for uncovering hidden insights and patterns in the data.

After these preliminary analyses, we will document any findings that necessitate further cleaning, such as filling in missing values, removing duplicates, or handling outliers.




## Cleaning : `ratings.csv `

In [9]:
#checking for missing values...
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [10]:
df_ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In [11]:
df_ratings.isna().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

-seems no data cleaning necessary for this , next...

## Cleaning : `movies.csv `

In [12]:
##checking for missing values...
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [13]:
df_movies.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

-seems no data cleaning necessary for this , next...

## Cleaning : `tags.csv `

In [14]:
##checking for missing values...
df_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


In [15]:
df_tags.isna().sum()

userId       0
movieId      0
tag          0
timestamp    0
dtype: int64

- Seems the data is ready to work with since the datasets are already cleaned . No need to check `links.csv` since it links to external datasets

In [16]:
#

---
# D.  EXPLORATORY DATA ANALYSIS (EDA)  


In this phase, we will explore and clean the data before preparing it for modeling. Our primary datasets, `ratings.csv` and `movies.csv`, contain essential information for building a recommendation system:

- **ratings.csv** provides user ratings for various movies, containing columns such as `userId`, `movieId`, `rating`, and `timestamp`.
- **movies.csv** contains details about movies, such as `movieId`, `title`, and `genres`.

To create a more informative dataset, we will **merge the `ratings.csv` and `movies.csv` files** using the `movieId` column, which is common to both datasets. This process allows us to combine user ratings with movie metadata, specifically the genres associated with each movie.

## Why Merge These DataFrames?

Merging the two DataFrames gives us a more comprehensive dataset that includes:
- **User-specific ratings** (from `ratings.csv`)
- **Movie information** like titles and genres (from `movies.csv`)

By bringing in the `genres` column, we enable additional insights that could be valuable in recommendation strategies, particularly when extending collaborative filtering approaches with content-based filtering. This is especially useful for:
1. **Cold start problems**, where new users who have not provided ratings can receive recommendations based on their preferred genres.
2. **Enhancing collaborative filtering** by incorporating genre preferences to fine-tune recommendations.

The resulting dataset, which includes `userId`, `movieId`, `rating`, and `genres`, will be key in building both collaborative and hybrid recommendation models.


## Feature Engineering
### Merging the 2 datasets


In [17]:
#merging movies and ratings dataframes...
#df_movies, df_ratings..
df_movies_ratings = pd.merge (df_movies, df_ratings, on = 'movieId')
df_movies_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


In [22]:
#check if ratings and merged columns have same no. of rows

assert len(df_ratings) == len(df_movies_ratings)

In [19]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [20]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


### Drop unnecessary columns...

-   Drop timestamp and title , won't be necessary for analysis.


In [23]:
df_movies_ratings.columns

Index(['movieId', 'title', 'genres', 'userId', 'rating', 'timestamp'], dtype='object')

In [24]:
df_movies_ratings = df_movies_ratings.drop(columns= ['title', 'timestamp'], axis= 1)

In [25]:
df_movies_ratings.columns

Index(['movieId', 'genres', 'userId', 'rating'], dtype='object')

In [26]:
df_movies_ratings.head()

Unnamed: 0,movieId,genres,userId,rating
0,1,Adventure|Animation|Children|Comedy|Fantasy,1,4.0
1,1,Adventure|Animation|Children|Comedy|Fantasy,5,4.0
2,1,Adventure|Animation|Children|Comedy|Fantasy,7,4.5
3,1,Adventure|Animation|Children|Comedy|Fantasy,15,2.5
4,1,Adventure|Animation|Children|Comedy|Fantasy,17,4.5


In [28]:
df_movies_ratings = df_movies_ratings[['movieId','userId', 'genres','rating']]
df_movies_ratings.head()

Unnamed: 0,movieId,userId,genres,rating
0,1,1,Adventure|Animation|Children|Comedy|Fantasy,4.0
1,1,5,Adventure|Animation|Children|Comedy|Fantasy,4.0
2,1,7,Adventure|Animation|Children|Comedy|Fantasy,4.5
3,1,15,Adventure|Animation|Children|Comedy|Fantasy,2.5
4,1,17,Adventure|Animation|Children|Comedy|Fantasy,4.5
