# Movielens Recommendation System


# Introduction

## Data Inspection

-   Since we dont know what the data set contains , we shall load them then inspect them briefly to have an idea of what we are dealing with

- before we can come up with objectives let's first inspect the dataset to see possible connections


In [1]:
#Imports
import pandas as pd

In [4]:
#Read csv files
df_links = pd.read_csv("../ml-latest-small/links.csv")
df_movies = pd.read_csv("../ml-latest-small/movies.csv")
df_ratings = pd.read_csv("../ml-latest-small/ratings.csv")
df_tags = pd.read_csv("../ml-latest-small/tags.csv")

## DataSet Inspection
-    We shall inspect the data set to which features we shall use

In [7]:
df_links.head()


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [6]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [11]:
df_tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [None]:
df_tags[df_tags['movieId']==1]

In [15]:
list(df_tags['tag'].sample(n=20))

['David Fincher',
 'lies',
 'suspense',
 'visually appealing',
 'Disney',
 'big budget',
 'samurai',
 'E.M. Forster',
 'Comedy',
 'helena bonham carter',
 'In Netflix queue',
 'jon hamm',
 'Mrs. DeWinter',
 'mobster',
 'will ferrell',
 'understated',
 'adventure',
 'Tom Hanks',
 'court',
 'FBI']

- We now have a rough idea of the files, Let us officially start the project...

# MOVIELENS RECOMMENDATION SYSTEM

# A.    INTRODUCTION : DATASET DESCRIPTION

## MovieLens Dataset: Content and File Structure

The MovieLens dataset is provided as comma-separated values (CSV) files, formatted with a single header row. Below is a detailed breakdown of the structure and content of each file, along with formatting and encoding notes.

### 1. Formatting and Encoding
- Files are encoded in **UTF-8**. Ensure your text editor or analysis script is configured to handle UTF-8, especially for accented characters (e.g., *Misérables, Les (1995)*).
- **Comma-separated** values (CSV) format is used, with columns that contain commas (`,`) enclosed in double-quotes (`"`).

### 2. User IDs
- User IDs are anonymized and consistent across `ratings.csv` and `tags.csv`.
- Each ID uniquely identifies a user, ensuring user consistency between the rating and tagging data.

### 3. Movie IDs
- Only movies with at least one rating or tag are included in the dataset.
- Movie IDs are consistent across all files (`ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv`).
- These IDs correspond to the same movies used on the [MovieLens website](https://movielens.org).

---

## File Structures

### 1. `ratings.csv`
This file contains explicit user ratings for movies on a **5-star scale**. The data is structured as:

| Column   | Description                                   |
|----------|-----------------------------------------------|
| `userId` | Anonymized ID representing each user          |
| `movieId`| ID representing each movie                    |
| `rating` | User rating for the movie (0.5 to 5.0 stars)  |
| `timestamp` | UNIX timestamp when the rating was made     |

Ratings are sorted first by `userId`, then by `movieId`.

### 2. `tags.csv`
Tags represent user-generated metadata (e.g., short descriptions or labels). The structure is:

| Column   | Description                                    |
|----------|------------------------------------------------|
| `userId` | Anonymized ID representing each user           |
| `movieId`| ID representing each movie                     |
| `tag`    | User-assigned tag for the movie                |
| `timestamp` | UNIX timestamp when the tag was added       |

Like ratings, tags are sorted by `userId` and then by `movieId`.

### 3. `movies.csv`
This file includes movie titles and their associated genres. The data is structured as follows:

| Column    | Description                                            |
|-----------|--------------------------------------------------------|
| `movieId` | ID representing each movie                             |
| `title`   | Movie title, including the year of release (e.g., *Toy Story (1995)*) |
| `genres`  | Pipe-separated list of genres (e.g., *Animation|Children's|Comedy*)   |

Errors or inconsistencies may exist in movie titles due to manual entry.

### 4. `links.csv`
Contains identifiers linking MovieLens movies to external databases (IMDB and TMDb). The structure is:

| Column    | Description                                            |
|-----------|--------------------------------------------------------|
| `movieId` | ID representing each movie in the MovieLens dataset    |
| `imdbId`  | Corresponding movie ID from IMDb                       |
| `tmdbId`  | Corresponding movie ID from The Movie Database (TMDb)  |

### Available Genres:
Movies are categorized into the following genres (separated by pipes `|` in the dataset):

- Action
- Adventure
- Animation
- Children's
- Comedy
- Crime
- Documentary
- Drama
- Fantasy
- Film-Noir
- Horror
- Musical
- Mystery
- Romance
- Sci-Fi
- Thriller
- War
- Western
- (no genres listed)

---

#   B.  OBJECTIVES

## 1. Build a Collaborative Filtering Model
-  Implement and test a collaborative filtering model using the user-item interaction data from `ratings.csv` to provide top 5 movie recommendations for users.

## 2. Implement a Hybrid Model (if applicable)
-  Mitigate the cold start problem by incorporating content-based filtering that utilizes movie genres and user-generated tags.

## 3. Evaluate Model Performance
-  Measure the model’s predictive accuracy using metrics such as RMSE (Root Mean Square Error) and MAE (Mean Absolute Error), along with other relevant ranking-based metrics.

## 4. Optimize Model for Better Performance
-  Tune the model's hyperparameters to enhance the quality of recommendations and overall user satisfaction.

---


# C.  DATA CLEANING AND PREPROCESSING

In this section, we will perform the following steps to prepare our datasets for modeling. Specifically, we will:

1. **Examine for Missing Values:** 
   - Identifying any missing or null values in the datasets is essential as they can significantly affect the performance of our recommendation system. We will analyze each dataset to determine the extent and impact of any missing data.

2. **Exploratory Data Analysis (EDA):** 
   - Conducting EDA will help us understand the characteristics of our data, including the distribution of ratings, the types of movies available, and any patterns or trends that may be present. This analysis will guide us in making informed decisions about data cleaning and feature engineering.

3. **Initial Data Overview:**
   - We will familiarize ourselves with the structure and contents of the data. This step will provide insight into the types of variables we are dealing with.

4. **Statistical Summary:**
   - Generating descriptive statistics for the numerical variables will allow us to assess their central tendencies and variabilities. This information is crucial for understanding the distribution of ratings and detecting potential outliers.

5. **Data Visualization:**
   - Visualizing data distributions through plots (e.g., histograms) will enhance our understanding of how ratings are spread across different values. Visualization is a powerful tool for uncovering hidden insights and patterns in the data.

After these preliminary analyses, we will document any findings that necessitate further cleaning, such as filling in missing values, removing duplicates, or handling outliers.

---
