*Reva Bharara*

*Email : revabharara@gmail.com*

*Linkedin : https://www.linkedin.com/in/reva-bharara-a83a78241/*


### Objective: To make a movie recommendation system based on collaborative filtering using K means clustering.

### *Index:*
1. Importing the dependencies
2. Importing relevant datasets
3. Data exploration
4. Data preprocessing
5. Data analysis
5. Collaborative filtering recommendation system (K-Means)
6. Model evaluation
7. Conclusion
8. Credits

### --------------------------------------------------------------------------------------------------------
### *1. Importing the dependencies*

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report

### --------------------------------------------------------------------------------------------------------
### *2. Importing relevant datasets*

In [2]:
# importing the movies dataset
df_movies=pd.read_csv('movie_names.csv')

# importing the rating dataset
df_rating=pd.read_csv('rating.csv')

### --------------------------------------------------------------------------------------------------------
### *3. Exploring the data*

In [3]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
df_rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


In [5]:
# let us look at the columns present in the datasets and their general features
print(df_movies.columns)
print(df_rating.columns)


Index(['movieId', 'title', 'genres'], dtype='object')
Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')


In [6]:
print(f'shape of movies dataset: {df_movies.shape}')
print(f'shape of rating dataset: {df_rating.shape}')

shape of movies dataset: (27278, 3)
shape of rating dataset: (20000263, 4)


In [7]:
print(f'info about the movies dataset: {df_movies.info()}')
print()
print(f'info about the rating dataset: {df_rating.info()}')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  27278 non-null  int64 
 1   title    27278 non-null  object
 2   genres   27278 non-null  object
dtypes: int64(1), object(2)
memory usage: 639.5+ KB
info about the movies dataset: None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  object 
dtypes: float64(1), int64(2), object(1)
memory usage: 610.4+ MB
info about the rating dataset: None


In [8]:
df_rating.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

### --------------------------------------------------------------------------------------------------------
### *4. Data preprocessing*

#### There are no null values in both the dataframes so we dont have to process those.

#### We now need to join the *'title'* column from df_movies to df_rating and store it


In [9]:
df_movies.drop(['genres'], axis=1, inplace=True)
df_movies.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [10]:
df_rating=df_rating.merge(df_movies, on='movieId')
df_rating.head()

Unnamed: 0,userId,movieId,rating,timestamp,title
0,1,2,3.5,2005-04-02 23:53:47,Jumanji (1995)
1,5,2,3.0,1996-12-25 15:26:09,Jumanji (1995)
2,13,2,3.0,1996-11-27 08:19:02,Jumanji (1995)
3,29,2,3.0,1996-06-23 20:36:14,Jumanji (1995)
4,34,2,3.0,1996-10-28 13:29:44,Jumanji (1995)


#### We don't need the timestamp column it is of no significance so we will delete it.

In [13]:
df_rating.drop(['timestamp'],axis=1, inplace=True)
df_rating.head()

Unnamed: 0,userId,movieId,rating,title
0,1,2,3.5,Jumanji (1995)
1,5,2,3.0,Jumanji (1995)
2,13,2,3.0,Jumanji (1995)
3,29,2,3.0,Jumanji (1995)
4,34,2,3.0,Jumanji (1995)


In [20]:
len(df_rating['title'].unique())

26729

In [23]:
# checking if there are any null values still
df_rating.isnull().sum()

userId     0
movieId    0
rating     0
title      0
dtype: int64

## Collaborative filtering based recommendation system
The current content-based recommendation engine we have has significant limitations. It can only suggest movies that are similar to a particular movie and cannot provide recommendations across different genres or capture individual preferences. Additionally, the engine does not take into account the unique tastes and biases of individual users, as it provides the same recommendations to all users who query it for a particular movie.

To overcome these limitations, we will use a technique called Collaborative Filtering, which uses the behavior and preferences of similar users to predict how much a user will like a particular movie that they have not yet watched. This approach leverages the idea that users who have similar preferences can be used to make accurate recommendations and tailor them to each user's individual tastes.

In this collaborative filtering system we will be creating a user-movie rating matrix that will tell us how users rated every movie and from there we will be using the nearest neighbor model to form user clusters and based on those user clusters we will be recommending the movies.