> ## **Movie Recommender System**

>> ## **Team Members:** 
>> ### 1. Srinivasan Govindarajan
>> ### 1. Vijayalakshmi Ramesh

**Dataset:**

https://grouplens.org/datasets/movielens/latest/

## **Executive Summary**

> **Project Goal:**

The goal of the project is to develop a recommendation system that suggests the top 5 movies to users based on collaborative filtering techniques: Item-based Collaborative filtering and User-based Collaborative filtering.

> **Methodologies:**

>> **Data Cleaning:**

Merged 3 datasets: movies.csv, ratings.csv and links.csv and dropped inappropriate columns. The resultant dataset has  100836 rows and 5 columns.

>> **Data Preparation:**
* Prepared the data such that rows has user ID and columnshas Movie and the values are filled with corresponding rating 
* Prepared the movie data where it has two columns: Movie; ID

>> **Helper functions:**
Got all helper functions over here: standEst(), svdEst(), pearsSim(), distCosine(), cross_validate_user(), print_most_similar_movies_cosine(), get_similar_movies_cosine(), print_most_similar_movies_pearson(), get_similar_movies_pearson()

>> **Predicting missing rating:**
* Predicted missing rating using Standard Estimate Method
* Predicted missing rating using SVD Estimate Method

Since the Standard Estimate method took more than 2 hours to compute missed rating, removed the movies which have have less that 100 ratings. 9719 movies have been reduced to 138 movies. With 138 movies, the Standard Estimate method took approximately 13 mins to compute missed ratings.

>> **Recommendation Engines:**

>>> **Item based Collaborative Filtering:**
>>>> Using predicted ratings using **Standard Estimate** Method:

>>>>> **Cosine Similarity** as similarity measure:
>>>>>> Using **manually calculated** function
>>>>>> Using **inbuilt** function

>>>>> **Pearson correlation** as Similarity measure:
>>>>>> Using **manually calculated** function
>>>>>> Using **inbuilt** function

>>>> Using predicted ratings using **SVD Estimate** Method:

>>>>> **Cosine Similarity** as similarity measure:
>>>>>> Using **manually calculated** function
>>>>>> Using **inbuilt** function

>>>>> **Pearson correlation** as Similarity measure:
>>>>>> Using **manually calculated** function
>>>>>> Using **inbuilt** function

Used 8 different combinations of approaches for item-based recommender system:

1. Standard Estimate Method | Cosine Similarity | Manual function
2. Standard Estimate Method | Cosine Similarity | Inbuilt function
3. Standard Estimate Method | Pearson Correlation | Manual function
4. Standard Estimate Method | Pearson Correlation | Inbuilt function
5. SVD Estimate Method | Cosine Similarity | Manual function
6. SVD Estimate Method | Cosine Similarity | Inbuilt function
7. SVD Estimate Method | Pearson Correlation | Manual function
8. SVD Estimate Method | Pearson Correlation | Inbuilt function

(1) and (5) have recommended exactly same set of movies.

(2), (3), (4), (6), (7) and (8) have recommended exactly same set of movies.

>>> **User based Collaborative filtering**

>> **Evaluation:**
Ran test function to calculate the MAE score for both Standard Estimate Method and SVD Estimate Method

>> **Conclusions:**

**I.**
The SVD Estimate method outperforms the Standard Estimate method in terms of computational efficiency. When predicting ratings for 138 movies by 610 users, the Standard Estimate method took approximately 14 minutes to complete, while the SVD Estimate method accomplished the same task in just 99 milliseconds. This significant difference in computational time highlights the superiority of the SVD Estimate method in terms of efficiency.

**II.**

The item-based recommendation engines were evaluated using four different combinations: 

1. Standard Estimate Method with Cosine Similarity
2. Standard Estimate Method with Pearson Correlation
3. SVD Estimate Method with Cosine Similarity
4. SVD Estimate Method with Pearson Correlation

For each combination, a selected movie (in this case, "Alien (1979)") was used as a query to recommend similar movies. The recommendations were obtained using manual calculation and inbuilt functions for similarity measurement.

In all four combinations, the recommended movies were listed based on their similarity to the selected movie. The recommendations were provided using both manual calculations and inbuilt functions for cosine similarity and Pearson correlation.

From the results, it can be observed that the recommended movies are consistent across all combinations. The top recommended movies are mostly the same, regardless of the similarity measurement method or the estimation method used. This indicates that the recommendation engines are providing reliable and consistent results.

The item-based recommendation engines showed effectiveness in identifying similar movies based on the selected movie. Whether using the Standard Estimate Method or the SVD Estimate Method, and whether employing Cosine Similarity or Pearson Correlation, the engines produced comparable recommendations.

Overall, these item-based recommendation engines demonstrate their capability to generate accurate and relevant movie recommendations. The choice between the Standard Estimate Method and the SVD Estimate Method, as well as the similarity measurement method (Cosine Similarity or Pearson Correlation), can be based on factors such as computational efficiency and specific requirements of the application.

**III.**

Based on the results obtained, we can conclude that both the `standEst` (Standard Estimate) and `svdEst` (SVD Estimate) methods perform reasonably well in terms of Mean Absolute Error (MAE) for the item-based recommendation system.

The MAE for the `standEst` method is 0.1794, while the MAE for the `svdEst` method is slightly higher at 0.1802. The lower the MAE value, the better the accuracy of the recommendation system. 

Comparing the computational times, the `standEst` method took approximately 5 minutes and 21 seconds, while the `svdEst` method took significantly longer at approximately 10 minutes and 18 seconds. The increased computational time of the `svdEst` method can be attributed to the more complex calculations involved in the SVD-based estimation.

Overall, both methods provide similar performance in terms of recommendation accuracy, with the `standEst` method having a slightly lower MAE. However, the trade-off is that the `svdEst` method requires significantly more computational time.

---


## **Contributions:**

> **Srinivasan Govindarajan**
- Performed data cleaning
- Done with data preparation
- Created helper functions
- Predicted missing rating using Standard Estimate method (without any dimensionality reduction technique)
- Predicted missing rating using Singular Value Decomposition (SVD) Estimate method
- Performed User-based Collaborative Filtering


> **Vijayalakshmi Ramesh**
- Performed Item-based Collaborative Filtering using 8 different combinations of approaches
- Performed evaluation on test data using both Standard Estimate Method and SVD Estimate Method
- Editing and Preparation of Final Report

---

## **Dataset Desription:**

**Dataset:**

The Movielens Data Set (ml-latest): https://grouplens.org/datasets/movielens/latest/

**Summary:**

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.

**Structure of each dataset:**

1. Movies Data File Structure (movies.csv):

9742 rows and 3 columns (movieId, title, genres)

Numerical: movieId (int)

Categorical: title, genres

Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format:

movieId,title,genres

Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:

Action
Adventure
Animation
Children's
Comedy
Crime
Documentary
Drama
Fantasy
Film-Noir
Horror
Musical
Mystery
Romance
Sci-Fi
Thriller
War
Western

2. Links Data File Structure (links.csv):

9742 rows and 3 columns (movieId, imdbId, tmdbId)

Numerical: movieId (int), imdbId (int), tmdbId (float)

Identifiers that can be used to link to other sources of movie data are contained in the file links.csv. Each line of this file after the header row represents one movie, and has the following format:

movieId,imdbId,tmdbId

movieId is an identifier for movies used by https://movielens.org. E.g., the movie Toy Story has the link https://movielens.org/movies/1.

imdbId is an identifier for movies used by http://www.imdb.com. E.g., the movie Toy Story has the link http://www.imdb.com/title/tt0114709/.

tmdbId is an identifier for movies used by https://www.themoviedb.org. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862.

Use of the resources listed above is subject to the terms of each provider.

3. Ratings Data File Structure (ratings.csv):

100836 rows and 4 columns (userId, movieId, rating, timestamp)

Numerical: userId (int), movieId (int), rating (float), timestamp (int)

All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

4. Tags Data File Structure (tags.csv):

3683 rows and 4 columns (userId, movieId, tag, timestamp)

Numerical: userId (int), movieId (int), timestamp (int)

Categorical: tag

All tags are contained in the file tags.csv. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

userId,movieId,tag,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

**Data Dictionary:**

userId -

MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between ratings.csv and tags.csv (i.e., the same id refers to the same user across the two files).

movieId -

Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id 1 corresponds to the URL https://movielens.org/movies/1). Movie ids are consistent between ratings.csv, tags.csv, movies.csv, and links.csv (i.e., the same id refers to the same movie across these four data files).

**Importing necessary packages:**

In [1]:
import pandas as pd
import numpy as np
import scipy.stats 
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
from numpy import nonzero
from numpy import logical_and
from numpy import corrcoef
from numpy import linalg as la
from numpy import *

In [2]:
#from google.colab import files
#uploaded = files.upload()

**Connecting to google drive:**

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## **Data Cleaning**

There are 4 datasets in our project: links.csv, movies.csv, ratings.csv, tags.csv

**Getting the file path:**

In [4]:
import os
os.listdir('/content/gdrive/MyDrive/Colab Notebooks/DSC478_PML/Recommendation Systems/DSC478 Final Project')

['ratings.csv',
 'tags.csv',
 'movies.csv',
 'links.csv',
 'README.txt',
 'amazon_prime_titles.csv']

In [5]:
movies = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/DSC478_PML/Recommendation Systems/DSC478 Final Project/movies.csv')
links = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/DSC478_PML/Recommendation Systems/DSC478 Final Project/links.csv')
ratings = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/DSC478_PML/Recommendation Systems/DSC478 Final Project/ratings.csv')
tags = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/DSC478_PML/Recommendation Systems/DSC478 Final Project/tags.csv')

In [6]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [7]:
links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


In [8]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [9]:
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


**Merging datasets**

In [10]:
df1 = pd.merge(movies, links, how='inner', on='movieId')

In [11]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9742 entries, 0 to 9741
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   title    9742 non-null   object 
 2   genres   9742 non-null   object 
 3   imdbId   9742 non-null   int64  
 4   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 456.7+ KB


In [12]:
df2 = pd.merge(df1, ratings, how='inner', on='movieId')

In [13]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100836 entries, 0 to 100835
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   movieId    100836 non-null  int64  
 1   title      100836 non-null  object 
 2   genres     100836 non-null  object 
 3   imdbId     100836 non-null  int64  
 4   tmdbId     100823 non-null  float64
 5   userId     100836 non-null  int64  
 6   rating     100836 non-null  float64
 7   timestamp  100836 non-null  int64  
dtypes: float64(2), int64(4), object(2)
memory usage: 6.9+ MB


In [14]:
df2.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,17,4.5,1305696483


**Dropping inappropriate columns**

In [15]:
data = df2.drop(columns=['timestamp','tmdbId','imdbId','genres'])
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   movieId  100836 non-null  int64  
 1   title    100836 non-null  object 
 2   userId   100836 non-null  int64  
 3   rating   100836 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 3.8+ MB


So. our final dataset is stored in the variable **data**. There are 100836 rows and 5 columns.

**Checking for missing values**

In [16]:
data.isnull().sum()

movieId    0
title      0
userId     0
rating     0
dtype: int64

There is no missing values

## **Data Preparation**

**Preparing movie data**



In [17]:
data.head()

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),1,4.0
1,1,Toy Story (1995),5,4.0
2,1,Toy Story (1995),7,4.5
3,1,Toy Story (1995),15,2.5
4,1,Toy Story (1995),17,4.5


In [18]:
data1 = data.sort_values(['userId', 'movieId'], ascending=[True, True])
data1.head()

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),1,4.0
325,3,Grumpier Old Men (1995),1,4.0
433,6,Heat (1995),1,4.0
2107,47,Seven (a.k.a. Se7en) (1995),1,5.0
2379,50,"Usual Suspects, The (1995)",1,5.0


In [19]:
unique_movies = data1['title'].unique()
unique_movies

array(['Toy Story (1995)', 'Grumpier Old Men (1995)', 'Heat (1995)', ...,
       'Hazard (2005)', 'Blair Witch (2016)', '31 (2016)'], dtype=object)

In [20]:
data1['title'].nunique()

9719

In [21]:
data1['userId'].nunique()

610

There are 610 users and 9719 movies. Now, we're going to make it as matrix with users as row index, movies as column index and the ratings as values. So, the shape of our matrix should be (610,9719).

In [22]:
movie = pd.DataFrame({'title': unique_movies, 'id': range(0, len(unique_movies))})
movie.head()

Unnamed: 0,title,id
0,Toy Story (1995),0
1,Grumpier Old Men (1995),1
2,Heat (1995),2
3,Seven (a.k.a. Se7en) (1995),3
4,"Usual Suspects, The (1995)",4


In [23]:
movie.shape

(9719, 2)

**Preparing rating data matrix**

In [24]:
ratings = data1.pivot_table(index=['userId'], columns=['title'], values='rating')
ratings.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


Null values are the one that the user didn't give rating or didn't watch the movie.

In [25]:
ratings.isnull().sum().sum()

5827758

Since there is a lot of null values, it takes a lot of computational time to predict all those missing ratings. So, removing movies which have less than 20 users who rated it and filling remaining null values with 0, so that the value 0 can be replaced with predicted rating either by Standard Estimate method or SVD Estimate method. These two methods doesn't work on NaNs.

In [26]:
# Removing movies which have less than 100 users who rated it and fill remaining null values with 0
ratings = ratings.dropna(thresh=100, axis=1).fillna(0)
ratings.head()

title,2001: A Space Odyssey (1968),Ace Ventura: Pet Detective (1994),Aladdin (1992),Alien (1979),Aliens (1986),"Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)",American Beauty (1999),American History X (1998),American Pie (1999),Apocalypse Now (1979),...,"Truman Show, The (1998)",Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Twister (1996),Up (2009),"Usual Suspects, The (1995)",V for Vendetta (2006),WALL·E (2008),Waterworld (1995),Willy Wonka & the Chocolate Factory (1971),X-Men (2000)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,4.0,0.0,0.0,5.0,5.0,0.0,4.0,...,0.0,0.0,3.0,0.0,5.0,0.0,0.0,0.0,5.0,5.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,4.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,...,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
5,0.0,3.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0


9719 movies have been reduced to 138 movies.

Now creating new movie list with 138 movies

In [27]:
import pandas as pd

# Assuming you have an existing DataFrame called 'df'
column_names = ratings.columns.tolist()  # Get the column names as a list
column_ids = range(1, len(column_names) + 1)  # Generate IDs for each column

# Create a new DataFrame with column names and their IDs
new_movies = pd.DataFrame({'ID': column_ids, 'Title': column_names})

new_movies


Unnamed: 0,ID,Title
0,1,2001: A Space Odyssey (1968)
1,2,Ace Ventura: Pet Detective (1994)
2,3,Aladdin (1992)
3,4,Alien (1979)
4,5,Aliens (1986)
...,...,...
133,134,V for Vendetta (2006)
134,135,WALL·E (2008)
135,136,Waterworld (1995)
136,137,Willy Wonka & the Chocolate Factory (1971)


Getting movie rating matrix

In [28]:
rat_mat = np.matrix(ratings)
rat_mat

matrix([[0. , 0. , 0. , ..., 0. , 5. , 5. ],
        [0. , 0. , 0. , ..., 0. , 0. , 0. ],
        [0. , 0. , 0. , ..., 0. , 0. , 0. ],
        ...,
        [3. , 3.5, 3. , ..., 3. , 3.5, 4. ],
        [0. , 0. , 0. , ..., 3. , 0. , 0. ],
        [4.5, 3. , 0. , ..., 0. , 0. , 3.5]])

Now, we're having two datasets. One with movie titles which is stored in the variable **movie**. Another with rating matrix with rows as users and columns as id which is stored in the variable **rat_mat**.

## **Helper Functions**

**Standard Estimate Method:**

 The standEst function implements a basic collaborative filtering approach to estimate ratings by considering the similarity between the item of interest and other items that the user has rated. It provides a simple way to generate personalized recommendations based on user-item ratings.

In [29]:
import numpy as np

def standEst(dataMat, user, simMeas, item):
    n = dataMat.shape[1]
    simTotal = 0.0
    ratSimTotal = 0.0
    for j in range(n):
        userRating = dataMat[user, j]
        if userRating == 0 or j == item:
            continue
        overLap = np.nonzero(np.logical_and(dataMat[:, item] > 0, dataMat[:, j] > 0))[0]
        if len(overLap) == 0:
            similarity = 0
        else:
            similarity = simMeas(dataMat[overLap, item], dataMat[overLap, j])
            similarity = np.nan_to_num(similarity)  # Replace nan values with 0
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0:
            return 0
    else:
            return ratSimTotal / simTotal if simTotal != 0 else 0

**SVD Estimate Method:**

The svdEst function applies Singular Value Decomposition to reduce the dimensionality of the user-item ratings matrix and uses the transformed representation to calculate similarities between items. It then estimates ratings by considering the similarity between the transformed item of interest and other transformed items that the user has rated. This approach can provide improved recommendations by capturing latent factors in the data through dimensionality reduction.

In [30]:
from numpy import linalg as la
from numpy import *

def svdEst(dataMat, user, simMeas, item):
    n = dataMat.shape[1]
    simTotal = 0.0; ratSimTotal = 0.0
    data=np.mat(dataMat)
    U,Sigma,VT = la.svd(data)
    Sig4 = np.mat(np.eye(4)*Sigma[:4]) #arrange Sig4 into a diagonal matrix
    xformedItems = data.T * U[:,:4] * Sig4.I  #create transformed items
    for j in range(n):
        userRating = data[user,j]
        if userRating == 0 or j==item: continue
        similarity = simMeas(xformedItems[item,:].T,\
                             xformedItems[j,:].T)
        #print 'the %d and %d similarity is: %f' % (item, j, similarity)
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0: return 0
    else: return ratSimTotal/simTotal

**Pearson Similarity**

The pearsSim function provides a measure of similarity between two vectors based on their Pearson correlation coefficient, allowing for comparison and similarity-based analysis of data.

In [31]:
def pearsSim(inA,inB):
    if len(inA) < 3 : return 1.0
    return 0.5 + 0.5 * corrcoef(inA, inB, rowvar = 0)[0][1]

**Cosine Similarity**

 The distCosine function provides a way to quantify the dissimilarity between two vectors based on the cosine of the angle between them, allowing for comparison and distance-based analysis of data.

In [32]:
def distCosine(inA, inB):
  normA = linalg.norm(inA)
  normB = linalg.norm(inB)
  sims = dot(inA, inB)/(normA * normB)
  dists = 1 - sims
  return dists

**Cross-Validation Method:**

The cross_validate_user function performs cross-validation by withholding a subset of items for testing from a specific user. It estimates ratings for the withheld items using a given rating estimation method and calculates the MAE by comparing the estimated ratings with the original ratings. This process allows for the evaluation and comparison of different recommendation algorithms or models.

In [33]:
def cross_validate_user(dataMat, user, test_ratio, estMethod, simMeas):
	number_of_items = np.shape(dataMat)[1]
	rated_items_by_user = np.array([i for i in range(number_of_items) if dataMat[user,i]>0])
	test_size = int(test_ratio * len(rated_items_by_user))
	test_indices = np.random.randint(0, len(rated_items_by_user), test_size)
	withheld_items = rated_items_by_user[test_indices]
	original_user_profile = np.copy(dataMat[user])
	dataMat[user, withheld_items] = 0 # So that the withheld test items is not used in the rating estimation below
	error_u = 0.0
	count_u = len(withheld_items)

	# Compute absolute error for user u over all test items
	for item in withheld_items:
		# Estimate rating on the withheld item
		estimatedScore = estMethod(dataMat, user, simMeas, item)
		error_u = error_u + abs(estimatedScore - original_user_profile[0, item])	
	
	# Now restore ratings of the withheld items to the user profile
	for item in withheld_items:
		dataMat[user, item] = original_user_profile[0, item]
		
	# Return sum of absolute errors and the count of test cases for this user
	# Note that these will have to be accumulated for each user to compute MAE
	return error_u, count_u

**print_most_similar_movies_cosine function using cosine similarity**

This function calculates the similarity between a given movie and all other movies in the dataset, and then prints the top k most similar movies based on the provided similarity metric.

In [34]:
def print_most_similar_movies_cosine(dataMat, movie, querymovie, k, metric):
    # Getting the ratings vector for the querymovie
    queryMovie_Vector = dataMat[:, querymovie].flatten()  # Flatten the column vector
    
    # Computing the similarity between the querymovie and all other movies
    similar = []
    for i in range(dataMat.shape[1]):
        m_ratings = dataMat[:, i]  
        similarity = metric(queryMovie_Vector, m_ratings)
        similar.append((i, similarity))
    
    # Sort the movies based on similarity in descending order
    similar.sort(key=lambda x: x[1], reverse=True)
    
    # Printing the query movie details
    print("Selected Movie:")
    print(movie.iloc[querymovie, 1])  # Assuming the movie title is in the first column
    
    print("\n")
    
    # Print the most similar movies
    print("Top {} recommended movies are:".format(k))
    j = 1
    for i in range(0, k):
        m_Index = similar[i][0]
        print('{}. {}'.format(j, movie.iloc[m_Index, 1]))  # Assuming the movie title is in the first column
        print('-'*125)
        j=j+1

**Using code with cosine_similarity inbuilt function**

The function `get_similar_movies_cosine` takes a query movie, user ratings, and a parameter `k` as inputs. It calculates the similarity scores between the query movie and all other movies based on cosine similarity, weighted by the user ratings. 

The similarity scores are sorted in descending order, and the top `k` similar movies are selected. The function then prints the details of the query movie and lists the top recommended movies along with their rankings.

Overall, this function provides a way to find and recommend movies that are most similar to a given query movie, taking into account user ratings and using cosine similarity as the similarity metric.

In [35]:
def get_similar_movies_cosine(queryMovie, user_rating, k):
    sim_score = movie_similarity_df[queryMovie] * user_rating
    sim_score = sim_score.sort_values(ascending=False)
    similar_movies_ids = sim_score.index[1:k+1]  # Exclude the first movie and select the next k movies
    # Printing the query movie details
    print("Selected Movie:")
    print(new_movies.iloc[queryMovie, 1])
    print("\n")
    # Print the most similar movies
    print("Top {} recommended movies are:".format(k))
    j = 1
    for i in similar_movies_ids:
        print('{}. {}'.format(j, new_movies.iloc[i, 1]))
        print('-' * 125)
        j += 1


**print_most_similar_movies_pearson function using pearson similarity**

The function `print_most_similar_movies_pearson` takes a data matrix `dataMat`, a movie DataFrame `movie`, a query movie index `querymovie`, a parameter `k`, and a similarity metric `metric` as inputs. It computes the similarity between the query movie and all other movies based on the given similarity metric.

The movies are then sorted based on their similarity to the query movie in descending order. The function prints the details of the query movie and lists the top `k` recommended movies along with their rankings.

Overall, this function provides a way to find and print the most similar movies to a given query movie based on the chosen similarity metric, using Pearson correlation in this case.

In [36]:
def print_most_similar_movies_pearson(dataMat, movie, querymovie, k, metric):
    # Getting the ratings vector for the querymovie
    queryMovie_Vector = dataMat[:, querymovie]
    
    # Computing the similarity between the querymovie and all other movies
    similar = []
    for i in range(dataMat.shape[1]):
        m_ratings = dataMat[:, i]  
        similarity = metric(queryMovie_Vector, m_ratings)
        similar.append((i, similarity))
    
    # Sort the movies based on similarity in descending order
    similar.sort(key=lambda x: x[1], reverse=True)
    
    # Printing the query movie details
    print("Selected Movie:")
    print(movie.iloc[querymovie, 1])  # Assuming the movie title is in the first column
    
    print("\n")
    
    # Print the most similar movies
    print("Top {} recommended movies are:".format(k))
    j = 1
    for i in range(1, k+1):
        m_Index = similar[i][0]
        print('{}. {}'.format(j, movie.iloc[m_Index, 1]))  # Assuming the movie title is in the first column
        print('-'*125)
        j=j+1

**Using code with .corr(method='pearson) inbuilt function**

The function `get_similar_movies_pearson` takes a query movie index `queryMovie`, user ratings `user_rating`, and a parameter `k` as inputs. It calculates the similarity scores between the query movie and all other movies based on Pearson correlation.

The similarity scores are then sorted in descending order, excluding the first movie (which is the query movie itself). The function prints the details of the query movie and lists the top `k` recommended movies along with their rankings.

This function provides a way to find and print the most similar movies to a given query movie based on Pearson correlation, using the movie similarity matrix and user ratings.

In [37]:
def get_similar_movies_pearson(queryMovie, user_rating, k):
    movie_name = new_movies.iloc[queryMovie, 1]
    sim_score = movie_similarity_pear.iloc[:, queryMovie] * user_rating
    sim_score = sim_score.sort_values(ascending=False)
    
    # Printing the query movie details
    print("Selected Movie:")
    print(movie_name)
    print("\n")
    
    # Print the most similar movies
    print("Top {} recommended movies are:".format(k))
    for i, (id, score) in enumerate(sim_score.iloc[1:k+1].iteritems(), 1):
        movie_name = new_movies.iloc[id, 1]
        print("{}. {}".format(i, movie_name))
        print("-" * 125)

## **Predicting missing rating using Standard Estimate Method**

Let's see how Standard Estimate Method works. I'm going to use **ratings** which has a lot of null values and has 610 users in rows and 138 movies in columns.

In [38]:
ratings.head()

title,2001: A Space Odyssey (1968),Ace Ventura: Pet Detective (1994),Aladdin (1992),Alien (1979),Aliens (1986),"Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)",American Beauty (1999),American History X (1998),American Pie (1999),Apocalypse Now (1979),...,"Truman Show, The (1998)",Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Twister (1996),Up (2009),"Usual Suspects, The (1995)",V for Vendetta (2006),WALL·E (2008),Waterworld (1995),Willy Wonka & the Chocolate Factory (1971),X-Men (2000)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,4.0,0.0,0.0,5.0,5.0,0.0,4.0,...,0.0,0.0,3.0,0.0,5.0,0.0,0.0,0.0,5.0,5.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,4.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,...,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
5,0.0,3.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0


In [39]:
print('There are {} levels of movie ratings: \n{}'. format(data['rating'].nunique(), data['rating'].unique()))

There are 10 levels of movie ratings: 
[4.  4.5 2.5 3.5 3.  5.  0.5 2.  1.5 1. ]


Let's compute rating for Movie **'burbs, The (1989)** by **User 1**

In [40]:
print(ratings.iloc[0,0]) # User 1

0.0


> **Using Pearson Correlation as similarity measure**

standEst() doesn't work on null values. So, replaced all null values with 0 and stored in **rat_mat**. Now, trying to predict the rating and replace all the 0 rating with predicted rating

In [41]:
rat_mat

matrix([[0. , 0. , 0. , ..., 0. , 5. , 5. ],
        [0. , 0. , 0. , ..., 0. , 0. , 0. ],
        [0. , 0. , 0. , ..., 0. , 0. , 0. ],
        ...,
        [3. , 3.5, 3. , ..., 3. , 3.5, 4. ],
        [0. , 0. , 0. , ..., 3. , 0. , 0. ],
        [4.5, 3. , 0. , ..., 0. , 0. , 3.5]])

In [42]:
rat_mat[0,0]

0.0

In [43]:
# For user1, we should put 0 in the place of user for indexing purpose

standEst(rat_mat, 0, pearsSim, 0)

4.4467668845433375

Since, we've 10 levels of movie ratings from 0.5 to 5.0 with an increment of 0.5. We could round up 4.478 to 4.5

Applying the standEst() method for all 0 rating in **rat_mat** and stored in the variable **rating_std**.

In [44]:
rating_std = rat_mat
rating_std 

matrix([[0. , 0. , 0. , ..., 0. , 5. , 5. ],
        [0. , 0. , 0. , ..., 0. , 0. , 0. ],
        [0. , 0. , 0. , ..., 0. , 0. , 0. ],
        ...,
        [3. , 3.5, 3. , ..., 3. , 3.5, 4. ],
        [0. , 0. , 0. , ..., 3. , 0. , 0. ],
        [4.5, 3. , 0. , ..., 0. , 0. , 3.5]])

The below code demonstrates an iteration over each row (user) and column (item) in a matrix called `rating_std`. It checks if a value in `rating_std` is zero and, if so, applies the `standEst` function with the `pearsSim` similarity metric to estimate the rating for that user-item pair.

After obtaining the estimated rating, the code uses a custom rounding function, `custom_round`, to round the estimated rating to the nearest value in the `ratings_levels` array. Finally, it replaces the zero value in `rating_std` with the rounded rating.

This process allows for the estimation and rounding of missing or unrated values in the `rating_std` matrix based on collaborative filtering techniques.

In [45]:
%%time

import numpy as np

# Ratings levels
ratings_levels = np.array([4.0, 4.5, 2.5, 3.5, 3.0, 5.0, 0.5, 2.0, 1.5, 1.0])

# Custom rounding function
def custom_round(value):
    if np.isnan(value):
        return np.nan
    else:
        return ratings_levels[np.abs(ratings_levels - value).argmin()]

# Iterate over each row (user)
for i in range(rating_std.shape[0]):
    # Iterate over each column (item)
    for j in range(rating_std.shape[1]):
        if rating_std[i, j] == 0:
            # Apply standEst() to zero values
            estimated_rating = standEst(rating_std, i, pearsSim, j)
            # Round the estimated rating using the custom rounding function
            rounded_rating = custom_round(estimated_rating)
            # Replace the zero value with the rounded rating
            rating_std[i, j] = rounded_rating

CPU times: user 13min 30s, sys: 1.8 s, total: 13min 32s
Wall time: 13min 43s


In [46]:
rating_std

matrix([[4.5, 4.5, 4.5, ..., 4.5, 5. , 5. ],
        [4. , 4. , 4. , ..., 4. , 4. , 4. ],
        [0.5, 0.5, 0.5, ..., 0.5, 0.5, 0.5],
        ...,
        [3. , 3.5, 3. , ..., 3. , 3.5, 4. ],
        [3.5, 3.5, 3.5, ..., 3. , 3.5, 3.5],
        [4.5, 3. , 4.5, ..., 4.5, 4.5, 3.5]])

## **Predicting missing rating using SVD Estimate Method**

> **Using Pearson Correlation as similarity measure**

Let's see how SVD Estimate Method works

In [47]:
svdEst(rat_mat, 0, pearsSim, 0)

4.482010896330537

Applying the svdEst() method for all 0 rating in rat_mat and stored in the variable rating_svd.

In [48]:
rating_svd = rat_mat
rating_svd 

matrix([[4.5, 4.5, 4.5, ..., 4.5, 5. , 5. ],
        [4. , 4. , 4. , ..., 4. , 4. , 4. ],
        [0.5, 0.5, 0.5, ..., 0.5, 0.5, 0.5],
        ...,
        [3. , 3.5, 3. , ..., 3. , 3.5, 4. ],
        [3.5, 3.5, 3.5, ..., 3. , 3.5, 3.5],
        [4.5, 3. , 4.5, ..., 4.5, 4.5, 3.5]])

The below code demonstrates an iteration over each row (user) and column (item) in a matrix called `rating_svd`. It checks if a value in `rating_svd` is zero and, if so, applies the `svdEst` function with the `pearsSim` similarity metric to estimate the rating for that user-item pair.

After obtaining the estimated rating, the code uses a custom rounding function, `custom_round`, to round the estimated rating to the nearest value in the `ratings_levels` array. Finally, it replaces the zero value in `rating_svd` with the rounded rating.

This process allows for the estimation and rounding of missing or unrated values in the `rating_svd` matrix based on collaborative filtering techniques, specifically using singular value decomposition (SVD) and the Pearson similarity metric.

In [49]:
%%time

import numpy as np

# Ratings levels
ratings_levels = np.array([4.0, 4.5, 2.5, 3.5, 3.0, 5.0, 0.5, 2.0, 1.5, 1.0])

# Custom rounding function
def custom_round(value):
    if np.isnan(value):
        return np.nan
    else:
        return ratings_levels[np.abs(ratings_levels - value).argmin()]

# Iterate over each row (user)
for i in range(rating_svd.shape[0]):
    # Iterate over each column (item)
    for j in range(rating_svd.shape[1]):
        if rating_svd[i, j] == 0:
            # Apply standEst() to zero values
            estimated_rating = svdEst(rating_svd, i, pearsSim, j)
            # Round the estimated rating using the custom rounding function
            rounded_rating = custom_round(estimated_rating)
            # Replace the zero value with the rounded rating
            rating_svd[i, j] = rounded_rating

CPU times: user 97.2 ms, sys: 62.9 ms, total: 160 ms
Wall time: 86.1 ms


In [50]:
rating_svd

matrix([[4.5, 4.5, 4.5, ..., 4.5, 5. , 5. ],
        [4. , 4. , 4. , ..., 4. , 4. , 4. ],
        [0.5, 0.5, 0.5, ..., 0.5, 0.5, 0.5],
        ...,
        [3. , 3.5, 3. , ..., 3. , 3.5, 4. ],
        [3.5, 3.5, 3.5, ..., 3. , 3.5, 3.5],
        [4.5, 3. , 4.5, ..., 4.5, 4.5, 3.5]])

## **Approach 1: Item-to-Item CF (Item based Collaborative Filtering)**

> **Find similar movies based on ratings given by other users**

Item-based collaborative filtering is a technique used in recommender systems to provide personalized recommendations based on the similarity between items. It relies on the assumption that if two items are frequently rated similarly by users, they are likely to be related or have similar characteristics.

The item-based collaborative filtering algorithm operates in two main steps:

Similarity Calculation: The first step involves computing the similarity between items. This can be done using various similarity measures such as cosine similarity or Pearson correlation. The similarity is computed based on the ratings given by users to different items.

Recommendation Generation: Once the similarity between items is calculated, the algorithm identifies items that are most similar to a given item. When a user expresses a preference for a particular item, the algorithm looks for items that are highly correlated with that item. It then recommends those highly correlated items to the user.

Item-based collaborative filtering has several advantages. It is relatively simple to implement and is particularly effective when dealing with sparse datasets, where there are many more users than items. It also avoids the cold-start problem, which occurs when new items are added to the system without any ratings or information. In this case, the similarity between items can still be computed based on existing ratings, allowing for recommendations to be generated.



### **Using predicted ratings using Standard Estimate Method**

**Function to fetch movie**

The function retrieves the movie information corresponding to the given id from the movie dataframe and returns it. The specific details of the returned movie information depend on the structure and columns of the movie dataframe.

In [51]:
def get_movie(movie, id):
    return movie.iloc[id,0]

#### **Option 1: Cosine Distance (Cosine Similarity)**
> **Using code with manual calculation**

**print_most_similar_movies_cosine function using cosine similarity**

This function calculates the similarity between a given movie and all other movies in the dataset, and then prints the top k most similar movies based on the provided similarity metric.

In [52]:
def print_most_similar_movies_cosine(dataMat, movie, querymovie, k, metric):
    # Getting the ratings vector for the querymovie
    queryMovie_Vector = dataMat[:, querymovie].flatten()  # Flatten the column vector
    
    # Computing the similarity between the querymovie and all other movies
    similar = []
    for i in range(dataMat.shape[1]):
        m_ratings = dataMat[:, i]  
        similarity = metric(queryMovie_Vector, m_ratings)
        similar.append((i, similarity))
    
    # Sort the movies based on similarity in descending order
    similar.sort(key=lambda x: x[1], reverse=True)
    
    # Printing the query movie details
    print("Selected Movie:")
    print(movie.iloc[querymovie, 1])  # Assuming the movie title is in the first column
    
    print("\n")
    
    # Print the most similar movies
    print("Top {} recommended movies are:".format(k))
    j = 1
    for i in range(0, k):
        m_Index = similar[i][0]
        print('{}. {}'.format(j, movie.iloc[m_Index, 1]))  # Assuming the movie title is in the first column
        print('-'*125)
        j=j+1

In [53]:
print_most_similar_movies_cosine(rating_std, new_movies, 1, 5, distCosine)

Selected Movie:
Ace Ventura: Pet Detective (1994)


Top 5 recommended movies are:
1. Braveheart (1995)
-----------------------------------------------------------------------------------------------------------------------------
2. Forrest Gump (1994)
-----------------------------------------------------------------------------------------------------------------------------
3. Pulp Fiction (1994)
-----------------------------------------------------------------------------------------------------------------------------
4. Schindler's List (1993)
-----------------------------------------------------------------------------------------------------------------------------
5. Star Wars: Episode I - The Phantom Menace (1999)
-----------------------------------------------------------------------------------------------------------------------------


> **Using code with inbuilt function**

In [54]:
rating_std_df = pd.DataFrame(rating_std)
rating_std_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,128,129,130,131,132,133,134,135,136,137
0,4.5,4.5,4.5,4.0,4.5,4.5,5.0,5.0,4.5,4.0,...,4.5,4.5,3.0,4.5,5.0,4.5,4.5,4.5,5.0,5.0
1,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,...,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
2,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
3,3.5,3.5,4.0,3.5,3.5,3.5,5.0,3.5,3.5,3.5,...,3.5,2.0,3.5,3.5,3.5,3.5,3.5,3.5,4.0,3.5
4,3.5,3.0,4.0,3.5,3.5,3.5,3.5,3.5,3.5,3.5,...,3.5,3.5,3.5,3.5,4.0,3.5,3.5,3.5,3.5,3.5


In [55]:
# Similarity matrix using cosine similarity
movie_similarity = cosine_similarity(rating_std_df.T)
print(movie_similarity)

[[1.         0.98020239 0.9893871  ... 0.98293205 0.98970202 0.98742738]
 [0.98020239 1.         0.98242113 ... 0.98114321 0.98157178 0.9814903 ]
 [0.9893871  0.98242113 1.         ... 0.98584472 0.99066491 0.99034309]
 ...
 [0.98293205 0.98114321 0.98584472 ... 1.         0.98558126 0.98532828]
 [0.98970202 0.98157178 0.99066491 ... 0.98558126 1.         0.99001515]
 [0.98742738 0.9814903  0.99034309 ... 0.98532828 0.99001515 1.        ]]


In [56]:
movie_similarity_df = pd.DataFrame(movie_similarity, index=rating_std_df.columns, columns=rating_std_df.columns)
movie_similarity_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,128,129,130,131,132,133,134,135,136,137
0,1.000000,0.980202,0.989387,0.989695,0.990155,0.989906,0.985682,0.991328,0.987321,0.992588,...,0.990189,0.988494,0.984507,0.991055,0.987828,0.990756,0.990281,0.982932,0.989702,0.987427
1,0.980202,1.000000,0.982421,0.978822,0.980277,0.979604,0.976453,0.982837,0.981557,0.983364,...,0.983346,0.979686,0.980438,0.983679,0.978428,0.983739,0.983070,0.981143,0.981572,0.981490
2,0.989387,0.982421,1.000000,0.988777,0.989954,0.990728,0.986044,0.993109,0.989789,0.991531,...,0.993170,0.989078,0.987755,0.993363,0.988858,0.992731,0.992268,0.985845,0.990665,0.990343
3,0.989695,0.978822,0.988777,1.000000,0.994547,0.990621,0.986237,0.991443,0.987969,0.992289,...,0.991401,0.989441,0.986393,0.990794,0.988802,0.992339,0.991703,0.984717,0.989752,0.989118
4,0.990155,0.980277,0.989954,0.994547,1.000000,0.990972,0.985965,0.992161,0.988632,0.991859,...,0.992129,0.989585,0.986791,0.991834,0.988736,0.992855,0.991770,0.984766,0.990189,0.989962
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133,0.990756,0.983739,0.992731,0.992339,0.992855,0.993319,0.988655,0.995027,0.991283,0.993655,...,0.994731,0.992201,0.989770,0.995343,0.991239,1.000000,0.994618,0.987902,0.992119,0.993226
134,0.990281,0.983070,0.992268,0.991703,0.991770,0.992118,0.988224,0.994068,0.990651,0.993476,...,0.993495,0.992484,0.988088,0.995855,0.990848,0.994618,1.000000,0.986671,0.991791,0.991440
135,0.982932,0.981143,0.985845,0.984717,0.984766,0.985144,0.979537,0.987131,0.985742,0.986815,...,0.987401,0.984030,0.984900,0.987607,0.980443,0.987902,0.986671,1.000000,0.985581,0.985328
136,0.989702,0.981572,0.990665,0.989752,0.990189,0.991331,0.986045,0.993451,0.988837,0.992344,...,0.992027,0.990678,0.987065,0.993309,0.989456,0.992119,0.991791,0.985581,1.000000,0.990015


The get_similar_movies_cosine function takes three parameters: queryMovie, user_rating, and k. It assumes that movie_similarity_df is a dataframe containing cosine similarity scores between movies, and new_movies is a dataframe containing movie information.

The function calculates the similarity scores by multiplying the cosine similarity between the queryMovie and all other movies with the corresponding user_rating. It then sorts the similarity scores in descending order and selects the top k similar movies.

The function prints the details of the queryMovie, including its title, and then prints the top k recommended movies along with their corresponding index. Each recommended movie's details are printed using the new_movies dataframe

In [57]:
def get_similar_movies_cosine(queryMovie, user_rating, k):
    sim_score = movie_similarity_df[queryMovie] * user_rating
    sim_score = sim_score.sort_values(ascending=False)
    similar_movies_ids = sim_score.index[1:k+1]  # Exclude the first movie and select the next k movies
    # Printing the query movie details
    print("Selected Movie:")
    print(new_movies.iloc[queryMovie, 1])
    print("\n")
    # Print the most similar movies
    print("Top {} recommended movies are:".format(k))
    j = 1
    for i in similar_movies_ids:
        print('{}. {}'.format(j, new_movies.iloc[i, 1]))
        print('-' * 125)
        j += 1


In [58]:
get_similar_movies_cosine(1,5,5)

Selected Movie:
Ace Ventura: Pet Detective (1994)


Top 5 recommended movies are:
1. Stargate (1994)
-----------------------------------------------------------------------------------------------------------------------------
2. Ocean's Eleven (2001)
-----------------------------------------------------------------------------------------------------------------------------
3. Heat (1995)
-----------------------------------------------------------------------------------------------------------------------------
4. Catch Me If You Can (2002)
-----------------------------------------------------------------------------------------------------------------------------
5. Indiana Jones and the Temple of Doom (1984)
-----------------------------------------------------------------------------------------------------------------------------


#### **Option 2: Pearson Correlation**
> **Using code with manual calculation**

**print_most_similar_movies_pearson function using pearson similarity**

This function is an item-based collaborative filtering approach. It calculates the similarity between a query movie and all other movies based on a chosen metric. Then, it sorts the movies based on their similarity to the query movie in descending order. Finally, it prints the top k recommended movies based on their similarity scores.

In [59]:
def print_most_similar_movies_pearson(dataMat, movie, querymovie, k, metric):
    # Getting the ratings vector for the querymovie
    queryMovie_Vector = dataMat[:, querymovie]
    
    # Computing the similarity between the querymovie and all other movies
    similar = []
    for i in range(dataMat.shape[1]):
        m_ratings = dataMat[:, i]  
        similarity = metric(queryMovie_Vector, m_ratings)
        similar.append((i, similarity))
    
    # Sort the movies based on similarity in descending order
    similar.sort(key=lambda x: x[1], reverse=True)
    
    # Printing the query movie details
    print("Selected Movie:")
    print(movie.iloc[querymovie, 1])  # Assuming the movie title is in the first column
    
    print("\n")
    
    # Print the most similar movies
    print("Top {} recommended movies are:".format(k))
    j = 1
    for i in range(1, k+1):
        m_Index = similar[i][0]
        print('{}. {}'.format(j, movie.iloc[m_Index, 1]))  # Assuming the movie title is in the first column
        print('-'*125)
        j=j+1

In [60]:
print_most_similar_movies_pearson(rating_std, new_movies, 1, 5, pearsSim)

Selected Movie:
Ace Ventura: Pet Detective (1994)


Top 5 recommended movies are:
1. Stargate (1994)
-----------------------------------------------------------------------------------------------------------------------------
2. Ocean's Eleven (2001)
-----------------------------------------------------------------------------------------------------------------------------
3. Heat (1995)
-----------------------------------------------------------------------------------------------------------------------------
4. Mask, The (1994)
-----------------------------------------------------------------------------------------------------------------------------
5. Indiana Jones and the Temple of Doom (1984)
-----------------------------------------------------------------------------------------------------------------------------


> **Using code with inbuilt function**

In [61]:
# Similarity matriix using Pearson correlation
movie_similarity_pear = rating_std_df.corr(method='pearson')
movie_similarity_pear.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,128,129,130,131,132,133,134,135,136,137
0,1.0,0.651131,0.766565,0.772486,0.782361,0.767673,0.698064,0.796909,0.727505,0.828525,...,0.774091,0.739207,0.684017,0.788284,0.724392,0.784502,0.773774,0.668739,0.768142,0.71656
1,0.651131,1.0,0.691277,0.623863,0.650664,0.63003,0.59123,0.690783,0.678938,0.702455,...,0.703001,0.634819,0.668211,0.70686,0.61101,0.709428,0.696358,0.689305,0.671891,0.670202
2,0.766565,0.691277,1.0,0.750903,0.776707,0.785324,0.7044,0.838114,0.779552,0.802319,...,0.842154,0.751004,0.749763,0.843013,0.746276,0.82993,0.819293,0.725357,0.788627,0.781129
3,0.772486,0.623863,0.750903,1.0,0.878321,0.781792,0.707548,0.797023,0.739305,0.819255,...,0.79991,0.758257,0.720858,0.778973,0.743908,0.819608,0.804868,0.702258,0.766927,0.752274
4,0.782361,0.650664,0.776707,0.878321,1.0,0.789621,0.701375,0.81394,0.753367,0.808695,...,0.816623,0.761178,0.728822,0.804063,0.741981,0.831595,0.806081,0.70297,0.77654,0.771142


The get_similar_movies_pearson function takes three parameters: queryMovie, user_rating, and k. It assumes that movie_similarity_pear is a dataframe containing Pearson correlation similarity scores between movies, and new_movies is a dataframe containing movie information.

The function calculates the similarity scores by multiplying the Pearson correlation similarity between the queryMovie and all other movies with the corresponding user_rating. It then sorts the similarity scores in descending order.

The function prints the details of the queryMovie, including its title, and then prints the top k recommended movies along with their corresponding index. Each recommended movie's details are printed using the new_movies dataframe.

In [62]:
def get_similar_movies_pearson(queryMovie, user_rating, k):
    movie_name = new_movies.iloc[queryMovie, 1]
    sim_score = movie_similarity_pear.iloc[:, queryMovie] * user_rating
    sim_score = sim_score.sort_values(ascending=False)
    
    # Printing the query movie details
    print("Selected Movie:")
    print(movie_name)
    print("\n")
    
    # Print the most similar movies
    print("Top {} recommended movies are:".format(k))
    for i, (id, score) in enumerate(sim_score.iloc[1:k+1].iteritems(), 1):
        movie_name = new_movies.iloc[id, 1]
        print("{}. {}".format(i, movie_name))
        print("-" * 125)


In [63]:
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

In [66]:
get_similar_movies_pearson(1,5,5)

Selected Movie:
Ace Ventura: Pet Detective (1994)


Top 5 recommended movies are:
1. Stargate (1994)
-----------------------------------------------------------------------------------------------------------------------------
2. Ocean's Eleven (2001)
-----------------------------------------------------------------------------------------------------------------------------
3. Heat (1995)
-----------------------------------------------------------------------------------------------------------------------------
4. Mask, The (1994)
-----------------------------------------------------------------------------------------------------------------------------
5. Indiana Jones and the Temple of Doom (1984)
-----------------------------------------------------------------------------------------------------------------------------


### **Using predicted ratings using SVD Estimate Method**

#### **Option 1: Cosine Distance (Cosine Similarity)**

> **Using code with manual calculation**

In [67]:
print_most_similar_movies_cosine(rating_svd, new_movies, 1, 5, distCosine)

Selected Movie:
Ace Ventura: Pet Detective (1994)


Top 5 recommended movies are:
1. Braveheart (1995)
-----------------------------------------------------------------------------------------------------------------------------
2. Forrest Gump (1994)
-----------------------------------------------------------------------------------------------------------------------------
3. Pulp Fiction (1994)
-----------------------------------------------------------------------------------------------------------------------------
4. Schindler's List (1993)
-----------------------------------------------------------------------------------------------------------------------------
5. Star Wars: Episode I - The Phantom Menace (1999)
-----------------------------------------------------------------------------------------------------------------------------


> **Using code with inbuilt function**

In [68]:
rating_svd_df = pd.DataFrame(rating_svd)
rating_svd_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,128,129,130,131,132,133,134,135,136,137
0,4.5,4.5,4.5,4.0,4.5,4.5,5.0,5.0,4.5,4.0,...,4.5,4.5,3.0,4.5,5.0,4.5,4.5,4.5,5.0,5.0
1,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,...,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
2,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
3,3.5,3.5,4.0,3.5,3.5,3.5,5.0,3.5,3.5,3.5,...,3.5,2.0,3.5,3.5,3.5,3.5,3.5,3.5,4.0,3.5
4,3.5,3.0,4.0,3.5,3.5,3.5,3.5,3.5,3.5,3.5,...,3.5,3.5,3.5,3.5,4.0,3.5,3.5,3.5,3.5,3.5


In [69]:
# Similarity matrix using cosine similarity
movie_similarity_svd = cosine_similarity(rating_svd_df.T)
print(movie_similarity_svd)

[[1.         0.98020239 0.9893871  ... 0.98293205 0.98970202 0.98742738]
 [0.98020239 1.         0.98242113 ... 0.98114321 0.98157178 0.9814903 ]
 [0.9893871  0.98242113 1.         ... 0.98584472 0.99066491 0.99034309]
 ...
 [0.98293205 0.98114321 0.98584472 ... 1.         0.98558126 0.98532828]
 [0.98970202 0.98157178 0.99066491 ... 0.98558126 1.         0.99001515]
 [0.98742738 0.9814903  0.99034309 ... 0.98532828 0.99001515 1.        ]]


In [70]:
movie_similarity_svd_df = pd.DataFrame(movie_similarity_svd, index=rating_svd_df.columns, columns=rating_svd_df.columns)
movie_similarity_svd_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,128,129,130,131,132,133,134,135,136,137
0,1.000000,0.980202,0.989387,0.989695,0.990155,0.989906,0.985682,0.991328,0.987321,0.992588,...,0.990189,0.988494,0.984507,0.991055,0.987828,0.990756,0.990281,0.982932,0.989702,0.987427
1,0.980202,1.000000,0.982421,0.978822,0.980277,0.979604,0.976453,0.982837,0.981557,0.983364,...,0.983346,0.979686,0.980438,0.983679,0.978428,0.983739,0.983070,0.981143,0.981572,0.981490
2,0.989387,0.982421,1.000000,0.988777,0.989954,0.990728,0.986044,0.993109,0.989789,0.991531,...,0.993170,0.989078,0.987755,0.993363,0.988858,0.992731,0.992268,0.985845,0.990665,0.990343
3,0.989695,0.978822,0.988777,1.000000,0.994547,0.990621,0.986237,0.991443,0.987969,0.992289,...,0.991401,0.989441,0.986393,0.990794,0.988802,0.992339,0.991703,0.984717,0.989752,0.989118
4,0.990155,0.980277,0.989954,0.994547,1.000000,0.990972,0.985965,0.992161,0.988632,0.991859,...,0.992129,0.989585,0.986791,0.991834,0.988736,0.992855,0.991770,0.984766,0.990189,0.989962
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133,0.990756,0.983739,0.992731,0.992339,0.992855,0.993319,0.988655,0.995027,0.991283,0.993655,...,0.994731,0.992201,0.989770,0.995343,0.991239,1.000000,0.994618,0.987902,0.992119,0.993226
134,0.990281,0.983070,0.992268,0.991703,0.991770,0.992118,0.988224,0.994068,0.990651,0.993476,...,0.993495,0.992484,0.988088,0.995855,0.990848,0.994618,1.000000,0.986671,0.991791,0.991440
135,0.982932,0.981143,0.985845,0.984717,0.984766,0.985144,0.979537,0.987131,0.985742,0.986815,...,0.987401,0.984030,0.984900,0.987607,0.980443,0.987902,0.986671,1.000000,0.985581,0.985328
136,0.989702,0.981572,0.990665,0.989752,0.990189,0.991331,0.986045,0.993451,0.988837,0.992344,...,0.992027,0.990678,0.987065,0.993309,0.989456,0.992119,0.991791,0.985581,1.000000,0.990015


The get_similar_movies_cosine_svd function takes three parameters: queryMovie, user_rating, and k. It assumes that movie_similarity_svd_df is a dataframe containing cosine similarity scores between movies based on Singular Value Decomposition (SVD), and new_movies is a dataframe containing movie information.

The function calculates the similarity scores by multiplying the cosine similarity between the queryMovie and all other movies with the corresponding user_rating. It then sorts the similarity scores in descending order.

The function prints the details of the queryMovie, including its title, and then prints the top k recommended movies along with their corresponding index. Each recommended movie's details are printed using the new_movies dataframe.

In [71]:
def get_similar_movies_cosine_svd(queryMovie, user_rating, k):
    sim_score = movie_similarity_svd_df[queryMovie] * user_rating
    sim_score = sim_score.sort_values(ascending=False)
    similar_movies_ids = sim_score.index[1:k+1]  # Exclude the first movie and select the next k movies
    # Printing the query movie details
    print("Selected Movie:")
    print(new_movies.iloc[queryMovie, 1])
    print("\n")
    # Print the most similar movies
    print("Top {} recommended movies are:".format(k))
    j = 1
    for i in similar_movies_ids:
        print('{}. {}'.format(j, new_movies.iloc[i, 1]))
        print('-' * 125)
        j += 1

In [72]:
get_similar_movies_cosine_svd(1,5,5)

Selected Movie:
Ace Ventura: Pet Detective (1994)


Top 5 recommended movies are:
1. Stargate (1994)
-----------------------------------------------------------------------------------------------------------------------------
2. Ocean's Eleven (2001)
-----------------------------------------------------------------------------------------------------------------------------
3. Heat (1995)
-----------------------------------------------------------------------------------------------------------------------------
4. Catch Me If You Can (2002)
-----------------------------------------------------------------------------------------------------------------------------
5. Indiana Jones and the Temple of Doom (1984)
-----------------------------------------------------------------------------------------------------------------------------


#### **Option 2: Pearson Correlation**

> **Using code with manual calculation**

In [73]:
print_most_similar_movies_pearson(rating_svd, new_movies, 1, 5, pearsSim)

Selected Movie:
Ace Ventura: Pet Detective (1994)


Top 5 recommended movies are:
1. Stargate (1994)
-----------------------------------------------------------------------------------------------------------------------------
2. Ocean's Eleven (2001)
-----------------------------------------------------------------------------------------------------------------------------
3. Heat (1995)
-----------------------------------------------------------------------------------------------------------------------------
4. Mask, The (1994)
-----------------------------------------------------------------------------------------------------------------------------
5. Indiana Jones and the Temple of Doom (1984)
-----------------------------------------------------------------------------------------------------------------------------


> **Using code with inbuilt function**

In [74]:
# Similarity matriix using Pearson correlation
movie_similarity_pear_svd = rating_svd_df.corr(method='pearson')
movie_similarity_pear_svd.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,128,129,130,131,132,133,134,135,136,137
0,1.0,0.651131,0.766565,0.772486,0.782361,0.767673,0.698064,0.796909,0.727505,0.828525,...,0.774091,0.739207,0.684017,0.788284,0.724392,0.784502,0.773774,0.668739,0.768142,0.71656
1,0.651131,1.0,0.691277,0.623863,0.650664,0.63003,0.59123,0.690783,0.678938,0.702455,...,0.703001,0.634819,0.668211,0.70686,0.61101,0.709428,0.696358,0.689305,0.671891,0.670202
2,0.766565,0.691277,1.0,0.750903,0.776707,0.785324,0.7044,0.838114,0.779552,0.802319,...,0.842154,0.751004,0.749763,0.843013,0.746276,0.82993,0.819293,0.725357,0.788627,0.781129
3,0.772486,0.623863,0.750903,1.0,0.878321,0.781792,0.707548,0.797023,0.739305,0.819255,...,0.79991,0.758257,0.720858,0.778973,0.743908,0.819608,0.804868,0.702258,0.766927,0.752274
4,0.782361,0.650664,0.776707,0.878321,1.0,0.789621,0.701375,0.81394,0.753367,0.808695,...,0.816623,0.761178,0.728822,0.804063,0.741981,0.831595,0.806081,0.70297,0.77654,0.771142


The get_similar_movies_pearson_svd function is similar to the get_similar_movies_cosine_svd function, but instead of using cosine similarity, it uses Pearson correlation similarity. The function assumes that movie_similarity_pear_svd is a dataframe containing Pearson correlation similarity scores between movies based on Singular Value Decomposition (SVD).

The function calculates the similarity scores by multiplying the Pearson correlation similarity between the queryMovie and all other movies with the corresponding user_rating. It then sorts the similarity scores in descending order.

The function prints the details of the queryMovie, including its title, and then prints the top k recommended movies along with their corresponding index. Each recommended movie's details are printed using the new_movies dataframe.

In [75]:
def get_similar_movies_pearson_svd(queryMovie, user_rating, k):
    movie_name = new_movies.iloc[queryMovie, 1]
    sim_score = movie_similarity_pear_svd.iloc[:, queryMovie] * user_rating
    sim_score = sim_score.sort_values(ascending=False)
    
    # Printing the query movie details
    print("Selected Movie:")
    print(movie_name)
    print("\n")
    
    # Print the most similar movies
    print("Top {} recommended movies are:".format(k))
    for i, (id, score) in enumerate(sim_score.iloc[1:k+1].iteritems(), 1):
        movie_name = new_movies.iloc[id, 1]
        print("{}. {}".format(i, movie_name))
        print("-" * 125)

In [76]:
get_similar_movies_pearson_svd(1, 5, 5)

Selected Movie:
Ace Ventura: Pet Detective (1994)


Top 5 recommended movies are:
1. Stargate (1994)
-----------------------------------------------------------------------------------------------------------------------------
2. Ocean's Eleven (2001)
-----------------------------------------------------------------------------------------------------------------------------
3. Heat (1995)
-----------------------------------------------------------------------------------------------------------------------------
4. Mask, The (1994)
-----------------------------------------------------------------------------------------------------------------------------
5. Indiana Jones and the Temple of Doom (1984)
-----------------------------------------------------------------------------------------------------------------------------


## **Approach 2:  User-to-User CF (User based Collaborative Filtering)**

> **Find similar users and recommend movies that they like**

User-based collaborative filtering is a recommendation technique that utilizes user-product interactions to make personalized recommendations. The underlying assumption is that users who have similar preferences tend to like similar products.

The algorithm for user-based collaborative filtering typically involves the following steps:

1. Discover users who exhibit similar preferences by examining their interactions with common items.
2. Determine the items that are highly rated by these similar users but have not been experienced by the user of interest.
3. Compute a weighted average score for each item, considering the ratings provided by the similar users.
4. Rank the items based on their scores and select the top "k" items to recommend.

In [77]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   movieId  100836 non-null  int64  
 1   title    100836 non-null  object 
 2   userId   100836 non-null  int64  
 3   rating   100836 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 3.8+ MB


In [78]:
# Aggregating by movie
agg_ratings = data.groupby('title').agg(mean_rating = ('rating','mean'), number_of_ratings = ('rating', 'count')).reset_index()

# Keeping the movies with over 100 ratings
agg_ratings_gt100 = agg_ratings[agg_ratings['number_of_ratings']>100]
agg_ratings_gt100.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 134 entries, 74 to 9615
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   title              134 non-null    object 
 1   mean_rating        134 non-null    float64
 2   number_of_ratings  134 non-null    int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 4.2+ KB


Let's check what the most popular movies and their ratings are.

In [79]:
# Checking popular movies
agg_ratings_gt100.sort_values(by='number_of_ratings', ascending = False).head()

Unnamed: 0,title,mean_rating,number_of_ratings
3158,Forrest Gump (1994),4.164134,329
7593,"Shawshank Redemption, The (1994)",4.429022,317
6865,Pulp Fiction (1994),4.197068,307
7680,"Silence of the Lambs, The (1991)",4.16129,279
5512,"Matrix, The (1999)",4.192446,278


In [80]:
# Merge data
df_gt100 = pd.merge(data, agg_ratings_gt100[['title']], on='title', how='inner')
df_gt100.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19788 entries, 0 to 19787
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  19788 non-null  int64  
 1   title    19788 non-null  object 
 2   userId   19788 non-null  int64  
 3   rating   19788 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 773.0+ KB


In [81]:
print(df_gt100['userId'].nunique()) # unique users
print(df_gt100['movieId'].nunique()) # unique movies
print(df_gt100['rating'].nunique()) # unique ratings
print(sorted(df_gt100['rating'].unique()))

597
134
10
[0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]


**User-Movie Matrix**

In [82]:
matrix = df_gt100.pivot_table(index='userId', columns='title', values='rating')
matrix.head()

title,2001: A Space Odyssey (1968),Ace Ventura: Pet Detective (1994),Aladdin (1992),Alien (1979),Aliens (1986),"Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)",American Beauty (1999),American History X (1998),American Pie (1999),Apocalypse Now (1979),...,True Lies (1994),"Truman Show, The (1998)",Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Twister (1996),Up (2009),"Usual Suspects, The (1995)",WALL·E (2008),Waterworld (1995),Willy Wonka & the Chocolate Factory (1971),X-Men (2000)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,4.0,,,5.0,5.0,,4.0,...,,,,3.0,,5.0,,,5.0,5.0
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,4.0,,,,5.0,,,,...,,,2.0,,,,,,4.0,
5,,3.0,4.0,,,,,,,,...,2.0,,,,,4.0,,,,


**Normalizing user_ratings matrix**

By subtracting the row-wise mean from the original ratings, the normalization operation centers the ratings around zero. This can be useful for collaborative filtering algorithms, as it removes the inherent bias introduced by different users having different rating scales or tendencies. Normalization allows for a more accurate comparison of ratings across users and items, enabling meaningful similarity calculations and personalized recommendations.

In [83]:
# Normalizing user-item matrix

matrix_norm = matrix.subtract(matrix.mean(axis=1), axis='rows')
matrix_norm.head()

title,2001: A Space Odyssey (1968),Ace Ventura: Pet Detective (1994),Aladdin (1992),Alien (1979),Aliens (1986),"Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)",American Beauty (1999),American History X (1998),American Pie (1999),Apocalypse Now (1979),...,True Lies (1994),"Truman Show, The (1998)",Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Twister (1996),Up (2009),"Usual Suspects, The (1995)",WALL·E (2008),Waterworld (1995),Willy Wonka & the Chocolate Factory (1971),X-Men (2000)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,-0.392857,,,0.607143,0.607143,,-0.392857,...,,,,-1.392857,,0.607143,,,0.607143,0.607143
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,0.617647,,,,1.617647,,,,...,,,-1.382353,,,,,,0.617647,
5,,-0.461538,0.538462,,,,,,,,...,-1.461538,,,,,0.538462,,,,


Now identifying similar users using pearson similarity.

**User Similarity matrix using Pearson correlation**

In [84]:
user_similarity_pearson = matrix_norm.T.corr()
user_similarity_pearson.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,,,0.391797,0.180151,-0.439941,-0.029894,0.464277,1.0,-0.037987,...,0.091574,0.254514,0.101482,-0.5,0.78002,0.303854,-0.012077,0.242309,-0.175412,0.071553
2,,1.0,,,,,,,,1.0,...,-0.583333,,-1.0,,,0.583333,,-0.229416,,0.765641
3,,,,,,,,,,,...,,,,,,,,,,
4,0.391797,,,1.0,-0.394823,0.421927,0.704669,0.055442,,0.360399,...,-0.239325,0.5625,0.162301,-0.158114,0.905134,0.021898,-0.020659,-0.286872,,-0.050868
5,0.180151,,,-0.394823,1.0,-0.006888,0.328889,0.030168,,-0.777714,...,0.0,0.231642,0.131108,0.068621,-0.245026,0.377341,0.228218,0.263139,0.384111,0.040582


To illustrate the process of finding similar users using user ID1 as an example, we need to exclude user ID1 from the list of similar users. Additionally, we need to determine the desired number of similar users to consider.

In [85]:
# Desired User
desired_user = 1

# Removing desired user from the similar user list
user_similarity_pearson.drop(index=desired_user, inplace=True)

user_similarity_pearson.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,,1.0,,,,,,,,1.0,...,-0.583333,,-1.0,,,0.583333,,-0.229416,,0.765641
3,,,,,,,,,,,...,,,,,,,,,,
4,0.391797,,,1.0,-0.394823,0.421927,0.704669,0.055442,,0.360399,...,-0.239325,0.5625,0.162301,-0.158114,0.905134,0.021898,-0.020659,-0.286872,,-0.050868
5,0.180151,,,-0.394823,1.0,-0.006888,0.328889,0.030168,,-0.777714,...,0.0,0.231642,0.131108,0.068621,-0.245026,0.377341,0.228218,0.263139,0.384111,0.040582
6,-0.439941,,,0.421927,-0.006888,1.0,0.0,-0.127385,,0.957427,...,-0.29277,-0.030599,-0.123983,-0.176327,0.063861,-0.468008,0.541386,-0.337129,0.158255,-0.030567


In the `user_similarity_pearson` matrix, the values range from -1 to 1, where -1 indicates opposite movie preferences and 1 indicates the same movie preferences. 

To perform user-based collaborative filtering, we set the number of similar users to consider as 10 (denoted by `n=10`). This means we aim to select the top 5 most similar users for user ID 1. 

In order to determine which users are considered similar, we set a positive threshold called `user_similarity_threshold` to 0.3. This means that a user must have a Pearson correlation coefficient of at least 0.3 with user ID 1 to be considered as a similar user.

After specifying the number of similar users and the similarity threshold, we sort the user similarity values in descending order. We then print out the IDs of the most similar users along with their corresponding Pearson correlation coefficients.

In summary, we utilize the `user_similarity_pearson` matrix, set the number of similar users and a similarity threshold, sort the values, and finally display the IDs and Pearson correlation coefficients of the most similar users for user ID 1.

In [86]:
# No. of similar users
n = 10

# User similarity threshold
user_similarity_threshold = 0.3

# Retrieving correlations of desired user with all users
similar_users = user_similarity_pearson[desired_user].sort_values(ascending=False)

# Selecting top n similar users
top_similar_users = similar_users.head(n)

# Printing the ID and correlation
print(f'The similar users for user {desired_user} are:')
for user_id, correlation in top_similar_users.items():
    print(f'User ID: {user_id}, Correlation: {correlation}')


The similar users for user 1 are:
User ID: 598, Correlation: 1.0
User ID: 9, Correlation: 1.0
User ID: 502, Correlation: 1.0
User ID: 108, Correlation: 1.0
User ID: 550, Correlation: 1.0
User ID: 401, Correlation: 0.9428090415820632
User ID: 511, Correlation: 0.9258200997725515
User ID: 366, Correlation: 0.8728715609439694
User ID: 154, Correlation: 0.8660254037844387
User ID: 595, Correlation: 0.8660254037844385


Now, removing the movies watched by the desired user and keeping only the movies watched by similar users

In [87]:
desired_user_watched = matrix_norm[matrix_norm.index == desired_user].dropna(axis=1, how='all')
desired_user_watched

title,Alien (1979),American Beauty (1999),American History X (1998),Apocalypse Now (1979),Back to the Future (1985),Batman (1989),"Big Lebowski, The (1998)",Braveheart (1995),Clear and Present Danger (1994),Clerks (1994),...,Star Wars: Episode IV - A New Hope (1977),Star Wars: Episode V - The Empire Strikes Back (1980),Star Wars: Episode VI - Return of the Jedi (1983),Stargate (1994),"Terminator, The (1984)",Toy Story (1995),Twister (1996),"Usual Suspects, The (1995)",Willy Wonka & the Chocolate Factory (1971),X-Men (2000)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.392857,0.607143,0.607143,-0.392857,0.607143,-0.392857,0.607143,-0.392857,-0.392857,-1.392857,...,0.607143,0.607143,0.607143,-1.392857,0.607143,-0.392857,-1.392857,0.607143,0.607143,0.607143


In [88]:
similar_user_movies = matrix_norm[matrix_norm.index.isin(similar_users.index)].dropna(axis=1, how='all')
similar_user_movies

title,2001: A Space Odyssey (1968),Ace Ventura: Pet Detective (1994),Aladdin (1992),Alien (1979),Aliens (1986),"Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)",American Beauty (1999),American History X (1998),American Pie (1999),Apocalypse Now (1979),...,True Lies (1994),"Truman Show, The (1998)",Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Twister (1996),Up (2009),"Usual Suspects, The (1995)",WALL·E (2008),Waterworld (1995),Willy Wonka & the Chocolate Factory (1971),X-Men (2000)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,0.617647,,,,1.617647,,,,...,,,-1.382353,,,,,,0.617647,
5,,-0.461538,0.538462,,,,,,,,...,-1.461538,,,,,0.538462,,,,
6,,-0.877551,1.122449,,,,,,,,...,0.122449,,0.122449,1.122449,,-2.877551,,-0.877551,-0.877551,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,1.097826,,,0.097826,-0.402174,0.597826,0.597826,0.097826,-2.902174,0.597826,...,,0.597826,0.097826,,,0.597826,0.097826,,,
607,,,,-0.900000,,,-0.900000,,,,...,0.100000,,,1.100000,,,,-0.900000,,-0.900000
608,-0.533613,-0.033613,-0.533613,0.466387,0.966387,,1.466387,0.466387,-1.033613,-0.533613,...,-0.533613,0.966387,-0.033613,-0.533613,,0.966387,,-0.533613,-0.033613,0.466387
609,,,,,,,,,,,...,,,,,,,,-0.333333,,


Now, we drop the movies that user ID1 watched from the similar user movie list. errors='ignore' drops columns if they exist without giving an error message.

In [89]:
# Removing the watched movie from the movie list
similar_user_movies.drop(desired_user_watched.columns, axis=1, inplace = True, errors = 'ignore')
similar_user_movies

title,2001: A Space Odyssey (1968),Ace Ventura: Pet Detective (1994),Aladdin (1992),Aliens (1986),"Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)",American Pie (1999),Apollo 13 (1995),Austin Powers: The Spy Who Shagged Me (1999),Babe (1995),Batman Begins (2005),...,Terminator 2: Judgment Day (1991),There's Something About Mary (1998),Titanic (1997),Trainspotting (1996),True Lies (1994),"Truman Show, The (1998)",Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Up (2009),WALL·E (2008),Waterworld (1995)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,0.617647,,,,,0.617647,,,...,,-0.382353,,,,,-1.382353,,,
5,,-0.461538,0.538462,,,,-0.461538,,0.538462,,...,-0.461538,,,,-1.461538,,,,,
6,,-0.877551,1.122449,,,,0.122449,,0.122449,,...,-0.877551,,,,0.122449,,0.122449,,,-0.877551
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,1.097826,,,-0.402174,0.597826,-2.902174,,,,,...,-0.402174,-0.902174,0.097826,0.097826,,0.597826,0.097826,,0.097826,
607,,,,,,,1.100000,,-0.900000,,...,0.100000,,,,0.100000,,,,,-0.900000
608,-0.533613,-0.033613,-0.533613,0.966387,,-1.033613,-1.533613,-0.533613,-0.033613,,...,-0.533613,-0.033613,-1.533613,-0.533613,-0.533613,0.966387,-0.033613,,,-0.533613
609,,,,,,,-0.333333,,,,...,-0.333333,,,,,,,,,-0.333333


In the recommendation process, we determine which movies to recommend to the target user based on a weighted average of user similarity scores and movie ratings. The similarity scores are used to assign weights to the movie ratings, giving higher weights to users who are more similar to the target user.

The following code iterates through the items and users to calculate the item score. The item scores are then ranked in descending order, and the top 5 movies are selected as recommendations for user 0.

In [90]:
# Dictionary to store item scores
item_score = {}

# Looping through items
for i in similar_user_movies.columns:
  # Getting the ratings for movie i
  movie_rating = similar_user_movies[i]
  # Creating a variable to store the score
  total = 0
  # Creating a variable to store the number of scores
  count=0
  # Looping through similar users
  for u in similar_users.index:
    # If the movie has rating
    if pd.isna(movie_rating[u]) == False:
      # Score is the sum of user similarity score multiply by the movie rating
      score = similar_users[u]*movie_rating[u]
      # Adding the score to the total score for the movie so far
      total += score
      # Adding 1 to the count
      count +=1
  # Getting the average score for the item
  item_score[i] = total/count 

# Converting dictionary to pandas dataframe
item_score = pd.DataFrame(item_score.items(), columns = ['movie', 'movie_score'])

# Sorting the movies by score
ranked_item_score = item_score.sort_values(by='movie_score', ascending = False)

# Select top 5 movies
m=10
ranked_item_score['movie'].head(10)


68                    Terminator 2: Judgment Day (1991)
71                                 Trainspotting (1996)
3                                         Aliens (1986)
28                            Fifth Element, The (1997)
69                  There's Something About Mary (1998)
23                                      Die Hard (1988)
42    Interview with the Vampire: The Vampire Chroni...
30                                     Firm, The (1993)
35                                     GoldenEye (1995)
19                                  Crimson Tide (1995)
Name: movie, dtype: object

## **Recommendation Engines**

> ### **Item-Based Collaborative Filtering**

In [91]:
print("Combination1: Standard Estimate Method | Cosine Similarity")
print("-"*50)

# Using code with manual calculation using cosine similarity as similarity measure
print('Cosine Similarity by manual calculation:')
print("="*40)
print_most_similar_movies_cosine(rating_std, new_movies, 3, 5, distCosine)
print('\n')

print('Cosine Similarity by inbuilt function:')
print("="*40)
# Using code with inbuilt function using cosine_similarity as distance measure
get_similar_movies_cosine(3,5,5)
print('\n')

print("Combination2: Standard Estimate Method | Pearson Correlation")
print("-"*50)

print('Pearson Similarity by manual calculation:')
print("="*40)
# Using code with manual calculation using pearson correlation as similarity measure
print_most_similar_movies_pearson(rating_std, new_movies, 3, 5, pearsSim)
print('\n')

print('Pearson Similarity by inbuilt function:')
print("="*40)
# Using code with inbuilt function using Pearson correlation as distance measure
get_similar_movies_pearson(3, 5, 5)
print('\n')

print("Combination3: SVD Estimate Method | Cosine Similarity")
print("-"*50)

# Using code with manual calculation using cosine similarity as similarity measure
print('Cosine Similarity by manual calculation:')
print("="*40)
print_most_similar_movies_cosine(rating_svd, new_movies, 3, 5, distCosine)
print('\n')

print('Cosine Similarity by inbuilt function:')
print("="*40)
# Using code with inbuilt function using cosine_similarity as distance measure
get_similar_movies_cosine_svd(3,5,5)
print('\n')

print("Combination4: SVD Estimate Method | Pearson Correlation")
print("-"*50)

print('Pearson Similarity by manual calculation:')
print("="*40)
# Using code with manual calculation using pearson correlation as similarity measure
print_most_similar_movies_pearson(rating_svd, new_movies, 3, 5, pearsSim)
print('\n')

print('Pearson Similarity by inbuilt function:')
print("="*40)
# Using code with inbuilt function using Pearson correlation as distance measure
get_similar_movies_pearson_svd(3, 5, 5)
print('\n')

Combination1: Standard Estimate Method | Cosine Similarity
--------------------------------------------------
Cosine Similarity by manual calculation:
Selected Movie:
Alien (1979)


Top 5 recommended movies are:
1. Ace Ventura: Pet Detective (1994)
-----------------------------------------------------------------------------------------------------------------------------
2. Star Wars: Episode I - The Phantom Menace (1999)
-----------------------------------------------------------------------------------------------------------------------------
3. Pulp Fiction (1994)
-----------------------------------------------------------------------------------------------------------------------------
4. Dumb & Dumber (Dumb and Dumber) (1994)
-----------------------------------------------------------------------------------------------------------------------------
5. Austin Powers: The Spy Who Shagged Me (1999)
----------------------------------------------------------------------------------

> ### **User-Based Collaborative Filtering**

In [92]:
print("Using Pearson Similarity:")
print(ranked_item_score['movie'].head(10))



Using Pearson Similarity:
68                    Terminator 2: Judgment Day (1991)
71                                 Trainspotting (1996)
3                                         Aliens (1986)
28                            Fifth Element, The (1997)
69                  There's Something About Mary (1998)
23                                      Die Hard (1988)
42    Interview with the Vampire: The Vampire Chroni...
30                                     Firm, The (1993)
35                                     GoldenEye (1995)
19                                  Crimson Tide (1995)
Name: movie, dtype: object


## **Comparison on rating predictions**

> 

The test function is used to evaluate the performance of a collaborative filtering recommendation system using a given estimation method (estMethod). It takes the following parameters:

dataMat: The user-item rating matrix.
test_ratio: The ratio of ratings to be used for testing.
estMethod: The estimation method to be used for rating prediction.
The function iterates over each user in the dataMat matrix and performs cross-validation by calling the cross_validate_user function. The cross_validate_user function is not provided, but it is likely responsible for splitting the user's ratings into training and testing sets, applying the estimation method to predict the missing ratings, and calculating the error between the predicted ratings and the actual ratings.

The function accumulates the total error and count of ratings for all users and calculates the Mean Absolute Error (MAE) by dividing the total error by the total count. Finally, it prints the MAE for the given estimation method.

In [93]:
def test(dataMat, test_ratio, estMethod):
    total_error = 0.0
    total_count = 0

    for user in range(dataMat.shape[0]):
        error_u, count_u = cross_validate_user(dataMat, user, test_ratio, estMethod, simMeas=pearsSim)
        total_error += error_u
        total_count += count_u

    MAE = total_error / total_count
    print('Mean Absolute Error for', estMethod.__name__, ':', MAE)

### **1. Standard Estimate | Pearson Similarity**

In [94]:
%%time

user = 3
test(rating_std, 0.2, standEst)

Mean Absolute Error for standEst : 0.17549976708385379
CPU times: user 4min 57s, sys: 724 ms, total: 4min 58s
Wall time: 5min 1s


In [95]:
%%time

user = 90
test(rating_std, 0.25, standEst)

Mean Absolute Error for standEst : 0.1701405334150917
CPU times: user 5min 54s, sys: 749 ms, total: 5min 54s
Wall time: 5min 57s


### **2. Singular Value Decomposition | Pearson Similarity**

In [96]:
%%time
test(rating_svd, 0.2, svdEst)

Mean Absolute Error for svdEst : 0.18618882629860567
CPU times: user 13min 44s, sys: 4min 39s, total: 18min 24s
Wall time: 10min 36s


In [97]:
%%time

user = 90
test(rating_std, 0.25, svdEst)

Mean Absolute Error for svdEst : 0.1875064192755259
CPU times: user 17min 5s, sys: 5min 41s, total: 22min 46s
Wall time: 13min 7s


---