<a href="https://colab.research.google.com/github/twisha-k/Python_notes/blob/main/135_coding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 135: Collaborative Filtering I - Pearson Correlation

---

### Teacher-Student Activities

In the previous class, we had designed various recommender systems based on popularity and weighted ratings to attract new customers to our media hosting platform.

In this class, we will build a movie recommender using **Collaborative Filters**. This type of recommender suggests movies to its users based on the movie rating database from its multiple users and based on the watch history of other users who have watched the similar set of movies.

Let us understand the collaborative filters in more detail and revisit problem statement.




---

#### What are Collaborative Filters?

While shopping through e-commerce platforms, you must have encountered:

**Customers who bought Macbook Pro also purchased: 'ProDisplay XR' | 'LG Gaming Monitor' | 'AirPods'**

Some movie hosting/OTT platforms suggests:

Say if you are watching: **Inception**

**Customers also watched: 'The Matrix' | 'Gravity' | 'Tenet'**

Such suggestions are given to a user on the basis of the likes and dislikes of similar users. This is exactly what Collaborative filters do.

**Collaborative filtering** builds a model from the user's past behaviour (i.e. items purchased or searched by the user) as well as similar decisions made by other users. This model is then used to predict items that users may have an interest in.

Let us now understand the problem statement in more detail.

**Problem Statement:**

- We will build an intelligent recommender that would recommend movies to a customer say **X** based on the customer's watch history.
- First, we need to find other sets of users who have watched same movies along with some other movies and suggest customer **X** the movies which were appreciated by those sets of users.

<center><img src=https://s3-whjr-v2-prod-bucket.whjr.online/whjr-v2-prod-bucket/8346c283-08f7-46c5-b37f-96a3eac57800.png></center>

In this way the customers are likely to appreciate the recommendation and as a result stay connected to the streaming platform.

Let us now explore the datasets that will be used to solve this problem statement.

---

#### Datasets

We will use following three datasets to set up a recommender system that will recommend movies to a user based on ratings given by other users:

**1. The `movie_metadata.csv` file:**

- This is the main Movies Metadata file.
- It contains information on 45,000 movies featured in the Full [MovieLens](https://movielens.org) database.

  **Note:** This was the same dataset which we had used to build simple movie recommenders in the previous lesson.

- Below are the features information:

  **Attribute Information:**
  ```
    adult: Indicates if the movie is X-Rated or Adult.
    belongs_to_collection: A stringified dictionary that gives information on the movie series the particular film belongs to.
    budget: The budget of the movie in dollars.
    genres: A stringified list of dictionaries that list out all the genres associated with the movie.
    homepage: The Official Homepage of the move.
    id: The TMDB ID of the movie.
    imdb_id: The IMDB ID of the movie.
    original_language: The language in which the movie was originally shot in.
    original_title: The original title of the movie.
    overview: A brief blurb of the movie.
    popularity: The Popularity Score assigned by TMDB.
    poster_path: The URL of the poster image.
    production_companies: A stringified list of production companies involved with the making of the movie.
    production_countries: A stringified list of countries where the movie was shot/produced in.
    release_date: Theatrical Release Date of the movie.
    revenue: The total revenue of the movie in dollars.
    runtime: The runtime of the movie in minutes.
    spoken_languages: A stringified list of spoken languages in the film.
    status: The status of the movie (Released, To Be Released, Announced, etc.)
    tagline: The tagline of the movie.
    title: The Official Title of the movie.
    video: Indicates if there is a video present of the movie with TMDB.
    vote_average: The average rating of the movie.
    vote_count: The number of votes by users, as counted by TMDB.
 ```

**2. The `links.csv` file:**

- This file contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.
- Below are the features information:
  ```
  movieId: A unique identifier for each movie
  imdbId: The IMDB ID of the movie
  tmdbId: The TMDB ID of the movie
  ```


**3. The `ratings_small.csv` file:**

- This file is a subset of 100,000 ratings from 700 users on 9,000 movies.
- Below are the features information:
  ```
  userId: The user ID of the subscriber
  movieId: A unique identifier for each movie
  rating: Rating given by a subscriber (Out of 5)
  timestamp: Time at which the rating was recorded
  ```



**Acknowledgement:** These datasets are an ensemble created by Rounak Banik using the data collected from TMDB and GroupLens.

**Dataset Source:** https://www.kaggle.com/rounakbanik/the-movies-dataset

---

#### Activity 1: Importing Modules and Reading Data

Let us load the first dataset `movies_metadata.csv` into a pandas DataFrame.

**The `movies_metadata.csv` Dataset link:** https://drive.google.com/uc?id=1Fa9Y8jOD1H0sa0AdQrj-C2taPdxrUl1q




In [None]:
# S1.1: Import the required modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# Load the Movies Metadata Dataset
df = pd.read_csv('https://drive.google.com/uc?id=1Fa9Y8jOD1H0sa0AdQrj-C2taPdxrUl1q')
df.head()


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


For our recommender, we would need only three columns from the above dataset: `id`, `imdb_id`, `title`.

Let us create a subset of the above DataFrame that consists of only 3 columns: `id`, `imdb_id`, `title`.

In [None]:
# S1.2: Create 'movies_df' DataFrame consisting of columns: 'id', 'imdb_id', 'title'
movies_df=df[['id', 'imdb_id', 'title']]
movies_df.head()

Unnamed: 0,id,imdb_id,title
0,862,tt0114709,Toy Story
1,8844,tt0113497,Jumanji
2,15602,tt0113228,Grumpier Old Men
3,31357,tt0114885,Waiting to Exhale
4,11862,tt0113041,Father of the Bride Part II


Next, find the number of rows, columns and data types of columns and determine whether there are any missing values in this DataFrame.

In [None]:
# S1.3: Get the total number of rows and columns, data types of columns and missing values in the dataset
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       45466 non-null  object
 1   imdb_id  45449 non-null  object
 2   title    45460 non-null  object
dtypes: object(3)
memory usage: 1.0+ MB


You may observe that there are some missing values in certain columns. Let us simply drop these missing values from the above DataFrame.

In [None]:
# S1.4: Drop missing values from the DataFrame.
movies_df.dropna(inplace=True)
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45443 entries, 0 to 45465
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       45443 non-null  object
 1   imdb_id  45443 non-null  object
 2   title    45443 non-null  object
dtypes: object(3)
memory usage: 1.4+ MB


Now there are no missing values in this DataFrame. However, you may observe that the data type of `id` column is `object`. It should be either `int` or `float`.

Let us convert the data type of `id` column to `float` using `astype()` function.

**Note:** You can also convert the data type of `id` column to `int` data type.

In [None]:
# S1.5: Convert data type of 'id' column to float
movies_df['id']=movies_df['id'].astype(float)
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45443 entries, 0 to 45465
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       45443 non-null  float64
 1   imdb_id  45443 non-null  object 
 2   title    45443 non-null  object 
dtypes: float64(1), object(2)
memory usage: 1.4+ MB


Please note that the above `id` column represents the TMDB ID and not the movie ID of a movie. For building a movie recommender, we would need movie ID that can be obtained from the second dataset `links.csv`.

Let us now load another dataset `links.csv` that contain the movie ID, IMDB ID, TMDB ID of each movie.

**The `links.csv` Dataset Link:** https://drive.google.com/uc?id=1Fa9Y8jOD1H0sa0AdQrj-C2taPdxrUl1q

In [None]:
# S1.6: Load 'links.csv' file into 'links_df' DataFrame.
links_df=pd.read_csv('https://drive.google.com/uc?id=1bcxKJJMhU15qH77BmmdPf4lZwi5k5Cnb')
links_df.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [None]:
movies_df.head()

Unnamed: 0,id,imdb_id,title
0,862.0,tt0114709,Toy Story
1,8844.0,tt0113497,Jumanji
2,15602.0,tt0113228,Grumpier Old Men
3,31357.0,tt0114885,Waiting to Exhale
4,11862.0,tt0113041,Father of the Bride Part II


You may observe that:
- The values of `id` column of `movies_df` DataFrame matches with values of `tmdbId` column of `links_df` DataFrame.
- The values of `imdb_id	` column of `movies_df` DataFrame matches with values of `imdbId` column of `links_df` DataFrame.

Let us now merge both these DataFrames using the `merge()` function of `pandas` module.

In [None]:
# S1.7: Merge 'movies_df' and 'links_df' DataFrames
ml_df=pd.merge(movies_df,links_df,left_on='id',right_on='tmdbId')
ml_df.head()

Unnamed: 0,id,imdb_id,title,movieId,imdbId,tmdbId
0,862.0,tt0114709,Toy Story,1,114709,862.0
1,8844.0,tt0113497,Jumanji,2,113497,8844.0
2,15602.0,tt0113228,Grumpier Old Men,3,113228,15602.0
3,31357.0,tt0114885,Waiting to Exhale,4,114885,31357.0
4,11862.0,tt0113041,Father of the Bride Part II,5,113041,11862.0


From the above DataFrame, we need only two columns i.e. `movieId` and `title`. Let us create a new DataFrame that consists of only these two columns.

In [None]:
# S1.8: Obtain the final DataFrame consisting of only 'movieId' and 'title' columns.
final_df=ml_df[['movieId','title']]
final_df.head()

Unnamed: 0,movieId,title
0,1,Toy Story
1,2,Jumanji
2,3,Grumpier Old Men
3,4,Waiting to Exhale
4,5,Father of the Bride Part II


Now that we finally obtained a DataFrame having movie ID and title of each movie, its time to get ratings of these movies.

For this, load the third dataset `ratings_small.csv` which  contains user IDs and ratings of movies given by the users.

**The `ratings_small.csv` Dataset Link:** https://drive.google.com/uc?id=1DKT6CcjHsdKY9TKKAfk50ic2khf9JbJA

In [None]:
# S1.9: Load 'ratings_small.csv' file into 'ratings_df' DataFrame.
ratings_df=pd.read_csv('https://drive.google.com/uc?id=1DKT6CcjHsdKY9TKKAfk50ic2khf9JbJA')
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


Let us drop the `timestamp` column from the above DataFrame as it is not needed for building the collaborative recommender.

In [None]:
# S1.10: Drop 'timestamp' column from 'ratings_df' DataFrame.
ratings_df=ratings_df.drop('timestamp',axis=1)
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


You may observe that both `m_df` and `ratings_df` DataFrames consists of a matching column `movieId`. Hence, we can merge these DataFrames using `merge()` function to obtain a final DataFrame that contain user ID, movie ID, title and ratings.

In [None]:
# S1.11: Merge 'm_df' and 'ratings_df' DataFrames.
fin_mov_df=pd.merge(final_df,ratings_df,on='movieId')
fin_mov_df.head()

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story,7,3.0
1,1,Toy Story,9,4.0
2,1,Toy Story,13,5.0
3,1,Toy Story,15,2.0
4,1,Toy Story,19,3.0


Now that we have obtained the final DataFrame that will be used to build a movie recommender based on user's ratings, let's explore the data a bit and get a look at some of the best rated movies.

---

#### Activity 2: Data Analysis

Let us first find out the average rating of each movie by grouping movies based on their title.

In [None]:
# S2.1: Group the DataFrame by 'title' column and use 'mean()' function to determine average rating.
fin_mov_df.groupby(by='title')['rating'].mean()

title
$9.99                       3.833333
'Neath the Arizona Skies    0.500000
'night, Mother              5.000000
(500) Days of Summer        3.755556
...And God Created Woman    5.000000
                              ...   
À Nous la Liberté           4.500000
Æon Flux                    2.538462
İtirazım Var                3.500000
Želary                      5.000000
’Round Midnight             2.250000
Name: rating, Length: 8754, dtype: float64

If we wish to see top 5 highly rated movies, we can sort the above DataFrame in descending order.

In [None]:
# S2.2: Print top 5 movies having highest mean rating.
fin_mov_df.groupby(by='title')['rating'].mean().sort_values(ascending=False).head()

title
Female Perversions    5.0
Lake of Fire          5.0
Lamerica              5.0
The Family Stone      5.0
Riding Giants         5.0
Name: rating, dtype: float64

Similarly, we can determine how many users had given their ratings to each movie by grouping movies based on their title and then using `count()` function.  

In [None]:
# S2.3: Count the number of ratings given to each movie.
fin_mov_df.groupby(by='title')['rating'].count()

title
$9.99                        3
'Neath the Arizona Skies     1
'night, Mother               3
(500) Days of Summer        45
...And God Created Woman     1
                            ..
À Nous la Liberté            1
Æon Flux                    13
İtirazım Var                 1
Želary                       1
’Round Midnight              2
Name: rating, Length: 8754, dtype: int64

If we wish to see which 5 movies have received highest number of ratings (probably the most watched movies), we can sort the above DataFrame in descending order.

In [None]:
# S2.4: Print top 5 movies having highest count of ratings.
fin_mov_df.groupby(by='title')['rating'].count().sort_values(ascending=False).head()

title
Forrest Gump                341
Pulp Fiction                324
The Shawshank Redemption    311
The Silence of the Lambs    304
Star Wars                   291
Name: rating, dtype: int64

Let's create a  Dataframe that consists of average rating of each movie and the number of ratings given to each movie. We will need this DataFrame to check the total number of ratings of recommended movies.

In [None]:
# T2.1: Create a DataFrame with average rating and number of ratings for each movie.
movie_ratings=pd.DataFrame(fin_mov_df.groupby('title')['rating'].mean())
movie_ratings.head()
movie_ratings['no.ratings']=pd.DataFrame(fin_mov_df.groupby('title')['rating'].count())
movie_ratings.head()

Unnamed: 0_level_0,rating,no.ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
$9.99,3.833333,3
'Neath the Arizona Skies,0.5,1
"'night, Mother",5.0,3
(500) Days of Summer,3.755556,45
...And God Created Woman,5.0,1


Hence, we obtained a DataFrame which contains the average rating and number of ratings for each movie.
Let's move on to creating a collaborative filtering based recommendation system.


---

#### Activity 3: Setting up the Movie Recommender

Let us again print the first 5 rows of the final movies DataFrame.



In [None]:
# S3.1: Print first 5 rows of final movies DataFrame.
fin_mov_df.head()

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story,7,3.0
1,1,Toy Story,9,4.0
2,1,Toy Story,13,5.0
3,1,Toy Story,15,2.0
4,1,Toy Story,19,3.0



Say we have a **user X** as our target person for whom we want to recommend best movie to watch. Consider the following user data is with you.

<center><img src=https://s3-whjr-v2-prod-bucket.whjr.online/whjr-v2-prod-bucket/602074ed-3289-4349-89b3-cce587168baf.png>

`Table 3.1: Users rating and watch history database`</center>

The above table shows the rating given by each user to each of the movies.

Now let's create a matrix similar to the above table $3.1$ for our final movies DataFrame such that the user IDs are on vertical axis and the movie titles are on horizontal axis as shown in the above table. This can be done by using `pivot_table()` function of `pandas` module.

**Note:** There will be a lot of `NaN` values in the obtained pivot table, because most people have not seen most of the movies.


In [None]:
# T3.1: Create a pivot table with index ='userId', columns ='title', values ='rating'
user_rating=fin_mov_df.pivot_table(index='userId',columns ='title', values ='rating')
user_rating

title,$9.99,'Neath the Arizona Skies,"'night, Mother",(500) Days of Summer,...And God Created Woman,...And Justice for All,1-900,10,10 Attitudes,10 Cloverfield Lane,...,eXistenZ,loudQUIETloud: A Film About the Pixies,xXx,xXx: State of the Union,¡Three Amigos!,À Nous la Liberté,Æon Flux,İtirazım Var,Želary,’Round Midnight
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,,,,,,,,,,,...,,,,,,,,,,
668,,,,,,,,,,,...,,,,,,,,,,
669,,,,,,,,,,,...,,,,,,,,,,
670,,,,,,,,,,,...,,,,,,,,,,


The above DataFrame is similar to Table $3.1$. Here, each cell consist of the rating the user gave to that movie. The `NaN` values indicate that these movies were not watched or rated by that particular user.

To obtain recommendation for similar movies based on the ratings given by other users, we will compute a **similarity score**. Collaborative filters can use a variety of similarity scores, for example:

1. Pearson Correlation Coefficient

2. Cosine Similarity

3. Singular Value Decomposition and a lot more.

For our recommender, we will use **Pearson Correlation Coefficient** to obtain the similarity score of the movies. Let us first recall correlation coefficient and the `corr()` function that we have already studied in one of the previous classes.

**Correlation:**

- Correlation measures the strength of a linear relationship between two variables.
- A correlation coefficient is a number between -1 and 1 that describes a negative or positive correlation respectively. A value of zero indicates no correlation.

**The corr() Function:**
  
  To calculate the correlation coefficient between all the numeric columns in a DataFrame, use the `corr()` function of the `pandas` module. It returns an N-dimensional DataFrame containing the correlation coefficient values between the numeric columns.

Let us obtain the similarity score between each movies by using `corr()` function on the above pivot table.

In [None]:
# S3.2: Calculate correlation coefficient between each pair of movies using 'corr()' function.
similarity_df=user_rating.corr()
similarity_df

title,$9.99,'Neath the Arizona Skies,"'night, Mother",(500) Days of Summer,...And God Created Woman,...And Justice for All,1-900,10,10 Attitudes,10 Cloverfield Lane,...,eXistenZ,loudQUIETloud: A Film About the Pixies,xXx,xXx: State of the Union,¡Three Amigos!,À Nous la Liberté,Æon Flux,İtirazım Var,Želary,’Round Midnight
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
$9.99,1.0,,,1.000000,,,,,,,...,,,,,,,,,,
'Neath the Arizona Skies,,,,,,,,,,,...,,,,,,,,,,
"'night, Mother",,,,,,,,,,,...,,,,,,,,,,
(500) Days of Summer,1.0,,,1.000000,,-0.327327,,-0.188982,,-0.5,...,,,0.424179,,-0.617213,,0.866025,,,1.0
...And God Created Woman,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
À Nous la Liberté,,,,,,,,,,,...,,,,,,,,,,
Æon Flux,,,,0.866025,,,,,,1.0,...,,,0.251952,,1.000000,,1.000000,,,
İtirazım Var,,,,,,,,,,,...,,,,,,,,,,
Želary,,,,,,,,,,,...,,,,,,,,,,


Hence, we obtained correlation coefficients for each set of movies. You may note that the many correlation coefficients are `NaN` as the `corr()` function does not compute pairwise correlation of NA/null values.

Now we need to select a movie to test our recommender system. Choose any movie title from the data.

Let us choose a movie `"Toy Story"`.

<center><img src="https://static.wikia.nocookie.net/logopedia/images/a/a2/Toy_Story_Logo.svg" height = 300/>

`Image Source: https://logos.fandom.com/wiki/Toy_Story`
</center>

To find the correlation value for the movie `"Toy Story"` with all other movies in the data, pass `"Toy Story"` as index to the above correlation coefficients `similarity_df` DataFrame.



In [None]:
# T3.2: Create a DataFrame containing the correlation coefficients of other movies with 'Toy Story'
similarity_toy=similarity_df['Toy Story']
similarity_toy_df=pd.DataFrame(similarity_toy)
similarity_toy_df

Unnamed: 0_level_0,Toy Story
title,Unnamed: 1_level_1
$9.99,
'Neath the Arizona Skies,
"'night, Mother",
(500) Days of Summer,0.407521
...And God Created Woman,
...,...
À Nous la Liberté,
Æon Flux,0.031627
İtirazım Var,
Želary,


Let us rename the column `Toy Story` to `'correlation'`.

In [None]:
# S3.3: Rename the column to 'correlation'.
similarity_toy_df.rename(columns={'Toy Story':'correlation'},inplace=True)
similarity_toy_df

Unnamed: 0_level_0,correlation
title,Unnamed: 1_level_1
$9.99,
'Neath the Arizona Skies,
"'night, Mother",
(500) Days of Summer,0.407521
...And God Created Woman,
...,...
À Nous la Liberté,
Æon Flux,0.031627
İtirazım Var,
Želary,


Hence, we obtained the similarity score of each movie with `'Toy Story'` movie. Let us display top 10 most similar movies to `'Toy Story'` by simply sorting the above DataFrame by correlation in descending order.


In [None]:
# S3.4: Sort the above DataFrame by 'correlation' column to find top 10 highly correlated movies.
similarity_toy_df.sort_values('correlation',ascending=False).head(10)

Unnamed: 0_level_0,correlation
title,Unnamed: 1_level_1
Weekend at Bernie's II,1.0
Step Up,1.0
Where the Heart Is,1.0
Ghost Town,1.0
Mother Night,1.0
Dirty Dancing: Havana Nights,1.0
The Internship,1.0
Full Frontal,1.0
The Yes Men,1.0
Bloodsport II,1.0


Here, we obtained the most similar movies for `'Toy Story'` movie as correlation coefficient is `1.0`. This implies that the other users who had watched `'Toy Story'` movie also watched the above movies too and gave a good rating to these movies. Hence, if a user who had recently watched `'Toy Story'` movie can be recommended above movies. But wait, are the above recommendations correct? 🤔

Many of the above movies might be watched by only 1 or very few users who had also watched `'Toy Story'` movie. Hence the above results don't really make sense unless we consider the total number of ratings each movie has. Let's fix this by filtering out movies that have less than 100 ratings.

For this, let us join or merge the total number of ratings of each movie obtained in **Activity 2: Data Analysis** to the correlation DataFrame using `join()` function.


In [None]:
# T3.3: Display the number of ratings of each movie along with the correlation coefficients
# by joining 'all_movies_ratings['num of ratings']' DataFrame with the above DataFrame.
corr_toy=similarity_toy_df.join(movie_ratings['no.ratings'])
corr_toy.head()

Unnamed: 0_level_0,correlation,no.ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
$9.99,,3
'Neath the Arizona Skies,,1
"'night, Mother",,3
(500) Days of Summer,0.407521,45
...And God Created Woman,,1


Let us display top 20 highly correlated/similar movies but only those whose `num of ratings` is greater than `100`.   

In [None]:
# S3.5: Display only those movies whose number of ratings are greater than 100.
# Sort them in descending order and print first 20 values.
corr_toy[corr_toy['no.ratings']>100].sort_values('correlation',ascending=False).head(20)


Unnamed: 0_level_0,correlation,no.ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Toy Story,1.0,247
Toy Story 2,0.743352,125
A Bug's Life,0.677299,105
"Monsters, Inc.",0.549582,130
The Dark Knight,0.540978,121
Finding Nemo,0.537958,122
Austin Powers: The Spy Who Shagged Me,0.519847,112
The Lion King,0.517524,200
Spider-Man,0.512995,134
The Incredibles,0.508661,126


Hence, these are the movies that can be recommended as they all have high similarity to `'Toy Story'` movie and have more than 100 ratings from the users.

Similar to `'Toy Story'` movie, we can find list of recommended movies for other movies as well.

Let us create a user-defined function to recommend movies that are similar to a particular movie using the Pearson Correlation Coefficient. Follow the steps given below (These steps are same as that followed in previous 5 code cells):

Create a user-defined function `recommend_movies` and pass the `movie_name` string as input. Inside this function:
1. Obtain the correlation coefficients of all the movies with the input  `movie_name` by specifying the `movie_name` as index to the correlation table `similarity_df`.
  
  This will return a pandas series containing the correlation coefficients. Store this pandas Series in a new DataFrame. Rename the column name to `'correlation'`.

2. Add the number of ratings for each movie to the new DataFrame obtained in the above step using `join()` function.

3. Return top 20 highly correlated/similar movies but only those whose num of ratings is greater than 100 .

In [None]:
# T3.4: Define 'recommend_movies()' function.
def recommend_movies(movie_name):
  similar_movies = similarity_df[movie_name]
  similar_movies_df = pd.DataFrame(similar_movies)
  similar_movies_df.rename(columns = {similar_movies_df.columns[0]: 'correlation'}, inplace = True)
  corr_num_ratings = similar_movies_df.join(movie_ratings['no.ratings'])
  return corr_num_ratings[corr_num_ratings['no.ratings'] > 100].sort_values('correlation',ascending = False).head(20)

In [None]:
movie_ratings

Unnamed: 0_level_0,rating,no.ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
$9.99,3.833333,3
'Neath the Arizona Skies,0.500000,1
"'night, Mother",5.000000,3
(500) Days of Summer,3.755556,45
...And God Created Woman,5.000000,1
...,...,...
À Nous la Liberté,4.500000,1
Æon Flux,2.538462,13
İtirazım Var,3.500000,1
Želary,5.000000,1




Let us recommend some movies to a user who recently watched `'Star Wars'` based on ratings given by other users who had also watched `'Star Wars'`.

In [None]:
# S3.6: Call 'recommend_movies()' function and pass 'Star Wars' as input.
recommend_movies('Star Wars')

Unnamed: 0_level_0,correlation,no.ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars,1.0,291
Return of the Jedi,0.747774,217
The Empire Strikes Back,0.70079,234
The Dark Knight,0.549486,121
The Lord of the Rings: The Fellowship of the Ring,0.477582,200
Raiders of the Lost Ark,0.476442,220
The Incredibles,0.450914,126
The Lord of the Rings: The Two Towers,0.448153,188
E.T. the Extra-Terrestrial,0.428289,160
Star Trek: Generations,0.413682,114


We can see that the top recommendations are pretty good. The movie that has the highest/full correlation to `Star Wars` is `Star Wars` itself. The movies such as `Return of the Jedi`, `The Empire Strikes Back` and `The Dark Knight` show high correlation with `Star Wars`. Similarly, you can obtain recommendations for any other movie of your choice.

Thus, we have successfully built a movie recommender by performing collaborative filtering using Pearson correlation.

We will stop here. In the next class, we will build another collaborative filter based recommender using Cosine similarity.

---