# Simple Recommender Systems

Let's start with asking the most basic question: 
## What is a recommender system?
If you could give an answer in one line, it would be: A system designed to match items to users that they will like.

As we saw in the earlier notebook, the items could be anything, from a product, an article, a video, or even search results. The goal is for us to rank all the items, and only present the top items to the user, so that there is maximum probability that the user will like the item.
![Daily_picks_netflix](images/Daily_picks_netflix.PNG)

### Data Types

Recommender Systems generally work with two types of data, explicit and implict.
Some examples of each are given below:

#### Explicit
- Explicity given (e.g. movie rating)
- Clear signals (e.g. rating a driver at the end of a ride)
- Limited and sparse data because of this (think of how many users ACTUALLY bother to give a rating unless they are motivated, by a voucher or some other incentive)


#### Implicit
- Provided/collected passively (e.g. click, watch time, scrolls, etc)
- Signals can be difficult to interpret
- Enormous quantities, [an incredible 2.5 quintillion bytes of data being created every day](https://www.the-next-tech.com/blockchain-technology/how-much-data-is-produced-every-day-2019/)

In this project, we are working with Explicit data for the most part, the only implicit data that we have is the `timestamp` feature in our merged dataframe from the last notebook.

# In this notebook:

* We will go through the types of recommendation engines
* We will build and evaluate a simple content-based and a simple memory-based collaborative filtering recommendation engine.

Without further ado, let's dive in!

---

This is the **SECOND** of five notebooks:<br>
[1. Data Exploration and EDA](1_Data_Exploration_and_EDA.ipynb)</br>
**2. Simple Content and Collaborative Filtering Methods (current notebook)**</br>
[3. Surprise Library Models](3_Surprise_Library_Models.ipynb)</br>
[4. Deep Learning Part 1 ](4_Deep_Learning_Part_1_(Basic).ipynb)</br>
[5. Deep Learning Part 2](5_Deep_Learning_Part_2_(Introducing_more_features_and_layers).ipynb)</br>

---

# Contents of this notebook:
[1. Imports](#Imports)<br>
[2. Reading in the data](#Reading-in-the-data)<br>
[3. Types of Recommendation Engines](#Types-of-Recommendation-Engines)<br>
* [Content Based model](#Content-Based-model)<br>
* [Implmentation of Content Model](#Implmentation-of-Content-Model)<br>
* [Content-Based Engine Evaluation](#Content-Based-Engine-Evaluation)<br>

[3. Collaborative Filtering Recommendation Engine](#Collaborative-Filtering-Recommendation-Engine)<br>
* [Implmentation of CF model](#Implmentation-of-CF-model)<br>
* [Memory Based Collaborative Filtering Engine Evaluation](#Memory-Based-Collaborative-Filtering-Engine-Evaluation)<br>

[4. Alternative Approach](#Alternative-Approach)<br>

# Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
from sklearn.metrics import mean_squared_error

In [3]:
from math import sqrt
from scipy import sparse

---

# Reading in the data

In [4]:
df = pd.read_csv('./datasets/merged_users+movies.csv')
movies = pd.read_csv('./datasets/movies.csv')

---

# Types of Recommendation Engines

- **Content-Based Filtering** _(similar items)_
    - If you enjoy certain characteristics of movies (e.g. certain actors, genre, etc.), you'll enjoy other movies with those characteristics.
    - Note this can easily be done using machine learning methods! Each movie can be decomposed into features. Then, for each user we compute a model -- the target can be a binary classifier (e.g. "LIKE"/"DISLIKE") or regression (e.g. star rating).


- **Collaborative Filtering**: _(similar people)_
    - If you like the same 5 movies as someone else, you'll likely enjoy other movies they like.
    - There are two main types: 
    (a) Find users who are similar and recommend what they like (**user-based**), or 
    (b) recommend items that are similar to already-liked items (**item-based**).
    
    Within each of these there is a further subset of **Memory-Based Collaborative Filtering** and **Model-Based Collaborative filtering**. This notebook will only cover the **Memory-Based Collaborative Filtering** engine.
![rec-systems](images/rec-systems.png)

---

## Content Based model

The way this model works is essentially by breaking down each item into "feature baskets". These feature baskets that represent the characteristics of the item are then mapped into a vector space. The **Vector Space Model** computes the proximity based on the angle between the vectors. 

In this model, each item is stored as a vector of its attributes (which are also vectors) in an **n-dimensional space** and the angles between the vectors are calculated to **determine the similarity between the vectors**. Next, the user profile vectors are also created based on his actions on previous attributes of items and the similarity between an item and a user is also determined in a similar way.
![vector-space](images/vector_space_item.png)

The above image is only showing a three dimensional model, so that is why we can comprehend is, however the vectors get mapped to a **n-dimensional space** which thankfully the engine is able to do the heavy lifting for us.

The similarity from the angle can be calculated via a **similarity metric**. In our case, we are going to use the **cosine similarity** metric, although there are some other metrics out there like the Jaccard Similarity, and Pearson Similarity too. [This post on medium covers some of the other metrics.](https://medium.com/bag-of-words/what-similarity-metric-should-you-use-for-your-recommendation-system-b45eb7e6ebd0)

Cosine similarity uses the cosine between two vectors to compute a scalar value that represents how closely related these vectors are. It is literally something that we learnt in math in school! The full formula is given below:

 $$
cos(\theta) = \frac{\vec{item1} \cdot \vec{item2}}{\left\| \vec{item1}\right\| \left\| \vec{item2}\right\| } \
= \frac{\sum{item1_i item2_i}}{\sqrt{\sum{item1i^2}}\sqrt{\sum{item2_i^2}}}
$$

- Angle of $0^{\circ}$ (same direction): $\cos(0^{\circ}) = 1$. Perfectly similar.
- Angle of $90^{\circ}$ (orthogonal): $\cos(90^{\circ}) = 0$. Totally dissimilar.
- Angle of $180^{\circ}$ (opposite direction): $\cos(90^{\circ}) = -1$. Opposite.

## Implmentation of Content Model

With all this out of the way, let's start building the engine.

In [5]:
# getting first 5 rows of movies dataset
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,2,Jumanji (1995),Adventure Children Fantasy
2,3,Grumpier Old Men (1995),Comedy Romance
3,4,Waiting to Exhale (1995),Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Comedy


By using the TF-IDFVectorizer, we skip having to normalize the scores for each `feature_name` once we put our text through the vectorizer. Instead, we can pass that directly over into our `cosine_similarity` method to get the similarity scores, which can be treated as the Pearson Correlation Coefficient.

In [6]:
tvec = TfidfVectorizer(lowercase=True, # using vectorizer to set all words to lowercase
                      analyzer='word', # we will vectorize based on words
                      stop_words=None, # every single genre should be important, 
                                       # in the next iteration we can try using 
                                       # stopwords
                      ngram_range=(1, 1), # considering that each word in the genre
                                          # was meant to be used as 1 word, we will
                                          # do the same
                      min_df=0         # we want to avoid missing out on any word in
                                       # the genre tagged to the movie
                      )

In [7]:
# vectorizing the genres in movies
tvec_genres = tvec.fit_transform(movies['genres'])
tvec_genres.shape

(9730, 22)

In [8]:
movies.shape

(9730, 3)

In [9]:
# converting our vectorized genres to dense_matrix
dense_matrix = pd.DataFrame(
    tvec_genres.todense(),
    columns=tvec.get_feature_names_out(),
    index=movies['title'],
)

In [10]:
# computing the similarity matrix
sim_matrix = cosine_similarity(dense_matrix)

# converting the matrix to a DataFrame for readability
movies_sim = pd.DataFrame(
    sim_matrix,
    columns=dense_matrix.index,
    index=dense_matrix.index)

Now let's take a look at the first 5 rows of our similarity matrix. 

In [11]:
movies_sim.head()

title,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995),...,Gintama: The Movie (2010),anohana: The Flower We Saw That Day - The Movie (2013),Silver Spoon (2014),Love Live! The School Idol Movie (2015),Jon Stewart Has Left the Building (2015),Black Butler: Book of the Atlantic (2017),No Game No Life: Zero (2017),Flint (2017),Bungo Stray Dogs: Dead Apple (2018),Andrew Dice Clay: Dice Rules (1991)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story (1995),1.0,0.813587,0.152709,0.13508,0.267516,0.0,0.152709,0.654702,0.0,0.262366,...,0.360275,0.46565,0.196508,0.516247,0.0,0.680239,0.755887,0.0,0.421045,0.267516
Jumanji (1995),0.813587,1.0,0.0,0.0,0.0,0.0,0.0,0.804711,0.0,0.32248,...,0.0,0.0,0.0,0.0,0.0,0.34138,0.379344,0.0,0.0,0.0
Grumpier Old Men (1995),0.152709,0.0,1.0,0.884557,0.570841,0.0,1.0,0.0,0.0,0.0,...,0.162737,0.0,0.419319,0.0,0.0,0.181808,0.202027,0.0,0.0,0.570841
Waiting to Exhale (1995),0.13508,0.0,0.884557,1.0,0.504942,0.0,0.884557,0.0,0.0,0.0,...,0.14395,0.201385,0.687404,0.0,0.0,0.16082,0.178704,0.466431,0.0,0.504942
Father of the Bride Part II (1995),0.267516,0.0,0.570841,0.504942,1.0,0.0,0.570841,0.0,0.0,0.0,...,0.285082,0.0,0.734563,0.0,0.0,0.318492,0.353911,0.0,0.0,1.0


### Let's check if our matrix is working

In [12]:
# creating a demo series to get titles similar to 'Toy Story (1995)'
# with a similarity score of more than 95%
demo = movies_sim['Toy Story (1995)'].sort_values(ascending=False)
demo[demo>0.95]

title
Toy Story (1995)                                             1.000000
Toy Story 2 (1999)                                           1.000000
Tale of Despereaux, The (2008)                               1.000000
Asterix and the Vikings (Astérix et les Vikings) (2006)      1.000000
Shrek the Third (2007)                                       1.000000
Turbo (2013)                                                 1.000000
Monsters, Inc. (2001)                                        1.000000
The Good Dinosaur (2015)                                     1.000000
Antz (1998)                                                  1.000000
Emperor's New Groove, The (2000)                             1.000000
Moana (2016)                                                 1.000000
Adventures of Rocky and Bullwinkle, The (2000)               1.000000
Wild, The (2006)                                             1.000000
Inside Out (2015)                                            0.970798
Atlantis: The 

Let's create a function that will return the top `n` number of movies that are similar to the movie that has been input.

In [13]:
def movie_genre_recommender(title, n):
    """This function returns a table, 
    with the recommended movie titles 
    and their respective similarity 
    scores to the input movie.
    
    Accepts a string title, and n, the 
    number of movies to be recommneded."""
    
    if title in movies_sim.index:
        reco_series = movies_sim[title].sort_values(ascending=False).head(n+1)
        df = pd.DataFrame({
            'title':reco_series.index,
            'similarity_score': reco_series.values
        })
        df = df[df['title'] != title]
        df.reset_index(inplace=True, drop=True)
        return df.style.format({'similarity_score':"{:.1%}"})
    else:
        print('Please input a movie title that is in the available list of movies.')

With this, we have a simple, working, **content-based recommendation engine**. Let's test a few use cases, with varying number of recommended movies.

In [14]:
movie_genre_recommender('Toy Story (1995)', 15)

Unnamed: 0,title,similarity_score
0,Toy Story 2 (1999),100.0%
1,"Tale of Despereaux, The (2008)",100.0%
2,Asterix and the Vikings (Astérix et les Vikings) (2006),100.0%
3,Shrek the Third (2007),100.0%
4,Turbo (2013),100.0%
5,"Monsters, Inc. (2001)",100.0%
6,The Good Dinosaur (2015),100.0%
7,Antz (1998),100.0%
8,"Emperor's New Groove, The (2000)",100.0%
9,Moana (2016),100.0%


In [15]:
movie_genre_recommender('Shrek the Third (2007)', 5)

Unnamed: 0,title,similarity_score
0,Toy Story (1995),100.0%
1,Toy Story 2 (1999),100.0%
2,"Tale of Despereaux, The (2008)",100.0%
3,Asterix and the Vikings (Astérix et les Vikings) (2006),100.0%
4,Turbo (2013),100.0%


In [16]:
movie_genre_recommender('Matrix, The (1999)', 25)

Unnamed: 0,title,similarity_score
0,Chronicle (2012),100.0%
1,Garm Wars: The Last Druid (2014),100.0%
2,Universal Soldier: Day of Reckoning (2012),100.0%
3,eXistenZ (1999),100.0%
4,Predator (1987),100.0%
5,Predator 2 (1990),100.0%
6,Hangar 18 (1980),100.0%
7,Firefox (1982),100.0%
8,Déjà Vu (Deja Vu) (2006),100.0%
9,Insurgent (2015),100.0%


## Content-Based Engine Evaluation

Truth be told, we have gotten a decent list of recommendations here, and if we add the `mean_rating` for each of the movies by the side, the user might be able to make a more informed choice by themselves. However this approach also has its drawbacks.

#### Pros<br>
* No need for a large number of users
* No cold-start or sparsity problems
* Can recommend to users with unique tastes in terms of genres
* Data transformation is explainable 

Cons<br>
* Difficult to identify and pass in the right features
* Hard to create cross-content recommendations (if the feature spaces are different)
* Cannot exploit qualitative judgements of other users (e.g. 2 horror films, 1 might be really good, the other might be really bad)

Now let's attempt building a **Memory-Based Collaborative Filtering** engine.

---

# Collaborative Filtering Recommendation Engine

### So what is a collaborative filtering recommendation engine?

This recommendation engine is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

We will be focusing on **Memory-Based** in this notebook, and we will build an engine for both types shown below. 

1. **User-User Collaborative Filtering**: Here we find *look-alike* users based on similarity and recommend movies which first user’s *look-alike* has chosen in past. This algorithm is very effective but takes a lot of time and resources. It requires to compute every user pair information which takes time. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizable system.

2. **Item-Item Collaborative Filtering**: It is quite similar to the user-user algorithm, but instead of finding user's *look-alike*, we try finding movie's *look-alike*. Once we have movie's *look-alike* matrix, we can easily recommend alike movies to user who have rated any movie from the dataset. This algorithm is far less resource consuming than user-user collaborative filtering. Hence, for a new user, the algorithm takes far lesser time than user-user collaborative as we don’t need all similarity scores between users. And with fixed number of movies, movie-movie look alike matrix is fixed over time.

As mentioned before, we will continue to use the **cosine similarity** for building these engines.
![mem_based_cf](images/mem_based_cf.PNG)

In the image above, lets take a look at the **User-User** filtering. Nancy and Chuck are found by the engine to be similar on some feature space. Based on this similarity, since Nancy has consumed all three fruits, but Chuck has only consumed the Watermelon, we can recommend the Orange and the Grape to Chuck.

Similarly for the **Item-Item** filtering. Grape and Watermelon are found by the engine to be similar on some feature space. Based on this similarity, since Chuck has consumed a Watermelon, we can recommend a similar item, based purely on item similarity, to be consumed by Chuck.

---

## Implmentation of CF model

With all this out of the way, let's start building the engine.

Let us start of by first building our **User-to-User** engine.<br>
For this, we will only be using the features `userId`, `movieId`, and `rating`.

In [17]:
# defining a DataFrame collab df, that only has the features we will use
collab_df = df.drop(columns=['timestamp', 'genres', 'movieId'])
collab_df

Unnamed: 0,userId,rating,title
0,1,4.0,Toy Story (1995)
1,5,4.0,Toy Story (1995)
2,7,4.5,Toy Story (1995)
3,15,2.5,Toy Story (1995)
4,17,4.5,Toy Story (1995)
...,...,...,...
100814,610,2.5,Bloodmoon (1997)
100815,610,4.5,Sympathy for the Underdog (1971)
100816,610,3.0,Hazard (2005)
100817,610,3.5,Blair Witch (2016)


In [18]:
# train_test_split on our dataframe
train_collab, test_collab = train_test_split(collab_df,
                                             test_size=0.2,
                                             random_state=42
                                            )

We will first create a pivot table that we will then use to calculate the cosine similarity for our recommender.

In [19]:
# creating a pivot table for the user-user CF
user_pivot = pd.pivot_table(train_collab, index='userId', columns='title', values='rating')
user_pivot.fillna(0, inplace=True)
user_pivot.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
# create a sparse matrix for cosine similarity calculation
sparse_user_pivot = sparse.csr_matrix(user_pivot)

In [21]:
# calculating the cosine similarity
user_sim = cosine_similarity(sparse_user_pivot)

# checking to ensure the output is as expected
user_sim

array([[1.        , 0.01969851, 0.02485907, ..., 0.21982512, 0.07751294,
        0.11823357],
       [0.01969851, 1.        , 0.        , ..., 0.02336901, 0.03992355,
        0.08842625],
       [0.02485907, 0.        , 1.        , ..., 0.00748982, 0.        ,
        0.01777359],
       ...,
       [0.21982512, 0.02336901, 0.00748982, ..., 1.        , 0.12310072,
        0.25384692],
       [0.07751294, 0.03992355, 0.        , ..., 0.12310072, 1.        ,
        0.03922679],
       [0.11823357, 0.08842625, 0.01777359, ..., 0.25384692, 0.03922679,
        1.        ]])

In [22]:
# creating the user_recommendation DataFrame
user_reco_df = pd.DataFrame(data=user_sim, 
                           columns=user_pivot.index,
                           index=user_pivot.index)

In [23]:
# viewing the user_recommendation DataFrame
user_reco_df.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.019699,0.024859,0.159338,0.092765,0.079797,0.122478,0.08154,0.045207,0.010076,...,0.033164,0.10654,0.180013,0.061623,0.110487,0.129565,0.189991,0.219825,0.077513,0.118234
2,0.019699,1.0,0.0,0.005192,0.023794,0.015619,0.009434,0.036926,0.0,0.088147,...,0.151634,0.023143,0.007792,0.0,0.0,0.023728,0.018732,0.023369,0.039924,0.088426
3,0.024859,0.0,1.0,0.0026,0.005958,0.003285,0.0,0.005548,0.0,0.0,...,0.006105,0.005563,0.028617,0.0,0.0,0.008364,0.019699,0.00749,0.0,0.017774
4,0.159338,0.005192,0.0026,1.0,0.07563,0.075669,0.077034,0.045329,0.0,0.02282,...,0.061613,0.105014,0.264199,0.052774,0.066285,0.173303,0.077606,0.108943,0.004376,0.086312
5,0.092765,0.023794,0.005958,0.07563,1.0,0.237731,0.066937,0.37468,0.0,0.012622,...,0.046495,0.402231,0.085245,0.234912,0.147258,0.068518,0.164663,0.136691,0.260706,0.042291


Now that we have the `user_reco_df`, let's now calculate the predicted ratings so that we can compute the `rmse` score.

In [24]:
# sorting values by userId
test_collab.sort_values('userId', inplace=True)

# creating 'user_based_preds' column, and setting the value as nan
# for now
test_collab['user_based_preds'] = np.nan

# instantiating row number for the while loop
row_num = 0

# starting while loop to keep running until every row in the test collab is complete
while row_num < len(test_collab):
    # getting row number with the first column which is the userId
    user_id = test_collab.iloc[row_num,0]
    
    # obtaining similarity df for current userId
    user_i_sim = user_reco_df[user_id].drop(user_id)
    user_i_sim = user_i_sim[user_i_sim > 0]
    
    # obtaining weights for current userId
    user_i_weights = user_i_sim.values/np.sum(user_i_sim.values)
    user_i_weights
    
    # obtaining ratings for current userId
    get_ratings_useri = user_pivot.T
    get_ratings_useri = get_ratings_useri[get_ratings_useri[user_id] == 0]
    get_ratings_useri = get_ratings_useri.drop(user_id, axis=1)
    get_ratings_useri = get_ratings_useri[user_i_sim.index]
    
    # obtaining ratings for current userId and converting it to DataFrame
    ratings_useri = np.dot(get_ratings_useri.fillna(0).values, user_i_weights)
    ratings_useri_df = pd.DataFrame(ratings_useri, index=get_ratings_useri.index, columns=['rating'])
    
    temp_df = test_collab[test_collab['userId'] == user_id]
    
    # instantiating for loop to add all ratings for all instances 
    # of this userId before moving on to the next userId
    for _ in range(0, len(temp_df)):
        if row_num < len(test_collab):
            try:
                movie_title = test_collab.iloc[row_num, 2]
                # column in pos 3 is the ratings column
                test_collab.iloc[row_num, 3] = ratings_useri_df.loc[movie_title, 'rating']
                row_num += 1
            except KeyError:
                # column in pos 3 is the ratings column
                test_collab.iloc[row_num, 3] = 0
                row_num += 1

*Side note, the reason why I opted to use a while loop here instead of the `.apply()` method is because the `.apply()` method would have to get the `similarity_df`, `weights` and all the movie `ratings` again and again each time for the `userId` even if it is the same `userId`. It would be more computationally efficient to obtain the scores for a `userId`, finish all instances of that `userId`, then move on to the next.*

Now that we have all the predicted ratings, we can actually compute the `rmse` score.

In [25]:
# computing the rmse score
sqrt(mean_squared_error(test_collab['user_based_preds'], test_collab['rating']))

3.2215290024864838

Now let's move on to building our **Item-to-Item** engine.
We will follow the exact same technique we used to create the user to user engine.

We will first create a pivot table that we will then use to calculate the cosine similarity for our recommender.

In [26]:
# creating a pivot table for the user-user CF
item_pivot = pd.pivot_table(train_collab, columns='userId', index='title', values='rating')
item_pivot.fillna(0, inplace=True)
item_pivot.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
# create a sparse matrix for cosine similarity calculation
sparse_item_pivot = sparse.csr_matrix(item_pivot)

In [28]:
# calculating the cosine similarity
item_sim = cosine_similarity(sparse_item_pivot)

# checking to ensure the output is as expected
item_sim

array([[1.        , 0.        , 0.        , ..., 0.32732684, 0.        ,
        0.        ],
       [0.        , 1.        , 0.70710678, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.70710678, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.32732684, 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [29]:
# creating the user_recommendation DataFrame
item_reco_df = pd.DataFrame(data=item_sim, 
                           columns=item_pivot.index,
                           index=item_pivot.index)

In [30]:
# viewing the user_recommendation DataFrame
item_reco_df.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.157993,0.0,...,0.0,0.0,0.543305,1.0,0.0,0.0,0.156532,0.327327,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,1.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.707107,1.0,0.0,0.0,0.0,0.191083,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot (2004),0.0,0.0,0.0,1.0,0.857493,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.857493,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now that we have the `item_reco_df`, let's now calculate the predicted ratings so that we can compute the `rmse` score. The following method is **exactly the same method as before.**

In [31]:
test_collab.sort_values('title', inplace=True)

test_collab['item_based_preds'] = np.nan

row_num = 0

while row_num < len(test_collab):
    try:
        item_name = test_collab.iloc[row_num,2]
        item_name_sim = item_reco_df[item_name].drop(item_name)
        item_name_sim = item_name_sim[item_name_sim > 0]

        item_name_weights = item_name_sim.values/np.sum(item_name_sim.values)

        get_ratings_item_name = item_pivot.T
        get_ratings_item_name = get_ratings_item_name[get_ratings_item_name[user_id] == 0]
        get_ratings_item_name = get_ratings_item_name.drop(item_name, axis=1)
        get_ratings_item_name = get_ratings_item_name[item_name_sim.index]

        ratings_item_name = np.dot(get_ratings_item_name.fillna(0).values, item_name_weights)
        ratings_item_name_df = pd.DataFrame(ratings_item_name, index=get_ratings_item_name.index, columns=['rating'])

        temp_df = test_collab[test_collab['title'] == item_name]

        for _ in range(0, len(temp_df)):
            if row_num < len(test_collab):
                try:
                    user_id = test_collab.iloc[row_num, 0]
                    test_collab.iloc[row_num, 4] = ratings_item_name_df.loc[user_id, 'rating']
                    row_num += 1
                except KeyError:
                    test_collab.iloc[row_num, 4] = 0
                    row_num += 1
    except KeyError:
        test_collab.iloc[row_num, 4] = 0
        row_num += 1

In [32]:
# computing the rmse score
sqrt(mean_squared_error(test_collab['item_based_preds'], test_collab['rating']))

3.6502707831501184

I also decided to test out the output of mean_centering, however zero change observed in the `rmse` score as we can see below. As such I removed all those cells and only kept the below output.

In [33]:
# with mean centering
print("sqrt(mean_squared_error(test_collab['item_based_preds'], test_collab['rating'])")
print("RMSE: 3.6521818368988557")

sqrt(mean_squared_error(test_collab['item_based_preds'], test_collab['rating'])
RMSE: 3.6521818368988557


Now that we have the `rmse` score for both methods, let's compile them and take a look.

---

## Memory Based Collaborative Filtering Engine Evaluation

Truth be told, we have gotten a decent list of recommendations here, and if we add the `mean_rating` for each of the movies by the side, the user might be able to make a more informed choice by themselves. However this approach also has its drawbacks.

In [35]:
mem_cf_scores = pd.DataFrame({'model': ['CF_user-user', 
                                        'CF_item-item'],
                              'score': [3.2250661900359954, 
                                        3.6521818368988557]}
                            )
mem_cf_scores

Unnamed: 0,model,score
0,CF_user-user,3.225066
1,CF_item-item,3.652182


As we can clearly see in the above table, the **user-user** engine performs better. However even though at first glance this `rmse` score seems to be alright, we must think about it from the perspective of the range of the `rating` feature. The max `rating` is 5. So our `rmse` as a percentage of the `rating` would be 64.5%. The way to interpret this is that our **user-user** model is able to predict the user rating for a movie, with +-64.5% accuracy.  Let's complete the evaluation of these models.

#### Pros<br>
* Easy to implement
* Can actually output a predicted rating which makes it easier to measure the model on a metric
* Data transformation is explainable 

#### Cons<br>
* Doesn't address the cold start problem, when a new user or item enters the system
* Doesn't deal with sparse data *(our dataset consists of users who have rated items at least 20 movies, in reality this is rarely the case, and data is usually much much more sparse.)*
* If there are users of items that don't have any ratings, the model suffers *(the main feature this model uses is the `rating` feature)*
* The model tends to recommend popular items *(e.g. all the highest rated items for each user)*

---

# Alternative Approach

Even though we were able to successfully implement the above model, it is clear that we need to implement another method that will have a higher degree of accuracy. In the next notebook, we will explore **Model-Based Collaborative Filtering**, with the help of the `Surprise` library.