<h1>Recommender systems using Collaborative filtering</h1>

___

A Recommender System is a class of algorithms which aims at enhancing user experience by producing refined and relevant suggestions specific to the target user. 

Many commercial applications benefit from the use of such algorithms.

> - Video/Song Sharing Services like <u>Youtube, Netflix, Gaana, Saavan</u>, for suggesting content.
> - Social Media Websites like <u>Facebook, Twitter, Instagram</u>, for suggesting other links and advertisements.
> - E-commerce sites like <u>Amazon, Flipkart</u>, for suggested future purchases.
> - Matrimonial & Dating Sites like <u>Jeevansathi, Tinder</u>, for suggesting suitable matches. 
> - News, Blog & Q&A sites like <u>Forbes, Medium, Quora</u>, for suggesting future reads.

There are 2 types of Reccomender Systems

- <b>Collaborative Filtering</b> : 

This approach relies on user's past behaviour. 

Collaborative Filtering is the most common technique used when it comes to building intelligent recommender systems that can learn to give better recommendations as more information about users is collected.

Most websites like Amazon, YouTube, and Netflix use collaborative filtering as a part of their sophisticated recommendation systems. You can use this technique to build recommenders that give suggestions to a user on the basis of the likes and dislikes of similar users.
  
   > - The model finds n users that have a similar taste as the target user. 
   > - A score is calculate for all items that the target user has not rated using the rating of similar users. 
   > - Top N items based on the calculated score are suggested to the target user.
  
- <b>Content-based Filtering</b> : This approach relies on item properties. 

   > - The model finds n users that have a similar taste as the target user. 
   > - A score is calculate for all items that the target user has not rated using the rating of similar users. 
   > - Top N items based on the calculated score are suggested to the target user.
   
A final commercial solution will have a mix of both kinds of techniques in many different forms.


## Can you think of some problems in building an accurate Reccomendation System?

 - User Preferences can change with time.
- Not just ratings but search history also needs to be modelled.
- Even the most active users usually rate very few items. The data is very sparse.
- For new users and items, the suggestions will not be good unless substantial information is available to build a user profile. Often, a naive approach of suggesting the most popular items is followed in the beginning.
- Data collection is very important for building a good reccomendation engine. Some sites ask users to pick some preferences while creating a profile. 
- Feedbacks shared by users has to be analysed for sentiment analysis and aspect analysis using text analytics.
- Computation requirement is high.


<div class="alert alert-block alert-success">
<b>Therefore, there is no single technique that will give great reccomendations. The design is complex and there is constant research towards developing a more personalised reccomendation engine.</b>
</div>    

___

<div class="alert alert-block alert-warning">
    
<b>FUN FACT</b> 

From 2006 to 2009, Netflix sponsored a competition, offering a grand prize of $1,000,000 to the team that could take an offered dataset of over 100 million movie ratings and return recommendations that were 10 percent more accurate than those offered by the company's existing recommender system. 

On 21 September 2009, the grand prize of US$1,000,000 was given to the BellKor's Pragmatic Chaos team using tiebreaking rules.

The most accurate algorithm in 2007 used an ensemble method of 107 different algorithmic approaches, blended into a single prediction. As stated by the winners, Bell et al.,

> Predictive accuracy is substantially improved when blending multiple predictors. Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique. Consequently, our solution is an ensemble of many methods.

Source : Wikipedia
</div>


### We will be using suprise package, `surprise` is a python package used to make recommender systems
http://surpriselib.com/
<br>https://github.com/NicolasHug/Surprise
<br>http://surprise.readthedocs.io/en/stable/index.html
    
> - It deals with explicit ratings only.
- This is available as part of sckit learn library

### Install surprise package

In [1]:
!pip install scikit-surprise



<b>We will use the publicly available [Jokes](http://eigentaste.berkeley.edu/dataset/) Dataset. Check the `URL` [Jokes](http://eigentaste.berkeley.edu/dataset/) for file more details about the dataset.</b>

## Jokes Dataset

This dataset is available at http://eigentaste.berkeley.edu/dataset/

**Over 100,000 new ratings from 59,132 total users.**

`jester_jokes.tsv` contains the text of the jokes. 

Format:
> - A tab seperated file with 149 rows.
> - The row number corresponds to the joke ID referred to in the ratings files below

`jester_ratings.csv` contains the ratings data.

Format:
> - The data is formatted as a csv file representing a 1761439 by 3 matrix with rows containts rating of an user against joke in the format `UserID`, `ItemID` and `Rating`.
> - The left-most column represents the UserID followed by ItemID(Joke) and Rating. There are a total of 59132 users and 140 jokes in this dataset.
> - Some of the jokes don't have ratings, their ids are: 1, 2, 3, 4, 6, 9, 10, 11, 12, 14}.
> - Each rating is from (-10.00 to +10.00).

**Note that the ratings are real values ranging from -10.00 to +10.00. The jokes 1, 2, 3, 4, 6, 9, 10, 11, 12, 14 have been removed (i.e. they are never displayed or rated).***

#### Specify the files paths/names

In [2]:
import os
datasets_path = "./datasets/"
jokes_dataset_path = os.path.join(datasets_path, "jester_jokes.tsv")
ratings_dataset_path = os.path.join(datasets_path, "jester_ratings.csv")
print(jokes_dataset_path)
print(ratings_dataset_path)

./datasets/jester_jokes.tsv
./datasets/jester_ratings.csv


## Reading the files

In [3]:
import pandas as pd

In [4]:
# Print first few rows from jokes and ratings datasets.
!head ./datasets/jester_jokes.tsv

1:	A man visits the doctor. The doctor says, "I have bad news for you. You have cancer and Alzheimer's disease". The man replies, "Well, thank God I don't have cancer!"
2:	This couple had an excellent relationship going until one day he came home from work to find his girlfriend packing. He asked her why she was leaving him and she told him that she had heard awful things about him. "What could they possibly have said to make you move out?" "They told me that you were a pedophile." He replied, "That's an awfully big word for a ten year old."
3:	Q. What's 200 feet long and has 4 teeth? A. The front row at a Willie Nelson concert.
4:	Q. What's the difference between a man and a toilet? A. A toilet doesn't follow you around after you use it.
5:	Q. What's O. J. Simpson's web address? A. Slash, slash, backslash, slash, slash, escape.
6:	Bill and Hillary Clinton are on a trip back to Arkansas. They're almost out of gas, so Bill pulls into a service station on the outskirts of town. The atten

In [5]:
!head ./datasets/jester_ratings.csv

UserID,ItemID,Rating
1,5,0.219
1,7,-9.281
1,8,-9.281
1,13,-6.7810000000000015
1,15,0.875
1,16,-9.656
1,17,-9.031
1,18,-7.4689999999999985
1,19,-8.719


### Read the jokes dataset.

this dataset is `tab` delimited and has two columns, `Joke`/`ItemID` and `Joke`.

Load data from jokes dataset (`jokes_dataset_path`) into the dataframe `jokes`.

In [6]:
#read the jokes Dataset
jokes = 

# print first 5 rows.
print(jokes.head())

# print last 5 rows.
print(jokes.tail())

  ItemID                                               Joke
0     1:  A man visits the doctor. The doctor says, "I h...
1     2:  This couple had an excellent relationship goin...
2     3:  Q. What's 200 feet long and has 4 teeth? A. Th...
3     4:  Q. What's the difference between a man and a t...
4     5:  Q. What's O. J. Simpson's web address? A. Slas...
    ItemID                                               Joke
144   145:  A blonde, brunette, and a red head are all lin...
145   146:  America: 8:00 - Welcome to work! 12:00 - Lunch...
146   147:  It was the day of the big sale. Rumors of the ...
147   148:  Recently a teacher, a garbage collector, and a...
148   149:  A little girl asked her father, "Daddy? Do all...


In [7]:
# Dimensions of the jokes dataframe
jokes.shape

(149, 2)

### Read the ratings dataset.

this dataset is `comma` delimited and has three columns, `UserId`, `ItemID` and `Rating`.

Load data from ratings dataset (`ratings_dataset_path`) into the dataframe `ratings`.

In [8]:
#Read the Ratings Dataset
ratings = 
# print first 5 rows.
print(ratings.head())

# print last 5 rows.
print(ratings.tail())

   UserID  ItemID  Rating
0       1       5   0.219
1       1       7  -9.281
2       1       8  -9.281
3       1      13  -6.781
4       1      15   0.875
         UserID  ItemID  Rating
1761434   63978      57  -8.531
1761435   63978      24  -9.062
1761436   63978     124  -9.031
1761437   63978      58  -8.656
1761438   63978      44  -8.438


In [9]:
# Dimensions of the ratings dataframe.


(1761439, 3)

In [10]:
# Unique Items (JokeId) in the dataset



140

### Convert the dataframes into surprise objects

> - Surprise algorithms do not run on Pandas dataframe. 
- They require us to define the data format using a Reader Function and parse the data into a Surprise Dataset object.

In [11]:
from surprise import Dataset, Reader

Let us get familiar with the Reader and Dataset API in Surprise.

To use Surprise, you should first know some of the basic modules and classes available in it:

The Dataset module is used to load data from files, Pandas dataframes, or even built-in datasets available for experimentation. 
<br>(`movielens-100k`, `movielens-1m`, `Jester` are the built-in datasets in Surprise.) 

To load a dataset, some of the available methods are:
> - Dataset.load_builtin()
> - Dataset.load_from_file()
> - Dataset.load_from_df()

The Reader class is used to parse a file containing ratings. 
<br>The default format in which it accepts data is that each rating is stored in a separate line in the order user item rating. 
<br>This order and the separator can be configured using parameters:
> - **line_format** is a string that stores the order of the data with field names separated by a space, as in "item user rating".
> - **sep** is used to specify separator between fields, such as ','.
> - **rating_scale** is used to specify the rating scale. The default is (1, 5).
> - **skip_lines** is used to indicate the number of lines to skip at the beginning of the file. The default is 0.

<b>Defining the parser to read the data into surprise dataframe.</b>
<b><br>The parser requires the scale of ratings, and the columns, to be mentioned using rating_scale and line_format.
</b>

**Limit to first 1000 users, to avoid the memory error.**

In [12]:
# Limit to 1000 users only - To avoid memory error.
no_of_users = 1000

reader = Reader(line_format="user item rating", rating_scale = (-10, 10))
jokes_data = Dataset.load_from_df(ratings[ratings.UserID < no_of_users], reader)

In [13]:
# Verify the jokes_data object, This is a surprise object.
print(jokes_data)
print(type(jokes_data))

<surprise.dataset.DatasetAutoFolds object at 0x7f9ac92e1b10>
<class 'surprise.dataset.DatasetAutoFolds'>


In [14]:
# Verify first 10 ratings from the raw_ratings attribute on surprise object.


[(1, 5, 0.219, None),
 (1, 7, -9.281, None),
 (1, 8, -9.281, None),
 (1, 13, -6.7810000000000015, None),
 (1, 15, 0.875, None),
 (1, 16, -9.656, None),
 (1, 17, -9.031, None),
 (1, 18, -7.4689999999999985, None),
 (1, 19, -8.719, None),
 (1, 20, -9.156, None)]

This returns a Surprise Dataset object which can be simply used to run algorithms. 

This means we don't have to explicitly convert the data into a user-item matrix.

Let us see what we mean by converting into a user-item matrix. 
Our data has 3 columns - userId, itemId, Rating. We can use pivot_table to spread this data.

In [15]:
# Limit to 1000 users only
# Convert UserID into rows.
# Convert ItemID into columns.
# Convert Rating to the cell value.
df1 = ratings[ratings.UserID < no_of_users].pivot_table(index = ['UserID'],
                          columns = ['ItemID'],
                          values = 'Rating')
df1.head()

ItemID,5,7,8,13,15,16,17,18,19,20,...,141,142,143,144,145,146,147,148,149,150
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.219,-9.281,-9.281,-6.781,0.875,-9.656,-9.031,-7.469,-8.719,-9.156,...,,,,,,,,,,
2,-9.688,9.938,9.531,9.938,0.406,3.719,9.656,-2.688,-9.562,-9.125,...,,,,,,,,,,
3,-9.844,-9.844,-7.219,-2.031,-9.938,-9.969,-9.875,-9.812,-9.781,-6.844,...,,,,,,,,,,
4,-5.812,-4.5,-4.906,,,,,,,,...,,,,,,,,,,
5,6.906,4.75,-5.906,-0.406,-4.031,3.875,6.219,5.656,6.094,5.406,...,,,,,,,,,,


In [16]:
# Dimensions of the above dataframe.
df1.shape

(905, 140)

## collaborative filtering.

**Once the data is loaded we can build model on it.**

<div class="alert alert-block alert-success">
    
The scope of our learning is <b>Collaborative Filtering</b>. There are two techniques in CF.

- <b>User Based</b> : If user u like item j, then recommend item k which was liked by other users who are similar to user u.
- <b>Item based</b> : If user u like item j, then recommend item k which is similar to item j.

</div>

**Steps Involved in Collaborative Filtering**

To build a system that can automatically recommend items to users based on the preferences of other users, the first step is to find similar users or items. The second step is to predict the ratings of the items that are not yet rated by a user. So, you will need the answers to these questions:

    - How do you determine which users or items are similar to one another?
    - Given that you know which users are similar, how do you determine the rating that a user would give to an item based on the ratings of similar users?
    - How do you measure the accuracy of the ratings you calculate?
    
The first two questions don’t have single answers. Collaborative filtering is a family of algorithms where there are multiple ways to find similar users or items and multiple ways to calculate rating based on ratings of similar users. Depending on the choices you make, you end up with a type of collaborative filtering approach. 

The third question for how to measure the accuracy of your predictions also has multiple answers, which include error calculation techniques that can be used in many places and not just recommenders based on collaborative filtering.

One of the approaches to measure the accuracy of your result is the Root Mean Square Error (RMSE), in which you predict ratings for a test dataset of user-item pairs whose rating values are already known. The difference between the known value and the predicted value would be the error. Square all the error values for the test set, find the average (or mean), and then take the square root of that average to get the RMSE.

Another metric to measure the accuracy is Mean Absolute Error (MAE), in which you find the magnitude of error by finding its absolute value and then taking the average of all error values.

### Different types of algorithms in the family of collaborative filtering.

#### Memory Based

The first category includes algorithms that are memory based, in which statistical techniques are applied to the entire dataset to calculate the predictions.

To find the rating R that a user U would give to an item I, the approach includes:
> - Finding users similar to U who have rated the item I
- Calculating the rating R based the ratings of users found in the previous step

**How to Find Similar Users on the Basis of Ratings**

To understand the concept of similarity, let’s create a simple dataset first.

The data includes four users A, B, C, and D, who have rated two movies. The ratings are stored in lists, and each list contains two numbers indicating the rating of each movie:

> - Ratings by A are [1.0, 2.0].
- Ratings by B are [2.0, 4.0].
- Ratings by C are [2.5, 4.0].
- Ratings by D are [4.5, 5.0].

Plot the ratings of two movies given by the users on a graph and look for a pattern. 

The graph looks like this:
![vectors](./images/sim-vectors.png)

In the graph above, each point represents a user and is plotted against the ratings they gave to two movies.

Looking at the distance between the points seems to be a good way to estimate similarity,  right? 

You can find the distance using the formula for Euclidean distance between two points. 

You can use the function available in scipy as shown in the following code:

In [17]:
# Data points
A = [1.0, 2.0]
B = [2.0, 4.0]
C = [2.5, 4.0]
D = [4.5, 5.0]

**Find the distance between A, B and D to C**

In [18]:
from scipy import spatial
print("Euclidean Distances")
print("===================")
print("C-A : ", spatial.distance.euclidean(C, A))
print("C-B : ", spatial.distance.euclidean(C, B))
print("C-D : ", spatial.distance.euclidean(C, D))

Euclidean Distances
C-A :  2.5
C-B :  0.5
C-D :  2.23606797749979


You can see that user C is closest to B even by looking at the graph. 

But **out of A and D only, who is C closer to?**

You could say C is closer to D in terms of distance. 
But looking at the rankings, it would seem that the choices of C would align with that of A more than D because both A and C like the second movie almost twice as much as they like the first movie, but D likes both of the movies equally.

So, what can you use to identify such patterns that Euclidean distance cannot? 

Can the angle between the lines joining the points to the origin be used to make a decision? 

You can take a look at the angle between the lines joining the origin of the graph to the respective points as shown:

![cosine_distance](./images/sim-cosine.png)

The graph shows four lines joining each point to the origin. The lines for A and B are coincident, making the angle between them zero.

You can consider that, if the angle between the lines is increased, then the similarity decreases, and if the angle is zero, then the users are very similar.

To calculate similarity using angle, you need a function that returns a higher similarity or smaller distance for a lower angle and a lower similarity or larger distance for a higher angle. 

**The cosine of an angle is a function that decreases from 1 to -1 as the angle increases from 0 to 180.**

You can use the cosine of the angle to find the similarity between two users. 

The higher the angle, the lower will be the cosine and thus, the lower will be the similarity of the users. You can also inverse the value of the cosine of the angle to get the cosine distance between the users by subtracting it from 1.

scipy has a function that calculates the cosine distance of vectors. It returns a higher value for higher angle:

In [19]:
from scipy import spatial
print("Cosine Distances")
print("================")
print("C-A : ", spatial.distance.cosine(C, A))
print("C-B : ", spatial.distance.cosine(C, B))
print("C-D : ", spatial.distance.cosine(C, D))

Cosine Distances
C-A :  0.004504527406047898
C-B :  0.004504527406047898
C-D :  0.015137225946083022


The lower angle between the vectors of C and A gives a lower cosine distance value. If you want to rank user similarities in this way, use cosine distance.

## Algorithms Based on K-Nearest Neighbours (k-NN)

For the memory-based approaches discussed above, the algorithm that would fit the bill is Centered k-NN because the algorithm is very close to the centered cosine similarity formula explained above. It is available in Surprise as KNNWithMeans.

To find the similarity, you simply have to configure the function by passing a dictionary as an argument to the recommender function. The dictionary should have the required keys, such as the following:

> - **name** contains the similarity metric to use. Options are cosine, msd, pearson, or pearson_baseline. The default is msd.
- **user_based** is a boolean that tells whether the approach will be user-based or item-based. The default is True, which means the user-based approach will be used.
- **min_support** is the minimum number of common items needed between users to consider them for similarity. For the item-based approach, this corresponds to the minimum number of common users for two items.

### Training the model

In [20]:
similarity_parameters = {
    'name' : 'cosine',
    'user_based': True,
    'min_support' : 3
}

**How many neighbours should we consider?**

This is a hyperparameter.

In [21]:
from surprise import KNNWithMeans

KNN_Algo = KNNWithMeans(k=3, sim_options = similarity_parameters)

#### Training using cross validation API

In [22]:
from surprise.model_selection import cross_validate

cross_validate(KNN_Algo, 
               jokes_data, 
               measures=['RMSE', 'MAE'], 
               cv=5, 
               verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    5.1991  5.3297  5.3089  5.2390  5.3423  5.2838  0.0554  
MAE (testset)     3.9965  4.0826  4.0945  4.0168  4.0944  4.0570  0.0418  
Fit time          0.51    0.53    0.52    0.52    0.52    0.52    0.00    
Test time         1.07    1.07    1.06    1.08    1.10    1.08    0.01    


{'test_rmse': array([5.19909587, 5.32971846, 5.30887175, 5.23897823, 5.34225415]),
 'test_mae': array([3.99654182, 4.082614  , 4.0944657 , 4.01678773, 4.09435684]),
 'fit_time': (0.5107889175415039,
  0.5251379013061523,
  0.5161430835723877,
  0.5207030773162842,
  0.5194323062896729),
 'test_time': (1.071791172027588,
  1.0700039863586426,
  1.0624070167541504,
  1.0799298286437988,
  1.0991268157958984)}

#### Training the model on complete data

In [23]:
# Use full data for training

trainset = jokes_data.build_full_trainset()

KNN_Algo.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f9ac3d9d210>

#### Filter the instances which can be used for predictions

**build_anti_testset**: Return a list of ratings that can be used as a testset in the test() method.

The ratings are all the ratings that are not in the trainset, i.e. all the ratings 𝑟𝑢𝑖 where the user 𝑢 is known, the item 𝑖 is known, but the rating 𝑟𝑢𝑖 is not in the trainset. As 𝑟𝑢𝑖 is unknown, it is either replaced by the fill value or assumed to be equal to the mean of all ratings global_mean.

In [24]:
# Getting data points where predictions can be made
testset = trainset.build_anti_testset()

### Making Predictions

Surprise provide a [Prediction](https://surprise.readthedocs.io/en/stable/predictions_module.html#surprise.prediction_algorithms.predictions.Prediction) object which is a named tuple with the following details.

A named tuple for storing the results of a prediction.

It’s wrapped in a class, but only for documentation and printing purposes.
Parameters:	

> - uid – The (raw) user id. See this note.
> - iid – The (raw) item id. See this note. 
> - r_ui (float) – The true rating 𝑟𝑢𝑖
> - est (float) – The estimated rating 𝑟̂ 𝑢𝑖
> - details (dict) – Stores additional details about the prediction that might be useful for later analysis.
> - actual_k : number of neighours used to calculate the rating (provided while defining the algo)
> - was_impossible : Exception raised when a prediction is impossible. When raised, the estimation 𝑟̂ 𝑢𝑖 is set to the global mean of all ratings 𝜇.

In [25]:
# Making predictions
predictions = KNN_Algo.test(testset)

In [26]:
# Verify few predictions
predictions[0:4]

[Prediction(uid=1, iid=28, r_ui=1.3216032090880538, est=2.6352948949376396, details={'actual_k': 3, 'was_impossible': False}),
 Prediction(uid=1, iid=30, r_ui=1.3216032090880538, est=1.1482308258829255, details={'actual_k': 3, 'was_impossible': False}),
 Prediction(uid=1, iid=48, r_ui=1.3216032090880538, est=4.834358515828926, details={'actual_k': 3, 'was_impossible': False}),
 Prediction(uid=1, iid=33, r_ui=1.3216032090880538, est=-2.429074772927871, details={'actual_k': 3, 'was_impossible': False})]

#### Function to calculate top 10 predictions for each user

Followint is the code that finds top 10 recommendations/suggestions on jokes for a user. 

These are the joke Ids the user hasn't rated yet.

In [27]:
# Fetching top 10 predictions for each user
from collections import defaultdict

def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

from itertools import islice

def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

top_n = get_top_n(predictions, n=10)
take(10, top_n.items())

[(1,
  [(145, 9.220575376144698),
   (56, 7.455773317925031),
   (137, 6.6595656010858795),
   (136, 6.609596458285943),
   (147, 6.542035610113534),
   (122, 6.039233244935499),
   (114, 5.874395361657554),
   (139, 5.731022402892997),
   (78, 5.5808023924561665),
   (113, 5.372598101303155)]),
 (2,
  [(121, 8.250787765350049),
   (126, 7.941681006504824),
   (130, 7.836405825603136),
   (122, 7.335861745593466),
   (105, 7.143615137560851),
   (52, 7.008628702709968),
   (98, 6.950833379616794),
   (116, 6.788894011795453),
   (124, 6.656663139292329),
   (146, 6.593037144129996)]),
 (3,
  [(138, 3.137880796773408),
   (139, 2.9304844258730123),
   (110, 1.9402686101904942),
   (35, 0.7841973859761717),
   (36, -0.35630174577875007),
   (72, -0.8699021947568584),
   (114, -1.6131133145698158),
   (142, -1.7084131601259251),
   (76, -1.7554968221795617),
   (123, -2.0684151247102642)]),
 (4,
  [(106, 0.8182247684630441),
   (108, 0.5531405187544545),
   (69, 0.03836103361967691),
   (

####  Top Predictions Matrix

Create a matrix of top-10 recommendations/suggestions for each user.

In [28]:
# Printing top predictions
for uid, user_ratings in take(10,top_n.items()):
    print(uid, [iid for (iid, _) in user_ratings])

1 [145, 56, 137, 136, 147, 122, 114, 139, 78, 113]
2 [121, 126, 130, 122, 105, 52, 98, 116, 124, 146]
3 [138, 139, 110, 35, 36, 72, 114, 142, 76, 123]
4 [106, 108, 69, 63, 93, 116, 32, 48, 40, 21]
5 [143, 117, 149, 114, 134, 145, 71, 140, 126, 148]
6 [102, 113, 76, 130, 127, 112, 117, 106, 88, 125]
7 [127, 77, 43, 116, 115, 105, 108, 118, 25, 117]
8 [143, 131, 132, 134, 114, 128, 147, 142, 118, 138]
9 [27, 116, 40, 66, 104, 148, 60, 81, 31, 43]
10 [135, 113, 127, 148, 138, 150, 140, 149, 114, 112]


In [29]:
# Display a joke text
jokes.iloc[116]

ItemID                                                 117:
Joke      A man joins a big corporate empire as a traine...
Name: 116, dtype: object

#### Recommended jokes for each user

In [30]:
# Printing top predictions
for uid, user_ratings in take(5,top_n.items()):
    print("For User",uid)
    for  (iid, _) in user_ratings:
        print(iid)
        ids = iid-1
        print(jokes.loc[ids,"Joke"])

For User 1
145
A blonde, brunette, and a red head are all lined up to be shot to death by a firing squad. The brunette shouts, "Tornado!" and the riflemen turn around to see the tornado. It isn't there, and the brunette uses that time to escape. The red head yells, "Lightning!" and the riflemen again turn to see the disaster, yet there is no disaster and the red head escapes. The blonde yells, "Fire!" The riflemen do.
56
A man and Cindy Crawford get stranded on a desert island. After a couple of days they fall in love and start sleeping together. Time passes and the man seems frustrated. Cindy asks if there is anything she can do, and he says there is one thing: "Could you put on this baseball cap and go to the other side of the island and answer me when I call you Bob?" She agrees. The next day he is walking on the other side of the island and runs into her. He says, "Hi Bob!" She says, "Hello, what's up?" He replies, "Bob, you won't believe it: I've been sleeping with Cindy Crawford 

**An example to find out how the user 100 would rate the Joke 123:**

In [31]:
pred = KNN_Algo.predict(100, 123)
pred

Prediction(uid=100, iid=123, r_ui=None, est=-5.651491439108761, details={'actual_k': 3, 'was_impossible': False})

In [32]:
pred.est

-5.651491439108761

### Tuning the Algorithm Parameters

Surprise provides a GridSearchCV class analogous to GridSearchCV from scikit-learn.

With a dict of all parameters, GridSearchCV tries all the combinations of parameters and reports the best parameters for any accuracy measure

For example, you can check which similarity metric works best for your data in memory-based approaches:

In [33]:
from surprise.model_selection import GridSearchCV

sim_options = {
    "name": ["msd", "cosine"],
    "min_support": [3, 4, 5],
    "user_based": [False, True],
}

param_grid = {"sim_options": sim_options}

jokes_gs = GridSearchCV(KNNWithMeans, 
                  param_grid, 
                  measures=["rmse", "mae"], 
                        cv=3)

jokes_gs.fit(jokes_data)

print(jokes_gs.best_score["rmse"])
print(jokes_gs.best_params["rmse"])

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

## Model Based

The second category covers the Model based approaches, which involve a step to reduce or compress the large but sparse user-item matrix.

**Dimensionality Reduction**

In the user-item matrix, there are two dimensions:

> 1. The number of users
2. The number of items

If the matrix is mostly empty, reducing dimensions can improve the performance of the algorithm in terms of both space and time. You can use various methods like matrix factorization or autoencoders to do this.

Matrix factorization can be seen as breaking down a large matrix into a product of smaller ones. This is similar to the factorization of integers, where 12 can be written as 6 x 2 or 4 x 3. In the case of matrices, a matrix A with dimensions m x n can be reduced to a product of two matrices X and Y with dimensions m x p and p x n respectively.

The reduced matrices actually represent the users and items individually. The m rows in the first matrix represent the m users, and the p columns tell you about the features or characteristics of the users. The same goes for the item matrix with n items and p characteristics.

One of the popular algorithms to factorize a matrix is the [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) algorithm. 

In [34]:
from surprise import SVD
from surprise import Dataset,accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

# Load the in-built movielens-100k dataset (download it if needed).
ml_data = Dataset.load_builtin('ml-100k')
raw_ratings = ml_data.raw_ratings  


# 75% trainset, 25% testset                                                
threshold = int(.75 * len(raw_ratings))                                     
trainset_raw_ratings = raw_ratings[:threshold]                             
test_raw_ratings = raw_ratings[threshold:]     

# We'll use the famous SVD algorithm.
SVD_Algo = SVD() # default epoch: 20, lr_all = 0.005, reg_all = 0.02

trainset = ml_data.build_full_trainset()   
# Train the algorithm on the trainset, and predict ratings for the testset
SVD_Algo.fit(trainset)

                                                                                   
                                                                           
# now test on the trainset                                                 
trainset = ml_data.construct_testset(trainset_raw_ratings)                     
predictions = SVD_Algo.test(trainset)                                           
print('Accuracy on the trainset:')                                         
accuracy.rmse(predictions)                                                 
                                                                           
# now test on the testset                                                  
testset = ml_data.construct_testset(test_raw_ratings)                         
predictions = SVD_Algo.test(testset)                                           
print('Accuracy on the testset:')                                          
accuracy.rmse(predictions)

Accuracy on the trainset:
RMSE: 0.6749
Accuracy on the testset:
RMSE: 0.6694


0.6693703976669881

For model-based approaches, we can use Surprise to check which values for the following factors work best:

- **n_epochs** is the number of iterations of SGD, which is basically an iterative method used in Statistics to minimize a function.
- **lr_all** is the learning rate for all parameters, which is a parameter that decides how much the parameters are adjusted in each iteration.
- **reg_all** is the regularization term for all parameters, which is a penalty term added to prevent overfitting.

**Note: Keep in mind that there won’t be any similarity metrics in matrix factorization algorithms as the latent factors take care of similarity among users or items.**

In [35]:
%%time
import random                                                              
                                                                           
# Load your full dataset.                                                  
ml_data = Dataset.load_builtin('ml-100k')                                     
raw_ratings = ml_data.raw_ratings                                             
                                                                           
# shuffle ratings if you want                                              
random.shuffle(raw_ratings)                                                
                                                                           
# 75% trainset, 25% testset                                                
threshold = int(.75 * len(raw_ratings))                                     
trainset_raw_ratings = raw_ratings[:threshold]                             
test_raw_ratings = raw_ratings[threshold:]                                 
                                                                           
ml_data.raw_ratings = trainset_raw_ratings  # data is now your trainset                                                           
                                                                           
# Select your best algo with grid search. Verbosity is buggy, I'll fix it. 
print('GRID SEARCH BEGIN...')                                                    
param_grid = {
    "n_epochs": [5, 10],
    "lr_all": [0.002, 0.005],
    "reg_all": [0.4, 0.6]
}

movie_gs = GridSearchCV(SVD, 
                        param_grid, 
                        measures=["rmse", "mae"], 
                        cv=3)

movie_gs.fit(ml_data)
print('GRID SEARCH END...')                                                    

GRID SEARCH BEGIN...
GRID SEARCH END...
CPU times: user 25.9 s, sys: 64.2 ms, total: 26 s
Wall time: 26 s


In [37]:
ml_final = movie_gs.best_estimator['rmse']                                  
                                                                           
# retrain on the whole train set                                           
trainset = ml_data.build_full_trainset()                                      
ml_final.fit(trainset)                                                       
                                                                           
# now test on the trainset                                                 
testset = ml_data.construct_testset(trainset_raw_ratings)                     
predictions = ml_final.test(testset)                                           
print('Accuracy on the trainset:')                                         
accuracy.rmse(predictions)                                                 
                                                                           
# now test on the testset                                                  
testset = ml_data.construct_testset(test_raw_ratings)                         
predictions = ml_final.test(testset)                                           
print('Accuracy on the testset:')                                          
accuracy.rmse(predictions)

Accuracy on the trainset:
RMSE: 0.9375
Accuracy on the testset:
RMSE: 0.9691


0.9691089967100166

In this activity, we learnt:
> - Collaborative filtering 
    - User Based Collaborative Filtering
    - Item Based Collaborative Filtering
> - Distance Measures
    - Euclidean distance
    - Cosine similarity
> - Collaborative Filtering Algorithms
    - Memory Based
        - KNN
    - Model Based
        - SVD
> - Recommendations using Collaborative Filtering.
> - Hyper parameter tuning using cross validation and grid search.

References :

> - https://realpython.com/build-recommendation-engine-collaborative-filtering/
    - https://surprise.readthedocs.io/en/stable/getting_started.html

Datasets :
> - http://eigentaste.berkeley.edu/dataset/
    - https://github.com/caserec/Datasets-for-Recommneder-Systems