<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Collaborative Filtering based Recommender System using K Nearest Neighbor**


Estimated time needed: **60** minutes


Collaborative filtering is probably the most commonly used recommendation algorithm, there are two main types of methods: 
 - **User-based** collaborative filtering is based on the user similarity or neighborhood
 - **Item-based** collaborative filtering is based on similarity among items


They both work similarly, let's briefly explain how user-based collaborative filtering works.


User-based collaborative filtering looks for users who are similar. This is very similar to the user clustering method done previously; where we employed explicit user profiles to calculate user similarity. However, the user profiles may not be available, so how can we determine if two users are similar?


#### User-item interaction matrix 


For most collaborative filtering-based recommender systems, the main dataset format is a 2-D matrix called the user-item interaction matrix. In the matrix,  its row is labeled as the user id/index and column labelled to be the item id/index, and the element `(i, j)` represents the rating of user `i` to item `j`.  

Below is a simple example of a user-item interaction matrix:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_4/images/user_item_matrix.png)


#### KNN-based collaborative filtering


As we can see from above, each row vector represents the rating history of a user and each column vector represents the users who rated the item. A user-item interaction matrix is usually very sparse as you can imagine one user very likely only interacts with a very small subset of items and one item is very likely to be interacted by a small subset of users.


Now to determine if two users are similar, we can simply calculate the similarities between their row vectors in the interaction matrix. Then based on the similarity measurements, we can find the `k` nearest neighbor as the similar users.


Item-based collaborative filtering works similarly, we just need to look at the user-item matrix vertically. Instead of finding similar users, we are trying to find similar items (courses). If two courses are enrolled by two groups of similar users, then we could consider the two items are similar and use the known ratings from the other users to predict the unknown ratings.


If we formulate the KNN based collaborative filtering,  the predicted rating of user $u$ to item $i$, $\hat{r}_{ui}$ is given by:


**User-based** collaborative filtering:


$$\hat{r}_{ui} = \frac{
\sum\limits_{v \in N^k_i(u)} \text{similarity}(u, v) \cdot r_{vi}}
{\sum\limits_{v \in N^k_i(u)} \text{similarity}(u, v)}$$


**Item-based** collaborative filtering:


$$\hat{r}_{ui} = \frac{
\sum\limits_{j \in N^k_u(i)} \text{similarity}(i, j) \cdot r_{uj}}
{\sum\limits_{j \in N^k_u(i)} \text{similarity}(i, j)}$$


Here $N^k_i(u)$ notates the nearest k neighbors of $u$.


Let's illustrate how the equation works using a simple example. From the above figure, suppose we want to predict the rating of `user6` to item `Machine Learning Capstone` course. After some similarity measurements, we found that k = 4 nearest neighbors: `user2, user3, user4, user5` with similarities in array ```knn_sims```:


In [1]:
import numpy as np
import math

In [2]:
# An example similarity array stores the similarity of user2, user3, user4, and user5 to user6
knn_sims = np.array([0.8, 0.92, 0.75, 0.83])

Also their rating on the `Machine Learning Capstone` course are:


In [3]:
# 2.0 means audit and 3.0 means complete the course
knn_ratings = np.array([3.0, 3.0, 2.0, 3.0]) 

So the predicted rating of `user6` to item `Machine Learning Capstone` course can be calculated as:


In [4]:
r_u6_ml =  np.dot(knn_sims, knn_ratings)/ sum(knn_sims)
r_u6_ml

2.7727272727272725

If we already know the true rating to be 3.0, then we get a prediction error RMSE (Rooted Mean Squared Error) as:


In [5]:
true_rating = 3.0
rmse = math.sqrt(true_rating - r_u6_ml) ** 2
rmse

0.22727272727272751

The predicted rating is around 2.7 (close to 3.0 with RMSE 0.22), which indicates that `user6` is also likely to complete the course `Machine Learning Capstone`. As such, we may recommend it to user6 with high confidence.


## Objectives


After completing this lab you will be able to:


* Perform KNN-based collaborative filtering on the user-item interaction matrix


----


### Load and exploring dataset


Let's first load our dataset, i.e., a user-item (learn-course) interaction matrix


In [6]:
import pandas as pd

In [7]:
rating_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/ratings.csv"
rating_df = pd.read_csv(rating_url)

In [8]:
rating_df.head()

Unnamed: 0,user,item,rating
0,1889878,CC0101EN,3.0
1,1342067,CL0101EN,3.0
2,1990814,ML0120ENv3,3.0
3,380098,BD0211EN,3.0
4,779563,DS0101EN,3.0


The dataset contains three columns, `user id` (learner), `item id`(course), and `rating`(enrollment mode). 

Note that this matrix is presented as the dense or vertical form, and you may convert it to a sparse matrix using `pivot` :


In [9]:
rating_sparse_df = rating_df.pivot(index='user', columns='item', values='rating').fillna(0).reset_index().rename_axis(index=None, columns=None)
rating_sparse_df.head()

Unnamed: 0,user,AI0111EN,BC0101EN,BC0201EN,BC0202EN,BD0101EN,BD0111EN,BD0115EN,BD0121EN,BD0123EN,...,SW0201EN,TA0105,TA0105EN,TA0106EN,TMP0101EN,TMP0105EN,TMP0106,TMP107,WA0101EN,WA0103EN
0,2,0.0,3.0,0.0,0.0,3.0,2.0,0.0,2.0,2.0,...,0.0,2.0,0.0,3.0,0.0,2.0,2.0,0.0,3.0,0.0
1,4,0.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,...,0.0,2.0,0.0,0.0,0.0,2.0,2.0,0.0,2.0,2.0
2,5,2.0,2.0,2.0,0.0,2.0,0.0,0.0,0.0,2.0,...,0.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,2.0
3,7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,8,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Usually, the dense format is more preferred as it saves a lot of storage and memory space. While the benefit of the sparse matrix is it is in the nature matrix format and you could apply computations such as cosine similarity directly.


Next, you need to perform KNN-based collaborative filtering on the user-item interaction matrix. 
You may choose one of the two following implementation options of KNN-based collaborative filtering. 
- The first one is to use `scikit-surprise` which is a popular and easy-to-use Python recommendation system library. 
- The second way is to implement it with standard `numpy`, `pandas`, and `sklearn`. You may need to write a lot of low-level implementation code along the way.


## Implementation Option 1: Use **Surprise** library (recommended)


*Surprise* is a Python sci-kit library for recommender systems. It is simple and comprehensive to build and test different recommendation algorithms. 

First, let's install it:


In [2]:
#!pip install scikit-surprise==1.1.1



Now we import required classes and methods


In [10]:
from surprise import KNNBasic
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy

Then, let's take a look at a code example how easily to perform KNN collaborative filtering on a sample movie review dataset, which contains about 100k movie ratings from users.


In [11]:
import numpy
numpy.version.version

'1.23.5'

In [12]:
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k', prompt=False)

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# We'll use the famous KNNBasic algorithm.
algo = KNNBasic()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9791


0.979056350705886

In [15]:
#predictions

[Prediction(uid='458', iid='694', r_ui=4.0, est=3.973337413437159, details={'actual_k': 30, 'was_impossible': False}),
 Prediction(uid='256', iid='1150', r_ui=5.0, est=2.704852875992944, details={'actual_k': 4, 'was_impossible': False}),
 Prediction(uid='363', iid='859', r_ui=4.0, est=2.577845771226977, details={'actual_k': 11, 'was_impossible': False}),
 Prediction(uid='6', iid='22', r_ui=3.0, est=3.8191019375895214, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='698', iid='86', r_ui=2.0, est=3.92029443885759, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='503', iid='173', r_ui=5.0, est=4.3214321625370475, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='684', iid='98', r_ui=4.0, est=4.573645948645771, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='100', iid='313', r_ui=5.0, est=4.172849878192331, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='321', iid='124', r_ui=3.0, est=4.07

As you can see, just a couple of lines and you can apply KNN collaborative filtering on the sample movie lens dataset. The main evaluation metric is `Root Mean Square Error (RMSE)` which is a very popular rating estimation error metric used in recommender systems as well as many regression model evaluations.


Now, let's load our own course rating dataset:


In [24]:
rating_df.to_csv("course_ratings.csv", index=False)
# Read the course rating dataset with columns user item rating
reader = Reader(
        line_format='user item rating', sep=',', skip_lines=1, rating_scale=(2, 3))

coruse_dataset = Dataset.load_from_file("course_ratings.csv", reader=reader)

We split it into trainset and testset:


In [25]:
trainset, testset = train_test_split(coruse_dataset, test_size=.3)

In [23]:
print(trainset)

<surprise.trainset.Trainset object at 0x000001AC503BDA30>


then check how many users and items we can use to fit a KNN model:


In [19]:
print(f"Total {trainset.n_users} users and {trainset.n_items} items in the trainingset")

Total 31256 users and 122 items in the trainingset


### TASK: Perform KNN-based collaborative filtering on the user-item interaction matrix


_TODO: Fit the KNN-based collaborative filtering model using the trainset and evaluate the results using the testset:_


In [26]:
## WRITE YOUR CODE HERE:

# - Define a KNNBasic() model
# Note there are some arguments such as:
# max_k and min_k, representing the max and min number of neighors for rating estimations
# sim_option, representing similarity measurement such as cosine and whether you want it to be user_based or items_based 
# e.g., sim_option = {
#        'name': 'cosine', 'user_based': False,
#    }
#
# more KNN model hyperparamets can be found here:
# https://surprise.readthedocs.io/en/stable/knn_inspired.html
# 
# You may try different hyperparamet combinations to see which one has the best performance
algo = KNNBasic()

# - Train the KNNBasic model on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# - Then compute RMSE
accuracy.rmse(predictions)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.1923


0.19229086894280187

In [None]:
#RMSE: 0.1923

<details>
    <summary>Click here for Hints</summary>

* Create a model by calling `KNNBasic()` class. 
* Fit it with `trainset` by using `model.fit(trainset)`.  
* Record predictions to the `testset`  by using `model.test(testset).
* Compute the accuracy by using `accuracy.rmse(predictions)`


To learn more detailed usages about _Surprise_ library, visit its website from [here](https://surprise.readthedocs.io/en/stable/getting_started.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01)


## Implementation Option 2: Use `numpy`, `pandas`, and `sklearn`


If you do not prefer the one-stop Suprise solution and want more hardcore coding practices, you may implement the KNN model using `numpy`, `pandas`, and possibly `sklearn`:


In [None]:
## WRITE YOUR CODE HERE:

## One solution could be:
## - Calculate the similarity between two users using their rating history (the row vectors of interaction matrix)

## - Build a similarity matrix for each pair of users with the training dataset

## - For each user, find its k nearest neighbors in the sim matrix

## - For each rating in the test dataset, estimate its rating using the KNN collaborative filtering equations shown before

## - Calculate RMSE for the entire test dataset



## Summary



In this lab, you have learned and implemented KNN-based collaborative filtering. It is probably the simplest but very effective and intuitive collaborative filtering algorithm. Since it is based on KNN, it inherits the main characteristics of KNN such as memory-intensive because you need to maintain a huge similarity matrix among users or items. In the future labs, we will learn other types of collaborative filtering which do not rely on such a huge similarity matrix to make rating predictions.


## Authors


[Yan Luo](https://www.linkedin.com/in/yan-luo-96288783/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01)


### Other Contributors


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2021-10-25|1.0|Yan|Created the initial version|


Copyright © 2021 IBM Corporation. All rights reserved.
