# User-based Collaborative Filtering
In this exercise, we practice **user-based collaborative filtering**, which is a recommendation algorithm focusing on the similarity between users.


We use the following Python libraries for the exercise. 

* numpy、scipy
    * Basic libraries for data science
* pandas
    * A library for efficient calculation of table-type data.

---
## Load data for this exercise
In this exercise, we apply a collaborative filtering technique to a simple dataset (Alice example) used in the lecture.
Before starting the exercise, execute the following commands and load necessary libraries.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import numpy as np # as is for making an alias to a library
import pandas as pd 
from scipy.stats import rankdata

Let's load data now.
On Google Colaboratory, we need to mount Google Drive in order to access files on Google Drive. Run the following codes to mount Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Once excuting the above codes, you are prompted to type a temporary code for autentification. After inputting the temporary code, Google Drive will be mounted and you can see ``drive`` directory on a left sidebar's ``Files`` tab.

In this exercise, we sometimes access data and codes in ``drive/My Drive/Colab Notebooks/recommender-system-2019`` directory. For covenience, let's set a constant value for the directory's path.

In [None]:
# Path to a data directory
DATADIR = 'drive/My Drive/Colab Notebooks/recommender-system-2019/data/'

# Path to a original Python library
LIBDIR = 'drive/My Drive/Colab Notebooks/recommender-system-2019/lib/'


The `pandas` library makes you easily load and access table-like data.
Use the `read_csv` method to load data into the variable `df`.

The data we will use for this exercise is located in the `data` directory.
The filename is **small-example.csv**.
In this file, each row means each user's ratings to all items. Each rating score to each item is separated by commas.
Please note that the first line is a header.

In [None]:
# The parameter index_col=0 enables you to set the first column on data as index names
df = pd.read_csv(DATADIR + 'small-example.tsv', delimiter='\t',index_col=0)
df

Unnamed: 0_level_0,item1,item2,item3,item4,item5
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alice,5,3,4,4,
user1,3,1,2,3,3.0
user2,4,3,4,3,5.0
user3,3,3,1,5,4.0
user4,1,5,5,2,1.0


Using `read_csv` method of `pandas`, we can load data as a **data frame** object.
We can see that each row means each user and each column means each item rating on the variable `df` (the data frame object).


## Calculation of Pearson correlation coefficient
Let's calculate the similarity between users for doing a user-based collaborative filtering.
Here, we calculate the **Pearson correlation coefficient** as user similarity.

For dataframe objects on `pandas`, the `corr` method is defined to calculate various types of correlation coefficients.
The `corr` method enables us to calculate correlation efficient values between row vectors, ignoring NA/null data in the dataframe.
Let's use this method to calculate user similarity.


In [None]:
# T method is for transposing a matrix.
# Use the T method for analyzing correlations between users before using the `corr` method.
df.T.corr(method='pearson')

user,Alice,user1,user2,user3,user4
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alice,1.0,0.852803,0.707107,0.0,-0.792118
user1,0.852803,1.0,0.467707,0.489956,-0.900149
user2,0.707107,0.467707,1.0,-0.161165,-0.466569
user3,0.0,0.489956,-0.161165,1.0,-0.641503
user4,-0.792118,-0.900149,-0.466569,-0.641503,1.0


We have obtained correlation coefficients between all users.
From now, we apply some matrix/vector operation to this similarity matrix for the collaborative filtering.
For matrix/vector operation, it is better to transform dataframe objects into matrix objects.
For that, we transform the dataframe object about similarity into a `numpy`'s matrix object (**similarity matrix**).


In [None]:
sim_matrix = df.T.corr(method='pearson').values
sim_matrix

array([[ 1.        ,  0.85280287,  0.70710678,  0.        , -0.79211803],
       [ 0.85280287,  1.        ,  0.46770717,  0.48995593, -0.9001488 ],
       [ 0.70710678,  0.46770717,  1.        , -0.16116459, -0.46656947],
       [ 0.        ,  0.48995593, -0.16116459,  1.        , -0.6415029 ],
       [-0.79211803, -0.9001488 , -0.46656947, -0.6415029 ,  1.        ]])

## Data handling numpy's matrix objects
Let's get used to handling numpy matrix objects.
We can access elements of matrix objects via various approaches:


In [None]:
print("0th row, 2nd column datum: ", sim_matrix[0, 2])
print("0th row, 2nd and 3rd columns data (vector): ", sim_matrix[1, [2, 3]])
print("0th row, from 2nd to the last columns: ", sim_matrix[0, 1:])

0th row, 2nd column datum:  0.7071067811865475
0th row, 2nd and 3rd columns data (vector):  [0.46770717 0.48995593]
0th row, from 2nd to the last columns:  [ 0.85280287  0.70710678  0.         -0.79211803]


## Predict item rating scores based on user similarity
Let's predict the item 5 ($i_5$) of Alice ($u_a$) using the above similarity matrix.
The approach for that is below:
1. Here, assume that the nearest neighbor users are user 1 and user 2, whose similarity between them and Alice is over 0.7
2. Calculate average ratings of the nearest neighbor users
3. Calculate the difference scores between the rating scores for item 5 and the average rating scores.
4. Predict the rating score of Alice for item 5, using the following equation:

\begin{equation}
rating(u_a, i_5) = \overline{r_{u_a}} + \frac{\sum_{u \in K}sim(u_a, u) \times (r_{u, i_5} - \overline{r_u})}{\sum_{u \in K}sim(u_a, u)}
\end{equation}

Here, $r(u, i)$ is a rating score of user $u$ for item $i$, $\overline{r_u}$ is user $u$'s average rating score, $sim(u_x, u_y)$ is the similarity score between user $x$ and user $y$.

At first, let's obtain the similarity scores between Alice, user 1 and user 2.

In [None]:
# In sim_matrix, Alice, user 1, and user 2 are corresponding to column 0, 1, and 2, respectively.
sim_vec = sim_matrix[0, [1,2]]
sim_vec

array([0.85280287, 0.70710678])

As seen in the above equation (in the denominator), we need the sum of user similarities to predict rating scores.
We can obtain the sum of list elements, by applying the sum function for a list object.


In [None]:
# Summation of elements in a similarity vector
sum(sim_vec)

1.5599096466089892

For the rating prediction, we need the rating scores for a target item of similar users.
For calculating matrix/vector data efficiently, let's transform rating data into a numpy matrix object.

In [None]:
# Transform a rating dataframe into a matrix object
rating_matrix = df.values

# Access rating scores of user 1 and user 2 for item 5
rating_matrix[[1,2], 4]

array([3., 5.])

Furthermore, we need the average rating scores of Alice and similar users ($\overline{r}$).
The `pandas` library provides us with a useful method `mean` to calculate average scores of row values (or column values) on dataframes.
Fortunately, the method ignores NA/null data on the dataframe in the average calculation process.
Let's use it.

In [None]:
# Average rating scores of each user
# If we set the parameter axis=1, we can obtain average scores by rows.
df.mean(axis=1)

# Transform data into a numpy vector object for easy vector calculation
mean_vec = df.mean(axis=1).values

# Obtain the average rating scores of user 1 and user 2
mean_vec[[1, 2]]

array([2.4, 3.8])

Now we are ready to predict Alice's rating score for item 5.
Let's calculate it following the below equation.

\begin{equation*}
rating(u_a, i_5) = \overline{r_{u_a}} + \frac{\sum_{u \in K}sim(u_a, u) \times (r_{u, i_5} - \overline{r_u})}{\sum_{u \in K}sim(u_a, u)}
\end{equation*}

In [None]:
# np.dot(v1, v2) calculate the inner product between vector v1 and v2
np.dot(sim_vec, (rating_matrix[[1,2], 4] - mean_vec[[1,2]])) / sum(sim_vec) + mean_vec[0]

4.871979899370592

## Generalization for the above calculation
The above calculation is limited for predicting Alice's rating for item 5 in the case where similar users are defined as the ones who have user similarity over 0.7.
For generalization, I have prepared the function to predict an arbitrary user's rating for an arbitrary item.
We can set an arbitrary value to a similarity threshold on the function.
The function is defined as the `predicting_rating` method of the `UserBasedCF` class in the file `cf.py` in the `lib` directory.

Let's run the following code.

In [None]:
# A setting for loading the lib directory
import sys
sys.path.append(LIBDIR)

# Import the ItemBasedCF class
from cf import UserBasedCF 

ubcf = UserBasedCF() # Create a instance of the UserBasedCF class
ubcf.predict_rating(df, target_user=0, target_item=4, sim_threshold=0.7)

  nn = (sim_vec >= sim_threshold)


4.871979899370592

---
## Method to select nearest neighbor users by neighbor number
As I told you in the lecture, we can select nearest neighbor users by using a threshold for the number of similar neighbors, instead of using a threshold for user similarity.

In the `predict_rating_with_k_nn` method of the `UserBasedCF` class, if some users are in the top-k ranking of user similarity for a target user, they will be regarded as nearest neighbors (similar users) in the process of prediction calculation.

Let's run the following code to predict the rating score of Alice for item 5.
Here, a threshold for the number of nearest neighbors is set to 2.

In [None]:
ubcf.predict_rating_with_k_nn(df, target_user=0, target_item=4, k=2)

4.871979899370592

---
## Assignment 1
In this assignment, we apply the user-based collaborative filtering for one of famous recommender system datasets, [MovieLens dataset](https://grouplens.org/datasets/movielens/).

The MovieLens dataset is a set of rating scores for a lot of movies.
In the dataset, each rating score ranges from 1 to 5.
In this assignment, we use the **MovieLens Latest Datasets (small)**, one of the MovieLens datasets.
The data file `ratings.csv` is located in the directory ``data/ml-latest-small-transformed``.
In each row of the file, a userID, a movieID, a rating score, and a timestamp are separated by commas.

Complete the following assignments.

### Assignment 1-1
Load the MovieLens data into the variable `ml_df` using the following `get_movie_lens_datatrame` function.


In [None]:
def get_movie_lens_dataframe(filepath=DATADIR + 'ml-latest-small-transformed/ratings.csv'):
    user_num = 610
    movie_num = 9724
    df = pd.read_csv(filepath)

    rating_matrix = np.zeros((user_num, movie_num))
    rating_matrix[:, :] = np.nan

    for _, row in df.iterrows():
        rating_matrix[int(row['userId'])-1, int(row['movieId'])-1] = row['rating']
    
    rating_df = pd.DataFrame(rating_matrix)
    rating_df.columns = ['item{}'.format(i) for i in range(movie_num)]
    rating_df.index = ['user{}'.format(i) for i in range(user_num)]
    return rating_df

### Assignment 1-2
The `ml_df` loaded in the assignment 1-1 contains the rating scores of user 413.
According to the `ml_df`, user 413 did not rate the following movie ids:

```
unrated_movies = [5, 76, 83, 242, 319, 351, 391, 473, 492, 597, 618, 634, 659, 733, 779, 1105, 1236, 1642, 1804, 2315]
```

By using a user-based collaborative filtering technique and decide which movie to recommend for user 413.
Then, make a list of recommended movies' ids and their predicted rating scores in descending order. 
Here, nearest neighbor users are defined as the users with top-k high user similarity.
Also, the threshold k for selecting nearest neighbors should be 20.

(Hint) use the function `ubcf.predict_rating_with_k_nn`.


### Assignment 1-3
For the same task in the assignment 1-2, apply a user-based collaborative filtering **where a threshold is set for user similarity**.
A threshold for the similarity should be 0.5.

(Hint) use the function `ubcf.predict_rating`.