# User-based Collaborative Filtering

## Assignment 1
In this assignment, we apply the user-based collaborative filtering for one of famous recommender system datasets, [MovieLens dataset](https://grouplens.org/datasets/movielens/).

The MovieLens dataset is a set of rating scores for a lot of movies.
In the dataset, each rating score ranges from 1 to 5.
In this assignment, we use the **MovieLens Latest Datasets (small)**, one of the MovieLens datasets.
The MovieLens Latest Datasets (small) data file is located [here](https://github.com/trycycle/recommender-system-2020/raw/main/data/ml-latest-small-transformed/ratings.csv).
In each row of the file, a userID, a movieID, a rating score, and a timestamp are separated by commas.

Complete the following assignments.

### Assignment 1-1
The following `get_movie_lens_datatrame` function enables us to download "the MovieLens Latest Datasets (small) data file" and convert the data to a pandas dataframe.
Load the MovieLens data into the variable `ml_df` using the following `get_movie_lens_datatrame` function.


In [1]:
import numpy as np
import pandas as pd 

In [2]:
def get_movie_lens_dataframe():
    user_num = 610
    movie_num = 9724
    df = pd.read_csv("https://github.com/trycycle/recommender-system-2020/raw/main/data/ml-latest-small-transformed/ratings.csv")

    rating_matrix = np.zeros((user_num, movie_num))
    rating_matrix[:, :] = np.nan

    for _, row in df.iterrows():
        rating_matrix[int(row['userId'])-1, int(row['movieId'])-1] = row['rating']
    
    rating_df = pd.DataFrame(rating_matrix)
    rating_df.columns = ['item{}'.format(i) for i in range(movie_num)]
    rating_df.index = ['user{}'.format(i) for i in range(user_num)]
    return rating_df

In [3]:
ml_df = get_movie_lens_dataframe()

### Assignment 1-2
The `ml_df` loaded in the assignment 1-1 contains the rating scores of user 413.
According to the `ml_df`, user 413 did not rate the following movie ids:

```
unrated_movies = [5, 76, 83, 242, 319, 351, 391, 473, 492, 597, 618, 634, 659, 733, 779, 1105, 1236, 1642, 1804, 2315]
```

By using a user-based collaborative filtering technique and decide which movie to recommend for user 413.
Then, make a list of recommended movies' ids and their predicted rating scores in descending order. 
Here, nearest neighbor users are defined as the users with top-k high user similarity.
Also, the threshold k for selecting nearest neighbors should be 20.

(Hint) use the function `ubcf.predict_rating_with_k_nn`.


In [None]:
!wget -P lib https://raw.githubusercontent.com/trycycle/recommender-system-2020/main/lib/cf.py

In [5]:
# Import the ItemBasedCF class
from lib.cf import UserBasedCF 

ubcf = UserBasedCF() # Create a instance of the UserBasedCF class

unrated_movies = [5, 76, 83, 242, 319, 351, 391, 473, 492, 597, 618, 634, 659, 733, 779, 1105, 1236, 1642, 1804, 2315]

In [6]:
scores = {}
for unrated_movie in unrated_movies:
    score = ubcf.predict_rating_with_k_nn(ml_df, target_user=413, target_item=unrated_movie, k=20)
    scores[unrated_movie] = score
    
for movie_id, score in sorted(scores.items(), key=lambda x:-x[1]):
    print(movie_id, "\t", score)

1105 	 4.27500757499568
351 	 4.158368047927345
76 	 4.149126911447925
83 	 3.9514898422195834
391 	 3.9502611066953994
242 	 3.7420145899657937
779 	 3.694275892230307
2315 	 3.6097115304316896
319 	 3.1532412588896217
473 	 3.0222303792614635
5 	 2.948289328605953
597 	 2.8903441124503204
618 	 2.881144122122124
659 	 2.8479462842268957
1236 	 2.8463868427674095
492 	 2.5963242360259864
1804 	 2.44234669105025
1642 	 2.3657659395438158
634 	 1.9822581891575723
733 	 1.8975797324892267


### Assignment 1-3
For the same task in the assignment 1-2, apply a user-based collaborative filtering **where a threshold is set for user similarity**.
A threshold for the similarity should be 0.5.

(Hint) use the function `ubcf.predict_rating`.

In [7]:
scores = {}
for unrated_movie in unrated_movies:
    score = ubcf.predict_rating(ml_df, target_user=413, target_item=unrated_movie, sim_threshold=0.5)
    scores[unrated_movie] = score
    
for movie_id, score in sorted(scores.items(), key=lambda x:-x[1]):
    print(movie_id, "\t", score)

1105 	 4.731987552591785
76 	 4.174370992650663
83 	 4.1355762853680975
391 	 4.1297855685058265
779 	 4.025916564128748
2315 	 3.6369641537837296
242 	 3.4566317835841103
351 	 3.4147595539643234
319 	 3.2148393485104507
5 	 3.07269070356819
659 	 3.0700139096038175
473 	 3.0062880057940644
618 	 2.9819683488665225
492 	 2.6979284244540924
597 	 2.4910918224917995
1236 	 2.3049199420937656
1642 	 1.8010538077617115
634 	 1.5344602011033874
1804 	 1.4147595539643234
733 	 0.9495186135370406
