# Item-based collaborative filtering and parameter selection for finding nearest neighbors

## Assignment 2
In this assignment, we will apply the collaborative filtering to **MovieLens Latest Datasets (small)**, which was used for assignment 1.
As assignment 1, load the MovieLens data into variable `ml_df` using the function `get_movie_lens_dataframe` and complete the following assignments.


### Assignment 2-1
User 413 in `ml_df` has not rate the following movies yet (numbers mean movie IDs).
```
unrated_movies = [5, 76, 83, 242, 319, 351, 391, 473, 492, 597, 618, 634, 659, 733, 779, 1105, 1236, 1642, 1804, 2315]
```

By applying **item-based collaborative filtering** (`predicting_rating_with_k_nn` function) into the dataset, show a list of the top 20 recommended movies for user 413 and their predicted rating scores.
Note that a threshold for selecting nearest neighbor items (k) is 20.

In [None]:
!wget -P lib https://raw.githubusercontent.com/trycycle/recommender-system-2020/main/lib/cf.py

In [2]:
import numpy as np
import pandas as pd 

# Import the ItemBasedCF class
from lib.cf import UserBasedCF, ItemBasedCF

In [3]:
def get_movie_lens_dataframe():
    user_num = 610
    movie_num = 9724
    df = pd.read_csv("https://github.com/trycycle/recommender-system-2020/raw/main/data/ml-latest-small-transformed/ratings.csv")

    rating_matrix = np.zeros((user_num, movie_num))
    rating_matrix[:, :] = np.nan

    for _, row in df.iterrows():
        rating_matrix[int(row['userId'])-1, int(row['movieId'])-1] = row['rating']
    
    rating_df = pd.DataFrame(rating_matrix)
    rating_df.columns = ['item{}'.format(i) for i in range(movie_num)]
    rating_df.index = ['user{}'.format(i) for i in range(user_num)]
    return rating_df

In [4]:
ml_df = get_movie_lens_dataframe()

In [5]:
unrated_movies = [5, 76, 83, 242, 319, 351, 391, 473, 492,
                  597, 618, 634, 659, 733, 779, 1105, 1236, 1642, 1804, 2315]

In [6]:
ibcf = ItemBasedCF() 

scores = {}
for unrated_movie in unrated_movies:
    score = ibcf.predict_rating_with_k_nn(ml_df, target_user=413, target_item=unrated_movie, k=20)
    scores[unrated_movie] = score
    
for movie_id, score in sorted(scores.items(), key=lambda x:-x[1]):
    print(movie_id, "\t", score)

76 	 4.606423682279282
83 	 4.594012616142038
391 	 4.278522487229257
1105 	 4.23365772743011
242 	 4.175590304748099
351 	 4.163537127429494
2315 	 3.5816218742719363
319 	 3.420120852827823
779 	 3.4199579479526494
5 	 3.344680567104595
1642 	 3.224534952716217
1236 	 2.9569680003215493
492 	 2.8314425160512635
733 	 2.8271404602246855
473 	 2.8137895007554814
618 	 2.72559890234164
597 	 2.710503049534268
659 	 2.694032720253541
1804 	 2.6884885543673867
634 	 2.1157836577952187


### Assignment 2-2
User 413 graded movie 401 as 3 in `ml_df`.
Assume that the user had not rated the movie yet and predict the rating score of user 413 for movie 401 using **item-based collaborative filtering**.
Also, calculate the absolute value of the delta between the actual value (3) and the predicted value.
Note that a threshold for selecting nearest neighbor items (k) is 20.

In [7]:
_ml_df = ml_df.copy()
_ml_df.iloc[413, 401] = np.nan

score = ibcf.predict_rating_with_k_nn(_ml_df, target_user=413, target_item=401, k=20)
delta = abs(score - 3.0) 
print(score, delta)

4.560602443867915 1.5606024438679151


### Assignment 2-3
In assignment 2-2, you calculated the delta value in the setting where k = 20.
Conduct the same calculation while changing the threshold k from 1 to 200 by 10.
Then, check how the absolute delta changes.

In [8]:
ibcf_scores = []
for i in range(10, 201):
    if i % 10 == 0:
        score = ibcf.predict_rating_with_k_nn(_ml_df, target_user=413, target_item=401, k=i)
        delta = abs(score - 3.0)
        ibcf_scores.append((i, score, delta))
        print(i, score, delta)

10 4.440927849879402 1.4409278498794018
20 4.560602443867915 1.5606024438679151
30 4.545395455296045 1.5453954552960454
40 4.451648390556956 1.4516483905569562
50 4.426893348951836 1.426893348951836
60 4.382375990403819 1.382375990403819
70 4.372678242193163 1.372678242193163
80 4.375555360462801 1.3755553604628012
90 4.332442530216759 1.3324425302167588
100 4.305459003246238 1.3054590032462379
110 4.305074695726723 1.3050746957267227
120 4.297740995588388 1.2977409955883878
130 4.291711038193224 1.291711038193224
140 4.275279210210138 1.2752792102101376
150 4.2608527556707125 1.2608527556707125
160 4.243360816537545 1.2433608165375452
170 4.227488124328039 1.2274881243280387
180 4.217827188776154 1.2178271887761536
190 4.20120901638749 1.20120901638749
200 4.205023005667773 1.2050230056677727


### Assignment 2-4
Conduct the same calculations in assignment 2-3 using **user-based collaborative filtering**.
Then, check how the absolute delta changes.

In [9]:
ubcf = UserBasedCF()

ubcf_scores = []
for i in range(10, 201):
    if i % 10 == 0:
        score = ubcf.predict_rating_with_k_nn(_ml_df, target_user=413, target_item=401, k=i)
        delta = abs(score - 3.0)
        ubcf_scores.append((i, score, delta))
        print(i, score, delta)

10 4.0907352113980044 1.0907352113980044
20 3.911921944512341 0.9119219445123412
30 3.786168301773572 0.7861683017735719
40 3.8363315068057373 0.8363315068057373
50 3.914248248307037 0.9142482483070369
60 3.876752399186387 0.8767523991863868
70 3.880964016256943 0.880964016256943
80 3.875050949706342 0.8750509497063419
90 3.87351160841324 0.8735116084132399
100 3.8483383567245184 0.8483383567245184
110 3.825438755943037 0.8254387559430372
120 3.7859430988752796 0.7859430988752796
130 3.7714959339408964 0.7714959339408964
140 3.761224332535325 0.761224332535325
150 3.7527542669654523 0.7527542669654523
160 3.7498923488290457 0.7498923488290457
170 3.7657360541895026 0.7657360541895026
180 3.8012884088928565 0.8012884088928565
190 3.8012884088928565 0.8012884088928565
200 3.8012884088928565 0.8012884088928565
