# Neighborhood-based Collaborative Filtering
(by Tevfik Aytekin)

In the following we will implement item based collaborative filtering.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from scipy.sparse import csr_matrix
from collections import Counter
from sklearn.metrics import pairwise_distances
from operator import itemgetter
from tqdm.notebook import tqdm
import copy
import heapq
import sys, os
import pickle
import itertools
import operator


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Movielens ml-latest-small dataset

In [None]:
with open('/content/drive/My Drive/datasets/ml-latest-small/README.txt', 'r') as f:
    print(f.read())

Summary

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.

This and other GroupLens data sets are publicly available for down

In [None]:
ratings = pd.read_csv("/content/drive/My Drive/datasets/ml-latest-small/ratings.csv", sep=",")
#ratings =pd.read_csv("/content/drive/My Drive/datasets/ml-25m/ratings.csv")
print(ratings.shape)
ratings.head(10)

(100836, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


In [None]:
links = pd.read_csv("/content/drive/My Drive/datasets/ml-latest-small/links.csv", sep=",")
print(links.shape)
links.head()

(9742, 3)


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [None]:
movies = pd.read_csv("/content/drive/My Drive/datasets/ml-latest-small/movies.csv", sep=",")
print(movies.shape)
movies.head()

(9742, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
tags = pd.read_csv("/content/drive/My Drive/datasets/ml-latest-small/tags.csv", sep=",")
print(tags.shape)
tags.head()

(3683, 4)


Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


## Create User Item Rating Map
It might take some time but will be useful later.

In [None]:
rating_map = {}
for i in range(len(ratings)):
    key = str(ratings.iloc[i,0]) + '_' +str(ratings.iloc[i,1])
    rating_map[key]=ratings.iloc[i,2]

In [None]:
rating_map["1_101"]

np.float64(5.0)

In [None]:
iterator = iter(rating_map.items())
for i in range(5):
    print(next(iterator))

('1_1', np.float64(4.0))
('1_3', np.float64(4.0))
('1_6', np.float64(4.0))
('1_47', np.float64(5.0))
('1_50', np.float64(5.0))


## Create User Ratings Map
It might take some time but will be useful later.

In [None]:
ratings.query("movieId == 2")

Unnamed: 0,userId,movieId,rating,timestamp
560,6,2,4.0,845553522
1026,8,2,4.0,839463806
1773,18,2,3.0,1455617462
2275,19,2,3.0,965704331
2977,20,2,3.0,1054038313
...,...,...,...,...
95102,600,2,4.0,1237764627
95965,602,2,4.0,840875851
97044,604,2,5.0,832080293
97144,605,2,3.5,1277176522


In [None]:
# user_ratings_map[i] stores a tuple: a list of users who rated item i and a list of corresponding ratings
user_ratings_map = {}

items = ratings.movieId.unique()
for i in items:
    userids = ratings.query("movieId == @i").userId.array
    user_ratings = ratings.query("movieId == @i").rating.array
    user_ratings_map[i] = (userids,user_ratings)

In [None]:
user_ratings_map[10]

(<NumpyExtensionArray>
 [  np.int64(6),   np.int64(8),  np.int64(11),  np.int64(19),  np.int64(21),
   np.int64(26),  np.int64(31),  np.int64(34),  np.int64(42),  np.int64(43),
  ...
  np.int64(580), np.int64(584), np.int64(588), np.int64(590), np.int64(592),
  np.int64(597), np.int64(599), np.int64(602), np.int64(608), np.int64(609)]
 Length: 132, dtype: int64,
 <NumpyExtensionArray>
 [np.float64(3.0), np.float64(2.0), np.float64(3.0), np.float64(2.0),
  np.float64(5.0), np.float64(3.0), np.float64(4.0), np.float64(5.0),
  np.float64(5.0), np.float64(4.0),
  ...
  np.float64(3.5), np.float64(5.0), np.float64(3.0), np.float64(3.5),
  np.float64(3.0), np.float64(3.0), np.float64(3.5), np.float64(3.0),
  np.float64(4.0), np.float64(4.0)]
 Length: 132, dtype: float64)

## Rating Prediction

### Algorithm

Predict rating of user $u$ for item $i$
- Calculate the similarity of items that are rated by $u$ with $i$.
- Use these similarities to calculate a weighted average of the ratings.

Similarity between items i and j will be calculated using the ratings of i and j (no content information will be used). One can view these ratings as a vector of values as shown below and use different metrics such as Jaccard or cosine.

In [None]:
df = pd.DataFrame([[1, "", 5, 3, ""],
                   [5, 3, "", 2, 4],
                   ["", 4, 2, "", 1],
                   [3, "", 4, 2, 3],
                   [4, 1, "", 2, 4],
                   [4, 1, 5, 3, ""],
                   ["", 4, 5, "", 1],
                   [2, 5, "", 1, 4]],
                 index = ['user 1','user 2','user 3','user 4','user 5','user 6','user 7','user 8'],
                 columns = ['movie 1','movie 2','movie 3','movie 4',' movie 5'])
df

Unnamed: 0,movie 1,movie 2,movie 3,movie 4,movie 5
user 1,1.0,,5.0,3.0,
user 2,5.0,3.0,,2.0,4.0
user 3,,4.0,2.0,,1.0
user 4,3.0,,4.0,2.0,3.0
user 5,4.0,1.0,,2.0,4.0
user 6,4.0,1.0,5.0,3.0,
user 7,,4.0,5.0,,1.0
user 8,2.0,5.0,,1.0,4.0


### Jaccard Similarity

Given two sets $A$ and $B$,

$Jaccard(A, B) = \frac{|A \cap B|}{|A \cup B|}$

For example if $A = \{a, b, c, d\}$ and $B = \{b, d, e ,f, g\}$ then

$Jaccard(A, B) = \frac{2}{7}$

We can apply Jaccard similarity by ignoring the rating values.

### Cosine Similarity

Cosine(A, B) =

<img src="cosine.png" width="200">



In [None]:
def NNCF_based_rating_prediction(u, i, metric):
    r = 0
    sum_sim = 0
    # find movies rated by u
    movies = ratings[ratings["userId"]==u].movieId
    for j in movies:
        sim = calc_sim(i, j, metric)
        key = str(u)+"_"+str(j)
        r += sim*rating_map[key]
        sum_sim += sim
    if sum_sim == 0:
        return 0
    else:
        return r / sum_sim

In [None]:
# finds the similary of items i and j
def calc_sim(i,j, metric):
    # users who rated item i
    users_rated_i = user_ratings_map[i][0]
    ratings_i = user_ratings_map[i][1]
    # users who rated item j
    users_rated_j = user_ratings_map[j][0]
    ratings_j = user_ratings_map[j][1]

    # Jaccard ignores rating values.
    if metric == "Jaccard":
        intersection_size = len(set(users_rated_i).intersection(users_rated_j))
        union_size = len(set(users_rated_i).union(users_rated_j))
        return intersection_size / union_size
    elif metric == "Cosine":
        inter, ind1, ind2 = np.intersect1d(users_rated_i, users_rated_j, return_indices=True)
        dot = np.dot(ratings_i[ind1], ratings_j[ind2])
        return dot/(np.linalg.norm(ratings_i[ind1])*np.linalg.norm(ratings_j[ind2]))


### Example

In [None]:
users_i = np.array([1, 3, 6, 9, 12, 15])
ratings_i =  np.array([4, 3, 2, 4, 2, 5])
users_j = np.array([1, 6, 8, 12])
ratings_j = np.array([2, 4, 2, 5])

In [None]:
inter, ind1, ind2 = np.intersect1d(users_i,users_j,return_indices=True)
print(inter)
print(ind1)
print(ind2)
print(ratings_i[ind1])
print(ratings_j[ind2])
print(np.dot(ratings_i[ind1],ratings_j[ind2]))

[ 1  6 12]
[0 2 4]
[0 1 3]
[4 2 2]
[2 4 5]
26


## Evaluation of Rating Prediction

How can we measure the performance of a recommender algorithm? This is similar to the evaluation used in machine learning.

- Make a train/test split
- Build the model on the training set
- Make predictions for the ratings in the test set
- Find the mean absolute error (MAE)

For more metrics other then MAE lool at the "Metrics for Regression" section of [this notebook](../../data_science/evaluation.ipynb). For ranking metrics see [this notebook](ranking_evaluation.ipynb)


In [None]:
ratings = shuffle(ratings)

X_train, X_test = train_test_split(ratings, test_size=100)
train_size = X_train.shape[0]
test_size = X_test.shape[0]
print("Test size:", test_size)
error1 = 0
error2 = 0
error3 = 0
preds = []

avg_rating = X_train.iloc[:,2].mean()
for k in tqdm(range(test_size)):
    u = X_test.iloc[k,0]
    i = X_test.iloc[k,1]
    r = X_test.iloc[k,2]

    error1 += np.abs(r - NNCF_based_rating_prediction(u,i,"Cosine"))
    error2 += np.abs(r - NNCF_based_rating_prediction(u,i,"Jaccard"))
    error3 += np.abs(r - avg_rating)


print("Cosine:", error1/test_size)
print("Jaccard:",error2/test_size)
print("Average:",error3/test_size)


Test size: 100


  0%|          | 0/100 [00:00<?, ?it/s]

Cosine: 0.7124765717818193
Jaccard: 0.6737920440215371
Average: 0.7847800190597205


#### Question: How can you explain the difference between using Cosine vs. Jaccard?

## Top-N recommendation Algorithm - Predict and Sort
The task in top-$N$ recommendation is to recommend $N$ items to a user.


Recommend $N$ movies to user $u$
- Predict the ratings of all items which are not watched by $u$
- Sort the predicted ratings
- Recommend the movies with the highest predicted ratings

In [None]:
def top_N_pred_sort(N, u):
    preds = pd.Series([], dtype='float')
    # find the movies not rated by u
    all_items = set(ratings['movieId'].unique())
    rated_by_user = set(ratings[ratings["userId"]==1].movieId.unique())
    not_rated_by_user = all_items - rated_by_user
    sample_movies = np.random.choice(list(not_rated_by_user), 500)
    for m in tqdm(sample_movies):
        preds[m] = NNCF_based_rating_prediction(u, m, "Jaccard")
    return preds.sort_values(ascending=False)[:N]

In [None]:
topn = top_N_pred_sort(10, 100)
topn

  0%|          | 0/500 [00:00<?, ?it/s]

Unnamed: 0,0
110773,4.318191
167634,4.318191
78316,4.273657
57843,4.26629
7184,4.22764
31184,4.198869
72696,4.174946
4689,4.172884
6578,4.172884
6973,4.172884


## Efficiency Issues

There are important inefficiencies in this algorithm:

- The algorithm predicts the rating of all items which are not rated by the user. In the case of millions of items this algorithm is practically infeasible. Numerous techniques have been developed to remedy this problem. Can you suggest a solution?
- In rating prediction, similarity between target item and items rated by the user are calculated. To make a recommendation to another user similarity calculations will be done again. For making recommendations to users in general many similarity calculations will be repeated. A general solution to this problem is to precalculate the similarities between items. Moreover, you don't need to store all similarities, only storing $k$ most similar items to every item will be enough. Size of $k$ can be determined according to the needs.


## Top-N recommendation Algorithm - kNN Map
The task in top-$N$ recommendation is to recommend $N$ items to a user.

- Build a knn-map (a map which stores the $k$ nearest neighbors of each item)

Recommend $N$ movies to user $u$
- Get the neigbors of movies which are watched by $u$ and put them into a list $C$.
- Choose $N$ movies form $C$. There can be different methods here. Most repeated movies in C can be chosen, movies with the highest total similarity (or maximum similarity) can be chosen. These methods will be implemented.
- Recommend the $N$ movies that are chosen.

## Building a knn map
This table will hold the most similar $k$ items for each item. In order to build this table we need to calculate all pairwise similarities which takes $O(n^2)$ time. There is no escape from this $O(n^2)$ time unless you use an approximation algorithm such as LSH (Locality Sensitive Hashing) for nearest neighbor search. Another approach might be to calculate similarites by a matrix multiplication operation and give it to a GPU for accelaration. Yet another approach is discovered by Amazon researchers which can lead to huge speed ups for very sparse matrices which we will look at below.

We will use a heap based priority queue for storing the nearest neighbor. You can look at this [animation](https://www.cs.usfca.edu/~galles/visualization/Heap.html).

In [None]:
pq =[(10,"a"),(8, "b"), (5, "c"), (3, "d")]
type(pq)
heapq.heapify(pq)
heapq.nsmallest(2,pq)

[(3, 'd'), (5, 'c')]

In [None]:
movies[:10]

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [None]:
def build_knn_map(movies, K=30):
    knn_map = {}
    movie_ids = ratings['movieId'].unique()
    print(len(movie_ids))
    for i in tqdm(movie_ids):
        pq = []
        knn_map[i] = pq
        for j in movie_ids:
            if (i == j):
                continue
            sim = calc_sim(i,j,"Jaccard")
            if (len(pq) >= K):
                smallest = pq[0]
                if (sim > smallest[0]):
                    heapq.heappop(pq)
                    heapq.heappush(pq, (sim, j))
            else:
                heapq.heappush(pq, (sim, j))
    return knn_map

In [None]:
knn_map = build_knn_map(movies)

9724


  0%|          | 0/9724 [00:00<?, ?it/s]

KeyboardInterrupt: 

### Amazon Algorithm

The following is the algorithm used at Amazon. Note that this does not help much for the movielens dataset, however, for very sparse datasets such as the dataset at Amazon, it really helps by skipping many item pairs for which there is no user which rated/bought both items.

Linden, Greg, Brent Smith, and Jeremy York. "Amazon. com recommendations: Item-to-item collaborative filtering." IEEE Internet computing 7.1 (2003): 76-80.


In [None]:
def build_knn_map_amazon(movies, K=30):
    knn_map = {}
    movie_ids = ratings['movieId'].unique()
    print(len(movie_ids))
    for i in tqdm(movie_ids):
        pq = []
        knn_map[i] = pq
        # find users who rated i
        users = ratings.query("movieId == @i").userId.unique()
        # find items rated by users_i
        movies = ratings.query("userId in @users").movieId.unique()
        # For speed up (this does not exist in the Amazon algorithm)
        movies = np.random.choice(movies, 500)
        for j in movies:
            if (i == j):
                continue
            sim = calc_sim(i,j,"Jaccard")
            if (len(pq) >= K):
                smallest = pq[0]
                if (sim > smallest[0]):
                    heapq.heappop(pq)
                    heapq.heappush(pq, (sim, j))
            else:
                heapq.heappush(pq, (sim, j))
    return knn_map, movies

In [None]:
knn_map, movies = build_knn_map_amazon(movies)

9724


  0%|          | 0/9724 [00:00<?, ?it/s]

In [None]:
def add_sims_and_sort(l):
    li = []
    it = itertools.groupby(l, operator.itemgetter(1))
    for key, subiter in it:
        li.append((key, sum(item[0] for item in subiter)))
    li = sorted(li, key=itemgetter(1), reverse=True)
    return li


In [None]:
def top_N_knn_map(ratings, N, u):
    C = []
    # find the movies rated by u
    movies_rated = ratings.query("userId == @u").movieId
    for m in movies_rated:
        C = C + knn_map[m]
    return add_sims_and_sort(C)[:N]

In [None]:
topn = top_N_knn_map(ratings, 10, 100)
topn

[(6785, 0.75),
 (6183, 0.75),
 (1650, 0.6666666666666666),
 (2201, 0.6666666666666666),
 (7092, 0.6666666666666666),
 (4796, 0.6666666666666666),
 (6970, 0.6666666666666666),
 (4896, 0.6129032258064516),
 (1380, 0.5625),
 (780, 0.548936170212766)]

In [None]:
topn = [i[0] for i in topn]
movies[movies.movieId.isin(topn)]

Unnamed: 0,movieId,title,genres
615,780,Independence Day (a.k.a. ID4) (1996),Action|Adventure|Sci-Fi|Thriller
1063,1380,Grease (1978),Comedy|Musical|Romance
1241,1650,Washington Square (1997),Drama
1648,2201,"Paradine Case, The (1947)",Drama
3509,4796,"Grass Is Greener, The (1960)",Comedy|Romance
3574,4896,Harry Potter and the Sorcerer's Stone (a.k.a. ...,Adventure|Children|Fantasy
4240,6183,Pillow Talk (1959),Comedy|Musical|Romance
4566,6785,Seven Brides for Seven Brothers (1954),Comedy|Musical|Romance|Western
4666,6970,Desk Set (1957),Comedy|Romance
4766,7092,Anna Karenina (1935),Drama|Romance


## Evaluation of top-N recommendation

Evaluation of rating prediction is rather easy: find the mean absolute error between rating predictions and true ratings. How can we evaluate the accuracy of a top-N recommendation? There are several techniques which we will look at in more detail later. Below is one common way to evaluate top-N recommendation:

- Randomly sub-sample some portion of positive preferences in order to create a test set $T$. Positive preferences might be 5-star ratings, movies watched more than a certain threshold, or items purchased.
- Put the rest of the preferences into the training set and build model.

- For each preference $(u,i)$ in the test set:
    - Make a top-N recommendation tu user $u$.
    - If the test item i occurs among the top-N items, then we have a hit, otherwise we have a miss.

Hit ratio is then defined as:

$$
Hit Ratio: \frac{\#hits}{|T|}
$$



In [None]:
N = 100
X_train, X_test = train_test_split(ratings, test_size=1000)
X_test = X_test.query("rating > 4")
train_size = X_train.shape[0]
test_size = X_test.shape[0]
print("Test size:", test_size)
hit_count = 0
for k in range(test_size):
    u = X_test.iloc[k,0]
    i = X_test.iloc[k,1]
    r = X_test.iloc[k,2]
    top_N = top_N_knn_map(X_train, N, u)
    hit_list = [item for item in top_N if item[0] == i]
    if len(hit_list) > 0:
        hit_count +=1
print("Hit Ratio", hit_count/test_size)

Test size: 217
Hit Ratio 0.15668202764976957


### OPTIONAL

The following is a comparison of the running times of similarity calculations performed on the CPU and GPU.

https://github.com/erogluegemen/Performance-and-Scalability-Analysis-of-KNN-Implementations-for-Content-Based-Filtering