# Neighborhood-based Collaborative Filtering
(by Tevfik Aytekin)

In the following we will implement item based collaborative filtering.

In [21]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from scipy.sparse import csr_matrix
from collections import Counter
from sklearn.metrics import pairwise_distances
from operator import itemgetter
from tqdm.notebook import tqdm
import copy
import heapq
import sys, os
import pickle
import itertools
import operator


### Movielens ml-latest-small dataset

In [2]:
with open('../../datasets/ml-latest-small/README.txt', 'r') as f:
    print(f.read())

Summary

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.

This and other GroupLens data sets are publicly available for down

In [3]:
ratings = pd.read_csv("../../datasets/ml-latest-small/ratings.csv", sep=",")
print(ratings.shape)
ratings.head()

(100836, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
links = pd.read_csv("../../datasets/ml-latest-small/links.csv", sep=",")
print(links.shape)
links.head()

(9742, 3)


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [5]:
movies = pd.read_csv("../../datasets/ml-latest-small/movies.csv", sep=",")
print(movies.shape)
movies.head()

(9742, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
tags = pd.read_csv("../../datasets/ml-latest-small/tags.csv", sep=",")
print(tags.shape)
tags.head()

(3683, 4)


Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


## Create User Item Rating Map
It might take some time but will be useful later.

In [7]:
rating_map = {}
for i in range(len(ratings)):
    key = str(ratings.iloc[i,0]) + '_' +str(ratings.iloc[i,1])
    rating_map[key]=ratings.iloc[i,2]

In [8]:
rating_map["1_47"]

5.0

In [9]:
iterator = iter(rating_map.items())
for i in range(5):
    print(next(iterator))

('1_1', 4.0)
('1_3', 4.0)
('1_6', 4.0)
('1_47', 5.0)
('1_50', 5.0)


## Create User Ratings Map
It might take some time but will be useful later.

In [10]:
ratings.query("movieId == 2")

Unnamed: 0,userId,movieId,rating,timestamp
560,6,2,4.0,845553522
1026,8,2,4.0,839463806
1773,18,2,3.0,1455617462
2275,19,2,3.0,965704331
2977,20,2,3.0,1054038313
...,...,...,...,...
95102,600,2,4.0,1237764627
95965,602,2,4.0,840875851
97044,604,2,5.0,832080293
97144,605,2,3.5,1277176522


In [11]:
# user_ratings_map[i] stores a tuple: a list of users who rated item i and a list of corresponding ratings
user_ratings_map = {}

items = ratings.movieId.unique()
for i in items:
    userids = ratings.query("movieId == @i").userId.array
    user_ratings = ratings.query("movieId == @i").rating.array
    user_ratings_map[i] = (userids,user_ratings)

In [15]:
user_ratings_map[10]

(<PandasArray>
 [  6,   8,  11,  19,  21,  26,  31,  34,  42,  43,
  ...
  580, 584, 588, 590, 592, 597, 599, 602, 608, 609]
 Length: 132, dtype: int64,
 <PandasArray>
 [3.0, 2.0, 3.0, 2.0, 5.0, 3.0, 4.0, 5.0, 5.0, 4.0,
  ...
  3.5, 5.0, 3.0, 3.5, 3.0, 3.0, 3.5, 3.0, 4.0, 4.0]
 Length: 132, dtype: float64)

## Rating Prediction

### Algorithm 
 
Predict rating of user $u$ for item $i$
- Calculate the similarity of items that are rated by $u$ with $i$.
- Use these similarities to calculate a weighted average of the ratings.

Similarity between items i and j will be calculated using the ratings of i and j (no content infırmation will be used). One can view these ratings as a vector of values as shown below and use different metrics such as Jaccard or cosine.

In [16]:
df = pd.DataFrame([[1, "", 5, 3, ""],
                   [5, 3, "", 2, 4],
                   ["", 4, 2, "", 1],
                   [3, "", 4, 2, 3],
                   [4, 1, "", 2, 4],
                   [4, 1, 5, 3, ""],
                   ["", 4, 5, "", 1],
                   [2, 5, "", 1, 4]],
                 index = ['user 1','user 2','user 3','user 4','user 5','user 6','user 7','user 8'],
                 columns = ['movie 1','movie 2','movie 3','movie 4',' movie 5'])
df

Unnamed: 0,movie 1,movie 2,movie 3,movie 4,movie 5
user 1,1.0,,5.0,3.0,
user 2,5.0,3.0,,2.0,4.0
user 3,,4.0,2.0,,1.0
user 4,3.0,,4.0,2.0,3.0
user 5,4.0,1.0,,2.0,4.0
user 6,4.0,1.0,5.0,3.0,
user 7,,4.0,5.0,,1.0
user 8,2.0,5.0,,1.0,4.0


### Jaccard Similarity

Given two sets $A$ and $B$,

$Jaccard(A, B) = \frac{|A \cap B|}{|A \cup B|}$

For example if $A = \{a, b, c, d\}$ and $B = \{b, d, e ,f, g\}$ then

$Jaccard(A, B) = \frac{2}{7}$

We can apply Jaccard similarity by ignoring the rating values.

### Cosine Similarity

Cosine(A, B) = 
<img src="../images/cosine.png" width="200">



In [17]:
def NNCF_based_rating_prediction(u, i, metric):
    r = 0
    sum_sim = 0
    # find the movies rated by u
    movies = ratings[ratings["userId"]==u].movieId
    for j in movies:
        sim = calc_sim(i, j, metric)
        key = str(u)+"_"+str(j)
        r += sim*rating_map[key]
        sum_sim += sim
    if sum_sim == 0:
        return 0
    else:
        return r / sum_sim        

In [54]:
# finds the similary of items i and j 
def calc_sim(i,j, metric):
    # users who rated item i
    users_rated_i = user_ratings_map[i][0]
    ratings_i = user_ratings_map[i][1]
    # users who rated item j
    users_rated_j = user_ratings_map[j][0]
    ratings_j = user_ratings_map[j][1]

    # Jaccard ignores rating values.
    if metric == "Jaccard":
        intersection_size = len(set(users_rated_i).intersection(users_rated_j))
        union_size = len(set(users_rated_i).union(users_rated_j))
        return intersection_size / union_size
    elif metric == "Cosine":
        inter, ind1, ind2 = np.intersect1d(users_rated_i, users_rated_j, return_indices=True)
        dot = np.dot(ratings_i[ind1], ratings_j[ind2])
        return dot/(np.linalg.norm(ratings_i[ind1])*np.linalg.norm(ratings_j[ind2]))
    

### Example

In [19]:
users_i = np.array([1, 3, 6, 9, 12, 15])
ratings_i =  np.array([4, 3, 2, 4, 2, 5])
users_j = np.array([1, 6, 8, 12])
ratings_j = np.array([2, 4, 2, 5])

In [20]:
inter, ind1, ind2 = np.intersect1d(users_i,users_j,return_indices=True)
print(inter)
print(ind1)
print(ind2)
print(ratings_i[ind1])
print(ratings_j[ind2])
print(np.dot(ratings_i[ind1],ratings_j[ind2]))

[ 1  6 12]
[0 2 4]
[0 1 3]
[4 2 2]
[2 4 5]
26


## Evaluation of Rating Prediction

How can we measure the performance of a recommender algorithm? This is similar to the evaluation used in machine learning.

- Make a train/test split
- Build the model on the training set
- Make predictions for the ratings in the test set
- Find the mean absolute error (MAE)

For more metrics other then MAE loo at the "Metrics for Regression" section of [this notebook](http://localhost:8888/notebooks/PycharmProjects/data_science/evaluation.ipynb)


In [26]:
X_train, X_test = train_test_split(ratings, test_size=200)
train_size = X_train.shape[0]
test_size = X_test.shape[0]
print("Test size:", test_size)
error = 0
c = 1
for k in tqdm(range(test_size)): 
    c+=1
    u = X_test.iloc[k,0]
    i = X_test.iloc[k,1]
    r = X_test.iloc[k,2]
    error += np.abs(r - NNCF_based_rating_prediction(u,i,"Cosine"))
print(error/test_size)

Test size: 200


  0%|          | 0/200 [00:00<?, ?it/s]

0.794878370362171


#### Question: Change the metric to Jaccard in the above code and run again. How can you explain this difference in the error.

## Top-N recommendation Algorithm - Predict and Sort
The task in top-$N$ recommendation is to recommend $N$ items to a user. 


Recommend $N$ movies to user $u$
- Predict the ratings of all items which are not watched by $u$
- Sort the predicted ratings
- Recommend the movies with the highest predicted ratings

In [36]:
def top_N_pred_sort(N, u):
    preds = pd.Series([], dtype='float')
    # find the movies not rated by u
    movies_not_rated = ratings.query("userId != @u").movieId.unique()
    sample_movies = np.random.choice(movies_not_rated, 500)
    for m in tqdm(sample_movies):
        preds[m] = NNCF_based_rating_prediction(u, m, "Jaccard")
    return preds.sort_values(ascending=False)[:N]    

In [37]:
top_N_pred_sort(10, 1)

  0%|          | 0/500 [00:00<?, ?it/s]

158035    5.000000
70932     4.952339
4945      4.925652
148775    4.840775
26429     4.840775
119714    4.840775
139640    4.840775
32243     4.840775
126088    4.840775
81087     4.840775
dtype: float64

## Efficiency Issues

There are important inefficiencies in this algorithm:

- The algorithm predicts the rating of all items which are not rated by the user. In the case of millions of items this algorithm is practically infeasible. Numerous techniques have been developed to remedy this problem. Can you suggest a solution? 
- In rating prediction, similarity between target item and items rated by the user are calculated. To make a recommendation to another user similarity calculations will be done again. For making recommendations to users in general many similarity calculations will be repeated. A general solution to this problem is to precalculate the similarities between items. Moreover, you don't need to store all similarities, only storing $k$ most similar items to every item will be enough. Size of $k$ can be determined according to the needs.


## Top-N recommendation Algorithm - kNN Map
The task in top-$N$ recommendation is to recommend $N$ items to a user. 

- Build a knn-map (a map which stores the $k$ nearest neighbors of each item)

Recommend $N$ movies to user $u$
- Get the neigbors of movies which are watched by $u$ and put them into a list $C$.
- Choose $N$ movies form $C$. There can be different methods here. Most repeated movies in C can be chosen, movies with the highest total similarity (or maximum similarity) can be chosen. These methods will be implemented.
- Recommend the $N$ movies that are chosen.

## Building a knn map
This table will hold the most similar $k$ items for each item. In order to build this table we need to calculate all pairwise similarities which takes $O(n^2)$ time. There is no escape from this $O(n^2)$ time unless you use an approximation algorithm such as LSH (Locality Sensitive Hashing) for nearest neighbor search.

We will use a heap based priority queue for storing the nearest neighbor. You can look at this [animation](https://www.cs.usfca.edu/~galles/visualization/Heap.html).

In [38]:
pq =[(10,"a"),(8, "b"), (5, "c")]
type(pq)
heapq.heapify(pq)
heapq.nsmallest(2,pq)

[(5, 'c'), (8, 'b')]

In [39]:
movies[:10]

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [52]:
def build_knn_map(movies, K=10):
    knn_map = {}
    movie_ids = ratings['movieId'].unique()
    print(len(movie_ids))
    for i in tqdm(range(len(movie_ids))):
        pq = []
        knn_map[movie_ids[i]] = pq
        for j in range(len(movie_ids)):
            if (i == j):
                continue
            sim = calc_sim(movie_ids[i],movie_ids[j],"Jaccard")
            if (len(pq) >= K):
                smallest = pq[0]
                if (sim > smallest[0]):
                    heapq.heappop(pq)
                    heapq.heappush(pq, (sim, movie_ids[j]))
            else:
                heapq.heappush(pq, (sim, movie_ids[j]))
    return knn_map

In [53]:
knn_map = build_knn_map(movies,K=30)

9724


  0%|          | 0/9724 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [56]:
def top_N_knn_map(ratings, N, u):
    C = []
    # find the movies rated by u
    movies_rated = ratings.query("userId == @u").movieId
    for m in movies_rated:
        C = C + knn_map[m]
    return add_sims_and_sort(C)[:N]    

In [57]:
def add_sims_and_sort(l):
    li = []
    it = itertools.groupby(l, operator.itemgetter(1))
    for key, subiter in it:
        li.append((key, sum(item[0] for item in subiter)))
    li = sorted(li, key=itemgetter(0), reverse=True)
    return li
    

In [58]:
top_N_knn_map(ratings, 10, 1)

[(552, 0.1),
 (552, 0.1),
 (552, 0.1),
 (552, 0.1),
 (552, 0.1),
 (552, 0.1),
 (552, 0.1),
 (552, 0.1),
 (552, 0.1),
 (552, 0.1)]

## Evaluation of top-N recommendation 

Evaluation of rating prediction is rather easy: find the mean absolute error between rating predictions and true ratings. How can we evaluate the accuracy of a top-N recommendation? There are several techniques which we will look at in more detail later. Below is one common way to evaluate top-N recommendation:

- Randomly sub-sample some portion of positive preferences in order to create a test set $T$. Positive preferences might be 5-star ratings, movies watched more than a certain threshold, or items purchased.
- Put the rest of the preferences into the training set and build model.

- For each preference $(u,i)$ in the test set:
    - Make a top-N recommendation tu user $u$.
    - If the test item i occurs among the top-N items, then we have a hit, otherwise we have a miss. 

Hit ratio is then defined as: 

$$
Hit Ratio: \frac{\#hits}{|T|}
$$



In [179]:
N = 1000
X_train, X_test = train_test_split(ratings, test_size=1000)
X_test = X_test.query("rating > 4")
train_size = X_train.shape[0]
test_size = X_test.shape[0]
print("Test size:", test_size)
hit_count = 0
for k in range(test_size): 
    u = X_test.iloc[k,0]
    i = X_test.iloc[k,1]
    r = X_test.iloc[k,2]
    top_N = top_N_knn_map(X_train, N, u)
    hit_list = [item for item in top_N if item[0] == i]
    if len(hit_list) > 0:
        hit_count +=1
print("Hit Ratio", hit_count/test_size)

Test size: 210
Hit Ratio 0.0761904761904762
