# Content-based Recommendation
(by Tevfik Aytekin)

Content-based recommender algorithms use the content of the items for making a recommendation. For example, for movie recommendation movie contents such as plot summary, director, casting, jenres, release date, etc. and user content such as previously watched movies, gender, age, etc. can be used to find out which movies can be recommended to the users.
(by Tevfik Aytekin)

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from scipy.sparse import csr_matrix
from collections import Counter
from sklearn.metrics import pairwise_distances
from operator import itemgetter
import copy
import heapq
import sys, os
import pickle
import itertools
import operator
from tqdm.notebook import tqdm

### Movielens ml-latest-small dataset

In [None]:
with open('../../datasets/ml-latest-small/README.txt', 'r') as f:
    print(f.read())

In [4]:
ratings = pd.read_csv("../../datasets/ml-latest-small/ratings.csv", sep=",")
print(ratings.shape)
ratings.head()

(100837, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1000,858,5.0,964982703
1,1,1,4.0,964982703
2,1,3,4.0,964981247
3,1,6,4.0,964982224
4,1,47,5.0,964983815


In [5]:
links = pd.read_csv("../../datasets/ml-latest-small/links.csv", sep=",")
print(links.shape)
links.head()

(9742, 3)


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [6]:
movies = pd.read_csv("../../datasets/ml-latest-small/movies.csv", sep=",")
print(movies.shape)
movies.head()

(9742, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
tags = pd.read_csv("../../datasets/ml-latest-small/tags.csv", sep=",")
print(tags.shape)
tags.head()

(3683, 4)


Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


## Create User Item Rating Map
It might take some time but will be useful later.

In [8]:
rating_map = {}
for i in range(len(ratings)):
    key = str(ratings.iloc[i,0]) + '_' +str(ratings.iloc[i,1])
    rating_map[key]=ratings.iloc[i,2]

In [9]:
rating_map["1_47"]

5.0

In [10]:
iterator = iter(rating_map.items())
for i in range(5):
    print(next(iterator))

('1000_858', 5.0)
('1_1', 4.0)
('1_3', 4.0)
('1_6', 4.0)
('1_47', 5.0)


## Create movie genre map
This map will also be useful later

In [11]:
movie_genres = {}
for i in range(len(movies)):
    key = movies.iloc[i,0]
    movie_genres[key]=movies.iloc[i,2].split('|')

In [12]:
iterator = iter(movie_genres.items())
for i in range(5):
    print(next(iterator))

(1, ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy'])
(2, ['Adventure', 'Children', 'Fantasy'])
(3, ['Comedy', 'Romance'])
(4, ['Comedy', 'Drama', 'Romance'])
(5, ['Comedy'])


In [13]:
movie_genres[1]

['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']

## Rating Prediction

### Algorithm 
 
Predict rating of user $u$ for item $i$
- Calculate the similarity of items that are rated by $u$ with $i$.
- Use these similarities to calculate a weighted average of the ratings.

Below we only use the genres to calculate the content similarity between movies, but in general, one can use many other information such as the director, date, casting and plot summary. In order to do this one needs to find a way to represent this information and a similarity metric to quantify the similarity.

Let us first see an example. 

In [14]:
movies_ratings_join = movies.merge(ratings, left_on="movieId", right_on="movieId")

In [15]:
movies_ratings_join.query("userId == 1").head(5)

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
325,3,Grumpier Old Men (1995),Comedy|Romance,1,4.0,964981247
433,6,Heat (1995),Action|Crime|Thriller,1,4.0,964982224
2107,47,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,1,5.0,964983815
2379,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,1,5.0,964982931


What would be the rating of this user for the following movie? 

In [16]:
movies_ratings_join.query("userId == 1 and movieId == 70")

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
2859,70,From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller,1,3.0,964982400


Following is the calculation for this example by using Jaccard similarity betwee movie jenres:

Jaccard(Toy Story, From Dusk Till Dawn) = 1/8 <br>
Jaccard(Grumpier Old Men, From Dusk Till Dawn) = 1/5 <br>
Jaccard(Heat, From Dusk Till Dawn) = 2/5 <br>
Jaccard(Seven (a.k.a. Se7en), From Dusk Till Dawn) = 1/5 <br>
Jaccard(Usual Suspects, From Dusk Till Dawn) = 1/6<br>


In [17]:
prediction = (1/8*4+1/5*4+2/5*4+1/5*5+1/6*5)/(1/8+1/5+2/5+1/5+1/6)
prediction

4.335877862595419

### Jaccard Similarity

Given two sets $A$ and $B$,

$Jaccard(A, B) = \frac{|A \cap B|}{|A \cup B|}$

For example if $A = \{a, b, c, d\}$ and $B = \{b, d, e ,f, g\}$ then

$Jaccard(A, B) = \frac{2}{7}$

In [25]:
def content_based_rating_prediction(u, i):
    r = 0
    sum_sim = 0
    # find the movies rated by u
    movies = ratings[ratings["userId"]==u].movieId
    for j in movies:
        sim = genre_sim(i, j)
        key = str(u)+"_"+str(j)
        r += sim*rating_map[key]
        sum_sim += sim
    if sum_sim == 0:
        return 0
    else:
        return r / sum_sim        

In [27]:
content_based_rating_prediction(5,10)

3.3031470777135508

In [21]:
# finds the genre similary of items i and j using Jaccard similarity
def genre_sim(i,j):
    genres_i = movie_genres[i]
    genres_j = movie_genres[j]
    #print(genres_i)
    #print(genres_j)
    intersection_size = len(set(genres_i).intersection(genres_j))
    union_size = len(set(genres_i).union(genres_j))
    return intersection_size / union_size
    

In [24]:
genre_sim(10,8)

0.25

In [None]:
# finds the genre similary of items i and j using Jaccard similarity
def genre_sim2(i,j):
    x = item_content_matrix.loc[i].to_numpy()
    y = item_content_matrix.loc[j].to_numpy()
    intersection_size = np.count_nonzero(np.bitwise_and(x,y))
    union_size = np.count_nonzero(np.bitwise_or(x,y))
    return intersection_size / union_size


## Evaluation of Rating Prediction

How can we measure the performance of a recommender algorithm? This is similar to the evaluation used in machine learning.

- Make a train/test split
- Build the model on the training set
- Make predictions for the ratings in the test set
- Find the mean absolute error (MAE)

For more metrics other then MAE look at the "Metrics for Regression" section of [this notebook](http://localhost:8888/notebooks/PycharmProjects/data_science/evaluation.ipynb)


In [28]:
X_train, X_test = train_test_split(ratings, test_size=1000)
train_size = X_train.shape[0]
test_size = X_test.shape[0]
print("Test size:", test_size)
error = 0
for k in range(test_size): 
    u = X_test.iloc[k,0]
    i = X_test.iloc[k,1]
    r = X_test.iloc[k,2]
    error += np.abs(r - content_based_rating_prediction(u,i))
print(error/test_size)

Test size: 1000
0.6679542055397085


## Top-N recommendation Algorithm - Predict and Sort
The task in top-$N$ recommendation is to recommend $N$ items to a user. 


Recommend $N$ movies to user $u$
- Predict the ratings of all items which are not watched by $u$
- Sort the predicted ratings
- Recommend the movies with the highest predicted ratings

In [29]:
def top_N_pred_sort(N, u):
    preds = pd.Series([], dtype='float')
    # find the movies not rated by u
    movies_not_rated = ratings.query("userId != @u").movieId.unique()
    for m in movies_not_rated:
        preds[m] = content_based_rating_prediction(u, m)
    return preds.sort_values(ascending=False)[:N]    

In [30]:
top_N_pred_sort(10, 1)

7335      5.000000
4426      5.000000
2066      5.000000
175401    4.718136
163112    4.718136
170597    4.718136
160718    4.718136
153236    4.718136
163386    4.718136
102058    4.718136
dtype: float64

## Efficiency Issues

There are important inefficiencies in this algorithm:

- The algorithm predicts the rating of all items which are not rated by the user. In the case of millions of items this algorithm is practically infeasible. Numerous techniques have been developed to remedy this problem. Can you suggest a solution? 
- In rating prediction, similarity between target item and items rated by the user are calculated. To make a recommendation to another user similarity calculations will be done again. For making recommendations to users in general many similarity calculations will be repeated. A general solution to this problem is to precalculate the similarities between items. Moreover, you don't need to store all similarities, only storing $k$ most similar items to every item will be enough. Size of $k$ can be determined according to the needs.


## Top-N recommendation Algorithm - kNN Map
The task in top-$N$ recommendation is to recommend $N$ items to a user. 

- Build a knn-map (a map which stores the $k$ nearest neighbors of each item)

Recommend $N$ movies to user $u$
- Get the neigbors of movies which are watched by $u$ and put them into a list $C$.
- Choose $N$ movies from $C$. There can be different methods here. Most repeated movies in C can be chosen, movies with the highest total similarity (or maximum similarity) can be chosen.
- Recommend the $N$ movies that are chosen.

## Building a knn map
This table will hold the most similar $k$ items for each item. In order to build this table we need to calculate all pairwise similarities which takes $O(n^2)$ time. There is no escape from this $O(n^2)$ time unless you use an approximation algorithm such as LSH (Locality Sensitive Hashing) for nearest neighbor search.

We will use a heap based priority queue for storing the nearest neighbors. You can look at this [animation](https://www.cs.usfca.edu/~galles/visualization/Heap.html).

In [None]:
pq =[(10,"a"),(8, "b"), (5, "c")]
heapq.heapify(pq)
heapq.nsmallest(2,pq)

In [None]:
movies[:10]

In [None]:
def build_knn_map(movies, K=10):
    knn_map = {}
    movie_ids = movies['movieId'].unique()
    for i in tqdm(range(len(movie_ids))):
        pq = []
        knn_map[movie_ids[i]] = pq
        for j in range(len(movie_ids)):
            if (i == j):
                continue
            sim = genre_sim(movie_ids[i],movie_ids[j])
            if (len(pq) >= K):
                smallest = pq[0]
                if (sim > smallest[0]):
                    heapq.heappop(pq)
                    heapq.heappush(pq, (sim, movie_ids[j]))
            else:
                heapq.heappush(pq, (sim, movie_ids[j]))
    return knn_map

In [None]:
knn_map = build_knn_map(movies,K=30)

In [None]:
knn_map[1]

## Top-N recommendation using knn map

In [None]:
def top_N_knn_map(ratings, N, u):
    C = []
    # find the movies rated by u
    movies_rated = ratings.query("userId == @u").movieId
    for m in movies_rated:
        C = C + knn_map[m]
    return add_sims_and_sort(C)[:N]    

In [None]:
def add_sims_and_sort(l):
    li = []
    it = itertools.groupby(l, operator.itemgetter(1))
    for key, subiter in it:
        li.append((key, sum(item[0] for item in subiter)))
    li = sorted(li, key=itemgetter(0), reverse=True)
    return li
    

In [None]:
top_N_knn_map(ratings, 10, 1)

## Evaluation of top-N recommendation 

Evaluation of rating prediction is rather easy: find the mean absolute error between rating predictions and true ratings. How can we evaluate the accuracy of a top-N recommendation? There are several techniques which we will look at in more detail later. Below is one common way to evaluate top-N recommendation:

- Randomly sub-sample some portion of positive preferences in order to create a test set $T$. Positive preferences might be 5-star ratings, movies watched more than a certain threshold, or items purchased.
- Put the rest of the preferences into the training set and build model.

- For each preference $(u,i)$ in the test set:
    - Make a top-N recommendation tu user $u$.
    - If the test item i occurs among the top-N items, then we have a hit, otherwise we have a miss. 

Hit ratio is then defined as: 

$$
Hit Ratio: \frac{\#hits}{|T|}
$$



In [None]:
N = 1000
X_train, X_test = train_test_split(ratings, test_size=1000)
X_test = X_test.query("rating > 4")
train_size = X_train.shape[0]
test_size = X_test.shape[0]
print("Test size:", test_size)
hit_count = 0
for k in range(test_size): 
    u = X_test.iloc[k,0]
    i = X_test.iloc[k,1]
    r = X_test.iloc[k,2]
    top_N = top_N_knn_map(X_train, N, u)
    hit_list = [item for item in top_N if item[0] == i]
    if len(hit_list) > 0:
        hit_count +=1
print("Hit Ratio", hit_count/test_size)