# WHAT'S THIS NOTEBOOK ABOUT?


This notebook is a practical introduction to the main Recommender System (RecSys) techniques. The objective of a RecSys is to recommend relevant items for users, based on their preference. Preference and relevance are subjective, and they are generally inferred by items users have consumed previously.

the dataset i'm using is a popular one called [movielens100k](https://www.kaggle.com/rajmehra03/movielens100k) that contains 3 tables: movies,ratings,tags.

since the data is **explicit**, for simplicity we will use the python package [Surprise](http://http://surpriselib.com/)

first, we will explore our data just to get an idea about it. We will define a class to evaluate our models and finally, we will train every model and compare them at then end.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns
import matplotlib.pyplot as plt

from tqdm import tqdm

from sklearn.feature_extraction.text import CountVectorizer

from sklearn import preprocessing
import scipy
import random

from surprise import AlgoBase, Reader
from surprise import Dataset
from surprise.model_selection import cross_validate,train_test_split, GridSearchCV

from sklearn.model_selection import train_test_split as sklearn_train_test_split

# models
from surprise import KNNWithZScore, SVD


pd.set_option('display.max_colwidth', None)

# EXPLORATORY DATA ANALYSIS

## THE MOVIES DATASET

lets start exploring our movies dataset.
first we check if there's any NaN value.

In [None]:
movies_data = pd.read_csv("../input/movielens100k/movies.csv")
movies_data.drop_duplicates(subset ="title",keep='first',inplace=True,ignore_index=True) 
movies_data.shape

In [None]:
# we check if there's empty values
plt.figure(figsize=(8,4))
sns.heatmap(movies_data.isna(), cbar=False, yticklabels=False)

In [None]:
# we remove ('no genre listed') from the genres list.
NO_GENRE_LISTED = "(no genres listed)"
movies_data["genres"] = movies_data["genres"].apply(lambda genres: [ genre for genre in genres.split('|') if genre != NO_GENRE_LISTED ])

In [None]:
# we drop rows with na values.
movies_data = movies_data.dropna()
movies_data.shape

In [None]:
genres_merged = movies_data["genres"].apply(lambda genres: " ".join(genres))
genres_vectorizer = CountVectorizer(token_pattern="(?u)\\b[\\w-]+\\b")
genres_count_matrix = genres_vectorizer.fit_transform(genres_merged.tolist())

In [None]:
print("there are {} genres of movies".format(len(genres_vectorizer.vocabulary_)))

In [None]:
summed_movie_genres = genres_vectorizer.vocabulary_
plt.figure(figsize=(40,10))
plt.bar(summed_movie_genres.keys(), summed_movie_genres.values())

## THE RATINGS DATASET

let's explore the ratings data set.

In [None]:
ratings_data = pd.read_csv("../input/movielens100k/ratings.csv")
ratings_data.columns

In [None]:
ratings_data.shape

In [None]:
ratings_data = ratings_data[ratings_data["movieId"].isin(movies_data["movieId"])]
ratings_data.shape

let's see the distribution of mean ratings in our dataset

In [None]:
movies_mean_ratings = ratings_data.groupby(['movieId'],as_index=False)["rating"].mean()
plt.figure(figsize=(12, 6))
sns.distplot(movies_mean_ratings["rating"], bins=10)

as we can see, most of the movies are rated between 3 and 4.

let's finish this data analysis by exploring user's interactions distribution.

In [None]:
user_interactions = pd.DataFrame(columns=['userId','interactions'])

user_interactions["userId"] = ratings_data["userId"]
user_interactions["interactions"] = 0

user_interactions = user_interactions.groupby(['userId'],as_index=False).agg({ 'userId' : 'first' , 'interactions': 'count' }) 

In [None]:
plt.figure(figsize=(15, 5))
sns.distplot(user_interactions["interactions"])

# EVALUATING OUR MODELS, HOW?

to evaluate recommender systems, we will use the **Recall@N** evaluation metric used in this [paper](https://www.researchgate.net/publication/221141030_Performance_of_recommender_algorithms_on_top-N_recommendation_tasks)

this is how will we proceed:

```
- for each user

    - for each item rated as "good" from our test set
    
        - sample X other items the user has never interacted with ( we assume that they are irrelevant).
        - we merge the X items and the targeted item in one list.
        - using the model, we rank our X+1 items.
        - we form top-N recommendations.
        - if our target item belongs to the top-N recommendation, it's a hit, otherwise it's a miss.
    
    - calculate the metric for the user
    
- calculate the mean metric of all users.

```

let's code our evaluator class

In [None]:
EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 50

class RecallEvaluator:
    
    def __init__(self,items,trainset,testset):
        self.items = items
        self.trainset = trainset
        self.testset = testset
        
    def get_favorite_items_from_testset(self,user_id):
        
        # calculate them mean rating of the user from the test set.
        mean_rating = np.mean([rating for (_,rating) in self.trainset.ur[user_id]])
        return [item for (user,item,rating) in self.testset if rating > mean_rating and user == user_id]

    def get_interacted_items_from_testset(self,user_id):
        return [item for (user,item,_) in testset if user == user_id]
    
    def get_interacted_items_from_trainset(self,user_id):
        return [item for (item,_) in trainset.ur[user_id]]
    
    def get_not_interacted_items(self, user_id, size, seed=42):
            
        interacted_items = self.get_interacted_items_from_testset(user_id)
        non_interacted_items = list(set(self.items) - set(interacted_items))
        random.seed(seed)
        non_interacted_items = random.sample(non_interacted_items,size)
        return set(non_interacted_items)
    
    def verify_top_n_hits(self, item_id, recommendations, topn):    
            try:
                index = recommendations.index(item_id)
            except:
                index = -1
            hit = int(index in range(0, topn))
            return hit
    
    def recommend_to_evaluate(self,user_id,model):
         
            # get all the items that the user didn't interact with YET from the full item list
            non_interacted_items = list(set(self.items) - set(self.get_interacted_items_from_trainset(user_id)))
            recommendations = [model.predict(user_id,item) for item in non_interacted_items]
            recommendations.sort(key=lambda x: x.est, reverse=True)
            return recommendations
            
    
    def evaluate_model_for_user(self,user_id,model,topn):
        
        #Getting the items in test set that the user "like"
        favorite_items_testset = self.get_favorite_items_from_testset(user_id)
        
        favorite_items_count_testset = len(favorite_items_testset)
        
        if favorite_items_count_testset == 0:
            return [(0,0,0)] * len(topn)
        
        user_recommendations = [prediction.iid for prediction in self.recommend_to_evaluate(user_id,model)]        
        
        # we initialize our hits count
        hits_count = [0] * len(topn)
        
        #For each item the user likes in the test set
        for item_id in favorite_items_testset:
            
            # we generate a random sample of #EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS movies that the user didn't interact with.
            non_interacted_items = self.get_not_interacted_items(user_id,size=EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS,seed=item_id%(2**32))
            
            # we combine them with the relevant item.
            items_to_filter_recs = non_interacted_items.union(set([item_id]))
            
            # we recommend movies to the user, these recommendations are sorted, we pick only the X+1 items.
            valid_recommendations = [recommended_item for recommended_item in user_recommendations if recommended_item in items_to_filter_recs]
                        
            #Verifying if the current interacted item is among the Top-N recommended items
            hits = [self.verify_top_n_hits(item_id,valid_recommendations,t) for t in topn]
            hits_count = np.add(hits_count, hits)
            
            
        recall = [hit_count/float(favorite_items_count_testset) for hit_count in hits_count]
        return [(rec,hit_count,favorite_items_count_testset) for rec,hit_count in zip(recall,hits_count)]
    
    
    def evaluate_model(self,model,topn,model_name):
            
        #key names of the user metrics
        keys = ['recall@{}'.format(t) for t in topn]
        users_metrics = []
        users_in_testset = set([user for (user,_,_) in self.testset])
        
        for user_id in tqdm(users_in_testset,total=len(users_in_testset)-1):
          
            user_metrics = [user_id] + self.evaluate_model_for_user(user_id,model,topn)  
            users_metrics.append(user_metrics)
            
        user_recalls = pd.DataFrame(users_metrics,columns=["user_id"] + keys)
        
        global_recall = {}
        
        for key in keys:
            
            hits_sum = np.sum([hit_count for _,(_,hit_count,_) in user_recalls[key].items()])
            interaction_counts = np.sum([interaction_count for _,(_,_,interaction_count) in user_recalls[key].items()])
            global_recall[key] = hits_sum / float(interaction_counts)
            
        global_metrics = {**{'model': model_name}, **global_recall} 
        return global_metrics, user_recalls

# RECOMMENDATIONS MODELS AND EVALUATIONS

## SPLITTING THE DATA

let's first split our data using the surprise built-in method **train_test_split**

In [None]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings_data[['userId', 'movieId', 'rating']], reader)

traindf, testdf = sklearn_train_test_split(ratings_data[['userId', 'movieId', 'rating']],
                                   stratify=ratings_data['userId'], 
                                   test_size=0.25,
                                   random_state=42)


train = Dataset.load_from_df(traindf, reader)
test = Dataset.load_from_df(testdf, reader)

trainset = train.build_full_trainset()
_,testset = train_test_split(test,train_size=None)

movies_id = movies_data["movieId"].unique().tolist()

we then initialize our evaluator.

In [None]:
recall_evaluator = RecallEvaluator(movies_id,trainset,testset)

# POPULARITY BASED MODEL

this model recommends the most popular movies to all users.

In [None]:
class PopularRecSys(AlgoBase):

    def __init__(self):
        AlgoBase.__init__(self)

    def fit(self, trainset):
        
        AlgoBase.fit(self, trainset)
        # we compute the ratings mean for each item.
        # the results are stored in a top_rating attribute
        
        self.popular = {}  # a tuple where the key is the item raw id, the value is the views count
        
        for item_id in trainset.all_items():

            views = len(trainset.ir[item_id])
            self.popular[item_id] = views

        return self

    def estimate(self, u, i):

        if self.trainset.knows_item(i):
            return self.popular[i]
        return 0

In [None]:
popularRecSys = PopularRecSys()
popularRecSys.fit(trainset)

In [None]:
popularRecSys_metrics,_ = recall_evaluator.evaluate_model(topn=[5,8,10],model=popularRecSys,model_name='popular')

In [None]:
print(popularRecSys_metrics)

since this model doesn't estimate ratings, we won't show the accuracy.

# TOP-RATED BASED MODEL

how the top rating model works?

```
- we get the top rating movies and we recommend them to the user

- to get the top rated movies:
    - we calculate the adjusted average.
    - we rank them.
```

In [None]:
class TopRatedRecSys(AlgoBase):

    def __init__(self):
        AlgoBase.__init__(self)

    def fit(self, trainset):
        
        AlgoBase.fit(self, trainset)
        # we compute the ratings mean for each item.
        # the results are stored in a top_rating attribute
        
        self.top_rating = {}  # a tuple where the key is the item raw id, the value is the estimated rating
        
        for item_id in trainset.all_items():

            adjusted_mean = (np.sum([r for (_,r) in trainset.ir[item_id]]) + 5) / (len([r for (_,r) in trainset.ir[item_id]]) + 5)
            self.top_rating[item_id] = adjusted_mean

        return self

    def estimate(self, u, i):

        if self.trainset.knows_item(i) and self.trainset.knows_user(u) :
            return self.top_rating[i]
        return 0

In [None]:
topRatedRS = TopRatedRecSys()
topRatedRS.fit(trainset)
topRatedRS_metrics , _ = recall_evaluator.evaluate_model(topn=[5,8,10],model=topRatedRS,model_name='top rated')

In [None]:
print(topRatedRS_metrics)

In [None]:
topRatedRS_cv = cross_validate(topRatedRS, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# COLLABORATIVE FILTERING


based on this [website](https://www.sciencedirect.com/topics/computer-science/collaborative-filtering) there are 2 CF RecSys approaches ( memory-based & model-based )

## MEMORY-BASED CF

we will implement 2 types of memory based CF : user-to-user and item-to-item. both are based on the K-NN with rating means. you check about it in this [article](https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761)

let's start with the user to user method. to calculate similarity/distance we will use cosine.

In [None]:
sim_options_ub = {'name': 'cosine','user_based': True}

In [None]:
KNNWithZscore_ub = KNNWithZScore(k=4,min_k=3,sim_options=sim_options_ub)
KNNWithZscore_ub.fit(trainset)

In [None]:
KNNWithZscore_ub_metrics , _ = recall_evaluator.evaluate_model(topn=[5,8,10],model=KNNWithZscore_ub,model_name='K-NN with z-score [user-based]')

In [None]:
print(KNNWithZscore_ub_metrics)

In [None]:
KNNWithZscore_ub_cv = cross_validate(KNNWithZscore_ub, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

now, let's implement the item-to-item model

In [None]:
sim_options_ib = {'name': 'cosine','user_based': False}
KNNWithZscore_ib = KNNWithZScore(k=4,min_k=3,sim_options=sim_options_ib)
KNNWithZscore_ib.fit(trainset)

In [None]:
KNNWithZscore_ib_metrics , _ = recall_evaluator.evaluate_model(topn=[5,8,10],model=KNNWithZscore_ib,model_name='K-NN with z-score [item-based]')

In [None]:
print(KNNWithZscore_ib_metrics)

In [None]:
KNNWithZscore_ib_cv = cross_validate(KNNWithZscore_ib, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

as we can see, the user-based CF performed better than the item-based one.

## MODEL-BASED CF

one the most popular model-based algorithms is SVD. more information about this algorithm [here](https://developers.google.com/machine-learning/recommendation/collaborative/matrix)

thanks to the surprise library, we can test with different parameters and pick the best.

In [None]:
param_grid = {'n_epochs': [10,20,30], 'lr_all': [0.002, 0.005]}
grid_search = GridSearchCV(SVD, param_grid,measures=['MAE','RMSE'],cv=3,refit=True)
grid_search.fit(data)

In [None]:
svd_metrics , _ = recall_evaluator.evaluate_model(topn=[5,8,10],model=grid_search,model_name='SVD')

In [None]:
print(svd_metrics)

as we can see, the SVD algorithm performed better than both versions of KNN with 63% of recall@8 and 68% recall@10

# FINAL RESULTS

let's plot the algorithms along with their recall values

In [None]:
global_metrics = pd.DataFrame([popularRecSys_metrics,topRatedRS_metrics,KNNWithZscore_ub_metrics,KNNWithZscore_ib_metrics,svd_metrics]).set_index('model')

In [None]:
%matplotlib inline
ax = global_metrics.transpose().plot(kind='bar', figsize=(15,8))
for p in ax.patches:
    ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')

# CONCLUSION

in this notebook we explored different algorithms using the **surprise** library and we concluded that:

- K-NN algorithms performed poorly.
- the top-rating algorithm has the highest recall because we have a small dataset (only 9k movies and 100k).

what to do now?

- create a content-based model and compare it to others
- create a hybrid model by combining different algorithms
- take rating time into consideration ( time-aware RecSys ). more infos on this [paper](https://www.scitepress.org/Papers/2017/63126/63126.pdf)