# Hands-on implementation for various recommender systems.

![img](https://images.unsplash.com/photo-1601944179066-29786cb9d32a?ixid=MnwxMjA3fDB8MHxzZWFyY2h8MTN8fG5ldGZsaXh8ZW58MHwwfDB8fA%3D%3D&ixlib=rb-1.2.1&auto=format&fit=crop&w=500&q=60)

Inspired by : https://github.com/alanchn31/recommender-system

**What kind of data do we need to implement the recommendation system?**

For our target variable
1. Explicit rating - A rating given by a user to an item on a scale (could be score 1 to 5, or 1 to 10).
2. Implicit rating - A measurement to indicate the user preference indirectly. It could be a view, click, like, how long they read, how much they bought, etc.

For other features
1. Content feature - Genre, Type, Number of subscriber, Age, Published channel, etc.

Here we use `Anime-recommendations-database` as an input data for our tutorial.

## Let's see what we have in our data source.

In [None]:
import numpy as np
import pandas as pd
import os
pd.set_option('display.max_columns',None)
import seaborn as sns
from collections import defaultdict
from tqdm.notebook import tqdm
import gc

def basic_summary(df):
    
    '''
    Report the basic information about the input dataframe
    
    Args:
    df -> pd.DataFrame
    
    Returns:
    None
    '''
    
    print(f"Samples : {df.shape[0]:,} \nColumns : {df.shape[1]} : {list(df.columns.values)}")
    print("\nHeads")
    display(df.head(3))
    print("\nData types")
    display(pd.DataFrame(df.dtypes, columns=['dtypes']).transpose())
    print("\nNull values")
    display(pd.concat([df.isna().sum(),df.isna().mean() * 100],axis=1).rename({0:'count',1:'pct'},axis=1).transpose())
    print("\nBasic statistics")
    display(df.describe().transpose())
      

if __name__ == "__main__":
    
    BASE_PATH = '/kaggle/input/anime-recommendations-database/'
    ANIME_DTYPES = {'anime_id': str, 'name': str, 'genre': str, 'type': str, 'episodes': str, 'rating': float, 'members': int}
    RATING_DTYPES = {'user_id': str, 'anime_id': str, 'rating': int}
    ANIME_PATH = os.path.join(BASE_PATH, 'anime.csv')
    RATING_PATH = os.path.join(BASE_PATH, 'rating.csv')

    anime = pd.read_csv(ANIME_PATH, dtype = ANIME_DTYPES)
    rating = pd.read_csv(RATING_PATH,  dtype = RATING_DTYPES)

# Basic summary

## Anime

- The `anime` dataframe contains the data related to the anime. 

### Metadata

- anime_id - myanimelist.net's unique id identifying an anime.
- name - full name of anime.
- genre - comma separated list of genres for this anime.
- type - movie, TV, OVA, etc.
- episodes - how many episodes in this show. (1 if movie).
- rating - average rating out of 10 for this anime.
- members - number of community members that are in this anime's "group".

In [None]:
basic_summary(anime)

## Rating

- The `rating` dataframe contains raw data of how each user rate each anime.
- The rating score is in range [0,10]

### Metadata

- user_id - non identifiable randomly generated user id.
- anime_id - the anime that this user has rated.
- rating - rating out of 10 this user has assigned (-1 if the user watched it but didn't assign a rating).

In [None]:
basic_summary(rating)

# 1. Popular based recommendation

![img](https://images.unsplash.com/photo-1583258292688-d0213dc5a3a8?ixid=MnwxMjA3fDB8MHxzZWFyY2h8NXx8bWFya2V0fGVufDB8fDB8fA%3D%3D&ixlib=rb-1.2.1&auto=format&fit=crop&w=500&q=60)

**Introduction**

For any machine learning problems, we need a **baseline model or method** to use as a reference whether our approach is good or not. 

Our machine learning prediction or sophsticated analysis should, at least, beat those baseline performance.

For recommendation system, we can make a simple baseline score with **popular item recommendation**

**To define the popularity of the item**

Regarding the IMDB syetem, there have a metrics called `weighted rating system` that is used to score the rating of each movie.

Here is the formular 
```
(WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C 
```
where:
- R = average rating for the movie. (rating)
- v = number of votes for the movie. (members)
- m = minimum votes required to be listed in the Top 250 (defined by > percentile 80 of total votes)
- C = the average rating across the whole dataset.

**Drawback**
- It's not personalized. All the users will get the same exact list of popularity based recommendation.

**Actions**

- For new users, if we don't have any information about them we can provide the list based on ranking the `vote_count` or `weighted_rating` as a best guess.
- In real world, this is the result when you see the section "Popular on Netflix"

**Reference**
- https://help.imdb.com/article/imdb/track-movies-tv/ratings-faq/G67Y87TFYYP6TWAV?ref_=helpms_helpart_inline#

In [None]:
def weighted_rating(v,m,R,C):
    '''
    Calculate the weighted rating
    
    Args:
    v -> average rating for each anime (float)
    m -> minimum votes required to be classified as popular (float)
    R -> average rating for the anime (pd.Series)
    C -> average rating for the whole dataset (pd.Series)
    
    Returns:
    pd.Series
    '''
    return ( (v / (v + m)) * R) + ( (m / (v + m)) * C )

def assign_popular_based_score(rating):
    '''
    Assigned popular based score based on the IMDB weighted average.
    
    Args:
    rating -> pd.DataFrame contains ['anime_id', 'rating'] for each user.
    
    Returns
    popular_anime -> pd.DataFrame contains anime name and IMDB weighted score.
    '''
    
    # pre processing
    filter_rating = rating[rating['rating'] != -1]
    vote_count = filter_rating.groupby('anime_id',as_index=False).agg({'user_id':'count', 'rating':'mean'})
    vote_count.columns = ['anime_id','vote_count', 'avg_rating']
    
    # calcuate input parameters
    C = np.mean(vote_count['avg_rating'])
    m = np.percentile(vote_count['vote_count'], 70)
    vote_count = vote_count[vote_count['vote_count'] >= m]
    R = vote_count['avg_rating']
    v = vote_count['vote_count']
    vote_count['weighted_rating'] = weighted_rating(v,m,R,C)
    
    # post processing
    vote_count = vote_count.merge(anime[['anime_id','name']],on=['anime_id'],how='left')
    vote_count = vote_count.drop('anime_id', axis=1)
    popular_anime = vote_count.loc[:,['name', 'vote_count', 'avg_rating', 'weighted_rating']]
    
    return popular_anime

In [None]:
popular_anime = assign_popular_based_score(rating)

## Popularity based on the number of votes count

In [None]:
sns.barplot(data = popular_anime.sort_values('vote_count',ascending=False).head(10),
            x = 'vote_count', y = 'name', palette='mako');
sns.despine()

## Popularity based on the weighted score

In [None]:
sns.barplot(data = popular_anime.sort_values('weighted_rating',ascending=False).head(10),
            x = 'weighted_rating', y = 'name', palette = 'mako');
sns.despine()

# 2. Content-based recommendation

**Introduction**

For example, if a person has liked the movie “Inception”, then this algorithm will recommend movies that fall under the same genre.

Here we create a better way of recommendation by introducing other features of the content into our engine. 

It's an improvement compared to the popularity based recommendation we mentioned earlier. 

Now, the customer who read, watch, or like any kinds of specific products will get a recommendation based on the product they interacted in the past.

![img](https://i.ibb.co/S5GWr1r/Content-recommendation.png)

> Consider the example of Netflix. They save all the information related to each user in a vector form. This vector contains the past behavior of the user, i.e. the movies liked/disliked by the user and the ratings given by them. This vector is known as the profile vector. All the information related to movies is stored in another vector called the item vector. Item vector contains the details of each movie, like genre, cast, director, etc. The content-based filtering algorithm finds **the cosine of the angle between the profile vector and item vector**, i.e. cosine similarity.



**drawback**
- A major drawback of this algorithm is that it is limited to recommending items that are of the same type. 
- It will **never recommend products which the user has not bought or liked** in the past. So if a user has watched or liked only action movies in the past, the system will recommend only action movies.

**reference**
- [Content based recommender system](https://towardsdatascience.com/content-based-recommender-systems-28a1dbd858f5)

In [None]:
def top_k_similar_anime(anime_id, top_k, corr_mat, map_name):
    
    # sort correlation value ascendingly and select top_k csr_anime_id
    top_anime = corr_mat[anime_id,:].argsort()[-top_k:][::-1] 
    
    # convert csr_anime_id to anime name
    top_anime = [map_name[e] for e in top_anime] 

    return top_anime

In [None]:
# extract the genre
genre = anime['genre'].str.split(",", expand=True)

# get all possible genre
all_genre = set()
for c in genre.columns:
    distinct_genre = genre[c].str.lower().str.strip().unique()
    all_genre.update(distinct_genre)

all_genre.remove(None)
all_genre.remove(np.nan)
print(f"The number of possible genre is : {len(all_genre)}")

# create item-genre matrix
item_genre_mat = anime[['name','genre']].copy()
item_genre_mat['genre'] = item_genre_mat['genre'].str.lower().str.strip()
for genre in tqdm(all_genre):
    item_genre_mat[genre] = np.where(item_genre_mat['genre'].str.contains(genre), 1, 0)
    
item_genre_mat = item_genre_mat.drop(['genre'], axis=1)
item_genre_mat = item_genre_mat.set_index('name')

# compute similarity matix
from sklearn.metrics.pairwise import cosine_similarity

ind2name = {ind:name for ind,name in enumerate(item_genre_mat.index)}
name2ind = {v:k for k,v in ind2name.items()}
cosine_mat = cosine_similarity(item_genre_mat)

In [None]:
similar_anime = top_k_similar_anime(name2ind['Naruto'],
                                    top_k = 10,
                                    corr_mat = cosine_mat,
                                    map_name = ind2name)

anime.loc[anime['name'].isin(similar_anime), ['name','genre']]

In [None]:
similar_anime = top_k_similar_anime(name2ind['Death Note'],
                                    top_k = 10,
                                    corr_mat = cosine_mat,
                                    map_name = ind2name)

anime.loc[anime['name'].isin(similar_anime), ['name','genre']]

In [None]:
del cosine_mat
gc.collect()

# 3. Collaborative filtering

**Introduction**

> The collaborative filtering algorithm uses **“User Behavior”** for recommending items. This is one of the most commonly used algorithms in the industry as it is not dependent on any additional information. There are different types of collaborating filtering techniques. [comprehensive-guide-recommendation-engine-python](https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/)[1]

There are 2 types of memory based collaborative filtering 

![img](https://predictivehacks.com/wp-content/uploads/2020/06/recommenders_systems.png) <br>
1. User based - The user-similarity matrix will consist of some distance metric that measures the similarity between any two pairs of users.
> This algorithm is useful when the number of users is less. Its **not effective when there are a large number of users** as it will take a lot of time to compute the similarity between all user pairs. This leads us to item-item collaborative filtering, which is effective when the number of users is more than the items being recommended. [1]
2. Item based - Likewise, the item-similarity matrix will measure the similarity between any two pairs of items.

**drawback**
- What will happen if a new user or a new item is added in the dataset? It is called a **Cold Start**. There can be two types of cold start.
    1. Visitor - Since there is no history of that user, the system does not know the preferences of that user
        - These can be determined by what has been **popular recently overall or regionally**.
    2. Product - More the interaction a product receives, the easier it is for our model to recommend that product to the right user.
        - We can make use of **Content based filtering** to solve this problem. 

**References**
- [comprehensive-guide-recommendation-engine-python](https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/)
- [intro-to-collaborative-filtering](https://www.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/)
- [intro-to-recommender-system-collaborative-filtering](https://towardsdatascience.com/intro-to-recommender-system-collaborative-filtering-64a238194a26)

In [None]:
# replace the rating -1 with 0 (convert from watched it but didn't rate the anime into they didn't like it)
collab_rating = rating.copy()
collab_rating['rating'] = collab_rating['rating'].replace(-1, 0)

n_users = collab_rating['user_id'].nunique()
n_animes = collab_rating['anime_id'].nunique()
print(f"Unique users : {n_users:,} \nUnique anime : {n_animes:,}")

In [None]:
# create ordered user_id, and anime_id
map_user_id = {v:int(i) for i,v in enumerate(sorted(collab_rating['user_id'].unique()))}
map_anime_id = {v:int(i) for i,v in enumerate(sorted(collab_rating['anime_id'].unique()))}

collab_rating['csr_user_id'] = collab_rating['user_id'].map(map_user_id)
collab_rating['csr_anime_id'] = collab_rating['anime_id'].map(map_anime_id)
collab_rating = collab_rating.merge(anime[['anime_id', 'name']], on='anime_id', how='left')

# create another mapping from the anime name to the new defined indexes
map_csr_anime_id_to_name = {ind:name for ind, name in zip(collab_rating['csr_anime_id'], collab_rating['name'])}
map_name_to_csr_anime_id = {name:ind for ind, name in map_csr_anime_id_to_name.items()}

In [None]:
from scipy.sparse import csr_matrix
from tqdm.notebook import tqdm

row = collab_rating['csr_user_id']
col = collab_rating['csr_anime_id']
data = collab_rating['rating']

mat = csr_matrix((data, (row, col)), shape=(n_users, n_animes))
mat.eliminate_zeros()

sparsity = float(len(mat.nonzero()[0]))
sparsity /= (mat.shape[0] * mat.shape[1])
sparsity *= 100

print(f'Sparsity: {sparsity:4.2f}%. This means that {sparsity:4.2f}% of the user-item ratings have a value.')

In [None]:
def train_test_split(mat, test_size = 0.2):
    
    train = mat.copy()
    
    test_row = []
    test_col = []
    test_data = []
    
    for user in tqdm(range(mat.shape[0])):
        
        # extract the csr_anime_id that has a rating > 0
        user_ratings = mat[user, :].nonzero()[1] 
        
        # random test label based on each user_ratings size.
        test_ratings = np.random.choice(user_ratings,
                                        size = int(test_size * len(user_ratings)), 
                                        replace = False)
        
        # because the changing of the csr_matrix is expensive, we store the data and create new csr_matrix instead.
        test_row.extend([user] * len(test_ratings))
        test_col.extend(list(test_ratings))
        test_data.extend(list(train[user, test_ratings].toarray()[0]))
        
        train[user, test_ratings] = 0
    
    test = csr_matrix((test_data, (test_row, test_col)), shape=(mat.shape[0], mat.shape[1]))
    test.eliminate_zeros()
    
    return train, test

In [None]:
train, test = train_test_split(mat)

## 3.1 Memory based approache

**Introduction**

> The key difference of memory-based approach from the model-based techniques is that we are **not learning any parameter** using gradient descent (or any other optimization algorithm). The closest user or items are calculated only by using **Cosine similarity** or **Pearson correlation coefficients**, which are only based on arithmetic operations. [various-implementations-of-collaborative-filtering](https://towardsdatascience.com/various-implementations-of-collaborative-filtering-100385c6dfe0) 

> Memory-based methods use user rating historical data to compute the similarity between users or items. The idea behind these methods is to define a similarity measure between users or items, and find the most similar to recommend unseen items. [Building a memory based collaborative filtering recommender](https://towardsdatascience.com/how-does-collaborative-filtering-work-da56ea94e331)

**Implementatiuon**
- Due to the size of user-item matrix. It's impossible to compute the user_features with `n_users x n_users` shape or the anime_features with `n_animes x n_animes` shape with the current running instance.
- Therefore, we drop the number of row for demostrating purpose.
- Also, another way is to create the matrix with the lower dimension with the deterministic approach such as 'TruncatedSVD'

**User-based**
- We find the group of similar users (the group size is arbitarily) based on the `pearson correlation`, `cosine similarity`, or `KNN Neareast neighbour`.
- We average the rating of each item based on the group of similar users
- Rank the item based on the averate rating descendently, and recommend the target user with the movie that they never rated it before ranking from the highest `top_k` average rating.

**Item-based**

- We find the group of similar item based on the `pearson correlation`, `cosine similarity`, or `KNN Neareast neighbour`
- Select up to the `top_k` most similar item to recommend. 
- Just to note that the item is **not similar in term of the content** (like the content-based recommendation) but it's **similar in term of the explicit rating from the user behavior** (the similarity between each item from the user-item matrix)

**Drawback**
- It's not scalable due to the sprasity of the data.
- We needs to construct the similarity matrix everytime the new user comes. (Hard to maintain, and operationalize)

**Actions**
- The list result can be showed in the front-end application like "Made for you" -> Provide the list of recommended animes.
- The list result can be showed in the front-end application like "Because you like Naruto" -> Provided the list of recommended animes.

**Reference**
- [various-implementations-of-collaborative-filtering](https://towardsdatascience.com/various-implementations-of-collaborative-filtering-100385c6dfe0)
- [Building a memory based collaborative filtering recommender](https://towardsdatascience.com/how-does-collaborative-filtering-work-da56ea94e331)

## Compute similarity

- When we try to compute the similarity with the whole matrix (75K x 11K) shape. We doesn't have enough memory to store the whole matrix.
- So, for this purpose of illustration, we will undersampling the data set by 90%.

In [None]:
# we need to reduce the size of user-item matrix so that we can compute the similarity based on the raw value.

def under_sampling(mat, ratio):
    
    sample_inds = np.random.choice(mat.shape[0],
                    size = int(ratio * mat.shape[0]), 
                    replace = False)
    
    return train[sample_inds,:]

### Pearson correlation

In [None]:
epsilon = 1e-9
small_train = under_sampling(train, ratio = .1).toarray() + epsilon
item_corr_mat = np.corrcoef(small_train.T)

In [None]:
similar_anime = top_k_similar_anime(map_name_to_csr_anime_id['Naruto'],
                                    top_k = 10,
                                    corr_mat = item_corr_mat,
                                    map_name = map_csr_anime_id_to_name)

anime.loc[anime['name'].isin(similar_anime), ['name','genre']]

In [None]:
similar_anime = top_k_similar_anime(map_name_to_csr_anime_id['Death Note'],
                                    top_k = 10,
                                    corr_mat = item_corr_mat,
                                    map_name = map_csr_anime_id_to_name)

anime.loc[anime['name'].isin(similar_anime), ['name','genre']]

**Summary**

- You can see that with this size of data and the limitation of the resouce computation. This approach is hard to scale up in the real world.
- It leads to the problem that we can't leverage the pattern inside our data. 

In [None]:
del small_train, item_corr_mat
gc.collect()

## 3.2 Model-based approach

**Introduction**

> Model-based CF uses machine learning algorithms to predict users’ rating of unrated items. There are many model-based CF algorithms, the most commonly used are matrix factorization models such as to applying a SVD to reconstruct the rating matrix, latent Dirichlet allocation or Markov decision process based models. [Building a memory based collaborative filtering recommender](https://towardsdatascience.com/how-does-collaborative-filtering-work-da56ea94e331)


**Type**

> 1. Matrix Factorization based
    - TruncatedSVD
    - Funk Matrix Factorization (SVD-like algorithm) (Surprise) <br>
    >  Note that, in Funk MF **no singular value decomposition is applied**, it is a SVD-like machine learning model. [Wiki/Matrix_factorization](https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems))
    - Probabilistic Matrix Factorization (fastai)
    - Non negative Matrix Factorization (Surprise)
2. Deep learning based
    - Embedding (fastai)
    
**Implementation**

**Item-based**

- We can extract the `user_features` and `anime_features` based on the matrix factorization technique. The purpose is to get the underlying latent matrix generated the user-item interaction matrix.
- After that, We select the `anime_features` for example. We calculate the `pearson correlation`, `cosine similarity`, or `KNN Neareast neighbour` between each item and check up for the `top_k` most similar item to recommend. 
- Just to note that the item is **not similar in term of the content** (like the content-based recommendation) and **not similar in term of the explicit user behavior** (like the memory-based CF recommendation), but in term of **latent factors** (underlying factors that we can't interpret explicitly) based on the matrix decomposition.

**User-based**
- We can extract the `user_features` and `anime_features` based on the matrix factorization technique. The purpose is to get the underlying latent matrix generated the user-item interaction matrix.
- We find the group of similar users (the group size is arbitarily) based on the `pearson correlation`, `cosine similarity`, or `KNN Neareast neighbour`.
- We average the rating of each item based on the group of similar users
- Rank the item based on the averate rating descendently, and recommend the target user with the movie that they never rated it before ranking from the highest top-k average rating.

**References**
- [Building a memory based collaborative filtering recommender](https://towardsdatascience.com/how-does-collaborative-filtering-work-da56ea94e331)
- [Wiki/Matrix_factorization](https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems))



## 3.1.1 Matrix factorization - TruncatedSVD (sklearn)

### TruncatedSVD

![img](https://www.researchgate.net/profile/Jun-Xu-67/publication/321344494/figure/fig1/AS:702109309751298@1544407312766/Diagram-of-matrix-factorization.png)

> Truncated SVD shares similarity with PCA while SVD is produced from the data matrix and the factorization of PCA is generated from the covariance matrix. Unlike regular SVDs, truncated SVD produces a factorization where the number of columns can be specified for a number of truncation. [recommender-system-singular-value-decomposition-svd-truncated-svd](https://towardsdatascience.com/recommender-system-singular-value-decomposition-svd-truncated-svd-97096338f361)

**Reference**
- [recommender-system-singular-value-decomposition-svd-truncated-svd](https://towardsdatascience.com/recommender-system-singular-value-decomposition-svd-truncated-svd-97096338f361)

In [None]:
from sklearn.decomposition import TruncatedSVD

epsilon = 1e-9
n_latent_factors = 10

anime_svd = TruncatedSVD(n_components = n_latent_factors)
anime_features = anime_svd.fit_transform(train.transpose()) + epsilon

user_svd = TruncatedSVD(n_components = n_latent_factors)
user_features = user_svd.fit_transform(train) + epsilon

print(f"anime_features shape : {anime_features.shape}\nuser_feature shape : {user_features.shape}")

## Item-based Collaborative filtering

### Pearson correlation

In [None]:
corr_mat = np.corrcoef(anime_features)

In [None]:
similar_anime = top_k_similar_anime(map_name_to_csr_anime_id['Naruto'],
                                    top_k = 10,
                                    corr_mat = corr_mat,
                                    map_name = map_csr_anime_id_to_name)

anime.loc[anime['name'].isin(similar_anime), ['name','genre']]

In [None]:
similar_anime = top_k_similar_anime(map_name_to_csr_anime_id['Death Note'],
                                    top_k = 10,
                                    corr_mat = corr_mat,
                                    map_name = map_csr_anime_id_to_name)

anime.loc[anime['name'].isin(similar_anime), ['name','genre']]

### Cosine similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_mat = cosine_similarity(anime_features)

In [None]:
similar_anime = top_k_similar_anime(map_name_to_csr_anime_id['Naruto'],
                                    top_k = 10,
                                    corr_mat = cosine_mat,
                                    map_name = map_csr_anime_id_to_name)

anime.loc[anime['name'].isin(similar_anime), ['name','genre']]

In [None]:
similar_anime = top_k_similar_anime(map_name_to_csr_anime_id['Death Note'],
                                    top_k = 10,
                                    corr_mat = cosine_mat,
                                    map_name = map_csr_anime_id_to_name)

anime.loc[anime['name'].isin(similar_anime), ['name','genre']]

**Summary**

With this method, you can see that we can compute the similarity based on the specific number of latent factor.
- Now, the recommendation would be based on some latent factors that we cannot explain directly.
- But in mathematically speaking, it will be the top latent factor that minimize the loss between the Actual rating and Reconstructed rating.
- You can see that now we reduce the size of matrix compared to the one in memory-based approach. however, it's still not good enough approach because eventually when the user, or item size growing with the time.<br> Soon it will reach the limitation of computation resouce as well.

In [None]:
del user_features, anime_features
gc.collect()

## 3.1.2 Matrix factorization - Funk MF (SVD-like algorithm in surprise)

In [None]:
from surprise import SVD, accuracy
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate
from surprise.model_selection.split import train_test_split

def pred2dict(predictions, top_k=None):
    
    rec_dict = defaultdict(list)
    for user_id, anime_id, actual_rating, pred_rating, _ in tqdm(predictions):
        rec_dict[user_id].append((anime_id, pred_rating))        
        
    return rec_dict

def get_top_k_recommendation(rec_dict, user_id, top_k, animeid2name):
    
    pred_ratings = rec_dict[user_id]
    pred_ratings = sorted(pred_ratings, key=lambda x: x[1], reverse=True) # sort descendingly by pred_rating
    pred_ratings = pred_ratings[:top_k]
    recs = [animeid2name[e[0]] for e in pred_ratings]
    
    return recs

reader = Reader(rating_scale=(1,10))
data = Dataset.load_from_df(collab_rating[['user_id','anime_id','rating']], reader)
train, test = train_test_split(data, test_size=.2, random_state=42)

algo = SVD(random_state = 42)
algo.fit(train)
pred = algo.test(test)
accuracy.rmse(pred)

## Prediction - Funk MF (SVD-like algorithm in surprise)

In [None]:
animeid2name = {ind:name for ind,name in zip(collab_rating['anime_id'], collab_rating['name'])}

rec_dict = pred2dict(pred)

### Item-based recommendation

In [None]:
corr_mat = np.corrcoef(algo.qi)

In [None]:
similar_anime = top_k_similar_anime(map_name_to_csr_anime_id['Naruto'],
                                    top_k = 10,
                                    corr_mat = corr_mat,
                                    map_name = map_csr_anime_id_to_name)

anime.loc[anime['name'].isin(similar_anime), ['name','genre']]

In [None]:
similar_anime = top_k_similar_anime(map_name_to_csr_anime_id['Death Note'],
                                    top_k = 10,
                                    corr_mat = corr_mat,
                                    map_name = map_csr_anime_id_to_name)

anime.loc[anime['name'].isin(similar_anime), ['name','genre']]

### User-based recommendation

In [None]:
# Example rating of user#3
collab_rating[ (collab_rating['user_id'] == '3') & (collab_rating['rating'] > 0)]\
.sort_values('rating',ascending=False)\
.head(10)

In [None]:
# Recommendation for user#3
recs = get_top_k_recommendation(rec_dict, '3', 10, animeid2name)
anime.loc[anime['name'].isin(recs), ['name','genre']]

**Summary**

- Here we use the Funk MF algorithms to create the latent factors matrix, and now we can build both the user-based, item-based recommendation.
- We also randomly split out some users from the train set into the test set for the validation purpose.

In [None]:
del corr_mat
gc.collect()

# What to do next?

- Hybrid MF
- Deep learning MF
- Evaluation