# Introduction

If you have been on youtube, netflix, HBO, youtube, and etc... You will see similar contents recommended to you based on your history.\
Thanks to machine learning and recommendation system, users can find similar content based on your interest.\
In this notebook, we will walk through content-based recommender using simple NLP and collaborative recommender using SVD.

# Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy.linalg import svd
from sklearn.metrics.pairwise import cosine_similarity

# Data Analysis

First, we will read two csv files. Note that we will consider only first 100000 due to limitation of computing power. Also, we will be renaming a column for later to merge two tables together.

Also want to filter out missing values if we want to feed into model later.


In [2]:
anime_df = pd.read_csv('anime.csv')
anime_df = anime_df.rename(columns={'MAL_ID': 'anime_id'}) # rename 

animelist_df = pd.read_csv('animelist.csv', nrows=100000)

# filter out missing values
anime_df = anime_df.dropna()
animelist_df = animelist_df.dropna()

Here is where we perform a merge two tables together by matching ids. From there, we can obtain more features and get to choose what kind of feature to input in model.

In [3]:
anime_df = anime_df.merge(animelist_df, left_on = 'anime_id', right_on = 'anime_id', how = 'left')
anime_df = anime_df.dropna()

Let's make a copy of dataframe for SVD recommender, so we don't ruin the information in original dataframe. This is a good practice if we might come back to use the original dataframe again.

In [4]:
# WE WILL USE THIS DATAFRAME FOR SVD RECOMMENDER
anime_dfcopy = anime_df.copy()
# only select features that we care about. Note that SVD performs very slow when large features or large dataset.
anime_dfcopy = anime_dfcopy[['anime_id', 'user_id', 'Score']]
# I notice that this dataset contains string unknown values. We need to make sure to filter out them if feed into model
anime_dfcopy = anime_dfcopy[anime_dfcopy['Score'] != 'Unknown']

# change column dtype. Model only works with numerical numbers.
anime_dfcopy['user_id'] = anime_dfcopy['user_id'].astype('int')
anime_dfcopy['Score'] = pd.to_numeric(anime_dfcopy['Score']).astype('int')

# get unique users and animes to construct our matrix
# one user can make multiple reviews on multiple anime_id. We want to avoid duplicates to maintain one to one relationship not one to many.
users = anime_df['user_id'].unique()
animes = anime_df['anime_id'].unique()

print("Users: {}, animes: {}".format(len(users), len(animes)))
print(anime_dfcopy)

Users: 310, animes: 7916
        anime_id  user_id  Score
0              1        3      8
1              1        6      8
2              1        8      8
3              1       13      8
4              1       14      8
...          ...      ...    ...
109566     47398      148      6
109567     47398      208      6
109575     47614      322      6
109635     48456      311      6
109636     48456      322      6

[99161 rows x 3 columns]


## Recommendation system with SVD
A brief introduction on matrix factorization in recommendation system. Given an user-item matrix, we want to decompose the matrix so we have user and item matrix independently.

user by item matrix = user by m matrix * m by item matrix\
Then we can compare different factors of the matrix. For example, if the item is anime, factors might measure the comparison of between watch duration vs rating or rating vs count of times watch

The idea is given some input matrix A, we want to use the following SVD formula
$$A = U*\sum*V^{T}$$

A: $mxn$ matrix that I want to decompose. In this case, m is anime_id, n is user_id\
U: $mxm$ matrix (left singular vectors)\
$\sum $: $mxn$ rectangular diagonal matrix (holds the eigenvalues)\
V: $nxn$ matrix (right singular vectors)

This notebook will use the SVD model from numpy library, but to compute SVD from scratch, we will follow the following steps.
1. Multiple $A*A^{T}$ to get the square matrix
2. Calculate the eigenvalues and eigenvectors of this square matrix. Take the square root of nonzero eigenvalues to generate $\sum$ matrix
3. Calculate U matrix by finding the unit vector in the direction of each eigen vectors.
4. Find V matrix through the formula $V = 1/ \sum * A^{T} * U$

Finally, after SVD, we will use cosine similarity to identify animes a user would be interested in.





In [5]:
# we will be create a user-item matrix based on the value associated with the rating of each anime
# note you don't have to use rating, you can pick number of viewing this anime, frequency, duration, etc...
# item -> anime_id, user -> user_id
user_col = 'user_id'
item_col = 'anime_id'
rate_col = 'Score'

mat = anime_dfcopy.pivot_table(
    index = user_col,
    columns = item_col,
    values = rate_col,
    fill_value = 0
)


In [6]:
mat

anime_id,1,5,6,7,8,15,16,17,18,19,...,44192,44208,44248,45567,45598,45753,45999,47398,47614,48456
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,0,8,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,8,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,7,0,0,0,8,...,0,0,0,0,0,0,0,0,0,0
3,8,0,0,0,0,0,0,0,8,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,8,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
335,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
336,8,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
338,8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
339,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


value at (i,j) location of the matrix will be the rating that user $i$ gave for anime $j$.

In [7]:
u, sigma, v = svd(mat.values)

In [8]:
# rows -> anime id, cols -> user_id 
mat.shape

(310, 7634)

In [9]:
def cosine_simi(a, b):
    """caculate cosine similarity between 2 vectors"""
    return np.dot(a, b) / (np.linalg.norm(a)*np.linalg.norm(b))

def get_similarities(mat, id_):
    """
    params:
    mat -> A 2-D array associated to user / item matrix after SVD
    id_ -> the id of the user / item you want to find similarities for.
            id must be in the range of the input matrix shape
    
    returns:
        return the similarity dictionary ordered by the values in descending order"""
    
    # create similarity hashmap, keys are anime ids and values are similarities
    sim_dct = {}
    for col in range(0, mat.shape[1]): # loop through all the anime_id
        sim = cosine_simi(mat[:, id_], mat[:, col])
        sim_dct[col] = sim
    
    # sort dictionary based on similarities
    sim_dct = {k: v for k, v in sorted(sim_dct.items(), key=lambda item: item[1], reverse=True)}
    return sim_dct

def SVD_recommend(mat, id_, n_recs):
    """
    function get the top n recommendations associated to an id
    """
    lst = []
    
    # if the input id is not found, theres no point of performing further calculation
    if id_ not in anime_df['anime_id']:
        return('id not found')
    
    # display what the original input anime is
    input_anime = anime_df.loc[anime_df['anime_id'] == id_]
    if input_anime.empty == False:
        first_item = input_anime['Name'].iloc[0]
        print('Your input anime show is:',first_item)
    
    # calculate list of anime shows
    sim_dct = get_similarities(mat, id_)
    similar_ids = list(sim_dct.keys())[1:n_recs+1]
    
    # for loop to transform all ids into anime name, then add to the list 
    for i in range(len(similar_ids)):
        find_anime = anime_df.loc[anime_df['anime_id'] == similar_ids[i]]
        if find_anime.empty == False:
            item = find_anime['Name'].iloc[0]
            lst.append(item)
    
    # transform list into arrays to fit in later hybrid-recommender
    arr = np.array(lst)
    arr = arr[..., None]
    
    return arr


In [30]:
SVD_recommend(u, 100, 10)

Your input anime show is: Shin Shirayuki-hime Densetsu Prétear


array([['Aishiteruze Baby★★'],
       ['Mobile Suit Gundam: The 08th MS Team'],
       ['Mai-HiME'],
       ['Gakuen Alice'],
       ['Full Moon wo Sagashite'],
       ['Air'],
       ['Akazukin Chacha'],
       ['Mahoromatic: Motto Utsukushii Mono'],
       ['Futakoi Alternative'],
       ['Hana yori Dango']], dtype='<U36')

Perfect! Seems like our SVD recommender system is working finely.

# Conclusion

Finishing this notebook, we can see the implementation of collaborative recommenders.

The downside of collaborative recommenders is inability to generate good predictions for new user / item or item / user due to low interaction.\
This is due to low information about new product or items release.

Also, problem with collaborative recommenders is that highly popular products almost get recommended due to there high interaction.

Also, problem with scalability. As dataset gets larger, matrix is high dimension which reduces in computation time, but higher performance.