In [61]:
## Import block
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import numpy as np
import pandas as pd

import math
from math import sqrt

from sklearn.metrics.pairwise import cosine_similarity

# ignore warnings
import warnings
warnings.filterwarnings('ignore')


# Recommendation System 

Recommendation system is one of the machine learning algorithm that takes the data to suggest new contents to the users based on the user’s preference. It aims to maximize the usability of recommended items. It is widely used in the real life, some famous examples being Youtube and Netflix recommendation algorithms. There are two basic types of recommendation system: **content-based filtering** and **collaborative filtering**. The former algorithm recommends a new item based on the similarity of items and the user's information, while the latter recommends an item by finding other user that shows similar preference with the current user and suggesting an item that user consumed. Therefore, while the content-based filtering can recommend an action movie to a user who watched and liked action movie, collaborative filtering can find another user who watched and liked that action movie, and follow its path. Because of these nature, both of the approaches are considered unsupervised learning where there is no "true" class or answer given to compare the prediction to.


In this project, I will use an **anime recommendation database** data from Kaggle to build two recommendation systems, and compare the result of them to figure out the advantages and disadvantages of both methods. Specifically as a person who enjoys watching anime, this project can be an opportunity to elicit an anime recommendation for me to watch in my break.  

**Anime recommendation database** includes **anime.csv** file with the unique **anime id**, their **names**, **genres**, **types**, **number of episodes**, **average ratings**, and the **number of members in the anime group** from myanimelist.net, one of the most famous social networking and cataloging community for anime and manga. 
There is also a separate **rating.csv** data that shows the different users' ratings on different anime with variables **user_id**, **anime_id**, and **rating**. In this data, the rating was marked as **-1** when the user watched but didn't rate the anime. This separate data will be used in collaborative filtering approach.

Dataset: https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database

## Content-based Filtering

### 1. Explore & Preprocess Data


In [62]:
anime = pd.read_csv("anime.csv", sep = ",")
rating = pd.read_csv("rating.csv", sep = ",")

In [63]:
anime.head(10)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10,9.15,93351
6,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9.11,81109


In [64]:
print("Anime set shape:", anime.shape)

Anime set shape: (12294, 7)


There are 12294 different anime with 7 variables. 

In [65]:
# explore null values
print("Null values in Anime:\n" , anime.isnull().sum())
index = anime["episodes"] == "Unknown"
print("Unknown episodes:", anime[index].shape[0])

Null values in Anime:
 anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64
Unknown episodes: 340


In [66]:
# drop observations with na & unknown value
anime = anime.dropna()
anime = anime[-index]
# check if all null values are dropped
print("Null values:\n",anime.isnull().sum())
print("\nAnime set shape after dropping NA values:", anime.shape)


Null values:
 anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

Anime set shape after dropping NA values: (11830, 7)


In [67]:
# change the type of rating and members to a numeric value
anime["rating"] = anime["rating"].astype(float)
anime["members"] = anime["members"].astype(float)
anime["episodes"] = anime["episodes"].astype(float)

# reset the index
anime = anime.reset_index(drop=True)

We can filter the contents by calculating the similarity of the genre, type, episode, rating, and members of the animes. For a more accurate representation of the traits, we can weight the rating by the number of members in the community. This will reduce the bias of the rating as more people rates the anime, and give a true Bayesian estimate. 

In this project, I'm using the formula used in IMDb, a website with an online database of information related to films, TV series, video games, and streaming content online: 

**weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C**

**R** = average rating for the movie

**v** = number of votes for the movie

**m** = minimum votes required to be listed in the Top 250 

**C** = the mean vote across the whole report

I'm assuming the number of votes to be the number of members, and setting 10000 to be the minimum number of members. It is approximately same as the 76th percentile cutoff, meaning that the anime should have more members than at least 76% of the anime in the list to be considered.


Source: https://www.quora.com/How-does-IMDbs-rating-system-work

In [68]:
# function for calculating the weigted rating
def weighted_rating(data, m, C):
    wr = (data["members"]/(data["members"]+m))*data["rating"] + (m/(data["members"]+m))*C
    return wr

In [69]:
# test the function
weighted = weighted_rating(anime, 10000, anime.rating.mean())
weighted.shape
type(weighted)

pandas.core.series.Series

In [70]:
# combine the weighted_rating with the data
anime["weighted_rating"] = anime.apply(weighted_rating, axis=1, args=(10000,anime.rating.mean()))
anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,weighted_rating
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1.0,9.37,200630.0,9.233011
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64.0,9.26,793665.0,9.225466
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.25,114262.0,9.027455
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24.0,9.17,673572.0,9.130715
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.16,151266.0,8.994101


Then, we can drop the columns for rating and members because the new column for weighted_rating takes into account of both.

In [71]:
# drop raiting, anime columns
anime.drop(["rating", "members"], axis=1, inplace=True)
anime.head(20)


Unnamed: 0,anime_id,name,genre,type,episodes,weighted_rating
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1.0,9.233011
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64.0,9.225466
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.027455
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24.0,9.130715
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,8.994101
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10.0,8.892103
6,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148.0,9.069306
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110.0,8.820474
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1.0,8.783113
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13.0,8.821841


Since genre and type are categorical variables, we can make the column for each of the categories and use binary label encoding method to mark the categories that correspond to an anime as 1 and others as 0.

In [73]:
# concatenate each classes for genres and types
anime_concat = pd.concat([anime, anime["genre"].str.get_dummies(sep=','), 
                     anime["type"].str.get_dummies()], axis=1)
anime_concat.head()

Unnamed: 0,anime_id,name,genre,type,episodes,weighted_rating,Adventure,Cars,Comedy,Dementia,...,Supernatural,Thriller,Vampire,Yaoi,Movie,Music,ONA,OVA,Special,TV
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1.0,9.233011,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64.0,9.225466,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.027455,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24.0,9.130715,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,8.994101,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1


In [74]:
# take only the columns that we need to compute 
anime_features = anime_concat.iloc[:, 4:].copy()
print(anime_features.shape)
anime_features.head()

(11830, 90)


Unnamed: 0,episodes,weighted_rating,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,...,Supernatural,Thriller,Vampire,Yaoi,Movie,Music,ONA,OVA,Special,TV
0,1.0,9.233011,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,64.0,9.225466,1,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,1
2,51.0,9.027455,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,24.0,9.130715,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,51.0,8.994101,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


### 2. Generate and Fit a Model 

We then calculate the cosine similarity to create a matrix that shows how similar the items are:

**Cosine similarity =  ( A . B ) / ( ||A|| * ||B|| )**

where **A** and **B** are two vectors, **(A . B)** is the dot product of A and B, and **||A|| * ||B||** is the product of the length of each vector. This is same as calculating the cosine of the angle between vectors A and B. 
The output will be in range (-1, 1), where -1 shows that two vectors are complete opposite of each other and 1 indicates that they are the same vector. 0 indicates that there is no correlation between two vectors. 

About cosine similarity: https://www.geeksforgeeks.org/cosine-similarity/

In [75]:
cos_sim = cosine_similarity(anime_features.values, anime_features.values)
cos_sim

array([[1.        , 0.24128676, 0.27115064, ..., 0.86177518, 0.9492321 ,
        0.96494682],
       [0.24128676, 1.        , 0.99794015, ..., 0.63139999, 0.28554635,
        0.28531688],
       [0.27115064, 0.99794015, 1.        , ..., 0.65482031, 0.31513737,
        0.31491293],
       ...,
       [0.86177518, 0.63139999, 0.65482031, ..., 1.        , 0.92317918,
        0.90374453],
       [0.9492321 , 0.28554635, 0.31513737, ..., 0.92317918, 1.        ,
        0.97767333],
       [0.96494682, 0.28531688, 0.31491293, ..., 0.90374453, 0.97767333,
        1.        ]])

In [76]:
cos_sim.shape

(11830, 11830)

Note that the generated matrix is 11830 x 11830, which is a square matrix of the number of the anime in the feature data. This is because it calculates the cosine similarity for each of the comparison of one anime to the rest of the anime for each row. For example, the first row of the matrix starts with 1 because the first anime was compared with the first anime itself. Then, it caculates all the cosine similarity of the first anime to the other animes and store them in this first row. This process is repeated for each anime, and that's why the second element in the first row is same as the first element in the second row. They are the cosine similarity of the first anime to the second anime and that of the second anime to the first anime, which is exactly the same thing.

### 3. Give Recommendations

We can take this matrix to make a recommendation function that recommends the anime with highest cosine similarity value from the row that corresponds to the anime that user inputs.

In [77]:
# get the index of anime with names
anime_index = pd.Series(anime.index, index=anime.name)
anime_index

name
Kimi no Na wa.                                            0
Fullmetal Alchemist: Brotherhood                          1
Gintama°                                                  2
Steins;Gate                                               3
Gintama&#039;                                             4
                                                      ...  
Toushindai My Lover: Minami tai Mecha-Minami          11825
Under World                                           11826
Violence Gekiga David no Hoshi                        11827
Violence Gekiga Shin David no Hoshi: Inma Densetsu    11828
Yasuji no Pornorama: Yacchimae!!                      11829
Length: 11830, dtype: int64

In [78]:
# recommendation function
def recommend(anime, anime_index, anime_name, similarity = cos_sim):
    # get the index of the anime name
    ind = anime_index[anime_name]
    #print(ind)
    
    # calculate cosine similarity scores of all anime with that anime, enumerate it to match the index
    sim_scores = list(enumerate(cos_sim[ind]))
    #print(sim_scores)

    # sort the anime based on the similarity scores
    sim_scores = sorted(sim_scores, reverse=True, key=lambda x: x[1])
    #print(sim_scores)

    # get the scores of the 5 most similar anime
    sim_scores = sim_scores[0:6]
    #print(sim_scores)

    # get the anime indices
    anime_indices = []
    for i in sim_scores:
        anime_indices.append(i[0])
    #print(anime_indices)

    # return the top 5 most similar anime after the anime user types
    result = anime[["name", "genre", "weighted_rating", "episodes", "type"]].iloc[anime_indices]
    return result

In [79]:
# recommend
recommend(anime, anime_index, "Kimi no Na wa.")


Unnamed: 0,name,genre,weighted_rating,episodes,type
0,Kimi no Na wa.,"Drama, Romance, School, Supernatural",9.233011,1.0,Movie
207,Kokoro ga Sakebitagatterunda.,"Drama, Romance, School",8.056491,1.0,Movie
1487,Harmonie,"Drama, School, Supernatural",7.254712,1.0,Movie
1950,Air Movie,"Drama, Romance, Supernatural",7.222889,1.0,Movie
60,Hotarubi no Mori e,"Drama, Romance, Shoujo, Supernatural",8.507541,1.0,Movie
891,Momo e no Tegami,"Drama, Supernatural",7.4603,1.0,Movie


Notice that 5 closest anime are listed after the anime that I typed, "Kimi no Na wa". 

If we look at the genre, weighted_rating, episodes, and types, it is evident that the recommended anime are very similar to "Kimi no Na wa" because the genres and rating are similar, and the number of episodes and the type of recommended anime are just all the same. Therefore, it makes sense that the algorithm recommended these items. 

Below are some more examples.

In [80]:
recommend(anime, anime_index, "Kuroko no Basket")

Unnamed: 0,name,genre,weighted_rating,episodes,type
121,Kuroko no Basket,"Comedy, School, Shounen, Sports",8.403287,25.0,TV
72,Kuroko no Basket 2nd Season,"Comedy, School, Shounen, Sports",8.497284,25.0,TV
58,Kuroko no Basket 3rd Season,"Comedy, School, Shounen, Sports",8.510225,25.0,TV
43,Haikyuu!!,"Comedy, Drama, School, Shounen, Sports",8.605453,25.0,TV
14,Haikyuu!! Second Season,"Comedy, Drama, School, Shounen, Sports",8.800848,25.0,TV
507,Sakigake!! Cromartie Koukou,"Comedy, School, Shounen",7.821254,26.0,TV


In [81]:
recommend(anime, anime_index, "Fullmetal Alchemist: Brotherhood")

Unnamed: 0,name,genre,weighted_rating,episodes,type
1,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",9.225466,64.0,TV
199,Fullmetal Alchemist,"Action, Adventure, Comedy, Drama, Fantasy, Mag...",8.299767,51.0,TV
2459,Digimon Frontier,"Action, Adventure, Comedy, Drama, Fantasy, Sho...",7.159879,50.0,TV
2820,Dragon Quest: Abel Yuusha Densetsu,"Action, Adventure, Fantasy, Shounen",6.600271,43.0,TV
3349,Gyouten Ningen Batsealer,"Action, Adventure, Fantasy, Magic",6.494961,52.0,TV
111,Hunter x Hunter,"Action, Adventure, Shounen, Super Power",8.36679,62.0,TV


https://www.kaggle.com/code/alsojmc/movie-recommender-systems/notebook

## Collaborative Filtering


### Idea

Unlike content-based filtering, collaborative filtering focuses on recommending the item based on finding the similarity between the users, not the items. Here, we take the information of different users and their ratings and try to find the underlying relationship between them by performing dimension reduction on the data. Dimension reduction algorithm will fill in the missing ratings from the users by searching and integrating the information of the most similar users. Then, we can recommend the products from highest to lowest rating that were filled in by dimension reduction for each users.

The quality of the recommendation will be determined by the contextual analysis. We can look at the genre of the anime that two users originally rated (which is a variable that has a high chance of influencing the rating), and see if we can find a relationship. If there is, then that common "genre" will be the latent preferences of the user that influences the rating. Therefore, we can look at the genre of the three recommended anime to see if it is reasonable, and that will help us decide whether the recommendation is valid or not.


### 1. Explore & Preprocess Data

In [82]:
anime2 = pd.read_csv("anime.csv", sep = ",")
rating = pd.read_csv("rating.csv", sep = ",")

In [83]:
anime2.head(10)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10,9.15,93351
6,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9.11,81109


In [84]:
rating.head(10)

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1
5,1,355,-1
6,1,356,-1
7,1,442,-1
8,1,487,-1
9,1,846,-1


Note that the rating for the anime which a user watched but haven't rated is -1.

In [85]:
print("Anime Shape: " ,anime2.shape )
print("Rating Shape: " ,rating.shape )

Anime Shape:  (12294, 7)
Rating Shape:  (7813737, 3)


In [86]:
print("Null values in Anime:\n" ,anime2.isnull().sum())
print("Null values in Rating:\n ",rating.isnull().sum())

Null values in Anime:
 anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64
Null values in Rating:
  user_id     0
anime_id    0
rating      0
dtype: int64


In [87]:
#merge anime and rating dataset 
anime_merged = anime2.merge(rating,on="anime_id",suffixes= ["", "_user"])
anime_merged


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,user_id,rating_user
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,99,5
1,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,152,10
2,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,244,10
3,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,271,10
4,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,278,-1
...,...,...,...,...,...,...,...,...,...
7813722,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175,39532,-1
7813723,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175,48766,-1
7813724,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175,60365,4
7813725,26081,Yasuji no Pornorama: Yacchimae!!,Hentai,Movie,1,5.46,142,27364,-1


Because there are a lot of rating data from the users, I don't think that it is necessary to fill in the -1 ratings as some average rating value. For computation speed and clear data, I will drop all the -1 ratings as well as the row with other null values that emerged when merging the dataset. 

In [88]:
# drop the observations with no rating
#anime_feature = anime_merged.copy()
index = anime_merged["rating_user"] == -1
anime_merged = anime_merged[-index]

anime_merged.shape

(6337239, 9)

In [89]:
# drop all NAs
anime_feature = anime_merged.dropna(axis = 0) 
anime_feature.isnull().sum()
anime_feature.shape

(6337146, 9)

This is still a lot. 

Therefore, I will set a threshold of 200 for the minimum number of ratings the user completed for the user rating to be considered valuable inputs. There are users who have rated anime once or twice, which can't really be considered a meaningful record. It might have been an accident click, or a rating terror.


In [90]:
# only store the rating of the users who has more than 200 ratings
counts = anime_feature["user_id"].value_counts()
counts = counts[counts >= 200]
index = anime_feature["user_id"].isin(counts.index)
anime_feature = anime_feature[index]
anime_feature.shape

(3179693, 9)

In [91]:
# pivot the data so that each row represents each user and each column represents each anime rating
wide_user_anime = anime_feature.pivot_table(index="user_id", columns="anime_id", values="rating_user")
print(wide_user_anime.shape)
wide_user_anime


(8713, 9785)


anime_id,1,5,6,7,8,15,16,17,18,19,...,34238,34239,34240,34252,34283,34324,34325,34349,34367,34475
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,,,8.0,,,6.0,,6.0,6.0,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
17,,,7.0,,,,,,,10.0,...,,,8.0,,,,,,,
38,,,,,,,,,,,...,,,,,,,,,,
43,10.0,,,,,7.0,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73476,,,,,,,,,,9.0,...,,,,,,,,,,
73499,9.0,,9.0,,,10.0,,,,,...,,,,,,,,,,
73502,,,,9.0,,,10.0,,,,...,,,,,,,,,,
73503,9.0,7.0,9.0,,,,,,,,...,,,,,,,,,,


In [92]:
# to numpy array
wide_user_anime_np = wide_user_anime.to_numpy()
wide_user_anime_np[:10,:10]

array([[nan, nan,  8., nan, nan,  6., nan,  6.,  6., nan],
       [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       [nan, nan,  7., nan, nan, nan, nan, nan, nan, 10.],
       [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       [10., nan, nan, nan, nan,  7., nan, nan, nan, nan],
       [10., nan, nan, nan, nan, nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan, nan,  9., nan, nan, nan],
       [ 9., nan,  9., nan, nan, nan, nan, nan,  9., 10.],
       [10.,  9., nan, nan, nan, nan, nan, nan, nan, 10.],
       [ 9.,  9.,  7.,  8., nan, nan,  9., nan, nan,  9.]])

In [93]:
# create binary matrix with 1 = nas, 0 = ratings that were rated 
binary_matrix = wide_user_anime_np.copy()
nan_inds = wide_user_anime.isna()
rating_inds = wide_user_anime.notna()
nan_inds = nan_inds.to_numpy()
rating_inds = rating_inds.to_numpy()
binary_matrix[nan_inds] = 1
binary_matrix[rating_inds] = 0
binary_matrix[:10,:10]
# this allows us to only recommend the anime that users haven't rated

array([[1., 1., 0., 1., 1., 0., 1., 0., 0., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 0., 1., 1., 1., 1., 1., 1., 0.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [0., 1., 1., 1., 1., 0., 1., 1., 1., 1.],
       [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 0., 1., 1., 1.],
       [0., 1., 0., 1., 1., 1., 1., 1., 0., 0.],
       [0., 0., 1., 1., 1., 1., 1., 1., 1., 0.],
       [0., 0., 0., 0., 1., 1., 0., 1., 1., 0.]])

### 2. Perform SVD

In order to apply dimension reduction on my data for this recommendation algorithm, I need to decide on which tool I should use--PCA or SVD. I chose SVD for the dimension reduction other than PCA because the merged data with user ratings on the anime turned out to be quite sparse. There are a lot of Na's when I pivoted the data to be in the wide format where each row represents each users and each columns represent each anime. SVD works better on sparse data, and there are also fewer restrictions on performing SVD than PCA, in fact that we don't have to normalize the data nor use the full data. We can save a lot of computing time by performing truncated SVD.

After performing SVD, we can take the lower dimension representation of the data by calculating the dot product of three components on lower dimension U, s, V. This will result in the matrix with a size of the original data where all the places of Na values are filled with lower dimension predictions. I will take just the newly-filled ratings of two random users, and select three new recommendations for each of them by taking the three highest rating that weren't the orignial rating.


In [94]:
# function for performing svd
def SVD(wide_data, binary_data):
    
    # 1. center the data by the average rating:
    # replace nan entries with the average rating of 5.0
    temp_wide_data = wide_data.fillna(5.0)
    x = np.tile(5.0, (temp_wide_data.shape[0],1))
    # get a centered utility matrix by subtracting 5.0 from all the elements
    utility_matrix = temp_wide_data - x

    # 2. compute singular value decomposition
    U,s,Vt = np.linalg.svd(utility_matrix, full_matrices=False)
    
    # 3. use svd to get a lower dimension approximation of data
    k=3
    low_d = np.dot(U[:,:k] * s[:k], Vt[:k,:])

    # 4. more refinement
    # restore original non-mean-centered numbers by adding 5
    low_d = low_d + 5
    # convert np to pd
    low_d = pd.DataFrame(low_d, index=wide_data.index, columns=wide_data.columns)
    # screen the dimension-reduced matrix with binary matrix
    # the rating for the anime that each user already rated will be 0
    low_final = low_d*binary_matrix
    
    return low_final

In [57]:
rating_prediction = SVD(wide_user_anime, binary_matrix)
print("done")

done


In [95]:
rating_prediction

anime_id,1,5,6,7,8,15,16,17,18,19,...,34238,34239,34240,34252,34283,34324,34325,34349,34367,34475
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,6.507016,5.913695,0.000000,5.069315,4.906203,0.000000,5.524807,0.000000,0.000000,6.032539,...,5.014951,5.000888,5.236587,4.998965,5.001660,5.000996,5.001348,5.000529,4.999614,5.001378
7,5.219586,4.983640,5.321254,5.027164,5.037238,5.235078,5.044505,5.055272,5.111842,5.001937,...,5.006304,4.999934,5.183262,5.001589,5.001944,5.000570,5.003878,5.000876,5.000398,5.000241
17,7.328183,6.136068,0.000000,5.151006,4.955130,5.187281,5.759948,5.004167,5.167864,0.000000,...,5.030165,5.000986,0.000000,5.001620,5.005668,5.002271,5.008788,5.002322,5.000251,5.002142
38,6.388416,5.633134,5.623618,5.031370,4.956532,5.089655,5.402408,4.988948,5.069614,5.801560,...,5.021946,5.000664,5.469990,5.001424,5.004061,5.001742,5.006735,5.001456,5.000227,5.001597
43,0.000000,4.987652,5.293236,4.977363,5.018935,0.000000,5.026159,5.041550,5.087538,5.074158,...,5.011110,5.000047,5.293617,5.002053,5.002773,5.001015,5.005558,5.001018,5.000472,5.000623
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73476,6.897401,6.016191,5.997651,5.199528,4.975982,5.129457,5.694873,5.008370,5.154743,0.000000,...,5.018371,5.000739,5.355075,5.000349,5.003350,5.001238,5.004391,5.001635,5.000009,5.001303
73499,0.000000,6.358629,0.000000,5.628500,5.151297,0.000000,6.168168,5.206334,5.589048,6.064164,...,5.010099,5.000229,5.217917,5.001009,5.003754,5.000266,5.004142,5.003236,5.000390,5.000049
73502,7.971117,6.695724,7.772492,0.000000,5.223562,5.957273,0.000000,5.278341,5.770152,6.167679,...,5.000396,5.000044,4.983689,4.999881,5.002413,4.999204,5.000562,5.003549,5.000238,4.999141
73503,0.000000,0.000000,0.000000,5.235869,4.997072,5.152513,5.671336,5.024158,5.174761,5.932229,...,5.013036,5.000582,5.239057,4.999983,5.002470,5.000773,5.002754,5.001450,4.999962,5.000873


### 3. Give & Evaluate Recommendation

Using the dimension-reduced data from the previous section, we can now make a recommendation algorithm that returns 3 anime with the highest rating prediction for a specific user when we input the user id.

In [97]:
# recommender function
def recommend2(anime, rating_prediction, user_id):

    # get the rows for the selected users
    current_user = rating_prediction.loc[user_id]
    #print(current_user)
    
    # sort and take the index of three highest rating
    # note that argsort is in default ascending order, so we take the last three
    rec_ind = np.argsort(current_user)[-3:]
    # note that sort is in default ascending order, so we make it double negative to make it descending
    rec_max = -np.sort(-current_user)[0:3]
    # get the anime id for the three highest rating
    rec_anime_id = list(current_user.iloc[rec_ind].index)
    rec_anime_id.reverse()
    #print(rec_anime_id)
    # show three anime
    recommendation = anime.loc[anime["anime_id"].isin(rec_anime_id)]
    for i in range(len(rec_anime_id)):
        print("{}. For anime: {} , the predicted rating from this user is: {:.2f}.".format(i+1,anime.loc[anime["anime_id"] == rec_anime_id[i]].name.iloc[0], rec_max[i]))
    
    return recommendation

# https://stackoverflow.com/questions/10337533/a-fast-way-to-find-the-largest-n-elements-in-an-numpy-array
# Formating string: https://pyformat.info/

In [98]:
# function that returns the user's ratings on the anime by descending order when we input the user_id
def user_rating(user_id):
    
    user_rating_previous = wide_user_anime.loc[user_id]
    s = user_rating_previous.dropna()
    s = s.to_frame()
    #print(s.shape)

    join = s.join(anime.set_index("anime_id"), on="anime_id")
    #print(join.shape)
    user_r = join.sort_values(by=[user_id], ascending = False)
    return user_r.head(15)

Let's make some recommendations for user with user_id = 5, 7, and 73499.

In [99]:
user_rating(5)

Unnamed: 0_level_0,5,name,genre,type,episodes,weighted_rating
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
245,10.0,Great Teacher Onizuka,"Comedy, Drama, School, Shounen, Slice of Life",TV,43.0,8.687935
15335,10.0,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1.0,8.783113
9969,9.0,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,8.994101
28891,9.0,Haikyuu!! Second Season,"Comedy, Drama, School, Shounen, Sports",TV,25.0,8.800848
2418,9.0,Stranger: Mukou Hadan,"Action, Adventure, Historical, Samurai",Movie,1.0,8.286432
918,9.0,Gintama,"Action, Comedy, Historical, Parody, Samurai, S...",TV,201.0,8.966225
3702,9.0,Detroit Metal City,"Comedy, Music",OVA,12.0,8.105474
14719,9.0,JoJo no Kimyou na Bouken (TV),"Action, Adventure, Shounen, Supernatural, Vampire",TV,26.0,8.40883
9253,9.0,Steins;Gate,"Sci-Fi, Thriller",TV,24.0,9.130715
32182,9.0,Mob Psycho 100,"Action, Comedy, Slice of Life, Supernatural",TV,12.0,8.448614


In [100]:
# Note that the table of recommendation is not in order
recommend2(anime,rating_prediction, 5)

1. For anime: Mononoke Hime , the predicted rating from this user is: 6.70.
2. For anime: Baccano! , the predicted rating from this user is: 6.64.
3. For anime: Howl no Ugoku Shiro , the predicted rating from this user is: 6.61.


Unnamed: 0,anime_id,name,genre,type,episodes,weighted_rating
24,164,Mononoke Hime,"Action, Adventure, Fantasy",Movie,1.0,8.743476
35,431,Howl no Ugoku Shiro,"Adventure, Drama, Fantasy, Romance",Movie,1.0,8.674281
83,2251,Baccano!,"Action, Comedy, Historical, Mystery, Seinen, S...",TV,13.0,8.484927


According to the genre of the anime user 5 already rated high and my knowledge about some of them, it looks like user 5 really enjoys watching a Comedy anime that also has some action and adventurous plot. The recommendation then makes sense because the genre for recommended anime are also Action and Aventure focused. However, it is not easy to interpret whether the recommendation is convincing or not only by looking at the genre because this user watched and liked the anime from quite a lot of genres. 

In [101]:
user_rating(7)

Unnamed: 0_level_0,7,name,genre,type,episodes,weighted_rating
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
14653,10.0,Hayate no Gotoku! Can&#039;t Take My Eyes Off You,"Comedy, Harem, Parody, Shounen",TV,12.0,6.959067
16982,10.0,Hayate no Gotoku! Cuties,"Comedy, Harem, Parody, Romance, Shounen",TV,12.0,6.95455
30,10.0,Neon Genesis Evangelion,"Action, Dementia, Drama, Mecha, Psychological,...",TV,26.0,8.28111
20159,10.0,Pokemon: The Origin,"Action, Adventure, Comedy, Fantasy, Kids",Special,4.0,7.788313
19815,10.0,No Game No Life,"Adventure, Comedy, Ecchi, Fantasy, Game, Super...",TV,12.0,8.437574
4896,10.0,Umineko no Naku Koro ni,"Horror, Mystery, Psychological, Supernatural",TV,26.0,7.277501
2026,9.0,Hayate no Gotoku!,"Action, Comedy, Harem, Parody, Romance",TV,52.0,7.65967
2759,9.0,Evangelion: 1.0 You Are (Not) Alone,"Action, Mecha, Sci-Fi",Movie,1.0,8.125654
2213,9.0,Black Jack (TV),Drama,TV,61.0,7.248692
23421,9.0,Re:␣Hamatora,"Comedy, Mystery, Super Power",TV,12.0,7.361364


In [103]:
recommend2(anime,rating_prediction, 7)

1. For anime: Angel Beats! , the predicted rating from this user is: 7.13.
2. For anime: Toradora! , the predicted rating from this user is: 6.89.
3. For anime: Shingeki no Kyojin , the predicted rating from this user is: 6.89.


Unnamed: 0,anime_id,name,genre,type,episodes,weighted_rating
85,16498,Shingeki no Kyojin,"Action, Drama, Fantasy, Shounen, Super Power",TV,25.0,8.517319
130,4224,Toradora!,"Comedy, Romance, School, Slice of Life",TV,25.0,8.419473
158,6547,Angel Beats!,"Action, Comedy, Drama, School, Supernatural",TV,13.0,8.36382


User 7 also likes a lot of Comedy anime, especially when 4 of the 6 anime that he/she rated 10 are Comedy anime. However, this user also enjoys Supernatural, Fantasy, Sci-Fi anime too because there are a lot of anime in that category on the table. Therefore, the recommendation that is of genres action, comedy, fantasy, super power, and supernatural sounds fair enough. From what I know, "Shingeki no Kyojin" from the recommendation is similar to "Neon Genesis Evangelion / Evangelion" that both of them are heavily based on the battle (war) scenes with a lot of action components. Therefore, the recommendation at least sounds relevant.

In [104]:
user_rating(73499)

Unnamed: 0_level_0,73499,name,genre,type,episodes,weighted_rating
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
19363,10.0,Gin no Saji 2nd Season,"Comedy, School, Shounen, Slice of Life",TV,11.0,8.116544
11111,10.0,Another,"Horror, Mystery, School, Supernatural, Thriller",TV,12.0,7.85438
16918,10.0,Gin no Saji,"Comedy, School, Shounen, Slice of Life",TV,11.0,8.045723
8841,10.0,Kore wa Zombie Desu ka?,"Action, Comedy, Ecchi, Harem, Magic, Supernatural",TV,12.0,7.631234
18679,10.0,Kill la Kill,"Action, Comedy, School, Super Power",TV,24.0,8.196313
4224,10.0,Toradora!,"Comedy, Romance, School, Slice of Life",TV,25.0,8.419473
5341,10.0,Ookami to Koushinryou II,"Adventure, Fantasy, Historical, Romance",TV,12.0,8.370409
5258,10.0,Hajime no Ippo: New Challenger,"Comedy, Drama, Shounen, Sports",TV,26.0,8.521161
237,10.0,Eureka Seven,"Adventure, Drama, Mecha, Romance, Sci-Fi",TV,50.0,8.121667
5205,10.0,Kara no Kyoukai 7: Satsujin Kousatsu (Kou),"Action, Mystery, Romance, Supernatural, Thriller",Movie,1.0,8.372628


In [105]:
recommend2(anime,rating_prediction, 73499)

1. For anime: Death Note , the predicted rating from this user is: 9.39.
2. For anime: Darker than Black: Kuro no Keiyakusha , the predicted rating from this user is: 8.34.
3. For anime: Durarara!! , the predicted rating from this user is: 8.30.


Unnamed: 0,anime_id,name,genre,type,episodes,weighted_rating
40,1535,Death Note,"Mystery, Police, Psychological, Supernatural, ...",TV,37.0,8.688266
165,6746,Durarara!!,"Action, Mystery, Supernatural",TV,24.0,8.346538
250,2025,Darker than Black: Kuro no Keiyakusha,"Action, Mystery, Sci-Fi, Super Power",TV,25.0,8.210798


Lastly, the user 73499 also likes supernatural and action anime, which was well reflected in the recommendation. From what I know, "Death Note" and "Durarara!!" from the recommendation gives off similar vibe with "Fullmetal Alchemist: Brotherhood" and "No Game No Life" when they all are about risking life and using brain to solve mystery or game tricks.

In [106]:
new_wide = wide_user_anime.iloc[0:10,0:10]

binary_matrix = new_wide.copy()
nan_inds = new_wide.isna()
rating_inds = new_wide.notna()
nan_inds = nan_inds.to_numpy()
rating_inds = rating_inds.to_numpy()
binary_matrix[nan_inds] = 1
binary_matrix[rating_inds] = 0

binary_matrix

anime_id,1,5,6,7,8,15,16,17,18,19
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
17,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
38,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
43,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
46,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
123,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0
129,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
139,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
160,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0


In [113]:
x = SVD(new_wide, binary_matrix)
recommend2(anime,x, 5)

yes_already_rated = False;
recom = recommend2(anime, x, user_id =5)
y = x[x.index ==5]
for i in y[recom["anime_id"]]:
    if i == 0:
        yes_already_rated = True;


1. For anime: Monster , the predicted rating from this user is: 6.33.
2. For anime: Beet the Vandel Buster , the predicted rating from this user is: 5.00.
3. For anime: Cowboy Bebop , the predicted rating from this user is: 4.94.
1. For anime: Monster , the predicted rating from this user is: 6.33.
2. For anime: Beet the Vandel Buster , the predicted rating from this user is: 5.00.
3. For anime: Cowboy Bebop , the predicted rating from this user is: 4.94.
