In [98]:
## Import block
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import numpy as np
import pandas as pd

import math
from math import sqrt

from sklearn.metrics.pairwise import cosine_similarity


# Recommendation System 

Recommendation system is one of the machine learning algorithm that takes the data to suggest new contents to the users based on the user’s preference. It aims to maximize the usability of recommended items. It is widely used in the real life, some famous examples being Youtube and Netflix recommendation algorithms. There are two basic types of recommendation system: **content-based filtering** and **collaborative filtering**. The former algorithm recommends a new item based on the similarity of items and the user's information, while the latter recommends an item by finding other user that shows similar preference with the current user and suggesting an item that user consumed. Therefore, while the content-based filtering can recommend an action movie to a user who watched and liked action movie, collaborative filtering can find another user who watched and liked that action movie, and follow its path. Because of these nature, both of the approaches are considered unsupervised learning where there is no "true" class or answer given to compare the prediction to.


In this project, I will use an **anime recommendation database** data from Kaggle to build two recommendation systems, and compare the result of them to figure out the advantages and disadvantages of both methods. Specifically as a person who enjoys watching anime, this project can be an opportunity to elicit an anime recommendation for me to watch in my break.  

**Anime recommendation database** includes the unique **anime id**, their **names**, **genres**, **types**, **number of episodes**, **average ratings**, and the **number of members in the anime group** from myanimelist.net, one of the most famous social networking and cataloging community for anime and manga. 
There is also a separate **rating** data with different users' id and their ratings. This separate data will be used in collaborative filtering approach.

Dataset: https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database

## Content-based Filtering

### 1. Explore & Preprocess Data


In [99]:
anime = pd.read_csv("anime.csv", sep = ",")
rating = pd.read_csv("rating.csv", sep = ",")

In [100]:
anime.head(10)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10,9.15,93351
6,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9.11,81109


In [101]:
print("Anime set shape:", anime.shape)

Anime set shape: (12294, 7)


There are 12294 different anime with 7 variables. 

In [103]:
print("Null values in Anime:\n" , anime.isnull().sum())
index = anime["episodes"] == "Unknown"
print("Unknown episodes:", anime[index].shape[0])

Null values in Anime:
 anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64
Unknown episodes: 340


In [104]:
#drop observations with unknown value
anime = anime.dropna()
anime = anime[-index]
#check if all null values are dropped
print("Null values:\n",anime.isnull().sum())
print("\nAnime set shape after dropping NA values:", anime.shape)

Null values:
 anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

Anime set shape after dropping NA values: (11830, 7)


  anime = anime[-index]


In [105]:
# change the type of rating and members to a numeric value
anime["rating"] = anime["rating"].astype(float)
anime["members"] = anime["members"].astype(float)
anime["episodes"] = anime["episodes"].astype(float)

We can filter the contents by calculating the similarity of the genre, type, episode, rating, and members of two anime. For a more accurate representation of the traits, we can weight the rating by the number of members in the community. This will reduce the bias of the rating tending to be decreasing as more people rates the anime, and give a true Bayesian estimate. 

In this project, I'm using the formula used in IMDb, a website with an online database of information related to films, TV series, video games, and streaming content online: 

weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C

R = average rating for the movie

v = number of votes for the movie

m = minimum votes required to be listed in the Top 250 

C = the mean vote across the whole report

I'm assuming the number of votes to be the number of members, and setting 10000 to be the minimum number of members. It is approximately same as the 76th percentile cutoff, meaning that the anime should have more members than at least 76% of the anime in the list to be considered.


Source: https://www.quora.com/How-does-IMDbs-rating-system-work

In [106]:
# function for calculating the weigted rating
def weighted_rating(data, m, C):
    wr = (data['members']/(data['members']+m))*data['rating'] + (m/(data['members']+m))*C
    return wr

In [107]:
# function testing
weighted_rating(anime, 10000, anime.rating.mean())

0        9.233011
1        9.225466
2        9.027455
3        9.130715
4        8.994101
           ...   
12289    6.436366
12290    6.444989
12291    6.450221
12292    6.458731
12293    6.470263
Length: 11830, dtype: float64

In [108]:
# combining the weighted_rating with the data
anime['weighted_rating'] = anime.apply(weighted_rating, axis=1, args=(10000,anime.rating.mean()))
anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,weighted_rating
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1.0,9.37,200630.0,9.233011
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64.0,9.26,793665.0,9.225466
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.25,114262.0,9.027455
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24.0,9.17,673572.0,9.130715
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.16,151266.0,8.994101


Then, we can drop the columns for rating and members because the new column for weighted_rating takes into account of both.

In [109]:
anime.drop(['rating', 'members'], axis=1, inplace=True)
anime.head()


Unnamed: 0,anime_id,name,genre,type,episodes,weighted_rating
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1.0,9.233011
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64.0,9.225466
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.027455
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24.0,9.130715
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,8.994101


Since genre and type are categorical variables, we can make the column for each of the categories and use binary label encoding method to mark the categories that correspond to an anime as 1 and others as 0.

In [110]:
# concatenate each classes for genres and types
anime_concat= pd.concat([anime, anime['genre'].str.get_dummies(sep=','), 
                     anime['type'].str.get_dummies()], axis=1)
anime_concat.head()

Unnamed: 0,anime_id,name,genre,type,episodes,weighted_rating,Adventure,Cars,Comedy,Dementia,...,Supernatural,Thriller,Vampire,Yaoi,Movie,Music,ONA,OVA,Special,TV
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1.0,9.233011,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64.0,9.225466,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.027455,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24.0,9.130715,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,8.994101,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1


In [111]:
anime_features = anime_concat.iloc[:, 4:].copy()
anime_features.head()

Unnamed: 0,episodes,weighted_rating,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,...,Supernatural,Thriller,Vampire,Yaoi,Movie,Music,ONA,OVA,Special,TV
0,1.0,9.233011,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,64.0,9.225466,1,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,1
2,51.0,9.027455,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,24.0,9.130715,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,51.0,8.994101,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


We then calculate the cosine similarity to create a matrix that shows how similar the items are.

In [114]:
cos_sim = cosine_similarity(anime_features.values, anime_features.values)
cos_sim

array([[1.        , 0.24128676, 0.27115064, ..., 0.86177518, 0.9492321 ,
        0.96494682],
       [0.24128676, 1.        , 0.99794015, ..., 0.63139999, 0.28554635,
        0.28531688],
       [0.27115064, 0.99794015, 1.        , ..., 0.65482031, 0.31513737,
        0.31491293],
       ...,
       [0.86177518, 0.63139999, 0.65482031, ..., 1.        , 0.92317918,
        0.90374453],
       [0.9492321 , 0.28554635, 0.31513737, ..., 0.92317918, 1.        ,
        0.97767333],
       [0.96494682, 0.28531688, 0.31491293, ..., 0.90374453, 0.97767333,
        1.        ]])

In [116]:
cos_sim.shape

(11830, 11830)

In [118]:
anime_index = pd.Series(anime.index, index=anime.name).drop_duplicates()


In [126]:
def recommend(anime_name, similarity = cos_sim):
    ind = anime_index[anime_name]
    
    # Get the pairwise similarity scores of all anime with that anime
    sim_scores = list(enumerate(cos_sim[ind]))

    # Sort the anime based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar anime
    sim_scores = sim_scores[0:11]

    # Get the anime indices
    anime_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar anime
    result = anime[['name', 'genre', 'weighted_rating']].iloc[anime_indices].drop(ind)
    return result

In [133]:
recommend("Kimi no Na wa.")

Unnamed: 0,name,genre,weighted_rating
208,Kokoro ga Sakebitagatterunda.,"Drama, Romance, School",8.056491
1494,Harmonie,"Drama, School, Supernatural",7.254712
1959,Air Movie,"Drama, Romance, Supernatural",7.222889
60,Hotarubi no Mori e,"Drama, Romance, Shoujo, Supernatural",8.507541
894,Momo e no Tegami,"Drama, Supernatural",7.4603
6119,Shisha no Sho,"Drama, Supernatural",6.468811
5697,Shiranpuri (Movie),"Drama, School",6.45983
10123,Samurai,"Drama, Romance",6.452721
1199,&quot;Bungaku Shoujo&quot; Movie,"Drama, Mystery, Romance, School",7.405343
2103,Clannad Movie,"Drama, Fantasy, Romance, School",7.270973


In [141]:
recommend("Gintama")

Unnamed: 0,name,genre,weighted_rating
175,Katekyo Hitman Reborn!,"Action, Comedy, Shounen, Super Power",8.299677
629,Lupin III: Part II,"Action, Adventure, Comedy, Shounen",7.237531
482,Prince of Tennis,"Action, Comedy, School, Shounen, Sports",7.880706
9896,Otoko Ippiki Gaki Daishou,"Action, Drama, Shounen",6.495736
1519,Yu☆Gi☆Oh! 5D&#039;s,"Action, Game, Shounen",7.338731
2565,Yu☆Gi☆Oh!: Duel Monsters GX,"Action, Comedy, Fantasy, Game, Shounen",7.141611
9895,Otoko Doahou! Koushien,"Action, Sports",6.485593
288,Fairy Tail,"Action, Adventure, Comedy, Fantasy, Magic, Sho...",8.190814
9803,Obocchama-kun,"Comedy, Parody",6.484485
577,Kindaichi Shounen no Jikenbo (TV),"Mystery, Shounen",7.169117


In [144]:
recommend("Steins;Gate")

Unnamed: 0,name,genre,weighted_rating
7446,Pacusi,Comedy,6.462201
10805,Yasamura Yasashi no Yasashii Sekai,Comedy,6.411847
54,Re:Zero kara Hajimeru Isekai Seikatsu,"Drama, Fantasy, Psychological, Thriller",8.581084
110,Shirobako,"Comedy, Drama",8.362183
144,Higurashi no Naku Koro ni Kai,"Mystery, Psychological, Supernatural, Thriller",8.32559
8135,Anime Document: München e no Michi,Sports,6.484735
2892,Papa to Odorou,Comedy,6.519775
541,Shiki,"Mystery, Supernatural, Thriller, Vampire",7.932343
680,Michiko to Hatchin,"Action, Adventure",7.747191
53,Rainbow: Nisha Rokubou no Shichinin,"Drama, Historical, Seinen, Thriller",8.495802


In [149]:
recommend("Fullmetal Alchemist: Brotherhood")

Unnamed: 0,name,genre,weighted_rating
200,Fullmetal Alchemist,"Action, Adventure, Comedy, Drama, Fantasy, Mag...",8.299767
2472,Digimon Frontier,"Action, Adventure, Comedy, Drama, Fantasy, Sho...",7.159879
2838,Dragon Quest: Abel Yuusha Densetsu,"Action, Adventure, Fantasy, Shounen",6.600271
3371,Gyouten Ningen Batsealer,"Action, Adventure, Fantasy, Magic",6.494961
112,Hunter x Hunter,"Action, Adventure, Shounen, Super Power",8.36679
5028,Kouya no Shounen Isamu,"Action, Adventure, Shounen",6.487248
5970,Bakugan Battle Brawlers: Mechtanium Surge,"Action, Adventure, Fantasy, Game, Shounen",6.394281
4248,Kuusou Kagaku Sekai Gulliver Boy,"Action, Adventure, Fantasy, Magic, Mecha",6.499076
5416,Getter Robo Go,"Action, Adventure, Mecha, Military, Shounen",6.483007
2829,Battle Spirits: Brave,"Action, Shounen",6.56308


https://www.kaggle.com/code/alsojmc/movie-recommender-systems/notebook

## Collaborative Filtering


### Idea

I'm investigating a recommendation algorithm that recommends a product by comparing the users and their ratings. We take the information of different users and their ratings and try to find the underlying relationship between them by performing dimension reduction on the data. Dimension reduction algorithm will fill in the missing ratings from the users by searching and integrating the information of the most similar users. Then, we can recommend the products from highest to lowest rating that were filled in by dimension reduction for each users.

In order to apply dimension reduction on my data for this recommendation algorithm, I had to decide on which tool I should use--PCA or SVD. I used SVD for the dimension reduction other than PCA because my data was quite sparse. There were a lot of Na's when I pivoted the data to be in the wide format with each row and columns representing each users and anime. These empty spaces made my data very sparse, so I chose SVD that works better on sparse data. There are also fewer restrictions on performing SVD than PCA, such as normalization and full matrix. I know it will take a huge amount of time to calculate the full matrix with this big data, so the running time was also considered in my decision.

After performing SVD, we can take the lower dimension representation of the data by calculating the dot product of three components on lower dimension U, s, V. This will result in the matrix with a size of the original data where all the places of Na values are filled with lower dimension predictions. I will take just the newly-filled ratings of two random users, and select three new recommendations for each of them by taking the three highest rating that weren't the orignial rating.

The quality of the recommendation will be determined by the contextual analysis. We can look at the genre of the anime that two users originally rated (which is a variable that has a high chance of influencing the rating), and see if we can find a relationship. If there is, then that common "genre" will be the latent preferences of the user that influences the rating. Therefore, we can look at the genre of the three recommended anime to see if it is reasonable, and that will help us decide whether the recommendation is valid or not.


### 1. Explore & Preprocess Data

In [178]:
anime = pd.read_csv("anime.csv", sep = ",")
rating = pd.read_csv("rating.csv", sep = ",")

In [179]:
anime.head(10)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10,9.15,93351
6,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9.11,81109


In [180]:
rating.head(10)

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1
5,1,355,-1
6,1,356,-1
7,1,442,-1
8,1,487,-1
9,1,846,-1


In [181]:
print("Anime Shape: " ,anime.shape )
print("Rating Shape: " ,rating.shape )

Anime Shape:  (12294, 7)
Rating Shape:  (7813737, 3)


In [182]:
print("Null values in Anime:\n" ,anime.isnull().sum())
print("Null values in Rating:\n ",rating.isnull().sum())

Null values in Anime:
 anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64
Null values in Rating:
  user_id     0
anime_id    0
rating      0
dtype: int64


In [200]:
#merge anime and rating dataset 
anime_merged = anime.merge(rating,on="anime_id",suffixes= ['', '_user'])
anime_merged


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,user_id,rating_user
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,99,5
1,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,152,10
2,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,244,10
3,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,271,10
4,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,278,-1
...,...,...,...,...,...,...,...,...,...
7813722,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175,39532,-1
7813723,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175,48766,-1
7813724,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175,60365,4
7813725,26081,Yasuji no Pornorama: Yacchimae!!,Hentai,Movie,1,5.46,142,27364,-1


In [205]:
# drop the observations with no rating
#anime_feature = anime_merged.copy()
index = anime_merged["rating_user"] == -1
anime_merged = anime_merged[-index]

anime_merged.shape

(6337239, 9)

In [209]:
# drop all NAs
anime_feature = anime_merged.dropna(axis = 0) 
anime_feature.isnull().sum()
anime_feature.shape

(6337146, 9)

There are users who has rated only once, even if they have rated it 5, it can't be considered a valuable record for recommendation. So I have considered minimum 200 ratings by the user as threshold value. You can play around changing the threshold value to get better results, but this worked fine.

In [212]:
counts = anime_feature['user_id'].value_counts()
anime_feature = anime_feature[anime_feature['user_id'].isin(counts[counts >= 200].index)]
anime_feature.shape

(3179693, 9)

In [215]:
#pivot the data so that each row represents each user and each column represents each anime rating
wide_user_anime = anime_feature.pivot_table(index='user_id', columns='anime_id', values='rating_user')
print(wide_user_anime.shape)
wide_user_anime


(8713, 9785)


anime_id,1,5,6,7,8,15,16,17,18,19,...,34238,34239,34240,34252,34283,34324,34325,34349,34367,34475
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,,,8.0,,,6.0,,6.0,6.0,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
17,,,7.0,,,,,,,10.0,...,,,8.0,,,,,,,
38,,,,,,,,,,,...,,,,,,,,,,
43,10.0,,,,,7.0,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73476,,,,,,,,,,9.0,...,,,,,,,,,,
73499,9.0,,9.0,,,10.0,,,,,...,,,,,,,,,,
73502,,,,9.0,,,10.0,,,,...,,,,,,,,,,
73503,9.0,7.0,9.0,,,,,,,,...,,,,,,,,,,


In [216]:
wide_user_anime_np = wide_user_anime.to_numpy()
wide_user_anime_np[:10,:10]

array([[nan, nan,  8., nan, nan,  6., nan,  6.,  6., nan],
       [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       [nan, nan,  7., nan, nan, nan, nan, nan, nan, 10.],
       [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       [10., nan, nan, nan, nan,  7., nan, nan, nan, nan],
       [10., nan, nan, nan, nan, nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan, nan,  9., nan, nan, nan],
       [ 9., nan,  9., nan, nan, nan, nan, nan,  9., 10.],
       [10.,  9., nan, nan, nan, nan, nan, nan, nan, 10.],
       [ 9.,  9.,  7.,  8., nan, nan,  9., nan, nan,  9.]])

In [217]:
#create binary matrix with 1 = nas, 0 = ratings that were rated 
binary_matrix = wide_user_anime_np.copy()
nan_inds = wide_user_anime.isna()
rating_inds = wide_user_anime.notna()
nan_inds = nan_inds.to_numpy()
rating_inds = rating_inds.to_numpy()
binary_matrix[nan_inds] = 1
binary_matrix[rating_inds] = 0
binary_matrix[:10,:10]
#this allows us to only recommend the anime that users haven't rated

array([[1., 1., 0., 1., 1., 0., 1., 0., 0., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 0., 1., 1., 1., 1., 1., 1., 0.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [0., 1., 1., 1., 1., 0., 1., 1., 1., 1.],
       [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 0., 1., 1., 1.],
       [0., 1., 0., 1., 1., 1., 1., 1., 0., 0.],
       [0., 0., 1., 1., 1., 1., 1., 1., 1., 0.],
       [0., 0., 0., 0., 1., 1., 0., 1., 1., 0.]])

## 2. Perform SVD

In [218]:
#Perform svd

#Fill nan values with temporary 0
temp0_wide_user_anime = wide_user_anime.fillna(0)
temp0_wide_user_anime.head(5)

anime_id,1,5,6,7,8,15,16,17,18,19,...,34238,34239,34240,34252,34283,34324,34325,34349,34367,34475
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,0.0,0.0,8.0,0.0,0.0,6.0,0.0,6.0,6.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,...,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
43,10.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [219]:
# replace nan entries with the average rating 5.0
temp_wide_user_anime = wide_user_anime.fillna(5.0)
x = np.tile(5.0, (temp_wide_user_anime.shape[0],1))
# center the data by mean, so minus all the entries with 5.0
utility_matrix = temp_wide_user_anime - x
utility_matrix.shape

(8713, 9785)

In [220]:
#get U, s, Vt and perform SVD
U, s, Vt=np.linalg.svd(utility_matrix, full_matrices=False)
s=np.diag(s)
print("Svd done")

Svd done


In [221]:
#shape checks
print(U.shape)
print(Vt.shape)
s=np.diag(s)
print(s.shape) 
s[:5]

(8713, 8713)
(8713, 9785)
(8713,)


array([2644.46968427, 1075.40928979,  850.1834485 ,  725.09577006,
        618.98762882])

In [222]:
#use svd to get a lower dimension approximation of data
k=3
low_d = np.dot(U[:,:k] * s[:k], Vt[:k,:])
print(low_d[:5,:5])
low_d.shape

#restore original non-mean-centered numbers by adding 5
low_d2 = low_d + 5
print(low_d2[:5,:5])

[[ 1.50701562  0.91369456  0.42760368  0.06931542 -0.09379742]
 [ 0.2195865  -0.01636031  0.32125442  0.02716375  0.03723829]
 [ 2.3281831   1.13606783  1.16699734  0.15100596 -0.04486996]
 [ 1.38841599  0.63313386  0.6236182   0.03136977 -0.04346828]
 [ 0.30826845 -0.01234809  0.29323607 -0.02263655  0.01893536]]
[[6.50701562 5.91369456 5.42760368 5.06931542 4.90620258]
 [5.2195865  4.98363969 5.32125442 5.02716375 5.03723829]
 [7.3281831  6.13606783 6.16699734 5.15100596 4.95513004]
 [6.38841599 5.63313386 5.6236182  5.03136977 4.95653172]
 [5.30826845 4.98765191 5.29323607 4.97736345 5.01893536]]


In [227]:
two_users = wide_user_anime.sample(n=2, random_state=21, axis = 0)
#user id of 31226 and 72511
two_users
#wide_user_anime.loc[[64015]].isna()
#low_d2binary_matrix == 1




#df = pd.DataFrame(my_array, columns = ['Column_A','Column_B','Column_C'], index = ['Item_1', 'Item_2'])


anime_id,1,5,6,7,8,15,16,17,18,19,...,34238,34239,34240,34252,34283,34324,34325,34349,34367,34475
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
31226,,,,,,,,,,,...,,,,,,,,,,
72511,,,,,,,,,,,...,,,,,,,,,,


In [228]:
two_users = wide_user_anime.sample(n=2, random_state=21)
#wide_user_anime[two_users]
two_users_np = two_users.to_numpy()

one = two_users_np[0,:]
#index for the first user
one_inds =~np.isnan(one)
two = two_users_np[1,:]
#index for the second user
two_inds =~np.isnan(two)
print(one[one_inds])
print(two[two_inds])
print(two_users.shape)

[ 8.  7.  8.  8.  8.  8.  8.  8.  6.  8.  7.  9.  7.  7.  6.  8.  8.  8.
  6.  8.  7.  8.  8.  9.  9.  7.  6.  8.  7.  7.  9.  3.  6.  8.  9.  8.
  8.  9.  7.  7.  7.  5.  7.  8.  8.  6.  7.  8.  7.  7.  8.  7.  6.  7.
  8.  8.  7.  7.  9.  7.  7.  7.  6. 10.  7.  8.  7.  7.  8.  9.  8.  8.
  7.  7.  6.  7.  7.  7.  8.  6.  6.  6.  7.  7.  6.  8.  7.  7.  9.  8.
  6.  8.  8.  8.  9.  8.  8.  8.  6.  7.  8.  7.  6.  8.  6.  8.  5.  7.
  9.  8.  7.  7.  9.  8.  6.  7.  6.  8.  8.  8.  8.  7.  8.  7.  8.  8.
  8.  7.  7.  6.  7.  8.  7.  6.  6.  8.  7.  8.  7.  7.  9.  8.  9.  6.
  7.  8.  8.  7.  7.  8.  9.  7.  8.  6.  7.  7.  7.  8.  7.  8.  7.  7.
  9.  7.  8.  7.  7.  7.  7.  7.  7.  8.  9.  7.  7.  7.  8.  8.  7.  7.
  7.  7.  7.  7.  7.  8.  8.  7.  7.  8.  8.  7.  8.  7.  8.  9.  8.  6.
  7.  6.  7.  7.  7.  6.  7.  8.  6.  7.  7.  8.  8.  8.  9.  6.  8.  7.
  7.  7.  8.  6.  7.  7.  8.  8.  8.  6.  7.  6.  8.  8.  7.  8.  8.  8.
  6.  7.  8.  8.  8.  8.  7.  6.  7.  8.  6.  7.  8

## 3. Evaluating the Recommendation

In [229]:



anime_cut2 = anime[['anime_id', 'name','genre']]
anime_cut2

Unnamed: 0,anime_id,name,genre
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural"
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili..."
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S..."
3,9253,Steins;Gate,"Sci-Fi, Thriller"
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S..."
...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai
12290,5543,Under World,Hentai
12291,5621,Violence Gekiga David no Hoshi,Hentai
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai


In [230]:
#join two data to have the name of the anime and two users' ratings
two_users = two_users.T
two_rating = two_users.join(anime_cut2.set_index('anime_id'), on='anime_id')

In [233]:

#two_rating_np = two_rating.to_numpy()

#two_rating_np[one_inds]

#show all the anime that first user rated
first_user_rating = two_rating[one_inds]
first_user_rating = first_user_rating[[31226,'name','genre']]
first_user_rating

Unnamed: 0_level_0,31226,name,genre
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
30,8.0,Neon Genesis Evangelion,"Action, Dementia, Drama, Mecha, Psychological,..."
48,7.0,.hack//Sign,"Adventure, Fantasy, Game, Magic, Mystery, Sci-Fi"
59,8.0,Chobits,"Comedy, Drama, Ecchi, Romance, Sci-Fi, Seinen"
60,8.0,Chrno Crusade,"Action, Demons, Historical, Romance, Supernatural"
64,8.0,Rozen Maiden,"Action, Comedy, Drama, Magic, Seinen"
...,...,...,...
32093,8.0,Tanaka-kun wa Itsumo Kedaruge,"Comedy, School, Slice of Life"
32094,7.0,Reikenzan: Hoshikuzu-tachi no Utage,"Comedy, Fantasy, Magic"
32360,6.0,Qualidea Code,"Action, Magic, Supernatural"
32595,6.0,Seisen Cerberus: Ryuukoku no Fatalités,"Adventure, Fantasy"


In [248]:
second_user_rating = two_rating[two_inds]
second_user_rating = second_user_rating[[72511,'name','genre']]
#show first 20 of the anime that second user rated
second_user_rating


Unnamed: 0_level_0,72511,name,genre
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
20,4.0,Naruto,"Action, Comedy, Martial Arts, Shounen, Super P..."
30,10.0,Neon Genesis Evangelion,"Action, Dementia, Drama, Mecha, Psychological,..."
32,10.0,Neon Genesis Evangelion: The End of Evangelion,"Dementia, Drama, Mecha, Psychological, Sci-Fi"
68,7.0,Black Cat,"Adventure, Comedy, Sci-Fi, Shounen, Super Power"
121,8.0,Fullmetal Alchemist,"Action, Adventure, Comedy, Drama, Fantasy, Mag..."
...,...,...,...
31181,8.0,Owarimonogatari,"Comedy, Mystery, Supernatural"
31240,9.0,Re:Zero kara Hajimeru Isekai Seikatsu,"Drama, Fantasy, Psychological, Thriller"
31442,7.0,Musaigen no Phantom World,"Action, Comedy, Fantasy, Slice of Life, Supern..."
31715,8.0,Working!!!: Lord of the Takanashi,"Comedy, Romance, Slice of Life"


In [252]:
#take the anime that user rated 10 & 9
inds = (second_user_rating[72511] == 10 & 9)
second_user_rating[inds] 

Unnamed: 0_level_0,72511,name,genre
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
121,8.0,Fullmetal Alchemist,"Action, Adventure, Comedy, Drama, Fantasy, Mag..."
223,8.0,Dragon Ball,"Adventure, Comedy, Fantasy, Martial Arts, Shou..."
226,8.0,Elfen Lied,"Action, Drama, Horror, Psychological, Romance,..."
269,8.0,Bleach,"Action, Comedy, Shounen, Super Power, Supernat..."
2236,8.0,Toki wo Kakeru Shoujo,"Adventure, Drama, Romance, Sci-Fi"
...,...,...,...
30240,8.0,Prison School,"Comedy, Ecchi, Romance, School, Seinen"
30503,8.0,Noragami Aragoto,"Action, Adventure, Shounen, Supernatural"
31043,8.0,Boku dake ga Inai Machi,"Mystery, Psychological, Seinen, Supernatural"
31181,8.0,Owarimonogatari,"Comedy, Mystery, Supernatural"
