### kNNs

This will be the fallback approach. If a new user who has never been in the database comes along with a list of anime they liked, how are we going to use the SVD algo? We can't unless we put this new data point into the dataset and retrain. But that is slow. Therefore, we will need to have a different, fallback method. To keep things simple, we can use only the genre type as a feature of the anime and try to compute distance metrics on those features between different anime. Close anime are similar anime we will recommend. 

In [1]:
import pandas as pd
import pickle

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
userID2userNameMap = {}
userName2userIDMap = {}

animeID2animeNameMap = {}
animeName2animeIDMap = {}

with open('userName2userIDMap.pkl', 'rb') as f:
    userName2userIDMap = pickle.load(f)

with open('userID2userNameMap.pkl', 'rb') as f:
    userID2userNameMap = pickle.load(f)


with open('animeID2animeNameMap.pkl', 'rb') as f:
    animeID2animeNameMap = pickle.load(f)

with open('animeName2animeIDMap.pkl', 'rb') as f:
    animeName2animeIDMap = pickle.load(f)

In [4]:
# First, just use the anime genre information.
# Later, see if we can incorporate the tags information, which will require more preprocessing.

df_anime = pd.read_csv("./data/anime_cleaned.csv")
df_anime

Unnamed: 0,anime_id,title,title_english,title_japanese,title_synonyms,image_url,type,source,episodes,status,...,broadcast,related,producer,licensor,studio,genre,opening_theme,ending_theme,duration_min,aired_from_year
0,11013,Inu x Boku SS,Inu X Boku Secret Service,妖狐×僕SS,Youko x Boku SS,https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,12,Finished Airing,...,Fridays at Unknown,"{'Adaptation': [{'mal_id': 17207, 'type': 'man...","Aniplex, Square Enix, Mainichi Broadcasting Sy...",Sentai Filmworks,David Production,"Comedy, Supernatural, Romance, Shounen","['""Nirvana"" by MUCC']","['#1: ""Nirvana"" by MUCC (eps 1, 11-12)', '#2: ...",24.0,2012.0
1,2104,Seto no Hanayome,My Bride is a Mermaid,瀬戸の花嫁,The Inland Sea Bride,https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,26,Finished Airing,...,Unknown,"{'Adaptation': [{'mal_id': 759, 'type': 'manga...","TV Tokyo, AIC, Square Enix, Sotsu",Funimation,Gonzo,"Comedy, Parody, Romance, School, Shounen","['""Romantic summer"" by SUN&LUNAR']","['#1: ""Ashita e no Hikari (明日への光)"" by Asuka Hi...",24.0,2007.0
2,5262,Shugo Chara!! Doki,Shugo Chara!! Doki,しゅごキャラ！！どきっ,"Shugo Chara Ninenme, Shugo Chara! Second Year",https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,51,Finished Airing,...,Unknown,"{'Adaptation': [{'mal_id': 101, 'type': 'manga...","TV Tokyo, Sotsu",,Satelight,"Comedy, Magic, School, Shoujo","['#1: ""Minna no Tamago (みんなのたまご)"" by Shugo Cha...","['#1: ""Rottara Rottara (ロッタラ ロッタラ)"" by Buono! ...",24.0,2008.0
3,721,Princess Tutu,Princess Tutu,プリンセスチュチュ,,https://myanimelist.cdn-dena.com/images/anime/...,TV,Original,38,Finished Airing,...,Fridays at Unknown,"{'Adaptation': [{'mal_id': 1581, 'type': 'mang...","Memory-Tech, GANSIS, Marvelous AQL",ADV Films,Hal Film Maker,"Comedy, Drama, Magic, Romance, Fantasy","['""Morning Grace"" by Ritsuko Okazaki']","['""Watashi No Ai Wa Chiisaikeredo"" by Ritsuko ...",16.0,2002.0
4,12365,Bakuman. 3rd Season,Bakuman.,バクマン。,Bakuman Season 3,https://myanimelist.cdn-dena.com/images/anime/...,TV,Manga,25,Finished Airing,...,Unknown,"{'Adaptation': [{'mal_id': 9711, 'type': 'mang...","NHK, Shueisha",,J.C.Staff,"Comedy, Drama, Romance, Shounen","['#1: ""Moshimo no Hanashi (もしもの話)"" by nano.RIP...","['#1: ""Pride on Everyday"" by Sphere (eps 1-13)...",24.0,2012.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6663,37405,Dokidoki Little Ooyasan,,dokidokiりとる大家さん,,https://myanimelist.cdn-dena.com/images/anime/...,OVA,Other,0,Currently Airing,...,,[],,,Collaboration Works,Hentai,[],[],30.0,2018.0
6664,37886,Wo Shi Jiang Xiaobai (2018),I&#039;m Joybo OVA,我是江小白 小剧场,Wo Shi Jiang Xiao Bai: Xiao Ju Chang,https://myanimelist.cdn-dena.com/images/anime/...,ONA,Original,1,Finished Airing,...,,"{'Prequel': [{'mal_id': 36775, 'type': 'anime'...",,,2:10 Animation,"Slice of Life, Drama, Romance",[],[],0.0,2018.0
6665,37255,Genki Genki Non-tan: Obake Mura Meiro,,げんきげんきノンタン　おばけむらめいろ,,https://myanimelist.cdn-dena.com/images/anime/...,OVA,Original,1,Finished Airing,...,,"{'Prequel': [{'mal_id': 25619, 'type': 'anime'...",,,Polygon Pictures,"Music, Kids",[],[],35.0,2015.0
6666,35229,Mr. Men Little Miss,Mr. Men Little Miss,Mr. Men Little Miss / ミスターメン リトルミス,,https://myanimelist.cdn-dena.com/images/anime/...,ONA,Picture book,0,Currently Airing,...,,[],,,Sanrio,Kids,[],[],2.0,2013.0


In [5]:
# the easiest ones to use will be type, genre, and maybe studio

display(df_anime['type'].value_counts())
display(df_anime['genre'].value_counts())
display(df_anime['studio'].value_counts())

type
TV         2980
OVA        1345
Special     929
Movie       908
ONA         408
Music        98
Name: count, dtype: int64

genre
Hentai                                                                 244
Comedy                                                                 216
Music                                                                   80
Slice of Life, Comedy                                                   67
Comedy, Slice of Life                                                   48
                                                                      ... 
Adventure, Ecchi, Fantasy, Magic, Mystery, Shoujo Ai                     1
Comedy, Mecha, Shounen                                                   1
Action, Sci-Fi, Dementia, Psychological, Drama, Mecha                    1
Action, Adventure, Fantasy, Magic, Comedy, Military, Drama, Shounen      1
Horror, Parody, Supernatural                                             1
Name: count, Length: 3203, dtype: int64

studio
Toei Animation            403
Sunrise                   277
Madhouse                  243
Studio Pierrot            235
J.C.Staff                 233
                         ... 
Blade                       1
Toei Animation, Bridge      1
Studio Unicorn              1
MooGoo                      1
G-angle                     1
Name: count, Length: 711, dtype: int64

In [137]:
# genre first
df = df_anime[['anime_id', 'genre', 'type']]
df

Unnamed: 0,anime_id,genre,type
0,11013,"Comedy, Supernatural, Romance, Shounen",TV
1,2104,"Comedy, Parody, Romance, School, Shounen",TV
2,5262,"Comedy, Magic, School, Shoujo",TV
3,721,"Comedy, Drama, Magic, Romance, Fantasy",TV
4,12365,"Comedy, Drama, Romance, Shounen",TV
...,...,...,...
6663,37405,Hentai,OVA
6664,37886,"Slice of Life, Drama, Romance",ONA
6665,37255,"Music, Kids",OVA
6666,35229,Kids,ONA


In [138]:
df.genre.isna().sum()

4

In [139]:
# which anime have no genre? Maybe we can give them genres manually.
df[df['genre'].isna()].anime_id.apply(lambda x : animeID2animeNameMap[x])

2357                                 Genbanojou
3301                               Match Shoujo
5111                Kyoto Animation: Megane-hen
6642    Season&#039;s Greetings 2017 from Dwarf
Name: anime_id, dtype: object

In [140]:
# just remove them. 
df = df[~df['genre'].isna()]

In [141]:
# are there any rows in which the type are nans?
df.type.isna().sum()

0

That's great, no empty types.

In [142]:
all_genre_set = set()

def foo(row):
    x = row.genre

    if not isinstance(x, str):
        print(f'anomoly genre: {x}')

    if ',' not in x:
        all_genre_set.add(x)
        return
    
    x = x.replace(' ', '').split(',')
    if len(x) > 0:
        for g in x:
            all_genre_set.add(g)
    

df.apply(foo, axis=1)

all_genre_set

{'Action',
 'Adventure',
 'Cars',
 'Comedy',
 'Dementia',
 'Demons',
 'Drama',
 'Ecchi',
 'Fantasy',
 'Game',
 'Harem',
 'Hentai',
 'Historical',
 'Horror',
 'Josei',
 'Kids',
 'Magic',
 'MartialArts',
 'Mecha',
 'Military',
 'Music',
 'Mystery',
 'Parody',
 'Police',
 'Psychological',
 'Romance',
 'Samurai',
 'School',
 'Sci-Fi',
 'Seinen',
 'Shoujo',
 'ShoujoAi',
 'Shounen',
 'ShounenAi',
 'Slice of Life',
 'SliceofLife',
 'Space',
 'Sports',
 'SuperPower',
 'Supernatural',
 'Thriller',
 'Vampire',
 'Yaoi',
 'Yuri'}

In [143]:
# Function to clean and split the genre string into a list
def split_genres(genre):
    if not isinstance(genre, str):
        return []
    return [g.strip() for g in genre.split(',')]

# Apply the function to the genre column to split the genres
df['genre_list'] = df['genre'].apply(split_genres)

# Initialize a DataFrame with zeros for each genre in all_genre_set, for each anime
genre_df = pd.DataFrame(0, index=df.index, columns=list(all_genre_set))

# Populate the DataFrame: for each anime, set 1 for genres it has
for index, row in df.iterrows():
    genres = row['genre_list']
    for genre in genres:
        if genre in all_genre_set:  # This check is technically redundant but safe
            genre_df.at[index, genre] = 1

# Join the one-hot encoded genres back to the original DataFrame
df_encoded = df.join(genre_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['genre_list'] = df['genre'].apply(split_genres)


In [144]:
df_encoded

Unnamed: 0,anime_id,genre,type,genre_list,Shounen,Kids,MartialArts,Ecchi,Samurai,Cars,...,Space,Horror,Shoujo,School,SuperPower,Demons,Historical,Game,Josei,Thriller
0,11013,"Comedy, Supernatural, Romance, Shounen",TV,"[Comedy, Supernatural, Romance, Shounen]",1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2104,"Comedy, Parody, Romance, School, Shounen",TV,"[Comedy, Parody, Romance, School, Shounen]",1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,5262,"Comedy, Magic, School, Shoujo",TV,"[Comedy, Magic, School, Shoujo]",0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
3,721,"Comedy, Drama, Magic, Romance, Fantasy",TV,"[Comedy, Drama, Magic, Romance, Fantasy]",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,12365,"Comedy, Drama, Romance, Shounen",TV,"[Comedy, Drama, Romance, Shounen]",1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6663,37405,Hentai,OVA,[Hentai],0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6664,37886,"Slice of Life, Drama, Romance",ONA,"[Slice of Life, Drama, Romance]",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6665,37255,"Music, Kids",OVA,"[Music, Kids]",0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6666,35229,Kids,ONA,[Kids],0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [145]:
df_encoded.drop(columns=['genre', 'genre_list', 'type'], inplace=True)
df_encoded

Unnamed: 0,anime_id,Shounen,Kids,MartialArts,Ecchi,Samurai,Cars,Yuri,Supernatural,Military,...,Space,Horror,Shoujo,School,SuperPower,Demons,Historical,Game,Josei,Thriller
0,11013,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2104,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,5262,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
3,721,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,12365,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6663,37405,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6664,37886,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6665,37255,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6666,35229,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [146]:
# df_encoded.drop(columns=['genre', 'genre_list'], inplace=True)
# df_encoded = pd.concat([df_encoded, pd.get_dummies(df_encoded['type'], prefix='type', dtype=int)], axis=1)
# df_encoded.drop(columns=['type'], inplace=True)
# df_encoded

In [147]:
# with the genre's one hot encoded, we can now try run kNN
from sklearn.neighbors import NearestNeighbors
import pandas as pd

# Drop non-genre columns to get the feature set X
X = df_encoded.drop(['anime_id'], axis=1)

# Initialize and fit the KNN model
knn = NearestNeighbors(n_neighbors=5, algorithm='ball_tree')
knn.fit(X)

In [148]:
def recommend_similar_anime(liked_anime_ids, df_encoded, knn_model, top_m=5):
    recommendations = []
    
    for anime_id in liked_anime_ids:
        # Find the one-hot encoded vector for the liked anime
        anime_vector = df_encoded[df_encoded['anime_id'] == anime_id].drop(['anime_id', 'genre', 'genre_list'], axis=1, errors='ignore')
        
        # Use the KNN model to find similar anime
        distances, indices = knn_model.kneighbors(anime_vector, n_neighbors=top_m + 1)
        
        # Get the anime_ids of the recommended anime, excluding the first one (itself)
        similar_anime_ids = df_encoded.iloc[indices[0], :]['anime_id'].values[1:]
        
        recommendations.extend(similar_anime_ids)
    
    # Remove duplicates
    recommendations =  list(set(recommendations))

    # Remove anime (if any in list) that were in the user input
    ret = []
    for x in recommendations:
        if x not in liked_anime_ids:
            ret.append(x)
    return ret


In [149]:
# Example liked anime IDs
liked_anime = ['One Piece', 'Naruto']

# Convert from name to ID
liked_anime_ids = [animeName2animeIDMap[x] for x in liked_anime]

# Get recommendations
recommendations = recommend_similar_anime(liked_anime_ids, df_encoded, knn, top_m=5)

# Convert from id to name
recommendations = [animeID2animeNameMap[x] for x in recommendations]

print("Recommended Anime IDs:")
for x in recommendations:
    print('\t' + x)

Recommended Anime IDs:
	Digimon Frontier
	Naruto: Takigakure no Shitou - Ore ga Eiyuu Dattebayo!
	One Piece: Episode of Merry - Mou Hitori no Nakama no Monogatari
	One Piece: Long Ring Long Land-hen
	Duel Masters VSR
	One Piece: Episode of Sabo - 3 Kyoudai no Kizuna Kiseki no Saikai to Uketsugareru Ishi
	Duel Masters Charge
	Ninkuu: Knife no Bohyou
