# Cosine TF-IDF (Term Frequency-Inverse Document Frequency) similarity

TF-IDF is measure of how frequent a term appears in a text and how frequent the term appears across the collection of documents.

The TF-IDF score multiplies TF x IDF values. A higher score means the term is more significant.

After calculating the TF-IDF score, we take the cosine of the angle between the sentences and the terms.

In [2]:
import pandas as pd
import numpy as np

df_anime = pd.read_csv('../data/anime-dataset-2023.csv')
df_anime.head()

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,"Apr 3, 1998 to Apr 24, 1999",...,Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,"Sep 1, 2001",...,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,https://cdn.myanimelist.net/images/anime/1439/...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,"Apr 1, 1998 to Sep 30, 1998",...,Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...
3,7,Witch Hunter Robin,Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),7.25,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,"Jul 3, 2002 to Dec 25, 2002",...,Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931,https://cdn.myanimelist.net/images/anime/10/19...
4,8,Bouken Ou Beet,Beet the Vandel Buster,冒険王ビィト,6.94,"Adventure, Fantasy, Supernatural",It is the dark century and the people are suff...,TV,52.0,"Sep 30, 2004 to Sep 29, 2005",...,Toei Animation,Manga,23 min per ep,PG - Children,4240.0,5126,14,6413.0,15001,https://cdn.myanimelist.net/images/anime/7/215...


In [3]:
df_anime.shape

(24905, 24)

In [4]:
#basic filtering for duplicates

duplicates_all = df_anime[df_anime.duplicated()]
print("All Duplicates:")
print(len(duplicates_all))

duplicates = df_anime[df_anime.duplicated(['Name'])].sort_values(by='Name')
print("Duplicates based on Name:")
print(len(duplicates))
duplicates = duplicates[['anime_id', 'Name']]
print(duplicates)

df_anime_new = df_anime.drop_duplicates(['Name'])
print("Cleaned anime shape: {} \n".format(df_anime_new.shape))
print("Old anime shape: {}".format(df_anime.shape))

All Duplicates:
0
Duplicates based on Name:
4
       anime_id       Name
24840     55658  Awakening
24586     55351  Azur Lane
24807     55610   Souseiki
24781     55582     Utopia
Cleaned anime shape: (24901, 24) 

Old anime shape: (24905, 24)


In [5]:
#filter out certain genre
to_exclude = df_anime[df_anime['Genres'].str.contains('Hentai', case=False, na=False)]
filtered_df = df_anime[~df_anime.index.isin(to_exclude.index)]
filtered_df.shape

(23419, 24)

In [6]:
# Convert Name column to lowercase and remove spaces
filtered_df['Processed_Name'] = filtered_df['Name'].str.lower().replace(' ', '')

# Filter out rows with titles in lowercase and without spaces
duplicate_rows = filtered_df[filtered_df.duplicated(subset='Processed_Name', keep=False) | ~filtered_df.duplicated(subset='Processed_Name', keep=False) & ~filtered_df['Processed_Name'].str.contains(' ')]

print(duplicate_rows)
# Filter out rows that are upper case and have no spacing, e.g. between Death Note and DEATHNOTE, keep Death Note
filtered_df = filtered_df[~((filtered_df['Processed_Name'].isin(duplicate_rows['Processed_Name'])) & (filtered_df.duplicated(subset='Processed_Name', keep=False)))]

# Drop the intermediate 'Processed_Name' column
filtered_df = filtered_df.drop(columns='Processed_Name')

filtered_df.shape

       anime_id         Name English name   Other name    Score  \
2             6       Trigun       Trigun        トライガン     8.22   
9            19      Monster      Monster        モンスター     8.87   
10           20       Naruto       Naruto          ナルト     7.99   
15           25    Sunabouzu  Desert Punk         砂ぼうず     7.38   
16           26   Texhnolyze   Texhnolyze   TEXHNOLYZE     7.76   
...         ...          ...          ...          ...      ...   
24880     55707       Kokoro      UNKNOWN            心  UNKNOWN   
24885     55716  Mechronicle  Mechronicle  Mechronicle  UNKNOWN   
24896     55727         Miru      UNKNOWN           未ル  UNKNOWN   
24898     55729     Thailand      UNKNOWN     Thailand  UNKNOWN   
24899     55730       Energy      UNKNOWN       Energy  UNKNOWN   

                                         Genres  \
2                     Action, Adventure, Sci-Fi   
9                      Drama, Mystery, Suspense   
10                   Action, Adventure, Fa

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['Processed_Name'] = filtered_df['Name'].str.lower().replace(' ', '')


(23409, 24)

In [7]:
#drop rows with unknown genres
unknown_rows = filtered_df[filtered_df['Genres'].str.lower() == 'unknown']
filtered_df = filtered_df.drop(unknown_rows.index)
filtered_df.shape

(18486, 24)

## Create TF-IDF Matrix and Encoders

In [8]:
#create the tf-idf matrix for text comparison
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english', 
                        max_features=10000,
                        max_df=0.9,
                        min_df=2)
synopsis_vectors = tfidf.fit_transform(filtered_df['Synopsis'])

In [9]:
#use one-hot encoder to include genre in the recommendation
from sklearn.preprocessing import OneHotEncoder
from scipy.sparse import hstack

encoder = OneHotEncoder(sparse_output=True)

genre_encoded_sparse = encoder.fit_transform(filtered_df[['Genres']].explode('Genres'))



In [10]:
#include Studios in the recommendation

#filter out words from the similarity comparison
exclude_studios = ['Animation', 'Studio', 'UNKNOWN']
studios = [studio if studio not in exclude_studios else '' for studio in filtered_df['Studios']]

studios_encoder = OneHotEncoder(sparse_output=True)

studios_encoded_sparse = studios_encoder.fit_transform(filtered_df[['Studios']].explode('Studios'))


In [11]:
#include Rating (PG, etc) in the recommendation
rating_encoder = OneHotEncoder(sparse_output=True)
rating_encoded_sparse = rating_encoder.fit_transform(filtered_df[['Rating']].explode('Rating'))

In [12]:
# apply weights of importance to feature
weight_synopsis = 3.0
weight_genres = 2.0
weight_studios = 1.0
weight_rating = 1.5

weighted_synopsis = weight_synopsis * synopsis_vectors
weighted_genres = weight_genres * genre_encoded_sparse
weighted_studios = weight_studios * studios_encoded_sparse
weighted_rating = weight_rating * rating_encoded_sparse


# combine the sparse matrices horizontally (hstack)
combined_sparse_matrix = hstack([weighted_synopsis, weighted_genres, weighted_studios, weighted_rating])

# display the combined sparse matrix
print("Combined Sparse Matrix:")
print(combined_sparse_matrix)


Combined Sparse Matrix:
  (0, 9808)	0.3209755583432745
  (0, 6175)	0.15805491682044426
  (0, 7343)	0.23433285738615384
  (0, 3051)	0.16115743904093013
  (0, 5987)	0.2435006534695524
  (0, 5085)	0.1284744239471969
  (0, 1500)	0.26589137108572464
  (0, 9376)	0.26300062253251594
  (0, 1862)	0.2077419711607731
  (0, 6604)	0.22873133029278017
  (0, 7423)	0.223816263959283
  (0, 5518)	0.29313594765598405
  (0, 2428)	0.31248852547728645
  (0, 5146)	0.163336363431806
  (0, 1991)	0.22896969341541484
  (0, 1338)	0.2268737012618897
  (0, 1671)	0.2690064253055781
  (0, 1351)	0.2463453158808384
  (0, 9790)	0.2109685575338856
  (0, 1094)	0.2690064253055781
  (0, 2318)	0.27928575081925683
  (0, 4481)	0.318679843907097
  (0, 2633)	0.3209755583432745
  (0, 1475)	0.20488646970792113
  (0, 6144)	0.24708911617982074
  :	:
  (18482, 8219)	1.1230494800226918
  (18482, 5822)	1.033018348543619
  (18482, 2749)	1.3996368158996542
  (18482, 9520)	1.0018106657958648
  (18482, 10494)	2.0
  (18482, 12149)	1.0
  (18

In [13]:
# Compute cosine similarity between all anime synopsis
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(combined_sparse_matrix)
similarity_df = pd.DataFrame(similarity, 
                             index=filtered_df['Name'], 
                             columns=filtered_df['Name'])
similarity_df.head(10)

Name,Cowboy Bebop,Cowboy Bebop: Tengoku no Tobira,Trigun,Witch Hunter Robin,Bouken Ou Beet,Eyeshield 21,Hachimitsu to Clover,Hungry Heart: Wild Striker,Initial D Fourth Stage,Monster,...,Beauty and the Brawn,4 Week Lovers,"Die, Please!",Miru,Wo Mengjian ni Mengjian wo,Thailand,Energy,Wu Nao Monu,Bu Xing Si: Yuan Qi,Di Yi Xulie
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Cowboy Bebop,1.0,0.290224,0.013966,0.088382,0.001004,0.011986,0.0,0.003452,0.0,0.006069,...,0.0,0.014172,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cowboy Bebop: Tengoku no Tobira,0.290224,1.0,0.0267,0.010819,0.002696,0.015679,0.00789,0.007841,0.005038,0.008427,...,0.0,0.003639,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Trigun,0.013966,0.0267,1.0,0.141856,0.008381,0.144653,0.140709,0.138462,0.138462,0.076583,...,0.0,0.024941,0.007177,0.0,0.0,0.138462,0.138462,0.138462,0.138462,0.138462
Witch Hunter Robin,0.088382,0.010819,0.141856,1.0,0.009193,0.222709,0.138462,0.147785,0.143097,0.0,...,0.002,0.00256,0.002575,0.0,0.0,0.138462,0.138462,0.138462,0.138462,0.138462
Bouken Ou Beet,0.001004,0.002696,0.008381,0.009193,1.0,0.038466,0.001348,0.0,0.0,0.005868,...,0.0,0.002842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Eyeshield 21,0.011986,0.015679,0.144653,0.222709,0.038466,1.0,0.146615,0.147616,0.149528,0.007006,...,0.003052,0.029993,0.006829,0.0,0.0,0.138462,0.138462,0.138462,0.138462,0.138462
Hachimitsu to Clover,0.0,0.00789,0.140709,0.138462,0.001348,0.146615,1.0,0.138462,0.156013,0.0,...,0.007798,0.018634,0.0,0.0,0.0,0.138462,0.138462,0.138462,0.138462,0.138462
Hungry Heart: Wild Striker,0.003452,0.007841,0.138462,0.147785,0.0,0.147616,0.138462,1.0,0.154368,0.0,...,0.0,0.007718,0.016633,0.0,0.0,0.138462,0.138462,0.138462,0.138462,0.138462
Initial D Fourth Stage,0.0,0.005038,0.138462,0.143097,0.0,0.149528,0.156013,0.154368,1.0,0.0,...,0.0,0.004035,0.012452,0.0,0.0,0.138462,0.138462,0.138462,0.138462,0.138462
Monster,0.006069,0.008427,0.076583,0.0,0.005868,0.007006,0.0,0.0,0.0,1.0,...,0.0,0.00999,0.012997,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# anime list 
anime_list = similarity_df.columns.values


# sample anime
anime = 'Death Note'

# top recommendation movie count
top_n = 10

# get anime similarity records
anime_sim = similarity_df[similarity_df.index == anime].values[0]

# get animes sorted by similarity
sorted_anime_ids = np.argsort(anime_sim)[::-1]

# get recommended anime names
recommended_anime = anime_list[sorted_anime_ids[1:top_n+1]]

print('\n\nTop Recommended Anime for:', anime, 'are:-\n', recommended_anime)



Top Recommended Anime for: Death Note are:-
 ['Death Note: Rewrite' 'Munou na Nana' 'Death Parade'
 'Majin Tantei Nougami Neuro' 'Warau Salesman Special Program'
 'Rainbow: Nisha Rokubou no Shichinin' 'Kamisama no Inai Nichiyoubi'
 'Hunter x Hunter Movie 2: The Last Mission' 'ChäoS;HEAd'
 'Touhai Densetsu Akagi: Yami ni Maiorita Tensai']


In [15]:
def content_anime_recommender(
    input_anime, similarity_database=similarity_df, anime_database_list=anime_list, top_n=10):
    
    # get anime similarity records
    anime_sim = similarity_database[similarity_database.index == input_anime].values[0]
    
    # get anime sorted by similarity
    sorted_anime_ids = np.argsort(anime_sim)[::-1]
    
    # get recommended anime names
    recommended_anime = anime_database_list[sorted_anime_ids[1:top_n+1]]
    
    print('\n\nTop Recommended Anime for:', input_anime, 'are:-\n', recommended_anime)

sample_anime = ['Death Note', 'Cowboy Bebop', 'Bleach', 
                 'Fruits Basket', 'Monster']
                 
for i in sample_anime:
    content_anime_recommender(i)



Top Recommended Anime for: Death Note are:-
 ['Death Note: Rewrite' 'Munou na Nana' 'Death Parade'
 'Majin Tantei Nougami Neuro' 'Warau Salesman Special Program'
 'Rainbow: Nisha Rokubou no Shichinin' 'Kamisama no Inai Nichiyoubi'
 'Hunter x Hunter Movie 2: The Last Mission' 'ChäoS;HEAd'
 'Touhai Densetsu Akagi: Yami ni Maiorita Tensai']


Top Recommended Anime for: Cowboy Bebop are:-
 ['Koukaku Kidoutai: Stand Alone Complex' 'Cowboy Bebop: Tengoku no Tobira'
 'Hate no issen EPISODE ZERO' 'SSSS.Gridman'
 'Cowboy Bebop: Yose Atsume Blues' 'Kidou Senshi Gundam SEED'
 'Kidou Senshi Gundam 00 Second Season' 'Kakumeiki Valvrave'
 'Double Decker! Doug & Kirill' 'Uchuu no Senshi']


Top Recommended Anime for: Bleach are:-
 ['Bleach Movie 3: Fade to Black - Kimi no Na wo Yobu'
 'Bleach Movie 1: Memories of Nobody' 'Bleach Movie 4: Jigoku-hen'
 'Bleach: The Sealed Sword Frenzy' 'Bleach: Sennen Kessen-hen'
 'Bleach Movie 2: The DiamondDust Rebellion - Mou Hitotsu no Hyourinmaru'
 'Juuni Kokuki

In [16]:
df_users_ratings = pd.read_csv('../data/users-score-2023.csv')
df_users_ratings[df_users_ratings['Anime Title'] == 'Death Note'].head()
print(df_users_ratings.shape)

(24325191, 5)


### Deriving ground truth using threshold-based approach

In [17]:
# find relevant anime for ground truth, set ratings above 7
threshold = 7

sample_size = 10000

# take sample from df_users_ratings
sample_data = df_users_ratings.sample(n=sample_size, random_state=42)  #set random_state for reproducibility

#create ground truth based on the threshold
avg_ratings = sample_data.groupby('Anime Title')['rating'].mean()
print(avg_ratings)

# Filter out titles where the average rating is greater than the threshold
liked_anime = avg_ratings[avg_ratings > threshold].index.tolist()

# group by anime and create the ground_truths dictionary
ground_truths = df_users_ratings.groupby('Anime Title')['Anime Title'].apply(lambda x: liked_anime).to_dict()


#print out the items
print (set(ground_truths))

Anime Title
"Bungaku Shoujo" Kyou no Oyatsu: Hatsukoi     8.000000
"Bungaku Shoujo" Movie                        7.750000
"Oshi no Ko"                                  8.500000
.hack//G.U. Returner                          7.000000
.hack//G.U. Trilogy: Parody Mode              6.000000
                                               ...    
xxxHOLiC                                      8.714286
xxxHOLiC Movie: Manatsu no Yoru no Yume       6.833333
xxxHOLiC Rou                                 10.000000
xxxHOLiC Shunmuki                             7.500000
xxxHOLiC◆Kei                                  7.333333
Name: rating, Length: 3148, dtype: float64
{'Tottemo! Luckyman', 'Norakuro Nitouhei: Kyouren no Maki', 'Doukyuusei 2 (OVA)', 'Kingdom 3rd Season', 'Kaitou Lupin: 813 no Nazo', 'Dinner Bell', 'Momoya Norihei Anime CM', 'Tales of Crestoria: Toga Waga wo Shoite Kare wa Tatsu', 'AI no Idenshi', 'Rifle Is Beautiful', 'Love Letter (Music)', 'Overlord III', 'Koi☆Sento', 'Kyoto Animation:

In [26]:
#precision at n: measures the proportion of relevant items among the top n
def content_anime_recommender2(
    top_n, input_anime, ground_truths, similarity_database=similarity_df, anime_database_list=anime_list):
    
    # get anime similarity records
    anime_sim = similarity_database[similarity_database.index == input_anime].values[0]
    
    # get anime sorted by similarity
    sorted_anime_ids = np.argsort(anime_sim)[::-1]
    
    # get recommended anime names
    recommended_anime = anime_database_list[sorted_anime_ids[1:top_n+1]]
    
    # calculate precision at n
    intersection = set(recommended_anime) & set(ground_truths)
    precision_at_n = len(intersection) / top_n
    rounded_precision = round(precision_at_n, 2)
    
    print('\n\nTop Recommended Anime for ', input_anime, recommended_anime)
    print('\nPrecision at', top_n, '=', rounded_precision)

#select 2 highly rated and 2 lower rated titles
sample_anime2 = ['Death Note', 'Jigoku Shoujo Mitsuganae', 'Xian Yu Ge', 'Higenashi Gogejabaru']

for i in sample_anime2:
    content_anime_recommender2(10, i, ground_truths)



Top Recommended Anime for  Death Note ['Death Note: Rewrite' 'Munou na Nana' 'Death Parade'
 'Majin Tantei Nougami Neuro' 'Warau Salesman Special Program'
 'Rainbow: Nisha Rokubou no Shichinin' 'Kamisama no Inai Nichiyoubi'
 'Hunter x Hunter Movie 2: The Last Mission' 'ChäoS;HEAd'
 'Touhai Densetsu Akagi: Yami ni Maiorita Tensai']

Precision at 10 = 1.0


Top Recommended Anime for  Jigoku Shoujo Mitsuganae ['Jigoku Shoujo: Yoi no Togi' 'Jigoku Shoujo' '18if' "Le Chevalier D'Eon"
 'Pet' 'Un-Go: Inga-ron' 'Jigoku Shoujo Futakomori' 'Tactics'
 'Muhyo to Rouji no Mahouritsu Soudan Jimusho 2nd Season'
 'Muhyo to Rouji no Mahouritsu Soudan Jimusho']

Precision at 10 = 1.0


Top Recommended Anime for  Xian Yu Ge ['Mao Zhi Ming Episode 5.5' 'Chinkoroheibei Tamatebako'
 'Atelier Petros Joukuu Gekijou: Sentaku Shima no Sentaku Tori'
 'Minna Tomodachi' 'Seaside-sou no Aquakko'
 'Qin Shi Mingyue: Xiao Chuangjianghu' 'Nulu-chan to Boku'
 'Kaeru Ouji to Imomushi Henry' 'Higenashi Gogejabaru' 'Happ

In [27]:
for i in sample_anime2:
    content_anime_recommender2(5, i, ground_truths)



Top Recommended Anime for  Death Note ['Death Note: Rewrite' 'Munou na Nana' 'Death Parade'
 'Majin Tantei Nougami Neuro' 'Warau Salesman Special Program']

Precision at 5 = 1.0


Top Recommended Anime for  Jigoku Shoujo Mitsuganae ['Jigoku Shoujo: Yoi no Togi' 'Jigoku Shoujo' '18if' "Le Chevalier D'Eon"
 'Pet']

Precision at 5 = 1.0


Top Recommended Anime for  Xian Yu Ge ['Mao Zhi Ming Episode 5.5' 'Chinkoroheibei Tamatebako'
 'Atelier Petros Joukuu Gekijou: Sentaku Shima no Sentaku Tori'
 'Minna Tomodachi' 'Seaside-sou no Aquakko']

Precision at 5 = 0.6


Top Recommended Anime for  Higenashi Gogejabaru ['Gokuu no Daibouken Pilot' 'Kaijuu no Ballad' 'Nulu-chan to Boku'
 "DS Anime Soushuuhen '98" 'Xian Yu Ge']

Precision at 5 = 0.8


In [30]:
for i in sample_anime2:
    content_anime_recommender2(15, i, ground_truths)



Top Recommended Anime for  Death Note ['Death Note: Rewrite' 'Munou na Nana' 'Death Parade'
 'Majin Tantei Nougami Neuro' 'Warau Salesman Special Program'
 'Rainbow: Nisha Rokubou no Shichinin' 'Kamisama no Inai Nichiyoubi'
 'Hunter x Hunter Movie 2: The Last Mission' 'ChäoS;HEAd'
 'Touhai Densetsu Akagi: Yami ni Maiorita Tensai' 'Btooom!'
 'Juubee Ninpuuchou: Ryuuhougyoku-hen' 'Boogiepop wa Warawanai (2019)'
 'Vampire Hunter D (2000)' 'Vampire Hunter']

Precision at 15 = 1.0


Top Recommended Anime for  Jigoku Shoujo Mitsuganae ['Jigoku Shoujo: Yoi no Togi' 'Jigoku Shoujo' '18if' "Le Chevalier D'Eon"
 'Pet' 'Un-Go: Inga-ron' 'Jigoku Shoujo Futakomori' 'Tactics'
 'Muhyo to Rouji no Mahouritsu Soudan Jimusho 2nd Season'
 'Muhyo to Rouji no Mahouritsu Soudan Jimusho'
 'Honto ni Atta Gakkou Kaidan' 'Saint Luminous Jogakuin' 'Sakurada Reset'
 'xxxHOLiC Rou' 'Mahoutsukai no Yoru']

Precision at 15 = 0.93


Top Recommended Anime for  Xian Yu Ge ['Mao Zhi Ming Episode 5.5' 'Chinkoroheibei T