# Cosine TF-IDF (Term Frequency-Inverse Document Frequency) similarity

TF-IDF is measure of how frequent a term appears in a text and how frequent the term appears across the collection of documents.

The TF-IDF score multiplies TF x IDF values. A higher score means the term is more significant.

After calculating the TF-IDF score, we take the cosine of the angle between the sentences and the terms.

In [33]:
import pandas as pd
import numpy as np

df_anime = pd.read_csv('../data/anime-dataset-2023.csv')
df_anime.head()

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,"Apr 3, 1998 to Apr 24, 1999",...,Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,"Sep 1, 2001",...,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,https://cdn.myanimelist.net/images/anime/1439/...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,"Apr 1, 1998 to Sep 30, 1998",...,Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...
3,7,Witch Hunter Robin,Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),7.25,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,"Jul 3, 2002 to Dec 25, 2002",...,Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931,https://cdn.myanimelist.net/images/anime/10/19...
4,8,Bouken Ou Beet,Beet the Vandel Buster,冒険王ビィト,6.94,"Adventure, Fantasy, Supernatural",It is the dark century and the people are suff...,TV,52.0,"Sep 30, 2004 to Sep 29, 2005",...,Toei Animation,Manga,23 min per ep,PG - Children,4240.0,5126,14,6413.0,15001,https://cdn.myanimelist.net/images/anime/7/215...


In [34]:
df_anime.shape

(24905, 24)

In [35]:
#basic filtering for duplicates

duplicates_all = df_anime[df_anime.duplicated()]
print("All Duplicates:")
print(len(duplicates_all))

duplicates = df_anime[df_anime.duplicated(['Name'])].sort_values(by='Name')
print("Duplicates based on Name:")
print(len(duplicates))
duplicates = duplicates[['anime_id', 'Name']]
print(duplicates)

df_anime_new = df_anime.drop_duplicates(['Name'])
print("Cleaned anime shape: {} \n".format(df_anime_new.shape))
print("Old anime shape: {}".format(df_anime.shape))

All Duplicates:
0
Duplicates based on Name:
4
       anime_id       Name
24840     55658  Awakening
24586     55351  Azur Lane
24807     55610   Souseiki
24781     55582     Utopia
Cleaned anime shape: (24901, 24) 

Old anime shape: (24905, 24)


In [36]:
#filter out certain genre
to_exclude = df_anime[df_anime['Genres'].str.contains('Hentai', case=False, na=False)]
filtered_df = df_anime[~df_anime.index.isin(to_exclude.index)]
filtered_df.shape

(23419, 24)

In [37]:
# Convert Name column to lowercase and remove spaces
filtered_df['Processed_Name'] = filtered_df['Name'].str.lower().replace(' ', '')

# Filter out rows with titles in lowercase and without spaces
duplicate_rows = filtered_df[filtered_df.duplicated(subset='Processed_Name', keep=False) | ~filtered_df.duplicated(subset='Processed_Name', keep=False) & ~filtered_df['Processed_Name'].str.contains(' ')]

print(duplicate_rows)
# Filter out rows that are upper case and have no spacing, e.g. between Death Note and DEATHNOTE, keep Death Note
filtered_df = filtered_df[~((filtered_df['Processed_Name'].isin(duplicate_rows['Processed_Name'])) & (filtered_df.duplicated(subset='Processed_Name', keep=False)))]

# Drop the intermediate 'Processed_Name' column
filtered_df = filtered_df.drop(columns='Processed_Name')

filtered_df.shape

       anime_id         Name English name   Other name    Score  \
2             6       Trigun       Trigun        トライガン     8.22   
9            19      Monster      Monster        モンスター     8.87   
10           20       Naruto       Naruto          ナルト     7.99   
15           25    Sunabouzu  Desert Punk         砂ぼうず     7.38   
16           26   Texhnolyze   Texhnolyze   TEXHNOLYZE     7.76   
...         ...          ...          ...          ...      ...   
24880     55707       Kokoro      UNKNOWN            心  UNKNOWN   
24885     55716  Mechronicle  Mechronicle  Mechronicle  UNKNOWN   
24896     55727         Miru      UNKNOWN           未ル  UNKNOWN   
24898     55729     Thailand      UNKNOWN     Thailand  UNKNOWN   
24899     55730       Energy      UNKNOWN       Energy  UNKNOWN   

                                         Genres  \
2                     Action, Adventure, Sci-Fi   
9                      Drama, Mystery, Suspense   
10                   Action, Adventure, Fa

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['Processed_Name'] = filtered_df['Name'].str.lower().replace(' ', '')


(23409, 24)

In [38]:
#drop rows with unknown genres
unknown_rows = filtered_df[filtered_df['Genres'].str.lower() == 'unknown']
filtered_df = filtered_df.drop(unknown_rows.index)
filtered_df.shape

(18486, 24)

## Create TF-IDF Matrix and Encoders

In [39]:
#create the tf-idf matrix for text comparison
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
synopsis_vectors = tfidf.fit_transform(filtered_df['Synopsis'])

In [40]:
#use one-hot encoder to include genre in the recommendation
from sklearn.preprocessing import OneHotEncoder
from scipy.sparse import hstack

encoder = OneHotEncoder(sparse_output=True)

genre_encoded_sparse = encoder.fit_transform(filtered_df[['Genres']].explode('Genres'))



In [41]:
#include Studios in the recommendation

#filter out words from the similarity comparison
exclude_studios = ['Animation', 'Studio', 'UNKNOWN']
studios = [studio if studio not in exclude_studios else '' for studio in filtered_df['Studios']]

studios_encoder = OneHotEncoder(sparse_output=True)

studios_encoded_sparse = studios_encoder.fit_transform(filtered_df[['Studios']].explode('Studios'))


In [42]:
# apply weights of importance to feature
weight_synopsis = 2.0
weight_genres = 3.0
weight_studios = 1.0

weighted_synopsis = weight_synopsis * synopsis_vectors
weighted_genres = weight_genres * genre_encoded_sparse
weighted_studios = weight_studios * studios_encoded_sparse

# combine the sparse matrices horizontally (hstack)
combined_sparse_matrix = hstack([weighted_synopsis, weighted_genres, weighted_studios])

# display the combined sparse matrix
print("Combined Sparse Matrix:")
print(combined_sparse_matrix)


Combined Sparse Matrix:
  (0, 44873)	0.1894713513833813
  (0, 28966)	0.09329956099252311
  (0, 33878)	0.13832630556560896
  (0, 12900)	0.09513097482617006
  (0, 27918)	0.14373804072106613
  (0, 23360)	0.07583824403677338
  (0, 6799)	0.15695524500626815
  (0, 43046)	0.1552488408248624
  (0, 8000)	0.12262974856416774
  (0, 31010)	0.13501973320953856
  (0, 24511)	0.22377604151932118
  (0, 34191)	0.1321183775264036
  (0, 25364)	0.1730376743578338
  (0, 10306)	0.1844614696507848
  (0, 23591)	0.09641719035931648
  (0, 8463)	0.1351604385741422
  (0, 6083)	0.13392317780616605
  (0, 7417)	0.1587940564588116
  (0, 6125)	0.14541724032765874
  (0, 44823)	0.12453439726584556
  (0, 4927)	0.1587940564588116
  (0, 9838)	0.1648619256337705
  (0, 8150)	0.22377604151932118
  (0, 44457)	0.24020971854486872
  (0, 4538)	0.24020971854486872
  :	:
  (18482, 34705)	0.9646242154530061
  (18482, 37622)	0.9646242154530061
  (18482, 18035)	0.8986305374960677
  (18482, 21494)	0.7407527119264334
  (18482, 37964)	0.4

In [43]:
# Compute cosine similarity between all anime synopsis
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(combined_sparse_matrix)
similarity_df = pd.DataFrame(similarity, 
                             index=filtered_df['Name'], 
                             columns=filtered_df['Name'])
similarity_df.head(10)

Name,Cowboy Bebop,Cowboy Bebop: Tengoku no Tobira,Trigun,Witch Hunter Robin,Bouken Ou Beet,Eyeshield 21,Hachimitsu to Clover,Hungry Heart: Wild Striker,Initial D Fourth Stage,Monster,...,Beauty and the Brawn,4 Week Lovers,"Die, Please!",Miru,Wo Mengjian ni Mengjian wo,Thailand,Energy,Wu Nao Monu,Bu Xing Si: Yuan Qi,Di Yi Xulie
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Cowboy Bebop,1.0,0.075336,0.005578,0.083021,0.000422,0.004759,0.0,0.001451,0.0,0.00266,...,0.0,0.003882,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cowboy Bebop: Tengoku no Tobira,0.075336,1.0,0.010711,0.004693,0.001139,0.006253,0.003151,0.003311,0.002216,0.003709,...,0.0,0.001001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Trigun,0.005578,0.010711,1.0,0.001447,0.003481,0.002427,0.000882,0.0,0.0,0.077939,...,0.0,0.006746,0.002366,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Witch Hunter Robin,0.083021,0.004693,0.001447,1.0,0.004129,0.035714,0.0,0.004185,0.002167,0.0,...,0.000499,0.000749,0.000918,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bouken Ou Beet,0.000422,0.001139,0.003481,0.004129,1.0,0.015882,0.000557,0.0,0.0,0.002674,...,0.0,0.00081,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Eyeshield 21,0.004759,0.006253,0.002427,0.035714,0.015882,1.0,0.003182,0.003777,0.004757,0.003014,...,0.0007,0.008065,0.002238,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Hachimitsu to Clover,0.0,0.003151,0.000882,0.0,0.000557,0.003182,1.0,0.0,0.007555,0.0,...,0.00179,0.005017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Hungry Heart: Wild Striker,0.001451,0.003311,0.0,0.004185,0.0,0.003777,0.0,1.0,0.007239,0.0,...,0.0,0.002197,0.005771,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Initial D Fourth Stage,0.0,0.002216,0.0,0.002167,0.0,0.004757,0.007555,0.007239,1.0,0.0,...,0.0,0.001197,0.004501,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Monster,0.00266,0.003709,0.077939,0.0,0.002674,0.003014,0.0,0.0,0.0,1.0,...,0.0,0.002965,0.004701,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
# anime list 
anime_list = similarity_df.columns.values


# sample anime
anime = 'Death Note'

# top recommendation movie count
top_n = 10

# get anime similarity records
anime_sim = similarity_df[similarity_df.index == anime].values[0]

# get animes sorted by similarity
sorted_anime_ids = np.argsort(anime_sim)[::-1]

# get recommended anime names
recommended_anime = anime_list[sorted_anime_ids[1:top_n+1]]

print('\n\nTop Recommended Anime for:', anime, 'are:-\n', recommended_anime)



Top Recommended Anime for: Death Note are:-
 ['Death Note: Rewrite' 'Munou na Nana' 'Warau Salesman Special Program'
 'Otogi Juushi Akazukin OVA'
 'Nezumi Monogatari: George to Gerald no Bouken' 'Death Parade'
 'MapleStory' 'Majin Tantei Nougami Neuro'
 'Hiroshima ni Ichiban Densha ga Hashitta' 'Kamisama no Inai Nichiyoubi']


In [45]:
def content_anime_recommender(
    input_anime, similarity_database=similarity_df, anime_database_list=anime_list, top_n=10):
    
    # get anime similarity records
    anime_sim = similarity_database[similarity_database.index == input_anime].values[0]
    
    # get anime sorted by similarity
    sorted_anime_ids = np.argsort(anime_sim)[::-1]
    
    # get recommended anime names
    recommended_anime = anime_database_list[sorted_anime_ids[1:top_n+1]]
    
    print('\n\nTop Recommended Anime for:', input_anime, 'are:-\n', recommended_anime)

sample_anime = ['Death Note', 'Cowboy Bebop', 'Bleach', 
                 'Fruits Basket', 'Monster']
                 
for i in sample_anime:
    content_anime_recommender(i)



Top Recommended Anime for: Death Note are:-
 ['Death Note: Rewrite' 'Munou na Nana' 'Warau Salesman Special Program'
 'Otogi Juushi Akazukin OVA'
 'Nezumi Monogatari: George to Gerald no Bouken' 'Death Parade'
 'MapleStory' 'Majin Tantei Nougami Neuro'
 'Hiroshima ni Ichiban Densha ga Hashitta' 'Kamisama no Inai Nichiyoubi']


Top Recommended Anime for: Cowboy Bebop are:-
 ['Koukaku Kidoutai: Stand Alone Complex' 'Hate no issen EPISODE ZERO'
 'SSSS.Gridman' 'Cowboy Bebop: Ein no Natsuyasumi'
 'Cowboy Bebop: Yose Atsume Blues' 'Kidou Shinseiki Gundam X'
 'Seihou Bukyou Outlaw Star' 'Seihou Tenshi Angel Links'
 'Witch Hunter Robin' 'Kidou Senshi Gundam II: Ai Senshi-hen']


Top Recommended Anime for: Bleach are:-
 ['Bleach: Sennen Kessen-hen'
 'Bleach Movie 3: Fade to Black - Kimi no Na wo Yobu'
 'Bleach Movie 1: Memories of Nobody' 'Bleach Movie 4: Jigoku-hen'
 'Bleach: The Sealed Sword Frenzy'
 'Bleach Movie 2: The DiamondDust Rebellion - Mou Hitotsu no Hyourinmaru'
 'Juuni Kokuki' '

In [46]:
df_users_ratings = pd.read_csv('../data/users-score-2023.csv')
df_users_ratings[df_users_ratings['Anime Title'] == 'Death Note'].head()
print(df_users_ratings.shape)

(24325191, 5)


### Deriving ground truth using threshold-based approach

In [47]:
# find relevant anime for ground truth, set ratings above 7
threshold = 7

sample_size = 10000

# take sample from df_users_ratings
sample_data = df_users_ratings.sample(n=sample_size, random_state=42)  #set random_state for reproducibility

#create ground truth based on the threshold
avg_ratings = sample_data.groupby('Anime Title')['rating'].mean()
print(avg_ratings)

# Filter out titles where the average rating is greater than the threshold
liked_anime = avg_ratings[avg_ratings > threshold].index.tolist()

# group by anime and create the ground_truths dictionary
ground_truths = df_users_ratings.groupby('Anime Title')['Anime Title'].apply(lambda x: liked_anime).to_dict()


#print out the items
print (set(ground_truths))

Anime Title
"Bungaku Shoujo" Kyou no Oyatsu: Hatsukoi     8.000000
"Bungaku Shoujo" Movie                        7.750000
"Oshi no Ko"                                  8.500000
.hack//G.U. Returner                          7.000000
.hack//G.U. Trilogy: Parody Mode              6.000000
                                               ...    
xxxHOLiC                                      8.714286
xxxHOLiC Movie: Manatsu no Yoru no Yume       6.833333
xxxHOLiC Rou                                 10.000000
xxxHOLiC Shunmuki                             7.500000
xxxHOLiC◆Kei                                  7.333333
Name: rating, Length: 3148, dtype: float64
{'Sekiei Ayakashi Mangatan', 'A.I.C.O. Incarnation', 'Bubuki Buranki', 'Soul Worker: Your Destiny Awaits', '12-sai. 2nd Season', 'Double Hard', 'Hakugei Densetsu', 'Fire Emblem Heroes: Chibi Playhouse', 'Hetalia: The World Twinkle Specials', 'Jin Hou Xiang Yao', 'Kamisama ni Natta Hi', 'PetoPeto-san', 'Obey Me!', 'X Densha de Ikou', 'Boue

In [48]:
#precision at n: measures the proportion of relevant items among the top n
def content_anime_recommender2(
    input_anime, ground_truths, similarity_database=similarity_df, anime_database_list=anime_list, top_n=10):
    
    # get anime similarity records
    anime_sim = similarity_database[similarity_database.index == input_anime].values[0]
    
    # get anime sorted by similarity
    sorted_anime_ids = np.argsort(anime_sim)[::-1]
    
    # get recommended anime names
    recommended_anime = anime_database_list[sorted_anime_ids[1:top_n+1]]
    
    # calculate precision at n
    intersection = set(recommended_anime) & set(ground_truths)
    precision_at_n = len(intersection) / top_n
    rounded_precision = round(precision_at_n, 2)
    
    print('\n\nTop Recommended Anime for ', input_anime, recommended_anime)
    print('\nPrecision at', top_n, '=', rounded_precision)

#select 2 highly rated and 2 lower rated titles
sample_anime2 = ['Death Note', 'Jigoku Shoujo Mitsuganae', 'Xian Yu Ge', 'Higenashi Gogejabaru']

for i in sample_anime2:
    content_anime_recommender2(i, ground_truths)



Top Recommended Anime for  Death Note ['Death Note: Rewrite' 'Munou na Nana' 'Warau Salesman Special Program'
 'Otogi Juushi Akazukin OVA'
 'Nezumi Monogatari: George to Gerald no Bouken' 'Death Parade'
 'MapleStory' 'Majin Tantei Nougami Neuro'
 'Hiroshima ni Ichiban Densha ga Hashitta' 'Kamisama no Inai Nichiyoubi']

Precision at 10 = 1.0


Top Recommended Anime for  Jigoku Shoujo Mitsuganae ['Tactics' 'Muhyo to Rouji no Mahouritsu Soudan Jimusho 2nd Season'
 'Muhyo to Rouji no Mahouritsu Soudan Jimusho'
 'Honto ni Atta Gakkou Kaidan' '18if' 'Sakurada Reset' 'xxxHOLiC Rou'
 'Saint Luminous Jogakuin' 'Mahoutsukai no Yoru' 'Kai Byoui Ramune']

Precision at 10 = 0.9


Top Recommended Anime for  Xian Yu Ge ['Friends: Mononoke Shima no Naki' 'Mao Zhi Ming Episode 5.5'
 'Chinkoroheibei Tamatebako'
 'Atelier Petros Joukuu Gekijou: Sentaku Shima no Sentaku Tori'
 'Minna Tomodachi' 'Seaside-sou no Aquakko'
 'Qin Shi Mingyue: Xiao Chuangjianghu' 'Nulu-chan to Boku'
 'Jian Yu Yuanzheng: Insom