# Cosine TF-IDF (Term Frequency-Inverse Document Frequency) similarity

TF-IDF is measure of how frequent a term appears in a text and how frequent the term appears across the collection of documents.

The TF-IDF score multiplies TF x IDF values. A higher score means the term is more significant.

After calculating the TF-IDF score, we take the cosine of the angle between the sentences and the terms.

In [1]:
import pandas as pd
import numpy as np

df_anime = pd.read_csv('../data/anime-dataset-2023.csv')
df_anime['Synopsis'].head()

0    Crime is timeless. By the year 2071, humanity ...
1    Another day, another bounty—such is the life o...
2    Vash the Stampede is the man with a $$60,000,0...
3    Robin Sena is a powerful craft user drafted in...
4    It is the dark century and the people are suff...
Name: Synopsis, dtype: object

In [4]:
df_anime.shape

(24905, 24)

In [32]:
#basic filtering for duplicates

duplicates_all = df_anime[df_anime.duplicated()]
print("All Duplicates:")
print(len(duplicates_all))

duplicates = df_anime[df_anime.duplicated(['Name'])].sort_values(by='Name')
print("Duplicates based on Name:")
print(len(duplicates))
duplicates = duplicates[['anime_id', 'Name']]
print(duplicates)

df_anime_new = df_anime.drop_duplicates(['Name'])
print("Cleaned anime shape: {} \n".format(df_anime_new.shape))
print("Old anime shape: {}".format(df_anime.shape))

All Duplicates:
0
Duplicates based on Name:
4
       anime_id       Name
24840     55658  Awakening
24586     55351  Azur Lane
24807     55610   Souseiki
24781     55582     Utopia
Cleaned anime shape: (24901, 24) 

Old anime shape: (24905, 24)


In [33]:
#filter out certain genre
to_exclude = df_anime[df_anime['Genres'].str.contains('Hentai', case=False, na=False)]
filtered_df = df_anime[~df_anime.index.isin(to_exclude.index)]
filtered_df.shape

(23419, 24)

In [35]:
# Convert Name column to lowercase and remove spaces
filtered_df['Processed_Name'] = filtered_df['Name'].str.lower().replace(' ', '')

# Filter out rows with titles in lowercase and without spaces
duplicate_rows = filtered_df[filtered_df.duplicated(subset='Processed_Name', keep=False) | ~filtered_df.duplicated(subset='Processed_Name', keep=False) & ~filtered_df['Processed_Name'].str.contains(' ')]

# Filter out rows that are upper case and have no spacing, e.g. between Death Note and DEATHNOTE, keep Death Note
filtered_df = filtered_df[~((filtered_df['Processed_Name'].isin(duplicate_rows['Processed_Name'])) & (filtered_df.duplicated(subset='Processed_Name', keep=False)))]

# Drop the intermediate 'Processed_Name' column
filtered_df = filtered_df.drop(columns='Processed_Name')

filtered_df.shape

(23409, 24)

In [36]:
#create the tf-idf matrix for text comparison
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(filtered_df['Synopsis'])

In [37]:
# Compute cosine similarity between all anime synopsis
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(tfidf_matrix)
similarity_df = pd.DataFrame(similarity, 
                             index=filtered_df['Name'], 
                             columns=filtered_df['Name'])
similarity_df.head(10)

Name,Cowboy Bebop,Cowboy Bebop: Tengoku no Tobira,Trigun,Witch Hunter Robin,Bouken Ou Beet,Eyeshield 21,Hachimitsu to Clover,Hungry Heart: Wild Striker,Initial D Fourth Stage,Monster,...,"Die, Please!",Miru,Wo Mengjian ni Mengjian wo,Thailand,Energy,Wu Nao Monu,Bu Xing Si: Yuan Qi,Di Yi Xulie,Bokura no Saishuu Sensou,Shijuuku Nichi
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Cowboy Bebop,1.0,0.264449,0.020127,0.041113,0.001561,0.017063,0.0,0.005259,0.0,0.009696,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cowboy Bebop: Tengoku no Tobira,0.264449,1.0,0.038074,0.016667,0.004182,0.022341,0.011285,0.011991,0.008144,0.013338,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Trigun,0.020127,0.038074,1.0,0.005106,0.012456,0.00889,0.003228,0.0,0.0,0.023628,...,0.008486,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Witch Hunter Robin,0.041113,0.016667,0.005106,1.0,0.014834,0.12298,0.0,0.014636,0.007969,0.0,...,0.003449,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bouken Ou Beet,0.001561,0.004182,0.012456,0.014834,1.0,0.056531,0.002034,0.0,0.0,0.009737,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Eyeshield 21,0.017063,0.022341,0.00889,0.12298,0.056531,1.0,0.01138,0.013734,0.017101,0.010976,...,0.008156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Hachimitsu to Clover,0.0,0.011285,0.003228,0.0,0.002034,0.01138,1.0,0.0,0.025935,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Hungry Heart: Wild Striker,0.005259,0.011991,0.0,0.014636,0.0,0.013734,0.0,1.0,0.026332,0.0,...,0.020187,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Initial D Fourth Stage,0.0,0.008144,0.0,0.007969,0.0,0.017101,0.025935,0.026332,1.0,0.0,...,0.016023,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Monster,0.009696,0.013338,0.023628,0.0,0.009737,0.010976,0.0,0.0,0.0,1.0,...,0.016344,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [38]:
# anime list 
anime_list = similarity_df.columns.values


# sample anime
anime = 'Death Note'

# top recommendation movie count
top_n = 10

# get anime similarity records
anime_sim = similarity_df[similarity_df.index == anime].values[0]

# get animes sorted by similarity
sorted_anime_ids = np.argsort(anime_sim)[::-1]

# get recommended anime names
recommended_anime = anime_list[sorted_anime_ids[1:top_n+1]]

print('\n\nTop Recommended Anime for:', anime, 'are:-\n', recommended_anime)



Top Recommended Anime for: Death Note are:-
 ['Death Note: Rewrite' 'Mugen no Hi' 'Sekaikei Sekai Ron' 'gdMen'
 'WONDER LiGHT' 'Dia Horizon (Kabu)'
 'Ore no Nounai Sentakushi ga, Gakuen Love Comedy wo Zenryoku de Jama Shiteiru OVA'
 'Ji Jia Shou Shen: Baolie Feiche' 'Dead Mount Death Play Part 2'
 'Hikari: Be My Light']


In [39]:
def content_anime_recommender(
    input_anime, similarity_database=similarity_df, anime_database_list=anime_list, top_n=10):
    
    # get anime similarity records
    anime_sim = similarity_database[similarity_database.index == input_anime].values[0]
    
    # get anime sorted by similarity
    sorted_anime_ids = np.argsort(anime_sim)[::-1]
    
    # get recommended anime names
    recommended_anime = anime_database_list[sorted_anime_ids[1:top_n+1]]
    
    print('\n\nTop Recommended Anime for:', input_anime, 'are:-\n', recommended_anime)

sample_anime = ['Death Note', 'Cowboy Bebop', 'Bleach', 
                 'Fruits Basket', 'Monster']
                 
for i in sample_anime:
    content_anime_recommender(i)



Top Recommended Anime for: Death Note are:-
 ['Death Note: Rewrite' 'Mugen no Hi' 'Sekaikei Sekai Ron' 'gdMen'
 'WONDER LiGHT' 'Dia Horizon (Kabu)'
 'Ore no Nounai Sentakushi ga, Gakuen Love Comedy wo Zenryoku de Jama Shiteiru OVA'
 'Ji Jia Shou Shen: Baolie Feiche' 'Dead Mount Death Play Part 2'
 'Hikari: Be My Light']


Top Recommended Anime for: Cowboy Bebop are:-
 ['Cowboy Bebop: Tengoku no Tobira' 'Cowboy Bebop: Ein no Natsuyasumi'
 'Saru Getchu Movie: Ougon no Pipo Helmet - Ukki Battle'
 'Kurogane Communication' 'Kandagawa Jet Girls Recap'
 'Phantasy Star Online 2: Episode Oracle' 'Kandagawa Jet Girls'
 'Umeboshi Denka' 'Bounty Hunter: The Hard'
 'Saraba Uchuu Senkan Yamato: Ai no Senshi-tachi']


Top Recommended Anime for: Bleach are:-
 ['Bleach: Sennen Kessen-hen'
 'Bleach Movie 3: Fade to Black - Kimi no Na wo Yobu'
 'Bleach Movie 1: Memories of Nobody' 'Bleach Movie 4: Jigoku-hen'
 'Yume-iro Pâtissière SP Professional' 'Tokyo Mew Mew New ♡'
 'Aikatsu! Movie' 'Tokyo Mew Mew'