# Cosine TF-IDF (Term Frequency-Inverse Document Frequency) similarity

TF-IDF is measure of how frequent a term appears in a text and how frequent the term appears across the collection of documents.

The TF-IDF score multiplies TF x IDF values. A higher score means the term is more significant.

After calculating the TF-IDF score, we take the cosine of the angle between the sentences and the terms.

In [1]:
import pandas as pd
import numpy as np

df_anime = pd.read_csv('../data/anime-dataset-2023.csv')
df_anime.head()

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,"Apr 3, 1998 to Apr 24, 1999",...,Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,"Sep 1, 2001",...,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,https://cdn.myanimelist.net/images/anime/1439/...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,"Apr 1, 1998 to Sep 30, 1998",...,Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...
3,7,Witch Hunter Robin,Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),7.25,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,"Jul 3, 2002 to Dec 25, 2002",...,Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931,https://cdn.myanimelist.net/images/anime/10/19...
4,8,Bouken Ou Beet,Beet the Vandel Buster,冒険王ビィト,6.94,"Adventure, Fantasy, Supernatural",It is the dark century and the people are suff...,TV,52.0,"Sep 30, 2004 to Sep 29, 2005",...,Toei Animation,Manga,23 min per ep,PG - Children,4240.0,5126,14,6413.0,15001,https://cdn.myanimelist.net/images/anime/7/215...


In [2]:
df_anime.shape

(24905, 24)

In [3]:
#basic filtering for duplicates

duplicates_all = df_anime[df_anime.duplicated()]
print("All Duplicates:")
print(len(duplicates_all))

duplicates = df_anime[df_anime.duplicated(['Name'])].sort_values(by='Name')
print("Duplicates based on Name:")
print(len(duplicates))
duplicates = duplicates[['anime_id', 'Name']]
print(duplicates)

df_anime_new = df_anime.drop_duplicates(['Name'])
print("Cleaned anime shape: {} \n".format(df_anime_new.shape))
print("Old anime shape: {}".format(df_anime.shape))

All Duplicates:
0
Duplicates based on Name:
4
       anime_id       Name
24840     55658  Awakening
24586     55351  Azur Lane
24807     55610   Souseiki
24781     55582     Utopia
Cleaned anime shape: (24901, 24) 

Old anime shape: (24905, 24)


In [4]:
#filter out certain genre
to_exclude = df_anime[df_anime['Genres'].str.contains('Hentai', case=False, na=False)]
filtered_df = df_anime[~df_anime.index.isin(to_exclude.index)]
filtered_df.shape

(23419, 24)

In [5]:
# Convert Name column to lowercase and remove spaces
filtered_df['Processed_Name'] = filtered_df['Name'].str.lower().replace(' ', '')

# Filter out rows with titles in lowercase and without spaces
duplicate_rows = filtered_df[filtered_df.duplicated(subset='Processed_Name', keep=False) | ~filtered_df.duplicated(subset='Processed_Name', keep=False) & ~filtered_df['Processed_Name'].str.contains(' ')]

print(duplicate_rows)
# Filter out rows that are upper case and have no spacing, e.g. between Death Note and DEATHNOTE, keep Death Note
filtered_df = filtered_df[~((filtered_df['Processed_Name'].isin(duplicate_rows['Processed_Name'])) & (filtered_df.duplicated(subset='Processed_Name', keep=False)))]

# Drop the intermediate 'Processed_Name' column
filtered_df = filtered_df.drop(columns='Processed_Name')

filtered_df.shape

       anime_id         Name English name   Other name    Score  \
2             6       Trigun       Trigun        トライガン     8.22   
9            19      Monster      Monster        モンスター     8.87   
10           20       Naruto       Naruto          ナルト     7.99   
15           25    Sunabouzu  Desert Punk         砂ぼうず     7.38   
16           26   Texhnolyze   Texhnolyze   TEXHNOLYZE     7.76   
...         ...          ...          ...          ...      ...   
24880     55707       Kokoro      UNKNOWN            心  UNKNOWN   
24885     55716  Mechronicle  Mechronicle  Mechronicle  UNKNOWN   
24896     55727         Miru      UNKNOWN           未ル  UNKNOWN   
24898     55729     Thailand      UNKNOWN     Thailand  UNKNOWN   
24899     55730       Energy      UNKNOWN       Energy  UNKNOWN   

                                         Genres  \
2                     Action, Adventure, Sci-Fi   
9                      Drama, Mystery, Suspense   
10                   Action, Adventure, Fa

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['Processed_Name'] = filtered_df['Name'].str.lower().replace(' ', '')


(23409, 24)

In [6]:
#drop rows with unknown genres
unknown_rows = filtered_df[filtered_df['Genres'].str.lower() == 'unknown']
filtered_df = filtered_df.drop(unknown_rows.index)
filtered_df.shape

(18486, 24)

In [7]:
#create the tf-idf matrix for text comparison
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
synopsis_vectors = tfidf.fit_transform(filtered_df['Synopsis'])

In [8]:
#use one-hot encoder to include genre in the recommendation
from sklearn.preprocessing import OneHotEncoder
from scipy.sparse import hstack

encoder = OneHotEncoder(sparse_output=True)

genre_encoded_sparse = encoder.fit_transform(filtered_df[['Genres']].explode('Genres'))

# Step 4: Combine the sparse matrices horizontally (hstack)
combined_sparse_matrix = hstack([genre_encoded_sparse, synopsis_vectors])

# Display the combined sparse matrix
print("Combined Sparse Matrix:")
print(combined_sparse_matrix)


Combined Sparse Matrix:
  (0, 143)	1.0
  (0, 45812)	0.09473567569169065
  (0, 29905)	0.046649780496261554
  (0, 34817)	0.06916315278280448
  (0, 13839)	0.04756548741308503
  (0, 28857)	0.07186902036053307
  (0, 24299)	0.03791912201838669
  (0, 7738)	0.07847762250313407
  (0, 43985)	0.0776244204124312
  (0, 8939)	0.06131487428208387
  (0, 31949)	0.06750986660476928
  (0, 25450)	0.11188802075966059
  (0, 35130)	0.0660591887632018
  (0, 26303)	0.0865188371789169
  (0, 11245)	0.0922307348253924
  (0, 24530)	0.04820859517965824
  (0, 9402)	0.0675802192870711
  (0, 7022)	0.06696158890308303
  (0, 8356)	0.0793970282294058
  (0, 7064)	0.07270862016382937
  (0, 45762)	0.06226719863292278
  (0, 5866)	0.0793970282294058
  (0, 10777)	0.08243096281688525
  (0, 9089)	0.11188802075966059
  (0, 45396)	0.12010485927243436
  :	:
  (18481, 29771)	0.4587973646408326
  (18481, 38903)	0.28961719018115994
  (18481, 27978)	0.266399545908467
  (18481, 44692)	0.2583515644523988
  (18482, 494)	1.0
  (18482, 3564

In [9]:
# Compute cosine similarity between all anime synopsis
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(combined_sparse_matrix)
similarity_df = pd.DataFrame(similarity, 
                             index=filtered_df['Name'], 
                             columns=filtered_df['Name'])
similarity_df.head(10)

Name,Cowboy Bebop,Cowboy Bebop: Tengoku no Tobira,Trigun,Witch Hunter Robin,Bouken Ou Beet,Eyeshield 21,Hachimitsu to Clover,Hungry Heart: Wild Striker,Initial D Fourth Stage,Monster,...,Beauty and the Brawn,4 Week Lovers,"Die, Please!",Miru,Wo Mengjian ni Mengjian wo,Thailand,Energy,Wu Nao Monu,Bu Xing Si: Yuan Qi,Di Yi Xulie
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Cowboy Bebop,1.0,0.131838,0.009761,0.020287,0.000739,0.008328,0.0,0.002539,0.0,0.004654,...,0.0,0.006794,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cowboy Bebop: Tengoku no Tobira,0.131838,1.0,0.018745,0.008214,0.001993,0.010943,0.005514,0.005794,0.003878,0.006492,...,0.0,0.001752,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Trigun,0.009761,0.018745,1.0,0.002533,0.006092,0.004248,0.001544,0.0,0.0,0.011393,...,0.0,0.011806,0.00414,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Witch Hunter Robin,0.020287,0.008214,0.002533,1.0,0.007225,0.0625,0.0,0.007323,0.003793,0.0,...,0.000873,0.00131,0.001606,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bouken Ou Beet,0.000739,0.001993,0.006092,0.007225,1.0,0.027794,0.000975,0.0,0.0,0.00468,...,0.0,0.001417,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Eyeshield 21,0.008328,0.010943,0.004248,0.0625,0.027794,1.0,0.005568,0.00661,0.008325,0.005274,...,0.001224,0.014114,0.003916,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Hachimitsu to Clover,0.0,0.005514,0.001544,0.0,0.000975,0.005568,1.0,0.0,0.01322,0.0,...,0.003132,0.00878,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Hungry Heart: Wild Striker,0.002539,0.005794,0.0,0.007323,0.0,0.00661,0.0,1.0,0.012669,0.0,...,0.0,0.003845,0.010099,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Initial D Fourth Stage,0.0,0.003878,0.0,0.003793,0.0,0.008325,0.01322,0.012669,1.0,0.0,...,0.0,0.002094,0.007876,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Monster,0.004654,0.006492,0.011393,0.0,0.00468,0.005274,0.0,0.0,0.0,1.0,...,0.0,0.005189,0.008227,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
# anime list 
anime_list = similarity_df.columns.values


# sample anime
anime = 'Death Note'

# top recommendation movie count
top_n = 10

# get anime similarity records
anime_sim = similarity_df[similarity_df.index == anime].values[0]

# get animes sorted by similarity
sorted_anime_ids = np.argsort(anime_sim)[::-1]

# get recommended anime names
recommended_anime = anime_list[sorted_anime_ids[1:top_n+1]]

print('\n\nTop Recommended Anime for:', anime, 'are:-\n', recommended_anime)



Top Recommended Anime for: Death Note are:-
 ['Death Note: Rewrite' 'Munou na Nana' 'Warau Salesman Special Program'
 'Mugen no Hi' 'Sekaikei Sekai Ron' 'gdMen'
 'Ore no Nounai Sentakushi ga, Gakuen Love Comedy wo Zenryoku de Jama Shiteiru OVA'
 'Dia Horizon (Kabu)' 'Ji Jia Shou Shen: Baolie Feiche'
 'Hikari: Be My Light']


In [11]:
def content_anime_recommender(
    input_anime, similarity_database=similarity_df, anime_database_list=anime_list, top_n=10):
    
    # get anime similarity records
    anime_sim = similarity_database[similarity_database.index == input_anime].values[0]
    
    # get anime sorted by similarity
    sorted_anime_ids = np.argsort(anime_sim)[::-1]
    
    # get recommended anime names
    recommended_anime = anime_database_list[sorted_anime_ids[1:top_n+1]]
    
    print('\n\nTop Recommended Anime for:', input_anime, 'are:-\n', recommended_anime)

sample_anime = ['Death Note', 'Cowboy Bebop', 'Bleach', 
                 'Fruits Basket', 'Monster']
                 
for i in sample_anime:
    content_anime_recommender(i)



Top Recommended Anime for: Death Note are:-
 ['Death Note: Rewrite' 'Munou na Nana' 'Warau Salesman Special Program'
 'Mugen no Hi' 'Sekaikei Sekai Ron' 'gdMen'
 'Ore no Nounai Sentakushi ga, Gakuen Love Comedy wo Zenryoku de Jama Shiteiru OVA'
 'Dia Horizon (Kabu)' 'Ji Jia Shou Shen: Baolie Feiche'
 'Hikari: Be My Light']


Top Recommended Anime for: Cowboy Bebop are:-
 ['Koukaku Kidoutai: Stand Alone Complex' 'Hate no issen EPISODE ZERO'
 'SSSS.Gridman' 'Cowboy Bebop: Tengoku no Tobira'
 'Cowboy Bebop: Ein no Natsuyasumi'
 'Saru Getchu Movie: Ougon no Pipo Helmet - Ukki Battle'
 'Kurogane Communication' 'Kandagawa Jet Girls Recap' 'Umeboshi Denka'
 'Phantasy Star Online 2: Episode Oracle']


Top Recommended Anime for: Bleach are:-
 ['Bleach: Sennen Kessen-hen'
 'Bleach Movie 3: Fade to Black - Kimi no Na wo Yobu'
 'Bleach Movie 1: Memories of Nobody' 'Bleach Movie 4: Jigoku-hen'
 'Bleach: The Sealed Sword Frenzy'
 'Bleach Movie 2: The DiamondDust Rebellion - Mou Hitotsu no Hyourinm

In [12]:
df_users_ratings = pd.read_csv('../data/users-score-2023.csv')
df_users_ratings[df_users_ratings['Anime Title'] == 'Death Note'].head()
print(df_users_ratings.shape)

(24325191, 5)


### Deriving ground truth using threshold-based approach

In [13]:
# find relevant anime for ground truth, set ratings above 7
threshold = 7

sample_size = 10000

# take sample from df_users_ratings
sample_data = df_users_ratings.sample(n=sample_size, random_state=42)  #set random_state for reproducibility

#create ground truth based on the threshold
avg_ratings = sample_data.groupby('Anime Title')['rating'].mean()
print(avg_ratings)

# Filter out titles where the average rating is greater than the threshold
liked_anime = avg_ratings[avg_ratings > threshold].index.tolist()

# group by anime and create the ground_truths dictionary
ground_truths = df_users_ratings.groupby('Anime Title')['Anime Title'].apply(lambda x: liked_anime).to_dict()


#print out the items
print (set(ground_truths))

Anime Title
"Bungaku Shoujo" Kyou no Oyatsu: Hatsukoi     8.000000
"Bungaku Shoujo" Movie                        7.750000
"Oshi no Ko"                                  8.500000
.hack//G.U. Returner                          7.000000
.hack//G.U. Trilogy: Parody Mode              6.000000
                                               ...    
xxxHOLiC                                      8.714286
xxxHOLiC Movie: Manatsu no Yoru no Yume       6.833333
xxxHOLiC Rou                                 10.000000
xxxHOLiC Shunmuki                             7.500000
xxxHOLiC◆Kei                                  7.333333
Name: rating, Length: 3148, dtype: float64
{'Glass no Kamen (2005)', 'Watashi, Nouryoku wa Heikinchi de tte Itta yo ne!', 'Eiyuu Densetsu: Head On! Master Senshi', 'Madonna (Movie)', 'Boku no Chikyuu wo Mamotte: Kiniro no Toki Nagarete', 'Musashi no Ken', 'Yoshimaho', 'Naeil-eun Pyeongbeomhae Jilgeoya', 'Teekyuu 5 Specials', 'Ikoku Meiro no Croisée Picture Drama', 'Shoutai Hanmei 

In [16]:
#precision at n: measures the proportion of relevant items among the top n
def content_anime_recommender2(
    input_anime, ground_truths, similarity_database=similarity_df, anime_database_list=anime_list, top_n=10):
    
    # get anime similarity records
    anime_sim = similarity_database[similarity_database.index == input_anime].values[0]
    
    # get anime sorted by similarity
    sorted_anime_ids = np.argsort(anime_sim)[::-1]
    
    # get recommended anime names
    recommended_anime = anime_database_list[sorted_anime_ids[1:top_n+1]]
    
    # Calculate Precision at K
    intersection = set(recommended_anime) & set(ground_truths)
    precision_at_n = len(intersection) / top_n
    rounded_precision = round(precision_at_n, 2)
    
    print('\n\nTop Recommended Anime for ', input_anime, recommended_anime)
    print('\nPrecision at', top_n, '=', rounded_precision)

sample_anime2 = ['Death Note', 'InuYasha', 'Chobits', 'Hikari: Be My Light']

for i in sample_anime2:
    content_anime_recommender2(i, ground_truths)



Top Recommended Anime for  Death Note ['Death Note: Rewrite' 'Munou na Nana' 'Warau Salesman Special Program'
 'Mugen no Hi' 'Sekaikei Sekai Ron' 'gdMen'
 'Ore no Nounai Sentakushi ga, Gakuen Love Comedy wo Zenryoku de Jama Shiteiru OVA'
 'Dia Horizon (Kabu)' 'Ji Jia Shou Shen: Baolie Feiche'
 'Hikari: Be My Light']

Precision at 10 = 0.8


Top Recommended Anime for  InuYasha ['InuYasha Movie 1: Toki wo Koeru Omoi' 'InuYasha: Kanketsu-hen'
 'InuYasha Movie 3: Tenka Hadou no Ken'
 'InuYasha Movie 4: Guren no Houraijima' 'InuYasha: Kuroi Tessaiga'
 'InuYasha Movie 2: Kagami no Naka no Mugenjo' 'Yao Shen Ji'
 'Doupo Cangqiong 2nd Season' 'MY WIFE IS A DEMON QUEEN' 'Seirei Gensouki']

Precision at 10 = 0.9


Top Recommended Anime for  Chobits ['Kono Minikuku mo Utsukushii Sekai' 'Chobits Recap'
 'Mahoromatic: Motto Utsukushii Mono' 'Chobits: Chibits'
 'Ai no Sungekijou' 'Chii-chan no Kageokuri' 'Chii-chan to Hige Ojisan'
 'Koneko no Chi: Ponponra Dairyokou' 'Uchi no 3 Shimai'
 'Tsurezure