The goal here is to preprocess our anime data and also create the content based recommendation model.

In [1]:
import pandas as pd

In [2]:
anime_filtered_df = pd.read_csv("data/anime_filtered.csv")

In [3]:
anime_filtered_df.head()

Unnamed: 0,anime_id,name,score,rank,genres,synopsis,type,episodes,popularity,members,studios,source,favorites,rating,year
0,1,cowboy bebop,8.75,41.0,"action, award winning, sci-fi","crime is timeless. by the year 2071, humanity ...",tv,26.0,43,1771505,sunrise,original,78525,rated 17,1998
1,5,cowboy bebop: tengoku no tobira,8.38,189.0,"action, sci-fi","another day, another bounty—such is the life o...",movie,1.0,602,360978,bones,original,1448,rated 17,2001
2,6,trigun,8.22,328.0,"action, adventure, sci-fi","vash the stampede is the man with a $$60,000,0...",tv,26.0,246,727252,madhouse,manga,15035,parental guidance 13,1998
3,7,witch hunter robin,7.25,2764.0,"action, drama, mystery, supernatural",robin sena is a powerful craft user drafted in...,tv,26.0,1795,111931,sunrise,original,613,parental guidance 13,2002
4,8,bouken ou beet,6.94,4240.0,"adventure, fantasy, supernatural",it is the dark century and the people are suff...,tv,52.0,5126,15001,toei animation,manga,14,parental guidance,2004


In [4]:
# Renaming our df
anime = anime_filtered_df

In [9]:
anime['anime_id'].nunique()

10048

In [10]:
user_clean = pd.read_csv("data/user_clean_processed_2.csv")

In [11]:
user_clean['anime_id'].nunique()

1920

Since we are using a better model now we can keep all textual data for embedding.

In [14]:
# Combine textual features
anime['combined_text'] = (
    anime['genres'].fillna('') + " " +
    anime['name'].fillna('') + " " +
    anime['synopsis'].fillna('') + " " +
    anime['type'].fillna('') + " " +
    anime['studios'].fillna('') + " " +
    anime['rating'].fillna('')
)

In [15]:
anime.head()

Unnamed: 0,anime_id,name,score,rank,genres,synopsis,type,episodes,popularity,members,studios,source,favorites,rating,year,combined_text
0,1,cowboy bebop,8.75,41.0,"action, award winning, sci-fi","crime is timeless. by the year 2071, humanity ...",tv,26.0,43,1771505,sunrise,original,78525,rated 17,1998,"action, award winning, sci-fi cowboy bebop cri..."
1,5,cowboy bebop: tengoku no tobira,8.38,189.0,"action, sci-fi","another day, another bounty—such is the life o...",movie,1.0,602,360978,bones,original,1448,rated 17,2001,"action, sci-fi cowboy bebop: tengoku no tobira..."
2,6,trigun,8.22,328.0,"action, adventure, sci-fi","vash the stampede is the man with a $$60,000,0...",tv,26.0,246,727252,madhouse,manga,15035,parental guidance 13,1998,"action, adventure, sci-fi trigun vash the stam..."
3,7,witch hunter robin,7.25,2764.0,"action, drama, mystery, supernatural",robin sena is a powerful craft user drafted in...,tv,26.0,1795,111931,sunrise,original,613,parental guidance 13,2002,"action, drama, mystery, supernatural witch hun..."
4,8,bouken ou beet,6.94,4240.0,"adventure, fantasy, supernatural",it is the dark century and the people are suff...,tv,52.0,5126,15001,toei animation,manga,14,parental guidance,2004,"adventure, fantasy, supernatural bouken ou bee..."


Now we will preprocess our data by removing stopwords and tokenization.

In [16]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import nltk

In [17]:
# Download stopwords and punkt for tokenization if not already done
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mahmu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\mahmu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\mahmu/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [18]:
# Preprocessing function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Join back into a single string
    return ' '.join(tokens)

In [19]:
# Apply preprocessing to the 'combined_text' column
anime['combined_text'] = anime['combined_text'].apply(preprocess_text)

In [41]:
anime[['combined_text']]

Unnamed: 0,combined_text
0,action award winning scifi cowboy bebop crime ...
1,action scifi cowboy bebop tengoku tobira anoth...
2,action adventure scifi trigun vash stampede ma...
3,action drama mystery supernatural witch hunter...
4,adventure fantasy supernatural bouken ou beet ...
...,...
10043,comedy kanojo okarishimasu petit special speci...
10044,comedy mystery li shi zhentan shiwusuo day lun...
10045,action adventure comedy fantasy one piece dai ...
10046,action comedy fantasy mashle mash burnedead fu...


We will use a pre-trained Sentence Transformer model in this run to convert the combined_text into dense embeddings. Then, we will scale year to a comparable range (using Min-Max scaling). And we will combine the dense embeddings from Sentence Transformers with the scaled year feature.

In [12]:
from sentence_transformers import SentenceTransformer
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

In [13]:
# Load the Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Encode the text into embeddings
text_embeddings = model.encode(anime['combined_text'], show_progress_bar=True)

Encoding text with Sentence Transformer...


Batches:   0%|          | 0/314 [00:00<?, ?it/s]

In [22]:
# Normalize the 'year' feature
scaler = MinMaxScaler()
anime['year_scaled'] = scaler.fit_transform(anime[['year']]) 

In [23]:
# Convert the scaled 'year' column to a NumPy array
year_feature = anime['year_scaled'].values.reshape(-1, 1)

In [24]:
# Concatenate the embeddings with the year feature
final_features = np.hstack([text_embeddings, year_feature])

In [25]:
# Check the shape of the final combined features
print(f"Final features shape: {final_features.shape}")

Final features shape: (10048, 385)


In [27]:
# Save the final features for later use
np.save("data/final_features_st.npy", final_features)

Now we can compute cosine similarity.

In [28]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [29]:
# Load the combined features
final_features = np.load("data/final_features_st.npy")

In [None]:
# Compute the cosine similarity matrix
cosine_sim_matrix = cosine_similarity(final_features)

Computing cosine similarity...


In [31]:
# Save the cosine similarity matrix for later use
np.save("data/cosine_sim_st.npy", cosine_sim_matrix)

In [33]:
# Load the cosine similarity matrix
cosine_sim_matrix = np.load("data/cosine_sim_st.npy")

Now we can make a recommendation function and try it out.

In [None]:
def get_recommendations_by_id(anime_id, anime_df, cosine_sim, top_n=10):
    try:
        # Get the index of the anime that matches the anime id
        idx = anime_df[anime_df['anime_id'] == anime_id].index[0]
    except IndexError:
        raise ValueError(f"Anime ID {anime_id} not found in the dataset.")
    
    # Get the pairwise similarity scores for this anime
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the anime based on the similarity scores in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the indices of the top N similar anime
    top_indices = [i[0] for i in sim_scores[1:top_n+1]]

    # Retrieve the recommended anime
    recommendations = anime_df.iloc[top_indices][['anime_id', 'name', 'genres', 'year', 'studios']].copy()
    recommendations['similarity_score'] = [sim_scores[i][1] for i in range(1, top_n+1)]
    
    return recommendations

In [None]:
# Get recommendations by anime_id
anime_id = 1  # Trying out 'Cowboy Bebop'
recommendations = get_recommendations_by_id(anime_id, anime, cosine_sim_matrix, top_n=10)

recommendations

Unnamed: 0,anime_id,name,genres,year,studios,similarity_score
1,5,cowboy bebop: tengoku no tobira,"action, sci-fi",2001,bones,0.803224
1810,2158,terra e... (tv),"action, drama, sci-fi",2007,"tokyo kids, minami machi bugyousho",0.777441
363,400,seihou bukyou outlaw star,"action, adventure, comedy, sci-fi",1998,sunrise,0.748444
3098,5074,tetsuwan birdy decode:02,"action, comedy, sci-fi",2009,a-1 pictures,0.746236
1674,2001,tengen toppa gurren lagann,"action, adventure, award winning, sci-fi",2007,gainax,0.745735
2796,4037,cowboy bebop: yose atsume blues,sci-fi,1998,sunrise,0.745614
839,974,dead leaves,"action, comedy, sci-fi",2004,production i.g,0.744082
8016,37578,planet with,"action, sci-fi",2018,j.c.staff,0.73939
5252,17205,cowboy bebop: ein no natsuyasumi,comedy,2012,sunrise,0.738221
9354,48453,super crooks,"action, drama, suspense",2021,bones,0.737716


We are getting excellent results with this model, we will use this one for our hybrid recommendation. We do not even have to balance it out unlike previous models.
| Name                             | Genres                                      | Similarity Score |
|----------------------------------|---------------------------------------------|------------------|
| Cowboy Bebop: Tengoku no Tobira  | Action, Sci-Fi                              | 0.803224         |
| Terra e... (TV)                  | Action, Drama, Sci-Fi                       | 0.777441         |
| Seihou Bukyou Outlaw Star        | Action, Adventure, Comedy, Sci-Fi           | 0.748444         |
| Tetsuwan Birdy Decode:02         | Action, Comedy, Sci-Fi                      | 0.746236         |
| Tengen Toppa Gurren Lagann       | Action, Adventure, Award Winning, Sci-Fi    | 0.745735         |
| Cowboy Bebop: Yose Atsume Blues  | Sci-Fi                                      | 0.745614         |
| Dead Leaves                      | Action, Comedy, Sci-Fi                      | 0.744082         |
| Planet With                      | Action, Sci-Fi                              | 0.739390         |
| Cowboy Bebop: Ein no Natsuyasumi | Comedy                                      | 0.738221         |
| Super Crooks                     | Action, Drama, Suspense                     | 0.737716         |


Our previous model run was not good at all, we had to introduce weight for popularity, score and rank to make it relevant.

| Name                                     | Genres                                | Similarity Score |
|------------------------------------------|---------------------------------------|------------------|
| Tetsujin 28-gou                          | Adventure, Sci-Fi                     | 0.638336         |
| Kumo to Tulip                            | Adventure                             | 0.636089         |
| Fuku-chan no Sensuikan                   | Comedy                                | 0.634850         |
| Wonder 3                                 | Action, Adventure, Comedy, Sci-Fi     | 0.632865         |
| Wan Wan Chuushingura                     | Action, Adventure, Drama, Fantasy     | 0.632822         |
| Momotarou: Umi no Shinpei                | Action                                | 0.632235         |
| Obake no Q-Tarou                         | Comedy, Slice of Life, Supernatural   | 0.631495         |
| Eightman                                 | Action, Drama, Sci-Fi                 | 0.630711         |
| Tetsuwan Atom: Uchuu no Yuusha           | Action, Adventure, Drama, Sci-Fi      | 0.630436         |
| Arabian Night: Sindbad no Bouken         | Action, Adventure, Fantasy            | 0.629994         |


Even after weighted adjusted using popularity, score and rank, its still not as good as our current model.

| Name                          | Genres                                      | Popularity | Score | Rank  | Weighted Score | Similarity Score |
|-------------------------------|---------------------------------------------|------------|-------|-------|----------------|------------------|
| Ginga Eiyuu Densetsu          | Drama, Sci-Fi                              | 728        | 9.02  | 12.0  | 0.424677       | 0.571545         |
| Ashita no Joe 2               | Drama, Sports                              | 2971       | 8.71  | 50.0  | 0.416602       | 0.596032         |
| Ashita no Joe                 | Drama, Sports                              | 2138       | 8.29  | 251.0 | 0.414423       | 0.619340         |
| Versailles no Bara            | Drama, Romance                             | 2036       | 8.33  | 220.0 | 0.407327       | 0.599311         |
| Mirai Shounen Conan           | Adventure, Drama, Sci-Fi                   | 2978       | 8.10  | 463.0 | 0.405780       | 0.608205         |
| Lupin III: Cagliostro no Shiro| Action, Adventure, Award Winning, Comedy, Mystery | 1807 | 8.15  | 409.0 | 0.405377       | 0.604446         |
| Kaze no Tani no Nausicaä      | Adventure, Award Winning, Fantasy          | 611        | 8.36  | 197.0 | 0.404715       | 0.590445         |
| Uchuu Senkan Yamato           | Action, Adventure, Award Winning, Drama, Sci-Fi | 3976 | 7.59  | 1413.0| 0.401975       | 0.624959         |
| Lupin III                     | Action, Adventure, Comedy, Mystery         | 1948       | 7.63  | 1296.0| 0.401710       | 0.622134         |
| Tenkuu no Shiro Laputa        | Adventure, Award Winning, Fantasy, Romance, Sci-Fi | 451 | 8.26  | 286.0 | 0.401172       | 0.587081         |


Saving items:

In [39]:
anime.to_csv("data/anime_filtered_processed_st.csv", index=False)