# **CSCI 2026: Individual Term Project**
## **Mapping the Anime Universe: A Semantic Analysis of 20,000+ Stories**

**Student:** Ignasi Bonmati  
**Date:** January 2026  
**Institution:** The College of Idaho  

---

### **Project Overview**
This project explores the relationship between different anime series by analyzing their textual descriptions. Instead of relying on human-made tags, we use **Natural Language Processing (NLP)** and **Machine Learning** to identify hidden thematic clusters.

**The Workflow Includes:**
1.  **Data Cleaning:** Removing metadata, website UI text, and stop words.
2.  **TF-IDF Vectorization:** Converting descriptions into an 8,000-dimensional mathematical matrix.
3.  **UMAP Dimensionality Reduction:** Squashing 8,000 dimensions into a 2D map while preserving topological relationships.
4.  **Interactive Visualization:** A dynamic scatter plot to explore the "islands" of the anime universe.

In [1]:
# 1. INSTALL ALL NECESSARY LIBRARIES (Ensure that your enviroment is ready)
#%pip install pandas numpy scikit-learn umap-learn plotly tqdm ipywidgets nbformat hdbscan

In [2]:
# CENTRALIZED IMPORTS
import pandas as pd
print("pandas importeded succesfully!")
import numpy as np
print("numpy importeded succesfully!")
import re
print("re importeded succesfully!")
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, TfidfVectorizer
print("ENGLISH_STOP_WORDS and TfidfVectorizer importeded succesfully!")
import umap
print("umap importeded succesfully!!")
import plotly.express as px
print("plotly. express importeded succesfully!")
from sklearn.cluster import KMeans
print("sklearn.cluster importeded succesfully!")
import hdbscan
print("hdbscan importeded succesfully!")

pandas importeded succesfully!
numpy importeded succesfully!
re importeded succesfully!
ENGLISH_STOP_WORDS and TfidfVectorizer importeded succesfully!
umap importeded succesfully!!
plotly. express importeded succesfully!
sklearn.cluster importeded succesfully!
hdbscan importeded succesfully!


In [3]:

# LOAD THE DATASET
file_path_1 = 'data/mal_anime_2025.csv'

df = pd.read_csv(file_path_1)

# Drop duplicates and reset index to ensure proper matrix indexing
df = df.drop_duplicates(subset=['title']).dropna(subset=['description']).reset_index(drop=True)

# Quick Preview
print(f"Loaded {len(df)} unique animes with descriptions.")
print(f"Available columns: {list(df.columns)}")

Loaded 19874 unique animes with descriptions.
Available columns: ['myanimelist_id', 'title', 'description', 'image', 'Type', 'Episodes', 'Status', 'Premiered', 'Released_Season', 'Released_Year', 'Source', 'Genres', 'Themes', 'Studios', 'Producers', 'Demographic', 'Duration', 'Rating', 'Score', 'Ranked', 'Popularity', 'Members', 'Favorites', 'characters', 'source_url']


In [4]:

# CLEANING FUNCTION
def clean_description(text):
    if pd.isna(text) or text == "":
        return ""
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    words = text.split()
    cleaned_words = [w for w in words if w not in ENGLISH_STOP_WORDS]
    return " ".join(cleaned_words)

# CLEANING
df['clean_desc'] = df['description'].apply(clean_description)

# PREVIEW THE DIFFERENCE
print("\n--- BEFORE ---")
print(df['description'].iloc[12][:1500])
print("\n--- AFTER ---")
print(df['clean_desc'].iloc[12][:1500])




--- BEFORE ---
At the request of his father, tennis prodigy Ryouma Echizen has returned from America and is ready to take the Japanese tennis scene by storm. Aiming to become the best tennis player in the country, he enrolls in Seishun Academy—home to one of the best middle school tennis teams in Japan.After Ryouma catches the captain's eye, he finds himself playing for a spot on the starting lineup in the intra-school ranking matches despite only being a freshman. Due to his age, the rest of the Seishun Boys' Tennis Team are initially reluctant to accept him, but his skill and determination convinces them to let him in.Armed with their new "super rookie," Seishun sets out to claim a spot in the National Tournament, hoping to take the coveted title for themselves. In order to do so, the team must qualify by playing through the Tokyo Prefectural and Kanto Regionals. Yet, the road ahead of them is shared by a plethora of strong schools, each playing tennis in unique ways for their own r

In [5]:

# INITIALIZE THE VECTORIZER
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))

# MATRIX GENERATION
tfidf_matrix = tfidf.fit_transform(df['clean_desc'])

# CHECK THE SHAPE
print(f"Matrix Shape: {tfidf_matrix.shape}")

Matrix Shape: (19874, 5000)


In [6]:
# CHECK IF THE ANIMES LOOK RIGHT
feature_names = tfidf.get_feature_names_out()
anime_pos = 20
anime_title = df['title'].iloc[anime_pos]
first_vector = tfidf_matrix[anime_pos]

# Convert the sparse row to a dense format and sort by score
df_tfidf = pd.DataFrame(first_vector.T.todense(), index=feature_names, columns=["tfidf_score"])
top_keywords = df_tfidf.sort_values(by="tfidf_score", ascending=False).head(70)

print(f"\n--- Top Keywords for: {anime_title} ---")
print(top_keywords)


--- Top Keywords for: Shinseiki Evangelion ---
          tfidf_score
shinji       0.488836
angels       0.260058
tokyo        0.240436
piloting     0.196474
trauma       0.194627
...               ...
led          0.068868
attack       0.067248
hope         0.066653
robot        0.066365
child        0.066004

[70 rows x 1 columns]


In [7]:


#UMAP (2D Projection)
reducer = umap.UMAP(
    n_neighbors=15, 
    min_dist=0.1, 
    metric='cosine', 
    random_state=42
)
embedding = reducer.fit_transform(tfidf_matrix)

# SAVE TO DATAFRAME
df['x'] = embedding[:, 0]
df['y'] = embedding[:, 1]

print("\n✅ Success! The Anime Universe has been mapped.")
print(df[['title', 'x', 'y']].head())

  warn(



✅ Success! The Anime Universe has been mapped.
                             title         x          y
0                     Cowboy Bebop -1.293747   9.562624
1  Cowboy Bebop: Tengoku no Tobira  0.138290  12.822880
2                           Trigun -2.947133   8.378293
3               Witch Hunter Robin -0.662802  11.889046
4                   Bouken Ou Beet -0.455099   9.761750


In [8]:
# 1. CLEAN THE NUMERIC COLUMNS
# Remove commas from 'Members' and convert to integer
# We use errors='coerce' to turn any messy non-numbers into NaN (blanks), then fill with 0
df['Members'] = df['Members'].astype(str).str.replace(',', '')
df['Members'] = pd.to_numeric(df['Members'], errors='coerce').fillna(0)

# Ensure Score is also numeric just in case
df['Score'] = pd.to_numeric(df['Score'], errors='coerce').fillna(0)

print("✅ 'Members' and 'Score' columns converted to numbers.")

✅ 'Members' and 'Score' columns converted to numbers.


In [9]:
#CREATE AN ADVANCED INTERACTIVE MAP
fig = px.scatter(
    df, 
    x='x', 
    y='y', 
    hover_name='title',

    hover_data={
        'x': False, 
        'y': False, 
        'Score': ':.2f', 
        'Genres': True, 
        'Released_Year': True,
        
    },
    title='The Thematic Anime Universe',
    template='plotly_dark',
    opacity=0.6
)

# ENHANCE THE UI
fig.update_traces(marker=dict(size=2))
fig.update_layout(
    coloraxis_colorbar=dict(title="MAL Score"),
    font=dict(family="Times New Roman, monospace", size=12),
    margin=dict(l=0, r=0, b=0, t=80) 
)

fig.show()

In [10]:
# AUTO-DETECT 100 THEMATIC CLUSTERS
kmeans = KMeans(n_clusters=100, random_state=100)
df['Cluster'] = kmeans.fit_predict(embedding)
df['Cluster'] = df['Cluster'].astype(str)

# CREATE MAP
fig = px.scatter(
    df,  x='x', y='y',
    color='Cluster',
    hover_name='title',
    hover_data={
        'x': False, 
        'y': False, 
        'Score': ':.2f', 
        'Genres': True, 
        'Released_Year': True,
    },
    title='<b>The Anime Universe: 100 Thematic Clusters</b>',
    template='plotly_dark',
    opacity=0.5
)

fig.update_traces(marker=dict(size=2))
fig.update_layout(
    coloraxis_colorbar=dict(title="MAL Score"),
    font=dict(family="Times New Roman, monospace", size=12),
    margin=dict(l=0, r=0, b=0, t=80) 
)
fig.show()

In [13]:


# AUTONOMOUS CLUSTER DISCOVERY
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=10,
    min_samples=None,
    prediction_data=True
)

df['Cluster'] = clusterer.fit_predict(embedding)

# 2. CLEAN UP FOR VISUALS
df['Cluster_Name'] = df['Cluster'].apply(lambda x: f"Cluster {x}" if x != -1 else "Background Noise")

# 3. CREATE MAP
fig = px.scatter(
    df, x='x', y='y',
    color='Cluster',
    hover_name='title',
    hover_data={'x': False, 'y': False, 'Genres': True, 'Score': True, 'Cluster_Name': False, 'Released_Year': True},
    title='<b>The Anime Universe: Autonomous Thematic Clustering</b>',
    template='plotly_dark',
    color_discrete_sequence=px.colors.qualitative.Prism, 
    opacity=0.4
)

fig.update_traces(marker=dict(size=2))
fig.update_layout(
    coloraxis_colorbar=dict(title="MAL Score"),
    font=dict(family="Times New Roman, monospace", size=12),
    margin=dict(l=0, r=0, b=0, t=80) 
)
fig.show()

In [16]:
print(df[df['title'].str.contains("naruto", case=False, na=False)][[ 'title', 'x','y']])

                                                   title          x          y
10                                                Naruto  -0.270957   7.350733
414    Naruto Movie 1: Dai Katsugeki!! Yuki Hime Ninp...  -0.311670   7.373579
556    Naruto: Takigakure no Shitou - Ore ga Eiyuu Da...  -0.276285   7.346038
694            Naruto: Akaki Yotsuba no Clover wo Sagase  -0.357186  10.390189
846    Naruto Movie 2: Dai Gekitotsu! Maboroshi no Ch...  -0.257501   7.354449
970    Naruto Narutimate Hero 3: Tsuini Gekitotsu! Jo...  -0.275211   7.393491
1566                                  Naruto: Shippuuden  -0.276355   7.349566
1950   Naruto Movie 3: Dai Koufun! Mikazuki Jima no A...  -0.270878   7.346636
2045   Naruto: Dai Katsugeki!! Yuki Hime Ninpouchou D...  -0.258852   7.333493
2252                          Naruto: Shippuuden Movie 1  -0.358988   7.664964
3457   Naruto: Shippuuden - Shippuu! "Konoha Gakuen" Den  -0.290861   7.360126
3569                 Naruto: Shippuuden Movie 2 - Ki