# **CSCI 2026: Individual Term Project**
## **Mapping the Anime Universe: A Semantic Analysis of 20,000+ Stories**

**Student:** Ignasi Bonmati  
**Date:** January 2026  
**Institution:** The College of Idaho  

---

### **Project Overview**
This project explores the relationship between different anime series by analyzing their textual descriptions. Instead of relying on human-made tags, we use **Natural Language Processing (NLP)** and **Machine Learning** to identify hidden thematic clusters.

**The Workflow Includes:**
1.  **Data Cleaning:** Removing metadata, website UI text, and stop words.
2.  **TF-IDF Vectorization:** Converting descriptions into an 8,000-dimensional mathematical matrix.
3.  **UMAP Dimensionality Reduction:** Squashing 8,000 dimensions into a 2D map while preserving topological relationships.
4.  **Interactive Visualization:** A dynamic scatter plot to explore the "islands" of the anime universe.

In [30]:
# 1. INSTALL ALL NECESSARY LIBRARIES (Ensure that your enviroment is ready)
%pip install pandas numpy scikit-learn umap-learn plotly tqdm ipywidgets


Note: you may need to restart the kernel to use updated packages.


In [31]:
# CENTRALIZED IMPORTS
import pandas as pd
print("pandas importeded succesfully!")
import numpy as np
print("numpy importeded succesfully!")
import re
print("re importeded succesfully!")
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, TfidfVectorizer
print("ENGLISH_STOP_WORDS and TfidfVectorizer importeded succesfully!")
import umap
print("umap importeded succesfully!!")
import plotly.express as px
print("plotly. express importeded succesfully!")


pandas importeded succesfully!
numpy importeded succesfully!
re importeded succesfully!
ENGLISH_STOP_WORDS and TfidfVectorizer importeded succesfully!
umap importeded succesfully!!
plotly. express importeded succesfully!


In [32]:

# LOAD THE DATASET
file_path = 'mal_anime_2025.csv'

df = pd.read_csv(file_path)

# Drop duplicates and reset index to ensure proper matrix indexing
df = df.drop_duplicates(subset=['title']).dropna(subset=['description']).reset_index(drop=True)

# Quick Preview
print(f"Loaded {len(df)} unique animes with descriptions.")
print(f"Available columns: {list(df.columns)}")

Loaded 19874 unique animes with descriptions.
Available columns: ['myanimelist_id', 'title', 'description', 'image', 'Type', 'Episodes', 'Status', 'Premiered', 'Released_Season', 'Released_Year', 'Source', 'Genres', 'Themes', 'Studios', 'Producers', 'Demographic', 'Duration', 'Rating', 'Score', 'Ranked', 'Popularity', 'Members', 'Favorites', 'characters', 'source_url']


In [33]:

# CLEANING FUNCTION
def clean_description(text):
    if pd.isna(text) or text == "":
        return ""
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    words = text.split()
    cleaned_words = [w for w in words if w not in ENGLISH_STOP_WORDS]
    return " ".join(cleaned_words)

# CLEANING
df['clean_desc'] = df['description'].apply(clean_description)

# PREVIEW THE DIFFERENCE
print("\n--- BEFORE ---")
print(df['description'].iloc[0][:1500])
print("\n--- AFTER ---")
print(df['clean_desc'].iloc[0][:1500])




--- BEFORE ---
Crime is timeless. By the year 2071, humanity has expanded across the galaxy, filling the surface of other planets with settlements like those on Earth. These new societies are plagued by murder, drug use, and theft, and intergalactic outlaws are hunted by a growing number of tough bounty hunters.Spike Spiegel and Jet Black pursue criminals throughout space to make a humble living. Beneath his goofy and aloof demeanor, Spike is haunted by the weight of his violent past. Meanwhile, Jet manages his own troubled memories while taking care of Spike and the Bebop, their ship. The duo is joined by the beautiful con artist Faye Valentine, odd child Edward Wong Hau Pepelu Tivrusky IV, and Ein, a bioengineered Welsh corgi.While developing bonds and working to catch a colorful cast of criminals, the Bebop crew's lives are disrupted by a menace from Spike's past. As a rival's maniacal plot continues to unravel, Spike must choose between life with his newfound family or revenge for

In [34]:

# INITIALIZE THE VECTORIZER
tfidf = TfidfVectorizer(max_features=8000, ngram_range=(1, 2))

# MATRIX GENERATION
tfidf_matrix = tfidf.fit_transform(df['clean_desc'])

# CHECK THE SHAPE
print(f"Matrix Shape: {tfidf_matrix.shape}")

Matrix Shape: (19874, 8000)


In [39]:
# CHECK IF THE ANIMES LOOK RIGHT
anime_pos = 20
anime_title = df['title'].iloc[anime_pos]
first_vector = tfidf_matrix[anime_pos]

# Convert the sparse row to a dense format and sort by score
df_tfidf = pd.DataFrame(first_vector.T.todense(), index=feature_names, columns=["tfidf_score"])
top_keywords = df_tfidf.sort_values(by="tfidf_score", ascending=False).head(70)

print(f"\n--- Top Keywords for: {anime_title} ---")
print(top_keywords)


--- Top Keywords for: Shinseiki Evangelion ---
            tfidf_score
shin           0.462599
angels         0.246100
tokugawa       0.227532
pinnacle       0.185929
transports     0.184181
...                 ...
beings         0.068659
combine        0.068474
shoutarou      0.068113
million        0.067420
ultra          0.067309

[70 rows x 1 columns]


In [None]:


#UMAP (2D Projection)
reducer = umap.UMAP(
    n_neighbors=15, 
    min_dist=0.1, 
    metric='cosine', 
    random_state=42
)
embedding = reducer.fit_transform(tfidf_matrix)

# SAVE TO DATAFRAME
df['x'] = embedding[:, 0]
df['y'] = embedding[:, 1]

print("\nâœ… Success! The Anime Universe has been mapped.")
print(df[['title', 'x', 'y']].head())

ðŸš€ Step 2: Generating new Map Coordinates... (Wait 1-3 mins)


  warn(



âœ… Success! The Anime Universe has been mapped.
                             title         x         y
0                     Cowboy Bebop  9.885464  3.544915
1  Cowboy Bebop: Tengoku no Tobira  7.632524 -0.412028
2                           Trigun  3.277765  3.035740
3               Witch Hunter Robin  7.859602  0.782811
4                   Bouken Ou Beet  8.133928  4.032320
