# **CSCI 2026: Individual Term Project**
## **Mapping the Anime Universe: A Semantic Analysis of 20,000+ Stories**

**Student:** Ignasi Bonmati  
**Date:** January 2026  
**Institution:** The College of Idaho  

---

### **Project Overview**
This project explores the relationship between different anime series by analyzing their textual descriptions. Instead of relying on human-made tags, we use **Natural Language Processing (NLP)** and **Machine Learning** to identify hidden thematic clusters.

**The Workflow Includes:**
1.  **Data Cleaning:** Removing metadata, website UI text, and stop words.
2.  **TF-IDF Vectorization:** Converting descriptions into an 8,000-dimensional mathematical matrix.
3.  **UMAP Dimensionality Reduction:** Squashing 8,000 dimensions into a 2D map while preserving topological relationships.
4.  **Interactive Visualization:** A dynamic scatter plot to explore the "islands" of the anime universe.

In [None]:
# 1. INSTALL ALL NECESSARY LIBRARIES (Ensure that your enviroment is ready)
#%pip install pandas numpy scikit-learn umap-learn plotly tqdm ipywidgets nbformat hdbscan spacy

In [None]:
# CENTRALIZED IMPORTS
import pandas as pd
print("pandas importeded succesfully!")
import numpy as np
print("numpy importeded succesfully!")
import re
print("re importeded succesfully!")
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, TfidfVectorizer
print("ENGLISH_STOP_WORDS and TfidfVectorizer importeded succesfully!")
import umap
print("umap importeded succesfully!!")
import plotly.express as px
print("plotly. express importeded succesfully!")
from sklearn.cluster import KMeans
print("sklearn.cluster importeded succesfully!")
import hdbscan
print("hdbscan importeded succesfully!")
import spacy
print("spacy importeded succesfully!")




pandas importeded succesfully!
numpy importeded succesfully!
re importeded succesfully!
ENGLISH_STOP_WORDS and TfidfVectorizer importeded succesfully!
umap importeded succesfully!!
plotly. express importeded succesfully!
sklearn.cluster importeded succesfully!
hdbscan importeded succesfully!
spacy importeded succesfully!


In [69]:
# LOAD DATASETS
df_main = pd.read_csv('data/mal_anime_2025.csv')
df_5ya  = pd.read_csv('data/animes_5ya.csv')
df_6ya  = pd.read_csv('data/animes_6ya.csv')

In [None]:
### COMING UP WITH THIS SOLUTION TO MERGE THE DATASETS ALMOS MAKES ME GIVE UP ###
# LONG LOAD CELL #

# FORCE LOWERCASE COLUMNS
df_main.columns = df_main.columns.str.strip().str.lower()
df_5ya.columns  = df_5ya.columns.str.strip().str.lower()
df_6ya.columns  = df_6ya.columns.str.strip().str.lower()

# HELPER FUNCTIONS
def get_merge_key(title_series):
    return title_series.astype(str).str.lower().str.replace(r'[^a-z0-9]', '', regex=True)

def prep_dataset(df, title_col, desc_col, source_name):
    df = df.copy()
    df['merge_key'] = get_merge_key(df[title_col])
    
    rename_map = {
        title_col: f'title_{source_name}',
        desc_col:  f'desc_{source_name}'
    }
    df = df.rename(columns=rename_map)
    df[f'desc_{source_name}'] = df[f'desc_{source_name}'].fillna("").astype(str)
    
    return df[['merge_key', f'title_{source_name}', f'desc_{source_name}']]

# PREPARE DATASETS
# Note: We use lowercase 'title', 'description', 'synopsis' now because we ran step 1
clean_main = prep_dataset(df_main, title_col='title', desc_col='description', source_name='main')
clean_5ya  = prep_dataset(df_5ya,  title_col='title', desc_col='synopsis',    source_name='5ya')
clean_6ya  = prep_dataset(df_6ya,  title_col='title', desc_col='synopsis',    source_name='6ya')

# MERGE BY NORMALIZED TITLE
print("Merging datasets by title...")
merged = clean_main.merge(clean_5ya, on='merge_key', how='outer')
merged = merged.merge(clean_6ya,  on='merge_key', how='outer')

# RESOLVE TEXT CONTENT
def resolve_final_text(row):
    # TITLE RESOLUTION
    title = row.get('title_main')
    if pd.isna(title) or str(title) == 'nan':
        title = row.get('title_5ya')
    if pd.isna(title) or str(title) == 'nan':
        title = row.get('title_6ya')
        
    # DESCRIPTION RESOLUTION
    desc = ""
    def is_valid(t): return isinstance(t, str) and len(t) > 20 and t.lower() != 'nan'
    
    if is_valid(row.get('desc_main')):
        desc = row['desc_main']
    elif is_valid(row.get('desc_5ya')):
        desc = row['desc_5ya']
    elif is_valid(row.get('desc_6ya')):
        desc = row['desc_6ya']
        
    return pd.Series([title, desc], index=['title', 'description'])

merged[['title', 'description']] = merged.apply(resolve_final_text, axis=1)

# CREATE MASTER DATAFRAME AND FILTER
master_df = merged[['title', 'description']].copy()

master_df = master_df.dropna(subset=['title'])
master_df = master_df[master_df['title'] != ""]
master_df = master_df[master_df['description'] != ""]

# GENERATE NEW UNIFIED IDS
master_df = master_df.reset_index(drop=True)
master_df['id'] = master_df.index + 1

# PREPARE METADATA BACKFILL
def get_meta_source(df, title_col, cols_to_grab):
    df = df.copy()
    df['merge_key'] = get_merge_key(df[title_col])
    
    # SAFETY: If column doesn't exist, create it as NaN (Fixes KeyError)
    for c in cols_to_grab:
        if c not in df.columns:
            df[c] = np.nan
            
    return df[['merge_key'] + cols_to_grab]

# EXTRACT METADATA
# We ask for 'score', 'genres', 'aired' because we lowercased everything in Step 1
meta_main = get_meta_source(df_main, 'title', ['score', 'genres', 'aired'])
meta_5ya  = get_meta_source(df_5ya,  'title', ['score']) 
meta_6ya  = get_meta_source(df_6ya,  'title', ['score'])

meta_main.rename(columns={'score': 'score_main', 'genres': 'genres_main', 'aired': 'aired_main'}, inplace=True)
meta_5ya.rename(columns={'score': 'score_5ya'}, inplace=True)
meta_6ya.rename(columns={'score': 'score_6ya'}, inplace=True)

# MERGE METADATA
master_df['merge_key'] = get_merge_key(master_df['title'])
master_df = master_df.merge(meta_main, on='merge_key', how='left')
master_df = master_df.merge(meta_5ya,  on='merge_key', how='left')
master_df = master_df.merge(meta_6ya,  on='merge_key', how='left')

# SCORES AND RENAMING

master_df['score'] = master_df['score_main'].fillna(master_df['score_5ya']).fillna(master_df['score_6ya'])
master_df.rename(columns={'genres_main': 'genres', 'aired_main': 'aired'}, inplace=True)

# FINAL CLEANUP
master_df = master_df[['id', 'title', 'description', 'score', 'genres', 'aired']]
master_df['score'] = pd.to_numeric(master_df['score'], errors='coerce')
master_df.sort_values(by='score', ascending=False, na_position='last', inplace=True)
master_df.drop_duplicates(subset=['title'], keep='first', inplace=True)

# SAVE TO CSV
master_df.to_csv('data/anime_master_final_clean.csv', index=False)

print(f"Final Row count: {len(master_df)}")
df = master_df

Merging datasets by title...
Final Row count: 23590


In [None]:

# LONG LOAD CELL #

def clean_text_pipeline(df, text_col):
    """
    Performs full NLP cleaning:
    1. Removes Boilerplate (Source: X, Written by MAL Rewrite)
    2. Removes Personal Names (using spaCy NER)
    3. Lowercases & Removes Special Chars
    4. Removes Stopwords
    """
   
    raw_docs = df[text_col].fillna("").astype(str).tolist()
    
    clean_docs = [re.sub(r'\[Written by MAL Rewrite\]|\(Source:.*?\)', '', text) for text in raw_docs]
    
    final_texts = []

    for doc in nlp.pipe(clean_docs, batch_size=100, disable=["parser"]):
        tokens_to_keep = []
        for token in doc:
            if token.ent_type_ == "PERSON":
                continue

            tokens_to_keep.append(token.text)
        
        text_no_names = " ".join(tokens_to_keep)
        final_texts.append(text_no_names)

    
    
    processed_output = []
    for text in final_texts:
        text = text.lower()
        text = re.sub(r'[^a-z\s]', '', text)
        words = text.split()
        cleaned_words = [w for w in words if w not in ENGLISH_STOP_WORDS and len(w) > 2]
        
        processed_output.append(" ".join(cleaned_words))
        
    return processed_output

df['clean_description'] = clean_text_pipeline(df, 'description')

In [74]:

# INITIALIZE THE VECTORIZER
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))

# MATRIX GENERATION
tfidf_matrix = tfidf.fit_transform(df['clean_description'])


In [90]:
# CHECK IF THE ANIMES LOOK RIGHT
feature_names = tfidf.get_feature_names_out()
anime_pos = 123

anime_title = df['title'].iloc[anime_pos]
first_vector = tfidf_matrix[anime_pos]

# Convert the sparse row to a dense format and sort by score
df_tfidf = pd.DataFrame(first_vector.T.todense(), index=feature_names, columns=["tfidf_score"])
top_keywords = df_tfidf.sort_values(by="tfidf_score", ascending=False).head(10)

print(f"\n Top Keywords for: {anime_title}")
print(top_keywords)


 Top Keywords for: Kono Subarashii Sekai ni Shukufuku wo!: Kurenai Densetsu
                   tfidf_score
crimson               0.383615
demons                0.289257
kazuma                0.281712
demon                 0.271623
demon lord            0.252688
lord                  0.207273
village               0.181658
just                  0.138528
generals              0.135806
misunderstandings     0.135190


In [None]:
# LONG LOAD CELL #

#UMAP (2D Projection)
reducer = umap.UMAP(
    n_neighbors=15, 
    min_dist=0.1, 
    metric='cosine', 
    random_state=42
)
embedding = reducer.fit_transform(tfidf_matrix)

# SAVE TO DATAFRAME
df['x'] = embedding[:, 0]
df['y'] = embedding[:, 1]




n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.




✅ Success! The Anime Universe has been mapped.
                                           title          x          y
50294                          Sousou no Frieren  10.050826   9.380158
3746                Chainsaw Man Movie: Reze-hen  15.954387   6.082365
9080            Fullmetal Alchemist: Brotherhood  10.404366   9.168086
44742  Quiz de Manabu Pinocchio no Koutsuu Ansen  13.655251   6.329363
50493                                Steins;Gate  11.570064  10.421191


In [77]:

df['score'] = pd.to_numeric(df['score'], errors='coerce').fillna(0)

df['year'] = df['aired'].astype(str).str.extract(r'(\d{4})').fillna("Unknown")

In [91]:
# AUTONOMOUS CLUSTER DISCOVERY
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=50,
    min_samples=15,
    metric='euclidean',
    cluster_selection_method='eom' 
)

# CLUSTERING
df['Cluster'] = clusterer.fit_predict(embedding)

# BAKGROUND NOISE
df['Cluster_Name'] = df['Cluster'].apply(lambda x: f"Cluster {x}" if x != -1 else "Background Noise")

# CREATE MAP
fig = px.scatter(
    df, x='x', y='y',
    color='Cluster_Name',  
    hover_name='title',
    hover_data={
        'x': False, 
        'y': False, 
        'genres': True,       
        'score': True,        
        'Cluster_Name': True, 
        'aired': True         
    },
    title='<b>The Anime Universe: Autonomous Thematic Clustering</b>',
    template='plotly_dark',
    color_discrete_sequence=px.colors.qualitative.Prism, 
    opacity=0.4
)

fig.update_traces(marker=dict(size=2))
fig.update_layout(
    
    coloraxis_colorbar=dict(title="MAL Score"), 
    font=dict(family="Times New Roman, monospace", size=12),
    margin=dict(l=0, r=0, b=0, t=80) 
)
fig.show()