# **CSCI 2026: Individual Term Project**
## **Mapping the Anime Universe: A Semantic Analysis of 20,000 Stories**

**Student:** Ignasi Bonmati Gonzalvez
**Date:** January 29, 2026  
**Institution:** The College of Idaho  

---

### **Project Overview**
This project explores the relationship between different anime series by analyzing their textual descriptions. Instead of relying on human-made tags, we use **Natural Language Processing (NLP)** and **Machine Learning** to identify hidden thematic clusters.

**The Workflow Includes:**
1.  **Data Cleaning:** Removing metadata, website UI text, and stop words.
2.  **TF-IDF Vectorization:** Converting descriptions into an 3,000-dimensional mathematical matrix.
3.  **UMAP Dimensionality Reduction:** Squashing 8,000 dimensions into a 2D map while preserving topological relationships(semantic neighborhoods).
4.  **Interactive Visualization:** A dynamic scatter plot to explore the "islands" of the anime universe.
5.  **Neural Search Engine:** A semantic search tool that allows users to query the database using natural language descriptions (e.g., *"sad robot in space"*) via Cosine Similarity.

In [1]:
# 1. INSTALL ALL NECESSARY LIBRARIES (Ensure that your enviroment is ready)
#%pip install pandas numpy scikit-learn umap-learn plotly tqdm ipywidgets nbformat hdbscan spacy

In [2]:
# CENTRALIZED IMPORTS
import pandas as pd
print("pandas importeded succesfully!")
import numpy as np
print("numpy importeded succesfully!")
from IPython.display import display, HTML
print("display and HTML importeded succesfully! from IPython.display")
import re
print("re importeded succesfully!")
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, TfidfVectorizer
print("ENGLISH_STOP_WORDS and TfidfVectorizer importeded succesfully!")
import umap
print("umap importeded succesfully!!")
import plotly.express as px
print("plotly. express importeded succesfully!")
from sklearn.cluster import KMeans
print("sklearn.cluster importeded succesfully!")
import hdbscan
print("hdbscan importeded succesfully!")
import spacy
print("spacy importeded succesfully!")
import ipywidgets as widgets
print("ipywidgets importeded succesfully!")
from sentence_transformers import SentenceTransformer, util
print("SentenceTransformer and util importeded succesfully! from sentence_transformers")
import torch
print("torch importeded succesfully!")

# USSED LATTER

nlp = spacy.load("en_core_web_sm")

pandas importeded succesfully!
numpy importeded succesfully!
display and HTML importeded succesfully! from IPython.display
re importeded succesfully!
ENGLISH_STOP_WORDS and TfidfVectorizer importeded succesfully!
umap importeded succesfully!!
plotly. express importeded succesfully!
sklearn.cluster importeded succesfully!
hdbscan importeded succesfully!
spacy importeded succesfully!
ipywidgets importeded succesfully!
SentenceTransformer and util importeded succesfully! from sentence_transformers
torch importeded succesfully!


In [3]:
# LOAD DATASETS
df_main = pd.read_csv('data/mal_anime_2025.csv')
df_5ya  = pd.read_csv('data/animes_5ya.csv')
df_6ya  = pd.read_csv('data/animes_6ya.csv')

In [4]:
### COMING UP WITH THIS SOLUTION TO MERGE THE DATASETS ALMOS MAKES ME GIVE UP ###
# LONG LOAD CELL #

# FORCE LOWERCASE COLUMNS
df_main.columns = df_main.columns.str.strip().str.lower()
df_5ya.columns  = df_5ya.columns.str.strip().str.lower()
df_6ya.columns  = df_6ya.columns.str.strip().str.lower()

# HELPER FUNCTIONS
def get_merge_key(title_series):
    return title_series.astype(str).str.lower().str.replace(r'[^a-z0-9]', '', regex=True)

def prep_dataset(df, title_col, desc_col, source_name):
    df = df.copy()
    df['merge_key'] = get_merge_key(df[title_col])
    
    rename_map = {
        title_col: f'title_{source_name}',
        desc_col:  f'desc_{source_name}'
    }
    df = df.rename(columns=rename_map)
    df[f'desc_{source_name}'] = df[f'desc_{source_name}'].fillna("").astype(str)
    
    return df[['merge_key', f'title_{source_name}', f'desc_{source_name}']]

# PREPARE DATASETS
# Note: We use lowercase 'title', 'description', 'synopsis' now because we ran step 1
clean_main = prep_dataset(df_main, title_col='title', desc_col='description', source_name='main')
clean_5ya  = prep_dataset(df_5ya,  title_col='title', desc_col='synopsis',    source_name='5ya')
clean_6ya  = prep_dataset(df_6ya,  title_col='title', desc_col='synopsis',    source_name='6ya')

# MERGE BY NORMALIZED TITLE
merged = clean_main.merge(clean_5ya, on='merge_key', how='outer')
merged = merged.merge(clean_6ya,  on='merge_key', how='outer')

# RESOLVE TEXT CONTENT
def resolve_final_text(row):
    # TITLE RESOLUTION
    title = row.get('title_main')
    if pd.isna(title) or str(title) == 'nan':
        title = row.get('title_5ya')
    if pd.isna(title) or str(title) == 'nan':
        title = row.get('title_6ya')
        
    # DESCRIPTION RESOLUTION
    desc = ""
    
    def is_valid(t):
        if not isinstance(t, str): return False
        if t.lower() == 'nan': return False
        # Count words by splitting spaces. Require at least 15 words.
        return len(t.split()) > 15
    
    if is_valid(row.get('desc_main')):
        desc = row['desc_main']
    elif is_valid(row.get('desc_5ya')):
        desc = row['desc_5ya']
    elif is_valid(row.get('desc_6ya')):
        desc = row['desc_6ya']
        
    return pd.Series([title, desc], index=['title', 'description'])

merged[['title', 'description']] = merged.apply(resolve_final_text, axis=1)

# CREATE MASTER DATAFRAME AND FILTER
master_df = merged[['title', 'description']].copy()

master_df = master_df.dropna(subset=['title'])
master_df = master_df[master_df['title'] != ""]
master_df = master_df[master_df['description'] != ""]

# GENERATE NEW UNIFIED IDS
master_df = master_df.reset_index(drop=True)
master_df['id'] = master_df.index + 1

# PREPARE METADATA BACKFILL
def get_meta_source(df, title_col, cols_to_grab):
    df = df.copy()
    df['merge_key'] = get_merge_key(df[title_col])
    
    # SAFETY: If column doesn't exist, create it as NaN (Fixes KeyError)
    for c in cols_to_grab:
        if c not in df.columns:
            df[c] = np.nan
            
    return df[['merge_key'] + cols_to_grab]

# EXTRACT METADATA
# We ask for 'score', 'genres', 'aired' because we lowercased everything in Step 1
meta_main = get_meta_source(df_main, 'title', ['score', 'genres', 'aired'])
meta_5ya  = get_meta_source(df_5ya,  'title', ['score']) 
meta_6ya  = get_meta_source(df_6ya,  'title', ['score'])

meta_main.rename(columns={'score': 'score_main', 'genres': 'genres_main', 'aired': 'aired_main'}, inplace=True)
meta_5ya.rename(columns={'score': 'score_5ya'}, inplace=True)
meta_6ya.rename(columns={'score': 'score_6ya'}, inplace=True)

# MERGE METADATA
master_df['merge_key'] = get_merge_key(master_df['title'])
master_df = master_df.merge(meta_main, on='merge_key', how='left')
master_df = master_df.merge(meta_5ya,  on='merge_key', how='left')
master_df = master_df.merge(meta_6ya,  on='merge_key', how='left')

# SCORES AND RENAMING

master_df['score'] = master_df['score_main'].fillna(master_df['score_5ya']).fillna(master_df['score_6ya'])
master_df.rename(columns={'genres_main': 'genres', 'aired_main': 'aired'}, inplace=True)

# FINAL CLEANUP
master_df = master_df[['id', 'title', 'description', 'score', 'genres', 'aired']]
master_df['score'] = pd.to_numeric(master_df['score'], errors='coerce')
master_df.sort_values(by='score', ascending=False, na_position='last', inplace=True)
master_df.drop_duplicates(subset=['title'], keep='first', inplace=True)

# SAVE TO CSV
master_df.to_csv('data/anime_master_final_clean.csv', index=False)

print(f"Final Row count: {len(master_df)}")
df = master_df

Final Row count: 19007


In [None]:

# LONG LOAD CELL # (3 minutes aprox.)
# CLEANING PROCESS

def clean_text_pipeline(df, text_col):
    """
    Performs full NLP cleaning:
    1. Removes Boilerplate (Source: X, Written by MAL Rewrite)
    2. Removes Personal Names (using spaCy NER)
    3. Lowercases & Removes Special Chars
    4. Removes Stopwords
    """
   
    raw_docs = df[text_col].fillna("").astype(str).tolist()
    
    clean_docs = [re.sub(r'\[Written by MAL Rewrite\]|\(Source:.*?\)', '', text) for text in raw_docs]
    
    final_texts = []

    for doc in nlp.pipe(clean_docs, batch_size=100, disable=["parser"]):
        tokens_to_keep = []
        for token in doc:
            if token.ent_type_ == "PERSON":
                continue

            tokens_to_keep.append(token.text)
        
        text_no_names = " ".join(tokens_to_keep)
        final_texts.append(text_no_names)

    
    
    processed_output = []
    for text in final_texts:
        text = text.lower()
        text = re.sub(r'[^a-z\s]', '', text)
        words = text.split()
        cleaned_words = [w for w in words if w not in ENGLISH_STOP_WORDS and len(w) > 2]
        
        processed_output.append(" ".join(cleaned_words))
        
    return processed_output

df['clean_description'] = clean_text_pipeline(df, 'description')
df['word_count'] = df['clean_description'].apply(lambda x: len(str(x).split()))
df = df[df['word_count'] > 5].copy()
df = df.reset_index(drop=True)

In [None]:
######## NEW ADDITION
print(f"Rows after removing empty descriptions: {len(df)}")
# Check if Shounen Ninja is still there
print("Shounen Ninja status:", df[df['title'].str.contains("Shounen Ninja", case=False)].shape[0] > 0)

Rows after removing empty descriptions: 18983
Shounen Ninja status: True


In [None]:
custom_stops = list(ENGLISH_STOP_WORDS) + [
    'anime', 'series', 'episode', 'episodes', 'tv', 'movie', 'film', 
    'story', 'world', 'japan', 'japanese', 'character', 'characters',
    'life', 'new', 'young', 'school', 'girl', 'boy', 'man',
    'aired', 'animated', 'animation', 'based', 'blu', 'channel', 'dvd', 
    'edition', 'follows', 'included', 'information', 'limited', 'ova', 
    'picture', 'program', 'recap', 'release', 'short', 'shorts', 'synopsis', 
    'synopsishere', 'title', 'video', 'youtube', 'nhk', 'tv',
    'conan', 'goku', 'kitty', 'mikiya', 'naruto', 'nobita', 'pikachu', 
    'pokmon', 'yuuta', 'chan', 'san', 'kun','help','year',
    'added', 'adding', 'bonus', 'bundled', 'collaboration', 'color', 
    'commercial', 'commercials', 'database', 'depicting', 'directed', 
    'educational', 'featured', 'great', 'improve', 'like', 'magazine', 
    'motion', 'original', 'prefecture', 'project', 'promote', 'dog',
    'promotional', 'real', 'released', 'screened', 'season', 
    'second', 'specials', 'studio', 'unit', 'version', 'volume',
    'adaptation', 'anniversary', 'aoi', 'old', 'year','gag',
    'broadcast', 'comic', 'company', 'does', 'double', 'dull', 
    'feature', 'footage', 'going', 'grand', 'happens', 'hiro', 
    'known', 'leaders', 'leading', 'learns', 'likes', 'lots', 'shojo'
    'minute', 'official', 'online', 'particularly', 'piece', 'car',
    'rei', 'shown', 'single', 'special', 'summary', 'tell', 'shonen',
    'types', 'voice', 'volumes', 'wants','luffy', 'uta', 'rei',
    'announced', 'collection', 'compilation', 'featuring', 'includes',
    'independent', 'published', 'releases', 'streamed', 'television', ' sshop',
    'unclassified', 'videos', 'vol', 'daily', 'day', 'events', 'future',
    'period', 'seasons', 'time', 'years', 'focus', 'manner', 'members',
    'place', 'scenes', 'takes', 'album', 'book', 'books', 'flash', 'live', 'monsters', 'dragons',
    'account', 'aimed', 'april', 'artist', 'box', 'case', 'debut', 'different', 'english', 'event',
    'half', 'join', 'later', 'main', 'manga', 'march', 'named', 'originally', 'point', 'shoujo',
    'stories', 'style', 'theme', 'unclassified', 'view', 'web', 'week', 'work','city', 'korea', 'people'
]

# INITIALIZE THE VECTORIZER
tfidf = TfidfVectorizer(
    max_features=3000, 
    ngram_range=(1, 2),
    min_df=10,
    max_df=0.5,
    stop_words=custom_stops 
)

# MATRIX GENERATION
tfidf_matrix = tfidf.fit_transform(df['clean_description'])

print(f"Matrix Shape: {tfidf_matrix.shape}")




Matrix Shape: (18983, 3000)


In [9]:
# LONG LOAD CELL # (2 minutes aprox.)

#UMAP (2D Projection)
reducer = umap.UMAP(
    n_neighbors=30, 
    min_dist=0.05, 
    metric='cosine', 
    random_state=42
)
embedding = reducer.fit_transform(tfidf_matrix)

# SAVE TO DATAFRAME
df['x'] = embedding[:, 0]
df['y'] = embedding[:, 1]



  warn(


In [10]:

df['score'] = pd.to_numeric(df['score'], errors='coerce').fillna(0)

df['year'] = df['aired'].astype(str).str.extract(r'(\d{4})').fillna("Unknown")

In [11]:
# AUTONOMOUS CLUSTER 
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=60,      
    min_samples=10,           
    metric='euclidean',
    cluster_selection_method='leaf' 
)
# CLUSTERING
df['Cluster'] = clusterer.fit_predict(embedding)

# BAKGROUND NOISE
df['Cluster_Name'] = df['Cluster'].apply(lambda x: f"Cluster {x}" if x != -1 else "Background Noise")

# AUTO-NAMING CLUSTERS
def auto_name_clusters(df, tfidf_matrix, vectorizer, top_n=3):
    feature_names = np.array(vectorizer.get_feature_names_out())
    cluster_names = {}
    unique_clusters = sorted(df['Cluster'].unique())
    
    for label in unique_clusters:
        if label == -1:
            cluster_names[label] = "Unclassified Noise"
            continue
            
        indices = df.index[df['Cluster'] == label].tolist()
        cluster_slice = tfidf_matrix[indices]
        word_scores = np.asarray(cluster_slice.sum(axis=0)).flatten()
        top_indices = word_scores.argsort()[::-1][:top_n]
        top_words = feature_names[top_indices]
        name = " • ".join([w.title() for w in top_words])
        cluster_names[label] = name
        
    return cluster_names

name_map = auto_name_clusters(df, tfidf_matrix, tfidf)
df['Cluster_Name'] = df['Cluster'].map(name_map)


In [None]:


# PREPARE DATA
df['visual_size'] = (df['score'].fillna(5) ** 2) / 10 

# CREATE GALAXY MAP
fig = px.scatter(
    df, 
    x='x', 
    y='y',
    color='Cluster_Name', 
    size='visual_size',       
    size_max=10,               
    hover_name='title',
    hover_data={
        'x': False, 'y': False, 'visual_size': False,
        'Cluster_Name': True,
        'genres': True,
        'score': True,
        'aired': True
    },
    title='<b>THE ANIME UNIVERSE</b><br><span style="font-size: 14px; color: grey;">A Thematic Map</span>',
    template='plotly_dark',
    color_discrete_sequence=px.colors.qualitative.Bold, 
    opacity=0.7
)

# CONFIGURE LAYOUT
fig.update_layout(
    title={
        'x': 0.98,
        'y': 0.95,
        'xanchor': 'right',
        'yanchor': 'top'
    },
    margin=dict(l=300, r=20, b=20, t=110),
    height=600, 
    xaxis=dict(visible=False),
    yaxis=dict(visible=False),
    font=dict(family="Segoe UI, sans-serif"),
    legend=dict(
        title="THEMES",
        itemsizing='constant',
        bgcolor='rgba(0,0,0,0)', 
        bordercolor="rgba(0,0,0,0)",
        x=-0.4,       
        y=1,         
        xanchor="left",
        yanchor="top",
        font=dict(size=11),
        tracegroupgap=5
    ),
    updatemenus=[
        dict(
            type="buttons",
            direction="left",
            buttons=list([
                dict(
                    args=[{"visible": True}],
                    label="Show All",
                    method="restyle"
                ),
                dict(
                    args=[{"visible": "legendonly"}],
                    label="Clear Map",
                    method="restyle"
                )
            ]),
            pad={"r": 10, "t": 10},
            showactive=True,
            x=-0.4,
            xanchor="left",
            y=1.15,  
            yanchor="top",
            bgcolor="#444444",
            font=dict(color="white")
        ),
    ]
)

fig.show()

In [13]:
num_clusters = df['Cluster'].max() + 1
print(f"Discovered {num_clusters} distinct themes.")
print(sorted(list(set(" ".join(df['Cluster_Name'].astype(str).unique()).replace(" • ", " ").lower().split()))))


Discovered 75 distinct themes.
['academy', 'agency', 'airing', 'ancient', 'animals', 'anpanman', 'arc', 'arts', 'baby', 'ball', 'band', 'baseball', 'basketball', 'birthday', 'black', 'blue', 'bomb', 'brother', 'brothers', 'cafe', 'candy', 'car', 'card', 'cards', 'cat', 'cats', 'celebrate', 'centers', 'chapter', 'chibi', 'child', 'childhood', 'children', 'chocolate', 'christmas', 'clan', 'class', 'club', 'combat', 'comedy', 'convenience', 'council', 'count', 'country', 'created', 'crew', 'cup', 'cute', 'cyborg', 'dark', 'demon', 'demons', 'detective', 'detectives', 'disc', 'doctor', 'dolls', 'dragon', 'drama', 'dream', 'early', 'earth', 'ending', 'evil', 'extra', 'fairy', 'family', 'fan', 'fashion', 'father', 'features', 'federation', 'films', 'flower', 'friend', 'friends', 'fuji', 'galaxy', 'game', 'gang', 'ghost', 'girls', 'gods', 'grandmother', 'group', 'hell', 'hero', 'high', 'hiroshima', 'hospital', 'hotel', 'house', 'husband', 'ichigo', 'idol', 'idols', 'island', 'kid', 'kids', 'k

In [14]:
# LONG LOAD TIME (6 minutes aprox.)
model = SentenceTransformer('all-MiniLM-L6-v2')
corpus_embeddings = model.encode(df['description'].fillna('').tolist(), convert_to_tensor=True)

def render_output(sender):
    output_widget.clear_output()
    query = search_bar.value
    
    if len(query) < 5:
        return
        
    with output_widget:
        query_embedding = model.encode(query, convert_to_tensor=True)
        
        cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
        top_results = torch.topk(cos_scores, k=10)
        
        indices = top_results.indices.tolist()
        similarities = top_results.values.tolist()
        
        recs = df.iloc[indices].copy()
        recs['similarity_score'] = similarities
        
        if recs.empty:
            print(f"No results found for description: '{query}'")
            return

        target = recs.iloc[0]
        cluster_name = target.get('Cluster_Name', 'Unknown')
        
        html = f"""
        <style>
            .rec-card {{ background-color: #1e1e1e; color: #e0e0e0; padding: 20px; border-radius: 12px; font-family: 'Segoe UI', sans-serif; box-shadow: 0 4px 15px rgba(0,0,0,0.5); }}
            .rec-header {{ border-bottom: 1px solid #333; padding-bottom: 10px; margin-bottom: 15px; }}
            .rec-title {{ font-size: 1.6em; color: #4facfe; font-weight: bold; }}
            .rec-meta {{ font-size: 0.9em; color: #aaa; margin-top: 5px; }}
            .rec-badge {{ background-color: #333; padding: 2px 8px; border-radius: 4px; font-size: 0.8em; margin-right: 10px; }}
            
            .rec-table {{ width: 100%; border-collapse: separate; border-spacing: 0 8px; }}
            .rec-table th {{ text-align: left; color: #888; font-size: 0.85em; padding-bottom: 5px; }}
            .rec-table td {{ background-color: #252525; padding: 12px; border-top: 1px solid #333; border-bottom: 1px solid #333; }}
            .rec-table tr td:first-child {{ border-left: 1px solid #333; border-top-left-radius: 8px; border-bottom-left-radius: 8px; }}
            .rec-table tr td:last-child {{ border-right: 1px solid #333; border-top-right-radius: 8px; border-bottom-right-radius: 8px; }}
            
            .score-high {{ color: #00e676; font-weight: bold; }}
            .score-med {{ color: #ffea00; }}
            .cluster-tag {{ font-size: 0.75em; text-transform: uppercase; letter-spacing: 1px; color: #ff9800; }}
            
            .bar-container {{ background-color: #444; height: 6px; width: 100px; border-radius: 3px; overflow: hidden; display: inline-block; vertical-align: middle; margin-right: 5px; }}
            .bar-fill {{ height: 100%; background: linear-gradient(90deg, #4facfe, #00f2fe); }}
        </style>
        
        <div class="rec-card">
            <div class="rec-header">
                <div class="rec-title">{target['title']}</div>
                <div class="rec-meta">
                    <span class="rec-badge">📍 {cluster_name}</span>
                    <span class="rec-badge">📅 {str(target['aired'])}</span>
                    <span class="rec-badge">⭐ Score: {target['score']}</span>
                </div>
                <div style="margin-top:10px; font-style:italic; color:#888; font-size:0.9em;">
                    "{str(target['description'])[:150]}..."
                </div>
            </div>
            
            <h3 style="margin:0 0 10px 0; color:#ddd;">Semantic Recommendations</h3>
            <table class="rec-table">
                <thead>
                    <tr>
                        <th>Title</th>
                        <th>Source Cluster</th> <th>Score</th>
                        <th>Similarity</th>
                    </tr>
                </thead>
                <tbody>
        """
        
        for _, row in recs.iterrows():
            fill_pct = max(5, min(100, row['similarity_score'] * 100))
            score_color = "score-high" if row['score'] > 7.5 else "score-med"
            
            c_name = row.get('Cluster_Name', f"Cluster {row.get('Cluster', '?')}")

            html += f"""
                <tr>
                    <td style="font-weight:bold; color:#fff;">
                        {row['title']}
                        <br><span style="font-size:0.75em; color:#888; font-weight:normal;">{str(row['genres'])[:30]}...</span>
                    </td>
                    <td class="cluster-tag">{c_name}</td>
                    <td class="{score_color}">{row['score']}</td>
                    <td>
                        <div class="bar-container">
                            <div class="bar-fill" style="width: {fill_pct}%;"></div>
                        </div>
                    </td>
                </tr>
            """
            
        html += """
                </tbody>
            </table>
        </div>
        """
        display(HTML(html))

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


In [15]:
search_bar = widgets.Text(
    value='',
    placeholder='Describe an anime (e.g. sad robot in space)...',
    description='🔍 Search:',
    continuous_update=False
)

output_widget = widgets.Output()

search_bar.observe(render_output, names='value')

display(search_bar, output_widget)

✅ Semantic Search Engine Loaded.


Text(value='', continuous_update=False, description='🔍 Search:', placeholder='Describe an anime (e.g. sad robo…

Output()