# **CSCI 2026: Individual Term Project**
## **Mapping the Anime Universe: A Semantic Analysis of 20,000 Stories**

**Student:** Ignasi Bonmati Gonzalvez
**Date:** January 29, 2026  
**Institution:** The College of Idaho  

---

### **Project Overview**
This project explores the relationship between different anime series by analyzing their textual descriptions. Instead of relying on human-made tags, we use **Natural Language Processing (NLP)** and **Machine Learning** to identify hidden thematic clusters.

**The Workflow Includes:**
1.  **Data Cleaning:** Removing metadata, website UI text, and stop words.
2.  **TF-IDF Vectorization:** Converting descriptions into an 3,000-dimensional mathematical matrix.
3.  **UMAP Dimensionality Reduction:** Squashing 8,000 dimensions into a 2D map while preserving topological relationships(semantic neighborhoods).
4.  **Interactive Visualization:** A dynamic scatter plot to explore the "islands" of the anime universe.
5.  **Neural Search Engine:** A semantic search tool that allows users to query the database using natural language descriptions (e.g., *"sad robot in space"*) via Cosine Similarity.

In [6]:
# 1. INSTALL ALL NECESSARY LIBRARIES (Ensure that your enviroment is ready)
#%pip install pandas numpy scikit-learn umap-learn plotly tqdm ipywidgets nbformat hdbscan spacy

In [7]:
# CENTRALIZED IMPORTS
import pandas as pd
print("pandas importeded succesfully!")
import numpy as np
print("numpy importeded succesfully!")
from IPython.display import display, HTML
print("display and HTML importeded succesfully from IPython.display!")
import re
print("re importeded succesfully!")
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, TfidfVectorizer
print("ENGLISH_STOP_WORDS and TfidfVectorizer importeded succesfully!")
import umap
print("umap importeded succesfully!!")
import plotly.express as px
print("plotly. express importeded succesfully!")
from sklearn.cluster import KMeans
print("sklearn.cluster importeded succesfully!")
import hdbscan
print("hdbscan importeded succesfully!")
import spacy
print("spacy importeded succesfully!")
import ipywidgets as widgets
print("ipywidgets importeded succesfully!")
from sentence_transformers import SentenceTransformer, util
print("SentenceTransformer and util importeded succesfully! from sentence_transformers")
import torch
print("torch importeded succesfully!")

# USSED LATTER

nlp = spacy.load("en_core_web_sm")

pandas importeded succesfully!
numpy importeded succesfully!
display and HTML importeded succesfully from IPython.display!
re importeded succesfully!
ENGLISH_STOP_WORDS and TfidfVectorizer importeded succesfully!
umap importeded succesfully!!
plotly. express importeded succesfully!
sklearn.cluster importeded succesfully!
hdbscan importeded succesfully!
spacy importeded succesfully!
ipywidgets importeded succesfully!
SentenceTransformer and util importeded succesfully! from sentence_transformers
torch importeded succesfully!


In [8]:

anime_killwords = [
    # ==============================================================================
    # 1. METADATA, FORMATS & RELEASE INFO (Aggressive)
    # ==============================================================================
    'anime', 'series', 'episode', 'episodes', 'ep', 'tv', 'movie', 'film', 'ova', 
    'ona', 'special', 'specials', 'short', 'shorts', 'video', 'videos', 'dvd', 
    'blu', 'ray', 'bluray', 'bd', 'box', 'edition', 'release', 'releases', 
    'released', 'airing', 'aired', 'broadcast', 'broadcasting', 'channel', 
    'station', 'program', 'programming', 'screened', 'streamed', 'streaming', 
    'television', 'network', 'online', 'web', 'internet', 'youtube', 'nhk', 
    'atx', 'tokyo', 'mx', 'bs11', 'wowow', 'theaters', 'theatrical', 'cinema',
    'volume', 'volumes', 'vol', 'chapter', 'ch', 'collection', 'compilation', 
    'bonus', 'bundled', 'footage', 'disc', 'cassette', 'digital', 'download',
    'simulcast', 'cour', 'split', 'season', 'seasons', 'sequel', 'prequel', 
    'franchise', 'adaptation', 'spin', 'off', 'spinoff', 'side', 'story', 
    'recap', 'summary', 'preview', 'trailer', 'teaser', 'pv', 'cm', 'commercial',
    'promo', 'promotional', 'advertisement', 'copyright', 'licensed', 'license',
    'official', 'db', 'database', 'mal', 'ann', 'wiki', 'website', 'site', 
    'link', 'click', 'visit', 'source', 'material', 'original', 'work', 'project',
    
    # ==============================================================================
    # 2. PRODUCTION, STAFF & INDUSTRY TERMS
    # ==============================================================================
    'studio', 'studios', 'animation', 'animated', 'animator', 'production', 
    'produced', 'producer', 'direct', 'directed', 'director', 'direction', 
    'write', 'writer', 'written', 'screenplay', 'script', 'composition', 
    'music', 'musical', 'sound', 'soundtrack', 'ost', 'bgm', 'op', 'ed', 
    'opening', 'ending', 'theme', 'song', 'insert', 'voice', 'cast', 'staff', 
    'actor', 'actress', 'seiyuu', 'dub', 'sub', 'caption', 'featuring', 
    'feature', 'features', 'collab', 'collaborate', 'collaboration', 'unit', 
    'idol', 'group', 'band', 'artist', 'illustrator', 'design', 'designer',
    'character', 'characters', 'setting', 'background', 'art', 'arts', 
    'visual', 'key', 'image', 'images', 'picture', 'pictures', 'photo', 
    'photos', 'gallery', 'scene', 'scenes', 'view', 'views', 'shot', 'shots',
    'manga', 'mangaka', 'comic', 'comics', 'novel', 'novels', 'ln', 'light', 
    'webtoon', 'manhwa', 'manhua', 'magazine', 'serialized', 'serialization', 
    'publish', 'published', 'publisher', 'weekly', 'monthly', 'shonen', 
    'shoujo', 'seinen', 'josei', 'jump', 'sunday', 'festa', 'event', 'ticket',
    
    # ==============================================================================
    # 3. GENERIC NOUNS (The "Noise" of Synopses)
    # ==============================================================================
    # These words appear in almost every synopsis but distinguish nothing.
    'new', 'old', 'young', 'adult', 'child', 'children', 'kid', 'kids', 
    'man', 'men', 'woman', 'women', 'boy', 'boys', 'girl', 'girls', 'guy', 
    'guys', 'male', 'female', 'person', 'people', 'human', 'humans', 
    'friend', 'friends', 'friendship', 'member', 'members', 'main', 'protagonist', 
    'hero', 'heroine', 'antagonist', 'villain', 'rival', 'enemy', 'ally', 
    'partner', 'duo', 'trio', 'leader', 'boss', 'subordinate', 'student', 
    'students', 'classmate', 'classmates', 'teacher', 'sensei', 'master', 
    'pupil', 'transfer', 'school', 'high', 'middle', 'elementary', 'college', 
    'university', 'academy', 'campus', 'class', 'classroom', 'club', 
    'activity', 'activities', 'life', 'lives', 'living', 'daily', 'everyday', 
    'world', 'worlds', 'earth', 'planet', 'universe', 'space', 'place', 
    'places', 'city', 'town', 'village', 'country', 'nation', 'state', 
    'region', 'area', 'zone', 'land', 'ground', 'sky', 'sea', 'ocean',
    'thing', 'things', 'stuff', 'object', 'item', 'piece', 'part', 'lot', 
    'lots', 'bit', 'bunch', 'type', 'types', 'kind', 'kinds', 'sort', 'way', 
    'style', 'manner', 'method', 'system', 'program', 'plan', 'goal', 
    'dream', 'hope', 'wish', 'desire', 'fate', 'destiny', 'chance', 
    'opportunity', 'incident', 'accident', 'event', 'situation', 'circumstance', 
    'problem', 'issue', 'trouble', 'danger', 'threat', 'crisis', 'chaos', 
    'peace', 'war', 'battle', 'fight', 'conflict', 'struggle', 'journey', 
    'adventure', 'quest', 'mission', 'task', 'job', 'work', 'challenge', 
    'game', 'match', 'tournament', 'competition', 'race', 'test', 'exam',
    
    # ==============================================================================
    # 4. VERBS OF NARRATION & FILLER (Crucial for Clustering)
    # ==============================================================================
    'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 
    'having', 'do', 'does', 'did', 'doing', 'can', 'could', 'will', 'would', 
    'shall', 'should', 'may', 'might', 'must', 'get', 'gets', 'getting', 'got', 
    'gotten', 'go', 'goes', 'going', 'gone', 'went', 'come', 'comes', 'coming', 
    'came', 'make', 'makes', 'making', 'made', 'take', 'takes', 'taking', 
    'took', 'taken', 'know', 'knows', 'knowing', 'known', 'knew', 'think', 
    'thinks', 'thinking', 'thought', 'want', 'wants', 'wanting', 'wanted', 
    'feel', 'feels', 'feeling', 'felt', 'see', 'sees', 'seeing', 'saw', 'seen', 
    'look', 'looks', 'looking', 'looked', 'watch', 'watches', 'watching', 
    'watched', 'hear', 'hears', 'hearing', 'heard', 'say', 'says', 'saying', 
    'said', 'tell', 'tells', 'telling', 'told', 'speak', 'speaks', 'speaking', 
    'spoke', 'spoken', 'talk', 'talks', 'talking', 'talked', 'ask', 'asks', 
    'asking', 'asked', 'answer', 'answers', 'answering', 'answered', 'give', 
    'gives', 'giving', 'gave', 'given', 'receive', 'receives', 'receiving', 
    'received', 'find', 'finds', 'finding', 'found', 'lose', 'loses', 'losing', 
    'lost', 'use', 'uses', 'using', 'used', 'try', 'tries', 'trying', 'tried', 
    'attempt', 'attempts', 'attempting', 'attempted', 'decide', 'decides', 
    'deciding', 'decided', 'start', 'starts', 'starting', 'started', 'begin', 
    'begins', 'beginning', 'began', 'begun', 'end', 'ends', 'ending', 'ended', 
    'finish', 'finishes', 'finishing', 'finished', 'stop', 'stops', 'stopping', 
    'stopped', 'continue', 'continues', 'continuing', 'continued', 'happen', 
    'happens', 'happening', 'happened', 'occur', 'occurs', 'occurring', 
    'occurred', 'appear', 'appears', 'appearing', 'appeared', 'disappear', 
    'disappears', 'disappearing', 'disappeared', 'meet', 'meets', 'meeting', 
    'met', 'join', 'joins', 'joining', 'joined', 'follow', 'follows', 
    'following', 'followed', 'lead', 'leads', 'leading', 'led', 'help', 
    'helps', 'helping', 'helped', 'save', 'saves', 'saving', 'saved', 
    'protect', 'protects', 'protecting', 'protected', 'fight', 'fights', 
    'fighting', 'fought', 'attack', 'attacks', 'attacking', 'attacked', 
    'defend', 'defends', 'defending', 'defended', 'win', 'wins', 'winning', 
    'won', 'lose', 'loses', 'losing', 'lost', 'defeat', 'defeats', 'defeating', 
    'defeated', 'destroy', 'destroys', 'destroying', 'destroyed', 'create', 
    'creates', 'creating', 'created', 'build', 'builds', 'building', 'built', 
    'change', 'changes', 'changing', 'changed', 'become', 'becomes', 'becoming', 
    'became', 'turn', 'turns', 'turning', 'turned', 'return', 'returns', 
    'returning', 'returned', 'bring', 'brings', 'bringing', 'brought', 'carry', 
    'carries', 'carrying', 'carried', 'hold', 'holds', 'holding', 'held', 
    'keep', 'keeps', 'keeping', 'kept', 'stay', 'stays', 'staying', 'stayed', 
    'leave', 'leaves', 'leaving', 'left', 'move', 'moves', 'moving', 'moved', 
    'live', 'lives', 'living', 'lived', 'die', 'dies', 'dying', 'died', 
    'kill', 'kills', 'killing', 'killed', 'survive', 'survives', 'surviving', 
    'survived', 'revolve', 'revolves', 'centering', 'centered', 'focus', 
    'focuses', 'focused', 'focusing', 'depict', 'depicts', 'depicting', 
    'portray', 'portrays', 'portraying', 'tell', 'tells', 'telling',
    
    # ==============================================================================
    # 5. ADJECTIVES & ADVERBS (Empty Descriptors)
    # ==============================================================================
    'good', 'bad', 'great', 'best', 'worst', 'better', 'worse', 'nice', 'kind', 
    'mean', 'happy', 'sad', 'angry', 'afraid', 'scared', 'strong', 'weak', 
    'big', 'small', 'large', 'little', 'huge', 'tiny', 'long', 'short', 
    'high', 'low', 'fast', 'slow', 'hard', 'soft', 'difficult', 'easy', 
    'simple', 'complex', 'complicated', 'strange', 'weird', 'odd', 'unusual', 
    'normal', 'ordinary', 'average', 'common', 'special', 'unique', 'rare', 
    'mysterious', 'secret', 'unknown', 'famous', 'popular', 'real', 'true', 
    'false', 'fake', 'different', 'same', 'similar', 'various', 'several', 
    'many', 'much', 'more', 'most', 'less', 'least', 'few', 'some', 'any', 
    'all', 'every', 'each', 'other', 'another', 'own', 'suddenly', 'sudden', 
    'immediately', 'instantly', 'quickly', 'slowly', 'finally', 'eventually', 
    'currently', 'recently', 'lately', 'soon', 'later', 'then', 'now', 
    'today', 'tomorrow', 'yesterday', 'ago', 'already', 'yet', 'still', 
    'just', 'only', 'even', 'also', 'too', 'very', 'really', 'so', 'quite', 
    'rather', 'somewhat', 'almost', 'nearly', 'hardly', 'barely', 'scarcely', 
    'always', 'usually', 'often', 'frequently', 'sometimes', 'occasionally', 
    'rarely', 'never', 'ever', 'perhaps', 'maybe', 'probably', 'possibly', 
    'likely', 'unlikely', 'certainly', 'definitely', 'surely', 'actually', 
    'truly', 'really', 'simply', 'merely', 'basically', 'essentially',
    
    # ==============================================================================
    # 6. TIME, DATES & NUMBERS
    # ==============================================================================
    'time', 'times', 'year', 'years', 'month', 'months', 'week', 'weeks', 
    'day', 'days', 'hour', 'hours', 'minute', 'minutes', 'second', 'seconds', 
    'moment', 'moments', 'period', 'era', 'age', 'century', 'decade', 
    'date', 'calendar', 'schedule', 'future', 'past', 'present', 'history', 
    'memory', 'memories', 'morning', 'afternoon', 'evening', 'night', 
    'midnight', 'dawn', 'dusk', 'spring', 'summer', 'autumn', 'fall', 
    'winter', 'january', 'february', 'march', 'april', 'may', 'june', 
    'july', 'august', 'september', 'october', 'november', 'december', 
    'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 
    'ten', 'eleven', 'twelve', 'hundred', 'thousand', 'million', 'billion', 
    'first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 
    'eighth', 'ninth', 'tenth', 'last', 'next', 'previous', 'current',
    
 # ==============================================================================
    # 7. JAPAN-SPECIFIC FILLER, CULTURAL TERMS & HONORIFICS (EXPANDED)
    # ==============================================================================
    
    # --- Locations & Administrative ---
    'japan', 'japanese', 'nippon', 'nihon', 'tokyo', 'osaka', 'kyoto', 
    'hokkaido', 'okinawa', 'nagoya', 'fukuoka', 'hiroshima', 'nara', 
    'akihabara', 'akiba', 'shibuya', 'shinjuku', 'ikebukuro', 'ginza', 
    'roppongi', 'harajuku', 'prefecture', 'city', 'ward', 'district', 
    'village', 'countryside', 'mount', 'mt', 'fuji', 'yen', 'town',
    
    # --- Honorifics & Social Titles (High Frequency Noise) ---
    'san', 'kun', 'chan', 'sama', 'dono', 'tan', 'chama', 'shi', 
    'senpai', 'semipai', 'kohai', 'kouhai', 'sensei', 'shishou', 'hakase',
    'onii', 'onee', 'niisan', 'neesan', 'aniki', 'imouto', 'otouto', 
    'kaicho', 'bucho', 'bancho', 'ojousama', 'bocchan', 'ojisan', 'obasan',
    
    # --- School Life (The "School" Cluster Trap) ---
    # These often merge distinct genres (Horror vs Romance) because both happen at school
    'gakuen', 'highschool', 'school', 'academy', 'university', 'college',
    'classroom', 'homeroom', 'club', 'clubroom', 'president', 'council',
    'uniform', 'seifuku', 'sailor', 'blazer', 'gym', 'dojo', 'festival', 
    'cultural', 'sports', 'bunkasai', 'undokai', 'trip', 'excursion',
    'cram', 'juku', 'exam', 'exams', 'test', 'tests', 'semester', 'term',
    'entrance', 'ceremony', 'graduation', 'transfer',
    
    # --- Daily Life & Housing ---
    'apartment', 'mansion', 'dorm', 'dormitory', 'room', 'house', 'home',
    'bath', 'bathhouse', 'onsen', 'hotspring', 'springs', 'sauna', 'sento',
    'kotatsu', 'futon', 'tatami', 'sliding', 'door', 'porch', 'veranda',
    'convini', 'convenience', 'store', 'supermarket', 'market', 'station',
    'train', 'subway', 'bullet', 'shinkansen', 'ticket', 'gate',
    
    # --- Food & Dining (Specific items that are just "noise") ---
    'bento', 'lunch', 'box', 'ramen', 'sushi', 'sake', 'tea', 'matcha',
    'rice', 'ball', 'onigiri', 'curry', 'bread', 'melonpan', 'yakisoba',
    'okonomiyaki', 'takoyaki', 'dango', 'mochi', 'sweets', 'snack', 'bar',
    'izakaya', 'restaurant', 'cafe', 'maid', 'cook', 'cooking', 'meal',
    
    # --- Traditional Culture & Religion ---
    'shrine', 'temple', 'torii', 'gate', 'miko', 'priest', 'monk', 
    'prayer', 'charm', 'omamori', 'fortune', 'omikuji', 'new', 'year',
    'hatsumode', 'kimono', 'yukata', 'haori', 'happi', 'festival', 
    'matsuri', 'fireworks', 'hanabi', 'cherry', 'blossom', 'sakura', 
    'flower', 'viewing', 'hanami', 'golden', 'week', 'obon', 'tanabata',
    'christmas', 'eve', 'valentine', 'white', 'day',
    
    # --- Anime/Manga Industry Meta-Terms ---
    'anime', 'manga', 'mangaka', 'light', 'novel', 'visual', 'novel', 'game', 
    'eroge', 'galge', 'otome', 'bl', 'gl', 'yuri', 'yaoi', 'doujin', 
    'doujinshi', 'comiket', 'convention', 'cosplay', 'costume', 'otaku', 
    'neet', 'hikikomori', 'shut-in', 'fan', 'fandom', 'figure', 'merch',
    'idol', 'debut', 'unit', 'center', 'live', 'concert', 'handshake',
    
    # --- Character Archetypes & Slang ---
    'tsundere', 'yandere', 'kuudere', 'dandere', 'himedere', 'deredere',
    'moe', 'kawaii', 'sugoi', 'baka', 'aho', 'chibi', 'bishounen', 'bishoujo',
    'ikemen', 'megane', 'loli', 'shota', 'trap', 'reverse', 'harem', 
    'fan', 'service', 'fanservice', 'ecchi', 'hentai', 'pantsu', 'nosebleed',
    'gag', 'slapstick', 'parody', 'filler',
    
    # --- Yakuza & Delinquent Terms (Optional: Remove if focusing on Crime genre) ---
    'yakuza', 'gang', 'gangster', 'delinquent', 'yankee', 'sukeban', 
    'bosozoku', 'fight', 'brawl', 'territory', 'turf',
    # ==============================================================================
    # 8. FRANCHISE NOISE (Specific Names that distort clusters)
    # ==============================================================================
    'goku', 'vegeta', 'gohan', 'trunks', 'piccolo', 'frieza', 'cell', 'buu', 
    'saiyan', 'dragonball', 'naruto', 'sasuke', 'sakura', 'kakashi', 'hokage', 
    'ninja', 'shinobi', 'chakra', 'luffy', 'zoro', 'nami', 'usopp', 'sanji', 
    'chopper', 'robin', 'franky', 'brook', 'piece', 'pirate', 'ichigo', 
    'rukia', 'renji', 'soul', 'reaper', 'shinigami', 'hollow', 'bleach', 
    'natsu', 'lucy', 'happy', 'gray', 'erza', 'fairy', 'tail', 'guild', 
    'gon', 'killua', 'kurapika', 'leorio', 'hunter', 'nen', 'deku', 
    'midoriya', 'bakugo', 'todoroki', 'allmight', 'quirk', 'hero', 
    'academia', 'tanjiro', 'nezuko', 'zenitsu', 'inosuke', 'demon', 
    'slayer', 'kimetsu', 'yaiba', 'eren', 'mikasa', 'armin', 'levi', 
    'titan', 'shingeki', 'kyojin', 'edward', 'alphonse', 'elric', 
    'alchemist', 'alchemy', 'fullmetal', 'ash', 'misty', 'brock', 
    'pikachu', 'pokemon', 'pocket', 'monster', 'digimon', 'digital', 
    'yugioh', 'duel', 'card', 'cards', 'gundam', 'suit', 'mobile', 
    'zeon', 'zaku', 'federation', 'evangelion', 'eva', 'nerv', 'angel', 
    'angels', 'sailor', 'moon', 'scout', 'senshi', 'precure', 'cure', 
    'magical', 'girl', 'fate', 'stay', 'night', 'grand', 'order', 
    'saber', 'archer', 'lancer', 'rider', 'caster', 'assassin', 
    'berserker', 'servant', 'grail', 'war', 'jojo', 'dio', 'stand', 
    'bizarre', 'adventure', 'gintoki', 'shinpachi', 'kagura', 'gintama', 
    'yorozuya', 'conan', 'detective', 'case', 'closed', 'lupin', 'iii', 
    'doraemon', 'nobita', 'shin', 'chan', 'crayon', 'totoro', 'ghibli', 
    'miyazaki', 'akira', 'ghost', 'shell', 'bebop', 'cowboy', 'champloo'
]




In [9]:
# LOAD DATASETS
df_main = pd.read_csv('data/mal_anime_2025.csv')
df_5ya  = pd.read_csv('data/animes_5ya.csv')
df_6ya  = pd.read_csv('data/animes_6ya.csv')

In [10]:


# Cargar modelo SpaCy (asegúrate de tenerlo instalado: python -m spacy download en_core_web_sm)
try:
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"]) # Cargamos NER solo si lo usamos, abajo lo reactivo en el pipe
except:
    print("Modelo SpaCy no encontrado. Instalando/descargando modelo liviano...")
    import os
    os.system("python -m spacy download en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")

# ==========================================
# 1. CARGA Y PREPARACIÓN DE DATOS (FUSION OPTIMIZADA)
# ==========================================
print("--- Cargando Datasets ---")
df_main = pd.read_csv('data/mal_anime_2025.csv')
df_5ya  = pd.read_csv('data/animes_5ya.csv')
df_6ya  = pd.read_csv('data/animes_6ya.csv')

# Normalización de columnas
for df in [df_main, df_5ya, df_6ya]:
    df.columns = df.columns.str.strip().str.lower()

# Función Key (vectorizada)
def get_merge_key(series):
    return series.astype(str).str.lower().str.replace(r'[^a-z0-9]', '', regex=True)

# Preparamos los 3 DFs con índice unificado
def prep_source(df, title_col, desc_col, score_col, suffix):
    df = df.copy()
    df['merge_key'] = get_merge_key(df[title_col])
    
    # Renombrar para consistencia interna
    rename_map = {title_col: 'title', desc_col: 'description', score_col: 'score'}
    # Solo renombramos si existen, si no, creamos NaN
    for k, v in rename_map.items():
        if k not in df.columns:
            df[k] = np.nan
        else:
            df.rename(columns={k: v}, inplace=True)
            
    # Asegurar columnas extra si existen
    if 'genres' not in df.columns: df['genres'] = np.nan
    if 'aired' not in df.columns: df['aired'] = np.nan

    # Establecer índice para usar combine_first
    return df.set_index('merge_key')[['title', 'description', 'score', 'genres', 'aired']]

# Crear fuentes estandarizadas
src_main = prep_source(df_main, 'title', 'description', 'score', 'main')
src_5ya  = prep_source(df_5ya,  'title', 'synopsis',    'score', '5ya')
src_6ya  = prep_source(df_6ya,  'title', 'synopsis',    'score', '6ya')

print("--- Fusionando Fuentes (Prioridad: Main > 5ya > 6ya) ---")
# combine_first rellena los NaNs del caller con valores del argumento
# Esto es MUCHO más rápido que apply(axis=1)
master_df = src_main.combine_first(src_5ya).combine_first(src_6ya)

# Reset index para recuperar el merge_key si fuera necesario (o crear ID)
master_df = master_df.reset_index(drop=True)
master_df['id'] = master_df.index + 1

# Limpieza básica de filas vacías
master_df = master_df.dropna(subset=['title'])
master_df = master_df[master_df['description'].str.len() > 20] # Primer filtro burdo por caracteres

print(f"Total animes tras fusión: {len(master_df)}")

# ==========================================
# 2. PIPELINE NLP AVANZADO (C/ KILLWORDS)
# ==========================================

# Tu lista de Killwords (Versión compacta para la celda, idealmente impórtala)
anime_killwords = set(anime_killwords)

def advanced_cleaning(texts_list, nlp_model, batch_size=50):
    docs = []
    # 1. Regex Rápido (Boilerplate)
    # Eliminamos patrones comunes de MAL y enlaces
    texts_clean = [re.sub(r'\[Written by MAL Rewrite\]|\(Source:.*?\)|https?://\S+', '', str(t), flags=re.IGNORECASE) for t in texts_list]
    
    print("--- Iniciando Tokenización SpaCy y Filtrado de Entidades ---")
    # Usamos nlp.pipe para velocidad. Habilitamos NER para quitar Nombres.
    # Deshabilitamos parser para velocidad.
    for doc in nlp_model.pipe(texts_clean, batch_size=batch_size, disable=["parser"]):
        
        tokens_kept = []
        for token in doc:
            # Filtros:
            # 1. No es Stopword (SpaCy default)
            # 2. No es Puntuación
            # 3. No es un Nombre Propio (PERSON) detectado por NER
            # 4. Longitud > 2
            if not token.is_stop and not token.is_punct and len(token.text) > 2:
                if token.ent_type_ != "PERSON": 
                    lemma = token.lemma_.lower()
                    # 5. Filtro FINAL: Killwords de Dominio Anime
                    if lemma not in anime_killwords:
                        tokens_kept.append(lemma)
        
        docs.append(" ".join(tokens_kept))
    return docs

# Ejecutar limpieza
master_df['clean_desc'] = advanced_cleaning(master_df['description'].tolist(), nlp)

# ==========================================
# 3. FILTRADO ESTADÍSTICO (EL CORTE DINÁMICO)
# ==========================================

# Contamos tokens REALES (semánticos) que quedaron
master_df['n_tokens'] = master_df['clean_desc'].apply(lambda x: len(x.split()))

# Calculamos estadísticas
stats = master_df['n_tokens'].describe()
cutoff_threshold = int(master_df['n_tokens'].quantile(0.10)) # Eliminamos el 10% inferior

print("\n--- Estadísticas de Tokens Semánticos ---")
print(stats[['mean', 'min', '25%', '50%', '75%', 'max']])
print(f"Umbral de corte sugerido (Percentil 10): {cutoff_threshold} tokens")

# APLICAR CORTE (O MÍNIMO DE SEGURIDAD 8)
final_threshold = max(cutoff_threshold, 8) # Nunca aceptar menos de 10 palabras semánticas
df_final = master_df[master_df['n_tokens'] >= final_threshold].copy()
df_final = df_final.reset_index(drop=True)

print(f"\nAnimes eliminados por falta de información: {len(master_df) - len(df_final)}")
print(f"Dataset final listo: {len(df_final)} animes.")

# Guardar
df_final.to_csv('data/anime_master_nlp_ready.csv', index=False)
df = df_final # Para compatibilidad con tu código posterior

--- Cargando Datasets ---
--- Fusionando Fuentes (Prioridad: Main > 5ya > 6ya) ---
Total animes tras fusión: 27277
--- Iniciando Tokenización SpaCy y Filtrado de Entidades ---

--- Estadísticas de Tokens Semánticos ---
mean     22.838105
min       0.000000
25%       7.000000
50%      15.000000
75%      35.000000
max     245.000000
Name: n_tokens, dtype: float64
Umbral de corte sugerido (Percentil 10): 3 tokens

Animes eliminados por falta de información: 9543
Dataset final listo: 17734 animes.


In [11]:
'''
### COMING UP WITH THIS SOLUTION TO MERGE THE DATASETS ALMOS MAKES ME GIVE UP ###

# LONG LOAD CELL #

# ####### # Improvement_ clean for really old animes, por que el 10% inferior de corte.

# FORCE LOWERCASE COLUMNS
df_main.columns = df_main.columns.str.strip().str.lower()
df_5ya.columns  = df_5ya.columns.str.strip().str.lower()
df_6ya.columns  = df_6ya.columns.str.strip().str.lower()

# HELPER FUNCTIONS
def get_merge_key(title_series):
    return title_series.astype(str).str.lower().str.replace(r'[^a-z0-9]', '', regex=True)

def prep_dataset(df, title_col, desc_col, source_name):
    df = df.copy()
    df['merge_key'] = get_merge_key(df[title_col])
    
    rename_map = {
        title_col: f'title_{source_name}',
        desc_col:  f'desc_{source_name}'
    }
    df = df.rename(columns=rename_map)
    df[f'desc_{source_name}'] = df[f'desc_{source_name}'].fillna("").astype(str)
    
    return df[['merge_key', f'title_{source_name}', f'desc_{source_name}']]

# PREPARE DATASETS
clean_main = prep_dataset(df_main, title_col='title', desc_col='description', source_name='main')
clean_5ya  = prep_dataset(df_5ya,  title_col='title', desc_col='synopsis',    source_name='5ya')
clean_6ya  = prep_dataset(df_6ya,  title_col='title', desc_col='synopsis',    source_name='6ya')

# MERGE BY NORMALIZED TITLE
merged = clean_main.merge(clean_5ya, on='merge_key', how='outer')
merged = merged.merge(clean_6ya,  on='merge_key', how='outer')

# RESOLVE TEXT CONTENT
def resolve_final_text(row):
    # TITLE RESOLUTION
    title = row.get('title_main')
    if pd.isna(title) or str(title) == 'nan':
        title = row.get('title_5ya')
    if pd.isna(title) or str(title) == 'nan':
        title = row.get('title_6ya')
        
    # DESCRIPTION RESOLUTION
    desc = ""
    
    def is_valid(t):
        if not isinstance(t, str): return False
        if t.lower() == 'nan': return False
        return len(t.split()) > 15
    
    if is_valid(row.get('desc_main')):
        desc = row['desc_main']
    elif is_valid(row.get('desc_5ya')):
        desc = row['desc_5ya']
    elif is_valid(row.get('desc_6ya')):
        desc = row['desc_6ya']
        
    return pd.Series([title, desc], index=['title', 'description'])

merged[['title', 'description']] = merged.apply(resolve_final_text, axis=1)

# CREATE MASTER DATAFRAME AND FILTER
master_df = merged[['title', 'description']].copy()

master_df = master_df.dropna(subset=['title'])
master_df = master_df[master_df['title'] != ""]
master_df = master_df[master_df['description'] != ""]

# GENERATE NEW UNIFIED IDS
master_df = master_df.reset_index(drop=True)
master_df['id'] = master_df.index + 1

# PREPARE METADATA BACKFILL
def get_meta_source(df, title_col, cols_to_grab):
    df = df.copy()
    df['merge_key'] = get_merge_key(df[title_col])
    
    # SAFETY: If column doesn't exist, create it as NaN (Fixes KeyError)
    for c in cols_to_grab:
        if c not in df.columns:
            df[c] = np.nan
            
    return df[['merge_key'] + cols_to_grab]

# EXTRACT METADATA

meta_main = get_meta_source(df_main, 'title', ['score', 'genres', 'aired'])
meta_5ya  = get_meta_source(df_5ya,  'title', ['score']) 
meta_6ya  = get_meta_source(df_6ya,  'title', ['score'])

meta_main.rename(columns={'score': 'score_main', 'genres': 'genres_main', 'aired': 'aired_main'}, inplace=True)
meta_5ya.rename(columns={'score': 'score_5ya'}, inplace=True)
meta_6ya.rename(columns={'score': 'score_6ya'}, inplace=True)

# MERGE METADATA
master_df['merge_key'] = get_merge_key(master_df['title'])
master_df = master_df.merge(meta_main, on='merge_key', how='left')
master_df = master_df.merge(meta_5ya,  on='merge_key', how='left')
master_df = master_df.merge(meta_6ya,  on='merge_key', how='left')

# SCORES AND RENAMING

master_df['score'] = master_df['score_main'].fillna(master_df['score_5ya']).fillna(master_df['score_6ya'])
master_df.rename(columns={'genres_main': 'genres', 'aired_main': 'aired'}, inplace=True)

# FINAL CLEANUP
master_df = master_df[['id', 'title', 'description', 'score', 'genres', 'aired']]
master_df['score'] = pd.to_numeric(master_df['score'], errors='coerce')
master_df.sort_values(by='score', ascending=False, na_position='last', inplace=True)
master_df.drop_duplicates(subset=['title'], keep='first', inplace=True)

# SAVE TO CSV
master_df.to_csv('data/anime_master_final_clean.csv', index=False)

print(f"Final Row count: {len(master_df)}")
df = master_df
'''

'\n### COMING UP WITH THIS SOLUTION TO MERGE THE DATASETS ALMOS MAKES ME GIVE UP ###\n\n# LONG LOAD CELL #\n\n# ####### # Improvement_ clean for really old animes, por que el 10% inferior de corte.\n\n# FORCE LOWERCASE COLUMNS\ndf_main.columns = df_main.columns.str.strip().str.lower()\ndf_5ya.columns  = df_5ya.columns.str.strip().str.lower()\ndf_6ya.columns  = df_6ya.columns.str.strip().str.lower()\n\n# HELPER FUNCTIONS\ndef get_merge_key(title_series):\n    return title_series.astype(str).str.lower().str.replace(r\'[^a-z0-9]\', \'\', regex=True)\n\ndef prep_dataset(df, title_col, desc_col, source_name):\n    df = df.copy()\n    df[\'merge_key\'] = get_merge_key(df[title_col])\n\n    rename_map = {\n        title_col: f\'title_{source_name}\',\n        desc_col:  f\'desc_{source_name}\'\n    }\n    df = df.rename(columns=rename_map)\n    df[f\'desc_{source_name}\'] = df[f\'desc_{source_name}\'].fillna("").astype(str)\n\n    return df[[\'merge_key\', f\'title_{source_name}\', f\'desc_{s

In [12]:

# LONG LOAD CELL # (3 minutes aprox.)
# CLEANING PROCESS

def clean_text_pipeline(df, text_col):
    """
    Performs full NLP cleaning:
    1. Removes Boilerplate (Source: X, Written by MAL Rewrite)
    2. Removes Personal Names (using spaCy NER)
    3. Lowercases & Removes Special Chars
    4. Removes Stopwords
    """
   
    raw_docs = df[text_col].fillna("").astype(str).tolist()
    
    clean_docs = [re.sub(r'\[Written by MAL Rewrite\]|\(Source:.*?\)', '', text) for text in raw_docs]
    
    final_texts = []

    for doc in nlp.pipe(clean_docs, batch_size=100, disable=["parser"]):
        tokens_to_keep = []
        for token in doc:
            if token.ent_type_ == "PERSON":
                continue

            tokens_to_keep.append(token.text)
        
        text_no_names = " ".join(tokens_to_keep)
        final_texts.append(text_no_names)

    
    
    processed_output = []
    for text in final_texts:
        text = text.lower()
        text = re.sub(r'[^a-z\s]', '', text)
        words = text.split()
        cleaned_words = [w for w in words if w not in ENGLISH_STOP_WORDS and len(w) > 2]
        
        processed_output.append(" ".join(cleaned_words))
        
    return processed_output

df['clean_description'] = clean_text_pipeline(df, 'description')
df['word_count'] = df['clean_description'].apply(lambda x: len(str(x).split()))
df = df[df['word_count'] > 5].copy()
df = df.reset_index(drop=True)

In [13]:
######## NEW ADDITION
print(f"Rows after removing empty descriptions: {len(df)}")
# Check if Shounen Ninja is still there
print("Shounen Ninja status:", df[df['title'].str.contains("Shounen Ninja", case=False)].shape[0] > 0)

Rows after removing empty descriptions: 17733
Shounen Ninja status: True


In [16]:
custom_stops = list(ENGLISH_STOP_WORDS) + list(anime_killwords)




# INITIALIZE THE VECTORIZER
tfidf = TfidfVectorizer(
    max_features=3000, 
    ngram_range=(1, 2),
    min_df=10,
    max_df=0.5,
    stop_words=custom_stops 
)

# MATRIX GENERATION
tfidf_matrix = tfidf.fit_transform(df['clean_description'])

print(f"Matrix Shape: {tfidf_matrix.shape}")




Matrix Shape: (17733, 3000)


In [17]:
# LONG LOAD CELL # (2 minutes aprox.)

#UMAP (2D Projection)
reducer = umap.UMAP(
    n_neighbors=30, 
    min_dist=0.05, 
    metric='cosine', 
    random_state=42
)
embedding = reducer.fit_transform(tfidf_matrix)

# SAVE TO DATAFRAME
df['x'] = embedding[:, 0]
df['y'] = embedding[:, 1]



  warn(


In [18]:

df['score'] = pd.to_numeric(df['score'], errors='coerce').fillna(0)

df['year'] = df['aired'].astype(str).str.extract(r'(\d{4})').fillna("Unknown")

In [19]:
# AUTONOMOUS CLUSTER 
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=60,      
    min_samples=10,           
    metric='euclidean',
    cluster_selection_method='leaf' 
)
# CLUSTERING
df['Cluster'] = clusterer.fit_predict(embedding)

# BAKGROUND NOISE
df['Cluster_Name'] = df['Cluster'].apply(lambda x: f"Cluster {x}" if x != -1 else "Background Noise")

# AUTO-NAMING CLUSTERS
def auto_name_clusters(df, tfidf_matrix, vectorizer, top_n=3):
    feature_names = np.array(vectorizer.get_feature_names_out())
    cluster_names = {}
    unique_clusters = sorted(df['Cluster'].unique())
    
    for label in unique_clusters:
        if label == -1:
            cluster_names[label] = "Unclassified Noise"
            continue
            
        indices = df.index[df['Cluster'] == label].tolist()
        cluster_slice = tfidf_matrix[indices]
        word_scores = np.asarray(cluster_slice.sum(axis=0)).flatten()
        top_indices = word_scores.argsort()[::-1][:top_n]
        top_words = feature_names[top_indices]
        name = " • ".join([w.title() for w in top_words])
        cluster_names[label] = name
        
    return cluster_names

name_map = auto_name_clusters(df, tfidf_matrix, tfidf)
df['Cluster_Name'] = df['Cluster'].map(name_map)


In [20]:


# PREPARE DATA
df['visual_size'] = (df['score'].fillna(5) ** 2) / 10 

# CREATE GALAXY MAP
fig = px.scatter(
    df, 
    x='x', 
    y='y',
    color='Cluster_Name', 
    size='visual_size',       
    size_max=10,               
    hover_name='title',
    hover_data={
        'x': False, 'y': False, 'visual_size': False,
        'Cluster_Name': True,
        'genres': True,
        'score': True,
        'aired': True
    },
    title='<b>THE ANIME UNIVERSE</b><br><span style="font-size: 14px; color: grey;">A Thematic Map</span>',
    template='plotly_dark',
    color_discrete_sequence=px.colors.qualitative.Bold, 
    opacity=0.7
)

# CONFIGURE LAYOUT
fig.update_layout(
    title={
        'x': 0.98,
        'y': 0.95,
        'xanchor': 'right',
        'yanchor': 'top'
    },
    margin=dict(l=300, r=20, b=20, t=110),
    height=600, 
    xaxis=dict(visible=False),
    yaxis=dict(visible=False),
    font=dict(family="Segoe UI, sans-serif"),
    legend=dict(
        title="THEMES",
        itemsizing='constant',
        bgcolor='rgba(0,0,0,0)', 
        bordercolor="rgba(0,0,0,0)",
        x=-0.4,       
        y=1,         
        xanchor="left",
        yanchor="top",
        font=dict(size=11),
        tracegroupgap=5
    ),
    updatemenus=[
        dict(
            type="buttons",
            direction="left",
            buttons=list([
                dict(
                    args=[{"visible": True}],
                    label="Show All",
                    method="restyle"
                ),
                dict(
                    args=[{"visible": "legendonly"}],
                    label="Clear Map",
                    method="restyle"
                )
            ]),
            pad={"r": 10, "t": 10},
            showactive=True,
            x=-0.4,
            xanchor="left",
            y=1.15,  
            yanchor="top",
            bgcolor="#444444",
            font=dict(color="white")
        ),
    ]
)

fig.show()

In [21]:
num_clusters = df['Cluster'].max() + 1
print(f"Discovered {num_clusters} distinct themes.")
print(sorted(list(set(" ".join(df['Cluster_Name'].astype(str).unique()).replace(" • ", " ").lower().split()))))


Discovered 85 distinct themes.
['akane', 'akari', 'album', 'alice', 'alien', 'aliens', 'amusement', 'anniversary', 'announced', 'aoi', 'araragi', 'baseball', 'based', 'bear', 'beautiful', 'bicycle', 'biscotti', 'black', 'blade', 'blue', 'book', 'books', 'boxing', 'breasts', 'brother', 'candy', 'cat', 'cats', 'chinatsu', 'cinque', 'civilization', 'clan', 'cloud', 'colony', 'commercials', 'company', 'corps', 'cream', 'crew', 'culture', 'daisuke', 'dance', 'dark', 'death', 'debt', 'defense', 'demons', 'dinosaurs', 'doctor', 'dragon', 'dragons', 'drama', 'dreams', 'duke', 'eagle', 'educational', 'empire', 'energy', 'erotic', 'extra', 'family', 'farm', 'fashion', 'father', 'featured', 'fighters', 'fish', 'forest', 'fox', 'frog', 'galaxy', 'gear', 'giant', 'guardians', 'harusame', 'heaven', 'hell', 'heroes', 'hikaru', 'hime', 'hiro', 'hiroshi', 'hiyori', 'hospital', 'hotaru', 'husband', 'ice', 'idols', 'immortal', 'included', 'inn', 'interact', 'introduces', 'island', 'jack', 'jin', 'joe', '

In [22]:
# LONG LOAD TIME (6 minutes aprox.)
model = SentenceTransformer('all-MiniLM-L6-v2')
corpus_embeddings = model.encode(df['description'].fillna('').tolist(), convert_to_tensor=True)

def render_output(sender):
    output_widget.clear_output()
    query = search_bar.value
    
    if len(query) < 5:
        return
        
    with output_widget:
        query_embedding = model.encode(query, convert_to_tensor=True)
        
        cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
        top_results = torch.topk(cos_scores, k=10)
        
        indices = top_results.indices.tolist()
        similarities = top_results.values.tolist()
        
        recs = df.iloc[indices].copy()
        recs['similarity_score'] = similarities
        
        if recs.empty:
            print(f"No results found for description: '{query}'")
            return

        target = recs.iloc[0]
        cluster_name = target.get('Cluster_Name', 'Unknown')
        
        html = f"""
        <style>
            .rec-card {{ background-color: #1e1e1e; color: #e0e0e0; padding: 20px; border-radius: 12px; font-family: 'Segoe UI', sans-serif; box-shadow: 0 4px 15px rgba(0,0,0,0.5); }}
            .rec-header {{ border-bottom: 1px solid #333; padding-bottom: 10px; margin-bottom: 15px; }}
            .rec-title {{ font-size: 1.6em; color: #4facfe; font-weight: bold; }}
            .rec-meta {{ font-size: 0.9em; color: #aaa; margin-top: 5px; }}
            .rec-badge {{ background-color: #333; padding: 2px 8px; border-radius: 4px; font-size: 0.8em; margin-right: 10px; }}
            
            .rec-table {{ width: 100%; border-collapse: separate; border-spacing: 0 8px; }}
            .rec-table th {{ text-align: left; color: #888; font-size: 0.85em; padding-bottom: 5px; }}
            .rec-table td {{ background-color: #252525; padding: 12px; border-top: 1px solid #333; border-bottom: 1px solid #333; }}
            .rec-table tr td:first-child {{ border-left: 1px solid #333; border-top-left-radius: 8px; border-bottom-left-radius: 8px; }}
            .rec-table tr td:last-child {{ border-right: 1px solid #333; border-top-right-radius: 8px; border-bottom-right-radius: 8px; }}
            
            .score-high {{ color: #00e676; font-weight: bold; }}
            .score-med {{ color: #ffea00; }}
            .cluster-tag {{ font-size: 0.75em; text-transform: uppercase; letter-spacing: 1px; color: #ff9800; }}
            
            .bar-container {{ background-color: #444; height: 6px; width: 100px; border-radius: 3px; overflow: hidden; display: inline-block; vertical-align: middle; margin-right: 5px; }}
            .bar-fill {{ height: 100%; background: linear-gradient(90deg, #4facfe, #00f2fe); }}
        </style>
        
        <div class="rec-card">
            <div class="rec-header">
                <div class="rec-title">{target['title']}</div>
                <div class="rec-meta">
                    <span class="rec-badge">📍 {cluster_name}</span>
                    <span class="rec-badge">📅 {str(target['aired'])}</span>
                    <span class="rec-badge">⭐ Score: {target['score']}</span>
                </div>
                <div style="margin-top:10px; font-style:italic; color:#888; font-size:0.9em;">
                    "{str(target['description'])[:150]}..."
                </div>
            </div>
            
            <h3 style="margin:0 0 10px 0; color:#ddd;">Semantic Recommendations</h3>
            <table class="rec-table">
                <thead>
                    <tr>
                        <th>Title</th>
                        <th>Source Cluster</th> <th>Score</th>
                        <th>Similarity</th>
                    </tr>
                </thead>
                <tbody>
        """
        
        for _, row in recs.iterrows():
            fill_pct = max(5, min(100, row['similarity_score'] * 100))
            score_color = "score-high" if row['score'] > 7.5 else "score-med"
            
            c_name = row.get('Cluster_Name', f"Cluster {row.get('Cluster', '?')}")

            html += f"""
                <tr>
                    <td style="font-weight:bold; color:#fff;">
                        {row['title']}
                        <br><span style="font-size:0.75em; color:#888; font-weight:normal;">{str(row['genres'])[:30]}...</span>
                    </td>
                    <td class="cluster-tag">{c_name}</td>
                    <td class="{score_color}">{row['score']}</td>
                    <td>
                        <div class="bar-container">
                            <div class="bar-fill" style="width: {fill_pct}%;"></div>
                        </div>
                    </td>
                </tr>
            """
            
        html += """
                </tbody>
            </table>
        </div>
        """
        display(HTML(html))



Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


In [23]:
search_bar = widgets.Text(
    value='',
    placeholder='Describe an anime (e.g. sad robot in space)...',
    description='🔍 Search:',
    continuous_update=False
)

output_widget = widgets.Output()

search_bar.observe(render_output, names='value')

display(search_bar, output_widget)

Text(value='', continuous_update=False, description='🔍 Search:', placeholder='Describe an anime (e.g. sad robo…

Output()