# FLSE distillation en Colab (fastText es)

Notas y bloques básicos que usamos para jugar con 5 capas, temperaturas por capa y vecinos.

## 1) Setup repositorio y deps
Clonar el repo y preparar entorno editable.

In [None]:
!git clone https://github.com/santiagoNieva/FLSE.git
%cd FLSE
!pip install -e .

# (opcional) limpiar kernel si ya había versiones previas cargadas

## 2) Descargar teacher fastText (es) y generar .npy
Guardamos top 200k palabras para acelerar corridas.

In [None]:
!wget -O /content/cc.es.300.bin.gz https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.es.300.bin.gz
!gunzip /content/cc.es.300.bin.gz

import fasttext, numpy as np
ft = fasttext.load_model('/content/cc.es.300.bin')
vocab = ft.words[:200000]
vecs = np.vstack([ft.get_word_vector(w) for w in vocab])
np.save('/content/FLSE/teacher_fasttext_es_200k.npy', vecs)

## 3) Entrenar FLSE (5 capas, temps decrecientes para afilar)
Bloque "soft" que dejó entropías ~[1.59, 0.99, 0.94, 0.64, 0.14] en 200k.

In [None]:
!python experiments/distill_playground.py \
  --teacher-path /content/FLSE/teacher_fasttext_es_200k.npy \
  --vocab-size 200000 --teacher-dim 300 \
  --num-layers 5 --verts-per-layer 12 --dim 12 \
  --epochs 12 --batch-size 64 \
  --lambda-ent 9.0 \
  --lambda-entropies 1.0 1.5 2.5 3.0 3.0 \
  --target-entropies 1.6 1.0 0.8 0.5 0.40 \
  --logit-temps 1.0 1.5 2.0 2.5 2.7 \
  --device cuda \
  --save-path /content/FLSE/data/flse_fasttext_5c_200k_soft.pt

## 4) Cargar checkpoint y ver entropías/top-k vértices
Incluye filtrado de vocab para vecinos alfabéticos.

In [None]:
import torch, torch.nn.functional as F, numpy as np, fasttext
from flse.model import FLSEModel
from flse.geometry import generate_vertices
from experiments.inspect_word_geometry import inspect_word

ckpt = torch.load('/content/FLSE/data/flse_fasttext_5c_200k_soft.pt', map_location='cpu')
cfg = ckpt['config']
temps = None if cfg.get('logit_temps') is None else torch.tensor(cfg['logit_temps'])
vertices = generate_vertices(cfg['num_layers'], cfg['verts_per_layer'], cfg['dim'])
model = FLSEModel(cfg['vocab_size'], vertices, cfg['teacher_dim'], logit_temps=temps)
model.load_state_dict(ckpt['state_dict'])
model.eval()

ft = fasttext.load_model('/content/cc.es.300.bin')
words = ft.words[:cfg['vocab_size']]
keep = [i for i,w in enumerate(words) if w.isalpha() and len(w)>2]
idx_keep = torch.tensor(keep)

with torch.no_grad():
    reps = F.normalize(model(idx_keep), dim=1)
    teacher_np = np.stack([ft.get_word_vector(words[i]) for i in keep])
    teacher = F.normalize(torch.from_numpy(teacher_np), dim=1)

def top_flse(word, k=10):
    idx = ft.get_word_id(word)
    if idx < 0 or idx >= cfg['vocab_size'] or idx not in keep:
        return []
    pos = keep.index(idx)
    sims = torch.mv(reps, reps[pos])
    top = sims.topk(k).indices.tolist()
    return [words[keep[i]] for i in top]

for w in ['perro', 'gato', 'animal', 'policía', 'gorra']:
    print(w, '→', top_flse(w))
    idx = ft.get_word_id(w)
    if idx >= 0:
        inspect_word(model, word_idx=idx, topk=5)

## 5) Ajustes rápidos
- Si la capa micro colapsa: baja última temp (ej. 2.3) o sube target final (0.35–0.4).
- Si queda plana: sube temps profundas o pesos `lambda-entropies`.
- Para vocab limpio, genera un teacher filtrado (solo tokens alfabéticos) y distila sobre ese subset.