# Transformer desde cero para Traducción Náhuatl (ncx) ↔ Español (es)


## Documentación robusta — Transformer Nahuatl (ncx) ↔ Español (es)

**Proyecto:** *articulo traduccion Nahualt-español*  
**Implementación principal:** Transformer “from scratch” en PyTorch + baseline opcional **BERT2BERT (mBERT)**.  
**Destino:** Notebook único: `Transformer_Nahuatl_Espanol_FromScratch.ipynb`

### 1) Objetivo y alcance
Construir un sistema de traducción automática **bidireccional** `ncx ↔ es` con:

- **Preprocesamiento específico:** segmentación por oraciones, normalización a minúsculas y preservación de **diacríticos, glotal, puntuación y números**.  
- **Tokenización:** SentencePiece **Unigram** con **vocabulario compartido** (ncx+es) ≈ **10k**.  
- **Modelo desde cero:** Transformer **Pre-Norm**, `d_model=512`, `n_heads=8`, `d_ff=2048`, `n_layers=6/6`, `dropout=0.1`.  
- **Entrenamiento:** Adam + **Noam**, **label smoothing=0.1**, **bucketing** por longitud, **grad clip=1.0**.  
- **Evaluación:** *sacreBLEU* y **chrF++**.  
- **Inferencia:** *greedy* y **beam search** (beam configurable).  
- **UI:** **Gradio** embebida en el notebook.  
- **Baseline opcional:** **BERT2BERT (mBERT)** con HuggingFace para comparación.

### 2) Requisitos
**Software**
- Python 3.9–3.11  
- Paquetes mínimos: `sentencepiece`, `sacrebleu`, `gradio`, `tqdm`, `pyyaml`, `torch`  
- Opcionales: `transformers`, `accelerate`, `datasets` (baseline), `torch-directml` (Windows/AMD), `spacy` + `es_core_news_sm`

**Instalación (celda 0)**
```bash
%pip install sentencepiece sacrebleu gradio tqdm pyyaml
%pip install transformers accelerate datasets  # baseline opcional
%pip install torch-directml                    # opcional AMD/DirectML
# %pip install spacy && python -m spacy download es_core_news_sm  # opcional
```

**Hardware**
- Funciona en **CPU**. Si tienes **GPU AMD** en Windows, prueba **DirectML** (`torch-directml`). El notebook lo detecta automáticamente.

### 3) Datos
**Carpeta base:** `C:\\Users\\Samuel Perez\\Desktop\\articulo`

Estructura esperada:
```
articulo/
├─ salida/
│  └─ parallel_ncx_es.jsonl       # corpus paralelo (requerido)
├─ spm/                           # modelos SentencePiece
├─ checkpoints/                   # pesos del modelo
└─ logs/
```

**Formato del corpus (`parallel_ncx_es.jsonl`)**
Cada línea es un JSON como:
```json
{"src": "<texto nahuatl>", "tgt": "<texto español>", "libro":"Génesis", "capitulo":1, "versiculo":"1"}
```
Los campos `libro`, `capitulo`, `versiculo` son opcionales (trazabilidad).

Si alguna vez necesitas **CSV→JSONL**:
```python
import csv, json, pathlib
csv_path = pathlib.Path(r"C:\Users\Samuel Perez\Desktop\articulo\corpus_ncx_es.csv")
jsonl_path = pathlib.Path(r"C:\Users\Samuel Perez\Desktop\articulo\salida\parallel_ncx_es.jsonl")
with open(csv_path, newline='', encoding='utf-8') as f, open(jsonl_path, 'w', encoding='utf-8') as w:
    for row in csv.DictReader(f):
        obj = {"src": row["ncx"].strip(), "tgt": row["es"].strip()}
        w.write(json.dumps(obj, ensure_ascii=False) + "\n")
```

**Normalización y segmentación**
- Minúsculas en ambos lados (preservando diacríticos y glotal).  
- **Español:** spaCy si está disponible, sino regex.  
- **Náhuatl:** reglas por puntuación.  
- Si el nº de oraciones coincide en ambos lados, se divide verso→oración; si no, se conserva el verso.

> Sugerencias: deduplicar pares, filtrar longitudes extremas, vigilar UTF‑8/ZWSP.

### 4) Tokenización — SentencePiece (Unigram, vocab compartido)
- `VOCAB_SIZE = 10000` (rango 8k–12k).  
- Símbolos especiales: `<pad>`, `<bos>`, `<eos>`, `<lang_ncx>`, `<lang_es>`.  
- `character_coverage = 1.0` para preservar diacríticos.  
- Salida: `spm/ncx_es_unigram.model`, `spm/ncx_es_unigram.vocab`.

### 5) Modelo “from scratch” (PyTorch)
**Small (por defecto):** `d_model=512, n_heads=8, d_ff=2048, n_layers=6/6, dropout=0.1`  
**Light (memoria baja):** `d_model=256, n_heads=4, d_ff=1024, n_layers=4/4`  
- Pre‑Norm, positional encoding seno/coseno, **padding mask** y **causal mask** en decoder.

### 6) Entrenamiento
- `MAX_LEN=128`, **bucketing** por longitud.  
- Adam(β1=0.9, β2=0.98, eps=1e-9) + **Noam** (`warmup=4000`).  
- Regularización: **label smoothing=0.1**, **dropout=0.1**, **grad_clip=1.0**.  
- **Grad Accum**: 1 (sube a 2–4 si reduces batch).  
- Recomendación inicial (**salida recomendada**): `CFG_SMALL`, `EPOCHS=2`, `BATCH=32`, `WARMUP=4000`, `BEAM=5`, `length_penalty=0.7`.

### 7) Evaluación
- **sacreBLEU** y **chrF++** por dirección (`ncx→es` y `es→ncx`).  
- En bajo‑recursos, **chrF++** suele ser más sensible.

### 8) Inferencia y UI
- *Greedy* (rápido) y **beam search** (beam 4–8 recomendado).  
- **Gradio**: selector de dirección, slider de beam y cajas de entrada/salida. Ejecuta `demo.launch()`.

### 9) Baseline opcional — BERT2BERT (mBERT)
- `EncoderDecoderModel` con `bert-base-multilingual-cased` (freezing parcial).  
- Dataset con `datasets` y `BertTokenizerFast`. Útil como comparación.

### 10) Flujo de trabajo (paso a paso)
1. Verifica `parallel_ncx_es.jsonl` en `...\articulo\salida\`.  
2. Corre celdas 0–7 (deps, rutas, carga/segmentación, splits, SentencePiece).  
3. Revisa `VOCAB_SIZE=10k` y `MAX_LEN=128`.  
4. Entrena (celda 11) ambas direcciones (empieza con 2 épocas).  
5. Evalúa (celda 12) con BLEU/chrF++.  
6. UI (celda 13): `demo.launch()` y prueba.  
7. (Opcional) Baseline mBERT (celdas 14.x).

### 11) Configuración avanzada (resumen)
- Filtros por longitud, detección de idioma opcional, re‑segmentación conservadora.  
- Decoding: beam 4–8, length penalty 0.6–1.0, penalización de repetición si hay loops.  
- Rendimiento: usa `CFG_LIGHT`, baja `BATCH`, sube `grad_accum`.  
- Reanudación: utilidades de `resume_direction(...)` incluidas.

### 12) Estructura de salida
- `spm/*.model`, `spm/*.vocab`  
- `checkpoints/scratch_ncx2es_best.pt`, `checkpoints/scratch_es2ncx_best.pt`  
- `hf_bert2bert/*` (opcional)

### 13) FAQ rápida
1. **AssertionError** (no está `parallel_ncx_es.jsonl`) → confirma ruta; si es CSV, convierte con el snippet.  
2. **unicodeescape** en rutas Windows → usa raw strings `r"C:\..."` o `/`.  
3. Falta `sentencepiece/sacrebleu/gradio` → instala en celda 0.  
4. Memoria insuficiente → `CFG_LIGHT`, baja `BATCH`, sube `grad_accum`, reduce `MAX_LEN`.  
5. Gradio no abre → `demo.launch(share=False)` y navega a `127.0.0.1`.  
6. DirectML no se usa → instala `torch-directml`; si no, continúa en CPU.

### 14) Consideraciones lingüísticas
- Preservar diacríticos/glotal.  
- Morfología rica: **Unigram** suele fragmentar mejor que BPE.  
- Segmentación por puntuación es conservadora; se puede entrenar un sentencizer específico más adelante.

### 15) Ética y licencias
Respeta términos de uso de las fuentes (p. ej., JW.org), documenta limitaciones y evita usos sensibles sin revisión humana.

### 16) Roadmap sugerido
- **v0.1 (actual):** pipeline completo + UI + baseline mBERT.  
- **v0.2:** métricas automáticas por época (chrF++ en dev), early stopping y CSV/JSON.  
- **v0.3:** mBART‑50/mT5, limpieza/alineación, análisis de errores.  
- **v0.4:** post‑edición asistida y glosario.


> Proyecto: **articulo traduccion Nahualt-español** — Notebook unificado con preprocesamiento, tokenización, modelo Transformer, entrenamiento, evaluación, inferencia/UI y baseline mBERT.  
> **Stack:** PyTorch puro + SentencePiece + sacreBLEU/chrF++ + Gradio.  
> **Ruta base**: `C:\Users\Samuel Perez\Desktop\articulo` (ajústala si lo necesitas).

### Índice
1. Objetivo y requisitos  
2. Datos y segmentación  
3. Tokenización (SentencePiece)  
4. Modelo Transformer (Pre‑Norm)  
5. Entrenamiento (Noam, label smoothing, bucketing)  
6. Logging CSV + Early stopping (chrF++) + reanudar  
7. Evaluación (BLEU/chrF++)  
8. Inferencia (greedy/beam) + UI con Gradio  
9. Baseline opcional BERT2BERT (mBERT)

---

## 1) Objetivo
- NMT **bidireccional** ncx↔es con vocab **compartido**.  
- Respetar diacríticos/glotal y signos de puntuación, minúsculas en ambos lados.

## 2) Requisitos
Instala si hace falta:
```bash
%pip install sentencepiece sacrebleu gradio tqdm pyyaml
# Opcional (baseline mBERT):
%pip install transformers accelerate datasets
# Opcional (Windows + AMD/DirectML):
%pip install torch-directml
# Opcional (mejor segmentación español):
# %pip install spacy && python -m spacy download es_core_news_sm
```

## 3) Datos
Se requiere `salida/parallel_ncx_es.jsonl` con líneas tipo:
```json
{"src":"<nahuatl>", "tgt":"<español>", "libro":"Génesis", "capitulo":1, "versiculo":"1"}
```
Si el número de oraciones coincide (ncx y es), el verso se divide en múltiples pares; de lo contrario se deja tal cual.

## 4) Arquitectura y entrenamiento
- **Pre‑Norm**, `d_model=512`, `n_heads=8`, `ff=2048`, `n_layers=6/6`, `dropout=0.1`.  
- **Adam + Noam (warmup=4000)**, **label smoothing=0.1**, **grad_clip=1.0**.  
- **MAX_LEN=128**, **bucketing** por longitud.  
- **Early stopping** por **chrF++**; logging CSV por época en `logs/`.

> Consejos: empieza con `EPOCHS=2` para validar; sube a 6–10 si todo va bien. Usa `CFG_LIGHT` si hay poca RAM.


In [None]:
# 1) Rutas, semillas y splits
from pathlib import Path
import os, random, numpy as np, torch

BASE_DIR = Path(r"C:\Users\Samuel Perez\Desktop\articulo")
for p in [BASE_DIR, BASE_DIR/"salida", BASE_DIR/"checkpoints", BASE_DIR/"spm", BASE_DIR/"logs"]:
    p.mkdir(parents=True, exist_ok=True)

DATA_DIR = BASE_DIR / "salida"
CHECK_DIR = BASE_DIR / "checkpoints"
TOK_DIR = BASE_DIR / "spm"
LOG_DIR = BASE_DIR / "logs"
PARALLEL_JSONL = DATA_DIR / "parallel_ncx_es.jsonl"
assert PARALLEL_JSONL.exists(), f"No se encontró {PARALLEL_JSONL}. Coloca el corpus en esa ruta."

SEED=42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
SPLIT_TRAIN=0.8; SPLIT_DEV=0.1
MAX_SAMPLES=0
MAX_LEN=128


In [None]:
# 2) Dispositivo
import torch
DEVICE = torch.device("cpu")
try:
    import torch_directml
    DEVICE = torch_directml.device()
    print("Usando DirectML (GPU AMD) si está disponible.")
except Exception as e:
    print("DirectML no disponible; usando CPU.\n", str(e))
print("DEVICE =", DEVICE)


In [None]:
# 3) Carga de datos y segmentación
import json, re
from typing import List
def sent_split_es(text: str) -> List[str]:
    try:
        import spacy
        try: nlp = spacy.load("es_core_news_sm")
        except Exception:
            nlp = spacy.blank("es"); nlp.add_pipe("sentencizer")
        return [s.text.strip() for s in nlp(text).sents if s.text.strip()]
    except Exception:
        parts = re.split(r"(?<=[\.\?\!¡¿])\s+", text.strip())
        return [p.strip() for p in parts if p.strip()]
def sent_split_ncx(text: str) -> List[str]:
    parts = re.split(r"(?<=[\.\?\!])\s+", text.strip())
    return [p.strip() for p in parts if p.strip()]

pairs = []
with open(PARALLEL_JSONL, "r", encoding="utf-8") as f:
    for line in f:
        obj = json.loads(line)
        s = obj["src"].strip(); t = obj["tgt"].strip()
        if s and t: pairs.append((s, t, obj.get("libro",""), obj.get("capitulo",0), obj.get("versiculo",0)))
print(f"Pares cargados (verso): {len(pairs):,}")

expanded = []
for s, t, libro, cap, ver in pairs:
    ss = sent_split_ncx(s.lower()); tt = sent_split_es(t.lower())
    if 1 < len(ss) == len(tt) < 10:
        for i in range(len(ss)): expanded.append((ss[i], tt[i], libro, cap, f"{ver}.{i+1}"))
    else:
        expanded.append((s.lower(), t.lower(), libro, cap, ver))
if MAX_SAMPLES and MAX_SAMPLES>0: expanded = expanded[:MAX_SAMPLES]
print(f"Pares tras segmentación: {len(expanded):,}")


In [None]:
# 4) Splits
from math import floor
idx = list(range(len(expanded))); random.shuffle(idx)
n_tr = floor(len(idx)*SPLIT_TRAIN); n_de = floor(len(idx)*SPLIT_DEV)
def take(idxs): return [expanded[i] for i in idxs]
train_pairs = take(idx[:n_tr]); dev_pairs = take(idx[n_tr:n_tr+n_de]); test_pairs = take(idx[n_tr+n_de:])
print(f"Train={len(train_pairs):,} | Dev={len(dev_pairs):,} | Test={len(test_pairs):,}")


In [None]:
# 5) SentencePiece
import sentencepiece as spm
VOCAB_SIZE = 10000
raw_corpus = (TOK_DIR / "spm_raw.txt")
with open(raw_corpus, "w", encoding="utf-8") as w:
    for s, t, *_ in train_pairs + dev_pairs: w.write(s+"\n"); w.write(t+"\n")
SPM_MODEL_PREFIX = str((TOK_DIR / "ncx_es_unigram").as_posix())
spm.SentencePieceTrainer.Train(
    input=str(raw_corpus), model_prefix=SPM_MODEL_PREFIX, vocab_size=VOCAB_SIZE, model_type="unigram",
    user_defined_symbols=["<pad>","<bos>","<eos>","<lang_ncx>","<lang_es>"], character_coverage=1.0,
    input_sentence_size=1000000, shuffle_input_sentence=True)
SPM_MODEL = TOK_DIR / "ncx_es_unigram.model"; SPM_VOCAB = TOK_DIR / "ncx_es_unigram.vocab"
assert SPM_MODEL.exists(), "No se generó el modelo SentencePiece."
print("Tokenizador listo:", SPM_MODEL)


In [None]:
# 6) Tokenización helpers
sp = spm.SentencePieceProcessor(model_file=str(SPM_MODEL))
PAD_ID=sp.piece_to_id("<pad>"); BOS_ID=sp.piece_to_id("<bos>"); EOS_ID=sp.piece_to_id("<eos>")
LNCX_ID=sp.piece_to_id("<lang_ncx>"); LES_ID=sp.piece_to_id("<lang_es>"); VOCAB=sp.get_piece_size()
def encode_with_lang(text, lang_tok_id):
    ids = sp.encode(text, out_type=int); return [BOS_ID, lang_tok_id] + ids + [EOS_ID]
def collate_batch(batch, pad_id=PAD_ID):
    src_lens=[len(b[0]) for b in batch]; tgt_lens=[len(b[1]) for b in batch]
    max_src=min(max(src_lens), MAX_LEN); max_tgt=min(max(tgt_lens), MAX_LEN)
    def pad_seq(seq,L): seq=seq[:L]; return seq+[pad_id]*(L-len(seq))
    import torch
    src=torch.tensor([pad_seq(b[0],max_src) for b in batch],dtype=torch.long)
    tgt=torch.tensor([pad_seq(b[1],max_tgt) for b in batch],dtype=torch.long)
    return src,tgt


In [None]:
# 7) Dataset + bucketing
from torch.utils.data import Dataset, DataLoader
import random
class ParallelDataset(Dataset):
    def __init__(self, pairs, direction="ncx2es"):
        self.items=[]
        for s,t,*_ in pairs:
            if direction=="ncx2es": self.items.append((encode_with_lang(s,LNCX_ID), encode_with_lang(t,LES_ID)))
            else: self.items.append((encode_with_lang(t,LES_ID), encode_with_lang(s,LNCX_ID)))
    def __len__(self): return len(self.items)
    def __getitem__(self,i): return self.items[i]
def make_loader(pairs, direction, batch_size=32, shuffle=True):
    ds=ParallelDataset(pairs,direction=direction)
    order=sorted(range(len(ds)),key=lambda i: len(ds.items[i][0]))
    if shuffle:
        B=50; buckets=[order[i::B] for i in range(B)]; order=[i for b in buckets for i in random.sample(b,len(b))]
    class _P(Dataset):
        def __len__(self): return len(order)
        def __getitem__(self,j): return ds.items[order[j]]
    return DataLoader(_P(), batch_size=batch_size, collate_fn=collate_batch)


In [None]:
# 8) Modelo Transformer (Pre‑Norm)
import math, torch, torch.nn as nn
DROPOUT = 0.1
class PositionalEncoding(nn.Module):
    def __init__(self,d_model,max_len=2048):
        super().__init__()
        pe=torch.zeros(max_len,d_model); pos=torch.arange(0,max_len,dtype=torch.float).unsqueeze(1)
        div=torch.exp(torch.arange(0,d_model,2).float()*(-math.log(10000.0)/d_model))
        pe[:,0::2]=torch.sin(pos*div); pe[:,1::2]=torch.cos(pos*div); self.register_buffer('pe',pe.unsqueeze(0))
    def forward(self,x): return x + self.pe[:, :x.size(1)]
class MultiHeadAttention(nn.Module):
    def __init__(self,d_model,n_heads,dropout=0.1):
        super().__init__(); assert d_model % n_heads == 0; self.d_k=d_model//n_heads; self.n=n_heads
        self.q=nn.Linear(d_model,d_model); self.k=nn.Linear(d_model,d_model); self.v=nn.Linear(d_model,d_model); self.o=nn.Linear(d_model,d_model); self.drop=nn.Dropout(dropout)
    def forward(self,q,k,v,attn_mask=None,key_padding_mask=None):
        B,Lq,D=q.shape; B,Lk,_=k.shape
        q=self.q(q).view(B,Lq,self.n,self.d_k).transpose(1,2); k=self.k(k).view(B,Lk,self.n,self.d_k).transpose(1,2); v=self.v(v).view(B,Lk,self.n,self.d_k).transpose(1,2)
        scores=torch.matmul(q,k.transpose(-2,-1))/math.sqrt(self.d_k)
        if attn_mask is not None:
            if attn_mask.dim()==2: scores = scores + attn_mask.unsqueeze(0).unsqueeze(0)
            elif attn_mask.dim()==4: scores = scores + attn_mask
        if key_padding_mask is not None: scores=scores.masked_fill(key_padding_mask.unsqueeze(1).unsqueeze(2), float('-inf'))
        attn=torch.softmax(scores,dim=-1); attn=self.drop(attn); out=torch.matmul(attn,v).transpose(1,2).contiguous().view(B,Lq,D)
        return self.o(out)
class FeedForward(nn.Module):
    def __init__(self,d_model,d_ff,dropout=0.1): super().__init__(); self.net=nn.Sequential(nn.Linear(d_model,d_ff), nn.ReLU(), nn.Dropout(dropout), nn.Linear(d_ff,d_model))
    def forward(self,x): return self.net(x)
class EncoderLayer(nn.Module):
    def __init__(self,d_model,n_heads,d_ff,dropout=0.1):
        super().__init__(); self.norm1=nn.LayerNorm(d_model); self.attn=MultiHeadAttention(d_model,n_heads,dropout); self.drop1=nn.Dropout(dropout); self.norm2=nn.LayerNorm(d_model); self.ff=FeedForward(d_model,d_ff,dropout); self.drop2=nn.Dropout(dropout)
    def forward(self,x,src_pad_mask):
        y=self.attn(self.norm1(x),self.norm1(x),self.norm1(x),key_padding_mask=src_pad_mask); x=x+self.drop1(y); y=self.ff(self.norm2(x)); x=x+self.drop2(y); return x
class DecoderLayer(nn.Module):
    def __init__(self,d_model,n_heads,d_ff,dropout=0.1):
        super().__init__(); self.norm1=nn.LayerNorm(d_model); self.self_attn=MultiHeadAttention(d_model,n_heads,dropout); self.drop1=nn.Dropout(dropout); self.norm2=nn.LayerNorm(d_model); self.cross_attn=MultiHeadAttention(d_model,n_heads,dropout); self.drop2=nn.Dropout(dropout); self.norm3=nn.LayerNorm(d_model); self.ff=FeedForward(d_model,d_ff,dropout); self.drop3=nn.Dropout(dropout)
    def forward(self,x,mem,tgt_pad_mask,tgt_causal_mask,mem_pad_mask):
        y=self.self_attn(self.norm1(x),self.norm1(x),self.norm1(x),attn_mask=tgt_causal_mask,key_padding_mask=tgt_pad_mask); x=x+self.drop1(y)
        y=self.cross_attn(self.norm2(x),mem,mem,key_padding_mask=mem_pad_mask); x=x+self.drop2(y); y=self.ff(self.norm3(x)); x=x+self.drop3(y); return x
class TransformerModel(nn.Module):
    def __init__(self,vocab_size,d_model,n_heads,d_ff,n_enc,n_dec,dropout=0.1,pad_id=0):
        super().__init__(); self.pad_id=pad_id; self.emb=nn.Embedding(vocab_size,d_model,padding_idx=pad_id); self.pos=PositionalEncoding(d_model)
        self.encoder=nn.ModuleList([EncoderLayer(d_model,n_heads,d_ff,dropout) for _ in range(n_enc)])
        self.decoder=nn.ModuleList([DecoderLayer(d_model,n_heads,d_ff,dropout) for _ in range(n_dec)])
        self.proj=nn.Linear(d_model,vocab_size)
    def make_pad_mask(self,seq): return seq.eq(self.pad_id)
    def make_causal_mask(self,L):
        m=torch.triu(torch.ones(L,L,device=self.emb.weight.device),diagonal=1); return m.masked_fill(m==1,float('-inf'))
    def encode(self,src):
        src_pad=self.make_pad_mask(src); x=self.pos(self.emb(src))
        for layer in self.encoder: x=layer(x,src_pad)
        return x,src_pad
    def decode(self,tgt,mem,mem_pad):
        tgt_pad=self.make_pad_mask(tgt); x=self.pos(self.emb(tgt)); causal=self.make_causal_mask(tgt.size(1)).unsqueeze(0).unsqueeze(0)
        for layer in self.decoder: x=layer(x,mem,tgt_pad,causal,mem_pad)
        return self.proj(x)
    def forward(self,src,tgt_in): mem,src_pad=self.encode(src); return self.decode(tgt_in,mem,src_pad)


In [None]:
# 9) LabelSmoothing + Noam + métricas
import torch.nn as nn, torch
class LabelSmoothingLoss(nn.Module):
    def __init__(self,classes,smoothing=0.1,ignore_index=0):
        super().__init__(); self.ignore_index=ignore_index; self.confidence=1.0-smoothing; self.smoothing=smoothing; self.cls=classes
    def forward(self,pred,target):
        pred=pred.view(-1,pred.size(-1)); target=target.reshape(-1); log_probs=torch.log_softmax(pred,dim=-1)
        nll=-log_probs.gather(dim=-1,index=target.unsqueeze(1)).squeeze(1); smooth=-log_probs.mean(dim=-1); pad_mask=target.eq(self.ignore_index)
        loss=self.confidence*nll + self.smoothing*smooth
        return (loss.masked_fill(pad_mask,0).sum()/torch.clamp((~pad_mask).sum(),min=1))
class NoamWrapper:
    def __init__(self,opt,d_model,warmup=4000): self.opt=opt; self.d_model=d_model; self.warm=warmup; self.step_num=0
    def step(self):
        self.step_num+=1; lr=(self.d_model**-0.5)*min(self.step_num**-0.5, self.step_num*(self.warm**-1.5))
        for pg in self.opt.param_groups: pg['lr']=lr
        self.opt.step()
    def zero_grad(self): self.opt.zero_grad()
    @property
    def lr(self): return self.opt.param_groups[0]['lr']
try:
    from sacrebleu.metrics import BLEU, CHRF
    bleu=BLEU(force=True); chrf=CHRF(word_order=2)
except Exception as e:
    print("sacrebleu no disponible:", e); bleu=chrf=None


In [None]:
# 10) Eval/greedy + CFGs
import torch
def batch_to_device(b,device): return b[0].to(device), b[1].to(device)
def evaluate(model,loader,device):
    model.eval(); total=0.0; n=0
    with torch.no_grad():
        for src,tgt in loader:
            src,tgt=batch_to_device((src,tgt),device)
            logits=model(src,tgt[:,:-1])
            crit=LabelSmoothingLoss(VOCAB,0.1,PAD_ID)
            loss=crit(logits,tgt[:,1:]); total+=loss.item(); n+=1
    return total/max(n,1)
def ids_to_text(ids):
    ids=[i for i in ids if i not in (PAD_ID,BOS_ID)]
    if ids and ids[-1]==EOS_ID: ids=ids[:-1]
    return sp.decode(ids)
def translate_greedy(model, src_ids, max_len=MAX_LEN):
    model.eval(); src=torch.tensor([src_ids],dtype=torch.long,device=model.emb.weight.device)
    mem,src_pad=model.encode(src); ys=torch.tensor([[BOS_ID, LES_ID]],dtype=torch.long,device=src.device)
    for _ in range(max_len):
        logits=model.decode(ys,mem,src_pad); nxt=logits[:,-1,:].argmax(dim=-1,keepdim=True)
        ys=torch.cat([ys,nxt],dim=1)
        if nxt.item()==EOS_ID: break
    return ys[0].tolist()
CFG_SMALL=dict(d_model=512,n_heads=8,d_ff=2048,n_enc=6,n_dec=6)
CFG_LIGHT=dict(d_model=256,n_heads=4,d_ff=1024,n_enc=4,n_dec=4)


In [None]:
# 10b) Logging CSV + Early Stopping + Resume
import csv, pathlib
from datetime import datetime
def ensure_logfile(direction):
    LOG_DIR.mkdir(parents=True, exist_ok=True)
    path = LOG_DIR / f"train_log_{direction}.csv"
    if not path.exists():
        with open(path, "w", encoding="utf-8", newline="") as w:
            csv.writer(w).writerow(["timestamp","direction","epoch","steps","lr","train_loss","dev_loss","dev_bleu","dev_chrf","best_so_far"])
    return path
def translate_greedy_dir(model, src_ids, direction="ncx2es", max_len=MAX_LEN):
    tgt_lang = LES_ID if direction=="ncx2es" else LNCX_ID
    model.eval(); src=torch.tensor([src_ids],dtype=torch.long,device=model.emb.weight.device)
    mem,src_pad=model.encode(src); ys=torch.tensor([[BOS_ID, tgt_lang]],dtype=torch.long,device=src.device)
    for _ in range(max_len):
        logits=model.decode(ys,mem,src_pad); nxt=logits[:,-1,:].argmax(dim=-1,keepdim=True); ys=torch.cat([ys,nxt],dim=1)
        if nxt.item()==EOS_ID: break
    return ys[0].tolist()
def compute_dev_metrics(model, direction="ncx2es", max_samples=200):
    ds=ParallelDataset(dev_pairs,direction=direction)
    if len(ds)==0: return (0.0,0.0)
    up=min(len(ds),max_samples); refs,hyps=[],[]
    for i in range(up):
        src_ids,tgt_ids=ds[i]; out_ids=translate_greedy_dir(model,src_ids,direction=direction,max_len=MAX_LEN)
        refs.append([ids_to_text(tgt_ids)]); hyps.append(ids_to_text(out_ids))
    try:
        from sacrebleu.metrics import BLEU, CHRF
        return (BLEU(force=True).corpus_score(hyps, list(zip(*refs))).score,
                CHRF(word_order=2).corpus_score(hyps, list(zip(*refs))).score)
    except Exception: return (0.0,0.0)
def save_checkpoint(model, save_prefix, d_model, n_heads, d_ff, n_enc, n_dec, pad_id, vocab, direction, epoch, noam_step=0):
    CHECK_DIR.mkdir(parents=True, exist_ok=True)
    path = (CHECK_DIR / f"{save_prefix}_best.pt").as_posix()
    torch.save({"model":model.state_dict(),
                "cfg":{"d_model":d_model,"n_heads":n_heads,"d_ff":d_ff,"n_enc":n_enc,"n_dec":n_dec,"pad_id":pad_id,"vocab":vocab},
                "meta":{"direction":direction,"epoch":epoch,"noam_step":noam_step,"spm_model":str(SPM_MODEL)}},
               path); return path
def train_direction(direction="ncx2es", epochs=3, batch_size=32, grad_accum=1,
                    d_model=512, n_heads=8, d_ff=2048, n_enc=6, n_dec=6,
                    warmup=4000, save_prefix="scratch_ncx2es",
                    patience=3, dev_metric_samples=200, init_model=None, start_epoch=1):
    print(f"\n=== Entrenando dirección: {direction} ===")
    train_loader=make_loader(train_pairs,direction,batch_size=batch_size,shuffle=True)
    dev_loader=make_loader(dev_pairs,direction,batch_size=batch_size,shuffle=False)
    model=TransformerModel(VOCAB,d_model,n_heads,d_ff,n_enc,n_dec,DROPOUT,PAD_ID) if init_model is None else init_model
    model.to(DEVICE)
    opt=torch.optim.Adam(model.parameters(),betas=(0.9,0.98),eps=1e-9); noam=NoamWrapper(opt,d_model,warmup=warmup); crit=LabelSmoothingLoss(VOCAB,0.1,PAD_ID)
    log_path=ensure_logfile(direction); best_dev_chrf=-1e9; epochs_no_improve=0; best_path=None; global_step=0
    for ep in range(start_epoch, start_epoch+epochs):
        model.train(); total=0.0; n=0; opt.zero_grad()
        for i,(src,tgt) in enumerate(train_loader,1):
            src,tgt=src.to(DEVICE),tgt.to(DEVICE)
            logits=model(src,tgt[:,:-1])
            loss=crit(logits,tgt[:,1:])/grad_accum; loss.backward(); torch.nn.utils.clip_grad_norm_(model.parameters(),1.0)
            if i % grad_accum == 0: noam.step(); noam.zero_grad()
            total+=loss.item()*grad_accum; n+=1; global_step+=1
            if i % 100 == 0: print(f"ep{ep} step{i} lr={noam.lr:.6f} loss={total/max(n,1):.4f}")
        dev_loss=evaluate(model,dev_loader,DEVICE); dev_bleu,dev_chrf=compute_dev_metrics(model,direction=direction,max_samples=dev_metric_samples)
        print(f"[Ep {ep}] dev_loss={dev_loss:.4f} | BLEU={dev_bleu:.2f} | chrF++={dev_chrf:.2f}")
        with open(log_path,"a",encoding="utf-8",newline="") as w:
            csv.writer(w).writerow([datetime.utcnow().isoformat(),direction,ep,global_step,f"{noam.lr:.8f}",f"{total/max(n,1):.6f}",f"{dev_loss:.6f}",f"{dev_bleu:.4f}",f"{dev_chrf:.4f}","yes" if dev_chrf>best_dev_chrf else "no"])
        if dev_chrf>best_dev_chrf:
            best_dev_chrf=dev_chrf; best_path=save_checkpoint(model,save_prefix,d_model,n_heads,d_ff,n_enc,n_dec,PAD_ID,VOCAB,direction,ep,noam.step_num)
            print("Mejora en chrF++; guardado mejor modelo en", best_path); epochs_no_improve=0
        else:
            epochs_no_improve+=1; print(f"Sin mejora de chrF++ por {epochs_no_improve}/{patience} épocas")
            if epochs_no_improve>=patience: print("Early stopping activado (paciencia agotada)."); break
    return best_path, best_dev_chrf
def load_for_resume(ckpt_path):
    data=torch.load(ckpt_path,map_location="cpu"); cfg=data["cfg"]; meta=data.get("meta",{})
    model=TransformerModel(cfg["vocab"],cfg["d_model"],cfg["n_heads"],cfg["d_ff"],cfg["n_enc"],cfg["n_dec"],pad_id=cfg["pad_id"])
    model.load_state_dict(data["model"]); model.to(DEVICE); return model,cfg,meta
def resume_direction(ckpt_path,more_epochs=2,batch_size=32,grad_accum=1,warmup=4000,patience=3,dev_metric_samples=200,save_prefix=None):
    import pathlib
    model,cfg,meta=load_for_resume(ckpt_path); direction=meta.get("direction","ncx2es")
    if save_prefix is None: save_prefix=pathlib.Path(ckpt_path).stem + "_cont"
    print(f"Reanudando {direction} desde {ckpt_path}")
    return train_direction(direction=direction,epochs=more_epochs,batch_size=batch_size,grad_accum=grad_accum,
                           d_model=cfg["d_model"],n_heads=cfg["n_heads"],d_ff=cfg["d_ff"],n_enc=cfg["n_enc"],n_dec=cfg["n_dec"],
                           warmup=warmup,save_prefix=save_prefix,patience=patience,dev_metric_samples=dev_metric_samples,
                           init_model=model,start_epoch=meta.get("epoch",1)+1)


In [None]:
# 11) Entreno ambos sentidos (ajusta epochs/batch)
EPOCHS=2; BATCH=32; ACCUM=1; WARMUP=4000
best_ncx2es,_ = train_direction("ncx2es", epochs=EPOCHS, batch_size=BATCH, grad_accum=ACCUM, warmup=WARMUP, save_prefix="scratch_ncx2es", **CFG_SMALL)
best_es2ncx,_ = train_direction("es2ncx", epochs=EPOCHS, batch_size=BATCH, grad_accum=ACCUM, warmup=WARMUP, save_prefix="scratch_es2ncx", **CFG_SMALL)
print("Mejores:", best_ncx2es, best_es2ncx)


In [None]:
# 12) Evaluación
from pathlib import Path
def load_model(path):
    data=torch.load(path,map_location="cpu"); cfg=data["cfg"]
    model=TransformerModel(cfg["vocab"],cfg["d_model"],cfg["n_heads"],cfg["d_ff"],cfg["n_enc"],cfg["n_dec"],pad_id=cfg["pad_id"])
    model.load_state_dict(data["model"]); model.to(DEVICE); model.eval(); return model
def eval_direction(best_path, direction="ncx2es", max_samples=200):
    if best_path is None or not Path(best_path).exists(): print("Checkpoint no encontrado:", best_path); return
    model=load_model(best_path); ds=ParallelDataset(test_pairs, direction=direction)
    refs=[]; hyps=[]
    for i in range(min(len(ds),max_samples)):
        src_ids,tgt_ids=ds[i]; out_ids=translate_greedy(model, src_ids, max_len=MAX_LEN)
        refs.append([ids_to_text(tgt_ids)]); hyps.append(ids_to_text(out_ids))
    try:
        from sacrebleu.metrics import BLEU, CHRF
        print(direction,"BLEU:", BLEU(force=True).corpus_score(hyps, list(zip(*refs))))
        print(direction,"chrF++:", CHRF(word_order=2).corpus_score(hyps, list(zip(*refs))))
    except Exception as e: print("sacrebleu no disponible:", e)
eval_direction(best_ncx2es,"ncx2es"); eval_direction(best_es2ncx,"es2ncx")


In [None]:
# 13) UI con Gradio
import gradio as gr, torch
def translate_beam(model, src_ids, beam=5, lp=0.7, max_len=MAX_LEN, tgt_lang_id=LES_ID):
    model.eval(); device=model.emb.weight.device
    src=torch.tensor([src_ids],dtype=torch.long,device=device); mem,src_pad=model.encode(src)
    beams=[([BOS_ID,tgt_lang_id],0.0)]; finished=[]
    for _ in range(max_len):
        new=[]
        for seq,sc in beams:
            if seq[-1]==EOS_ID: finished.append((seq,sc)); continue
            ys=torch.tensor([seq],dtype=torch.long,device=device)
            logits=model.decode(ys,mem,src_pad)[:,-1,:].squeeze(0); logp=torch.log_softmax(logits,dim=-1).detach().cpu()
            topk=torch.topk(logp,beam).indices.tolist()
            for tok in topk: new.append((seq+[tok], sc+logp[tok].item()))
        beams=sorted(new,key=lambda x: x[1]/((len(x[0])**lp)), reverse=True)[:beam]
        if not beams: break
    if not finished: finished=beams
    best=max(finished,key=lambda x: x[1]/((len(x[0])**lp))); return best[0]
BEST_NCX2ES=best_ncx2es if 'best_ncx2es' in globals() else None
BEST_ES2NCX=best_es2ncx if 'best_es2ncx' in globals() else None
def _load_model_(path):
    data=torch.load(path,map_location="cpu"); cfg=data["cfg"]
    m=TransformerModel(cfg["vocab"],cfg["d_model"],cfg["n_heads"],cfg["d_ff"],cfg["n_enc"],cfg["n_dec"],pad_id=cfg["pad_id"]); m.load_state_dict(data["model"]); m.eval(); return m
def load_scratch(direction):
    path = BEST_NCX2ES if direction=="ncx2es" else BEST_ES2NCX
    from pathlib import Path
    if path is None or not Path(path).exists(): return None, f"Checkpoint no encontrado: {path}"
    return _load_model_(path), f"Cargado: {path}"
def infer_scratch(text, direction="ncx2es", beam=5):
    if not text.strip(): return ""
    model, msg = load_scratch(direction)
    lang_id = LES_ID if direction=="ncx2es" else LNCX_ID
    src_lang = LNCX_ID if direction=="ncx2es" else LES_ID
    src_ids = encode_with_lang(text.lower(), src_lang)
    out_ids = translate_beam(model, src_ids, beam=beam, tgt_lang_id=lang_id)
    return ids_to_text(out_ids)
with gr.Blocks() as demo:
    gr.Markdown("## Traductor (Transformer desde cero)")
    direction = gr.Radio(choices=["ncx2es","es2ncx"], value="ncx2es", label="Dirección")
    beam = gr.Slider(1,10, step=1, value=5, label="Beam size")
    inp = gr.Textbox(lines=3, label="Texto de entrada")
    out = gr.Textbox(lines=3, label="Traducción")
    btn = gr.Button("Traducir")
    btn.click(fn=infer_scratch, inputs=[inp, direction, beam], outputs=[out])
print("Para lanzar la UI: demo.launch()")


## 14) Baseline opcional — BERT2BERT (mBERT) con HuggingFace (freezing parcial)

In [None]:
# 14.1) Dataset HuggingFace
from datasets import Dataset as HFDataset
from transformers import BertTokenizerFast
HF_DIR = BASE_DIR / "hf_bert2bert"; HF_DIR.mkdir(parents=True, exist_ok=True)
tok_hf = BertTokenizerFast.from_pretrained("bert-base-multilingual-cased")
def build_hf_split(pairs, direction="ncx2es"):
    srcs, tgts = [], []
    for s, t, *_ in pairs:
        if direction=="ncx2es": srcs.append(s); tgts.append(t)
        else: srcs.append(t); tgts.append(s)
    return HFDataset.from_dict({"src":srcs, "tgt":tgts})
hf_train = build_hf_split(train_pairs, "ncx2es"); hf_dev = build_hf_split(dev_pairs, "ncx2es")
def tok_map(batch):
    model_inputs = tok_hf(batch["src"], truncation=True, max_length=MAX_LEN)
    with tok_hf.as_target_tokenizer():
        labels = tok_hf(batch["tgt"], truncation=True, max_length=MAX_LEN)
    model_inputs["labels"] = labels["input_ids"]; return model_inputs
hf_train_tok = hf_train.map(tok_map, batched=True, remove_columns=["src","tgt"])
hf_dev_tok   = hf_dev.map(tok_map,   batched=True, remove_columns=["src","tgt"])
print(hf_train_tok)


In [None]:
# 14.2) Entrenamiento BERT2BERT (opcional)
from transformers import EncoderDecoderModel, TrainingArguments, Trainer, DataCollatorForSeq2Seq
enc_dec = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-multilingual-cased", "bert-base-multilingual-cased")
for name, param in enc_dec.named_parameters():
    if "encoder.embeddings" in name or "encoder.encoder.layer.0" in name or "decoder.embeddings" in name:
        param.requires_grad = False
data_collator = DataCollatorForSeq2Seq(tokenizer=tok_hf, model=enc_dec)
args = TrainingArguments(output_dir=str(HF_DIR), per_device_train_batch_size=4, per_device_eval_batch_size=4,
                         evaluation_strategy="epoch", learning_rate=5e-5, num_train_epochs=1,
                         save_total_limit=1, logging_steps=50, report_to="none")
trainer = Trainer(model=enc_dec, args=args, data_collator=data_collator, tokenizer=tok_hf,
                  train_dataset=hf_train_tok, eval_dataset=hf_dev_tok)
# trainer.train()


In [None]:
# 14.3) Inferencia con mBERT (opcional)
def infer_hf(text, max_new_tokens=64):
    if not text.strip(): return ""
    enc_dec.eval()
    inputs = tok_hf(text, return_tensors="pt")
    outputs = enc_dec.generate(**inputs, max_new_tokens=max_new_tokens)
    return tok_hf.decode(outputs[0], skip_special_tokens=True)


In [None]:
# 15) Reanudar entrenamiento desde un checkpoint (ejemplo)
# ckpt = CHECK_DIR / "scratch_ncx2es_best.pt"
# best_path, best_chrf = resume_direction(str(ckpt), more_epochs=2, patience=3)
# print("Nuevo mejor checkpoint:", best_path, " | chrF++:", best_chrf)
