# Fine-tuning Qwen2.5-0.5B con Unsloth

Este notebook realiza fine-tuning del modelo Qwen2.5-0.5B-Instruct para generar epicrisis m√©dicas usando **Unsloth**.

**Ventajas de Unsloth:**
- 2-5x m√°s r√°pido que m√©todos tradicionales
- 80% menos uso de memoria
- Compatible con GPU T4 gratuita de Colab

**Dataset:**
- ~1200 ejemplos de epicrisis en formato ChatML
- 90% train / 10% validation

## 1. Instalar Unsloth

In [None]:
# Verificar GPU
!nvidia-smi

In [None]:
%%capture
# Instalar Unsloth
!pip install unsloth
# Instalar desde GitHub para la version mas reciente
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [None]:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. Configuraci√≥n

In [None]:
import os
import json
from pathlib import Path

# Configuracion
MODEL_NAME = "unsloth/Qwen2.5-0.5B-Instruct"
OUTPUT_DIR = "./epicrisis-unsloth"
DATASET_DIR = "./datasets"

# Hiperparametros
EPOCHS = 3
BATCH_SIZE = 4  # Unsloth permite batch size mas grande
GRADIENT_ACCUMULATION = 2
LEARNING_RATE = 2e-4  # Unsloth recomienda learning rate mas alto
MAX_SEQ_LENGTH = 1024

# LoRA
LORA_RANK = 16
LORA_ALPHA = 16
LORA_DROPOUT = 0

# System instruction
SYSTEM_INSTRUCTION = (
    "Genera una epicrisis narrativa en UN SOLO PARRAFO. "
    "USA SOLO la informacion del JSON, NO inventes datos. "
    "IMPORTANTE: Incluye TODOS los codigos entre parentesis: "
    "diagnostico de ingreso con codigo CIE-10 (ej: I20.0), "
    "procedimientos con codigo K (ej: K492, K493), "
    "medicacion con dosis y codigo ATC (ej: B01AC06). "
    "Estructura: dx ingreso -> procedimientos -> evolucion -> dx alta -> medicacion alta. "
    "Abreviaturas: DA=descendente anterior, CD=coronaria derecha, CX=circunfleja, "
    "SDST=supradesnivel ST, IAM=infarto agudo miocardio."
)

print("="*60)
print("CONFIGURACION UNSLOTH")
print("="*60)
print(f"  Modelo: {MODEL_NAME}")
print(f"  Epochs: {EPOCHS}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Gradient accumulation: {GRADIENT_ACCUMULATION}")
print(f"  Effective batch size: {BATCH_SIZE * GRADIENT_ACCUMULATION}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Max seq length: {MAX_SEQ_LENGTH}")
print(f"  LoRA rank: {LORA_RANK}")
print("="*60)

## 3. Cargar modelo con Unsloth

In [None]:
from unsloth import FastLanguageModel

# Cargar modelo con Unsloth (4bit cuantizacion automatica)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,  # Auto-detectar
    load_in_4bit=True,
)

print(f"Modelo cargado: {MODEL_NAME}")
print(f"Vocab size: {tokenizer.vocab_size}")

In [None]:
# Aplicar LoRA con Unsloth
model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_RANK,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",
    use_gradient_checkpointing="unsloth",  # Optimizado para Unsloth
    random_state=42,
    max_seq_length=MAX_SEQ_LENGTH,
)

print("LoRA aplicado con exito")

## 4. Subir y preparar dataset

In [None]:
from google.colab import files

os.makedirs(DATASET_DIR, exist_ok=True)

print("="*60)
print("SUBE LOS ARCHIVOS DEL DATASET UNIFICADO:")
print("  - unified_data/train.jsonl")
print("  - unified_data/validation.jsonl")
print("="*60)

uploaded = files.upload()

for filename, content in uploaded.items():
    filepath = f"{DATASET_DIR}/{filename}"
    with open(filepath, "wb") as f:
        f.write(content)
    with open(filepath, "r", encoding="utf-8") as f:
        lines = sum(1 for _ in f)
    print(f"Guardado: {filepath} ({lines} ejemplos)")

In [None]:
from datasets import Dataset, DatasetDict

def load_datasets(dataset_dir):
    """Carga los datasets unificados."""
    train_path = Path(dataset_dir) / "train.jsonl"
    valid_path = Path(dataset_dir) / "validation.jsonl"
    
    train_examples = []
    valid_examples = []
    
    with open(train_path, "r", encoding="utf-8") as f:
        for line in f:
            if line.strip():
                train_examples.append(json.loads(line))
    print(f"Train: {len(train_examples)} ejemplos")
    
    with open(valid_path, "r", encoding="utf-8") as f:
        for line in f:
            if line.strip():
                valid_examples.append(json.loads(line))
    print(f"Validation: {len(valid_examples)} ejemplos")
    
    return DatasetDict({
        "train": Dataset.from_list(train_examples),
        "validation": Dataset.from_list(valid_examples)
    })

dataset = load_datasets(DATASET_DIR)
print(f"\nDataset: {dataset}")

In [None]:
# Verificar formato del dataset
print("="*60)
print("VERIFICANDO FORMATO")
print("="*60)

example = dataset["train"][0]["text"]
print(f"\nPrimer ejemplo (primeros 1000 chars):")
print("-"*60)
print(example[:1000])
print("...")

# Verificar estructura ChatML
print(f"\nVerificaciones:")
print(f"  - '<|im_start|>system': {'<|im_start|>system' in example}")
print(f"  - '<|im_start|>user': {'<|im_start|>user' in example}")
print(f"  - '<|im_start|>assistant': {'<|im_start|>assistant' in example}")
print(f"  - Termina con '<|im_end|>': {example.strip().endswith('<|im_end|>')}")

## 5. Configurar entrenamiento

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# Configurar trainer con Unsloth
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        output_dir=OUTPUT_DIR,
        num_train_epochs=EPOCHS,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION,
        learning_rate=LEARNING_RATE,
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,
        weight_decay=0.01,
        logging_steps=10,
        save_steps=100,
        eval_strategy="steps",
        eval_steps=100,
        save_total_limit=3,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        optim="adamw_8bit",
        seed=42,
        report_to="none",
    ),
)

print("Trainer configurado")
print(f"  - bf16: {is_bfloat16_supported()}")
print(f"  - Effective batch size: {BATCH_SIZE * GRADIENT_ACCUMULATION}")

## 6. Entrenar

In [None]:
# Estadisticas de memoria antes
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU: {gpu_stats.name}")
print(f"Memoria GPU reservada: {start_gpu_memory} GB")
print(f"Memoria GPU total: {max_memory} GB")

In [None]:
print("="*60)
print("INICIANDO ENTRENAMIENTO")
print("="*60)

trainer_stats = trainer.train()

print("\n" + "="*60)
print("ENTRENAMIENTO COMPLETADO")
print("="*60)

In [None]:
# Estadisticas de memoria despues
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"\nEstadisticas de memoria:")
print(f"  - Memoria usada: {used_memory} GB ({used_percentage}%)")
print(f"  - Memoria para LoRA: {used_memory_for_lora} GB ({lora_percentage}%)")
print(f"  - Tiempo total: {trainer_stats.metrics['train_runtime']:.2f} segundos")

## 7. Guardar adaptadores LoRA

In [None]:
# Guardar adaptadores LoRA
LORA_OUTPUT = f"{OUTPUT_DIR}/lora"
model.save_pretrained(LORA_OUTPUT)
tokenizer.save_pretrained(LORA_OUTPUT)

print(f"Adaptadores LoRA guardados en: {LORA_OUTPUT}")

## 8. Probar modelo

In [None]:
# Ejemplo de prueba
test_input = {
    "dx": ["Angina inestable (I20.0)"],
    "proc": ["Coronariografia (K492)", "Angioplastia DA (K493)"],
    "tto": [
        "Aspirina 300mg carga (B01AC06)",
        "Enoxaparina 60mg SC c/12h (B01AB05)",
    ],
    "evo": "SDST V1-V4. Oclusion DA proximal. Angioplastia exitosa con stent.",
    "dx_alta": ["IAM pared anterior (I21.0)"],
    "med": [
        "Aspirina 100mg VO c/24h (B01AC06)",
        "Clopidogrel 75mg VO c/24h 12m (B01AC04)",
        "Atorvastatina 80mg VO c/noche (C10AA05)",
        "Bisoprolol 2.5mg VO c/24h (C07AB07)",
    ],
}

print("Input de prueba:")
print(json.dumps(test_input, indent=2, ensure_ascii=False))

In [None]:
# Habilitar modo inferencia con Unsloth (2x mas rapido)
FastLanguageModel.for_inference(model)

# Crear prompt
json_str = json.dumps(test_input, ensure_ascii=False, indent=2)
messages = [
    {"role": "system", "content": SYSTEM_INSTRUCTION},
    {"role": "user", "content": json_str},
]

# Aplicar template de chat
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

print("Prompt generado:")
print("-"*60)
print(prompt[:500])
print("...")

In [None]:
# Generar respuesta
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    repetition_penalty=1.15,
    use_cache=True,
)

# Decodificar solo tokens nuevos
response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)

print("\n" + "="*60)
print("EPICRISIS GENERADA:")
print("="*60)
print(response)

## 9. Fusionar y guardar modelo completo

In [None]:
# Guardar modelo fusionado en formato float16
MERGED_OUTPUT = f"{OUTPUT_DIR}/merged-f16"

model.save_pretrained_merged(
    MERGED_OUTPUT,
    tokenizer,
    save_method="merged_16bit",
)

print(f"Modelo fusionado guardado en: {MERGED_OUTPUT}")

## 13. Guardar en Google Drive

In [None]:
# Montar Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Crear directorio en Google Drive
DRIVE_OUTPUT = "/content/drive/MyDrive/fine-tuning"
os.makedirs(DRIVE_OUTPUT, exist_ok=True)

print(f"Google Drive montado")
print(f"Directorio de salida: {DRIVE_OUTPUT}")

In [None]:
import shutil

# Copiar modelo fusionado (merged-f16) a Google Drive
print("="*60)
print("GUARDANDO MODELOS EN GOOGLE DRIVE")
print("="*60)

# 1. Modelo fusionado float16
DRIVE_MERGED = f"{DRIVE_OUTPUT}/merged-f16"
if os.path.exists(MERGED_OUTPUT):
    print(f"\n1. Copiando modelo fusionado a {DRIVE_MERGED}...")
    if os.path.exists(DRIVE_MERGED):
        shutil.rmtree(DRIVE_MERGED)
    shutil.copytree(MERGED_OUTPUT, DRIVE_MERGED)
    print(f"   ‚úì Modelo fusionado guardado")
else:
    print(f"\n1. ‚ö†Ô∏è No se encontr√≥ {MERGED_OUTPUT}")

# 2. Modelo GGUF
DRIVE_GGUF = f"{DRIVE_OUTPUT}/gguf"
if os.path.exists(GGUF_OUTPUT):
    print(f"\n2. Copiando modelo GGUF a {DRIVE_GGUF}...")
    if os.path.exists(DRIVE_GGUF):
        shutil.rmtree(DRIVE_GGUF)
    shutil.copytree(GGUF_OUTPUT, DRIVE_GGUF)
    print(f"   ‚úì Modelo GGUF guardado")
else:
    print(f"\n2. ‚ö†Ô∏è No se encontr√≥ {GGUF_OUTPUT} (ejecuta la celda de exportaci√≥n GGUF primero)")

# 3. Modelo ONNX
DRIVE_ONNX = f"{DRIVE_OUTPUT}/onnx"
if os.path.exists(ONNX_OUTPUT):
    print(f"\n3. Copiando modelo ONNX a {DRIVE_ONNX}...")
    if os.path.exists(DRIVE_ONNX):
        shutil.rmtree(DRIVE_ONNX)
    shutil.copytree(ONNX_OUTPUT, DRIVE_ONNX)
    print(f"   ‚úì Modelo ONNX guardado")
else:
    print(f"\n3. ‚ö†Ô∏è No se encontr√≥ {ONNX_OUTPUT} (ejecuta la celda de exportaci√≥n ONNX primero)")

# 4. Adaptadores LoRA (bonus)
DRIVE_LORA = f"{DRIVE_OUTPUT}/lora"
if os.path.exists(LORA_OUTPUT):
    print(f"\n4. Copiando adaptadores LoRA a {DRIVE_LORA}...")
    if os.path.exists(DRIVE_LORA):
        shutil.rmtree(DRIVE_LORA)
    shutil.copytree(LORA_OUTPUT, DRIVE_LORA)
    print(f"   ‚úì Adaptadores LoRA guardados")

print("\n" + "="*60)
print("RESUMEN - Archivos en Google Drive:")
print("="*60)
print(f"  üìÅ {DRIVE_OUTPUT}/")

for folder in ["merged-f16", "gguf", "onnx", "lora"]:
    folder_path = f"{DRIVE_OUTPUT}/{folder}"
    if os.path.exists(folder_path):
        # Calcular tama√±o
        total_size = 0
        for dirpath, dirnames, filenames in os.walk(folder_path):
            for f in filenames:
                fp = os.path.join(dirpath, f)
                total_size += os.path.getsize(fp)
        size_mb = total_size / (1024 * 1024)
        print(f"     ‚îú‚îÄ‚îÄ {folder}/ ({size_mb:.1f} MB)")
    else:
        print(f"     ‚îú‚îÄ‚îÄ {folder}/ (no existe)")

print("="*60)

## 10. Exportar a GGUF (opcional, para llama.cpp)

In [None]:
# Exportar a GGUF con cuantizacion Q4_K_M
GGUF_OUTPUT = f"{OUTPUT_DIR}/gguf"

model.save_pretrained_gguf(
    GGUF_OUTPUT,
    tokenizer,
    quantization_method="q4_k_m",
)

print(f"Modelo GGUF guardado en: {GGUF_OUTPUT}")

## 11. Exportar a ONNX (para navegador)

In [None]:
# Instalar optimum para ONNX
!pip install -q optimum[exporters] onnx onnxruntime

In [None]:
from optimum.onnxruntime import ORTModelForCausalLM

ONNX_OUTPUT = f"{OUTPUT_DIR}/onnx"

# Exportar a ONNX desde el modelo fusionado
ort_model = ORTModelForCausalLM.from_pretrained(
    MERGED_OUTPUT,
    export=True,
    provider="CPUExecutionProvider",
)

ort_model.save_pretrained(ONNX_OUTPUT)
tokenizer.save_pretrained(ONNX_OUTPUT)

print(f"Modelo ONNX guardado en: {ONNX_OUTPUT}")

## 12. Descargar modelos

In [None]:
# Comprimir y descargar
!zip -r epicrisis-lora.zip {LORA_OUTPUT}
!zip -r epicrisis-merged.zip {MERGED_OUTPUT}

from google.colab import files

print("Descargando adaptadores LoRA...")
files.download("epicrisis-lora.zip")

In [None]:
# Descargar modelo fusionado
print("Descargando modelo fusionado...")
files.download("epicrisis-merged.zip")

In [None]:
# Descargar GGUF si existe
if os.path.exists(GGUF_OUTPUT):
    !zip -r epicrisis-gguf.zip {GGUF_OUTPUT}
    print("Descargando modelo GGUF...")
    files.download("epicrisis-gguf.zip")

In [None]:
# Descargar ONNX si existe
if os.path.exists(ONNX_OUTPUT):
    !zip -r epicrisis-onnx.zip {ONNX_OUTPUT}
    print("Descargando modelo ONNX...")
    files.download("epicrisis-onnx.zip")

## Resumen

Este notebook realiza:

1. **Fine-tuning con Unsloth** - 2-5x m√°s r√°pido que m√©todos tradicionales
2. **LoRA optimizado** - Usando gradient checkpointing de Unsloth
3. **Exportaci√≥n m√∫ltiple**:
   - LoRA adapters (para cargar sobre modelo base)
   - Modelo fusionado float16
   - GGUF Q4_K_M (para llama.cpp)
   - ONNX (para navegador)

### Archivos generados:
- `epicrisis-unsloth/lora/` - Adaptadores LoRA
- `epicrisis-unsloth/merged-f16/` - Modelo fusionado
- `epicrisis-unsloth/gguf/` - Modelo GGUF
- `epicrisis-unsloth/onnx/` - Modelo ONNX