# Laboratorio: ComparaciÃ³n T5-base vs FLAN-T5 - PAUTA COMPLETA

**Objetivo**: Comparar fine-tuning vs instruction tuning en 3 datasets de clasificaciÃ³n

**Tiempo estimado**: 1.5 horas

---

## Estructura del laboratorio:

1. **Setup y carga de datos** (Completo)
2. **Preprocesamiento** (Soluciones completas)
3. **Fine-tuning de T5-base** (Completo)
4. **FLAN-T5 Zero-shot** (Soluciones de prompts)
5. **ComparaciÃ³n y anÃ¡lisis** (Con respuestas)

---

## Datasets que usaremos:

- **SST-2**: Sentimiento de pelÃ­culas (positive/negative)
- **Amazon Polarity**: Reviews de productos (positive/negative)
- **AG News**: ClasificaciÃ³n de noticias (4 categorÃ­as)

## SECCIÃ“N 1: Setup y Carga de Datos

Esta secciÃ³n estÃ¡ completa. Solo ejecuta las celdas.

In [34]:
# InstalaciÃ³n de dependencias
!pip install transformers datasets evaluate accelerate -q
!pip install torch -q
!pip install scikit-learn -q

In [35]:
# Imports necesarios
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
import numpy as np
import evaluate
from datetime import datetime

print("âœ“ Dependencias cargadas")

âœ“ Dependencias cargadas


In [36]:
# Cargar datasets con subsets fijos (5000 train, 1000 validation cada uno)
print("Cargando datasets...\n")

# Dataset 1: SST-2 (Sentimiento de pelÃ­culas)
print("  - SST-2 (Sentimiento de pelÃ­culas)...")
dataset_sst2 = load_dataset("glue", "sst2")
dataset_sst2_train = dataset_sst2["train"].select(range(5000))
dataset_sst2_val = dataset_sst2["validation"].select(range(872))

# Dataset 2: Amazon Polarity (Reviews de productos)
print("  - Amazon Polarity (Reviews de productos)...")
dataset_amazon = load_dataset("amazon_polarity")
dataset_amazon_train = dataset_amazon["train"].select(range(5000))
dataset_amazon_val = dataset_amazon["test"].select(range(1000))

# Dataset 3: AG News (ClasificaciÃ³n de noticias en 4 categorÃ­as)
print("  - AG News (Noticias en 4 categorÃ­as)...")
dataset_agnews = load_dataset("ag_news")
dataset_agnews_train = dataset_agnews["train"].select(range(5000))
dataset_agnews_val = dataset_agnews["test"].select(range(1000))

print("\nâœ“ Datasets cargados")

Cargando datasets...

  - SST-2 (Sentimiento de pelÃ­culas)...
  - Amazon Polarity (Reviews de productos)...
  - AG News (Noticias en 4 categorÃ­as)...

âœ“ Datasets cargados


In [37]:
# ExploraciÃ³n de los datasets
print("--- SST-2 (Sentimiento de pelÃ­culas) ---")
print(f"Train: {len(dataset_sst2_train)} ejemplos")
print(f"Validation: {len(dataset_sst2_val)} ejemplos")
print(f"Clases: 0=negative, 1=positive")
print(f"Ejemplo: {dataset_sst2_train[0]}")

print("\n--- Amazon Polarity (Reviews de productos) ---")
print(f"Train: {len(dataset_amazon_train)} ejemplos")
print(f"Validation: {len(dataset_amazon_val)} ejemplos")
print(f"Clases: 0=negative, 1=positive")
print(f"Ejemplo: {dataset_amazon_train[0]}")

print("\n--- AG News (Noticias) ---")
print(f"Train: {len(dataset_agnews_train)} ejemplos")
print(f"Validation: {len(dataset_agnews_val)} ejemplos")
print(f"Clases: 0=World, 1=Sports, 2=Business, 3=Sci/Tech")
print(f"Ejemplo: {dataset_agnews_train[0]}")

--- SST-2 (Sentimiento de pelÃ­culas) ---
Train: 5000 ejemplos
Validation: 872 ejemplos
Clases: 0=negative, 1=positive
Ejemplo: {'sentence': 'hide new secretions from the parental units ', 'label': 0, 'idx': 0}

--- Amazon Polarity (Reviews de productos) ---
Train: 5000 ejemplos
Validation: 1000 ejemplos
Clases: 0=negative, 1=positive
Ejemplo: {'label': 1, 'title': 'Stuning even for the non-gamer', 'content': 'This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'}

--- AG News (Noticias) ---
Train: 5000 ejemplos
Validation: 1000 ejemplos
Clases: 0=World, 1=Sports, 2=Business, 3=Sci/Tech
Ejemplo: {'text': "Wall St. Bears Claw Back Into the Black (Reut

In [38]:
# Cargar tokenizador
model_checkpoint = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
print("âœ“ Tokenizador T5-base cargado")

âœ“ Tokenizador T5-base cargado


## SECCIÃ“N 2: Preprocesamiento de Datos - SOLUCIONES

AquÃ­ estÃ¡n las soluciones completas para los 3 datasets.

### SoluciÃ³n: SST-2 (Ejemplo de referencia)

In [39]:
def preprocess_function_sst2(examples):
    """
    Preprocesa el dataset SST-2 para T5.
    Convierte la tarea de clasificaciÃ³n a formato text-to-text.
    """
    prefix = "sst2 sentence: "
    label_map = {0: "negative", 1: "positive"}

    inputs = [prefix + doc for doc in examples["sentence"]]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding=False)

    labels_text = [label_map[label] for label in examples["label"]]
    labels = tokenizer(text_target=labels_text, max_length=2, truncation=True, padding=False)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Aplicar preprocesamiento
tokenized_sst2_train = dataset_sst2_train.map(
    preprocess_function_sst2,
    batched=True,
    remove_columns=dataset_sst2_train.column_names
)
tokenized_sst2_val = dataset_sst2_val.map(
    preprocess_function_sst2,
    batched=True,
    remove_columns=dataset_sst2_val.column_names
)

print("âœ“ SST-2 preprocesado")

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

âœ“ SST-2 preprocesado


### SoluciÃ³n: Amazon Polarity

**Cambios respecto a SST-2:**
- Campo: `'title'` + `'content'` en vez de `'sentence'`
- Prefijo: `'amazon review: '` en vez de `'sst2 sentence: '`
- Labels: iguales (0=negative, 1=positive)

In [40]:
def preprocess_function_amazon(examples):
    """
    TODO: Completa esta funciÃ³n basÃ¡ndote en el ejemplo de SST-2
    """
    prefix = "amazon review: "
    label_map = {0: "negative", 1: "positive"}

    # TODO: Concatena title + content y aÃ±ade prefijo
    inputs = [prefix + title + " " + content
              for title, content in zip(examples["title"], examples["content"])]

    # TODO: Tokeniza los inputs
    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding=False)

    # TODO: Convierte labels a texto
    labels_text = [label_map[label] for label in examples["label"]]

    # TODO: Tokeniza las labels
    labels = tokenizer(text_target=labels_text, max_length=2, truncation=True, padding=False)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# TODO: Aplica el preprocesamiento
tokenized_amazon_train = dataset_amazon_train.map(
    preprocess_function_amazon,
    batched=True,
    remove_columns=dataset_amazon_train.column_names
)
tokenized_amazon_val = dataset_amazon_val.map(
    preprocess_function_amazon,
    batched=True,
    remove_columns=dataset_amazon_val.column_names
)

print("âœ“ Amazon Polarity preprocesado")

âœ“ Amazon Polarity preprocesado


### SoluciÃ³n: AG News

**Cambios respecto a SST-2:**
- Campo: `'text'` en vez de `'sentence'`
- Prefijo: `'ag news: '`
- Labels: 4 clases (0=World, 1=Sports, 2=Business, 3=Sci/Tech)
- max_length para labels: 3 (palabras mÃ¡s largas)

In [41]:
def preprocess_function_agnews(examples):
    """
    TODO: Completa esta funciÃ³n para AG News (4 clases)
    """
    prefix = "ag news: "

    # TODO: Define el mapeo de labels (4 clases ahora)
    label_map = {
        0: "World",
        1: "Sports",
        2: "Business",
        3: "Sci/Tech"
    }

    # TODO: AÃ±ade prefijo al texto
    inputs = [prefix + text for text in examples["text"]]

    # TODO: Tokeniza los inputs
    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding=False)

    # TODO: Convierte labels a texto
    labels_text = [label_map[label] for label in examples["label"]]

    # TODO: Tokeniza las labels (max_length=3 porque las palabras son mÃ¡s largas)
    labels = tokenizer(text_target=labels_text, max_length=3, truncation=True, padding=False)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# TODO: Aplica el preprocesamiento
tokenized_agnews_train = dataset_agnews_train.map(
    preprocess_function_agnews,
    batched=True,
    remove_columns=dataset_agnews_train.column_names
)
tokenized_agnews_val = dataset_agnews_val.map(
    preprocess_function_agnews,
    batched=True,
    remove_columns=dataset_agnews_val.column_names
)

print("âœ“ AG News preprocesado")

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

âœ“ AG News preprocesado


## SECCIÃ“N 3: Fine-tuning de T5-base

Esta secciÃ³n estÃ¡ completa. El cÃ³digo entrenarÃ¡ T5-base en los 3 datasets automÃ¡ticamente.

In [42]:
# MÃ©tricas
metric_accuracy = evaluate.load("accuracy")
metric_f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    """
    Calcula accuracy y F1 score para las predicciones.
    CORRECCIÃ“N: Compara strings directamente sin usar la librerÃ­a evaluate para accuracy
    """
    predictions, labels = eval_pred

    # Decodificar predicciones
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Decodificar labels - reemplazar -100 con pad_token_id
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # CORRECCIÃ“N: Calcular accuracy manualmente comparando strings
    # Normalizar strings (strip y lowercase)
    decoded_preds_normalized = [pred.strip().lower() for pred in decoded_preds]
    decoded_labels_normalized = [label.strip().lower() for label in decoded_labels]

    # Calcular accuracy manualmente
    correct = sum(p == l for p, l in zip(decoded_preds_normalized, decoded_labels_normalized))
    accuracy = correct / len(decoded_labels_normalized)

    # Para F1: convertir strings a IDs numÃ©ricos
    unique_labels = sorted(list(set(decoded_labels_normalized)))
    label_to_id = {label: idx for idx, label in enumerate(unique_labels)}

    pred_ids = [label_to_id.get(pred, -1) for pred in decoded_preds_normalized]
    label_ids = [label_to_id.get(label, -1) for label in decoded_labels_normalized]

    # Filtrar predicciones invÃ¡lidas (que no matchean ninguna label conocida)
    valid_indices = [i for i, pred_id in enumerate(pred_ids) if pred_id != -1]

    if len(valid_indices) > 0:
        pred_ids_valid = [pred_ids[i] for i in valid_indices]
        label_ids_valid = [label_ids[i] for i in valid_indices]

        f1 = metric_f1.compute(
            predictions=pred_ids_valid,
            references=label_ids_valid,
            average='macro'
        )
    else:
        f1 = {"f1": 0.0}

    return {
        "accuracy": accuracy,  # CAMBIO: Ya no es un dict, es un float directo
        "f1": f1["f1"]
    }

In [43]:
def train_t5_on_dataset(train_dataset, val_dataset, output_dir, dataset_name):
    """
    Entrena T5-base en un dataset especÃ­fico.
    """
    print(f"\nðŸš€ Entrenando T5-base en {dataset_name}...")

    # Cargar modelo fresco
    model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

    # Configurar argumentos de entrenamiento
    training_args = Seq2SeqTrainingArguments(
        output_dir=output_dir,
        eval_strategy="epoch",
        learning_rate=3e-4,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        weight_decay=0.01,
        save_total_limit=1,
        num_train_epochs=1,  # Solo 1 Ã©poca para velocidad
        predict_with_generate=True,
        report_to="none",
        fp16=False,
        logging_steps=100,
    )

    # Data collator
    data_collator = DataCollatorForSeq2Seq(
        tokenizer=tokenizer,
        model=model,
        label_pad_token_id=-100
    )

    # Trainer
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        data_collator=data_collator,
        compute_metrics=compute_metrics
    )

    # Entrenar
    start_time = datetime.now()
    trainer.train()
    end_time = datetime.now()

    # Evaluar
    eval_results = trainer.evaluate()

    print(f"âœ“ Entrenamiento completado en {(end_time - start_time).seconds}s")
    print(f"  Accuracy: {eval_results['eval_accuracy']:.4f}")
    print(f"  F1 Score: {eval_results['eval_f1']:.4f}")

    return trainer, eval_results

In [44]:
# Entrenar en los 3 datasets (esto tomarÃ¡ ~15-20 minutos)
print("Entrenando T5-base en los 3 datasets...\n")

trainer_sst2, results_sst2 = train_t5_on_dataset(
    tokenized_sst2_train,
    tokenized_sst2_val,
    "t5-sst2-finetuned",
    "SST-2"
)

trainer_amazon, results_amazon = train_t5_on_dataset(
    tokenized_amazon_train,
    tokenized_amazon_val,
    "t5-amazon-finetuned",
    "Amazon Polarity"
)

trainer_agnews, results_agnews = train_t5_on_dataset(
    tokenized_agnews_train,
    tokenized_agnews_val,
    "t5-agnews-finetuned",
    "AG News"
)

print("\nâœ“ Fine-tuning de T5-base completo para los 3 datasets")

Entrenando T5-base en los 3 datasets...


ðŸš€ Entrenando T5-base en SST-2...


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.0858,0.132353,0.924312,0.924168


âœ“ Entrenamiento completado en 178s
  Accuracy: 0.9243
  F1 Score: 0.9242

ðŸš€ Entrenando T5-base en Amazon Polarity...


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.0972,0.083099,0.939,0.938998


âœ“ Entrenamiento completado en 231s
  Accuracy: 0.9390
  F1 Score: 0.9390

ðŸš€ Entrenando T5-base en AG News...


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.1225,0.117947,0.916,0.911369


âœ“ Entrenamiento completado en 238s
  Accuracy: 0.9160
  F1 Score: 0.9114

âœ“ Fine-tuning de T5-base completo para los 3 datasets


## SECCIÃ“N 4: EvaluaciÃ³n FLAN-T5 Zero-Shot - SOLUCIONES

AquÃ­ estÃ¡n las soluciones de prompts efectivos para FLAN-T5.

In [45]:
# Cargar FLAN-T5
print("Cargando FLAN-T5...")
model_flan = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
tokenizer_flan = AutoTokenizer.from_pretrained("google/flan-t5-base")
print("âœ“ FLAN-T5 cargado")

Cargando FLAN-T5...


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

âœ“ FLAN-T5 cargado


In [51]:
def evaluate_flan_t5(dataset, prompt_template, label_map):
    """
    EvalÃºa FLAN-T5 en zero-shot sobre un dataset.
    CORRECCIÃ“N: Calcula accuracy manualmente sin usar metric_accuracy.compute()
    """
    predictions = []
    references = []

    print(f"  Evaluando {len(dataset)} ejemplos...")

    for i, example in enumerate(dataset):
        if i % 200 == 0:
            print(f"    Progreso: {i}/{len(dataset)}")

        # Generar prompt
        prompt = prompt_template(example)

        # Generar predicciÃ³n
        input_ids = tokenizer_flan(prompt, return_tensors="pt", max_length=512, truncation=True).input_ids
        outputs = model_flan.generate(input_ids, max_length=10)
        prediction = tokenizer_flan.decode(outputs[0], skip_special_tokens=True)

        predictions.append(prediction)
        references.append(label_map[example["label"]])

    # CORRECCIÃ“N: Calcular accuracy manualmente
    # Normalizar strings (strip y lowercase)
    predictions_normalized = [pred.strip().lower() for pred in predictions]
    references_normalized = [ref.strip().lower() for ref in references]

    # Calcular accuracy manualmente
    correct = sum(p == r for p, r in zip(predictions_normalized, references_normalized))
    accuracy = correct / len(references_normalized)

    # F1 score - convertir strings a IDs numÃ©ricos
    unique_labels = sorted(list(set(references_normalized)))
    label_to_id = {label: idx for idx, label in enumerate(unique_labels)}

    pred_ids = [label_to_id.get(pred, -1) for pred in predictions_normalized]
    ref_ids = [label_to_id.get(ref, -1) for ref in references_normalized]

    # Filtrar predicciones invÃ¡lidas
    valid_indices = [i for i, pred_id in enumerate(pred_ids) if pred_id != -1]

    if len(valid_indices) > 0:
        pred_ids_valid = [pred_ids[i] for i in valid_indices]
        ref_ids_valid = [ref_ids[i] for i in valid_indices]

        f1 = metric_f1.compute(
            predictions=pred_ids_valid,
            references=ref_ids_valid,
            average='macro'
        )
    else:
        f1 = {"f1": 0.0}

    return {
        "accuracy": accuracy,  # CAMBIO: Ya no es un dict, es un float directo
        "f1": f1["f1"]
    }

### Soluciones de Prompts para FLAN-T5

**CaracterÃ­sticas de un buen prompt:**
- InstrucciÃ³n clara y directa
- Especifica las opciones exactas de respuesta
- Contexto mÃ­nimo pero suficiente

In [52]:
# SOLUCIÃ“N: Prompt para SST-2
def prompt_sst2(example):
    """
    SOLUCIÃ“N: Prompt para clasificaciÃ³n de sentimiento en SST-2.
    """
    prompt = f"""Classify the sentiment of the following movie review as either 'positive' or 'negative'.

Review: "{example['sentence']}"

Sentiment:"""
    return prompt

# SOLUCIÃ“N: Prompt para Amazon Polarity
def prompt_amazon(example):
    """
    SOLUCIÃ“N: Prompt para reviews de productos Amazon.
    """
    # Concatenar tÃ­tulo y contenido
    review_text = f"{example['title']} {example['content']}"

    prompt = f"""Classify the sentiment of the following Amazon product review as either 'positive' or 'negative'.

Review: "{review_text}"

Sentiment:"""
    return prompt

# SOLUCIÃ“N: Prompt para AG News
def prompt_agnews(example):
    """
    SOLUCIÃ“N: Prompt para clasificaciÃ³n de noticias en 4 categorÃ­as.
    """
    prompt = f"""Classify the following news article into one of these categories: World, Sports, Business, or Sci/Tech.

Article: "{example['text']}"

Category:"""
    return prompt

In [53]:
# Variante 1: MÃ¡s conciso para SST-2
def prompt_sst2_alternative(example):
    """
    Variante: MÃ¡s conciso
    """
    return f"Is this movie review positive or negative?\n\n{example['sentence']}\n\nAnswer:"

# Variante 2: MÃ¡s explÃ­cito para AG News
def prompt_agnews_alternative(example):
    """
    Variante: Con ejemplos de cada categorÃ­a (mÃ¡s explÃ­cito)
    """
    prompt = f"""Classify this news article into exactly one category:
- World: International news and politics
- Sports: Sports events and athletes
- Business: Economy, companies, markets
- Sci/Tech: Science and technology news

Article: "{example['text']}"

Category:"""
    return prompt

print("âœ“ Prompts alternativos definidos (opcionales)")

âœ“ Prompts alternativos definidos (opcionales)


In [54]:
# Mapeos de labels
label_map_sst2 = {0: "negative", 1: "positive"}
label_map_amazon = {0: "negative", 1: "positive"}
label_map_agnews = {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}

In [55]:
# Evaluar FLAN-T5 en los 3 datasets
print("\nðŸ“Š Evaluando FLAN-T5 en SST-2...")
results_flan_sst2 = evaluate_flan_t5(dataset_sst2_val, prompt_sst2, label_map_sst2)
print(f"âœ“ SST-2 - Accuracy: {results_flan_sst2['accuracy']:.4f}, F1: {results_flan_sst2['f1']:.4f}")

print("\nðŸ“Š Evaluando FLAN-T5 en Amazon Polarity...")
results_flan_amazon = evaluate_flan_t5(dataset_amazon_val, prompt_amazon, label_map_amazon)
print(f"âœ“ Amazon - Accuracy: {results_flan_amazon['accuracy']:.4f}, F1: {results_flan_amazon['f1']:.4f}")

print("\nðŸ“Š Evaluando FLAN-T5 en AG News...")
results_flan_agnews = evaluate_flan_t5(dataset_agnews_val, prompt_agnews, label_map_agnews)
print(f"âœ“ AG News - Accuracy: {results_flan_agnews['accuracy']:.4f}, F1: {results_flan_agnews['f1']:.4f}")

print("\nâœ“ EvaluaciÃ³n de FLAN-T5 completa")


ðŸ“Š Evaluando FLAN-T5 en SST-2...
  Evaluando 872 ejemplos...
    Progreso: 0/872
    Progreso: 200/872
    Progreso: 400/872
    Progreso: 600/872
    Progreso: 800/872
âœ“ SST-2 - Accuracy: 0.9358, F1: 0.9358

ðŸ“Š Evaluando FLAN-T5 en Amazon Polarity...
  Evaluando 1000 ejemplos...
    Progreso: 0/1000
    Progreso: 200/1000
    Progreso: 400/1000
    Progreso: 600/1000
    Progreso: 800/1000
âœ“ Amazon - Accuracy: 0.9620, F1: 0.9620

ðŸ“Š Evaluando FLAN-T5 en AG News...
  Evaluando 1000 ejemplos...
    Progreso: 0/1000
    Progreso: 200/1000
    Progreso: 400/1000
    Progreso: 600/1000
    Progreso: 800/1000
âœ“ AG News - Accuracy: 0.8720, F1: 0.8719

âœ“ EvaluaciÃ³n de FLAN-T5 completa


## SECCIÃ“N 5: ComparaciÃ³n y AnÃ¡lisis

Ahora compararemos los resultados y discutiremos las implicaciones.

In [56]:
# Tabla comparativa - Accuracy
print("\nðŸ“Š TABLA COMPARATIVA - ACCURACY")
print("-" * 80)
print(f"{'Dataset':<20} | {'T5 Fine-tuned':<15} | {'FLAN-T5 Zero-shot':<18} | {'Diferencia':<12}")
print("-" * 80)

diff_sst2 = results_sst2['eval_accuracy'] - results_flan_sst2['accuracy']
print(f"{'SST-2':<20} | {results_sst2['eval_accuracy']:>14.2%} | {results_flan_sst2['accuracy']:>17.2%} | {diff_sst2:>+11.2%}")

diff_amazon = results_amazon['eval_accuracy'] - results_flan_amazon['accuracy']
print(f"{'Amazon Polarity':<20} | {results_amazon['eval_accuracy']:>14.2%} | {results_flan_amazon['accuracy']:>17.2%} | {diff_amazon:>+11.2%}")

diff_agnews = results_agnews['eval_accuracy'] - results_flan_agnews['accuracy']
print(f"{'AG News':<20} | {results_agnews['eval_accuracy']:>14.2%} | {results_flan_agnews['accuracy']:>17.2%} | {diff_agnews:>+11.2%}")

print("-" * 80)


ðŸ“Š TABLA COMPARATIVA - ACCURACY
--------------------------------------------------------------------------------
Dataset              | T5 Fine-tuned   | FLAN-T5 Zero-shot  | Diferencia  
--------------------------------------------------------------------------------
SST-2                |         92.43% |            93.58% |      -1.15%
Amazon Polarity      |         93.90% |            96.20% |      -2.30%
AG News              |         91.60% |            87.20% |      +4.40%
--------------------------------------------------------------------------------


In [57]:
# Tabla comparativa - F1 Score
print("\nðŸ“Š TABLA COMPARATIVA - F1 SCORE (MACRO)")
print("-" * 80)
print(f"{'Dataset':<20} | {'T5 Fine-tuned':<15} | {'FLAN-T5 Zero-shot':<18} | {'Diferencia':<12}")
print("-" * 80)

diff_f1_sst2 = results_sst2['eval_f1'] - results_flan_sst2['f1']
print(f"{'SST-2':<20} | {results_sst2['eval_f1']:>14.2%} | {results_flan_sst2['f1']:>17.2%} | {diff_f1_sst2:>+11.2%}")

diff_f1_amazon = results_amazon['eval_f1'] - results_flan_amazon['f1']
print(f"{'Amazon Polarity':<20} | {results_amazon['eval_f1']:>14.2%} | {results_flan_amazon['f1']:>17.2%} | {diff_f1_amazon:>+11.2%}")

diff_f1_agnews = results_agnews['eval_f1'] - results_flan_agnews['f1']
print(f"{'AG News':<20} | {results_agnews['eval_f1']:>14.2%} | {results_flan_agnews['f1']:>17.2%} | {diff_f1_agnews:>+11.2%}")

print("-" * 80)


ðŸ“Š TABLA COMPARATIVA - F1 SCORE (MACRO)
--------------------------------------------------------------------------------
Dataset              | T5 Fine-tuned   | FLAN-T5 Zero-shot  | Diferencia  
--------------------------------------------------------------------------------
SST-2                |         92.42% |            93.58% |      -1.16%
Amazon Polarity      |         93.90% |            96.20% |      -2.30%
AG News              |         91.14% |            87.19% |      +3.95%
--------------------------------------------------------------------------------



## âœ“ Laboratorio Completado - PAUTA COMPLETA

### Resumen de Aprendizajes Clave:

1. **T5-base** es un modelo poderoso que requiere fine-tuning para tareas especÃ­ficas
2. **FLAN-T5** demuestra el poder del instruction tuning para generalizaciÃ³n zero-shot
3. La **calidad del prompt** es crucial para el rendimiento de FLAN-T5
4. **Fine-tuning vs Zero-shot** es una decisiÃ³n que depende del contexto y recursos
5. Entender estos trade-offs es esencial para deployment en producciÃ³n

---

### ConclusiÃ³n Final:

El **instruction tuning** (FLAN) representa un salto cualitativo en la evoluciÃ³n de los LLMs:

- **Antes (T5)**: Un modelo base por tarea, requiere fine-tuning especÃ­fico
- **Ahora (FLAN-T5)**: Un modelo general para mÃºltiples tareas via prompting

Esta innovaciÃ³n abriÃ³ la puerta a los **asistentes conversacionales** y **modelos de propÃ³sito general** que usamos hoy en dÃ­a (ChatGPT, Claude, etc.).

La habilidad de **"programar" un modelo con lenguaje natural** (prompting) es una consecuencia directa de esta