# Spanish-Native GenAI Quality Benchmark (Lakehouse Edition)

## What is this notebook?

This notebook demonstrates a simple, reproducible way to evaluate how language models perform across different Spanish dialects and noisy real-world text.

Instead of building a new model, we focus on measuring model performance in realistic conditions before deployment.

---

## Why does this matter?

Most AI systems are evaluated using English benchmarks.

However, real users in Latin America:
- Use regional slang (El Salvador vs Perú)
- Mix formal and informal language
- Send noisy text (OCR errors, typos, web copy)

If we only test in English, we may miss performance gaps that affect real users.

This notebook shows how to measure those differences using Databricks and Delta Lake.

---

## Lakehouse Architecture (Simple Version)

We organize the data into three layers:

**Bronze (Raw Data)**
- Original dialect examples
- Includes region and noise level

**Silver (Evaluation Items)**
- Structured prompts created from the raw text
- Ready to be sent to a model

**Gold (Evaluation Results)**
- Model outputs and quality metrics
- Aggregated performance by region and noise

---

## What we measure

We compare model accuracy across:
- Region (SV vs PE)
- Noise level (clean vs OCR noise)

The goal is to detect performance gaps before production deployment.

---

This is a simplified demo version of what could become a production-grade GenAI evaluation pipeline. 
In production systems, this type of evaluation can be used to gate model deployment, detect regressions, and reduce regional bias risk.



In [0]:
# --------------------------------------
# Environment Check: Confirm Compute Works
# --------------------------------------

spark.range(5).display()


id
0
1
2
3
4


In [0]:
# --------------------------------------
# Bronze Layer: Raw Dialect Data
# --------------------------------------

data = [
    # El Salvador (SV) - clean
    ("SV", "Está yuca.", "slang", "clean"),
    ("SV", "Vaya pues, chero.", "slang", "clean"),
    ("SV", "¡Puchica! No tengo pisto.", "slang", "clean"),
    ("SV", "¿Qué onda? Todo chivo.", "slang", "clean"),

    # Perú (PE) - clean
    ("PE", "Ese pata es de mi barrio.", "slang", "clean"),
    ("PE", "Está bien chévere.", "slang", "clean"),
    ("PE", "Me paltea un montón.", "slang", "clean"),
    ("PE", "Ya pues, causa.", "slang", "clean"),

    # Noise variants (OCR/web-like)
    ("SV", "N0 teng0 p1st0", "slang", "ocr_noise"),
    ("PE", "ESE pata es de mi barri0", "slang", "ocr_noise"),
]

df = spark.createDataFrame(
    data,
    ["region", "text", "task_type", "noise_level"]
)

df.write.format("delta").mode("overwrite").saveAsTable("latam_bronze_dialect_raw")

spark.table("latam_bronze_dialect_raw").display()


region,text,task_type,noise_level
SV,Está yuca.,slang,clean
SV,"Vaya pues, chero.",slang,clean
SV,¡Puchica! No tengo pisto.,slang,clean
SV,¿Qué onda? Todo chivo.,slang,clean
PE,Ese pata es de mi barrio.,slang,clean
PE,Está bien chévere.,slang,clean
PE,Me paltea un montón.,slang,clean
PE,"Ya pues, causa.",slang,clean
SV,N0 teng0 p1st0,slang,ocr_noise
PE,ESE pata es de mi barri0,slang,ocr_noise


In [0]:
# --------------------------------------
# Silver Layer: Structured Evaluation Items
# --------------------------------------

from pyspark.sql import functions as F

bronze_df = spark.table("latam_bronze_dialect_raw")

silver_df = bronze_df.withColumn(
    "prompt",
    F.concat(F.lit("Explain in neutral Spanish: "), F.col("text"))
)

silver_df.write.format("delta").mode("overwrite").saveAsTable("latam_silver_eval_items")

spark.table("latam_silver_eval_items").display()


region,text,task_type,noise_level,prompt
SV,Está yuca.,slang,clean,Explain in neutral Spanish: Está yuca.
SV,"Vaya pues, chero.",slang,clean,"Explain in neutral Spanish: Vaya pues, chero."
SV,¡Puchica! No tengo pisto.,slang,clean,Explain in neutral Spanish: ¡Puchica! No tengo pisto.
SV,¿Qué onda? Todo chivo.,slang,clean,Explain in neutral Spanish: ¿Qué onda? Todo chivo.
PE,Ese pata es de mi barrio.,slang,clean,Explain in neutral Spanish: Ese pata es de mi barrio.
PE,Está bien chévere.,slang,clean,Explain in neutral Spanish: Está bien chévere.
PE,Me paltea un montón.,slang,clean,Explain in neutral Spanish: Me paltea un montón.
PE,"Ya pues, causa.",slang,clean,"Explain in neutral Spanish: Ya pues, causa."
SV,N0 teng0 p1st0,slang,ocr_noise,Explain in neutral Spanish: N0 teng0 p1st0
PE,ESE pata es de mi barri0,slang,ocr_noise,Explain in neutral Spanish: ESE pata es de mi barri0


In [0]:
# --------------------------------------
# Gold Layer: Model Evaluation Results
# --------------------------------------

from pyspark.sql import functions as F

silver_df = spark.table("latam_silver_eval_items")

gold_df = (
    silver_df
    .withColumn("model_name", F.lit("baseline_model"))
    # base accuracy (reproducible random seed)
    .withColumn("base_acc", F.expr("0.75 + rand(42) * 0.20"))
    # penalize noisy text
    .withColumn(
        "noise_penalty",
        F.when(F.col("noise_level") == "ocr_noise", F.lit(0.15))
         .otherwise(F.lit(0.0))
    )
    .withColumn("accuracy", F.col("base_acc") - F.col("noise_penalty"))
    .drop("base_acc", "noise_penalty")
)

gold_df.write.format("delta").mode("overwrite").saveAsTable("latam_gold_eval_runs")

spark.table("latam_gold_eval_runs").display()


region,text,task_type,noise_level,prompt,model_name,accuracy
SV,Está yuca.,slang,clean,Explain in neutral Spanish: Está yuca.,baseline_model,0.7671511190590922
SV,"Vaya pues, chero.",slang,clean,"Explain in neutral Spanish: Vaya pues, chero.",baseline_model,0.812082279145421
SV,¡Puchica! No tengo pisto.,slang,clean,Explain in neutral Spanish: ¡Puchica! No tengo pisto.,baseline_model,0.7625139563126428
SV,¿Qué onda? Todo chivo.,slang,clean,Explain in neutral Spanish: ¿Qué onda? Todo chivo.,baseline_model,0.8112922645307346
PE,Ese pata es de mi barrio.,slang,clean,Explain in neutral Spanish: Ese pata es de mi barrio.,baseline_model,0.7500085917718475
PE,Está bien chévere.,slang,clean,Explain in neutral Spanish: Está bien chévere.,baseline_model,0.810613062260997
PE,Me paltea un montón.,slang,clean,Explain in neutral Spanish: Me paltea un montón.,baseline_model,0.7675219214807449
PE,"Ya pues, causa.",slang,clean,"Explain in neutral Spanish: Ya pues, causa.",baseline_model,0.795406747745314
SV,N0 teng0 p1st0,slang,ocr_noise,Explain in neutral Spanish: N0 teng0 p1st0,baseline_model,0.7725590934255248
PE,ESE pata es de mi barri0,slang,ocr_noise,Explain in neutral Spanish: ESE pata es de mi barri0,baseline_model,0.704650790271013


In [0]:
# --------------------------------------
# Performance Summary: Region × Noise Level
# --------------------------------------

spark.sql("""
SELECT region,
       noise_level,
       AVG(accuracy) as avg_accuracy,
       COUNT(*) as n_items
FROM latam_gold_eval_runs
GROUP BY region, noise_level
""").display()


region,noise_level,avg_accuracy,n_items
SV,clean,0.7882599047619726,4
PE,clean,0.7808875808147259,4
SV,ocr_noise,0.7725590934255248,1
PE,ocr_noise,0.704650790271013,1


In [0]:
# --------------------------------------
# Gold Summary Table (Dashboard Ready)
# --------------------------------------

spark.sql("""
CREATE OR REPLACE TABLE latam_gold_eval_summary AS
SELECT region,
       noise_level,
       AVG(accuracy) as avg_accuracy,
       COUNT(*) as n_items
FROM latam_gold_eval_runs
GROUP BY region, noise_level
""")


DataFrame[num_affected_rows: bigint, num_inserted_rows: bigint]