# 03: Real GPTQ Quantization Test

**Goal:** Test if our findings apply to actual GPTQ (not simulated INT4)

**Time:** ~45 minutes on T4

**Key question:** Does GPTQ's calibration-based optimization already handle disparity?

In [None]:
!pip install -q transformers accelerate auto-gptq optimum

import sys
sys.path.append('..')

import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
from src.disparity import measure_disparity, perplexity, DEFAULT_TEXTS
from src.utils import print_results

In [None]:
# Load pre-quantized GPTQ model
GPTQ_MODEL = "TheBloke/Llama-2-7B-GPTQ"

print(f"Loading {GPTQ_MODEL}...")
tokenizer = AutoTokenizer.from_pretrained(GPTQ_MODEL)
tokenizer.pad_token = tokenizer.eos_token

model = AutoGPTQForCausalLM.from_quantized(
    GPTQ_MODEL,
    device="cuda:0",
    use_triton=False,
)
print("GPTQ model loaded")

In [None]:
# Also load FP16 baseline for comparison
from transformers import AutoModelForCausalLM

FP16_MODEL = "NousResearch/Llama-2-7b-hf"
print(f"Loading {FP16_MODEL} for baseline...")

model_fp16 = AutoModelForCausalLM.from_pretrained(
    FP16_MODEL,
    torch_dtype=torch.float16,
    device_map="auto",
)
model_fp16.eval()

In [None]:
TEXTS = {k: v for k, v in DEFAULT_TEXTS.items() if k in ['en', 'he', 'ar', 'zh', 'de', 'fr']}

# Baseline (FP16)
print("Computing FP16 baseline...")
baseline = {lang: perplexity(model_fp16, tokenizer, text) for lang, text in TEXTS.items()}
print(f"Baseline: en={baseline['en']:.1f}")

# Free FP16 model
del model_fp16
torch.cuda.empty_cache()

In [None]:
# Measure GPTQ disparity
print("\nMeasuring GPTQ disparity...")
gptq_results = measure_disparity(model, tokenizer, TEXTS, baseline)

print_results(gptq_results, "GPTQ Quantization Results")

In [None]:
# Interpretation
print("\n" + "=" * 60)
print("INTERPRETATION")
print("=" * 60)

avg_disp = gptq_results['avg_disparity']

if avg_disp < 2.0:
    print(f"""
✓ GPTQ shows low disparity ({avg_disp:.2f}x)

This suggests GPTQ's calibration-based optimization may partially
address multilingual disparity. However:

1. Calibration data matters - was it English-only or multilingual?
2. Our layer protection insight still valuable for understanding WHY
3. Further experiment: Compare English-only vs multilingual calibration
""")
elif avg_disp < 10.0:
    print(f"""
~ GPTQ shows moderate disparity ({avg_disp:.2f}x)

GPTQ reduces but doesn't eliminate disparity. Our findings suggest:

1. Layer protection could further improve GPTQ
2. Multilingual calibration might help
3. Hybrid approach: GPTQ + layer protection
""")
else:
    print(f"""
✗ GPTQ shows high disparity ({avg_disp:.2f}x)

GPTQ doesn't solve the multilingual disparity problem. Our findings
are directly applicable:

1. Layer protection strategy adds value
2. Consider layer-aware GPTQ modification
3. Practical contribution for multilingual deployment
""")