# Quantization Disparity: GPU Experiments

**Purpose:** Validate simulation findings with real models

**Priority Order (Epistemic Value):**
1. **G1: Real Tokenization Analysis** - Verify alignment metric assumptions
2. **G2: True Quantization Effects** - Validate degradation patterns
3. **G3: Layer Importance Probing** - Confirm gateway hypothesis

**Output Format:** Results are formatted for copy-paste back to local analysis.

---

In [None]:
# === SETUP ===
# Run this cell first to install dependencies

!pip install -q transformers accelerate bitsandbytes sentencepiece
!pip install -q torch --upgrade

import torch
import json
import numpy as np
from datetime import datetime

# Check GPU availability
print("=" * 60)
print("GPU CHECK")
print("=" * 60)
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {gpu_name}")
    print(f"Memory: {gpu_mem:.1f} GB")
    print("Status: READY")
else:
    print("WARNING: No GPU detected. Go to Runtime > Change runtime type > GPU")
print("=" * 60)

---
## G1: Real Tokenization Analysis

**Question:** Does our alignment metric match real tokenizer behavior?

**Epistemic Value:** HIGH - validates core assumption

**Method:**
1. Load Llama-2 tokenizer
2. Tokenize parallel sentences in multiple languages
3. Compute token-to-morpheme ratios
4. Compare to our simulated alignment values

In [None]:
# === G1: REAL TOKENIZATION ANALYSIS ===

from transformers import AutoTokenizer
import json

print("Loading Llama-2 tokenizer...")
try:
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", use_fast=True)
except:
    print("Llama-2 requires access. Falling back to GPT-2 tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Parallel test sentences (same meaning, different languages)
TEST_SENTENCES = {
    'en': [
        "The computer processes data quickly.",
        "Scientists discovered new evidence.",
        "The weather is beautiful today.",
        "Students learn mathematics in school.",
        "The government announced new policies.",
    ],
    'de': [
        "Der Computer verarbeitet Daten schnell.",
        "Wissenschaftler entdeckten neue Beweise.",
        "Das Wetter ist heute wunderschön.",
        "Schüler lernen Mathematik in der Schule.",
        "Die Regierung kündigte neue Maßnahmen an.",
    ],
    'fr': [
        "L'ordinateur traite les données rapidement.",
        "Les scientifiques ont découvert de nouvelles preuves.",
        "Le temps est magnifique aujourd'hui.",
        "Les étudiants apprennent les mathématiques à l'école.",
        "Le gouvernement a annoncé de nouvelles politiques.",
    ],
    'he': [
        "המחשב מעבד נתונים במהירות.",
        "מדענים גילו ראיות חדשות.",
        "מזג האוויר יפה היום.",
        "תלמידים לומדים מתמטיקה בבית הספר.",
        "הממשלה הכריזה על מדיניות חדשה.",
    ],
    'ar': [
        "يقوم الحاسوب بمعالجة البيانات بسرعة.",
        "اكتشف العلماء أدلة جديدة.",
        "الطقس جميل اليوم.",
        "يتعلم الطلاب الرياضيات في المدرسة.",
        "أعلنت الحكومة عن سياسات جديدة.",
    ],
    'zh': [
        "计算机快速处理数据。",
        "科学家发现了新证据。",
        "今天天气很好。",
        "学生在学校学习数学。",
        "政府宣布了新政策。",
    ],
    'ja': [
        "コンピュータはデータを素早く処理します。",
        "科学者たちは新しい証拠を発見しました。",
        "今日は天気がいいです。",
        "生徒は学校で数学を学びます。",
        "政府は新しい政策を発表しました。",
    ],
    'ko': [
        "컴퓨터가 데이터를 빠르게 처리합니다.",
        "과학자들이 새로운 증거를 발견했습니다.",
        "오늘 날씨가 아름답습니다.",
        "학생들은 학교에서 수학을 배웁니다.",
        "정부가 새로운 정책을 발표했습니다.",
    ],
}

# Expected word counts (approximate morpheme counts)
EXPECTED_WORDS = {
    'en': [5, 4, 5, 5, 5],
    'de': [5, 4, 5, 6, 6],
    'fr': [5, 6, 5, 7, 7],
    'he': [4, 4, 4, 5, 5],
    'ar': [5, 4, 3, 5, 5],
    'zh': [4, 4, 4, 5, 4],
    'ja': [5, 5, 4, 5, 5],
    'ko': [4, 4, 3, 5, 4],
}

print("\n" + "=" * 60)
print("G1: TOKENIZATION ANALYSIS RESULTS")
print("=" * 60)

results = {}

for lang, sentences in TEST_SENTENCES.items():
    token_counts = []
    for sent in sentences:
        tokens = tokenizer.encode(sent)
        token_counts.append(len(tokens))
    
    avg_tokens = np.mean(token_counts)
    avg_words = np.mean(EXPECTED_WORDS.get(lang, [5]*5))
    
    # Token-to-word ratio (inverse of alignment)
    ratio = avg_tokens / avg_words
    
    # Compute alignment proxy (higher = better tokenization)
    alignment_proxy = 1 / ratio if ratio > 0 else 0
    
    results[lang] = {
        'avg_tokens': avg_tokens,
        'avg_words': avg_words,
        'token_word_ratio': ratio,
        'alignment_proxy': alignment_proxy,
        'token_counts': token_counts,
    }
    
    print(f"{lang}: tokens={avg_tokens:.1f}, words={avg_words:.1f}, ratio={ratio:.2f}, align={alignment_proxy:.3f}")

# Output in parseable format
print("\n" + "=" * 60)
print("G1_RESULTS_JSON_START")
print(json.dumps(results, indent=2))
print("G1_RESULTS_JSON_END")
print("=" * 60)

---
## G2: True Quantization Effects

**Question:** Does quantization affect languages differently in real models?

**Epistemic Value:** CRITICAL - core hypothesis validation

**Method:**
1. Load small model (GPT-2 or similar)
2. Compute perplexity on parallel sentences (FP32)
3. Quantize to INT8 and INT4
4. Compare degradation across languages

In [None]:
# === G2: TRUE QUANTIZATION EFFECTS ===

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import json

print("Loading model for quantization experiment...")
print("(Using GPT-2 for accessibility - Llama requires auth)")

model_name = "gpt2"  # Small, accessible model
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load FP32 model
model_fp32 = AutoModelForCausalLM.from_pretrained(model_name).cuda()
model_fp32.eval()

def compute_perplexity(model, tokenizer, text):
    """Compute perplexity for a given text."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    inputs = {k: v.cuda() for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss
    
    return torch.exp(loss).item()

# Test sentences (using G1 sentences)
TEST_SENTENCES = {
    'en': "The computer processes data quickly and efficiently.",
    'de': "Der Computer verarbeitet Daten schnell und effizient.",
    'fr': "L'ordinateur traite les données rapidement et efficacement.",
    'he': "המחשב מעבד נתונים במהירות וביעילות.",
    'ar': "يقوم الحاسوب بمعالجة البيانات بسرعة وكفاءة.",
    'zh': "计算机快速高效地处理数据。",
    'ja': "コンピュータはデータを迅速かつ効率的に処理します。",
    'ko': "컴퓨터는 데이터를 빠르고 효율적으로 처리합니다.",
}

print("\n" + "=" * 60)
print("G2: QUANTIZATION EFFECTS - FP32 BASELINE")
print("=" * 60)

fp32_results = {}
for lang, text in TEST_SENTENCES.items():
    try:
        ppl = compute_perplexity(model_fp32, tokenizer, text)
        fp32_results[lang] = ppl
        print(f"{lang}: PPL = {ppl:.2f}")
    except Exception as e:
        print(f"{lang}: ERROR - {e}")
        fp32_results[lang] = None

# Clean up FP32 model
del model_fp32
torch.cuda.empty_cache()

print("\nLoading INT8 quantized model...")
try:
    model_int8 = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_8bit=True,
        device_map="auto"
    )
    model_int8.eval()
    
    print("\n" + "=" * 60)
    print("G2: QUANTIZATION EFFECTS - INT8")
    print("=" * 60)
    
    int8_results = {}
    for lang, text in TEST_SENTENCES.items():
        try:
            ppl = compute_perplexity(model_int8, tokenizer, text)
            int8_results[lang] = ppl
            degradation = (ppl - fp32_results[lang]) / fp32_results[lang] * 100 if fp32_results[lang] else None
            print(f"{lang}: PPL = {ppl:.2f}, degradation = {degradation:+.1f}%" if degradation else f"{lang}: PPL = {ppl:.2f}")
        except Exception as e:
            print(f"{lang}: ERROR - {e}")
            int8_results[lang] = None
    
    del model_int8
    torch.cuda.empty_cache()
except Exception as e:
    print(f"INT8 loading failed: {e}")
    int8_results = {}

# Compute degradation
print("\n" + "=" * 60)
print("G2: DEGRADATION ANALYSIS")
print("=" * 60)

degradation_results = {}
for lang in TEST_SENTENCES.keys():
    if fp32_results.get(lang) and int8_results.get(lang):
        deg = (int8_results[lang] - fp32_results[lang]) / fp32_results[lang] * 100
        degradation_results[lang] = {
            'fp32_ppl': fp32_results[lang],
            'int8_ppl': int8_results[lang],
            'degradation_pct': deg,
        }
        print(f"{lang}: {deg:+.1f}% degradation")

# Output in parseable format
print("\n" + "=" * 60)
print("G2_RESULTS_JSON_START")
print(json.dumps(degradation_results, indent=2))
print("G2_RESULTS_JSON_END")
print("=" * 60)

---
## G3: Layer Importance Probing

**Question:** Are gateway layers (L0, Llast) more critical for low-resource languages?

**Epistemic Value:** HIGH - tests architectural hypothesis

**Method:**
1. For each layer, apply noise/perturbation
2. Measure perplexity degradation per language
3. Compare layer importance profiles across languages

In [None]:
# === G3: LAYER IMPORTANCE PROBING ===

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import torch.nn as nn
import json

print("Loading model for layer probing...")

model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained(model_name).cuda()
model.eval()

n_layers = model.config.n_layer
print(f"Model has {n_layers} layers")

# Test sentences
TEST_SENTENCES = {
    'en': "The computer processes data quickly and efficiently.",
    'de': "Der Computer verarbeitet Daten schnell und effizient.",
    'he': "המחשב מעבד נתונים במהירות וביעילות.",
    'ar': "يقوم الحاسوب بمعالجة البيانات بسرعة وكفاءة.",
}

def compute_perplexity_with_noise(model, tokenizer, text, layer_idx, noise_scale=0.1):
    """Compute perplexity with noise added to specific layer."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    inputs = {k: v.cuda() for k, v in inputs.items()}
    
    # Hook to add noise to layer output
    noise_added = [False]
    original_output = [None]
    
    def add_noise_hook(module, input, output):
        if not noise_added[0]:
            noise = torch.randn_like(output[0]) * noise_scale
            noisy_output = output[0] + noise
            noise_added[0] = True
            return (noisy_output,) + output[1:]
        return output
    
    # Register hook
    layer = model.transformer.h[layer_idx]
    handle = layer.register_forward_hook(add_noise_hook)
    
    try:
        with torch.no_grad():
            outputs = model(**inputs, labels=inputs["input_ids"])
            loss = outputs.loss
        ppl = torch.exp(loss).item()
    finally:
        handle.remove()
    
    return ppl

def compute_baseline_ppl(model, tokenizer, text):
    """Compute baseline perplexity without noise."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    inputs = {k: v.cuda() for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss
    
    return torch.exp(loss).item()

print("\n" + "=" * 60)
print("G3: LAYER IMPORTANCE PROBING")
print("=" * 60)

layer_importance = {lang: {} for lang in TEST_SENTENCES}

for lang, text in TEST_SENTENCES.items():
    print(f"\nProcessing {lang}...")
    baseline_ppl = compute_baseline_ppl(model, tokenizer, text)
    print(f"  Baseline PPL: {baseline_ppl:.2f}")
    
    for layer_idx in range(n_layers):
        noisy_ppl = compute_perplexity_with_noise(model, tokenizer, text, layer_idx, noise_scale=0.1)
        degradation = (noisy_ppl - baseline_ppl) / baseline_ppl * 100
        layer_importance[lang][f"L{layer_idx}"] = {
            'baseline_ppl': baseline_ppl,
            'noisy_ppl': noisy_ppl,
            'degradation_pct': degradation,
        }
    
    # Print gateway vs middle comparison
    gateway_layers = [0, n_layers-1]
    middle_layers = list(range(2, n_layers-2))
    
    gateway_deg = np.mean([layer_importance[lang][f"L{i}"]['degradation_pct'] for i in gateway_layers])
    middle_deg = np.mean([layer_importance[lang][f"L{i}"]['degradation_pct'] for i in middle_layers])
    
    print(f"  Gateway layers (L0, L{n_layers-1}): {gateway_deg:.1f}% avg degradation")
    print(f"  Middle layers: {middle_deg:.1f}% avg degradation")
    print(f"  Gateway/Middle ratio: {gateway_deg/middle_deg:.2f}x")

# Output in parseable format
print("\n" + "=" * 60)
print("G3_RESULTS_JSON_START")
print(json.dumps(layer_importance, indent=2))
print("G3_RESULTS_JSON_END")
print("=" * 60)

# Clean up
del model
torch.cuda.empty_cache()

---
## Summary and Export

Run this cell to get a complete summary of all results for copy-paste.

In [None]:
# === FINAL SUMMARY ===

print("=" * 70)
print("GPU EXPERIMENT SUMMARY")
print(f"Timestamp: {datetime.now().isoformat()}")
print("=" * 70)

summary = {
    'timestamp': datetime.now().isoformat(),
    'gpu': torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None',
    'experiments': ['G1_tokenization', 'G2_quantization', 'G3_layer_importance'],
}

# Add results if available
try:
    summary['G1_results'] = results  # From G1
except:
    summary['G1_results'] = 'Not run'

try:
    summary['G2_results'] = degradation_results  # From G2
except:
    summary['G2_results'] = 'Not run'

try:
    summary['G3_results'] = layer_importance  # From G3
except:
    summary['G3_results'] = 'Not run'

print("\n" + "=" * 70)
print("COMPLETE_RESULTS_JSON_START")
print(json.dumps(summary, indent=2, default=str))
print("COMPLETE_RESULTS_JSON_END")
print("=" * 70)
print("\nCopy everything between COMPLETE_RESULTS_JSON_START and COMPLETE_RESULTS_JSON_END")
print("and paste it back for local analysis.")