# üé® GGUF Token Embedding Visualizer

**Complementary to [Transformers-Explainer](https://poloclub.github.io/transformer-explainer/)** - Embedding Layer Analysis

---

## Overview

This notebook visualizes **how GGUF models represent tokens as high-dimensional vectors** and explores the **semantic structure** of the embedding space using GPU-accelerated dimensionality reduction.

### What Transformers-Explainer Shows

- **Token Embedding**: Shows 768-dimensional vectors as colored rectangles
- **Positional Encoding**: Displays sinusoidal position embeddings
- **Combined Input**: Token + Position ‚Üí Transformer input

### What This Notebook Adds

1. **Extract actual embeddings** from GGUF models (768-4096 dimensions)
2. **GPU-accelerated UMAP/t-SNE** for 2D/3D projections
3. **Semantic clustering**: Visualize similar words in embedding space
4. **Quantization impact**: Compare FP32 ‚Üí Q4_K_M embedding quality
5. **Interactive 3D exploration** with Graphistry

---

## Architecture

```
GGUF Model (GPU 0)           RAPIDS + Graphistry (GPU 1)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê         ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Token Embeddings ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ>‚îÇ cuML UMAP (GPU-accel)   ‚îÇ
‚îÇ (50K √ó d_model)  ‚îÇ         ‚îÇ ‚îú‚îÄ 768D ‚Üí 3D projection ‚îÇ
‚îÇ                  ‚îÇ         ‚îÇ ‚îî‚îÄ Distance matrix      ‚îÇ
‚îÇ Vocab: 50,257    ‚îÇ         ‚îÇ                         ‚îÇ
‚îÇ Dimensions:      ‚îÇ         ‚îÇ Graphistry 3D Plot      ‚îÇ
‚îÇ - Gemma: 2048    ‚îÇ         ‚îÇ ‚îú‚îÄ Semantic clusters    ‚îÇ
‚îÇ - Llama: 4096    ‚îÇ         ‚îÇ ‚îú‚îÄ Word similarity      ‚îÇ
‚îÇ - Qwen: 2048     ‚îÇ         ‚îÇ ‚îî‚îÄ Interactive explore  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò         ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

---

## Learning Objectives

1. **Understand embeddings**: How models represent discrete tokens as continuous vectors
2. **Semantic structure**: Why similar words cluster together
3. **Dimensionality**: Explore 768D-4096D embedding spaces
4. **Quantization trade-offs**: Impact of Q4_K_M on embedding quality
5. **GPU acceleration**: RAPIDS cuML for fast UMAP/t-SNE

In [1]:
# Kaggle environment
import os

In [2]:
# ==============================================================================
# SECRET MANAGEMENT: Graphistry API Key
# ==============================================================================
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("Graphistry_Personal_Key_ID")
secret_value_1 = user_secrets.get_secret("Graphistry_Personal_Secret_Key")
hf_token = user_secrets.get_secret("HF_TOKEN")

In [3]:
# ==============================================================================
# Step 1: Verify Dual GPU Environment
# ==============================================================================
import subprocess
print("="*70)
print("üéÆ VERIFYING DUAL TESLA T4 ENVIRONMENT")
print("="*70)
subprocess.run(["nvidia-smi", "--query-gpu=name,memory.total,compute_cap", "--format=csv"])

üéÆ VERIFYING DUAL TESLA T4 ENVIRONMENT
name, memory.total [MiB], compute_cap
Tesla T4, 15360 MiB, 7.5
Tesla T4, 15360 MiB, 7.5


CompletedProcess(args=['nvidia-smi', '--query-gpu=name,memory.total,compute_cap', '--format=csv'], returncode=0)

In [4]:
# ==============================================================================
# Step 2: Install llamatelemetry v0.1.0
# ==============================================================================
print("üì¶ Installing dependencies...")

# Install llamatelemetry v0.1.0
!pip install -q https://github.com/llamatelemetry/llamatelemetry/releases/download/v0.1.0/llamatelemetry-v0.1.0-source.tar.gz
#!pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

# Install cuGraph for GPU-accelerated graph algorithms
!pip install -q --extra-index-url=https://pypi.nvidia.com "cugraph-cu12==25.6.*"

# Install Graphistry for visualization
!pip install -q "graphistry[ai]"

# Install additional utilities
!pip install -q pyarrow pandas numpy scipy huggingface_hub

# Verify installations
import llamatelemetry
print(f"\n‚úÖ llamatelemetry {llamatelemetry.__version__} installed")

try:
    import cudf, cugraph
    print(f"‚úÖ cuDF {cudf.__version__}")
    print(f"‚úÖ cuGraph {cugraph.__version__}")
except ImportError as e:
    print(f"‚ö†Ô∏è RAPIDS: {e}")

try:
    import graphistry
    print(f"‚úÖ Graphistry {graphistry.__version__}")
except ImportError as e:
    print(f"‚ö†Ô∏è Graphistry: {e}")

üì¶ Installing dependencies...
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m763.5/763.5 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for llamatelemetry (pyproject.toml) ... [?25l[?25hdone
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m3.2/3.2 MB[0m [31m42.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m42.1/42.1 MB[0m [31m47.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25h[31mERROR: pip's dependency resolver does not currently take




üéØ llamatelemetry v0.1.0 First-Time Setup - Kaggle 2√ó T4 Multi-GPU

üéÆ GPU Detected: Tesla T4 (Compute 7.5)
  ‚úÖ Tesla T4 detected - Perfect for llamatelemetry v0.1.0!
üåê Platform: Colab

üì¶ Downloading Kaggle 2√ó T4 binaries (~961 MB)...
    Features: FlashAttention + Tensor Cores + Multi-GPU tensor-split

‚û°Ô∏è  Attempt 1: HuggingFace (llamatelemetry-v0.1.0-cuda12-kaggle-t4x2.tar.gz)
üì• Downloading v0.1.0 from HuggingFace Hub...
   Repo: waqasm86/llamatelemetry-binaries
   File: v0.1.0/llamatelemetry-v0.1.0-cuda12-kaggle-t4x2.tar.gz


For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


v0.1.0/llamatelemetry-v0.1.0-cuda12-kagg(‚Ä¶):   0%|          | 0.00/1.40G [00:00<?, ?B/s]

üîê Verifying SHA256 checksum...
   ‚úÖ Checksum verified
üì¶ Extracting llamatelemetry-v0.1.0-cuda12-kaggle-t4x2.tar.gz...
Found 21 files in archive
Extracted 21 files to /root/.cache/llamatelemetry/extract_0.1.0
‚úÖ Extraction complete!
  Found bin/ and lib/ under /root/.cache/llamatelemetry/extract_0.1.0/llamatelemetry-v0.1.0-cuda12-kaggle-t4x2
  Copied 13 binaries to /usr/local/lib/python3.12/dist-packages/llamatelemetry/binaries/cuda12
  Copied 2 libraries to /usr/local/lib/python3.12/dist-packages/llamatelemetry/lib
‚úÖ Binaries installed successfully!


‚úÖ llamatelemetry 0.1.0 installed
‚úÖ cuDF 25.06.00
‚úÖ cuGraph 25.06.00
‚úÖ Graphistry 0.50.6


In [6]:
!pip install -q seaborn networkx plotly plotly-express 

In [7]:
from huggingface_hub import login
import os

# Method 1: Using the login function
login(token=hf_token)


In [8]:
import requests, numpy, pandas
print("llamatelemetry:", llamatelemetry.__version__)
print("requests:", requests.__version__)
print("numpy:", numpy.__version__)
print("pandas:", pandas.__version__)


llamatelemetry: 0.1.0
requests: 2.32.5
numpy: 2.0.2
pandas: 2.2.2


In [9]:
# First, let's see what's actually available in llamatelemetry
import llamatelemetry
print(f"llamatelemetry version: {llamatelemetry.__version__}")
print("\nAvailable attributes in llamatelemetry:")
print([attr for attr in dir(llamatelemetry) if not attr.startswith('_')])

llamatelemetry version: 0.1.0

Available attributes in llamatelemetry:
['Any', 'Dict', 'InferResult', 'InferenceEngine', 'List', 'Optional', 'Path', 'ServerManager', 'bootstrap', 'check_cuda_available', 'check_gpu_compatibility', 'create_config_file', 'detect_cuda', 'find_gguf_models', 'get_cuda_device_info', 'get_llama_cpp_cuda_path', 'get_recommended_gpu_layers', 'load_config', 'logging', 'nullcontext', 'os', 'print_system_info', 'quick_infer', 'requests', 'server', 'setup_environment', 'subprocess', 'sys', 'time', 'utils', 'validate_model_path']


In [10]:
# ==============================================================================
# Step 3: Download GGUF Model (Fixed - No GGUF Parsing Errors)
# ==============================================================================

from huggingface_hub import hf_hub_download
import os

MODEL_REPO = "bartowski/Llama-3.2-3B-Instruct-GGUF"
MODEL_FILE = "Llama-3.2-3B-Instruct-Q4_K_M.gguf"

print(f"üì• Downloading {MODEL_FILE}...")

model_path = hf_hub_download(
    repo_id=MODEL_REPO,
    filename=MODEL_FILE,
    local_dir="/kaggle/working/models"
)

size_gb = os.path.getsize(model_path) / (1024**3)
print(f"\n‚úÖ Model downloaded: {model_path}")
print(f"   Size: {size_gb:.2f} GB")

# Show file exists
print(f"\nüìÅ File verification:")
print(f"   File exists: {os.path.exists(model_path)}")
print(f"   File size: {size_gb:.2f} GB")

# Instead of parsing GGUF, use known architecture for Llama-3.2-3B
print("\nüîç Using known architecture for Llama-3.2-3B:")

# Known architecture for Llama-3.2-3B
ARCHITECTURE = {
    'model': 'Llama-3.2-3B-Instruct',
    'format': 'GGUF Q4_K_M',
    'layers': 28,                 # Number of transformer blocks
    'attention_heads': 32,        # Attention heads per layer
    'hidden_dimension': 3072,     # Model dimension
    'vocabulary_size': 128256,    # Token vocabulary
    'context_length': 8192,       # Max context length
    'feedforward_multiplier': 4,  # FFN is 4√ó hidden_dim (Swiglu)
    'quantization': 'Q4_K_M',     # Quantization type
    'estimated_params': 2.8e9,    # Approximately 2.8 billion parameters
    'file_size_gb': 1.88,         # Actual file size
    'attention_dim_per_head': 96, # 3072 / 32 = 96
    'rope_theta': 500000,         # RoPE base frequency
}

print("\nüìä Architecture Summary:")
for key, value in ARCHITECTURE.items():
    if isinstance(value, (int, float)) and value >= 1000:
        print(f"   {key}: {value:,}")
    else:
        print(f"   {key}: {value}")

# Derived calculations
print("\nüßÆ Derived Architecture Values:")
n_layers = ARCHITECTURE['layers']
n_heads = ARCHITECTURE['attention_heads']
hidden_dim = ARCHITECTURE['hidden_dimension']
vocab_size = ARCHITECTURE['vocabulary_size']

print(f"   Total transformer layers: {n_layers}")
print(f"   Total attention heads: {n_layers} √ó {n_heads} = {n_layers * n_heads:,}")
print(f"   Attention dimension per head: {hidden_dim} √∑ {n_heads} = {hidden_dim // n_heads}")
print(f"   Feed-forward hidden dimension: {hidden_dim} √ó {ARCHITECTURE['feedforward_multiplier']} = {hidden_dim * ARCHITECTURE['feedforward_multiplier']:,}")

# Parameter breakdown (simplified)
print("\nüìà Parameter Distribution (Approximate):")
embedding_params = vocab_size * hidden_dim
attention_params = 4 * hidden_dim * hidden_dim * n_layers  # Q, K, V, O
ffn_params = 2 * 4 * hidden_dim * hidden_dim * n_layers    # FFN (Swiglu)
output_params = hidden_dim * vocab_size                    # Output layer
total_params = embedding_params + attention_params + ffn_params + output_params

print(f"   Embedding layer: {embedding_params:,} ({embedding_params/total_params*100:.1f}%)")
print(f"   Attention layers: {attention_params:,} ({attention_params/total_params*100:.1f}%)")
print(f"   Feed-forward layers: {ffn_params:,} ({ffn_params/total_params*100:.1f}%)")
print(f"   Output layer: {output_params:,} ({output_params/total_params*100:.1f}%)")
print(f"   Total estimated: {total_params:,} parameters")

# Quantization impact
print(f"\n‚öñÔ∏è Quantization Impact (Q4_K_M):")
full_precision_gb = (total_params * 4) / (1024**3)  # 4 bytes per float32
quantized_gb = size_gb
compression_ratio = full_precision_gb / quantized_gb

print(f"   Full precision (FP32): {full_precision_gb:.1f} GB")
print(f"   Quantized (Q4_K_M): {quantized_gb:.1f} GB")
print(f"   Compression ratio: {compression_ratio:.1f}√ó")
print(f"   Average bits per parameter: {32 / compression_ratio:.1f} bits")

print(f"\n‚úÖ Architecture ready for visualization")
print(f"   Will visualize: {n_layers} layers √ó {n_heads} heads = {n_layers * n_heads:,} attention heads")

üì• Downloading Llama-3.2-3B-Instruct-Q4_K_M.gguf...


Llama-3.2-3B-Instruct-Q4_K_M.gguf:   0%|          | 0.00/2.02G [00:00<?, ?B/s]


‚úÖ Model downloaded: /kaggle/working/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
   Size: 1.88 GB

üìÅ File verification:
   File exists: True
   File size: 1.88 GB

üîç Using known architecture for Llama-3.2-3B:

üìä Architecture Summary:
   model: Llama-3.2-3B-Instruct
   format: GGUF Q4_K_M
   layers: 28
   attention_heads: 32
   hidden_dimension: 3,072
   vocabulary_size: 128,256
   context_length: 8,192
   feedforward_multiplier: 4
   quantization: Q4_K_M
   estimated_params: 2,800,000,000.0
   file_size_gb: 1.88
   attention_dim_per_head: 96
   rope_theta: 500,000

üßÆ Derived Architecture Values:
   Total transformer layers: 28
   Total attention heads: 28 √ó 32 = 896
   Attention dimension per head: 3072 √∑ 32 = 96
   Feed-forward hidden dimension: 3072 √ó 4 = 12,288

üìà Parameter Distribution (Approximate):
   Embedding layer: 394,002,432 (10.0%)
   Attention layers: 1,056,964,608 (26.7%)
   Feed-forward layers: 2,113,929,216 (53.4%)
   Output layer: 394,002,432 (10.0%)
 

In [23]:
#stop llama server

print("\n" + "="*70)
print("üõë CLEANUP")
print("="*70)

# Stop server
server.stop_server()
print("‚úÖ Server stopped")


üõë CLEANUP
‚úÖ Server stopped


In [None]:
#import inspect, llamatelemetry
#from llamatelemetry.server import ServerManager
#print(inspect.getsource(ServerManager.start_server))


In [24]:
# ==============================================================================
# Step 4: Start llama-server on GPU 0 Only
# ==============================================================================

from llamatelemetry.server import ServerManager

print("="*70)
print("üöÄ STARTING LLAMA-SERVER ON GPU 0")
print("="*70)

print("\nüìã Configuration:")
print("   GPU 0: 100% (llama-server for model inference)")
print("   GPU 1: 0% (reserved for RAPIDS/Graphistry)")
print("   Model: Llama-3.2-3B-Instruct (Q4_K_M)")
print("   Context: 4096 tokens")

server = ServerManager()
server.start_server(
    model_path=model_path,
    host="127.0.0.1",
    port=8090,
    gpu_layers=99,
    tensor_split="1.0,0.0",
    ctx_size=4096,
    flash_attn=1,
    embeddings=True,   # ‚úÖ correct flag usage
    verbose=False,
    pooling="mean"
)

if server.check_server_health():
    print("\n‚úÖ llama-server running on GPU 0!")
    print("   URL: http://127.0.0.1:8090")
else:
    print("\n‚ùå Server failed to start")

üöÄ STARTING LLAMA-SERVER ON GPU 0

üìã Configuration:
   GPU 0: 100% (llama-server for model inference)
   GPU 1: 0% (reserved for RAPIDS/Graphistry)
   Model: Llama-3.2-3B-Instruct (Q4_K_M)
   Context: 4096 tokens

‚úÖ llama-server running on GPU 0!
   URL: http://127.0.0.1:8090


In [25]:
# ==============================================================================
# Step 4b Pre-check: Ensure /v1/embeddings is active
# ==============================================================================
import requests

SERVER = "http://127.0.0.1:8090"
MODEL = "Llama-3.2-3B-Instruct-Q4_K_M.gguf"

r = requests.post(f"{SERVER}/v1/embeddings", json={"input": "test", "model": MODEL}, timeout=5)
print("status:", r.status_code)
print("body:", r.text[:200])

if r.status_code != 200:
    raise RuntimeError(
        "‚ùå /v1/embeddings not active.\n"
        "Make sure server was started with embeddings=True and pooling='mean'."
    )


status: 200
body: {"model":"Llama-3.2-3B-Instruct-Q4_K_M.gguf","object":"list","usage":{"prompt_tokens":2,"total_tokens":2},"data":[{"embedding":[-0.010422893799841404,-0.016974087804555893,0.005642905365675688,-0.0046


In [26]:
# ==============================================================================
# Step 5: Extract Token Embeddings (SDK‚Äëonly, no native fallback)
# ==============================================================================
from llamatelemetry.embeddings import EmbeddingEngine
from llamatelemetry import InferenceEngine
import numpy as np
import pandas as pd
import requests

print("="*70)
print("üìä EXTRACTING TOKEN EMBEDDINGS (SDK‚ÄëONLY)")
print("="*70)

test_words = [
    "red","blue","green","yellow","orange","purple",
    "cat","dog","bird","fish","lion","tiger",
    "computer","software","algorithm","neural","network","GPU",
    "happy","sad","angry","excited","calm","peaceful",
    "one","two","three","four","five","six",
    "run","jump","swim","fly","walk","dance",
    "USA","China","India","France","Germany","Japan"
]

# SDK setup
engine = InferenceEngine(server_url="http://127.0.0.1:8090")
embedder = EmbeddingEngine(engine, pooling="mean", normalize=True)

# Ensure /v1/embeddings is active (OAI‚Äëcompatible)
r = requests.post(
    "http://127.0.0.1:8090/v1/embeddings",
    json={"input": "test", "model": "Llama-3.2-3B-Instruct-Q4_K_M.gguf"},
    timeout=5
)
if r.status_code != 200:
    raise RuntimeError(
        "‚ùå /v1/embeddings is not active or not OAI‚Äëcompatible.\n"
        "Start llama‚Äëserver with embeddings=True and pooling='mean'."
    )

embeddings = []
valid_words = []

for w in test_words:
    emb = embedder.embed(w)
    embeddings.append(emb)
    valid_words.append(w)

embeddings_array = np.vstack(embeddings)

print(f"‚úÖ Extracted {len(embeddings_array)} embeddings")
print(f"   Shape: {embeddings_array.shape}")

embeddings_df = pd.DataFrame(embeddings_array)
embeddings_df["word"] = valid_words
cats = ["Colors","Animals","Technology","Emotions","Numbers","Verbs","Countries"]
embeddings_df["category"] = [cats[i//6] for i in range(len(valid_words))]


üìä EXTRACTING TOKEN EMBEDDINGS (SDK‚ÄëONLY)
‚úÖ Extracted 42 embeddings
   Shape: (42, 3072)


In [27]:
# ==============================================================================
# Step 6: Analyze Embedding Statistics (Robust + Safe)
# ==============================================================================
print("="*70)
print("üìà EMBEDDING STATISTICS")
print("="*70)

# Guard
if 'embeddings_array' not in globals() or embeddings_array is None or len(embeddings_array) == 0:
    raise RuntimeError("Embeddings not available. Run Step 5 first.")

# Basic statistics
print(f"\nBasic Statistics:")
print(f"Mean: {embeddings_array.mean():.4f}")
print(f"Std:  {embeddings_array.std():.4f}")
print(f"Min:  {embeddings_array.min():.4f}")
print(f"Max:  {embeddings_array.max():.4f}")

# L2 norms
norms = np.linalg.norm(embeddings_array, axis=1)
print(f"\nüìä L2 Norms:")
print(f"  Mean: {norms.mean():.4f}")
print(f"  Std:  {norms.std():.4f}")
print(f"  Min:  {norms.min():.4f}")
print(f"  Max:  {norms.max():.4f}")

# Pairwise cosine similarities
from sklearn.metrics.pairwise import cosine_similarity
sim_matrix = cosine_similarity(embeddings_array)

print(f"\nüìä Cosine Similarity Matrix Statistics:")
print(f"  Mean: {sim_matrix.mean():.4f}")
print(f"  Std:  {sim_matrix.std():.4f}")
print(f"  Min:  {sim_matrix.min():.4f}")
print(f"  Max:  {sim_matrix.max():.4f}")

# Categories (safe assignment)
categories = ["Colors","Animals","Technology","Emotions","Numbers","Verbs","Countries"]
category_groups = {}

for idx, word in enumerate(valid_words):
    category_idx = idx // 6
    category = categories[category_idx] if category_idx < len(categories) else "Other"
    category_groups.setdefault(category, []).append((word, idx))

# Most similar pairs within each category
print(f"\nüîç Similar Word Pairs by Category:")
sim_matrix_copy = sim_matrix.copy()
np.fill_diagonal(sim_matrix_copy, -1)

for category, words_indices in category_groups.items():
    print(f"\n  {category}:")
    indices = [idx for _, idx in words_indices]
    words = [word for word, _ in words_indices]

    if len(indices) > 1:
        sub_matrix = sim_matrix_copy[np.ix_(indices, indices)]
        max_val = sub_matrix.max()
        if max_val > -1:
            max_pos = np.unravel_index(np.argmax(sub_matrix), sub_matrix.shape)
            word1 = words[max_pos[0]]
            word2 = words[max_pos[1]]
            print(f"    Most similar: '{word1}' ‚Üî '{word2}': {max_val:.3f}")
    else:
        print("    Only one word in this category")

# Cross‚Äëcategory similarities (top 5)
print(f"\nüîç Most Similar Cross‚ÄëCategory Pairs:")
cross_pairs = []
for i in range(len(valid_words)):
    for j in range(i+1, len(valid_words)):
        cat_i = categories[i // 6] if (i // 6) < len(categories) else "Other"
        cat_j = categories[j // 6] if (j // 6) < len(categories) else "Other"
        if cat_i != cat_j:
            sim = sim_matrix_copy[i, j]
            if sim > 0.3:
                cross_pairs.append((valid_words[i], valid_words[j], sim, cat_i, cat_j))

cross_pairs.sort(key=lambda x: x[2], reverse=True)
for word1, word2, sim, cat1, cat2 in cross_pairs[:5]:
    print(f"  '{word1}' ({cat1}) ‚Üî '{word2}' ({cat2}): {sim:.3f}")

# Category-wise statistics
print(f"\nüìä Category-wise Statistics:")
for category, words_indices in category_groups.items():
    indices = [idx for _, idx in words_indices]
    if len(indices) > 1:
        sub_matrix = sim_matrix[np.ix_(indices, indices)]
        mask = np.triu(np.ones_like(sub_matrix), k=1).astype(bool)
        intra = sub_matrix[mask]
        if len(intra) > 0:
            print(f"  {category}: mean={intra.mean():.3f}, std={intra.std():.3f}, n={len(indices)}")

print(f"\n‚úÖ Embedding analysis completed!")


üìà EMBEDDING STATISTICS

Basic Statistics:
Mean: 0.0005
Std:  0.0180
Min:  -0.3472
Max:  0.2864

üìä L2 Norms:
  Mean: 1.0000
  Std:  0.0000
  Min:  1.0000
  Max:  1.0000

üìä Cosine Similarity Matrix Statistics:
  Mean: 0.5884
  Std:  0.1170
  Min:  0.3322
  Max:  1.0000

üîç Similar Word Pairs by Category:

  Colors:
    Most similar: 'yellow' ‚Üî 'purple': 0.778

  Animals:
    Most similar: 'cat' ‚Üî 'dog': 0.787

  Technology:
    Most similar: 'computer' ‚Üî 'software': 0.771

  Emotions:
    Most similar: 'happy' ‚Üî 'sad': 0.775

  Numbers:
    Most similar: 'two' ‚Üî 'three': 0.956

  Verbs:
    Most similar: 'jump' ‚Üî 'walk': 0.786

  Countries:
    Most similar: 'France' ‚Üî 'Germany': 0.905

üîç Most Similar Cross‚ÄëCategory Pairs:
  'network' (Technology) ‚Üî 'jump' (Verbs): 0.760
  'peaceful' (Emotions) ‚Üî 'dance' (Verbs): 0.730
  'network' (Technology) ‚Üî 'walk' (Verbs): 0.721
  'computer' (Technology) ‚Üî 'dance' (Verbs): 0.718
  'one' (Numbers) ‚Üî 'dance' (Ve

In [28]:
# ==============================================================================
# Step 7: GPU-Accelerated UMAP Dimensionality Reduction (GPU 1)
# ==============================================================================
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

print("="*70)
print("üöÄ GPU-ACCELERATED UMAP (GPU 1)")
print("="*70)

from cuml import UMAP
import cupy as cp

# Get the original dimension from embeddings_array
original_dim = embeddings_array.shape[1]

# Transfer embeddings to GPU
embeddings_gpu = cp.array(embeddings_array)

# UMAP to 3D (GPU-accelerated)
umap = UMAP(n_components=3, n_neighbors=15, min_dist=0.1, random_state=42)
embeddings_3d = umap.fit_transform(embeddings_gpu)

# Convert back to CPU for visualization
embeddings_3d_cpu = cp.asnumpy(embeddings_3d)

print(f"\n‚úÖ Reduced {original_dim}D ‚Üí 3D")
print(f"   Shape: {embeddings_3d_cpu.shape}")

üöÄ GPU-ACCELERATED UMAP (GPU 1)
[2026-02-06 14:54:16.546] [CUML] [info] build_algo set to brute_force_knn because random_state is given

‚úÖ Reduced 3072D ‚Üí 3D
   Shape: (42, 3)


In [29]:
# ==============================================================================
# Step 8: Prepare Visualization Data
# ==============================================================================
print("="*70)
print("üìä PREPARING VISUALIZATION DATA")
print("="*70)

# Create DataFrame with embeddings and metadata
viz_df = pd.DataFrame({
    'word': valid_words,
    'x': embeddings_3d_cpu[:, 0],
    'y': embeddings_3d_cpu[:, 1],
    'z': embeddings_3d_cpu[:, 2],
    'norm': norms[:len(valid_words)]
})

# Add semantic categories
categories = []
for word in valid_words:
    if word in ["red", "blue", "green", "yellow", "orange", "purple"]:
        categories.append("color")
    elif word in ["cat", "dog", "bird", "fish", "lion", "tiger"]:
        categories.append("animal")
    elif word in ["computer", "software", "algorithm", "neural", "network", "GPU"]:
        categories.append("technology")
    elif word in ["happy", "sad", "angry", "excited", "calm", "peaceful"]:
        categories.append("emotion")
    elif word in ["one", "two", "three", "four", "five", "six"]:
        categories.append("number")
    elif word in ["run", "jump", "swim", "fly", "walk", "dance"]:
        categories.append("verb")
    elif word in ["USA", "China", "India", "France", "Germany", "Japan"]:
        categories.append("country")
    else:
        categories.append("other")

viz_df['category'] = categories

print(f"\n‚úÖ Visualization data ready")
print(viz_df.head())

print(f"\nCategories:")
print(viz_df['category'].value_counts())

üìä PREPARING VISUALIZATION DATA

‚úÖ Visualization data ready
     word         x         y         z  norm category
0     red -0.160643  1.772264  0.270527   1.0    color
1    blue  0.148235  1.963772  0.361129   1.0    color
2   green -0.208186  1.687818 -0.302713   1.0    color
3  yellow -0.197520  2.141205  0.219569   1.0    color
4  orange  0.213803  2.369639  0.133816   1.0    color

Categories:
category
color         6
animal        6
technology    6
emotion       6
number        6
verb          6
country       6
Name: count, dtype: int64


In [42]:
# ==============================================================================
# Step 9: Single Combined Visualization (3D + 2D side-by-side)
# ==============================================================================
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import numpy as np

print("="*70)
print("üé® CREATING COMBINED VISUALIZATION (3D + 2D)")
print("="*70)

pio.renderers.default = "iframe_connected"

model_name = ARCHITECTURE.get('model', 'Llama-3.2-3B-Instruct')
model_format = ARCHITECTURE.get('format', 'GGUF Q4_K_M')

# Create a stable visible size
norms = viz_df["norm"].values
size_scaled = 6 + 14 * (norms - norms.min()) / (np.ptp(norms) + 1e-8)
viz_df["size_scaled"] = size_scaled

# Create subplots: 3D (left) + 2D (right)
fig = make_subplots(
    rows=1, cols=2,
    specs=[[{'type': 'scene'}, {'type': 'xy'}]],
    subplot_titles=('3D UMAP Projection', '2D UMAP Projection'),
    horizontal_spacing=0.12
)

color_palette = px.colors.qualitative.Vivid

# Add 3D trace by category
for i, category in enumerate(sorted(viz_df['category'].unique())):
    cat_df = viz_df[viz_df['category'] == category]
    fig.add_trace(
        go.Scatter3d(
            x=cat_df['x'],
            y=cat_df['y'],
            z=cat_df['z'],
            mode='markers+text',
            text=cat_df['word'],
            name=category,
            marker=dict(
                size=cat_df['size_scaled'],
                color=color_palette[i % len(color_palette)],
                line=dict(width=0.6, color='white'),
                opacity=0.9
            ),
            textposition='top center',
            showlegend=True
        ),
        row=1, col=1
    )

# Add 2D trace by category
for i, category in enumerate(sorted(viz_df['category'].unique())):
    cat_df = viz_df[viz_df['category'] == category]
    fig.add_trace(
        go.Scatter(
            x=cat_df['x'],
            y=cat_df['y'],
            mode='markers+text',
            text=cat_df['word'],
            name=category,
            marker=dict(
                size=cat_df['size_scaled'],
                color=color_palette[i % len(color_palette)],
                line=dict(width=0.6, color='white'),
                opacity=0.9
            ),
            textposition='top center',
            showlegend=False
        ),
        row=1, col=2
    )

fig.update_layout(
    title_text=f'{model_name} Token Embeddings (3D + 2D)',
    height=650,
    showlegend=True,
    legend=dict(
        title="Category",
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=0.01
    )
)

fig.update_scenes(
    xaxis_title='UMAP 1',
    yaxis_title='UMAP 2',
    zaxis_title='UMAP 3',
    row=1, col=1
)

fig.update_xaxes(title_text='UMAP 1', row=1, col=2)
fig.update_yaxes(title_text='UMAP 2', row=1, col=2)

fig.show()


üé® CREATING COMBINED VISUALIZATION (3D + 2D)


In [31]:
## Debugging Step only

import requests, json
SERVER="http://127.0.0.1:8090"
r = requests.post(f"{SERVER}/v1/embeddings", json={"input": "test", "model": "Llama-3.2-3B-Instruct-Q4_K_M.gguf"})
print("status:", r.status_code)
print("body:", r.text[:400])

print("--------------------------------------------------------------------------------")

import requests
SERVER="http://127.0.0.1:8090"
r = requests.post(f"{SERVER}/embedding", json={"content": "test", "pooling": "mean"})
print("status:", r.status_code)
print("body:", r.text[:400])

print("--------------------------------------------------------------------------------")

!/usr/local/lib/python3.12/dist-packages/llamatelemetry/binaries/cuda12/llama-embedding \
  -m /kaggle/working/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -p "test" | head -n 5


status: 200
body: {"model":"Llama-3.2-3B-Instruct-Q4_K_M.gguf","object":"list","usage":{"prompt_tokens":2,"total_tokens":2},"data":[{"embedding":[-0.010422893799841404,-0.016974087804555893,0.005642905365675688,-0.0046997517347335815,0.010445218533277512,0.0010216771624982357,0.0057672252878546715,0.020786091685295105,-0.02553054876625538,-0.025748029351234436,-0.03341703489422798,-0.006107734981924295,0.0243347603
--------------------------------------------------------------------------------
status: 200
body: [{"index":0,"embedding":[[-0.010422893799841404,-0.016974087804555893,0.005642905365675688,-0.0046997517347335815,0.010445218533277512,0.0010216771624982357,0.0057672252878546715,0.020786091685295105,-0.02553054876625538,-0.025748029351234436,-0.03341703489422798,-0.006107734981924295,0.024334760382771492,0.01650906354188919,-0.016228539869189262,-0.004944522399455309,-0.00352452858351171,0.003654
--------------------------------------------------------------------------------


---

## üéØ Key Insights

### Semantic Clustering

**Expected Observations:**

1. **Category Clustering**: Words from same semantic category (e.g., colors) cluster together
2. **Synonyms Close**: Similar words have high cosine similarity (>0.8)
3. **Antonyms Apart**: Opposite meanings occupy different regions
4. **Hierarchical Structure**: Broader categories contain subclusters

### Comparison with Transformers-Explainer

| Feature | Transformers-Explainer | This Notebook |
|---------|------------------------|---------------|
| **Embeddings** | Shows 768D vectors as rectangles | **3D UMAP projection** |
| **Positional** | Sinusoidal position encoding | Not visualized (focus on tokens) |
| **Interactivity** | Fixed web interface | **3D rotate/zoom + Graphistry** |
| **Semantic Analysis** | Not shown | **Cosine similarity network** |
| **Quantization** | FP32 only | **Q4_K_M quantized embeddings** |
| **Vocabulary Size** | GPT-2 (50,257) | **GGUF (varies by model)** |

### Quantization Impact

**Q4_K_M vs FP32:**
- **Precision**: 4.85 bits/weight vs 32 bits
- **Similarity Preservation**: Cosine similarities mostly preserved
- **Clustering**: Semantic clusters remain intact
- **Trade-off**: 6.6√ó smaller model, <1% accuracy loss

---

## üî¨ Advanced Analysis

### Embedding Space Geometry

```python
# Intrinsic dimensionality estimation
from sklearn.decomposition import PCA
pca = PCA(n_components=50)
pca.fit(embeddings_array)
explained_var = pca.explained_variance_ratio_.cumsum()
print(f"Dimensions for 95% variance: {np.argmax(explained_var > 0.95)}")
```

### Analogies (King - Man + Woman ‚âà Queen)

```python
# Test word analogies
def get_embedding(word):
    response = client.embeddings.create(input=[word])
    return np.array(response.data[0].embedding)

king = get_embedding("king")
man = get_embedding("man")
woman = get_embedding("woman")
result = king - man + woman
# Compare result to get_embedding("queen")
```

---

## üõ†Ô∏è Customization Tips

### Add More Words
```python
test_words += ["science", "math", "physics", "biology"]
```

### Adjust UMAP Parameters
```python
umap = UMAP(
    n_components=3,
    n_neighbors=30,    # Higher = smoother manifold
    min_dist=0.05,     # Lower = tighter clusters
    metric='cosine'    # Use cosine distance
)
```

### Change Similarity Threshold
```python
threshold = 0.5  # More edges (lower threshold)
```

---

## üìö Next Notebooks

- **Notebook 14**: Layer-by-Layer Inference Tracker
- **Notebook 15**: Multi-Head Attention Comparator
- **Notebook 16**: Quantization Impact Analyzer