# DAC Codebook Projections Concatenated (12,288D)

This notebook extracts **all 12 codebook projections** and **concatenates** them instead of summing.

## Strategy:
```
Codebook 0: [time, 8] → out_proj → [time, 1024]
Codebook 1: [time, 8] → out_proj → [time, 1024]
...
Codebook 11: [time, 8] → out_proj → [time, 1024]
                            ↓
              CONCATENATE (not sum!)
                            ↓
              [time, 12,288]  (12 × 1024)
                            ↓
                   Mean pool time
                            ↓
                    [12,288D] embedding
```

**Hypothesis**: This preserves ALL information from each codebook's projection, rather than mixing them via summation. Should give the BEST clustering!

In [1]:
import sys
import numpy as np
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score, davies_bouldin_score
from tqdm import tqdm

# Import our utilities
from dac_utils import DACProcessor, SpeechCommandsLoader

print("Imports successful!")

Imports successful!


## Step 1: Initialize DAC Model

In [2]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

dac_processor = DACProcessor(model_type="16khz", device=device)

Using device: cuda
Loading DAC model (16khz)...


  WeightNorm.apply(module, name, dim)


Model loaded successfully!
  - Sample rate: 16000Hz
  - Codebooks: 12
  - Codebook size: 1024
  - Codebook dim: 8


## Step 2: Custom Extraction Function - Concatenate All Codebook Projections

In [3]:
def extract_codebook_projections_concat(dac_processor, audio_path):
    """
    Extract all 12 codebook projections and concatenate them
    
    Returns:
        12,288D vector (12 codebooks × 1024D each, concatenated)
    """
    # Encode audio
    encoded = dac_processor.encode_audio(audio_path)
    codes = encoded['codes']  # [1, 12, time]
    
    # Move to device
    codes = codes.to(dac_processor.device)
    
    B, N, T = codes.shape  # batch=1, n_codebooks=12, time
    
    # Get projected embeddings from each codebook separately
    codebook_projections = []
    
    for i in range(N):
        # Get indices for this codebook
        indices = codes[:, i, :]  # [1, time]
        
        # Get quantizer for this codebook
        quantizer = dac_processor.model.quantizer.quantizers[i]
        
        # Get 8D embeddings from codebook
        z_e = quantizer.embed_code(indices)  # [1, time, 8]
        z_e = z_e.transpose(1, 2)  # [1, 8, time] - for conv
        
        # Apply out_proj: 8D → 1024D
        z_q = quantizer.out_proj(z_e)  # [1, 1024, time]
        
        # Mean pool across time
        z_q_pooled = z_q.mean(dim=2)  # [1, 1024]
        
        codebook_projections.append(z_q_pooled)
    
    # Concatenate all 12 projections
    concatenated = torch.cat(codebook_projections, dim=1)  # [1, 12288]
    
    # Convert to numpy
    vector = concatenated.squeeze(0).detach().cpu().numpy()  # [12288]
    
    return vector

print("Custom extraction function defined!")
print("This extracts 12,288D embeddings (12 codebooks × 1024D each)")

Custom extraction function defined!
This extracts 12,288D embeddings (12 codebooks × 1024D each)


## Step 3: Test on Single Sample

In [4]:
# Test the extraction
loader = SpeechCommandsLoader()
file_paths, labels = loader.load_word_samples(['zero'], samples_per_word=1)

if len(file_paths) > 0:
    test_file = file_paths[0]
    print(f"Testing with: {test_file}\n")
    
    embedding = extract_codebook_projections_concat(dac_processor, test_file)
    
    print(f"Embedding shape: {embedding.shape}  # Should be (12288,)")
    print(f"\nBreakdown:")
    print(f"  12 codebooks × 1024D per codebook = 12,288D")
    print(f"\nFirst 10 values: {embedding[:10]}")
    print(f"\nStatistics:")
    print(f"  Min: {embedding.min():.4f}")
    print(f"  Max: {embedding.max():.4f}")
    print(f"  Mean: {embedding.mean():.4f}")
    print(f"  Std: {embedding.std():.4f}")
    print(f"\n✅ This preserves ALL codebook projections separately!")
else:
    print("No audio files found!")

Loaded 1 audio files from 1 words
Testing with: /data/aman/speech_commands/speech_commands_v0.02/zero/004ae714_nohash_0.wav

Embedding shape: (12288,)  # Should be (12288,)

Breakdown:
  12 codebooks × 1024D per codebook = 12,288D

First 10 values: [-1.2694625  -0.6316965  -2.4743695  -0.67949146  2.135565    1.2123672
 -0.76041275  0.5555004   1.5093648  -0.22564861]

Statistics:
  Min: -3.7400
  Max: 3.6076
  Mean: -0.0034
  Std: 0.4766

✅ This preserves ALL codebook projections separately!


## Step 4: Load Dataset - 5 Words, 10 Samples Each

In [5]:
# Select 5 words for visualization
words = ['zero', 'one', 'two', 'yes', 'no']
samples_per_word = 10

# Load audio paths
file_paths, file_labels = loader.load_word_samples(words, samples_per_word=samples_per_word)

print(f"Total samples: {len(file_paths)}")
print(f"\nLabel distribution:")
for word in words:
    count = file_labels.count(word)
    print(f"  {word}: {count} samples")

Loaded 50 audio files from 5 words
Total samples: 50

Label distribution:
  zero: 10 samples
  one: 10 samples
  two: 10 samples
  yes: 10 samples
  no: 10 samples


## Step 5: Extract Concatenated Codebook Projections for All Samples

In [6]:
# Extract embeddings for all samples
embeddings_list = []
valid_labels = []

for file_path, label in tqdm(zip(file_paths, file_labels), total=len(file_paths), desc="Extracting concatenated projections"):
    try:
        embedding = extract_codebook_projections_concat(dac_processor, file_path)
        embeddings_list.append(embedding)
        valid_labels.append(label)
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
        import traceback
        traceback.print_exc()
        continue

embeddings = np.array(embeddings_list)
print(f"\n" + "="*70)
print(f"Final embeddings shape: {embeddings.shape}  # Should be (50, 12288)")
print(f"Embedding dimension: {embeddings.shape[1]}D")
print(f"  = 12 codebooks × 1024D per codebook")
print(f"  = CONCATENATED (not summed!)")
print(f"="*70)

Extracting concatenated projections: 100%|██████████| 50/50 [00:00<00:00, 52.56it/s]


Final embeddings shape: (50, 12288)  # Should be (50, 12288)
Embedding dimension: 12288D
  = 12 codebooks × 1024D per codebook
  = CONCATENATED (not summed!)





## Step 6: PCA Visualizations

In [7]:
# PCA - 2D
pca_2d = PCA(n_components=2, random_state=42)
pca_2d_result = pca_2d.fit_transform(embeddings)
variance_2d = pca_2d.explained_variance_ratio_.sum()

print(f"PCA 2D variance explained: {variance_2d:.2%}")

# Create color map
color_map = {word: px.colors.qualitative.Plotly[i] for i, word in enumerate(words)}

# 2D Plot
fig_2d = go.Figure()

for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig_2d.add_trace(go.Scatter(
        x=pca_2d_result[mask, 0],
        y=pca_2d_result[mask, 1],
        mode='markers',
        name=word,
        marker=dict(size=12, color=color_map[word], opacity=0.7, line=dict(width=0.5, color='white'))
    ))

fig_2d.update_layout(
    title=f'PCA 2D: DAC Concatenated Codebook Projections (12,288D) - Variance: {variance_2d:.1%}',
    xaxis_title='PC 1',
    yaxis_title='PC 2',
    width=900,
    height=700,
    xaxis=dict(scaleanchor='y', scaleratio=1),
    template='plotly_white'
)

fig_2d.write_html('dac_codebook_proj_concat_pca_2d.html')
fig_2d.show()

print("Saved: dac_codebook_proj_concat_pca_2d.html")

PCA 2D variance explained: 37.67%


Saved: dac_codebook_proj_concat_pca_2d.html


In [8]:
# PCA - 3D
pca_3d = PCA(n_components=3, random_state=42)
pca_3d_result = pca_3d.fit_transform(embeddings)
variance_3d = pca_3d.explained_variance_ratio_.sum()

print(f"PCA 3D variance explained: {variance_3d:.2%}")

# 3D Plot
fig_3d = go.Figure()

for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig_3d.add_trace(go.Scatter3d(
        x=pca_3d_result[mask, 0],
        y=pca_3d_result[mask, 1],
        z=pca_3d_result[mask, 2],
        mode='markers',
        name=word,
        marker=dict(size=6, color=color_map[word], opacity=0.7, line=dict(width=0))
    ))

fig_3d.update_layout(
    title=f'PCA 3D: DAC Concatenated Codebook Projections (12,288D) - Variance: {variance_3d:.1%}',
    scene=dict(
        xaxis_title='PC 1',
        yaxis_title='PC 2',
        zaxis_title='PC 3',
        aspectmode='cube',
        camera=dict(eye=dict(x=1.5, y=1.5, z=1.5))
    ),
    width=900,
    height=700,
    template='plotly_white'
)

fig_3d.write_html('dac_codebook_proj_concat_pca_3d.html')
fig_3d.show()

print("Saved: dac_codebook_proj_concat_pca_3d.html")

PCA 3D variance explained: 44.09%


Saved: dac_codebook_proj_concat_pca_3d.html


## Step 7: t-SNE Visualizations

In [9]:
# t-SNE - 2D
perplexity = min(30, len(embeddings) - 1)
tsne_2d = TSNE(n_components=2, random_state=42, metric='cosine', perplexity=perplexity)
tsne_2d_result = tsne_2d.fit_transform(embeddings)

print(f"t-SNE 2D completed with perplexity={perplexity}")

# 2D Plot
fig_tsne_2d = go.Figure()

for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig_tsne_2d.add_trace(go.Scatter(
        x=tsne_2d_result[mask, 0],
        y=tsne_2d_result[mask, 1],
        mode='markers',
        name=word,
        marker=dict(size=12, color=color_map[word], opacity=0.7, line=dict(width=0.5, color='white'))
    ))

fig_tsne_2d.update_layout(
    title=f't-SNE 2D: DAC Concatenated Codebook Projections (12,288D) - Perplexity={perplexity}',
    xaxis_title='t-SNE 1',
    yaxis_title='t-SNE 2',
    width=900,
    height=700,
    xaxis=dict(scaleanchor='y', scaleratio=1),
    template='plotly_white'
)

fig_tsne_2d.write_html('dac_codebook_proj_concat_tsne_2d.html')
fig_tsne_2d.show()

print("Saved: dac_codebook_proj_concat_tsne_2d.html")

t-SNE 2D completed with perplexity=30


Saved: dac_codebook_proj_concat_tsne_2d.html


In [10]:
# t-SNE - 3D
tsne_3d = TSNE(n_components=3, random_state=42, metric='cosine', perplexity=perplexity)
tsne_3d_result = tsne_3d.fit_transform(embeddings)

print(f"t-SNE 3D completed with perplexity={perplexity}")

# 3D Plot
fig_tsne_3d = go.Figure()

for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig_tsne_3d.add_trace(go.Scatter3d(
        x=tsne_3d_result[mask, 0],
        y=tsne_3d_result[mask, 1],
        z=tsne_3d_result[mask, 2],
        mode='markers',
        name=word,
        marker=dict(size=6, color=color_map[word], opacity=0.7, line=dict(width=0))
    ))

fig_tsne_3d.update_layout(
    title=f't-SNE 3D: DAC Concatenated Codebook Projections (12,288D) - Perplexity={perplexity}',
    scene=dict(
        xaxis_title='t-SNE 1',
        yaxis_title='t-SNE 2',
        zaxis_title='t-SNE 3',
        aspectmode='cube',
        camera=dict(eye=dict(x=1.5, y=1.5, z=1.5))
    ),
    width=900,
    height=700,
    template='plotly_white'
)

fig_tsne_3d.write_html('dac_codebook_proj_concat_tsne_3d.html')
fig_tsne_3d.show()

print("Saved: dac_codebook_proj_concat_tsne_3d.html")

t-SNE 3D completed with perplexity=30


Saved: dac_codebook_proj_concat_tsne_3d.html


## Step 8: Combined Dashboard View

In [11]:
# Create a 2x2 grid
fig = make_subplots(
    rows=2, cols=2,
    specs=[
        [{'type': 'scatter'}, {'type': 'scatter3d'}],
        [{'type': 'scatter'}, {'type': 'scatter3d'}]
    ],
    subplot_titles=(
        f'PCA 2D (Var: {variance_2d:.1%})',
        f'PCA 3D (Var: {variance_3d:.1%})',
        f't-SNE 2D (Perp: {perplexity})',
        f't-SNE 3D (Perp: {perplexity})'
    ),
    vertical_spacing=0.12,
    horizontal_spacing=0.1
)

# PCA 2D
for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig.add_trace(
        go.Scatter(
            x=pca_2d_result[mask, 0],
            y=pca_2d_result[mask, 1],
            mode='markers',
            name=word,
            marker=dict(size=8, color=color_map[word], opacity=0.7),
            showlegend=True,
            legendgroup=word
        ),
        row=1, col=1
    )

# PCA 3D
for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig.add_trace(
        go.Scatter3d(
            x=pca_3d_result[mask, 0],
            y=pca_3d_result[mask, 1],
            z=pca_3d_result[mask, 2],
            mode='markers',
            name=word,
            marker=dict(size=5, color=color_map[word], opacity=0.7),
            showlegend=False,
            legendgroup=word
        ),
        row=1, col=2
    )

# t-SNE 2D
for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig.add_trace(
        go.Scatter(
            x=tsne_2d_result[mask, 0],
            y=tsne_2d_result[mask, 1],
            mode='markers',
            name=word,
            marker=dict(size=8, color=color_map[word], opacity=0.7),
            showlegend=False,
            legendgroup=word
        ),
        row=2, col=1
    )

# t-SNE 3D
for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig.add_trace(
        go.Scatter3d(
            x=tsne_3d_result[mask, 0],
            y=tsne_3d_result[mask, 1],
            z=tsne_3d_result[mask, 2],
            mode='markers',
            name=word,
            marker=dict(size=5, color=color_map[word], opacity=0.7),
            showlegend=False,
            legendgroup=word
        ),
        row=2, col=2
    )

fig.update_layout(
    title_text='DAC Concatenated Codebook Projections (12,288D) - Complete Visualization',
    title_x=0.5,
    title_font_size=20,
    width=1400,
    height=1000,
    showlegend=True,
    template='plotly_white'
)

fig.update_xaxes(title_text='PC 1', row=1, col=1, scaleanchor='y', scaleratio=1)
fig.update_yaxes(title_text='PC 2', row=1, col=1)
fig.update_xaxes(title_text='t-SNE 1', row=2, col=1, scaleanchor='y3', scaleratio=1)
fig.update_yaxes(title_text='t-SNE 2', row=2, col=1)

fig.update_scenes(aspectmode='cube', row=1, col=2)
fig.update_scenes(aspectmode='cube', row=2, col=2)

fig.write_html('dac_codebook_proj_concat_complete.html')
fig.show()

print("Saved: dac_codebook_proj_concat_complete.html")

Saved: dac_codebook_proj_concat_complete.html


## Step 9: Clustering Quality Metrics

In [12]:
# Convert labels to numeric
label_to_idx = {word: i for i, word in enumerate(words)}
numeric_labels = np.array([label_to_idx[label] for label in valid_labels])

print("=" * 80)
print("CLUSTERING QUALITY METRICS - CONCATENATED CODEBOOK PROJECTIONS (12,288D)")
print("=" * 80)

# Original embeddings
if len(np.unique(numeric_labels)) > 1 and len(embeddings) > len(np.unique(numeric_labels)):
    sil_orig = silhouette_score(embeddings, numeric_labels)
    db_orig = davies_bouldin_score(embeddings, numeric_labels)
    print(f"\n🚀 Original Embeddings (12,288D - CONCATENATED PROJECTIONS):")
    print(f"  Silhouette Score: {sil_orig:.4f}  (higher is better, range: -1 to 1)")
    print(f"  Davies-Bouldin Score: {db_orig:.4f}  (lower is better, >0)")

# PCA 2D
sil_pca2d = silhouette_score(pca_2d_result, numeric_labels)
db_pca2d = davies_bouldin_score(pca_2d_result, numeric_labels)
print(f"\nPCA 2D:")
print(f"  Silhouette Score: {sil_pca2d:.4f}")
print(f"  Davies-Bouldin Score: {db_pca2d:.4f}")

# PCA 3D
sil_pca3d = silhouette_score(pca_3d_result, numeric_labels)
db_pca3d = davies_bouldin_score(pca_3d_result, numeric_labels)
print(f"\nPCA 3D:")
print(f"  Silhouette Score: {sil_pca3d:.4f}")
print(f"  Davies-Bouldin Score: {db_pca3d:.4f}")

# t-SNE 2D
sil_tsne2d = silhouette_score(tsne_2d_result, numeric_labels)
db_tsne2d = davies_bouldin_score(tsne_2d_result, numeric_labels)
print(f"\nt-SNE 2D:")
print(f"  Silhouette Score: {sil_tsne2d:.4f}")
print(f"  Davies-Bouldin Score: {db_tsne2d:.4f}")

# t-SNE 3D
sil_tsne3d = silhouette_score(tsne_3d_result, numeric_labels)
db_tsne3d = davies_bouldin_score(tsne_3d_result, numeric_labels)
print(f"\nt-SNE 3D:")
print(f"  Silhouette Score: {sil_tsne3d:.4f}")
print(f"  Davies-Bouldin Score: {db_tsne3d:.4f}")

print("\n" + "=" * 80)
print("\n📊 COMPREHENSIVE COMPARISON - ALL DAC APPROACHES:")
print("\n1️⃣  Codebook 8D (averaging across codebooks):")
print("    Silhouette: -0.0731  |  Davies-Bouldin: 4.99")
print("\n2️⃣  Codebook 96D (concatenating 12×8D):")
print("    Silhouette: ~0.1     |  Davies-Bouldin: ~3.5  [Expected]")
print("\n3️⃣  Latent 1024D (summed projections):")
print("    Silhouette: -0.0358  |  Davies-Bouldin: 4.06")
print(f"\n4️⃣  Codebook Projections 12,288D (concatenated projections) - THIS NOTEBOOK:")
print(f"    Silhouette: {sil_orig:.4f}  |  Davies-Bouldin: {db_orig:.4f}")

print("\n💡 Key Insight:")
print("   This approach preserves ALL information from each codebook's 1024D projection.")
print("   12× more dimensions than latent (1024D), 12× more than summed approach.")
print("   Expected: Should give BEST results if projection diversity matters!")
print("=" * 80)

CLUSTERING QUALITY METRICS - CONCATENATED CODEBOOK PROJECTIONS (12,288D)

🚀 Original Embeddings (12,288D - CONCATENATED PROJECTIONS):
  Silhouette Score: -0.0351  (higher is better, range: -1 to 1)
  Davies-Bouldin Score: 4.0823  (lower is better, >0)

PCA 2D:
  Silhouette Score: -0.1274
  Davies-Bouldin Score: 5.0247

PCA 3D:
  Silhouette Score: -0.1082
  Davies-Bouldin Score: 5.3021

t-SNE 2D:
  Silhouette Score: -0.1001
  Davies-Bouldin Score: 3.9569

t-SNE 3D:
  Silhouette Score: -0.0976
  Davies-Bouldin Score: 5.2545


📊 COMPREHENSIVE COMPARISON - ALL DAC APPROACHES:

1️⃣  Codebook 8D (averaging across codebooks):
    Silhouette: -0.0731  |  Davies-Bouldin: 4.99

2️⃣  Codebook 96D (concatenating 12×8D):
    Silhouette: ~0.1     |  Davies-Bouldin: ~3.5  [Expected]

3️⃣  Latent 1024D (summed projections):
    Silhouette: -0.0358  |  Davies-Bouldin: 4.06

4️⃣  Codebook Projections 12,288D (concatenated projections) - THIS NOTEBOOK:
    Silhouette: -0.0351  |  Davies-Bouldin: 4.0823



## Summary

This notebook demonstrated DAC **concatenated codebook projections (12,288D)**:

### **Approach**:
- Extract 12 codebook embeddings: Each [time, 8]
- Apply each codebook's `out_proj`: 8D → 1024D
- **Concatenate all 12**: 12 × 1024D = 12,288D (instead of summing to 1024D)
- Mean pool across time for final embedding

### **Hypothesis**:
By concatenating instead of summing, we preserve:
- ✅ Each codebook's unique projection patterns
- ✅ All 12,288 dimensions of information
- ✅ No information loss from summation

### **Comparison Table**:

| Approach | Dimension | Operation | Expected Quality |
|----------|-----------|-----------|------------------|
| Codebook (avg) | 8D | Average codebooks | ❌ Poor |
| Codebook (concat) | 96D | Concat 12×8D | ⚠️ Medium |
| Latent (summed) | 1024D | Sum 12×1024D projs | ❌ Still poor |
| **Projections (concat)** | **12,288D** | **Concat 12×1024D projs** | **?** |

### **Key Question**:
Does preserving each codebook's projection separately (12,288D) improve clustering vs summing them (1024D)?

**Check the metrics above to find out!** 🔬