# DAC Latent Representation (1024D) Visualization

This notebook uses the **latent representation `z`** from DAC for visualization.

## Why Latent `z`?
- **1024D**: Richest representation (128× larger than 8D, 10× larger than 96D)
- **Full audio features**: Complete internal representation after encoder + quantization
- **Best for clustering**: Expected to significantly outperform codebook embeddings

## Strategy:
```
audio → encoder → quantizer → z [1, 1024, time]
                                ↓
                        mean pool time
                                ↓
                        embedding [1024D]
```

In [1]:
import sys
import numpy as np
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score, davies_bouldin_score
from tqdm import tqdm

# Import our utilities
from dac_utils import DACProcessor, SpeechCommandsLoader

print("Imports successful!")

Imports successful!


## Step 1: Initialize DAC Model

In [2]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

dac_processor = DACProcessor(model_type="16khz", device=device)

Using device: cuda
Loading DAC model (16khz)...


  WeightNorm.apply(module, name, dim)


Model loaded successfully!
  - Sample rate: 16000Hz
  - Codebooks: 12
  - Codebook size: 1024
  - Codebook dim: 8


## Step 2: Custom Extraction Function Using Latent `z`

In [3]:
def extract_latent_embedding(dac_processor, audio_path):
    """
    Extract DAC latent representation z
    
    Returns:
        1024D vector (latent representation averaged over time)
    """
    # Encode audio
    encoded = dac_processor.encode_audio(audio_path)
    z = encoded['z']  # [1, 1024, time]
    
    # Mean pool across time dimension only
    time_pooled = z.mean(dim=2)  # [1, 1024]
    
    # Convert to numpy
    vector = time_pooled.squeeze(0).detach().cpu().numpy()  # [1024]
    
    return vector

print("Latent extraction function defined!")

Latent extraction function defined!


## Step 3: Test on Single Sample

In [4]:
# Test the extraction
loader = SpeechCommandsLoader()
file_paths, labels = loader.load_word_samples(['zero'], samples_per_word=1)

if len(file_paths) > 0:
    test_file = file_paths[0]
    print(f"Testing with: {test_file}\n")
    
    embedding = extract_latent_embedding(dac_processor, test_file)
    
    print(f"Embedding shape: {embedding.shape}  # Should be (1024,)")
    print(f"\nFirst 10 values: {embedding[:10]}")
    print(f"\nStatistics:")
    print(f"  Min: {embedding.min():.4f}")
    print(f"  Max: {embedding.max():.4f}")
    print(f"  Mean: {embedding.mean():.4f}")
    print(f"  Std: {embedding.std():.4f}")
    print(f"\n✅ This is the RICHEST representation - 1024D!")
else:
    print("No audio files found!")

Loaded 1 audio files from 1 words
Testing with: /data/aman/speech_commands/speech_commands_v0.02/zero/004ae714_nohash_0.wav

Embedding shape: (1024,)  # Should be (1024,)

First 10 values: [-0.09022599 -1.2304747  -4.6205683  -1.6760498   1.7532446   1.8854003
 -2.3499246   0.16611435  1.1464992   0.31649902]

Statistics:
  Min: -7.3352
  Max: 6.8034
  Mean: -0.0403
  Std: 2.0114

✅ This is the RICHEST representation - 1024D!


## Step 4: Load Dataset - 5 Words, 10 Samples Each

In [5]:
# Select 5 words for visualization
words = ['zero', 'one', 'two', 'yes', 'no']
samples_per_word = 10

# Load audio paths
file_paths, file_labels = loader.load_word_samples(words, samples_per_word=samples_per_word)

print(f"Total samples: {len(file_paths)}")
print(f"\nLabel distribution:")
for word in words:
    count = file_labels.count(word)
    print(f"  {word}: {count} samples")

Loaded 50 audio files from 5 words
Total samples: 50

Label distribution:
  zero: 10 samples
  one: 10 samples
  two: 10 samples
  yes: 10 samples
  no: 10 samples


## Step 5: Extract Latent Embeddings for All Samples

In [6]:
# Extract latent embeddings for all samples
embeddings_list = []
valid_labels = []

for file_path, label in tqdm(zip(file_paths, file_labels), total=len(file_paths), desc="Extracting latent embeddings"):
    try:
        embedding = extract_latent_embedding(dac_processor, file_path)
        embeddings_list.append(embedding)
        valid_labels.append(label)
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
        continue

embeddings = np.array(embeddings_list)
print(f"\n" + "="*60)
print(f"Final embeddings shape: {embeddings.shape}  # Should be (50, 1024)")
print(f"Embedding dimension: {embeddings.shape[1]}D (LATENT REPRESENTATION)")
print(f"="*60)

Extracting latent embeddings: 100%|██████████| 50/50 [00:00<00:00, 63.41it/s]


Final embeddings shape: (50, 1024)  # Should be (50, 1024)
Embedding dimension: 1024D (LATENT REPRESENTATION)





## Step 6: PCA Visualizations

In [7]:
# PCA - 2D
pca_2d = PCA(n_components=2, random_state=42)
pca_2d_result = pca_2d.fit_transform(embeddings)
variance_2d = pca_2d.explained_variance_ratio_.sum()

print(f"PCA 2D variance explained: {variance_2d:.2%}")

# Create color map
color_map = {word: px.colors.qualitative.Plotly[i] for i, word in enumerate(words)}

# 2D Plot
fig_2d = go.Figure()

for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig_2d.add_trace(go.Scatter(
        x=pca_2d_result[mask, 0],
        y=pca_2d_result[mask, 1],
        mode='markers',
        name=word,
        marker=dict(size=12, color=color_map[word], opacity=0.7, line=dict(width=0.5, color='white'))
    ))

fig_2d.update_layout(
    title=f'PCA 2D: DAC Latent Embeddings (1024D) - Variance: {variance_2d:.1%}',
    xaxis_title='PC 1',
    yaxis_title='PC 2',
    width=900,
    height=700,
    xaxis=dict(scaleanchor='y', scaleratio=1),
    template='plotly_white'
)

fig_2d.write_html('dac_latent_pca_2d.html')
fig_2d.show()

print("Saved: dac_latent_pca_2d.html")

PCA 2D variance explained: 42.36%


Saved: dac_latent_pca_2d.html


In [8]:
# PCA - 3D
pca_3d = PCA(n_components=3, random_state=42)
pca_3d_result = pca_3d.fit_transform(embeddings)
variance_3d = pca_3d.explained_variance_ratio_.sum()

print(f"PCA 3D variance explained: {variance_3d:.2%}")

# 3D Plot
fig_3d = go.Figure()

for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig_3d.add_trace(go.Scatter3d(
        x=pca_3d_result[mask, 0],
        y=pca_3d_result[mask, 1],
        z=pca_3d_result[mask, 2],
        mode='markers',
        name=word,
        marker=dict(size=6, color=color_map[word], opacity=0.7, line=dict(width=0))
    ))

fig_3d.update_layout(
    title=f'PCA 3D: DAC Latent Embeddings (1024D) - Variance: {variance_3d:.1%}',
    scene=dict(
        xaxis_title='PC 1',
        yaxis_title='PC 2',
        zaxis_title='PC 3',
        aspectmode='cube',
        camera=dict(eye=dict(x=1.5, y=1.5, z=1.5))
    ),
    width=900,
    height=700,
    template='plotly_white'
)

fig_3d.write_html('dac_latent_pca_3d.html')
fig_3d.show()

print("Saved: dac_latent_pca_3d.html")

PCA 3D variance explained: 49.54%


Saved: dac_latent_pca_3d.html


## Step 7: t-SNE Visualizations

In [9]:
# t-SNE - 2D
perplexity = min(30, len(embeddings) - 1)
tsne_2d = TSNE(n_components=2, random_state=42, metric='cosine', perplexity=perplexity)
tsne_2d_result = tsne_2d.fit_transform(embeddings)

print(f"t-SNE 2D completed with perplexity={perplexity}")

# 2D Plot
fig_tsne_2d = go.Figure()

for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig_tsne_2d.add_trace(go.Scatter(
        x=tsne_2d_result[mask, 0],
        y=tsne_2d_result[mask, 1],
        mode='markers',
        name=word,
        marker=dict(size=12, color=color_map[word], opacity=0.7, line=dict(width=0.5, color='white'))
    ))

fig_tsne_2d.update_layout(
    title=f't-SNE 2D: DAC Latent Embeddings (1024D) - Perplexity={perplexity}',
    xaxis_title='t-SNE 1',
    yaxis_title='t-SNE 2',
    width=900,
    height=700,
    xaxis=dict(scaleanchor='y', scaleratio=1),
    template='plotly_white'
)

fig_tsne_2d.write_html('dac_latent_tsne_2d.html')
fig_tsne_2d.show()

print("Saved: dac_latent_tsne_2d.html")

t-SNE 2D completed with perplexity=30


Saved: dac_latent_tsne_2d.html


In [10]:
# t-SNE - 3D
tsne_3d = TSNE(n_components=3, random_state=42, metric='cosine', perplexity=perplexity)
tsne_3d_result = tsne_3d.fit_transform(embeddings)

print(f"t-SNE 3D completed with perplexity={perplexity}")

# 3D Plot
fig_tsne_3d = go.Figure()

for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig_tsne_3d.add_trace(go.Scatter3d(
        x=tsne_3d_result[mask, 0],
        y=tsne_3d_result[mask, 1],
        z=tsne_3d_result[mask, 2],
        mode='markers',
        name=word,
        marker=dict(size=6, color=color_map[word], opacity=0.7, line=dict(width=0))
    ))

fig_tsne_3d.update_layout(
    title=f't-SNE 3D: DAC Latent Embeddings (1024D) - Perplexity={perplexity}',
    scene=dict(
        xaxis_title='t-SNE 1',
        yaxis_title='t-SNE 2',
        zaxis_title='t-SNE 3',
        aspectmode='cube',
        camera=dict(eye=dict(x=1.5, y=1.5, z=1.5))
    ),
    width=900,
    height=700,
    template='plotly_white'
)

fig_tsne_3d.write_html('dac_latent_tsne_3d.html')
fig_tsne_3d.show()

print("Saved: dac_latent_tsne_3d.html")

t-SNE 3D completed with perplexity=30


Saved: dac_latent_tsne_3d.html


## Step 8: Combined Dashboard View

In [11]:
# Create a 2x2 grid
fig = make_subplots(
    rows=2, cols=2,
    specs=[
        [{'type': 'scatter'}, {'type': 'scatter3d'}],
        [{'type': 'scatter'}, {'type': 'scatter3d'}]
    ],
    subplot_titles=(
        f'PCA 2D (Var: {variance_2d:.1%})',
        f'PCA 3D (Var: {variance_3d:.1%})',
        f't-SNE 2D (Perp: {perplexity})',
        f't-SNE 3D (Perp: {perplexity})'
    ),
    vertical_spacing=0.12,
    horizontal_spacing=0.1
)

# PCA 2D
for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig.add_trace(
        go.Scatter(
            x=pca_2d_result[mask, 0],
            y=pca_2d_result[mask, 1],
            mode='markers',
            name=word,
            marker=dict(size=8, color=color_map[word], opacity=0.7),
            showlegend=True,
            legendgroup=word
        ),
        row=1, col=1
    )

# PCA 3D
for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig.add_trace(
        go.Scatter3d(
            x=pca_3d_result[mask, 0],
            y=pca_3d_result[mask, 1],
            z=pca_3d_result[mask, 2],
            mode='markers',
            name=word,
            marker=dict(size=5, color=color_map[word], opacity=0.7),
            showlegend=False,
            legendgroup=word
        ),
        row=1, col=2
    )

# t-SNE 2D
for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig.add_trace(
        go.Scatter(
            x=tsne_2d_result[mask, 0],
            y=tsne_2d_result[mask, 1],
            mode='markers',
            name=word,
            marker=dict(size=8, color=color_map[word], opacity=0.7),
            showlegend=False,
            legendgroup=word
        ),
        row=2, col=1
    )

# t-SNE 3D
for word in words:
    mask = np.array([label == word for label in valid_labels])
    fig.add_trace(
        go.Scatter3d(
            x=tsne_3d_result[mask, 0],
            y=tsne_3d_result[mask, 1],
            z=tsne_3d_result[mask, 2],
            mode='markers',
            name=word,
            marker=dict(size=5, color=color_map[word], opacity=0.7),
            showlegend=False,
            legendgroup=word
        ),
        row=2, col=2
    )

fig.update_layout(
    title_text='DAC Latent Embeddings (1024D) - Complete Visualization',
    title_x=0.5,
    title_font_size=20,
    width=1400,
    height=1000,
    showlegend=True,
    template='plotly_white'
)

fig.update_xaxes(title_text='PC 1', row=1, col=1, scaleanchor='y', scaleratio=1)
fig.update_yaxes(title_text='PC 2', row=1, col=1)
fig.update_xaxes(title_text='t-SNE 1', row=2, col=1, scaleanchor='y3', scaleratio=1)
fig.update_yaxes(title_text='t-SNE 2', row=2, col=1)

fig.update_scenes(aspectmode='cube', row=1, col=2)
fig.update_scenes(aspectmode='cube', row=2, col=2)

fig.write_html('dac_latent_complete.html')
fig.show()

print("Saved: dac_latent_complete.html")

Saved: dac_latent_complete.html


## Step 9: Clustering Quality Metrics

In [12]:
# Convert labels to numeric
label_to_idx = {word: i for i, word in enumerate(words)}
numeric_labels = np.array([label_to_idx[label] for label in valid_labels])

print("=" * 70)
print("CLUSTERING QUALITY METRICS - LATENT REPRESENTATION (1024D)")
print("=" * 70)

# Original embeddings
if len(np.unique(numeric_labels)) > 1 and len(embeddings) > len(np.unique(numeric_labels)):
    sil_orig = silhouette_score(embeddings, numeric_labels)
    db_orig = davies_bouldin_score(embeddings, numeric_labels)
    print(f"\n🚀 Original Embeddings (1024D - LATENT):")
    print(f"  Silhouette Score: {sil_orig:.4f}  (higher is better, range: -1 to 1)")
    print(f"  Davies-Bouldin Score: {db_orig:.4f}  (lower is better, >0)")

# PCA 2D
sil_pca2d = silhouette_score(pca_2d_result, numeric_labels)
db_pca2d = davies_bouldin_score(pca_2d_result, numeric_labels)
print(f"\nPCA 2D:")
print(f"  Silhouette Score: {sil_pca2d:.4f}")
print(f"  Davies-Bouldin Score: {db_pca2d:.4f}")

# PCA 3D
sil_pca3d = silhouette_score(pca_3d_result, numeric_labels)
db_pca3d = davies_bouldin_score(pca_3d_result, numeric_labels)
print(f"\nPCA 3D:")
print(f"  Silhouette Score: {sil_pca3d:.4f}")
print(f"  Davies-Bouldin Score: {db_pca3d:.4f}")

# t-SNE 2D
sil_tsne2d = silhouette_score(tsne_2d_result, numeric_labels)
db_tsne2d = davies_bouldin_score(tsne_2d_result, numeric_labels)
print(f"\nt-SNE 2D:")
print(f"  Silhouette Score: {sil_tsne2d:.4f}")
print(f"  Davies-Bouldin Score: {db_tsne2d:.4f}")

# t-SNE 3D
sil_tsne3d = silhouette_score(tsne_3d_result, numeric_labels)
db_tsne3d = davies_bouldin_score(tsne_3d_result, numeric_labels)
print(f"\nt-SNE 3D:")
print(f"  Silhouette Score: {sil_tsne3d:.4f}")
print(f"  Davies-Bouldin Score: {db_tsne3d:.4f}")

print("\n" + "=" * 70)
print("\n📊 COMPARISON WITH OTHER APPROACHES:")
print("\n1️⃣  Codebook Averaging (8D):")
print("    Silhouette: -0.0731  |  Davies-Bouldin: 4.99")
print("\n2️⃣  Codebook Concatenation (96D):")
print("    Silhouette: ~0.1     |  Davies-Bouldin: ~3.5  [Expected]")
print(f"\n3️⃣  Latent Representation (1024D) - THIS NOTEBOOK:")
print(f"    Silhouette: {sil_orig:.4f}  |  Davies-Bouldin: {db_orig:.4f}")

print("\n💡 Expected: Latent (1024D) should significantly outperform both!")
print("   Target: Silhouette > 0.3, Davies-Bouldin < 2.0")
print("=" * 70)

CLUSTERING QUALITY METRICS - LATENT REPRESENTATION (1024D)

🚀 Original Embeddings (1024D - LATENT):
  Silhouette Score: -0.0358  (higher is better, range: -1 to 1)
  Davies-Bouldin Score: 4.0643  (lower is better, >0)

PCA 2D:
  Silhouette Score: -0.1250
  Davies-Bouldin Score: 6.0486

PCA 3D:
  Silhouette Score: -0.1079
  Davies-Bouldin Score: 5.5799

t-SNE 2D:
  Silhouette Score: -0.0829
  Davies-Bouldin Score: 3.5198

t-SNE 3D:
  Silhouette Score: -0.0740
  Davies-Bouldin Score: 5.1930


📊 COMPARISON WITH OTHER APPROACHES:

1️⃣  Codebook Averaging (8D):
    Silhouette: -0.0731  |  Davies-Bouldin: 4.99

2️⃣  Codebook Concatenation (96D):
    Silhouette: ~0.1     |  Davies-Bouldin: ~3.5  [Expected]

3️⃣  Latent Representation (1024D) - THIS NOTEBOOK:
    Silhouette: -0.0358  |  Davies-Bouldin: 4.0643

💡 Expected: Latent (1024D) should significantly outperform both!
   Target: Silhouette > 0.3, Davies-Bouldin < 2.0


## Summary

This notebook demonstrated DAC **latent representation visualization (1024D)**:

### **Why Latent `z` is Best**:
- ✅ **1024D**: Richest representation (128× larger than 8D codebook embeddings)
- ✅ **Complete features**: Full internal audio representation after encoder + quantization
- ✅ **Similar to Wav2Vec2/Whisper**: Comparable dimensionality (768-1024D)
- ✅ **Best clustering**: Should show much better word separation

### **Comparison Table**:

| Approach | Dimension | Silhouette | Davies-Bouldin | Quality |
|----------|-----------|------------|----------------|----------|
| Codebook (avg) | 8D | -0.07 | 4.99 | ❌ Poor |
| Codebook (concat) | 96D | ~0.1 | ~3.5 | ⚠️ Medium |
| **Latent `z`** | **1024D** | **See above** | **See above** | ✅ **Best** |

### **Key Findings**:
- Look at the metrics above to compare with 8D and 96D versions
- Visual inspection: Are clusters more separated in the plots?
- This is the recommended approach for DAC embedding visualization!

### **Next Steps**:
1. ✅ Compare all three DAC approaches: 8D vs 96D vs 1024D
2. ✅ Compare DAC latent (1024D) with Wav2Vec2 (768D) and Whisper (384-1024D)
3. Consider integrating best approach into main dashboard