# DAC Temporal Slice Visualization

This notebook explores DAC embeddings at **specific time steps** without temporal averaging.

## Goal:
- Extract 12,288D concatenated projections (12 codebooks √ó 1024D each) at a **single time index**
- Compare clustering across different time positions (beginning, middle, end)
- Understand how temporal position affects word discrimination

In [1]:
import sys
import numpy as np
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import plotly.graph_objects as go
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score, davies_bouldin_score

# Import our utilities
from dac_utils import DACProcessor, SpeechCommandsLoader

print("Imports successful!")

Imports successful!


## Step 1: Initialize DAC and Load Data

In [2]:
# Initialize DAC processor
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

dac_processor = DACProcessor(model_type="16khz", device=device)

# Load dataset
loader = SpeechCommandsLoader()
words = ['zero', 'one', 'two', 'yes', 'no']
samples_per_word = 10

file_paths, file_labels = loader.load_word_samples(words, samples_per_word=samples_per_word)
print(f"\nLoaded {len(file_paths)} audio files from {len(words)} words")

Using device: cuda
Loading DAC model (16khz)...


  WeightNorm.apply(module, name, dim)


Model loaded successfully!
  - Sample rate: 16000Hz
  - Codebooks: 12
  - Codebook size: 1024
  - Codebook dim: 8
Loaded 50 audio files from 5 words

Loaded 50 audio files from 5 words


## Step 2: Extract Embeddings at Specific Time Index

In [3]:
def extract_temporal_slice(dac_processor, audio_path, time_index):
    """
    Extract 12,288D embedding at a specific time index.
    
    Args:
        dac_processor: DACProcessor instance
        audio_path: Path to audio file
        time_index: Time step to extract (0 to T-1)
    
    Returns:
        12,288D numpy array representing concatenated projections at time_index
    """
    # Encode audio
    encoded = dac_processor.encode_audio(audio_path)
    codes = encoded['codes'][0]  # [n_codebooks, time]
    
    N = codes.shape[0]  # Number of codebooks (12)
    T = codes.shape[1]  # Number of time steps
    
    # Validate time index
    if time_index >= T:
        raise ValueError(f"time_index {time_index} exceeds sequence length {T}")
    
    # Extract projection at specific time index for each codebook
    codebook_projections = []
    
    for i in range(N):
        quantizer = dac_processor.model.quantizer.quantizers[i]
        
        # Get indices for this codebook [time]
        indices = codes[i:i+1, :].to(dac_processor.device)  # [1, time]
        
        # Get 8D embeddings [1, time, 8]
        z_e = quantizer.embed_code(indices)
        
        # Project to 1024D [1, 1024, time]
        z_q = quantizer.out_proj(z_e.transpose(1, 2))
        
        # Extract specific time index [1, 1024]
        z_q_slice = z_q[:, :, time_index]
        
        codebook_projections.append(z_q_slice)
    
    # Concatenate all codebooks [1, 12288]
    concatenated = torch.cat(codebook_projections, dim=1)
    
    return concatenated.squeeze(0).detach().cpu().numpy()

# Test extraction
test_embedding = extract_temporal_slice(dac_processor, file_paths[0], time_index=10)
print(f"\nExtracted embedding shape: {test_embedding.shape}")
print(f"Embedding stats: mean={test_embedding.mean():.4f}, std={test_embedding.std():.4f}")


Extracted embedding shape: (12288,)
Embedding stats: mean=0.0037, std=1.1050


## Step 3: Inspect Temporal Dimension

In [4]:
# Check sequence lengths across samples
sequence_lengths = []
for file_path in file_paths[:5]:  # Check first 5
    encoded = dac_processor.encode_audio(file_path)
    T = encoded['codes'].shape[2]
    sequence_lengths.append(T)

print(f"Sample sequence lengths: {sequence_lengths}")
print(f"Min: {min(sequence_lengths)}, Max: {max(sequence_lengths)}")
print(f"\n‚Üí We'll use time indices that fit all samples (0 to {min(sequence_lengths)-1})")

Sample sequence lengths: [50, 40, 50, 50, 50]
Min: 40, Max: 50

‚Üí We'll use time indices that fit all samples (0 to 39)


## Step 4: Visualize at Different Time Positions

We'll compare 3 temporal positions:
- **Early** (time_index=5): Beginning of the word
- **Middle** (time_index=25): Core phonetic content
- **Late** (time_index=45): End of the word

In [5]:
# Define time indices to explore
time_positions = {
    'Early (t=5)': 5,
    'Middle (t=25)': 25,
    'Late (t=45)': 45
}

# Extract embeddings for each time position
results = {}

for position_name, time_idx in time_positions.items():
    print(f"\nExtracting embeddings at {position_name}...")
    
    embeddings = []
    valid_labels = []
    
    for file_path, label in zip(file_paths, file_labels):
        try:
            emb = extract_temporal_slice(dac_processor, file_path, time_idx)
            embeddings.append(emb)
            valid_labels.append(label)
        except Exception as e:
            print(f"  Skipped {file_path}: {e}")
    
    embeddings = np.array(embeddings)
    print(f"  Extracted {len(embeddings)} embeddings, shape: {embeddings.shape}")
    
    results[position_name] = {
        'embeddings': embeddings,
        'labels': valid_labels,
        'time_index': time_idx
    }

print("\n‚úÖ Extraction complete!")


Extracting embeddings at Early (t=5)...
  Extracted 50 embeddings, shape: (50, 12288)

Extracting embeddings at Middle (t=25)...
  Extracted 50 embeddings, shape: (50, 12288)

Extracting embeddings at Late (t=45)...
  Skipped /data/aman/speech_commands/speech_commands_v0.02/zero/004ae714_nohash_1.wav: time_index 45 exceeds sequence length 40
  Extracted 49 embeddings, shape: (49, 12288)

‚úÖ Extraction complete!


## Step 5: PCA Visualization Across Time Positions

In [6]:
# Create color map
color_map = {word: px.colors.qualitative.Plotly[i] for i, word in enumerate(words)}

# PCA for each time position
fig = go.Figure()

for position_name, data in results.items():
    embeddings = data['embeddings']
    labels = data['labels']
    
    # Apply PCA
    pca = PCA(n_components=2)
    pca_result = pca.fit_transform(embeddings)
    
    variance_explained = pca.explained_variance_ratio_.sum()
    print(f"{position_name}: PCA variance explained = {variance_explained:.2%}")
    
    # Store for metrics
    data['pca_result'] = pca_result
    data['pca_variance'] = variance_explained

# Create subplots for comparison
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=[f"{name}<br>Var: {data['pca_variance']:.1%}" 
                    for name, data in results.items()]
)

for idx, (position_name, data) in enumerate(results.items(), 1):
    pca_result = data['pca_result']
    labels = data['labels']
    
    for word in words:
        mask = np.array([label == word for label in labels])
        fig.add_trace(
            go.Scatter(
                x=pca_result[mask, 0],
                y=pca_result[mask, 1],
                mode='markers',
                name=word,
                marker=dict(size=8, color=color_map[word], opacity=0.7),
                legendgroup=word,
                showlegend=(idx == 1)  # Only show legend once
            ),
            row=1, col=idx
        )

fig.update_layout(
    title_text='PCA: DAC Embeddings at Different Time Positions (12,288D)',
    height=400,
    width=1400
)

fig.write_html('dac_temporal_slice_pca.html')
fig.show()

print("\nSaved: dac_temporal_slice_pca.html")

Early (t=5): PCA variance explained = 11.75%
Middle (t=25): PCA variance explained = 11.84%
Late (t=45): PCA variance explained = 10.93%



Saved: dac_temporal_slice_pca.html


## Step 6: t-SNE Visualization Across Time Positions

In [7]:
# t-SNE for each time position
perplexity = min(30, len(file_paths) - 1)

for position_name, data in results.items():
    embeddings = data['embeddings']
    
    # Apply t-SNE
    tsne = TSNE(n_components=2, random_state=42, perplexity=perplexity)
    tsne_result = tsne.fit_transform(embeddings)
    
    # Store for metrics
    data['tsne_result'] = tsne_result
    print(f"{position_name}: t-SNE completed")

# Create subplots
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=[name for name in results.keys()]
)

for idx, (position_name, data) in enumerate(results.items(), 1):
    tsne_result = data['tsne_result']
    labels = data['labels']
    
    for word in words:
        mask = np.array([label == word for label in labels])
        fig.add_trace(
            go.Scatter(
                x=tsne_result[mask, 0],
                y=tsne_result[mask, 1],
                mode='markers',
                name=word,
                marker=dict(size=8, color=color_map[word], opacity=0.7),
                legendgroup=word,
                showlegend=(idx == 1)
            ),
            row=1, col=idx
        )

fig.update_layout(
    title_text='t-SNE: DAC Embeddings at Different Time Positions (12,288D)',
    height=400,
    width=1400
)

fig.write_html('dac_temporal_slice_tsne.html')
fig.show()

print("\nSaved: dac_temporal_slice_tsne.html")

Early (t=5): t-SNE completed
Middle (t=25): t-SNE completed
Late (t=45): t-SNE completed



Saved: dac_temporal_slice_tsne.html


## Step 7: Clustering Metrics Comparison

In [8]:
# Convert labels to numeric
label_to_idx = {word: i for i, word in enumerate(words)}

print("=" * 80)
print("CLUSTERING METRICS: TEMPORAL SLICE COMPARISON")
print("=" * 80)

metrics_table = []

for position_name, data in results.items():
    labels = data['labels']
    numeric_labels = np.array([label_to_idx[label] for label in labels])
    
    # PCA metrics
    pca_sil = silhouette_score(data['pca_result'], numeric_labels)
    pca_db = davies_bouldin_score(data['pca_result'], numeric_labels)
    
    # t-SNE metrics
    tsne_sil = silhouette_score(data['tsne_result'], numeric_labels)
    tsne_db = davies_bouldin_score(data['tsne_result'], numeric_labels)
    
    print(f"\n{position_name}:")
    print(f"  PCA:   Silhouette = {pca_sil:+.4f}  |  Davies-Bouldin = {pca_db:.4f}")
    print(f"  t-SNE: Silhouette = {tsne_sil:+.4f}  |  Davies-Bouldin = {tsne_db:.4f}")
    
    metrics_table.append({
        'position': position_name,
        'time_idx': data['time_index'],
        'pca_sil': pca_sil,
        'pca_db': pca_db,
        'tsne_sil': tsne_sil,
        'tsne_db': tsne_db
    })

print("\n" + "=" * 80)
print("SUMMARY:")
print("=" * 80)

# Find best position
best_pca = max(metrics_table, key=lambda x: x['pca_sil'])
best_tsne = max(metrics_table, key=lambda x: x['tsne_sil'])

print(f"\nüèÜ Best PCA clustering: {best_pca['position']} (Silhouette: {best_pca['pca_sil']:+.4f})")
print(f"üèÜ Best t-SNE clustering: {best_tsne['position']} (Silhouette: {best_tsne['tsne_sil']:+.4f})")

print("\nüí° Interpretation:")
if max(m['pca_sil'] for m in metrics_table) > 0.1:
    print("  ‚úÖ Specific time positions show better clustering!")
    print("  ‚Üí Certain phonetic moments are more discriminative")
else:
    print("  ‚ùå Even at specific time slices, clustering remains poor")
    print("  ‚Üí Single time steps lack sufficient context for word discrimination")
    print("  ‚Üí Need to preserve full temporal sequences (RNN/LSTM/Transformer)")

print("=" * 80)

CLUSTERING METRICS: TEMPORAL SLICE COMPARISON

Early (t=5):
  PCA:   Silhouette = -0.1302  |  Davies-Bouldin = 9.6585
  t-SNE: Silhouette = -0.1439  |  Davies-Bouldin = 8.5370

Middle (t=25):
  PCA:   Silhouette = -0.1376  |  Davies-Bouldin = 19.2873
  t-SNE: Silhouette = -0.1251  |  Davies-Bouldin = 12.9015

Late (t=45):
  PCA:   Silhouette = -0.1185  |  Davies-Bouldin = 8.7317
  t-SNE: Silhouette = -0.1281  |  Davies-Bouldin = 9.4580

SUMMARY:

üèÜ Best PCA clustering: Late (t=45) (Silhouette: -0.1185)
üèÜ Best t-SNE clustering: Middle (t=25) (Silhouette: -0.1251)

üí° Interpretation:
  ‚ùå Even at specific time slices, clustering remains poor
  ‚Üí Single time steps lack sufficient context for word discrimination
  ‚Üí Need to preserve full temporal sequences (RNN/LSTM/Transformer)


## Step 8: Custom Time Index Exploration

Use this cell to explore any specific time index:

In [9]:
# Set your desired time index here
CUSTOM_TIME_INDEX = 10  # Change this value (0 to 49)

print(f"Extracting embeddings at time index: {CUSTOM_TIME_INDEX}")

custom_embeddings = []
custom_labels = []

for file_path, label in zip(file_paths, file_labels):
    try:
        emb = extract_temporal_slice(dac_processor, file_path, CUSTOM_TIME_INDEX)
        custom_embeddings.append(emb)
        custom_labels.append(label)
    except Exception as e:
        print(f"Skipped: {e}")

custom_embeddings = np.array(custom_embeddings)
print(f"Shape: {custom_embeddings.shape}")

# Quick PCA visualization
pca_custom = PCA(n_components=2)
pca_custom_result = pca_custom.fit_transform(custom_embeddings)

fig = go.Figure()
for word in words:
    mask = np.array([label == word for label in custom_labels])
    fig.add_trace(go.Scatter(
        x=pca_custom_result[mask, 0],
        y=pca_custom_result[mask, 1],
        mode='markers',
        name=word,
        marker=dict(size=10, color=color_map[word], opacity=0.7)
    ))

fig.update_layout(
    title=f'PCA at Time Index {CUSTOM_TIME_INDEX}',
    xaxis_title='PC1',
    yaxis_title='PC2',
    width=800,
    height=600
)

fig.show()

# Metrics
numeric_custom = np.array([label_to_idx[label] for label in custom_labels])
sil_custom = silhouette_score(pca_custom_result, numeric_custom)
db_custom = davies_bouldin_score(pca_custom_result, numeric_custom)

print(f"\nMetrics for time index {CUSTOM_TIME_INDEX}:")
print(f"  Silhouette: {sil_custom:+.4f}")
print(f"  Davies-Bouldin: {db_custom:.4f}")

Extracting embeddings at time index: 10
Shape: (50, 12288)



Metrics for time index 10:
  Silhouette: -0.1234
  Davies-Bouldin: 11.2723


## Summary

This notebook explored DAC embeddings **without temporal averaging**:
1. ‚úÖ Extracted 12,288D concatenated projections at specific time indices
2. ‚úÖ Compared clustering quality across Early/Middle/Late positions
3. ‚úÖ Provided custom time index exploration capability

**Key Finding**: Single time slices likely show poor clustering because:
- Words are temporal sequences, not static snapshots
- Phonetic information unfolds over time
- DAC optimized for compression, not phonetic discrimination

**Conclusion**: To use DAC for speech tasks, need sequence models that preserve full temporal context (RNN/LSTM/Transformer), not single-frame or averaged representations.