# 🔍 Embedding Analysis & Visualization

This notebook visualizes document embeddings to understand the semantic relationships between documents in our corpus. We'll explore:

1. **Dimensionality Reduction** - Using t-SNE to visualize high-dimensional embeddings in 2D space
2. **Similarity Heatmaps** - Visualizing query-document and document-document similarities
3. **Cluster Analysis** - Identifying natural groupings in our document space

In [1]:
import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
import umap

# Add the src directory to the path
import sys
sys.path.append(os.path.abspath('..'))

from src.embedder import Embedder

# Set paths
DATA_DIR = Path("../data")
PROCESSED_DIR = DATA_DIR / "processed"
OUTPUT_DIR = Path("../outputs")

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("viridis")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['figure.dpi'] = 100

## 1. Load Documents and Generate Embeddings

First, we'll load our processed documents and generate embeddings for them.

In [2]:
# Load processed documents
with open(PROCESSED_DIR / "processed_chunks.json", 'r') as f:
    documents = json.load(f)

print(f"Loaded {len(documents)} document chunks")

# Initialize embedder
embedder = Embedder()

# Generate embeddings for all documents
doc_texts = [doc['content'] for doc in documents]
doc_embeddings = embedder.embed_documents(doc_texts)

print(f"Generated embeddings with shape: {doc_embeddings.shape}")

## 2. Dimensionality Reduction with t-SNE

We'll use t-SNE to reduce our high-dimensional embeddings to 2D for visualization.

In [3]:
# Apply t-SNE for dimensionality reduction
tsne = TSNE(n_components=2, perplexity=15, random_state=42, n_iter=1000)
embeddings_2d = tsne.fit_transform(doc_embeddings)

# Create a DataFrame for easier plotting
plot_df = pd.DataFrame({
    'x': embeddings_2d[:, 0],
    'y': embeddings_2d[:, 1],
    'document_id': [doc['id'] for doc in documents],
    'title': [doc['title'] for doc in documents],
    'category': [doc['metadata'].get('category', 'unknown') for doc in documents],
    'parent_id': [doc['metadata'].get('parent_id', doc['id']) for doc in documents]
})

# Plot t-SNE visualization colored by category
plt.figure(figsize=(14, 10))
sns.scatterplot(data=plot_df, x='x', y='y', hue='category', style='category', s=100, alpha=0.7)
plt.title('t-SNE Visualization of Document Embeddings by Category', fontsize=16)
plt.xlabel('t-SNE Dimension 1', fontsize=12)
plt.ylabel('t-SNE Dimension 2', fontsize=12)
plt.legend(title='Category', bbox_to_anchor=(1.05, 1), loc='upper left')

# Add annotations for a few points (not all to avoid clutter)
for i, row in plot_df.sample(min(10, len(plot_df))).iterrows():
    plt.annotate(row['title'][:20] + '...', 
                 (row['x'], row['y']),
                 xytext=(5, 5),
                 textcoords='offset points',
                 fontsize=8,
                 bbox=dict(boxstyle="round,pad=0.3", fc="white", alpha=0.7))

plt.tight_layout()
plt.savefig(OUTPUT_DIR / "tsne_visualization.png", dpi=300)
plt.show()

## 3. Document Similarity Heatmap

Let's visualize the similarity between documents to identify clusters and relationships.

In [4]:
# Calculate cosine similarity between all document pairs
similarity_matrix = cosine_similarity(doc_embeddings)

# For visualization, let's use a subset of documents if there are too many
max_docs_for_heatmap = 20
if len(documents) > max_docs_for_heatmap:
    # Sample documents from different categories for better visualization
    categories = plot_df['category'].unique()
    sampled_indices = []
    
    # Try to get an equal number from each category
    docs_per_category = max(1, max_docs_for_heatmap // len(categories))
    
    for category in categories:
        category_indices = plot_df[plot_df['category'] == category].index.tolist()
        sampled_indices.extend(np.random.choice(category_indices, 
                                               size=min(docs_per_category, len(category_indices)),
                                               replace=False))
    
    # If we still need more, sample randomly
    if len(sampled_indices) < max_docs_for_heatmap:
        remaining = list(set(range(len(documents))) - set(sampled_indices))
        additional = np.random.choice(remaining, 
                                     size=min(max_docs_for_heatmap - len(sampled_indices), len(remaining)),
                                     replace=False)
        sampled_indices.extend(additional)
else:
    sampled_indices = range(len(documents))

# Extract the subset for visualization
subset_similarity = similarity_matrix[np.ix_(sampled_indices, sampled_indices)]
subset_titles = [documents[i]['title'][:30] + '...' if len(documents[i]['title']) > 30 else documents[i]['title'] for i in sampled_indices]
subset_categories = [documents[i]['metadata'].get('category', 'unknown') for i in sampled_indices]

# Create a DataFrame for the heatmap
similarity_df = pd.DataFrame(subset_similarity, 
                             index=subset_titles,
                             columns=subset_titles)

# Plot the heatmap
plt.figure(figsize=(14, 12))
heatmap = sns.heatmap(similarity_df, annot=False, cmap='viridis', vmin=0, vmax=1)
plt.title('Document Similarity Heatmap', fontsize=16)
plt.tight_layout()
plt.savefig(OUTPUT_DIR / "document_similarity_heatmap.png", dpi=300)
plt.show()

## 4. Query-Document Similarity Visualization

Let's visualize how different queries relate to our document corpus.

In [5]:
# Define some sample queries
sample_queries = [
    "How does HRV relate to recovery?",
    "What are the best sleep optimization strategies?",
    "How to monitor training load effectively?",
    "What causes overtraining syndrome?",
    "How does nutrition affect recovery?"
]

# Generate embeddings for queries
query_embeddings = embedder.embed_query(sample_queries)

# Calculate similarity between queries and documents
query_doc_similarity = cosine_similarity(query_embeddings, doc_embeddings)

# Create a DataFrame for visualization
# Use the same sampled documents as before for consistency
subset_query_similarity = query_doc_similarity[:, sampled_indices]
query_similarity_df = pd.DataFrame(subset_query_similarity,
                                  index=sample_queries,
                                  columns=subset_titles)

# Plot the heatmap
plt.figure(figsize=(16, 8))
sns.heatmap(query_similarity_df, annot=True, cmap='YlGnBu', vmin=0, vmax=1, fmt='.2f')
plt.title('Query-Document Similarity Heatmap', fontsize=16)
plt.tight_layout()
plt.savefig(OUTPUT_DIR / "query_document_similarity.png", dpi=300)
plt.show()

# For each query, show the top 3 most similar documents
for i, query in enumerate(sample_queries):
    # Get similarities for all documents
    similarities = query_doc_similarity[i]
    
    # Get indices of top 5 most similar documents
    top_indices = np.argsort(similarities)[::-1][:3]
    
    print(f"\nQuery: {query}")
    print("Top 3 most similar documents:")
    for idx in top_indices:
        print(f"- {documents[idx]['title']} (Similarity: {similarities[idx]:.3f})")

## 5. UMAP Visualization with Category Coloring

UMAP often preserves both local and global structure better than t-SNE.

In [6]:
# Apply UMAP for dimensionality reduction
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
umap_embeddings = reducer.fit_transform(doc_embeddings)

# Update the DataFrame for plotting
plot_df['umap_x'] = umap_embeddings[:, 0]
plot_df['umap_y'] = umap_embeddings[:, 1]

# Plot UMAP visualization
plt.figure(figsize=(14, 10))
sns.scatterplot(data=plot_df, x='umap_x', y='umap_y', hue='category', style='category', s=100, alpha=0.7)
plt.title('UMAP Visualization of Document Embeddings by Category', fontsize=16)
plt.xlabel('UMAP Dimension 1', fontsize=12)
plt.ylabel('UMAP Dimension 2', fontsize=12)
plt.legend(title='Category', bbox_to_anchor=(1.05, 1), loc='upper left')

# Add annotations for a few points
for i, row in plot_df.sample(min(10, len(plot_df))).iterrows():
    plt.annotate(row['title'][:20] + '...', 
                 (row['umap_x'], row['umap_y']),
                 xytext=(5, 5),
                 textcoords='offset points',
                 fontsize=8,
                 bbox=dict(boxstyle="round,pad=0.3", fc="white", alpha=0.7))

plt.tight_layout()
plt.savefig(OUTPUT_DIR / "umap_visualization.png", dpi=300)
plt.show()

## 6. Visualize Document Clusters

Let's use K-means clustering to identify natural groups in our document space.

In [7]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Determine optimal number of clusters using silhouette score
silhouette_scores = []
K_range = range(2, min(10, len(documents)))

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(doc_embeddings)
    silhouette_avg = silhouette_score(doc_embeddings, cluster_labels)
    silhouette_scores.append(silhouette_avg)
    print(f"For n_clusters = {k}, the silhouette score is {silhouette_avg:.3f}")

# Plot silhouette scores
plt.figure(figsize=(10, 6))
plt.plot(K_range, silhouette_scores, 'o-')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for Different Numbers of Clusters')
plt.grid(True)
plt.savefig(OUTPUT_DIR / "silhouette_scores.png", dpi=300)
plt.show()

# Choose optimal k (highest silhouette score)
optimal_k = K_range[np.argmax(silhouette_scores)]
print(f"Optimal number of clusters: {optimal_k}")

# Apply K-means with optimal k
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(doc_embeddings)

# Add cluster labels to the DataFrame
plot_df['cluster'] = cluster_labels

# Visualize clusters on UMAP projection
plt.figure(figsize=(14, 10))
sns.scatterplot(data=plot_df, x='umap_x', y='umap_y', hue='cluster', palette='tab10', s=100, alpha=0.7)
plt.title(f'Document Clusters (K={optimal_k}) Visualized with UMAP', fontsize=16)
plt.xlabel('UMAP Dimension 1', fontsize=12)
plt.ylabel('UMAP Dimension 2', fontsize=12)
plt.legend(title='Cluster', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.savefig(OUTPUT_DIR / "document_clusters.png", dpi=300)
plt.show()

# Compare clusters with categories
cluster_category_counts = pd.crosstab(plot_df['cluster'], plot_df['category'])
print("\nCluster-Category Distribution:")
display(cluster_category_counts)

## 7. Embedding Space Analysis for a Specific Query

Let's visualize how a specific query relates to the document space.

In [8]:
# Choose a specific query for detailed analysis
focus_query = "How does sleep quality affect HRV and recovery?"
focus_query_embedding = embedder.embed_query([focus_query])[0]

# Calculate similarity to all documents
similarities = cosine_similarity([focus_query_embedding], doc_embeddings)[0]

# Add similarity scores to the DataFrame
plot_df['query_similarity'] = similarities

# Visualize on UMAP projection with similarity as color intensity
plt.figure(figsize=(14, 10))
scatter = plt.scatter(plot_df['umap_x'], plot_df['umap_y'], 
                      c=plot_df['query_similarity'], cmap='YlOrRd', 
                      s=100, alpha=0.8, edgecolors='k', linewidths=0.5)

plt.colorbar(scatter, label='Similarity to Query')
plt.title(f'Document Similarity to Query: "{focus_query}"', fontsize=16)
plt.xlabel('UMAP Dimension 1', fontsize=12)
plt.ylabel('UMAP Dimension 2', fontsize=12)

# Annotate top 5 most similar documents
top5_indices = np.argsort(similarities)[::-1][:5]
for idx in top5_indices:
    row = plot_df.iloc[idx]
    plt.annotate(f"{row['title'][:20]}... ({similarities[idx]:.2f})", 
                 (row['umap_x'], row['umap_y']),
                 xytext=(10, 10),
                 textcoords='offset points',
                 fontsize=9,
                 bbox=dict(boxstyle="round,pad=0.3", fc="white", alpha=0.8),
                 arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))

plt.tight_layout()
plt.savefig(OUTPUT_DIR / "query_similarity_map.png", dpi=300)
plt.show()

# Print top 10 most similar documents
print(f"Top 10 documents most similar to query: '{focus_query}'\n")
top10_indices = np.argsort(similarities)[::-1][:10]
for i, idx in enumerate(top10_indices):
    print(f"{i+1}. {documents[idx]['title']} (Similarity: {similarities[idx]:.3f})")

## 8. Comparing Embedding Quality

Let's compare how well our embeddings capture semantic relationships by testing with related queries.

In [9]:
# Define semantically related queries
related_queries = [
    "What is HRV?",
    "How is heart rate variability measured?",
    "What factors affect heart rate variability?",
    "How does HRV relate to stress?",
    "Can HRV predict overtraining?"
]

# Generate embeddings
related_embeddings = embedder.embed_query(related_queries)

# Calculate similarity between queries
query_similarity = cosine_similarity(related_embeddings)

# Visualize query similarity
plt.figure(figsize=(10, 8))
sns.heatmap(query_similarity, annot=True, cmap='Blues', vmin=0, vmax=1,
            xticklabels=[q[:20] + '...' for q in related_queries],
            yticklabels=[q[:20] + '...' for q in related_queries])
plt.title('Semantic Similarity Between Related Queries', fontsize=16)
plt.tight_layout()
plt.savefig(OUTPUT_DIR / "query_semantic_similarity.png", dpi=300)
plt.show()

## Conclusion

In this notebook, we've visualized and analyzed the embedding space of our document corpus. Key findings include:

1. **Document Clustering**: We identified natural clusters in our document space that largely align with the document categories.

2. **Query-Document Relationships**: We visualized how different queries relate to our documents, showing which documents are most semantically similar to specific queries.

3. **Embedding Quality**: We evaluated the quality of our embeddings by examining how well they capture semantic relationships between related queries.

These visualizations provide valuable insights into the behavior of our retrieval system and can help identify areas for improvement.