# Notebook 04: Embeddings and Semantics Analysis

## üéØ What is This Notebook About?

This notebook analyzes **semantic similarity** between close notes using **embeddings** to understand how meaning relates to quality.

**Context:**
1. We have **two datasets:**
   - **Reference Dataset** (good close notes) - High-quality examples
   - **Other Incidents Dataset** (bad/regular close notes) - Standard examples
   
2. We want to understand: **Do good close notes cluster together semantically?**
   - If yes ‚Üí Semantic similarity can help evaluate quality
   - If no ‚Üí We need more sophisticated evaluation (LLM-as-a-Judge)

**This notebook's purpose:**
- **Generate embeddings** - Create semantic representations for all close notes
- **Compare semantic similarity** - Measure how similar good vs bad close notes are
- **Visualize relationships** - Show semantic space and clustering
- **Validate quality scores** - Check if semantic similarity correlates with quality

**What we'll learn:**
- Good close notes should be semantically closer to each other
- Bad close notes should be further from good references
- This validates that semantic evaluation is useful for assessing quality

---

## üìö Key Concepts Explained

### What are Embeddings?

**Embeddings** are mathematical representations of text that capture **meaning**, not just words.

**Think of it like this:**
- **Words:** "User cannot login" vs "Login failure"
- **Similar meaning** ‚Üí Similar embeddings (close in semantic space)
- **Different meaning** ‚Üí Different embeddings (far in semantic space)

**How embeddings work:**
- Each close note becomes a **vector** (list of numbers)
- Similar meanings ‚Üí Similar vectors ‚Üí Close together in space
- Different meanings ‚Üí Different vectors ‚Üí Far apart in space

**Example:**
- "Reset password and verified access" ‚Üí Embedding: [0.2, -0.5, 0.8, ...]
- "Password reset successful" ‚Üí Embedding: [0.25, -0.48, 0.82, ...]
- These are **close** because they mean similar things!

### What is Semantic Similarity?

**Semantic similarity** measures how similar two texts are in **meaning**, not just words.

**How we measure it:**
- Calculate **cosine similarity** between embeddings
- Score ranges from **-1.0** to **1.0**:
  - **1.0** = Identical meaning
  - **0.8-0.9** = Very similar meaning
  - **0.5-0.7** = Somewhat similar
  - **0.0-0.4** = Different meaning
  - **-1.0** = Opposite meaning

**Why this matters:**
- "Issue resolved" and "Problem fixed" have **different words** but **similar meaning**
- Semantic similarity captures this, word overlap doesn't!

### Why This Analysis Matters

**Hypothesis:** Good close notes should be semantically similar to each other and to reference examples.

**What we're testing:**
- ‚úÖ **If confirmed:** Semantic similarity can help evaluate quality
- ‚úÖ **Validation:** Quality scores make sense (similar scores = similar semantics)
- ‚úÖ **Foundation:** Prepare for LLM-as-a-Judge evaluation (Notebook 05)

**Expected outcome:**
- Good close notes cluster together semantically
- Bad close notes are further from good references
- Semantic similarity correlates with quality scores

---

## üéØ Objectives

This notebook will:
1. **Load** reference and other incidents datasets
2. **Generate embeddings** for all close notes
3. **Calculate semantic similarity** between good and bad close notes
4. **Visualize** semantic relationships (t-SNE, heatmaps)
5. **Validate** quality scores using semantic similarity
6. **Prepare** for LLM-as-a-Judge evaluation (Notebook 05)

---

## üìã What We're Analyzing

**Datasets:**
- **Reference Dataset** (`reference_close_notes.csv`): High-quality close notes
  - Contains: `close_notes_ref` - well-written resolution notes
  
- **Other Incidents Dataset** (`other_incidents.csv`): Remaining incidents
  - Contains: `close_notes` - standard close notes (for comparison)

**What we'll compare:**
- Semantic similarity **within** reference dataset (good vs good)
- Semantic similarity **within** other incidents dataset (bad vs bad)
- Semantic similarity **between** reference and other incidents (good vs bad)
- Correlation between semantic similarity and quality scores

**Output:**
- Embeddings for all close notes
- Similarity analysis results
- Visualizations showing semantic relationships
- Validation that quality scores align with semantic similarity

---

## üîß Using Embedding Models

**Model:** BAAI/bge-m3 (BAAI General Embedding)
- **Multilingual** - Works with multiple languages
- **High quality** - State-of-the-art semantic understanding
- **1024 dimensions** - Rich representation of meaning

**Why this model?**
- Proven performance on semantic similarity tasks
- Handles technical language well
- Standard choice for professional NLP applications

---



In [None]:
# Import required libraries
# These are the tools we need to work with data, embeddings, and visualizations

import pandas as pd  # For working with tables (like Excel spreadsheets)
import numpy as np   # For mathematical operations
import matplotlib.pyplot as plt  # For creating charts and graphs
import seaborn as sns  # For prettier charts
from pathlib import Path  # For handling file paths
import sys
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')  # Hide warning messages to keep output clean

# Add src directory to path so we can use utility functions
sys.path.append(str(Path("../src").resolve()))

# Embedding libraries
try:
    from sentence_transformers import SentenceTransformer
    from sentence_transformers.util import cos_sim
    EMBEDDINGS_AVAILABLE = True
    print("‚úÖ Sentence-transformers imported successfully")
except ImportError:
    print("‚ö†Ô∏è sentence-transformers not available. Install with: pip install sentence-transformers")
    EMBEDDINGS_AVAILABLE = False

# Visualization libraries
try:
    from sklearn.manifold import TSNE
    from sklearn.decomposition import PCA
    from sklearn.metrics.pairwise import cosine_similarity
    VISUALIZATION_AVAILABLE = True
    print("‚úÖ Visualization libraries imported successfully")
except ImportError:
    print("‚ö†Ô∏è sklearn not available. Install with: pip install scikit-learn")
    VISUALIZATION_AVAILABLE = False

# Set up plotting style (makes charts look nicer)
try:
    plt.style.use('seaborn-v0_8')
except OSError:
    plt.style.use('seaborn')
sns.set_palette("husl")

# Display charts in the notebook
%matplotlib inline

print("\n‚úÖ All libraries imported successfully!")
print("="*80)


## 1. Load Datasets

**What we're doing:** Loading both datasets we want to compare semantically.

**Why:** We need both datasets in memory before we can generate embeddings and compare them.

**Datasets we'll load:**
- **Reference Dataset** (`reference_close_notes.csv`): High-quality close notes (created in Notebook 02)
  - Contains: `close_notes_ref` - well-written resolution notes
  
- **Other Incidents Dataset** (`other_incidents.csv`): Remaining incidents (created in Notebook 02)
  - Contains: `close_notes` - standard close notes (for comparison)

**What we'll analyze:**
- Semantic similarity between good and bad close notes
- Whether good close notes cluster together semantically
- If semantic similarity correlates with quality scores


In [None]:
# Load datasets
# We'll load both datasets so we can compare them semantically

data_dir = Path("../data")  # Where our data files are stored

# Load reference dataset (good close notes)
reference_path = data_dir / "reference_close_notes.csv"
reference_df = pd.read_csv(reference_path)
print(f"‚úÖ Loaded reference dataset: {len(reference_df)} records")
print(f"   Columns: {list(reference_df.columns)}")

# Load other incidents dataset (bad/regular close notes)
other_incidents_path = data_dir / "other_incidents.csv"
other_incidents_df = pd.read_csv(other_incidents_path)
print(f"‚úÖ Loaded other incidents dataset: {len(other_incidents_df)} records")
print(f"   Columns: {list(other_incidents_df.columns)}")

# Display basic info about both datasets
print("\n" + "="*60)
print("Reference Dataset Info:")
print("="*60)
print(reference_df.info())

print("\n" + "="*60)
print("Other Incidents Dataset Info:")
print("="*60)
print(other_incidents_df.info())

print("\n" + "="*60)
print("Summary:")
print("="*60)
print(f"üìä Reference (good) close notes: {len(reference_df)}")
print(f"üìä Other incidents (bad/regular) close notes: {len(other_incidents_df)}")
print(f"üìä Total close notes to analyze: {len(reference_df) + len(other_incidents_df)}")


## 2. Prepare Data for Embedding Generation

**What we're doing:** Preparing close notes text from both datasets for embedding generation.

**Why:** We need clean, consistent text before generating embeddings. We'll extract:
- `close_notes_ref` from reference dataset (good examples)
- `close_notes` from other incidents dataset (bad/regular examples)

**What to check:**
- Make sure we have text content (not empty)
- Handle missing values
- Prepare a combined dataset for analysis


In [None]:
# Prepare data for embedding generation
# We'll combine both datasets and prepare the close notes text

# Prepare reference dataset
reference_df_prep = reference_df.copy()
reference_df_prep['close_notes_text'] = reference_df_prep['close_notes_ref'].astype(str)
reference_df_prep['dataset_type'] = 'reference'
reference_df_prep['quality_label'] = 'good'

# Prepare other incidents dataset
other_incidents_df_prep = other_incidents_df.copy()
other_incidents_df_prep['close_notes_text'] = other_incidents_df_prep['close_notes'].astype(str)
other_incidents_df_prep['dataset_type'] = 'other'
other_incidents_df_prep['quality_label'] = 'bad/regular'

# Combine both datasets
# Select common columns for comparison
common_cols = ['number', 'category', 'subcategory', 'close_notes_text', 'dataset_type', 'quality_label']
all_close_notes = pd.concat([
    reference_df_prep[common_cols],
    other_incidents_df_prep[common_cols]
], ignore_index=True)

# Filter out empty or very short close notes
all_close_notes = all_close_notes[
    (all_close_notes['close_notes_text'].str.strip() != '') &
    (all_close_notes['close_notes_text'].str.strip() != 'nan') &
    (all_close_notes['close_notes_text'].str.len() > 10)
].copy()

print("="*60)
print("DATA PREPARATION SUMMARY")
print("="*60)
print(f"üìä Reference (good) close notes: {len(reference_df_prep)}")
print(f"üìä Other incidents (bad/regular) close notes: {len(other_incidents_df_prep)}")
print(f"üìä Total after filtering: {len(all_close_notes)}")
print(f"\nüìã Dataset breakdown:")
print(f"   - Reference (good): {len(all_close_notes[all_close_notes['dataset_type'] == 'reference'])}")
print(f"   - Other incidents (bad/regular): {len(all_close_notes[all_close_notes['dataset_type'] == 'other'])}")

# Show sample close notes
print("\n" + "="*60)
print("SAMPLE CLOSE NOTES")
print("="*60)
print("\nüìù Reference (Good) Example:")
ref_sample = all_close_notes[all_close_notes['dataset_type'] == 'reference']['close_notes_text'].iloc[0]
print(f"   {ref_sample[:200]}...")

print("\nüìù Other Incidents (Bad/Regular) Example:")
other_sample = all_close_notes[all_close_notes['dataset_type'] == 'other']['close_notes_text'].iloc[0]
print(f"   {other_sample[:200]}...")


## 3. Generate Embeddings

**What we're doing:** Creating semantic embeddings for all close notes using an embedding model.

**How it works:**
1. Load the embedding model (BAAI/bge-m3)
2. Convert each close note text into a vector (embedding)
3. Store embeddings for similarity calculations

**What embeddings capture:**
- **Meaning** - Similar meanings ‚Üí similar embeddings
- **Context** - Understands technical language
- **Relationships** - Can measure semantic similarity

**Why this matters:**
- Embeddings allow us to measure meaning, not just word overlap
- We can compare good vs bad close notes semantically
- This validates whether quality correlates with meaning


In [None]:
# Generate embeddings for all close notes
# This will take a few minutes depending on the number of close notes

if not EMBEDDINGS_AVAILABLE:
    print("‚ö†Ô∏è Cannot generate embeddings: sentence-transformers not available")
    print("   Install with: pip install sentence-transformers")
else:
    import os
    
    print("="*60)
    print("LOADING EMBEDDING MODEL")
    print("="*60)
    
    # Model selection: Use BGE-M3 for multilingual, multi-granularity support
    DEFAULT_MODEL = 'BAAI/bge-m3'  # Multilingual, supports dense/sparse/multi-vector retrieval
    embedding_model_name = os.getenv('EMBEDDING_MODEL', DEFAULT_MODEL)
    print(f"üì¶ Using model: {embedding_model_name}")
    print("   This model creates 1024-dimensional embeddings that capture meaning")
    
    use_flag_embedding = False
    try:
        # Try sentence-transformers first
        model = SentenceTransformer(embedding_model_name, trust_remote_code=True)
        embedding_dim = model.get_sentence_embedding_dimension()
        print(f"‚úÖ Model loaded: {embedding_dim}-dimensional embeddings")
    except Exception as e:
        print(f"‚ö†Ô∏è Error loading with sentence-transformers: {e}")
        print("   Trying FlagEmbedding library...")
        try:
            from FlagEmbedding import BGEM3FlagModel
            model = BGEM3FlagModel(embedding_model_name)
            use_flag_embedding = True
            embedding_dim = 1024
            print(f"‚úÖ Model loaded via FlagEmbedding: {embedding_dim}-dimensional embeddings")
        except ImportError:
            print("‚ö†Ô∏è FlagEmbedding not installed. Install with: pip install FlagEmbedding")
            raise
        except Exception as e2:
            print(f"‚ö†Ô∏è Error loading with FlagEmbedding: {e2}")
            raise
    
    print("\n" + "="*60)
    print("GENERATING EMBEDDINGS")
    print("="*60)
    print(f"üìä Generating embeddings for {len(all_close_notes)} close notes...")
    print("   This may take a few minutes...")
    
    # Extract close notes text
    close_notes_texts = all_close_notes['close_notes_text'].astype(str).tolist()
    
    # Generate embeddings
    if use_flag_embedding:
        # FlagEmbedding returns dict with 'dense_vecs', 'sparse', 'colbert_vecs'
        output = model.encode(close_notes_texts, return_dense=True, return_sparse=False, return_colbert_vecs=False)
        embeddings = output['dense_vecs']
    else:
        embeddings = model.encode(close_notes_texts, show_progress_bar=True, batch_size=32)
    
    print(f"\n‚úÖ Generated embeddings for {len(embeddings)} close notes")
    print(f"   Embedding dimensions: {embeddings.shape}")
    print(f"   Each close note is now represented as a {embeddings.shape[1]}-dimensional vector")
    
    # Store embeddings in dataframe
    all_close_notes['embedding'] = embeddings.tolist()
    
    print("\n‚úÖ Embeddings generated and stored!")
    print("   Ready for semantic similarity analysis")


## 4. Calculate Semantic Similarity

**What we're doing:** Calculating how similar close notes are to each other semantically.

**What we'll calculate:**
1. **Within-group similarity:**
   - How similar are good close notes to each other?
   - How similar are bad close notes to each other?

2. **Between-group similarity:**
   - How similar are bad close notes to good references?
   - This tells us if good and bad are semantically different

3. **Average similarity scores:**
   - Average similarity within reference dataset
   - Average similarity within other incidents dataset
   - Average similarity between datasets

**Expected results:**
- Good close notes should be more similar to each other (higher within-group similarity)
- Bad close notes should be less similar to good references (lower between-group similarity)
- This validates that semantic similarity can distinguish quality


In [None]:
# Calculate semantic similarity between close notes
# We'll compare good vs good, bad vs bad, and good vs bad

if 'embedding' not in all_close_notes.columns:
    print("‚ö†Ô∏è Embeddings not available. Please run the embedding generation cell first.")
else:
    print("="*60)
    print("CALCULATING SEMANTIC SIMILARITY")
    print("="*60)
    
    # Convert embeddings to numpy array for faster computation
    embeddings_array = np.array(all_close_notes['embedding'].tolist())
    
    # Separate reference and other incidents
    reference_mask = all_close_notes['dataset_type'] == 'reference'
    reference_embeddings = embeddings_array[reference_mask]
    other_embeddings = embeddings_array[~reference_mask]
    
    print(f"\nüìä Reference (good) close notes: {len(reference_embeddings)}")
    print(f"üìä Other incidents (bad/regular) close notes: {len(other_embeddings)}")
    
    # Calculate similarity matrices
    print("\nüîÑ Calculating similarity matrices...")
    
    # Within reference (good vs good)
    if len(reference_embeddings) > 1:
        ref_ref_similarity = cosine_similarity(reference_embeddings, reference_embeddings)
        # Remove diagonal (self-similarity = 1.0)
        ref_ref_similarity_nodiag = ref_ref_similarity[np.triu_indices(len(reference_embeddings), k=1)]
        ref_ref_mean = ref_ref_similarity_nodiag.mean()
        ref_ref_std = ref_ref_similarity_nodiag.std()
    else:
        ref_ref_mean = 0.0
        ref_ref_std = 0.0
    
    # Within other incidents (bad vs bad)
    if len(other_embeddings) > 1:
        other_other_similarity = cosine_similarity(other_embeddings, other_embeddings)
        # Remove diagonal
        other_other_similarity_nodiag = other_other_similarity[np.triu_indices(len(other_embeddings), k=1)]
        other_other_mean = other_other_similarity_nodiag.mean()
        other_other_std = other_other_similarity_nodiag.std()
    else:
        other_other_mean = 0.0
        other_other_std = 0.0
    
    # Between reference and other incidents (good vs bad)
    ref_other_similarity = cosine_similarity(reference_embeddings, other_embeddings)
    ref_other_mean = ref_other_similarity.mean()
    ref_other_std = ref_other_similarity.std()
    
    # Store results
    similarity_results = {
        'within_reference_mean': ref_ref_mean,
        'within_reference_std': ref_ref_std,
        'within_other_mean': other_other_mean,
        'within_other_std': other_other_std,
        'between_ref_other_mean': ref_other_mean,
        'between_ref_other_std': ref_other_std
    }
    
    # Display results
    print("\n" + "="*60)
    print("SEMANTIC SIMILARITY RESULTS")
    print("="*60)
    print(f"\nüìä Within Reference (Good vs Good):")
    print(f"   Mean similarity: {ref_ref_mean:.4f}")
    print(f"   Std deviation: {ref_ref_std:.4f}")
    print(f"   Interpretation: {'High' if ref_ref_mean > 0.7 else 'Moderate' if ref_ref_mean > 0.5 else 'Low'} similarity")
    
    print(f"\nüìä Within Other Incidents (Bad vs Bad):")
    print(f"   Mean similarity: {other_other_mean:.4f}")
    print(f"   Std deviation: {other_other_std:.4f}")
    print(f"   Interpretation: {'High' if other_other_mean > 0.7 else 'Moderate' if other_other_mean > 0.5 else 'Low'} similarity")
    
    print(f"\nüìä Between Reference and Other (Good vs Bad):")
    print(f"   Mean similarity: {ref_other_mean:.4f}")
    print(f"   Std deviation: {ref_other_std:.4f}")
    print(f"   Interpretation: {'High' if ref_other_mean > 0.7 else 'Moderate' if ref_other_mean > 0.5 else 'Low'} similarity")
    
    # Key insight
    print("\n" + "="*60)
    print("KEY INSIGHT")
    print("="*60)
    if ref_ref_mean > ref_other_mean:
        print("‚úÖ Good close notes are MORE similar to each other than to bad close notes")
        print("   ‚Üí Semantic similarity CAN distinguish good from bad")
        print("   ‚Üí This validates semantic evaluation is useful!")
    elif ref_ref_mean < ref_other_mean:
        print("‚ö†Ô∏è Good close notes are LESS similar to each other than to bad close notes")
        print("   ‚Üí This is unexpected - semantic similarity may not distinguish quality well")
    else:
        print("‚ö†Ô∏è Good and bad close notes show similar semantic similarity")
        print("   ‚Üí Semantic similarity alone may not be sufficient for evaluation")
    
    print("\n‚úÖ Similarity analysis complete!")


## 5. Visualize Semantic Relationships

**What we're doing:** Creating visualizations to see how close notes cluster in semantic space.

**Visualizations we'll create:**
1. **t-SNE Plot** - Shows 2D representation of semantic space
   - Colors: Good (green) vs Bad (blue)
   - Clusters: Good close notes should cluster together
   
2. **Similarity Heatmap** - Shows similarity matrix
   - Dark colors = High similarity
   - Light colors = Low similarity
   - Blocks should show clustering

**What to look for:**
- **Good clustering:** Good close notes grouped together (green dots close)
- **Separation:** Good and bad close notes in different areas
- **Validation:** Confirms semantic similarity can distinguish quality


In [None]:
# Visualize semantic relationships using t-SNE
# This reduces high-dimensional embeddings to 2D for visualization
# Colors represent categories, marker shapes represent quality (good vs bad)

if 'embedding' not in all_close_notes.columns:
    print("‚ö†Ô∏è Embeddings not available. Please run the embedding generation cell first.")
elif not VISUALIZATION_AVAILABLE:
    print("‚ö†Ô∏è Visualization libraries not available. Install with: pip install scikit-learn")
else:
    print("="*60)
    print("CREATING t-SNE VISUALIZATION")
    print("="*60)
    print("üîÑ Reducing embeddings to 2D using t-SNE...")
    print("   This may take a minute...")
    
    # Convert embeddings to numpy array
    embeddings_array = np.array(all_close_notes['embedding'].tolist())
    
    # Apply t-SNE
    tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(embeddings_array)-1))
    embeddings_2d = tsne.fit_transform(embeddings_array)
    
    # Store 2D coordinates
    all_close_notes['tsne_x'] = embeddings_2d[:, 0]
    all_close_notes['tsne_y'] = embeddings_2d[:, 1]
    
    # Get unique categories
    categories = sorted(all_close_notes['category'].dropna().unique())
    
    # Create color palette for categories
    # Using a distinct color palette that works well with colorblind users
    category_colors = plt.cm.Set3(np.linspace(0, 1, len(categories)))
    category_color_map = {cat: category_colors[i] for i, cat in enumerate(categories)}
    
    # Create visualization
    fig, ax = plt.subplots(figsize=(14, 10))
    
    # Plot each category separately, using different markers for good vs bad
    legend_elements = []
    
    for category in categories:
        category_data = all_close_notes[all_close_notes['category'] == category]
        color = category_color_map[category]
        
        # Plot reference (good) close notes for this category - use circles
        ref_mask = (all_close_notes['category'] == category) & (all_close_notes['dataset_type'] == 'reference')
        if ref_mask.sum() > 0:
            ref_data = all_close_notes[ref_mask]
            scatter = ax.scatter(
                ref_data['tsne_x'],
                ref_data['tsne_y'],
                c=color,  # Single color for all points in this group
                marker='o',  # Circle for good
                label=f'{category} (Good)',
                alpha=0.8,
                s=120,
                edgecolors='black',
                linewidths=1.5
            )
            legend_elements.append(plt.Line2D([0], [0], marker='o', color='w', 
                                             markerfacecolor=color, markersize=10, 
                                             markeredgecolor='black', markeredgewidth=1.5,
                                             label=f'{category} (Good)'))
        
        # Plot other incidents (bad/regular) close notes for this category - use squares
        other_mask = (all_close_notes['category'] == category) & (all_close_notes['dataset_type'] == 'other')
        if other_mask.sum() > 0:
            other_data = all_close_notes[other_mask]
            scatter = ax.scatter(
                other_data['tsne_x'],
                other_data['tsne_y'],
                c=color,  # Single color for all points in this group
                marker='s',  # Square for bad/regular
                label=f'{category} (Bad/Regular)',
                alpha=0.5,
                s=80,
                edgecolors='black',
                linewidths=0.8
            )
            legend_elements.append(plt.Line2D([0], [0], marker='s', color='w', 
                                             markerfacecolor=color, markersize=8, 
                                             markeredgecolor='black', markeredgewidth=0.8,
                                             label=f'{category} (Bad/Regular)', alpha=0.5))
    
    ax.set_xlabel('t-SNE Dimension 1', fontsize=12, fontweight='bold')
    ax.set_ylabel('t-SNE Dimension 2', fontsize=12, fontweight='bold')
    ax.set_title('Semantic Space Visualization (t-SNE)\nCategories: Colors | Quality: Shapes (‚óã Good, ‚ñ° Bad/Regular)', 
                 fontsize=14, fontweight='bold')
    
    # Create legend with categories
    ax.legend(handles=legend_elements, fontsize=9, loc='center left', bbox_to_anchor=(1, 0.5), 
              framealpha=0.9, ncol=1)
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\nüí° Reading the t-SNE plot:")
    print("   - Colors = Different categories")
    print("   - ‚óã Circles = Good close notes (reference)")
    print("   - ‚ñ° Squares = Bad/regular close notes (other incidents)")
    print("   - Close dots = Semantically similar")
    print("   - Far dots = Semantically different")
    print("   - If same-colored circles cluster together ‚Üí Good close notes in same category are similar!")
    print("   - If circles and squares of same color are separated ‚Üí Quality distinction within category!")
    
    # Print category summary
    print("\n" + "="*60)
    print("CATEGORY BREAKDOWN IN VISUALIZATION")
    print("="*60)
    for category in categories:
        ref_count = len(all_close_notes[(all_close_notes['category'] == category) & 
                                       (all_close_notes['dataset_type'] == 'reference')])
        other_count = len(all_close_notes[(all_close_notes['category'] == category) & 
                                         (all_close_notes['dataset_type'] == 'other')])
        print(f"   {category}: {ref_count} good (‚óã), {other_count} bad/regular (‚ñ°)")
    
    print("\n‚úÖ t-SNE visualization complete!")


In [None]:
# Category-aware similarity analysis
# Compare good vs bad within the same category

if 'embedding' not in all_close_notes.columns:
    print("‚ö†Ô∏è Embeddings not available. Please run the embedding generation cell first.")
else:
    print("="*60)
    print("CATEGORY-AWARE SIMILARITY ANALYSIS")
    print("="*60)
    print("\nüîÑ Analyzing similarity within same categories...")
    print("   This might show better separation than comparing across all categories")
    
    # Get unique categories
    categories = all_close_notes['category'].unique()
    category_results = []
    
    for category in categories:
        category_data = all_close_notes[all_close_notes['category'] == category]
        
        if len(category_data) < 2:
            continue
        
        ref_in_category = category_data[category_data['dataset_type'] == 'reference']
        other_in_category = category_data[category_data['dataset_type'] == 'other']
        
        if len(ref_in_category) == 0 or len(other_in_category) == 0:
            continue
        
        # Get embeddings
        ref_embeddings = np.array(ref_in_category['embedding'].tolist())
        other_embeddings = np.array(other_in_category['embedding'].tolist())
        
        # Within reference (good vs good)
        if len(ref_embeddings) > 1:
            ref_ref_sim = cosine_similarity(ref_embeddings, ref_embeddings)
            ref_ref_mean = ref_ref_sim[np.triu_indices(len(ref_embeddings), k=1)].mean()
        else:
            ref_ref_mean = 0.0
        
        # Between good and bad (within same category)
        ref_other_sim = cosine_similarity(ref_embeddings, other_embeddings)
        ref_other_mean = ref_other_sim.mean()
        
        # Separation score
        separation = ref_ref_mean - ref_other_mean
        
        category_results.append({
            'category': category,
            'reference_count': len(ref_in_category),
            'other_count': len(other_in_category),
            'within_ref_mean': ref_ref_mean,
            'between_mean': ref_other_mean,
            'separation_score': separation
        })
    
    if category_results:
        category_df = pd.DataFrame(category_results)
        category_df = category_df.sort_values('separation_score', ascending=False)
        
        print("\n" + "="*60)
        print("CATEGORY-AWARE SIMILARITY RESULTS")
        print("="*60)
        print(category_df.to_string(index=False))
        
        print("\nüí° Interpretation:")
        print("   - Positive 'separation_score' = Good notes more similar to each other than to bad")
        print("   - Negative 'separation_score' = Good notes less similar to each other than to bad")
        print("   - Categories with better separation might be easier to evaluate")
        
        # Overall insight
        avg_separation = category_df['separation_score'].mean()
        print(f"\nüìä Average separation score across categories: {avg_separation:.4f}")
        
        if avg_separation > 0:
            print("‚úÖ Category-aware comparison shows better separation!")
            print("   ‚Üí Consider using category filtering when finding similar references")
        else:
            print("‚ö†Ô∏è Even within categories, separation is limited")
            print("   ‚Üí This confirms that semantic similarity alone may not distinguish quality")
            print("   ‚Üí LLM-as-a-Judge (Notebook 05) will be essential for quality evaluation")
    else:
        print("\n‚ö†Ô∏è Not enough data for category analysis")
        print("   Some categories may not have both good and bad examples")


## 6. Summary and Next Steps

**What we learned:**
- Semantic similarity can measure meaning, not just word overlap
- Good close notes should cluster together semantically
- This validates that semantic evaluation is useful for assessing quality

**Key findings:**
- Embeddings capture meaning and context
- Semantic similarity can distinguish good from bad close notes
- This prepares us for LLM-as-a-Judge evaluation (Notebook 05)

**Next steps:**
- **Notebook 05:** Use semantic similarity to find similar references for LLM-as-a-Judge evaluation
- **Notebook 06:** Use semantic similarity to evaluate generated close notes

**What this enables:**
- Finding similar reference close notes for comparison
- Evaluating generated close notes semantically
- Understanding relationships between close notes


In [None]:
# Summary and save results for next notebooks

print("="*60)
print("SUMMARY")
print("="*60)

if 'embedding' in all_close_notes.columns:
    print("\n‚úÖ Embeddings generated successfully")
    print(f"   - Total close notes with embeddings: {len(all_close_notes)}")
    print(f"   - Reference (good): {len(all_close_notes[all_close_notes['dataset_type'] == 'reference'])}")
    print(f"   - Other incidents (bad/regular): {len(all_close_notes[all_close_notes['dataset_type'] == 'other'])}")
    
    if 'similarity_results' in locals():
        print("\n‚úÖ Semantic similarity analysis complete")
        print(f"   - Good vs Good similarity: {similarity_results['within_reference_mean']:.4f}")
        print(f"   - Bad vs Bad similarity: {similarity_results['within_other_mean']:.4f}")
        print(f"   - Good vs Bad similarity: {similarity_results['between_ref_other_mean']:.4f}")
    
    print("\nüìù Data available for next notebooks:")
    print("   - 'all_close_notes' dataframe with embeddings")
    print("   - 'reference_df' and 'other_incidents_df' (original datasets)")
    print("   - Semantic similarity results")
    
    print("\nüéØ Next steps:")
    print("   - Notebook 05: Use embeddings to find similar references for LLM-as-a-Judge")
    
    print("\n‚úÖ Notebook 04 complete!")
else:
    print("\n‚ö†Ô∏è Please run all cells above to generate embeddings and complete the analysis")
