# Notebook 02: Define Quality and Separate Datasets

## üéØ What is This Notebook About?

This notebook helps you understand what makes a "good" close note and separates your incident dataset into two groups:
- **Reference Dataset** (high-quality close notes) - These will serve as examples of what good close notes look like
- **Other Incidents Dataset** - All remaining incidents (for comparison)

**Why this matters:**
- We need clear examples of good close notes to compare against
- These reference examples will be used to evaluate other close notes (existing or AI-generated)
- By separating them, we can see the difference between high-quality and standard close notes

---

## üìö Key Concepts Explained

### What Makes a "Good" Close Note?

A good close note should:
1. **Be informative** - Contains specific details about the problem and solution
2. **Be complete** - Covers what happened, what was done, and the outcome
3. **Avoid generic phrases** - Doesn't just say "Issue resolved" without explanation
4. **Be professional** - Clear, well-structured, and easy to understand

**Example of a GOOD close note:**
> "Investigated the reported issue with Workday crashing when saving files. Cleared browser cache and cookies, updated browser to latest version. Verified user can now save files successfully. Issue resolved and confirmed with user."

**Example of a BAD/Generic close note:**
> "Issue resolved."

**Why the difference matters:**
- Good close notes help others understand what happened and how it was fixed
- Generic close notes don't provide useful information
- We'll use good examples as "reference" to evaluate other close notes

---

## üéØ Objectives

This notebook will:
1. **Define quality criteria** - Explain what makes a close note "good"
2. **Examine quality scores** - Look at how close notes are scored
3. **Filter high-quality examples** - Identify close notes that meet our criteria
4. **Separate into two datasets:**
   - **Reference Dataset** - High-quality close notes (ground truth)
   - **Other Incidents Dataset** - Remaining incidents (for comparison)
5. **Save datasets** - Save both for use in later notebooks

---

## üìã How We'll Separate the Data

**Separation Strategy:**
- **Reference Dataset** (`reference_close_notes.csv`): 
  - High-quality close notes that meet all criteria
  - Will be used as examples/references for evaluation
  
- **Other Incidents Dataset**:
  - All incidents that don't meet the "high-quality" criteria
  - Will be used for comparison in later notebooks

**Selection Criteria for Reference Dataset:**
- High information score (`info_score_close_notes` ‚â• 0.8)
- Low generic content score (`info_score_poor_close_notes` ‚â§ 0.1)
- Complete and informative text (not just "Issue resolved")
- Minimum length (ensures sufficient detail)
- Diverse examples across different categories


## 1. Import Libraries and Setup

**What we're doing:** Loading the tools we need to work with data and create visualizations.

**Why:** Just like a carpenter needs a hammer and saw, we need specific tools (libraries) to work with data.

**What to expect:** You'll see a success message when everything is loaded correctly.


In [None]:
# Import required libraries
# Think of these like tools in a toolbox - each one does a specific job

import pandas as pd  # For working with data tables (like Excel spreadsheets)
import numpy as np   # For doing math calculations
import matplotlib.pyplot as plt  # For creating charts and graphs
import seaborn as sns  # For making prettier charts
from pathlib import Path  # For handling file paths
import sys
import re  # For text pattern matching (finding generic phrases)

# Add src directory to path so we can use our helper functions
sys.path.append(str(Path("../src").resolve()))

# Import our custom helper functions
from utils import load_incident_dataset, calculate_basic_stats

# Set up plotting style (makes our charts look nicer)
try:
    plt.style.use('seaborn-v0_8')
except OSError:
    try:
        plt.style.use('seaborn')
    except OSError:
        plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline  # This makes charts appear in the notebook

print("‚úÖ Libraries imported successfully!")
print("üìö Ready to start analyzing close notes quality!")


## 2. Load Prepared Dataset

**What we're doing:** Loading the incident dataset from Notebook 01.

**Why:** We need all incidents with their close notes so we can identify which ones are high-quality and separate them.

**What to expect:** The dataset contains incidents with various quality levels of close notes. We'll filter to find the best ones.


In [None]:
# Load prepared dataset from notebook 01
# This is the data we prepared in the previous notebook

data_dir = Path("../data")
prepared_path = data_dir / "incidents_prepared.csv"

if prepared_path.exists():
    # Load the prepared dataset (faster - we already processed it)
    df = pd.read_csv(prepared_path)
    print(f"‚úÖ Loaded prepared dataset: {len(df)} records")
else:
    # If the prepared file doesn't exist, load from scratch
    print("‚ö†Ô∏è Prepared dataset not found. Loading from Hugging Face...")
    df = load_incident_dataset(sample_size=200, random_state=42)
    # Filter for records with close_notes (we need these!)
    if 'close_notes' in df.columns:
        df = df[df['close_notes'].notna()].copy()
    print(f"‚úÖ Loaded dataset: {len(df)} records with close_notes")

# Let's see what we're working with
print(f"\nüìä Dataset Overview:")
print(f"   Total incidents: {len(df)}")
print(f"   Columns (pieces of information): {df.shape[1]}")
print(f"\nüìã Key columns we'll use:")
key_columns = ['close_notes', 'info_score_close_notes', 'info_score_poor_close_notes', 'category']
for col in key_columns:
    if col in df.columns:
        print(f"   ‚úÖ {col}")
    else:
        print(f"   ‚ùå {col} (missing!)")


## 3. Understand Quality Scores

**What we're doing:** Looking at how close notes are scored for quality.

**What are quality scores?**
- **`info_score_close_notes`**: Measures how informative a close note is (0.0 to 1.0)
  - **High score (‚â•0.8)**: Contains detailed, useful information
  - **Low score (<0.8)**: Lacks detail or information
  
- **`info_score_poor_close_notes`**: Measures how "generic" a close note is (0.0 to 1.0)
  - **Low score (‚â§0.1)**: Not generic, contains specific information
  - **High score (>0.1)**: Generic phrases like "Issue resolved"

**Why this matters:** We want close notes that are informative (high score) and not generic (low poor score).

**What to look for:** 
- How many close notes have high information scores?
- How many have low generic scores?
- This tells us how many "good" examples we can use as references


In [None]:
# Check quality score columns
# This shows us how the close notes are scored for quality

print("="*80)
print("üìä QUALITY SCORES ANALYSIS")
print("="*80)
print("\nüí° Understanding the scores:")
print("   ‚Ä¢ Higher info_score_close_notes = More informative (better!)")
print("   ‚Ä¢ Lower info_score_poor_close_notes = Less generic (better!)")
print("   ‚Ä¢ We want: High info score (‚â•0.8) AND Low poor score (‚â§0.1)")

if 'info_score_close_notes' in df.columns:
    high_quality_count = (df['info_score_close_notes'] >= 0.8).sum()
    high_quality_pct = (high_quality_count / len(df)) * 100
    print(f"\nüìä info_score_close_notes (How informative?):")
    print(f"   Mean (average): {df['info_score_close_notes'].mean():.3f}")
    print(f"   Median (middle value): {df['info_score_close_notes'].median():.3f}")
    print(f"   Range: {df['info_score_close_notes'].min():.3f} to {df['info_score_close_notes'].max():.3f}")
    print(f"   ‚úÖ High quality (‚â•0.8): {high_quality_count} incidents ({high_quality_pct:.1f}%)")

if 'info_score_poor_close_notes' in df.columns:
    low_generic_count = (df['info_score_poor_close_notes'] <= 0.1).sum()
    low_generic_pct = (low_generic_count / len(df)) * 100
    print(f"\nüìä info_score_poor_close_notes (How generic?):")
    print(f"   Mean (average): {df['info_score_poor_close_notes'].mean():.3f}")
    print(f"   Median (middle value): {df['info_score_poor_close_notes'].median():.3f}")
    print(f"   ‚úÖ Low generic (‚â§0.1): {low_generic_count} incidents ({low_generic_pct:.1f}%)")

# Visualize score distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

if 'info_score_close_notes' in df.columns:
    axes[0].hist(df['info_score_close_notes'].dropna(), bins=20, edgecolor='black', alpha=0.7, color='steelblue')
    axes[0].axvline(0.8, color='red', linestyle='--', label='Threshold (‚â•0.8)')
    axes[0].set_xlabel('Info Score (close_notes)', fontsize=11)
    axes[0].set_ylabel('Frequency', fontsize=11)
    axes[0].set_title('Close Notes Quality Score Distribution', fontsize=12, fontweight='bold')
    axes[0].grid(axis='y', alpha=0.3)
    axes[0].legend()

if 'info_score_poor_close_notes' in df.columns:
    axes[1].hist(df['info_score_poor_close_notes'].dropna(), bins=20, edgecolor='black', alpha=0.7, color='coral')
    axes[1].axvline(0.1, color='red', linestyle='--', label='Threshold (‚â§0.1)')
    axes[1].set_xlabel('Info Score (poor_close_notes)', fontsize=11)
    axes[1].set_ylabel('Frequency', fontsize=11)
    axes[1].set_title('Poor Close Notes Score Distribution', fontsize=12, fontweight='bold')
    axes[1].grid(axis='y', alpha=0.3)
    axes[1].legend()

plt.tight_layout()
plt.show()

print("="*80)


## 4. Define Quality Criteria and Filter

**What we're doing:** Applying filters to identify high-quality close notes that will become our "Reference Dataset".

**Our quality criteria:**
1. **High information score** (`info_score_close_notes` ‚â• 0.8)
   - *Why?* Ensures the close note contains useful, detailed information
   
2. **Low generic score** (`info_score_poor_close_notes` ‚â§ 0.1)
   - *Why?* Ensures it's not just generic phrases like "Issue resolved"
   
3. **Not generic phrases** - Excludes notes that only say things like:
   - "No changes noted"
   - "Issue resolved"
   - "Resolved per user"
   
4. **Minimum length** (‚â•100 characters)
   - *Why?* Ensures there's enough detail to be useful

**What happens:**
- We'll filter the dataset step by step
- Records that pass all filters ‚Üí **Reference Dataset** (high-quality examples)
- Records that don't pass ‚Üí **Other Incidents Dataset** (for comparison)


In [None]:
# Define generic phrases to exclude
# These are phrases that don't provide useful information
# Think of them like "filler words" - they don't tell us anything useful

GENERIC_PHRASES = [
    'no changes noted',
    'issue resolved',
    'resolved',
    'closed',
    'no further action',
    'resolved per user',
    'user confirmed resolved'
]

def is_generic_close_note(text):
    """
    Check if close note contains only generic phrases.
    
    This function looks for close notes that are too short or only contain
    generic phrases like "Issue resolved" without any details.
    """
    if pd.isna(text) or not isinstance(text, str):
        return True  # Missing or not text = consider it generic
    
    text_lower = text.lower().strip()
    
    # Check if text is too short (likely generic)
    # Good close notes need detail, so very short ones are probably generic
    if len(text_lower) < 50:
        return True
    
    # Check if text contains only generic phrases
    words = set(text_lower.split())
    generic_words = set()
    for phrase in GENERIC_PHRASES:
        generic_words.update(phrase.split())
    
    # If most words are generic, likely a generic note
    # (e.g., "Issue resolved" = only 2 words, both generic)
    if len(words) <= 5 and words.issubset(generic_words):
        return True
    
    return False  # Not generic - has useful information!

# Apply filters step by step
# We'll filter the data multiple times, keeping only the best close notes
print("üîç Applying quality filters...")
print("="*80)
print(f"üì¶ Initial records: {len(df)}")

# Filter 1: Must have close_notes
# We can't evaluate quality if there's no close note!
df_filtered = df[df['close_notes'].notna()].copy()
print(f"‚úÖ Filter 1 - Has close_notes: {len(df_filtered)} records")

# Filter 2: High quality score (informative)
# We want close notes that are informative (score ‚â• 0.8)
if 'info_score_close_notes' in df_filtered.columns:
    before = len(df_filtered)
    df_filtered = df_filtered[df_filtered['info_score_close_notes'] >= 0.8].copy()
    removed = before - len(df_filtered)
    print(f"‚úÖ Filter 2 - High info score (‚â•0.8): {len(df_filtered)} records (removed {removed})")

# Filter 3: Low poor quality score (not generic)
# We want close notes that are NOT generic (score ‚â§ 0.1)
if 'info_score_poor_close_notes' in df_filtered.columns:
    before = len(df_filtered)
    df_filtered = df_filtered[df_filtered['info_score_poor_close_notes'] <= 0.1].copy()
    removed = before - len(df_filtered)
    print(f"‚úÖ Filter 3 - Low generic score (‚â§0.1): {len(df_filtered)} records (removed {removed})")

# Filter 4: Exclude generic notes (text-based check)
# Double-check: remove any that are just generic phrases
df_filtered['is_generic'] = df_filtered['close_notes'].apply(is_generic_close_note)
before = len(df_filtered)
df_filtered = df_filtered[~df_filtered['is_generic']].copy()
removed = before - len(df_filtered)
print(f"‚úÖ Filter 4 - Exclude generic phrases: {len(df_filtered)} records (removed {removed})")

# Filter 5: Minimum text length (ensure informative)
# Very short close notes probably don't have enough detail
df_filtered['close_notes_length'] = df_filtered['close_notes'].astype(str).str.len()
before = len(df_filtered)
df_filtered = df_filtered[df_filtered['close_notes_length'] >= 100].copy()
removed = before - len(df_filtered)
print(f"‚úÖ Filter 5 - Minimum length (‚â•100 chars): {len(df_filtered)} records (removed {removed})")

print("="*80)
print(f"\nüéØ Final filtered dataset: {len(df_filtered)} high-quality records")
reduction_pct = ((len(df) - len(df_filtered))/len(df)*100)
print(f"   üìâ Filtered out: {len(df) - len(df_filtered)} records ({reduction_pct:.1f}%)")
print(f"   ‚úÖ Kept: {len(df_filtered)} records ({100-reduction_pct:.1f}%)")
print("="*80)


## 5. Analyze Diversity of Filtered Dataset

**What we're doing:** Checking how diverse our filtered high-quality close notes are across different categories.

**Why this matters:**
- We want examples from different types of incidents (not just SOFTWARE)
- Diverse examples help when evaluating different categories later
- This shows us if we need to balance the dataset better

**What we'll check:**
- **Categories**: How many incidents per category (SOFTWARE, ACCOUNT, etc.)
- **Subcategories**: How many per subcategory (ERROR, MALFUNCTION, etc.)
- **Contact Types**: How incidents were reported (Email, Phone, Chat, etc.)

**What to look for:**
- If one category dominates (e.g., 90% SOFTWARE), we may want to balance
- Good diversity means we have examples across different incident types
- This helps when evaluating close notes for different types of problems


In [None]:
# Analyze diversity
print("="*80)
print("DIVERSITY ANALYSIS")
print("="*80)

if 'category' in df_filtered.columns:
    print(f"\nüìä Categories:")
    category_counts = df_filtered['category'].value_counts()
    for cat, count in category_counts.items():
        print(f"   {cat}: {count} ({count/len(df_filtered)*100:.1f}%)")

if 'subcategory' in df_filtered.columns:
    print(f"\nüìã Subcategories:")
    subcat_counts = df_filtered['subcategory'].value_counts()
    print(f"   Total unique: {df_filtered['subcategory'].nunique()}")
    for subcat, count in subcat_counts.head(10).items():
        print(f"   {subcat}: {count}")

if 'contact_type' in df_filtered.columns:
    print(f"\nüìû Contact Types:")
    contact_counts = df_filtered['contact_type'].value_counts()
    for contact, count in contact_counts.items():
        print(f"   {contact}: {count} ({count/len(df_filtered)*100:.1f}%)")

# Visualize diversity
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

if 'category' in df_filtered.columns:
    category_counts = df_filtered['category'].value_counts()
    axes[0].bar(range(len(category_counts)), category_counts.values, 
                color=sns.color_palette("husl", len(category_counts)))
    axes[0].set_xticks(range(len(category_counts)))
    axes[0].set_xticklabels(category_counts.index, rotation=45, ha='right')
    axes[0].set_ylabel('Count', fontsize=10)
    axes[0].set_title('Category Distribution (Filtered)', fontsize=12, fontweight='bold')
    axes[0].grid(axis='y', alpha=0.3)
    for i, v in enumerate(category_counts.values):
        axes[0].text(i, v, str(v), ha='center', va='bottom', fontsize=9)

if 'subcategory' in df_filtered.columns:
    subcat_counts = df_filtered['subcategory'].value_counts().head(10)
    axes[1].barh(range(len(subcat_counts)), subcat_counts.values,
                color=sns.color_palette("viridis", len(subcat_counts)))
    axes[1].set_yticks(range(len(subcat_counts)))
    axes[1].set_yticklabels(subcat_counts.index)
    axes[1].set_xlabel('Count', fontsize=10)
    axes[1].set_title('Top 10 Subcategories (Filtered)', fontsize=12, fontweight='bold')
    axes[1].grid(axis='x', alpha=0.3)

if 'contact_type' in df_filtered.columns:
    contact_counts = df_filtered['contact_type'].value_counts()
    axes[2].bar(contact_counts.index, contact_counts.values,
               color=sns.color_palette("muted", len(contact_counts)))
    axes[2].set_ylabel('Count', fontsize=10)
    axes[2].set_title('Contact Type Distribution (Filtered)', fontsize=12, fontweight='bold')
    axes[2].grid(axis='y', alpha=0.3)
    for i, v in enumerate(contact_counts.values):
        axes[2].text(i, v, str(v), ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

print("="*80)


## 6. Ensure Balanced Sampling

**What we're doing:** Making sure our Reference Dataset has examples from different categories.

**Why this matters:**
- We want diverse examples (not just SOFTWARE incidents)
- This helps when evaluating different types of incidents later
- Balanced sampling ensures we have good examples across categories

**How it works:**
- We try to get a target number of samples per category (e.g., 20)
- If a category has fewer samples, we take what's available
- If a category has too few samples (< 5), we skip it

**Result:** A balanced Reference Dataset with diverse examples.

**Note:** After sampling, the remaining high-quality records (that didn't make it into the balanced sample) will still be part of the "Other Incidents Dataset" - they're still good quality, just not selected for the reference set.


In [None]:
# Strategy: Sample balanced examples if one category dominates
TARGET_SAMPLES_PER_CATEGORY = 20  # Adjust based on dataset size
MIN_SAMPLES_PER_CATEGORY = 5     # Minimum samples per category

print("Applying balanced sampling strategy...")

if 'category' in df_filtered.columns:
    sampled_records = []
    
    for category in df_filtered['category'].unique():
        category_df = df_filtered[df_filtered['category'] == category].copy()
        
        if len(category_df) >= TARGET_SAMPLES_PER_CATEGORY:
            # Sample TARGET_SAMPLES_PER_CATEGORY records
            sampled = category_df.sample(
                min(TARGET_SAMPLES_PER_CATEGORY, len(category_df)),
                random_state=42
            )
        elif len(category_df) >= MIN_SAMPLES_PER_CATEGORY:
            # Take all available if less than target but above minimum
            sampled = category_df
        else:
            # Skip categories with too few samples
            print(f"   ‚ö†Ô∏è Skipping {category}: only {len(category_df)} samples")
            continue
        
        sampled_records.append(sampled)
        print(f"   ‚úÖ {category}: {len(sampled)} samples")
    
    df_ground_truth = pd.concat(sampled_records, ignore_index=True)
    print(f"\n‚úÖ Balanced ground truth dataset: {len(df_ground_truth)} records")
else:
    # If no category column, use all filtered records
    df_ground_truth = df_filtered.copy()
    print(f"\n‚úÖ Using all filtered records: {len(df_ground_truth)} records")


## 7. Create Reference Dataset Structure

**What we're doing:** Preparing the Reference Dataset with the columns we need.

**Structure we need:**
- `number` - Incident identifier
- `content` - Original incident description
- `close_notes_ref` - High-quality close note (reference example)
- Metadata: category, subcategory, contact_type, info_score

**Why these columns:**
- We need both `content` and `close_notes_ref` from the same incident (for Notebook 03)
- Metadata helps us find similar incidents for evaluation


In [None]:
# Create ground truth dataset with required structure
gt_dataset = pd.DataFrame({
    'number': df_ground_truth['number'].values,
    'content': df_ground_truth['content'].values,
    'close_notes_ref': df_ground_truth['close_notes'].values
})

# Add optional metadata columns for reference
if 'category' in df_ground_truth.columns:
    gt_dataset['category'] = df_ground_truth['category'].values
if 'subcategory' in df_ground_truth.columns:
    gt_dataset['subcategory'] = df_ground_truth['subcategory'].values
if 'contact_type' in df_ground_truth.columns:
    gt_dataset['contact_type'] = df_ground_truth['contact_type'].values
if 'info_score_close_notes' in df_ground_truth.columns:
    gt_dataset['info_score'] = df_ground_truth['info_score_close_notes'].values

print("="*80)
print("GROUND TRUTH DATASET STRUCTURE")
print("="*80)
print(f"\nTotal records: {len(gt_dataset)}")
print(f"\nColumns: {list(gt_dataset.columns)}")
print(f"\nFirst few records:")
print(gt_dataset.head())
print("="*80)


## 8. Separate into Two Datasets

**What we're doing:** Splitting our incidents into two groups - the "good examples" and everything else.

**Why separate:**
- **Reference Dataset** (the "good" ones): High-quality close notes that will serve as our standards
  - Think of these like "model answers" - examples of what good close notes should look like
  - We'll use these to evaluate other close notes (existing or AI-generated)
  
- **Other Incidents Dataset** (the rest): All remaining incidents that didn't meet our high-quality criteria
  - These are standard close notes - not bad, just not exceptional
  - We'll compare these against the reference to see the difference

**How we separate:**
- Records that passed all quality filters ‚Üí **Reference Dataset** ‚úÖ
- All other records ‚Üí **Other Incidents Dataset** üìã

**Think of it like:**
- Sorting apples into "premium" (perfect, beautiful) and "regular" (good, but not perfect)
- The premium ones become our reference for what "good" looks like
- The regular ones help us see the difference

**This separation allows us to:**
- Compare how "good" close notes differ from standard ones
- Use references to evaluate other close notes (like grading against a rubric)
- Understand quality differences between datasets


In [None]:
# Separate into two datasets
# This is the key step: creating Reference Dataset and Other Incidents Dataset

print("="*80)
print("DATASET SEPARATION")
print("="*80)

# Reference Dataset: High-quality close notes (what we've filtered)
# Contains: number, content, close_notes_ref, and metadata
reference_dataset = gt_dataset.copy()

# Other Incidents Dataset: All incidents NOT in reference dataset
# These are incidents that didn't meet the high-quality criteria
reference_numbers = set(reference_dataset['number'].values)
other_incidents = df[~df['number'].isin(reference_numbers)].copy()

# Keep same structure for other incidents (for consistency)
# Keep: number, content, close_notes (if available), and metadata
other_incidents_dataset = other_incidents[[
    'number', 'content', 'close_notes', 'category', 'subcategory', 'contact_type'
]].copy() if 'close_notes' in other_incidents.columns else other_incidents[['number', 'content', 'category', 'subcategory', 'contact_type']].copy()

print(f"\nüìä Separation Summary:")
print(f"   Total incidents: {len(df)}")
print(f"   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
print(f"   ‚úÖ Reference Dataset (high-quality): {len(reference_dataset)} incidents")
print(f"      - These are our 'good' examples")
print(f"      - Will be used as references for evaluation")
print(f"   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
print(f"   üìã Other Incidents Dataset: {len(other_incidents_dataset)} incidents")
print(f"      - Remaining incidents (standard quality)")
print(f"      - Will be used for comparison")
print(f"   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
print(f"   Total: {len(reference_dataset) + len(other_incidents_dataset)} incidents")
print(f"   Match check: {'‚úÖ Match' if len(reference_dataset) + len(other_incidents_dataset) == len(df) else '‚ö†Ô∏è Mismatch'}")

# Show examples of the difference
print(f"\nüìù Example from Reference Dataset (High-Quality):")
if len(reference_dataset) > 0:
    example_ref = reference_dataset.iloc[0]
    print(f"   Incident: {example_ref['number']}")
    print(f"   Close Note: {str(example_ref['close_notes_ref'])[:200]}...")
    print(f"   Quality Score: {example_ref.get('info_score', 'N/A')}")

print(f"\nüìù Example from Other Incidents Dataset (Standard):")
if len(other_incidents_dataset) > 0 and 'close_notes' in other_incidents_dataset.columns:
    example_other = other_incidents_dataset[other_incidents_dataset['close_notes'].notna()].iloc[0] if len(other_incidents_dataset[other_incidents_dataset['close_notes'].notna()]) > 0 else other_incidents_dataset.iloc[0]
    print(f"   Incident: {example_other['number']}")
    if 'close_notes' in example_other and pd.notna(example_other['close_notes']):
        print(f"   Close Note: {str(example_other['close_notes'])[:200]}...")
    else:
        print(f"   Close Note: (not available)")

print("\nüí° Key Difference:")
print("   - Reference Dataset: Detailed, informative, complete close notes")
print("   - Other Incidents: Standard close notes (may be shorter, less detailed, or generic)")

print("="*80)


## 9. Display Sample Examples

**What we're doing:** Showing examples from both datasets so you can see the difference.

**What to look for:**
- Reference examples are detailed and informative
- Other incidents may be shorter or less detailed
- This visual comparison helps understand what makes a "good" close note


In [None]:
# Display sample examples
print("="*80)
print("SAMPLE GROUND TRUTH EXAMPLES")
print("="*80)

for idx, row in gt_dataset.head(3).iterrows():
    print(f"\n{'='*80}")
    print(f"Example {idx + 1}")
    print(f"{'='*80}")
    print(f"\nüìã Incident Number: {row['number']}")
    if 'category' in row:
        print(f"üè∑Ô∏è  Category: {row['category']}")
    if 'subcategory' in row:
        print(f"üè∑Ô∏è  Subcategory: {row['subcategory']}")
    if 'info_score' in row:
        print(f"‚≠ê Quality Score: {row['info_score']:.2f}")
    
    print(f"\nüìù Original Content (Input):")
    print(f"   {row['content'][:300]}...")
    
    print(f"\n‚úÖ Reference Close Notes (Ground Truth):")
    print(f"   {row['close_notes_ref'][:400]}...")

print(f"\n{'='*80}")


## 10. Save Both Datasets

**What we're doing:** Saving both datasets so we can use them in later notebooks.

**What gets saved:**
1. **Reference Dataset** (`reference_close_notes.csv`) - **Good Samples**
   - High-quality close notes with both `content` and `close_notes_ref`
   - Used as references for evaluation
   
2. **Other Incidents Dataset** (`other_incidents.csv`) - **Remaining Samples**
   - All remaining incidents (those that didn't meet high-quality criteria)
   - Includes `content` and `close_notes` (if available)
   - Can be used for comparison in Notebook 03

**Why save both:**
- Reference Dataset: Used in Notebooks 03, 04, and 05 for evaluation
- Other Incidents Dataset: Used for comparison to see quality differences
- Saving both allows us to work with consistent data across notebooks
- Both files are ready to use without re-running this notebook


In [None]:
# Save Reference Dataset (high-quality close notes)
# This is the main dataset we'll use as references for evaluation

reference_final = reference_dataset[[
    'number',
    'content',
    'close_notes_ref'
]].copy()

# Add metadata columns if they exist
for col in ['category', 'subcategory', 'contact_type', 'info_score']:
    if col in reference_dataset.columns:
        reference_final[col] = reference_dataset[col]

# Save Reference Dataset (good samples)
reference_path = data_dir / "reference_close_notes.csv"
reference_final.to_csv(reference_path, index=False)

# Prepare Other Incidents Dataset for saving
# This dataset already has the structure we need: number, content, close_notes, and metadata
other_incidents_final = other_incidents_dataset.copy()

# Save Other Incidents Dataset (remaining incidents - not high-quality)
other_incidents_path = data_dir / "other_incidents.csv"
other_incidents_final.to_csv(other_incidents_path, index=False)

print("="*80)
print("DATASETS SAVED")
print("="*80)

print(f"\n‚úÖ Reference Dataset (Good Samples) saved:")
print(f"   File: {reference_path}")
print(f"   Total records: {len(reference_final)}")
print(f"   File size: {reference_path.stat().st_size / 1024:.1f} KB")
print(f"   Columns: {list(reference_final.columns)}")
print(f"   Use: High-quality examples for evaluation (Notebooks 03, 04, 05)")

print(f"\n‚úÖ Other Incidents Dataset (Remaining Samples) saved:")
print(f"   File: {other_incidents_path}")
print(f"   Total records: {len(other_incidents_final)}")
print(f"   File size: {other_incidents_path.stat().st_size / 1024:.1f} KB")
print(f"   Columns: {list(other_incidents_final.columns)}")
print(f"   Use: For comparison in Notebook 03")

print("\nüí° Summary:")
print(f"   - Reference Dataset (good): {len(reference_final)} high-quality examples ‚Üí {reference_path.name}")
print(f"   - Other Incidents (remaining): {len(other_incidents_final)} remaining incidents ‚Üí {other_incidents_path.name}")
print(f"   - Total: {len(reference_final) + len(other_incidents_final)} incidents")
print("="*80)


## 11. Summary Statistics

**What we're doing:** Looking at the final results - what we created and how the two datasets compare.

**What we'll show:**
- **Size of each dataset** - How many incidents in each group?
- **Distribution across categories** - Do we have examples from different types of problems?
- **Quality differences** - How do the scores compare between "good" and "regular"?

**This helps us understand:**
- How many reference examples we have (enough to evaluate with?)
- Whether we have good coverage across categories (not just SOFTWARE?)
- The difference between reference and other incidents (is the quality gap clear?)

**Think of it like:** A final report card showing what we accomplished and what we have to work with.


In [None]:
# Final summary comparing both datasets
# This shows us what we accomplished and what we have to work with

print("="*80)
print("üéØ FINAL SUMMARY: DATASET SEPARATION COMPLETE")
print("="*80)

print(f"\n‚úÖ Reference Dataset (High-Quality Examples):")
print(f"   üì¶ Total records: {len(reference_final)}")
print(f"   üéØ Purpose: Examples of good close notes for evaluation")
print(f"   üí° Think of these as 'model answers' - what good close notes should look like")

if 'category' in reference_final.columns:
    print(f"\n   üìä Category Distribution:")
    for cat, count in reference_final['category'].value_counts().items():
        pct = count/len(reference_final)*100
        print(f"      ‚Ä¢ {cat}: {count} incidents ({pct:.1f}%)")

if 'info_score' in reference_final.columns:
    print(f"\n   ‚≠ê Quality Scores:")
    print(f"      Mean (average): {reference_final['info_score'].mean():.3f}")
    print(f"      Range: {reference_final['info_score'].min():.3f} to {reference_final['info_score'].max():.3f}")
    print(f"      üí° All scores are ‚â• 0.8 (high quality!)")

print(f"\nüìã Other Incidents Dataset (Remaining Samples):")
print(f"   üì¶ Total records: {len(other_incidents_dataset)}")
print(f"   üéØ Purpose: Remaining incidents for comparison")
print(f"   üí° These are standard close notes - not bad, just not exceptional")

if 'category' in other_incidents_dataset.columns:
    print(f"\n   üìä Category Distribution:")
    for cat, count in other_incidents_dataset['category'].value_counts().items():
        pct = count/len(other_incidents_dataset)*100
        print(f"      ‚Ä¢ {cat}: {count} incidents ({pct:.1f}%)")

print(f"\nüìä Overall Statistics:")
print(f"   üì¶ Total incidents: {len(df)}")
ref_pct = len(reference_final)/len(df)*100
other_pct = len(other_incidents_dataset)/len(df)*100
print(f"   ‚úÖ Reference Dataset: {len(reference_final)} incidents ({ref_pct:.1f}%)")
print(f"   üìã Other Incidents: {len(other_incidents_dataset)} incidents ({other_pct:.1f}%)")

print(f"\nüíæ Files Saved:")
print(f"   ‚úÖ Reference Dataset: data/reference_close_notes.csv")
print(f"   ‚úÖ Other Incidents: data/other_incidents.csv")

print(f"\nüöÄ Next Steps:")
print(f"   ‚Üí Notebook 03: Compare n-gram scores between reference and other incidents")
print(f"   ‚Üí Notebook 04: Analyze semantic similarity using embeddings")
print(f"   ‚Üí Notebook 05: Use LLM-as-a-Judge to evaluate close notes")
print("="*80)


## 12. Optional: Generate Embeddings for All Incidents

**What we're doing:** (Optional) Creating semantic embeddings for ALL incidents (both reference and other) to validate quality scores.

**What are embeddings?**
- Mathematical representations of text that capture meaning
- Similar meanings ‚Üí similar embeddings
- Allows us to measure semantic similarity between close notes

**Why generate embeddings for all incidents?**
- **Validate quality scores**: Check if incidents with similar quality scores are semantically closer
- **Understand relationships**: See how close notes relate to each other semantically
- **Quality assurance**: Verify that our quality scoring makes sense (similar scores = similar content)

**What we'll validate:**
- Incidents with similar `info_score_close_notes` should be semantically similar
- High-quality close notes should cluster together in semantic space
- This confirms our quality criteria are meaningful

**Note:** This is optional but useful for validating our quality scoring approach.


In [None]:
# Import embedding library
try:
    from sentence_transformers import SentenceTransformer
    from sentence_transformers.util import cos_sim
    EMBEDDINGS_AVAILABLE = True
except ImportError:
    print("‚ö†Ô∏è sentence-transformers not available. Install with: pip install sentence-transformers")
    EMBEDDINGS_AVAILABLE = False

if EMBEDDINGS_AVAILABLE:
    import os
    print("Loading embedding model...")
    
    # Model selection: Use BGE-M3 for multilingual, multi-granularity support
    # Can be overridden via EMBEDDING_MODEL environment variable
    DEFAULT_MODEL = 'BAAI/bge-m3'  # Multilingual, supports dense/sparse/multi-vector retrieval
    # Alternative models:
    # - 'BAAI/bge-small-en-v1.5' (faster, English-only)
    # - 'BAAI/bge-base-en-v1.5' (slower, higher accuracy, English-only)
    # - 'sentence-transformers/all-mpnet-base-v2' (proven alternative, English-only)
    
    embedding_model_name = os.getenv('EMBEDDING_MODEL', DEFAULT_MODEL)
    print(f"   Using model: {embedding_model_name}")
    
    use_flag_embedding = False
    try:
        # Try sentence-transformers first
        model = SentenceTransformer(embedding_model_name, trust_remote_code=True)
        embedding_dim = model.get_sentence_embedding_dimension()
        print(f"‚úÖ Model loaded: {embedding_dim}-dimensional embeddings")
    except Exception as e:
        print(f"‚ö†Ô∏è Error loading with sentence-transformers: {e}")
        print("   Trying FlagEmbedding library...")
        try:
            from FlagEmbedding import BGEM3FlagModel
            model = BGEM3FlagModel(embedding_model_name)
            use_flag_embedding = True
            print(f"‚úÖ Model loaded via FlagEmbedding (BGE-M3)")
        except ImportError:
            print("‚ö†Ô∏è FlagEmbedding not installed. Install with: pip install FlagEmbedding")
            raise
    
    # Combine all incidents for embedding generation
    # We want to generate embeddings for ALL incidents to validate quality scores
    print("\nPreparing all incidents for embedding generation...")
    
    # Prepare reference dataset (has close_notes_ref)
    reference_with_notes = reference_final.copy()
    reference_with_notes['close_notes_for_embedding'] = reference_with_notes['close_notes_ref']
    reference_with_notes['dataset_type'] = 'reference'
    
    # Prepare other incidents dataset (has close_notes)
    other_with_notes = other_incidents_final.copy()
    if 'close_notes' in other_with_notes.columns:
        # Filter to only incidents that have close_notes
        other_with_notes = other_with_notes[other_with_notes['close_notes'].notna()].copy()
        other_with_notes['close_notes_for_embedding'] = other_with_notes['close_notes']
    else:
        print("‚ö†Ô∏è Other incidents dataset doesn't have close_notes column")
        other_with_notes = pd.DataFrame()  # Empty if no close_notes
    
    other_with_notes['dataset_type'] = 'other'
    
    # Combine both datasets
    all_incidents = pd.concat([reference_with_notes, other_with_notes], ignore_index=True)
    print(f"   Reference incidents: {len(reference_with_notes)}")
    print(f"   Other incidents: {len(other_with_notes)}")
    print(f"   Total incidents for embedding: {len(all_incidents)}")
    
    # Generate embeddings for all close notes
    print("\nGenerating embeddings for all close notes...")
    close_notes_texts = all_incidents['close_notes_for_embedding'].astype(str).tolist()
    
    if use_flag_embedding:
        # FlagEmbedding returns dict with 'dense_vecs', 'sparse', 'colbert_vecs'
        output = model.encode(close_notes_texts, return_dense=True, return_sparse=False, return_colbert_vecs=False)
        embeddings = output['dense_vecs']
    else:
        embeddings = model.encode(close_notes_texts, show_progress_bar=True, batch_size=32)
    
    print(f"‚úÖ Generated embeddings for {len(embeddings)} close notes")
    print(f"   Embedding dimensions: {embeddings.shape}")
    
    # Store embeddings in the combined dataframe
    all_incidents['embedding'] = embeddings.tolist()
    
    # Also store in original dataframes for convenience
    # Split back into reference and other
    reference_mask = all_incidents['dataset_type'] == 'reference'
    reference_with_embeddings = all_incidents[reference_mask].copy()
    other_with_embeddings = all_incidents[~reference_mask].copy()
    
    # Store embeddings in original dataframes
    if len(reference_with_embeddings) > 0:
        reference_final['embedding'] = reference_with_embeddings['embedding'].values
    
    print("\n‚úÖ Embeddings generated and stored for all incidents!")
    print(f"   Use 'all_incidents' dataframe for full analysis")
    print(f"   Use 'reference_final' for reference dataset with embeddings")
else:
    print("‚ö†Ô∏è Skipping embeddings generation")


## 13. Validate Quality Scores with Semantic Similarity

**What we're doing:** Checking if incidents with similar quality scores are semantically closer to each other.

**Why this matters:**
- **Validates our scoring**: If quality scores are meaningful, similar scores should mean similar content
- **Quality assurance**: Confirms that our filtering criteria make sense
- **Understanding relationships**: See how close notes cluster based on quality

**What we'll check:**
- Incidents with similar `info_score_close_notes` should be semantically similar
- High-quality close notes (reference dataset) should cluster together
- This validates that our quality criteria capture meaningful differences

**Note:** This analysis uses ALL incidents (both reference and other) to get a complete picture.


In [None]:
# Validate quality scores using semantic similarity
# Check if incidents with similar quality scores are semantically closer

if EMBEDDINGS_AVAILABLE and 'embedding' in all_incidents.columns:
    from sklearn.metrics.pairwise import cosine_similarity
    
    print("="*80)
    print("VALIDATING QUALITY SCORES WITH SEMANTIC SIMILARITY")
    print("="*80)
    
    # Convert embeddings to numpy array
    embedding_array = np.array(all_incidents['embedding'].tolist())
    
    # Calculate pairwise cosine similarities (measures how similar each pair is)
    print("\nCalculating semantic similarities between all incidents...")
    similarity_matrix = cosine_similarity(embedding_array)
    
    # Get quality scores if available
    if 'info_score' in all_incidents.columns:
        quality_scores = all_incidents['info_score'].values
    elif 'info_score_close_notes' in all_incidents.columns:
        quality_scores = all_incidents['info_score_close_notes'].values
    else:
        # Try to get from original df
        quality_scores = None
        print("‚ö†Ô∏è Quality scores not found in all_incidents")
    
    if quality_scores is not None:
        print("\nüìä Validation: Do similar quality scores mean semantic similarity?")
        
        # Group incidents by quality score ranges
        score_ranges = [
            (0.8, 1.0, "High (0.8-1.0)"),
            (0.6, 0.8, "Medium-High (0.6-0.8)"),
            (0.4, 0.6, "Medium (0.4-0.6)"),
            (0.0, 0.4, "Low (0.0-0.4)")
        ]
        
        within_group_similarities = []
        between_group_similarities = []
        
        for low1, high1, label1 in score_ranges:
            mask1 = (quality_scores >= low1) & (quality_scores < high1)
            if mask1.sum() == 0:
                continue
            
            indices1 = np.where(mask1)[0]
            
            # Within-group similarity (incidents in same score range)
            if len(indices1) > 1:
                within_sim = similarity_matrix[np.ix_(indices1, indices1)]
                # Get upper triangle (avoid diagonal and duplicates)
                within_sim_flat = within_sim[np.triu_indices(len(indices1), k=1)]
                within_group_similarities.extend(within_sim_flat.tolist())
            
            # Between-group similarity (incidents in different score ranges)
            for low2, high2, label2 in score_ranges:
                if low2 <= low1:  # Avoid duplicate comparisons
                    continue
                mask2 = (quality_scores >= low2) & (quality_scores < high2)
                if mask2.sum() == 0:
                    continue
                
                indices2 = np.where(mask2)[0]
                between_sim = similarity_matrix[np.ix_(indices1, indices2)]
                between_group_similarities.extend(between_sim.flatten().tolist())
        
        if within_group_similarities and between_group_similarities:
            avg_within = np.mean(within_group_similarities)
            avg_between = np.mean(between_group_similarities)
            
            print(f"\n   Average similarity WITHIN same score range: {avg_within:.3f}")
            print(f"   Average similarity BETWEEN different score ranges: {avg_between:.3f}")
            print(f"   Difference: {avg_within - avg_between:.3f}")
            
            if avg_within > avg_between:
                print(f"\n   ‚úÖ Validation PASSED: Similar scores = similar content")
                print(f"      (Incidents with similar quality scores are semantically closer)")
            else:
                print(f"\n   ‚ö†Ô∏è Validation WARNING: Similar scores don't mean similar content")
                print(f"      (This might indicate issues with quality scoring)")
    
    # Category analysis (if available)
    if 'category' in all_incidents.columns:
        print("\nüìä Category-wise Semantic Similarity:")
        categories = all_incidents['category'].unique()
    
    for cat in categories:
        cat_indices = reference_final[reference_final['category'] == cat].index
        if len(cat_indices) > 1:
            # Get similarity matrix for this category
            cat_similarity = similarity_matrix[np.ix_(cat_indices, cat_indices)]
            # Exclude diagonal (self-similarity = 1.0)
            mask = np.ones_like(cat_similarity, dtype=bool)
            np.fill_diagonal(mask, False)
            within_similarity = cat_similarity[mask].mean()
            print(f"   {cat}: Mean within-category similarity: {within_similarity:.3f} (n={len(cat_indices)})")
    
    # Analyze between-category similarity
    if len(categories) > 1:
        print(f"\nüìä Between-Category Semantic Similarity:")
        for i, cat1 in enumerate(categories):
            for cat2 in categories[i+1:]:
                cat1_indices = reference_final[reference_final['category'] == cat1].index
                cat2_indices = reference_final[reference_final['category'] == cat2].index
                
                # Get similarity between categories
                between_similarity = similarity_matrix[np.ix_(cat1_indices, cat2_indices)].mean()
                print(f"   {cat1} ‚Üî {cat2}: {between_similarity:.3f}")
    
    # Overall statistics
    print(f"\nüìä Overall Statistics:")
    print(f"   Mean similarity (all pairs): {similarity_matrix[np.triu_indices_from(similarity_matrix, k=1)].mean():.3f}")
    print(f"   Min similarity: {similarity_matrix[np.triu_indices_from(similarity_matrix, k=1)].min():.3f}")
    print(f"   Max similarity: {similarity_matrix[np.triu_indices_from(similarity_matrix, k=1)].max():.3f}")
    
    print("="*80)
else:
    print("‚ö†Ô∏è Embeddings not available or category column missing")


## 14. Visualize All Incidents with t-SNE

**What we're doing:** Visualizing all incidents in 2D space to see how they cluster based on quality scores.

**Why this matters:**
- **Visual validation**: See if high-quality incidents cluster together
- **Understand patterns**: Identify groups of similar incidents
- **Quality assurance**: Verify that our scoring creates meaningful groupings

**What you'll see:**
- 2D visualization of all incidents (both reference and other)
- Color-coded by quality score to see if similar scores cluster together
- Category information to see if incident types group together


In [None]:
if EMBEDDINGS_AVAILABLE and 'embedding' in all_incidents.columns:
    try:
        from sklearn.manifold import TSNE
        import matplotlib.pyplot as plt
        
        print("Generating t-SNE visualization...")
        
        # Prepare embeddings for ALL incidents
        embedding_array = np.array(all_incidents['embedding'].tolist())
        
        # Apply t-SNE (reduce to 2D for visualization)
        print("   Running t-SNE for all incidents (this may take a moment)...")
        tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(all_incidents)-1), max_iter=1000)
        embeddings_2d = tsne.fit_transform(embedding_array)
        
        # Create visualization with 2 plots
        fig, axes = plt.subplots(1, 2, figsize=(18, 6))
        
        # Get quality scores
        if 'info_score' in all_incidents.columns:
            quality_scores = all_incidents['info_score'].values
        elif 'info_score_close_notes' in all_incidents.columns:
            quality_scores = all_incidents['info_score_close_notes'].values
        else:
            quality_scores = None
        
        # Plot 1: Color by dataset type (reference vs other)
        if 'dataset_type' in all_incidents.columns:
            ref_mask = all_incidents['dataset_type'] == 'reference'
            axes[0].scatter(
                embeddings_2d[ref_mask, 0],
                embeddings_2d[ref_mask, 1],
                label='Reference (High-Quality)',
                color='green',
                alpha=0.6,
                s=100
            )
            axes[0].scatter(
                embeddings_2d[~ref_mask, 0],
                embeddings_2d[~ref_mask, 1],
                label='Other Incidents',
                color='orange',
                alpha=0.6,
                s=100
            )
            axes[0].set_title('t-SNE: Reference vs Other Incidents', fontsize=14, fontweight='bold')
            axes[0].set_xlabel('t-SNE Dimension 1', fontsize=11)
            axes[0].set_ylabel('t-SNE Dimension 2', fontsize=11)
            axes[0].legend()
            axes[0].grid(alpha=0.3)
        
        # Plot 2: Color by quality score
        if quality_scores is not None:
            scatter = axes[1].scatter(
                embeddings_2d[:, 0],
                embeddings_2d[:, 1],
                c=quality_scores,
                cmap='viridis',
                alpha=0.6,
                s=100
            )
            axes[1].set_title('t-SNE: Colored by Quality Score', fontsize=14, fontweight='bold')
            axes[1].set_xlabel('t-SNE Dimension 1', fontsize=11)
            axes[1].set_ylabel('t-SNE Dimension 2', fontsize=11)
            plt.colorbar(scatter, ax=axes[1], label='Quality Score')
            axes[1].grid(alpha=0.3)
        else:
            # If no quality score, just show all points
            axes[1].scatter(
                embeddings_2d[:, 0],
                embeddings_2d[:, 1],
                alpha=0.6,
                s=100,
                color='steelblue'
            )
            axes[1].set_title('t-SNE: All Close Notes', fontsize=14, fontweight='bold')
            axes[1].set_xlabel('t-SNE Dimension 1', fontsize=11)
            axes[1].set_ylabel('t-SNE Dimension 2', fontsize=11)
            axes[1].grid(alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # Store 2D embeddings in all_incidents for potential future use
        all_incidents['tsne_x'] = embeddings_2d[:, 0]
        all_incidents['tsne_y'] = embeddings_2d[:, 1]
        
        print("\nüí° Interpretation:")
        print("   - If high-quality incidents cluster together ‚Üí Quality scores are meaningful")
        print("   - If reference incidents are close ‚Üí Our filtering worked well")
        print("   - Color gradient in Plot 2 shows quality score distribution")
        
        print("‚úÖ t-SNE visualization complete!")
        
    except ImportError:
        print("‚ö†Ô∏è scikit-learn not available. Install with: pip install scikit-learn")
    except Exception as e:
        print(f"‚ö†Ô∏è Error generating t-SNE visualization: {e}")
else:
    print("‚ö†Ô∏è Embeddings not available")


## 15. Save Embeddings for Future Use

Save the embeddings (or embeddings metadata) for use in evaluation notebooks. Since embeddings are large, we'll save them separately or include metadata.


In [None]:
import pickle

if EMBEDDINGS_AVAILABLE and 'embedding' in reference_final.columns:
    print("="*80)
    print("SAVING EMBEDDINGS (OPTIONAL)")
    print("="*80)
    
    # Save embeddings as numpy array (more efficient than storing in CSV)
    embeddings_path = data_dir / "gt_close_notes_embeddings.npy"
    embedding_array = np.array(reference_final['embedding'].tolist())
    np.save(embeddings_path, embedding_array)
    print(f"‚úÖ Saved embeddings array to: {embeddings_path}")
    print(f"   Shape: {embedding_array.shape}")
    print(f"   Size: {embeddings_path.stat().st_size / 1024:.1f} KB")
    
    # Save mapping between indices and incident numbers
    import os
    current_model = os.getenv('EMBEDDING_MODEL', 'BAAI/bge-m3')
    embeddings_metadata = {
        'indices': reference_final['number'].tolist(),
        'model_name': current_model,
        'embedding_dimension': embedding_array.shape[1],
        'num_samples': len(embedding_array)
    }
    
    metadata_path = data_dir / "gt_close_notes_embeddings_metadata.pkl"
    with open(metadata_path, 'wb') as f:
        pickle.dump(embeddings_metadata, f)
    print(f"‚úÖ Saved embeddings metadata to: {metadata_path}")
    
    # Save updated CSV with t-SNE coordinates (if available)
    if 'tsne_x' in reference_final.columns:
        # Save a version without embeddings column (too large for CSV)
        reference_final_export = reference_final.drop(columns=['embedding']).copy()
        output_path_with_coords = data_dir / "gt_close_notes_with_coords.csv"
        reference_final_export.to_csv(output_path_with_coords, index=False)
        print(f"‚úÖ Saved CSV with t-SNE coordinates to: {output_path_with_coords}")
    
    print("\nüí° To load embeddings later:")
    print("   embeddings = np.load('data/gt_close_notes_embeddings.npy')")
    print("   with open('data/gt_close_notes_embeddings_metadata.pkl', 'rb') as f:")
    print("       metadata = pickle.load(f)")
    print("="*80)
else:
    print("‚ö†Ô∏è Embeddings not available - skipping save")
