# Notebook 03: N-gram Baseline Analysis

## üéØ What is This Notebook About?

This notebook performs a **baseline exploration** to test whether n-gram metrics (word/phrase overlap) are useful for evaluating close notes quality.

**Context:**
1. We have an **incident dataset** with original problem descriptions
2. We extracted some **high-quality close notes** from that dataset to serve as **ground truth references**
3. Our goal is to evaluate close notes (existing ones or LLM-generated ones) against these ground truth references

**This notebook's purpose:**
- **Hypothesis:** Incident descriptions and close notes might use very different language, making n-gram metrics less useful
- **Test:** Compare ground-truth close notes vs incident descriptions using n-gram metrics
- **Goal:** Determine if n-grams are relevant, or if we should focus on LLM-as-a-Judge evaluation instead

**What we're comparing:**
- **From Ground Truth Dataset:** Pairs of (`close_notes_ref`, `content`) from the **same incident**
- **From Incidents Dataset:** Pairs of (`close_notes`, `content`) from the **same incident**
- **Then compare:** N-gram scores between ground truth pairs vs incidents pairs

**Expected outcome:** 
- If n-gram scores are very low for both datasets, it confirms that incident descriptions and close notes use different language
- Comparing scores between datasets helps us understand if ground truth close notes differ more/less from descriptions than regular close notes
- This validates that we should use **LLM-as-a-Judge** (semantic evaluation) rather than n-grams for the main evaluation

---

## üìö Key Concepts Explained

### What are N-grams?

**N-grams** are sequences of N words. For example:
- **1-gram (unigram)**: Single words ‚Üí "the", "user", "reported"
- **2-gram (bigram)**: Pairs of words ‚Üí "the user", "user reported", "reported error"
- **3-gram (trigram)**: Three words ‚Üí "the user reported", "user reported error"

**Why we're testing this:** N-grams measure **lexical overlap** (shared words/phrases). If incident descriptions and close notes use completely different vocabulary, n-grams won't be useful for evaluation.

### What are ROUGE Metrics?

**ROUGE** (Recall-Oriented Understudy for Gisting Evaluation) measures how well a text matches a reference by counting overlapping n-grams.

**The metrics we'll use:**

1. **ROUGE-1**: Measures word overlap (unigrams)
   - *Example:* "User reported error" vs "User saw error" ‚Üí Shares 2 words: "User" and "error"
   - *What it tells us:* Do the texts use similar vocabulary?

2. **ROUGE-2**: Measures two-word phrase overlap (bigrams)
   - *Example:* "User reported error" vs "User reported issue" ‚Üí Shares 1 phrase: "User reported"
   - *What it tells us:* Do the texts use similar word combinations?

3. **ROUGE-L**: Measures longest common subsequence
   - *Example:* Finds the longest sequence of words that appear in the same order in both texts
   - *What it tells us:* How well do the texts follow similar sentence structure?

4. **ROUGE-Lsum**: Similar to ROUGE-L but optimized for summaries
   - *What it tells us:* How well does the text capture the main points?

**Score interpretation:**
- **0.0** = No overlap (completely different texts)
- **1.0** = Perfect match (identical texts)
- **0.5** = Moderate similarity (half the words/phrases match)

**Our hypothesis:** Scores will be **low (0.1-0.3)** because:
- Incident descriptions describe **problems** ("User cannot login")
- Close notes describe **solutions** ("Reset password and verified access")
- They use different vocabulary and structure

---

## üéØ Objectives

This notebook will:
1. **Load** ground truth dataset and incidents dataset
2. **Compare** ground-truth close notes vs incident descriptions using n-gram metrics
3. **Calculate** ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum scores using Unitxt
4. **Analyze** results to test our hypothesis
5. **Conclude** whether n-grams are useful or if we should focus on LLM-as-a-Judge

---

## üìã What We're Comparing

**Dataset Comparison:**
- **Ground Truth** (`gt_close_notes.csv`): High-quality close notes extracted from incidents
  - Contains: `close_notes_ref` - well-written resolution notes
  
- **Incidents** (`incidents_prepared.csv`): Original incident dataset
  - Contains: `content` - the original problem description
  - Also contains: `close_notes` - existing close notes (not used in this comparison)

**Why compare ground-truth close notes vs incident descriptions?**
- **To test if n-grams are relevant:** If scores are very low, it confirms that incident descriptions and close notes use different language
- **Baseline for comparison:** Once we have LLM-generated close notes, we can compare them against ground truth using more sophisticated methods (LLM-as-a-Judge)

**Note:** The **real evaluation** will happen in the next notebook using **LLM-as-a-Judge**, which evaluates:
- Topic coverage
- Accuracy of facts
- Text structure
- Completeness
- And other semantic criteria

---

## üîß Using Unitxt

**Unitxt** is a standardized framework for evaluating text quality. It provides pre-built metrics that ensure consistent and comparable results across different evaluations.

**Why Unitxt?** It standardizes how we compute metrics, making results reproducible and comparable across different evaluation phases.


In [None]:
# Import required libraries
# These are the tools we need to work with data and create visualizations

import pandas as pd  # For working with tables (like Excel spreadsheets)
import numpy as np   # For mathematical operations
import matplotlib.pyplot as plt  # For creating charts and graphs
import seaborn as sns  # For prettier charts
from pathlib import Path  # For handling file paths
import sys
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')  # Hide warning messages to keep output clean

# Add src directory to path so we can use utility functions
sys.path.append(str(Path("../src").resolve()))

# Unitxt imports - REQUIRED
# Unitxt provides the ROUGE metrics we'll use to compare texts
try:
    from unitxt.metrics import Rouge
    print("‚úÖ Unitxt imported successfully")
except ImportError as e:
    raise ImportError(
        f"Unitxt is required but not available: {e}\n"
        "Please install Unitxt: pip install unitxt or uv add unitxt"
    )

# Set up plotting style (makes charts look nicer)
try:
    plt.style.use('seaborn-v0_8')
except OSError:
    try:
        plt.style.use('seaborn')
    except OSError:
        plt.style.use('default')
sns.set_palette("husl")  # Use a nice color palette
# Display charts in the notebook
%matplotlib inline

print("‚úÖ All libraries imported successfully!")
print("‚úÖ Using Unitxt for n-gram metrics evaluation")


## 1. Load Datasets

**What we're doing:** Loading the two datasets we want to compare.

**Why:** We need both datasets in memory before we can compare them.


In [None]:
# Load datasets
# We'll load both datasets so we can compare them

data_dir = Path("../data")  # Where our data files are stored

# Load ground truth dataset
# This contains high-quality close notes (our reference texts)
gt_path = data_dir / "gt_close_notes.csv"
gt_df = pd.read_csv(gt_path)  # Read CSV file into a table (DataFrame)
print(f"‚úÖ Loaded ground truth dataset: {len(gt_df)} records")
print(f"   Columns: {list(gt_df.columns)}")  # Show what information is in each row

# Load incidents dataset
# This contains the original incident descriptions
incidents_path = data_dir / "incidents_prepared.csv"
if incidents_path.exists():
    incidents_df = pd.read_csv(incidents_path)
    print(f"‚úÖ Loaded incidents dataset: {len(incidents_df)} records")
else:
    # Fallback to sample dataset if the prepared one doesn't exist
    incidents_path = data_dir / "incidents_sample.csv"
    incidents_df = pd.read_csv(incidents_path)
    print(f"‚úÖ Loaded incidents sample dataset: {len(incidents_df)} records")
print(f"   Columns: {list(incidents_df.columns)}")

# Display basic info about both datasets
# This helps us understand the data structure and verify everything loaded correctly
print("\n" + "="*60)
print("Ground Truth Dataset Info:")
print("="*60)
print(gt_df.info())  # Shows number of rows, columns, data types, and missing values
print("\n" + "="*60)
print("Incidents Dataset Info:")
print("="*60)
print(incidents_df.info())


## 2. Prepare Data for Comparison

**What we're doing:** Checking that our datasets have the right columns to create pairs from the same incident.

**Why:** Before comparing, we need to verify:
- The ground truth dataset has both `close_notes_ref` AND `content` columns (same incident)
- The incidents dataset has both `close_notes` AND `content` columns (same incident)
- The data looks correct

**What to look for:** Make sure the sample data shows actual text content, not empty values. Each row should have both close notes and content from the same incident.


In [None]:
# Check required columns
# We need BOTH close notes AND content from the same incident in each dataset

required_gt_cols = ['close_notes_ref', 'content']  # Ground truth must have both close notes and content
required_incident_cols = ['close_notes', 'content']  # Incidents must have both close notes and content

# Verify columns exist in our datasets
# This checks if the columns we need are actually present
missing_gt = [col for col in required_gt_cols if col not in gt_df.columns]
missing_incident = [col for col in required_incident_cols if col not in incidents_df.columns]

# Report any missing columns
if missing_gt:
    print(f"‚ùå Missing columns in ground truth: {missing_gt}")
    print(f"   Available columns: {list(gt_df.columns)}")
if missing_incident:
    print(f"‚ùå Missing columns in incidents: {missing_incident}")
    print(f"   Available columns: {list(incidents_df.columns)}")

# If all columns are present, show sample data
if not missing_gt and not missing_incident:
    print("‚úÖ All required columns found!")
    
    # Show sample data so you can see what we're working with
    # This helps verify the data looks correct before we start comparing
    print("\n" + "="*60)
    print("Sample Ground Truth (same incident):")
    print("="*60)
    print("Each row has both close_notes_ref and content from the same incident:")
    print(gt_df[['number', 'category', 'content', 'close_notes_ref']].head(2))
    
    print("\n" + "="*60)
    print("Sample Incidents (same incident):")
    print("="*60)
    print("Each row has both close_notes and content from the same incident:")
    print(incidents_df[['number', 'category', 'content', 'close_notes']].head(2))
else:
    print("\n‚ö†Ô∏è  Cannot proceed: Missing required columns")
    print("   Please ensure both datasets have the required columns to create pairs from the same incident")


## 3. Prepare Text Pairs from Same Incident

**What we're doing:** Creating pairs of texts from the **same incident** in each dataset.

**Why:** We want to compare how similar close notes are to their incident descriptions, and then compare this similarity between ground truth vs regular incidents.

**Pair Creation Strategy:**
- **Ground Truth Dataset:** For each row, create pair (`close_notes_ref`, `content`) - both from the same incident
- **Incidents Dataset:** For each row, create pair (`close_notes`, `content`) - both from the same incident

**What happens:**
1. From ground truth dataset: Extract pairs where both fields come from the same row (same incident)
2. From incidents dataset: Extract pairs where both fields come from the same row (same incident)
3. Filter out rows where either field is missing or empty
4. We'll compute n-gram metrics for each pair
5. Then compare the distribution of scores between ground truth pairs vs incidents pairs

**Example from Ground Truth:**
- Same incident (e.g., INC009427):
  - Content: "Customer has an issue with Palo Alto Prisma Cloud..."
  - Close Notes Ref: "The customer reported a SocketException: Connection..."
  - Pair: (content, close_notes_ref) from same incident

**Example from Incidents:**
- Same incident (e.g., INC0047192):
  - Content: "The customer reports that Google Workspace crashes..."
  - Close Notes: "Resolved issue with Google Workspace by clearing cache..."
  - Pair: (content, close_notes) from same incident


In [None]:
def prepare_text_pairs_from_same_incident(
    df: pd.DataFrame,
    close_notes_col: str,
    content_col: str,
    dataset_name: str = "dataset"
) -> Tuple[List[str], List[str], pd.DataFrame]:
    """
    Prepare text pairs from the same incident.
    
    For each row in the dataframe, creates a pair of (content, close_notes) 
    where both come from the same incident.
    
    Args:
        df: Dataframe with both close_notes and content columns
        close_notes_col: Name of the close notes column
        content_col: Name of the content/description column
        dataset_name: Name of dataset (for logging)
    
    Returns:
        Tuple of (references, predictions, metadata_df)
        - references: List of content texts (incident descriptions)
        - predictions: List of close notes texts (paired with content from same row)
        - metadata_df: DataFrame with incident numbers, category, etc.
    """
    references = []  # Will store content (incident descriptions)
    predictions = []  # Will store close notes (from same incident)
    metadata = []  # Will store information about each pair
    
    # Clean text data - make copy so we don't modify original
    df_clean = df.copy()
    
    # Remove rows where either field is missing
    df_clean = df_clean.dropna(subset=[close_notes_col, content_col])
    
    # Filter out empty strings
    df_clean = df_clean[
        (df_clean[close_notes_col].str.strip() != '') & 
        (df_clean[content_col].str.strip() != '')
    ]
    
    print(f"üìä {dataset_name} records: {len(df)}")
    print(f"üìä Records with both fields: {len(df_clean)}")
    
    # Create pairs from same incident (same row)
    for _, row in df_clean.iterrows():
        content_text = str(row[content_col]).strip()
        close_notes_text = str(row[close_notes_col]).strip()
        
        # Store the pair (both from same incident)
        references.append(content_text)
        predictions.append(close_notes_text)
        metadata.append({
            'number': row.get('number', ''),
            'category': row.get('category', 'UNKNOWN'),
            'subcategory': row.get('subcategory', ''),
        })
    
    # Convert metadata list to DataFrame for easier analysis
    metadata_df = pd.DataFrame(metadata)
    print(f"‚úÖ Created {len(references)} pairs from same incidents")
    
    return references, predictions, metadata_df

# Prepare pairs from ground truth dataset
# Each pair: (content, close_notes_ref) from the same incident
print("="*60)
print("GROUND TRUTH DATASET PAIRS")
print("="*60)
gt_references, gt_predictions, gt_metadata = prepare_text_pairs_from_same_incident(
    gt_df,
    close_notes_col='close_notes_ref',
    content_col='content',
    dataset_name="Ground truth"
)

# Prepare pairs from incidents dataset
# Each pair: (content, close_notes) from the same incident
print("\n" + "="*60)
print("INCIDENTS DATASET PAIRS")
print("="*60)
inc_references, inc_predictions, inc_metadata = prepare_text_pairs_from_same_incident(
    incidents_df,
    close_notes_col='close_notes',
    content_col='content',
    dataset_name="Incidents"
)

# Show examples
print("\n" + "="*60)
print("EXAMPLE PAIRS")
print("="*60)
print("\nüìù Ground Truth Example (same incident):")
print(f"  Content: {gt_references[0][:150]}...")
print(f"  Close Notes: {gt_predictions[0][:150]}...")

print("\nüìù Incidents Example (same incident):")
print(f"  Content: {inc_references[0][:150]}...")
print(f"  Close Notes: {inc_predictions[0][:150]}...")
print("\nWe'll compare these pairs to see how similar close notes are to their incident descriptions!")


## 4. Compute N-gram Metrics Using Unitxt

**What we're doing:** Calculating ROUGE scores for each text pair in both datasets separately, then comparing them.

**How it works:**
1. For each pair (content, close_notes) from the same incident:
   - **ROUGE-1**: Counts how many individual words appear in both texts
   - **ROUGE-2**: Counts how many two-word phrases appear in both texts
   - **ROUGE-L**: Finds the longest sequence of words that appear in order in both texts
   - **ROUGE-Lsum**: Similar to ROUGE-L but optimized for longer texts

2. Each metric returns a score between 0.0 and 1.0:
   - **0.0** = No words/phrases in common
   - **0.5** = Half the words/phrases match
   - **1.0** = Perfect match (all words/phrases match)

3. We compute metrics for:
   - **Ground Truth pairs:** (content, close_notes_ref) from ground truth dataset
   - **Incidents pairs:** (content, close_notes) from incidents dataset

4. Then we compare the distributions to see if ground truth pairs show different patterns

**What to expect:**
- Scores are typically low (0.1-0.3) because incident descriptions and close notes use different language
- Comparing ground truth vs incidents helps us understand if high-quality close notes differ more/less from descriptions
- This validates that n-grams aren't suitable for evaluating close notes quality


In [None]:
def compute_ngram_metrics_unitxt(references: List[str], predictions: List[str]) -> pd.DataFrame:
    """
    Compute n-gram metrics using Unitxt.
    
    This function compares each pair of texts and calculates ROUGE scores.
    ROUGE scores measure how many words/phrases the two texts share.
    
    Args:
        references: List of reference texts (ground truth close notes)
        predictions: List of prediction texts (incident descriptions)
    
    Returns:
        DataFrame with ROUGE scores for each pair
        - rouge1: Word overlap score (0.0 to 1.0)
        - rouge2: Two-word phrase overlap score (0.0 to 1.0)
        - rougeL: Longest common subsequence score (0.0 to 1.0)
        - rougeLsum: Summary-level LCS score (0.0 to 1.0)
    """
    results = []  # Will store scores for each pair
    
    # Initialize Unitxt ROUGE metric
    # This is the tool that calculates the similarity scores
    rouge_metric = Rouge()
    
    print("üìä Computing ROUGE metrics using Unitxt...")
    print("   This compares each pair and counts shared words/phrases")
    print("   Note: Using ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum metrics")
    
    # Process each pair one by one
    for i, (ref, pred) in enumerate(zip(references, predictions)):
        try:
            # Compute ROUGE scores for this pair
            # Unitxt expects: references (as a list), prediction (as a string), task_data (empty dict)
            rouge_scores = rouge_metric.compute(references=[ref], prediction=pred, task_data={})
            
            # Extract the scores (they come back as numbers between 0.0 and 1.0)
            result = {
                'rouge1': rouge_scores.get('rouge1', 0.0),  # Word overlap
                'rouge2': rouge_scores.get('rouge2', 0.0),  # Two-word phrase overlap
                'rougeL': rouge_scores.get('rougeL', 0.0),  # Longest common subsequence
                'rougeLsum': rouge_scores.get('rougeLsum', 0.0),  # Summary-level LCS
            }
            results.append(result)
            
            # Show progress every 10 pairs
            if (i + 1) % 10 == 0:
                print(f"   Processed {i + 1}/{len(references)} pairs...")
        except Exception as e:
            # If something goes wrong with this pair, record zero scores
            print(f"‚ö†Ô∏è  Error processing pair {i+1}: {e}")
            results.append({
                'rouge1': 0.0,
                'rouge2': 0.0,
                'rougeL': 0.0,
                'rougeLsum': 0.0,
            })
    
    # Convert results to a DataFrame (table) for easier analysis
    return pd.DataFrame(results)

# Compute metrics for Ground Truth dataset
print("="*60)
print("COMPUTING METRICS FOR GROUND TRUTH DATASET")
print("="*60)
gt_metrics_df = compute_ngram_metrics_unitxt(gt_references, gt_predictions)

print(f"\n‚úÖ Computed metrics for {len(gt_metrics_df)} ground truth pairs")
print("\nüìä Ground Truth Metrics Summary:")
print(gt_metrics_df.describe())

# Compute metrics for Incidents dataset
print("\n" + "="*60)
print("COMPUTING METRICS FOR INCIDENTS DATASET")
print("="*60)
inc_metrics_df = compute_ngram_metrics_unitxt(inc_references, inc_predictions)

print(f"\n‚úÖ Computed metrics for {len(inc_metrics_df)} incidents pairs")
print("\nüìä Incidents Metrics Summary:")
print(inc_metrics_df.describe())

# Compare the two datasets
print("\n" + "="*60)
print("COMPARISON: GROUND TRUTH vs INCIDENTS")
print("="*60)
metric_cols = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
comparison_df = pd.DataFrame({
    'Metric': metric_cols,
    'Ground Truth Mean': [gt_metrics_df[col].mean() for col in metric_cols],
    'Incidents Mean': [inc_metrics_df[col].mean() for col in metric_cols],
    'Difference': [gt_metrics_df[col].mean() - inc_metrics_df[col].mean() for col in metric_cols]
})
print("\nMean scores comparison:")
print(comparison_df.to_string(index=False))
print("\nüí° Interpretation:")
print("   - Positive difference = Ground truth pairs have higher similarity")
print("   - Negative difference = Incidents pairs have higher similarity")
print("   - Close to zero = Similar patterns in both datasets")


## 5. Combine Results with Metadata

**What we're doing:** Combining the ROUGE scores with metadata for both datasets separately.

**Why:** This lets us analyze results by category and compare patterns between ground truth and incidents datasets.

**What we're adding:**
- Dataset source (ground truth vs incidents)
- Category information (SOFTWARE, NETWORK, etc.)
- Incident numbers (for tracking)
- Text lengths (to see if length affects similarity)


In [None]:
# Combine metrics with metadata for Ground Truth dataset
gt_results_df = pd.concat([gt_metadata, gt_metrics_df], axis=1)
gt_results_df['dataset'] = 'ground_truth'  # Mark as ground truth
gt_results_df['ref_length'] = [len(ref) for ref in gt_references]
gt_results_df['pred_length'] = [len(pred) for pred in gt_predictions]
gt_results_df['length_diff'] = gt_results_df['pred_length'] - gt_results_df['ref_length']

# Combine metrics with metadata for Incidents dataset
inc_results_df = pd.concat([inc_metadata, inc_metrics_df], axis=1)
inc_results_df['dataset'] = 'incidents'  # Mark as incidents
inc_results_df['ref_length'] = [len(ref) for ref in inc_references]
inc_results_df['pred_length'] = [len(pred) for pred in inc_predictions]
inc_results_df['length_diff'] = inc_results_df['pred_length'] - inc_results_df['ref_length']

# Combine both datasets into one dataframe for comparison
results_df = pd.concat([gt_results_df, inc_results_df], ignore_index=True)

print("‚úÖ Combined results with metadata for both datasets")
print(f"\nüìä Ground Truth Results: {len(gt_results_df)} pairs")
print(f"üìä Incidents Results: {len(inc_results_df)} pairs")
print(f"üìä Total Results DataFrame shape: {results_df.shape}")
print(f"   (rows = pairs, columns = information about each pair)")
print(f"\nüìã Columns: {list(results_df.columns)}")
print("\nüìä First few results from each dataset:")
print("\nGround Truth:")
print(gt_results_df.head(2))
print("\nIncidents:")
print(inc_results_df.head(2))


## 6. Overall Statistics & Comparison

**What we're doing:** Calculating summary statistics for each dataset separately, then comparing them to test our hypothesis.

**What the statistics mean:**
- **Mean (Average)**: The typical similarity score
  - *Example:* Mean ROUGE-1 of 0.25 means on average, 25% of words overlap
  
- **Median**: The middle value (half scores are above, half below)
  - *Why useful?* Less affected by outliers than mean
  
- **Standard Deviation**: How much scores vary
  - *Low std dev:* Scores are consistent
  - *High std dev:* Some pairs are very similar, others very different

- **Min/Max**: The lowest and highest scores
  - Shows the range of similarity

**Comparison Focus:**
- Are ground truth pairs more/less similar than incidents pairs?
- Do both datasets show similar patterns (low scores)?
- Does this confirm our hypothesis that n-grams aren't suitable?

**How to interpret the results:**

**If scores are LOW for BOTH datasets (0.1-0.3):**
- ‚úÖ **Confirms our hypothesis:** Incident descriptions and close notes use different vocabulary
- ‚úÖ **Conclusion:** N-grams are **not suitable** for evaluating close notes quality
- ‚úÖ **Action:** Proceed with **LLM-as-a-Judge** evaluation (semantic understanding)

**If scores differ significantly between datasets:**
- ‚ö†Ô∏è **Interesting finding:** Ground truth close notes might use different language patterns
- ‚ö†Ô∏è **Still conclude:** N-grams are not suitable - the difference itself shows inconsistency

**If scores are HIGH (0.5+):**
- ‚ö†Ô∏è **Surprising result:** Incident descriptions and close notes share significant vocabulary
- ‚ö†Ô∏è **Action:** Investigate further - this might indicate the incident descriptions already contain resolution language


In [None]:
# Overall statistics - compute separately for each dataset
metric_cols = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']  # The metrics we calculated

print("="*60)
print("GROUND TRUTH DATASET STATISTICS")
print("="*60)
gt_stats = gt_results_df[metric_cols].describe()
print(gt_stats)

print("\n" + "="*60)
print("INCIDENTS DATASET STATISTICS")
print("="*60)
inc_stats = inc_results_df[metric_cols].describe()
print(inc_stats)

print("\n" + "="*60)
print("COMPARISON: MEAN SCORES")
print("="*60)
comparison_stats = pd.DataFrame({
    'Metric': metric_cols,
    'Ground Truth Mean': [gt_results_df[col].mean() for col in metric_cols],
    'Incidents Mean': [inc_results_df[col].mean() for col in metric_cols],
    'Difference': [gt_results_df[col].mean() - inc_results_df[col].mean() for col in metric_cols],
    'GT Std Dev': [gt_results_df[col].std() for col in metric_cols],
    'Inc Std Dev': [inc_results_df[col].std() for col in metric_cols]
})
print(comparison_stats.to_string(index=False))

print("\n" + "="*60)
print("COMPARISON: MEDIAN SCORES")
print("="*60)
median_comparison = pd.DataFrame({
    'Metric': metric_cols,
    'Ground Truth Median': [gt_results_df[col].median() for col in metric_cols],
    'Incidents Median': [inc_results_df[col].median() for col in metric_cols],
    'Difference': [gt_results_df[col].median() - inc_results_df[col].median() for col in metric_cols]
})
print(median_comparison.to_string(index=False))

print("\nüí° Interpretation:")
print("   - Low scores (0.1-0.3) for BOTH datasets = Confirms hypothesis")
print("   - Similar patterns = Both datasets show same language differences")
print("   - Different patterns = Interesting finding about ground truth quality")
print("   - This validates that n-grams are NOT suitable for evaluating close notes!")


## 7. Optional: Analysis by Category

**What we're doing:** (Optional) Grouping results by category to see if patterns vary by incident type.

**Why this is optional:**
- The main comparison (Ground Truth vs Incidents) is more important
- Category analysis can provide additional insights but isn't critical for hypothesis testing

**What to look for (if reviewing):**
- **Consistent low scores across categories** = Strong confirmation of hypothesis
- **Variable scores** = Some categories might have different patterns (interesting but doesn't change main conclusion)

**Main takeaway:** Even if categories vary, the overall low scores confirm that n-grams aren't suitable for evaluating close notes quality.


In [None]:
# Optional: Group by category
# This is optional - the main comparison is Ground Truth vs Incidents

if 'category' in results_df.columns and len(results_df['category'].unique()) > 1:
    print("="*60)
    print("OPTIONAL: METRICS BY CATEGORY")
    print("="*60)
    print("(This analysis is optional - main focus is dataset comparison above)\n")
    
    # Quick summary by category for both datasets
    print("üìä Quick Summary by Category (across both datasets):")
    category_summary = results_df.groupby(['dataset', 'category'])[metric_cols].mean()
    print(category_summary)
    
    print("\nüí° Note: Main conclusion from dataset comparison is more important than category analysis")
else:
    print("‚ö†Ô∏è  Category column not found - skipping optional category analysis")


## 8. Visualizations: Comparing Datasets

**What we're doing:** Creating charts to compare n-gram scores between Ground Truth and Incidents datasets.

**Focus:** We want to see if both datasets show similar patterns (low scores), which would confirm our hypothesis that n-grams aren't suitable for evaluating close notes.

**Charts we'll create:**
1. **Comparison Bar Chart**: Mean scores for each metric, side-by-side for both datasets
2. **Box Plot Comparison**: Distribution comparison showing if patterns are similar
3. **Summary Table**: Quick visual comparison of key statistics

**What to look for:**
- **Similar low scores** in both datasets = Confirms hypothesis
- **Different patterns** = Interesting finding about ground truth quality
- **Overall conclusion**: Low scores validate that n-grams aren't suitable


In [None]:
# Create comparison visualizations
# Focus on comparing Ground Truth vs Incidents datasets

metric_cols = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
metric_labels = ['ROUGE-1\n(Word Overlap)', 'ROUGE-2\n(Phrase Overlap)', 
                 'ROUGE-L\n(Sequence)', 'ROUGE-Lsum\n(Summary)']

# Prepare data for comparison
gt_means = [gt_results_df[col].mean() for col in metric_cols]
inc_means = [inc_results_df[col].mean() for col in metric_cols]

# Chart 1: Comparison Bar Chart
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(metric_labels))
width = 0.35

bars1 = ax.bar(x - width/2, gt_means, width, label='Ground Truth', alpha=0.8, color='#2ecc71')
bars2 = ax.bar(x + width/2, inc_means, width, label='Incidents', alpha=0.8, color='#3498db')

ax.set_xlabel('Metrics', fontsize=12, fontweight='bold')
ax.set_ylabel('Mean Score', fontsize=12, fontweight='bold')
ax.set_title('Comparison: Mean N-gram Scores\n(Ground Truth vs Incidents)', 
             fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(metric_labels)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3, axis='y')
ax.set_ylim([0, max(max(gt_means), max(inc_means)) * 1.2])

# Add value labels on bars
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.3f}',
                ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

print("\nüí° Reading the bar chart:")
print("   - Compare the height of bars for each metric")
print("   - Similar heights = Both datasets show similar patterns")
print("   - Low scores (0.1-0.3) = Confirms hypothesis that n-grams aren't suitable")


In [None]:
# Chart 2: Box Plot Comparison - Distribution of scores by dataset
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Distribution Comparison: Ground Truth vs Incidents', 
             fontsize=14, fontweight='bold')

# Prepare data for box plots
data_to_plot = []
for col in metric_cols:
    data_to_plot.append([
        gt_results_df[col].values,
        inc_results_df[col].values
    ])

positions = [[1, 2], [1, 2], [1, 2], [1, 2]]
colors = ['#2ecc71', '#3498db']
labels = ['Ground Truth', 'Incidents']

for idx, (col, label) in enumerate(zip(metric_cols, metric_labels)):
    ax = axes[idx // 2, idx % 2]
    bp = ax.boxplot(data_to_plot[idx], positions=[1, 2], widths=0.6, 
                    patch_artist=True, labels=labels)
    
    # Color the boxes
    for patch, color in zip(bp['boxes'], colors):
        patch.set_facecolor(color)
        patch.set_alpha(0.7)
    
    ax.set_title(f'{label}', fontsize=11, fontweight='bold')
    ax.set_ylabel('Score', fontsize=10)
    ax.grid(True, alpha=0.3, axis='y')
    ax.set_ylim([0, max(max(gt_results_df[col]), max(inc_results_df[col])) * 1.1])

plt.tight_layout()
plt.show()

print("\nüí° Reading the box plots:")
print("   - Compare the boxes side-by-side for each metric")
print("   - Similar box positions = Both datasets show similar patterns")
print("   - Low boxes (scores 0.1-0.3) = Confirms hypothesis")
print("   - The box shows middle 50% of scores, line shows median")


In [None]:
# Chart 3: Summary Comparison Table Visualization
fig, ax = plt.subplots(figsize=(10, 6))
ax.axis('tight')
ax.axis('off')

# Create comparison table
comparison_data = []
for col, label in zip(metric_cols, ['ROUGE-1', 'ROUGE-2', 'ROUGE-L', 'ROUGE-Lsum']):
    comparison_data.append([
        label,
        f"{gt_results_df[col].mean():.4f}",
        f"{inc_results_df[col].mean():.4f}",
        f"{gt_results_df[col].mean() - inc_results_df[col].mean():+.4f}",
        f"{gt_results_df[col].median():.4f}",
        f"{inc_results_df[col].median():.4f}"
    ])

table = ax.table(cellText=comparison_data,
                colLabels=['Metric', 'GT Mean', 'Inc Mean', 'Difference', 'GT Median', 'Inc Median'],
                cellLoc='center',
                loc='center',
                colWidths=[0.2, 0.15, 0.15, 0.15, 0.15, 0.15])

table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2)

# Style the header
for i in range(6):
    table[(0, i)].set_facecolor('#34495e')
    table[(0, i)].set_text_props(weight='bold', color='white')

# Color code differences
for i in range(1, len(comparison_data) + 1):
    diff_val = float(comparison_data[i-1][3])
    if abs(diff_val) < 0.01:
        table[(i, 3)].set_facecolor('#ecf0f1')  # Very similar (gray)
    elif diff_val > 0:
        table[(i, 3)].set_facecolor('#d5f4e6')  # GT higher (light green)
    else:
        table[(i, 3)].set_facecolor('#fadbd8')  # Inc higher (light red)

ax.set_title('Summary Comparison: Ground Truth vs Incidents', 
             fontsize=12, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

print("\nüí° Summary:")
print("   - Low scores (< 0.3) in both datasets = Confirms hypothesis")
print("   - Similar patterns = Both show same language differences")
print("   - This validates that n-grams are NOT suitable for evaluating close notes quality")


## 9. Save Results

**What we're doing:** Saving all the computed scores and metadata to a CSV file.

**Why save:**
- Can analyze results later without re-running the notebook
- Can compare these n-gram results with semantic similarity results (from next notebook)
- Can share results with team members
- Can track improvements over time

**What gets saved:**
- All ROUGE scores for each pair
- Category and metadata information
- Text lengths and other analysis columns

**File location:** `data/ngram_comparison_results.csv`

---

## üìä Summary: What Did We Learn?

### Key Findings from N-gram Analysis

1. **N-gram metrics measure word/phrase overlap** between two texts
2. **ROUGE scores tell us** how similar the vocabulary and phrases are
3. **Expected result:** Low scores confirm that incident descriptions and close notes use different language

### Interpretation of Results

**If scores are LOW (0.1-0.3):**
- ‚úÖ **Confirms our hypothesis:** Incident descriptions and close notes use different vocabulary
- ‚úÖ **Conclusion:** N-grams are **not suitable** for evaluating close notes quality
- ‚úÖ **Next step:** Focus on **LLM-as-a-Judge** evaluation (semantic understanding)

**If scores are HIGH (0.5+):**
- ‚ö†Ô∏è **Surprising result:** Incident descriptions and close notes share significant vocabulary
- ‚ö†Ô∏è **Conclusion:** N-grams might be useful, but we still need semantic evaluation

### Why This Matters

This baseline analysis helps us understand:
- **What metrics NOT to use:** If n-grams are low, they won't help evaluate close notes
- **What to focus on:** LLM-as-a-Judge will evaluate semantic quality, not just word overlap
- **Baseline for comparison:** We can compare future LLM-generated close notes against these ground truth references

---

## üéØ Next Steps: LLM-as-a-Judge Evaluation

The **real evaluation** will happen in the next phase using **LLM-as-a-Judge**, which will:

1. **Compare close notes** (existing or LLM-generated) against ground truth references
2. **Evaluate multiple criteria:**
   - **Topic coverage:** Does the close note cover the same topics as the reference?
   - **Profile data accuracy:** Is client/system information correct?
   - **Supporting facts:** Are the facts consistent with the reference?
   - **No invented facts:** Does it avoid making up information?
   - **Text structure:** Is it well-organized and clear?
   - **Conclusion quality:** Does it provide a clear resolution summary?

3. **Provide scores (0-5)** for each criterion, similar to the example provided:
   ```json
   {
     "check_topic_coverage": 4,
     "check_profile_data": 5,
     "check_supporting_facts": 5,
     "check_facts_are_not_invented": 5,
     "check_text_structure": 4,
     "check_conclusion": 5,
     "general_score": 4.67
   }
   ```

4. **Handle context differences:** Each incident has different context, so evaluation will be relative to similar incidents/categories

**This approach is more suitable** because:
- It evaluates **meaning and quality**, not just word overlap
- It can handle **different contexts** (each incident is unique)
- It provides **explainable scores** with reasoning
- It's **scalable** and doesn't require human labeling


In [None]:
# Save results
# This saves all our computed scores and metadata to a CSV file
# You can open this file in Excel or any spreadsheet program later

output_path = data_dir / "ngram_comparison_results.csv"
results_df.to_csv(output_path, index=False)  # Save as CSV (comma-separated values)
print(f"‚úÖ Saved results to: {output_path}")
print(f"   Total pairs: {len(results_df)}")
print(f"   Columns: {list(results_df.columns)}")
print("\nüí° You can now:")
print("   - Open this file in Excel or Google Sheets")
print("   - Filter/sort by category or score")
print("   - Use this as a baseline for future comparisons")

# Summary statistics
# Print a final summary comparing both datasets
print("\n" + "="*60)
print("FINAL SUMMARY")
print("="*60)
print(f"Ground Truth pairs evaluated: {len(gt_results_df)}")
print(f"Incidents pairs evaluated: {len(inc_results_df)}")
print(f"Total pairs: {len(results_df)}")

print(f"\nMean Scores Comparison:")
print(f"{'Metric':<15} {'Ground Truth':<15} {'Incidents':<15} {'Difference':<15}")
print("-" * 60)
for metric in metric_cols:
    gt_mean = gt_results_df[metric].mean()
    inc_mean = inc_results_df[metric].mean()
    diff = gt_mean - inc_mean
    print(f"{metric:<15} {gt_mean:<15.4f} {inc_mean:<15.4f} {diff:+.4f}")

# Conclusion based on comparison of both datasets
print("\n" + "="*60)
print("CONCLUSION & RECOMMENDATION")
print("="*60)

gt_mean_rouge1 = gt_results_df['rouge1'].mean()
inc_mean_rouge1 = inc_results_df['rouge1'].mean()
overall_mean = (gt_mean_rouge1 + inc_mean_rouge1) / 2

if overall_mean < 0.3:
    print("‚úÖ Hypothesis CONFIRMED: Low n-gram scores detected in BOTH datasets")
    print(f"   - Ground Truth mean ROUGE-1: {gt_mean_rouge1:.4f}")
    print(f"   - Incidents mean ROUGE-1: {inc_mean_rouge1:.4f}")
    print("   - Incident descriptions and close notes use different vocabulary")
    print("   - N-grams are NOT suitable for evaluating close notes quality")
    print("   - Recommendation: Proceed with LLM-as-a-Judge evaluation")
elif overall_mean < 0.5:
    print("‚ö†Ô∏è  Partial overlap detected in both datasets")
    print(f"   - Ground Truth mean ROUGE-1: {gt_mean_rouge1:.4f}")
    print(f"   - Incidents mean ROUGE-1: {inc_mean_rouge1:.4f}")
    print("   - Some vocabulary is shared, but significant differences remain")
    print("   - Recommendation: Use LLM-as-a-Judge for semantic evaluation")
else:
    print("‚ö†Ô∏è  Higher overlap than expected")
    print(f"   - Ground Truth mean ROUGE-1: {gt_mean_rouge1:.4f}")
    print(f"   - Incidents mean ROUGE-1: {inc_mean_rouge1:.4f}")
    print("   - This might indicate incident descriptions contain resolution language")
    print("   - Recommendation: Investigate further, but still use LLM-as-a-Judge")

# Add comparison insight
if abs(gt_mean_rouge1 - inc_mean_rouge1) < 0.05:
    print("\nüí° Insight: Both datasets show similar patterns - validates comparison approach")
else:
    print(f"\nüí° Insight: Datasets differ by {abs(gt_mean_rouge1 - inc_mean_rouge1):.4f} - interesting finding!")

print(f"\nResults saved to: {output_path}")
print("\nüéâ Baseline analysis complete!")
print("   Next step: Implement LLM-as-a-Judge evaluation (Notebook 05)")
