# Notebook 01: Load and Explore Dataset

## üéØ What is This Notebook About?

Welcome! This is your first step in learning how AI can help improve IT incident documentation. Think of this notebook like a **guided tour** - we're going to open up a dataset (a collection of IT support tickets) and take a look around to understand what we're working with.

**What we'll do:**
1. **Load the data** - Get our IT incident tickets from an online repository
2. **Take a look around** - See what information each ticket contains
3. **Understand the data** - Learn about categories, types, and patterns
4. **Check quality** - See which tickets have good documentation (close notes)
5. **Prepare for next steps** - Get everything ready for the rest of the workshop

**Why this matters:**
- Before we can use AI to improve close notes, we need to understand what data we have
- This is like checking your ingredients before cooking - you need to know what you're working with!
- The better we understand the data, the better we can evaluate AI-generated improvements

---

## üìö Key Concepts Explained

### What is a Dataset?

A **dataset** is simply a collection of data organized in a structured way. Think of it like a spreadsheet with rows (each row is one incident) and columns (each column is a piece of information about that incident).

**Example:** Like a customer database with names, emails, and phone numbers - but here we have IT incidents with descriptions, categories, and resolution notes.

### What is Hugging Face?

**Hugging Face** is like a library or app store for AI datasets and models. Instead of creating our own data from scratch, we're using a pre-made dataset of IT support tickets that someone has already prepared and shared.

**Think of it like:** Using a recipe book instead of inventing recipes - it saves time and gives us something proven to work with.

### What are "Close Notes"?

**Close notes** are the documentation that IT support agents write when they resolve an incident. They explain:
- What the problem was
- What they did to fix it
- How they confirmed it was resolved

**Why they matter:** Good close notes help other agents understand similar problems in the future. Bad close notes (like just saying "Issue resolved") don't help anyone.

### What is "Ground Truth"?

**Ground truth** means the "correct" or "reference" answer. In our case, these are the high-quality close notes that we'll use as examples of what "good" looks like.

**Think of it like:** When learning to write, you look at examples of good essays. Here, we'll use good close notes as examples to compare against.

---

## üìã Dataset Overview

**Source:** [Hugging Face - KameronB/synthetic-it-callcenter-tickets](https://huggingface.co/datasets/KameronB/synthetic-it-callcenter-tickets)

This dataset contains **synthetic** (artificially created but realistic) IT support tickets. They simulate real-world incidents and requests, which makes them perfect for learning and experimenting without using real customer data.

**What's in each ticket?**
- Incident number and date
- Category (like SOFTWARE, NETWORK, EMAIL)
- Description of the problem
- Close notes (how it was resolved)
- Quality scores (how informative the close notes are)


In [None]:
# Import required libraries
# Think of these like tools in a toolbox - each one does a specific job

import pandas as pd  # For working with data tables (like Excel spreadsheets)
import numpy as np   # For doing math calculations
import matplotlib.pyplot as plt  # For creating charts and graphs
import seaborn as sns  # For making prettier charts
from pathlib import Path  # For handling file paths
import sys
import json

# Add src directory to path so we can use our helper functions
sys.path.append(str(Path("../src").resolve()))

# Import our custom helper functions
# These are functions we created to make loading data easier
from utils import load_incident_dataset, calculate_basic_stats

# Set up plotting style (makes our charts look nicer)
try:
    plt.style.use('seaborn-v0_8')
except OSError:
    try:
        plt.style.use('seaborn')
    except OSError:
        plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

print("‚úÖ Libraries imported successfully!")
print("üìö Ready to start working with data!")


## 1. Load Dataset

**What we're doing:** Loading our IT incident tickets from Hugging Face (an online repository).

**Why:** We need the data before we can analyze it. Think of this like opening a file on your computer.

**What to expect:** 
- The dataset might be large, so we'll load a sample (200 tickets) for faster experimentation
- You'll see a message showing how many records were loaded
- The data will be stored in a variable called `df` (short for "dataframe" - think of it as a spreadsheet)


In [None]:
# Load dataset
# We're loading 200 incidents as a sample (you can change this number if you want)
# RANDOM_STATE = 42 ensures we get the same random sample each time (for consistency)

SAMPLE_SIZE = 200  # Number of incidents to load (use None to load all)
RANDOM_STATE = 42  # Random seed (keeps results consistent)

# This function downloads and loads the data from Hugging Face
df = load_incident_dataset(sample_size=SAMPLE_SIZE, random_state=RANDOM_STATE)

# Let's see what we got!
print(f"\nüìä Dataset loaded successfully!")
print(f"   Shape: {df.shape[0]} rows (incidents) √ó {df.shape[1]} columns (pieces of information)")
print(f"\nüìã Available columns (information fields):")
for i, col in enumerate(df.columns, 1):
    print(f"   {i}. {col}")


## 2. Basic Dataset Overview

**What we're doing:** Taking a first look at the data to see what it actually contains.

**Why:** Before diving deep, we want to see:
- What does a real incident look like?
- What types of information do we have?
- Are there any missing pieces of information?

**Think of it like:** Opening a book and reading the first few pages to get a sense of what it's about.


In [None]:
# Display first few rows
# This shows us what the actual data looks like - like looking at the first few rows of a spreadsheet

print("üëÄ First 5 incidents in the dataset:")
print("="*80)
df.head()


In [None]:
# Check data types and missing values
# This helps us understand:
# - What kind of information is in each column (text, numbers, dates)?
# - Are there any missing pieces of information?

print("üìù Data Types:")
print("   (This tells us what kind of information each column contains)")
print(df.dtypes)
print("\n" + "="*50)
print("\n‚ùì Missing Values:")
print("   (This shows how many incidents are missing information in each column)")
print("   (0 means no missing data, higher numbers mean more missing data)")
missing = df.isnull().sum()
print(missing[missing > 0])  # Only show columns with missing values
if missing[missing > 0].empty:
    print("   ‚úÖ No missing values!")
print("\n" + "="*50)
print("\nüìä Dataset Summary Info:")
df.info()


## 3. Dataset Statistics

**What we're doing:** Getting a summary of what's in our dataset - like a quick overview.

**Why:** This gives us the "big picture" before we dive into details. It's like reading the summary on the back of a book.

**What we'll learn:**
- How many incidents vs requests we have
- What categories are most common
- How long it typically takes to resolve issues


In [None]:
# Calculate basic statistics
# This function counts things up and gives us summary numbers

stats = calculate_basic_stats(df)

print("üìä Dataset Statistics:")
print("="*80)
print(f"üì¶ Total records: {stats['total_incidents']}")
print(f"üö® Incidents: {stats['incidents']} (problems that need fixing)")
print(f"üìã Requests: {stats['requests']} (requests for something new)")
if stats['avg_resolution_time']:
    hours = stats['avg_resolution_time'] / 60
    print(f"‚è±Ô∏è  Average resolution time: {stats['avg_resolution_time']:.2f} minutes ({hours:.1f} hours)")
print(f"\nüè∑Ô∏è  Categories (types of problems):")
for category, count in stats['categories'].items():
    percentage = (count / stats['total_incidents']) * 100
    print(f"   ‚Ä¢ {category}: {count} incidents ({percentage:.1f}%)")
print("="*80)


## 4. Visualizing the Data

**What we're doing:** Creating charts and graphs to "see" patterns in our data.

**Why:** Sometimes it's easier to understand data by looking at pictures rather than numbers. These visualizations help us see:
- What types of problems are most common?
- How do people usually contact support?
- How long does it take to resolve issues?
- What does the quality of close notes look like?

**Think of it like:** Looking at a map instead of reading a list of addresses - the visual helps you understand patterns quickly.

**What you'll see:**
- Pie charts showing proportions (like "what percentage are software issues?")
- Bar charts showing counts (like "how many incidents per category?")
- Histograms showing distributions (like "how long do most incidents take to resolve?")


In [None]:
# Create comprehensive visualization dashboard
fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(3, 3, hspace=0.35, wspace=0.3)

# 1. Category Distribution (Pie Chart)
ax1 = fig.add_subplot(gs[0, 0])
if 'category' in df.columns:
    category_counts = df['category'].value_counts()
    colors = sns.color_palette("husl", len(category_counts))
    wedges, texts, autotexts = ax1.pie(
        category_counts.values, 
        labels=category_counts.index, 
        autopct='%1.1f%%',
        colors=colors,
        startangle=90
    )
    ax1.set_title('Incident Categories Distribution', fontsize=12, fontweight='bold')
    # Improve text readability
    for autotext in autotexts:
        autotext.set_color('white')
        autotext.set_fontweight('bold')

# 2. Category Distribution (Bar Chart with counts)
ax2 = fig.add_subplot(gs[0, 1])
if 'category' in df.columns:
    category_counts = df['category'].value_counts()
    bars = ax2.bar(range(len(category_counts)), category_counts.values, color=colors)
    ax2.set_xticks(range(len(category_counts)))
    ax2.set_xticklabels(category_counts.index, rotation=45, ha='right')
    ax2.set_ylabel('Count', fontsize=10)
    ax2.set_title('Categories by Count', fontsize=12, fontweight='bold')
    ax2.grid(axis='y', alpha=0.3)
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax2.text(bar.get_x() + bar.get_width()/2., height,
                f'{int(height)}', ha='center', va='bottom', fontsize=9)

# 3. Top 10 Subcategories
ax3 = fig.add_subplot(gs[0, 2])
if 'subcategory' in df.columns:
    subcat_counts = df['subcategory'].value_counts().head(10)
    ax3.barh(range(len(subcat_counts)), subcat_counts.values, 
             color=sns.color_palette("viridis", len(subcat_counts)))
    ax3.set_yticks(range(len(subcat_counts)))
    ax3.set_yticklabels(subcat_counts.index)
    ax3.set_xlabel('Count', fontsize=10)
    ax3.set_title('Top 10 Subcategories', fontsize=12, fontweight='bold')
    ax3.grid(axis='x', alpha=0.3)
    # Add value labels
    for i, v in enumerate(subcat_counts.values):
        ax3.text(v + 0.5, i, str(v), va='center', fontsize=9)

# 4. Contact Type Distribution
ax4 = fig.add_subplot(gs[1, 0])
if 'contact_type' in df.columns:
    contact_counts = df['contact_type'].value_counts()
    bars = ax4.bar(contact_counts.index, contact_counts.values, 
                   color=sns.color_palette("muted", len(contact_counts)))
    ax4.set_ylabel('Count', fontsize=10)
    ax4.set_title('Contact Channel Distribution', fontsize=12, fontweight='bold')
    ax4.grid(axis='y', alpha=0.3)
    # Add value labels
    for bar in bars:
        height = bar.get_height()
        ax4.text(bar.get_x() + bar.get_width()/2., height,
                f'{int(height)}', ha='center', va='bottom', fontsize=10)

# 5. Type Distribution (Incident vs Request)
ax5 = fig.add_subplot(gs[1, 1])
if 'type' in df.columns:
    type_counts = df['type'].value_counts()
    colors_type = sns.color_palette("Set2", len(type_counts))
    wedges, texts, autotexts = ax5.pie(
        type_counts.values,
        labels=type_counts.index,
        autopct='%1.1f%%',
        colors=colors_type,
        startangle=90
    )
    ax5.set_title('Incident vs Request Distribution', fontsize=12, fontweight='bold')
    for autotext in autotexts:
        autotext.set_color('white')
        autotext.set_fontweight('bold')

# 6. Resolution Time Distribution
ax6 = fig.add_subplot(gs[1, 2])
if 'resolution_time' in df.columns and df['resolution_time'].notna().any():
    resolution_times = df['resolution_time'].dropna()
    ax6.hist(resolution_times, bins=40, edgecolor='black', alpha=0.7, color='steelblue')
    ax6.set_xlabel('Resolution Time (minutes)', fontsize=10)
    ax6.set_ylabel('Frequency', fontsize=10)
    ax6.set_title('Resolution Time Distribution', fontsize=12, fontweight='bold')
    ax6.set_yscale('log')
    ax6.grid(axis='y', alpha=0.3)
    # Add statistics
    ax6.axvline(resolution_times.median(), color='red', linestyle='--', 
                label=f'Median: {resolution_times.median():.1f} min')
    ax6.axvline(resolution_times.mean(), color='orange', linestyle='--', 
                label=f'Mean: {resolution_times.mean():.1f} min')
    ax6.legend(fontsize=8)

# 7. Content Length Analysis
ax7 = fig.add_subplot(gs[2, 0])
if 'content' in df.columns:
    df['content_length'] = df['content'].astype(str).str.len()
    ax7.hist(df['content_length'], bins=30, edgecolor='black', alpha=0.7, color='teal')
    ax7.set_xlabel('Content Length (characters)', fontsize=10)
    ax7.set_ylabel('Frequency', fontsize=10)
    ax7.set_title('Incident Content Length Distribution', fontsize=12, fontweight='bold')
    ax7.grid(axis='y', alpha=0.3)
    ax7.axvline(df['content_length'].median(), color='red', linestyle='--', 
                label=f'Median: {df["content_length"].median():.0f} chars')
    ax7.legend(fontsize=8)

# 8. Ground Truth Quality (Info Score)
ax8 = fig.add_subplot(gs[2, 1])
if 'info_score_close_notes' in df.columns:
    info_scores = df['info_score_close_notes'].dropna()
    if len(info_scores) > 0:
        ax8.hist(info_scores, bins=20, edgecolor='black', alpha=0.7, color='purple')
        ax8.set_xlabel('Info Score', fontsize=10)
        ax8.set_ylabel('Frequency', fontsize=10)
        ax8.set_title('Ground Truth Quality Score\n(close_notes info_score)', fontsize=12, fontweight='bold')
        ax8.grid(axis='y', alpha=0.3)
        ax8.axvline(info_scores.mean(), color='red', linestyle='--', 
                    label=f'Mean: {info_scores.mean():.2f}')
        ax8.legend(fontsize=8)

# 9. Reassignment Analysis
ax9 = fig.add_subplot(gs[2, 2])
if 'reassigned_count' in df.columns:
    reassign_counts = df['reassigned_count'].value_counts().sort_index()
    bars = ax9.bar(reassign_counts.index, reassign_counts.values, 
                   color=sns.color_palette("rocket", len(reassign_counts)))
    ax9.set_xlabel('Number of Reassignments', fontsize=10)
    ax9.set_ylabel('Count', fontsize=10)
    ax9.set_title('Incident Reassignment Frequency', fontsize=12, fontweight='bold')
    ax9.grid(axis='y', alpha=0.3)
    # Add value labels
    for bar in bars:
        height = bar.get_height()
        if height > 0:
            ax9.text(bar.get_x() + bar.get_width()/2., height,
                    f'{int(height)}', ha='center', va='bottom', fontsize=9)

plt.suptitle('Dataset Overview Dashboard', fontsize=16, fontweight='bold', y=0.995)
plt.show()

# Print summary statistics
print("\n" + "="*80)
print("DATASET SUMMARY STATISTICS")
print("="*80)
if 'category' in df.columns:
    print(f"\nüìä Categories: {df['category'].nunique()} unique categories")
    print(f"   Most common: {df['category'].value_counts().index[0]} ({df['category'].value_counts().iloc[0]} incidents)")
if 'subcategory' in df.columns:
    print(f"\nüìã Subcategories: {df['subcategory'].nunique()} unique subcategories")
    print(f"   Most common: {df['subcategory'].value_counts().index[0]} ({df['subcategory'].value_counts().iloc[0]} incidents)")
if 'contact_type' in df.columns:
    print(f"\nüìû Contact Channels: {df['contact_type'].nunique()} channels")
    print(f"   Most used: {df['contact_type'].value_counts().index[0]} ({df['contact_type'].value_counts().iloc[0]} incidents)")
if 'resolution_time' in df.columns and df['resolution_time'].notna().any():
    rt = df['resolution_time'].dropna()
    print(f"\n‚è±Ô∏è  Resolution Time:")
    print(f"   Mean: {rt.mean():.1f} minutes ({rt.mean()/60:.1f} hours)")
    print(f"   Median: {rt.median():.1f} minutes ({rt.median()/60:.1f} hours)")
    print(f"   Range: {rt.min():.1f} - {rt.max():.1f} minutes")
if 'reassigned_count' in df.columns:
    print(f"\nüîÑ Reassignments:")
    print(f"   Mean: {df['reassigned_count'].mean():.2f} reassignments per incident")
    print(f"   Max: {df['reassigned_count'].max()} reassignments")
    no_reassign = (df['reassigned_count'] == 0).sum()
    print(f"   {no_reassign} incidents ({no_reassign/len(df)*100:.1f}%) had no reassignments")
print("="*80)


## 5. Look at Real Examples

**What we're doing:** Looking at actual incidents from the dataset to see what real data looks like.

**Why:** Numbers and charts are great, but sometimes you need to see the actual text to understand what we're working with. This helps us understand:
- What does a real incident description look like?
- What do good close notes actually say?
- How detailed are the problem descriptions?

**Think of it like:** Reading a few example essays to understand what "good writing" looks like, rather than just seeing scores.


In [None]:
# Display a sample incident in detail
# This picks one random incident and shows us all its details

sample_incident = df.sample(1).iloc[0]

print("="*80)
print("üìã EXAMPLE INCIDENT - Let's see what real data looks like!")
print("="*80)
print(f"\nüî¢ Incident Number: {sample_incident.get('number', 'N/A')}")
print(f"üìÖ Date: {sample_incident.get('date', 'N/A')}")
print(f"üìû Contact Type: {sample_incident.get('contact_type', 'N/A')} (how the user contacted support)")
print(f"üè∑Ô∏è  Category: {sample_incident.get('category', 'N/A')} (type of problem)")
print(f"üè∑Ô∏è  Subcategory: {sample_incident.get('subcategory', 'N/A')} (more specific type)")
print(f"üë§ Customer: {sample_incident.get('customer', 'N/A')}")
print(f"\nüìù Short Description:")
print(f"   {sample_incident.get('short_description', 'N/A')}")
print(f"\nüìÑ Full Content (the problem description):")
print(f"   {sample_incident.get('content', 'N/A')[:500]}...")
if 'close_notes' in sample_incident and pd.notna(sample_incident.get('close_notes')):
    print(f"\n‚úÖ Close Notes (how it was resolved - this is what we want to improve!):")
    print(f"   {sample_incident.get('close_notes', 'N/A')[:500]}...")
else:
    print(f"\n‚ö†Ô∏è  No close notes available for this incident")
print("="*80)


## 6. Content Quality Analysis

**What we're doing:** Analyzing the quality and characteristics of the text in our incidents.

**Why:** Before we can improve close notes with AI, we need to understand:
- How long are the problem descriptions? (This affects how much context the AI has)
- Do most incidents have close notes? (We need these as examples)
- How detailed are the close notes compared to the problem descriptions?
- What's the quality score of the close notes? (This tells us which ones are "good")

**Key Questions:**
- **Content length**: Are problem descriptions detailed enough for AI to understand?
- **Ground truth availability**: Do we have enough examples of good close notes?
- **Quality scores**: Which close notes are high-quality and can serve as references?

**Think of it like:** Checking the quality of your ingredients before cooking - you want to know what you're working with!


In [None]:
# Comprehensive content analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Content Quality Analysis', fontsize=16, fontweight='bold')

# Calculate content metrics
if 'content' in df.columns:
    df['content_length'] = df['content'].astype(str).str.len()
    df['content_word_count'] = df['content'].astype(str).str.split().str.len()
    
    # 1. Content Length Distribution
    axes[0, 0].hist(df['content_length'], bins=30, edgecolor='black', alpha=0.7, color='steelblue')
    axes[0, 0].axvline(df['content_length'].median(), color='red', linestyle='--', 
                       label=f'Median: {df["content_length"].median():.0f} chars')
    axes[0, 0].set_xlabel('Content Length (characters)', fontsize=10)
    axes[0, 0].set_ylabel('Frequency', fontsize=10)
    axes[0, 0].set_title('Input Content Length Distribution', fontsize=12, fontweight='bold')
    axes[0, 0].grid(axis='y', alpha=0.3)
    axes[0, 0].legend(fontsize=9)
    
    # 2. Word Count Distribution
    axes[0, 1].hist(df['content_word_count'], bins=30, edgecolor='black', alpha=0.7, color='teal')
    axes[0, 1].axvline(df['content_word_count'].median(), color='red', linestyle='--', 
                       label=f'Median: {df["content_word_count"].median():.0f} words')
    axes[0, 1].set_xlabel('Word Count', fontsize=10)
    axes[0, 1].set_ylabel('Frequency', fontsize=10)
    axes[0, 1].set_title('Content Word Count Distribution', fontsize=12, fontweight='bold')
    axes[0, 1].grid(axis='y', alpha=0.3)
    axes[0, 1].legend(fontsize=9)
    
    # Check if close_notes exist (ground truth)
    if 'close_notes' in df.columns:
        has_close_notes = df['close_notes'].notna()
        df_with_gt = df[has_close_notes].copy()
        
        if len(df_with_gt) > 0:
            df_with_gt['close_notes_length'] = df_with_gt['close_notes'].astype(str).str.len()
            df_with_gt['close_notes_word_count'] = df_with_gt['close_notes'].astype(str).str.split().str.len()
            
            # 3. Content vs Close Notes Length Comparison
            axes[1, 0].scatter(df_with_gt['content_length'], df_with_gt['close_notes_length'], 
                              alpha=0.6, color='purple', s=50)
            axes[1, 0].plot([0, max(df_with_gt['content_length'].max(), df_with_gt['close_notes_length'].max())],
                            [0, max(df_with_gt['content_length'].max(), df_with_gt['close_notes_length'].max())],
                            'r--', alpha=0.5, label='y=x line')
            axes[1, 0].set_xlabel('Content Length (chars)', fontsize=10)
            axes[1, 0].set_ylabel('Close Notes Length (chars)', fontsize=10)
            axes[1, 0].set_title('Content vs Resolution Notes Length', fontsize=12, fontweight='bold')
            axes[1, 0].grid(alpha=0.3)
            axes[1, 0].legend(fontsize=9)
            
            # 4. Info Score Distribution
            if 'info_score_close_notes' in df_with_gt.columns:
                info_scores = df_with_gt['info_score_close_notes'].dropna()
                if len(info_scores) > 0:
                    axes[1, 1].hist(info_scores, bins=20, edgecolor='black', alpha=0.7, color='orange')
                    axes[1, 1].axvline(info_scores.mean(), color='red', linestyle='--', 
                                      label=f'Mean: {info_scores.mean():.2f}')
                    axes[1, 1].axvline(info_scores.median(), color='blue', linestyle='--', 
                                      label=f'Median: {info_scores.median():.2f}')
                    axes[1, 1].set_xlabel('Info Score', fontsize=10)
                    axes[1, 1].set_ylabel('Frequency', fontsize=10)
                    axes[1, 1].set_title('Ground Truth Quality Score Distribution', fontsize=12, fontweight='bold')
                    axes[1, 1].grid(axis='y', alpha=0.3)
                    axes[1, 1].legend(fontsize=9)

plt.tight_layout()
plt.show()

# Print detailed statistics
print("\n" + "="*80)
print("CONTENT QUALITY STATISTICS")
print("="*80)
if 'content' in df.columns:
    print(f"\nüìù Input Content (for LLM enrichment):")
    print(f"   Average length: {df['content_length'].mean():.0f} characters")
    print(f"   Median length: {df['content_length'].median():.0f} characters")
    print(f"   Average word count: {df['content_word_count'].mean():.0f} words")
    print(f"   Range: {df['content_length'].min()} - {df['content_length'].max()} characters")
    
    if 'close_notes' in df.columns:
        has_close_notes = df['close_notes'].notna().sum()
        print(f"\n‚úÖ Ground Truth (close_notes) Availability:")
        print(f"   Incidents with close_notes: {has_close_notes} ({has_close_notes/len(df)*100:.1f}%)")
        
        if has_close_notes > 0:
            df_with_gt = df[df['close_notes'].notna()].copy()
            df_with_gt['close_notes_length'] = df_with_gt['close_notes'].astype(str).str.len()
            df_with_gt['close_notes_word_count'] = df_with_gt['close_notes'].astype(str).str.split().str.len()
            
            print(f"\nüìã Resolution Notes (close_notes) Statistics:")
            print(f"   Average length: {df_with_gt['close_notes_length'].mean():.0f} characters")
            print(f"   Median length: {df_with_gt['close_notes_length'].median():.0f} characters")
            print(f"   Average word count: {df_with_gt['close_notes_word_count'].mean():.0f} words")
            print(f"   Range: {df_with_gt['close_notes_length'].min()} - {df_with_gt['close_notes_length'].max()} characters")
            
            # Expansion ratio
            expansion_ratio = df_with_gt['close_notes_length'].mean() / df_with_gt['content_length'].mean()
            print(f"\nüìà Content Expansion:")
            print(f"   Resolution notes are {expansion_ratio:.2f}x longer than input content on average")
            
            if 'info_score_close_notes' in df_with_gt.columns:
                info_scores = df_with_gt['info_score_close_notes'].dropna()
                if len(info_scores) > 0:
                    print(f"\n‚≠ê Information Quality Score:")
                    print(f"   Mean: {info_scores.mean():.2f}")
                    print(f"   Median: {info_scores.median():.2f}")
                    print(f"   Range: {info_scores.min():.2f} - {info_scores.max():.2f}")
                    high_quality = (info_scores >= 0.8).sum()
                    print(f"   High quality (‚â•0.8): {high_quality} ({high_quality/len(info_scores)*100:.1f}%)")
print("="*80)


## 7. Prepare Data for Next Steps

**What we're doing:** Filtering and preparing the data for the rest of the workshop.

**Why:** 
- We want to focus on incidents that have close notes (so we can compare AI-generated ones to real ones)
- We need clean, ready-to-use data for the next notebooks
- This is like organizing your workspace before starting a project

**What we'll do:**
- Keep only incidents that have close notes (our "ground truth" examples)
- These will be used in later notebooks to evaluate AI-generated close notes


In [None]:
# Filter incidents that have close_notes (ground truth) for evaluation
# We only want incidents that have close notes because we'll use those as examples
# to compare against AI-generated close notes

if 'close_notes' in df.columns:
    df_with_ground_truth = df[df['close_notes'].notna()].copy()
    print("üìä Filtering incidents:")
    print("="*80)
    print(f"‚úÖ Incidents WITH close notes (ground truth): {len(df_with_ground_truth)}")
    print(f"‚ùå Incidents WITHOUT close notes: {len(df) - len(df_with_ground_truth)}")
    
    # For experiments, we'll use incidents with ground truth
    df_experiments = df_with_ground_truth.copy()
else:
    print("‚ö†Ô∏è  No close_notes column found - will use all incidents")
    df_experiments = df.copy()

print(f"\nüéØ Total incidents prepared for experiments: {len(df_experiments)}")
print("="*80)


## 8. Save Prepared Dataset

**What we're doing:** Saving our prepared data to a file so we can use it in the next notebooks.

**Why:** The next notebooks need this data, and it's easier to load from a file than to reload everything each time.

**Think of it like:** Saving your work so you can come back to it later.


In [None]:
# Create data directory if it doesn't exist
# This is where we'll save our prepared data files

data_dir = Path("../data")
data_dir.mkdir(exist_ok=True)

# Save the prepared dataset (all incidents with close notes)
output_path = data_dir / "incidents_prepared.csv"
df_experiments.to_csv(output_path, index=False)
print("üíæ Saving datasets:")
print("="*80)
print(f"‚úÖ Saved FULL prepared dataset to: {output_path}")
print(f"   Total records: {len(df_experiments)} incidents")

# Also save a small sample for quick testing (useful for faster experiments)
df_sample = df_experiments.sample(min(10, len(df_experiments)), random_state=42)
sample_path = data_dir / "incidents_sample.csv"
df_sample.to_csv(sample_path, index=False)
print(f"\n‚úÖ Saved SAMPLE dataset to: {sample_path}")
print(f"   Sample records: {len(df_sample)} incidents (for quick testing)")
print("="*80)


## 9. Summary - What We Accomplished

**Congratulations!** üéâ You've completed your first notebook. Here's what we did:

‚úÖ **Loaded the dataset** - Got 200 IT incident tickets from Hugging Face  
‚úÖ **Explored the data** - Saw what information each incident contains  
‚úÖ **Analyzed patterns** - Learned about categories, types, and resolution times  
‚úÖ **Checked quality** - Identified which incidents have good close notes  
‚úÖ **Prepared the data** - Filtered and saved it for the next notebooks  

**What you learned:**
- How to load and explore a dataset
- What information IT incidents contain
- How to visualize data patterns
- What makes a good close note (detailed, informative)

---

## üöÄ Next Steps

**Ready for Notebook 02!** 

In the next notebook, we'll:
- Define what makes a "good" close note
- Separate high-quality close notes from regular ones
- Create reference examples that we'll use to evaluate AI-generated close notes

**Files ready for next notebook:**
- `data/incidents_prepared.csv` - All incidents with close notes
- `data/incidents_sample.csv` - Small sample for quick testing


In [None]:
# Display final summary
print("\n" + "="*80)
print("üéØ NOTEBOOK 01 COMPLETE - FINAL SUMMARY")
print("="*80)
print(f"\nüìä Dataset loaded: {len(df)} total records")
print(f"üìù Prepared for experiments: {len(df_experiments)} records (with close notes)")
print(f"üíæ Saved to: {output_path}")
print(f"\n‚ú® What's next?")
print(f"   ‚Üí Move to Notebook 02: Create Ground Truth")
print(f"   ‚Üí We'll identify high-quality close notes to use as reference examples")
print("\n‚úÖ Ready for the next notebook!")
print("="*80)
