# Notebook 01: Load and Explore Dataset

## üéØ Objectives

This notebook demonstrates how to:
1. **Load** the synthetic IT call center tickets dataset from Hugging Face
2. **Explore** dataset structure, data quality, and completeness
3. **Analyze** incident characteristics (categories, types, contact channels)
4. **Examine** content quality and ground truth availability
5. **Visualize** key metrics and distributions
6. **Prepare** data for enrichment experiments

---

## üìã Dataset Overview

**Source:** [Hugging Face - KameronB/synthetic-it-callcenter-tickets](https://huggingface.co/datasets/KameronB/synthetic-it-callcenter-tickets)

This dataset contains synthetic IT support tickets simulating real-world incidents and requests, ideal for demonstrating LLM-based incident enrichment tasks.


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sys
import json

# Add src directory to path
sys.path.append(str(Path("../src").resolve()))

from utils import load_incident_dataset, calculate_basic_stats, prepare_incident_for_enrichment

# Set up plotting style
try:
    plt.style.use('seaborn-v0_8')
except OSError:
    try:
        plt.style.use('seaborn')
    except OSError:
        plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

print("Libraries imported successfully!")


## 1. Load Dataset

We'll load the synthetic IT call center tickets dataset from Hugging Face. For faster experimentation, we can sample a subset of the data.


In [None]:
# Load dataset - adjust sample_size as needed
# Set sample_size=None to load full dataset, or specify a number (e.g., 200)
SAMPLE_SIZE = 200  # Use smaller sample for faster experiments
RANDOM_STATE = 42

df = load_incident_dataset(sample_size=SAMPLE_SIZE, random_state=RANDOM_STATE)

print(f"\nDataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")


## 2. Basic Dataset Overview

Let's examine the first few rows and check data types.


In [None]:
# Display first few rows
df.head()


In [None]:
# Check data types and missing values
print("Data Types:")
print(df.dtypes)
print("\n" + "="*50)
print("\nMissing Values:")
print(df.isnull().sum())
print("\n" + "="*50)
print("\nDataset Info:")
df.info()


## 3. Dataset Statistics

Calculate and display basic statistics about the dataset.


In [None]:
# Calculate basic statistics
stats = calculate_basic_stats(df)

print("üìä Dataset Statistics:")
print(f"Total incidents: {stats['total_incidents']}")
print(f"Incidents: {stats['incidents']}")
print(f"Requests: {stats['requests']}")
if stats['avg_resolution_time']:
    print(f"Average resolution time: {stats['avg_resolution_time']:.2f} minutes")
print(f"\nCategories distribution:")
for category, count in stats['categories'].items():
    print(f"  - {category}: {count}")


## 4. Incident Characteristics Analysis

Create comprehensive visualizations to understand the dataset distribution across multiple dimensions:
- **Categories & Subcategories**: Distribution of incident types
- **Contact Channels**: How users report incidents
- **Incident Types**: Incident vs Request breakdown
- **Resolution Metrics**: Time to resolution and reassignments
- **Content Quality**: Text length and ground truth quality scores


In [None]:
# Create comprehensive visualization dashboard
fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(3, 3, hspace=0.35, wspace=0.3)

# 1. Category Distribution (Pie Chart)
ax1 = fig.add_subplot(gs[0, 0])
if 'category' in df.columns:
    category_counts = df['category'].value_counts()
    colors = sns.color_palette("husl", len(category_counts))
    wedges, texts, autotexts = ax1.pie(
        category_counts.values, 
        labels=category_counts.index, 
        autopct='%1.1f%%',
        colors=colors,
        startangle=90
    )
    ax1.set_title('Incident Categories Distribution', fontsize=12, fontweight='bold')
    # Improve text readability
    for autotext in autotexts:
        autotext.set_color('white')
        autotext.set_fontweight('bold')

# 2. Category Distribution (Bar Chart with counts)
ax2 = fig.add_subplot(gs[0, 1])
if 'category' in df.columns:
    category_counts = df['category'].value_counts()
    bars = ax2.bar(range(len(category_counts)), category_counts.values, color=colors)
    ax2.set_xticks(range(len(category_counts)))
    ax2.set_xticklabels(category_counts.index, rotation=45, ha='right')
    ax2.set_ylabel('Count', fontsize=10)
    ax2.set_title('Categories by Count', fontsize=12, fontweight='bold')
    ax2.grid(axis='y', alpha=0.3)
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax2.text(bar.get_x() + bar.get_width()/2., height,
                f'{int(height)}', ha='center', va='bottom', fontsize=9)

# 3. Top 10 Subcategories
ax3 = fig.add_subplot(gs[0, 2])
if 'subcategory' in df.columns:
    subcat_counts = df['subcategory'].value_counts().head(10)
    ax3.barh(range(len(subcat_counts)), subcat_counts.values, 
             color=sns.color_palette("viridis", len(subcat_counts)))
    ax3.set_yticks(range(len(subcat_counts)))
    ax3.set_yticklabels(subcat_counts.index)
    ax3.set_xlabel('Count', fontsize=10)
    ax3.set_title('Top 10 Subcategories', fontsize=12, fontweight='bold')
    ax3.grid(axis='x', alpha=0.3)
    # Add value labels
    for i, v in enumerate(subcat_counts.values):
        ax3.text(v + 0.5, i, str(v), va='center', fontsize=9)

# 4. Contact Type Distribution
ax4 = fig.add_subplot(gs[1, 0])
if 'contact_type' in df.columns:
    contact_counts = df['contact_type'].value_counts()
    bars = ax4.bar(contact_counts.index, contact_counts.values, 
                   color=sns.color_palette("muted", len(contact_counts)))
    ax4.set_ylabel('Count', fontsize=10)
    ax4.set_title('Contact Channel Distribution', fontsize=12, fontweight='bold')
    ax4.grid(axis='y', alpha=0.3)
    # Add value labels
    for bar in bars:
        height = bar.get_height()
        ax4.text(bar.get_x() + bar.get_width()/2., height,
                f'{int(height)}', ha='center', va='bottom', fontsize=10)

# 5. Type Distribution (Incident vs Request)
ax5 = fig.add_subplot(gs[1, 1])
if 'type' in df.columns:
    type_counts = df['type'].value_counts()
    colors_type = sns.color_palette("Set2", len(type_counts))
    wedges, texts, autotexts = ax5.pie(
        type_counts.values,
        labels=type_counts.index,
        autopct='%1.1f%%',
        colors=colors_type,
        startangle=90
    )
    ax5.set_title('Incident vs Request Distribution', fontsize=12, fontweight='bold')
    for autotext in autotexts:
        autotext.set_color('white')
        autotext.set_fontweight('bold')

# 6. Resolution Time Distribution
ax6 = fig.add_subplot(gs[1, 2])
if 'resolution_time' in df.columns and df['resolution_time'].notna().any():
    resolution_times = df['resolution_time'].dropna()
    ax6.hist(resolution_times, bins=40, edgecolor='black', alpha=0.7, color='steelblue')
    ax6.set_xlabel('Resolution Time (minutes)', fontsize=10)
    ax6.set_ylabel('Frequency', fontsize=10)
    ax6.set_title('Resolution Time Distribution', fontsize=12, fontweight='bold')
    ax6.set_yscale('log')
    ax6.grid(axis='y', alpha=0.3)
    # Add statistics
    ax6.axvline(resolution_times.median(), color='red', linestyle='--', 
                label=f'Median: {resolution_times.median():.1f} min')
    ax6.axvline(resolution_times.mean(), color='orange', linestyle='--', 
                label=f'Mean: {resolution_times.mean():.1f} min')
    ax6.legend(fontsize=8)

# 7. Content Length Analysis
ax7 = fig.add_subplot(gs[2, 0])
if 'content' in df.columns:
    df['content_length'] = df['content'].astype(str).str.len()
    ax7.hist(df['content_length'], bins=30, edgecolor='black', alpha=0.7, color='teal')
    ax7.set_xlabel('Content Length (characters)', fontsize=10)
    ax7.set_ylabel('Frequency', fontsize=10)
    ax7.set_title('Incident Content Length Distribution', fontsize=12, fontweight='bold')
    ax7.grid(axis='y', alpha=0.3)
    ax7.axvline(df['content_length'].median(), color='red', linestyle='--', 
                label=f'Median: {df["content_length"].median():.0f} chars')
    ax7.legend(fontsize=8)

# 8. Ground Truth Quality (Info Score)
ax8 = fig.add_subplot(gs[2, 1])
if 'info_score_close_notes' in df.columns:
    info_scores = df['info_score_close_notes'].dropna()
    if len(info_scores) > 0:
        ax8.hist(info_scores, bins=20, edgecolor='black', alpha=0.7, color='purple')
        ax8.set_xlabel('Info Score', fontsize=10)
        ax8.set_ylabel('Frequency', fontsize=10)
        ax8.set_title('Ground Truth Quality Score\n(close_notes info_score)', fontsize=12, fontweight='bold')
        ax8.grid(axis='y', alpha=0.3)
        ax8.axvline(info_scores.mean(), color='red', linestyle='--', 
                    label=f'Mean: {info_scores.mean():.2f}')
        ax8.legend(fontsize=8)

# 9. Reassignment Analysis
ax9 = fig.add_subplot(gs[2, 2])
if 'reassigned_count' in df.columns:
    reassign_counts = df['reassigned_count'].value_counts().sort_index()
    bars = ax9.bar(reassign_counts.index, reassign_counts.values, 
                   color=sns.color_palette("rocket", len(reassign_counts)))
    ax9.set_xlabel('Number of Reassignments', fontsize=10)
    ax9.set_ylabel('Count', fontsize=10)
    ax9.set_title('Incident Reassignment Frequency', fontsize=12, fontweight='bold')
    ax9.grid(axis='y', alpha=0.3)
    # Add value labels
    for bar in bars:
        height = bar.get_height()
        if height > 0:
            ax9.text(bar.get_x() + bar.get_width()/2., height,
                    f'{int(height)}', ha='center', va='bottom', fontsize=9)

plt.suptitle('Dataset Overview Dashboard', fontsize=16, fontweight='bold', y=0.995)
plt.show()

# Print summary statistics
print("\n" + "="*80)
print("DATASET SUMMARY STATISTICS")
print("="*80)
if 'category' in df.columns:
    print(f"\nüìä Categories: {df['category'].nunique()} unique categories")
    print(f"   Most common: {df['category'].value_counts().index[0]} ({df['category'].value_counts().iloc[0]} incidents)")
if 'subcategory' in df.columns:
    print(f"\nüìã Subcategories: {df['subcategory'].nunique()} unique subcategories")
    print(f"   Most common: {df['subcategory'].value_counts().index[0]} ({df['subcategory'].value_counts().iloc[0]} incidents)")
if 'contact_type' in df.columns:
    print(f"\nüìû Contact Channels: {df['contact_type'].nunique()} channels")
    print(f"   Most used: {df['contact_type'].value_counts().index[0]} ({df['contact_type'].value_counts().iloc[0]} incidents)")
if 'resolution_time' in df.columns and df['resolution_time'].notna().any():
    rt = df['resolution_time'].dropna()
    print(f"\n‚è±Ô∏è  Resolution Time:")
    print(f"   Mean: {rt.mean():.1f} minutes ({rt.mean()/60:.1f} hours)")
    print(f"   Median: {rt.median():.1f} minutes ({rt.median()/60:.1f} hours)")
    print(f"   Range: {rt.min():.1f} - {rt.max():.1f} minutes")
if 'reassigned_count' in df.columns:
    print(f"\nüîÑ Reassignments:")
    print(f"   Mean: {df['reassigned_count'].mean():.2f} reassignments per incident")
    print(f"   Max: {df['reassigned_count'].max()} reassignments")
    no_reassign = (df['reassigned_count'] == 0).sum()
    print(f"   {no_reassign} incidents ({no_reassign/len(df)*100:.1f}%) had no reassignments")
print("="*80)


## 5. Examine Sample Incidents

Let's look at a few sample incidents to understand the structure and content quality.


In [None]:
# Display a sample incident in detail
sample_incident = df.sample(1).iloc[0]

print("="*80)
print("SAMPLE INCIDENT")
print("="*80)
print(f"\nüìã Number: {sample_incident.get('number', 'N/A')}")
print(f"üìÖ Date: {sample_incident.get('date', 'N/A')}")
print(f"üìû Contact Type: {sample_incident.get('contact_type', 'N/A')}")
print(f"üè∑Ô∏è  Category: {sample_incident.get('category', 'N/A')}")
print(f"üè∑Ô∏è  Subcategory: {sample_incident.get('subcategory', 'N/A')}")
print(f"üë§ Customer: {sample_incident.get('customer', 'N/A')}")
print(f"\nüìù Short Description:")
print(f"   {sample_incident.get('short_description', 'N/A')}")
print(f"\nüìÑ Content:")
print(f"   {sample_incident.get('content', 'N/A')[:500]}...")
if 'close_notes' in sample_incident and pd.notna(sample_incident.get('close_notes')):
    print(f"\n‚úÖ Close Notes (Ground Truth):")
    print(f"   {sample_incident.get('close_notes', 'N/A')[:500]}...")
print("="*80)


## 6. Content Quality Analysis

Analyze the text content characteristics to understand:
- **Content length**: Input text size for LLM processing
- **Ground truth availability**: Quality and completeness of close_notes
- **Content vs Resolution**: Compare input content with resolution notes length
- **Information quality scores**: Assess the informational value of ground truth data


In [None]:
# Comprehensive content analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Content Quality Analysis', fontsize=16, fontweight='bold')

# Calculate content metrics
if 'content' in df.columns:
    df['content_length'] = df['content'].astype(str).str.len()
    df['content_word_count'] = df['content'].astype(str).str.split().str.len()
    
    # 1. Content Length Distribution
    axes[0, 0].hist(df['content_length'], bins=30, edgecolor='black', alpha=0.7, color='steelblue')
    axes[0, 0].axvline(df['content_length'].median(), color='red', linestyle='--', 
                       label=f'Median: {df["content_length"].median():.0f} chars')
    axes[0, 0].set_xlabel('Content Length (characters)', fontsize=10)
    axes[0, 0].set_ylabel('Frequency', fontsize=10)
    axes[0, 0].set_title('Input Content Length Distribution', fontsize=12, fontweight='bold')
    axes[0, 0].grid(axis='y', alpha=0.3)
    axes[0, 0].legend(fontsize=9)
    
    # 2. Word Count Distribution
    axes[0, 1].hist(df['content_word_count'], bins=30, edgecolor='black', alpha=0.7, color='teal')
    axes[0, 1].axvline(df['content_word_count'].median(), color='red', linestyle='--', 
                       label=f'Median: {df["content_word_count"].median():.0f} words')
    axes[0, 1].set_xlabel('Word Count', fontsize=10)
    axes[0, 1].set_ylabel('Frequency', fontsize=10)
    axes[0, 1].set_title('Content Word Count Distribution', fontsize=12, fontweight='bold')
    axes[0, 1].grid(axis='y', alpha=0.3)
    axes[0, 1].legend(fontsize=9)
    
    # Check if close_notes exist (ground truth)
    if 'close_notes' in df.columns:
        has_close_notes = df['close_notes'].notna()
        df_with_gt = df[has_close_notes].copy()
        
        if len(df_with_gt) > 0:
            df_with_gt['close_notes_length'] = df_with_gt['close_notes'].astype(str).str.len()
            df_with_gt['close_notes_word_count'] = df_with_gt['close_notes'].astype(str).str.split().str.len()
            
            # 3. Content vs Close Notes Length Comparison
            axes[1, 0].scatter(df_with_gt['content_length'], df_with_gt['close_notes_length'], 
                              alpha=0.6, color='purple', s=50)
            axes[1, 0].plot([0, max(df_with_gt['content_length'].max(), df_with_gt['close_notes_length'].max())],
                            [0, max(df_with_gt['content_length'].max(), df_with_gt['close_notes_length'].max())],
                            'r--', alpha=0.5, label='y=x line')
            axes[1, 0].set_xlabel('Content Length (chars)', fontsize=10)
            axes[1, 0].set_ylabel('Close Notes Length (chars)', fontsize=10)
            axes[1, 0].set_title('Content vs Resolution Notes Length', fontsize=12, fontweight='bold')
            axes[1, 0].grid(alpha=0.3)
            axes[1, 0].legend(fontsize=9)
            
            # 4. Info Score Distribution
            if 'info_score_close_notes' in df_with_gt.columns:
                info_scores = df_with_gt['info_score_close_notes'].dropna()
                if len(info_scores) > 0:
                    axes[1, 1].hist(info_scores, bins=20, edgecolor='black', alpha=0.7, color='orange')
                    axes[1, 1].axvline(info_scores.mean(), color='red', linestyle='--', 
                                      label=f'Mean: {info_scores.mean():.2f}')
                    axes[1, 1].axvline(info_scores.median(), color='blue', linestyle='--', 
                                      label=f'Median: {info_scores.median():.2f}')
                    axes[1, 1].set_xlabel('Info Score', fontsize=10)
                    axes[1, 1].set_ylabel('Frequency', fontsize=10)
                    axes[1, 1].set_title('Ground Truth Quality Score Distribution', fontsize=12, fontweight='bold')
                    axes[1, 1].grid(axis='y', alpha=0.3)
                    axes[1, 1].legend(fontsize=9)

plt.tight_layout()
plt.show()

# Print detailed statistics
print("\n" + "="*80)
print("CONTENT QUALITY STATISTICS")
print("="*80)
if 'content' in df.columns:
    print(f"\nüìù Input Content (for LLM enrichment):")
    print(f"   Average length: {df['content_length'].mean():.0f} characters")
    print(f"   Median length: {df['content_length'].median():.0f} characters")
    print(f"   Average word count: {df['content_word_count'].mean():.0f} words")
    print(f"   Range: {df['content_length'].min()} - {df['content_length'].max()} characters")
    
    if 'close_notes' in df.columns:
        has_close_notes = df['close_notes'].notna().sum()
        print(f"\n‚úÖ Ground Truth (close_notes) Availability:")
        print(f"   Incidents with close_notes: {has_close_notes} ({has_close_notes/len(df)*100:.1f}%)")
        
        if has_close_notes > 0:
            df_with_gt = df[df['close_notes'].notna()].copy()
            df_with_gt['close_notes_length'] = df_with_gt['close_notes'].astype(str).str.len()
            df_with_gt['close_notes_word_count'] = df_with_gt['close_notes'].astype(str).str.split().str.len()
            
            print(f"\nüìã Resolution Notes (close_notes) Statistics:")
            print(f"   Average length: {df_with_gt['close_notes_length'].mean():.0f} characters")
            print(f"   Median length: {df_with_gt['close_notes_length'].median():.0f} characters")
            print(f"   Average word count: {df_with_gt['close_notes_word_count'].mean():.0f} words")
            print(f"   Range: {df_with_gt['close_notes_length'].min()} - {df_with_gt['close_notes_length'].max()} characters")
            
            # Expansion ratio
            expansion_ratio = df_with_gt['close_notes_length'].mean() / df_with_gt['content_length'].mean()
            print(f"\nüìà Content Expansion:")
            print(f"   Resolution notes are {expansion_ratio:.2f}x longer than input content on average")
            
            if 'info_score_close_notes' in df_with_gt.columns:
                info_scores = df_with_gt['info_score_close_notes'].dropna()
                if len(info_scores) > 0:
                    print(f"\n‚≠ê Information Quality Score:")
                    print(f"   Mean: {info_scores.mean():.2f}")
                    print(f"   Median: {info_scores.median():.2f}")
                    print(f"   Range: {info_scores.min():.2f} - {info_scores.max():.2f}")
                    high_quality = (info_scores >= 0.8).sum()
                    print(f"   High quality (‚â•0.8): {high_quality} ({high_quality/len(info_scores)*100:.1f}%)")
print("="*80)


## 7. Prepare Data for Experiments

Select and prepare incidents for enrichment experiments. We'll focus on incidents that have ground truth (close_notes) for evaluation.


In [None]:
# Filter incidents that have close_notes (ground truth) for evaluation
if 'close_notes' in df.columns:
    df_with_ground_truth = df[df['close_notes'].notna()].copy()
    print(f"Incidents with ground truth: {len(df_with_ground_truth)}")
    print(f"Incidents without ground truth: {len(df) - len(df_with_ground_truth)}")
    
    # For experiments, we'll use incidents with ground truth
    df_experiments = df_with_ground_truth.copy()
else:
    print("No close_notes column found - will use all incidents")
    df_experiments = df.copy()

print(f"\nTotal incidents prepared for experiments: {len(df_experiments)}")


## 8. Save Prepared Dataset

Save the prepared dataset for use in subsequent notebooks.


In [None]:
# Create data directory if it doesn't exist
data_dir = Path("../data")
data_dir.mkdir(exist_ok=True)

# Save the prepared dataset
output_path = data_dir / "incidents_prepared.csv"
df_experiments.to_csv(output_path, index=False)
print(f"‚úÖ Saved prepared dataset to: {output_path}")
print(f"   Total records: {len(df_experiments)}")

# Also save a sample of incidents for quick testing
df_sample = df_experiments.sample(min(10, len(df_experiments)), random_state=42)
sample_path = data_dir / "incidents_sample.csv"
df_sample.to_csv(sample_path, index=False)
print(f"‚úÖ Saved sample dataset to: {sample_path}")
print(f"   Sample records: {len(df_sample)}")


## 9. Summary

This notebook has:
- ‚úÖ Loaded the synthetic IT call center tickets dataset
- ‚úÖ Explored dataset structure and characteristics
- ‚úÖ Analyzed content and ground truth availability
- ‚úÖ Prepared data for enrichment experiments
- ‚úÖ Saved prepared datasets for next steps

**Next Steps:**
- Move to `02_llm_eval_with_trustyai.ipynb` to test different prompts and LLM models
- Use the prepared dataset to generate enriched incident reports


In [None]:
# Display final summary
print("="*80)
print("NOTEBOOK SUMMARY")
print("="*80)
print(f"\nüìä Dataset loaded: {len(df)} total records")
print(f"üìù Prepared for experiments: {len(df_experiments)} records")
print(f"üíæ Saved to: {output_path}")
print("\n‚úÖ Ready for prompt experiments in the next notebook!")
print("="*80)
