# Error Analysis Notebook

**Purpose:** Manual review and error analysis of generated specs/plans

**Process:**
1. **Open Coding:** Review outputs and note issues
2. **Axial Coding:** Group similar failures into categories
3. **Prioritization:** Count frequency and prioritize fixes

**Time:** 30-60 minutes per session

---

In [1]:
# Setup
import os
import glob
import pandas as pd
from pathlib import Path
import re

# Configuration
DATASET_DIR = Path('../datasets/real-specs')
RESULTS_DIR = Path('../datasets/analysis-results')
RESULTS_DIR.mkdir(exist_ok=True)

print(f"‚úÖ Setup complete")
print(f"üìÅ Dataset directory: {DATASET_DIR}")
print(f"üìä Results directory: {RESULTS_DIR}")

‚úÖ Setup complete
üìÅ Dataset directory: ../datasets/real-specs
üìä Results directory: ../datasets/analysis-results


## 1. Load Test Specs

In [12]:
# Load all spec files
spec_files = sorted(glob.glob(str(DATASET_DIR / 'spec-*.md')))

print(f"üìù Found {len(spec_files)} spec files\n")

# Load into dataframe
data = []
for spec_file in spec_files:
    with open(spec_file, 'r') as f:
        content = f.read()
    
    # Extract prompt
    prompt_match = re.search(r'## Prompt\n(.+?)\n\n', content, re.DOTALL)
    prompt = prompt_match.group(1).strip() if prompt_match else ''
    
    # Extract generated spec
    spec_match = re.search(r'## Generated Spec\n(.+?)\n\n## Review Notes', content, re.DOTALL)
    spec = spec_match.group(1).strip() if spec_match else ''
    
    data.append({
        'file': os.path.basename(spec_file),
        'prompt': prompt,
        'spec': spec,
        'pass_fail': '',  # To be filled during review
        'issues': '',     # To be filled during review
        'category': ''    # To be filled during review
    })

df = pd.DataFrame(data)
print(f"‚úÖ Loaded {len(df)} specs into dataframe\n")
df.head()

üìù Found 17 spec files

‚úÖ Loaded 17 specs into dataframe



Unnamed: 0,file,prompt,spec,pass_fail,issues,category
0,spec-001.md,Build a simple todo list web application with ...,# Feature Specification Document\n\n## 1. Over...,,,
1,spec-002.md,"Create a personal blog platform with posts, co...",# Feature Specification Document\n\n## 1. Over...,,,
2,spec-003.md,Develop a simple calculator web app with basic...,# Feature Specification Document\n\n## 1. Over...,,,
3,spec-004.md,Build an e-commerce platform with product cata...,# Feature Specification Document\n\n## 1. Over...,,,
4,spec-005.md,Create a customer relationship management (CRM...,# Feature Specification Document\n\n## 1. Over...,,,


## 2. Review Process (Open Coding)

**Instructions for Domain Expert:**
- Review each spec
- Mark Pass/Fail (binary, no Likert scale)
- Note specific issues found
- Don't categorize yet - just note what you see

**Example issues to look for:**
- Missing requirements
- Ambiguous user stories
- Over-engineering
- Missing constraints
- Incorrect tech stack assumptions

In [13]:
# Display specs one at a time for review
# Change idx to review different specs

idx = 2  # Change this to review different specs (0 to len(df)-1)

if idx < len(df):
    row = df.iloc[idx]
    print(f"\n{'='*80}")
    print(f"üìÑ File: {row['file']}")
    print(f"\nüìù Prompt:\n{row['prompt']}")
    print(f"\n{'='*80}")
    print(f"\nüìã Generated Spec:\n{row['spec']}")
    print(f"\n{'='*80}")
    print(f"\n‚úçÔ∏è Review (fill in manually):")
    print(f"   Pass/Fail: {row['pass_fail'] or '[TODO]'}")
    print(f"   Issues: {row['issues'] or '[TODO]'}")
else:
    print(f"‚ùå Index {idx} out of range (0-{len(df)-1})")


üìÑ File: spec-003.md

üìù Prompt:
Develop a simple calculator web app with basic arithmetic operations


üìã Generated Spec:
# Feature Specification Document

## 1. Overview Section

**Feature Name:** [NEEDS CLARIFICATION - No feature described in user requirements]

**Description:** 
No user requirements were provided in the input. This template demonstrates the structure for a comprehensive feature specification document.

**Purpose:**
[NEEDS CLARIFICATION - Awaiting user requirements to define feature purpose and value proposition]

**Target Users:**
[NEEDS CLARIFICATION - User personas and roles to be defined based on requirements]

**Business Value:**
[NEEDS CLARIFICATION - Expected business impact and ROI to be defined]

---

## 2. User Stories

### Priority 1 (P1) - Must Have

**US-001: [NEEDS CLARIFICATION - Core user story needed]**
- **As a** [role]
- **I want** [feature/capability]
- **So that** [business value/benefit]

**Acceptance Criteria:**
- **Given** [initial con

In [6]:
# Manual annotation helper
# Update this cell to record your review

# Example:
# df.loc[0, 'pass_fail'] = 'Fail'
# df.loc[0, 'issues'] = 'Missing authentication requirements, vague data model'

# Your annotations here:
# df.loc[idx, 'pass_fail'] = 'Pass' or 'Fail'
# df.loc[idx, 'issues'] = 'Description of issues found'

print("üíæ Remember to save your work!")

üíæ Remember to save your work!


## 3. Categorization (Axial Coding)

After reviewing all specs, group similar issues into categories.

**Common failure categories:**
- Incomplete requirements
- Ambiguous specifications
- Over-engineering
- Missing constraints
- Tech stack mismatch
- Missing edge cases
- Unclear acceptance criteria

In [14]:
# View all failed specs
failed_specs = df[df['pass_fail'] == 'Fail']
print(f"‚ùå Failed specs: {len(failed_specs)} / {len(df)}\n")

if len(failed_specs) > 0:
    print("Issues found:")
    for idx, row in failed_specs.iterrows():
        print(f"\n[{row['file']}]")
        print(f"  {row['issues']}")
else:
    print("‚úÖ No failed specs (or not yet reviewed)")

‚ùå Failed specs: 0 / 17

‚úÖ No failed specs (or not yet reviewed)


In [8]:
# After reviewing issues, assign categories
# Example:
# df.loc[0, 'category'] = 'Incomplete requirements'
# df.loc[1, 'category'] = 'Over-engineering'

# Your categorization here:

print("üíæ Update categories based on your error analysis")

üíæ Update categories based on your error analysis


## 4. Failure Taxonomy & Prioritization

In [15]:
# Count failures by category
category_counts = df[df['pass_fail'] == 'Fail']['category'].value_counts()

print("üìä Failure Taxonomy:\n")
print(category_counts)
print(f"\nTotal failures: {len(failed_specs)}")
print(f"Pass rate: {(len(df) - len(failed_specs)) / len(df) * 100:.1f}%")

üìä Failure Taxonomy:

Series([], Name: count, dtype: int64)

Total failures: 0
Pass rate: 100.0%


In [10]:
# Visualize failure categories
import matplotlib.pyplot as plt

if len(category_counts) > 0:
    plt.figure(figsize=(10, 6))
    category_counts.plot(kind='barh')
    plt.title('Failure Categories (Frequency)')
    plt.xlabel('Count')
    plt.tight_layout()
    plt.show()
else:
    print("No failures to visualize (or not yet categorized)")

Matplotlib is building the font cache; this may take a moment.


No failures to visualize (or not yet categorized)


## 5. Save Results

In [11]:
# Save annotated data
output_file = RESULTS_DIR / 'error-analysis-results.csv'
df.to_csv(output_file, index=False)
print(f"üíæ Saved results to: {output_file}")

# Save summary
summary_file = RESULTS_DIR / 'summary.txt'
with open(summary_file, 'w') as f:
    f.write(f"Error Analysis Summary\n")
    f.write(f"{'='*50}\n\n")
    f.write(f"Total specs reviewed: {len(df)}\n")
    f.write(f"Failed specs: {len(failed_specs)}\n")
    f.write(f"Pass rate: {(len(df) - len(failed_specs)) / len(df) * 100:.1f}%\n\n")
    f.write(f"Failure Categories:\n")
    for category, count in category_counts.items():
        f.write(f"  - {category}: {count}\n")

print(f"üìÑ Saved summary to: {summary_file}")

üíæ Saved results to: ../datasets/analysis-results/error-analysis-results.csv
üìÑ Saved summary to: ../datasets/analysis-results/summary.txt


## 6. Next Steps

Based on this error analysis:

1. **Prioritize fixes** by failure frequency
2. **Update prompts/templates** to address top issues
3. **Add automated checks** for frequent failure modes
4. **Re-run evaluation** after fixes
5. **Repeat monthly** or when making major changes

---

**üí° Tips:**
- Aim for 70% pass rate (not 100%)
- Focus on high-frequency failure modes
- Don't build automated checks until you understand the pattern
- Use LLMs to help categorize (but human reviews what matters)