# NotebookValidator Tutorial

The **NotebookValidator** validates Jupyter notebooks for reproducibility, documentation quality, and data science best practices.

## Overview

- **Type**: Simple Agent (single input/output)
- **Use Case**: Notebook quality assurance, reproducibility checks
- **Features**: Security scanning, documentation scoring, best practices validation

## Setup

In [None]:
from agent_workshop.agents.data_science import NotebookValidator
from agent_workshop import Config

config = Config()
validator = NotebookValidator(config)

print(f"Provider: {validator.provider_name}")
print(f"Model: {validator.model_name}")

## Input/Output Format

**Input**: Jupyter notebook content (JSON string or cell text)

**Output**:
```python
{
    "valid": bool,        # true if score >= 70 and no critical issues
    "score": int,         # 0-100 quality score
    "issues": [
        {
            "severity": "critical|high|medium|low",
            "category": "reproducibility|documentation|security|quality",
            "cell_index": int | null,
            "message": str
        }
    ],
    "suggestions": [str],   # Improvement suggestions
    "summary": str,         # Brief assessment
    "timestamp": str        # ISO timestamp
}
```

## Validation Categories

The validator checks four key areas:

### 1. Reproducibility
- Non-linear cell execution (cells run out of order)
- Hardcoded absolute paths
- Missing dependency imports
- Environment-specific code
- Random seeds not set

### 2. Documentation
- Missing or inadequate markdown cells
- No clear narrative structure
- Undocumented data transformations
- Missing section headers

### 3. Security
- Hardcoded credentials or API keys
- Sensitive data exposed in outputs
- Insecure network calls

### 4. Quality
- Long cells that should be split
- Unused imports
- Poor variable naming
- No error handling

## Default Validation Criteria

In [None]:
# View the default criteria
print("Default Validation Criteria:")
print("=" * 50)
for i, criterion in enumerate(validator.validation_criteria, 1):
    print(f"{i}. {criterion}")

## Example: Validate a Notebook with Issues

In [None]:
# Sample notebook content with issues (simplified representation)
problematic_notebook = '''
# Cell 1 (code)
import pandas as pd
import numpy as np

# Cell 2 (code)
API_KEY = "sk-secret-12345"  # Security issue: hardcoded secret
data = pd.read_csv("/Users/john/data/dataset.csv")  # Reproducibility issue: absolute path

# Cell 3 (code)
# Using variable defined in cell 5 - execution order issue
result = model.predict(data)  # Error: model not defined yet

# Cell 4 (code)
# Very long cell with many operations - quality issue
data = data.dropna()
data = data[data["value"] > 0]
data["normalized"] = (data["value"] - data["value"].mean()) / data["value"].std()
data["binned"] = pd.cut(data["normalized"], bins=10)
data["encoded"] = data["category"].astype("category").cat.codes
# ... many more transformations without explanation

# Cell 5 (code)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()  # Reproducibility: no random_state set
model.fit(X, y)

# No markdown cells explaining the analysis - documentation issue
'''

print("Notebook with issues:")
print(problematic_notebook)

In [None]:
# Run the validation (uncomment to execute)
# result = await validator.run(problematic_notebook)
#
# print("Validation Result:")
# print("=" * 50)
# print(f"Valid: {result.get('valid')}")
# print(f"Score: {result.get('score')}/100")
#
# print("\nIssues Found:")
# for issue in result.get('issues', []):
#     print(f"  [{issue['severity'].upper()}] {issue['category']}: {issue['message']}")
#
# print("\nSuggestions:")
# for suggestion in result.get('suggestions', []):
#     print(f"  - {suggestion}")
#
# print(f"\nSummary: {result.get('summary')}")

## Example: Well-Structured Notebook

In [None]:
# Sample well-structured notebook
clean_notebook = '''
# Markdown Cell 1
# Customer Churn Analysis

This notebook analyzes customer churn patterns and builds a predictive model.

## Setup and Imports

# Cell 1 (code)
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import logging

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Markdown Cell 2
## Data Loading

Load the customer data from a relative path.

# Cell 2 (code)
# Use relative path for reproducibility
DATA_PATH = "./data/customers.csv"

# API key from environment variable
API_KEY = os.environ.get("ANALYSIS_API_KEY")
if not API_KEY:
    logger.warning("API key not set - some features disabled")

try:
    data = pd.read_csv(DATA_PATH)
    logger.info(f"Loaded {len(data)} records")
except FileNotFoundError:
    logger.error(f"Data file not found: {DATA_PATH}")
    raise

# Markdown Cell 3
## Data Preprocessing

Clean and prepare the data for modeling.

# Cell 3 (code)
def preprocess_data(df):
    """Clean and transform customer data."""
    df = df.dropna()
    df = df[df["tenure"] > 0]
    return df

data = preprocess_data(data)

# Markdown Cell 4
## Model Training

Train a Random Forest classifier with fixed random state.

# Cell 4 (code)
X = data.drop("churned", axis=1)
y = data["churned"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

model = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE)
model.fit(X_train, y_train)
logger.info(f"Training accuracy: {model.score(X_train, y_train):.3f}")
'''

print("Well-structured notebook example:")
print(clean_notebook[:1000] + "...")

## Severity Levels

| Severity | Description | Example |
|----------|-------------|--------|
| **critical** | Security or completely broken reproducibility | Exposed API keys, credentials in output |
| **high** | Major reproducibility or documentation gaps | Absolute paths, undefined variables |
| **medium** | Quality issues, minor documentation gaps | Long cells, missing docstrings |
| **low** | Style suggestions, nice-to-haves | Variable naming, formatting |

## Custom Validation Criteria

In [None]:
# Custom validator for ML-specific notebooks
ml_validator = NotebookValidator(
    config=config,
    system_prompt="""You are an ML notebook reviewer specializing in:
    - Model reproducibility (random seeds, versioning)
    - Experiment tracking (metrics, parameters)
    - Data leakage prevention
    - Model deployment readiness
    
    Focus on ML engineering best practices.""",
    validation_criteria=[
        "Random seeds set for all stochastic operations",
        "Train/test split before any preprocessing",
        "Model metrics logged and tracked",
        "Hyperparameters clearly documented",
        "No data leakage between train and test sets",
        "Model serialization demonstrated",
        "Requirements.txt or environment.yml present",
    ]
)

print("ML-focused Notebook Validator:")
print("Criteria:")
for i, c in enumerate(ml_validator.validation_criteria, 1):
    print(f"  {i}. {c}")

## Scoring Guidelines

The quality score (0-100) is calculated based on:

- **90-100**: Excellent - Ready for sharing/publication
- **70-89**: Good - Minor improvements needed
- **50-69**: Fair - Several issues to address
- **Below 50**: Needs Work - Significant issues

A notebook is considered **valid** if:
1. Score >= 70
2. No critical severity issues

## Integration with CI/CD

Use NotebookValidator in your pipeline:

```python
import json
from pathlib import Path

async def validate_notebooks(notebook_dir: str) -> bool:
    """Validate all notebooks in a directory."""
    validator = NotebookValidator(Config())
    all_valid = True
    
    for notebook_path in Path(notebook_dir).glob("*.ipynb"):
        content = notebook_path.read_text()
        result = await validator.run(content)
        
        if not result.get("valid", False):
            print(f"FAIL: {notebook_path.name} (score: {result['score']})")
            all_valid = False
        else:
            print(f"PASS: {notebook_path.name} (score: {result['score']})")
    
    return all_valid
```

## Next Steps

- **[07_blueprint_system.ipynb](./07_blueprint_system.ipynb)** - Create agents from YAML blueprints
- **[01_deliverable_validator.ipynb](./01_deliverable_validator.ipynb)** - Document validation
- **[00_getting_started.ipynb](./00_getting_started.ipynb)** - Framework overview