# KATO Hierarchical Training v2.0 - Educational Architecture

**Purpose**: Train hierarchical concept learner with transparent, layer-based API.

## Key Changes in v2.0

### Architecture
- **Full text processing**: Take complete text ‚Üí tokenize ‚Üí chunk ‚Üí feed to node0
- **Natural abstraction**: Let hierarchy learn naturally (no forced segmentation)
- **Pattern name flow**: Explicit flow between layers visible to users

### API Design
- **TensorFlow/PyTorch-style**: Use `add_layer()` to build hierarchy
- **Explicit KATO calls**: Show `observe()`, `observe_sequence()`, `learn()`, `get_predictions()`
- **Educational focus**: Users see exactly what's happening

### Configuration
- **Flexible metadata**: Configurable which layers capture source metadata
- **Per-layer settings**: Chunk size, max predictions, recall threshold, STM mode
- **Transparent**: All settings visible in notebook

## What You'll Learn

1. How to configure hierarchical layers explicitly
2. How pattern names flow between layers
3. How KATO API calls work (observe, learn, predict)
4. How to handle metadata at specific layers
5. How to process full documents through the hierarchy

## 1. Setup and Imports

In [None]:
# Install required packages
!pip install -q datasets transformers requests numpy matplotlib tqdm

In [None]:
# Core imports
from tools.hierarchical_builder import (
    HierarchicalBuilder,
    process_chunk_at_layer,
    accumulate_in_stm,
    learn_from_stm,
    extract_prediction_field
)

# For profiling and analysis
from tools import (
    ProfilingEngine,
    StreamingDatasetLoader
)

from tqdm import tqdm
import matplotlib.pyplot as plt
%matplotlib inline

print("‚úì All modules imported successfully")
print("‚úì Ready for hierarchical training v2.0")

## 2. Service Configuration

Configure KATO server URL.

**Note**: KATO has migrated from MongoDB to ClickHouse + Redis (Nov 2025). The KATO API remains backward compatible, so training works as before. Post-training analysis tools are being updated.

**Multi-machine support**: Change KATO_URL if running KATO on a separate machine.

In [None]:
# ========================================
# SERVICE CONFIGURATION
# ========================================

KATO_URL = 'http://kato:8000'  # KATO server

# NOTE: MongoDB has been removed from KATO (migrated to ClickHouse + Redis)
# KATO API remains backward compatible - training will work as before
# Post-training analysis tools are being updated for the new storage backend

# For multi-machine setups:
# KATO_URL = 'http://192.168.1.100:8000'

print("‚úì Service URLs configured")
print(f"  KATO: {KATO_URL}")
print("  Note: KATO now uses ClickHouse + Redis (MongoDB removed)")

In [None]:
# ========================================
# VERIFY KATO SERVER CONNECTION
# ========================================

import requests
from datetime import datetime

def check_kato_server(url):
    """Check if KATO server is responding."""
    try:
        response = requests.get(f"{url}/health", timeout=5)
        if response.status_code == 200:
            print(f"‚úì KATO server is healthy at {url}")
            return True
        else:
            print(f"‚ö†Ô∏è  KATO server responded with status {response.status_code}")
            return False
    except requests.exceptions.ConnectionError:
        print(f"‚úó Cannot connect to KATO server at {url}")
        print(f"  Make sure KATO is running (check: docker ps | grep kato)")
        return False
    except Exception as e:
        print(f"‚úó Error checking KATO server: {e}")
        return False

print(f"Checking KATO server at {KATO_URL}...")
if check_kato_server(KATO_URL):
    print(f"\n‚úì Ready to begin training!")
    print(f"  Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
else:
    print(f"\n‚ö†Ô∏è  WARNING: Training will fail without KATO server")
    print(f"  Please start KATO before continuing")

## 3. Training Configuration

Configure dataset and training parameters.

In [None]:
# ========================================
# TRAINING CONFIGURATION
# ========================================

# Dataset
DATASET_KEY = 'wikitext'  # Options: 'c4', 'refinedweb', 'wikitext', 'openwebtext'
MAX_SAMPLES = 100000  # Start small for testing, then scale up
CHUNK_SIZES = [8, 8, 8, 8]      # [node0, node1, node2, node3]

# Workers (for parallel training)
NUM_WORKERS = 1  # Start with 1 for educational single-threaded mode

# Checkpoint configuration
CHECKPOINT_INTERVAL = 1000  # Save checkpoint every N samples
RESUME_FROM_CHECKPOINT = False  # Set True to resume interrupted training

print("‚úì Training configuration set")
print(f"  Dataset: {DATASET_KEY}")
print(f"  Max samples: {MAX_SAMPLES:,}")
print(f"  Workers: {NUM_WORKERS}")
print(f"  Checkpoint interval: {CHECKPOINT_INTERVAL:,}")

## 4. Hierarchical Layer Configuration (NEW!)

**TensorFlow/PyTorch-style API**: Build hierarchy by adding layers.

**Key Parameters**:
- `chunk_size`: How many inputs per chunk
- `max_predictions`: Top N predictions to pass to next layer
- `prediction_field`: Which field to extract ('name')
- `recall_threshold`: Pattern matching strictness (0.0-1.0)
- `capture_metadata`: Whether this layer captures source metadata (True/False)

In [None]:
# Build hierarchy with explicit layer configuration
hierarchy = HierarchicalBuilder(
    tokenizer_name='gpt2',
    base_url=KATO_URL
)

# Add node0: Chunk-level patterns
hierarchy.add_layer(
    name='node0',
    chunk_size=CHUNK_SIZES[0],
    max_predictions=10,
    prediction_field='name',
    recall_threshold=0.6,
    stm_mode='CLEAR',
    max_pattern_length=0,
    process_predictions=False,
    capture_metadata=False  # Don't capture metadata here
)

# Add node1: Paragraph-level patterns
hierarchy.add_layer(
    name='node1',
    chunk_size=CHUNK_SIZES[1],
    max_predictions=8,
    prediction_field='name',
    recall_threshold=0.6,
    stm_mode='CLEAR',
    max_pattern_length=0,
    process_predictions=False,
    capture_metadata=False  # Don't capture metadata here either
)

# Add node2: Chapter-level patterns (capture metadata)
hierarchy.add_layer(
    name='node2',
    chunk_size=CHUNK_SIZES[2],
    max_predictions=6,
    prediction_field='name',
    recall_threshold=0.6,
    stm_mode='CLEAR',
    max_pattern_length=0,
    process_predictions=False,
    capture_metadata=True  # START capturing metadata at this layer
)

# Add node3: Book-level patterns (capture metadata)
hierarchy.add_layer(
    name='node3',
    chunk_size=CHUNK_SIZES[3],
    max_predictions=4,
    prediction_field='name',
    recall_threshold=0.6,
    stm_mode='CLEAR',
    max_pattern_length=0,
    process_predictions=False,
    capture_metadata=True  # Capture metadata here too
)

# Build the model
model = hierarchy.build(verbose=True)
model.summary()

## 5. Explicit KATO Helper Functions (Educational)

These functions show the EXACT KATO API calls being made.

In [None]:
def process_text_sample(text, metadata=None, verbose=True):
    """
    Process one text sample through the hierarchy.
    
    Shows explicit KATO calls at each step.
    """
    if verbose:
        print(f"\n{'='*60}")
        print(f"PROCESSING SAMPLE")
        print(f"{'='*60}")
    
    # Step 1: Tokenize
    tokens = model.tokenize(text)
    if verbose:
        print(f"\n1. Tokenized: {len(tokens)} tokens")
        print(f"   First 10: {tokens[:10]}")
    
    # Step 2: Chunk tokens for node0
    chunks = model.chunk_tokens(tokens, model.layers[0].chunk_size)
    if verbose:
        print(f"\n2. Chunked into {len(chunks)} chunks (size={model.layers[0].chunk_size})")
    
    # Step 3: Process each chunk at node0
    if verbose:
        print(f"\n3. Processing chunks at node0...")
    
    node0_patterns = []
    for i, chunk in enumerate(chunks):
        # EXPLICIT KATO API CALL
        pattern = process_chunk_at_layer(
            chunk,
            model.layers[0].client,
            metadata=None,  # No metadata at node0
            verbose=False
        )
        node0_patterns.append(pattern)
        if verbose and i < 3:  # Show first 3
            print(f"   Chunk {i+1}: {pattern[:40]}...")
    
    if verbose:
        print(f"   ‚Üí node0 produced {len(node0_patterns)} patterns")
    
    # Step 4: Send node0 patterns to node1
    if verbose:
        print(f"\n4. Sending {len(node0_patterns)} patterns to node1...")
    
    # EXPLICIT KATO API CALL
    count = accumulate_in_stm(
        node0_patterns,
        model.layers[1].client,
        metadata=None,
        verbose=False
    )
    
    # Learn at node1 (sample complete)
    # EXPLICIT KATO API CALL
    node1_pattern = learn_from_stm(model.layers[1].client, verbose=False)
    if verbose:
        print(f"   ‚Üí node1 learned: {node1_pattern[:40]}...")
    
    # Step 5: Send node1 pattern to node2 (with metadata if configured)
    should_capture_metadata = model.layers[1].should_capture_metadata()
    meta = metadata if should_capture_metadata else None
    
    if verbose:
        print(f"\n5. Sending node1 pattern to node2...")
        if meta:
            print(f"   ‚Üí Metadata attached: {meta}")
    
    # EXPLICIT KATO API CALL
    accumulate_in_stm([node1_pattern], model.layers[2].client, metadata=meta, verbose=False)
    
    # Learn at node2
    # EXPLICIT KATO API CALL
    node2_pattern = learn_from_stm(model.layers[2].client, verbose=False)
    if verbose:
        print(f"   ‚Üí node2 learned: {node2_pattern[:40]}...")
    
    # Step 6: Send node2 pattern to node3 (with metadata if configured)
    should_capture_metadata = model.layers[2].should_capture_metadata()
    meta = metadata if should_capture_metadata else None
    
    if verbose:
        print(f"\n6. Sending node2 pattern to node3...")
    
    # EXPLICIT KATO API CALL
    accumulate_in_stm([node2_pattern], model.layers[3].client, metadata=meta, verbose=False)
    
    # Learn at node3
    # EXPLICIT KATO API CALL
    node3_pattern = learn_from_stm(model.layers[3].client, verbose=False)
    if verbose:
        print(f"   ‚Üí node3 learned: {node3_pattern[:40]}...")
    
    if verbose:
        print(f"\n{'='*60}")
        print(f"SAMPLE COMPLETE")
        print(f"{'='*60}\n")
    
    return {
        'node0_patterns': len(node0_patterns),
        'node1_pattern': node1_pattern,
        'node2_pattern': node2_pattern,
        'node3_pattern': node3_pattern
    }

print("‚úì Helper functions defined")

## 6. Single-Sample Educational Demo

Process ONE sample to see exactly what happens at each step.

In [None]:
# Load one sample for demonstration using static method API
stream_iterator = StreamingDatasetLoader.load_streaming(
    dataset_key=DATASET_KEY,
    max_samples=1
)

# Get first sample
sample = next(iter(stream_iterator))

print(f"Sample text preview:")
print(f"{sample['text'][:200]}...")
print()

# Process with verbose output
result = process_text_sample(
    sample['text'],
    metadata={'source': DATASET_KEY, 'sample_id': 0},
    verbose=True
)

print(f"\nResult summary:")
print(f"  node0 patterns: {result['node0_patterns']}")
print(f"  node1 pattern: {result['node1_pattern'][:50]}...")
print(f"  node2 pattern: {result['node2_pattern'][:50]}...")
print(f"  node3 pattern: {result['node3_pattern'][:50]}...")

## 7. Batch Training (Resilient to Service Restarts)

### Automatic Retry Behavior

The `KATOClient` (v3.6.0+) now includes **production-grade retry logic** that handles KATO service restarts automatically:

**What Happens**:
1. **Service Restarts**: KATO restarts every ~10,000 requests (uvicorn `--limit-max-requests`) to prevent memory leaks
2. **Auto-Detection**: Client detects `ConnectionError` when service is restarting
3. **Health Check**: Waits for service to become healthy (up to 30 seconds)
4. **Session Recreation**: Creates new session with same configuration
5. **Transparent Retry**: Retries failed request automatically

**You'll See** (when restart occurs):
```
‚ö†Ô∏è  KATO service connection lost (attempt 1/3)
   Likely cause: Service restart (uvicorn --limit-max-requests)
   Waiting for service to become healthy...
   ‚úì Service healthy, recreating session and retrying...
```

**No Action Required**: Training continues automatically!

**Configuration**:
- Max retry attempts: 3
- Health check timeout: 30 seconds
- Exponential backoff: 0.5s ‚Üí 1s ‚Üí 2s

**Technical Details**:
- See `kato_client.py::_request()` for implementation
- See `kato_client.py::_wait_for_kato_healthy()` for health check logic
- Connection retry is separate from session recreation (404 errors)

In [None]:
# Clear all STM before batch training
model.clear_all_stm()

# Training statistics
stats = {
    'samples_processed': 0,
    'total_tokens': 0,
    'node0_patterns': 0,
    'service_restarts': 0,
    'errors': 0
}

print(f"\n{'='*60}")
print(f"BATCH TRAINING WITH AUTO-RETRY")
print(f"{'='*60}\n")
print(f"‚ÑπÔ∏è  KATOClient now handles service restarts automatically!")
print(f"   - Detects connection failures (service restart)")
print(f"   - Waits for service health check (up to 30s)")
print(f"   - Recreates session and retries transparently")
print(f"   - You'll see status messages if restarts occur\n")

# Load streaming dataset using static method API
# Note: Initial streaming info will be displayed, then clean progress bar
stream_iterator = StreamingDatasetLoader.load_streaming(
    dataset_key=DATASET_KEY,
    max_samples=MAX_SAMPLES
)

# Process samples with clean progress bar
# NOTE: Retry logic is now built into KATOClient - no manual handling needed!
for i, sample in enumerate(tqdm(stream_iterator, total=MAX_SAMPLES, desc="Training", unit="sample")):
    try:
        # Process sample (verbose=False for batch mode)
        # If KATO restarts (every ~10k requests), KATOClient will handle it automatically
        result = process_text_sample(
            sample['text'],
            metadata={'source': DATASET_KEY, 'sample_id': i},
            verbose=False
        )
        
        # Update stats
        stats['samples_processed'] += 1
        stats['node0_patterns'] += result['node0_patterns']
        
    except Exception as e:
        # Catch any errors that couldn't be auto-recovered
        print(f"\n‚ö†Ô∏è  Error processing sample {i+1}: {str(e)[:150]}")
        stats['errors'] += 1
        
        # If too many errors, abort
        if stats['errors'] > 10:
            print(f"\n‚úó Too many errors ({stats['errors']}), aborting training")
            print(f"  Check KATO server: docker logs kato --tail 50")
            break

print(f"\n{'='*60}")
print(f"TRAINING COMPLETE")
print(f"{'='*60}\n")

print(f"Statistics:")
print(f"  Samples processed: {stats['samples_processed']:,}")
print(f"  Node0 patterns: {stats['node0_patterns']:,}")
if stats['samples_processed'] > 0:
    print(f"  Avg patterns/sample: {stats['node0_patterns'] / stats['samples_processed']:.1f}")
if stats['errors'] > 0:
    print(f"  ‚ö†Ô∏è  Failed samples: {stats['errors']}")
    
print(f"\n‚ÑπÔ∏è  Note: KATO service restarts every ~10,000 requests (uvicorn limit)")
print(f"   This is normal and handled automatically by KATOClient")

## 8. Training Complete

Training is done! Patterns have been learned and stored in KATO.

**Note**: Pattern analysis temporarily unavailable during KATO storage migration (MongoDB ‚Üí ClickHouse + Redis). New analysis tools coming soon.

In [None]:
print("\n" + "="*60)
print("‚úì TRAINING COMPLETE!")
print("="*60)

print("\nüìä Pattern Analysis Status:")
print("   ‚Ä¢ Patterns successfully stored in KATO")
print("   ‚Ä¢ KATO storage: ClickHouse + Redis (MongoDB removed)")
print("   ‚Ä¢ Post-training analysis tools coming soon")

print("\nüîç Next Steps:")
print("   ‚Ä¢ Use generation.ipynb to test text generation with learned patterns")
print("   ‚Ä¢ Try different prompts to see hierarchical predictions in action")
print("   ‚Ä¢ Experiment with generation parameters")

print("\nüìù Note:")
print("   Pattern counting and frequency analysis temporarily unavailable.")
print("   New analysis tools for ClickHouse + Redis are being developed.")

## üîß Troubleshooting: KATO Server Issues

### ‚úÖ Good News: Most Issues Auto-Resolve!

**KATOClient (v3.6.0+)** automatically handles:
- ‚úÖ Service restarts (every ~10k requests)
- ‚úÖ Connection failures (detects and waits for healthy service)
- ‚úÖ Session recreation (recreates with same config)
- ‚úÖ Exponential backoff retry (3 attempts, 30s health check)

**You should see** automatic recovery messages like:
```
‚ö†Ô∏è  KATO service connection lost (attempt 1/3)
   Waiting for service to become healthy...
   ‚úì Service healthy, recreating session and retrying...
```

---

### üîç When You Still Need to Troubleshoot

**If training fails AFTER retry attempts**, try these steps:

#### 1. Check KATO Server Status

```bash
# In terminal:
docker ps | grep kato
```

If KATO is not running:
```bash
cd /path/to/kato
docker-compose up -d kato
```

#### 2. Check KATO Logs

```bash
docker logs kato --tail 50
```

Look for errors like:
- `OOM` (out of memory)
- `Exception` or `Error` (application crash)
- `Connection refused` (network issues)

#### 3. Verify Service Health

```bash
# Should return {"status": "healthy"}
curl http://kato:8000/health
```

#### 4. Restart KATO Server (Last Resort)

```bash
docker restart kato
```

Wait 10-20 seconds for it to become healthy, then re-run training.

#### 5. Check Resource Usage

```bash
# Check memory/CPU usage
docker stats kato --no-stream
```

If KATO is using >90% memory:
- Reduce `MAX_SAMPLES` in training config
- Reduce `chunk_size` in layer configuration
- Increase Docker memory limit

---

### üõ†Ô∏è Common Issues (Now Auto-Resolved)

| Issue | Old Behavior | New Behavior (v3.6.0+) |
|-------|--------------|------------------------|
| **Service restart (10k requests)** | ‚ùå Training fails with ConnectionError | ‚úÖ Auto-detects, waits, recreates session, continues |
| **Session expiration** | ‚ùå 404 error, training stops | ‚úÖ Auto-recreates session, restores STM, continues |
| **Network blip** | ‚ùå ConnectionError, immediate failure | ‚úÖ Retries with exponential backoff (3 attempts) |
| **Service temporarily unhealthy** | ‚ùå Immediate failure | ‚úÖ Waits up to 30s for health check |

---

### üìä Understanding Service Restart Behavior

**Why does KATO restart?**
- Configured with `uvicorn --limit-max-requests 10000`
- Prevents memory leaks from accumulating
- Forces periodic cleanup (production best practice)

**What happens during restart?**
1. KATO processes 10,000th request
2. Uvicorn logs: `WARNING: Maximum request limit of 10000 exceeded. Terminating process.`
3. Container restarts (takes ~5-10 seconds)
4. Health check passes
5. Service accepts new connections

**Training impact?**
- ‚ùå **Before (v3.5.0)**: Training would fail at ~424 samples
- ‚úÖ **After (v3.6.0)**: Training continues seamlessly through restart

---

### üéØ When to Contact Support

Contact KATO developers if:
1. **Persistent failures**: Retry logic fails all 3 attempts repeatedly
2. **Service crashes**: KATO container keeps restarting (check `docker logs kato`)
3. **OOM errors**: Service running out of memory despite normal workload
4. **Data corruption**: Patterns not being stored correctly

Otherwise, the automatic retry logic should handle all transient issues!

## 9. Next Steps

### Analysis
- Open `analysis.ipynb` to visualize frequency distributions
- Inspect high-frequency patterns
- Compare training runs

### Scaling Up
- Increase `MAX_SAMPLES` (100 ‚Üí 1000 ‚Üí 10000)
- Add parallel workers (`NUM_WORKERS = 3`)
- Use larger datasets (C4, RefinedWeb)

### Advanced Features
- Add profiling with `ProfilingEngine`
- Enable checkpointing for long runs
- Experiment with different `chunk_size` values
- Try different recall thresholds per layer

### Generation
- Open `generation.ipynb` to generate text using learned patterns
- See how hierarchical predictions work
- Experiment with different generation strategies

## üìã Migration Note: KATO Storage Update

**Date**: November 2025  
**Change**: KATO migrated from MongoDB to ClickHouse + Redis

### What Changed

**Storage Backend**:
- ‚ùå **Removed**: MongoDB (deprecated)
- ‚úÖ **Added**: ClickHouse (pattern data) + Redis (metadata)
- üéØ **Result**: 100-300x faster pattern queries

**KATO API**:
- ‚úÖ **No changes** - HTTP API remains backward compatible
- ‚úÖ Training works exactly as before
- ‚úÖ Pattern learning and storage fully functional

### What Works Now

‚úÖ **Training Pipeline**: Fully functional
- Tokenization, chunking, and pattern learning work
- All KATO API calls (`observe`, `learn`, `predict`) work
- Hierarchical pattern flow intact

‚úÖ **Text Generation**: Use `generation.ipynb`
- Pattern retrieval works
- Hierarchical predictions work
- Text generation fully functional

### What's Temporarily Unavailable

‚è≥ **Post-Training Analysis**: Being updated
- Pattern counting (previously used MongoDB queries)
- Frequency distributions
- Pattern inspection tools

These features will be restored with new ClickHouse + Redis analysis tools.

### For Developers

**If you need pattern statistics now**:
- Option 1: Use `generation.ipynb` to verify patterns work
- Option 2: Access ClickHouse/Redis directly (see KATO server docs)
- Option 3: Wait for new analysis tools (recommended)

**Future enhancements**:
- New `tools/clickhouse_redis_analyzer.py` module
- Updated `analysis.ipynb` with new storage queries
- Restored pattern counting and frequency analysis

### Resources

- KATO migration docs: `/Users/sevakavakians/PROGRAMMING/kato/planning-docs/completed/features/2025-11-13-mongodb-removal-complete.md`
- This notebook focuses on **training** (fully working)
- Use `generation.ipynb` for **testing** (fully working)
- Watch for updates to `analysis.ipynb` (coming soon)