# KATO Hierarchical Training - Real Data Workflow

**Purpose**: Train hierarchical concept learner on real datasets with performance profiling.

**This notebook**:
- ‚úÖ Trains on **real data** from HuggingFace datasets (WikiText, C4, RefinedWeb, etc.)
- ‚úÖ Uses **parallel workers** for optimal speed (2-3x faster)
- ‚úÖ Profiles **hardware resources** (CPU, RAM, disk I/O) during training
- ‚úÖ Tracks **training history** in SQLite for later analysis
- ‚úÖ Tests different **chunk_size and layer configurations** to find optimal settings

**Key Concepts**:
- node0 learns token chunks (e.g., 8 tokens ‚Üí phrase patterns)
- node1 learns sequences of node0 patterns (e.g., 64 tokens ‚Üí sentence patterns)
- node2 learns sequences of node1 patterns (e.g., 512 tokens ‚Üí paragraph patterns)
- node3 learns sequences of node2 patterns (e.g., 4,096 tokens ‚Üí chapter patterns)

**After training**: Use `analysis.ipynb` to analyze learned patterns.

## 1. Setup and Imports

In [1]:
# Install required packages
!pip install -q datasets transformers requests numpy matplotlib tqdm pymongo

In [2]:
# Import hierarchical training modules
from tools import (
    # Core training
    HierarchicalConceptLearner,
    HierarchicalNode,
    train_from_streaming_dataset_parallel,
    
    # Profiling and analysis
    ProfilingEngine,
    HardwareAnalyzerV2,
    StorageEstimator,
    TrainingHistory,
    TrainingEstimator,  # NEW: Data-driven training time estimator
    
    # Dataset loading
    StreamingDatasetLoader,
)

import matplotlib.pyplot as plt
%matplotlib inline

print("‚úì All modules imported successfully")
print("‚úì Ready for hierarchical training with real data")

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


‚úì All modules imported successfully
‚úì Ready for hierarchical training with real data


### KATO Node & Training Config

**Configuration**:
- `dataset_key`: 'wikitext', 'c4', 'refinedweb', 'openwebtext', etc.
- `max_samples`: Start small (100) to test, then scale up (10K, 100K, 1M+)
- `num_workers`: **3 recommended** (was 6, reduced to prevent deadlocks)
  - Rule: workers √ó nodes ‚â§ 30 connections (for stability)
  - 3 workers √ó 5 nodes = 15 connections ‚úì SAFE
- `checkpoint_interval`: Save progress every N samples (default: 5000)
- `resume_from_checkpoint`: Resume from last checkpoint if training was interrupted

In [3]:
# ========================================
# SERVICE CONFIGURATION (Multi-Machine Support)
# ========================================
# Configure these URLs for your deployment environment.
# Change these if running KATO and MongoDB on separate machines.

KATO_URL = 'http://kato:8000'              # KATO server (use IP if DNS unavailable: 'http://192.168.1.100:8000')
MONGODB_URI = 'mongodb://kato-mongodb:27017/'  # MongoDB (use IP if DNS unavailable: 'mongodb://192.168.1.101:27017/')

# For single-machine setups (everything on localhost):
# KATO_URL = 'http://localhost:8000'
# MONGODB_URI = 'mongodb://localhost:27017/'

# For multi-machine setups with IP addresses:
# KATO_URL = 'http://192.168.1.100:8000'
# MONGODB_URI = 'mongodb://192.168.1.101:27017/'

print("‚úì Service URLs configured")
print(f"  KATO:    {KATO_URL}")
print(f"  MongoDB: {MONGODB_URI}")

‚úì Service URLs configured
  KATO:    http://kato:8000
  MongoDB: mongodb://kato-mongodb:27017/


In [4]:
# Chunk sizes per node.

# cs_array = [3, 5, 3, 3, 3]
# cs_array = [3, 5, 5, 8, 3]
# cs_array = [3, 3, 3, 3, 3]
# cs_array = [4, 4, 4, 4, 4]
# cs_array = [5, 5, 5, 5, 5]
# cs_array = [6, 6, 6, 6, 6]
# cs_array = [7, 7, 7, 7, 7]
cs_array = [8, 8, 8, 8, 8]
# cs_array = [8, 6, 5, 4, 3]
# cs_array = [5, 4, 4, 3, 3]
## after iterating these above, change node0_batch_size to 100 and redo the above - Doing now

batch_size = 100

# Configure dataset and training parameters
DATASET_KEY = 'wikitext'  # Options: 'c4', 'refinedweb', 'wikitext', 'openwebtext'
MAX_SAMPLES = 100000  # Start small to test, then scale up
NUM_WORKERS = 3    # REDUCED from 6 ‚Üí safer, prevents deadlocks (recommended: 2-4)

# Checkpoint configuration (NEW!)
CHECKPOINT_INTERVAL = 5000  # Save checkpoint every 5K samples
RESUME_FROM_CHECKPOINT = False  # Set True to resume interrupted training

print("‚úì Configuration set")
print(f"  Workers: {NUM_WORKERS} (3 workers √ó 5 nodes = 15 connections - SAFE)")
print(f"  Checkpoint interval: {CHECKPOINT_INTERVAL:,} samples")
print(f"  Resume: {RESUME_FROM_CHECKPOINT}")

‚úì Configuration set
  Workers: 3 (3 workers √ó 5 nodes = 15 connections - SAFE)
  Checkpoint interval: 5,000 samples
  Resume: False


### üìã Crash Recovery Guide (Kernel Restart / Interruption)

**What happens if Jupyter crashes or kernel restarts mid-training?**

‚úÖ **Good news**: Learned patterns persist in MongoDB (not lost!)

‚úÖ **Checkpoint system**: Progress saved every 5,000 samples

**To resume after crash:**

1. **Restart kernel** (if needed)

2. **Re-run setup cells** with **EXACT SAME configuration**:
   - ‚úì Cell 1: Imports
   - ‚úì Cell 2: Additional imports
   - ‚úì Cell 4: **Configuration** (must match original!)
     - Set `RESUME_FROM_CHECKPOINT = True`
     - Keep same `cs_array`, `batch_size`, `NUM_WORKERS`
   - ‚úì Cell 10: Create learner (**same nodes, chunk_sizes, tokenizer!**)
   - ‚úì Cell 14: Create profiler

3. **Run training cell** (Cell 16):
   - System validates configuration matches checkpoint
   - Skips already-processed samples
   - Continues from where it left off

**‚ö†Ô∏è Configuration Validation**:

The system now validates your configuration matches the checkpoint:
- ‚úì If match ‚Üí Resume safely
- ‚ùå If mismatch ‚Üí Clear error message explaining the problem

**Example error if config changed**:
```
‚ùå CONFIGURATION MISMATCH - Cannot resume training!

Mismatches detected:
  - num_nodes: checkpoint=5, current=4
  - chunk_sizes: checkpoint=[8,8,8,8,8], current=[10,10,10,10]

To fix:
  1. Recreate learner with EXACT same configuration
  2. Or delete checkpoint (./checkpoints/) and start fresh
  3. Or use different checkpoint_dir for new configuration
```

**üí° Pro tip**: Take a screenshot of your configuration before long training runs!

## 2. Hardware Analysis (Optional but Recommended)

Analyze your system to:
- Understand hardware capabilities
- Estimate training time for different dataset sizes
- Identify performance bottlenecks

In [None]:
# Analyze current hardware
hw_analyzer = HardwareAnalyzerV2(verbose=True)

# Example training config for accurate throughput estimate
# (adjust these to match your planned training configuration)
example_config = {
    'chunk_sizes': cs_array,
    'batch_size': batch_size,
    'num_workers': 6
}

hw_report = hw_analyzer.analyze_system(
    mongodb_uri=MONGODB_URI,  # Use configured MongoDB URI
    kato_url=KATO_URL,        # Use configured KATO URL
    training_config=example_config,  # Config-aware estimation
    num_samples=10000
)

hw_report.print_summary()

# Save hardware baseline for reference
hw_report.export_json('hardware_baseline.json')

# Extract key metrics
BASELINE_THROUGHPUT = hw_report.estimated_samples_per_sec
HARDWARE_TIER = hw_report.tier

print(f"\nüéØ HARDWARE BASELINE")
print(f"  Estimated throughput: {BASELINE_THROUGHPUT:.1f} samples/sec")
print(f"  (for chunk_size={example_config['chunk_sizes'][0]}, batch={example_config['batch_size']}, 10K samples)")
print(f"  Hardware tier: {HARDWARE_TIER}")

## 3. Storage Estimation (Optional)

Estimate MongoDB storage requirements using Zipfian distribution modeling.

This helps you plan disk space before training large datasets.

In [None]:
# Create storage estimator with auto-calibration
# Auto-calibration uses historical training data to refine Zipfian parameters
storage_est = StorageEstimator(verbose=True, auto_calibrate=True)

# Example configuration (adjust to match your training config)
config = {
    'num_levels': 4,
    'chunk_sizes': [3,5,8],  # Uniform chunk_size=8
    'tokenizer': 'gpt2'
}

dataset_stats = {
    'avg_tokens_per_sample': 500,
    'dataset_name': 'wikitext'
}

# Estimate for your planned training size
print("\nüìä STORAGE ESTIMATES\n")

# for num_samples in [1_000, 10_000, 100_000]:
for num_samples in [100_000, 1_000_000]:
    estimate = storage_est.estimate_storage(
        num_samples=num_samples,
        config=config,
        dataset_stats=dataset_stats
    )
    
    print(f"{num_samples:>10,} samples: {estimate.estimated_storage_with_overhead_gb:>8.2f} GB ")
    print(f"             Total patterns: {estimate.total_patterns:,}")
    if storage_est.calibrated_zipf_alpha:
        print(f"             Zipfian Œ±: {storage_est.calibrated_zipf_alpha:.3f} (calibrated)")
    print()

## 4. Configure Hierarchical Learner

**KATO Configuration** (Performance Optimizations):
- `process_predictions=False`: Disables prediction computation ‚Üí **2-3x faster training**
- `max_pattern_length=0`: Manual learning only (we control when to learn)
- `stm_mode='CLEAR'`: Clears short-term memory after each learn (fresh context)

These settings are configurable in the cell below for transparency and control.

**Key Decision**: Choose `chunk_size` based on your dataset.

**Recommended configurations**:
- **WikiText (500-2K tokens/sample)**: `chunk_size=8` with 4 levels ‚Üí covers 8‚Üí64‚Üí512‚Üí4K tokens
- **C4/RefinedWeb (300-3K tokens)**: `chunk_size=6` with 4 levels ‚Üí covers 6‚Üí36‚Üí216‚Üí1.3K tokens
- **BookCorpus (50K+ tokens)**: `chunk_size=8` with 5-6 levels for book-length coverage

**See PROJECT_OVERVIEW.md Section 7** for detailed hierarchy sizing guide.

In [5]:
# ========================================
# KATO CONFIGURATION (Performance Optimizations)
# ========================================
# These settings control KATO's internal behavior during training.
# Defaults are optimized for training performance.

# process_predictions: Disable prediction computation during training
#   - False (recommended): 2-3x faster, predictions not needed during training
#   - True: Compute predictions (only for interactive exploration/debugging)
PROCESS_PREDICTIONS = False

# max_pattern_length: Auto-learning behavior
#   - 0 (recommended): Manual learning only (we control when to learn)
#   - >0: Auto-learn after N observations (not recommended for training)
MAX_PATTERN_LENGTH = 0

# stm_mode: Short-term memory management
#   - 'CLEAR' (recommended): Clear STM after each learn (fresh context)
#   - 'ROLLING': Keep rolling window (for sequential prediction tasks)
STM_MODE = 'CLEAR'

print("‚úì KATO Configuration:")
print(f"  process_predictions = {PROCESS_PREDICTIONS} ({'predictions disabled' if not PROCESS_PREDICTIONS else 'predictions enabled'})")
print(f"  max_pattern_length = {MAX_PATTERN_LENGTH} ({'manual learning' if MAX_PATTERN_LENGTH == 0 else 'auto-learning'})")
print(f"  stm_mode = {STM_MODE}")

# ========================================
# HIERARCHICAL NODES
# ========================================
# Configure hierarchical nodes with KATO settings

nodes = [
    HierarchicalNode('node0', chunk_size=cs_array[0], mode='chunking', base_url=KATO_URL,
                     process_predictions=PROCESS_PREDICTIONS, max_pattern_length=MAX_PATTERN_LENGTH, stm_mode=STM_MODE),
    HierarchicalNode('node1', chunk_size=cs_array[1], mode='chunking', base_url=KATO_URL,
                     process_predictions=PROCESS_PREDICTIONS, max_pattern_length=MAX_PATTERN_LENGTH, stm_mode=STM_MODE),
    HierarchicalNode('node2', chunk_size=cs_array[2], mode='chunking', base_url=KATO_URL,
                     process_predictions=PROCESS_PREDICTIONS, max_pattern_length=MAX_PATTERN_LENGTH, stm_mode=STM_MODE),
    HierarchicalNode('node3', chunk_size=cs_array[3], mode='chunking', base_url=KATO_URL,
                     process_predictions=PROCESS_PREDICTIONS, max_pattern_length=MAX_PATTERN_LENGTH, stm_mode=STM_MODE)
]

learner = HierarchicalConceptLearner(
    nodes=nodes,
    tokenizer_name='gpt2',
    node0_batch_size=batch_size  # Batching for 4-7x speedup
)

print(f"\n‚úì Created hierarchical learner with {learner.num_nodes} nodes")
print(f"  Chunk size: {learner.node_configs[0].chunk_size}")
print(f"  Node0 batch size: {learner.node0_batch_size} (batching ENABLED)")
print(f"\n  Semantic coverage:")
coverage = learner.node_configs[0].chunk_size
for i in range(learner.num_nodes):
    print(f"    node{i}: {coverage:,} tokens")
    coverage *= learner.node_configs[0].chunk_size

# Clear all node knowledgebases ONLY if starting fresh (not resuming)
if RESUME_FROM_CHECKPOINT:
    print(f"\nüìÇ RESUME MODE: Keeping existing MongoDB data")
    print(f"   ‚úì Patterns from previous training will be preserved")
    print(f"   ‚úì Training will continue from checkpoint")
else:
    print("\nüßπ Clearing all node knowledgebases...")
    for i, node in enumerate(learner.nodes.values()):
        node.clear_all_memory()
        print(f"  ‚úì node{i} cleared")
    print("‚úì All nodes cleared and ready for fresh training")

‚úì KATO Configuration:
  process_predictions = False (predictions disabled)
  max_pattern_length = 0 (manual learning)
  stm_mode = CLEAR

INITIALIZING HIERARCHICAL CONCEPT LEARNER
Using custom node configurations (4 nodes)

‚úì 4 nodes initialized with:
  - max_pattern_length = 0 (manual learning)
  - stm_mode = CLEAR (STM clears after learn)
  - process_predictions = False (predictions disabled)
  - tokenizer = gpt2

Per-node configuration:
  node0: mode=chunking, chunk_size=8
  node1: mode=chunking, chunk_size=8
  node2: mode=chunking, chunk_size=8
  node3: mode=chunking, chunk_size=8

‚úì Created hierarchical learner with 4 nodes
  Chunk size: 8
  Node0 batch size: 100 (batching ENABLED)

  Semantic coverage:
    node0: 8 tokens
    node1: 64 tokens
    node2: 512 tokens
    node3: 4,096 tokens

üßπ Clearing all node knowledgebases...
  ‚úì node0 cleared
  ‚úì node1 cleared
  ‚úì node2 cleared
  ‚úì node3 cleared
‚úì All nodes cleared and ready for fresh training


## 5. Initialize Training History

Training history tracks all runs in SQLite for comparison and analysis.

In [None]:
# Initialize training history database
history = TrainingHistory(db_path='./training_history.db', verbose=True)

# Show current state
history.print_summary()

## 5a. Training Time Estimator (NEW!)

**Predict training time** before you start, based on 29 historical training runs.

The TrainingEstimator uses real data to provide accurate estimates that account for:
- **chunk_size** (exponential impact - most important factor)
- **batch_size** (linear speedup)
- **scale** (logarithmic slowdown at larger datasets)
- **workers** (sub-linear scaling)
- **hardware tier** (existing multipliers)

**Key insight**: Performance is dominated by minimum chunk_size!

In [None]:
# Initialize training time estimator (calibrated from historical runs)
time_estimator = TrainingEstimator(verbose=True)

# Validate estimator accuracy against historical data
print("\n" + "="*80)
print("ESTIMATOR ACCURACY")
print("="*80)
validation_metrics = time_estimator.validate_against_history(verbose=True)
print(f"\nEstimator is {100 - validation_metrics['mape']:.1f}% accurate on average")
print("="*80)

# Define your planned training configuration
planned_config = {
    'chunk_sizes': cs_array,  # Adjust to match section 4
    'batch_size': batch_size,
    'num_workers': 6
}

# Get time estimate
print(f"\n{'='*80}")
print("TRAINING TIME PREDICTION")
print(f"{'='*80}\n")

time_estimate = time_estimator.estimate_training(
    config=planned_config,
    num_samples=MAX_SAMPLES,
    hardware_tier=HARDWARE_TIER if 'HARDWARE_TIER' in dir() else 'medium'
)

time_estimate.print_summary()

# Compare different chunk sizes
print(f"\n{'='*80}")
print("CHUNK SIZE COMPARISON (for 10K samples)")
print(f"{'='*80}\n")

for chunk_size in cs_array:
    test_config = {
        'chunk_sizes': [chunk_size] * 5,
        'batch_size': batch_size,
        'num_workers': 6
    }
    est = time_estimator.estimate_training(
        config=test_config,
        num_samples=MAX_SAMPLES,
        hardware_tier=HARDWARE_TIER if 'HARDWARE_TIER' in dir() else 'medium'
    )
    print(f"chunk_size={chunk_size}: {est.estimated_time_minutes:.1f} min ({est.estimated_samples_per_sec:.2f} samples/sec)")

print(f"\nüí° TIP: Larger chunk sizes train MUCH faster (exponential speedup)")
print(f"   chunk_size=3 ‚Üí chunk_size=8 gives ~3x speedup!")

## 6. Train with Real Data (Parallel + Profiling)

**This is the main training step**.

**‚ö†Ô∏è Requires**:
- KATO server running at localhost:8000
- MongoDB running at localhost:27017

**üìä Note about Pattern Counts**:
- Pattern counts will show as 0 after training (MongoDB connection limit with parallel workers)
- **Patterns are successfully stored** via KATO API
- Use `analysis.ipynb` after training for accurate pattern counts and analysis

In [7]:
## If training stalls, re-run this cell (make sure RESUME_FROM_CHECKPOINT = True in settings of cell 2.)

print(f"\n{'='*80}")
print("TRAINING CONFIGURATION")
print(f"{'='*80}")
print(f"Dataset: {DATASET_KEY}")
print(f"Samples: {MAX_SAMPLES:,}")
print(f"Workers: {NUM_WORKERS}")
print(f"Connections: {NUM_WORKERS * learner.num_nodes} (workers √ó nodes)")
print(f"Nodes: {learner.num_nodes}")
print(f"Chunk size: {nodes[0].chunk_size}")
print(f"Batch size: {learner.node0_batch_size}")
print(f"Checkpoint interval: {CHECKPOINT_INTERVAL:,} samples")
print(f"Resume from checkpoint: {RESUME_FROM_CHECKPOINT}")
print(f"{'='*80}\n")

# Start profiling
profiler = ProfilingEngine(sampling_interval_seconds=1.0, verbose=True)
profiler.start()

# Train with parallel workers (profiler is REQUIRED for performance analysis)
stats = train_from_streaming_dataset_parallel(
    dataset_key=DATASET_KEY,
    max_samples=MAX_SAMPLES,
    learner=learner,
    profiler=profiler,  # REQUIRED - tracks samples/sec, CPU, memory for analysis.ipynb
    num_levels=learner.num_nodes,
    num_workers=NUM_WORKERS,
    segment_method='simple',
    checkpoint_interval=CHECKPOINT_INTERVAL,  # NEW: Auto-checkpoint
    checkpoint_dir='./checkpoints',           # NEW: Checkpoint directory
    resume_from_checkpoint=RESUME_FROM_CHECKPOINT,  # NEW: Resume support
    verbose=True
)

# Stop profiling and generate report
profiler.stop()
profiling_report = profiler.generate_report()

print(f"\n{'='*80}")
print("TRAINING COMPLETE")
print(f"{'='*80}")
print(f"\nPerformance Statistics:")
print(f"  Samples processed: {stats['samples_processed']:,}")
print(f"  Samples attempted: {stats.get('samples_attempted', stats['samples_processed']):,}")
print(f"  Total time: {stats['total_time_seconds']:.2f}s")
print(f"  Rate: {stats['rate_samples_per_sec']:.2f} samples/sec")
print(f"  Workers: {stats.get('num_workers', 'N/A')}")
print(f"  Checkpoints saved: {stats.get('checkpoints_saved', 0)}")
print(f"\nüìä For pattern counts and analysis: Open analysis.ipynb")


TRAINING CONFIGURATION
Dataset: wikitext
Samples: 100,000
Workers: 3
Connections: 12 (workers √ó nodes)
Nodes: 4
Chunk size: 8
Batch size: 100
Checkpoint interval: 5,000 samples
Resume from checkpoint: False

‚úì ProfilingEngine initialized
‚úì Profiling started at 18:41:34

PARALLEL STREAMING HIERARCHICAL TRAINING
Dataset: WikiText-103 - Wikipedia articles (script-free)
Source: Salesforce/wikitext
Samples: 100,000 (target)
Workers: 3 (parallel processing)
Connections: 12 (workers √ó nodes)
Segmentation: simple
Node0 batch size: 100
Checkpoint interval: 5,000 samples
Est. Time (sequential): 1.4h
Est. Time (parallel): 37.4m
Expected speedup: 2.2x

üì• Streaming dataset in batches of 1000...
‚úì Starting parallel training...



Training samples:   0%|                                                                                                                                                      | 0/100000 [00:00<?, ?sample/s]


üì° Streaming: WikiText-103 - Wikipedia articles (script-free)
   Dataset: Salesforce/wikitext
   Samples: 100,000
   Est. Time: 1.4h


README.md: 0.00B [00:00, ?B/s]

Training samples:  29%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè                                                                             | 29300/100000 [20:36<1:32:24, 12.75sample/s, trained=4.0/s, errors=21]


üíæ Checkpoint saved: 5,000 samples completed

üíæ Checkpoint saved: 5,000 samples completed

üíæ Checkpoint saved: 5,000 samples completed

üíæ Checkpoint saved: 5,000 samples completed

üíæ Checkpoint saved: 5,000 samples completed


Training samples:  43%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã                                                              | 43400/100000 [47:47<2:25:44,  6.47sample/s, trained=3.5/s, errors=45]


üíæ Checkpoint saved: 10,000 samples completed


Training samples:  54%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç                                                 | 54100/100000 [1:10:09<1:38:36,  7.76sample/s, trained=3.6/s, errors=62]


üíæ Checkpoint saved: 15,000 samples completed

üíæ Checkpoint saved: 15,000 samples completed

üíæ Checkpoint saved: 15,000 samples completed

üíæ Checkpoint saved: 15,000 samples completed

üíæ Checkpoint saved: 15,000 samples completed

üíæ Checkpoint saved: 15,000 samples completed

üíæ Checkpoint saved: 15,000 samples completed


Training samples:  63%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé                                       | 63300/100000 [1:32:01<1:27:17,  7.01sample/s, trained=3.6/s, errors=83]


üíæ Checkpoint saved: 20,000 samples completed


Training samples:  72%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé                             | 72200/100000 [1:56:24<1:23:47,  5.53sample/s, trained=3.6/s, errors=109]


üíæ Checkpoint saved: 25,000 samples completed

üíæ Checkpoint saved: 25,000 samples completed


Training samples:  87%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè             | 87300/100000 [2:41:29<42:32,  4.98sample/s, trained=3.6/s, errors=157]


üíæ Checkpoint saved: 35,000 samples completed

üíæ Checkpoint saved: 35,000 samples completed


Training samples:  94%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä      | 94300/100000 [3:04:42<19:08,  4.96sample/s, trained=3.6/s, errors=176]


üíæ Checkpoint saved: 40,000 samples completed

üíæ Checkpoint saved: 40,000 samples completed

üíæ Checkpoint saved: 40,000 samples completed

üíæ Checkpoint saved: 40,000 samples completed

üíæ Checkpoint saved: 40,000 samples completed


Training samples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100000/100000 [3:24:10<00:00,  4.82sample/s, trained=3.6/s, errors=194]


‚è≥ Waiting for workers to complete...


Training samples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100000/100000 [6:50:25<00:00,  4.06sample/s, trained=3.6/s, errors=194]



PARALLEL TRAINING COMPLETE
‚úì Samples processed: 99,546
‚ö†Ô∏è  Errors/skipped: 454
‚è±Ô∏è  Total time: 6.8h
üìä Rate: 4.0 samples/second
üöÄ Workers: 3
üíæ Checkpoints saved: 23


TRAINING COMPLETE - PATTERN STATISTICS
Pattern counts not shown during parallel training (MongoDB connection limit).
‚úì Patterns were successfully stored via KATO API.

üìä For accurate pattern counts and analysis, open: analysis.ipynb

‚úì Training manifest saved: manifests/training_20251105_013200.json
  (Load later with: TrainingManifest.load('manifests/training_20251105_013200.json'))

‚úì Profiling stopped after 24625.31s

TRAINING COMPLETE

Performance Statistics:
  Samples processed: 99,546
  Samples attempted: 100,000
  Total time: 24625.22s
  Rate: 4.04 samples/sec
  Workers: 3
  Checkpoints saved: 23

üìä For pattern counts and analysis: Open analysis.ipynb


## 7. View Profiling Results

In [8]:
# Display comprehensive profiling summary
profiler.print_summary()

# Optional: Export profiling report to JSON for later analysis
import time
timestamp = time.strftime("%Y%m%d_%H%M%S")
import os
os.makedirs('profiling_reports', exist_ok=True)
profiling_report_path = f'profiling_reports/run_{timestamp}.json'
profiler.export_json(profiling_report_path)
print(f"\n‚úì Profiling report exported to {profiling_report_path}")


PROFILING SUMMARY

‚è±Ô∏è  TIMING
  Total duration: 24625.31s
  Samples processed: 99,546
  Throughput: 4.04 samples/sec

üíæ MEMORY
  Peak: 1049.1 MB
  Average: 1023.6 MB
  Per sample: 0.010 MB/sample
  Trend: stable

üîß CPU
  Average utilization: 53.3%
  Peak utilization: 150.7%

üíø DISK I/O
  Total write: 155742.82 MB
  Write speed: 6.32 MB/s
  ‚ö†Ô∏è  DISK I/O BOTTLENECK DETECTED

üåê NETWORK
  Sent: 1237.87 MB
  Received: 2862.35 MB

üîç BOTTLENECK ANALYSIS
  Primary bottleneck: MEMORY
  Confidence: 100.0%

‚úì Profiling report exported to profiling_reports/run_20251105_172734.json

‚úì Profiling report exported to profiling_reports/run_20251105_172734.json


## 8. Record Training Run in History

In [9]:
# Record this training run for later comparison
config = {
    'num_levels': learner.num_nodes,
    'chunk_sizes': [n.chunk_size for n in learner.node_configs],
    'batch_size': learner.node0_batch_size,
    'num_workers': NUM_WORKERS
}

# Calculate estimated time if estimator was used
if 'time_estimate' in dir() and time_estimate is not None:
    estimated_time_sec = time_estimate.estimated_time_seconds
    estimated_storage = None  # Could add storage estimate here too
else:
    estimated_time_sec = None
    estimated_storage = None

run_id = history.record_run(
    config=config,
    estimated_time=estimated_time_sec,  # Use prediction from Section 5a
    actual_time=stats['total_time_seconds'],
    estimated_storage_gb=estimated_storage,
    actual_storage_gb=profiling_report.total_disk_write_mb / 1024,
    samples_processed=MAX_SAMPLES,
    patterns_learned={
        f'node{i}_patterns': stats.get(f'node{i}_patterns', 0)
        for i in range(learner.num_nodes)
    },
    profiling_report=profiling_report,
    dataset_key=DATASET_KEY,
    hardware_tier=HARDWARE_TIER if 'HARDWARE_TIER' in dir() else 'unknown',
    notes=f'Parallel training with {NUM_WORKERS} workers on {MAX_SAMPLES} samples'
)

# Show estimation accuracy if we had a prediction
if estimated_time_sec:
    actual_time_min = stats['total_time_seconds'] / 60
    estimated_time_min = estimated_time_sec / 60
    error_pct = abs(estimated_time_sec - stats['total_time_seconds']) / stats['total_time_seconds'] * 100
    
    print(f"\nüìä ESTIMATION ACCURACY")
    print(f"  Estimated: {estimated_time_min:.1f} minutes")
    print(f"  Actual: {actual_time_min:.1f} minutes")
    print(f"  Error: {error_pct:.1f}%")

print(f"\n‚úì Training run recorded in history: {run_id}")
print(f"\nüéâ TRAINING SESSION COMPLETE!")
print(f"\nüìä Next step: Open analysis.ipynb to analyze learned patterns")

NameError: name 'history' is not defined

## 9. Capture Enhanced Training Snapshot

**Purpose:** Save comprehensive statistics BEFORE MongoDB is wiped for next training run.

### What's Captured (Enhanced Metrics)

**Basic Statistics:**
- ‚úÖ Pattern counts and frequency distributions
- ‚úÖ Zipfian power-law fits (Œ±, R¬≤)
- ‚úÖ Top patterns by frequency
- ‚úÖ Storage metrics

**NEW: Graph Topology** (for composition analysis):
- ‚úÖ Parent-child relationships (which patterns compose into which)
- ‚úÖ Orphan rates (% patterns with no parents)
- ‚úÖ Coverage metrics (% patterns used by parent level)
- ‚úÖ Reusability statistics (patterns referenced by multiple parents)

**NEW: Prediction Quality Samples** (for generation readiness):
- ‚úÖ Predictive_information scores (prediction reliability)
- ‚úÖ Potential scores (similarity √ó predictive_information)
- ‚úÖ Fan-out statistics (number of predictions per query)
- ‚úÖ Confidence distributions

**NEW: Hierarchical Validation** (for integrity checking):
- ‚úÖ Frequency correlation (parent freq vs sum of child freqs)
- ‚úÖ Frequency compression ratios
- ‚úÖ Hierarchical consistency metrics

### Why These Metrics Matter

**Graph topology becomes critical for:**
- **Pruning operations:** Must track orphans when deleting low-frequency patterns
- **Post-pruning validation:** Detect dangling references after cleanup
- **Composition quality:** Measure how well patterns participate in hierarchy

**Prediction quality metrics predict:**
- **Text generation quality** WITHOUT running generation tests
- **Hierarchical Generation Readiness (HGR)** score (0-100)
- **Optimal configuration** for different use cases

**Note:** In an unpruned KB, top-down reachability is guaranteed by training design. These metrics become essential for **pruning analysis** and **generation quality prediction**.

### Next Steps After Snapshot

1. **Analyze generation readiness:** Open `prediction_quality.ipynb`
   - Get HGR score (0-100)
   - Review category breakdowns
   - Get actionable recommendations

2. **Compare configurations:** Load multiple snapshots
   - Rank by HGR score
   - Identify optimal chunk_size
   - Track improvements over time

3. **Plan pruning (future):** If implementing KB pruning
   - Use graph topology for safe cascade cleanup
   - Validate no dangling references post-pruning

In [10]:
# Capture training snapshot (BEFORE clearing for next run!)
print(f"\n{'='*80}")
print("CAPTURING ENHANCED TRAINING SNAPSHOT")
print(f"{'='*80}\n")

print("üì∏ Capturing comprehensive metrics:")
print("  ‚úì Basic statistics (patterns, storage, frequencies)")
print("  ‚úì Zipfian power-law fits")
print("  ‚úì Graph topology (parent-child relationships)")
print("  ‚úì Composition quality (orphan rates, coverage)")
print("  ‚úì Prediction samples (quality estimation)")
print("  ‚úì Hierarchical validation (frequency correlation)")
print()

run_snapshot = history.capture_snapshot(
    learner=learner,
    run_id=run_id,
    mongo_uri=MONGODB_URI,  # Use configured MongoDB URI
    snapshots_dir='./snapshots',
    verbose=True,
    # NEW: Enhanced metrics for prediction quality estimation
    capture_graph_topology=True,        # Parent-child relationships, orphan rates
    capture_prediction_samples=True,    # Predictive_information, fan-out
    num_prediction_samples=100,         # Number of test predictions per node
    validate_hierarchy=True             # Frequency correlation validation
)

print(f"\nüì∏ SNAPSHOT SUMMARY:")
print(f"  Total patterns: {run_snapshot.total_patterns:,}")
print(f"  Total storage: {run_snapshot.total_storage_mb:.2f} MB")
print(f"  Total observations: {run_snapshot.total_observations:,}")

print(f"\n  Per-node breakdown:")
for node_name in sorted(run_snapshot.nodes.keys()):
    ns = run_snapshot.nodes[node_name]
    print(f"    {node_name}: {ns.total_patterns:,} patterns, {ns.db_size_mb:.2f} MB")
    if ns.zipf_alpha:
        print(f"             Zipf Œ±={ns.zipf_alpha:.3f}, mean_freq={ns.mean_frequency:.2f}")
    if ns.orphan_rate is not None:
        print(f"             Orphans: {ns.orphan_rate:.1%}, Coverage: {ns.coverage_to_parent:.1%}")

print(f"\n{'='*80}")
print("‚úì Enhanced snapshot captured and saved")
print(f"  Use prediction_quality.ipynb to estimate generation quality")
print(f"  Use analysis.ipynb to compare with other runs")
print(f"{'='*80}\n")


CAPTURING ENHANCED TRAINING SNAPSHOT

üì∏ Capturing comprehensive metrics:
  ‚úì Basic statistics (patterns, storage, frequencies)
  ‚úì Zipfian power-law fits
  ‚úì Graph topology (parent-child relationships)
  ‚úì Composition quality (orphan rates, coverage)
  ‚úì Prediction samples (quality estimation)
  ‚úì Hierarchical validation (frequency correlation)



NameError: name 'history' is not defined

## Next Steps

### üìä Hierarchy Metrics (NEW!)
Open **`hierarchy_dashboard.ipynb`** for:
- **Quick health check** (5-tier scoring system)
- At-a-glance hierarchy quality assessment
- Actionable recommendations
- Immediate issue detection

Open **`hierarchy_metrics.ipynb`** for:
- **Comprehensive analysis** of all 15 metrics
- Graph topology evaluation
- Information-theoretic analysis
- Training dynamics visualization
- Detailed interpretation guide

**15 Metrics Across 6 Categories**:
1. Compression (ratios, counts, effectiveness)
2. Connectivity (reusability, coverage, branching)
3. Information Theory (MI, entropy, constraints)
4. Prediction (fan-out)
5. Context (alignment, diversity)
6. Training Dynamics (growth, reusability trends)

### üìä Traditional Analysis
Open **`analysis.ipynb`** to:
- Visualize frequency distributions
- Inspect high-frequency patterns
- Compare multiple training runs
- Clean up low-frequency noise

### üî¨ Experimentation
To find optimal configurations:
1. Try different `chunk_size` values (5, 8, 10, 15, 20)
2. Test different number of levels (3, 4, 5, 6)
3. Compare training runs using TrainingHistory
4. **Use hierarchy metrics** to validate improvements

### üìà Scale Up
Once you've found good settings (via hierarchy metrics):
- Increase `MAX_SAMPLES` (10K ‚Üí 100K ‚Üí 1M+)
- Use larger datasets (C4, RefinedWeb)
- Monitor hierarchy health over time

### üìö Documentation
- **hierarchy_metrics/README.md**: Complete metrics guide
- **PROJECT_OVERVIEW.md**: Core concepts and philosophy
- **TRAINING_RUN_COMPARISON.md**: How to compare experiments
- **README.md**: Full feature list

## 9. Capture Hierarchy Metrics (Graph-Based Analysis)

**IMPORTANT**: This must run BEFORE clearing MongoDB for the next run.

Captures comprehensive graph-based metrics including:
- Compression ratios and pattern counts
- Connectivity (reusability, coverage, branching)
- Graph topology and relationships

Use `hierarchy_metrics.ipynb` or `hierarchy_dashboard.ipynb` to analyze results.

## Next Steps

### üìä Analysis
Open **`analysis.ipynb`** to:
- Visualize frequency distributions
- Inspect high-frequency patterns
- Compare multiple training runs
- Clean up low-frequency noise

### üî¨ Experimentation
To find optimal configurations:
1. Try different `chunk_size` values (5, 8, 10, 15, 20)
2. Test different number of levels (3, 4, 5, 6)
3. Compare training runs using TrainingHistory
4. Analyze which configurations produce best patterns

### üìà Scale Up
Once you've found good settings:
- Increase `MAX_SAMPLES` (10K ‚Üí 100K ‚Üí 1M+)
- Use larger datasets (C4, RefinedWeb)
- Monitor storage growth with estimates

### üìö Documentation
- **PROJECT_OVERVIEW.md**: Core concepts and philosophy
- **TRAINING_RUN_COMPARISON.md**: How to compare experiments
- **README.md**: Full feature list