# 02 - Data Preprocessing

This notebook demonstrates the complete data preprocessing pipeline.

## Learning Objectives
- Understand the preprocessing pipeline
- Learn about feature extraction from sensor data
- Create vocabulary from G-code tokens
- Generate train/validation/test splits
- Visualize preprocessed data

## Preprocessing Steps
1. Load raw G-code and sensor data
2. Tokenize G-code sequences
3. Build vocabulary
4. Extract and normalize sensor features
5. Create sliding windows
6. Split into train/val/test sets
7. Save processed data

In [None]:
# Setup
import sys
from pathlib import Path

project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / 'src'))

print(f"Project root: {project_root}")

In [None]:
# Imports
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

sns.set_style('whitegrid')
print("âœ“ Imports successful")

## 1. Understanding the Preprocessing Module

In [None]:
# View preprocessing module structure
preprocessing_module = project_root / 'src' / 'miracle' / 'dataset' / 'preprocessing.py'

if preprocessing_module.exists():
    print("Preprocessing module found!")
    print(f"Location: {preprocessing_module}")
    print("\nKey functions:")
    print("  - load_gcode(): Load G-code files")
    print("  - tokenize(): Convert G-code to tokens")
    print("  - build_vocabulary(): Create token-to-ID mapping")
    print("  - extract_features(): Process sensor data")
    print("  - create_sequences(): Generate training samples")
else:
    print("Preprocessing module not found")

## 2. Running Preprocessing

The preprocessing script can be run from the command line or imported as a module.

In [None]:
# Command-line preprocessing (shown for reference)
print("To run preprocessing from command line:")
print()
print("PYTHONPATH=src .venv/bin/python -m miracle.dataset.preprocessing \\")
print("    --data-dir data \\")
print("    --output-dir outputs/processed_demo \\")
print("    --vocab-path data/vocabulary.json \\")
print("    --max-seq-length 64 \\")
print("    --train-ratio 0.7 \\")
print("    --val-ratio 0.15")

## 3. Exploring Preprocessed Data

In [None]:
# Find preprocessed data
import glob

processed_dirs = glob.glob(str(project_root / 'outputs' / 'processed*'))

if processed_dirs:
    print(f"Found {len(processed_dirs)} preprocessed dataset(s):")
    for d in processed_dirs:
        print(f"  - {Path(d).name}")
    
    # Use the first one
    processed_dir = Path(processed_dirs[0])
    print(f"\nUsing: {processed_dir.name}")
else:
    print("No preprocessed data found. Run preprocessing first.")
    processed_dir = None

In [None]:
# Load preprocessed data
if processed_dir:
    train_data = np.load(processed_dir / 'train_sequences.npz', allow_pickle=True)
    val_data = np.load(processed_dir / 'val_sequences.npz', allow_pickle=True)
    test_data = np.load(processed_dir / 'test_sequences.npz', allow_pickle=True)
    
    print("Data loaded successfully!")
    print(f"\nTrain set:")
    print(f"  Samples: {len(train_data['tokens'])}")
    print(f"  Keys: {list(train_data.keys())}")
    
    print(f"\nValidation set: {len(val_data['tokens'])} samples")
    print(f"Test set: {len(test_data['tokens'])} samples")

In [None]:
# Examine data structure
if processed_dir:
    print("Sample data structure:")
    print(f"\nTokens shape: {train_data['tokens'][0].shape}")
    print(f"Continuous features shape: {train_data['continuous'][0].shape}")
    print(f"Categorical features shape: {train_data['categorical'][0].shape}")
    
    print(f"\nFirst token sequence (truncated):")
    print(train_data['tokens'][0][:10])

## 4. Visualizing Sensor Features

In [None]:
# Analyze continuous features
if processed_dir:
    sample_continuous = train_data['continuous'][0]
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Feature distributions for different timesteps
    for i, ax in enumerate(axes.flat):
        feature_idx = i * 30  # Sample different features
        if feature_idx < sample_continuous.shape[1]:
            ax.plot(sample_continuous[:, feature_idx], alpha=0.7)
            ax.set_title(f"Continuous Feature {feature_idx}", fontweight='bold')
            ax.set_xlabel('Timestep')
            ax.set_ylabel('Value')
            ax.grid(alpha=0.3)
    
    plt.tight_layout()
    plt.show()

## 5. Analyzing Sequence Lengths

In [None]:
# Sequence length distribution
if processed_dir:
    seq_lengths = [len(seq) for seq in train_data['tokens']]
    
    plt.figure(figsize=(12, 5))
    plt.hist(seq_lengths, bins=30, color='steelblue', alpha=0.7, edgecolor='black')
    plt.axvline(np.mean(seq_lengths), color='red', linestyle='--', label=f'Mean: {np.mean(seq_lengths):.1f}')
    plt.axvline(np.median(seq_lengths), color='green', linestyle='--', label=f'Median: {np.median(seq_lengths):.1f}')
    plt.xlabel('Sequence Length', fontsize=12)
    plt.ylabel('Frequency', fontsize=12)
    plt.title('Distribution of Token Sequence Lengths', fontsize=14, fontweight='bold')
    plt.legend()
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print(f"\nSequence length statistics:")
    print(f"  Min: {min(seq_lengths)}")
    print(f"  Max: {max(seq_lengths)}")
    print(f"  Mean: {np.mean(seq_lengths):.1f}")
    print(f"  Std: {np.std(seq_lengths):.1f}")

## 6. Vocabulary Statistics

In [None]:
# Load and analyze vocabulary
vocab_path = project_root / 'data' / 'vocabulary.json'

if vocab_path.exists():
    with open(vocab_path, 'r') as f:
        vocab = json.load(f)
    
    # Count token usage in training data
    if processed_dir:
        all_tokens = []
        for seq in train_data['tokens'][:100]:  # Sample 100 sequences
            all_tokens.extend(seq)
        
        token_counts = Counter(all_tokens)
        
        print(f"Vocabulary size: {len(vocab)}")
        print(f"Tokens used in sample: {len(token_counts)}")
        print(f"\nTop 20 most common tokens:")
        
        # Get reverse mapping
        id_to_token = {v: k for k, v in vocab.items()}
        
        for token_id, count in token_counts.most_common(20):
            token_str = id_to_token.get(token_id, f'<ID:{token_id}>')
            print(f"  {token_str:15s}: {count}")

## 7. Hands-On: Custom Preprocessing

Create a mini-dataset from a small subset of data.

In [None]:
# Example: Simple tokenization function
import re

def simple_tokenize(gcode_line):
    """Tokenize a single G-code line."""
    # Remove comments
    line = re.sub(r';.*', '', gcode_line)
    line = re.sub(r'\(.*?\)', '', line)
    
    # Extract tokens
    tokens = []
    for token in line.strip().split():
        # Match patterns like G0, X10.5, etc.
        if re.match(r'[A-Z][\d.\-]+', token):
            tokens.append(token)
    
    return tokens

# Test
test_line = "G0 X10.5 Y20.3 Z5.0 F1500 ; Move to position"
tokens = simple_tokenize(test_line)
print(f"Input: {test_line}")
print(f"Tokens: {tokens}")

## Summary

You learned:
- How the preprocessing pipeline works
- How to load and explore preprocessed data
- Understanding sensor feature structure
- Vocabulary creation and usage
- Sequence length distributions

## Next Steps

Continue to **03_training_models.ipynb** to learn how to train models on this preprocessed data.

## Troubleshooting

- **No preprocessed data**: Run `PYTHONPATH=src .venv/bin/python -m miracle.dataset.preprocessing ...`
- **Vocabulary mismatch**: Ensure you use the same vocabulary file for both preprocessing and training
- **Memory errors**: Reduce `--max-samples` in preprocessing arguments