# 02 - Data Preprocessing

**Complete walkthrough of the data preprocessing pipeline.**

## Learning Objectives
- Load and inspect raw sensor + G-code data
- Understand the 4-digit hybrid G-code tokenization
- Configure windowing and stride parameters
- Create multilabel stratified train/validation/test splits
- Validate data quality and coverage
- Visualize the preprocessing pipeline

## Table of Contents
1. [Raw Data Loading](#1.-Raw-Data-Loading)
2. [Sensor Feature Extraction](#2.-Sensor-Feature-Extraction)
3. [G-code Tokenization (4-Digit Hybrid)](#3.-G-code-Tokenization)
4. [Vocabulary Management](#4.-Vocabulary-Management)
5. [Window Configuration](#5.-Window-Configuration)
6. [Multilabel Stratified Splitting](#6.-Multilabel-Stratified-Splitting)
7. [Data Quality Validation](#7.-Data-Quality-Validation)
8. [Pipeline Visualization](#8.-Pipeline-Visualization)

In [None]:
# Setup and Environment Check
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / 'src'))

# Environment info
print(f"Python: {sys.version.split()[0]}")
print(f"Project root: {project_root}")

# Reproducibility
SEED = 42

In [None]:
# Imports
import os
import json
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

np.random.seed(SEED)
print("✓ Imports successful")

---
## 1. Raw Data Loading

Load and inspect the raw CSV files containing aligned sensor + G-code data.

In [None]:
# Find all raw data files
data_dir = project_root / 'data'
csv_files = sorted(data_dir.glob('*.csv'))

print(f"Found {len(csv_files)} data files in {data_dir}")
print("\nFiles by operation type:")

# Categorize by operation
files_by_op = defaultdict(list)
for f in csv_files:
    name = f.stem
    if 'Face' in name:
        files_by_op['Face'].append(f)
    elif 'Pocket' in name:
        files_by_op['Pocket'].append(f)
    elif 'Adaptive' in name:
        files_by_op['Adaptive'].append(f)
    elif 'Damaged' in name:
        files_by_op['Damaged'].append(f)
    else:
        files_by_op['Other'].append(f)

for op, files in files_by_op.items():
    print(f"  {op}: {len(files)} files")

In [None]:
# Load a sample file to inspect structure
if csv_files:
    sample_file = csv_files[0]
    df = pd.read_csv(sample_file)
    
    print(f"Sample file: {sample_file.name}")
    print(f"Shape: {df.shape}")
    print(f"\nColumns ({len(df.columns)}):")
    
    # Categorize columns
    gcode_cols = [c for c in df.columns if 'gcode' in c.lower() or 'token' in c.lower()]
    sensor_cols = [c for c in df.columns if c not in gcode_cols]
    
    print(f"  G-code columns: {len(gcode_cols)}")
    print(f"  Sensor columns: {len(sensor_cols)}")
    
    print(f"\nFirst few rows:")
    display(df.head())

In [None]:
# Inspect column data types
if csv_files:
    print("Column Data Types:")
    print("="*50)
    
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    string_cols = df.select_dtypes(include=['object']).columns.tolist()
    
    print(f"\nNumeric columns ({len(numeric_cols)}):")
    for col in numeric_cols[:10]:
        print(f"  {col}: {df[col].dtype}")
    if len(numeric_cols) > 10:
        print(f"  ... and {len(numeric_cols) - 10} more")
    
    print(f"\nString columns ({len(string_cols)}):")
    for col in string_cols:
        print(f"  {col}: {df[col].dtype}")

---
## 2. Sensor Feature Extraction

Extract and normalize continuous and categorical sensor features.

In [None]:
# Identify sensor feature types
if csv_files:
    # Typical sensor columns
    continuous_candidates = [
        'SpindleLoad', 'ActFeedRate', 'Xact', 'Yact', 'Zact',
        'ActSpindleSpeed', 'ToolLength', 'ToolRadius'
    ]
    
    # Find actual columns
    continuous_cols = [c for c in df.columns if any(cc in c for cc in continuous_candidates)]
    
    print(f"Continuous sensor features ({len(continuous_cols)}):")
    for col in continuous_cols[:10]:
        stats = df[col].describe()
        print(f"  {col}: mean={stats['mean']:.2f}, std={stats['std']:.2f}")
    
    if len(continuous_cols) > 10:
        print(f"  ... and {len(continuous_cols) - 10} more features")

In [None]:
# Visualize sensor feature distributions
if csv_files and continuous_cols:
    fig, axes = plt.subplots(2, 3, figsize=(15, 8))
    
    plot_cols = continuous_cols[:6]
    
    for ax, col in zip(axes.flat, plot_cols):
        data = df[col].dropna()
        ax.hist(data, bins=50, color='steelblue', alpha=0.7, edgecolor='black')
        ax.axvline(data.mean(), color='red', linestyle='--', label=f'Mean: {data.mean():.1f}')
        ax.set_xlabel(col, fontsize=10)
        ax.set_ylabel('Frequency')
        ax.legend(fontsize=8)
        ax.grid(alpha=0.3)
    
    plt.suptitle('Sensor Feature Distributions', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

In [None]:
# Feature normalization demonstration
def normalize_features(df, cols, method='zscore'):
    """Normalize features using z-score or min-max."""
    normalized = df[cols].copy()
    
    if method == 'zscore':
        for col in cols:
            mean = normalized[col].mean()
            std = normalized[col].std()
            normalized[col] = (normalized[col] - mean) / (std + 1e-8)
    elif method == 'minmax':
        for col in cols:
            min_val = normalized[col].min()
            max_val = normalized[col].max()
            normalized[col] = (normalized[col] - min_val) / (max_val - min_val + 1e-8)
    
    return normalized

if csv_files and continuous_cols:
    # Demonstrate normalization
    sample_cols = continuous_cols[:3]
    normalized_df = normalize_features(df, sample_cols, method='zscore')
    
    print("Before normalization:")
    print(df[sample_cols].describe().round(2))
    
    print("\nAfter z-score normalization:")
    print(normalized_df.describe().round(2))

---
## 3. G-code Tokenization (4-Digit Hybrid)

The current system uses **4-digit hybrid tokenization** for better numeric precision.

### Token Structure

Each G-code sequence is tokenized into 7 positions:
```
[Type, Command, Param, Sign, Digit1, Digit2, Digit3]
```

- **Type**: SPECIAL, COMMAND, PARAM, NUMERIC (4 classes)
- **Command**: G0, G1, G3, G53, M30, NONE (6 classes)
- **Param Type**: X, Y, Z, F, R, NONE, etc. (10 classes)
- **Sign**: +, -, NONE
- **Digits**: 0-9 for each position

In [None]:
# 4-digit hybrid tokenization (used by the model)
def tokenize_gcode_4digit(gcode_str):
    """Tokenize G-code using 4-digit hybrid encoding.
    
    Returns a dict with token components:
    - type: SPECIAL, COMMAND, PARAM, NUMERIC
    - command: G0, G1, G3, G53, M30, NONE
    - param_type: X, Y, Z, F, R, NONE
    - sign: +, -, NONE
    - digits: [d1, d2, d3] for numeric values
    """
    if pd.isna(gcode_str) or not isinstance(gcode_str, str):
        return {'type': 'SPECIAL', 'command': 'NONE', 'param_type': 'NONE', 
                'sign': 'NONE', 'digits': [0, 0, 0]}
    
    # Remove comments
    gcode_str = re.sub(r';.*', '', gcode_str)
    gcode_str = re.sub(r'\(.*?\)', '', gcode_str)
    gcode_str = gcode_str.strip().upper()
    
    result = {
        'type': 'SPECIAL',
        'command': 'NONE',
        'param_type': 'NONE',
        'sign': 'NONE',
        'digits': [0, 0, 0]
    }
    
    # Match G/M commands
    cmd_match = re.match(r'([GM])(\d+)', gcode_str)
    if cmd_match:
        result['type'] = 'COMMAND'
        cmd_num = int(cmd_match.group(2))
        cmd_letter = cmd_match.group(1)
        result['command'] = f"{cmd_letter}{cmd_num}"
        return result
    
    # Match parameters with values
    param_match = re.match(r'([XYZFRS])(-?\d+\.?\d*)', gcode_str)
    if param_match:
        result['type'] = 'PARAM'
        result['param_type'] = param_match.group(1)
        
        value_str = param_match.group(2)
        value = float(value_str)
        
        # Sign
        result['sign'] = '-' if value < 0 else '+'
        
        # Convert to 3 digits (MMM format for values up to 999)
        abs_val = int(abs(value)) % 1000
        result['digits'] = [
            (abs_val // 100) % 10,
            (abs_val // 10) % 10,
            abs_val % 10
        ]
        return result
    
    return result

# Test tokenization
test_gcodes = [
    "G0",
    "G1",
    "X123.5",
    "Y-45.2",
    "Z5.0",
    "F1500",
    "M30"
]

print("4-Digit Hybrid Tokenization Examples:")
print("="*70)
for gcode in test_gcodes:
    tokens = tokenize_gcode_4digit(gcode)
    print(f"\nInput: {gcode:12s}")
    print(f"  Type: {tokens['type']}, Command: {tokens['command']}, "
          f"Param: {tokens['param_type']}, Sign: {tokens['sign']}, Digits: {tokens['digits']}")

---
## 4. Vocabulary Management

The current vocabulary uses **4-digit hybrid encoding** (`vocabulary_4digit_hybrid.json`).

In [None]:
# Load the 4-digit hybrid vocabulary
vocab_path = project_root / 'data' / 'vocabulary_4digit_hybrid.json'

if vocab_path.exists():
    with open(vocab_path, 'r') as f:
        vocab = json.load(f)
    
    print(f"Loaded vocabulary: {vocab_path.name}")
    print(f"Total entries: {len(vocab)}")
    
    # The vocabulary contains mappings for:
    # - Types (4): SPECIAL, COMMAND, PARAM, NUMERIC
    # - Commands (6): G0, G1, G3, G53, M30, NONE
    # - Param types (10+): X, Y, Z, F, R, NONE, etc.
    # - Digits (10): 0-9
    # - Signs (3): +, -, NONE
    
    print("\nVocabulary structure:")
    for key, value in list(vocab.items())[:20]:
        print(f"  {key}: {value}")
else:
    print(f"Vocabulary not found: {vocab_path}")
    # Try alternative paths
    alt_paths = list(project_root.glob('data/*.json'))
    print(f"Available vocab files: {[p.name for p in alt_paths]}")

In [None]:
# Model output head dimensions
print("Multi-Head Model Output Dimensions:")
print("="*50)
print(f"  Token Types:   4 classes (SPECIAL, COMMAND, PARAM, NUMERIC)")
print(f"  Commands:      6 classes (G0, G1, G3, G53, M30, NONE)")
print(f"  Param Types:  10 classes (X, Y, Z, F, R, NONE, ...)")
print(f"  Signs:         3 classes (+, -, NONE)")
print(f"  Digits:       10 classes (0-9) x 3 positions")
print(f"\nOperation Types: 9 classes (from frozen encoder)")

---
## 5. Window Configuration

Configure sliding window parameters for sequence creation.

In [None]:
# Window/stride configuration
WINDOW_SIZE = 64   # Number of timesteps per sample
STRIDE = 16        # Step between windows

print("Window Configuration:")
print("="*50)
print(f"  Window size: {WINDOW_SIZE} timesteps")
print(f"  Stride: {STRIDE} timesteps")
print(f"  Overlap: {WINDOW_SIZE - STRIDE} timesteps ({(WINDOW_SIZE - STRIDE)/WINDOW_SIZE*100:.0f}%)")

# Calculate samples from a sequence
if csv_files:
    seq_len = len(df)
    n_windows = max(1, (seq_len - WINDOW_SIZE) // STRIDE + 1)
    
    print(f"\nFor sample file ({seq_len} rows):")
    print(f"  Number of windows: {n_windows}")
    print(f"  Total samples: ~{n_windows} per file")

In [None]:
# Visualize windowing
def create_windows(data, window_size, stride):
    """Create sliding windows from data."""
    windows = []
    for start in range(0, len(data) - window_size + 1, stride):
        windows.append(data[start:start + window_size])
    return windows

# Visualize window creation
if csv_files and continuous_cols:
    fig, axes = plt.subplots(2, 1, figsize=(14, 8))
    
    # Get sample sensor data
    sensor_col = continuous_cols[0]
    full_signal = df[sensor_col].values[:500]  # First 500 points
    
    # Plot full signal
    ax1 = axes[0]
    ax1.plot(full_signal, 'b-', linewidth=1, alpha=0.7)
    ax1.set_xlabel('Timestep')
    ax1.set_ylabel(sensor_col)
    ax1.set_title('Full Sensor Signal with Window Positions', fontsize=14, fontweight='bold')
    
    # Highlight windows
    colors = plt.cm.Set2(np.linspace(0, 1, 5))
    for i in range(5):
        start = i * STRIDE
        end = start + WINDOW_SIZE
        if end <= len(full_signal):
            ax1.axvspan(start, end, alpha=0.2, color=colors[i], label=f'Window {i+1}')
    ax1.legend(loc='upper right', fontsize=9)
    
    # Plot individual windows
    ax2 = axes[1]
    windows = create_windows(full_signal, WINDOW_SIZE, STRIDE)[:5]
    
    for i, window in enumerate(windows):
        ax2.plot(window + i*20, color=colors[i], linewidth=1.5, label=f'Window {i+1}')
    
    ax2.set_xlabel('Position in Window')
    ax2.set_ylabel('Value (offset for visibility)')
    ax2.set_title('Individual Windows (Offset for Clarity)', fontsize=14, fontweight='bold')
    ax2.legend(loc='upper right', fontsize=9)
    
    plt.tight_layout()
    plt.show()

---
## 6. Multilabel Stratified Splitting

The current preprocessing uses **multilabel stratified splitting** to ensure proper class coverage across train/val/test sets.

### Why Multilabel Stratification?

Standard stratification on a single label may not ensure coverage of:
- All 9 operation types in each split
- All 6 G-code commands in each split
- All parameter types in each split

Multilabel stratification treats each sample as having multiple labels and ensures balanced distribution.

In [None]:
# Load split info from preprocessed data
split_dirs = [
    project_root / 'outputs' / 'multilabel_stratified_splits',
    project_root / 'outputs' / 'stratified_splits_v2'
]

split_info_path = None
for sd in split_dirs:
    si_path = sd / 'split_info.json'
    if si_path.exists():
        split_info_path = si_path
        break

if split_info_path:
    with open(split_info_path, 'r') as f:
        split_info = json.load(f)
    
    print(f"Split Information ({split_info_path.parent.name}):")
    print("="*60)
    print(f"Method: {split_info.get('method', 'unknown')}")
    print(f"\nSample Counts:")
    print(f"  Train: {split_info['train_samples']:,} samples ({split_info['train_samples']/(split_info['train_samples']+split_info['val_samples']+split_info['test_samples'])*100:.1f}%)")
    print(f"  Val:   {split_info['val_samples']:,} samples ({split_info['val_samples']/(split_info['train_samples']+split_info['val_samples']+split_info['test_samples'])*100:.1f}%)")
    print(f"  Test:  {split_info['test_samples']:,} samples ({split_info['test_samples']/(split_info['train_samples']+split_info['val_samples']+split_info['test_samples'])*100:.1f}%)")
    print(f"  Total: {split_info['train_samples'] + split_info['val_samples'] + split_info['test_samples']:,} samples")
else:
    print("No split_info.json found. Run preprocessing first.")

In [None]:
# Visualize class coverage across splits
if split_info_path and 'coverage_results' in split_info:
    coverage = split_info['coverage_results']
    
    fig, axes = plt.subplots(1, 3, figsize=(16, 5))
    
    # Operation type distribution
    ax1 = axes[0]
    splits_data = ['train', 'val', 'test']
    x = np.arange(9)  # 9 operation types
    width = 0.25
    
    for i, split in enumerate(splits_data):
        op_counts = coverage[split]['operation_counts']
        counts = [op_counts.get(str(j), 0) for j in range(9)]
        ax1.bar(x + i*width, counts, width, label=split.capitalize())
    
    ax1.set_xlabel('Operation Type')
    ax1.set_ylabel('Count')
    ax1.set_title('Operation Type Distribution', fontsize=12, fontweight='bold')
    ax1.set_xticks(x + width)
    ax1.set_xticklabels([f'Op{i}' for i in range(9)])
    ax1.legend()
    
    # Command distribution
    ax2 = axes[1]
    commands = ['G0', 'G1', 'G3', 'G53', 'M30', 'NONE']
    x = np.arange(len(commands))
    
    for i, split in enumerate(splits_data):
        cmd_counts = coverage[split]['command_counts']
        counts = [cmd_counts.get(c, 0) for c in commands]
        ax2.bar(x + i*width, counts, width, label=split.capitalize())
    
    ax2.set_xlabel('Command')
    ax2.set_ylabel('Count')
    ax2.set_title('Command Distribution', fontsize=12, fontweight='bold')
    ax2.set_xticks(x + width)
    ax2.set_xticklabels(commands)
    ax2.legend()
    
    # Parameter type distribution
    ax3 = axes[2]
    params = ['X', 'Y', 'Z', 'F', 'R', 'NONE']
    x = np.arange(len(params))
    
    for i, split in enumerate(splits_data):
        param_counts = coverage[split]['param_counts']
        counts = [param_counts.get(p, 0) for p in params]
        ax3.bar(x + i*width, counts, width, label=split.capitalize())
    
    ax3.set_xlabel('Parameter Type')
    ax3.set_ylabel('Count')
    ax3.set_title('Parameter Type Distribution', fontsize=12, fontweight='bold')
    ax3.set_xticks(x + width)
    ax3.set_xticklabels(params)
    ax3.legend()
    
    plt.suptitle('Multilabel Stratified Split Coverage', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Print coverage validation
    print("\nCoverage Validation:")
    for split in splits_data:
        passed = coverage[split].get('passed', 'Unknown')
        issues = coverage[split].get('issues', [])
        status = '✓ PASS' if passed else '✗ FAIL'
        print(f"  {split.capitalize()}: {status}")
        if issues:
            for issue in issues:
                print(f"    - {issue}")

In [None]:
# Load and inspect preprocessed data (NPZ format)
split_dir = None
for sd in split_dirs:
    if (sd / 'train_sequences.npz').exists():
        split_dir = sd
        break

if split_dir:
    print(f"Loading preprocessed data from: {split_dir.name}")
    
    # Load train data to inspect structure
    train_data = np.load(split_dir / 'train_sequences.npz', allow_pickle=True)
    
    print(f"\nData Structure (NPZ format):")
    print("="*60)
    for key in train_data.files:
        arr = train_data[key]
        print(f"  {key:20s}: shape={str(arr.shape):15s} dtype={arr.dtype}")
    
    # Key fields:
    # - continuous: [N, 64, 155] - continuous sensor features
    # - categorical: [N, 64, 4] - categorical sensor features
    # - tokens: [N, 7] - tokenized G-code (type, cmd, param, sign, d1, d2, d3)
    # - operation_type: [N] - operation class (0-8)
    # - gcode_texts: [N] - original G-code strings
else:
    print("No preprocessed data found. Run preprocessing first.")

---
## 7. Data Quality Validation

Validate the preprocessed data for common issues.

In [None]:
# Data quality checks on raw data
def validate_data_quality(df):
    """Run quality checks on data."""
    issues = []
    
    # Check for missing values
    missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
    high_missing = missing_pct[missing_pct > 5]
    if len(high_missing) > 0:
        issues.append(f"High missing values in: {list(high_missing.index)}")
    
    # Check for constant columns
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    constant_cols = [c for c in numeric_cols if df[c].std() < 1e-8]
    if constant_cols:
        issues.append(f"Constant columns: {constant_cols[:5]}")
    
    # Check for outliers (>5 std from mean)
    outlier_cols = []
    for col in numeric_cols:
        z_scores = np.abs((df[col] - df[col].mean()) / (df[col].std() + 1e-8))
        if (z_scores > 5).sum() > len(df) * 0.01:  # >1% outliers
            outlier_cols.append(col)
    if outlier_cols:
        issues.append(f"Columns with many outliers: {outlier_cols[:5]}")
    
    return issues

if csv_files:
    print("Data Quality Report")
    print("="*50)
    
    issues = validate_data_quality(df)
    
    if issues:
        print("\n⚠️  Issues found:")
        for issue in issues:
            print(f"  - {issue}")
    else:
        print("\n✓ No major issues found")
    
    # Summary statistics
    print(f"\nData Summary:")
    print(f"  Total rows: {len(df):,}")
    print(f"  Total columns: {len(df.columns)}")
    print(f"  Missing values: {df.isnull().sum().sum():,} ({df.isnull().sum().sum()/df.size*100:.2f}%)")
    print(f"  Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

In [None]:
# Validate preprocessed NPZ data
if split_dir:
    print("Preprocessed Data Validation:")
    print("="*60)
    
    for split_name in ['train', 'val', 'test']:
        npz_path = split_dir / f'{split_name}_sequences.npz'
        if npz_path.exists():
            data = np.load(npz_path, allow_pickle=True)
            n_samples = len(data['operation_type'])
            
            # Check for NaN values
            has_nan = any(np.isnan(data[k]).any() for k in ['continuous'] if k in data.files)
            
            # Check operation type distribution
            op_types = data['operation_type']
            unique_ops = len(np.unique(op_types))
            
            status = '✓' if not has_nan and unique_ops == 9 else '⚠️'
            print(f"  {split_name:5s}: {n_samples:5,} samples, {unique_ops} operation types {status}")

---
## 8. Pipeline Visualization

Visualize the complete preprocessing pipeline.

In [None]:
# Create pipeline diagram
fig, ax = plt.subplots(figsize=(16, 8))
ax.set_xlim(0, 16)
ax.set_ylim(0, 8)
ax.axis('off')

# Colors
colors = {
    'input': '#E8F4F8',
    'process': '#D4E6F1',
    'output': '#D5F5E3',
    'arrow': '#555555'
}

# Draw pipeline stages
stages = [
    (1, 6, 'Raw CSV\nFiles', colors['input']),
    (4, 6, 'Load &\nParse', colors['process']),
    (7, 7, 'Sensor\nFeatures', colors['process']),
    (7, 5, '4-Digit\nTokens', colors['process']),
    (10, 6, 'Window\nCreation', colors['process']),
    (13, 6, 'Stratified\nSplits (NPZ)', colors['output']),
]

for x, y, text, color in stages:
    rect = plt.Rectangle((x-0.8, y-0.6), 1.6, 1.2, facecolor=color, 
                          edgecolor='black', linewidth=2, zorder=2)
    ax.add_patch(rect)
    ax.text(x, y, text, ha='center', va='center', fontsize=10, fontweight='bold', zorder=3)

# Draw arrows
arrows = [
    (1.8, 6, 3.2, 6),
    (4.8, 6, 6.2, 7),
    (4.8, 6, 6.2, 5),
    (7.8, 7, 9.2, 6),
    (7.8, 5, 9.2, 6),
    (10.8, 6, 12.2, 6),
]

for x1, y1, x2, y2 in arrows:
    ax.annotate('', xy=(x2, y2), xytext=(x1, y1),
                arrowprops=dict(arrowstyle='->', color=colors['arrow'], lw=2))

# Add details below
details = [
    (1, 4, '145 CSV files\n9 operation types'),
    (4, 4, 'Pandas\nDataFrame'),
    (7, 3.5, '155 continuous\n4 categorical'),
    (10, 4, f'Window: {WINDOW_SIZE}\nStride: {STRIDE}'),
    (13, 4, '70% / 15% / 15%\nMultilabel'),
]

for x, y, text in details:
    ax.text(x, y, text, ha='center', va='center', fontsize=9, style='italic', color='gray')

ax.set_title('Data Preprocessing Pipeline (4-Digit Hybrid)', fontsize=16, fontweight='bold', y=0.95)

plt.tight_layout()
plt.show()

In [None]:
# Command-line preprocessing reference
print("="*70)
print("PREPROCESSING COMMANDS")
print("="*70)

print("\n1. Create multilabel stratified splits:")
print("-"*50)
print("""
PYTHONPATH=src .venv/bin/python scripts/create_multilabel_stratified_splits.py \\
    --data-dir outputs/processed_v2 \\
    --output-dir outputs/multilabel_stratified_splits \\
    --val-size 0.15 \\
    --test-size 0.15 \\
    --seed 42
""")

print("\n2. Alternative: Standard stratified splits:")
print("-"*50)
print("""
PYTHONPATH=src .venv/bin/python scripts/create_stratified_splits.py \\
    --data-dir outputs/processed_v2 \\
    --output-dir outputs/stratified_splits_v2 \\
    --val-size 0.15 \\
    --test-size 0.15
""")

print("\n3. Output files:")
print("-"*50)
print("""
outputs/multilabel_stratified_splits/
├── train_sequences.npz      # Training data
├── val_sequences.npz        # Validation data  
├── test_sequences.npz       # Test data
├── split_info.json          # Split statistics
└── split_indices.npz        # Original indices
""")

---
## Summary

In this notebook, you learned:

1. **Raw Data Loading**: CSV files with aligned sensor + G-code data
2. **Sensor Features**: 155 continuous + 4 categorical features
3. **G-code Tokenization**: 4-digit hybrid encoding for better precision
4. **Vocabulary**: Multi-head output structure (types, commands, params, digits)
5. **Windowing**: Sliding windows with configurable size and stride
6. **Data Splits**: Multilabel stratified 70/15/15 train/val/test split
7. **Quality Checks**: Coverage validation for all splits

### Key Parameters

| Parameter | Value | Description |
|-----------|-------|-------------|
| Window Size | 64 | Timesteps per sample |
| Stride | 16 | Step between windows |
| Vocabulary | 4-digit hybrid | Multi-head output structure |
| Train/Val/Test | 70/15/15 | Multilabel stratified |
| Operation Types | 9 | Classification target |
| Commands | 6 | G0, G1, G3, G53, M30, NONE |
| Param Types | 10 | X, Y, Z, F, R, NONE, etc. |

### Output Format (NPZ)

| Key | Shape | Description |
|-----|-------|-------------|
| `continuous` | [N, 64, 155] | Continuous sensor features |
| `categorical` | [N, 64, 4] | Categorical sensor features |
| `tokens` | [N, 7] | Tokenized G-code |
| `operation_type` | [N] | Operation class (0-8) |
| `gcode_texts` | [N] | Original G-code strings |

---
**Navigation:**
← [Previous: 01_getting_started](01_getting_started.ipynb) |
[Next: 03_training_models](03_training_models.ipynb) →

**Related:** [00_raw_data_analysis](00_raw_data_analysis.ipynb) | [08_model_evaluation](08_model_evaluation.ipynb)