# 01 - Getting Started with G-code Fingerprinting

Welcome! This notebook provides a complete introduction to the G-code fingerprinting project.

## Table of Contents
1. [What is G-code Fingerprinting?](#1.-What-is-G-code-Fingerprinting?)
2. [Project Architecture](#2.-Project-Architecture)
3. [Environment Setup](#3.-Environment-Setup)
4. [Project Structure](#4.-Project-Structure)
5. [Understanding the Model](#5.-Understanding-the-Model)
6. [Data Format](#6.-Data-Format)
7. [Quick Inference Demo](#7.-Quick-Inference-Demo)
8. [Next Steps](#8.-Next-Steps)

---

## 1. What is G-code Fingerprinting?

This project uses deep learning to:

1. **Predict G-code sequences** from CNC machine sensor data
2. **Classify operation types** with near-perfect accuracy
3. **Extract machine fingerprints** - unique embeddings that identify specific operations

### Use Cases
- **Manufacturing Quality Control**: Verify machine operations match expected G-code
- **Security**: Detect unauthorized modifications to machine programs
- **Process Monitoring**: Understand what a machine is doing from sensor data alone
- **Digital Twins**: Create models that map sensor patterns to machine commands

### Key Innovations
- **Two-Stage Architecture**: Frozen MM-DTAE-LSTM encoder + SensorMultiHeadDecoder
- **Multi-Head Prediction**: Separate heads for type, command, param_type, and digit values
- **Operation Conditioning**: Leverages 100% accurate operation classification to guide token generation
- **Digit-by-Digit Value Prediction**: Predicts numeric values one digit at a time (4-digit precision)

## 2. Project Architecture

```
┌─────────────────────────────────────────────────────────────────────────────────┐
│                    G-code Fingerprinting Pipeline (v3)                          │
└─────────────────────────────────────────────────────────────────────────────────┘

┌─────────────────┐    ┌────────────────────────┐    ┌─────────────────────────┐
│   Sensor Data   │    │   MM-DTAE-LSTM v2      │    │   Operation Type        │
│ (155 cont + 4   │───▶│      (FROZEN)          │───▶│   Classification        │
│   categorical)  │    │   Encoder              │    │   (100% accurate!)      │
└─────────────────┘    └────────────────────────┘    └─────────────────────────┘
                               │                                │
                               │ sensor_embeddings              │ operation_type
                               │ [B, 64, 128]                   │ [B]
                               ▼                                ▼
                       ┌─────────────────────────────────────────────────┐
                       │        SensorMultiHeadDecoder v3                │
                       │  ┌─────────────────────────────────────────┐   │
                       │  │  Operation Embedding + Sensor Projection │   │
                       │  └─────────────────────────────────────────┘   │
                       │                      │                          │
                       │  ┌───────────────────▼──────────────────────┐  │
                       │  │  Transformer Decoder (4 layers, 8 heads) │  │
                       │  │     d_model=192, dropout=0.3             │  │
                       │  └──────────────────────────────────────────┘  │
                       │                      │                          │
                       │           ┌─────────┴─────────┐                │
                       │           ▼                   ▼                │
                       │   ┌───────────────┐  ┌─────────────────┐       │
                       │   │  Multi-Head   │  │  Digit Value    │       │
                       │   │  Outputs:     │  │  Head:          │       │
                       │   │  - type (4)   │  │  - sign (3)     │       │
                       │   │  - cmd (6)    │  │  - 6 digits     │       │
                       │   │  - param (10) │  │  - aux_value    │       │
                       │   └───────────────┘  └─────────────────┘       │
                       └─────────────────────────────────────────────────┘
                                              │
                                              ▼
                               ┌─────────────────────────┐
                               │  G-code Token Sequence  │
                               │  (max 32 tokens)        │
                               └─────────────────────────┘
```

### Components
- **Sensor Data**: 155 continuous features + 4 categorical features (64 timesteps)
- **MM-DTAE-LSTM v2**: Frozen encoder providing 128-dim embeddings + operation classification
- **SensorMultiHeadDecoder v3**: Transformer decoder with operation conditioning and multi-head outputs
- **Token Types**: SPECIAL, COMMAND, PARAM_LETTER, NUMERIC

## 3. Environment Setup

Let's verify your environment is properly configured.

In [None]:
# ============================================================
# Environment Setup and Verification
# ============================================================

import sys
from pathlib import Path

# Project root setup
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / 'src'))

# Reproducibility
SEED = 42

print("="*60)
print("G-CODE FINGERPRINTING - ENVIRONMENT CHECK")
print("="*60)
print(f"\n✓ Project root: {project_root}")
print(f"✓ Python path configured")

In [None]:
# Python version check
print(f"\nPython Version: {sys.version}")

# Check Python version is compatible
if sys.version_info >= (3, 9):
    print("✓ Python version compatible (3.9+)")
else:
    print("⚠ Python 3.9+ recommended")

In [None]:
# Package version check
packages_to_check = [
    ('torch', 'PyTorch'),
    ('numpy', 'NumPy'),
    ('pandas', 'Pandas'),
    ('matplotlib', 'Matplotlib'),
    ('seaborn', 'Seaborn'),
    ('sklearn', 'Scikit-learn'),
    ('wandb', 'Weights & Biases'),
    ('flask', 'Flask'),
]

print("\nPackage Versions:")
print("-" * 40)

for pkg_name, display_name in packages_to_check:
    try:
        module = __import__(pkg_name)
        version = getattr(module, '__version__', 'installed')
        print(f"  ✓ {display_name:20s} {version}")
    except ImportError:
        print(f"  ✗ {display_name:20s} NOT INSTALLED")

In [None]:
# Device check
import os
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

import torch
import numpy as np

# Set seeds for reproducibility
torch.manual_seed(SEED)
np.random.seed(SEED)

print("\nCompute Device:")
print("-" * 40)

if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"  ✓ CUDA available: {torch.cuda.get_device_name(0)}")
    print(f"    Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    device = torch.device('mps')
    print("  ✓ Apple MPS available (Metal Performance Shaders)")
else:
    device = torch.device('cpu')
    print("  ℹ Using CPU (no GPU acceleration)")

print(f"\n  Selected device: {device}")
print(f"\n✓ Reproducibility seeds set (SEED={SEED})")

## 4. Project Structure

Understanding the project layout helps you navigate the codebase.

In [None]:
# Verify key directories exist
print("\nProject Structure:")
print("="*60)

structure = {
    'data/': 'Raw G-code files and vocabulary files',
    'src/miracle/': 'Core Python package',
    'src/miracle/model/': 'Model architectures (MM-DTAE-LSTM, SensorMultiHeadDecoder)',
    'src/miracle/dataset/': 'Data loading and preprocessing',
    'src/miracle/training/': 'Training utilities and losses',
    'scripts/': 'Training and evaluation scripts',
    'configs/': 'Configuration files',
    'outputs/': 'Model checkpoints and results',
    'outputs/sensor_multihead_v3/': 'Latest trained model (decoder)',
    'outputs/mm_dtae_lstm_v2/': 'Frozen encoder model',
    'outputs/stratified_splits_v2/': 'Train/val/test splits',
    'notebooks/': 'Tutorial notebooks (you are here!)',
}

for path, description in structure.items():
    full_path = project_root / path.rstrip('/')
    exists = "✓" if full_path.exists() else "✗"
    print(f"  {exists} {path:40s} {description}")

In [None]:
# Check for critical files
print("\nCritical Files:")
print("-" * 70)

critical_files = {
    'data/vocabulary_4digit_hybrid.json': 'G-code vocabulary (4-digit hybrid)',
    'outputs/stratified_splits_v2/train_sequences.npz': 'Training data',
    'outputs/stratified_splits_v2/val_sequences.npz': 'Validation data',
    'outputs/stratified_splits_v2/test_sequences.npz': 'Test data',
    'outputs/sensor_multihead_v3/best_model.pt': 'Best decoder model',
    'outputs/mm_dtae_lstm_v2/best_model.pt': 'Frozen encoder model',
}

for path, description in critical_files.items():
    full_path = project_root / path
    if full_path.exists():
        size_mb = full_path.stat().st_size / (1024 * 1024)
        print(f"  ✓ {path:50s} ({size_mb:.1f} MB)")
    else:
        print(f"  ✗ {path:50s} (MISSING)")

## 5. Understanding the Model

### Model Architecture

The model uses a two-stage architecture:

**Stage 1: MM-DTAE-LSTM Encoder (Frozen)**
- Processes raw sensor data (155 continuous + 4 categorical features)
- Outputs 128-dimensional embeddings
- Classifies operation type with ~100% accuracy

**Stage 2: SensorMultiHeadDecoder**
- Receives sensor embeddings and operation type
- Generates G-code tokens autoregressively
- Multi-head outputs for structured prediction

### Token Types

| Type ID | Type Name | Examples |
|---------|-----------|----------|
| 0 | SPECIAL | PAD, BOS, EOS, UNK |
| 1 | COMMAND | G0, G1, G2, G3, G53, M30 |
| 2 | PARAM_LETTER | X, Y, Z, F, R |
| 3 | NUMERIC | NUM_X_1234 (4-digit values) |

### Operation Types (9 classes)

The model identifies 9 distinct machining operations from sensor patterns.

In [None]:
# Load and display model configuration
import json

results_path = project_root / 'outputs' / 'sensor_multihead_v3' / 'results.json'

if results_path.exists():
    with open(results_path, 'r') as f:
        results = json.load(f)
    
    args = results.get('args', {})
    
    print("\nModel Configuration (SensorMultiHeadDecoder v3):")
    print("="*50)
    print(f"  d_model:        {args.get('d_model', 'N/A')}")
    print(f"  n_heads:        {args.get('n_heads', 'N/A')}")
    print(f"  n_layers:       {args.get('n_layers', 'N/A')}")
    print(f"  dropout:        {args.get('dropout', 'N/A')}")
    print(f"  sensor_dim:     {args.get('sensor_dim', 'N/A')}")
    print(f"  n_operations:   {args.get('n_operations', 'N/A')}")
    print(f"  n_types:        {args.get('n_types', 'N/A')}")
    print(f"  n_commands:     {args.get('n_commands', 'N/A')}")
    print(f"  n_param_types:  {args.get('n_param_types', 'N/A')}")
    print(f"  max_seq_len:    {args.get('max_seq_len', 'N/A')}")
    
    print(f"\nTraining Configuration:")
    print("-"*50)
    print(f"  batch_size:     {args.get('batch_size', 'N/A')}")
    print(f"  max_epochs:     {args.get('max_epochs', 'N/A')}")
    print(f"  learning_rate:  {args.get('learning_rate', 'N/A')}")
    print(f"  use_focal_loss: {args.get('use_focal_loss', 'N/A')}")
    print(f"  focal_gamma:    {args.get('focal_gamma', 'N/A')}")
    print(f"  curriculum:     {args.get('curriculum', 'N/A')}")
    
    print(f"\nPerformance:")
    print("-"*50)
    print(f"  Best val token accuracy: {results.get('best_val_metric', 0):.2%}")
    test_metrics = results.get('test_metrics', {})
    print(f"  Test token accuracy:     {test_metrics.get('token', 0):.2%}")
    print(f"  Test loss:               {test_metrics.get('loss', 0):.4f}")
else:
    print("⚠ Results file not found. Train a model first.")

In [None]:
# Load vocabulary and show statistics
vocab_path = project_root / 'data' / 'vocabulary_4digit_hybrid.json'

if vocab_path.exists():
    with open(vocab_path, 'r') as f:
        vocab_data = json.load(f)
    
    vocab = vocab_data.get('vocab', vocab_data)  # Handle nested or flat format
    config = vocab_data.get('config', {})
    
    print("\nVocabulary Statistics (4-digit hybrid):")
    print("="*50)
    print(f"  Total tokens: {len(vocab)}")
    
    # Categorize tokens
    commands = [t for t in vocab.keys() if t.startswith('G') or t.startswith('M')]
    param_letters = [t for t in vocab.keys() if t in ['X', 'Y', 'Z', 'F', 'R', 'S', 'I', 'J', 'K', 'A', 'B', 'C']]
    numeric = [t for t in vocab.keys() if t.startswith('NUM_')]
    special = [t for t in vocab.keys() if t in ['PAD', 'BOS', 'EOS', 'UNK', 'MASK']]
    
    print(f"  Special tokens:     {len(special)}")
    print(f"  Command tokens:     {len(commands)}")
    print(f"  Parameter letters:  {len(param_letters)}")
    print(f"  Numeric tokens:     {len(numeric)}")
    
    print(f"\nSample tokens:")
    print(f"  Special:    {special}")
    print(f"  Commands:   {commands[:8]}")
    print(f"  Param:      {param_letters[:8]}")
    print(f"  Numeric:    {list(numeric)[:5]}...")
    
    if config:
        print(f"\nVocab Config:")
        print(f"  Mode:           {config.get('mode', 'N/A')}")
        print(f"  Bucket digits:  {config.get('bucket_digits', 'N/A')}")
else:
    print("⚠ Vocabulary not found.")

## 6. Data Format

The preprocessed data is stored in `.npz` format with the following structure.

In [None]:
# Load and inspect data format
split_dir = project_root / 'outputs' / 'stratified_splits_v2'

if (split_dir / 'train_sequences.npz').exists():
    train_data = np.load(split_dir / 'train_sequences.npz', allow_pickle=True)
    
    print("\nTraining Data Structure:")
    print("="*60)
    
    for key in train_data.keys():
        arr = train_data[key]
        if hasattr(arr, 'shape'):
            print(f"  {key:25s} shape={str(arr.shape):20s} dtype={arr.dtype}")
        else:
            print(f"  {key:25s} type={type(arr).__name__}")
    
    print(f"\nData Description:")
    print("-"*60)
    print(f"  continuous:       Sensor continuous features [N, 64, 155]")
    print(f"  categorical:      Sensor categorical features [N, 64, 4]")
    print(f"  tokens:           G-code token IDs [N, seq_len]")
    print(f"  param_value_raw:  Raw numeric values [N, seq_len]")
    print(f"  operation_type:   Operation type labels [N]")
    print(f"  gcode_texts:      Original G-code strings [N]")
    print(f"  lengths:          Sequence lengths [N]")
    
    print(f"\nSample sizes:")
    print(f"  Training:   {len(train_data['continuous']):,} samples")
    
    if (split_dir / 'val_sequences.npz').exists():
        val_data = np.load(split_dir / 'val_sequences.npz', allow_pickle=True)
        print(f"  Validation: {len(val_data['continuous']):,} samples")
    
    if (split_dir / 'test_sequences.npz').exists():
        test_data = np.load(split_dir / 'test_sequences.npz', allow_pickle=True)
        print(f"  Test:       {len(test_data['continuous']):,} samples")
else:
    print("⚠ Training data not found. Run preprocessing first.")

In [None]:
# Show sample data
if (split_dir / 'train_sequences.npz').exists():
    print("\nSample Data (first 3 samples):")
    print("="*60)
    
    for i in range(min(3, len(train_data['gcode_texts']))):
        print(f"\nSample {i}:")
        print(f"  Operation type: {train_data['operation_type'][i]}")
        if 'operation_type_names' in train_data:
            print(f"  Operation name: {train_data['operation_type_names'][i]}")
        print(f"  Sensor shape:   continuous={train_data['continuous'][i].shape}")
        print(f"  Token IDs:      {train_data['tokens'][i][:10]}...")
        print(f"  G-code text:    {train_data['gcode_texts'][i]}")

## 7. Quick Inference Demo

Let's load a trained model and run inference on sample data.

In [None]:
# Find available checkpoints
import glob

checkpoint_patterns = [
    'outputs/sensor_multihead_v3/best_model.pt',
    'outputs/*/best_model.pt',
    'outputs/*/checkpoint_best.pt',
]

checkpoints = []
for pattern in checkpoint_patterns:
    checkpoints.extend(glob.glob(str(project_root / pattern)))

print("\nAvailable Checkpoints:")
print("-" * 60)

if checkpoints:
    for cp in checkpoints[:5]:
        cp_path = Path(cp)
        size_mb = cp_path.stat().st_size / (1024 * 1024)
        print(f"  • {cp_path.relative_to(project_root)} ({size_mb:.1f} MB)")
    
    # Prefer sensor_multihead_v3
    preferred = project_root / 'outputs' / 'sensor_multihead_v3' / 'best_model.pt'
    checkpoint_path = str(preferred) if preferred.exists() else checkpoints[0]
    print(f"\n  Using: {Path(checkpoint_path).relative_to(project_root)}")
else:
    print("  ⚠ No checkpoints found. Train a model first!")
    checkpoint_path = None

In [None]:
# Load checkpoint and model
if checkpoint_path and vocab_path.exists():
    print("\nLoading checkpoint...")
    checkpoint = torch.load(checkpoint_path, map_location=device, weights_only=False)
    
    print("\nCheckpoint Contents:")
    print("-" * 40)
    for key in checkpoint.keys():
        if 'state_dict' in key:
            print(f"  {key}: {len(checkpoint[key])} parameters")
        elif isinstance(checkpoint[key], dict):
            print(f"  {key}: dict with {len(checkpoint[key])} keys")
        else:
            print(f"  {key}: {type(checkpoint[key]).__name__}")
    
    print("\n✓ Checkpoint loaded successfully!")
else:
    print("⚠ Skipping model loading - missing checkpoint or vocabulary")

In [None]:
# Load model architecture
if checkpoint_path:
    try:
        from miracle.model.sensor_multihead_decoder import SensorMultiHeadDecoder
        
        # Get model config from checkpoint or use defaults
        config = checkpoint.get('config', {}) or {}
        
        model = SensorMultiHeadDecoder(
            vocab_size=config.get('vocab_size', 668),
            d_model=config.get('d_model', 192),
            n_heads=config.get('n_heads', 8),
            n_layers=config.get('n_layers', 4),
            sensor_dim=config.get('sensor_dim', 128),
            n_operations=config.get('n_operations', 9),
            n_types=config.get('n_types', 4),
            n_commands=config.get('n_commands', 6),
            n_param_types=config.get('n_param_types', 10),
            dropout=0.0,  # No dropout for inference
        ).to(device)
        
        # Load weights
        if 'model_state_dict' in checkpoint:
            model.load_state_dict(checkpoint['model_state_dict'])
        
        model.eval()
        
        # Count parameters
        param_counts = model.count_parameters()
        print("\nModel Parameter Counts:")
        print("-" * 40)
        for name, count in param_counts.items():
            print(f"  {name:25s} {count:,}")
        
        print(f"\n✓ Model loaded successfully!")
        
    except ImportError as e:
        print(f"⚠ Could not import model: {e}")
        model = None
    except Exception as e:
        print(f"⚠ Could not load model: {e}")
        model = None

In [None]:
# Run quick inference with synthetic data
if checkpoint_path and 'model' in dir() and model is not None:
    import time
    
    print("\nRunning inference on synthetic data...")
    print("-" * 50)
    
    # Generate synthetic inputs
    batch_size = 2
    sensor_seq_len = 50
    token_seq_len = 16
    sensor_dim = 128
    n_operations = 9
    vocab_size = 668
    
    # Synthetic sensor embeddings (normally from MM-DTAE-LSTM encoder)
    sensor_embeddings = torch.randn(batch_size, sensor_seq_len, sensor_dim).to(device)
    operation_type = torch.randint(0, n_operations, (batch_size,)).to(device)
    tokens = torch.randint(0, vocab_size, (batch_size, token_seq_len)).to(device)
    
    print(f"  Input shapes:")
    print(f"    sensor_embeddings: {sensor_embeddings.shape}")
    print(f"    operation_type:    {operation_type.shape}")
    print(f"    tokens:            {tokens.shape}")
    
    # Forward pass
    start_time = time.time()
    with torch.no_grad():
        outputs = model(tokens, sensor_embeddings, operation_type)
    inference_time = (time.time() - start_time) * 1000
    
    print(f"\n  Output shapes:")
    for name, tensor in outputs.items():
        if tensor is not None:
            print(f"    {name:20s} {tensor.shape}")
    
    print(f"\n  Inference time: {inference_time:.2f} ms")
    print(f"\n✓ Inference successful!")
else:
    print("⚠ Skipping inference demo - model not loaded")

## 8. Next Steps

### Recommended Tutorial Sequence

| # | Notebook | Description |
|---|----------|-------------|
| 00 | [Raw Data Analysis](00_raw_data_analysis.ipynb) | Explore the raw dataset |
| **01** | **Getting Started** | **You are here!** |
| 02 | [Data Preprocessing](02_data_preprocessing.ipynb) | Prepare data for training |
| 03 | [Training Models](03_training_models.ipynb) | Train from scratch |
| 04 | [Inference & Prediction](04_inference_prediction.ipynb) | Run predictions |
| 05 | [API Usage](05_api_usage.ipynb) | Use the REST API |
| 06 | [Dashboard Usage](06_dashboard_usage.ipynb) | Interactive dashboard |
| 07 | [Hyperparameter Sweeps](07_hyperparameter_sweeps.ipynb) | Optimize with W&B |
| 08 | [Model Evaluation](08_model_evaluation.ipynb) | Comprehensive eval |
| 09 | [Ablation Studies](09_ablation_studies.ipynb) | Component analysis |
| 10 | [Visualization Experiments](10_visualization_experiments.ipynb) | Publication figures |

### Quick Start Commands

```bash
# Create stratified splits
PYTHONPATH=src .venv/bin/python scripts/create_stratified_splits.py \
    --data-dir data \
    --output-dir outputs/stratified_splits_v2

# Train sensor multi-head decoder
PYTORCH_ENABLE_MPS_FALLBACK=1 PYTHONPATH=src .venv/bin/python scripts/train_sensor_multihead.py \
    --split-dir outputs/stratified_splits_v2 \
    --vocab-path data/vocabulary_4digit_hybrid.json \
    --encoder-path outputs/mm_dtae_lstm_v2/best_model.pt \
    --output-dir outputs/sensor_multihead_v3 \
    --use-wandb

# Start dashboard
PYTHONPATH=src .venv/bin/python flask_dashboard.py
```

## Summary

In this notebook, you learned:

- **Project purpose**: Predict G-code from sensor data with high accuracy (~90%+)
- **Architecture**: Frozen MM-DTAE-LSTM encoder + SensorMultiHeadDecoder
- **Data format**: NPZ files with sensor, tokens, operation types
- **Token types**: SPECIAL, COMMAND, PARAM_LETTER, NUMERIC (4 types)
- **Operations**: 9 distinct machining operation classes

### Key Model Features

| Feature | Value |
|---------|-------|
| d_model | 192 |
| n_heads | 8 |
| n_layers | 4 |
| sensor_dim | 128 |
| Token accuracy | ~90%+ |

### Troubleshooting

| Issue | Solution |
|-------|----------|
| Import errors | Activate venv, set `PYTHONPATH=src` |
| No checkpoints | Train a model first |
| Vocabulary missing | Check `data/vocabulary_4digit_hybrid.json` |
| CUDA errors | Use CPU or enable MPS fallback |
| Memory errors | Reduce batch size |

---

**Navigation:**
← [Previous: 00_raw_data_analysis](00_raw_data_analysis.ipynb) |
[Next: 02_data_preprocessing](02_data_preprocessing.ipynb) →

**Related:** [03_training_models](03_training_models.ipynb) | [08_model_evaluation](08_model_evaluation.ipynb)