# 01 - Getting Started with G-code Fingerprinting

Welcome! This notebook provides a complete overview of the G-code fingerprinting project.

## What is G-code Fingerprinting?

This project uses deep learning to:
1. **Predict G-code sequences** from CNC machine sensor data
2. **Extract machine fingerprints** - unique embeddings that identify specific machines
3. **Enable reverse engineering** of machine operations from sensor readings

## Project Architecture

```
Sensor Data → MM-DTAE-LSTM Backbone → Embeddings → Multi-Head LM → G-code Predictions
                                              ↓
                                    Machine Fingerprint
```

## Learning Objectives
- Understand the project structure
- Verify environment setup
- Run a complete inference example
- Explore the token decomposition system

In [None]:
# Setup
import sys
from pathlib import Path

project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / 'src'))

print(f"✓ Project root: {project_root}")

## 1. Project Structure

```
gcode_fingerprinting/
├── data/                      # Raw G-code and vocabulary files
├── src/miracle/               # Core package
│   ├── model/                 # Model architectures
│   │   ├── model.py          # MM-DTAE-LSTM backbone
│   │   └── multihead_lm.py   # Multi-head language model
│   ├── dataset/               # Data processing
│   │   ├── preprocessing.py  # Data preparation
│   │   └── target_utils.py   # Token decomposition
│   └── api/                   # FastAPI server
├── scripts/                   # Training and evaluation
├── configs/                   # Configuration files
├── outputs/                   # Model checkpoints and results
└── notebooks/                 # Tutorial notebooks (you are here!)
```

In [None]:
# Verify key directories exist
import os

key_dirs = ['data', 'src', 'scripts', 'configs', 'outputs']
print("Directory check:")
for d in key_dirs:
    path = project_root / d
    exists = "✓" if path.exists() else "✗"
    print(f"  {exists} {d}/")

## 2. Understanding Token Decomposition

G-code tokens are hierarchically decomposed into:
- **Commands**: G0, G1, M3, etc.
- **Parameter Types**: X, Y, Z, F, etc.
- **Parameter Values**: Numeric values

This allows the model to learn structure and relationships.

In [None]:
# Import and create TokenDecomposer
from miracle.dataset.target_utils import TokenDecomposer

vocab_path = project_root / 'data' / 'vocabulary.json'

if vocab_path.exists():
    decomposer = TokenDecomposer(str(vocab_path))
    
    print(f"Vocabulary size: {decomposer.vocab_size}")
    print(f"Number of commands: {decomposer.n_commands}")
    print(f"Number of parameter types: {decomposer.n_param_types}")
    print(f"Number of parameter values: {decomposer.n_param_values}")
else:
    print("Vocabulary not found. Run preprocessing first.")

In [None]:
# Example: Decompose a sample sequence
if vocab_path.exists():
    # Sample G-code tokens
    sample_tokens = ['G0', 'X10.5', 'Y20.3', 'G1', 'F1500']
    
    print("Sample G-code sequence:")
    print(f"  {' '.join(sample_tokens)}")
    
    # Convert tokens to IDs
    token_ids = [decomposer.token_to_id.get(t, decomposer.token_to_id['<UNK>']) for t in sample_tokens]
    
    print(f"\nToken IDs: {token_ids}")
    
    # Decompose
    decomposed = decomposer.decompose_batch([token_ids])
    
    print(f"\nDecomposed structure:")
    print(f"  Commands: {decomposed['commands'][0]}")
    print(f"  Param types: {decomposed['param_types'][0]}")
    print(f"  Param values: {decomposed['param_values'][0]}")

## 3. Model Architecture Overview

### MM-DTAE-LSTM Backbone
- Processes multi-modal sensor data (continuous + categorical)
- Uses LSTM for temporal modeling
- Outputs embeddings for each timestep

### Multi-Head Language Model
- Three prediction heads:
  1. **Command Head**: Predicts G/M commands
  2. **Parameter Type Head**: Predicts X, Y, Z, F, etc.
  3. **Parameter Value Head**: Predicts numeric values
- Uses transformer architecture for autoregressive generation

In [None]:
# Load model configuration
import json

config_path = project_root / 'configs' / 'config.json'

if config_path.exists():
    with open(config_path, 'r') as f:
        config = json.load(f)
    
    print("Model Configuration:")
    print(f"  Hidden dimension: {config.get('hidden_dim', 128)}")
    print(f"  Number of layers: {config.get('num_layers', 2)}")
    print(f"  Attention heads: {config.get('num_heads', 4)}")
    print(f"  Dropout: {config.get('dropout', 0.1)}")
else:
    print("Config file not found at", config_path)

## 4. Quick Inference Demo

Let's load a checkpoint and run inference on sample data.

In [None]:
# Check for available checkpoints
import glob

checkpoint_patterns = [
    'outputs/*/checkpoint_best.pt',
    'outputs/training_*/checkpoint_best.pt',
]

checkpoints = []
for pattern in checkpoint_patterns:
    checkpoints.extend(glob.glob(str(project_root / pattern)))

if checkpoints:
    print(f"Found {len(checkpoints)} checkpoint(s):")
    for cp in checkpoints[:5]:
        print(f"  - {Path(cp).relative_to(project_root)}")
    
    checkpoint_path = checkpoints[0]
    print(f"\nUsing: {Path(checkpoint_path).relative_to(project_root)}")
else:
    print("No checkpoints found. Train a model first.")
    checkpoint_path = None

In [None]:
# Load checkpoint and run inference
if checkpoint_path and vocab_path.exists():
    import torch
    import numpy as np
    from miracle.model.model import MM_DTAE_LSTM, ModelConfig
    from miracle.model.multihead_lm import MultiHeadGCodeLM
    from miracle.utilities.device import get_device
    
    device = get_device()
    print(f"Using device: {device}")
    
    # Load checkpoint
    checkpoint = torch.load(checkpoint_path, map_location=device)
    config_dict = checkpoint.get('config', {})
    
    # Create models
    backbone_config = ModelConfig(
        sensor_dims=[155, 4],  # From preprocessed data
        d_model=config_dict.get('hidden_dim', 128),
        lstm_layers=config_dict.get('num_layers', 2),
        gcode_vocab=decomposer.vocab_size,
        n_heads=config_dict.get('num_heads', 4),
    )
    
    backbone = MM_DTAE_LSTM(backbone_config).to(device)
    backbone.load_state_dict(checkpoint['backbone_state_dict'])
    backbone.eval()
    
    print("✓ Model loaded successfully!")
    print(f"  Epoch: {checkpoint.get('epoch', 'unknown')}")
    print(f"  Val Accuracy: {checkpoint.get('val_acc', 'unknown')}")
else:
    print("Skipping inference demo - missing checkpoint or vocabulary")

## 5. Environment Verification

In [None]:
# Check Python version
import sys
print(f"Python version: {sys.version}")

# Check key packages
packages_to_check = [
    'torch',
    'numpy',
    'pandas',
    'matplotlib',
    'wandb',
    'fastapi',
]

print("\nPackage versions:")
for pkg in packages_to_check:
    try:
        module = __import__(pkg)
        version = getattr(module, '__version__', 'unknown')
        print(f"  ✓ {pkg}: {version}")
    except ImportError:
        print(f"  ✗ {pkg}: NOT INSTALLED")

## 6. Next Steps

### Tutorial Sequence:
1. ✓ **00_raw_data_analysis.ipynb** - Explore raw data
2. ✓ **01_getting_started.ipynb** - You are here!
3. **02_data_preprocessing.ipynb** - Prepare data for training
4. **03_training_models.ipynb** - Train models
5. **04_inference_prediction.ipynb** - Run predictions
6. **05_api_usage.ipynb** - Use the FastAPI server
7. **06_dashboard_usage.ipynb** - Interactive dashboard
8. **07_hyperparameter_sweeps.ipynb** - Optimize with W&B
9. **08_model_evaluation.ipynb** - Evaluate and compare models

### Quick Commands:

```bash
# Preprocess data
PYTHONPATH=src .venv/bin/python -m miracle.dataset.preprocessing \
    --data-dir data --output-dir outputs/processed --vocab-path data/vocabulary.json

# Train a model
PYTHONPATH=src .venv/bin/python scripts/train_multihead.py \
    --data-dir outputs/processed --vocab-path data/vocabulary.json \
    --output-dir outputs/training --max-epochs 10

# Start API server
PYTHONPATH=src .venv/bin/python src/miracle/api/server.py
```

## Summary

You now understand:
- Project architecture and structure
- Token decomposition system
- Model components (backbone + multi-head LM)
- How to verify your environment

## Troubleshooting

- **Import errors**: Ensure virtual environment is activated and `PYTHONPATH` includes `src/`
- **No checkpoints**: Train a model first using `scripts/train_multihead.py`
- **Vocabulary missing**: Run preprocessing first
- **CUDA errors**: Use CPU by setting `device='cpu'` or enable MPS fallback