# 00 - Raw Data Analysis

This notebook explores the raw G-code and sensor data before any preprocessing.

## Learning Objectives
- Understand raw G-code file structure
- Explore sensor data format and features
- Perform statistical analysis on raw datasets
- Visualize data distributions
- Identify data quality issues

## Prerequisites
- Python 3.8+
- Project virtual environment activated

In [None]:
# Setup
import sys
from pathlib import Path

# Add src to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / 'src'))

print(f"Project root: {project_root}")

In [None]:
# Imports
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Imports successful!")

## 1. Exploring Raw G-code Files

G-code is a numerical control programming language used in CNC machining. Each line contains commands and parameters.

In [None]:
# List available G-code files
data_dir = project_root / 'data'
gcode_files = list(data_dir.glob('*.gcode')) + list(data_dir.glob('*.nc'))

print(f"Found {len(gcode_files)} G-code files")
for f in gcode_files[:10]:  # Show first 10
    print(f"  - {f.name}")

In [None]:
# Load and display a sample G-code file
if gcode_files:
    sample_file = gcode_files[0]
    print(f"\nSample G-code file: {sample_file.name}")
    print("-" * 80)
    
    with open(sample_file, 'r') as f:
        lines = f.readlines()[:20]  # First 20 lines
        for i, line in enumerate(lines, 1):
            print(f"{i:3d}: {line.rstrip()}")
else:
    print("No G-code files found. Please add sample files to the data/ directory.")

### G-code Command Analysis

In [None]:
def parse_gcode_line(line):
    """Extract G-code commands from a line."""
    # Remove comments
    line = re.sub(r';.*', '', line)
    line = re.sub(r'\(.*?\)', '', line)
    
    # Extract tokens
    tokens = line.strip().split()
    commands = []
    
    for token in tokens:
        # Match command patterns (G0, M3, X10.5, etc.)
        match = re.match(r'([A-Z])([\d.\-]+)', token)
        if match:
            commands.append(match.group(1))  # Just the letter
    
    return commands

# Analyze all G-code files
all_commands = []

for gcode_file in gcode_files[:10]:  # Analyze first 10 files
    with open(gcode_file, 'r') as f:
        for line in f:
            all_commands.extend(parse_gcode_line(line))

# Count command frequencies
command_counts = Counter(all_commands)

print(f"\nTotal commands analyzed: {len(all_commands)}")
print(f"Unique command types: {len(command_counts)}")
print("\nTop 10 most common commands:")
for cmd, count in command_counts.most_common(10):
    print(f"  {cmd}: {count:,} ({count/len(all_commands)*100:.1f}%)")

In [None]:
# Visualize command distribution
top_commands = dict(command_counts.most_common(15))

plt.figure(figsize=(12, 6))
plt.bar(top_commands.keys(), top_commands.values(), color='steelblue', alpha=0.7)
plt.xlabel('Command Type', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Top 15 G-code Commands Distribution', fontsize=14, fontweight='bold')
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## 2. Exploring Sensor Data

Sensor data captures machine states during G-code execution.

In [None]:
# Check for sensor data files
sensor_files = list(data_dir.glob('*sensor*.json')) + list(data_dir.glob('*sensor*.csv'))

if sensor_files:
    print(f"Found {len(sensor_files)} sensor data files:")
    for f in sensor_files[:5]:
        print(f"  - {f.name}")
else:
    print("No sensor files found. Sensor data might be embedded with G-code.")
    print("Check the data/ directory structure.")

In [None]:
# Example: Load sensor data (adjust based on your data format)
# This is a template - modify based on your actual data structure

# Option 1: If sensor data is in JSON format
# with open(sensor_files[0], 'r') as f:
#     sensor_data = json.load(f)

# Option 2: If sensor data is in CSV format
# sensor_df = pd.read_csv(sensor_files[0])
# print(sensor_df.head())

print("Adjust this cell based on your actual sensor data format")

## 3. Vocabulary Analysis

The vocabulary file maps G-code tokens to integer IDs.

In [None]:
# Load vocabulary
vocab_path = project_root / 'data' / 'vocabulary.json'

if vocab_path.exists():
    with open(vocab_path, 'r') as f:
        vocab = json.load(f)
    
    print(f"Vocabulary size: {len(vocab)}")
    print("\nFirst 20 tokens:")
    for token, idx in list(vocab.items())[:20]:
        print(f"  {token:15s} -> {idx}")
else:
    print("Vocabulary file not found at", vocab_path)
    print("Run preprocessing to generate vocabulary.")

In [None]:
# Analyze vocabulary composition
if vocab_path.exists():
    # Categorize tokens
    g_commands = [t for t in vocab.keys() if t.startswith('G')]
    m_commands = [t for t in vocab.keys() if t.startswith('M')]
    params = [t for t in vocab.keys() if t[0] in 'XYZFIJKR' and len(t) > 1]
    special = ['<PAD>', '<UNK>', '<SOS>', '<EOS>']
    
    print("Token categories:")
    print(f"  G-commands: {len(g_commands)}")
    print(f"  M-commands: {len(m_commands)}")
    print(f"  Parameters: {len(params)}")
    print(f"  Special tokens: {sum(1 for t in vocab if t in special)}")
    print(f"  Other: {len(vocab) - len(g_commands) - len(m_commands) - len(params) - 4}")

## 4. Data Quality Assessment

In [None]:
# Analyze G-code file sizes and line counts
if gcode_files:
    file_stats = []
    
    for gcode_file in gcode_files:
        with open(gcode_file, 'r') as f:
            lines = f.readlines()
            non_empty = [l for l in lines if l.strip() and not l.strip().startswith((';', '('))]
            
            file_stats.append({
                'file': gcode_file.name,
                'total_lines': len(lines),
                'code_lines': len(non_empty),
                'size_kb': gcode_file.stat().st_size / 1024
            })
    
    stats_df = pd.DataFrame(file_stats)
    
    print("G-code File Statistics:")
    print(f"\nTotal files: {len(stats_df)}")
    print(f"\nLines per file:")
    print(f"  Mean: {stats_df['code_lines'].mean():.0f}")
    print(f"  Median: {stats_df['code_lines'].median():.0f}")
    print(f"  Min: {stats_df['code_lines'].min()}")
    print(f"  Max: {stats_df['code_lines'].max()}")
    print(f"\nFile size (KB):")
    print(f"  Mean: {stats_df['size_kb'].mean():.1f}")
    print(f"  Total: {stats_df['size_kb'].sum():.1f}")

In [None]:
# Visualize file statistics
if gcode_files:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Lines distribution
    axes[0].hist(stats_df['code_lines'], bins=20, color='steelblue', alpha=0.7, edgecolor='black')
    axes[0].set_xlabel('Number of Code Lines', fontsize=12)
    axes[0].set_ylabel('Frequency', fontsize=12)
    axes[0].set_title('Distribution of G-code File Sizes', fontsize=13, fontweight='bold')
    axes[0].grid(axis='y', alpha=0.3)
    
    # Size distribution
    axes[1].hist(stats_df['size_kb'], bins=20, color='coral', alpha=0.7, edgecolor='black')
    axes[1].set_xlabel('File Size (KB)', fontsize=12)
    axes[1].set_ylabel('Frequency', fontsize=12)
    axes[1].set_title('Distribution of File Sizes', fontsize=13, fontweight='bold')
    axes[1].grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.show()

## 5. Hands-On Exercise

**Task**: Analyze a G-code file and extract key statistics.

For a G-code file of your choice:
1. Count the number of different G-commands (G0, G1, G2, etc.)
2. Count the number of different M-commands
3. Find the most common parameter (X, Y, Z, F, etc.)
4. Calculate the average line length

In [None]:
# Your solution here

# Example starter code:
if gcode_files:
    target_file = gcode_files[0]  # Pick any file
    
    # TODO: Implement the analysis
    pass

## Summary

In this notebook, you learned:
- How to load and parse raw G-code files
- How to analyze command distributions
- How to explore vocabulary structure
- How to assess data quality

## Next Steps

Continue to **01_getting_started.ipynb** for an overview of the entire project.

## Troubleshooting

- **No G-code files found**: Ensure your data is in the `data/` directory
- **Vocabulary not found**: Run preprocessing first
- **Import errors**: Make sure your virtual environment is activated