# Shuttle Tracker ML Training Exploration

This notebook explores the training data for the Shubble shuttle tracking ML component. We'll investigate the preprocessed data, segment distributions, and the speed calculation to understand the source of the high MSE and extreme speed values.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

%matplotlib inline
sns.set_theme(style="whitegrid")

## 1. Load Data

In [None]:
data_path = Path("../data/preprocessed_vehicle_locations.csv")
if not data_path.exists():
    # Try relative to project root if notebook is run from root
    data_path = Path("ml/data/preprocessed_vehicle_locations.csv")

print(f"Loading data from {data_path}...")
df = pd.read_csv(data_path)
df['timestamp'] = pd.to_datetime(df['timestamp'])
print(f"Loaded {len(df)} records.")

## 2. Inspect Speed Statistics

We previously observed very high MSE. Let's look at the distribution of the `speed_kmh` column.

In [None]:
print("Speed Statistics (km/h):")
print(df['speed_kmh'].describe())

plt.figure(figsize=(10, 6))
sns.histplot(df['speed_kmh'].dropna(), bins=100, kde=True)
plt.title("Distribution of Speed (km/h)")
plt.xlabel("Speed (km/h)")
plt.yscale('log')  # Log scale since outliers are extreme
plt.show()

## 3. Identify Extreme Outliers

Shuttles shouldn't be traveling at 100,000+ km/h. Let's see some of these records.

In [None]:
outliers = df[df['speed_kmh'] > 120].sort_values('speed_kmh', ascending=False)
print(f"Found {len(outliers)} records with speed > 120 km/h")
print("Top 10 most extreme outliers:")
display(outliers.head(10))

## 4. Investigate the Cause

Is it GPS drift (large distance jump) or tiny time deltas?

In [None]:
# Let's look at consecutive points for an outlier vehicle
if len(outliers) > 0:
    example_vehicle = outliers.iloc[0]['vehicle_id']
    example_time = outliers.iloc[0]['timestamp']
    
    # Get window around the outlier
    mask = (df['vehicle_id'] == example_vehicle) & \
           (df['timestamp'] >= example_time - pd.Timedelta(minutes=5)) & \
           (df['timestamp'] <= example_time + pd.Timedelta(minutes=5))
    
    window = df[mask].sort_values('timestamp')
    print(f"Sequence around outlier for vehicle {example_vehicle}:")
    display(window[['timestamp', 'latitude', 'longitude', 'distance_km', 'speed_kmh', 'epoch_seconds']])

## 5. Segment Analysis

Let's look at how segments are distributed.

In [None]:
from ml.data.preprocess import segment_by_consecutive

if 'segment_id' not in df.columns:
    print("Segmenting data...")
    df = segment_by_consecutive(df, max_timedelta=300, segment_column='segment_id')

segment_counts = df['segment_id'].value_counts()
print("Segment Length Statistics:")
print(segment_counts.describe())

plt.figure(figsize=(10, 6))
sns.histplot(segment_counts, bins=50)
plt.title("Distribution of Segment Lengths")
plt.xlabel("Number of points per segment")
plt.show()

## 6. Correlation Analysis

Let's see if speed correlates with other features.

In [None]:
plt.figure(figsize=(10, 8))
corr = df[['latitude', 'longitude', 'dist_to_route', 'distance_km', 'speed_kmh']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.show()