# Segment Pipeline Demo

This notebook demonstrates the segmentation functions in `ml/data/segment.py`:

1. **`segment_by_consecutive`** - Group consecutive GPS points into trip segments
2. **`filter_segments_by_length`** - Remove segments with too few points
3. **`clean_closest_route`** - Fill NaN route values using majority vote from neighbors
4. **`add_closest_points_educated`** - Refine geometric points for inferred routes

These functions are used by `segment_pipeline()` in `ml/pipelines.py`.

In [1]:
import pandas as pd
import numpy as np
import sys
from pathlib import Path

# Add project root to Python path
project_root = Path.cwd().parent.parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Import segment functions
from ml.data.segment import (
    segment_by_consecutive,
    filter_segments_by_length,
    clean_closest_route,
    add_closest_points_educated
)

print("Segment functions loaded successfully!")

2026-02-04 17:18:55 - ml - INFO - ML logging level: debug
Segment functions loaded successfully!


---
## 1. `segment_by_consecutive`

Groups consecutive GPS points into segments based on:
- **Vehicle ID changes** - New vehicle = new segment
- **Time gaps** - Gap > `max_timedelta` = new segment
- **Distance from route** (optional) - Off-route = new segment

### Function Signature
```python
def segment_by_consecutive(
    df: pd.DataFrame,
    max_timedelta: float,
    segment_column: str,
    distance_column: str = None,
    max_distance_to_route: float = None
) -> pd.DataFrame:
```

**Returns a new DataFrame with segment IDs.**

In [2]:
# Create sample DataFrame demonstrating different segmentation triggers
df_seg = pd.DataFrame({
    'vehicle_id': [1, 1, 1, 1, 1, 2, 2, 2],
    'epoch_seconds': [
        0, 5, 10,    # Vehicle 1, consecutive (5s gaps)
        100, 105,    # Vehicle 1, after 90s gap → new segment
        0, 5, 10     # Vehicle 2 → new segment
    ],
    'latitude': [42.73, 42.731, 42.732, 42.735, 42.736, 42.74, 42.741, 42.742],
    'longitude': [-73.67, -73.67, -73.67, -73.68, -73.68, -73.69, -73.69, -73.69]
})

print("BEFORE segment_by_consecutive:")
print(df_seg)
print(f"\nColumns: {list(df_seg.columns)}")

BEFORE segment_by_consecutive:
   vehicle_id  epoch_seconds  latitude  longitude
0           1              0    42.730     -73.67
1           1              5    42.731     -73.67
2           1             10    42.732     -73.67
3           1            100    42.735     -73.68
4           1            105    42.736     -73.68
5           2              0    42.740     -73.69
6           2              5    42.741     -73.69
7           2             10    42.742     -73.69

Columns: ['vehicle_id', 'epoch_seconds', 'latitude', 'longitude']


In [3]:
# Apply segmentation with 15-second max gap
df_seg_result = segment_by_consecutive(df_seg, max_timedelta=15, segment_column='segment_id')

print("AFTER segment_by_consecutive (max_timedelta=15):")
print(df_seg_result)
print(f"\nNumber of segments: {df_seg_result['segment_id'].nunique()}")

AFTER segment_by_consecutive (max_timedelta=15):
   vehicle_id  epoch_seconds  latitude  longitude  segment_id
0           1              0    42.730     -73.67           1
1           1              5    42.731     -73.67           1
2           1             10    42.732     -73.67           1
3           1            100    42.735     -73.68           2
4           1            105    42.736     -73.68           2
5           2              0    42.740     -73.69           3
6           2              5    42.741     -73.69           3
7           2             10    42.742     -73.69           3

Number of segments: 3


In [4]:
# Explain the segmentation
print("Segmentation Explanation:")
print("="*60)
print("Segment 1: Rows 0-2 (vehicle 1, consecutive points, 5s gaps)")
print("Segment 2: Rows 3-4 (vehicle 1, after 90s gap > 15s threshold)")
print("Segment 3: Rows 5-7 (vehicle 2, new vehicle)")

Segmentation Explanation:
Segment 1: Rows 0-2 (vehicle 1, consecutive points, 5s gaps)
Segment 2: Rows 3-4 (vehicle 1, after 90s gap > 15s threshold)
Segment 3: Rows 5-7 (vehicle 2, new vehicle)


### Distance-Based Segmentation

Optionally split segments when vehicle goes off-route.

In [5]:
# Create DataFrame with distance to route
df_dist = pd.DataFrame({
    'vehicle_id': [1, 1, 1, 1, 1, 1],
    'epoch_seconds': [0, 5, 10, 15, 20, 25],
    'latitude': [42.73, 42.731, 42.732, 42.735, 42.736, 42.737],
    'longitude': [-73.67, -73.67, -73.67, -73.67, -73.67, -73.67],
    'dist_to_route': [
        0.005,   # On route (5m)
        0.008,   # On route (8m)
        0.025,   # OFF route (25m) - exceeds 20m threshold
        0.030,   # OFF route (30m)
        0.010,   # Back on route (10m)
        0.006    # On route (6m)
    ]
})

print("Data with distance to route:")
print(df_dist)

Data with distance to route:
   vehicle_id  epoch_seconds  latitude  longitude  dist_to_route
0           1              0    42.730     -73.67          0.005
1           1              5    42.731     -73.67          0.008
2           1             10    42.732     -73.67          0.025
3           1             15    42.735     -73.67          0.030
4           1             20    42.736     -73.67          0.010
5           1             25    42.737     -73.67          0.006


In [6]:
# Segment with distance threshold
df_dist_result = segment_by_consecutive(
    df_dist,
    max_timedelta=30,
    segment_column='segment_id',
    distance_column='dist_to_route',
    max_distance_to_route=0.020  # 20 meters
)

print("After segmentation with distance threshold (20m):")
print(df_dist_result)
print(f"\nNumber of segments: {df_dist_result['segment_id'].nunique()}")
print("\n→ Off-route points (rows 2-3) form separate segments")

After segmentation with distance threshold (20m):
   vehicle_id  epoch_seconds  latitude  longitude  dist_to_route  segment_id
0           1              0    42.730     -73.67          0.005           1
1           1              5    42.731     -73.67          0.008           1
2           1             10    42.732     -73.67          0.025           2
3           1             15    42.735     -73.67          0.030           3
4           1             20    42.736     -73.67          0.010           4
5           1             25    42.737     -73.67          0.006           4

Number of segments: 4

→ Off-route points (rows 2-3) form separate segments


---
## 2. `filter_segments_by_length`

Removes segments with fewer than a minimum number of points.

### Function Signature
```python
def filter_segments_by_length(
    df: pd.DataFrame,
    segment_column: str,
    min_length: int
) -> pd.DataFrame:
```

**Returns a filtered DataFrame.**

In [7]:
# Create DataFrame with segments of varying sizes
df_filter = pd.DataFrame({
    'segment_id': [1, 1, 1, 1, 1, 2, 2, 3, 3, 3],
    'value': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
})

print("BEFORE filter_segments_by_length:")
print(df_filter)
print(f"\nSegment sizes:")
print(f"  Segment 1: 5 points")
print(f"  Segment 2: 2 points")
print(f"  Segment 3: 3 points")

BEFORE filter_segments_by_length:
   segment_id value
0           1     a
1           1     b
2           1     c
3           1     d
4           1     e
5           2     f
6           2     g
7           3     h
8           3     i
9           3     j

Segment sizes:
  Segment 1: 5 points
  Segment 2: 2 points
  Segment 3: 3 points


In [8]:
# Filter to keep only segments with >= 3 points
df_filtered = filter_segments_by_length(df_filter, 'segment_id', min_length=3)

print("AFTER filter_segments_by_length (min_length=3):")
print(df_filtered)
print(f"\nRemoved segment 2 (only 2 points)")
print(f"Kept segments 1 and 3 (3+ points)")

AFTER filter_segments_by_length (min_length=3):
   segment_id value
0           1     a
1           1     b
2           1     c
3           1     d
4           1     e
7           3     h
8           3     i
9           3     j

Removed segment 2 (only 2 points)
Kept segments 1 and 3 (3+ points)


---
## 3. `clean_closest_route`

Fills NaN route values using majority vote from surrounding window.

**Why?** Sometimes route matching fails for individual points even when surrounding points clearly indicate the route.

### Function Signature
```python
def clean_closest_route(
    df: pd.DataFrame,
    route_column: str = 'route',
    polyline_idx_column: str = 'polyline_idx',
    segment_column: str = 'segment_id',
    window_size: int = 5,
    require_majority_valid: bool = False
) -> pd.DataFrame:
```

**Returns a new DataFrame with filled values.**

In [9]:
# Create DataFrame with NaN routes surrounded by valid routes
df_clean = pd.DataFrame({
    'segment_id': [1, 1, 1, 1, 1, 1, 1],
    'route': ['WEST', 'WEST', None, None, 'WEST', 'WEST', 'WEST'],
    'polyline_idx': [0, 0, np.nan, np.nan, 0, 0, 0],
    'latitude': [42.730, 42.731, 42.732, 42.733, 42.734, 42.735, 42.736],
    'longitude': [-73.676, -73.676, -73.676, -73.676, -73.676, -73.676, -73.676]
})

print("BEFORE clean_closest_route:")
print(df_clean)
print(f"\nNaN routes: {df_clean['route'].isna().sum()}")

BEFORE clean_closest_route:
   segment_id route  polyline_idx  latitude  longitude
0           1  WEST           0.0    42.730    -73.676
1           1  WEST           0.0    42.731    -73.676
2           1  None           NaN    42.732    -73.676
3           1  None           NaN    42.733    -73.676
4           1  WEST           0.0    42.734    -73.676
5           1  WEST           0.0    42.735    -73.676
6           1  WEST           0.0    42.736    -73.676

NaN routes: 2


In [10]:
# Clean NaN routes using majority vote
df_cleaned = clean_closest_route(df_clean, 'route', 'polyline_idx', 'segment_id', window_size=3)

print("AFTER clean_closest_route (window_size=3):")
print(df_cleaned)
print(f"\nNaN routes: {df_cleaned['route'].isna().sum()}")
print("\n→ NaN values filled based on surrounding WEST route values")

Cleaning 2 NaN route values using window size 3 (Vectorized)...
  ✓ Filled 2/2 NaN values (100.0%)
AFTER clean_closest_route (window_size=3):
   segment_id route  polyline_idx  latitude  longitude
0           1  WEST           0.0    42.730    -73.676
1           1  WEST           0.0    42.731    -73.676
2           1  WEST           0.0    42.732    -73.676
3           1  WEST           0.0    42.733    -73.676
4           1  WEST           0.0    42.734    -73.676
5           1  WEST           0.0    42.735    -73.676
6           1  WEST           0.0    42.736    -73.676

NaN routes: 0

→ NaN values filled based on surrounding WEST route values


In [11]:
# Demonstrate segment boundary respect
df_boundary = pd.DataFrame({
    'segment_id': [1, 1, 1, 2, 2, 2],  # Two segments
    'route': ['WEST', None, 'WEST', 'NORTH', None, 'NORTH'],
    'polyline_idx': [0, np.nan, 0, 0, np.nan, 0]
})

print("Segment boundary example:")
print("BEFORE:")
print(df_boundary)

df_boundary_cleaned = clean_closest_route(df_boundary, 'route', 'polyline_idx', 'segment_id', window_size=2)
print("\nAFTER:")
print(df_boundary_cleaned)
print("\n→ Row 1 filled with WEST (from segment 1)")
print("→ Row 4 filled with NORTH (from segment 2)")
print("→ Windows don't cross segment boundaries!")

Segment boundary example:
BEFORE:
   segment_id  route  polyline_idx
0           1   WEST           0.0
1           1   None           NaN
2           1   WEST           0.0
3           2  NORTH           0.0
4           2   None           NaN
5           2  NORTH           0.0
Cleaning 2 NaN route values using window size 2 (Vectorized)...
  ✓ Filled 2/2 NaN values (100.0%)

AFTER:
   segment_id  route  polyline_idx
0           1   WEST           0.0
1           1   WEST           0.0
2           1   WEST           0.0
3           2  NORTH           0.0
4           2  NORTH           0.0
5           2  NORTH           0.0

→ Row 1 filled with WEST (from segment 1)
→ Row 4 filled with NORTH (from segment 2)
→ Windows don't cross segment boundaries!


---
## 4. `add_closest_points_educated`

Refines geometric points for rows where route is known but segment details are missing.

**Use case:** After `clean_closest_route` fills in route names, we need to compute the actual closest point on that specific route.

### Function Signature
```python
def add_closest_points_educated(
    df: pd.DataFrame,
    lat_column: str,
    lon_column: str,
    route_column: str,
    polyline_idx_column: str,
    output_columns: dict[str, str]
) -> None:
```

**Modifies DataFrame in-place.**

In [12]:
# Create DataFrame with route known but segment_idx missing
# (simulating result after clean_closest_route)
df_educated = pd.DataFrame({
    'latitude': [42.7302, 42.7310, 42.7318],
    'longitude': [-73.6762, -73.6765, -73.6768],
    'route': ['WEST', 'WEST', 'WEST'],
    'polyline_idx': [0, 0, 0],
    'segment_idx': [5.0, np.nan, 7.0],  # Middle row missing segment_idx
    'dist_to_route': [0.001, np.nan, 0.002],
    'closest_lat': [42.7303, np.nan, 42.7319],
    'closest_lon': [-73.6763, np.nan, -73.6769]
})

print("BEFORE add_closest_points_educated:")
print(df_educated)
print(f"\nRow 1 has route=WEST but missing geometric details")

BEFORE add_closest_points_educated:
   latitude  longitude route  polyline_idx  segment_idx  dist_to_route  \
0   42.7302   -73.6762  WEST             0          5.0          0.001   
1   42.7310   -73.6765  WEST             0          NaN            NaN   
2   42.7318   -73.6768  WEST             0          7.0          0.002   

   closest_lat  closest_lon  
0      42.7303     -73.6763  
1          NaN          NaN  
2      42.7319     -73.6769  

Row 1 has route=WEST but missing geometric details


In [13]:
# Fill in missing geometric details
add_closest_points_educated(
    df_educated,
    lat_column='latitude',
    lon_column='longitude',
    route_column='route',
    polyline_idx_column='polyline_idx',
    output_columns={
        'distance': 'dist_to_route',
        'closest_point_lat': 'closest_lat',
        'closest_point_lon': 'closest_lon',
        'segment_index': 'segment_idx'
    }
)

print("AFTER add_closest_points_educated:")
print(df_educated)
print("\n→ Row 1 now has all geometric details filled in")

Refining 1 rows with educated route guesses...


100%|██████████| 1/1 [00:00<00:00, 345.30it/s]

AFTER add_closest_points_educated:
   latitude  longitude route  polyline_idx  segment_idx  dist_to_route  \
0   42.7302   -73.6762  WEST             0          5.0       0.001000   
1   42.7310   -73.6765  WEST             0          7.0       0.036174   
2   42.7318   -73.6768  WEST             0          7.0       0.002000   

   closest_lat  closest_lon  
0    42.730300   -73.676300  
1    42.730678   -73.676562  
2    42.731900   -73.676900  

→ Row 1 now has all geometric details filled in





---
## Integration with `segment_pipeline`

In `ml/pipelines.py`, these functions work together:

In [14]:
# Show how segment_pipeline uses these functions
print("""
def segment_pipeline(df: pd.DataFrame = None, **kwargs) -> pd.DataFrame:
    from ml.data.segment import (
        segment_by_consecutive, filter_segments_by_length,
        clean_closest_route, add_closest_points_educated
    )

    # Step 1: Preprocess if needed
    if df is None:
        df = preprocess_pipeline(**kwargs)

    # Step 2: Segment by consecutive points
    df = segment_by_consecutive(
        df,
        max_timedelta=max_timedelta,
        segment_column='segment_id',
        distance_column='dist_to_route',
        max_distance_to_route=max_distance
    )

    # Step 3: Clean NaN route values
    df = clean_closest_route(df, 'route', 'polyline_idx', 'segment_id', window_size)

    # Step 3.5: Refine geometric points for inferred routes
    add_closest_points_educated(
        df, 'latitude', 'longitude', 'route', 'polyline_idx',
        output_columns={
            'distance': 'dist_to_route',
            'closest_point_lat': 'closest_lat',
            'closest_point_lon': 'closest_lon',
            'segment_index': 'segment_idx'
        }
    )

    # Step 4: Add speed (via speed_pipeline)
    df = speed_pipeline(df, **kwargs)

    # Step 5: Filter short segments
    df = filter_segments_by_length(df, 'segment_id', min_segment_length)

    return df
""")


def segment_pipeline(df: pd.DataFrame = None, **kwargs) -> pd.DataFrame:
    from ml.data.segment import (
        segment_by_consecutive, filter_segments_by_length,
        clean_closest_route, add_closest_points_educated
    )

    # Step 1: Preprocess if needed
    if df is None:
        df = preprocess_pipeline(**kwargs)

    # Step 2: Segment by consecutive points
    df = segment_by_consecutive(
        df,
        max_timedelta=max_timedelta,
        segment_column='segment_id',
        distance_column='dist_to_route',
        max_distance_to_route=max_distance
    )

    # Step 3: Clean NaN route values
    df = clean_closest_route(df, 'route', 'polyline_idx', 'segment_id', window_size)

    # Step 3.5: Refine geometric points for inferred routes
    add_closest_points_educated(
        df, 'latitude', 'longitude', 'route', 'polyline_idx',
        output_columns={
            'distance': 'dist_to_route',
            'closest_point_lat': 'closest_lat',
            'closest_poin

In [15]:
# Run the actual segment_pipeline on real data
from ml.pipelines import segment_pipeline

# Load segmented data (uses cache if available)
df = segment_pipeline()

print(f"Segmented {len(df):,} records")
print(f"Total segments: {df['segment_id'].nunique():,}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nSegment size statistics:")
print(df.groupby('segment_id').size().describe())

2026-02-04 17:19:41 - ml.pipelines - INFO - SEGMENT PIPELINE
2026-02-04 17:19:41 - ml.cache - INFO - Loading preprocessed data from /Users/joel/eclipse-workspace/shuttletracker-new/ml/cache/shared/locations_preprocessed.csv
2026-02-04 17:19:47 - ml.cache - INFO - Loaded 1833872 records from cache
2026-02-04 17:19:47 - ml.pipelines - INFO - Step 1/5: Preprocessed 1833872 location points
2026-02-04 17:19:47 - ml.pipelines - INFO - Step 2/5: Segmenting (max gap: 15s, max distance: 0.02 km)...
2026-02-04 17:19:48 - ml.pipelines - INFO -   ✓ Created 225139 segments
2026-02-04 17:19:48 - ml.pipelines - INFO - Step 3/5: Cleaning NaN route values (window size: 5)...
Cleaning 585824 NaN route values using window size 5 (Vectorized)...
  ✓ Filled 436227/585824 NaN values (74.5%)
2026-02-04 17:20:07 - ml.pipelines - INFO - Step 3.5/5: Refining geometric points for inferred routes...
Refining 436227 rows with educated route guesses...


100%|██████████| 436227/436227 [02:18<00:00, 3138.75it/s]

2026-02-04 17:22:26 - ml.pipelines - INFO - Step 4/5: Adding speed calculations...
2026-02-04 17:22:26 - ml.pipelines - INFO - Calculating distance and speed...





2026-02-04 17:22:27 - ml.pipelines - INFO -   ✓ Calculated segment-local speeds
2026-02-04 17:22:27 - ml.pipelines - INFO - Step 5/5: Filtering segments < 3 points...
2026-02-04 17:22:27 - ml.pipelines - INFO -   ✓ Kept 78265/225139 segments
2026-02-04 17:22:27 - ml.cache - INFO - Saving segmented data to /Users/joel/eclipse-workspace/shuttletracker-new/ml/cache/shared/locations_segmented_max_distance0p02_max_timedelta15_min_segment_length3_window_size5.csv
2026-02-04 17:22:56 - ml.cache - INFO - Saved 1682903 records
Segmented 1,682,903 records
Total segments: 78,265

Columns: ['vehicle_id', 'latitude', 'longitude', 'timestamp', 'epoch_seconds', 'dist_to_route', 'route', 'closest_lat', 'closest_lon', 'polyline_idx', 'segment_idx', 'segment_id', 'distance_km', 'speed_kmh']

Segment size statistics:
count    78265.000000
mean        21.502626
std         20.256631
min          3.000000
25%          7.000000
50%         14.000000
75%         28.000000
max        179.000000
dtype: float64

---
## Summary

| Function | Purpose | Key Parameters |
|----------|---------|----------------|
| `segment_by_consecutive` | Group points into trip segments | `max_timedelta`, `max_distance_to_route` |
| `filter_segments_by_length` | Remove short segments | `min_length` |
| `clean_closest_route` | Fill NaN routes via majority vote | `window_size` |
| `add_closest_points_educated` | Refine geometry for inferred routes | Target route/polyline |

**Pipeline Order:**
1. `segment_by_consecutive` - Create segments
2. `clean_closest_route` - Fill NaN routes
3. `add_closest_points_educated` - Refine geometry
4. (speed calculations)
5. `filter_segments_by_length` - Remove short segments