# Data Pipeline: OL-DL 1v1 Rep Detection

**Consolidated pipeline for processing practice tracking data with coordinate normalization.**

## Workflow
1. **Load Data** - Load parquet file
2. **Inspect Drill Types** - See counts per drill type
3. **Filter Drill Type** - Select one drill type to analyze
4. **Generate Frame IDs** - Create canonical frame numbering
5. **Visualize Raw Data** - Find LOS position and determine orientation (raw coords 0-120)
6. **Apply Transformation** - Normalize coordinates based on LOS and orientation
7. **Filter to Rep Period** - Narrow to the rep period, mark OL/DL
8. **Run Algorithm** - Detect individual reps
9. **Visualize Reps** - Review each detected rep (normalized coords)

In [1]:
# Imports
import polars as pl
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patheffects as pe
from matplotlib.patches import Rectangle
from pathlib import Path
from math import inf
from IPython.display import display, clear_output
import ipywidgets as widgets
from ipywidgets import IntSlider, Play, Dropdown, Button, HBox, VBox, Output, jslink

# Enable inline plotting
%matplotlib widget

print("Imports complete.")

Imports complete.


In [2]:
# CELL 1: MANUAL CONFIGURATION (Skip to Cell 3 to use created dict; I followed this process to create the dict)
# Path to the practice parquet file
# Replace with your own path to practice files 
PRACTICE_FILE = Path("~/Desktop/ShrineBowlSumerSportsAnalyticsCompetition/practice_data/2024_West_Practice_1.snappy.parquet")

ID_COL_CANDIDATES = ["zebra_id", "tid", "id"]  # Possible player ID column names
TRACKING_METRICS = ["a", "dir", "sa", "dis", "s", "x", "y", "z"]  # Motion metrics

# Scan the parquet file (lazy evaluation)
tracking_scan = pl.scan_parquet(PRACTICE_FILE)

# Detect schema and find player ID column
try:
    cols = tracking_scan.collect_schema().names()
except Exception:
    cols = list(tracking_scan.schema.keys())

player_id_col = next((c for c in ID_COL_CANDIDATES if c in cols), None)
if player_id_col is None:
    raise ValueError(f"Could not find player ID column. Tried: {ID_COL_CANDIDATES}")

print(f"Using player ID column: {player_id_col}")

# Select columns that exist in the dataset
keep_cols_wanted = [
    "dataset_id", "dataset_name", "dataset_intensity", "dataset_game_id", "session_id",
    "drill_type", "entity_type", player_id_col, "gsis_id", "jersey_number", "ts",
] + TRACKING_METRICS
keep_cols = [c for c in keep_cols_wanted if c in cols]

# Load data (filter to players only)
df_raw = (
    tracking_scan
    .select(keep_cols)
    .filter(pl.col("entity_type") == "player") # Note: Football is sometimes not present in OL-DL drill sessions
    .with_columns([
        pl.col("ts").cast(pl.Utf8),
        pl.col(player_id_col).cast(pl.Utf8).alias("id"),
    ])
    .collect()
)

# Count the drill types; pick one for 1 on 1
drill_counts = df_raw.group_by("drill_type").agg(pl.len().alias("count")).sort("count", descending=True)
with pl.Config(tbl_rows=-1):
    print(drill_counts)

Using player ID column: zebra_id
shape: (9, 2)
┌────────────────────────────┬─────────┐
│ drill_type                 ┆ count   │
│ ---                        ┆ ---     │
│ str                        ┆ u32     │
╞════════════════════════════╪═════════╡
│ Bigs 1 on 1 - Skill 7 on 7 ┆ 3567162 │
│ Bigs 9 on 7 - Skill 1 on 1 ┆ 3567162 │
│ Team                       ┆ 3567162 │
│ Best of 1 on 1             ┆ 3567162 │
│ Indy                       ┆ 3567162 │
│ Pre Practice               ┆ 3567162 │
│ SPT 2                      ┆ 3567162 │
│ SPT                        ┆ 3567162 │
│ Stretch                    ┆ 3567162 │
└────────────────────────────┴─────────┘


#### CELL 2: FILTER TO DRILL TYPE, GENERATE CANONICAL FRAME ID'S BASED ON TIMESTAMP
From above, select the drill type you want to analyze. Then, filter to the drill type and create frame ID's based on the unique timestamps.

In [3]:
DRILL_TYPE_FILTER = "Bigs 1 on 1 - Skill 7 on 7"  # Change to match the specific drill type

# Filter to the selected drill type
df_filtered = df_raw.filter(pl.col("drill_type") == DRILL_TYPE_FILTER)

# Build frame_map from unique timestamps (sorted)
frame_map = (
    df_filtered
    .select("ts")
    .unique()
    .sort("ts")
    .with_row_index(name="frame_id", offset=0)
)

# Join frame_id back to filtered data
df_with_frames = df_filtered.join(frame_map, on="ts", how="left")

# Add parsed timestamp
df_full_raw = df_with_frames.with_columns(pl.col("ts").str.to_datetime().alias("parsed_ts"))

#### CELL 3: CREATE VISUALIZATION
We will create a visualization of the data in order to obtain needed configuration parameters to run the pipeline. We need the following parameters:

1. START_TS: The timestamp of the beginning of the drill period ()
2. END_TS: The timestamp of the end of the drill period
3. LOS: The line of scrimmage for the 1-on-1 drill (can be inferred from where oline is lined up)
4. olinemen: List of oline player jersey numbers
5. dlinemen: List of dline player jersey numbers
6. Flip: Boolean indicating whether the drill is flipped (oline is assumed to be to the right of the LOS)

In [4]:
# Viz (Plotly)
# Full field, raw coordinates with jersey numbers over markers
import plotly.graph_objects as go

# Get unique frames for slider (from raw data)
frames_view = df_full_raw.select(["frame_id", "ts", "parsed_ts"]).unique().sort("frame_id")
frame_ids = frames_view["frame_id"].to_list()
timestamps = frames_view["ts"].to_list()
n_frames = len(frame_ids)

print(f"Visualization ready: {n_frames} frames (full field, ORIGINAL coordinates)")
print(f"Time range: {timestamps[0]} to {timestamps[-1]}")

X_MIN = 0.0
X_MAX = 125.0
Y_MIN = 0.0
Y_MAX = 53.3

fig = go.FigureWidget()

# Yard lines (every 5 yards, bold every 10)
for yard in range(0, 121, 5):
    width = 2 if yard % 10 == 0 else 1
    fig.add_shape(
        type="line",
        x0=yard, x1=yard,
        y0=Y_MIN, y1=Y_MAX,
        line=dict(color="white", width=width),
        layer="below",
    )

# Goal lines
fig.add_shape(
    type="line",
    x0=10, x1=10,
    y0=Y_MIN, y1=Y_MAX,
    line=dict(color="white", width=3),
    layer="below",
)
fig.add_shape(
    type="line",
    x0=110, x1=110,
    y0=Y_MIN, y1=Y_MAX,
    line=dict(color="white", width=3),
    layer="below",
)

# Yard numbers
annotations = []
for yard in range(20, 110, 10):
    num = yard - 10 if yard <= 60 else 110 - yard
    annotations.append(dict(x=yard, y=5, text=str(num), showarrow=False, font=dict(color="white", size=10)))
    annotations.append(dict(x=yard, y=48, text=str(num), showarrow=False, font=dict(color="white", size=10)))

# Players (markers + jersey numbers)
fig.add_trace(go.Scatter(
    x=[], y=[], mode="markers+text", name="Players",
    marker=dict(size=12, color="steelblue", line=dict(color="white", width=1.5)),
    text=[], textposition="middle center",
    textfont=dict(color="white", size=9),
    showlegend=True,
))

# Ball
fig.add_trace(go.Scatter(
    x=[], y=[], mode="markers", name="Ball",
    marker=dict(size=9, color="saddlebrown", line=dict(color="white", width=1.2)),
    showlegend=True,
))

fig.update_layout(
    width=900, height=450,
    title="",
    showlegend=True,
    legend=dict(x=0.02, y=0.98),
    plot_bgcolor="#2e7d32",
    paper_bgcolor="white",
    annotations=annotations,
)

fig.update_xaxes(range=[X_MIN, X_MAX], title="X (yards)", showgrid=False, zeroline=False)
fig.update_yaxes(range=[Y_MIN, Y_MAX], title="Y (yards)", showgrid=False, zeroline=False, scaleanchor="x", scaleratio=1)

# ====== Widgets ======
play = Play(value=0, min=0, max=max(0, n_frames - 1), step=1, interval=100)
slider = IntSlider(min=0, max=max(0, n_frames - 1), step=1, value=0, description="Frame")
back_button = Button(description="<")
forward_button = Button(description=">")


def update_frame(frame_idx):
    idx = max(0, min(int(frame_idx), n_frames - 1))
    fid = frame_ids[idx]
    ts = timestamps[idx]

    frame_data = df_full_raw.filter(pl.col("frame_id") == fid)
    if frame_data.height == 0:
        fig.data[0].x = []
        fig.data[0].y = []
        fig.data[0].text = []
        fig.data[1].x = []
        fig.data[1].y = []
        fig.layout.title = f"EXPLORATION (Raw Coords) | Frame {fid} / {frame_ids[-1]} | {ts} | Players: 0"
        return

    ball_data = frame_data.filter(pl.col("entity_type") == "ball")
    player_points = frame_data.filter(pl.col("entity_type") != "ball")

    x = player_points["x"].to_list()
    y = player_points["y"].to_list()
    jerseys = [str(j) for j in player_points["jersey_number"].to_list()]

    fig.data[0].x = x
    fig.data[0].y = y
    fig.data[0].text = jerseys

    fig.data[1].x = ball_data["x"].to_list()
    fig.data[1].y = ball_data["y"].to_list()

    fig.layout.title = f"EXPLORATION (Raw Coords) | Frame {fid} / {frame_ids[-1]} | {ts} | Players: {len(jerseys)}"


def on_frame_change(change):
    if change["name"] == "value":
        update_frame(change["new"])


def step_frame(delta):
    slider.value = max(0, min(slider.value + delta, n_frames - 1))

jslink((play, "value"), (slider, "value"))
slider.observe(on_frame_change, names="value")
back_button.on_click(lambda _: step_frame(-1))
forward_button.on_click(lambda _: step_frame(1))

controls = HBox([play, back_button, forward_button, slider])
ui = VBox([controls, fig])
display(ui)

update_frame(0)


Visualization ready: 71388 frames (full field, ORIGINAL coordinates)
Time range: 2024-01-27T15:51:08.700 to 2024-01-27T17:50:11.400


VBox(children=(HBox(children=(Play(value=0, max=71387), Button(description='<', style=ButtonStyle()), Button(d…

#### CELL 4: APPLY COORDINATE TRANSFORMATIONS
To make the coordinate system across each session, we need to use LOS and Flip to transform the cordinate system so that 1) the LOS is always at x = 0 and 2) the oline is always to the right of the LOS.

In [5]:
# Set these after using the exploration visualization to inspect raw data
LOS = 105.0              # Line of scrimmage x-coordinate in ORIGINAL data
FLIP_ORIENTATION = False # Set TRUE if DL is to the right of LOS in original data

def transform_coordinates(df: pl.DataFrame, los: float, flip_orientation: bool) -> pl.DataFrame:
    """
    Transform coordinates based on LOS and orientation configuration.

    After transformation:
    - LOS is at x = 0
    - DL should be at x < 0 (left of LOS)
    - OL should be at x > 0 (right of LOS)
    - DL moves in +x direction when rep starts

    Parameters:
        df: DataFrame with x, y, dir columns
        los: Line of scrimmage x-coordinate in original data
        flip_orientation: If True, flip x and y around field center before normalizing
    """
    if flip_orientation:
        # Flip around field center (60, 26.65), then normalize LOS to 0
        # Combined formula: x_new = los - x, y_new = 53.3 - y
        # Also flip direction angle by 180 degrees
        df = df.with_columns([
            (pl.lit(los) - pl.col("x")).alias("x"),
            (pl.lit(53.3) - pl.col("y")).alias("y"),
            ((pl.col("dir") + 180) % 360).alias("dir"),  # Rotate direction 180 deg
        ])
    else:
        # Just normalize LOS to x=0
        df = df.with_columns([
            (pl.col("x") - pl.lit(los)).alias("x"),
        ])
    return df

# Apply transformation
df_full_field  = transform_coordinates(df_with_frames, LOS, FLIP_ORIENTATION)

# df: Filter the field so that it is within +/- 15 yards of the LOS. Essentially, the area of interest.
df = df_full_field.filter((pl.col("x") >= -15.0) & (pl.col("x") <= 15.0))

#### CELL 5: Impute Missing Frames
Manual inspection of the data shows that there are some missing frames in the data. To fix this and avoid downstream issues with rep detection, we will impute the missing frames via linear interpolation. (Meaning, we'll take the last known frame and the next known frame, and linearly step between them.) Other methods of imputation may be used, but were not explored. We'll only impute up to five missing frames.

In [6]:
# Impute Missing Frames
print(f"Rows before imputation: {df.height:,}")

TRACKING_METRICS = ["a", "dir", "sa", "dis", "s", "x", "y", "z"]
metrics_for_impute = [c for c in TRACKING_METRICS if c in df.columns]
non_metric_cols = [c for c in df.columns if c not in metrics_for_impute + ["frame_id", "ts", "parsed_ts"]]

# Build frame_id -> ts mapping
frame_ts_map = dict(zip(frame_map["frame_id"].to_list(), frame_map["ts"].to_list()))

rows_to_add = []
for _, player_df in df.partition_by("id", as_dict=True).items():
    player_rows = player_df.sort("frame_id").to_dicts()
    for i in range(len(player_rows) - 1):
        prev_row = player_rows[i]
        next_row = player_rows[i + 1]
        gap = next_row["frame_id"] - prev_row["frame_id"]
        missing = gap - 1
        # We will only impute up to 5 missing frames
        if missing <= 0 or missing > 5:
            continue
        
        # If set to False, it will impute the average of the previous and next frame across all missing frames
        USE_LINEAR_INTERPOLATION = True
        total_steps = next_row["frame_id"] - prev_row["frame_id"]

        for fid in range(prev_row["frame_id"] + 1, next_row["frame_id"]):
            ts_val = frame_ts_map.get(fid)
            if ts_val is None:
                continue

            step = fid - prev_row["frame_id"]

            imputed_metrics = {}
            for m in metrics_for_impute:
                pv = prev_row.get(m)
                nv = next_row.get(m)
                if pv is None or nv is None:
                    imputed_metrics[m] = pv if nv is None else nv if pv is None else None
                elif USE_LINEAR_INTERPOLATION:
                    imputed_metrics[m] = pv + (nv - pv) * (step / total_steps)
                else:
                    imputed_metrics[m] = (pv + nv) / 2

            new_row = {col: prev_row.get(col) for col in non_metric_cols}
            new_row["frame_id"] = fid
            new_row["ts"] = ts_val
            new_row["parsed_ts"] = ts_val
            new_row.update(imputed_metrics)
            rows_to_add.append(new_row)

if rows_to_add:
    imputed_df = pl.DataFrame(rows_to_add)
    if "parsed_ts" in imputed_df.columns:
        imputed_df = imputed_df.with_columns(pl.col("parsed_ts").str.to_datetime())

    for col, dtype in df.schema.items():
        if col not in imputed_df.columns:
            imputed_df = imputed_df.with_columns(pl.lit(None, dtype=dtype).alias(col))

    imputed_df = imputed_df.with_columns([
        pl.col(col).cast(dtype, strict=False)
        for col, dtype in df.schema.items()
        if col in imputed_df.columns
    ])
    imputed_df = imputed_df.select(df.columns)
    df = pl.concat([df, imputed_df], how="vertical").sort(["zebra_id", "frame_id"])

print(f"Imputed {len(rows_to_add)} missing frames")
print(f"Rows after imputation: {df.height:,}")

Rows before imputation: 1,100,217
Imputed 183 missing frames
Rows after imputation: 1,100,400


#### CELL 6: FILTER TO REP PERIOD AND PLAYERS
Based on the exploration visualization, set:
1. `REP_PERIOD_START_TS` - Timestamp when OL-DL reps begin
2. `REP_PERIOD_END_TS` - Timestamp when OL-DL reps end
3. `olinemen` - List of OL jersey numbers
4. `dlinemen` - List of DL jersey numbers

In [7]:
# Configure these after using the visualization above 
# Or use prvoided timestamps from competition folder
REP_PERIOD_START_TS = "2024-01-27T17:14:06.100" 
REP_PERIOD_END_TS = "2024-01-27T17:25:47.100"

# OL and DL jersey numbers for this practice
olinemen = ["75", "55", "72", "60", "54", "69", "71", "78", "77", "70", "73"]
dlinemen = ["7", "6", "97", "85", "99", "9", "58", "92", "52", "8", "91"]

# Filter to rep period
df_reps = df.filter(
    (pl.col("ts") >= REP_PERIOD_START_TS) &
    (pl.col("ts") <= REP_PERIOD_END_TS)
)

# Add position flags
df_reps = df_reps.with_columns([
    pl.col("jersey_number").is_in(olinemen).alias("is_olineman"),
    pl.col("jersey_number").is_in(dlinemen).alias("is_dlineman"),
])

print(f"Rep period: {REP_PERIOD_START_TS} to {REP_PERIOD_END_TS}")
print(f"Frame range: {df_reps['frame_id'].min()} to {df_reps['frame_id'].max()}")
print(f"Rows in rep period: {df_reps.height:,}")
print(f"\nOL jerseys: {olinemen}")
print(f"DL jerseys: {dlinemen}")

Rep period: 2024-01-27T17:14:06.100 to 2024-01-27T17:25:47.100
Frame range: 49734 to 56744
Rows in rep period: 160,352

OL jerseys: ['75', '55', '72', '60', '54', '69', '71', '78', '77', '70', '73']
DL jerseys: ['7', '6', '97', '85', '99', '9', '58', '92', '52', '8', '91']


#### CELL 7: LOAD THE DETECTION ALGORITHM
There are two main components of the detection algorithm:
1. An targeted algorithm that identifies 1) identifies the OL-DL pair 2) marks the start of the rep and 3) marks the end of the rep. This is designed to work on a window of ~20 seconds that is segmented.
2. A supra-algorithm that runs on the entire ~10 minute rep period. This identifies an individual 1-on-1 rep and then applies the targeted algorithm to each rep.

#### Targeted Algorithm Components
 Core Pair Scoring

  - For each OL/DL candidate pair, builds a time series by inner-joining on
    frame_id and computing distance, distance change, and frame deltas.
  - Engagement scoring uses distance dynamics: longest closing run, min
    distance, sustained contact (<2.0), and “active close” frames (both speeds ≥
    1.0).
  - The primary engagement_score is the count of active close frames; other
    metrics are used to tie-break.
  - OL candidates can be filtered to those within OL_MAX_X_AT_TRIGGER yards of
    LOS at the trigger frame.

  Pair Selection

  - Iterates all OL/DL combinations, computes scores, and selects the pair with
    the highest engagement score.
  - Ties are broken by earliest active closing frame, then earliest min-distance
    frame, then earliest first-contact frame.
  - If no valid pair exists, a ValueError is raised.

  Rep Start Detection

  - Finds the first DL crossing of CROSSING_X that stays above for
    HOLD_FRAMES=10 consecutive frames.
  - Looks back LOOKBACK_FRAMES=15 frames for a trigger where ol_a+dl_a ≥ 1.5 or
    ol_s+dl_s ≥ 1.1; that becomes the start. (These values are based on looking at the data and finding reasonable
    thresholds.)
  - If no qualifying crossing exists, rep start defaults to the first frame in
    the pair series.
  - This assumes normalized orientation where DL moves in +x at the rep start.

  Rep End Detection

  - Starts searching SEARCH_DELAY_FRAMES=10 after the start.
  - Ends the rep if either player shows a sustained retreat (x delta < -0.05 for
    10 consecutive frames).
  - Alternatively, ends on stagnation: |x| and |y| deltas < 0.01 for 3
    consecutive frames.
  - If neither condition triggers, the rep ends at the last available frame.

##### Supra Algorithm Components
  - find_next_dl_trigger locates the earliest frame where any DL crosses
    TRIGGER_X from below to above.
  - For each trigger, it defines a window [trigger - WINDOW_BEFORE_TRIGGER,
    trigger + WINDOW_AFTER_TRIGGER].
  - detect_rep runs pair selection and start/end logic; outer loop filters by
    MIN_REP_DURATION.
  - On success, the scan jumps ahead to rep_start + WINDOW_AFTER_TRIGGER; on
    failure it advances by 10 frames.


In [8]:
# Rep Detection Algorithm Constants
TRIGGER_X = 0.5             # DL crossing threshold for rep window trigger (normalized coords)
CROSSING_X = 0.5            # DL crossing threshold for rep start detection (normalized coords)
WINDOW_BEFORE_TRIGGER = 40  # 4.0 seconds
WINDOW_AFTER_TRIGGER = 80   # 8.0 seconds
MIN_REP_DURATION = 15       # 1.5 seconds (Can bump this up to 2.0 seconds)

## OL proximity filter: only consider OLs within this distance of LOS at trigger frame
## This helps prevent edge cases where the algorithm will pair a DL with an OL that is in the back of the endzone
OL_MAX_X_AT_TRIGGER = 6

# Functions
def compute_pairwise_distance(x1: float, y1: float, x2: float, y2: float) -> float:
    """Euclidean distance between two points."""
    return np.sqrt((x1 - x2)**2 + (y1 - y2)**2)

def build_pair_timeseries(df_window: pl.DataFrame, ol_jersey: str, dl_jersey: str) -> pl.DataFrame:
    """
    Build a timeseries DataFrame for an OL-DL pair with metrics from both players.
    """
    ol_data = (
        df_window
        .filter(pl.col("jersey_number") == ol_jersey)
        .select(["frame_id", "ts", "x", "y", "s", "a"])
        .rename({"x": "ol_x", "y": "ol_y", "s": "ol_s", "a": "ol_a"})
    )
    
    dl_data = (
        df_window
        .filter(pl.col("jersey_number") == dl_jersey)
        .select(["frame_id", "ts", "x", "y", "s", "a"])
        .rename({"x": "dl_x", "y": "dl_y", "s": "dl_s", "a": "dl_a"})
    )
    
    pair_df = ol_data.join(dl_data.drop("ts"), on="frame_id", how="inner")
    
    if pair_df.height == 0:
        return pair_df
    
    pair_df = pair_df.sort("frame_id")
    
    pair_df = pair_df.with_columns(
        (((pl.col("ol_x") - pl.col("dl_x"))**2 + (pl.col("ol_y") - pl.col("dl_y"))**2).sqrt())
        .alias("pairwise_distance")
    )
    
    pair_df = pair_df.with_columns(
        (pl.col("pairwise_distance") - pl.col("pairwise_distance").shift(1))
        .alias("distance_change")
    )
    
    pair_df = pair_df.with_columns(
        (pl.col("frame_id") - pl.col("frame_id").shift(1))
        .alias("frame_delta")
    )
    
    return pair_df

def score_pair_engagement(pair_df: pl.DataFrame) -> dict:
    """
    Score how likely this OL-DL pair is the engaged pair for the rep.
    """
    if pair_df.height < 5:
        return {
            "engagement_score": 0,
            "min_distance": 999,
            "closing_frames": 0,
            "sustained_contact_frames": 0,
            "close_duration": 0,
            "activity_score": 0,
            "active_close_frames": 0,
            "min_distance_idx": 0,
        }

    distances = pair_df["pairwise_distance"].to_numpy()
    distance_changes = pair_df["distance_change"].to_numpy()
    frame_deltas = pair_df["frame_delta"].to_numpy()
    ol_accels = pair_df["ol_a"].to_numpy()
    dl_accels = pair_df["dl_a"].to_numpy()
    ol_speeds = pair_df["ol_s"].to_numpy()
    dl_speeds = pair_df["dl_s"].to_numpy()

    min_distance = float(np.nanmin(distances))
    min_distance_idx = int(np.nanargmin(distances))

    max_closing_run = 0
    current_run = 0
    for i in range(1, len(distance_changes)):
        dc = distance_changes[i]
        fd = frame_deltas[i]
        is_consecutive = fd is not None and not np.isnan(fd) and fd == 1
        is_closing = dc is not None and not np.isnan(dc) and dc < -0.01

        if is_consecutive and is_closing:
            current_run += 1
            max_closing_run = max(max_closing_run, current_run)
        else:
            current_run = 0
    closing_frames = max_closing_run

    CONTACT_THRESHOLD = 2.0
    sustained_contact_frames = int(np.sum(distances < CONTACT_THRESHOLD))
    close_duration = int(np.sum(distances < 2.0))

    active_close_mask = (distances < 2.0) & (ol_speeds >= 1.0) & (dl_speeds >= 1.0)
    active_close_frames = int(np.sum(active_close_mask))

    closing_mask = np.zeros(len(distance_changes) - 1, dtype=bool)
    for i in range(1, len(distance_changes)):
        dc = distance_changes[i]
        fd = frame_deltas[i]
        is_consecutive = fd is not None and not np.isnan(fd) and fd == 1
        is_closing = dc is not None and not np.isnan(dc) and dc < -0.01
        if is_consecutive and is_closing:
            closing_mask[i - 1] = True

    if np.any(closing_mask):
        activity_score = float(np.mean(ol_accels[1:][closing_mask] + dl_accels[1:][closing_mask]))
    else:
        activity_score = 0.0

    engagement_score = float(active_close_frames)

    return {
        "engagement_score": engagement_score,
        "min_distance": min_distance,
        "min_distance_idx": min_distance_idx,
        "closing_frames": closing_frames,
        "sustained_contact_frames": sustained_contact_frames,
        "close_duration": close_duration,
        "activity_score": activity_score,
        "active_close_frames": active_close_frames,
    }

def identify_ol_dl_pair(df_window: pl.DataFrame, excluded_ol_jerseys: list = None, trigger_frame: int = None) -> tuple:
    """
    Identify the OL-DL pair engaged in the rep.
    
    Parameters:
        df_window: DataFrame containing the rep window data
        excluded_ol_jerseys: List of OL jerseys to exclude from consideration
        trigger_frame: Frame when DL crossed TRIGGER_X (used to filter OLs by position)
    """
    if excluded_ol_jerseys is None:
        excluded_ol_jerseys = []
    
    ol_jerseys = (
        df_window
        .filter(pl.col("is_olineman") == True)
        .select("jersey_number")
        .unique()
        ["jersey_number"]
        .to_list()
    )
    
    # Filter OL candidates to those within OL_MAX_X_AT_TRIGGER yards of LOS at trigger frame
    if trigger_frame is not None:
        ol_jerseys_near_los = []
        for ol_j in ol_jerseys:
            if ol_j in excluded_ol_jerseys:
                continue
            # Get OL position at trigger frame
            ol_at_trigger = df_window.filter(
                (pl.col("jersey_number") == ol_j) &
                (pl.col("frame_id") == trigger_frame)
            )
            if ol_at_trigger.height > 0:
                x_at_trigger = ol_at_trigger.select("x").item()
                if x_at_trigger is not None and x_at_trigger <= OL_MAX_X_AT_TRIGGER:
                    ol_jerseys_near_los.append(ol_j)
        ol_jerseys = ol_jerseys_near_los
    else:
        # Fallback: just exclude the excluded jerseys
        ol_jerseys = [j for j in ol_jerseys if j not in excluded_ol_jerseys]
    
    dl_jerseys = (
        df_window
        .filter(pl.col("is_dlineman") == True)
        .select("jersey_number")
        .unique()
        ["jersey_number"]
        .to_list()
    )
    
    if len(ol_jerseys) == 0 or len(dl_jerseys) == 0:
        raise ValueError("No OL or DL players found in window (or all OL excluded/filtered by position)")
    
    best_score = -1
    best_pair = (None, None)
    best_pair_df = None
    best_info = None

    TIEBREAK_SCORE_EPS = 0.02
    CONTACT_THRESHOLD = 1.5
    ACTIVE_CLOSING_THRESHOLD = -0.005
    MIN_ACTIVITY_SPEED = 0.3
    MIN_ACTIVITY_ACCEL = 0.5

    for ol_j in ol_jerseys:
        for dl_j in dl_jerseys:
            pair_df = build_pair_timeseries(df_window, ol_j, dl_j)
            if pair_df.height < 5:
                continue
            
            info = score_pair_engagement(pair_df)

            frame_ids = pair_df["frame_id"].to_numpy()
            distances = pair_df["pairwise_distance"].to_numpy()
            distance_changes = pair_df["distance_change"].to_numpy()
            frame_deltas = pair_df["frame_delta"].to_numpy()
            ol_speeds = pair_df["ol_s"].to_numpy()
            dl_speeds = pair_df["dl_s"].to_numpy()
            ol_accels = pair_df["ol_a"].to_numpy()
            dl_accels = pair_df["dl_a"].to_numpy()
            min_idx = info.get("min_distance_idx", 0)
            min_frame = int(frame_ids[min_idx]) if len(frame_ids) > min_idx else int(frame_ids[0])
            first_close_frame = None
            if len(distances) > 0:
                close_idx = np.where(distances < CONTACT_THRESHOLD)[0]
                if close_idx.size > 0:
                    first_close_frame = int(frame_ids[close_idx[0]])

            first_active_closing_frame = None
            for i in range(1, len(distance_changes)):
                dc = distance_changes[i]
                fd = frame_deltas[i]
                if dc is None or np.isnan(dc) or fd is None or np.isnan(fd) or fd != 1:
                    continue
                if dc < ACTIVE_CLOSING_THRESHOLD:
                    max_speed = max(ol_speeds[i], dl_speeds[i])
                    max_accel = max(ol_accels[i], dl_accels[i])

                    if max_speed >= MIN_ACTIVITY_SPEED or max_accel >= MIN_ACTIVITY_ACCEL:
                        first_active_closing_frame = int(frame_ids[i])
                        break

            info["min_distance_frame"] = min_frame
            info["first_close_frame"] = first_close_frame
            info["first_active_closing_frame"] = first_active_closing_frame
            
            score = info["engagement_score"]
            if score > best_score + TIEBREAK_SCORE_EPS:
                best_score = score
                best_pair = (ol_j, dl_j)
                best_pair_df = pair_df
                best_info = info
            elif best_info is None or abs(score - best_score) <= TIEBREAK_SCORE_EPS:
                cand_active = info.get("first_active_closing_frame")
                best_active = best_info.get("first_active_closing_frame") if best_info else None
                cand_min = info.get("min_distance_frame")
                best_min = best_info.get("min_distance_frame") if best_info else None
                cand_first = info.get("first_close_frame")
                best_first = best_info.get("first_close_frame") if best_info else None

                prefer = False
                if cand_active is not None and best_active is not None and cand_active != best_active:
                    prefer = cand_active < best_active
                elif cand_min is not None and best_min is not None and cand_min != best_min:
                    prefer = cand_min < best_min
                elif cand_first is not None and best_first is not None and cand_first != best_first:
                    prefer = cand_first < best_first

                if best_info is None or prefer:
                    best_score = score
                    best_pair = (ol_j, dl_j)
                    best_pair_df = pair_df
                    best_info = info
    
    if best_pair[0] is None:
        raise ValueError("Could not identify engaged OL-DL pair")
    
    return best_pair[0], best_pair[1], best_pair_df, best_info

print("Rep detection helper functions loaded.")

# Rep Start Detection - Uses global CROSSING_X
def detect_rep_start(pair_df: pl.DataFrame, min_distance_idx: int = None, df_window: pl.DataFrame = None, ol_jersey: str = None) -> int:
    """
    Detect rep start using DL crossing CROSSING_X with lookback for acceleration trigger.
    Uses global CROSSING_X (normalized to 0.5 for LOS at x=0).
    """
    if pair_df.height == 0:
        return 0
    pair_df = pair_df.sort("frame_id")
    frame_ids = pair_df["frame_id"].to_numpy()
    dl_x = pair_df["dl_x"].to_numpy()
    ol_a = pair_df["ol_a"].to_numpy()
    dl_a = pair_df["dl_a"].to_numpy()
    ol_s = pair_df["ol_s"].to_numpy()
    dl_s = pair_df["dl_s"].to_numpy()
    frame_deltas = np.concatenate([[np.nan], np.diff(frame_ids)])
    n = len(frame_ids)
    
    # Use global CROSSING_X (set in Cell 2)
    HOLD_FRAMES = 10
    LOOKBACK_FRAMES = 15
    ACCEL_SUM_THRESHOLD = 1.5
    SPEED_SUM_THRESHOLD = 1.1
    
    crossing_idx = None
    for i in range(0, n - HOLD_FRAMES + 1):
        if dl_x[i] <= CROSSING_X:
            continue
        if i > 0 and dl_x[i-1] > CROSSING_X:
            continue
        run_ok = True
        for j in range(HOLD_FRAMES):
            idx = i + j
            if dl_x[idx] <= CROSSING_X:
                run_ok = False
                break
            if j > 0:
                fd = frame_deltas[idx]
                if fd is None or np.isnan(fd) or fd != 1:
                    run_ok = False
                    break
        if run_ok:
            crossing_idx = i
            break
    
    if crossing_idx is None:
        return int(frame_ids[0])
    
    lookback_start = max(0, crossing_idx - LOOKBACK_FRAMES)
    
    for i in range(lookback_start, crossing_idx + 1):
        accel_sum = ol_a[i] + dl_a[i]
        speed_sum = ol_s[i] + dl_s[i]
        if not np.isnan(accel_sum) and ((accel_sum >= ACCEL_SUM_THRESHOLD) | (speed_sum >= SPEED_SUM_THRESHOLD)):
            return int(frame_ids[i])
    
    return int(frame_ids[crossing_idx])

def detect_rep_end(pair_df: pl.DataFrame, rep_start_frame: int) -> int:
    """
    Rep End Detection: LOS Retreat OR Stagnation Rule
    """
    SEARCH_DELAY_FRAMES = 10
    X_DECREASE_THRESHOLD = -0.05
    CONSECUTIVE_FRAMES = 10
    STAGNATION_THRESHOLD = 0.01
    STAGNATION_FRAMES = 3

    search_start_frame = rep_start_frame + SEARCH_DELAY_FRAMES
    pair_after_delay = pair_df.filter(pl.col("frame_id") >= search_start_frame)

    min_frames_needed = min(CONSECUTIVE_FRAMES, STAGNATION_FRAMES)
    if pair_after_delay.height < min_frames_needed:
        return int(pair_df["frame_id"].max())

    pair_after_delay = pair_after_delay.sort("frame_id")
    pair_after_delay = pair_after_delay.with_columns([
        (pl.col("frame_id") - pl.col("frame_id").shift(1)).alias("frame_delta"),
        (pl.col("ol_x") - pl.col("ol_x").shift(1)).alias("ol_x_delta"),
        (pl.col("ol_y") - pl.col("ol_y").shift(1)).alias("ol_y_delta"),
        (pl.col("dl_x") - pl.col("dl_x").shift(1)).alias("dl_x_delta"),
        (pl.col("dl_y") - pl.col("dl_y").shift(1)).alias("dl_y_delta")
    ])

    frame_ids = pair_after_delay["frame_id"].to_numpy()
    frame_deltas = pair_after_delay["frame_delta"].to_numpy()
    ol_x_deltas = pair_after_delay["ol_x_delta"].to_numpy()
    ol_y_deltas = pair_after_delay["ol_y_delta"].to_numpy()
    dl_x_deltas = pair_after_delay["dl_x_delta"].to_numpy()
    dl_y_deltas = pair_after_delay["dl_y_delta"].to_numpy()
    n = len(frame_ids)

    for i in range(1, n):
        # Condition A: LOS Retreat
        if i <= n - CONSECUTIVE_FRAMES:
            ol_retreat_run = True
            for j in range(CONSECUTIVE_FRAMES):
                idx = i + j
                if idx >= n:
                    ol_retreat_run = False
                    break
                fd = frame_deltas[idx]
                is_consecutive = fd is not None and not np.isnan(fd) and fd == 1
                ol_xd = ol_x_deltas[idx]
                ol_retreating = ol_xd is not None and not np.isnan(ol_xd) and ol_xd < X_DECREASE_THRESHOLD
                if not (is_consecutive and ol_retreating):
                    ol_retreat_run = False
                    break
            if ol_retreat_run:
                return int(frame_ids[i])

            dl_retreat_run = True
            for j in range(CONSECUTIVE_FRAMES):
                idx = i + j
                if idx >= n:
                    dl_retreat_run = False
                    break
                fd = frame_deltas[idx]
                is_consecutive = fd is not None and not np.isnan(fd) and fd == 1
                dl_xd = dl_x_deltas[idx]
                dl_retreating = dl_xd is not None and not np.isnan(dl_xd) and dl_xd < X_DECREASE_THRESHOLD
                if not (is_consecutive and dl_retreating):
                    dl_retreat_run = False
                    break
            if dl_retreat_run:
                return int(frame_ids[i])

        # Condition B: Stagnation
        if i <= n - STAGNATION_FRAMES:
            ol_stagnant_run = True
            for j in range(STAGNATION_FRAMES):
                idx = i + j
                if idx >= n:
                    ol_stagnant_run = False
                    break
                fd = frame_deltas[idx]
                is_consecutive = fd is not None and not np.isnan(fd) and fd == 1
                ol_xd = ol_x_deltas[idx]
                ol_yd = ol_y_deltas[idx]
                ol_x_stagnant = ol_xd is not None and not np.isnan(ol_xd) and abs(ol_xd) < STAGNATION_THRESHOLD
                ol_y_stagnant = ol_yd is not None and not np.isnan(ol_yd) and abs(ol_yd) < STAGNATION_THRESHOLD
                ol_is_stagnant = ol_x_stagnant and ol_y_stagnant
                if not (is_consecutive and ol_is_stagnant):
                    ol_stagnant_run = False
                    break
            if ol_stagnant_run:
                return int(frame_ids[i])

            dl_stagnant_run = True
            for j in range(STAGNATION_FRAMES):
                idx = i + j
                if idx >= n:
                    dl_stagnant_run = False
                    break
                fd = frame_deltas[idx]
                is_consecutive = fd is not None and not np.isnan(fd) and fd == 1
                dl_xd = dl_x_deltas[idx]
                dl_yd = dl_y_deltas[idx]
                dl_x_stagnant = dl_xd is not None and not np.isnan(dl_xd) and abs(dl_xd) < STAGNATION_THRESHOLD
                dl_y_stagnant = dl_yd is not None and not np.isnan(dl_yd) and abs(dl_yd) < STAGNATION_THRESHOLD
                dl_is_stagnant = dl_x_stagnant and dl_y_stagnant
                if not (is_consecutive and dl_is_stagnant):
                    dl_stagnant_run = False
                    break
            if dl_stagnant_run:
                return int(frame_ids[i])

    return int(frame_ids[-1])

print("Rep start/end detection functions loaded.")
print(f"  Using CROSSING_X = {CROSSING_X} (normalized coords)")

def detect_rep(df_window: pl.DataFrame, window_start: int, window_end: int, rep_number: int = 0, trigger_frame: int = None) -> dict:
    """Main rep detection function with fallback loop.
    
    Parameters:
        df_window: DataFrame containing the rep window data
        window_start: Start frame of the window
        window_end: End frame of the window
        rep_number: Rep number for labeling
        trigger_frame: Frame when DL crossed TRIGGER_X (passed to identify_ol_dl_pair)
    """
    MIN_REP_DURATION_LOCAL = 0
    MAX_RETRIES = 3
    
    excluded_ol_jerseys = []
    best_result = None
    best_rep_duration = 0
    
    for attempt in range(MAX_RETRIES + 1):
        try:
            ol_jersey, dl_jersey, pair_df, engagement_info = identify_ol_dl_pair(
                df_window, excluded_ol_jerseys=excluded_ol_jerseys, trigger_frame=trigger_frame
            )
            
            rep_start_frame = detect_rep_start(pair_df, engagement_info.get("min_distance_idx"), df_window, ol_jersey)
            rep_end_frame = detect_rep_end(pair_df, rep_start_frame)
            
            rep_duration = rep_end_frame - rep_start_frame
            
            start_ts_row = pair_df.filter(pl.col("frame_id") == rep_start_frame)
            end_ts_row = pair_df.filter(pl.col("frame_id") == rep_end_frame)
            start_ts = start_ts_row["ts"][0] if start_ts_row.height > 0 else None
            end_ts = end_ts_row["ts"][0] if end_ts_row.height > 0 else None
            
            result = {
                "window_start": window_start,
                "window_end": window_end,
                "ol_jersey": ol_jersey,
                "dl_jersey": dl_jersey,
                "rep_start_frame": rep_start_frame,
                "rep_end_frame": rep_end_frame,
                "start_ts": start_ts,
                "end_ts": end_ts,
                "engagement_info": engagement_info,
                "pair_timeseries": pair_df,
                "rep_number": rep_number,
                "retry_attempt": attempt,
                "excluded_ol_jerseys": list(excluded_ol_jerseys)
            }
            
            if rep_duration > best_rep_duration:
                best_rep_duration = rep_duration
                best_result = result
            
            if rep_duration >= MIN_REP_DURATION_LOCAL:
                return result
            
            if attempt < MAX_RETRIES:
                excluded_ol_jerseys.append(ol_jersey)
            
        except ValueError as e:
            if best_result is not None:
                return best_result
            raise e
    
    return best_result

print("Main rep detection function loaded.")

# DL Trigger Detection - Uses global TRIGGER_X
def find_next_dl_trigger(df: pl.DataFrame, start_frame: int, end_frame: int = None) -> int | None:
    """Find next frame where ANY DL crosses TRIGGER_X in increasing x direction."""
    dl_df = df.filter(pl.col("is_dlineman") == True)
    
    if end_frame is not None:
        dl_df = dl_df.filter(
            (pl.col("frame_id") >= start_frame) &
            (pl.col("frame_id") <= end_frame)
        )
    else:
        dl_df = dl_df.filter(pl.col("frame_id") >= start_frame)
    
    if dl_df.height == 0:
        return None
    
    dl_with_prev = (
        dl_df
        .sort(["jersey_number", "frame_id"])
        .with_columns(
            pl.col("x").shift(1).over("jersey_number").alias("x_prev")
        )
    )
    
    # Use global TRIGGER_X
    crossings = dl_with_prev.filter(
        (pl.col("x_prev").is_not_null()) &
        (pl.col("x_prev") < TRIGGER_X) &
        (pl.col("x") >= TRIGGER_X)
    )
    
    if crossings.height == 0:
        return None
    
    earliest_frame = crossings["frame_id"].min()
    return int(earliest_frame)

def run_supra_algorithm(df: pl.DataFrame, start_frame: int, end_frame: int, verbose: bool = True) -> list:
    """Main supra-algorithm that scans through the practice period and detects all reps."""
    results = []
    rep_number = 1
    current_scan_position = start_frame
    
    if verbose:
        print(f"Starting supra-algorithm scan from frame {start_frame} to {end_frame}")
        print(f"Trigger X: {TRIGGER_X}, Window: [-{WINDOW_BEFORE_TRIGGER}, +{WINDOW_AFTER_TRIGGER}]")
        print(f"OL position filter: x <= {OL_MAX_X_AT_TRIGGER} at trigger frame")
        print("=" * 70)
    
    while current_scan_position < end_frame:
        trigger_frame = find_next_dl_trigger(df, current_scan_position, end_frame)
        
        if trigger_frame is None:
            if verbose:
                print(f"No more triggers found after frame {current_scan_position}")
            break
        
        if verbose:
            print(f"\nRep {rep_number}: Trigger at frame {trigger_frame}")
        
        window_start = max(start_frame, trigger_frame - WINDOW_BEFORE_TRIGGER)
        window_end = min(end_frame, trigger_frame + WINDOW_AFTER_TRIGGER)
        
        if verbose:
            print(f"  Window: [{window_start}, {window_end}]")
        
        df_window = df.filter(
            (pl.col("frame_id") >= window_start) &
            (pl.col("frame_id") <= window_end)
        )
        
        if df_window.height == 0:
            if verbose:
                print(f"  Empty window, skipping")
            current_scan_position = trigger_frame + 10
            continue
        
        try:
            result = detect_rep(df_window, window_start, window_end, rep_number, trigger_frame=trigger_frame)
            
            if result is not None:
                rep_duration = result['rep_end_frame'] - result['rep_start_frame']
                
                if rep_duration >= MIN_REP_DURATION:
                    results.append(result)
                    if verbose:
                        print(f"  Detected: OL {result['ol_jersey']} vs DL {result['dl_jersey']}")
                        print(f"  Rep frames: {result['rep_start_frame']} - {result['rep_end_frame']} ({rep_duration} frames)")
                    
                    rep_number += 1
                    current_scan_position = result['rep_start_frame'] + WINDOW_AFTER_TRIGGER
                else:
                    if verbose:
                        print(f"  Rep too short ({rep_duration} frames < {MIN_REP_DURATION}), skipping")
                    current_scan_position = trigger_frame + 10
            else:
                if verbose:
                    print(f"  No valid rep detected, skipping")
                current_scan_position = trigger_frame + 10
                
        except ValueError as e:
            if verbose:
                print(f"  Error: {e}, skipping")
            current_scan_position = trigger_frame + 10
    
    if verbose:
        print("\n" + "=" * 70)
        print(f"Supra-algorithm complete. Detected {len(results)} reps.")
    
    return results

def build_output_dataframe(results: list, jersey_to_zebra: dict) -> pl.DataFrame:
    """Convert results list to summary DataFrame."""
    if not results:
        return pl.DataFrame({
            'rep_number': [], 'rep_start_frame': [], 'rep_end_frame': [],
            'ol_jersey': [], 'dl_jersey': [], 'ol_zebra_id': [], 'dl_zebra_id': [],
            'start_ts': [], 'end_ts': [], 'duration_frames': [], 'duration_seconds': [],
        })

    rows = []
    for r in results:
        duration_frames = r['rep_end_frame'] - r['rep_start_frame'] + 1  # +1 for inclusive count
        rows.append({
            'rep_number': r['rep_number'],
            'rep_start_frame': r['rep_start_frame'],
            'rep_end_frame': r['rep_end_frame'],
            'ol_jersey': r['ol_jersey'],
            'dl_jersey': r['dl_jersey'],
            'ol_zebra_id': jersey_to_zebra.get(r['ol_jersey']),
            'dl_zebra_id': jersey_to_zebra.get(r['dl_jersey']),
            'start_ts': r['start_ts'],
            'end_ts': r['end_ts'],
            'duration_frames': duration_frames,
            'duration_seconds': duration_frames * 0.1,
        })

    return pl.DataFrame(rows)

print("\nAll algorithm functions loaded successfully.")


Rep detection helper functions loaded.
Rep start/end detection functions loaded.
  Using CROSSING_X = 0.5 (normalized coords)
Main rep detection function loaded.

All algorithm functions loaded successfully.


#### CELL 8: RUN THE ALGORITHM
Let it rip!

In [9]:
# Build jersey -> zebra_id lookup
jersey_to_zebra = dict(
    df_reps.select(["jersey_number", "id"]).unique().iter_rows()
)

# Get frame range
start_frame = int(df_reps["frame_id"].min())
end_frame = int(df_reps["frame_id"].max())

# Run the algorithm
results = run_supra_algorithm(df_reps, start_frame, end_frame, verbose=True)

# Build output DataFrame
output_df = build_output_dataframe(results, jersey_to_zebra)

print("\n" + "="*60)
print(f"DETECTED {len(results)} REPS")
print("="*60)
display(output_df)

Starting supra-algorithm scan from frame 49734 to 56744
Trigger X: 0.5, Window: [-40, +80]
OL position filter: x <= 6 at trigger frame

Rep 1: Trigger at frame 49824
  Window: [49784, 49904]
  Detected: OL 71 vs DL 91
  Rep frames: 49815 - 49842 (27 frames)

Rep 2: Trigger at frame 50039
  Window: [49999, 50119]
  Detected: OL 71 vs DL 91
  Rep frames: 50029 - 50073 (44 frames)

Rep 3: Trigger at frame 50177
  Window: [50137, 50257]
  Detected: OL 69 vs DL 8
  Rep frames: 50163 - 50205 (42 frames)

Rep 4: Trigger at frame 50353
  Window: [50313, 50433]
  Detected: OL 69 vs DL 8
  Rep frames: 50342 - 50375 (33 frames)

Rep 5: Trigger at frame 50479
  Window: [50439, 50559]
  Detected: OL 54 vs DL 52
  Rep frames: 50465 - 50502 (37 frames)

Rep 6: Trigger at frame 50633
  Window: [50593, 50713]
  Detected: OL 54 vs DL 52
  Rep frames: 50621 - 50665 (44 frames)

Rep 7: Trigger at frame 50878
  Window: [50838, 50958]
  Detected: OL 73 vs DL 92
  Rep frames: 50864 - 50908 (44 frames)

Rep 8

rep_number,rep_start_frame,rep_end_frame,ol_jersey,dl_jersey,ol_zebra_id,dl_zebra_id,start_ts,end_ts,duration_frames,duration_seconds
i64,i64,i64,str,str,str,str,str,str,i64,f64
1,49815,49842,"""71""","""91""","""1770000086""","""1770000135""","""2024-01-27T17:14:14.200""","""2024-01-27T17:14:16.900""",28,2.8
2,50029,50073,"""71""","""91""","""1770000086""","""1770000135""","""2024-01-27T17:14:35.600""","""2024-01-27T17:14:40.000""",45,4.5
3,50163,50205,"""69""","""8""","""1770000089""","""1770000099""","""2024-01-27T17:14:49.000""","""2024-01-27T17:14:53.200""",43,4.3
4,50342,50375,"""69""","""8""","""1770000089""","""1770000099""","""2024-01-27T17:15:06.900""","""2024-01-27T17:15:10.200""",34,3.4
5,50465,50502,"""54""","""52""","""1770000095""","""1770000101""","""2024-01-27T17:15:19.200""","""2024-01-27T17:15:22.900""",38,3.8
…,…,…,…,…,…,…,…,…,…,…
29,55698,55740,"""55""","""8""","""1770000094""","""1770000099""","""2024-01-27T17:24:02.500""","""2024-01-27T17:24:06.700""",43,4.3
30,55899,55930,"""78""","""91""","""1770000092""","""1770000135""","""2024-01-27T17:24:22.600""","""2024-01-27T17:24:25.700""",32,3.2
31,56115,56147,"""78""","""91""","""1770000092""","""1770000135""","""2024-01-27T17:24:44.200""","""2024-01-27T17:24:47.400""",33,3.3
32,56414,56448,"""77""","""58""","""1770000084""","""1770000103""","""2024-01-27T17:25:14.100""","""2024-01-27T17:25:17.500""",35,3.5


#### CELL 9: VISUALIZE DETECTED REPS
Use the dropdown to select a rep and view the frame-by-frame tracking data. This is helpful for validation and identifying false positives.

- **Blue dots**: Offensive linemen
- **Red dots**: Defensive linemen
- **Highlighted**: The identified OL-DL pair for this rep
- **Yellow dashed line**: LOS at x=0 (normalized coordinates)

In [10]:
# Viz (play-only, Plotly)
if output_df.height == 0:
    print("No reps detected. Check your configuration and try again.")
else:
    if "session_name" in globals():
        session_name_local = session_name
    else:
        session_name_local = PRACTICE_FILE.stem if "PRACTICE_FILE" in globals() else "Session"

    frame_col = "frame_id"

    rep_options = []
    rep_meta = {}
    for result in results:
        rep_number = int(result.get("rep_number"))
        ol_jersey = result.get("ol_jersey")
        dl_jersey = result.get("dl_jersey")
        rep_key = (session_name_local, rep_number)
        rep_meta[rep_key] = result
        label = f"{session_name_local} | rep {rep_number} | OL {ol_jersey} vs DL {dl_jersey}"
        rep_options.append((label, rep_key))

    rep_options = sorted(rep_options, key=lambda x: x[1][1])

    if not rep_options:
        print("No reps available for visualization.")
    else:
        # Visualization constants (normalized coords)
        X_MIN = -15.0
        X_MAX = 15.0
        Y_MIN = 10.0
        Y_MAX = 40.0

        print("Pre-caching rep frames...")
        rep_cache = {}
        for _, rep_key in rep_options:
            meta = rep_meta.get(rep_key)
            if meta is None:
                continue

            rep_start = meta.get("rep_start_frame")
            rep_end = meta.get("rep_end_frame")
            ol_jersey = str(meta.get("ol_jersey"))
            dl_jersey = str(meta.get("dl_jersey"))

            rep_df = (
                df_reps
                .filter((pl.col("frame_id") >= rep_start) & (pl.col("frame_id") <= rep_end))
                .with_columns(pl.col("jersey_number").cast(pl.Utf8))
            )

            if rep_df.height == 0:
                continue

            ol_df = (
                rep_df
                .filter(pl.col("jersey_number") == ol_jersey)
                .select(["frame_id", "x", "y"])
                .unique(subset=["frame_id"])
                .sort("frame_id")
            )

            dl_df = (
                rep_df
                .filter(pl.col("jersey_number") == dl_jersey)
                .select(["frame_id", "x", "y"])
                .unique(subset=["frame_id"])
                .sort("frame_id")
            )

            wide_df = ol_df.join(dl_df, on="frame_id", how="inner", suffix="_dl").sort("frame_id")
            if wide_df.height == 0:
                continue

            frame_ids = wide_df["frame_id"].to_list()

            others_df = rep_df.filter(
                (pl.col("frame_id").is_in(frame_ids)) &
                (~pl.col("jersey_number").is_in([ol_jersey, dl_jersey]))
            )

            frame_groups = (
                others_df
                .group_by("frame_id")
                .agg(
                    pl.col("x").implode().alias("x"),
                    pl.col("y").implode().alias("y"),
                    pl.col("jersey_number").implode().alias("jersey"),
                )
                .sort("frame_id")
            )

            others_by_frame = {
                row["frame_id"]: (row["x"], row["y"], row["jersey"])
                for row in frame_groups.iter_rows(named=True)
            }

            ts_by_frame = {
                row["frame_id"]: row["ts"]
                for row in rep_df.select(["frame_id", "ts"]).unique().iter_rows(named=True)
            }

            rep_cache[rep_key] = {
                "session": rep_key[0],
                "rep_number": rep_key[1],
                "frame_ids": frame_ids,
                "ol_x": wide_df["x"].to_list(),
                "ol_y": wide_df["y"].to_list(),
                "dl_x": wide_df["x_dl"].to_list(),
                "dl_y": wide_df["y_dl"].to_list(),
                "ol_jersey": ol_jersey,
                "dl_jersey": dl_jersey,
                "others_by_frame": others_by_frame,
                "ts_by_frame": ts_by_frame,
            }

        print(f"Loaded {len(rep_cache)} reps")

        # ====== Figure ======
        import plotly.graph_objects as go

        fig = go.FigureWidget()

        # Yard lines
        for x_val in range(int(X_MIN), int(X_MAX) + 1, 5):
            color = "yellow" if x_val == 0 else "rgba(255,255,255,0.5)"
            width = 2 if x_val == 0 else 1
            fig.add_shape(
                type="line",
                x0=x_val, x1=x_val,
                y0=Y_MIN, y1=Y_MAX,
                line=dict(color=color, width=width),
                layer="below",
            )

        # Other players (faded)
        fig.add_trace(go.Scatter(
            x=[], y=[], mode="markers", name="Other Players",
            marker=dict(size=8, color="rgba(220,220,220,0.5)", line=dict(color="rgba(255,255,255,0.4)", width=1)),
            showlegend=True,
        ))

        # OL highlight
        fig.add_trace(go.Scatter(
            x=[], y=[], mode="markers+text", name="OL",
            marker=dict(size=18, color="dodgerblue", line=dict(color="white", width=2)),
            text=[], textposition="middle center",
            textfont=dict(color="white", size=10),
            showlegend=True,
        ))

        # DL highlight
        fig.add_trace(go.Scatter(
            x=[], y=[], mode="markers+text", name="DL",
            marker=dict(size=18, color="red", line=dict(color="white", width=2)),
            text=[], textposition="middle center",
            textfont=dict(color="white", size=10),
            showlegend=True,
        ))

        fig.update_layout(
            width=900, height=500,
            title="",
            showlegend=True,
            legend=dict(x=0.02, y=0.98),
            plot_bgcolor="#2e7d32",
            paper_bgcolor="white",
        )

        fig.update_xaxes(range=[X_MIN, X_MAX], title="X (yards)", showgrid=False, zeroline=False)
        fig.update_yaxes(range=[Y_MIN, Y_MAX], title="Y (yards)", showgrid=False, zeroline=False, scaleanchor="x", scaleratio=1)

        # ====== Widgets ======
        rep_dropdown = Dropdown(options=rep_options, description="Rep")
        play = Play(interval=200, min=0, max=1, step=1, value=0)
        frame_slider = IntSlider(min=0, max=1, step=1, value=0, description="Frame")
        back_button = Button(description="<")
        forward_button = Button(description=">")

        current_data = {"ref": None}

        def update_plot(frame_idx):
            data = current_data["ref"]
            if data is None:
                return
            idx = max(0, min(frame_idx, len(data["frame_ids"]) - 1))
            frame_id = data["frame_ids"][idx]
            other_xy = data["others_by_frame"].get(frame_id, ([], [], []))
            ts = data["ts_by_frame"].get(frame_id, "")

            with fig.batch_update():
                # Other players
                fig.data[0].x = other_xy[0]
                fig.data[0].y = other_xy[1]

                # OL/DL
                fig.data[1].x = [data["ol_x"][idx]]
                fig.data[1].y = [data["ol_y"][idx]]
                fig.data[1].text = [data["ol_jersey"]]
                fig.data[2].x = [data["dl_x"][idx]]
                fig.data[2].y = [data["dl_y"][idx]]
                fig.data[2].text = [data["dl_jersey"]]

                fig.layout.title = (
                    f"{data['session']} | rep {data['rep_number']} | {frame_col} {frame_id} | {ts}"
                )

        def load_rep(rep_key):
            if rep_key not in rep_cache:
                return
            current_data["ref"] = rep_cache[rep_key]
            data = current_data["ref"]
            max_idx = len(data["frame_ids"]) - 1

            frame_slider.max = max_idx
            play.max = max_idx
            frame_slider.value = 0
            update_plot(0)

        def on_rep_change(change):
            if change["name"] == "value":
                load_rep(change["new"])

        def on_frame_change(change):
            if change["name"] == "value":
                update_plot(change["new"])

        jslink((play, "value"), (frame_slider, "value"))
        rep_dropdown.observe(on_rep_change, names="value")
        frame_slider.observe(on_frame_change, names="value")
        back_button.on_click(lambda _: setattr(frame_slider, "value", max(0, frame_slider.value - 1)))
        forward_button.on_click(lambda _: setattr(frame_slider, "value", min(frame_slider.max, frame_slider.value + 1)))

        controls = HBox([play, back_button, forward_button, frame_slider])
        ui = VBox([rep_dropdown, controls, fig])
        display(ui)

        load_rep(rep_dropdown.value)



Pre-caching rep frames...
Loaded 33 reps


VBox(children=(Dropdown(description='Rep', options=(('2024_West_Practice_1.snappy | rep 1 | OL 71 vs DL 91', (…

#### CELL 11: WRITE THE DF TO A CSV
Save the df to a csv file.

In [11]:
# Save Summary and Wide Format
session_name = "2024WestPractice1"
output_dir = Path("~/Desktop/").expanduser() # Or change to your desired output directory
summary_filepath = output_dir / f"{session_name}_summary.csv"

# Save summary DataFrame 
# output_df.with_columns(pl.lit(session_name).alias("session_name")).write_csv(summary_filepath)

# Build and save wide-format rep timeseries
def build_wide_rep_data(df_reps: pl.DataFrame, result: dict) -> pl.DataFrame:
    """
    Build wide-format DataFrame for a single rep's OL-DL pair.
    Includes all frames from rep_start to rep_end with metrics for both players.
    """
    rep_start = result['rep_start_frame']
    rep_end = result['rep_end_frame']
    ol_jersey = result['ol_jersey']
    dl_jersey = result['dl_jersey']
    rep_number = result['rep_number']
    
    # Filter to rep frames
    rep_df = df_reps.filter(
        (pl.col("frame_id") >= rep_start) &
        (pl.col("frame_id") <= rep_end)
    )
    
    # Columns to include for each player
    base_cols = ["frame_id", "ts"]
    metric_cols = ["x", "y", "s", "a", "dir", "z", "sa", "dis"]
    id_cols = ["jersey_number", "gsis_id", "id"]  # id is zebra_id
    
    # Filter to available columns
    all_player_cols = metric_cols + id_cols
    available_cols = [c for c in all_player_cols if c in rep_df.columns]
    select_cols = base_cols + available_cols
    
    # Get OL data - deduplicate by frame_id to handle any duplicate rows
    ol_data = (
        rep_df
        .filter(pl.col("jersey_number") == ol_jersey)
        .select([c for c in select_cols if c in rep_df.columns])
        .unique(subset=["frame_id"])  # Deduplicate by frame_id
        .sort("frame_id")
    )
    
    # Rename OL columns (except frame_id, ts)
    ol_rename = {c: f"ol_{c}" for c in available_cols}
    ol_data = ol_data.rename(ol_rename)
    
    # Get DL data - deduplicate by frame_id to handle any duplicate rows
    dl_data = (
        rep_df
        .filter(pl.col("jersey_number") == dl_jersey)
        .select([c for c in select_cols if c in rep_df.columns])
        .unique(subset=["frame_id"])  # Deduplicate by frame_id
        .sort("frame_id")
    )
    
    # Rename DL columns (except frame_id, ts)
    dl_rename = {c: f"dl_{c}" for c in available_cols}
    dl_data = dl_data.rename(dl_rename)
    
    # Join on frame_id (inner join - only frames where both players have data)
    wide_df = ol_data.join(dl_data.drop("ts"), on="frame_id", how="inner")
    
    if wide_df.height == 0:
        return None
    
    # Add rep_number and session_name
    wide_df = wide_df.with_columns([
        pl.lit(rep_number).alias("rep_number"),
        pl.lit(session_name).alias("session_name"),
    ])
    
    # Add pairwise distance
    wide_df = wide_df.with_columns(
        (((pl.col("ol_x") - pl.col("dl_x"))**2 + (pl.col("ol_y") - pl.col("dl_y"))**2).sqrt())
        .alias("pairwise_distance")
    )
    
    # Add distance change
    wide_df = wide_df.with_columns(
        (pl.col("pairwise_distance") - pl.col("pairwise_distance").shift(1))
        .alias("distance_change")
    )
    
    # Add frame delta
    wide_df = wide_df.with_columns(
        (pl.col("frame_id") - pl.col("frame_id").shift(1))
        .alias("frame_delta")
    )
    
    return wide_df

# Build combined wide rep data
all_wide_reps = []
for result in results:
    wide_rep = build_wide_rep_data(df_reps, result)
    if wide_rep is not None:
        all_wide_reps.append(wide_rep)

if all_wide_reps:
    combined_wide_df = pl.concat(all_wide_reps, how="vertical")
    
    # Reorder columns: identifiers first, then OL metrics, then DL metrics, then derived
    id_cols_order = ["session_name", "rep_number", "frame_id", "ts"]
    ol_cols = [c for c in combined_wide_df.columns if c.startswith("ol_")]
    dl_cols = [c for c in combined_wide_df.columns if c.startswith("dl_")]
    derived_cols = ["pairwise_distance", "distance_change", "frame_delta"]
    
    # Build final column order
    final_cols = id_cols_order + sorted(ol_cols) + sorted(dl_cols) + derived_cols
    final_cols = [c for c in final_cols if c in combined_wide_df.columns]
    combined_wide_df = combined_wide_df.select(final_cols)
    
    # Save wide rep data
    wide_filepath = output_dir / f"{session_name}_wide_reps.csv"
    # combined_wide_df.write_csv(wide_filepath)
    
    # Show sample
    print(f"\nSample (first 5 rows):")
    display(combined_wide_df.head(5))
else:
    print("No rep data to save.")



Sample (first 5 rows):


session_name,rep_number,frame_id,ts,ol_a,ol_dir,ol_dis,ol_gsis_id,ol_id,ol_jersey_number,ol_s,ol_sa,ol_x,ol_y,ol_z,dl_a,dl_dir,dl_dis,dl_gsis_id,dl_id,dl_jersey_number,dl_s,dl_sa,dl_x,dl_y,dl_z,pairwise_distance,distance_change,frame_delta
str,i32,u32,str,f64,f64,f64,str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,str,str,str,f64,f64,f64,f64,f64,f64,f64,u32
"""2024WestPractice1""",1,49815,"""2024-01-27T17:14:14.200""",0.715876,114.639725,0.0005,"""327296""","""1770000086""","""71""",0.311376,0.593745,0.440745,23.378616,2.0,1.076955,75.683302,0.001114,"""325899""","""1770000135""","""91""",0.477003,1.076862,-0.514781,21.168608,2.0,2.40773,,
"""2024WestPractice1""",1,49816,"""2024-01-27T17:14:14.300""",0.819911,110.90028,0.000721,"""327296""","""1770000086""","""71""",0.393735,0.774361,0.46759,23.373589,2.0,1.417541,75.781828,0.001547,"""325899""","""1770000135""","""91""",0.631631,1.417333,-0.475474,21.177973,2.0,2.389581,-0.018149,1.0
"""2024WestPractice1""",1,49817,"""2024-01-27T17:14:14.400""",1.235503,114.96914,0.000745,"""327296""","""1770000086""","""71""",0.55266,1.234527,0.49471,23.384496,2.0,1.637186,75.301393,0.003361,"""325899""","""1770000135""","""91""",0.817225,1.636394,-0.417531,21.190806,2.0,2.375807,-0.013773,1.0
"""2024WestPractice1""",1,49818,"""2024-01-27T17:14:14.500""",1.487852,116.732955,0.002573,"""327296""","""1770000086""","""71""",0.716564,1.479584,0.544983,23.362483,2.0,1.663369,74.854191,0.007851,"""325899""","""1770000135""","""91""",0.998911,1.662188,-0.329041,21.214494,2.0,2.319002,-0.056805,1.0
"""2024WestPractice1""",1,49819,"""2024-01-27T17:14:14.600""",1.851712,118.53101,0.003872,"""327296""","""1770000086""","""71""",0.926649,1.831969,0.606354,23.332492,2.0,2.043224,74.733885,0.010068,"""325899""","""1770000135""","""91""",1.213979,2.041493,-0.228841,21.241914,2.0,2.251236,-0.067766,1.0


#### ADDENDUM: FILTERING OUT FALSE POSITIVES
The algorithm will detect a small amount of false positives. Those can be filtered out by manual inspection. For reference, and will be filtered out in the automated pipeline in data_pipeline_automated.ipynb.