# Image generation by segments
The goal of this notebook is to demonstrate a new capability that we've made possible: pixel-by-pixel replay of a RealEye trial
Stretch goal is to show Tobii as well.

**Why aren't they together, when you have code to join them?**
- New code would have be written to segment the Tobii data parallel to the RealEye data. Such would be a simple "nearest" join, then filtering `nulls` but RealEye has to take the lead and the Tobii and RealEye data need to paired already.
    - Format-agnostic pairing has not been done
    - Stapling indices into the existing code would be hacky and error-prone
    - This will result in something that is more clear.

In [None]:
#| default_exp timeseries_segmentation

In [1]:
#|export
import polars as pl
from pathlib import Path

In [None]:
from RevChem.data_export import read_chunks_from_json

# Loading JSON  store of the RealEye and Tobii pairs in a format that's fast to read: 4 seconds for 35 trials.
# Data written as JSON to path: /Users/stephen/dev/RevChemData/2025-07-17-python-outputs/202507171142-matches-with-TCA.json.gz
associated_tobii_re_sequences = read_chunks_from_json(
    Path("~/dev/RevChemData/2025-07-17-python-outputs/202507171142-matches-with-TCA.json.gz").expanduser(),
)

In [3]:
associated_tobii_re_sequences[0]

(shape: (55_856, 4)
 ┌────────────────────────────┬──────┬──────┬──────────────────────┐
 │ timestamp                  ┆ X    ┆ Y    ┆ source_tsv           │
 │ ---                        ┆ ---  ┆ ---  ┆ ---                  │
 │ datetime[μs]               ┆ i32  ┆ i32  ┆ str                  │
 ╞════════════════════════════╪══════╪══════╪══════════════════════╡
 │ 2025-03-07 18:43:47.952    ┆ null ┆ null ┆ 2025-03-07-Cyndaquil │
 │ 2025-03-07 18:43:47.983053 ┆ null ┆ null ┆ 2025-03-07-Cyndaquil │
 │ 2025-03-07 18:43:47.993366 ┆ null ┆ null ┆ 2025-03-07-Cyndaquil │
 │ 2025-03-07 18:43:48.008020 ┆ null ┆ null ┆ 2025-03-07-Cyndaquil │
 │ 2025-03-07 18:43:48.023002 ┆ null ┆ null ┆ 2025-03-07-Cyndaquil │
 │ …                          ┆ …    ┆ …    ┆ …                    │
 │ 2025-03-07 18:51:15.869997 ┆ 1770 ┆ 977  ┆ 2025-03-07-Cyndaquil │
 │ 2025-03-07 18:51:15.878330 ┆ 1772 ┆ 986  ┆ 2025-03-07-Cyndaquil │
 │ 2025-03-07 18:51:15.886663 ┆ 1776 ┆ 986  ┆ 2025-03-07-Cyndaquil │
 │ 2025-03-07 

In [18]:
# | export

from datetime import timedelta
from typing import NamedTuple


class AssociatedTrialSegements(NamedTuple):
    trial_name_or_id: str
    segments: list[pl.DataFrame]


def join_chunks_as_segments(
    associated_chunks: list[tuple[pl.DataFrame, list[pl.DataFrame]]],
    *,
    join_strategy="backward",
) -> list[AssociatedTrialSegements]:
    """Transform a "Chunk and associated list" to "list of associated chunks"

    Algorithm:
        given a list of tuple[pl.DataFrame, list[pl.DataFrame]] representing tobii and RealEye, resp.
        for each RE dataframe
            join with Tobii "master" frame on the "timestamp" column
            - "how" should be something like "nearest", or "between" the start and end of the RE df in question
            - rename the RE columns "X_re" and "Y_re"
            - drop the "test_created_at" column
            - filter all the nulls, and those outside of the time bounds of the RE df

    Arguments:
        associated_chunks: list of matched tobii dataframe with all the RE dataframes per stimulus

    Returns:
        subsegments of the Tobii df joined on the time column of the RE df, per the algorithm
    """
    output = []
    for tobii_df, re_dfs in associated_chunks:
        trial_name = tobii_df["source_tsv"][0]
        re_rename = dict(X="X_re", Y="Y_re")
        tobii_df = tobii_df.drop("source_tsv")
        segmented_associations = []
        for re_df in re_dfs:
            # NOTE: may need to use the `tolerance` kwarg to better tune the match-up
            associated = tobii_df.join_asof(
                re_df.drop("test_created_at").rename(re_rename),
                on="timestamp",
                strategy=join_strategy,
            )
            associated = associated.filter(
                (pl.col("timestamp") >= re_df["timestamp"].min())
                & (pl.col("timestamp") <= re_df["timestamp"].max())
            )
            segmented_associations.append(associated)

        output.append(AssociatedTrialSegements(trial_name, segmented_associations))
    return output


def test_chunk_assoc():
    first_joined = join_chunks_as_segments(associated_tobii_re_sequences[:1], join_strategy="backward")
    trial_name, segmented_associations = first_joined[0]

    print(f"For trial {trial_name}")
    with pl.Config(tbl_rows=30):
        print(segmented_associations)


In [None]:
#| hide
test_chunk_assoc()

For trial 2025-03-07-Cyndaquil
[shape: (117, 5)
┌────────────────────────────┬──────┬──────┬──────┬──────┐
│ timestamp                  ┆ X    ┆ Y    ┆ X_re ┆ Y_re │
│ ---                        ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ datetime[μs]               ┆ i32  ┆ i32  ┆ i32  ┆ i32  │
╞════════════════════════════╪══════╪══════╪══════╪══════╡
│ 2025-03-07 18:47:07.934088 ┆ 912  ┆ 684  ┆ 956  ┆ 547  │
│ 2025-03-07 18:47:07.942421 ┆ 914  ┆ 684  ┆ 956  ┆ 547  │
│ 2025-03-07 18:47:07.950754 ┆ 915  ┆ 684  ┆ 956  ┆ 547  │
│ 2025-03-07 18:47:07.959088 ┆ 915  ┆ 692  ┆ 892  ┆ 514  │
│ 2025-03-07 18:47:07.967421 ┆ 911  ┆ 691  ┆ 892  ┆ 514  │
│ 2025-03-07 18:47:07.975754 ┆ 912  ┆ 688  ┆ 892  ┆ 514  │
│ 2025-03-07 18:47:07.984088 ┆ 913  ┆ 684  ┆ 892  ┆ 514  │
│ 2025-03-07 18:47:07.992421 ┆ 914  ┆ 682  ┆ 894  ┆ 513  │
│ 2025-03-07 18:47:08.000754 ┆ 915  ┆ 676  ┆ 894  ┆ 513  │
│ 2025-03-07 18:47:08.009088 ┆ 917  ┆ 679  ┆ 894  ┆ 513  │
│ 2025-03-07 18:47:08.017421 ┆ 918  ┆ 679  ┆ 894  ┆ 513  │
│ 2025-0

In [None]:
from RevChem.tobii import GroupedFrames


stimuli_paths = sorted(Path("/Users/stephen/dev/RevChem-Stimuli/jpegs").glob("*.jpg"))

# we know that the stimuli are in the same order as the RealEye data
# problem is that we *don't* know which stimulus corresponds to which subset of the RealEye data

## A note on timings of the RealEye-Tobii trial's RealEye portion
Kathy says the triangle gets 5 seconds, the other two stimulis get 30 seconds.
Control trial says: 
* Frame between stimuli is 1 second
* Triangle gets 5 seconds
* Stimulus (questions) get 60 seconds
* Stimulus (confidence) get 8 or 9 seconds
* Stimulus (reasoning) get 45 seconds

I can work with that and get approximate timing down, to then map the RealEye buckets