# `spudtr` epochs format

`spudtr` epochs are `pandas.DataFrame` objects.

There are three key elements:

 1. `epoch_id` an index-like integer column, where each value designates a unique epoch
 2. `time` an index-like column of integer timestamps, the same in each epoch
 3.  the rest of the data columns
 
There must be at least one epoch.

There must be at least one timepoint.

All the epochs must be timestamped exactly the same way.

> NOTE: timestamps are positive and negative integers, the units are unspecified: milliseconds, months, nanoseconds, hours.

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import spudtr.fake_epochs_data as fake_data
from spudtr import epf, DATA_DIR

# Example: simulated categorical and continuous data

The `epoch_id` column is "epoch_id", there are four epochs: 0, 1, 2, 3.

The `time` column is "days", there are 31 days in each epoch, 0, 1, 2, ..., 31.

The rest of the columns are the data recorded in each epoch at each time stamp.

In [2]:
n_epochs_per_category = 2
sim_epochs_df, channels = fake_data._generate(
    n_epochs=n_epochs_per_category,
    n_samples=32,
    n_categories=2,
    n_channels=4,
    time="days",
    epoch_id="epoch_id",
    seed=10,
)
display(sim_epochs_df)

Unnamed: 0,epoch_id,days,categorical,continuous,channel0,channel1,channel2,channel3
0,0,0,cat0,0.771321,-13.170787,-30.197057,19.609869,43.177612
1,0,1,cat0,0.020752,4.233125,-7.726009,-65.298259,41.464399
2,0,2,cat0,0.633648,8.191480,21.915223,18.568468,27.639613
3,0,3,cat0,0.748804,-48.557122,-50.952045,14.317029,-17.186617
4,0,4,cat0,0.498507,-17.193401,50.222266,0.782896,38.251473
...,...,...,...,...,...,...,...,...
123,3,27,cat1,0.744603,33.167254,-7.658414,14.630878,14.329468
124,3,28,cat1,0.469785,-60.531560,0.774228,1.689442,0.882024
125,3,29,cat1,0.598256,16.216221,66.028993,16.373534,4.854384
126,3,30,cat1,0.147620,-43.268966,26.531028,-20.493672,-12.327708


# Example: EEG data

The epoch_id column is "epoch_id", there are 792 epochs numbered: 0, 1, 2, ..., 791

The time column is "time_ms", there are 275 digital samples in each epoch at 4 ms intervals, -100, -96, ..., 992, 996 

The rest of the columns are the data recorded in each epoch at each time stamp.

In [3]:
eeg_epochs_df = pd.read_hdf(DATA_DIR / "gh_sub000p3.epochs.h5", key="p3")
eeg_epochs_df

Unnamed: 0,epoch_id,time_ms,event_code,eeg_artifact,participant,MiPf,MiCe,MiPa,MiOc,A2,stim,accuracy,acc_type,exp
0,0,-100,0,False,demonstration,-48.0,23.015625,46.031250,11.656250,9.843750,target,correct,hit,p3
1,0,-96,0,False,demonstration,-52.5,19.984375,41.968750,6.800781,5.660156,target,correct,hit,p3
2,0,-92,0,False,demonstration,-51.5,22.765625,43.187500,7.773438,10.093750,target,correct,hit,p3
3,0,-88,0,False,demonstration,-54.0,21.750000,38.875000,5.101562,5.906250,target,correct,hit,p3
4,0,-84,0,False,demonstration,-55.0,19.984375,34.812500,5.343750,7.628906,target,correct,hit,p3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
217795,791,980,0,False,demonstration,4.0,15.171875,15.984375,1.700195,-2.460938,all,all,all,p3
217796,791,984,0,False,demonstration,-2.5,13.406250,13.835938,0.728516,-6.398438,all,all,all,p3
217797,791,988,0,False,demonstration,7.0,25.046875,25.765625,14.335938,6.890625,all,all,all,p3
217798,791,992,0,False,demonstration,1.5,22.765625,24.328125,12.390625,4.429688,all,all,all,p3


# Always check the epoch x time format

When things go well the check quietly succeeds.

When they don't the reason appears at the bottom of the messages.

Example: This check of the simulated data **SUCCEEDS**.

In [4]:
epf.check_epochs(sim_epochs_df, ['channel0', 'channel1'], epoch_id="epoch_id", time="days")

Example: this checks **FAILS** because the data column named "bogus_channel0"  doesn't exist in the data.

In [5]:
epf.check_epochs(sim_epochs_df, ['bogus_channel0', 'channel1'], epoch_id="epoch_id", time="days")

ValueError: data_streams should all be present in the epochs dataframe, the following are missing: ['bogus_channel0']

Example: this checks **FAILS** because the `time` column named "hours" doesn't exist.

In [6]:
epf.check_epochs(sim_epochs_df, ['channel0', 'channel1'], epoch_id="epoch_id", time="hours")

ValueError: time column not found: hours