[![Test In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vanderschaarlab/temporai/blob/main/tutorials/data/tutorial02_datasets.ipynb)

# Data Tutorial 02: Datasets

This tutorial shows different TemporAI `Dataset`s.

## Prepare some example data

In [1]:
import pandas as pd
import numpy as np

# Some time series data:
time_series_df = pd.DataFrame(
    {
        "sample_idx": ["sample_0", "sample_0", "sample_0", "sample_0", "sample_1", "sample_1", "sample_2"],
        "time_idx": [1, 2, 3, 4, 2, 4, 9],
        "t_feat_0": [11, 12, 13, 14, 21, 22, 31],
        "t_feat_1": [1.1, 1.2, 1.3, 1.4, 2.1, 2.2, 3.1],
        "t_feat_2": [10, 20, 30, 40, 11, 21, 111],
    }
)
time_series_df.set_index(keys=["sample_idx", "time_idx"], drop=True, inplace=True)

# Some static data:
static_df = pd.DataFrame(
    {
        "s_feat_0": [100, 200, 300],
        "s_feat_1": [-1.1, -1.2, -1.3],
        "s_feat_2": [0, 1, 0],
    },
    index=["sample_0", "sample_1", "sample_2"],
)

event_df = pd.DataFrame(
    {
        "e_feat_0": [(10, True), (12, False), (13, True)],
        "e_feat_1": [(10, False), (10, False), (11, True)],
    },
    index=["sample_0", "sample_1", "sample_2"],
)

Preview the dataframes below.

In [2]:
time_series_df

Unnamed: 0_level_0,Unnamed: 1_level_0,t_feat_0,t_feat_1,t_feat_2
sample_idx,time_idx,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sample_0,1,11,1.1,10
sample_0,2,12,1.2,20
sample_0,3,13,1.3,30
sample_0,4,14,1.4,40
sample_1,2,21,2.1,11
sample_1,4,22,2.2,21
sample_2,9,31,3.1,111


In [3]:
static_df

Unnamed: 0,s_feat_0,s_feat_1,s_feat_2
sample_0,100,-1.1,0
sample_1,200,-1.2,1
sample_2,300,-1.3,0


In [4]:
event_df

Unnamed: 0,e_feat_0,e_feat_1
sample_0,"(10, True)","(10, False)"
sample_1,"(12, False)","(10, False)"
sample_2,"(13, True)","(11, True)"


## `CovariatesDataset`

A `CovariatesDataset` contains time series and optionally static covariates only, without any predictive data
(targets or treatments).

It can be used with `preprocessing` transformations.

In [5]:
from tempor.data import dataset

In [6]:
# Initialize a CovariatesDataset:
data = dataset.CovariatesDataset(
    time_series=time_series_df,
    static=static_df,  # Optional, can be `None`.
)

data

CovariatesDataset(
    time_series=TimeSeriesSamples([3, *, 3]),
    static=StaticSamples([3, 3])
)

In [7]:
data.time_series

Unnamed: 0_level_0,Unnamed: 1_level_0,t_feat_0,t_feat_1,t_feat_2
sample_idx,time_idx,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sample_0,1,11,1.1,10
sample_0,2,12,1.2,20
sample_0,3,13,1.3,30
sample_0,4,14,1.4,40
sample_1,2,21,2.1,11
sample_1,4,22,2.2,21
sample_2,9,31,3.1,111


In [8]:
data.static

Unnamed: 0_level_0,s_feat_0,s_feat_1,s_feat_2
sample_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
sample_0,100,-1.1,0
sample_1,200,-1.2,1
sample_2,300,-1.3,0


## `OneOffPredictionDataset`

A `OneOffPredictionDataset` contains time series and optionally static covariates.

It also needs `StaticSamples` prediction *targets* for estimators to be able to `fit` on this dataset.

It can be used with `prediction.one_off` estimators. The task is to predict some one-off value for each sample.

In [9]:
# Initialize a OneOffPredictionDataset:
data = dataset.OneOffPredictionDataset(
    time_series=time_series_df,
    static=static_df.loc[:, :"s_feat_1"],  # Optional, can be `None`.
    targets=static_df.loc[:, ["s_feat_2"]],  # Optional, can be `None` at inference time.
)

data

OneOffPredictionDataset(
    time_series=TimeSeriesSamples([3, *, 3]),
    static=StaticSamples([3, 2]),
    predictive=OneOffPredictionTaskData(targets=StaticSamples([3, 1]))
)

In [10]:
data.time_series

Unnamed: 0_level_0,Unnamed: 1_level_0,t_feat_0,t_feat_1,t_feat_2
sample_idx,time_idx,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sample_0,1,11,1.1,10
sample_0,2,12,1.2,20
sample_0,3,13,1.3,30
sample_0,4,14,1.4,40
sample_1,2,21,2.1,11
sample_1,4,22,2.2,21
sample_2,9,31,3.1,111


In [11]:
data.static

Unnamed: 0_level_0,s_feat_0,s_feat_1
sample_idx,Unnamed: 1_level_1,Unnamed: 2_level_1
sample_0,100,-1.1
sample_1,200,-1.2
sample_2,300,-1.3


In [12]:
data.predictive.targets

Unnamed: 0_level_0,s_feat_2
sample_idx,Unnamed: 1_level_1
sample_0,0
sample_1,1
sample_2,0


## `TemporalPredictionDataset`

A `TemporalPredictionDataset` contains time series and optionally static covariates.

It also needs `TimeSeriesSamples` prediction *targets* for estimators to be able to `fit` on this dataset.

It can be used with `prediction.temporal` estimators. The task is to predict some time series for each sample.

In [13]:
# Initialize a TemporalPredictionDataset:
data = dataset.TemporalPredictionDataset(
    time_series=time_series_df.loc[:, :"t_feat_1"],
    static=static_df,  # Optional, can be `None`.
    targets=time_series_df.loc[:, ["t_feat_2"]],  # Optional, can be `None` at inference time.
)

data

TemporalPredictionDataset(
    time_series=TimeSeriesSamples([3, *, 2]),
    static=StaticSamples([3, 3]),
    predictive=TemporalPredictionTaskData(
        targets=TimeSeriesSamples([3, *, 1])
    )
)

In [14]:
data.time_series

Unnamed: 0_level_0,Unnamed: 1_level_0,t_feat_0,t_feat_1
sample_idx,time_idx,Unnamed: 2_level_1,Unnamed: 3_level_1
sample_0,1,11,1.1
sample_0,2,12,1.2
sample_0,3,13,1.3
sample_0,4,14,1.4
sample_1,2,21,2.1
sample_1,4,22,2.2
sample_2,9,31,3.1


In [15]:
data.static

Unnamed: 0_level_0,s_feat_0,s_feat_1,s_feat_2
sample_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
sample_0,100,-1.1,0
sample_1,200,-1.2,1
sample_2,300,-1.3,0


In [16]:
data.predictive.targets

Unnamed: 0_level_0,Unnamed: 1_level_0,t_feat_2
sample_idx,time_idx,Unnamed: 2_level_1
sample_0,1,10
sample_0,2,20
sample_0,3,30
sample_0,4,40
sample_1,2,11
sample_1,4,21
sample_2,9,111


## `TimeToEventAnalysisDataset`

A `TimeToEventAnalysisDataset` contains time series and optionally static covariates.

It also needs `EventSamples` prediction *targets* for estimators to be able to `fit` on this dataset.

It can be used with `time_to_event` estimators. The task is to predict risk scores for each sample.

In [17]:
# Initialize a TimeToEventAnalysisDataset:
data = dataset.TimeToEventAnalysisDataset(
    time_series=time_series_df,
    static=static_df,  # Optional, can be `None`.
    targets=event_df,  # Optional, can be `None` at inference time.
)

data

TimeToEventAnalysisDataset(
    time_series=TimeSeriesSamples([3, *, 3]),
    static=StaticSamples([3, 3]),
    predictive=TimeToEventAnalysisTaskData(targets=EventSamples([3, 2]))
)

In [18]:
data.time_series

Unnamed: 0_level_0,Unnamed: 1_level_0,t_feat_0,t_feat_1,t_feat_2
sample_idx,time_idx,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sample_0,1,11,1.1,10
sample_0,2,12,1.2,20
sample_0,3,13,1.3,30
sample_0,4,14,1.4,40
sample_1,2,21,2.1,11
sample_1,4,22,2.2,21
sample_2,9,31,3.1,111


In [19]:
data.static

Unnamed: 0_level_0,s_feat_0,s_feat_1,s_feat_2
sample_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
sample_0,100,-1.1,0
sample_1,200,-1.2,1
sample_2,300,-1.3,0


In [20]:
data.predictive.targets

Unnamed: 0_level_0,e_feat_0,e_feat_1
sample_idx,Unnamed: 1_level_1,Unnamed: 2_level_1
sample_0,"(10, True)","(10, False)"
sample_1,"(12, False)","(10, False)"
sample_2,"(13, True)","(11, True)"


## `OneOffTreatmentEffectsDataset`

A `OneOffTreatmentEffectsDataset` contains time series and optionally static covariates.

It also needs `TimeSeriesSamples` prediction *targets* and `EventSamples` treatments
for estimators to be able to `fit` on this dataset.

It can be used with `treatments.one_off` estimators.
The task is to predict a time series counterfactual outcome based on a one-off treatment event.

In [21]:
# Initialize a TimeToEventAnalysisDataset:
data = dataset.OneOffTreatmentEffectsDataset(
    time_series=time_series_df.loc[:, :"t_feat_1"],
    static=static_df,  # Optional, can be `None`.
    targets=time_series_df.loc[:, ["t_feat_2"]],  # Optional, can be `None` at inference time.
    treatments=event_df.loc[:, ["e_feat_0"]],
)

data

OneOffTreatmentEffectsDataset(
    time_series=TimeSeriesSamples([3, *, 2]),
    static=StaticSamples([3, 3]),
    predictive=OneOffTreatmentEffectsTaskData(
        targets=TimeSeriesSamples([3, *, 1]),
        treatments=EventSamples([3, 1])
    )
)

In [22]:
data.time_series

Unnamed: 0_level_0,Unnamed: 1_level_0,t_feat_0,t_feat_1
sample_idx,time_idx,Unnamed: 2_level_1,Unnamed: 3_level_1
sample_0,1,11,1.1
sample_0,2,12,1.2
sample_0,3,13,1.3
sample_0,4,14,1.4
sample_1,2,21,2.1
sample_1,4,22,2.2
sample_2,9,31,3.1


In [23]:
data.static

Unnamed: 0_level_0,s_feat_0,s_feat_1,s_feat_2
sample_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
sample_0,100,-1.1,0
sample_1,200,-1.2,1
sample_2,300,-1.3,0


In [24]:
data.predictive.targets

Unnamed: 0_level_0,Unnamed: 1_level_0,t_feat_2
sample_idx,time_idx,Unnamed: 2_level_1
sample_0,1,10
sample_0,2,20
sample_0,3,30
sample_0,4,40
sample_1,2,11
sample_1,4,21
sample_2,9,111


In [25]:
data.predictive.treatments

Unnamed: 0_level_0,e_feat_0
sample_idx,Unnamed: 1_level_1
sample_0,"(10, True)"
sample_1,"(12, False)"
sample_2,"(13, True)"


## `TemporalTreatmentEffectsDataset`

A `TemporalTreatmentEffectsDataset` contains time series and optionally static covariates.

It also needs `TimeSeriesSamples` prediction *targets* and `TimeSeriesSamples` treatments
for estimators to be able to `fit` on this dataset.

It can be used with `treatments.temporal` estimators.
The task is to predict a time series counterfactual outcome based on a time series treatment.

In [26]:
# Initialize a TimeToEventAnalysisDataset:
data = dataset.TemporalTreatmentEffectsDataset(
    time_series=time_series_df.loc[:, :"t_feat_0"],
    static=static_df,  # Optional, can be `None`.
    targets=time_series_df.loc[:, ["t_feat_1"]],  # Optional, can be `None` at inference time.
    treatments=time_series_df.loc[:, ["t_feat_2"]],
)

data

TemporalTreatmentEffectsDataset(
    time_series=TimeSeriesSamples([3, *, 1]),
    static=StaticSamples([3, 3]),
    predictive=TemporalTreatmentEffectsTaskData(
        targets=TimeSeriesSamples([3, *, 1]),
        treatments=TimeSeriesSamples([3, *, 1])
    )
)

In [27]:
data.time_series

Unnamed: 0_level_0,Unnamed: 1_level_0,t_feat_0
sample_idx,time_idx,Unnamed: 2_level_1
sample_0,1,11
sample_0,2,12
sample_0,3,13
sample_0,4,14
sample_1,2,21
sample_1,4,22
sample_2,9,31


In [28]:
data.static

Unnamed: 0_level_0,s_feat_0,s_feat_1,s_feat_2
sample_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
sample_0,100,-1.1,0
sample_1,200,-1.2,1
sample_2,300,-1.3,0


In [29]:
data.predictive.targets

Unnamed: 0_level_0,Unnamed: 1_level_0,t_feat_1
sample_idx,time_idx,Unnamed: 2_level_1
sample_0,1,1.1
sample_0,2,1.2
sample_0,3,1.3
sample_0,4,1.4
sample_1,2,2.1
sample_1,4,2.2
sample_2,9,3.1


In [30]:
data.predictive.treatments

Unnamed: 0_level_0,Unnamed: 1_level_0,t_feat_2
sample_idx,time_idx,Unnamed: 2_level_1
sample_0,1,10
sample_0,2,20
sample_0,3,30
sample_0,4,40
sample_1,2,11
sample_1,4,21
sample_2,9,111
