# Data Tutorial 01: Data Format

This tutorial shows a minimal example of the data format for TemporAI.

*Skip the below cell if you are not on Google Colab / already have TemporAI installed:*

In [None]:
%pip install temporai

# Or from the repo, for the latest version:
# %pip install git+https://github.com/vanderschaarlab/temporai.git

## Time series data

Time series data contains data samples (e.g. patients), with features that unfold sequentially over some number of timesteps.

Time series data should take form of a `pandas.DataFrame`, with the following specifics:
* The index should be a 2-level multiindex, where level `0` index represents sampled IDs, and level `1` represents the timesteps for each sample.
* The sample index can be comprised of either `int`s or `str`s (homogenous, not a mix of these).
* The time index (timesteps) may be `int`, `float` or `pandas.Timestep`-compatible format (homogenous, not a mix of these).
* The columns of the dataframe represent the features, column names must be `str`.
* Column (feature) values currently supported are: `bool`, `int`, `float`, or `pandas.Categorical` (homogenous per column).

Other points to note:
* Sample IDs must be unique.
* (Sample ID, timestep) combination must be unique (a sample cannot have more than one of the same timestep).
* Null values such as `numpy.nan` are allowed and represent missing values.

In [2]:
import pandas as pd
import numpy as np

from IPython.display import display

In [3]:
# Create a time series dataframe.

time_series_df = pd.DataFrame(
    {
        "sample_idx": ["sample_0", "sample_0", "sample_0", "sample_0", "sample_1", "sample_1", "sample_2"],
        "time_idx": [1, 2, 3, 4, 2, 4, 9],
        "t_feat_0": [11, 12, 13, 14, 21, 22, 31],
        "t_feat_1": [1.1, 1.2, 1.3, np.nan, 2.1, 2.2, 3.1],
        "t_feat_2": ["a", "b", "b", "c", "a", "a", "c"],
    }
)

# Set the 2-level index:
time_series_df.set_index(keys=["sample_idx", "time_idx"], drop=True, inplace=True)

# "feat_2" needs to be set to a categorical, as `str` format is not supported.
time_series_df["t_feat_2"] = pd.Categorical(time_series_df["t_feat_2"])

# Preview the dataframe:
time_series_df.info()
time_series_df

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 7 entries, ('sample_0', 1) to ('sample_2', 9)
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   t_feat_0  7 non-null      int64   
 1   t_feat_1  6 non-null      float64 
 2   t_feat_2  7 non-null      category
dtypes: category(1), float64(1), int64(1)
memory usage: 725.0+ bytes


Unnamed: 0_level_0,Unnamed: 1_level_0,t_feat_0,t_feat_1,t_feat_2
sample_idx,time_idx,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sample_0,1,11,1.1,a
sample_0,2,12,1.2,b
sample_0,3,13,1.3,b
sample_0,4,14,,c
sample_1,2,21,2.1,a
sample_1,4,22,2.2,a
sample_2,9,31,3.1,c


## Static data

Static data contains data samples (e.g. patients), features that are not associated with a particular time.

Static data should take form of a `pandas.DataFrame`, with the following specifics:
* The index represents sample IDs and is a (single level) index that can be comprised of `int`s or `str`s (homogenous, not a mix of these).
* The columns of the dataframe represent the features, column names must be `str`.
* Column (feature) values currently supported are: `bool`, `int`, `float`, or `pandas.Categorical` (homogenous per column).

Other points to note:
* Sample IDs must be unique.
* Null values such as `numpy.nan` are allowed and represent missing values.

In [4]:
# Create a static data dataframe.

static_df = pd.DataFrame(
    {
        "s_feat_0": [100, 200, 300],
        "s_feat_1": [-1.1, np.nan, -1.3],
    },
    index=["sample_0", "sample_1", "sample_2"],
)

# Preview the dataframe:
static_df.info()
static_df

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, sample_0 to sample_2
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   s_feat_0  3 non-null      int64  
 1   s_feat_1  2 non-null      float64
dtypes: float64(1), int64(1)
memory usage: 72.0+ bytes


Unnamed: 0,s_feat_0,s_feat_1
sample_0,100,-1.1
sample_1,200,
sample_2,300,-1.3


## Event data

Event data contains data samples (e.g. patients), with features that represent occurrence of an event at a certain time.
If the event did not occur, it is "censored".

Event data should take form of a `pandas.DataFrame`, with the following specifics:
* The index represents sample IDs and is a (single level) index that can be comprised of `int`s or `str`s (homogenous, not a mix of these).
* The columns of the dataframe represent the features, column names must be `str`.
* Column (feature) values must be of the form: `Tuple[<timestep>, bool]`.
    * The first element `<timestep>` may be `int`, `float` or `pandas.Timestep`-compatible format (homogenous per column).
    * The second element `bool` indicates whether the event occurred at this time (`True`) or the event feature is censored (`False`).
    * In case of censoring, the timestep should indicate the last time information about the sample was available. 

Other points to note:
* Sample IDs must be unique.
* Null values such as `numpy.nan` are not allowed allowed - indicate an event as censored (did not occur) instead.

In [5]:
# Create an event dataframe.

event_df = pd.DataFrame(
    {
        "e_feat_0": [(10, True), (12, False), (13, True)],
        "e_feat_1": [(10, False), (10, False), (11, True)],
    },
    index=["sample_0", "sample_1", "sample_2"],
)

# Preview the dataframe:
event_df.info()
event_df

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, sample_0 to sample_2
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   e_feat_0  3 non-null      object
 1   e_feat_1  3 non-null      object
dtypes: object(2)
memory usage: 72.0+ bytes


Unnamed: 0,e_feat_0,e_feat_1
sample_0,"(10, True)","(10, False)"
sample_1,"(12, False)","(10, False)"
sample_2,"(13, True)","(11, True)"


The data can also be initialised from a 2D `numpy` array (static, event) or a 3D `numpy` array (time series).

`TODO: more info`

## Dataset

The collection of data that represents the task at hand is a `Dataset`.

A `Dataset` contains:
* Time series data (covariates),
* Static data (covariates), optional,
* Predictive data, which depends on the *task*.
    * This may contain *targets* and *treatments*.

For example, for the *time-to-event analysis task* we create a dataset as follows.

In [6]:
from tempor.data.dataset import TimeToEventAnalysisDataset

# Create a dataset of time-to-event analysis task:
data = TimeToEventAnalysisDataset(
    time_series=time_series_df,
    static=static_df,
    targets=event_df,
)

# Preview dataset:
data

TimeToEventAnalysisDataset(
    time_series=TimeSeriesSamples([3, *, 3]),
    static=StaticSamples([3, 2]),
    predictive=TimeToEventAnalysisTaskData(targets=EventSamples([3, 2]))
)

In [7]:
data.time_series

Unnamed: 0_level_0,Unnamed: 1_level_0,t_feat_0,t_feat_1,t_feat_2
sample_idx,time_idx,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sample_0,1,11,1.1,a
sample_0,2,12,1.2,b
sample_0,3,13,1.3,b
sample_0,4,14,,c
sample_1,2,21,2.1,a
sample_1,4,22,2.2,a
sample_2,9,31,3.1,c


In [8]:
data.static

Unnamed: 0_level_0,s_feat_0,s_feat_1
sample_idx,Unnamed: 1_level_1,Unnamed: 2_level_1
sample_0,100,-1.1
sample_1,200,
sample_2,300,-1.3


In [9]:
data.predictive.targets

Unnamed: 0_level_0,e_feat_0,e_feat_1
sample_idx,Unnamed: 1_level_1,Unnamed: 2_level_1
sample_0,"(10, True)","(10, False)"
sample_1,"(12, False)","(10, False)"
sample_2,"(13, True)","(11, True)"


## Useful methods

The data (`{TimeSeries,Event,Static}Samples` classes) provide a number of useful methods, some examples below.

### Examples for `TimeSeriesSamples`

In [10]:
time_series = data.time_series

In [11]:
# Return time series data as a dataframe:

time_series.dataframe()

Unnamed: 0_level_0,Unnamed: 1_level_0,t_feat_0,t_feat_1,t_feat_2
sample_idx,time_idx,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sample_0,1,11,1.1,a
sample_0,2,12,1.2,b
sample_0,3,13,1.3,b
sample_0,4,14,,c
sample_1,2,21,2.1,a
sample_1,4,22,2.2,a
sample_2,9,31,3.1,c


In [12]:
# Return time series data as a numpy array:

time_series.numpy(padding_indicator=-999.0)

array([[[11, 1.1, 'a'],
        [12, 1.2, 'b'],
        [13, 1.3, 'b'],
        [14, nan, 'c']],

       [[21, 2.1, 'a'],
        [22, 2.2, 'a'],
        [-999.0, -999.0, -999.0],
        [-999.0, -999.0, -999.0]],

       [[31, 3.1, 'c'],
        [-999.0, -999.0, -999.0],
        [-999.0, -999.0, -999.0],
        [-999.0, -999.0, -999.0]]], dtype=object)

In [13]:
# Return the time series data as a list of dataframes:

time_series.list_of_dataframes()

[                     t_feat_0  t_feat_1 t_feat_2
 sample_idx time_idx                             
 sample_0   1               11       1.1        a
            2               12       1.2        b
            3               13       1.3        b
            4               14       NaN        c,
                      t_feat_0  t_feat_1 t_feat_2
 sample_idx time_idx                             
 sample_1   2               21       2.1        a
            4               22       2.2        a,
                      t_feat_0  t_feat_1 t_feat_2
 sample_idx time_idx                             
 sample_2   9               31       3.1        c]

In [14]:
# Show number of features and samples:

print("num_features:", time_series.num_features)
print("num_samples:", time_series.num_samples)

num_features: 3
num_samples: 3


In [15]:
# Show number of samples for each sample:

print("timesteps per sample:", time_series.num_timesteps())

timesteps per sample: [4, 2, 1]


### Examples for `{Static,Event}Samples`

In [16]:
assert data.static is not None
assert data.predictive.targets is not None

static = data.static
event = data.predictive.targets

In [17]:
# Return the static data as a dataframe, numpy array:

display(static.dataframe())
display(static.numpy())

Unnamed: 0_level_0,s_feat_0,s_feat_1
sample_idx,Unnamed: 1_level_1,Unnamed: 2_level_1
sample_0,100,-1.1
sample_1,200,
sample_2,300,-1.3


array([[100. ,  -1.1],
       [200. ,   nan],
       [300. ,  -1.3]])

In [18]:
# Return the event data as a dataframe, numpy array:

display(event.dataframe())
display(event.numpy())

Unnamed: 0_level_0,e_feat_0,e_feat_1
sample_idx,Unnamed: 1_level_1,Unnamed: 2_level_1
sample_0,"(10, True)","(10, False)"
sample_1,"(12, False)","(10, False)"
sample_2,"(13, True)","(11, True)"


array([[(10, True), (10, False)],
       [(12, False), (10, False)],
       [(13, True), (11, True)]], dtype=object)

## 🎉 Congratulations!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement towards *Machine learning and AI for Medicine*, you can do so in the following ways!



### ⭐ Star [TemporAI](https://github.com/vanderschaarlab/temporai) on GitHub

- The easiest way to help our community is by just starring the repos! This helps raise awareness of the tools we're building.



### Check out other projects from [vanderschaarlab](https://github.com/vanderschaarlab)
- 📝 [HyperImpute](https://github.com/vanderschaarlab/hyperimpute)
- 📊 [AutoPrognosis](https://github.com/vanderschaarlab/autoprognosis)
- 🤖 [SynthCity](https://github.com/vanderschaarlab/synthcity)
 