# Importing time series data

Time series data is a sequence of data points collected over time, such as daily humidity levels, weekly number of t-shirt orders or tons of grain harvested per month. Time series forecasting is used in many different fields, including economics, medicine, engineering, and the climatology.

synthcity supports different forms of tabular data including time series data. To work with your own data in synthcity it is important to make it compatible with the TimeSeriesDataLoader. This tutorial will cover the basics of how to make your data compatible.

In [None]:
!pip install synthcity
!pip uninstall -y torchaudio torchdata

In [None]:
# stdlib
import sys
import warnings

# synthcity absolute
import synthcity.logger as log
from synthcity.plugins import Plugins
from synthcity.plugins.core.dataloader import TimeSeriesDataLoader

log.add(sink=sys.stderr, level="INFO")
warnings.filterwarnings("ignore")

We simulate our data as two dataframes: a dataframe containing data (eg, age, sex) and a dataframe containing temporal data (eg, body temperature over time)

In [None]:
# stdlib
import datetime
import uuid

# third party
# import libraries for generating simulated data
import numpy as np
import pandas as pd

# set the number of individuals and observations per individual you want to generate
num_subj = 200
num_obs = 10

# generate static data
ids = [uuid.uuid4().hex[:6].upper() for i in range(num_subj)]
static_data = pd.DataFrame(
    {
        "id": ids,
        "var_a": np.random.randint(2, size=(num_subj)),
        "var_b": np.random.normal(loc=2, scale=0.5, size=(num_subj)),
        "outcome": np.random.binomial(1, 0.7, size=(num_subj)),
    }
)

# generate temporal data
temp_len = num_obs * len(ids)
temp_ids = ids * num_obs
timepoints = [i for i in range(num_obs)] * num_subj

temporal_data = pd.DataFrame(
    {
        "id": temp_ids,
        "temp_a": np.random.normal(loc=0, scale=0.2, size=(temp_len)),
        "temp_b": np.random.normal(loc=5, scale=1, size=(temp_len)),
        "temp_c": np.random.binomial(1, 0.5, size=(temp_len)),
        "timepoint": timepoints,
    }
)

Now there are two dataframes: a static and a temporal one. To understand how the data needs to be rearranged, you need to understand how it is imported by the `TimeSeriesDataLoader`. The `TimeSeriesDataLoader` takes four inputs: `temporal_data`, `observation_times`, `static_data` and `outcome`.

`temporal_data` is a list of dataframes each subject. Each dataframe contains a set of observations/measurements. The index of the dataframes can be anything.
`observation_times` : A list of arrays that maps directly to the index of each dataframe in temporal_data. It's when each measurement was taken.
`static_data` is a DataFrame of static features for each subject, like gender, city, etc.
`outcome` is a dataframe that can be for anything : labels, regression outcome, forecasting etc.

It is important to note that `temporal_data`, `observation_times`, `static_data` and `outcome` must have the same length

Knowing this we can rearrange our data.

In [None]:
# rearrange static data
outcome_data = static_data[["outcome"]]
static_data = static_data.drop(columns=["outcome"])

# rearrange temporal data
observation_data, temporal_dataframes = ([] for i in range(2))
for id in static_data["id"].unique():
    temp_df = temporal_data[temporal_data["id"] == id]
    observations = temp_df["timepoint"].tolist()
    temp_df.set_index("timepoint", inplace=True)
    temp_df = temp_df.drop(columns=["id"])
    # add each to list
    observation_data.append(observations)
    temporal_dataframes.append(temp_df)

# instantiate time series data loader
loader = TimeSeriesDataLoader(
    temporal_data=temporal_dataframes,
    observation_times=observation_data,
    static_data=static_data,
    outcome=outcome_data,
)

Now that we have rearranged our data and loaded with the `TimeSeriesDataLoader` we can train our model

In [None]:
syn_model = Plugins().get("timegan")

syn_model.fit(loader)

Next we train our model

In [None]:
syn_model.generate(count=10).dataframe()

In [None]:
# synthcity absolute
from synthcity.benchmark import Benchmarks

score = Benchmarks.evaluate(
    [
        (f"test_{model}", model, {})
        for model in ["timegan"]
    ],
    loader,
    synthetic_size=1000,
    repeats=2,
    task_type="time_series",  # time_series_survival or time_series
)

In [None]:
Benchmarks.print(score)

## Congratulations!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement towards Machine learning and AI for medicine, you can do so in the following ways!

### Star [Synthcity](https://github.com/vanderschaarlab/synthcity) on GitHub

- The easiest way to help our community is just by starring the Repos! This helps raise awareness of the tools we're building.

### Checkout other projects from vanderschaarlab
- [HyperImpute](https://github.com/vanderschaarlab/hyperimpute)
- [AutoPrognosis](https://github.com/vanderschaarlab/autoprognosis)