# Synthetic data generation from Time-series - Single Entity

Time-Series is a type of data that holds a temporal dependency. It can include from categorical to numerical variables, and can be found in a multitude of use-case - from healthcare to financial services.

YData Fabric offers an easy-to-use and familiar interface through the SDK to support Time-Series Synthesis. With the SDK and a few lines of code, users can replicate not only the general statistics of a dataset but also the temporal properties such as seasonality, periods and trends.

[Air Quality](https://www.kaggle.com/datasets/sid321axn/beijing-multisite-airquality-data-set) is the dataset chosen to demo Fabric Time-Series synthesis properties and interface. For this example we are leveraging a single site.

## Getting your data from the Data Catalog

In this example we have create a new datasource from Google Cloud Storage and [created a Dataset in Fabric Data Catalog](https://docs.sdk.ydata.ai/latest/get-started/upload_csv/). Copy the required code snippet by clicking in the "Explore in Labs" button that you can find inside of the dataset detail as per the image below.

![explore_in_labs.png](img/time_series_explore_in_lab.png)

In [1]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='insert-datasource-id',
                             namespace='insert-project-id')
dataset = datasource.dataset

In [2]:
#Update the dataset data types - Datetime is a date
dataset.astype('Datetime', 'datetime')

In [3]:
print(dataset)

[1mDataset 
 
[0m[1mShape: [0m(35064, 15)
[1mSchema: [0m
      Column Variable type
0         No           int
1      PM2.5         float
2       PM10         float
3        SO2         float
4        NO2         float
5         CO         float
6         O3         float
7       TEMP         float
8       PRES         float
9       DEWP         float
10      RAIN         float
11        wd        string
12      WSPM         float
13   station        string
14  Datetime      datetime




### Configure the Metadata for synthesis

For the tiem-series synthesis there are two specific attributes to the Metadata that need to be set in order to generate Synthetic Data - *sortbykey*, that can only either be and integer or a date as it is used to understand the temporal order of the data, and *entities* this property is only required when your time-series dataset has time trajectories that refer to more than 1 entity (eg: patients, stores, stations, meters, etc).

In [5]:
from ydata.metadata import Metadata

dataset_attrs = {
    'sortbykey': 'Datetime',
}

metadata = Metadata(dataset, dataset_type='timeseries', dataset_attrs=dataset_attrs)

  warn("Datasets other than Timeseries don't make use of dataset_attrs")


## Train & Generate synthetic data samples

In [6]:
from ydata.synthesizers import TimeSeriesSynthesizer

synth = TimeSeriesSynthesizer()
synth.fit(dataset, metadata=metadata)

INFO: 2024-02-28 15:43:39,918 [SYNTHESIZER] - Initializing Time Series SYNTHESIZER.
INFO: 2024-02-28 15:43:39,919 [SYNTHESIZER] - Number columns considered for synth: 15
INFO: 2024-02-28 15:43:50,509 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2024-02-28 15:43:50,516 [SYNTHESIZER] - Preprocess segment
INFO: 2024-02-28 15:43:50,526 [SYNTHESIZER] - Synthesizer init.
INFO: 2024-02-28 15:43:50,527 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.


<ydata.synthesizers.timeseries.model.TimeSeriesSynthesizer at 0x7f54c03eb850>

### Generating synthetic samples

Different from the RegularSynthesizer that generates samples where is row is independent, the same does not happen to time-series data. For that reason the sampling is done based on the number of entities that you want to generate.
This means that the series will have the same trajectory size and within the time period as the original data, but the number of entities generated my vary.
In this case, and because our original data had only 1 entity we will be generating 1 synthetic entity.

In [9]:
synth_sample = synth.sample(n_entities=1)

INFO: 2024-02-28 15:44:44,411 [SYNTHESIZER] - Start generating model samples.


## Saving the data

### Writing as a CSV

In [10]:
synth_sample.to_pandas().to_csv('synthetic_sample.csv')
