# YData SDK | Tabular synthetic data generation quickstart

#### Install ydata-sdk

In [3]:
!pip install ydata-sdk

Collecting ydata-sdk
  Downloading ydata_sdk-1.0.1-py310-none-any.whl.metadata (7.4 kB)
Collecting httpx==0.23.3 (from ydata-sdk)
  Downloading httpx-0.23.3-py3-none-any.whl.metadata (7.1 kB)
Collecting ydata-core>=0.2.0 (from ydata-sdk)
  Downloading ydata_core-0.7.0-py2.py3-none-any.whl.metadata (1.5 kB)
Collecting prettytable==3.6.0 (from ydata-sdk)
  Downloading prettytable-3.6.0-py3-none-any.whl.metadata (25 kB)
Collecting pydantic==1.10.9 (from ydata-sdk)
  Downloading pydantic-1.10.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (147 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.8/147.8 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typeguard==2.13.3 (from ydata-sdk)
  Downloading typeguard-2.13.3-py3-none-any.whl.metadata (3.6 kB)
Collecting ydata-datascience (from ydata-sdk)
  Downloading ydata_datascience-0.7.0-py2.py3-none-any.whl.metadata (1.5 kB)
Collecting httpcore<0.17.0,>=0.15.0 (from httpx==0.23.3->ydata-s

#### Authentication

In [4]:
import os

# Do not forget to add your token as env variable
os.environ["YDATA_TOKEN"] = '{insert-token}'

#### Sampling an example dataset for a multientity & multivariate time-series dataset

In [2]:
# Generate the dataset
from ydata.sdk.dataset import get_dataset

dataset = get_dataset('titanic')

print(dataset.head())

   entity_id  time  feature_0  feature_1  feature_2
0          0     0   0.463602   0.060326   0.244735
1          0     1   0.254560   0.226478   0.160453
2          0     2   0.451870   0.656529   0.817230
3          0     3   0.688637   0.952739   0.041068
4          0     4   0.954968   0.079913   0.770322


## Train a Synthetic data generator

### From a pandas dataframe

In [6]:
from ydata.sdk.synthesizers import RegularSynthesizer

# We initialize a time series synthesizer
# As long as the synthesizer does not call `fit`, it exists only locally
synth = RegularSynthesizer(name='Titanic synth')

# We train the synthesizer on our dataset
# sortbykey -> variable that define the time order for the sequence
synth.fit(dataset)

### From an existing Datasource

In case there are Datasources already available in [YData Fabric Data Catalog]() it is also possible to leverage them to train a synthetic data generator using **ydata-sdk**.
- **List your available datasources:** List available datasources using DataSource.list(). This method will provide you important information such as name, creation data and UID/ID.
- **Access the data**: The method get will return the data from the datasource as a pandas dataframe.

In [20]:
### list the available datasources
from ydata.sdk.datasources import DataSource

DataSource.list()

In [None]:
from ydata.sdk.datasources import DataSource

dataset = DataSource.get('<DATASOURCE_UID>')

# We initialize a time series synthesizer
# As long as the synthesizer does not call `fit`, it exists only locally
synth = RegularSynthesizer(name='Titanic synth - from datasource')

# We train the synthesizer on our dataset
# sortbykey -> variable that define the time order for the sequence
synth.fit(dataset)

### Generate samples from an already trained synthesizer

#### From the synthesizer in context in the notebook

In [None]:
# Generate a sample with x number of rows
sample = synth.sample(1000)

sample.head()

### From a previously trained synthetic data generation model

Users can access previously trained synthetic data generation models using the get method and passing the synthesizer's unique identifier (UID/ID). This allows for leveraging existing models without retraining.


**Get the details of trained synths:** Call the TimeSeriesSynthesizer.list method to get a list of the previously trained synthetic data generators.

**Retrieve the synthesizer:** Call TimeSeriesSynthesizer.get(id='your_synth_uid') providing the desired synthesizer's UID/ID.

**Generate data:** Use the retrieved synthesizer's sample() method to generate new synthetic data.

In [None]:
# List the trained synthetic data generators to get the uid synthetisizer
RegularSynthesizer.list()

In [None]:
synth = RegularSynthesizer(uid='{insert-synth-id}').get()

# Generate a new synthetic dataset with the sample method
sample = synth.sample(10000)

sample.head()