# YData Synthetic data generation Holdout

The holdout dataset is a subset of the original data reserved during data processing to evaluate the performance of machine learning models. It plays a pivotal role in synthetic data generation by acting as an independent benchmark to assess the quality and generalizability of the synthetic data.

In YData Fabric's synthetic data generation process, 20% of the provided dataset is automatically set aside by default as the holdout set. This holdout set is essential not only to mitigate risks of overfitting and bias but also to ensure that the synthetic data can effectively generalize to unseen data.

For more detailed information about the holdout set and the synthetic data generation workflow, refer to this whitepaper.

We will use the Adult Census Income dataset to demonstrate how to create and change the size of an Holdout set. This dataset is a collection of census data from 1994 mainly used for prediction tasks where the goal is to identify if a person makes over 50K a year (https://archive.ics.uci.edu/ml/datasets/adult). Each person is described by 14 features focused on personal information, including sensitive attributes such as race and sex. 

## Read the data

In [1]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='37be2188-7938-4446-afba-839e5111eb97', namespace='17731ef6-bcc3-4aad-899c-ef507e9ea704')
dataset = datasource.dataset
# Getting the calculated Metadata to get the profile overview information in the labs
metadata = datasource.metadata
print(metadata)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m15
[1mNumber of rows: [0m32561
[1mDuplicate rows: [0m0
[1mTarget column: [0m

[1mColumn detail: [0m
            Column    Data type Variable type Characteristics
0              age    numerical           int                
1        workclass  categorical        string                
2           fnlwgt    numerical           int                
3        education  categorical        string                
4    education-num  categorical           int                
5   marital-status  categorical        string                
6       occupation  categorical        string                
7     relationship  categorical        string                
8             race  categorical        string                
9              sex  categorical        string                
10    capital-gain    numerical           int                
11    capital-loss    numerical   

## Configuring the Synthetic data generation holdout set

In [3]:
from ydata.synthesizers import RegularSynthesizer

synth = RegularSynthesizer()
synth.fit(X=dataset, metadata=metadata, holdout_size=0.3) 
# By default the houldout set is always set to 20%. 
# Depending the size and characteristics of the dataset it is recommended to have it tweaked.

INFO: 2024-08-16 11:33:57,886 [SYNTHESIZER] - Number columns considered for synth: 15
INFO: 2024-08-16 11:34:05,675 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2024-08-16 11:34:05,680 [SYNTHESIZER] - Preprocess segment
INFO: 2024-08-16 11:34:05,689 [SYNTHESIZER] - Synthesizer init.
INFO: 2024-08-16 11:34:05,690 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.


<ydata.synthesizers.regular.model.RegularSynthesizer at 0x7ff6781feb30>

#### Disable automatic holdout

In [None]:
from ydata.synthesizers import RegularSynthesizer

synth = RegularSynthesizer()
synth.fit(x=dataset, metadata=metadata, holdout_size=0.0) 
# Setting the holdout to 0 will disable this features
# This is mainly recommend whenever you already have your own holdout or your dataset is too small. 

## Creating an holdout set

In [7]:
from ydata.dataset.holdout import Holdout

holdout_config = Holdout(fraction=0.3)
train, holdout = holdout_config.get_split(X=dataset, metadata=metadata, strategy='random')

# this will return a Dataset object with 30% of th records compare to the original dataset.

In [8]:
print(holdout)

[1mDataset 
 
[0m[1mShape: [0m(9724, 15)
[1mSchema: [0m
            Column Variable type
0              age           int
1        workclass        string
2           fnlwgt           int
3        education        string
4    education-num           int
5   marital-status        string
6       occupation        string
7     relationship        string
8             race        string
9              sex        string
10    capital-gain           int
11    capital-loss           int
12  hours-per-week           int
13  native-country        string
14          income        string


