# **YData Synthetic Data Generation with Conditional Sampling**

YData synthesizers now support conditional sampling. The `fit` method has an additional parameter named `condition_on`, which receives a list of features to condition upon. Furthermore, the `sample` method receives the conditions to be applied through an additional parameter also named `condition_on`. For now, there are three types of conditions:
- Condition upon a categorical (or string) feature. The parameters are the name of the feature and a list of values (i.e., categories) to be considered. Each category also has its percentage of representativeness. For example, if we want to condition upon two categories, we need to define the percentage of rows each of these categories will have on the synthetic dataset. Naturally, the sum of such percentages needs to be 1.
- Condition upon a numerical feature. The parameters are the name of the feature and the minimum and maximum of the range to be considered. This feature will present a uniform distribution on the synthetic dataset, limited by the specified range. 
- A generic type of condition where the feature's values are defined according to the data returned by a Generator function.

This notebook describes how to apply conditional sampling with the regular synthesizer. The same logic here explained can be equally applied to the time series synthesizer.

We will use the Adult Census Income dataset to demonstrate how to perform the conditional sampling. This dataset is a collection of census data from 1994 mainly used for prediction tasks where the goal is to identify if a person makes over 50K a year (https://archive.ics.uci.edu/ml/datasets/adult). Each person is described by 14 features focused on personal information, including sensitive attributes such as race and sex. 

In [1]:
from numpy.random import default_rng
from ydata.synthesizers.regular.model import RegularSynthesizer
from ydata.metadata import Metadata
from ydata.labs import DataSources

In [2]:
datasource = DataSources.get(uid='{dasource-uid}')
data = datasource.read()
df = data.to_pandas()
metadata = Metadata(data)

[########################################] | 100% Completed | 101.76 ms
[########################################] | 100% Completed | 315.84 ms
[########################################] | 100% Completed | 155.04 ms
[########################################] | 100% Completed | 101.83 ms
[########################################] | 100% Completed | 2.22 sms


In [3]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## **Condition upon categorical features**

In this first example, we will generate synthetic data conditioned upon two categorical features: native-country and sex. Specifically, we will only generate data for Female people from the United States (60%) and Mexico (40%).

We start by defining which features to condition upon when calling the `fit` method.

In [4]:
synthesizer = RegularSynthesizer()
synthesizer.fit(data, metadata=metadata,
                condition_on=["sex", "native-country"])

INFO: 2023-04-12 11:09:06,629 [SYNTHESIZER] - Number columns considered for synth: 15
INFO: 2023-04-12 11:09:06,874 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2023-04-12 11:09:06,876 [SYNTHESIZER] - Preprocess segment
INFO: 2023-04-12 11:09:06,878 [SYNTHESIZER] - Synthesizer init.
INFO: 2023-04-12 11:09:06,879 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.


<ydata.synthesizers.regular.model.RegularSynthesizer at 0x7f3e6a6ce110>

Afterward, we define the specific conditions when calling the sample method. In this case, we need to define the percentage of representativeness for each of the native countries.

In [5]:
synth_df = synthesizer.sample(n_samples=len(data),
                              condition_on={
                                  "sex": {
                                      "categories": [{
                                          "category": "Female",
                                          "percentage": 1.0
                                      }]
                                  },
                                  "native-country": {
                                      "categories": [("United-States", 0.6),
                                                     ("Mexico", 0.4)]
                                  }
                              }).to_pandas()

INFO: 2023-04-12 11:09:09,783 [SYNTHESIZER] - Start generating model samples.


We can now check the percentages of representativeness for each category to validate if the conditions were respected.

In [6]:
print(f'\033[1mSex Feature (Synthetic)')
print(f'\033[1mFemale = {synth_df["sex"].value_counts(normalize=True)["Female"] * 100:.0f}%\n')

nc_vc = synth_df["native-country"].value_counts(normalize=True)
print(f'\033[1mNative Country Feature (Synthetic)')
print(f'\033[1mMexico = {nc_vc["Mexico"] * 100:.0f}%')
print(f'\033[1mUnited-States = {nc_vc["United-States"] * 100:.0f}%')

[1mSex Feature (Synthetic)
[1mFemale = 100%

[1mNative Country Feature (Synthetic)
[1mMexico = 40%
[1mUnited-States = 60%


We can also analyze the age mean on the original and synthetic data. When filtering by Female Mexicans, we can see the age obtained from the synthetic data matches the original data.

In [7]:
print(f'\033[1mAge Mean (Original - No Filters) = {df["age"].mean()}')
orig_female_mx = df[(df["native-country"] == "Mexico") & (df["sex"] == "Female")]["age"].mean()
print(f'\033[1mAge Mean (Original - Female Mexicans) = {orig_female_mx}')
synth_female = synth_df[synth_df["native-country"] == "Mexico"]["age"].mean()
print(f'\033[1mAge Mean (Synthetic - Female Mexicans) = {synth_female}')

[1mAge Mean (Original - No Filters) = 38.58164675532078
[1mAge Mean (Original - Female Mexicans) = 32.73287671232877
[1mAge Mean (Synthetic - Female Mexicans) = 32.84072160974482


## **Condition upon numerical features**

In this second example, we will generate synthetic data conditioned upon the age numerical feature. Specifically, we will only generate data for people aged between 55 and 60 years.

In [8]:
synthesizer = RegularSynthesizer()
synthesizer.fit(data, metadata=metadata,
                condition_on=["age"])

INFO: 2023-04-12 11:09:10,837 [SYNTHESIZER] - Number columns considered for synth: 15
INFO: 2023-04-12 11:09:11,089 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2023-04-12 11:09:11,092 [SYNTHESIZER] - Preprocess segment
INFO: 2023-04-12 11:09:11,095 [SYNTHESIZER] - Synthesizer init.
INFO: 2023-04-12 11:09:11,095 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.


<ydata.synthesizers.regular.model.RegularSynthesizer at 0x7f3e48b1cfd0>

For the specific conditions, we now define the minimum and maximum values for the age feature.

In [9]:
synth_df = synthesizer.sample(n_samples=len(data),
                              condition_on={
                                  "age": {
                                      "minimum": 55,
                                      "maximum": 60
                                  },
                              }).to_pandas()

INFO: 2023-04-12 11:09:14,450 [SYNTHESIZER] - Start generating model samples.


We can now check the minimum and maximum values for the age feature to validate if the conditions were respected.

In [10]:
print(f'\033[1mAge Min (Synthetic) = {synth_df["age"].min()}')
print(f'\033[1mAge Max (Synthetic) = {synth_df["age"].max()}')

[1mAge Min (Synthetic) = 55
[1mAge Max (Synthetic) = 59


We can also analyze the hours-per-week mean on the original and synthetic data. When filtering by the age interval between 55 and 60, we can see that the hours-per-week obtained from the synthetic data matches the original data.

In [11]:
print(f'\033[1mHours Per Week Mean (Original - No Filters) = {df["hours-per-week"].mean()}')
orig_age_55_60 = df[(df["age"] >= 55) & (df["age"] <= 60)]
print(f'\033[1mHours Per Week Mean (Original - Age between 55 and 60) = {orig_age_55_60["hours-per-week"].mean()}')
print(f'\033[1mHours Per Week Mean (Synthetic - Age between 55 and 60) = {synth_df["hours-per-week"].mean()}')

[1mHours Per Week Mean (Original - No Filters) = 40.437455852092995
[1mHours Per Week Mean (Original - Age between 55 and 60) = 41.73299632352941
[1mHours Per Week Mean (Synthetic - Age between 55 and 60) = 41.91311691901355


## **Condition upon a Generator function**

In this third example, we will generate synthetic data conditioned upon the age feature but using a Generator function. Specifically, we will generate data assuming the age feature follows a normal distribution with a mean of 25 and a standard deviation of 5.

In [12]:
synthesizer = RegularSynthesizer()
synthesizer.fit(data, metadata=metadata,
                condition_on=["age"])

INFO: 2023-04-12 11:09:15,326 [SYNTHESIZER] - Number columns considered for synth: 15
INFO: 2023-04-12 11:09:15,576 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2023-04-12 11:09:15,579 [SYNTHESIZER] - Preprocess segment
INFO: 2023-04-12 11:09:15,582 [SYNTHESIZER] - Synthesizer init.
INFO: 2023-04-12 11:09:15,582 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.


<ydata.synthesizers.regular.model.RegularSynthesizer at 0x7f3e48b1d120>

After training the synthesizer, we define the Generator function according to the desired condition. This function is then supplied to the `sample` method.

In [13]:
def generate_age():
    yield from default_rng().normal(25, 5, len(data))

In [14]:
synth_df = synthesizer.sample(n_samples=len(data),
                              condition_on={
                                  "age": {
                                      "function": generate_age
                                  }
                              }).to_pandas()

INFO: 2023-04-12 11:09:18,989 [SYNTHESIZER] - Start generating model samples.


We can now check the mean and standard deviation values for the age feature to validate if the conditions were respected.

In [15]:
print(f'\033[1mAge Mean (Synthetic) = {synth_df["age"].mean()}')
print(f'\033[1mAge Standard Deviation (Synthetic) = {synth_df["age"].std()}')

[1mAge Mean (Synthetic) = 24.627499155431344
[1mAge Standard Deviation (Synthetic) = 4.73635677167138


We can also analyze the hours-per-week mean on the original and synthetic data. As expected, the synthetic data presents a lower mean (according to the dataset, younger people work fewer hours per week).

In [16]:
print(f'\033[1mHours Per Week Mean (Original) = {df["hours-per-week"].mean()}')
print(f'\033[1mHours Per Week Mean (Synthetic) = {synth_df["hours-per-week"].mean()}')

[1mHours Per Week Mean (Original) = 40.437455852092995
[1mHours Per Week Mean (Synthetic) = 37.421731519302234
