CTGAN Model
===========

In this guide we will go through a series of steps that will let you
discover functionalities of the `CTGAN` model, including how to:

-   Create an instance of `CTGAN`.
-   Fit the instance to your data.
-   Generate synthetic versions of your data.
-   Use `CTGAN` to anonymize PII information.
-   Specify hyperparameters to improve the output quality.

What is CTGAN?
--------------

The `sdv.tabular.CTGAN` model is based on the GAN-based Deep Learning
data synthesizer which was presented at the NeurIPS 2020 conference by
the paper titled [Modeling Tabular data using Conditional
GAN](https://arxiv.org/abs/1907.00503).

Let\'s now discover how to learn a dataset and later on generate
synthetic data with the same format and statistical properties by using
the `CTGAN` class from SDV.

Quick Usage
-----------

We will start by loading one of our demo datasets, the
`student_placements`, which contains information about MBA students that
applied for placements during the year 2020.

<div class="alert alert-warning">

**Warning**

In order to follow this guide you need to have `ctgan` installed on your
system. If you have not done it yet, please install `ctgan` now by
executing the command `pip install sdv` in a terminal.

</div>

In [1]:
from sdv.demo import load_tabular_demo

data = load_tabular_demo('student_placements')
data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,17264,M,67.0,91.0,Commerce,58.0,Sci&Tech,False,0,55.0,Mkt&HR,58.8,27000.0,True,2020-07-23,2020-10-12,3.0
1,17265,M,79.33,78.33,Science,77.48,Sci&Tech,True,1,86.5,Mkt&Fin,66.28,20000.0,True,2020-01-11,2020-04-09,3.0
2,17266,M,65.0,68.0,Arts,64.0,Comm&Mgmt,False,0,75.0,Mkt&Fin,57.8,25000.0,True,2020-01-26,2020-07-13,6.0
3,17267,M,56.0,52.0,Science,52.0,Sci&Tech,False,0,66.0,Mkt&HR,59.43,,False,NaT,NaT,
4,17268,M,85.8,73.6,Commerce,73.3,Comm&Mgmt,False,0,96.8,Mkt&Fin,55.5,42500.0,True,2020-07-04,2020-09-27,3.0


As you can see, this table contains information about students which
includes, among other things:

-   Their id and gender
-   Their grades and specializations
-   Their work experience
-   The salary that they were offered
-   The duration and dates of their placement

You will notice that there is data with the following characteristics:

-   There are float, integer, boolean, categorical and datetime values.
-   There are some variables that have missing data. In particular, all
    the data related to the placement details is missing in the rows
    where the student was not placed.

Let us use `CTGAN` to learn this data and then sample synthetic data
about new students to see how well the model captures the characteristics
indicated above. In order to do this you will need to:

-   Import the `sdv.tabular.CTGAN` class and create an instance of it.
-   Call its `fit` method passing our table.
-   Call its `sample` method indicating the number of synthetic rows
    that you want to generate.

In [2]:
from sdv.tabular import CTGAN

model = CTGAN()
model.fit(data)

<div class="alert alert-info">

**Note**

Notice that the model `fitting` process took care of transforming the
different fields using the appropriate [Reversible Data
Transforms](http://github.com/sdv-dev/RDT) to ensure that the data has a
format that the underlying CTGANSynthesizer class can handle.

</div>

### Generate synthetic data from the model

Once the modeling has finished you are ready to generate new synthetic
data by calling the `sample` method from your model passing the number
of rows that we want to generate. The number of rows (``num_rows``)
is a required parameter.

In [3]:
new_data = model.sample(num_rows=200)

This will return a table identical to the one which the model was fitted
on, but filled with new data which resembles the original one.

In [4]:
new_data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,17362,M,32.960462,53.704156,Science,66.631445,Comm&Mgmt,False,0,95.207488,Mkt&Fin,47.321018,,False,NaT,2020-04-23,6.0
1,17225,M,64.110736,43.341049,Science,75.102013,Comm&Mgmt,True,0,77.093897,Mkt&HR,51.570617,28008.508481,True,NaT,2020-09-26,3.0
2,17221,M,61.768412,40.925261,Commerce,64.717689,Sci&Tech,True,0,68.146367,Mkt&HR,57.642954,38203.24765,False,2020-09-17,2021-03-28,12.0
3,17214,M,80.515029,36.534666,Commerce,66.592473,Sci&Tech,True,0,93.435515,Mkt&Fin,54.056713,31803.404036,False,NaT,2020-12-05,
4,17235,F,80.126127,63.582383,Commerce,51.98877,Comm&Mgmt,False,0,70.351838,Mkt&Fin,72.464351,,True,NaT,2020-09-10,3.0


<div class="alert alert-info">

**Note**

There are a number of other parameters in this method that you can use to
optimize the process of generating synthetic data. Use ``output_file_path``
to directly write results to a CSV file, ``batch_size`` to break up sampling
into smaller pieces & track their progress and ``randomize_samples`` to
determine whether to generate the same synthetic data every time.
See the <a href=https://sdv.dev/SDV/api_reference/tabular/api/sdv.tabular.ctgan.CTGAN.sample>API Section</a> 
for more details.

</div>

### Save and Load the model

In many scenarios it will be convenient to generate synthetic versions
of your data directly in systems that do not have access to the original
data source. For example, if you may want to generate testing data on
the fly inside a testing environment that does not have access to your
production database. In these scenarios, fitting the model with real
data every time that you need to generate new data is feasible, so you
will need to fit a model in your production environment, save the fitted
model into a file, send this file to the testing environment and then
load it there to be able to `sample` from it.

Let\'s see how this process works.

#### Save and share the model

Once you have fitted the model, all you need to do is call its `save`
method passing the name of the file in which you want to save the model.
Note that the extension of the filename is not relevant, but we will be
using the `.pkl` extension to highlight that the serialization protocol
used is [pickle](https://docs.python.org/3/library/pickle.html).

In [5]:
model.save('my_model.pkl')

This will have created a file called `my_model.pkl` in the same
directory in which you are running SDV.

<div class="alert alert-info">

**Important**

If you inspect the generated file you will notice that its size is much
smaller than the size of the data that you used to generate it. This is
because the serialized model contains **no information about the
original data**, other than the parameters it needs to generate
synthetic versions of it. This means that you can safely share this
`my_model.pkl` file without the risc of disclosing any of your real
data!

</div>

#### Load the model and generate new data

The file you just generated can be sent over to the system where the
synthetic data will be generated. Once it is there, you can load it
using the `CTGAN.load` method, and then you are ready to sample new data
from the loaded instance:

In [6]:
loaded = CTGAN.load('my_model.pkl')
new_data = loaded.sample(num_rows=200)

<div class="alert alert-warning">

**Warning**

Notice that the system where the model is loaded needs to also have
`sdv` and `ctgan` installed, otherwise it will not be able to load the
model and use it.

</div>

### Specifying the Primary Key of the table

One of the first things that you may have noticed when looking at the demo
data is that there is a `student_id` column which acts as the primary
key of the table, and which is supposed to have unique values. Indeed,
if we look at the number of times that each value appears, we see that
all of them appear at most once:

In [7]:
data.student_id.value_counts().max()

1

However, if we look at the synthetic data that we generated, we observe
that there are some values that appear more than once:

In [8]:
new_data[new_data.student_id == new_data.student_id.value_counts().index[0]]

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
52,17271,M,46.37456,81.486515,Commerce,66.37206,Sci&Tech,False,0,103.94687,Mkt&HR,48.477507,,True,NaT,2020-07-16,6.0
76,17271,M,58.815226,58.919422,Commerce,63.072138,Comm&Mgmt,True,0,93.463872,Mkt&Fin,51.019909,43100.580159,True,NaT,2020-04-18,3.0
87,17271,M,64.669588,57.76156,Commerce,51.385809,Comm&Mgmt,False,0,106.545396,Mkt&Fin,66.693542,27137.751137,True,NaT,NaT,
181,17271,M,47.890702,81.051024,Science,52.175775,Sci&Tech,True,0,75.079406,Mkt&Fin,63.161405,29850.269919,True,NaT,NaT,6.0


This happens because the model was not notified at any point about the
fact that the `student_id` had to be unique, so when it generates new
data it will provoke collisions sooner or later. In order to solve this,
we can pass the argument `primary_key` to our model when we create it,
indicating the name of the column that is the index of the table.

In [9]:
model = CTGAN(
    primary_key='student_id'
)
model.fit(data)
new_data = model.sample(200)
new_data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,M,67.679231,107.045934,Science,56.449142,Comm&Mgmt,True,0,85.463105,Mkt&Fin,62.49216,,False,NaT,2020-06-23,6.0
1,1,F,61.344323,79.999742,Arts,53.222962,Comm&Mgmt,False,0,69.90505,Mkt&Fin,62.458629,42450.334675,True,2020-03-27,2020-08-12,6.0
2,2,F,81.108852,91.145108,Commerce,49.812391,Comm&Mgmt,True,0,72.979795,Mkt&HR,78.032517,,True,2020-01-15,2020-05-26,
3,3,M,73.461129,80.196905,Science,41.874524,Comm&Mgmt,True,1,71.035512,Mkt&Fin,79.273043,,True,2020-03-17,2020-03-19,3.0
4,4,M,76.3704,84.044438,Commerce,52.984854,Comm&Mgmt,False,1,87.025331,Mkt&Fin,62.487152,,True,2020-01-23,2020-07-04,


As a result, the model will learn that this column must be unique and
generate a unique sequence of values for the column:

In [10]:
new_data.student_id.value_counts().max()

1

### Anonymizing Personally Identifiable Information (PII)

There will be many cases where the data will contain Personally
Identifiable Information which we cannot disclose. In these cases, we
will want our Tabular Models to replace the information within these
fields with fake, simulated data that looks similar to the real one but
does not contain any of the original values.

Let\'s load a new dataset that contains a PII field, the
`student_placements_pii` demo, and try to generate synthetic versions of
it that do not contain any of the PII fields.

<div class="alert alert-info">

**Note**

The `student_placements_pii` dataset is a modified version of the
`student_placements` dataset with one new field, `address`, which
contains PII information about the students. Notice that this additional
`address` field has been simulated and does not correspond to data from
the real users.

</div>

In [11]:
data_pii = load_tabular_demo('student_placements_pii')
data_pii.head()

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,17264,"70304 Baker Turnpike\nEricborough, MS 15086",M,67.0,91.0,Commerce,58.0,Sci&Tech,False,0,55.0,Mkt&HR,58.8,27000.0,True,2020-07-23,2020-10-12,3.0
1,17265,"805 Herrera Avenue Apt. 134\nMaryview, NJ 36510",M,79.33,78.33,Science,77.48,Sci&Tech,True,1,86.5,Mkt&Fin,66.28,20000.0,True,2020-01-11,2020-04-09,3.0
2,17266,"3702 Bradley Island\nNorth Victor, FL 12268",M,65.0,68.0,Arts,64.0,Comm&Mgmt,False,0,75.0,Mkt&Fin,57.8,25000.0,True,2020-01-26,2020-07-13,6.0
3,17267,Unit 0879 Box 3878\nDPO AP 42663,M,56.0,52.0,Science,52.0,Sci&Tech,False,0,66.0,Mkt&HR,59.43,,False,NaT,NaT,
4,17268,"96493 Kelly Canyon Apt. 145\nEast Steven, NC 3...",M,85.8,73.6,Commerce,73.3,Comm&Mgmt,False,0,96.8,Mkt&Fin,55.5,42500.0,True,2020-07-04,2020-09-27,3.0


If we use our tabular model on this new data we will see how the
synthetic data that it generates discloses the addresses from the real
students:

In [12]:
model = CTGAN(
    primary_key='student_id',
)
model.fit(data_pii)
new_data_pii = model.sample(200)
new_data_pii.head()

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,"418 Simmons Crescent\nLake Shawnview, SD 98853",F,43.277193,69.505324,Science,67.195715,Sci&Tech,False,0,59.000344,Mkt&Fin,59.008388,28441.121522,False,2019-12-22,2020-09-19,
1,1,"92152 Walker Place Suite 289\nMicheleview, NH ...",F,69.111162,76.887437,Commerce,79.563555,Comm&Mgmt,False,0,102.647341,Mkt&HR,59.08186,,False,2020-02-26,2021-03-06,12.0
2,2,"252 Allen Ranch\nSouth Joshua, AK 02142",M,46.529879,63.733209,Science,58.855701,Sci&Tech,False,0,64.99391,Mkt&Fin,66.154008,27881.607597,True,2020-05-03,2020-09-26,
3,3,51067 Turner Parks Suite 297\nNorth Gregorybor...,M,66.841085,61.948674,Science,51.485522,Comm&Mgmt,True,0,106.02891,Mkt&Fin,71.32769,18714.294298,False,NaT,2021-04-14,6.0
4,4,"79045 Mary Prairie\nEast Christina, GA 42034",F,63.419292,72.578095,Science,53.172454,Sci&Tech,False,0,76.760836,Mkt&Fin,70.605526,,True,2020-05-20,2020-05-29,3.0


More specifically, we can see how all the addresses that have been
generated actually come from the original dataset:

In [13]:
new_data_pii.address.isin(data_pii.address).sum()

200

In order to solve this, we can pass an additional argument
`anonymize_fields` to our model when we create the instance. This
`anonymize_fields` argument will need to be a dictionary that contains:

-   The name of the field that we want to anonymize.
-   The category of the field that we want to use when we generate fake
    values for it.

The list complete list of possible categories can be seen in the [Faker
Providers](https://faker.readthedocs.io/en/master/providers.html) page,
and it contains a huge list of concepts such as:

-   name
-   address
-   country
-   city
-   ssn
-   credit_card_number
-   credit_card_expire
-   credit_card_security_code
-   email
-   telephone
-   \...

In this case, since the field is an address, we will pass a
dictionary indicating the category `address`

In [14]:
model = CTGAN(
    primary_key='student_id',
    anonymize_fields={
        'address': 'address'
    }
)
model.fit(data_pii)

As a result, we can see how the real `address` values have been replaced
by other fake addresses:

In [15]:
new_data_pii = model.sample(200)
new_data_pii.head()

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,"18252 Kristen Rapid\nSouth Wendy, IA 12847",F,62.788103,57.683344,Commerce,70.322665,Comm&Mgmt,False,0,72.099963,Mkt&Fin,60.288379,,True,2020-01-22,2020-04-29,12.0
1,1,"68663 Alexandra Walks\nRodriguezfort, AZ 77519",F,55.283274,62.558524,Science,73.225486,Comm&Mgmt,True,0,80.555841,Mkt&Fin,61.248741,,False,2020-01-20,NaT,3.0
2,2,"2024 Michael Ports\nDonaldfurt, MN 94301",M,59.515239,61.164161,Commerce,71.543674,Comm&Mgmt,True,1,110.556561,Mkt&Fin,57.346226,26435.827942,True,2020-01-15,2020-01-30,
3,3,"29668 Pearson Keys\nSusanfort, MT 55770",M,65.807491,68.725586,Arts,83.628631,Sci&Tech,False,0,106.766273,Mkt&Fin,58.283908,29354.13554,True,2020-03-29,NaT,3.0
4,4,"5208 Young Village Apt. 695\nNew Soniaton, OH ...",F,88.722821,67.390853,Arts,82.97707,Comm&Mgmt,False,0,69.449941,Mkt&Fin,48.739592,27759.182835,False,2020-01-16,NaT,


Which means that none of the original addresses can be found in the
sampled data:

In [16]:
data_pii.address.isin(new_data_pii.address).sum()

0

Advanced Usage
--------------

Now that we have discovered the basics, let\'s go over a few more
advanced usage examples and see the different arguments that we can pass
to our `CTGAN` Model in order to customize it to our needs.

### How to modify the CTGAN Hyperparameters?

A part from the common Tabular Model arguments, `CTGAN` has a number of
additional hyperparameters that control its learning behavior and can
impact on the performance of the model, both in terms of quality of the
generated data and computational time.

-   `epochs` and `batch_size`: these arguments control the number of
    iterations that the model will perform to optimize its parameters,
    as well as the number of samples used in each step. Its default
    values are `300` and `500` respectively, and `batch_size` needs to
    always be a value which is multiple of `10`. These hyperparameters
    have a very direct effect in time the training process lasts but
    also on the performance of the data, so for new datasets, you might
    want to start by setting a low value on both of them to see how long
    the training process takes on your data and later on increase the
    number to acceptable values in order to improve the performance.
-   `log_frequency`: Whether to use log frequency of categorical levels
    in conditional sampling. It defaults to `True`. This argument affects
    how the model processes the frequencies of the categorical values that
    are used to condition the rest of the values. In some cases, changing
    it to `False` could lead to better performance.
-   `embedding_dim` (int): Size of the random sample passed to the
    Generator. Defaults to 128.
-   `generator_dim` (tuple or list of ints): Size of the output samples for
    each one of the Residuals. A Resiudal Layer will be created for each
    one of the values provided. Defaults to (256, 256).
-   `discriminator_dim` (tuple or list of ints): Size of the output samples for
    each one of the Discriminator Layers. A Linear Layer will be created
    for each one of the values provided. Defaults to (256, 256).
-   `generator_lr` (float): Learning rate for the generator. Defaults to 2e-4.
-   `generator_decay` (float): Generator weight decay for the Adam Optimizer.
    Defaults to 1e-6.
-   `discriminator_lr` (float): Learning rate for the discriminator.
    Defaults to 2e-4.
-   `discriminator_decay` (float): Discriminator weight decay for the Adam
    Optimizer. Defaults to 1e-6.
-   `discriminator_steps` (int): Number of discriminator updates to do for
    each generator update. From the WGAN paper: https://arxiv.org/abs/1701.07875.
    WGAN paper default is 5. Default used is 1 to match original CTGAN
    implementation.
-   `verbose`: Whether to print fit progress on stdout. Defaults to
    `False`.

<div class="alert alert-warning">

**Warning**

Notice that the value that you set on the `batch_size` argument must
always be a multiple of `10`!

</div>

As an example, we will try to fit the `CTGAN` model slightly increasing
the number of epochs, reducing the `batch_size`, adding one additional
layer to the models involved and using a smaller wright decay.

Before we start, we will evaluate the quality of the previously
generated data using the `sdv.evaluation.evaluate` function

In [17]:
from sdv.evaluation import evaluate

evaluate(new_data, data)

0.3120837147725067

Afterwards, we create a new instance of the `CTGAN` model with the
hyperparameter values that we want to use

In [18]:
model = CTGAN(
    primary_key='student_id',
    epochs=500,
    batch_size=100,
    generator_dim=(256, 256, 256),
    discriminator_dim=(256, 256, 256)
)

And fit to our data.

In [19]:
model.fit(data)

Finally, we are ready to generate new data and evaluate the results.

In [20]:
new_data = model.sample(len(data))
evaluate(new_data, data)

0.3189345188059005

As we can see, in this case these modifications changed the obtained
results slightly, but they did neither introduce dramatic changes in the
performance.


### Conditional Sampling

As the name implies, conditional sampling allows us to sample from a conditional distribution using the `CTGAN` model, which means we can generate only values that satisfy certain conditions. These conditional values can be passed to the `sample_conditions` method as a list of `sdv.sampling.Condition` objects or to the `sample_remaining_columns` method as a dataframe. 

When specifying a `sdv.sampling.Condition` object, we can pass in the desired conditions as a dictionary, as well as specify the number of desired rows for that condition.

In [21]:
from sdv.sampling import Condition

condition = Condition({
    'gender': 'M'
}, num_rows=5)

model.sample_conditions(conditions=[condition])

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,M,70.251977,51.04317,Commerce,41.864175,Comm&Mgmt,False,1,106.238562,Mkt&Fin,57.851349,33039.96532,False,NaT,NaT,3.0
1,0,M,82.103027,77.718532,Commerce,51.612875,Comm&Mgmt,False,0,56.190743,Mkt&Fin,77.84952,,False,2020-03-09,NaT,12.0
2,1,M,60.028752,68.820894,Commerce,65.995398,Comm&Mgmt,False,0,95.124748,Mkt&Fin,62.951054,27940.855177,False,2020-06-11,2020-06-13,
3,2,M,76.733105,90.554487,Commerce,71.151889,Comm&Mgmt,False,2,59.158381,Mkt&HR,69.919527,,False,2020-04-02,2020-06-09,3.0
4,3,M,57.621074,40.250825,Science,45.390671,Comm&Mgmt,False,1,93.239456,Mkt&Fin,64.525858,,True,2020-02-10,2020-05-21,3.0


It's also possible to condition on multiple columns, such as `gender = M, 'experience_years': 0`.

In [25]:
condition = Condition({
    'gender': 'M',
    'experience_years': 0
}, num_rows=5)

model.sample_conditions(conditions=[condition])

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,M,71.435596,76.273542,Science,77.80396,Comm&Mgmt,False,0,46.464446,Mkt&HR,82.964351,30511.019575,False,2020-03-08,NaT,
1,5,M,67.205207,62.493386,Commerce,51.706185,Sci&Tech,False,0,61.360194,Mkt&HR,83.689837,,True,NaT,NaT,
2,7,M,58.946618,54.097171,Science,67.072846,Comm&Mgmt,False,0,65.284074,Mkt&HR,67.589778,,False,NaT,NaT,
3,9,M,91.028415,75.836934,Science,41.456927,Sci&Tech,False,0,96.141702,Mkt&Fin,58.466902,,False,NaT,2020-05-26,3.0
4,2,M,61.560133,70.004518,Science,77.092232,Comm&Mgmt,False,0,94.452796,Mkt&HR,61.759055,,False,2020-08-18,NaT,3.0


In the `sample_remaining_columns` method, `conditions` is passed as a dataframe. In that case, the model will generate one sample for each row of the dataframe, sorted in the same order. Since the model already knows how many samples to generate, passing it as a parameter is unnecessary. For example, if we want to generate three samples where `gender = M` and three samples with `gender = F`,  all of them with `work_experience = True`, we can do the following: 

In [23]:
import pandas as pd 

conditions = pd.DataFrame({
    'gender': ['M', 'M', 'M', 'F', 'F', 'F'],
    'work_experience': [True, True, True, True, True, True]
})
model.sample_remaining_columns(conditions)

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,2,M,80.886749,79.772646,Science,64.129706,Sci&Tech,True,0,78.169567,Mkt&Fin,73.69591,,True,2020-03-28,NaT,12.0
1,1,M,87.416346,74.371714,Science,85.166974,Comm&Mgmt,True,0,42.789144,Mkt&Fin,72.435012,26264.069272,False,2020-03-24,2020-09-14,3.0
2,3,M,44.6058,68.933255,Science,50.551667,Sci&Tech,True,0,47.312228,Mkt&HR,63.514986,,False,2020-01-04,NaT,
3,0,F,76.404555,69.409891,Commerce,80.748612,Comm&Mgmt,True,3,94.117485,Mkt&HR,77.548469,28217.285619,False,2020-01-14,2020-06-05,12.0
4,0,F,68.313682,32.54934,Science,55.139175,Comm&Mgmt,True,0,43.086751,Mkt&HR,52.641979,,False,NaT,NaT,
5,1,F,54.14933,65.672814,Commerce,60.548738,Comm&Mgmt,True,0,90.027053,Mkt&Fin,61.074818,30289.816455,False,2020-03-04,2020-06-14,


`CTGAN` also supports conditioning on continuous values, as long as the values are within the range of seen numbers. For example, if all the values of the dataset are within 0 and 1, `CTGAN` will not be able to set this value to 1000.

In [24]:
condition = Condition({
    'degree_perc': 70.0
}, num_rows=5)

model.sample_conditions(conditions=[condition])

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,12,F,75.274873,38.023216,Commerce,70.0,Sci&Tech,False,0,45.312776,Mkt&HR,82.928409,,False,2019-12-13,2020-06-29,3.0
1,26,M,88.873697,101.980034,Science,70.0,Comm&Mgmt,False,1,58.630728,Mkt&Fin,64.177173,31568.772188,True,2020-09-04,2020-06-08,3.0
2,15,M,96.03818,96.22777,Commerce,70.0,Comm&Mgmt,False,1,51.65035,Mkt&HR,62.006734,23368.53841,True,2020-01-31,2020-08-06,3.0
3,33,F,56.194367,59.559493,Science,70.0,Comm&Mgmt,False,0,100.105575,Mkt&Fin,72.97743,,False,NaT,NaT,
4,26,M,76.835546,38.963765,Science,70.0,Comm&Mgmt,False,2,86.21846,Mkt&Fin,67.647664,27740.418814,False,2020-05-25,2020-06-12,3.0


<div class="alert alert-info">

**Note**

Currently, conditional sampling works through a rejection sampling process, where
rows are sampled repeatedly until one that satisfies the conditions is found.
In case you are not able to sample enough valid rows, update the related parameters:
increasing ``max_tries`` or increasing ``batch_size_per_try``.
More information about these paramters can be found in the
<a href=https://sdv.dev/SDV/api_reference/tabular/api/sdv.tabular.ctgan.CTGAN.sample_conditions.html> API section</a>.

If you have many conditions that cannot easily be satisified, consider switching
to the <a href=https://sdv.dev/SDV/user_guides/single_table/gaussian_copula.html>GaussianCopula model</a>, which is able to handle conditional
sampling more efficiently.


</div>

### How do I specify constraints?

If you look closely at the data you may notice that some properties were
not completely captured by the model. For example, you may have seen
that sometimes the model produces an `experience_years` number greater
than `0` while also indicating that `work_experience` is `False`. These
types of properties are what we call `Constraints` and can also be
handled using `SDV`. For further details about them please visit the
[Handling Constraints](04_Handling_Constraints.ipynb) guide.

### Can I evaluate the Synthetic Data?

A very common question when someone starts using **SDV** to generate
synthetic data is: *\"How good is the data that I just generated?\"*

In order to answer this question, **SDV** has a collection of metrics
and tools that allow you to compare the *real* that you provided and the
*synthetic* data that you generated using **SDV** or any other tool.

You can read more about this in the [Evaluating Synthetic Data Generators](
05_Evaluating_Synthetic_Data_Generators.ipynb) guide.