TVAE Model
===========

In this guide we will go through a series of steps that will let you
discover functionalities of the `TVAE` model, including how to:

-   Create an instance of `TVAE`.
-   Fit the instance to your data.
-   Generate synthetic versions of your data.
-   Use `TVAE` to anonymize PII information.
-   Specify hyperparameters to improve the output quality.

What is TVAE?
--------------

The `sdv.tabular.TVAE` model is based on the VAE-based Deep Learning
data synthesizer which was presented at the NeurIPS 2020 conference by
the paper titled [Modeling Tabular data using Conditional
GAN](https://arxiv.org/abs/1907.00503).

Let\'s now discover how to learn a dataset and later on generate
synthetic data with the same format and statistical properties by using
the `TVAE` class from SDV.

Quick Usage
-----------

We will start by loading one of our demo datasets, the
`student_placements`, which contains information about MBA students that
applied for placements during the year 2020.

<div class="alert alert-warning">

**Warning**

In order to follow this guide you need to have `tvae` installed on your
system. If you have not done it yet, please install `tvae` now by
executing the command `pip install sdv` in a terminal.

</div>

In [1]:
from sdv.demo import load_tabular_demo

data = load_tabular_demo('student_placements')
data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,17264,M,67.0,91.0,Commerce,58.0,Sci&Tech,False,0,55.0,Mkt&HR,58.8,27000.0,True,2020-07-23,2020-10-12,3.0
1,17265,M,79.33,78.33,Science,77.48,Sci&Tech,True,1,86.5,Mkt&Fin,66.28,20000.0,True,2020-01-11,2020-04-09,3.0
2,17266,M,65.0,68.0,Arts,64.0,Comm&Mgmt,False,0,75.0,Mkt&Fin,57.8,25000.0,True,2020-01-26,2020-07-13,6.0
3,17267,M,56.0,52.0,Science,52.0,Sci&Tech,False,0,66.0,Mkt&HR,59.43,,False,NaT,NaT,
4,17268,M,85.8,73.6,Commerce,73.3,Comm&Mgmt,False,0,96.8,Mkt&Fin,55.5,42500.0,True,2020-07-04,2020-09-27,3.0


As you can see, this table contains information about students which
includes, among other things:

-   Their id and gender
-   Their grades and specializations
-   Their work experience
-   The salary that they were offered
-   The duration and dates of their placement

You will notice that there is data with the following characteristics:

-   There are float, integer, boolean, categorical and datetime values.
-   There are some variables that have missing data. In particular, all
    the data related to the placement details is missing in the rows
    where the student was not placed.

T   There are float, integer, boolean, categorical and datetime values.
-   There are some variables that have missing data. In particular, all
    the data related to the placement details is missing in the rows
    where the student was not placed.

Let us use `TVAE` to learn this data and then sample synthetic data
about new students to see how well the model captures the characteristics
indicated above. In order to do this you will need to:

-   Import the `sdv.tabular.TVAE` class and create an instance of it.
-   Call its `fit` method passing our table.
-   Call its `sample` method indicating the number of synthetic rows
    that you want to generate.

In [2]:
from sdv.tabular import TVAE

model = TVAE()
model.fit(data)

<div class="alert alert-info">

**Note**

Notice that the model `fitting` process took care of transforming the
different fields using the appropriate [Reversible Data
Transforms](http://github.com/sdv-dev/RDT) to ensure that the data has a
format that the underlying TVAESynthesizer class can handle.

</div>

### Generate synthetic data from the model

Once the modeling has finished you are ready to generate new synthetic
data by calling the `sample` method from your model passing the number
of rows that we want to generate.

In [3]:
new_data = model.sample(num_rows=200)

This will return a table identical to the one which the model was fitted
on, but filled with new data which resembles the original one.

In [4]:
new_data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,17424,F,68.414611,71.641484,Arts,59.592832,Sci&Tech,True,1,59.46208,Mkt&HR,60.946053,54304.286549,True,NaT,2020-11-16,3.0
1,17445,F,63.996638,63.546057,Arts,57.22512,Sci&Tech,False,1,69.676283,Mkt&HR,60.093712,56793.137834,False,NaT,2020-07-10,
2,17439,F,76.248241,48.159081,Arts,61.71832,Comm&Mgmt,True,1,70.135505,Mkt&HR,62.328108,,False,2020-03-12,NaT,12.0
3,17331,F,66.550416,65.489936,Arts,67.00688,Sci&Tech,True,1,61.662881,Mkt&HR,64.899771,58462.286843,False,NaT,2020-05-08,3.0
4,17411,F,81.503435,52.607624,Arts,70.031605,Others,True,2,80.296772,Mkt&HR,59.881714,,False,NaT,2020-04-25,3.0


<div class="alert alert-info">

**Note**

There are a number of other parameters in this method that you can use to
optimize the process of generating synthetic data. Use ``output_file_path``
to directly write results to a CSV file, ``batch_size`` to break up sampling
into smaller pieces & track their progress and ``randomize_samples`` to
determine whether to generate the same synthetic data every time.
See the <a href=https://sdv.dev/SDV/api_reference/tabular/api/sdv.tabular.ctgan.TVAE.sample>API Section</a> 
for more details.

</div>

### Save and Load the model

In many scenarios it will be convenient to generate synthetic versions
of your data directly in systems that do not have access to the original
data source. For example, if you may want to generate testing data on
the fly inside a testing environment that does not have access to your
production database. In these scenarios, fitting the model with real
data every time that you need to generate new data is feasible, so you
will need to fit a model in your production environment, save the fitted
model into a file, send this file to the testing environment and then
load it there to be able to `sample` from it.

Let\'s see how this process works.

#### Save and share the model

Once you have fitted the model, all you need to do is call its `save`
method passing the name of the file in which you want to save the model.
Note that the extension of the filename is not relevant, but we will be
using the `.pkl` extension to highlight that the serialization protocol
used is [pickle](https://docs.python.org/3/library/pickle.html).

In [5]:
model.save('my_model.pkl')

This will have created a file called `my_model.pkl` in the same
directory in which you are running SDV.

<div class="alert alert-info">

**Important**

If you inspect the generated file you will notice that its size is much
smaller than the size of the data that you used to generate it. This is
because the serialized model contains **no information about the
original data**, other than the parameters it needs to generate
synthetic versions of it. This means that you can safely share this
`my_model.pkl` file without the risc of disclosing any of your real
data!

</div>

#### Load the model and generate new data

The file you just generated can be sent over to the system where the
synthetic data will be generated. Once it is there, you can load it
using the `TVAE.load` method, and then you are ready to sample new data
from the loaded instance:

In [6]:
loaded = TVAE.load('my_model.pkl')
new_data = loaded.sample(num_rows=200)

<div class="alert alert-warning">

**Warning**

Notice that the system where the model is loaded needs to also have
`sdv` and `tvae` installed, otherwise it will not be able to load the
model and use it.

</div>

### Specifying the Primary Key of the table

One of the first things that you may have noticed when looking at the demo
data is that there is a `student_id` column which acts as the primary
key of the table, and which is supposed to have unique values. Indeed,
if we look at the number of times that each value appears, we see that
all of them appear at most once:

In [7]:
data.student_id.value_counts().max()

1

However, if we look at the synthetic data that we generated, we observe
that there are some values that appear more than once:

In [8]:
new_data[new_data.student_id == new_data.student_id.value_counts().index[0]]

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
5,17437,F,66.836992,65.193488,Arts,59.649978,Comm&Mgmt,False,2,66.727716,Mkt&HR,62.534507,,True,2020-07-21,2020-08-05,12.0
6,17437,F,78.689217,69.817104,Arts,58.566803,Others,True,2,57.886596,Mkt&HR,61.934658,46208.655972,False,NaT,2020-04-18,6.0
31,17437,F,79.996405,71.577193,Arts,57.769377,Comm&Mgmt,True,0,79.278036,Mkt&HR,58.47408,65732.892841,False,NaT,2020-06-28,3.0
73,17437,F,74.133228,65.295345,Arts,54.172868,Comm&Mgmt,False,0,61.904124,Mkt&HR,58.424795,71550.465337,True,2020-07-02,2020-08-16,
158,17437,F,58.058312,61.244965,Arts,59.314837,Comm&Mgmt,True,1,55.661294,Mkt&HR,70.618268,,False,2020-06-25,NaT,3.0
168,17437,F,64.98761,60.283189,Commerce,56.867534,Others,True,1,60.031004,Mkt&HR,63.187485,80071.635841,False,NaT,2020-04-11,3.0
181,17437,F,70.817137,62.007954,Arts,61.707167,Sci&Tech,True,1,74.563869,Mkt&HR,62.359456,,False,NaT,2020-08-29,3.0
199,17437,F,60.928689,62.697369,Arts,60.498766,Comm&Mgmt,True,1,67.254061,Mkt&HR,61.198041,47928.793404,False,NaT,2020-04-03,3.0


This happens because the model was not notified at any point about the
fact that the `student_id` had to be unique, so when it generates new
data it will provoke collisions sooner or later. In order to solve this,
we can pass the argument `primary_key` to our model when we create it,
indicating the name of the column that is the index of the table.

In [9]:
model = TVAE(
    primary_key='student_id'
)
model.fit(data)
new_data = model.sample(200)
new_data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,F,80.836944,61.458954,Science,61.482424,Others,True,1,75.59666,Mkt&HR,61.303812,67926.506847,False,NaT,2020-07-29,12.0
1,1,M,72.08986,70.697692,Arts,66.513671,Others,True,1,90.165485,Mkt&HR,70.958715,65853.866382,False,NaT,2020-11-24,
2,2,F,70.32002,69.161934,Science,59.340073,Comm&Mgmt,False,0,79.847505,Mkt&Fin,66.973222,67639.732902,True,NaT,NaT,12.0
3,3,M,79.842479,60.995107,Science,60.857316,Others,True,1,89.523483,Mkt&HR,65.962838,75646.794029,False,NaT,2020-06-26,12.0
4,4,F,78.330676,64.967778,Science,64.364296,Comm&Mgmt,True,1,75.971503,Mkt&HR,70.256314,63900.07709,False,NaT,2020-09-21,


As a result, the model will learn that this column must be unique and
generate a unique sequence of values for the column:

In [10]:
new_data.student_id.value_counts().max()

1

### Anonymizing Personally Identifiable Information (PII)

There will be many cases where the data will contain Personally
Identifiable Information which we cannot disclose. In these cases, we
will want our Tabular Models to replace the information within these
fields with fake, simulated data that looks similar to the real one but
does not contain any of the original values.

Let\'s load a new dataset that contains a PII field, the
`student_placements_pii` demo, and try to generate synthetic versions of
it that do not contain any of the PII fields.

<div class="alert alert-info">

**Note**

The `student_placements_pii` dataset is a modified version of the
`student_placements` dataset with one new field, `address`, which
contains PII information about the students. Notice that this additional
`address` field has been simulated and does not correspond to data from
the real users.

</div>

In [11]:
data_pii = load_tabular_demo('student_placements_pii')
data_pii.head()

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,17264,"70304 Baker Turnpike\nEricborough, MS 15086",M,67.0,91.0,Commerce,58.0,Sci&Tech,False,0,55.0,Mkt&HR,58.8,27000.0,True,2020-07-23,2020-10-12,3.0
1,17265,"805 Herrera Avenue Apt. 134\nMaryview, NJ 36510",M,79.33,78.33,Science,77.48,Sci&Tech,True,1,86.5,Mkt&Fin,66.28,20000.0,True,2020-01-11,2020-04-09,3.0
2,17266,"3702 Bradley Island\nNorth Victor, FL 12268",M,65.0,68.0,Arts,64.0,Comm&Mgmt,False,0,75.0,Mkt&Fin,57.8,25000.0,True,2020-01-26,2020-07-13,6.0
3,17267,Unit 0879 Box 3878\nDPO AP 42663,M,56.0,52.0,Science,52.0,Sci&Tech,False,0,66.0,Mkt&HR,59.43,,False,NaT,NaT,
4,17268,"96493 Kelly Canyon Apt. 145\nEast Steven, NC 3...",M,85.8,73.6,Commerce,73.3,Comm&Mgmt,False,0,96.8,Mkt&Fin,55.5,42500.0,True,2020-07-04,2020-09-27,3.0


If we use our tabular model on this new data we will see how the
synthetic data that it generates discloses the addresses from the real
students:

In [12]:
model = TVAE(
    primary_key='student_id',
)
model.fit(data_pii)
new_data_pii = model.sample(200)
new_data_pii.head()

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,"32455 Michael Row Apt. 500\nWest Timothymouth,...",M,70.793376,46.268669,Science,67.904729,Others,False,1,91.290425,Mkt&Fin,70.168262,32288.003137,False,NaT,NaT,6.0
1,1,"6822 Rebecca Unions Apt. 560\nHunterberg, SC 4...",M,74.131613,45.980113,Science,70.932366,Others,False,1,96.880915,Mkt&HR,62.215164,,True,2020-01-12,NaT,6.0
2,2,"32455 Michael Row Apt. 500\nWest Timothymouth,...",M,73.756645,46.738071,Science,65.88212,Others,False,1,90.90148,Mkt&Fin,62.803208,,True,NaT,NaT,12.0
3,3,"2707 Maria Parkways Apt. 743\nAlisonview, TN 8...",M,64.737157,44.516361,Science,67.729363,Others,False,1,86.184814,Mkt&Fin,63.33121,,True,2020-01-05,NaT,6.0
4,4,"3702 Bradley Island\nNorth Victor, FL 12268",M,59.670929,54.104045,Science,68.47955,Others,False,0,68.614568,Mkt&Fin,61.825971,,False,NaT,NaT,3.0


More specifically, we can see how all the addresses that have been
generated actually come from the original dataset:

In [13]:
new_data_pii.address.isin(data_pii.address).sum()

200

In order to solve this, we can pass an additional argument
`anonymize_fields` to our model when we create the instance. This
`anonymize_fields` argument will need to be a dictionary that contains:

-   The name of the field that we want to anonymize.
-   The category of the field that we want to use when we generate fake
    values for it.

The list complete list of possible categories can be seen in the [Faker
Providers](https://faker.readthedocs.io/en/master/providers.html) page,
and it contains a huge list of concepts such as:

-   name
-   address
-   country
-   city
-   ssn
-   credit_card_number
-   credit_card_expire
-   credit_card_security_code
-   email
-   telephone
-   \...

In this case, since the field is an address, we will pass a
dictionary indicating the category `address`

In [14]:
model = TVAE(
    primary_key='student_id',
    anonymize_fields={
        'address': 'address'
    }
)
model.fit(data_pii)

As a result, we can see how the real `address` values have been replaced
by other fake addresses:

In [15]:
new_data_pii = model.sample(200)
new_data_pii.head()

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,"766 Melissa Flats Apt. 855\nCraigshire, OK 33527",F,79.487658,59.494256,Science,65.993907,Sci&Tech,True,0,73.359967,Mkt&Fin,62.199366,31736.021959,False,2020-06-30,NaT,
1,1,"2113 Ford Village\nBennetthaven, FL 94467",M,77.629652,69.010061,Science,71.200462,Comm&Mgmt,False,0,66.420962,Mkt&HR,62.207652,23978.253882,True,2020-08-03,NaT,
2,2,"7823 Williams Islands Suite 529\nNorth Brenda,...",M,73.248937,67.215449,Arts,67.968827,Sci&Tech,False,1,86.336591,Mkt&Fin,65.613219,26982.374117,True,2020-03-13,NaT,6.0
3,3,"763 Buck Pass Apt. 667\nKaitlynland, AZ 59218",M,71.912633,66.500061,Arts,73.184308,Sci&Tech,True,2,78.848483,Mkt&Fin,68.436962,34107.884356,True,2020-03-11,NaT,6.0
4,4,"787 Carol Dale Apt. 918\nThompsonborough, UT 7...",F,77.45302,57.695257,Arts,71.257841,Sci&Tech,False,1,81.992638,Mkt&Fin,61.948024,,True,2020-07-26,NaT,6.0


Which means that none of the original addresses can be found in the
sampled data:

In [16]:
data_pii.address.isin(new_data_pii.address).sum()

0

As we can see, in this case these modifications changed the obtained
results slightly, but they did neither introduce dramatic changes in the
performance.

### Conditional Sampling

As the name implies, conditional sampling allows us to sample from a conditional distribution using the `TVAE` model, which means we can generate only values that satisfy certain conditions. These conditional values can be passed to the `sample_conditions` method as a list of `sdv.sampling.Condition` objects or to the `sample_remaining_columns` method as a dataframe. 

When specifying a `sdv.sampling.Condition` object, we can pass in the desired conditions as a dictionary, as well as specify the number of desired rows for that condition.

In [17]:
from sdv.sampling import Condition

condition = Condition({
    'gender': 'M'
}, num_rows=5)

model.sample_conditions(conditions=[condition])

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,8482 David Views Suite 838\nNew Russellborough...,M,81.281643,74.569253,Science,73.191228,Comm&Mgmt,False,1,72.705225,Mkt&HR,67.111455,31356.579862,True,2020-07-22,NaT,
1,4,192 Ayala Spring Suite 108\nLake Ashleychester...,M,75.689359,60.77935,Arts,62.16342,Sci&Tech,False,1,72.342272,Mkt&Fin,63.902414,,True,2020-03-16,NaT,12.0
2,2,"727 Kevin Track Apt. 345\nDavidberg, SC 48344",M,61.642908,62.140472,Arts,75.200972,Comm&Mgmt,True,0,72.145328,Mkt&Fin,64.551245,28553.476713,False,2020-03-16,NaT,6.0
3,3,"56355 Lindsey Centers Suite 549\nNorth Henry, ...",M,80.859773,69.296383,Arts,67.654884,Comm&Mgmt,True,0,79.49682,Mkt&Fin,63.06124,,False,2019-12-31,2021-01-17,12.0
4,4,"417 Martin Forge Suite 481\nWest Danielleside,...",M,59.101625,65.83545,Science,68.956885,Comm&Mgmt,True,0,71.982152,Mkt&Fin,54.381311,24063.474509,False,2020-07-18,NaT,


It's also possible to condition on multiple columns, such as `gender = M, 'experience_years': 0`.

In [18]:
condition = Condition({
    'gender': 'M',
    'experience_years': 0
}, num_rows=5)

model.sample_conditions(conditions=[condition])

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,"6908 Edward Dale Suite 840\nAnnefurt, UT 55654",M,79.015134,65.037219,Science,65.295668,Sci&Tech,False,0,77.594349,Mkt&Fin,67.379004,39518.748938,True,2020-05-07,NaT,6.0
1,3,Unit 7258 Box 0921\nDPO AE 39679,M,62.004899,69.325425,Science,64.828263,Sci&Tech,False,0,82.768434,Mkt&Fin,59.017437,,True,2020-03-14,NaT,12.0
2,0,"8720 Fox Burg Apt. 524\nLake Robin, OR 88270",M,76.581409,53.273323,Arts,71.56752,Others,False,0,74.314734,Mkt&Fin,67.786817,,True,2020-07-15,NaT,12.0
3,2,"80276 Tonya Brooks\nSouth Aprilmouth, SD 49356",M,77.297712,62.496331,Science,66.931573,Comm&Mgmt,False,0,65.88653,Mkt&HR,63.886734,,True,2020-08-09,NaT,
4,4,"8544 Tucker Canyon Suite 636\nJonstad, ME 80045",M,68.482353,63.554968,Science,63.696645,Comm&Mgmt,True,0,82.833582,Mkt&Fin,65.163352,,True,2020-06-09,NaT,6.0


In the `sample_remaining_columns` method, `conditions` is passed as a dataframe. In that case, the model will generate one sample for each row of the dataframe, sorted in the same order. Since the model already knows how many samples to generate, passing it as a parameter is unnecessary. For example, if we want to generate three samples where `gender = M` and three samples with `gender = F`, we can do the following: 

In [19]:
import pandas as pd 

conditions = pd.DataFrame({
    'gender': ['M', 'M', 'M', 'F', 'F', 'F'],
})
model.sample_remaining_columns(conditions)

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,Unit 9223 Box 6906\nDPO AA 72225,M,77.689366,66.535651,Science,63.463402,Sci&Tech,False,0,80.814893,Mkt&Fin,67.120832,29609.077457,False,2020-03-22,NaT,12.0
1,2,"38807 Herrera Shoals\nDanielberg, ME 02069",M,76.437568,68.093591,Arts,68.619631,Sci&Tech,False,1,76.972433,Mkt&HR,65.301862,,True,2020-03-15,NaT,3.0
2,0,"275 Franco Drive Suite 209\nLorifort, DE 95588",M,53.343088,59.096284,Arts,74.814227,Sci&Tech,True,0,76.420595,Mkt&Fin,68.945166,,True,2020-07-22,NaT,12.0
3,1,Unit 1802 Box 8963\nDPO AE 94192,F,70.946908,60.552811,Arts,64.527244,Comm&Mgmt,True,0,71.335284,Mkt&Fin,61.68707,28949.984302,True,2020-06-16,2020-11-17,12.0
4,0,"5231 Carol Turnpike\nDavisshire, TN 31726",F,75.251684,68.920104,Arts,69.797003,Sci&Tech,False,0,77.697857,Mkt&Fin,67.314591,,True,2020-03-18,NaT,12.0
5,1,"PSC 0540, Box 1735\nAPO AP 31208",F,81.770787,62.876001,Science,67.616103,Sci&Tech,False,0,70.124889,Mkt&Fin,69.354579,,False,2020-03-13,NaT,


`TVAE` also supports conditioning on continuous values, as long as the values are within the range of seen numbers. For example, if all the values of the dataset are within 0 and 1, `TVAE` will not be able to set this value to 1000.

In [20]:
condition = Condition({
    'degree_perc': 70.0
}, num_rows=5)

model.sample_conditions(conditions=[condition])

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,88528 Katherine Island Suite 174\nLake Charles...,M,75.565847,67.383363,Arts,70.0,Comm&Mgmt,False,0,80.595135,Mkt&Fin,66.783025,28508.275215,True,2020-03-22,NaT,
1,4,"582 Sampson Crossroad\nMarshberg, WI 91845",M,72.803198,54.972422,Science,70.0,Sci&Tech,False,2,62.181067,Mkt&Fin,63.262314,31581.254943,True,NaT,2020-04-28,
2,6,"39409 Luis Gateway Apt. 632\nNorth Tiffany, NE...",F,58.44654,73.072237,Science,70.0,Sci&Tech,False,1,71.204364,Mkt&Fin,56.913063,33057.754832,False,2020-04-21,NaT,
3,8,"6081 Mark Brooks Suite 682\nSaramouth, KS 92058",M,73.113137,61.544769,Arts,70.0,Sci&Tech,False,0,75.747652,Mkt&Fin,66.188572,34970.2088,True,2020-03-18,NaT,12.0
4,10,"9583 Rachael Street\nPort Nicole, DC 63556",F,81.429691,69.620078,Science,70.0,Sci&Tech,False,0,73.838952,Mkt&Fin,66.383343,,True,2020-03-25,NaT,12.0


<div class="alert alert-info">

**Note**
    
Currently, conditional sampling works through a rejection sampling process, where
rows are sampled repeatedly until one that satisfies the conditions is found.
In case you are not able to sample enough valid rows, update the related parameters:
increasing ``max_tries`` or increasing ``batch_size_per_try``.
More information about these paramters can be found in the
<a href=https://sdv.dev/SDV/api_reference/tabular/api/sdv.tabular.ctgan.TVAE.sample_conditions.html> API section</a>.

If you have many conditions that cannot easily be satisified, consider switching
to the <a href=https://sdv.dev/SDV/user_guides/single_table/gaussian_copula.html>GaussianCopula model</a>, which is able to handle conditional
sampling more efficiently.


</div>

### How do I specify constraints?

If you look closely at the data you may notice that some properties were
not completely captured by the model. For example, you may have seen
that sometimes the model produces an `experience_years` number greater
than `0` while also indicating that `work_experience` is `False`. These
types of properties are what we call `Constraints` and can also be
handled using `SDV`. For further details about them please visit the
[Handling Constraints](04_Handling_Constraints.ipynb) guide.

### Can I evaluate the Synthetic Data?

A very common question when someone starts using **SDV** to generate
synthetic data is: *\"How good is the data that I just generated?\"*

In order to answer this question, **SDV** has a collection of metrics
and tools that allow you to compare the *real* that you provided and the
*synthetic* data that you generated using **SDV** or any other tool.

You can read more about this in the [Evaluating Synthetic Data Generators](
05_Evaluating_Synthetic_Data_Generators.ipynb) guide.