# Single Table Modeling

**SDV** has special support for modeling single table datasets using a variety of models.

Currently, SDV implements:

* GaussianCopula: A tool to model multivariate distributions using [copula functions](https://en.wikipedia.org/wiki/Copula_%28probability_theory%29). Based on our [Copulas Library](https://github.com/sdv-dev/Copulas).
* CTGAN: A GAN-based Deep Learning data synthesizer that can generate synthetic tabular data with high fidelity. Based on our [CTGAN Library](https://github.com/sdv-dev/CTGAN).

## GaussianCopula

In this first part of the tutorial we will be using the GaussianCopula class to model the `users` table
from the toy dataset included in the **SDV** library.

### 1. Load the Data

In [1]:
from sdv import load_demo

users = load_demo()['users']

This will return a table with 4 fields:

* `user_id`: A unique identifier of the user.
* `country`: A 2 letter code of the country of residence of the user.
* `gender`: A single letter code, `M` or `F`, indicating the user gender. Note that this demo simulates the case where some users did not indicate the gender, which resulted in empty data values in some rows.
* `age`: The age of the user, in years.

In [2]:
users

Unnamed: 0,user_id,country,gender,age
0,0,USA,M,34
1,1,UK,F,23
2,2,ES,,44
3,3,UK,M,22
4,4,USA,F,54
5,5,DE,M,57
6,6,BG,F,45
7,7,ES,,41
8,8,FR,F,23
9,9,UK,,30


### 2. Prepare the model

In order to properly model our data we will need to provide some additional information to our model,
so let's prepare this information in some variables.

First, let's indicate that the `user_id` field in our table is the primary key, so we do not want our
model to attempt to learn it.

In [3]:
primary_key = 'user_id'

We will also want to anonymize the countries of residence of our users, to avoid disclosing such information.
Let's make a variable indicating that the `country` field needs to be anonymized using fake `country_codes`.

In [4]:
anonymize_fileds = {
    'country': 'contry_code'
}

The full list of categories supported corresponds to the `Faker` library
[provider names](https://faker.readthedocs.io/en/master/providers.html)

Once we have prepared the arguments for our model we are ready to import it, create an instance
and fit it to our data.

In [5]:
from sdv.tabular import GaussianCopula

model = GaussianCopula(
    primary_key=primary_key,
    anonymize_fileds=anonymize_fileds
)
model.fit(users)

2020-07-09 21:18:32,974 - INFO - table - Loading transformer CategoricalTransformer for field country
2020-07-09 21:18:32,975 - INFO - table - Loading transformer CategoricalTransformer for field gender
2020-07-09 21:18:32,975 - INFO - table - Loading transformer NumericalTransformer for field age
2020-07-09 21:18:32,991 - INFO - gaussian - Fitting GaussianMultivariate(distribution="GaussianUnivariate")


**Notice** how the model took care of transforming the different fields using the appropriate
Reversible Data Transforms to ensure that the data has a format that the GaussianMultivariate model
from the [copulas](https://github.com/sdv-dev/Copulas) library can handle.

### 3. Sample data from the fitted model

Once the modeling has finished you are ready to generate new synthetic data by calling the `sample` method
from our model.

In [6]:
sampled = model.sample()

This will return a table identical to the one which the model was fitted on, but filled with new data
which resembles the original one.

In [7]:
sampled

Unnamed: 0,user_id,country,gender,age
0,0,USA,M,38
1,1,UK,,23
2,2,USA,F,34
3,3,ES,,47
4,4,ES,F,29
5,5,UK,F,39
6,6,FR,,40
7,7,ES,M,38
8,8,ES,F,32
9,9,ES,F,36


Notice, as well that the number of rows generated by default corresponds to the number of rows that
the original table had, but that this number can be changed by simply passing it:

In [8]:
model.sample(5)

Unnamed: 0,user_id,country,gender,age
0,0,UK,F,48
1,1,USA,,38
2,2,USA,M,29
3,3,BG,M,22
4,4,USA,M,43


## CTGAN

In this second part of the tutorial we will be using the CTGAN model to learn the data from the
demo dataset called `census`, which is based on the [UCI Adult Census Dataset]('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data').

### 1. Load the Data

In [9]:
from sdv import load_demo

census = load_demo('census')['census']

2020-07-09 21:18:33,085 - INFO - __init__ - Loading table census


This will return a table with several rows of multiple data types:

In [10]:
census.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### 2. Prepare the model

In this case there is no primary key to setup and we will not be anonymizing anything, so the only
thing that we will pass to the CTGAN model is the number of epochs that we want it to perform when
it leanrs the data, which we will keep low to make this execution quick.

In [11]:
from sdv.tabular import CTGAN

model = CTGAN(epochs=10)



Once the instance is created, we can fit it to our data. Bear in mind that this process might take some
time to finish, especially on non-GPU enabled systems, so in this case we will be passing only a
subsample of the data to accelerate the process.

In [12]:
model.fit(census.sample(1000))

2020-07-09 21:18:33,488 - INFO - table - Loading transformer NumericalTransformer for field age
2020-07-09 21:18:33,489 - INFO - table - Loading transformer LabelEncodingTransformer for field workclass
2020-07-09 21:18:33,489 - INFO - table - Loading transformer NumericalTransformer for field fnlwgt
2020-07-09 21:18:33,490 - INFO - table - Loading transformer LabelEncodingTransformer for field education
2020-07-09 21:18:33,490 - INFO - table - Loading transformer NumericalTransformer for field education-num
2020-07-09 21:18:33,490 - INFO - table - Loading transformer LabelEncodingTransformer for field marital-status
2020-07-09 21:18:33,491 - INFO - table - Loading transformer LabelEncodingTransformer for field occupation
2020-07-09 21:18:33,491 - INFO - table - Loading transformer LabelEncodingTransformer for field relationship
2020-07-09 21:18:33,491 - INFO - table - Loading transformer LabelEncodingTransformer for field race
2020-07-09 21:18:33,492 - INFO - table - Loading transforme

Epoch 1, Loss G: 1.9512, Loss D: -0.0182
Epoch 2, Loss G: 1.9884, Loss D: -0.0663
Epoch 3, Loss G: 1.9710, Loss D: -0.1339
Epoch 4, Loss G: 1.8960, Loss D: -0.2061
Epoch 5, Loss G: 1.9155, Loss D: -0.3062
Epoch 6, Loss G: 1.9699, Loss D: -0.3906
Epoch 7, Loss G: 1.8614, Loss D: -0.5142
Epoch 8, Loss G: 1.8446, Loss D: -0.6448
Epoch 9, Loss G: 1.7619, Loss D: -0.7488
Epoch 10, Loss G: 1.6732, Loss D: -0.7961


### 3. Sample data from the fitted model

Once the modeling has finished you are ready to generate new synthetic data by calling the `sample` method
from our model just like we did with the GaussianCopula model.

In [13]:
sampled = model.sample()

This will return a table identical to the one which the model was fitted on, but filled with new data
which resembles the original one.

In [14]:
sampled.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,50,Local-gov,169719,1st-4th,9,Widowed,?,Husband,White,Male,114,8,38,Columbia,<=50K
1,32,?,152479,1st-4th,9,Never-married,Adm-clerical,Wife,Black,Male,-42,20,21,Jamaica,>50K
2,22,Private,69617,Bachelors,0,Separated,?,Husband,White,Male,6,11,38,Guatemala,<=50K
3,25,?,652858,10th,16,Married-civ-spouse,Handlers-cleaners,Not-in-family,White,Female,152,-27,39,Cuba,<=50K
4,43,Private,301956,Some-college,8,Married-civ-spouse,?,Wife,White,Male,-133,-12,39,India,<=50K
5,66,Private,401171,Prof-school,13,Separated,Protective-serv,Unmarried,Black,Female,-124,-1,40,Cuba,<=50K
6,52,Private,278399,Bachelors,12,Never-married,Prof-specialty,Unmarried,Other,Male,122567,-6,47,Columbia,<=50K
7,36,Federal-gov,229817,HS-grad,8,Married-AF-spouse,Farming-fishing,Not-in-family,White,Male,8,19,38,Portugal,>50K
8,27,Federal-gov,306972,Some-college,8,Never-married,Exec-managerial,Husband,Asian-Pac-Islander,Female,42144,3,39,Japan,>50K
9,28,Local-gov,416161,1st-4th,8,Divorced,Adm-clerical,Unmarried,White,Female,-349,1090,61,Guatemala,>50K


### 4. Evaluate how good the data is

Finally, we will use the evaluation framework included in SDV to obtain a metric of how
similar the sampled data is to the original one.

For this, we will simply import the `sdv.evaluation.evaluate` function and pass both
the synthetic and the real data to it.

In [15]:
import warnings
warnings.filterwarnings('ignore')

from sdv.evaluation import evaluate

evaluate(sampled, census)

-144.971907591418