# Single Table Modeling

**SDV** supports modeling single table datasets. It provides unique features for making it easy for the user 
to learn models and synthesize datasets. Some important features of sdv.tables include:

* Support for tables with primarykey
* Support to anonymize certain fields like addresses, emails, phone numbers, names and other PII information. 
  We use faker library for this. The full list of categories supported corresponds to the `Faker` library 
  [provider names](https://faker.readthedocs.io/en/master/providers.html).
* Support for a number of different data types - categorical, numerical, discrete-ordinal and datetimes.
* Support multiple types of statistical and deep learning models:
  * GaussianCopula: A tool to model multivariate distributions using [copula functions](
    https://en.wikipedia.org/wiki/Copula_%28probability_theory%29). Based on our [Copulas Library](
    https://github.com/sdv-dev/Copulas).
  * CTGAN: A GAN-based Deep Learning data synthesizer that can generate synthetic tabular data with high 
    fidelity. Based on our [CTGAN Library](https://github.com/sdv-dev/CTGAN).

**Note:** We are adding a number of additional features and functionality to make it easy to model single table datasets. For example, we are adding ways for users to add inter-column constraints . If you find a unique use case that we do not support consider suggesting and adding examples here.

## Quick usage

Let's consider a dataset from our demo datasets. 

In [1]:
import warnings
warnings.simplefilter('ignore')

from sdv import load_demo

users = load_demo()['users']

This will return a table with 4 fields:

* `user_id`: A unique identifier of the user.
* `country`: A 2 letter code of the country of residence of the user.
* `gender`: A single letter code, `M` or `F`, indicating the user gender. Note that this demo simulates the case where some users did not indicate the gender, which resulted in empty data values in some rows.
* `age`: The age of the user, in years.

In [2]:
users

Unnamed: 0,user_id,country,gender,age
0,0,US,M,34
1,1,UK,F,23
2,2,ES,,44
3,3,UK,M,22
4,4,US,F,54
5,5,DE,M,57
6,6,BG,F,45
7,7,ES,,41
8,8,FR,F,23
9,9,UK,,30


We notice that there are some additional properties in this dataset:

* First, `user_id` field in our table is the `primary_key` and each row has a `unique` value, so we do not
  want our model to attempt to learn it.
* Second, let's say we want to `anonymize` the countries of residence of our `users`, to avoid disclosing
  such information. 
* Third, we notice that there is missing data for the `gender` column. 

Let us use the `GaussianCopula` to model this data and then sample synthetic data from the model. In order
to properly model our data we will need to provide some additional information to our model. Once we have
prepared the arguments for our model we are ready to import it, create an instance and fit it to our data.

In [3]:
from sdv.tabular import GaussianCopula

model = GaussianCopula(
    primary_key='user_id',
    anonymize_fields={'country':'country_code'}
)
model.fit(users)

2020-07-23 22:15:16,942 - INFO - table - Loading transformer OneHotEncodingTransformer for field country
2020-07-23 22:15:16,942 - INFO - table - Loading transformer OneHotEncodingTransformer for field gender
2020-07-23 22:15:16,943 - INFO - table - Loading transformer NumericalTransformer for field age
2020-07-23 22:15:16,979 - INFO - gaussian - Fitting GaussianMultivariate()


**Notice** that the model `fitting` process took care of transforming the different fields using the
appropriate [Reversible Data Transforms](http://github.com/sdv-dev/RDT) to ensure that the data has
a format that the GaussianMultivariate model from the [copulas](https://github.com/sdv-dev/Copulas)
library can handle.

## Generate synthetic data from the model

Once the modeling has finished you are ready to generate new synthetic data by calling the `sample` method
from your model.

In [4]:
sampled = model.sample(5)

This will return a table identical to the one which the model was fitted on, but filled with new data
which resembles the original one.

In [5]:
sampled

Unnamed: 0,user_id,country,gender,age
0,0,GQ,M,25
1,1,BE,M,57
2,2,BE,F,33
3,3,SN,F,37
4,4,GQ,M,37


**Note:** You can control the number of rows by specifying the number of `samples` in the
`model.sample(<num_rows>)`. To test, try `model.sample(10000)`. Note that the original 
table only had 10 rows.

## Let's consider using CTGAN

In this second part of the tutorial we will be using the CTGAN model to learn the data from the
demo dataset called `census`, which is based on the [UCI Adult Census Dataset]('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data').

In [6]:
from sdv import load_demo

census = load_demo('census')['census']

2020-07-23 22:15:18,538 - INFO - __init__ - Loading table census


This will return a table with several rows of multiple data types:

In [7]:
census.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In this case there is no `primary_key` to setup and we will not be `anonymizing` anything, so the
only thing that we will pass to the `CTGAN` model is the `number of epochs` that we want it to
perform when it learns the data, which we will keep low to make this execution quick.

In [8]:
from sdv.tabular import CTGAN

model = CTGAN(epochs=10)

Once the instance is created, we can fit it to our data. 

**Note** that this process might take some time to finish, especially on non-GPU enabled systems,
so in this case we will be passing only a `subsample` of the data to accelerate the process.

In [9]:
model.fit(census.sample(1000))

2020-07-23 22:15:18,944 - INFO - table - Loading transformer NumericalTransformer for field age
2020-07-23 22:15:18,945 - INFO - table - Loading transformer LabelEncodingTransformer for field workclass
2020-07-23 22:15:18,945 - INFO - table - Loading transformer NumericalTransformer for field fnlwgt
2020-07-23 22:15:18,946 - INFO - table - Loading transformer LabelEncodingTransformer for field education
2020-07-23 22:15:18,946 - INFO - table - Loading transformer NumericalTransformer for field education-num
2020-07-23 22:15:18,947 - INFO - table - Loading transformer LabelEncodingTransformer for field marital-status
2020-07-23 22:15:18,947 - INFO - table - Loading transformer LabelEncodingTransformer for field occupation
2020-07-23 22:15:18,947 - INFO - table - Loading transformer LabelEncodingTransformer for field relationship
2020-07-23 22:15:18,948 - INFO - table - Loading transformer LabelEncodingTransformer for field race
2020-07-23 22:15:18,948 - INFO - table - Loading transforme

Epoch 1, Loss G: 1.9722, Loss D: 0.0039
Epoch 2, Loss G: 2.0078, Loss D: -0.0528
Epoch 3, Loss G: 1.9806, Loss D: -0.1373
Epoch 4, Loss G: 1.9688, Loss D: -0.1716
Epoch 5, Loss G: 1.8883, Loss D: -0.3181
Epoch 6, Loss G: 1.8073, Loss D: -0.4152
Epoch 7, Loss G: 1.8015, Loss D: -0.5423
Epoch 8, Loss G: 1.6593, Loss D: -0.6838
Epoch 9, Loss G: 1.6784, Loss D: -0.7575
Epoch 10, Loss G: 1.6479, Loss D: -0.8074


### Generate synthetic data from the model

Once the modeling has finished you are ready to generate new synthetic data by calling the `sample` method
from our model just like we did with the `GaussianCopula` model.

In [10]:
sampled = model.sample(1000)

This will return a table identical to the one which the model was fitted on, but filled with `synthetic` data
which resembles the original one.

In [11]:
sampled.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,55,?,54620,HS-grad,9,Never-married,Prof-specialty,Unmarried,Amer-Indian-Eskimo,Male,809,-4,-4,United-States,<=50K
1,65,Federal-gov,238,Bachelors,9,Married-spouse-absent,Exec-managerial,Own-child,Black,Male,816,-37,45,Vietnam,<=50K
2,63,?,-11877,Prof-school,9,Never-married,Prof-specialty,Unmarried,White,Female,166785,1684,33,Poland,<=50K
3,21,Local-gov,-53388,Preschool,1,Married-civ-spouse,?,Own-child,Other,Male,554,4,51,Italy,>50K
4,41,Private,117452,7th-8th,4,Widowed,Farming-fishing,Other-relative,White,Female,6270,12,38,Scotland,<=50K
5,59,Private,109067,11th,8,Married-civ-spouse,Tech-support,Not-in-family,White,Female,-296,-7,39,Hungary,<=50K
6,55,Local-gov,133741,Some-college,9,Widowed,Adm-clerical,Own-child,White,Male,27743,8,40,France,<=50K
7,52,Federal-gov,48567,7th-8th,8,Divorced,Sales,Other-relative,Black,Male,196,19,46,Hong,<=50K
8,24,Federal-gov,135607,9th,-1,Married-civ-spouse,Sales,Not-in-family,White,Male,465,-26,38,Vietnam,>50K
9,46,Self-emp-not-inc,-57277,12th,12,Separated,Adm-clerical,Own-child,Amer-Indian-Eskimo,Male,576,0,3,Puerto-Rico,>50K


## Frequently encountered needs

### How can I evaluate the quality of my synthetic data?

In some cases, you will want to know how similar the generated is to the original one.

For this you can use the `evaluation` framework included in SDV by simply importing the
`sdv.evaluation.evaluate` function and calling it passing it both the synthetic and the
real data.

In [12]:
from sdv.evaluation import evaluate

evaluate(sampled, census)

-144.80667199073469