# Synthesize a Table Using CTGAN

CTGAN uses generative adversarial networks (GANs) to create synthesize data with high fidelity.

In [1]:
from sdv.datasets.demo import download_demo
from sdv.single_table import CTGANSynthesizer
from sdv.evaluation.single_table import evaluate_quality, get_column_plot, get_column_pair_plot

# 1. Loading the demo data
For this demo, we'll use a fake dataset that describes some fictional guests staying at a hotel.

In [2]:
real_data, metadata = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests'
)

In [3]:
real_data.head()

Unnamed: 0,guest_email,has_rewards,room_type,amenities_fee,checkin_date,checkout_date,room_rate,billing_address,credit_card_number
0,michaelsanders@shaw.net,False,BASIC,37.89,27 Dec 2020,29 Dec 2020,131.23,"49380 Rivers Street\nSpencerville, AK 68265",4075084747483975747
1,randy49@brown.biz,False,BASIC,24.37,30 Dec 2020,02 Jan 2021,114.43,"88394 Boyle Meadows\nConleyberg, TN 22063",180072822063468
2,webermelissa@neal.com,True,DELUXE,0.0,17 Sep 2020,18 Sep 2020,368.33,"0323 Lisa Station Apt. 208\nPort Thomas, LA 82585",38983476971380
3,gsims@terry.com,False,BASIC,,28 Dec 2020,31 Dec 2020,115.61,"77 Massachusetts Ave\nCambridge, MA 02139",4969551998845740
4,misty33@smith.biz,False,BASIC,16.45,05 Apr 2020,,122.41,"1234 Corporate Drive\nBoston, MA 02116",3558512986488983


In [4]:
metadata

{
    "primary_key": "guest_email",
    "columns": {
        "guest_email": {
            "sdtype": "email",
            "pii": true
        },
        "has_rewards": {
            "sdtype": "boolean"
        },
        "room_type": {
            "sdtype": "categorical"
        },
        "amenities_fee": {
            "sdtype": "numerical",
            "computer_representation": "Float"
        },
        "checkin_date": {
            "sdtype": "datetime",
            "datetime_format": "%d %b %Y"
        },
        "checkout_date": {
            "sdtype": "datetime",
            "datetime_format": "%d %b %Y"
        },
        "room_rate": {
            "sdtype": "numerical",
            "computer_representation": "Float"
        },
        "billing_address": {
            "sdtype": "address",
            "pii": true
        },
        "credit_card_number": {
            "sdtype": "credit_card_number",
            "pii": true
        }
    },
    "METADATA_SPEC_VERSION": "SINGLE_TABL

In [5]:
# Fit a CTGAN Synthesizer (takes longer than using the basic statistical models)
synthesizer = CTGANSynthesizer(metadata)
synthesizer.fit(real_data)



In [6]:
# Generate some data from our GAN
synthetic_data = synthesizer.sample(num_rows=500)
synthetic_data.head()

Unnamed: 0,guest_email,has_rewards,room_type,amenities_fee,checkin_date,checkout_date,room_rate,billing_address,credit_card_number
0,dsullivan@example.net,False,BASIC,12.61,02 May 2020,22 Apr 2020,196.46,"90469 Karla Knolls Apt. 781\nSusanberg, NC 28401",5161033759518983
1,steven59@example.org,False,BASIC,7.09,10 May 2020,06 Nov 2019,190.2,"1080 Ashley Creek Apt. 622\nWest Amy, NM 25058",4133047413145475690
2,brandon15@example.net,False,DELUXE,27.49,19 Feb 2020,27 Feb 2020,165.43,"99923 Anderson Trace Suite 861\nNorth Haley, T...",4977328103788
3,humphreyjennifer@example.net,False,BASIC,28.12,31 Jan 2020,20 Nov 2019,145.36,"9301 John Parkways\nThomasland, OH 61350",3524946844839485
4,joshuabrown@example.net,False,BASIC,9.74,25 Jul 2020,02 May 2020,135.91,"126 George Tunnel\nDuranstad, MS 95176",4446905799576890978


The synthesizer is generating synthetic guests in the **same format as the original data**.

In the original dataset, we had some sensitive columns such as the guest's email, billing address and phone number. In the synthetic data, these columns are **fully anonymized** -- they contain entirely fake values that follow the format of the original.

In [7]:
sensitive_column_names = ['guest_email', 'billing_address', 'credit_card_number']

real_data[sensitive_column_names].head(3)

Unnamed: 0,guest_email,billing_address,credit_card_number
0,michaelsanders@shaw.net,"49380 Rivers Street\nSpencerville, AK 68265",4075084747483975747
1,randy49@brown.biz,"88394 Boyle Meadows\nConleyberg, TN 22063",180072822063468
2,webermelissa@neal.com,"0323 Lisa Station Apt. 208\nPort Thomas, LA 82585",38983476971380


In [8]:
synthetic_data[sensitive_column_names].head(3)

Unnamed: 0,guest_email,billing_address,credit_card_number
0,dsullivan@example.net,"90469 Karla Knolls Apt. 781\nSusanberg, NC 28401",5161033759518983
1,steven59@example.org,"1080 Ashley Creek Apt. 622\nWest Amy, NM 25058",4133047413145475690
2,brandon15@example.net,"99923 Anderson Trace Suite 861\nNorth Haley, T...",4977328103788


In [9]:
[address for address in real_data["billing_address"] if address in synthetic_data["billing_address"]]

[]

_Note that any repeated values between the real and synthetic data occur by random chance. This ensures that an attacker won't be able to guess the real, sensitive values based on these columns alone._

## 2.4 Evaluating Real vs. Synthetic Data
The synthetic data replicates the **mathematical properties** of the real data. To get more insight, we can use the `evaluation` module.

In [10]:
quality_report = evaluate_quality(
    real_data,
    synthetic_data,
    metadata
)

Creating report: 100%|██████████| 4/4 [00:00<00:00, 29.39it/s]



Overall Quality Score: 85.05%

Properties:
Column Shapes: 85.63%
Column Pair Trends: 84.46%


The report allows us to visualize the different properties that were captured. For example, the visualization below shows us _which_ individual column shapes were well-captured and which weren't.

In [11]:
quality_report.get_visualization('Column Shapes')

The TVComplement metric computes the similarity of a real column vs. a synthetic column in terms of the column shapes -- aka the marginal distribution or 1D histogram of the column.

Works for:
- Categorical Data
- Boolean Data

The KSComplement computes the similarity of a real column vs. a synthetic column in terms of the column shapes -- aka the marginal distribution or 1D histogram of the column.

Works for:
- Numerical Data
- Datetime Data

## 2.5 Visualizing the Data
For more insights, we can visualize the real vs. synthetic data.

Let's perform a 1D visualization comparing a column of the real data to the synthetic data.

In [12]:
for column in real_data.columns:
    try:
        fig = get_column_plot(
            real_data=real_data,
            synthetic_data=synthetic_data,
            column_name=column,
            metadata=metadata
        )   

        fig.show()
    except:
        pass

We can also visualize in 2D, comparing the correlations of a pair of columns.

In [13]:
fig = get_column_pair_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    column_names=['room_rate', 'room_type'],
    metadata=metadata
)

fig.show()

## 2.6 Saving and Loading
We can save the synthesizer to share with others and sample more synthetic data in the future.

In [14]:
save_path = 'ctgan_synthesizer.pkl'
synthesizer.save(save_path)

synthesizer = CTGANSynthesizer.load(save_path)

# 3. CTGAN Customization
When using this synthesizer, we can make a tradeoff between training time and data quality using the `epochs` parameter: Higher `epochs` means that the synthesizer will train for longer, and ideally improve the data quality.


In [15]:
custom_synthesizer = CTGANSynthesizer(
    metadata,
    epochs=1500)
custom_synthesizer.fit(real_data)


Future versions of RDT will not support the 'model_missing_values' parameter. Please switch to using the 'missing_value_generation' parameter to select your strategy.


Future versions of RDT will not support the 'model_missing_values' parameter. Please switch to using the 'missing_value_generation' parameter to select your strategy.


Future versions of RDT will not support the 'model_missing_values' parameter. Please switch to using the 'missing_value_generation' parameter to select your strategy.


Future versions of RDT will not support the 'model_missing_values' parameter. Please switch to using the 'missing_value_generation' parameter to select your strategy.



<font color="maroon"><i><b>This code takes about 10 min to run.</b></i></font>

After we've trained our synthesizer, we can verify the changes to the data quality by creating some synthetic data and evaluating it.

In [16]:
# Save/load the synthesizer
save_path = 'ctgan_custom_synthesizer.pkl'
custom_synthesizer.save(save_path)

# custom_synthesizer = CTGANSynthesizer.load(save_path)

In [17]:
synthetic_data_customized = custom_synthesizer.sample(num_rows=500)

quality_report = evaluate_quality(
    real_data,
    synthetic_data_customized,
    metadata
)

Creating report: 100%|██████████| 4/4 [00:00<00:00, 35.52it/s]



Overall Quality Score: 84.91%

Properties:
Column Shapes: 85.94%
Column Pair Trends: 83.88%


In [18]:
for column in real_data.columns:
    try:
        fig = get_column_plot(
            real_data=real_data,
            synthetic_data=synthetic_data_customized,
            column_name=column,
            metadata=metadata
        )   

        fig.show()
    except:
        pass

In [19]:
fig = get_column_pair_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    column_names=['room_rate', 'room_type'],
    metadata=metadata
)

fig.show()

While GANs are able to model complex patterns and shapes, it is not easy to understand how they are learning -- but it is possible to modify the underlying architecture of the neural networks.

For users who are familiar with the GAN architecture, there are extra parameters you can use to tune CTGAN to your particular needs. For more details, see [the CTGAN documentation](https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/ctgansynthesizer).