<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_07_5_tabular_synthetic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tabular Synthetic Data Generation**




Typically GANs are used to generate images. However, we can also generate tabular data from a GAN. In this part, we will use the Python tabgan utility to create fake data from tabular data. Specifically, we will use the Auto MPG dataset to train a GAN to generate fake cars.

## Installing Tabgan

The following code installs the needed software to run tabgan in Google Colab. 

In [None]:
# HIDE OUTPUT
CMD = "wget https://raw.githubusercontent.com/Diyago/"\
  "GAN-for-tabular-data/master/requirements.txt"

!{CMD}
!pip install -r requirements.txt
!pip install tabgan

--2022-05-04 14:57:52--  https://raw.githubusercontent.com/Diyago/GAN-for-tabular-data/master/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 197 [text/plain]
Saving to: ‘requirements.txt’


2022-05-04 14:57:53 (7.31 MB/s) - ‘requirements.txt’ saved [197/197]

Collecting category_encoders==2.1.0
  Downloading category_encoders-2.1.0-py2.py3-none-any.whl (100 kB)
[K     |████████████████████████████████| 100 kB 3.5 MB/s 
[?25hCollecting numpy==1.18.1
  Downloading numpy-1.18.1-cp37-cp37m-manylinux1_x86_64.whl (20.1 MB)
[K     |████████████████████████████████| 20.1 MB 1.1 MB/s 
[?25hCollecting torch==1.6.0
  Downloading torch-1.6.0-cp37-cp37m-manylinux1_x86_64.whl (748.8 MB)
[K     |████████████████████████████████| 748.8 MB 17 kB/s 
[?

Thus Tabgan is sucessfully installed!

## Loading the Auto MPG Data and Training a Neural Network



In [None]:
# HIDE OUTPUT
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
import pandas as pd
import io
import os
import requests
import numpy as np
from sklearn import metrics

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/auto-mpg.csv", 
    na_values=['NA', '?'])

COLS_USED = ['cylinders', 'displacement', 'horsepower', 'weight', 
          'acceleration', 'year', 'origin','mpg']
COLS_TRAIN = ['cylinders', 'displacement', 'horsepower', 'weight', 
          'acceleration', 'year', 'origin']

df = df[COLS_USED]

# Handle missing value
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median())


# Split into training and test sets
df_x_train, df_x_test, df_y_train, df_y_test = train_test_split(
    df.drop("mpg", axis=1),
    df["mpg"],
    test_size=0.20,
    #shuffle=False,
    random_state=42,
)

# Create dataframe versions for tabular GAN
df_x_test, df_y_test = df_x_test.reset_index(drop=True), \
  df_y_test.reset_index(drop=True)
df_y_train = pd.DataFrame(df_y_train)
df_y_test = pd.DataFrame(df_y_test)

# Pandas to Numpy
x_train = df_x_train.values
x_test = df_x_test.values
y_train = df_y_train.values
y_test = df_y_test.values

# Build the neural network
model = Sequential()
# Hidden 1
model.add(Dense(50, input_dim=x_train.shape[1], activation='relu')) 
model.add(Dense(25, activation='relu')) # Hidden 2
model.add(Dense(12, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')

monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, 
        patience=5, verbose=1, mode='auto',
        restore_best_weights=True)
model.fit(x_train,y_train,validation_data=(x_test,y_test),
        callbacks=[monitor], verbose=2,epochs=1000)

Epoch 1/1000
10/10 - 1s - loss: 1180.4148 - val_loss: 392.8569 - 697ms/epoch - 70ms/step
Epoch 2/1000
10/10 - 0s - loss: 356.3349 - val_loss: 320.0972 - 65ms/epoch - 7ms/step
Epoch 3/1000
10/10 - 0s - loss: 214.5367 - val_loss: 134.9735 - 65ms/epoch - 7ms/step
Epoch 4/1000
10/10 - 0s - loss: 141.4483 - val_loss: 103.4205 - 67ms/epoch - 7ms/step
Epoch 5/1000
10/10 - 0s - loss: 135.2173 - val_loss: 101.4199 - 54ms/epoch - 5ms/step
Epoch 6/1000
10/10 - 0s - loss: 113.9612 - val_loss: 99.0179 - 44ms/epoch - 4ms/step
Epoch 7/1000
10/10 - 0s - loss: 114.5036 - val_loss: 87.1774 - 68ms/epoch - 7ms/step
Epoch 8/1000
10/10 - 0s - loss: 105.1194 - val_loss: 79.0914 - 44ms/epoch - 4ms/step
Epoch 9/1000
10/10 - 0s - loss: 95.7653 - val_loss: 81.0359 - 61ms/epoch - 6ms/step
Epoch 10/1000
10/10 - 0s - loss: 94.2665 - val_loss: 77.4099 - 64ms/epoch - 6ms/step
Epoch 11/1000
10/10 - 0s - loss: 89.3309 - val_loss: 72.2019 - 65ms/epoch - 7ms/step
Epoch 12/1000
10/10 - 0s - loss: 87.0730 - val_loss: 67.72

<keras.callbacks.History at 0x7f77bc2072d0>


# To compare the accuracy between the original data and the GAN-generated data

In [None]:
pred = model.predict(x_test)
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print("Final score (RMSE): {}".format(score))

Final score (RMSE): 3.6612987401955763


## Training a GAN to generate fake data from the original MPG data for Auto MPG dataset


In [None]:
from tabgan.sampler import GANGenerator
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

gen_x, gen_y = GANGenerator(gen_x_times=1.1, cat_cols=None,
           bot_filter_quantile=0.001, top_filter_quantile=0.999, \
              is_post_process=True,
           adversarial_model_params={
               "metrics": "rmse", "max_depth": 2, "max_bin": 100, 
               "learning_rate": 0.02, "random_state": \
                42, "n_estimators": 500,
           }, pregeneration_frac=2, only_generated_data=False,\
           gan_params = {"batch_size": 500, "patience": 25, \
          "epochs" : 500,}).generate_data_pipe(df_x_train, df_y_train,\
          df_x_test, deep_copy=True, only_adversarial=False, \
          use_adversarial=True)



Fitting CTGAN transformers for each column:   0%|          | 0/8 [00:00<?, ?it/s]

Training CTGAN, epochs::   0%|          | 0/500 [00:00<?, ?it/s]


## Evaluating the GAN Results

If we display the results, we can see that the GAN-generated data looks similar to the original. Some values, typically whole numbers in the original data, have fractional values in the synthetic data. 

In [None]:
gen_x

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,year,origin
0,5,183.000000,77.000000,3530,20.100000,79,2
1,4,160.639338,103.282225,3550,20.953171,79,2
2,6,281.004165,74.502942,3709,20.062095,78,2
3,4,398.886734,64.531755,3465,18.823368,77,2
4,5,98.506210,104.141180,3679,23.009761,77,2
...,...,...,...,...,...,...,...
512,8,350.000000,165.000000,4209,12.000000,71,1
513,8,350.000000,165.000000,4274,12.000000,72,1
514,8,318.000000,150.000000,4096,13.000000,71,1
515,8,351.000000,153.000000,4129,13.000000,72,1


Finally, we present the synthetic data to the previously trained neural network to see how accurately we can predict the synthetic targets.  As we can see, you lose some RMSE accuracy by going to synthetic data.

In [None]:
# Predict
pred = model.predict(gen_x.values)
score = np.sqrt(metrics.mean_squared_error(pred,gen_y.values))
print("Final score (RMSE): {}".format(score))

Final score (RMSE): 9.252012969483955
