# Generating synthetic data

This notebook walks through training a probabilistic, generative RNN model<br>
on a rental scooter location dataset, and then generating a synthetic<br>
dataset with greater privacy guarantees. 

For both training and generating data, we can use the ``config.py`` module and<br>
create a ``LocalConfig`` instance that contains all the attributes that we need<br>
for both activities.

In [3]:
!git clone https://github.com/riyas036/Synthetic-Data-Entry-Job-Automation.git

Cloning into 'Synthetic-Data-Entry-Job-Automation'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (6/6), 554.81 KiB | 3.63 MiB/s, done.


In [2]:
# Google Colab support
# Note: Click "Runtime->Change Runtime Type" set Hardware Accelerator to "GPU"
# Note: Use pip install gretel-synthetics[tf] to install tensorflow if necessary
# 
!pip install gretel-synthetics --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gretel-synthetics
  Downloading gretel_synthetics-0.20.0-py3-none-any.whl (124 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m124.8/124.8 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece==0.1.97
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting category-encoders==2.2.2
  Downloading category_encoders-2.2.2-py2.py3-none-any.whl (80 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.7/80.7 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
Collecting smart-open<6.0,>=2.1.0
  Downloading smart_open-5.2.1-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.6/58.6 kB[0m [31m4.3 MB/s[0m eta [36m0:00:0

In [13]:
!pip install --upgrade tensorflow-estimator==2.3.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorflow-estimator==2.3.0
  Downloading tensorflow_estimator-2.3.0-py2.py3-none-any.whl (459 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m459.0/459.0 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tensorflow-estimator
  Attempting uninstall: tensorflow-estimator
    Found existing installation: tensorflow-estimator 2.8.0
    Uninstalling tensorflow-estimator-2.8.0:
      Successfully uninstalled tensorflow-estimator-2.8.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires tensorflow-estimator<2.13,>=2.12.0, but you have tensorflow-estimator 2.3.0 which is incompatible.
gretel-synthetics 0.20.0 requires tensorflow-estimator==2.8, but you have tensorflow-estimator 2.3

In [15]:
!pip install --upgrade tensorflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorflow-estimator<2.13,>=2.12.0
  Downloading tensorflow_estimator-2.12.0-py2.py3-none-any.whl (440 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m440.7/440.7 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tensorflow-estimator
  Attempting uninstall: tensorflow-estimator
    Found existing installation: tensorflow-estimator 2.3.0
    Uninstalling tensorflow-estimator-2.3.0:
      Successfully uninstalled tensorflow-estimator-2.3.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gretel-synthetics 0.20.0 requires tensorflow-estimator==2.8, but you have tensorflow-estimator 2.12.0 which is incompatible.[0m[31m
[0mSuccessfully installed tensorflow-estimator-2.12.0


In [17]:
from pathlib import Path

from gretel_synthetics.config import LocalConfig
# Create a config that we can use for both training and generating data
# The default values for ``max_lines`` and ``epochs`` are optimized for training on a GPU.

config = LocalConfig(
    max_line_len=2048,   # the max line length for input training data
    vocab_size=20000,    # tokenizer model vocabulary size
    field_delimiter=",", # specify if the training text is structured, else ``None``
    overwrite=True,      # overwrite previously trained model checkpoints
    checkpoint_dir=(Path.cwd() / 'checkpoints').as_posix(),
    input_data_path="/content/Synthetic-Data-Entry-Job-Automation/uber_scooter_rides_1day.csv" # filepath or S3
)


In [None]:
# Train a model
# The training function only requires our config as a single arg
from gretel_synthetics.train import train_rnn

train_rnn(config)

100%|██████████| 27114/27114 [00:00<00:00, 46166.38it/s]


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (64, None, 256)           5120000   
                                                                 
 dropout (Dropout)           (64, None, 256)           0         
                                                                 
 lstm (LSTM)                 (64, None, 256)           525312    
                                                                 
 dropout_1 (Dropout)         (64, None, 256)           0         
                                                                 
 lstm_1 (LSTM)               (64, None, 256)           525312    
                                                                 
 dropout_2 (Dropout)         (64, None, 256)           0         
                                                                 
 dense (Dense)               (64, None, 20000)         5

In [None]:
# Let's generate some text!
#
# The ``generate_text`` funtion is a generator that will return
# a line of predicted text based on the ``gen_lines`` setting in your
# config.
#
# There is no limit on the line length as with proper training, your model
# should learn where newlines generally occur. However, if you want to
# specify a maximum char len for each line, you may set the ``gen_chars``
# attribute in your config object
from gretel_synthetics.generate import generate_text

# Optionally, when generating text, you can provide a callable that takes the 
# generated line as a single arg. If this function raises any errors, the 
# line will fail validation and will not be returned.  The exception message
# will be provided as a ``explain`` field in the resulting dict that gets
# created by ``generate_text``
def validate_record(line):
    rec = line.split(", ")
    if len(rec) == 6:
        float(rec[5])
        float(rec[4])
        float(rec[3])
        float(rec[2])
        int(rec[0])
    else:
        raise Exception('record not 6 parts')
        
for line in generate_text(config, line_validator=validate_record, num_lines=1000):
    print(line)