### RUNS ON GRID

### Example 03. Synthetic Dataset Generators Trainined using Differential Privacy and Immediate Sensitivity

This example notebook is intended to serve as a guide for csl data synthesis module and its associated methods.

**CSL Modules:**
* `synthesizers`   <--   **_main focus_**



---

**This notebook:** 

Focuses on the `synthesizers` module and its convenience methods, which include:

A. Differential privacy parameters: `ALPHA` and `EPSILON` for imnmediate sensitivity

B. `train_and_synthesize`: train a model and use the checkpoint to synthesize train data

C. `synthesize_using_pretrained`: load an existing (pre-trained) model and use it to synthesize validation data

**Note:**
The argument `tasks` (a list) can be set to process train and val sequentially (i.e., `tasks = [train, val]`) in `train_and_synthesize` to train a model and synthesize train and val datasets as a single call. The order must be preserved as val cannot be generated without a trained model.

*The `synthesizers.py` module supports command line arguments*

In [1]:
import torch
import torchvision

import sys
sys.path.append("/persist/carlos_folder/csl/csl")

import os
os.chdir("../")

import csl.synthesizers as syn

**VARIABLES**
* `METHOD`: str, synthesis method (architecture). Currently supported methods: vae, cvae, dcgan
* `DATASET_NAME`: str, name of the dataset
* `MODELS_DIR`: os.pathlike, where " _to save_ " and " _load from_ " models
* `DATA_DIR`: os.pathlike, where to " _to save_ " synthesized data
* `TASK`: list, what to synthesize train, test (for some datasets), or val

*Differential Privacy Training*
* `ALPHA`: float (default = None)
* `EPSILON`: float (default = None)

**NOTE** both `ALPHA` and `EPSILON` to be set to something other than `None` to enable differential privacy (immediate_sensitivity) training.

In [2]:
ROOT_DIR = "/persist/"
DATA_DIR = f"{ROOT_DIR}datasets/"
MODELS_DIR = f"{ROOT_DIR}models/"

# VARS
METHOD = "vae"
DATASET_NAME = "mnist"

TASK = ["train"]  # "val"]  # successively mimics the train and validation sets
N_EPOCHS = 10
NUM_WORKERS = 16
BATCH_SIZE = 32
CLASS_INDEX = "all"  # 0, 1, 2... etc
ALPHAS = [None, 20]

### A. Train and synthesize

Trains a new model based on the method and dataset name (i.e., input source) variables and uses the pretrained model to generate a synthetic version of the input source.


**inputs:**
* method: str, see METHOD
* dataset_name: str, see DATASET_NAME
* batch_size: int, size of the sample
* num_workers: int, number of virtual cores to use to move data and execute non-gpu data operations
* class_index: int (or str), 
    - int: index of the class to sample, train on, and synthesize
    - str:= "all", means all classess
* num_epochs: int, number of total passess to train over
* image_dim: int, size of the input/outputs images (images are resized to square tiles)
* task: list, see TASK
* num_samples: int (or str),
    - int: specific number of samples to synthesize (can be any positive number)
    - str:= "all" looks at the source dataset for the number of sample (mimics the original input dataset label count distribution)
* data_save_dir: os.pathlike, see DATA_DIR
* model_save_dir: os.pathlike, see MODELS_DIR


In [3]:
for ALPHA in ALPHAS:
    EPSILONS = (
        [None]
        if ALPHA is None
        else [1e6, 5e5, 2e5, 1e5, 1e4, 1e3, 1e2, 10, 1, 0.1, 0.01]
    )
    for EPSILON in EPSILONS:
        args = {
            "method": METHOD,
            "dataset_name": DATASET_NAME,
            "batch_size": BATCH_SIZE,
            "num_workers": NUM_WORKERS,
            "class_index": CLASS_INDEX,
            "num_epochs": N_EPOCHS,
            "image_dim": 64,
            "tasks": ["train"],  # , "val"],
            "num_samples": "all",
            "data_save_dir": DATA_DIR,
            "model_save_dir": MODELS_DIR,
            "alpha": ALPHA,
            "epsilon": EPSILON,
        }
        # CALL TRAIN AND SYNTHESIZE TO FIT A MODEL AND SYNTHESIZE THE TRAIN DATASET
        syn.train_and_synthesize(**args)

        """
        USE THE TRAINED MODEL TO SYNTHESIZE THE VALIDATION SET
        """
#         print(f"Loading checkpoint from {model_dir}")
#         print(f"Synthetic dataset destination {data_dir}")
#         task = "val"
#         for class_index in range(10):
#             model_path = f"{model_dir}{class_index}/"
#             data_path = f"{data_dir}{class_index}/"
#             args = {
#                 "method": METHOD,
#                 "dataset_name": DATASET_NAME,
#                 "num_workers": 4,
#                 "class_index": class_index,  # "all",
#                 "tasks": [task],
#                 "num_samples": "all",
#                 "data_save_dir": data_path,
#                 "load_model_path": model_path,
#             }
#             syn.synthesize_using_pretrained(**args)
#     print(
#         f"\n\n===\n{DATASET_NAME}-{METHOD} DATA GENERATION USING "
#         f"{ALPHA}-alpha and {EPSILON}-epsilon IS COMPLETE\n===\n\n"

2021-01-30 05:35:38 53447080b262 datasets[121511] INFO Processing 'mnist' torch.vision built-in structure
2021-01-30 05:35:38 53447080b262 datasets[121511] INFO Processing 'mnist' torch.vision built-in structure
2021-01-30 05:35:38 53447080b262 datasets[121511] INFO  > Extracting all (max) samples for 0 class(es).
Train Batch 1: 100%|██████████| 186/186 [00:02<00:00, 77.65it/s]
2021-01-30 05:35:46 53447080b262 data_generators.vae.vaes[121511] INFO ====> Epoch: 1 Average Train loss:  49.8599
2021-01-30 05:35:48 53447080b262 data_generators.vae.vaes[121511] INFO ====> Test set loss:  47.2477

Train Batch 2: 100%|██████████| 186/186 [00:02<00:00, 82.28it/s]
2021-01-30 05:35:50 53447080b262 data_generators.vae.vaes[121511] INFO ====> Epoch: 2 Average Train loss:  45.6274
2021-01-30 05:35:51 53447080b262 data_generators.vae.vaes[121511] INFO ====> Test set loss:  43.3431

Train Batch 3:   0%|          | 0/186 [00:00<?, ?it/s]



 ===> 2-epoch. Updating best (with 43.343), which is less than previous (47.248) best_loss


Train Batch 3: 100%|██████████| 186/186 [00:02<00:00, 77.73it/s]
2021-01-30 05:35:54 53447080b262 data_generators.vae.vaes[121511] INFO ====> Epoch: 3 Average Train loss:  41.6899
2021-01-30 05:35:55 53447080b262 data_generators.vae.vaes[121511] INFO ====> Test set loss:  39.4953

Train Batch 4:   0%|          | 0/186 [00:00<?, ?it/s]



 ===> 3-epoch. Updating best (with 39.495), which is less than previous (43.343) best_loss


Train Batch 4: 100%|██████████| 186/186 [00:02<00:00, 81.14it/s]
2021-01-30 05:35:58 53447080b262 data_generators.vae.vaes[121511] INFO ====> Epoch: 4 Average Train loss:  38.2225
2021-01-30 05:35:59 53447080b262 data_generators.vae.vaes[121511] INFO ====> Test set loss:  36.2385

Train Batch 5:   0%|          | 0/186 [00:00<?, ?it/s]



 ===> 4-epoch. Updating best (with 36.239), which is less than previous (39.495) best_loss


Train Batch 5: 100%|██████████| 186/186 [00:02<00:00, 81.50it/s]
2021-01-30 05:36:02 53447080b262 data_generators.vae.vaes[121511] INFO ====> Epoch: 5 Average Train loss:  35.1037


KeyboardInterrupt: 

In [None]:
print("SYNTHESIZER TRAINED AND USED TO SYNTHESIZE A TRAIN DATASET")

### B. Synthesize using pretrained

Loads a pretrained model and uses it to generate a synthetic version of the input source dataset. Generates all the samples for class index 0. 

**inputs:**
* method: str, see METHOD
* dataset_name: str, see DATASET_NAME
* num_workers: int, number of virtual cores to use to move data and execute non-gpu data operations
* class_index: int (or str), 
    - int: index of the class to sample, train on, and synthesize
    - str:= "all", means all classess
* task: list, see TASK
* num_samples: int (or str),
    - int: specific number of samples to synthesize (can be any positive number)
    - str:= "all" looks at the source dataset for the number of sample (mimics the original input dataset label count distribution)
* data_save_dir: os.pathlike, see DATA_DIR

**NEW INPUT**
* **load_model_path**: os.pathlike, location where the checkpoint (pretrained model state_dictionary and associated parameters) is stored.


In [None]:
# THE LOOP CAN BE RUN TOGETHER WITH THE PREVIOUS

task = "val"
    
for ALPHA in ALPHAS:
    EPSILONS = (
        [None]
        if ALPHA is None
        else [1e6, 5e5, 2e5, 1e5, 1e4, 1e3, 1e2, 10, 1, 0.1, 0.01]
    )
    for EPSILON in EPSILONS:
        if (ALPHA is not None) and (EPSILON is not None):
            model_dir = (
                f"{MODELS_DIR}{DATASET_NAME}_{METHOD}_dp_{ALPHA}a_{EPSILON}e/train/"
            )
            data_dir = (
                f"{DATA_DIR}{DATASET_NAME}_{METHOD}_dp_{ALPHA}a_{EPSILON}e/{task}/"
            )
        else:
            model_dir = f"{MODELS_DIR}{DATASET_NAME}_{METHOD}/train/"
            data_dir = f"{DATA_DIR}{DATASET_NAME}_{METHOD}/{task}/"

        print(f"Loading checkpoint from {model_dir}")
        print(f"Synthetic dataset destination {data_dir}")

        for class_index in range(10):
            model_path = f"{model_dir}{class_index}/"
            data_path = f"{data_dir}{class_index}/"
            args = {
                "method": METHOD,
                "dataset_name": DATASET_NAME,
                "num_workers": 4,
                "class_index": class_index,  # "all",
                "tasks": [task],
                "num_samples": "all",
                "data_save_dir": data_path,
                "load_model_path": model_path,
            }
            syn.synthesize_using_pretrained(**args)

    print(
        f"\n\n===\n{DATASET_NAME}-{METHOD} DATA GENERATION USING "
        f"{ALPHA}-alpha and {EPSILON}-epsilon IS COMPLETE\n===\n\n"

In [None]:
print("ALL DONE")