# Synthetic Data Generation

## Introduction

Synthetic data generation involves creating artificial datasets that mimic real-world data. This technique is invaluable in situations where data is scarce, sensitive, or imbalanced. Synthetic data can be used to enhance machine learning models, protect privacy, and test algorithms under various conditions.

In this tutorial, we'll explore different methods of synthetic data generation, including their underlying mathematics, implementations, and use cases. We'll reference key papers and discuss the latest developments in the field. Relevant imagery and code examples will be provided to enhance understanding.

## Table of Contents

1. [Why Synthetic Data?](#1)
2. [Basic Statistical Methods](#2)
   - [Random Sampling](#2.1)
   - [Bootstrapping](#2.2)
3. [Data Augmentation](#3)
   - [Image Data Augmentation](#3.1)
   - [Text Data Augmentation](#3.2)
4. [Generative Models](#4)
   - [Variational Autoencoders (VAEs)](#4.1)
   - [Generative Adversarial Networks (GANs)](#4.2)
5. [Synthetic Tabular Data Generation](#5)
   - [CTGAN](#5.1)
   - [Copulas](#5.2)
6. [Applications of Synthetic Data](#6)
7. [Latest Developments](#7)
   - [Differential Privacy in Synthetic Data](#7.1)
   - [Self-Supervised Learning](#7.2)
8. [Conclusion](#8)
9. [References](#9)

<a id="1"></a>
# 1. Why Synthetic Data?

- **Privacy Preservation**: Synthetic data can be shared without exposing sensitive information.
- **Data Augmentation**: Enhances model performance by providing more training examples.
- **Imbalanced Datasets**: Balances classes by generating minority class samples.
- **Testing and Validation**: Allows testing algorithms under various controlled conditions.
- **Cost Reduction**: Reduces the need for expensive data collection and labeling.

<a id="2"></a>
# 2. Basic Statistical Methods

Simple statistical techniques can generate synthetic data by sampling from predefined distributions.

<a id="2.1"></a>
## 2.1 Random Sampling

We can generate synthetic data by sampling from probability distributions such as uniform, normal, or custom distributions.

### Uniform Distribution

Generate data uniformly distributed between two values $( a )$ and $( b )$:

$[
X \sim \text{Uniform}(a, b)
]$

### Normal Distribution

Generate data following a normal distribution with mean $( \mu )$ and standard deviation $( \sigma )$:

$[
X \sim \mathcal{N}(\mu, \sigma^2)
]$

In [None]:
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Generate uniform data
uniform_data = np.random.uniform(low=0.0, high=1.0, size=1000)

# Generate normal data
normal_data = np.random.normal(loc=0.0, scale=1.0, size=1000)

# Plot histograms
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(uniform_data, bins=30, color='skyblue', edgecolor='black')
axes[0].set_title('Uniform Distribution')
axes[1].hist(normal_data, bins=30, color='salmon', edgecolor='black')
axes[1].set_title('Normal Distribution')
plt.show()

**Explanation:**

- We use `np.random.uniform` and `np.random.normal` to generate synthetic data.
- Histograms visualize the distribution of the generated data.

<a id="2.2"></a>
## 2.2 Bootstrapping

Bootstrapping involves resampling with replacement from the original dataset to create new synthetic datasets. It's useful for estimating the sampling distribution of a statistic.

### Algorithm

1. Given a dataset of size \( n \), sample \( n \) data points with replacement.
2. Repeat the process multiple times to create multiple synthetic datasets.

### Mathematical Foundation

Bootstrapping approximates the sampling distribution of a statistic by resampling the data.

In [None]:
# Original dataset
original_data = np.random.normal(0, 1, 100)

# Bootstrap samples
bootstrap_samples = []
num_samples = 1000
for _ in range(num_samples):
    sample = np.random.choice(original_data, size=len(original_data), replace=True)
    bootstrap_samples.append(sample.mean())

# Plot the distribution of sample means
plt.hist(bootstrap_samples, bins=30, color='lightgreen', edgecolor='black')
plt.title('Bootstrap Distribution of Sample Means')
plt.xlabel('Sample Mean')
plt.ylabel('Frequency')
plt.show()

**Explanation:**

- Resample the original data with replacement to create bootstrap samples.
- Compute the mean of each sample to estimate the sampling distribution.

<a id="3"></a>
# 3. Data Augmentation

Data augmentation creates synthetic data by applying transformations to existing data. It's widely used in computer vision and natural language processing.

<a id="3.1"></a>
## 3.1 Image Data Augmentation

Common techniques include:

- **Rotation**
- **Flipping**
- **Scaling**
- **Cropping**
- **Color Jitter**

In [None]:
# Import libraries
from torchvision import transforms
from PIL import Image
import requests
from io import BytesIO

# Load an example image
url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Coat_of_Arms_of_Granada.png/600px-Coat_of_Arms_of_Granada.png'
response = requests.get(url)
img = Image.open(BytesIO(response.content))

# Define augmentation transforms
augmentation = transforms.Compose([
    transforms.RandomHorizontalFlip(p=1.0),
    transforms.RandomRotation(degrees=45),
    transforms.ColorJitter(brightness=0.5, contrast=0.5),
])

# Apply augmentation
augmented_img = augmentation(img)

# Display original and augmented images
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
axes[0].imshow(img)
axes[0].set_title('Original Image')
axes[0].axis('off')
axes[1].imshow(augmented_img)
axes[1].set_title('Augmented Image')
axes[1].axis('off')
plt.show()

**Explanation:**

- Use `torchvision.transforms` to define augmentation operations.
- Apply transformations to the image to create augmented data.

<a id="3.2"></a>
## 3.2 Text Data Augmentation

Techniques include:

- **Synonym Replacement**
- **Random Insertion**
- **Random Swap**
- **Random Deletion**

Reference: Wei and Zou (2019) proposed **EDA (Easy Data Augmentation)** [[1]](#ref1).

In [None]:
# Install NLTK and download wordnet
!pip install nltk
import nltk
nltk.download('wordnet')

from nltk.corpus import wordnet
import random

In [None]:
# Synonym Replacement function
def synonym_replacement(sentence, n):
    words = sentence.split()
    new_words = words.copy()
    random_word_list = list(set(words))
    random.shuffle(random_word_list)
    num_replaced = 0
    for random_word in random_word_list:
        synonyms = get_synonyms(random_word)
        if len(synonyms) >= 1:
            synonym = random.choice(synonyms)
            new_words = [synonym if word == random_word else word for word in new_words]
            num_replaced += 1
        if num_replaced >= n:
            break
    sentence = ' '.join(new_words)
    return sentence

def get_synonyms(word):
    synonyms = []
    for syn in wordnet.synsets(word):
        for lem in syn.lemmas():
            synonym = lem.name()
            if synonym != word:
                synonyms.append(synonym)
    return list(set(synonyms))

In [None]:
# Original sentence
sentence = "The quick brown fox jumps over the lazy dog"

# Apply synonym replacement
augmented_sentence = synonym_replacement(sentence, n=3)

# Display results
print("Original:", sentence)
print("Augmented:", augmented_sentence)

**Explanation:**

- Replace words in the sentence with their synonyms.
- This creates new sentences while preserving the original meaning.

<a id="4"></a>
# 4. Generative Models

Generative models learn the underlying distribution of data and can generate new data samples. Two popular generative models are Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).

<a id="4.1"></a>
## 4.1 Variational Autoencoders (VAEs)

VAEs are probabilistic generative models that learn latent representations of data.

### Mathematical Foundation

VAEs introduce a probabilistic encoder (recognition model) and decoder (generative model).

**Encoder:**

$[
q_{\phi}(\mathbf{z} | \mathbf{x}) \approx p_{\theta}(\mathbf{z} | \mathbf{x})
]$

**Decoder:**

$[
p_{\theta}(\mathbf{x} | \mathbf{z})
]$

**Objective Function:**

Maximize the Evidence Lower Bound (ELBO):

$[
\mathcal{L}(\theta, \phi; \mathbf{x}) = -\text{KL}(q_{\phi}(\mathbf{z} | \mathbf{x}) \| p_{\theta}(\mathbf{z})) + \mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x})} [ \log p_{\theta}(\mathbf{x} | \mathbf{z}) ]
]$

- **KL Divergence**: Regularizes the latent space to match a prior distribution (e.g., standard normal).
- **Reconstruction Loss**: Measures how well the decoder reconstructs the input.

In [None]:
# Import libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

In [None]:
# Define VAE model
class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()
        
        # Encoder
        self.fc1 = nn.Linear(784, 400)
        self.fc21 = nn.Linear(400, 20)  # Mean
        self.fc22 = nn.Linear(400, 20)  # Log variance
        
        # Decoder
        self.fc3 = nn.Linear(20, 400)
        self.fc4 = nn.Linear(400, 784)
        
    def encode(self, x):
        h1 = torch.relu(self.fc1(x))
        return self.fc21(h1), self.fc22(h1)
    
    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std
    
    def decode(self, z):
        h3 = torch.relu(self.fc3(z))
        return torch.sigmoid(self.fc4(h3))
    
    def forward(self, x):
        mu, logvar = self.encode(x.view(-1, 784))
        z = self.reparameterize(mu, logvar)
        return self.decode(z), mu, logvar

# Loss function
def loss_function(recon_x, x, mu, logvar):
    BCE = nn.functional.binary_cross_entropy(recon_x, x.view(-1, 784), reduction='sum')
    KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return BCE + KLD

In [None]:
# Training settings
epochs = 5
batch_size = 128
learning_rate = 1e-3

# Load MNIST dataset
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transforms.ToTensor())
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# Initialize model and optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = VAE().to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
# Training loop
model.train()
for epoch in range(1, epochs + 1):
    train_loss = 0
    for batch_idx, (data, _) in enumerate(train_loader):
        data = data.to(device)
        optimizer.zero_grad()
        recon_batch, mu, logvar = model(data)
        loss = loss_function(recon_batch, data, mu, logvar)
        loss.backward()
        train_loss += loss.item()
        optimizer.step()
    print(f'Epoch {epoch}, Loss: {train_loss / len(train_loader.dataset):.4f}')

In [None]:
# Generate synthetic images
model.eval()
with torch.no_grad():
    z = torch.randn(64, 20).to(device)
    sample = model.decode(z).cpu()

# Plot the generated images
import torchvision.utils as vutils

plt.figure(figsize=(8,8))
plt.axis("off")
plt.title("Generated Images")
plt.imshow(np.transpose(vutils.make_grid(sample.view(64,1,28,28), padding=2, normalize=True), (1,2,0)))
plt.show()

**Explanation:**

- The VAE consists of an encoder, a reparameterization layer, and a decoder.
- The loss function combines reconstruction loss and KL divergence.
- After training, we sample from the latent space to generate new images.

<a id="4.2"></a>
## 4.2 Generative Adversarial Networks (GANs)

GANs consist of two neural networks, a generator and a discriminator, competing against each other.

### Mathematical Foundation

- **Generator (G)**: Tries to generate data that resembles the real data.
- **Discriminator (D)**: Tries to distinguish between real data and generated data.

**Objective Function:**

$[
\min_G \max_D V(D, G) = \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}(\mathbf{x})} [\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})} [\log(1 - D(G(\mathbf{z})))]
]$

- **$( p_{\text{data}}(\mathbf{x}) )$**: Real data distribution.
- **$( p_{\mathbf{z}}(\mathbf{z}) )$**: Prior noise distribution.

In [None]:
# Define Generator and Discriminator
class Generator(nn.Module):
    def __init__(self):
        super(Generator, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(100, 256),
            nn.ReLU(True),
            nn.Linear(256, 512),
            nn.ReLU(True),
            nn.Linear(512, 1024),
            nn.ReLU(True),
            nn.Linear(1024, 784),
            nn.Tanh()
        )
    def forward(self, x):
        x = self.model(x)
        return x

class Discriminator(nn.Module):
    def __init__(self):
        super(Discriminator, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(784, 512),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(512, 256),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )
    def forward(self, x):
        x = self.model(x)
        return x

In [None]:
# Initialize models
generator = Generator().to(device)
discriminator = Discriminator().to(device)

# Loss function and optimizers
adversarial_loss = nn.BCELoss()
optimizer_G = optim.Adam(generator.parameters(), lr=learning_rate)
optimizer_D = optim.Adam(discriminator.parameters(), lr=learning_rate)

In [None]:
# Training loop
for epoch in range(1, epochs + 1):
    for i, (imgs, _) in enumerate(train_loader):
        # Adversarial ground truths
        valid = torch.ones(imgs.size(0), 1, device=device)
        fake = torch.zeros(imgs.size(0), 1, device=device)
        
        # Configure input
        real_imgs = imgs.view(imgs.size(0), -1).to(device)
        
        # ---------------------
        #  Train Generator
        # ---------------------
        optimizer_G.zero_grad()
        
        # Sample noise as generator input
        z = torch.randn(imgs.size(0), 100, device=device)
        
        # Generate images
        gen_imgs = generator(z)
        
        # Loss measures generator's ability to fool the discriminator
        g_loss = adversarial_loss(discriminator(gen_imgs), valid)
        
        g_loss.backward()
        optimizer_G.step()
        
        # ---------------------
        #  Train Discriminator
        # ---------------------
        optimizer_D.zero_grad()
        
        # Measure discriminator's ability to classify real from generated samples
        real_loss = adversarial_loss(discriminator(real_imgs), valid)
        fake_loss = adversarial_loss(discriminator(gen_imgs.detach()), fake)
        d_loss = (real_loss + fake_loss) / 2
        
        d_loss.backward()
        optimizer_D.step()
    print(f'Epoch {epoch}, D loss: {d_loss.item():.4f}, G loss: {g_loss.item():.4f}')

In [None]:
# Generate synthetic images
generator.eval()
with torch.no_grad():
    z = torch.randn(64, 100, device=device)
    gen_imgs = generator(z).view(-1, 1, 28, 28).cpu()

# Plot the generated images
plt.figure(figsize=(8,8))
plt.axis("off")
plt.title("Generated Images")
plt.imshow(np.transpose(vutils.make_grid(gen_imgs, padding=2, normalize=True), (1,2,0)))
plt.show()

**Explanation:**

- The generator and discriminator are trained in an adversarial manner.
- The generator learns to produce images that the discriminator cannot distinguish from real images.

<a id="5"></a>
# 5. Synthetic Tabular Data Generation

Generating synthetic tabular data poses unique challenges due to the complexity of relationships between features.

<a id="5.1"></a>
## 5.1 CTGAN

CTGAN (Conditional Tabular GAN) [[2]](#ref2) is a GAN-based model designed specifically for generating synthetic tabular data, handling both continuous and categorical variables.

In [None]:
# Install SDV (Synthetic Data Vault)
!pip install sdv

from sdv.tabular import CTGAN
import pandas as pd

# Load adult income dataset
from sdv.datasets.demo import load_demo

data = load_demo(metadata=False)

In [None]:
# Train CTGAN model
model = CTGAN()
model.fit(data)

# Generate synthetic data
synthetic_data = model.sample(1000)

# Compare real and synthetic data
print('Real Data Sample:')
print(data.head())
print('\nSynthetic Data Sample:')
print(synthetic_data.head())

**Explanation:**

- **SDV** provides tools for generating synthetic data using models like CTGAN.
- **CTGAN** handles mixed data types and learns the distribution of the data.

<a id="5.2"></a>
## 5.2 Copulas

Copulas are functions that allow modeling of the joint distribution of random variables by describing the dependencies between them separately from their marginal distributions.

### Mathematical Foundation

Given cumulative distribution functions (CDFs) of random variables, copulas join them to form a multivariate CDF.

**Sklar's Theorem** states that any multivariate joint distribution can be expressed in terms of univariate marginal distribution functions and a copula that captures the dependence structure.

In [None]:
# Install Copulas package
!pip install copulas

from copulas.multivariate import GaussianMultivariate

# Fit Copula model
copula = GaussianMultivariate()
copula.fit(data)

# Generate synthetic data
synthetic_data_copula = copula.sample(1000)

# Compare real and synthetic data
print('Real Data Sample:')
print(data.head())
print('\nSynthetic Data Sample (Copula):')
print(synthetic_data_copula.head())

**Explanation:**

- Copulas model dependencies between variables separately from their marginals.
- **GaussianMultivariate** copula assumes a Gaussian dependence structure.

<a id="6"></a>
# 6. Applications of Synthetic Data

- **Privacy-Preserving Data Sharing**: Sharing synthetic data instead of real data to protect privacy.
- **Balancing Imbalanced Datasets**: Generating samples for minority classes to improve model performance.
- **Testing and Validation**: Creating datasets to test algorithms under various conditions.
- **Simulation**: Modeling scenarios that are difficult or expensive to collect data for.

<a id="7"></a>
# 7. Latest Developments

<a id="7.1"></a>
## 7.1 Differential Privacy in Synthetic Data

Differentially private synthetic data generation ensures that the synthetic data does not reveal sensitive information about any individual in the original dataset.

### Differential Privacy Definition

A mechanism $( \mathcal{M} )$ is $( \epsilon )$-differentially private if for all datasets $( D_1 )$ and $( D_2 )$ differing by one record, and all subsets $( S )$ of possible outputs:

$[
\Pr[ \mathcal{M}(D_1) \in S ] \leq e^{\epsilon} \Pr[ \mathcal{M}(D_2) \in S ]
]$

### Methods

- **DP-SGD**: Incorporate noise into the optimization process.
- **PATE-GAN** [[3]](#ref3): Uses Private Aggregation of Teacher Ensembles with GANs.

<a id="7.2"></a>
## 7.2 Self-Supervised Learning

Self-supervised learning leverages unlabeled data by generating labels from the data itself. Synthetic data can be used to create pretext tasks for self-supervised learning.

### Example: Rotation Prediction

- Rotate images by a certain angle.
- Train a model to predict the rotation angle.

In [None]:
# Define rotation angles
angles = [0, 90, 180, 270]

# Prepare dataset
class RotationDataset(torch.utils.data.Dataset):
    def __init__(self, dataset):
        self.dataset = dataset
    def __len__(self):
        return len(self.dataset) * len(angles)
    def __getitem__(self, idx):
        img_idx = idx // len(angles)
        angle_idx = idx % len(angles)
        img, _ = self.dataset[img_idx]
        rotated_img = transforms.functional.rotate(img, angles[angle_idx])
        return rotated_img, angle_idx

rotation_dataset = RotationDataset(train_dataset)
rotation_loader = DataLoader(rotation_dataset, batch_size=batch_size, shuffle=True)

# Define model
class RotationNet(nn.Module):
    def __init__(self):
        super(RotationNet, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 32, 3, 1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, 1),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        self.fc = nn.Sequential(
            nn.Linear(64 * 5 * 5, 128),
            nn.ReLU(),
            nn.Linear(128, 4)
        )
    def forward(self, x):
        x = self.conv(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

# Initialize model and optimizer
model = RotationNet().to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

In [None]:
# Training loop
for epoch in range(1, epochs + 1):
    model.train()
    train_loss = 0
    correct = 0
    total = 0
    for data, target in rotation_loader:
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * data.size(0)
        _, predicted = output.max(1)
        total += target.size(0)
        correct += predicted.eq(target).sum().item()
    print(f'Epoch {epoch}, Loss: {train_loss / total:.4f}, Accuracy: {100. * correct / total:.2f}%')

**Explanation:**

- The model learns to predict the rotation angle of images.
- This self-supervised task helps the model learn useful feature representations.

<a id="8"></a>
# 8. Conclusion

Synthetic data generation is a powerful tool with a wide range of applications in machine learning and data science. From simple statistical methods to advanced generative models like GANs and VAEs, synthetic data can help overcome challenges related to data scarcity, privacy, and imbalance. Understanding the underlying mathematics and implementation details enables practitioners to effectively generate and utilize synthetic data in their projects.

<a id="9"></a>
# 9. References

1. <a id="ref1"></a>Wei, J., & Zou, K. (2019). *EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks*. [arXiv:1901.11196](https://arxiv.org/abs/1901.11196)
2. <a id="ref2"></a>Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). *Modeling Tabular data using Conditional GAN*. In Advances in Neural Information Processing Systems (NeurIPS).
3. <a id="ref3"></a>Jordon, J., Yoon, J., & van der Schaar, M. (2019). *PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees*. In International Conference on Learning Representations (ICLR).

---

This notebook provides a comprehensive overview of synthetic data generation methods, including their mathematical foundations, implementations, and applications. You can run the code cells to see how these methods are applied and experiment with different datasets and parameters.