# ML Synthetic Data Generation with spark-bestfit (LocalBackend)

This notebook demonstrates how to generate **synthetic data** for machine learning workflows
by fitting statistical distributions to real data.

## What You'll Learn

1. **Fit distributions** to production data features
2. **Handle mixed types** (continuous + discrete columns)
3. **Save and load** fitted models for reproducibility
4. **Generate synthetic data** that matches original statistics
5. **Validate** synthetic data quality

## Business Context

Synthetic data is valuable for:

- **Privacy**: Share data patterns without exposing real records
- **Testing**: Generate test data that matches production characteristics
- **Augmentation**: Expand small datasets for model training
- **Development**: Work with realistic data in non-production environments

## Prerequisites

```bash
pip install spark-bestfit pandas numpy matplotlib scikit-learn
```

## Setup

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from spark_bestfit import DistributionFitter, DiscreteDistributionFitter
from spark_bestfit.backends.local import LocalBackend

# Create LocalBackend
backend = LocalBackend()
print(f"LocalBackend initialized with {backend.max_workers} workers")

## Part 1: Create Sample "Production" Dataset

We'll simulate a customer dataset with mixed feature types:

- **Continuous**: age, income, account_balance, credit_score
- **Discrete**: num_products, num_transactions, tenure_months

In [None]:
np.random.seed(42)
n_customers = 5000

# Generate realistic customer data
data = {
    # Continuous features
    'age': np.clip(np.random.normal(42, 15, n_customers), 18, 85).astype(int),
    'income': np.random.lognormal(10.8, 0.6, n_customers),  # Skewed income distribution
    'account_balance': np.abs(np.random.normal(5000, 3000, n_customers)),
    'credit_score': np.clip(np.random.normal(700, 80, n_customers), 300, 850).astype(int),
    
    # Discrete features
    'num_products': np.random.poisson(2.5, n_customers),
    'num_transactions': np.random.negative_binomial(5, 0.3, n_customers),
    'tenure_months': np.random.geometric(0.02, n_customers)
}

# Create Pandas DataFrame
original_df = pd.DataFrame(data)

print(f"Original dataset: {len(original_df)} customers")
print(original_df.head())
print("\nSummary statistics:")
print(original_df.describe())

In [None]:
# Visualize original distributions
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

columns = list(data.keys())
for i, col in enumerate(columns):
    if i < len(axes):
        axes[i].hist(original_df[col], bins=40, density=True, alpha=0.7, edgecolor='black')
        axes[i].set_title(f'{col}\nmean={original_df[col].mean():.1f}')
        axes[i].set_xlabel(col)

# Hide unused subplot
axes[-1].set_visible(False)

plt.suptitle('Original Data Distributions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Part 2: Fit Distributions to Continuous Features

In [None]:
# Define continuous and discrete columns
continuous_cols = ['age', 'income', 'account_balance', 'credit_score']
discrete_cols = ['num_products', 'num_transactions', 'tenure_months']

# Fit continuous distributions using LocalBackend
cont_fitter = DistributionFitter(backend=backend)

cont_results = cont_fitter.fit(
    original_df,
    columns=continuous_cols,
    lazy_metrics=True,
    max_distributions=25  # Focus on common distributions
)

print(f"Fitted {cont_results.count()} continuous distribution-column combinations")

In [None]:
# Get best continuous fits
best_continuous = cont_results.best_per_column(n=1, metric='aic')

print("Best Continuous Distributions:\n")
for col, fits in best_continuous.items():
    fit = fits[0]
    print(f"  {col}: {fit.distribution} (AIC={fit.aic:.1f})")

## Part 3: Fit Distributions to Discrete Features

In [None]:
# Fit discrete distributions
disc_fitter = DiscreteDistributionFitter(backend=backend)

disc_results = disc_fitter.fit(
    original_df,
    columns=discrete_cols
)

print(f"Fitted {disc_results.count()} discrete distribution-column combinations")

In [None]:
# Get best discrete fits
best_discrete = disc_results.best_per_column(n=1, metric='aic')

print("Best Discrete Distributions:\n")
for col, fits in best_discrete.items():
    fit = fits[0]
    print(f"  {col}: {fit.distribution} (AIC={fit.aic:.1f})")

## Part 4: Save Fitted Models

Save the fitted distributions for reproducibility and later use.

In [None]:
import os
import tempfile

# Create temp directory for models
model_dir = tempfile.mkdtemp(prefix='synthetic_models_')

# Save best fits for each column
cont_model_path = os.path.join(model_dir, 'continuous_fits')
os.makedirs(cont_model_path, exist_ok=True)

print("Saved continuous models:")
for col, fits in best_continuous.items():
    fit = fits[0]
    path = os.path.join(cont_model_path, f'{col}.json')
    fit.save(path)
    print(f"  {col}: {path}")

disc_model_path = os.path.join(model_dir, 'discrete_fits')
os.makedirs(disc_model_path, exist_ok=True)

print("\nSaved discrete models:")
for col, fits in best_discrete.items():
    fit = fits[0]
    path = os.path.join(disc_model_path, f'{col}.json')
    fit.save(path)
    print(f"  {col}: {path}")

## Part 5: Generate Synthetic Data

Now we'll generate synthetic data by sampling from the fitted distributions.

In [None]:
from scipy import stats

def generate_synthetic_data(best_continuous, best_discrete, n_samples, seed=42):
    """
    Generate synthetic data by sampling from fitted distributions.
    
    Args:
        best_continuous: Dict of column -> DistributionFitResult for continuous
        best_discrete: Dict of column -> DistributionFitResult for discrete
        n_samples: Number of synthetic records to generate
        seed: Random seed for reproducibility
    
    Returns:
        pandas DataFrame with synthetic data
    """
    np.random.seed(seed)
    synthetic_data = {}
    
    # Generate continuous columns using get_scipy_dist()
    for col, fits in best_continuous.items():
        fit = fits[0]
        frozen_dist = fit.get_scipy_dist()
        
        # Sample from distribution
        samples = frozen_dist.rvs(size=n_samples)
        synthetic_data[col] = samples
        print(f"  Generated {col}: {fit.distribution}")
    
    # Generate discrete columns using get_scipy_dist()
    for col, fits in best_discrete.items():
        fit = fits[0]
        frozen_dist = fit.get_scipy_dist()
        
        # Sample from distribution
        samples = frozen_dist.rvs(size=n_samples)
        synthetic_data[col] = samples
        print(f"  Generated {col}: {fit.distribution}")
    
    return pd.DataFrame(synthetic_data)

# Generate synthetic dataset (same size as original)
print("Generating synthetic data...")
synthetic_df = generate_synthetic_data(
    best_continuous,
    best_discrete,
    n_samples=n_customers,
    seed=42
)

print(f"\nGenerated {len(synthetic_df)} synthetic records")

## Part 6: Validate Synthetic Data Quality

Compare synthetic data statistics to original data.

In [None]:
# Compare summary statistics
all_cols = continuous_cols + discrete_cols

comparison = pd.DataFrame({
    'Original Mean': original_df[all_cols].mean(),
    'Synthetic Mean': synthetic_df[all_cols].mean(),
    'Original Std': original_df[all_cols].std(),
    'Synthetic Std': synthetic_df[all_cols].std(),
})

comparison['Mean Diff %'] = ((comparison['Synthetic Mean'] - comparison['Original Mean']) 
                              / comparison['Original Mean'] * 100).round(1)
comparison['Std Diff %'] = ((comparison['Synthetic Std'] - comparison['Original Std']) 
                             / comparison['Original Std'] * 100).round(1)

print("Statistical Comparison:\n")
print(comparison.round(2).to_string())

In [None]:
# Visual comparison
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

for i, col in enumerate(all_cols):
    if i < len(axes):
        # Original
        axes[i].hist(original_df[col], bins=40, density=True, alpha=0.5, 
                     label='Original', color='blue', edgecolor='black')
        # Synthetic
        axes[i].hist(synthetic_df[col], bins=40, density=True, alpha=0.5,
                     label='Synthetic', color='orange', edgecolor='black')
        axes[i].set_title(col)
        axes[i].legend(fontsize=8)

axes[-1].set_visible(False)

plt.suptitle('Original vs Synthetic Data Distributions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Kolmogorov-Smirnov test for distribution similarity
print("Kolmogorov-Smirnov Tests (Original vs Synthetic):\n")
print(f"{'Column':<20} {'KS Statistic':<15} {'p-value':<15} {'Match?'}")
print("-" * 60)

for col in all_cols:
    ks_stat, p_value = stats.ks_2samp(original_df[col], synthetic_df[col])
    match = "Yes" if p_value > 0.05 else "No"
    print(f"{col:<20} {ks_stat:<15.4f} {p_value:<15.4f} {match}")

print("\n(p-value > 0.05 suggests distributions are similar)")

## Part 7: Using Synthetic Data for ML

Demonstrate using synthetic data to train a model, then evaluate on real data.

In [None]:
try:
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score, classification_report
    SKLEARN_AVAILABLE = True
except ImportError:
    SKLEARN_AVAILABLE = False
    print("Note: scikit-learn not installed. Skipping ML validation.")
    print("Install with: pip install scikit-learn")

if SKLEARN_AVAILABLE:
    # Create a binary target (e.g., high-value customer: income > median)
    median_income = original_df['income'].median()

    original_df['high_value'] = (original_df['income'] > median_income).astype(int)
    synthetic_df['high_value'] = (synthetic_df['income'] > median_income).astype(int)

    # Features (exclude income since it defines the target)
    feature_cols = ['age', 'account_balance', 'credit_score', 'num_products', 'num_transactions', 'tenure_months']

    # Split original data for testing
    X_original = original_df[feature_cols]
    y_original = original_df['high_value']

    X_train_orig, X_test, y_train_orig, y_test = train_test_split(
        X_original, y_original, test_size=0.3, random_state=42
    )

    # Train on synthetic data
    X_synthetic = synthetic_df[feature_cols]
    y_synthetic = synthetic_df['high_value']

    print(f"Training set (synthetic): {len(X_synthetic)} samples")
    print(f"Test set (real): {len(X_test)} samples")

In [None]:
if SKLEARN_AVAILABLE:
    # Model 1: Train on original, test on original
    rf_original = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_original.fit(X_train_orig, y_train_orig)
    acc_original = accuracy_score(y_test, rf_original.predict(X_test))

    # Model 2: Train on synthetic, test on original (real)
    rf_synthetic = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_synthetic.fit(X_synthetic, y_synthetic)
    acc_synthetic = accuracy_score(y_test, rf_synthetic.predict(X_test))

    print("Model Performance on Real Test Data:\n")
    print(f"  Trained on Original:  {acc_original:.1%} accuracy")
    print(f"  Trained on Synthetic: {acc_synthetic:.1%} accuracy")
    print(f"  \nDifference: {(acc_original - acc_synthetic):.1%}")
else:
    print("Skipping ML comparison (scikit-learn not installed)")

## Part 8: Scaling Up - Generate Large Synthetic Dataset

In [None]:
# Generate a larger synthetic dataset (10x original)
print("Generating 50,000 synthetic records...")
large_synthetic = generate_synthetic_data(
    best_continuous,
    best_discrete,
    n_samples=50000,
    seed=123
)

print(f"\nLarge synthetic dataset: {len(large_synthetic)} records")
print(large_synthetic.describe())

## Summary

This notebook demonstrated a complete synthetic data generation workflow using LocalBackend:

1. **Fit distributions** to continuous features with `DistributionFitter`
2. **Fit distributions** to discrete features with `DiscreteDistributionFitter`
3. **Save models** for reproducibility with `results.save()`
4. **Generate synthetic data** by sampling from fitted distributions
5. **Validate quality** using statistical tests and visual comparison
6. **ML validation** - model trained on synthetic achieves similar accuracy

### Key spark-bestfit Features Used

| Feature | Purpose |
|---------|----------|
| `LocalBackend` | Local parallel processing |
| Multi-column fitting | Fit all features efficiently |
| `DiscreteDistributionFitter` | Handle count/categorical data |
| `results.save()` | Persist fitted models |
| `lazy_metrics=True` | Fast fitting when only AIC needed |

### Limitations & Extensions

- **Correlations**: This example generates columns independently. For correlated features, use `GaussianCopula` (see Monte Carlo notebook).
- **Constraints**: Real data may have business constraints (e.g., age > 18). Add post-processing validation.
- **Privacy**: For formal privacy guarantees, consider differential privacy techniques.

In [None]:
# Cleanup
import shutil
shutil.rmtree(model_dir, ignore_errors=True)
print("Cleanup complete!")