# BinAgg Tutorial: Differentially Private Linear Regression

This notebook demonstrates how to use the `binagg` package for:
1. Differentially Private Linear Regression
2. Synthetic Data Generation
3. Privacy Budget Management

## Installation

```bash
pip install git+https://github.com/soumojitdas/binagg.git
```

In [None]:
import numpy as np
from binagg import (
    dp_linear_regression,
    generate_synthetic_data,
    mu_to_epsilon,
    delta_from_gdp,
    compose_gdp,
    allocate_budget
)

np.random.seed(42)

---
## Part 1: Basic DP Linear Regression

### Step 1.1: Create Sample Data

In [None]:
# Generate sample data
n_samples = 500
n_features = 3

# Features uniformly distributed in [0, 10]
X = np.random.uniform(0, 10, (n_samples, n_features))

# True coefficients
true_beta = np.array([1.5, -2.0, 0.5])

# Response with noise
y = X @ true_beta + np.random.normal(0, 1.0, n_samples)

print(f"Data shape: X={X.shape}, y={y.shape}")
print(f"True coefficients: {true_beta}")

### Step 1.2: Define Data Bounds

**Important**: Differential privacy requires knowing the data bounds ahead of time. These should be set based on domain knowledge, not computed from the data.

In [None]:
# Feature bounds - each feature is in [0, 10]
x_bounds = [(0, 10), (0, 10), (0, 10)]

# Response bounds - add some margin
y_bounds = (-25, 25)  # Based on domain knowledge

print(f"Feature bounds: {x_bounds}")
print(f"Response bounds: {y_bounds}")

### Step 1.3: Run DP Linear Regression

In [None]:
# Run DP regression with mu=1.0 (moderate privacy)
result = dp_linear_regression(
    X, y,
    x_bounds=x_bounds,
    y_bounds=y_bounds,
    mu=1.0,           # Privacy budget
    alpha=0.05,       # 95% confidence intervals
    random_state=42
)

print(f"Privacy budget used: mu = {result.privacy_budget}")
print(f"Number of bins created: {result.n_bins}")

### Step 1.4: Examine Results

In [None]:
print("\n" + "="*70)
print("COEFFICIENT ESTIMATES")
print("="*70)
print(f"{'Feature':<10} {'True':<10} {'DP Est':<12} {'SE':<10} {'95% CI':<25} {'Covers?'}")
print("-"*75)

for i in range(n_features):
    ci_low, ci_high = result.confidence_intervals[i]
    covers = "Yes" if ci_low <= true_beta[i] <= ci_high else "No"
    print(f"beta_{i:<5} {true_beta[i]:<10.3f} {result.coefficients[i]:<12.3f} "
          f"{result.standard_errors[i]:<10.3f} [{ci_low:.3f}, {ci_high:.3f}]  {covers}")

### Step 1.5: Compare with Non-Private OLS

In [None]:
# Standard OLS (no privacy)
beta_ols = np.linalg.lstsq(X, y, rcond=None)[0]

print("\n" + "="*50)
print("COMPARISON: DP vs OLS")
print("="*50)
print(f"{'Feature':<10} {'True':<10} {'OLS':<12} {'DP Est':<12}")
print("-"*44)

for i in range(n_features):
    print(f"beta_{i:<5} {true_beta[i]:<10.3f} {beta_ols[i]:<12.3f} {result.coefficients[i]:<12.3f}")

print("\nNote: DP estimates have more variance but provide privacy guarantees.")

---
## Part 2: Effect of Privacy Budget

The privacy parameter `mu` controls the privacy-accuracy tradeoff:
- **Smaller mu** = Stronger privacy, more noise, wider confidence intervals
- **Larger mu** = Weaker privacy, less noise, narrower confidence intervals

In [None]:
print("\n" + "="*60)
print("EFFECT OF PRIVACY BUDGET (mu)")
print("="*60)
print(f"{'mu':<8} {'SE(beta_0)':<15} {'CI Width':<15} {'Bins':<8}")
print("-"*46)

for mu_test in [0.5, 1.0, 2.0, 5.0]:
    res = dp_linear_regression(
        X, y, x_bounds, y_bounds,
        mu=mu_test, random_state=42
    )
    ci_width = res.confidence_intervals[0, 1] - res.confidence_intervals[0, 0]
    print(f"{mu_test:<8.1f} {res.standard_errors[0]:<15.3f} {ci_width:<15.3f} {res.n_bins:<8}")

print("\nHigher mu = smaller SE = narrower CI (but less privacy)")

---
## Part 3: Synthetic Data Generation

Generate differentially private synthetic data that can be shared publicly.

In [None]:
# Generate synthetic data
syn_result = generate_synthetic_data(
    X, y,
    x_bounds=x_bounds,
    y_bounds=y_bounds,
    mu=1.0,
    clip_output=True,
    random_state=42
)

X_syn = syn_result.X_synthetic
y_syn = syn_result.y_synthetic

print(f"Original data: {X.shape[0]} samples")
print(f"Synthetic data: {syn_result.n_samples} samples")
print(f"Number of bins used: {syn_result.n_bins_used}")

### Compare Original vs Synthetic Statistics

In [None]:
if syn_result.n_samples > 0:
    print("\n" + "="*50)
    print("FEATURE MEANS: Original vs Synthetic")
    print("="*50)
    print(f"{'Feature':<10} {'Original':<12} {'Synthetic':<12} {'Diff':<10}")
    print("-"*44)
    
    for i in range(n_features):
        orig_mean = X[:, i].mean()
        syn_mean = X_syn[:, i].mean()
        diff = abs(orig_mean - syn_mean)
        print(f"X_{i:<8} {orig_mean:<12.3f} {syn_mean:<12.3f} {diff:<10.3f}")
    
    print(f"\n{'y':<10} {y.mean():<12.3f} {y_syn.mean():<12.3f} {abs(y.mean()-y_syn.mean()):<10.3f}")
else:
    print("Not enough synthetic samples generated.")

### Regression on Synthetic Data

In [None]:
if syn_result.n_samples >= n_features + 1:
    # OLS on synthetic data
    beta_syn = np.linalg.lstsq(X_syn, y_syn, rcond=None)[0]
    
    print("\n" + "="*50)
    print("REGRESSION: Original vs Synthetic")
    print("="*50)
    print(f"{'Feature':<10} {'True':<10} {'On Original':<14} {'On Synthetic':<14}")
    print("-"*48)
    
    for i in range(n_features):
        print(f"beta_{i:<5} {true_beta[i]:<10.3f} {beta_ols[i]:<14.3f} {beta_syn[i]:<14.3f}")
else:
    print("Not enough synthetic samples for regression.")

---
## Part 4: Privacy Accounting

Understanding and converting between different privacy measures.

### mu-GDP to epsilon Conversion

In [None]:
print("\n" + "="*40)
print("mu-GDP to Pure epsilon-DP")
print("="*40)
print(f"{'mu':<10} {'epsilon':<12}")
print("-"*22)

for mu in [0.1, 0.5, 1.0, 2.0, 5.0]:
    eps = mu_to_epsilon(mu)
    print(f"{mu:<10.1f} {eps:<12.3f}")

### Computing delta from (mu, epsilon)

In [None]:
mu = 1.0
print(f"\nFor mu = {mu}:")
print(f"{'epsilon':<12} {'delta':<15}")
print("-"*27)

for eps in [0.5, 1.0, 2.0, 3.0, 5.0]:
    delta = delta_from_gdp(mu, eps)
    print(f"{eps:<12.1f} {delta:<15.2e}")

print("\nLarger epsilon -> smaller delta (for fixed mu)")

### Privacy Composition

In [None]:
# Four mechanisms each with mu=0.5
mus = [0.5, 0.5, 0.5, 0.5]
total_mu = compose_gdp(*mus)

print("\n" + "="*50)
print("PRIVACY COMPOSITION")
print("="*50)
print(f"Four mechanisms with mu=0.5 each:")
print(f"  GDP composition: mu_total = sqrt(4*0.5^2) = {total_mu:.3f}")
print(f"  (Compare to naive sum: 4*0.5 = 2.0)")
print("\nGDP composition is much tighter than naive composition!")

### Budget Allocation

In [None]:
total = 1.0
ratios = (1, 3, 3, 3)  # Default for dp_linear_regression

budgets = allocate_budget(total, ratios)

print("\n" + "="*50)
print("BUDGET ALLOCATION")
print("="*50)
print(f"Total budget: mu = {total}")
print(f"Ratios: {ratios}")
print(f"\nAllocated budgets:")

names = ["binning", "counts", "sum_x", "sum_y"]
for name, b in zip(names, budgets):
    print(f"  {name:<10}: mu = {b:.4f}")

# Verify composition
composed = compose_gdp(*budgets)
print(f"\nVerification: sqrt(sum(mu^2)) = {composed:.6f}")

---
## Part 5: Practical Recommendations

### Choosing mu

| Use Case | Recommended mu | Notes |
|----------|---------------|-------|
| High-stakes analysis | 0.1 - 0.5 | Strong privacy, expect wider CIs |
| Standard analysis | 1.0 | Good balance of privacy/utility |
| Exploratory analysis | 2.0 - 5.0 | Weaker privacy, closer to non-private |

### Example Privacy Statement

In [None]:
from binagg import eps_from_mu_delta

mu_used = 1.0
target_delta = 1e-5
eps_achieved = eps_from_mu_delta(mu_used, target_delta)

print("\n" + "="*60)
print("EXAMPLE PRIVACY STATEMENT")
print("="*60)
print(f"\n'This analysis satisfies mu={mu_used} Gaussian Differential Privacy,")
print(f" which implies (epsilon={eps_achieved:.2f}, delta={target_delta})-DP.'")

---
## Summary

The `binagg` package provides:

1. **`dp_linear_regression()`**: DP linear regression with valid confidence intervals
2. **`generate_synthetic_data()`**: DP synthetic data generation
3. **Privacy utilities**: `mu_to_epsilon()`, `delta_from_gdp()`, `compose_gdp()`, etc.

### Key Points

- Always specify bounds based on domain knowledge, not from data
- Start with mu=1.0 and adjust based on your privacy/utility needs
- Confidence intervals are valid (asymptotically) even with DP noise
- GDP composition is much tighter than naive epsilon composition

### Citation

```bibtex
@article{lin2025differentially,
  title={Differentially Private Linear Regression and Synthetic Data 
         Generation with Statistical Guarantees},
  author={Lin, Shurong and Slavkovic, Aleksandra},
  journal={arXiv preprint arXiv:2510.16974},
  year={2025}
}
```