# spark-bestfit API Demo (Ray Backend)

This notebook demonstrates the complete API for the `spark-bestfit` library using the **RayBackend**, including:

1. **Distribution Fitting** - Using DistributionFitter with RayBackend
2. **FitterConfig Builder** - Fluent configuration API for complex setups (v2.2.0+)
3. **Progress Tracking** - Monitor long-running fits with callbacks
4. **Working with Results** - FitResults and DistributionFitResult objects
5. **Lazy Metrics** - Skip KS/AD computation for faster model selection (v1.5.0+)
6. **Pre-filtering** - Skip incompatible distributions for faster fitting (v1.6.0+)
7. **Confidence Intervals** - Bootstrap confidence intervals for fitted parameters
8. **Plotting** - Visualization with customizable parameters
9. **Excluding Distributions** - Customizing which distributions to fit
10. **Serialization** - Save and load fitted distributions

## Setup

First, let's initialize a RayBackend. Unlike Spark, Ray can auto-initialize locally.

**RayBackend initialization options:**
- `RayBackend()` - Auto-initialize Ray locally (simplest)
- `RayBackend(num_cpus=4)` - Limit CPU usage
- `RayBackend(address="auto")` - Connect to existing Ray cluster
- `RayBackend(address="ray://cluster:10001")` - Connect to specific cluster

In [None]:
import numpy as np
import pandas as pd
from spark_bestfit import RayBackend

# Create RayBackend (auto-initializes Ray locally)
backend = RayBackend()

print(f"RayBackend initialized with {backend.get_parallelism()} CPUs")

In [None]:
# Import spark-bestfit components
from spark_bestfit import (
    DistributionFitter,
    FitterConfig,
    FitterConfigBuilder,
    DEFAULT_EXCLUDED_DISTRIBUTIONS,
)

## Generate Sample Data

We'll create sample data from known distributions for demonstration.

In [None]:
np.random.seed(42)

# Normal distribution data
normal_data = np.random.normal(loc=50, scale=10, size=50_000)
df_normal = pd.DataFrame({"value": normal_data})

# Exponential distribution data (non-negative)
exp_data = np.random.exponential(scale=5, size=50_000)
df_exp = pd.DataFrame({"value": exp_data})

# Gamma distribution data
gamma_data = np.random.gamma(shape=2.0, scale=2.0, size=50_000)
df_gamma = pd.DataFrame({"value": gamma_data})

print(f"Normal data: {len(df_normal):,} rows, mean={normal_data.mean():.2f}, std={normal_data.std():.2f}")
print(f"Exponential data: {len(df_exp):,} rows, mean={exp_data.mean():.2f}")
print(f"Gamma data: {len(df_gamma):,} rows, mean={gamma_data.mean():.2f}")

---

# Part 1: Excluded Distributions

spark-bestfit excludes some slow distributions by default. You can customize this.

## 1.1 DEFAULT_EXCLUDED_DISTRIBUTIONS

Some distributions are excluded by default because they are very slow to fit.

In [None]:
# View default excluded distributions
print(f"Default excluded distributions ({len(DEFAULT_EXCLUDED_DISTRIBUTIONS)}):")
for dist in sorted(DEFAULT_EXCLUDED_DISTRIBUTIONS):
    print(f"  - {dist}")

In [None]:
# Include a specific distribution that's excluded by default
custom_exclusions = tuple(d for d in DEFAULT_EXCLUDED_DISTRIBUTIONS if d != "wald")

fitter_with_wald = DistributionFitter(backend=backend, excluded_distributions=custom_exclusions)
print(f"Now fitting 'wald' distribution (removed from exclusions)")

---

# Part 2: Distribution Fitting

The `DistributionFitter` class is the main entry point for fitting distributions.

## 2.1 Basic Fitting

In [None]:
# Create fitter with RayBackend
fitter = DistributionFitter(backend=backend)

# Fit distributions to normal data (limit to 20 for demo speed)
print("Fitting distributions to normal data...")
results_normal = fitter.fit(df_normal, column="value", max_distributions=20)

print(f"\nFitted {results_normal.count()} distributions")

## 2.2 Fitting with Custom Parameters

In [None]:
# Fit only non-negative distributions using support_at_zero=True
fitter_nonneg = DistributionFitter(backend=backend)

print("Fitting non-negative distributions to exponential data...")
results_exp = fitter_nonneg.fit(
    df_exp,
    column="value",
    bins=100,
    support_at_zero=True,  # Only fit non-negative distributions
    enable_sampling=True,
    max_distributions=15,
)

print(f"Fitted {results_exp.count()} non-negative distributions")

## 2.3 FitterConfig Builder (v2.2.0+)

For complex configurations with many parameters, use the **fluent builder pattern**. This provides:
- **Cleaner code**: Grouped, readable configuration
- **Reusability**: Same config works across multiple fits  
- **IDE-friendly**: Better autocomplete and discoverability
- **Immutable**: Frozen dataclass prevents accidental mutation

In [None]:
# Build a reusable configuration with the fluent builder
config = (FitterConfigBuilder()
    .with_bins(100)                           # Histogram bins
    .with_support_at_zero()                   # Non-negative distributions only
    .with_sampling(fraction=0.5)              # Sample 50% of data
    .with_max_distributions(15)               # Limit to 15 distributions
    .with_lazy_metrics()                      # Defer KS/AD computation
    .build())

print("FitterConfig created:")
print(f"  bins: {config.bins}")
print(f"  support_at_zero: {config.support_at_zero}")
print(f"  sample_fraction: {config.sample_fraction}")
print(f"  max_distributions: {config.max_distributions}")
print(f"  lazy_metrics: {config.lazy_metrics}")

# Use config with fitter
fitter_config = DistributionFitter(backend=backend)
results_config = fitter_config.fit(df_exp, column="value", config=config)

print(f"\nFitted {results_config.count()} distributions using FitterConfig")

# Config is reusable - use same config for different data
results_gamma_config = fitter_config.fit(df_gamma, column="value", config=config)
print(f"Fitted {results_gamma_config.count()} distributions (reused same config)")

In [None]:
# FitterConfig with bounded fitting and prefilter
bounded_config = (FitterConfigBuilder()
    .with_bounds(lower=0, upper=100)          # Explicit bounds
    .with_prefilter()                          # Skip incompatible distributions
    .with_lazy_metrics()                       # Fast model selection
    .with_max_distributions(20)
    .build())

print("Bounded FitterConfig:")
print(f"  bounded: {bounded_config.bounded}")
print(f"  lower_bound: {bounded_config.lower_bound}")
print(f"  upper_bound: {bounded_config.upper_bound}")
print(f"  prefilter: {bounded_config.prefilter}")

# You can also create FitterConfig directly (without builder)
direct_config = FitterConfig(
    bins=50,
    lazy_metrics=True,
    max_distributions=10,
)
print(f"\nDirect FitterConfig: bins={direct_config.bins}, lazy={direct_config.lazy_metrics}")

## 2.4 Progress Tracking

For long-running fits, you can monitor progress with a callback. The easiest way is to use the built-in `console_progress()` utility:

```python
from spark_bestfit.progress import console_progress

results = fitter.fit(df, column="value", progress_callback=console_progress())
```

For custom callbacks, pass any function matching `(completed: int, total: int, percent: float) -> None`.

In [None]:
from spark_bestfit.progress import console_progress

# Simple approach: use built-in console_progress()
print("Fitting with console_progress()...")
fitter_progress = DistributionFitter(backend=backend)
results_progress = fitter_progress.fit(
    df_normal,
    column="value",
    max_distributions=25,
    progress_callback=console_progress("Fitting"),  # Built-in utility
)
print()  # Newline after progress
print(f"Fitted {results_progress.count()} distributions")

## 2.5 Using Ray Dataset (Distributed)

For larger datasets, you can use Ray Dataset for distributed aggregation. This avoids collecting raw data to the driver.

In [None]:
import ray

# Create a larger dataset
large_data = np.random.normal(loc=100, scale=25, size=100_000)
large_df = pd.DataFrame({"value": large_data})

# Convert to Ray Dataset
ds = ray.data.from_pandas(large_df)
print(f"Ray Dataset created with {ds.count()} rows")

# Fit using Ray Dataset - distributed histogram and sampling
results_ray = fitter.fit(ds, column="value", max_distributions=15)

best_ray = results_ray.best(n=1)[0]
print(f"Best fit: {best_ray.distribution}")
print(f"  KS statistic: {best_ray.ks_statistic:.6f}")

---

# Part 3: Working with Results

The `fit()` method returns a `FitResults` object for easy result manipulation.

## 3.1 Getting Best Distributions

In [None]:
# Get best distribution (by K-S statistic, the default)
best = results_normal.best(n=1)[0]
print(f"Best by K-S statistic: {best.distribution}")
print(f"  K-S statistic: {best.ks_statistic:.6f}")
print(f"  p-value: {best.pvalue:.4f}")
print(f"  A-D statistic: {best.ad_statistic:.6f}")
print(f"  A-D p-value: {best.ad_pvalue:.4f}" if best.ad_pvalue else f"  A-D p-value: N/A (not available for {best.distribution})")
print(f"  SSE: {best.sse:.6f}")
print(f"  AIC: {best.aic:.2f}")
print(f"  BIC: {best.bic:.2f}")
print(f"  Parameters: {[f'{p:.4f}' for p in best.parameters]}")

In [None]:
# Get top 5 by different metrics
print("Top 5 by K-S statistic (default):")
for i, r in enumerate(results_normal.best(n=5), 1):
    print(f"  {i}. {r.distribution:20s} KS={r.ks_statistic:.6f} p={r.pvalue:.4f}")

print("\nTop 5 by A-D statistic:")
for i, r in enumerate(results_normal.best(n=5, metric="ad_statistic"), 1):
    ad_p = f"{r.ad_pvalue:.4f}" if r.ad_pvalue else "N/A"
    print(f"  {i}. {r.distribution:20s} AD={r.ad_statistic:.6f} p={ad_p}")

print("\nTop 5 by SSE:")
for i, r in enumerate(results_normal.best(n=5, metric="sse"), 1):
    print(f"  {i}. {r.distribution:20s} SSE={r.sse:.6f}")

print("\nTop 5 by AIC:")
for i, r in enumerate(results_normal.best(n=5, metric="aic"), 1):
    print(f"  {i}. {r.distribution:20s} AIC={r.aic:.2f}")

## 3.2 Filtering Results

In [None]:
# Filter by K-S statistic threshold
good_fits = results_normal.filter(ks_threshold=0.05)
print(f"Distributions with K-S statistic < 0.05: {good_fits.count()}")

for r in good_fits.best(n=10):
    print(f"  {r.distribution:20s} KS={r.ks_statistic:.6f} p={r.pvalue:.4f}")

# Filter by p-value threshold (keep distributions with p-value > 0.05)
significant = results_normal.filter(pvalue_threshold=0.05)
print(f"\nDistributions with p-value > 0.05: {significant.count()}")

# Filter by A-D statistic threshold
good_ad = results_normal.filter(ad_threshold=2.0)
print(f"\nDistributions with A-D statistic < 2.0: {good_ad.count()}")

## 3.3 Converting to Pandas

In [None]:
# Convert to pandas DataFrame for further analysis
# For RayBackend, results.df is already a pandas DataFrame
df_results = results_normal.df
print("Results as pandas DataFrame:")
df_results.head(10)

## 3.4 Using Fitted Distributions

In [None]:
# The DistributionFitResult object wraps the scipy.stats distribution
best = results_normal.best(n=1)[0]

# Generate samples from the fitted distribution
samples = best.sample(size=10000, random_state=42)
print(f"Generated {len(samples)} samples from fitted {best.distribution}")
print(f"  Sample mean: {samples.mean():.2f} (original: {normal_data.mean():.2f})")
print(f"  Sample std: {samples.std():.2f} (original: {normal_data.std():.2f})")

In [None]:
# Evaluate PDF at specific points
x = np.array([30, 40, 50, 60, 70])
pdf_values = best.pdf(x)
cdf_values = best.cdf(x)

print("PDF and CDF values:")
for xi, pdf, cdf in zip(x, pdf_values, cdf_values):
    print(f"  x={xi}: PDF={pdf:.6f}, CDF={cdf:.4f}")

## 3.5 Parameter Confidence Intervals

Compute bootstrap confidence intervals for fitted distribution parameters. This is useful for understanding the uncertainty in your parameter estimates.

**Note**: CI width depends on sample size and distribution identifiability. Highly flexible distributions (like beta with 4 parameters) may have wider CIs due to parameter identifiability issues.

In [None]:
# Use exponential fit for CI demo (simpler distribution = more stable CIs)
best_exp = results_exp.best(n=1)[0]

print(f"Distribution: {best_exp.distribution}")
print(f"Parameter names: {best_exp.get_param_names()}")
print(f"Fitted values: {[f'{p:.4f}' for p in best_exp.parameters]}")

# Compute 95% bootstrap confidence intervals
print("\nComputing 95% confidence intervals (this may take a few seconds)...")
ci = best_exp.confidence_intervals(
    df_exp,
    column="value",
    alpha=0.05,           # 95% CI
    n_bootstrap=500,      # Number of bootstrap samples (use 1000+ for production)
    random_seed=42,       # For reproducibility
)

print("\nParameter confidence intervals:")
for param, (lower, upper) in ci.items():
    print(f"  {param}: [{lower:.4f}, {upper:.4f}]")

## 3.6 Lazy Metrics for Fast Model Selection (v1.5.0+)

When fitting ~100 distributions, computing KS and AD statistics for all of them can be slow. With **lazy metrics**, these expensive computations are skipped during fitting and only computed on-demand when you actually need them.

**Key benefits:**
- Fast initial fitting (skip KS/AD computation)
- On-demand computation only for distributions you access
- Ideal for model selection workflows using AIC/BIC

In [None]:
# Fit with lazy metrics - KS/AD statistics are NOT computed during fitting
print("Fitting with lazy_metrics=True (fast)...")
fitter_lazy = DistributionFitter(backend=backend)
results_lazy = fitter_lazy.fit(
    df_normal,
    column="value",
    max_distributions=20,
    lazy_metrics=True,  # Skip KS/AD computation!
)

print(f"Fitted {results_lazy.count()} distributions")
print(f"Is lazy: {results_lazy.is_lazy}")

In [None]:
# Get best by AIC - fast! No KS/AD computation needed
best_aic = results_lazy.best(n=1, metric="aic")[0]

print(f"Best by AIC: {best_aic.distribution}")
print(f"  AIC: {best_aic.aic:.2f}")
print(f"  BIC: {best_aic.bic:.2f}")
print(f"  KS statistic: {best_aic.ks_statistic}")  # None - not computed yet!
print(f"  AD statistic: {best_aic.ad_statistic}")  # None - not computed yet!

In [None]:
# Get best by KS statistic - triggers ON-DEMAND computation!
# Only computes KS/AD for top candidates, not all distributions
best_ks = results_lazy.best(n=1, metric="ks_statistic")[0]

print(f"Best by KS: {best_ks.distribution}")
print(f"  KS statistic: {best_ks.ks_statistic:.6f}")  # Computed on-demand!
print(f"  p-value: {best_ks.pvalue:.4f}")
print(f"  AD statistic: {best_ks.ad_statistic:.6f}")

In [None]:
# If you need all metrics computed (e.g., before unpersisting source data),
# use materialize() to force computation of all KS/AD statistics
materialized = results_lazy.materialize()

print(f"Is lazy after materialize: {materialized.is_lazy}")  # False

# Now all distributions have KS/AD computed
top_3 = materialized.best(n=3, metric="ks_statistic")
print("\nTop 3 distributions (all metrics available):")
for i, r in enumerate(top_3, 1):
    print(f"  {i}. {r.distribution:15} KS={r.ks_statistic:.6f} p={r.pvalue:.4f}")

## 3.7 Pre-filtering Distributions (v1.6.0+)

When you know something about your data, you can skip distributions that are mathematically incompatible. Pre-filtering uses data characteristics (support bounds, skewness, kurtosis) to eliminate distributions before the expensive fitting step.

**Filtering layers:**
1. **Support bounds (100% reliable)**: Skips distributions whose support doesn't contain your data range
2. **Skewness sign (95% reliable)**: Skips positive-skew-only distributions for left-skewed data  
3. **Kurtosis (aggressive mode, ~80% reliable)**: Skips low-kurtosis distributions for heavy-tailed data

In [None]:
# Create negative data (will filter out non-negative distributions like expon, gamma)
np.random.seed(42)
negative_data = np.random.normal(loc=-50, scale=10, size=10_000)
df_negative = pd.DataFrame({"value": negative_data})

print(f"Data range: [{negative_data.min():.1f}, {negative_data.max():.1f}]")
print(f"All values are negative - expon/gamma distributions cannot fit this data")

In [None]:
# Fit WITHOUT prefilter (baseline)
print("Fitting WITHOUT prefilter...")
fitter_no_prefilter = DistributionFitter(backend=backend)
results_no_prefilter = fitter_no_prefilter.fit(
    df_negative, 
    column="value", 
    max_distributions=20,
    prefilter=False,  # Default - fit all distributions
)
print(f"Fitted {results_no_prefilter.count()} distributions (no prefilter)")

In [None]:
# Fit WITH prefilter (safe mode) - skips incompatible distributions
print("\nFitting WITH prefilter=True (safe mode)...")
fitter_prefilter = DistributionFitter(backend=backend)
results_prefilter = fitter_prefilter.fit(
    df_negative,
    column="value", 
    max_distributions=20,
    prefilter=True,  # Enable pre-filtering!
)
print(f"Fitted {results_prefilter.count()} distributions (with prefilter)")
print("\n-> Pre-filter skipped distributions incompatible with negative data")

In [None]:
# Compare best fits - both should find norm as best for normal data
best_no_prefilter = results_no_prefilter.best(n=1)[0]
best_prefilter = results_prefilter.best(n=1)[0]

print("Best distribution comparison:")
print(f"  Without prefilter: {best_no_prefilter.distribution} (KS={best_no_prefilter.ks_statistic:.6f})")
print(f"  With prefilter:    {best_prefilter.distribution} (KS={best_prefilter.ks_statistic:.6f})")
print("\n-> Same best fit, but prefilter was faster by skipping incompatible distributions!")

---

# Part 4: Plotting

Visualize the fitted distribution with the data histogram.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

## 4.1 Basic Plot

In [None]:
# Basic plot with default config
best = results_normal.best(n=1)[0]
fig, ax = fitter.plot(
    best,
    df_normal,
    "value",
    title="Best Fit Distribution (Normal Data)",
    xlabel="Value",
    ylabel="Density"
)
plt.show()

## 4.2 Plot with Custom Parameters

In [None]:
# Custom plot with direct parameters
fig, ax = fitter.plot(
    best,
    df_normal,
    "value",
    figsize=(14, 8),
    dpi=100,
    histogram_alpha=0.7,
    pdf_linewidth=3,
    title_fontsize=18,
    label_fontsize=14,
    legend_fontsize=12,
    grid_alpha=0.4,
    title="Distribution Fit with Custom Styling",
    xlabel="Value",
    ylabel="Density",
)
plt.show()

## 4.3 Plot Non-Negative Distribution

In [None]:
# Best fit for exponential data
best_exp = results_exp.best(n=1)[0]
print(f"Best fit for exponential data: {best_exp.distribution}")
print(f"  K-S statistic: {best_exp.ks_statistic:.6f}")
print(f"  p-value: {best_exp.pvalue:.4f}")

fig, ax = fitter_nonneg.plot(
    best_exp,
    df_exp,
    "value",
    figsize=(14, 8),
    dpi=100,
    histogram_alpha=0.7,
    pdf_linewidth=3,
    title_fontsize=18,
    title=f"Best Fit: {best_exp.distribution.capitalize()}",
    xlabel="Value",
    ylabel="Density",
)
plt.show()

## 4.4 Q-Q Plots for Goodness-of-Fit Assessment

A Q-Q (quantile-quantile) plot is a powerful visual tool for assessing how well a distribution fits your data. It plots sample quantiles against theoretical quantiles from the fitted distribution. If the fit is good, points will fall approximately along the diagonal reference line.

In [None]:
# Q-Q plot for the best fit on normal data
fig, ax = fitter.plot_qq(
    best,
    df_normal,
    "value",
    max_points=1000,  # Sample size for plotting (too many points clutters the plot)
    figsize=(10, 10),
    title="Q-Q Plot: Normal Data vs Fitted Distribution",
)
plt.show()

# Compare: Q-Q plot for exponential data
fig, ax = fitter_nonneg.plot_qq(
    best_exp,
    df_exp,
    "value",
    max_points=1000,
    figsize=(10, 10),
    title="Q-Q Plot: Exponential Data vs Fitted Distribution",
)
plt.show()

## 4.5 P-P Plots for Goodness-of-Fit Assessment

A P-P (probability-probability) plot compares the empirical cumulative distribution function (CDF) of the sample data against the theoretical CDF of the fitted distribution. It is particularly useful for assessing the fit in the center of the distribution.

In [None]:
# P-P plot for the best fit on normal data
fig, ax = fitter.plot_pp(
    best,
    df_normal,
    "value",
    max_points=1000,  # Sample size for plotting (too many points clutters the plot)
    figsize=(10, 10),
    title="P-P Plot: Normal Data vs Fitted Distribution",
)
plt.show()

# Compare: P-P plot for exponential data
fig, ax = fitter_nonneg.plot_pp(
    best_exp,
    df_exp,
    "value",
    max_points=1000,
    figsize=(10, 10),
    title="P-P Plot: Exponential Data vs Fitted Distribution",
)
plt.show()

---

# Part 5: Multi-Column Fitting

Fit multiple columns efficiently in a single operation. This shares overhead across all columns.

## 5.1 Create Multi-Column DataFrame

In [None]:
# Create a DataFrame with multiple columns (different distributions)
np.random.seed(42)

# Generate data from different distributions
n_rows = 20_000
df_multi = pd.DataFrame({
    "normal_col": np.random.normal(50, 10, n_rows),
    "exp_col": np.random.exponential(5, n_rows),
    "gamma_col": np.random.gamma(2.0, 2.0, n_rows),
})

print(f"Created DataFrame with {len(df_multi):,} rows and columns: {df_multi.columns.tolist()}")

## 5.2 Fit Multiple Columns in One Call

In [None]:
# Fit distributions to all columns in a single operation
# This is more efficient than fitting each column separately
print("Fitting distributions to 3 columns simultaneously...")

fitter_multi = DistributionFitter(backend=backend)
results_multi = fitter_multi.fit(
    df_multi,
    columns=["normal_col", "exp_col", "gamma_col"],  # Multi-column fitting!
    max_distributions=15,
)

print(f"\nTotal results: {results_multi.count()}")
print(f"Columns fitted: {results_multi.column_names}")

## 5.3 Get Best Distribution Per Column

In [None]:
# Get the best distribution for each column
best_per_col = results_multi.best_per_column(n=1)

print("Best distribution per column:")
for col_name, fits in best_per_col.items():
    best = fits[0]
    print(f"\n{col_name}:")
    print(f"  Distribution: {best.distribution}")
    print(f"  K-S statistic: {best.ks_statistic:.6f}")
    print(f"  p-value: {best.pvalue:.4f}")
    print(f"  Parameters: {[f'{p:.4f}' for p in best.parameters]}")

## 5.4 Filter Results by Column

In [None]:
# Get results for a specific column
exp_results = results_multi.for_column("exp_col")

print(f"Results for 'exp_col': {exp_results.count()} distributions")
print("\nTop 5 by K-S statistic:")
for i, r in enumerate(exp_results.best(n=5), 1):
    print(f"  {i}. {r.distribution:15} KS={r.ks_statistic:.6f}")

## 5.5 Plot Results for Each Column

In [None]:
# Plot the best fit for each column
for col_name, fits in best_per_col.items():
    best = fits[0]
    fig, ax = fitter_multi.plot(
        best,
        df_multi,
        col_name,
        title=f"{col_name}: {best.distribution} (KS={best.ks_statistic:.4f})",
        xlabel="Value",
        ylabel="Density",
        figsize=(10, 6),
    )
    plt.show()

---

# Part 6: Complete Workflow Example

Putting it all together - a complete production-style workflow.

In [None]:
# Complete workflow with all parameters
fitter_gamma = DistributionFitter(backend=backend, random_seed=42)

# Fit distributions
print("Fitting gamma distribution data...")
results = fitter_gamma.fit(
    df_gamma,
    column="value",
    bins=100,
    use_rice_rule=False,
    enable_sampling=True,
    max_sample_size=1_000_000,
    max_distributions=25,
)

# Get best result
best = results.best(n=1)[0]
print(f"\nBest distribution: {best.distribution}")
print(f"K-S statistic: {best.ks_statistic:.6f}")
print(f"p-value: {best.pvalue:.4f}")
print(f"SSE: {best.sse:.6f}")
print(f"Parameters: {[f'{p:.4f}' for p in best.parameters]}")

# Plot with custom parameters
fig, ax = fitter_gamma.plot(
    best,
    df_gamma,
    "value",
    figsize=(14, 9),
    dpi=150,
    histogram_alpha=0.6,
    pdf_linewidth=3,
    title_fontsize=16,
    title=f"Gamma Data - Best Fit: {best.distribution.capitalize()}",
    xlabel="Value",
    ylabel="Density",
)
plt.show()

# Show top 5 results (sorted by K-S statistic for meaningful ranking)
# For RayBackend, results.df is already a pandas DataFrame
print("\nTop 5 distributions:")
df_top5 = results.df.sort_values("ks_statistic").head(5)
df_top5[["distribution", "ks_statistic", "pvalue", "sse", "aic", "bic"]]

---

# Part 7: Serialization

Save fitted distributions to disk and reload them later for inference without re-fitting.

## 7.1 Save and Load

Save a fitted distribution to JSON (default) or pickle format.

In [None]:
import tempfile
import os
from pathlib import Path

# Use the best fit from Part 6
print(f"Saving distribution: {best.distribution}")
print(f"Parameters: {best.parameters}")

# Save to a temporary directory
model_dir = Path(tempfile.mkdtemp())
json_path = model_dir / "best_model.json"
pkl_path = model_dir / "best_model.pkl"

# Save as JSON (human-readable, default)
best.save(json_path)
print(f"\nSaved to JSON: {json_path}")
print(f"File size: {json_path.stat().st_size:,} bytes")

# Save as pickle (binary, faster)
best.save(pkl_path, format="pickle")
print(f"\nSaved to pickle: {pkl_path}")
print(f"File size: {pkl_path.stat().st_size:,} bytes")

In [None]:
from spark_bestfit import DistributionFitResult

# Load the saved model
loaded = DistributionFitResult.load(json_path)

print(f"Loaded distribution: {loaded.distribution}")
print(f"Parameters: {loaded.parameters}")
print(f"K-S statistic: {loaded.ks_statistic:.6f}")
print(f"p-value: {loaded.pvalue:.4f}")

# Verify loaded model works
samples = loaded.sample(size=1000, random_state=42)
print(f"\nGenerated {len(samples)} samples from loaded model")
print(f"Sample mean: {samples.mean():.2f}")
print(f"Sample std: {samples.std():.2f}")

## 7.2 Data Summary

When fitting with `DistributionFitter`, the result includes data statistics
about the fitted data. This provides lightweight provenance tracking.

In [None]:
# Access data statistics from the loaded model (v2.0+ flat field API)
if loaded.data_count is not None:
    print("Data Statistics (from fitting):")
    print(f"  Sample size: {loaded.data_count:,.0f}")
    print(f"  Min: {loaded.data_min:.4f}")
    print(f"  Max: {loaded.data_max:.4f}")
    print(f"  Mean: {loaded.data_mean:.4f}")
    print(f"  Std: {loaded.data_stddev:.4f}")
else:
    print("No data statistics available (result may have been created manually)")

## 7.3 JSON Format

The JSON format is human-readable and includes version metadata for compatibility.

In [None]:
# View the JSON content
with open(json_path) as f:
    content = f.read()

print("JSON file content:")
print(content)

In [None]:
# Cleanup temporary files
import shutil
shutil.rmtree(model_dir)
print(f"Cleaned up temporary directory: {model_dir}")

---

# Part 8: Ray-Specific Features

These features are unique to the RayBackend.

## 8.1 Discrete Distribution Fitting

In [None]:
from spark_bestfit import DiscreteDistributionFitter

# Generate count data
count_data = np.random.poisson(lam=12, size=3000)
count_df = pd.DataFrame({"events": count_data})

# Convert to Ray Dataset
count_ds = ray.data.from_pandas(count_df)

# Fit discrete distributions
discrete_fitter = DiscreteDistributionFitter(backend=backend)
discrete_results = discrete_fitter.fit(count_ds, column="events", max_distributions=5)

# Best by AIC (recommended for discrete)
best_discrete = discrete_results.best(n=1, metric="aic")[0]
print(f"Best discrete: {best_discrete.distribution}")
print(f"  Parameters: {dict(zip(best_discrete.get_param_names(), best_discrete.parameters))}")
print(f"  AIC: {best_discrete.aic:.2f}")
print(f"  Data mean: {count_data.mean():.2f}")

## 8.2 Gaussian Copula with RayBackend

The Gaussian Copula uses RayBackend for distributed correlation computation and sample generation.

In [None]:
from spark_bestfit import GaussianCopula

# Create correlated data
n = 5000
x = np.random.normal(0, 1, n)
correlated_data = pd.DataFrame({
    "feature_a": x * 10 + 50,  # Normal-like
    "feature_b": np.exp(0.5 * x + np.random.normal(0, 0.3, n)),  # Log-normal-like
    "feature_c": np.abs(x) * 20 + np.random.exponential(5, n),  # Right-skewed
})

# Fit marginal distributions
marginal_results = fitter.fit(
    correlated_data, 
    columns=["feature_a", "feature_b", "feature_c"],
    max_distributions=10
)

print("Marginal fits:")
for col, fits in marginal_results.best_per_column(n=1).items():
    print(f"  {col}: {fits[0].distribution}")

In [None]:
# Fit copula (correlation computed via RayBackend)
copula = GaussianCopula.fit(marginal_results, correlated_data, backend=backend)

print("\nCorrelation matrix:")
print(copula.correlation_matrix.round(3))

In [None]:
# Generate correlated samples
samples = copula.sample(n=1000)

# Verify correlation is preserved
sample_df = pd.DataFrame(samples)
print("Sample correlation matrix:")
print(sample_df.corr().round(3))

## 8.3 Distributed Sample Generation

RayBackend's `generate_samples` distributes sample generation across workers.

In [None]:
# Generate large sample using RayBackend
best_fit = results_multi.for_column("normal_col").best(n=1)[0]

# Local sampling (small scale)
local_samples = best_fit.sample(size=10000)
print(f"Local samples: mean={local_samples.mean():.2f}, std={local_samples.std():.2f}")

# For very large samples, use the backend's generate_samples
def sample_generator(n_samples, partition_id, seed):
    np.random.seed(seed)
    return {"value": best_fit.sample(size=n_samples)}

large_samples = backend.generate_samples(
    n=100000,
    generator_func=sample_generator,
    column_names=["value"],
    num_partitions=4,
    random_seed=42
)

print(f"\nDistributed samples: {len(large_samples)} rows")
print(f"  mean={large_samples['value'].mean():.2f}, std={large_samples['value'].std():.2f}")

## 8.4 Performance Comparison: pandas vs Ray Dataset

For small datasets, pandas is faster (no serialization overhead). For large datasets, Ray Dataset provides distributed aggregation.

In [None]:
import time

# Small dataset comparison
small_data = pd.DataFrame({"value": np.random.exponential(5, 1000)})
small_ds = ray.data.from_pandas(small_data)

# pandas DataFrame
start = time.time()
_ = fitter.fit(small_data, column="value", max_distributions=5)
pandas_time = time.time() - start

# Ray Dataset
start = time.time()
_ = fitter.fit(small_ds, column="value", max_distributions=5)
ray_time = time.time() - start

print(f"Small dataset (1K rows):")
print(f"  pandas: {pandas_time:.2f}s")
print(f"  Ray Dataset: {ray_time:.2f}s")
print(f"  Recommendation: Use pandas for small datasets")

<cell_type>markdown</cell_type>---

## Summary

This notebook demonstrated:

1. **Excluded Distributions**:
   - `DEFAULT_EXCLUDED_DISTRIBUTIONS` - Slow distributions excluded by default
   - Pass custom `excluded_distributions` to `DistributionFitter()` to include/exclude

2. **RayBackend Initialization**:
   - `RayBackend()` - Auto-initialize locally
   - `RayBackend(num_cpus=N)` - Limit CPU usage
   - `RayBackend(address="auto")` - Connect to existing cluster

3. **Fitting**:
   - `DistributionFitter.fit()` - Fit distributions to data
   - Parameters: `bins`, `use_rice_rule`, `support_at_zero`, `enable_sampling`, etc.
   - `max_distributions` parameter to limit fitting scope
   - `progress_callback` parameter to monitor long-running fits

4. **FitterConfig Builder (v2.2.0+)**:
   - `FitterConfigBuilder()` - Fluent API for complex configurations
   - Chain methods: `.with_bins()`, `.with_bounds()`, `.with_lazy_metrics()`, etc.
   - `.build()` returns immutable `FitterConfig` dataclass
   - Pass `config=config` to `fit()` for reusable configurations

5. **Progress Tracking**:
   - Pass `progress_callback=fn` to `fit()` to receive progress updates
   - Callback receives `(completed_tasks, total_tasks, percent_complete)`
   - Works with both `DistributionFitter` and `DiscreteDistributionFitter`

6. **Results**:
   - `results.best(n, metric)` - Get top N by K-S statistic (default), A-D statistic, SSE, AIC, or BIC
   - `results.filter(ks_threshold, pvalue_threshold, ad_threshold)` - Filter by goodness-of-fit
   - `results.df` - Access underlying pandas DataFrame (RayBackend uses pandas)
   - `DistributionFitResult.sample()`, `.pdf()`, `.cdf()` - Use fitted distribution
   - `DistributionFitResult.get_param_names()` - Get parameter names
   - `DistributionFitResult.confidence_intervals()` - Bootstrap confidence intervals

7. **Lazy Metrics (v1.5.0+)**:
   - `lazy_metrics=True` - Skip KS/AD computation during fitting for faster iteration
   - `results.is_lazy` - Check if results have lazy metrics
   - `results.best(metric="ks_statistic")` - Triggers on-demand computation for top candidates only
   - `results.materialize()` - Force computation of all KS/AD statistics

8. **Pre-filtering (v1.6.0+)**:
   - `prefilter=True` - Skip distributions incompatible with your data (safe mode)
   - `prefilter="aggressive"` - Also filter by kurtosis for heavy-tailed data
   - Uses support bounds, skewness sign, and kurtosis to eliminate distributions
   - 30-70% fewer distributions to fit with automatic fallback

9. **Multi-Column Fitting**:
   - `fitter.fit(df, columns=[...])` - Fit multiple columns in one call
   - `results.column_names` - List all fitted columns
   - `results.for_column(name)` - Filter results to one column
   - `results.best_per_column(n, metric)` - Get top N per column

10. **Plotting**:
    - `fitter.plot()` - Visualize fitted distribution with data histogram
    - `fitter.plot_qq()` - Q-Q plot for visual goodness-of-fit assessment
    - `fitter.plot_pp()` - P-P plot for assessing fit in the center of distribution
    - Customizable with `figsize`, `dpi`, `histogram_alpha`, `pdf_linewidth`, etc.

11. **Serialization**:
    - `result.save(path)` - Save to JSON (default) or pickle format
    - `DistributionFitResult.load(path)` - Load a saved result
    - `result.data_min`, `result.data_max`, etc. - Access fitting statistics for provenance
    - JSON format includes version metadata for compatibility

12. **Goodness-of-Fit Metrics**:
    - **K-S statistic** (default) - Lower is better, measures max distance from empirical CDF
    - **A-D statistic** - Lower is better, more sensitive to tails than K-S
    - **p-value** - Higher is better (>0.05 suggests good fit)
    - **A-D p-value** - Only available for norm, expon, logistic, gumbel_r, gumbel_l
    - **SSE** - Sum of squared errors between histogram and fitted PDF
    - **AIC/BIC** - Information criteria for model comparison

**When to use RayBackend:**

| Scenario | Recommended Backend |
|----------|---------------------|
| Local development/testing | LocalBackend |
| Spark cluster available | SparkBackend |
| Ray cluster or Ray-based ML pipeline | RayBackend |
| Data already in Ray Dataset | RayBackend |

In [None]:
# Shutdown Ray (optional - useful in notebooks)
ray.shutdown()