# spark-bestfit Discrete Distribution Demo

This notebook demonstrates fitting **discrete (count) distributions** using the `DiscreteDistributionFitter` class.

Use cases:
- Event counts (website visits, error occurrences, purchases)
- Count data (items per order, failures per unit, defects per batch)
- Any integer-valued data

Key differences from continuous fitting:
- Uses PMF (probability mass function) instead of PDF
- Fits via MLE optimization (scipy discrete dists don't have built-in `fit()`)
- Integer-aligned histograms
- **AIC recommended** for model selection (K-S p-values are not reliable for discrete data)

## Setup

In [None]:
import numpy as np
from pyspark.sql import SparkSession

# Create Spark session
spark = (
    SparkSession.builder
    .appName("Discrete-Distribution-Demo")
    .config("spark.sql.shuffle.partitions", "10")
    .config("spark.sql.execution.arrow.pyspark.enabled", "true")
    .getOrCreate()
)

print(f"Spark version: {spark.version}")

In [None]:
# Import discrete fitting components
from spark_bestfit import (
    DiscreteDistributionFitter,
    DiscreteDistributionRegistry,
    DEFAULT_EXCLUDED_DISCRETE_DISTRIBUTIONS,
)

## Generate Sample Data

We'll create sample data from known discrete distributions.

In [None]:
np.random.seed(42)

# Poisson data (event counts, e.g., website visits per hour)
poisson_data = np.random.poisson(lam=7.5, size=10_000)
df_poisson = spark.createDataFrame([(int(x),) for x in poisson_data], ["counts"])

# Negative binomial data (overdispersed counts, e.g., insurance claims)
nbinom_data = np.random.negative_binomial(n=5, p=0.4, size=10_000)
df_nbinom = spark.createDataFrame([(int(x),) for x in nbinom_data], ["counts"])

# Geometric data (trials until success, e.g., attempts until conversion)
geom_data = np.random.geometric(p=0.25, size=10_000)
df_geom = spark.createDataFrame([(int(x),) for x in geom_data], ["counts"])

print(f"Poisson(λ=7.5): {len(poisson_data):,} samples, mean={poisson_data.mean():.2f}")
print(f"NegBinom(n=5, p=0.4): {len(nbinom_data):,} samples, mean={nbinom_data.mean():.2f}")
print(f"Geometric(p=0.25): {len(geom_data):,} samples, mean={geom_data.mean():.2f}")

---

# Part 1: Available Discrete Distributions

In [None]:
# View available discrete distributions
registry = DiscreteDistributionRegistry()
distributions = registry.get_distributions()

print(f"Available discrete distributions ({len(distributions)}):")
for dist in sorted(distributions):
    config = registry.get_param_config(dist)
    print(f"  {dist:15} params: {config['param_names']}")

In [None]:
# Default excluded distributions (slow or problematic)
print(f"\nDefault excluded ({len(DEFAULT_EXCLUDED_DISCRETE_DISTRIBUTIONS)}):")
for dist in sorted(DEFAULT_EXCLUDED_DISCRETE_DISTRIBUTIONS):
    print(f"  - {dist}")

---

# Part 2: Fitting Discrete Distributions

In [None]:
# Create discrete fitter
fitter = DiscreteDistributionFitter(spark)

# Fit distributions to Poisson data
print("Fitting discrete distributions to Poisson data...")
results_poisson = fitter.fit(df_poisson, column="counts")

print(f"\nFitted {results_poisson.count()} distributions")

In [None]:
# Get best distributions by AIC (recommended for discrete)
print("Top 5 distributions by AIC (lower is better):")
for i, r in enumerate(results_poisson.best(n=5, metric="aic"), 1):
    params_str = ", ".join([f"{p:.3f}" for p in r.parameters])
    print(f"  {i}. {r.distribution:15} AIC={r.aic:.2f} params=({params_str})")

In [None]:
# Best fit should be Poisson with λ ≈ 7.5
best = results_poisson.best(n=1, metric="aic")[0]
print(f"\nBest fit: {best.distribution}")
print(f"  Parameters: {[f'{p:.4f}' for p in best.parameters]}")
print(f"  AIC: {best.aic:.2f}")
print(f"  BIC: {best.bic:.2f}")
print(f"  SSE: {best.sse:.6f}")
print(f"  K-S statistic: {best.ks_statistic:.6f} (for ranking only)")

---

# Part 3: Plotting Discrete Distributions

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Plot best fit for Poisson data
fig, ax = fitter.plot(
    best,
    df_poisson,
    "counts",
    title="Poisson Data - Best Fit Distribution",
    xlabel="Count",
    ylabel="Probability",
    figsize=(12, 8),
)
plt.show()

---

# Part 4: Negative Binomial Data (Overdispersion)

In [None]:
# Fit distributions to negative binomial data
print("Fitting distributions to negative binomial data...")
results_nbinom = fitter.fit(df_nbinom, column="counts")

print("\nTop 5 distributions by AIC:")
for i, r in enumerate(results_nbinom.best(n=5, metric="aic"), 1):
    params_str = ", ".join([f"{p:.3f}" for p in r.parameters])
    print(f"  {i}. {r.distribution:15} AIC={r.aic:.2f} params=({params_str})")

In [None]:
# Plot negative binomial fit
best_nbinom = results_nbinom.best(n=1, metric="aic")[0]

fig, ax = fitter.plot(
    best_nbinom,
    df_nbinom,
    "counts",
    title=f"Negative Binomial Data - Best Fit: {best_nbinom.distribution}",
    xlabel="Count",
    ylabel="Probability",
    figsize=(12, 8),
)
plt.show()

---

# Part 5: Geometric Data (Trials Until Success)

In [None]:
# Fit distributions to geometric data
print("Fitting distributions to geometric data...")
results_geom = fitter.fit(df_geom, column="counts")

print("\nTop 5 distributions by AIC:")
for i, r in enumerate(results_geom.best(n=5, metric="aic"), 1):
    params_str = ", ".join([f"{p:.3f}" for p in r.parameters])
    print(f"  {i}. {r.distribution:15} AIC={r.aic:.2f} params=({params_str})")

In [None]:
# Plot geometric fit
best_geom = results_geom.best(n=1, metric="aic")[0]

fig, ax = fitter.plot(
    best_geom,
    df_geom,
    "counts",
    title=f"Geometric Data - Best Fit: {best_geom.distribution}",
    xlabel="Count",
    ylabel="Probability",
    figsize=(12, 8),
)
plt.show()

---

# Part 6: Working with Results

In [None]:
# Convert results to pandas for analysis
df_results = results_poisson.df.toPandas()
print("Results as pandas DataFrame:")
df_results.sort_values("ks_statistic").head(10)

In [None]:
# Filter by goodness-of-fit
good_fits = results_poisson.filter(ks_threshold=0.2)
print(f"Distributions with K-S statistic < 0.2: {good_fits.count()}")

for r in good_fits.best(n=10):
    print(f"  {r.distribution:15} KS={r.ks_statistic:.4f}")

---

## Summary

This notebook demonstrated:

1. **DiscreteDistributionFitter** - Main class for fitting discrete distributions
   - Same API as `DistributionFitter` for continuous data
   - Uses MLE optimization (discrete dists don't have built-in `fit()`)

2. **Available Distributions**:
   - poisson, nbinom, geom, binom, hypergeom, betabinom, zipf, and more
   - 16 discrete distributions by default

3. **Fitting**:
   - `fitter.fit(df, column)` - Fit all discrete distributions
   - Returns `FitResults` object (same as continuous)

4. **Model Selection** (recommended metrics):
   - **AIC** - Best for discrete model selection (lower is better)
   - **BIC** - Similar to AIC, stronger complexity penalty
   - K-S statistic works for ranking but p-values are not reliable for discrete data

5. **Results**:
   - `results.best(n, metric="aic")` - Get top N distributions by AIC
   - `results.filter(aic_threshold=...)` - Filter by information criteria

6. **Plotting**:
   - `fitter.plot()` - Visualize fitted PMF vs empirical histogram
   - Shows stem plot for discrete probabilities

7. **Use Cases**:
   - **Poisson**: Event counts with constant rate
   - **Negative Binomial**: Overdispersed counts (variance > mean)
   - **Geometric**: Number of trials until first success
   - **Binomial**: Successes in n trials with probability p

In [None]:
spark.stop()
print("Spark session stopped.")