# Capital Budgeting Monte Carlo Simulation with Ray

This notebook demonstrates how to use spark-bestfit with **RayBackend** for **capital budgeting decisions** under uncertainty using Monte Carlo simulation.

## What You'll Learn

1. **Model uncertain project parameters** (revenue growth, costs, discount rates)
2. **Fit distributions** to historical/expert data using FitterConfigBuilder
3. **Run distributed Monte Carlo simulation** with Ray
4. **Calculate investment metrics**: NPV, IRR, Payback Period
5. **Perform risk analysis**: VaR, probability of positive NPV, sensitivity analysis

## Business Context

A company is evaluating a **$5M manufacturing plant investment** with:
- **10-year project life**
- **Uncertain revenue growth** (based on market conditions)
- **Variable operating costs** (labor, materials, energy)
- **Uncertain discount rate** (reflects financing risk)

Traditional DCF analysis uses single "best estimate" values. Monte Carlo simulation instead:
- Fits probability distributions to each uncertain parameter
- Generates thousands of scenarios
- Provides a **distribution of outcomes** rather than a single number

## Prerequisites

```bash
pip install spark-bestfit[ray] pandas numpy matplotlib scipy
```

## Setup

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from scipy.optimize import brentq
import ray

from spark_bestfit import (
    DistributionFitter,
    FitterConfigBuilder,
    RayBackend,
    GaussianCopula
)

# Initialize Ray
if not ray.is_initialized():
    ray.init()

backend = RayBackend()
print(f"RayBackend initialized with {backend.get_parallelism()} CPUs")

## Part 1: Define Project Parameters

Our capital budgeting model has the following structure:

**Fixed Parameters:**
- Initial investment: $5,000,000
- Project life: 10 years
- Base year revenue: $1,500,000
- Salvage value: $500,000 (year 10)

**Uncertain Parameters (to be fitted):**
- Revenue growth rate (annual %)
- Operating cost ratio (% of revenue)
- Discount rate (WACC)

In [None]:
# Fixed project parameters
INITIAL_INVESTMENT = 5_000_000  # $5M upfront investment
PROJECT_YEARS = 10
BASE_REVENUE = 1_500_000  # Year 1 revenue: $1.5M
SALVAGE_VALUE = 500_000  # End of project salvage: $500K
TAX_RATE = 0.25  # 25% corporate tax rate
DEPRECIATION = INITIAL_INVESTMENT / PROJECT_YEARS  # Straight-line

print("Project Parameters:")
print(f"  Initial Investment: ${INITIAL_INVESTMENT:,.0f}")
print(f"  Project Life: {PROJECT_YEARS} years")
print(f"  Base Year Revenue: ${BASE_REVENUE:,.0f}")
print(f"  Annual Depreciation: ${DEPRECIATION:,.0f}")
print(f"  Salvage Value: ${SALVAGE_VALUE:,.0f}")
print(f"  Tax Rate: {TAX_RATE:.0%}")

## Part 2: Generate Historical Data for Uncertain Parameters

In practice, you would use:
- **Historical revenue growth** from comparable projects or industry data
- **Operating cost ratios** from financial statements
- **Historical WACC** or required returns from similar investments

Here we simulate realistic historical data to demonstrate the workflow.

In [None]:
np.random.seed(42)
n_historical = 500  # Historical observations

# Revenue Growth Rate: Slightly right-skewed (good years have high growth)
# Mean ~8%, Std ~5%, bounded above 0
revenue_growth = np.clip(
    stats.skewnorm.rvs(a=2, loc=0.08, scale=0.05, size=n_historical),
    -0.10, 0.30  # -10% to +30% range
)

# Operating Cost Ratio: Concentrated around 60% with some variation
# Lower is better (more efficient operations)
cost_ratio = np.clip(
    stats.beta.rvs(a=12, b=8, size=n_historical),
    0.40, 0.80  # 40% to 80% range
)

# Discount Rate (WACC): Right-skewed reflecting risk premium uncertainty
# Mean ~10%, represents cost of capital
discount_rate = np.clip(
    stats.lognorm.rvs(s=0.3, scale=0.10, size=n_historical),
    0.05, 0.20  # 5% to 20% range
)

# Create DataFrame
historical_pdf = pd.DataFrame({
    'revenue_growth': revenue_growth,
    'cost_ratio': cost_ratio,
    'discount_rate': discount_rate
})

# Convert to Ray Dataset
historical_ds = ray.data.from_pandas(historical_pdf)

print(f"Historical data: {historical_ds.count()} observations")
print("\nSummary Statistics:")
print(historical_pdf.describe().round(4))

In [None]:
# Visualize historical distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Revenue Growth
axes[0].hist(revenue_growth * 100, bins=40, density=True, alpha=0.7, edgecolor='black', color='steelblue')
axes[0].axvline(np.mean(revenue_growth) * 100, color='red', linestyle='--', label=f'Mean: {np.mean(revenue_growth)*100:.1f}%')
axes[0].set_xlabel('Annual Growth Rate (%)')
axes[0].set_ylabel('Density')
axes[0].set_title('Revenue Growth Distribution')
axes[0].legend()

# Operating Cost Ratio
axes[1].hist(cost_ratio * 100, bins=40, density=True, alpha=0.7, edgecolor='black', color='coral')
axes[1].axvline(np.mean(cost_ratio) * 100, color='red', linestyle='--', label=f'Mean: {np.mean(cost_ratio)*100:.1f}%')
axes[1].set_xlabel('Cost Ratio (% of Revenue)')
axes[1].set_ylabel('Density')
axes[1].set_title('Operating Cost Ratio Distribution')
axes[1].legend()

# Discount Rate
axes[2].hist(discount_rate * 100, bins=40, density=True, alpha=0.7, edgecolor='black', color='seagreen')
axes[2].axvline(np.mean(discount_rate) * 100, color='red', linestyle='--', label=f'Mean: {np.mean(discount_rate)*100:.1f}%')
axes[2].set_xlabel('Discount Rate (%)')
axes[2].set_ylabel('Density')
axes[2].set_title('Discount Rate (WACC) Distribution')
axes[2].legend()

plt.suptitle('Historical Distributions of Uncertain Parameters', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Part 3: Fit Distributions Using FitterConfigBuilder

We use the **FitterConfigBuilder** pattern to create a reusable configuration for distribution fitting. This is especially useful when:
- Fitting multiple columns with the same settings
- Ensuring bounded distributions (our parameters have natural bounds)
- Managing complex configurations

In [None]:
# Create fitting configuration using the builder pattern
fit_config = (
    FitterConfigBuilder()
    .with_bins(50)  # 50 bins for histogram
    .with_sampling(enabled=False)  # Small dataset, no sampling needed
    .with_lazy_metrics(False)  # Compute all metrics for comparison
    .build()
)

print("FitterConfig created:")
print(f"  bins={fit_config.bins}")
print(f"  enable_sampling={fit_config.enable_sampling}")
print(f"  lazy_metrics={fit_config.lazy_metrics}")

In [None]:
# Fit distributions to all parameters
fitter = DistributionFitter(backend=backend)

results = fitter.fit(
    historical_ds,
    columns=['revenue_growth', 'cost_ratio', 'discount_rate'],
    config=fit_config,
    max_distributions=30  # Focus on common distributions
)

print(f"Fitted {results.count()} distribution-parameter combinations")

In [None]:
# Get best distributions for each parameter
best_per_param = results.best_per_column(n=3, metric='aic')

print("Best Distributions by AIC:")
print("=" * 60)

best_fits = {}
for param, fits in best_per_param.items():
    print(f"\n{param.upper().replace('_', ' ')}:")
    for i, fit in enumerate(fits, 1):
        marker = "→" if i == 1 else " "
        print(f"  {marker} {i}. {fit.distribution}: AIC={fit.aic:.2f}, BIC={fit.bic:.2f}")
    best_fits[param] = fits[0]  # Store best fit

print("\n" + "=" * 60)
print("Selected distributions marked with →")

In [None]:
# Visualize fitted distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

params = ['revenue_growth', 'cost_ratio', 'discount_rate']
titles = ['Revenue Growth', 'Cost Ratio', 'Discount Rate']
colors = ['steelblue', 'coral', 'seagreen']

for i, (param, title, color) in enumerate(zip(params, titles, colors)):
    fit = best_fits[param]
    data = historical_pdf[param]
    
    # Histogram
    axes[i].hist(data, bins=40, density=True, alpha=0.6, edgecolor='black', color=color, label='Historical')
    
    # Fitted PDF
    x = np.linspace(data.min(), data.max(), 200)
    frozen = fit.get_scipy_dist()
    axes[i].plot(x, frozen.pdf(x), 'r-', linewidth=2, label=f'Fitted: {fit.distribution}')
    
    axes[i].set_xlabel(param.replace('_', ' ').title())
    axes[i].set_ylabel('Density')
    axes[i].set_title(f'{title}\n({fit.distribution})')
    axes[i].legend(fontsize=9)

plt.suptitle('Fitted Distributions for Monte Carlo Simulation', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Part 4: Model Parameter Correlations

Economic parameters are often correlated:
- High revenue growth may come with higher costs (expansion expenses)
- Economic uncertainty affects both discount rates and growth prospects

We use a **Gaussian Copula** to capture these dependencies.

In [None]:
# Fit Gaussian Copula to capture correlations
copula = GaussianCopula.fit(
    results,
    historical_ds,
    columns=['revenue_growth', 'cost_ratio', 'discount_rate'],
    backend=backend
)

print("Parameter Correlation Matrix:")
print("-" * 50)
corr_df = pd.DataFrame(
    copula.correlation_matrix,
    index=['revenue_growth', 'cost_ratio', 'discount_rate'],
    columns=['revenue_growth', 'cost_ratio', 'discount_rate']
)
print(corr_df.round(3))

# Visualize correlation matrix
plt.figure(figsize=(6, 5))
im = plt.imshow(copula.correlation_matrix, cmap='RdBu_r', vmin=-1, vmax=1)
plt.colorbar(im, label='Correlation')
labels = ['Revenue\nGrowth', 'Cost\nRatio', 'Discount\nRate']
plt.xticks(range(3), labels)
plt.yticks(range(3), labels)
for i in range(3):
    for j in range(3):
        plt.text(j, i, f'{copula.correlation_matrix[i,j]:.2f}', 
                ha='center', va='center', fontsize=12, fontweight='bold')
plt.title('Parameter Correlations')
plt.tight_layout()
plt.show()

## Part 5: Define Financial Metric Functions

We need functions to calculate:
- **Net Present Value (NPV)**: Sum of discounted cash flows minus initial investment
- **Internal Rate of Return (IRR)**: Discount rate that makes NPV = 0
- **Payback Period**: Time to recover initial investment

In [None]:
def calculate_cash_flows(revenue_growth, cost_ratio, years=PROJECT_YEARS):
    """
    Calculate annual after-tax cash flows for the project.
    
    Cash Flow = (Revenue - Operating Costs - Depreciation) × (1 - Tax Rate) + Depreciation
    """
    cash_flows = []
    revenue = BASE_REVENUE
    
    for year in range(1, years + 1):
        # Revenue grows each year
        if year > 1:
            revenue = revenue * (1 + revenue_growth)
        
        # Operating costs as percentage of revenue
        operating_costs = revenue * cost_ratio
        
        # EBITDA
        ebitda = revenue - operating_costs
        
        # EBIT (after depreciation)
        ebit = ebitda - DEPRECIATION
        
        # After-tax operating income
        after_tax_income = ebit * (1 - TAX_RATE)
        
        # Add back depreciation (non-cash expense)
        cash_flow = after_tax_income + DEPRECIATION
        
        # Add salvage value in final year (after tax)
        if year == years:
            cash_flow += SALVAGE_VALUE * (1 - TAX_RATE)
        
        cash_flows.append(cash_flow)
    
    return np.array(cash_flows)


def calculate_npv(cash_flows, discount_rate):
    """
    Calculate Net Present Value.
    
    NPV = -Initial Investment + Σ(CF_t / (1+r)^t)
    """
    years = np.arange(1, len(cash_flows) + 1)
    discount_factors = (1 + discount_rate) ** years
    pv_cash_flows = cash_flows / discount_factors
    return -INITIAL_INVESTMENT + np.sum(pv_cash_flows)


def calculate_irr(cash_flows, initial_investment=INITIAL_INVESTMENT):
    """
    Calculate Internal Rate of Return using root finding.
    
    IRR is the rate r where NPV = 0
    """
    all_flows = np.concatenate([[-initial_investment], cash_flows])
    
    def npv_at_rate(r):
        years = np.arange(len(all_flows))
        return np.sum(all_flows / (1 + r) ** years)
    
    try:
        # Find IRR between -50% and 100%
        irr = brentq(npv_at_rate, -0.5, 1.0)
        return irr
    except ValueError:
        # No valid IRR found
        return np.nan


def calculate_payback_period(cash_flows):
    """
    Calculate simple payback period (undiscounted).
    
    Returns years to recover initial investment.
    """
    cumulative = np.cumsum(cash_flows)
    recovered_idx = np.where(cumulative >= INITIAL_INVESTMENT)[0]
    
    if len(recovered_idx) == 0:
        return np.inf  # Never recovers investment
    
    first_recovery_year = recovered_idx[0] + 1
    
    # Interpolate for fractional year
    if first_recovery_year == 1:
        return INITIAL_INVESTMENT / cash_flows[0]
    else:
        remaining = INITIAL_INVESTMENT - cumulative[first_recovery_year - 2]
        fraction = remaining / cash_flows[first_recovery_year - 1]
        return first_recovery_year - 1 + fraction


# Test with mean parameter values
test_growth = historical_pdf['revenue_growth'].mean()
test_cost = historical_pdf['cost_ratio'].mean()
test_discount = historical_pdf['discount_rate'].mean()

test_cf = calculate_cash_flows(test_growth, test_cost)
test_npv = calculate_npv(test_cf, test_discount)
test_irr = calculate_irr(test_cf)
test_payback = calculate_payback_period(test_cf)

print("Base Case Analysis (Mean Parameter Values):")
print("=" * 50)
print(f"Revenue Growth:  {test_growth:.1%}")
print(f"Cost Ratio:      {test_cost:.1%}")
print(f"Discount Rate:   {test_discount:.1%}")
print("-" * 50)
print(f"NPV:             ${test_npv:,.0f}")
print(f"IRR:             {test_irr:.1%}")
print(f"Payback Period:  {test_payback:.1f} years")
print("-" * 50)
print(f"Decision:        {'ACCEPT ✓' if test_npv > 0 else 'REJECT ✗'}")

## Part 6: Run Monte Carlo Simulation

Now we generate thousands of scenarios using the fitted distributions and copula to get a **distribution of outcomes** rather than a single point estimate.

In [None]:
# Generate correlated parameter scenarios using the copula
N_SCENARIOS = 10_000

print(f"Generating {N_SCENARIOS:,} Monte Carlo scenarios...")

# Sample correlated parameters from the copula
scenarios = copula.sample(n=N_SCENARIOS, random_state=42)

print(f"Generated {len(scenarios)} parameter sets")
print("\nScenario Statistics:")
print(pd.DataFrame(scenarios).describe().round(4))

In [None]:
# Calculate financial metrics for all scenarios
print(f"\nCalculating NPV, IRR, and Payback Period for {N_SCENARIOS:,} scenarios...")

npv_results = []
irr_results = []
payback_results = []

for i in range(N_SCENARIOS):
    growth = scenarios['revenue_growth'][i]
    cost = scenarios['cost_ratio'][i]
    discount = scenarios['discount_rate'][i]
    
    # Calculate cash flows
    cf = calculate_cash_flows(growth, cost)
    
    # Calculate metrics
    npv_results.append(calculate_npv(cf, discount))
    irr_results.append(calculate_irr(cf))
    payback_results.append(calculate_payback_period(cf))

# Convert to arrays
npv_results = np.array(npv_results)
irr_results = np.array(irr_results)
payback_results = np.array(payback_results)

# Create results DataFrame
simulation_results = pd.DataFrame({
    'revenue_growth': scenarios['revenue_growth'],
    'cost_ratio': scenarios['cost_ratio'],
    'discount_rate': scenarios['discount_rate'],
    'npv': npv_results,
    'irr': irr_results,
    'payback_period': payback_results
})

print("\nSimulation complete!")
print(simulation_results[['npv', 'irr', 'payback_period']].describe().round(2))

## Part 7: Analyze NPV Distribution

The NPV distribution tells us the range of possible outcomes and their probabilities.

In [None]:
# NPV Analysis
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
ax1 = axes[0]
n, bins, patches = ax1.hist(npv_results / 1e6, bins=80, density=True, alpha=0.7, edgecolor='black')

# Color bars: green for positive NPV, red for negative
for patch, left_edge in zip(patches, bins[:-1]):
    if left_edge < 0:
        patch.set_facecolor('indianred')
    else:
        patch.set_facecolor('seagreen')

# Add vertical lines for key percentiles
p5 = np.percentile(npv_results, 5) / 1e6
p50 = np.percentile(npv_results, 50) / 1e6
p95 = np.percentile(npv_results, 95) / 1e6
mean_npv = np.mean(npv_results) / 1e6

ax1.axvline(0, color='black', linewidth=2, linestyle='-', label='Break-even')
ax1.axvline(p5, color='orange', linewidth=2, linestyle='--', label=f'5th %ile: ${p5:.2f}M')
ax1.axvline(p50, color='blue', linewidth=2, linestyle='--', label=f'Median: ${p50:.2f}M')
ax1.axvline(p95, color='purple', linewidth=2, linestyle='--', label=f'95th %ile: ${p95:.2f}M')

ax1.set_xlabel('NPV ($ Millions)', fontsize=12)
ax1.set_ylabel('Density', fontsize=12)
ax1.set_title('NPV Distribution', fontsize=14, fontweight='bold')
ax1.legend(loc='upper right', fontsize=9)

# Cumulative Distribution (S-curve)
ax2 = axes[1]
sorted_npv = np.sort(npv_results) / 1e6
cumulative = np.arange(1, len(sorted_npv) + 1) / len(sorted_npv)

ax2.plot(sorted_npv, cumulative, 'b-', linewidth=2)
ax2.axvline(0, color='black', linewidth=2, linestyle='-')
ax2.axhline(0.5, color='gray', linewidth=1, linestyle=':')

# Probability of positive NPV
prob_positive = (npv_results > 0).mean()
ax2.fill_between(sorted_npv[sorted_npv < 0], 0, cumulative[:np.sum(sorted_npv < 0)], 
                 alpha=0.3, color='red', label=f'P(NPV < 0) = {1-prob_positive:.1%}')
ax2.fill_between(sorted_npv[sorted_npv >= 0], 
                 cumulative[np.sum(sorted_npv < 0):], 1, 
                 alpha=0.3, color='green', label=f'P(NPV > 0) = {prob_positive:.1%}')

ax2.set_xlabel('NPV ($ Millions)', fontsize=12)
ax2.set_ylabel('Cumulative Probability', fontsize=12)
ax2.set_title('NPV Cumulative Distribution', fontsize=14, fontweight='bold')
ax2.legend(loc='lower right', fontsize=10)
ax2.set_ylim(0, 1)

plt.tight_layout()
plt.show()

In [None]:
# Detailed NPV Statistics
prob_positive_npv = (npv_results > 0).mean()
var_5 = np.percentile(npv_results, 5)  # Value at Risk (5%)
cvar_5 = npv_results[npv_results <= var_5].mean()  # Conditional VaR (Expected Shortfall)

print("="*60)
print("NPV ANALYSIS SUMMARY")
print("="*60)
print(f"\nCentral Tendency:")
print(f"  Mean NPV:         ${np.mean(npv_results):>12,.0f}")
print(f"  Median NPV:       ${np.median(npv_results):>12,.0f}")
print(f"  Std Deviation:    ${np.std(npv_results):>12,.0f}")
print(f"\nDistribution Range:")
print(f"  Minimum:          ${np.min(npv_results):>12,.0f}")
print(f"  5th Percentile:   ${np.percentile(npv_results, 5):>12,.0f}")
print(f"  25th Percentile:  ${np.percentile(npv_results, 25):>12,.0f}")
print(f"  75th Percentile:  ${np.percentile(npv_results, 75):>12,.0f}")
print(f"  95th Percentile:  ${np.percentile(npv_results, 95):>12,.0f}")
print(f"  Maximum:          ${np.max(npv_results):>12,.0f}")
print(f"\nRisk Metrics:")
print(f"  P(NPV > 0):       {prob_positive_npv:>12.1%}")
print(f"  P(NPV < 0):       {1-prob_positive_npv:>12.1%}")
print(f"  VaR (5%):         ${var_5:>12,.0f}")
print(f"  CVaR/ES (5%):     ${cvar_5:>12,.0f}")
print("="*60)

## Part 8: Analyze IRR Distribution

In [None]:
# Filter valid IRR values
valid_irr = irr_results[~np.isnan(irr_results)]
valid_irr = valid_irr[(valid_irr > -0.5) & (valid_irr < 1.0)]  # Remove extreme outliers

print(f"Valid IRR calculations: {len(valid_irr):,} of {N_SCENARIOS:,} ({len(valid_irr)/N_SCENARIOS:.1%})")

# IRR Distribution Plot
fig, ax = plt.subplots(figsize=(10, 5))

n, bins, patches = ax.hist(valid_irr * 100, bins=60, density=True, alpha=0.7, edgecolor='black')

# Color bars based on hurdle rate (assume 10% WACC as hurdle)
hurdle_rate = 0.10
for patch, left_edge in zip(patches, bins[:-1]):
    if left_edge < hurdle_rate * 100:
        patch.set_facecolor('indianred')
    else:
        patch.set_facecolor('seagreen')

# Key statistics
ax.axvline(hurdle_rate * 100, color='black', linewidth=2, linestyle='-', label=f'Hurdle Rate: {hurdle_rate:.0%}')
ax.axvline(np.median(valid_irr) * 100, color='blue', linewidth=2, linestyle='--', 
           label=f'Median IRR: {np.median(valid_irr):.1%}')
ax.axvline(np.mean(valid_irr) * 100, color='orange', linewidth=2, linestyle='--', 
           label=f'Mean IRR: {np.mean(valid_irr):.1%}')

ax.set_xlabel('IRR (%)', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('Internal Rate of Return (IRR) Distribution', fontsize=14, fontweight='bold')
ax.legend(loc='upper right', fontsize=10)

plt.tight_layout()
plt.show()

# IRR Statistics
prob_irr_above_hurdle = (valid_irr > hurdle_rate).mean()
print(f"\nIRR Statistics:")
print(f"  Mean IRR:              {np.mean(valid_irr):.1%}")
print(f"  Median IRR:            {np.median(valid_irr):.1%}")
print(f"  P(IRR > {hurdle_rate:.0%}):          {prob_irr_above_hurdle:.1%}")

## Part 9: Analyze Payback Period Distribution

In [None]:
# Filter reasonable payback periods
valid_payback = payback_results[payback_results < PROJECT_YEARS + 1]

print(f"Scenarios with payback ≤ {PROJECT_YEARS} years: {len(valid_payback):,} of {N_SCENARIOS:,} ({len(valid_payback)/N_SCENARIOS:.1%})")

# Payback Period Distribution
fig, ax = plt.subplots(figsize=(10, 5))

# Target payback of 6 years
target_payback = 6

n, bins, patches = ax.hist(valid_payback, bins=50, density=True, alpha=0.7, edgecolor='black')

# Color bars based on target payback
for patch, left_edge in zip(patches, bins[:-1]):
    if left_edge <= target_payback:
        patch.set_facecolor('seagreen')
    else:
        patch.set_facecolor('goldenrod')

ax.axvline(target_payback, color='black', linewidth=2, linestyle='-', 
           label=f'Target: {target_payback} years')
ax.axvline(np.median(valid_payback), color='blue', linewidth=2, linestyle='--', 
           label=f'Median: {np.median(valid_payback):.1f} years')

ax.set_xlabel('Payback Period (Years)', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('Payback Period Distribution', fontsize=14, fontweight='bold')
ax.legend(loc='upper right', fontsize=10)

plt.tight_layout()
plt.show()

# Payback Statistics
prob_meet_target = (valid_payback <= target_payback).mean()
print(f"\nPayback Period Statistics:")
print(f"  Mean Payback:          {np.mean(valid_payback):.1f} years")
print(f"  Median Payback:        {np.median(valid_payback):.1f} years")
print(f"  P(Payback ≤ {target_payback} yrs):   {prob_meet_target:.1%}")

## Part 10: Sensitivity Analysis

Which parameters have the greatest impact on NPV? Understanding sensitivity helps focus risk management efforts.

In [None]:
# Correlation between inputs and NPV
correlations = simulation_results[['revenue_growth', 'cost_ratio', 'discount_rate', 'npv']].corr()['npv'].drop('npv')

# Tornado chart (sensitivity analysis)
fig, ax = plt.subplots(figsize=(10, 5))

params = correlations.index.tolist()
values = correlations.values
colors = ['seagreen' if v > 0 else 'indianred' for v in values]

# Sort by absolute value
sorted_idx = np.argsort(np.abs(values))
params = [params[i] for i in sorted_idx]
values = [values[i] for i in sorted_idx]
colors = [colors[i] for i in sorted_idx]

y_pos = np.arange(len(params))
ax.barh(y_pos, values, color=colors, edgecolor='black', height=0.6)

ax.set_yticks(y_pos)
ax.set_yticklabels([p.replace('_', ' ').title() for p in params])
ax.set_xlabel('Correlation with NPV', fontsize=12)
ax.set_title('Sensitivity Analysis: What Drives NPV?', fontsize=14, fontweight='bold')
ax.axvline(0, color='black', linewidth=1)

# Add value labels
for i, (v, p) in enumerate(zip(values, params)):
    offset = 0.02 if v > 0 else -0.02
    ha = 'left' if v > 0 else 'right'
    ax.text(v + offset, i, f'{v:.2f}', va='center', ha=ha, fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nSensitivity Ranking (correlation with NPV):")
print("-" * 50)
for p, v in zip(reversed(params), reversed(values)):
    direction = "↑" if v > 0 else "↓"
    print(f"  {p.replace('_', ' ').title():<20}: {v:>6.3f}  {direction} Higher = {'Higher' if v > 0 else 'Lower'} NPV")

In [None]:
# Scatter plots of key drivers
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

params_to_plot = ['revenue_growth', 'cost_ratio', 'discount_rate']
titles = ['Revenue Growth vs NPV', 'Cost Ratio vs NPV', 'Discount Rate vs NPV']

# Sample for visualization (10k points is too many)
sample_idx = np.random.choice(len(simulation_results), size=min(2000, len(simulation_results)), replace=False)
sample_df = simulation_results.iloc[sample_idx]

for i, (param, title) in enumerate(zip(params_to_plot, titles)):
    ax = axes[i]
    
    # Color by NPV (green = positive, red = negative)
    colors = ['seagreen' if npv > 0 else 'indianred' for npv in sample_df['npv']]
    
    ax.scatter(sample_df[param], sample_df['npv'] / 1e6, 
               c=colors, alpha=0.3, s=10)
    
    ax.axhline(0, color='black', linewidth=1, linestyle='--')
    ax.set_xlabel(param.replace('_', ' ').title(), fontsize=11)
    ax.set_ylabel('NPV ($ Millions)', fontsize=11)
    ax.set_title(title, fontsize=12, fontweight='bold')

plt.suptitle('Parameter Impact on NPV', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Part 11: Executive Summary Report

In [None]:
# Generate executive summary
prob_positive = (npv_results > 0).mean()
prob_irr_hurdle = (valid_irr > 0.10).mean()
prob_payback_target = (valid_payback <= 6).mean()

print("="*70)
print("                    CAPITAL BUDGETING ANALYSIS REPORT")
print("                    Manufacturing Plant Investment")
print("="*70)

print("\n┌─────────────────────────────────────────────────────────────────────┐")
print("│                        PROJECT OVERVIEW                             │")
print("├─────────────────────────────────────────────────────────────────────┤")
print(f"│  Initial Investment:        ${INITIAL_INVESTMENT:>12,}                       │")
print(f"│  Project Life:              {PROJECT_YEARS:>12} years                       │")
print(f"│  Monte Carlo Scenarios:     {N_SCENARIOS:>12,}                             │")
print("└─────────────────────────────────────────────────────────────────────┘")

print("\n┌─────────────────────────────────────────────────────────────────────┐")
print("│                     KEY INVESTMENT METRICS                          │")
print("├─────────────────────────────────────────────────────────────────────┤")
print(f"│                           Mean         Median         5th %ile     │")
print(f"│  NPV:              ${np.mean(npv_results)/1e6:>7.2f}M    ${np.median(npv_results)/1e6:>7.2f}M    ${np.percentile(npv_results, 5)/1e6:>7.2f}M     │")
print(f"│  IRR:                  {np.mean(valid_irr)*100:>6.1f}%       {np.median(valid_irr)*100:>6.1f}%       {np.percentile(valid_irr, 5)*100:>6.1f}%      │")
print(f"│  Payback:              {np.mean(valid_payback):>6.1f} yrs    {np.median(valid_payback):>6.1f} yrs    {np.percentile(valid_payback, 95):>6.1f} yrs    │")
print("└─────────────────────────────────────────────────────────────────────┘")

print("\n┌─────────────────────────────────────────────────────────────────────┐")
print("│                       RISK ASSESSMENT                               │")
print("├─────────────────────────────────────────────────────────────────────┤")
print(f"│  Probability of Positive NPV:           {prob_positive:>6.1%}                      │")
print(f"│  Probability IRR > 10% hurdle:          {prob_irr_hurdle:>6.1%}                      │")
print(f"│  Probability Payback ≤ 6 years:         {prob_payback_target:>6.1%}                      │")
print(f"│  Value at Risk (5%):                    ${var_5/1e6:>6.2f}M                     │")
print(f"│  Expected Shortfall (5%):               ${cvar_5/1e6:>6.2f}M                     │")
print("└─────────────────────────────────────────────────────────────────────┘")

print("\n┌─────────────────────────────────────────────────────────────────────┐")
print("│                      KEY RISK DRIVERS                               │")
print("├─────────────────────────────────────────────────────────────────────┤")
print(f"│  1. Cost Ratio:        {correlations['cost_ratio']:.2f} correlation (most impactful)         │")
print(f"│  2. Discount Rate:     {correlations['discount_rate']:.2f} correlation                         │")
print(f"│  3. Revenue Growth:    {correlations['revenue_growth']:.2f} correlation                         │")
print("└─────────────────────────────────────────────────────────────────────┘")

# Investment Decision
print("\n" + "="*70)
if prob_positive > 0.7 and prob_irr_hurdle > 0.6:
    decision = "RECOMMEND APPROVAL"
    emoji = "✅"
    reason = "Strong probability of positive returns with acceptable risk profile."
elif prob_positive > 0.5:
    decision = "CONDITIONAL APPROVAL"
    emoji = "⚠️"
    reason = "Positive expected value but significant downside risk. Consider risk mitigation."
else:
    decision = "RECOMMEND REJECTION"
    emoji = "❌"
    reason = "High probability of negative NPV. Risk exceeds potential reward."

print(f"  INVESTMENT DECISION: {emoji} {decision}")
print(f"  Rationale: {reason}")
print("="*70)

## Summary

This notebook demonstrated a complete **Monte Carlo capital budgeting workflow** using spark-bestfit with RayBackend:

### Workflow Steps

1. **Parameter Modeling**: Defined uncertain project parameters (growth, costs, discount rate)
2. **Distribution Fitting**: Used `FitterConfigBuilder` to fit distributions to historical data
3. **Correlation Capture**: Applied `GaussianCopula` to model parameter dependencies
4. **Monte Carlo Simulation**: Generated 10,000 correlated scenarios
5. **Financial Metrics**: Calculated NPV, IRR, and Payback Period for each scenario
6. **Risk Analysis**: Computed probabilities, VaR, CVaR, and sensitivity rankings

### Key spark-bestfit Features Used

| Feature | Purpose |
|---------|----------|
| `FitterConfigBuilder` | Create reusable fitting configuration |
| `DistributionFitter` | Fit 90+ distributions to uncertain parameters |
| `GaussianCopula` | Model correlations between economic parameters |
| `RayBackend` | Distributed computation for large-scale simulation |

### Business Value

Monte Carlo simulation provides:
- **Distribution of outcomes** instead of single point estimates
- **Probability-based decision making** (P(NPV > 0), P(IRR > hurdle))
- **Risk quantification** (VaR, Expected Shortfall)
- **Sensitivity insights** to focus risk management efforts

### Extensions

- Add more uncertain parameters (tax rates, salvage value, project timing)
- Model time-varying correlations (regime-switching)
- Include real options (abandonment, expansion)
- Compare multiple investment alternatives

In [None]:
# Cleanup
ray.shutdown()