# Planning Analytics Regression with TabPFN

This notebook demonstrates how to use **TabPFN** for regression tasks across the retail/CPG planning value chain.

TabPFN provides state-of-the-art regression performance with built-in uncertainty quantification, making it ideal for scenarios where understanding prediction confidence is important.

**Use Cases Covered:**

| Use Case | Planning Process | Target Variable |
|----------|-----------------|----------------|
| Price Elasticity Prediction | Demand Planning | Price elasticity coefficient |
| Promotion Lift Prediction | Demand Planning | % sales lift during promotion |
| Supplier Lead Time Prediction | Supply Planning | Actual delivery days |
| Yield Prediction | Production Planning | Production yield % |
| Transportation Lead Time | Distribution Planning | Actual transit days |

**Business Value:**
- Optimize pricing and promotion strategies
- Improve supply planning with accurate lead time forecasts
- Optimize production capacity planning with yield predictions
- Improve delivery planning and customer service

**Prerequisites:** Run `00_data_preparation` notebook first to set up the datasets.

**References:**
- [TabPFN Client GitHub](https://github.com/PriorLabs/tabpfn-client)
- [Prior Labs Documentation](https://docs.priorlabs.ai/)

## Compute Setup

We recommend running this notebook on **Serverless Compute** with the **Base Environment V4**.

## 1. Installation

In [None]:
%pip install tabpfn-client scikit-learn pandas matplotlib seaborn --quiet

In [None]:
dbutils.library.restartPython()

## 2. Authentication

See the `01_classification` notebook for detailed instructions on setting up Databricks Secrets.

In [None]:
import tabpfn_client

token = dbutils.secrets.get(scope="tabpfn-client", key="token")
tabpfn_client.set_access_token(token)

## 3. Configuration

In [None]:
CATALOG = "tabpfn_databricks"
SCHEMA = "default"

spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {SCHEMA}")

## 4. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from tabpfn_client import TabPFNRegressor

---
# Part 1: Demand Planning
---

## 5. Price Elasticity Prediction

**Business Context:** Revenue management and demand planning teams need to understand how price changes affect demand to:
- Set optimal prices for different products and markets
- Plan price increases without losing market share
- Design effective discount strategies

**Price Elasticity:** Measures the % change in demand for a 1% change in price.
- Elasticity of -2.0 means: 1% price increase → 2% demand decrease
- More negative = more price sensitive (elastic)
- Closer to 0 = less price sensitive (inelastic)

In [None]:
# Load Price Elasticity dataset from Delta table
df_elasticity = spark.table("price_elasticity").toPandas()

print(f"Dataset shape: {df_elasticity.shape}")
print(f"\nFeatures:")
print([col for col in df_elasticity.columns if col != 'price_elasticity'])
print(f"\nTarget (price_elasticity) statistics:")
print(df_elasticity['price_elasticity'].describe())

In [None]:
# Visualize elasticity distribution by category
fig, ax = plt.subplots(figsize=(10, 6))
df_elasticity.boxplot(column='price_elasticity', by='category', ax=ax)
ax.set_title('Price Elasticity by Product Category')
ax.set_xlabel('Category')
ax.set_ylabel('Price Elasticity')
plt.suptitle('')  # Remove automatic title
plt.tight_layout()
plt.show()

In [None]:
# Prepare features - encode categorical columns
df_elas_encoded = pd.get_dummies(df_elasticity, 
                                  columns=['category', 'price_tier', 'purchase_frequency'], 
                                  drop_first=True)

# Separate features and target
feature_cols = [col for col in df_elas_encoded.columns if col != 'price_elasticity']
X = df_elas_encoded[feature_cols].values
y = df_elasticity['price_elasticity'].values

print(f"Feature matrix shape: {X.shape}")

In [None]:
# Use a sample for faster demonstration (TabPFN works best with smaller datasets)
np.random.seed(42)
sample_size = min(2000, len(X))
sample_idx = np.random.choice(len(X), size=sample_size, replace=False)
X_sample = X[sample_idx]
y_sample = y[sample_idx]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_sample, y_sample, test_size=0.2, random_state=42
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

In [None]:
# Initialize and train TabPFN regressor
reg = TabPFNRegressor()
reg.fit(X_train, y_train)

# Make predictions
y_pred = reg.predict(X_test)

# Evaluate performance
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"TabPFN Price Elasticity Prediction Results:")
print(f"  RMSE: {rmse:.4f}")
print(f"  MAE:  {mae:.4f}")
print(f"  R²:   {r2:.4f}")

In [None]:
# Visualize predictions vs actual values
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(y_test, y_pred, alpha=0.5, edgecolors='none')
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
ax.set_xlabel('Actual Price Elasticity')
ax.set_ylabel('Predicted Price Elasticity')
ax.set_title(f'Price Elasticity: Predicted vs Actual (R² = {r2:.3f})')
plt.tight_layout()
plt.show()

## 6. Promotion Lift Prediction

**Business Context:** Trade promotion managers need to predict the incremental sales lift from promotions to:
- Optimize promotion ROI
- Plan inventory for promotional periods
- Negotiate trade spend with retailers

**Promotion Lift:** The % increase in sales during a promotion compared to baseline.
- Lift of 100% means: Sales double during the promotion
- Higher lift = more effective promotion

In [None]:
# Load Promotion Lift dataset from Delta table
df_promo = spark.table("promotion_lift").toPandas()

print(f"Dataset shape: {df_promo.shape}")
print(f"\nTarget (promotion_lift_pct) statistics:")
print(df_promo['promotion_lift_pct'].describe())
print(f"\nPromotion type distribution:")
print(df_promo['promotion_type'].value_counts())

In [None]:
# Prepare features - encode categorical columns
df_promo_encoded = pd.get_dummies(df_promo, 
                                   columns=['promotion_type', 'category'], 
                                   drop_first=True)

# Separate features and target
promo_feature_cols = [col for col in df_promo_encoded.columns if col != 'promotion_lift_pct']
X_promo = df_promo_encoded[promo_feature_cols].values
y_promo = df_promo['promotion_lift_pct'].values

# Sample and split
np.random.seed(42)
sample_size = min(2000, len(X_promo))
sample_idx_p = np.random.choice(len(X_promo), size=sample_size, replace=False)
X_promo_sample = X_promo[sample_idx_p]
y_promo_sample = y_promo[sample_idx_p]

X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(
    X_promo_sample, y_promo_sample, test_size=0.2, random_state=42
)

print(f"Training: {len(X_train_p)}, Test: {len(X_test_p)}")

In [None]:
# Train TabPFN regressor for promotion lift
reg_promo = TabPFNRegressor()
reg_promo.fit(X_train_p, y_train_p)

# Make predictions
y_pred_promo = reg_promo.predict(X_test_p)

# Evaluate performance
rmse_promo = np.sqrt(mean_squared_error(y_test_p, y_pred_promo))
mae_promo = mean_absolute_error(y_test_p, y_pred_promo)
r2_promo = r2_score(y_test_p, y_pred_promo)

print(f"TabPFN Promotion Lift Prediction Results:")
print(f"  RMSE: {rmse_promo:.2f}%")
print(f"  MAE:  {mae_promo:.2f}%")
print(f"  R²:   {r2_promo:.4f}")

---
# Part 2: Supply Planning
---

## 7. Supplier Lead Time Prediction

**Business Context:** Supply planners need accurate lead time predictions to:
- Time purchase orders correctly to avoid stockouts
- Set appropriate safety stock levels
- Identify suppliers with unpredictable delivery performance

**Target:** Actual lead time in days (vs. contracted lead time)

In [None]:
# Load Supplier Lead Time dataset
df_supplier_lt = spark.table("supplier_lead_time").toPandas()

print(f"Dataset shape: {df_supplier_lt.shape}")
print(f"\nFeatures:")
print([col for col in df_supplier_lt.columns if col != 'actual_lead_time_days'])
print(f"\nTarget (actual_lead_time_days) statistics:")
print(df_supplier_lt['actual_lead_time_days'].describe())

In [None]:
# Visualize lead time by supplier region
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

df_supplier_lt.boxplot(column='actual_lead_time_days', by='supplier_region', ax=axes[0])
axes[0].set_title('Actual Lead Time by Supplier Region')
axes[0].set_xlabel('Region')
axes[0].set_ylabel('Lead Time (days)')
plt.suptitle('')

# Contracted vs Actual lead time
axes[1].scatter(df_supplier_lt['contracted_lead_time_days'], 
                df_supplier_lt['actual_lead_time_days'], alpha=0.3)
axes[1].plot([0, 60], [0, 60], 'r--', label='Perfect prediction')
axes[1].set_xlabel('Contracted Lead Time (days)')
axes[1].set_ylabel('Actual Lead Time (days)')
axes[1].set_title('Contracted vs Actual Lead Time')
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# Prepare features
df_slt_encoded = pd.get_dummies(df_supplier_lt, 
                                 columns=['supplier_tier', 'supplier_region', 'order_complexity', 
                                         'transport_mode', 'port_of_entry'], 
                                 drop_first=True)

slt_feature_cols = [col for col in df_slt_encoded.columns if col != 'actual_lead_time_days']
X_slt = df_slt_encoded[slt_feature_cols].values
y_slt = df_supplier_lt['actual_lead_time_days'].values

# Sample and split
np.random.seed(42)
sample_size = min(1500, len(X_slt))
sample_idx_slt = np.random.choice(len(X_slt), size=sample_size, replace=False)

X_train_slt, X_test_slt, y_train_slt, y_test_slt = train_test_split(
    X_slt[sample_idx_slt], y_slt[sample_idx_slt], test_size=0.2, random_state=42
)
print(f"Training: {len(X_train_slt)}, Test: {len(X_test_slt)}")

In [None]:
# Train TabPFN regressor for supplier lead time
reg_slt = TabPFNRegressor()
reg_slt.fit(X_train_slt, y_train_slt)

y_pred_slt = reg_slt.predict(X_test_slt)

rmse_slt = np.sqrt(mean_squared_error(y_test_slt, y_pred_slt))
mae_slt = mean_absolute_error(y_test_slt, y_pred_slt)
r2_slt = r2_score(y_test_slt, y_pred_slt)

print(f"TabPFN Supplier Lead Time Prediction Results:")
print(f"  RMSE: {rmse_slt:.2f} days")
print(f"  MAE:  {mae_slt:.2f} days")
print(f"  R²:   {r2_slt:.4f}")

In [None]:
# Uncertainty quantification for supplier lead time
y_lower_slt = reg_slt.predict(X_test_slt, output_type="quantiles", quantiles=[0.1]).flatten()
y_upper_slt = reg_slt.predict(X_test_slt, output_type="quantiles", quantiles=[0.9]).flatten()

coverage_slt = np.mean((y_test_slt >= y_lower_slt) & (y_test_slt <= y_upper_slt))
print(f"80% Prediction Interval Coverage: {coverage_slt:.1%}")

# Visualize predictions with uncertainty
fig, ax = plt.subplots(figsize=(10, 6))
sort_idx = np.argsort(y_pred_slt)
n_show = 60

ax.fill_between(range(n_show), 
                y_lower_slt[sort_idx[:n_show]], 
                y_upper_slt[sort_idx[:n_show]], 
                alpha=0.3, color='blue', label='80% Prediction Interval')
ax.plot(range(n_show), y_pred_slt[sort_idx[:n_show]], 'b-', linewidth=2, label='Predicted')
ax.scatter(range(n_show), y_test_slt[sort_idx[:n_show]], color='red', s=20, label='Actual', zorder=5)

ax.set_xlabel('Sample Index (sorted by prediction)')
ax.set_ylabel('Lead Time (days)')
ax.set_title('Supplier Lead Time Prediction with Uncertainty')
ax.legend()
plt.tight_layout()
plt.show()

---
# Part 3: Production Planning
---

## 8. Yield Prediction

**Business Context:** Production planners need accurate yield predictions to:
- Plan raw material requirements correctly
- Set realistic production targets
- Identify factors affecting production efficiency

**Target:** Production yield percentage (0-100%)

In [None]:
# Load Yield Prediction dataset
df_yield = spark.table("yield_prediction").toPandas()

print(f"Dataset shape: {df_yield.shape}")
print(f"\nFeatures:")
print([col for col in df_yield.columns if col != 'yield_percentage'])
print(f"\nTarget (yield_percentage) statistics:")
print(df_yield['yield_percentage'].describe())

In [None]:
# Visualize yield by key factors
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

df_yield.boxplot(column='yield_percentage', by='production_line', ax=axes[0])
axes[0].set_title('Yield by Production Line')
axes[0].set_ylabel('Yield (%)')
plt.suptitle('')

df_yield.boxplot(column='yield_percentage', by='raw_material_grade', ax=axes[1])
axes[1].set_title('Yield by Material Grade')
axes[1].set_ylabel('Yield (%)')

axes[2].scatter(df_yield['equipment_age_years'], df_yield['yield_percentage'], alpha=0.3)
axes[2].set_xlabel('Equipment Age (years)')
axes[2].set_ylabel('Yield (%)')
axes[2].set_title('Yield vs Equipment Age')

plt.tight_layout()
plt.show()

In [None]:
# Prepare features
df_yield_encoded = pd.get_dummies(df_yield, 
                                   columns=['production_line', 'product_complexity', 
                                           'raw_material_grade', 'shift'], 
                                   drop_first=True)

yield_feature_cols = [col for col in df_yield_encoded.columns if col != 'yield_percentage']
X_yield = df_yield_encoded[yield_feature_cols].values
y_yield = df_yield['yield_percentage'].values

# Sample and split
np.random.seed(42)
sample_size = min(1500, len(X_yield))
sample_idx_y = np.random.choice(len(X_yield), size=sample_size, replace=False)

X_train_y, X_test_y, y_train_y, y_test_y = train_test_split(
    X_yield[sample_idx_y], y_yield[sample_idx_y], test_size=0.2, random_state=42
)
print(f"Training: {len(X_train_y)}, Test: {len(X_test_y)}")

In [None]:
# Train TabPFN regressor for yield prediction
reg_yield = TabPFNRegressor()
reg_yield.fit(X_train_y, y_train_y)

y_pred_yield = reg_yield.predict(X_test_y)

rmse_yield = np.sqrt(mean_squared_error(y_test_y, y_pred_yield))
mae_yield = mean_absolute_error(y_test_y, y_pred_yield)
r2_yield = r2_score(y_test_y, y_pred_yield)

print(f"TabPFN Yield Prediction Results:")
print(f"  RMSE: {rmse_yield:.2f}%")
print(f"  MAE:  {mae_yield:.2f}%")
print(f"  R²:   {r2_yield:.4f}")

In [None]:
# Visualize predictions vs actual
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(y_test_y, y_pred_yield, alpha=0.5)
ax.plot([y_test_y.min(), y_test_y.max()], [y_test_y.min(), y_test_y.max()], 'r--', lw=2)
ax.set_xlabel('Actual Yield (%)')
ax.set_ylabel('Predicted Yield (%)')
ax.set_title(f'Yield Prediction: Predicted vs Actual (R² = {r2_yield:.3f})')
plt.tight_layout()
plt.show()

---
# Part 4: Distribution Planning
---

## 9. Transportation Lead Time Prediction

**Business Context:** Distribution planners need accurate transit time predictions to:
- Promise accurate delivery dates to customers
- Plan warehouse receiving schedules
- Optimize carrier selection

**Target:** Actual transit time in days

In [None]:
# Load Transportation Lead Time dataset
df_transport = spark.table("transportation_lead_time").toPandas()

print(f"Dataset shape: {df_transport.shape}")
print(f"\nFeatures:")
print([col for col in df_transport.columns if col != 'actual_transit_days'])
print(f"\nTarget (actual_transit_days) statistics:")
print(df_transport['actual_transit_days'].describe())

In [None]:
# Visualize transit time by key factors
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

df_transport.boxplot(column='actual_transit_days', by='service_level', ax=axes[0])
axes[0].set_title('Transit Time by Service Level')
axes[0].set_ylabel('Transit Days')
plt.suptitle('')

df_transport.boxplot(column='actual_transit_days', by='shipment_type', ax=axes[1])
axes[1].set_title('Transit Time by Shipment Type')
axes[1].set_ylabel('Transit Days')

axes[2].scatter(df_transport['distance_miles'], df_transport['actual_transit_days'], alpha=0.3)
axes[2].set_xlabel('Distance (miles)')
axes[2].set_ylabel('Transit Days')
axes[2].set_title('Transit Time vs Distance')

plt.tight_layout()
plt.show()

In [None]:
# Prepare features
df_trans_encoded = pd.get_dummies(df_transport, 
                                   columns=['shipment_type', 'origin_region', 'destination_region',
                                           'carrier_tier', 'service_level', 'ship_day_of_week'], 
                                   drop_first=True)

trans_feature_cols = [col for col in df_trans_encoded.columns if col != 'actual_transit_days']
X_trans = df_trans_encoded[trans_feature_cols].values
y_trans = df_transport['actual_transit_days'].values

# Sample and split
np.random.seed(42)
sample_size = min(1500, len(X_trans))
sample_idx_t = np.random.choice(len(X_trans), size=sample_size, replace=False)

X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(
    X_trans[sample_idx_t], y_trans[sample_idx_t], test_size=0.2, random_state=42
)
print(f"Training: {len(X_train_t)}, Test: {len(X_test_t)}")

In [None]:
# Train TabPFN regressor for transportation lead time
reg_trans = TabPFNRegressor()
reg_trans.fit(X_train_t, y_train_t)

y_pred_trans = reg_trans.predict(X_test_t)

rmse_trans = np.sqrt(mean_squared_error(y_test_t, y_pred_trans))
mae_trans = mean_absolute_error(y_test_t, y_pred_trans)
r2_trans = r2_score(y_test_t, y_pred_trans)

print(f"TabPFN Transportation Lead Time Prediction Results:")
print(f"  RMSE: {rmse_trans:.2f} days")
print(f"  MAE:  {mae_trans:.2f} days")
print(f"  R²:   {r2_trans:.4f}")

In [None]:
# Uncertainty quantification for transportation lead time
y_lower_t = reg_trans.predict(X_test_t, output_type="quantiles", quantiles=[0.1]).flatten()
y_upper_t = reg_trans.predict(X_test_t, output_type="quantiles", quantiles=[0.9]).flatten()

coverage_t = np.mean((y_test_t >= y_lower_t) & (y_test_t <= y_upper_t))
print(f"80% Prediction Interval Coverage: {coverage_t:.1%}")

# Business application: Delivery Promise
# Use upper bound of prediction interval for conservative delivery promises
print(f"\nDelivery Promise Strategy:")
print(f"  Average predicted transit: {y_pred_trans.mean():.1f} days")
print(f"  Conservative promise (90th percentile): {y_upper_t.mean():.1f} days")

---
# Part 5: Model Comparison & Summary
---

## 10. Model Comparison

Let's compare TabPFN with other popular regression models across all use cases.

In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge

# Define models
models = {
    "TabPFN": TabPFNRegressor(),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(n_estimators=100, random_state=42),
    "Ridge Regression": Ridge(alpha=1.0),
}

# Evaluate on all datasets
datasets = [
    ("Price Elasticity", X_train, X_test, y_train, y_test),
    ("Promotion Lift", X_train_p, X_test_p, y_train_p, y_test_p),
    ("Supplier Lead Time", X_train_slt, X_test_slt, y_train_slt, y_test_slt),
    ("Yield Prediction", X_train_y, X_test_y, y_train_y, y_test_y),
    ("Transportation LT", X_train_t, X_test_t, y_train_t, y_test_t),
]

all_results = {}
for ds_name, X_tr, X_te, y_tr, y_te in datasets:
    print(f"\n{ds_name}:")
    print("-" * 50)
    all_results[ds_name] = {}
    for name, model in models.items():
        model.fit(X_tr, y_tr)
        y_pred_model = model.predict(X_te)
        r2 = r2_score(y_te, y_pred_model)
        all_results[ds_name][name] = r2
        print(f"  {name:20s}: R² = {r2:.4f}")

In [None]:
# Visualize comparison across all use cases
df_all = pd.DataFrame(all_results)

fig, ax = plt.subplots(figsize=(12, 6))
df_all.T.plot(kind='bar', ax=ax)
ax.set_xlabel('Use Case')
ax.set_ylabel('R² Score')
ax.set_title('Model Comparison Across All Regression Use Cases')
ax.legend(title='Model', bbox_to_anchor=(1.02, 1), loc='upper left')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
plt.tight_layout()
plt.show()

## Summary

In this notebook, we demonstrated TabPFN regression across the planning value chain:

| Use Case | Planning Process | Key Metrics |
|----------|-----------------|-------------|
| Price Elasticity | Demand Planning | R², MAE for elasticity coefficient |
| Promotion Lift | Demand Planning | R², MAE for lift % |
| Supplier Lead Time | Supply Planning | R², MAE for days, prediction intervals |
| Yield Prediction | Production Planning | R², MAE for yield % |
| Transportation Lead Time | Distribution Planning | R², MAE for transit days |

**Key Takeaways:**
1. TabPFN provides competitive regression performance without hyperparameter tuning
2. Built-in uncertainty quantification enables risk-aware planning
3. Predictions can be directly integrated into planning systems

**Next Steps:**
- Run `03_outlier_detection` notebook for production anomaly detection
- Run `04_time_series_forecasting` notebook for demand forecasting
- Integrate predictions into ERP/planning systems