# TabPFN Regression on Databricks

This notebook demonstrates how to use **TabPFN** for regression tasks on Databricks.

TabPFN provides state-of-the-art regression performance with built-in uncertainty quantification, making it ideal for scenarios where understanding prediction confidence is important.

**What you will learn:**
- How to perform regression with TabPFN
- How to quantify prediction uncertainty
- How to compare TabPFN with other regressors

**Prerequisites:** Run `00_data_preparation` notebook first to set up the datasets.

**References:**
- [TabPFN Client GitHub](https://github.com/PriorLabs/tabpfn-client)
- [Prior Labs Documentation](https://docs.priorlabs.ai/)

## Compute Setup

We recommend running this notebook on **Serverless Compute** with the **Base Environment V4**.

## 1. Installation

In [None]:
%pip install tabpfn-client scikit-learn pandas matplotlib seaborn --quiet

In [None]:
dbutils.library.restartPython()

## 2. Authentication

See the `01_classification` notebook for detailed instructions on setting up Databricks Secrets.

In [None]:
import tabpfn_client

token = dbutils.secrets.get(scope="tabpfn-client", key="token")
tabpfn_client.set_access_token(token)

## 3. Configuration

In [None]:
CATALOG = "tabpfn_databricks"
SCHEMA = "default"

spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {SCHEMA}")

## 4. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from tabpfn_client import TabPFNRegressor

## 5. Basic Regression Example

We'll use the **California Housing** dataset to predict median house values.

In [None]:
# Load California Housing dataset from Delta table
df_housing = spark.table("california_housing").toPandas()

# Separate features and target
feature_names = [col for col in df_housing.columns if col != "target"]
X = df_housing[feature_names].values
y = df_housing["target"].values

print(f"Dataset shape: {X.shape}")
print(f"Features: {feature_names}")
print(f"Target: Median house value (in $100,000s)")
print(f"Target range: [{y.min():.2f}, {y.max():.2f}]")

In [None]:
# Use a subset for faster demonstration (TabPFN works best with smaller datasets)
np.random.seed(42)
sample_idx = np.random.choice(len(X), size=2000, replace=False)
X_sample = X[sample_idx]
y_sample = y[sample_idx]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_sample, y_sample, test_size=0.2, random_state=42
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

In [None]:
# Initialize and train TabPFN regressor
reg = TabPFNRegressor()
reg.fit(X_train, y_train)

# Make predictions
y_pred = reg.predict(X_test)

# Evaluate performance
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"TabPFN Regression Results:")
print(f"  RMSE: {rmse:.4f}")
print(f"  MAE:  {mae:.4f}")
print(f"  R²:   {r2:.4f}")

In [None]:
# Visualize predictions vs actual values
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(y_test, y_pred, alpha=0.5, edgecolors='none')
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
ax.set_xlabel('Actual Values')
ax.set_ylabel('Predicted Values')
ax.set_title(f'TabPFN Regression: Predicted vs Actual (R² = {r2:.3f})')
plt.tight_layout()
plt.show()

## 6. Uncertainty Quantification

TabPFN can provide prediction intervals, which is valuable for understanding model confidence.

In [None]:
# Get predictions with uncertainty (quantiles)
# Predict 5th, 50th (median), and 95th percentiles for 90% prediction interval
y_lower = reg.predict(X_test, output_type="quantiles", quantiles=[0.05]).flatten()
y_median = reg.predict(X_test, output_type="quantiles", quantiles=[0.5]).flatten()
y_upper = reg.predict(X_test, output_type="quantiles", quantiles=[0.95]).flatten()

print(f"Shapes - lower: {y_lower.shape}, median: {y_median.shape}, upper: {y_upper.shape}")

# Calculate coverage (what percentage of true values fall within the prediction interval)
coverage = np.mean((y_test >= y_lower) & (y_test <= y_upper))
print(f"90% Prediction Interval Coverage: {coverage:.1%}")

In [None]:
# Visualize predictions with uncertainty
# Sort by predicted value for better visualization
sort_idx = np.argsort(y_median)
n_show = 50  # Show first 50 samples for clarity

fig, ax = plt.subplots(figsize=(12, 6))
x_range = np.arange(n_show)

ax.fill_between(x_range, 
                y_lower[sort_idx[:n_show]], 
                y_upper[sort_idx[:n_show]], 
                alpha=0.3, color='blue', label='90% Prediction Interval')
ax.plot(x_range, y_median[sort_idx[:n_show]], 'b-', linewidth=2, label='Predicted (median)')
ax.scatter(x_range, y_test[sort_idx[:n_show]], color='red', s=20, label='Actual', zorder=5)

ax.set_xlabel('Sample Index (sorted by prediction)')
ax.set_ylabel('House Value ($100,000s)')
ax.set_title('TabPFN Regression with Uncertainty Quantification')
ax.legend()
plt.tight_layout()
plt.show()

## 7. Model Comparison

Let's compare TabPFN with other popular regression models.

In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge

# Define models
models = {
    "TabPFN": TabPFNRegressor(),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(n_estimators=100, random_state=42),
    "Ridge Regression": Ridge(alpha=1.0),
}

# Evaluate each model
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred_model = model.predict(X_test)
    
    results[name] = {
        "RMSE": np.sqrt(mean_squared_error(y_test, y_pred_model)),
        "MAE": mean_absolute_error(y_test, y_pred_model),
        "R²": r2_score(y_test, y_pred_model)
    }
    print(f"{name:20s}: RMSE = {results[name]['RMSE']:.4f}, R² = {results[name]['R²']:.4f}")

In [None]:
# Visualize comparison
df_results = pd.DataFrame(results).T

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# RMSE comparison
colors = ['#2ecc71' if name == 'TabPFN' else '#3498db' for name in df_results.index]
df_results['RMSE'].sort_values().plot(kind='barh', ax=axes[0], color=colors)
axes[0].set_xlabel('RMSE (lower is better)')
axes[0].set_title('Model Comparison - RMSE')

# R² comparison
df_results['R²'].sort_values().plot(kind='barh', ax=axes[1], color=colors)
axes[1].set_xlabel('R² (higher is better)')
axes[1].set_title('Model Comparison - R²')

plt.tight_layout()
plt.show()

## Summary

In this notebook, we demonstrated:

- ✅ Basic regression with TabPFN
- ✅ Uncertainty quantification with prediction intervals
- ✅ Model comparison with other popular regressors
- ✅ Loading data from Delta tables

TabPFN's built-in uncertainty quantification makes it particularly valuable for applications where understanding prediction confidence is critical.