# TabPFN Classification on Databricks

This notebook demonstrates how to use **TabPFN** for classification tasks on Databricks.

TabPFN is a foundation model for tabular data that outperforms traditional methods while being dramatically faster. It requires no hyperparameter tuning and works out-of-the-box.

**What you will learn:**
- How to install and authenticate with TabPFN client
- How to perform binary and multi-class classification
- How to compare TabPFN with other classifiers
- How to get probability estimates and evaluate model performance

**Prerequisites:** Run `00_data_preparation` notebook first to set up the datasets.

**References:**
- [TabPFN Client GitHub](https://github.com/PriorLabs/tabpfn-client)
- [Prior Labs Documentation](https://docs.priorlabs.ai/)

## Compute Setup

We recommend running this notebook on **Serverless Compute** with the **Base Environment V4**.

To configure:
1. Click on the compute selector in the notebook toolbar
2. Select **Serverless**
3. Under Environment, choose **Base Environment V4**

Serverless compute provides fast startup times and automatic scaling, ideal for interactive notebook workflows.

## 1. Installation

In [None]:
%pip install tabpfn-client scikit-learn pandas matplotlib --quiet

In [None]:
dbutils.library.restartPython()

## 2. Authentication

TabPFN client requires authentication using an access token.

**Setting up Databricks Secrets (one-time setup):**

1. Create a secret scope using the Databricks CLI:
   ```bash
   databricks secrets create-scope tabpfn-client
   ```

2. Store your TabPFN token in the secret scope:
   ```bash
   databricks secrets put-secret tabpfn-client token
   ```

3. You can retrieve your TabPFN token on another machine by running:
   ```python
   import tabpfn_client
   token = tabpfn_client.get_access_token()
   print(token)
   ```

In [None]:
import tabpfn_client

token = dbutils.secrets.get(scope="tabpfn-client", key="token")
tabpfn_client.set_access_token(token)

## 3. Configuration

Configure the catalog and schema where the prepared datasets are stored.

In [None]:
# Configure catalog and schema (must match 00_data_preparation)
CATALOG = "tabpfn_databricks"
SCHEMA = "default"

spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {SCHEMA}")

## 4. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

from tabpfn_client import TabPFNClassifier

## 5. Binary Classification Example

We'll use the **Breast Cancer Wisconsin** dataset for binary classification.

In [None]:
# Load the Breast Cancer dataset from Delta table
df_breast_cancer = spark.table("breast_cancer").toPandas()

# Separate features and target
feature_names = [col for col in df_breast_cancer.columns if col != "target"]
X = df_breast_cancer[feature_names].values
y = df_breast_cancer["target"].values

print(f"Dataset shape: {X.shape}")
print(f"Number of features: {len(feature_names)}")
print(f"Classes: ['malignant', 'benign']")
print(f"Class distribution: {dict(zip(*np.unique(y, return_counts=True)))}")

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

In [None]:
# Initialize and train TabPFN classifier
clf = TabPFNClassifier()
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)
y_pred_proba = clf.predict_proba(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba[:, 1])

print(f"TabPFN Results:")
print(f"  Accuracy: {accuracy:.4f}")
print(f"  ROC AUC:  {roc_auc:.4f}")

In [None]:
# Detailed classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['malignant', 'benign']))

## 6. Model Comparison

Let's compare TabPFN with other popular classifiers using cross-validation.

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

# Define models to compare
models = {
    "TabPFN": TabPFNClassifier(),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
}

# Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Evaluate each model
results = {}
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring="roc_auc")
    results[name] = {"mean": scores.mean(), "std": scores.std()}
    print(f"{name:20s}: ROC AUC = {scores.mean():.4f} (+/- {scores.std():.4f})")

In [None]:
# Visualize comparison
df_results = pd.DataFrame(results).T
df_results.columns = ["Mean ROC AUC", "Std"]
df_results = df_results.sort_values("Mean ROC AUC", ascending=True)

fig, ax = plt.subplots(figsize=(10, 5))
colors = ["#2ecc71" if name == "TabPFN" else "#3498db" for name in df_results.index]
bars = ax.barh(df_results.index, df_results["Mean ROC AUC"], color=colors)
ax.errorbar(df_results["Mean ROC AUC"], df_results.index, 
            xerr=df_results["Std"], fmt="none", color="black", capsize=3)
ax.set_xlabel("ROC AUC Score")
ax.set_title("Model Comparison - Breast Cancer Classification")
ax.set_xlim(0.9, 1.0)
plt.tight_layout()
plt.show()

## 7. Multi-class Classification

TabPFN also supports multi-class classification. Let's demonstrate with the Iris dataset.

In [None]:
# Load the Iris dataset from Delta table
df_iris = spark.table("iris").toPandas()

# Separate features and target
iris_feature_names = [col for col in df_iris.columns if col != "target"]
X_iris = df_iris[iris_feature_names].values
y_iris = df_iris["target"].values
iris_target_names = ['setosa', 'versicolor', 'virginica']

print(f"Dataset shape: {X_iris.shape}")
print(f"Classes: {iris_target_names}")

# Split the data
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

In [None]:
# Train TabPFN on multi-class problem
clf_multi = TabPFNClassifier()
clf_multi.fit(X_train_iris, y_train_iris)

# Make predictions
y_pred_iris = clf_multi.predict(X_test_iris)

# Evaluate
accuracy_iris = accuracy_score(y_test_iris, y_pred_iris)
print(f"Multi-class Accuracy: {accuracy_iris:.4f}")

print("\nClassification Report:")
print(classification_report(y_test_iris, y_pred_iris, target_names=iris_target_names))

## Summary

In this notebook, we demonstrated:

- ✅ Installing and authenticating TabPFN client
- ✅ Loading data from Delta tables
- ✅ Binary classification with probability estimates
- ✅ Multi-class classification
- ✅ Model comparison with cross-validation

TabPFN provides excellent out-of-the-box performance without hyperparameter tuning, making it an ideal choice for rapid prototyping and baseline modeling.