# Outlier Detection on Databricks

This notebook demonstrates outlier detection techniques on Databricks, comparing TabPFN-based approaches with traditional methods.

**What you will learn:**
- How to use TabPFN for anomaly scoring via classification
- How to evaluate and visualize anomaly scores
- How to compare with traditional outlier detection methods

**Prerequisites:** Run `00_data_preparation` notebook first to set up the datasets.

## Compute Setup

We recommend running this notebook on **Serverless Compute** with the **Base Environment V4**.

## 1. Installation

In [None]:
%pip install tabpfn-client scikit-learn pandas matplotlib seaborn --quiet

In [None]:
dbutils.library.restartPython()

## 2. Authentication

In [None]:
import tabpfn_client

token = dbutils.secrets.get(scope="tabpfn-client", key="token")
tabpfn_client.set_access_token(token)

## 3. Configuration

In [None]:
CATALOG = "tabpfn_databricks"
SCHEMA = "default"

spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {SCHEMA}")

## 4. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from tabpfn_client import TabPFNClassifier

## 5. Synthetic Data Example

In [None]:
# Generate synthetic data with known outliers
np.random.seed(42)

n_normal = 200
X_normal = np.random.randn(n_normal, 2) * 0.5 + np.array([2, 2])

n_outliers = 20
X_outliers = np.random.uniform(-2, 6, size=(n_outliers, 2))

X_synthetic = np.vstack([X_normal, X_outliers])
y_true = np.array([0] * n_normal + [1] * n_outliers)

print(f"Dataset shape: {X_synthetic.shape}")
print(f"Normal samples: {n_normal}, Outliers: {n_outliers}")

In [None]:
# Visualize the data
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(X_synthetic[y_true == 0, 0], X_synthetic[y_true == 0, 1], c='blue', label='Normal', alpha=0.6)
ax.scatter(X_synthetic[y_true == 1, 0], X_synthetic[y_true == 1, 1], c='red', label='Outlier', marker='x', s=100)
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_title('Synthetic Data with Known Outliers')
ax.legend()
plt.show()

## 6. TabPFN-based Anomaly Detection

We use TabPFN as a classifier trained on a subset of normal data to score anomalies.

In [None]:
# Train TabPFN on normal samples to learn the "normal" distribution
# Then use it to score how "abnormal" each point is

# Create a semi-supervised setup: train on labeled normal data
X_train = X_normal[:150]  # Use 150 normal samples for training
y_train = np.zeros(150)   # All labeled as normal (0)

# Add a few synthetic outliers to training to teach the model
X_train_outliers = np.random.uniform(-2, 6, size=(15, 2))
X_train = np.vstack([X_train, X_train_outliers])
y_train = np.concatenate([y_train, np.ones(15)])  # Label outliers as 1

clf = TabPFNClassifier()
clf.fit(X_train, y_train)

# Score all points - probability of being an outlier
anomaly_scores_tabpfn = clf.predict_proba(X_synthetic)[:, 1]

roc_auc_tabpfn = roc_auc_score(y_true, anomaly_scores_tabpfn)
print(f"TabPFN ROC AUC: {roc_auc_tabpfn:.4f}")

## 7. Comparison with Traditional Methods

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_synthetic)

# Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
iso_forest.fit(X_scaled)
scores_iso = -iso_forest.score_samples(X_scaled)

# Local Outlier Factor
lof = LocalOutlierFactor(n_neighbors=20, novelty=False, contamination=0.1)
lof.fit(X_scaled)
scores_lof = -lof.negative_outlier_factor_

# Evaluate all methods
methods = {
    "TabPFN (semi-supervised)": anomaly_scores_tabpfn,
    "Isolation Forest": scores_iso,
    "Local Outlier Factor": scores_lof,
}

results = {}
for name, scores in methods.items():
    roc = roc_auc_score(y_true, scores)
    results[name] = roc
    print(f"{name:30s}: ROC AUC = {roc:.4f}")

In [None]:
# Visualize comparison
fig, ax = plt.subplots(figsize=(10, 5))
colors = ['#2ecc71' if 'TabPFN' in name else '#3498db' for name in results.keys()]
bars = ax.barh(list(results.keys()), list(results.values()), color=colors)
ax.set_xlabel('ROC AUC Score')
ax.set_title('Outlier Detection Method Comparison')
ax.set_xlim(0.5, 1.0)
plt.tight_layout()
plt.show()

## Summary

In this notebook, we demonstrated:

- ✅ Using TabPFN for semi-supervised anomaly detection
- ✅ Comparing with traditional methods (Isolation Forest, LOF)
- ✅ Evaluating with ROC AUC metric