# Production Anomaly Detection with TabPFN

This notebook demonstrates outlier detection techniques for production planning in retail/CPG manufacturing.

**Use Case:** Production Planning - Detect unusual scrap/defect patterns

**Business Context:** Manufacturing operations generate vast amounts of production data. Detecting anomalous production runs early helps:
- Identify equipment issues before major failures
- Catch quality problems early in the production process
- Reduce scrap costs and improve overall equipment effectiveness (OEE)
- Maintain product quality and customer satisfaction

**What you will learn:**
- How to use TabPFN for anomaly scoring via semi-supervised classification
- How to evaluate and visualize anomaly scores
- How to identify anomalous production runs and understand key indicators

**Prerequisites:** Run `00_data_preparation` notebook first to set up the datasets.

## Compute Setup

We recommend running this notebook on **Serverless Compute** with the **Base Environment V4**.

## 1. Installation

In [None]:
%pip install tabpfn-client scikit-learn pandas matplotlib seaborn mlflow --quiet

In [None]:
dbutils.library.restartPython()

## 2. Authentication

In [None]:
import tabpfn_client

token = dbutils.secrets.get(scope="tabpfn-client", key="token")
tabpfn_client.set_access_token(token)

## 3. Configuration

In [None]:
CATALOG = "tabpfn_databricks"
SCHEMA = "default"

# MLflow experiment configuration (shared across all TabPFN notebooks)
# Default uses user namespace, but can be customized
current_user = spark.sql("SELECT current_user()").collect()[0][0]
MLFLOW_EXPERIMENT_NAME = f"/Users/{current_user}/tabpfn-databricks"

spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {SCHEMA}")

## 4. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc, confusion_matrix
import mlflow

from tabpfn_client import TabPFNClassifier

# Set MLflow experiment
mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)
print(f"MLflow experiment set to: {MLFLOW_EXPERIMENT_NAME}")

## 5. Load Scrap Anomaly Data

The scrap anomaly dataset contains production metrics that can indicate abnormal production runs:
- **Scrap rate**: Percentage of units scrapped
- **Defect count**: Number of defects detected
- **Rework hours**: Time spent on rework
- **Equipment vibration**: Sensor reading for equipment health
- **Process temperature deviation**: Deviation from optimal temperature
- **Quality score**: Overall quality assessment

Anomalies may indicate:
- Equipment malfunction
- Material quality issues
- Process parameter drift
- Operator errors

In [None]:
# Load the Scrap Anomaly training dataset from Delta table
df_scrap = spark.table("scrap_anomaly_train").toPandas()

# Extract labels and features
y_true = df_scrap['is_anomaly'].values

# Select numeric features for anomaly detection
numeric_features = [
    'scrap_rate_pct', 'defect_count', 'rework_hours', 'equipment_vibration',
    'process_temperature_deviation', 'material_waste_pct', 'cycle_time_variance',
    'operator_interventions', 'quality_score', 'downtime_minutes'
]

print(f"Dataset shape: {df_scrap.shape}")
print(f"\nAnomaly distribution:")
print(f"  Normal (0): {(y_true == 0).sum()}")
print(f"  Anomaly (1): {(y_true == 1).sum()}")
print(f"  Anomaly rate: {y_true.mean():.1%}")

display(df_scrap.head())

In [None]:
# Visualize distributions for key features
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

features_to_plot = ['scrap_rate_pct', 'defect_count', 'equipment_vibration', 
                    'quality_score', 'rework_hours', 'downtime_minutes']

for ax, feature in zip(axes.flatten(), features_to_plot):
    # Plot distribution by anomaly status
    df_scrap[df_scrap['is_anomaly'] == 0][feature].hist(ax=ax, bins=30, alpha=0.5, 
                                                        label='Normal', color='blue')
    df_scrap[df_scrap['is_anomaly'] == 1][feature].hist(ax=ax, bins=30, alpha=0.5, 
                                                        label='Anomaly', color='red')
    ax.set_title(feature)
    ax.legend()

plt.tight_layout()
plt.show()

## 6. TabPFN-based Anomaly Detection

We use TabPFN as a classifier in a semi-supervised setup:
1. Train on a subset of labeled normal data plus some labeled anomalies
2. Use the model to score all production runs
3. Higher anomaly probability = more likely to be abnormal

This approach leverages TabPFN's strong classification capabilities for anomaly detection.

In [None]:
# Prepare feature matrix
X = df_scrap[numeric_features].values

# Create a semi-supervised training set
# Use 70% of the data for "training" (fitting the anomaly detector)
np.random.seed(42)
n_total = len(X)
n_train = int(0.7 * n_total)

# Shuffle indices
shuffled_idx = np.random.permutation(n_total)
train_idx = shuffled_idx[:n_train]
test_idx = shuffled_idx[n_train:]

X_train = X[train_idx]
y_train = y_true[train_idx]
X_test = X[test_idx]
y_test = y_true[test_idx]

print(f"Training set: {len(X_train)} samples ({y_train.mean():.1%} anomalies)")
print(f"Test set: {len(X_test)} samples ({y_test.mean():.1%} anomalies)")

In [None]:
# Train TabPFN classifier for anomaly detection with MLflow logging
with mlflow.start_run(run_name="scrap_anomaly_tabpfn"):
    # Log parameters
    mlflow.log_param("model_type", "TabPFNClassifier")
    mlflow.log_param("task", "scrap_anomaly_detection")
    mlflow.log_param("approach", "semi_supervised")
    mlflow.log_param("train_ratio", 0.7)
    mlflow.log_param("n_features", X_train.shape[1])
    mlflow.log_param("train_samples", X_train.shape[0])
    mlflow.log_param("test_samples", X_test.shape[0])
    mlflow.log_param("train_anomaly_rate", y_train.mean())
    mlflow.log_param("test_anomaly_rate", y_test.mean())
    
    clf = TabPFNClassifier()
    clf.fit(X_train, y_train)

    # Score test set - probability of being an anomaly
    anomaly_scores_tabpfn = clf.predict_proba(X_test)[:, 1]

    # Evaluate
    roc_auc_tabpfn = roc_auc_score(y_test, anomaly_scores_tabpfn)
    
    # Calculate PR AUC
    precision, recall, _ = precision_recall_curve(y_test, anomaly_scores_tabpfn)
    pr_auc_tabpfn = auc(recall, precision)
    
    # Log metrics
    mlflow.log_metric("roc_auc", roc_auc_tabpfn)
    mlflow.log_metric("pr_auc", pr_auc_tabpfn)
    
    print(f"TabPFN ROC AUC: {roc_auc_tabpfn:.4f}")
    print(f"TabPFN PR AUC: {pr_auc_tabpfn:.4f}")
    print(f"MLflow Run ID: {mlflow.active_run().info.run_id}")

In [None]:
# Visualize anomaly score distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Score distribution by actual label
axes[0].hist(anomaly_scores_tabpfn[y_test == 0], bins=30, alpha=0.5, 
             label='Normal', color='blue', density=True)
axes[0].hist(anomaly_scores_tabpfn[y_test == 1], bins=30, alpha=0.5, 
             label='Anomaly', color='red', density=True)
axes[0].set_xlabel('Anomaly Score (Probability)')
axes[0].set_ylabel('Density')
axes[0].set_title('TabPFN Anomaly Score Distribution')
axes[0].legend()

# Precision-Recall curve
precision, recall, thresholds = precision_recall_curve(y_test, anomaly_scores_tabpfn)
pr_auc = auc(recall, precision)
axes[1].plot(recall, precision, 'b-', linewidth=2)
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title(f'Precision-Recall Curve (AUC = {pr_auc:.4f})')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Identifying Anomalous Production Runs

Let's identify the most anomalous production runs and understand what makes them unusual.

In [None]:
# Add anomaly scores to test data
df_test = df_scrap.iloc[test_idx].copy().reset_index(drop=True)
df_test['anomaly_score'] = anomaly_scores_tabpfn
df_test['predicted_anomaly'] = (anomaly_scores_tabpfn > 0.5).astype(int)

# Show top 10 most anomalous production runs
print("Top 10 Most Anomalous Production Runs:")
top_anomalies = df_test.nlargest(10, 'anomaly_score')[[
    'batch_number', 'production_line', 'shift', 'scrap_rate_pct', 
    'defect_count', 'equipment_vibration', 'quality_score', 
    'anomaly_score', 'is_anomaly'
]]
display(top_anomalies)

In [None]:
# Confusion matrix at 0.5 threshold
y_pred = (anomaly_scores_tabpfn > 0.5).astype(int)
cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(6, 5))
im = ax.imshow(cm, interpolation='nearest', cmap='Blues')
ax.figure.colorbar(im, ax=ax)

classes = ['Normal', 'Anomaly']
ax.set(xticks=[0, 1], yticks=[0, 1],
       xticklabels=classes, yticklabels=classes,
       title='Anomaly Detection Confusion Matrix',
       ylabel='Actual', xlabel='Predicted')

# Add text annotations
thresh = cm.max() / 2.
for i in range(2):
    for j in range(2):
        ax.text(j, i, format(cm[i, j], 'd'),
                ha="center", va="center",
                color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.show()

# Calculate detection metrics
tn, fp, fn, tp = cm.ravel()
precision_val = tp / (tp + fp) if (tp + fp) > 0 else 0
recall_val = tp / (tp + fn) if (tp + fn) > 0 else 0
f1_val = 2 * precision_val * recall_val / (precision_val + recall_val) if (precision_val + recall_val) > 0 else 0

print(f"\nDetection Metrics (threshold=0.5):")
print(f"  Precision: {precision_val:.3f}")
print(f"  Recall: {recall_val:.3f}")
print(f"  F1 Score: {f1_val:.3f}")

In [None]:
# Feature importance: Which features contribute most to anomaly detection?
# Calculate mean difference between anomalies and normal samples
df_normal = df_scrap[df_scrap['is_anomaly'] == 0][numeric_features]
df_anomaly = df_scrap[df_scrap['is_anomaly'] == 1][numeric_features]

# Standardized difference
mean_diff = (df_anomaly.mean() - df_normal.mean()) / df_normal.std()
mean_diff = mean_diff.sort_values(key=abs, ascending=True)

fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#e74c3c' if v > 0 else '#3498db' for v in mean_diff.values]
mean_diff.plot(kind='barh', ax=ax, color=colors)
ax.set_xlabel('Standardized Difference (Anomaly - Normal)')
ax.set_title('Feature Differences: Anomalies vs Normal Production Runs')
ax.axvline(x=0, color='black', linestyle='--', linewidth=0.5)
plt.tight_layout()
plt.show()

print("\nKey Anomaly Indicators (highest absolute difference):")
top_indicators = mean_diff.abs().nlargest(5)
for feature in top_indicators.index:
    direction = "higher" if mean_diff[feature] > 0 else "lower"
    print(f"  - {feature}: Anomalies have {direction} values")

## Summary

In this notebook, we demonstrated:

- **TabPFN for Anomaly Detection** - Semi-supervised approach using classification
- **Production Anomaly Identification** - Finding unusual production runs
- **Feature Analysis** - Understanding what makes runs anomalous

**Key Takeaways:**
1. TabPFN's semi-supervised approach leverages labeled anomaly data effectively
2. Probability scores enable threshold tuning for different business objectives

**Next Steps:**
- Run `04_time_series_forecasting` notebook for demand forecasting
- Integrate real-time anomaly scoring into production monitoring systems
- Develop automated alerting workflows for high-risk production runs