# Time Series Forecasting with TabPFN on Databricks

This notebook demonstrates time series forecasting using TabPFN's regression capabilities.

**What you will learn:**
- How to prepare time series data with lag features
- How to use TabPFN for time series forecasting
- How to evaluate forecast accuracy

**Prerequisites:** Run `00_data_preparation` notebook first.

## Compute Setup

We recommend running this notebook on **Serverless Compute** with the **Base Environment V4**.

## 1. Installation

In [None]:
%pip install tabpfn-client scikit-learn pandas matplotlib --quiet

In [None]:
dbutils.library.restartPython()

## 2. Authentication

In [None]:
import tabpfn_client

token = dbutils.secrets.get(scope="tabpfn-client", key="token")
tabpfn_client.set_access_token(token)

## 3. Configuration

In [None]:
CATALOG = "tabpfn_databricks"
SCHEMA = "default"

spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {SCHEMA}")

## 4. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error
from tabpfn_client import TabPFNRegressor

## 5. Load Time Series Data

In [None]:
df_tourism = spark.table("tourism_monthly").toPandas()
df_tourism['timestamp'] = pd.to_datetime(df_tourism['timestamp'])

print(f"Total records: {len(df_tourism)}")
print(f"Number of time series: {df_tourism['item_id'].nunique()}")

display(df_tourism.head(10))

In [None]:
# Select one time series
series_id = "T000000"
df_series = df_tourism[df_tourism['item_id'] == series_id].sort_values('timestamp').reset_index(drop=True)
values = df_series['target'].values

print(f"Series length: {len(values)}")

In [None]:
# Visualize
fig, ax = plt.subplots(figsize=(14, 5))
ax.plot(df_series['timestamp'], values, 'b-', linewidth=1.5)
ax.set_xlabel('Date')
ax.set_ylabel('Value')
ax.set_title(f'Tourism Monthly Data - Series {series_id}')
ax.grid(True, alpha=0.3)
plt.show()

## 6. Create Lag Features for Time Series

In [None]:
def create_lag_features(series, n_lags=12):
    """Create lag features for time series forecasting."""
    X, y = [], []
    for i in range(n_lags, len(series)):
        X.append(series[i-n_lags:i])
        y.append(series[i])
    return np.array(X), np.array(y)

n_lags = 12  # Use last 12 months as features
X, y = create_lag_features(values, n_lags)

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

## 7. Train-Test Split

In [None]:
# Use last 12 points as test set
test_size = 12

X_train, X_test = X[:-test_size], X[-test_size:]
y_train, y_test = y[:-test_size], y[-test_size:]

print(f"Training: {len(X_train)}, Test: {len(X_test)}")

## 8. Train TabPFN Regressor

In [None]:
reg = TabPFNRegressor()
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100

print(f"Forecast Metrics:")
print(f"  MAE:  {mae:.2f}")
print(f"  RMSE: {rmse:.2f}")
print(f"  MAPE: {mape:.1f}%")

In [None]:
# Visualize forecasts
fig, ax = plt.subplots(figsize=(14, 6))

# Historical
ax.plot(range(len(y_train)), y_train, 'b-', linewidth=1.5, label='Training')

# Actual vs Predicted
test_idx = range(len(y_train), len(y_train) + len(y_test))
ax.plot(test_idx, y_test, 'g-', linewidth=2, marker='o', label='Actual')
ax.plot(test_idx, y_pred, 'r--', linewidth=2, marker='s', label='Forecast')

ax.set_xlabel('Time Index')
ax.set_ylabel('Value')
ax.set_title('TabPFN Time Series Forecast')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

## 9. Forecast Multiple Series

In [None]:
series_ids = ["T000000", "T000001", "T000002", "T000003"]
results = []

for sid in series_ids:
    df_s = df_tourism[df_tourism['item_id'] == sid].sort_values('timestamp')
    vals = df_s['target'].values
    
    X_s, y_s = create_lag_features(vals, n_lags)
    X_tr, X_te = X_s[:-test_size], X_s[-test_size:]
    y_tr, y_te = y_s[:-test_size], y_s[-test_size:]
    
    model = TabPFNRegressor()
    model.fit(X_tr, y_tr)
    pred = model.predict(X_te)
    
    mae = mean_absolute_error(y_te, pred)
    mape = np.mean(np.abs((y_te - pred) / y_te)) * 100
    
    results.append({'Series': sid, 'MAE': mae, 'MAPE': mape})
    print(f"{sid}: MAE={mae:.2f}, MAPE={mape:.1f}%")

df_results = pd.DataFrame(results)
print(f"\nAverage MAPE: {df_results['MAPE'].mean():.1f}%")

## Summary

In this notebook, we demonstrated:

- ✅ Loading time series data from Delta tables
- ✅ Creating lag features for forecasting
- ✅ Using TabPFN Regressor for time series prediction
- ✅ Evaluating forecast accuracy with MAE, RMSE, MAPE
- ✅ Batch forecasting multiple series