# Transformer Forecasting for Time Series
Transformers replace recurrence with self-attention, letting the model compare every time step in a window at once. In forecasting, we typically feed a fixed lookback window and predict a multi-step horizon in one shot.


## Takeaways
- Transformers scale to long lookbacks because attention connects any pair of time steps directly.
- Positional encodings inject order information that attention alone lacks.
- Multi-horizon forecasting can be framed as a supervised window-to-horizon regression problem.


## Problem setup: windowed forecasting
Given a series $y_1, \dots, y_T$, we build input windows of length $L$ and predict a horizon of length $H$:

$$
X_t = [y_{t-L+1}, \dots, y_t], \quad \hat{y}_{t+1:t+H} = f(X_t).
$$

This notebook trains a Transformer encoder to map $X_t$ to a full horizon in a single forward pass.


## Self-attention in one line
Scaled dot-product attention computes a weighted mix of value vectors $V$ using query-key similarity:

$$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(rac{QK^	op}{\sqrt{d_k}}ight)V.
$$

Multi-head attention repeats this with several projections, letting the model attend to different temporal patterns simultaneously.


## Positional encoding
Because attention is permutation-invariant, we add position signals. A common sinusoidal encoding is:

$$
\mathrm{PE}(pos, 2i) = \sin\left(rac{pos}{10000^{2i/d}}ight), \quad
\mathrm{PE}(pos, 2i+1) = \cos\left(rac{pos}{10000^{2i/d}}ight).
$$

These features are added to input embeddings so the model can learn order-aware patterns.


## When to use
- Long lookbacks or multiple seasonalities where attention can connect distant time steps.
- Multivariate or multi-series settings where shared patterns matter.
- You can afford more data and compute than classical baselines.


In [None]:
import os

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

SEED = 7
rng = np.random.default_rng(SEED)
torch.manual_seed(SEED)

FAST_RUN = True
LOOKBACK = 96
HORIZON = 24


In [None]:
# Synthetic series: trend + daily + weekly seasonalityn = 720t = np.arange(n, dtype=float)trend = 0.02 * tdaily = 1.5 * np.sin(2 * np.pi * t / 24)weekly = 0.6 * np.sin(2 * np.pi * t / 168)noise = rng.normal(0, 0.3, n)series = trend + daily + weekly + noisefig = go.Figure()fig.add_trace(go.Scatter(x=t, y=series, mode="lines", name="series"))fig.update_layout(title="Synthetic time series", xaxis_title="t", yaxis_title="value")fig

## Windowing and scaling
We turn the series into a supervised dataset of sliding windows and normalize with the training statistics.


In [None]:
def make_windows(values: np.ndarray, lookback: int, horizon: int):
    X, y = [], []
    for i in range(lookback, len(values) - horizon + 1):
        X.append(values[i - lookback : i])
        y.append(values[i : i + horizon])
    return np.array(X), np.array(y)

X, y = make_windows(series, LOOKBACK, HORIZON)

n_total = len(X)
train_end = int(0.7 * n_total)
val_end = int(0.85 * n_total)

X_train, y_train = X[:train_end], y[:train_end]
X_val, y_val = X[train_end:val_end], y[train_end:val_end]
X_test, y_test = X[val_end:], y[val_end:]

mu = X_train.mean()
sigma = X_train.std()

X_train_s = (X_train - mu) / sigma
y_train_s = (y_train - mu) / sigma
X_val_s = (X_val - mu) / sigma
y_val_s = (y_val - mu) / sigma
X_test_s = (X_test - mu) / sigma
y_test_s = (y_test - mu) / sigma

print("windows:", X.shape, "train/val/test:", X_train.shape, X_val.shape, X_test.shape)


In [None]:
# Seasonal naive baseline using last daily season
SEASON = 24

def seasonal_naive(window: np.ndarray, horizon: int, season: int = 24) -> np.ndarray:
    if season <= 0:
        return np.repeat(window[-1], horizon)
    reps = int(np.ceil(horizon / season))
    pattern = np.tile(window[-season:], reps)
    return pattern[:horizon]

baseline_preds = np.vstack([
    seasonal_naive(x, HORIZON, SEASON) for x in X_test
])

mae_baseline = np.mean(np.abs(y_test - baseline_preds))
rmse_baseline = np.sqrt(np.mean((y_test - baseline_preds) ** 2))

print(f"baseline MAE: {mae_baseline:.3f} | baseline RMSE: {rmse_baseline:.3f}")


## Transformer forecaster (PyTorch)
We use an encoder-only Transformer. The last token representation is mapped to the forecast horizon with a linear head.


In [None]:
class WindowDataset(Dataset):
    def __init__(self, X: np.ndarray, y: np.ndarray):
        self.X = torch.tensor(X, dtype=torch.float32).unsqueeze(-1)
        self.y = torch.tensor(y, dtype=torch.float32)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]


class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, max_len: int = 1000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float32).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2, dtype=torch.float32) * (-np.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer("pe", pe.unsqueeze(0))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return x + self.pe[:, : x.size(1)]


class TransformerForecaster(nn.Module):
    def __init__(
        self,
        input_size: int,
        horizon: int,
        d_model: int = 64,
        nhead: int = 4,
        num_layers: int = 2,
        dim_ff: int = 128,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.input_proj = nn.Linear(input_size, d_model)
        self.positional = PositionalEncoding(d_model)
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=dim_ff,
            dropout=dropout,
            batch_first=True,
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.head = nn.Linear(d_model, horizon)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        z = self.input_proj(x)
        z = self.positional(z)
        z = self.encoder(z)
        last = z[:, -1, :]
        return self.head(last)


In [None]:
# Visualize a small positional encoding matrixpe = PositionalEncoding(32, max_len=48).pe.squeeze(0).numpy()fig = px.imshow(    pe,    aspect="auto",    color_continuous_scale="RdBu",    title="Sinusoidal positional encoding (48 positions x 32 dims)",)fig.update_layout(xaxis_title="embedding dim", yaxis_title="position")fig

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

train_ds = WindowDataset(X_train_s, y_train_s)
val_ds = WindowDataset(X_val_s, y_val_s)

test_ds = WindowDataset(X_test_s, y_test_s)

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=64)

torch.manual_seed(SEED)

model = TransformerForecaster(input_size=1, horizon=HORIZON).to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

EPOCHS = 15 if FAST_RUN else 60


In [None]:
def run_epoch(model, loader, optimizer=None, train: bool = True):
    model.train(train)
    total_loss = 0.0
    for X_batch, y_batch in loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)
        if train:
            optimizer.zero_grad()
        with torch.set_grad_enabled(train):
            preds = model(X_batch)
            loss = criterion(preds, y_batch)
        if train:
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
        total_loss += loss.item() * X_batch.size(0)
    return total_loss / len(loader.dataset)

history = {"train": [], "val": []}
for epoch in range(1, EPOCHS + 1):
    train_loss = run_epoch(model, train_loader, optimizer=optimizer, train=True)
    val_loss = run_epoch(model, val_loader, train=False)
    history["train"].append(train_loss)
    history["val"].append(val_loss)
    if epoch == 1 or epoch % 5 == 0 or epoch == EPOCHS:
        print(f"epoch {epoch:02d} | train {train_loss:.4f} | val {val_loss:.4f}")


In [None]:
fig = go.Figure()fig.add_trace(go.Scatter(y=history["train"], mode="lines+markers", name="train"))fig.add_trace(go.Scatter(y=history["val"], mode="lines+markers", name="val"))fig.update_layout(title="Training curves", xaxis_title="epoch", yaxis_title="MSE")fig

In [None]:
model.eval()
with torch.no_grad():
    preds_scaled = []
    for X_batch, _ in DataLoader(test_ds, batch_size=128):
        X_batch = X_batch.to(device)
        preds_scaled.append(model(X_batch).cpu().numpy())

preds_scaled = np.vstack(preds_scaled)
preds = preds_scaled * sigma + mu

mae = np.mean(np.abs(y_test - preds))
rmse = np.sqrt(np.mean((y_test - preds) ** 2))

print(f"transformer MAE: {mae:.3f} | transformer RMSE: {rmse:.3f}")


In [None]:
# Plot a single forecast exampleidx = 0history_vals = X_test[idx]truth = y_test[idx]forecast = preds[idx]hist_x = np.arange(len(history_vals))future_x = np.arange(len(history_vals), len(history_vals) + HORIZON)fig = go.Figure()fig.add_trace(go.Scatter(x=hist_x, y=history_vals, mode="lines", name="history"))fig.add_trace(go.Scatter(x=future_x, y=truth, mode="lines", name="actual"))fig.add_trace(go.Scatter(x=future_x, y=forecast, mode="lines", name="forecast"))fig.update_layout(title="Transformer forecast", xaxis_title="t", yaxis_title="value")fig

## Practical tips
- Scale inputs and targets (z-score or min-max) to stabilize training.
- Use larger lookbacks for long-seasonality data, but keep batch size manageable.
- Regularize with dropout and early stopping to avoid overfitting.
- Compare against seasonal naive or ETS baselines before deploying.


## Exercises
1. Replace the single-token readout with mean pooling and compare accuracy.
2. Add exogenous features (calendar flags, promotions) as extra input channels.
3. Try a longer horizon and evaluate how error grows with forecast step.


## Further reading
- Vaswani et al., *Attention Is All You Need* (2017).
- Lim et al., *Temporal Fusion Transformers* (2019).
- Wu et al., *Autoformer: Decomposition Transformers with Auto-Correlation* (2021).
