Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions PRPs/INITIAL/INITIAL-MLZOO-C-xgboost-prophet-extensions.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,20 @@
# INITIAL-MLZOO-C-xgboost-prophet-extensions.md - XGBoost and Prophet-like Extensions

> **This brief is split into TWO PRPs — two branches, two review units. Never one.**
> This INITIAL is the shared brief for both, but the two models are delivered separately:
>
> - **`PRPs/PRP-MLZOO-C1-xgboost-model.md`** — the XGBoost half. A low-risk follow-up that
> mirrors the merged `LightGBMForecaster` design (optional `ml-xgboost` extra, feature
> flag, lazy import, deterministic training, registry metadata).
> - **`PRPs/PRP-MLZOO-C2-prophet-like-additive-model.md`** — the Prophet-like half. A
> distinct model-family design task — a pure-scikit-learn additive linear model with
> trend / seasonality / holiday-regressor decomposition; **not** a clone of the tree
> models and **not** the real `prophet` dependency.
>
> Do not combine the two models into a single PRP or a single branch. The "Out of scope"
> lists below still apply to *each* PRP individually (e.g. C1 does not touch Prophet-like
> work; C2 does not touch XGBoost). See `INITIAL-MLZOO-index.md` for the updated roadmap.

## FEATURE:

Extend the Advanced ML Model Zoo after the feature-frame foundation and first advanced model path are stable.
Expand Down
16 changes: 13 additions & 3 deletions PRPs/INITIAL/INITIAL-MLZOO-index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,24 @@ Recommended PRP sequence:
| 1 | `INITIAL-MLZOO-A-foundation-feature-frames.md` | PRP-29 | Feature-aware forecasting foundation and leakage-safe frame contracts |
| 2 | `INITIAL-MLZOO-B-lightgbm-first-model.md` | PRP-30 | First advanced model path with LightGBM (optional `ml-lightgbm` extra) |
| 2.5 | `INITIAL-MLZOO-B.2-feature-aware-backtesting.md` | PRP-MLZOO-B.2 | Wire feature-aware models into the backtesting fold loop (per-fold leakage-safe `X_train` / `X_future`) |
| 3 | `INITIAL-MLZOO-C-xgboost-prophet-extensions.md` | Future PRP | XGBoost and Prophet-like extensions |
| 3a | `INITIAL-MLZOO-C-xgboost-prophet-extensions.md` (XGBoost half) | PRP-MLZOO-C1 | XGBoost feature-aware model — a low-risk follow-up mirroring the merged LightGBM design (optional `ml-xgboost` extra) |
| 3b | `INITIAL-MLZOO-C-xgboost-prophet-extensions.md` (Prophet-like half) | PRP-MLZOO-C2 | Prophet-like additive model — a distinct model-family design (pure scikit-learn; trend / seasonality / regressor decomposition) |
| 4 | `INITIAL-MLZOO-D-frontend-registry-explainability.md` | Future PRP | UI, registry surfacing, and explanation polish |

**C is two PRPs, not one.** `INITIAL-MLZOO-C` briefs both XGBoost and a Prophet-like model,
but they are deliberately split into **two separate PRPs, branches, and review units** —
`PRP-MLZOO-C1` (XGBoost) and `PRP-MLZOO-C2` (Prophet-like). They are additive and
order-independent; whichever merges second rebases cleanly. Do **not** combine them into a
single branch or a single review unit (this honours the "one reviewable unit" rule below).

Dependency graph:

```text
A. Foundation feature frames
-> B. LightGBM first model
-> B.2 Feature-aware backtesting
-> C. XGBoost / Prophet-like extensions
-> C1. XGBoost model (separate review unit)
-> C2. Prophet-like model (separate review unit; parallel to C1)
-> D. Frontend / registry / explainability
```

Expand Down Expand Up @@ -74,5 +82,7 @@ Read these before creating any MLZOO PRP:
- Do not implement LightGBM before the feature-frame contracts and leakage tests are stable.
- Do not implement XGBoost or Prophet-like models before the first advanced model path proves the architecture.
- Do not add frontend/explainability scope before backend metadata and persistence contracts are stable.
- Keep each PRP to one branch and one reviewable unit.
- Keep each PRP to one branch and one reviewable unit. In particular, `INITIAL-MLZOO-C`'s
two models (XGBoost, Prophet-like) are **two PRPs** — `PRP-MLZOO-C1` and `PRP-MLZOO-C2` —
never one combined branch.

979 changes: 979 additions & 0 deletions PRPs/PRP-MLZOO-C1-xgboost-model.md

Large diffs are not rendered by default.

6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,9 @@ docker-compose up -d
```bash
uv sync --extra dev
# or: pip install -e ".[dev]"
# LightGBM is an opt-in advanced model — add the extra to enable it:
# LightGBM and XGBoost are opt-in advanced models — add the extra to enable each:
# uv sync --extra dev --extra ml-lightgbm (then set forecast_enable_lightgbm=true)
# uv sync --extra dev --extra ml-xgboost (then set forecast_enable_xgboost=true)
```

4. **Run database migrations**
Expand Down Expand Up @@ -342,6 +343,7 @@ curl -X POST http://localhost:8123/forecasting/predict \
- `moving_average` - Mean of last N observations
- `regression` - Gradient-boosted exogenous-feature regressor (feature-aware)
- `lightgbm` - LightGBM feature-aware regressor — opt-in: install the `ml-lightgbm` extra and set `forecast_enable_lightgbm=True`
- `xgboost` - XGBoost feature-aware regressor — opt-in: install the `ml-xgboost` extra and set `forecast_enable_xgboost=True`

See [examples/models/](examples/models/) for baseline model examples.

Expand Down Expand Up @@ -394,7 +396,7 @@ curl -X POST http://localhost:8123/backtesting/run \
When `include_baselines=true`, automatically compares against naive and seasonal_naive models.

**Feature-Aware Models:**
`regression` and `lightgbm` models can be backtested too — set
`regression`, `lightgbm`, and `xgboost` models can be backtested too — set
`model_config_main.model_type` accordingly. Each fold builds a leakage-safe
per-fold feature matrix (`min_train_size >= 30` required); the result carries
`feature_aware: true` and `exogenous_policy: "observed"`.
Expand Down
1 change: 1 addition & 0 deletions app/core/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ class Settings(BaseSettings):
forecast_max_horizon: int = 90
forecast_model_artifacts_dir: str = "./artifacts/models"
forecast_enable_lightgbm: bool = False
forecast_enable_xgboost: bool = False

# Backtesting
backtest_max_splits: int = 20
Expand Down
44 changes: 43 additions & 1 deletion app/features/backtesting/tests/test_feature_aware_backtest.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,11 @@
SeriesData,
)
from app.features.backtesting.splitter import TimeSeriesSplitter
from app.features.forecasting.schemas import NaiveModelConfig, RegressionModelConfig
from app.features.forecasting.schemas import (
NaiveModelConfig,
RegressionModelConfig,
XGBoostModelConfig,
)
from app.shared.feature_frames import canonical_feature_columns

_N_FEATURES = len(canonical_feature_columns()) # 14 — 4 lags + 6 calendar + 4 exogenous
Expand Down Expand Up @@ -135,6 +139,44 @@ def test_feature_aware_backtest_produces_per_fold_metrics(
assert "mae" in fold.metrics


def test_feature_aware_backtest_runs_with_xgboost_model(
sample_dates_120: list[date],
sample_values_120: np.ndarray,
sample_split_config_expanding: SplitConfig,
monkeypatch: pytest.MonkeyPatch,
) -> None:
"""An XGBoost backtest runs end-to-end and yields per-fold metrics.

Mirrors ``test_feature_aware_backtest_produces_per_fold_metrics`` for the
XGBoost feature-aware model (PRP-MLZOO-C1) — proving the B.2
``requires_features`` probe needs no per-model backtesting-service wiring.
SKIPs when the optional ``ml-xgboost`` dependency is absent; the
``forecast_enable_xgboost`` flag is enabled so ``model_factory`` dispatches.
"""
pytest.importorskip("xgboost")
from app.core.config import get_settings

monkeypatch.setattr(get_settings(), "forecast_enable_xgboost", True)

service = BacktestingService()
series = _series(sample_dates_120, sample_values_120, with_exogenous=True)
splitter = TimeSeriesSplitter(sample_split_config_expanding)

result = service._run_model_backtest(
series_data=series,
splitter=splitter,
model_config=XGBoostModelConfig(),
store_fold_details=True,
)

assert result.model_type == "xgboost"
assert result.feature_aware is True
assert len(result.fold_results) > 0
assert "mae" in result.aggregated_metrics
for fold in result.fold_results:
assert "mae" in fold.metrics


def test_feature_aware_result_records_observed_policy(
sample_dates_120: list[date],
sample_values_120: np.ndarray,
Expand Down
175 changes: 174 additions & 1 deletion app/features/forecasting/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -732,8 +732,166 @@ def set_params(self, **params: Any) -> LightGBMForecaster: # noqa: ANN401
return self


class XGBoostForecaster(BaseForecaster):
"""Feature-aware forecaster wrapping ``xgboost.XGBRegressor``.

The second ADVANCED feature-aware tree model (MLZOO-C1). Structurally a
twin of ``LightGBMForecaster``: it REQUIRES a non-``None`` exogenous ``X``
for both ``fit`` and ``predict``; the estimator is gradient-boosted trees
from the optional ``xgboost`` package.

``xgboost`` is imported LAZILY inside ``fit`` — never at module scope and
never in ``__init__`` — so importing this module (which every forecasting
code path does, baseline models included) never requires the optional
``ml-xgboost`` dependency.

Determinism: ``XGBRegressor`` has no ``deterministic`` switch (unlike
LightGBM). Bit-reproducibility comes from ``n_jobs=1`` + ``tree_method="hist"``
+ a fixed ``random_state`` + the conservative config leaving ``subsample`` /
``colsample_bytree`` at their ``1.0`` defaults (no stochastic sampling) —
all pinned in ``fit``. XGBoost tolerates ``NaN`` natively (``missing=np.nan``),
which matters because the future feature frame leaves lag cells ``NaN``
when their source target lies in the un-observed horizon.

Attributes:
n_estimators: Number of boosting rounds.
learning_rate: Gradient-boosting learning rate.
max_depth: Maximum depth of each tree.
"""

requires_features: ClassVar[bool] = True
"""A feature-aware model — ``fit``/``predict`` REQUIRE a non-None ``X``."""

def __init__(
self,
*,
n_estimators: int = 100,
learning_rate: float = 0.1,
max_depth: int = 6,
random_state: int = 42,
) -> None:
"""Initialize the XGBoost forecaster.

Args:
n_estimators: Number of boosting rounds.
learning_rate: Gradient-boosting learning rate.
max_depth: Maximum depth of each tree.
random_state: Random seed for reproducibility (determinism).
"""
super().__init__(random_state)
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.max_depth = max_depth
self._estimator: Any = None

def fit(
self,
y: np.ndarray[Any, np.dtype[np.floating[Any]]],
X: np.ndarray[Any, np.dtype[np.floating[Any]]] | None = None,
) -> XGBoostForecaster:
"""Fit the gradient-boosted regressor on historical features.

Args:
y: Target values (1D array of shape ``[n_samples]``).
X: Exogenous features (2D array of shape ``[n_samples, n_features]``).
REQUIRED — unlike the baseline forecasters.

Returns:
self (for method chaining).

Raises:
ValueError: If ``X`` is ``None``, ``y`` is empty, or the row counts
of ``X`` and ``y`` do not match.
"""
if X is None:
raise ValueError("XGBoostForecaster requires exogenous features X for fit()")
if len(y) == 0:
raise ValueError("Cannot fit on empty array")
if X.shape[0] != len(y):
raise ValueError(
f"X has {X.shape[0]} rows but y has {len(y)} — feature/target rows must match"
)
# LAZY import — the optional ``ml-xgboost`` dependency is only needed
# the first time an XGBoost model is actually fitted.
import xgboost as xgb

estimator: Any = xgb.XGBRegressor(
n_estimators=self.n_estimators,
learning_rate=self.learning_rate,
max_depth=self.max_depth,
random_state=self.random_state,
n_jobs=1, # single-threaded — removes float-summation non-determinism
tree_method="hist", # explicit; the default, and the reproducible path
verbosity=0, # silence XGBoost's training chatter
)
estimator.fit(X, y)
self._estimator = estimator
self._last_values = np.asarray(y[-1:], dtype=np.float64)
self._is_fitted = True
return self

def predict(
self,
horizon: int,
X: np.ndarray[Any, np.dtype[np.floating[Any]]] | None = None,
) -> np.ndarray[Any, np.dtype[np.floating[Any]]]:
"""Generate forecasts from a future feature frame.

Args:
horizon: Number of steps to forecast.
X: Exogenous features for the forecast period, shape
``[horizon, n_features]``. REQUIRED.

Returns:
Array of forecasts with shape ``[horizon]``.

Raises:
RuntimeError: If the model has not been fitted.
ValueError: If ``X`` is ``None`` or its row count is not ``horizon``.
"""
if not self._is_fitted or self._estimator is None:
raise RuntimeError("Model must be fitted before predict")
if X is None:
raise ValueError("XGBoostForecaster requires exogenous features X for predict()")
if X.shape[0] != horizon:
raise ValueError(f"X has {X.shape[0]} rows but horizon is {horizon} — they must match")
predictions = self._estimator.predict(X)
result: np.ndarray[Any, np.dtype[np.floating[Any]]] = np.asarray(
predictions, dtype=np.float64
)
return result

def get_params(self) -> dict[str, Any]:
"""Get model parameters.

Returns:
Dictionary with n_estimators, learning_rate, max_depth, random_state.
"""
return {
"n_estimators": self.n_estimators,
"learning_rate": self.learning_rate,
"max_depth": self.max_depth,
"random_state": self.random_state,
}

def set_params(self, **params: Any) -> XGBoostForecaster: # noqa: ANN401
"""Set model parameters.

Args:
**params: Parameter names and values to set.

Returns:
self (for method chaining).
"""
for key, value in params.items():
setattr(self, key, value)
return self


# Type alias for model type literals
ModelType = Literal["naive", "seasonal_naive", "moving_average", "lightgbm", "regression"]
ModelType = Literal[
"naive", "seasonal_naive", "moving_average", "xgboost", "lightgbm", "regression"
]


def model_factory(config: ModelConfig, random_state: int = 42) -> BaseForecaster:
Expand Down Expand Up @@ -790,6 +948,21 @@ def model_factory(config: ModelConfig, random_state: int = 42) -> BaseForecaster
random_state=random_state,
)
raise ValueError("Invalid config type for lightgbm")
elif model_type == "xgboost":
if not settings.forecast_enable_xgboost:
raise ValueError(
"XGBoost is not enabled. Set forecast_enable_xgboost=True in settings."
)
from app.features.forecasting.schemas import XGBoostModelConfig

if isinstance(config, XGBoostModelConfig):
return XGBoostForecaster(
n_estimators=config.n_estimators,
learning_rate=config.learning_rate,
max_depth=config.max_depth,
random_state=random_state,
)
raise ValueError("Invalid config type for xgboost")
elif model_type == "regression":
from app.features.forecasting.schemas import RegressionModelConfig

Expand Down
27 changes: 27 additions & 0 deletions app/features/forecasting/persistence.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ class ModelBundle:
sklearn_version: Scikit-learn version used when saving.
lightgbm_version: LightGBM version used when saving, ``None`` when the
optional ``ml-lightgbm`` dependency was not installed.
xgboost_version: XGBoost version used when saving, ``None`` when the
optional ``ml-xgboost`` dependency was not installed.
bundle_hash: Deterministic hash of bundle contents.
"""

Expand All @@ -54,6 +56,7 @@ class ModelBundle:
python_version: str | None = None
sklearn_version: str | None = None
lightgbm_version: str | None = None
xgboost_version: str | None = None
bundle_hash: str | None = None

def compute_hash(self) -> str:
Expand Down Expand Up @@ -106,6 +109,14 @@ def save_model_bundle(bundle: ModelBundle, path: str | Path) -> Path:
bundle.lightgbm_version = str(lightgbm.__version__)
except ImportError:
bundle.lightgbm_version = None
# Best-effort: XGBoost is an optional dependency, so a baseline-only
# install legitimately has no version to record.
try:
import xgboost

bundle.xgboost_version = str(xgboost.__version__)
except ImportError:
bundle.xgboost_version = None
bundle.bundle_hash = bundle.compute_hash()

# Save with compression
Expand Down Expand Up @@ -198,6 +209,22 @@ def load_model_bundle(path: str | Path, base_dir: str | Path | None = None) -> M
current_lightgbm=current_lightgbm,
)

# XGBoost is optional — only warn when the bundle recorded a version AND
# the optional dependency is importable here AND the two differ.
if bundle.xgboost_version:
try:
import xgboost

current_xgboost: str | None = str(xgboost.__version__)
except ImportError:
current_xgboost = None
if current_xgboost is not None and bundle.xgboost_version != current_xgboost:
logger.warning(
"forecasting.xgboost_version_mismatch",
saved_xgboost=bundle.xgboost_version,
current_xgboost=current_xgboost,
)

logger.info(
"forecasting.model_bundle_loaded",
path=str(path),
Expand Down
Loading