## Milestone II

- Supervised Learning Task:   Multi-Domain Recession Forecasting
- Unsupervised Learning Task: Economic Regime Identification

### Current Datasets/Data Sources:
1. Federal Reserve Economic Data (FRED, https://fred.stlouisfed.org/)
   - Contains all(?) of the economic data that we will need (GDP, CPI, unemployment, etc.)
   - Requires API key, obtainable upon registration at their website.
   - The data is sourced from government agencies, international orgs, private companies, research institutions, etc. Metadata for this is available through API. Realistically however, pretty much all the data will come from the OECD (Organisation for Economic Co-operation and Development).
   - Contains information on countries besides the U.S., we will focus on the G7 and maybe a select few other countries, will need to screen for availability and reliability.

### Notes:
- We will be using OECD-based Recession Indicators (https://fred.stlouisfed.org/series/USAREC) as our definition for Recession.


### Import Packages

In [1]:
import os
from dataclasses import dataclass, field
from typing import Literal

import pandas as pd
import numpy as np
import pyfredapi as pf
import warnings

warnings.filterwarnings("ignore")

In [None]:
# Type alias for supported feature engineering operations
FeatureOp = Literal["log_diff", "first_diff", "amplitude_deviation", "rolling_stats"]


@dataclass(frozen=True)
class SeriesConfig:
    """Configuration for a FRED series and its feature engineering.

    Attributes:
        prefix: Characters before the country code in series ID (default: "")
        suffix: Characters after the country code in series ID (default: "")
        use_iso3: Use 3-letter (True) or 2-letter (False) country codes (default: True)
        is_global: If True, series is not country-specific, e.g., VIX (default: False)
        agg_method: Aggregation method for frequency conversion, e.g., "avg" (default: None)
        suffix_overrides: Country-specific suffix exceptions (default: {})
        iso_overrides: Country-specific ISO code exceptions (default: {})
        feature_ops: Feature operation(s) to apply - single string or list (default: [])
        lagged: If True, add lagged features for this indicator (default: False)
    """

    # FRED series construction
    prefix: str = ""
    suffix: str = ""
    use_iso3: bool = True
    is_global: bool = False
    agg_method: str | None = None
    suffix_overrides: dict[str, str] = field(default_factory=dict)
    iso_overrides: dict[str, str] = field(default_factory=dict)

    # Feature engineering configuration
    feature_ops: FeatureOp | list[FeatureOp] = field(default_factory=list)
    lagged: bool = False

    def __post_init__(self):
        """Normalize feature_ops to a list."""
        if isinstance(self.feature_ops, str):
            object.__setattr__(self, "feature_ops", [self.feature_ops])

    def has_op(self, op: FeatureOp) -> bool:
        """Check if this series should have the given operation applied."""
        return op in self.feature_ops

### Setup API Key

#### To use FRED in this notebook:

1. Get an API key from https://fredaccount.stlouisfed.org/login/secure/, register if needed.
2. Add to your operating systems environment variables

In [3]:
# load FRED API key from environment, if this fails read above
if os.environ.get("FRED_API_KEY") is None:
    print("Failed to get FRED API Key!")


### Get data from FRED

Getting data from FRED requires knowing the series ID of the data you want.

Most series IDs will comprise of a country code (2 or 3 letter) combined with a series prefix/suffix or just a suffix.

You can look them up individually at https://fred.stlouisfed.org/

I avoid hardcoding them and instead generate them using their pattern, that way we can more easily add new series data.

### G7 Country Codes

ISO-2 and ISO-3 country codes for the G7 nations, used to construct FRED series IDs.

In [4]:
G7_COUNTRY_CODES = {
    "USA": {"iso2": "US", "iso3": "USA"},
    "Canada": {"iso2": "CA", "iso3": "CAN"},
    "UK": {"iso2": "GB", "iso3": "GBR"},
    "France": {"iso2": "FR", "iso3": "FRA"},
    "Germany": {"iso2": "DE", "iso3": "DEU"},
    "Italy": {"iso2": "IT", "iso3": "ITA"},
    "Japan": {"iso2": "JP", "iso3": "JPN"},
}

### FRED Series Configuration

Configuration for constructing FRED series IDs and feature engineering. Each indicator uses a `SeriesConfig` dataclass with the following fields:

**Series ID Construction:**
- `prefix`: Characters before the country code (default: "")
- `suffix`: Characters after the country code (default: "")
- `use_iso3`: Use 3-letter (True) or 2-letter (False) country codes (default: True)
- `is_global`: If True, series is not country-specific, e.g., VIX (default: False)
- `agg_method`: Aggregation method for frequency conversion (default: None)
- `suffix_overrides`: Country-specific suffix exceptions (default: {})
- `iso_overrides`: Country-specific ISO code exceptions (default: {})

**Feature Engineering:**
- `feature_ops`: Operation(s) to apply - single string or list. Supported: `"log_diff"`, `"first_diff"`, `"amplitude_deviation"`, `"rolling_stats"` (default: [])
- `lagged`: If True, add lagged features (1, 3, 6, 12 months) for this indicator (default: False)

To add a new data series, create a `SeriesConfig` entry. Only specify non-default values.

In [None]:
SERIES_CONFIG: dict[str, SeriesConfig] = {
    # Real GDP, quarterly, in national currency units, seasonally adj.
    # USA URL: https://fred.stlouisfed.org/series/NAEXKP01USQ652S
    "real_gdp": SeriesConfig(
        prefix="NAEXKP01",
        suffix="Q189S",
        use_iso3=False,
        suffix_overrides={"USA": "Q652S", "UK": "Q652S"},
        feature_ops="log_diff",
    ),
    # Consumer Price Index, 2015=100, monthly
    # USA URL: https://fred.stlouisfed.org/series/USACPIALLMINMEI
    "cpi": SeriesConfig(
        suffix="CPIALLMINMEI",
        feature_ops="log_diff",
    ),
    # Unemployment Rate (%), Monthly, Seasonally Adj.
    # USA URL: https://fred.stlouisfed.org/series/LRHUTTTTUSM156S
    # Lagged: Labor market conditions lag economic turning points
    "unemployment_rate": SeriesConfig(
        prefix="LRHUTTTT",
        suffix="M156S",
        use_iso3=False,
        feature_ops="first_diff",
        lagged=True,
    ),
    # Economic Policy Uncertainty Index, Monthly
    # USA URL: https://fred.stlouisfed.org/series/USEPUINDXM
    # Japan's is discontinued on FRED in 2016, the original source has up to
    # present. https://policyuncertainty.com/japan_monthly.html
    # Will correct later.
    # Lagged: Policy uncertainty signals precede market stress
    "epu": SeriesConfig(
        suffix="EPUINDXM",
        use_iso3=False,
        suffix_overrides={"France": "EUINDXM"},
        iso_overrides={"Japan": "JPN", "UK": "UK", "Canada": "CAN"},
        feature_ops=["log_diff", "rolling_stats"],
        lagged=True,
    ),
    # 10 Year Government Bond Interest Rates, Monthly
    # USA URL: https://fred.stlouisfed.org/series/IRLTLT01USM156N
    # Lagged: Long-term rates reflect financial conditions
    "10_yr_yld": SeriesConfig(
        prefix="IRLTLT01",
        suffix="M156N",
        use_iso3=False,
        feature_ops="first_diff",
        lagged=True,
    ),
    # 3-Month Interbank Interest Rate, Monthly
    # USA URL: https://fred.stlouisfed.org/series/IR3TIB01USM156N
    # This is the interest rate that banks charge other banks for a 90-day loan
    # not using 3 month government bond yield as it's not available for all G7 countries
    # Lagged: Short-term rates reflect monetary policy stance
    "3_mo_yld": SeriesConfig(
        prefix="IR3TIB01",
        suffix="M156N",
        use_iso3=False,
        feature_ops="first_diff",
        lagged=True,
    ),
    # OECD based Recession Indicators, Monthly
    # USA URL: https://fred.stlouisfed.org/series/USAREC
    # Discontinued since 2022 but can be used for historical analysis
    # Can find alternatives if needed
    "oecd_rec": SeriesConfig(
        suffix="REC",
    ),
    # Industrial Activity Index, Monthly, Seasonally Adj.
    # USA URL: https://fred.stlouisfed.org/series/USAPRINTO01IXOBM
    # Lagged: Industrial output tracks real economy momentum
    "ind_out": SeriesConfig(
        suffix="PROINDMISMEI",
        feature_ops="log_diff",
        lagged=True,
    ),
    # Composite Consumer Confidence Amplitude, Monthly, Seasonally Adj.
    # USA URL: https://fred.stlouisfed.org/series/CSCICP03USM665S
    # Normal is 100 (amplitude-adjusted index)
    # Lagged: Consumer sentiment is a leading indicator
    "comp_consumer_conf": SeriesConfig(
        prefix="CSCICP03",
        suffix="M665S",
        use_iso3=False,
        feature_ops="amplitude_deviation",
        lagged=True,
    ),
    # Car Registration for Passenger Cars, Monthly, Seasonally Adj.
    # USA URL: https://fred.stlouisfed.org/series/USASLRTCR03GPSAM
    "pcar_reg": SeriesConfig(
        suffix="SLRTCR03GPSAM",
    ),
    # VIX - Daily data, aggregated to monthly via average
    # URL: https://fred.stlouisfed.org/series/VIXCLS
    # Lagged: Market volatility signals financial stress
    "vix": SeriesConfig(
        prefix="VIXCLS",
        use_iso3=False,
        is_global=True,
        agg_method="avg",
        feature_ops=["log_diff", "rolling_stats"],
        lagged=True,
    ),
    # Spot Crude Oil Price: West Texas Intermediate (WTI), Monthly
    # URL: https://fred.stlouisfed.org/series/WTISPLC
    "oil": SeriesConfig(
        prefix="WTISPLC",
        use_iso3=False,
        is_global=True,
        feature_ops="log_diff",
    ),
    # Global price of Copper, Monthly
    # URL: https://fred.stlouisfed.org/series/PCOPPUSDM
    "copper": SeriesConfig(
        prefix="PCOPPUSDM",
        use_iso3=False,
        is_global=True,
        feature_ops="log_diff",
    ),
    # Proxy for Gold, Gold Spot Price is no longer available due to licensing from
    # the London Bullion Market Association
    # Producer Price Index for Jewelry (Gold and Platinum) and Silverware, Monthly
    # https://fred.stlouisfed.org/series/WPU159402
    "gps": SeriesConfig(
        prefix="WPU159402",
        use_iso3=False,
        is_global=True,
        feature_ops="log_diff",
    ),
    # OECD: Leading Indicators: Composite Leading Indicator: Amplitude Adjusted, Monthly
    # URL: https://fred.stlouisfed.org/series/USALOLITOAASTSAM
    # Lagged: Composite leading indicator designed to predict cycles
    "cli": SeriesConfig(
        suffix="LOLITOAASTSAM",
        feature_ops="amplitude_deviation",
        lagged=True,
    ),
    # Sales: Retail Trade: Total Retail Trade: Volume, Growth Rate Previous Peroid, Monthly
    # URL: https://fred.stlouisfed.org/series/USASLRTTO01GPSAM
    "retail_vol": SeriesConfig(
        suffix="SLRTTO01GPSAM",
        feature_ops="log_diff",
        lagged=True
    ),
    # Financial Market: Share Prices, Index 2015=100, Monthly
    # URL: https://fred.stlouisfed.org/series/SPASTT01USM661N
    "national_share_price": SeriesConfig(
        prefix="SPASTT01",
        suffix="M661N",
        use_iso3=False,
        feature_ops=["log_diff", "rolling_stats"],
        lagged=True
    )
}

### Series ID Construction

Functions to build FRED series IDs from the pattern templates and country codes.

- `build_series_id()`: Constructs a single series ID for a given indicator and country
- `build_series_dict()`: Builds a nested dictionary of all series IDs for batch fetching

In [6]:
def build_series_id(indicator: str, country: str) -> str:
    """Constructs a FRED series ID from template and country code.

    Args:
        indicator: The indicator type (e.g., 'real_gdp', 'cpi').
        country: The country name (e.g., 'USA', 'UK').

    Returns:
        The constructed FRED series ID string.

    Raises:
        ValueError: If indicator or country is not recognized.
    """
    if indicator not in SERIES_CONFIG:
        raise ValueError(f"Unknown indicator: {indicator}")
    if country not in G7_COUNTRY_CODES:
        raise ValueError(f"Unknown country: {country}")

    config = SERIES_CONFIG[indicator]
    codes = G7_COUNTRY_CODES[country]

    # Global indicators don't use country codes
    if config.is_global:
        return f"{config.prefix}{config.suffix}"

    # Check for country-specific ISO override first
    if country in config.iso_overrides:
        code = config.iso_overrides[country]
    else:
        code = codes["iso3"] if config.use_iso3 else codes["iso2"]

    suffix = config.suffix_overrides.get(country, config.suffix)

    return f"{config.prefix}{code}{suffix}"


def build_series_dict(
    indicators: list[str] | None = None,
    countries: list[str] | None = None,
) -> dict[str, dict[str, str]]:
    """Builds a nested dictionary of FRED series IDs.

    Args:
        indicators: List of indicators to include all by default
        countries: List of countries to include all by default

    Returns:
        Nested dict: {indicator: {country: series_id, country2: series_id2}, indicator2: ...}
        For global indicators, all countries map to the same series ID.
    """
    indicators = indicators or list(SERIES_CONFIG.keys())
    countries = countries or list(G7_COUNTRY_CODES.keys())

    series_ids_by_indicator = {}

    for indicator in indicators:
        config = SERIES_CONFIG[indicator]

        if config.is_global:
            # Global indicators use the same series ID for all countries
            global_series_id = build_series_id(indicator, countries[0])
            series_ids_by_country = {country: global_series_id for country in countries}
        else:
            series_ids_by_country = {
                country: build_series_id(indicator, country) for country in countries
            }

        series_ids_by_indicator[indicator] = series_ids_by_country

    return series_ids_by_indicator

### Data Retrieval

Functions for fetching data from FRED API.

- `get_series_metadata()`: Retrieves metadata (source, frequency, units) for a single series
- `get_fred_data()`: Fetches and combines multiple series into a multi-indexed DataFrame

In [7]:
def get_series_metadata(series_id: str) -> dict:
    """Retrieves source metadata and release URLs for a specific data series.

    Args:
        series_id: The unique identifier for the FRED data series (e.g., 'GDPC1').

    Returns:
        A dictionary containing metadata with the following keys:
            - series_title: title of the data series
            - series_notes: notes about the series
            - dataset_url: link to the dataset release
            - source_name: name of the data source organization
            - source_url: URL of the source organization
            - frequency: frequency of the data (e.g., monthly, quarterly)
            - units: units of measurement for the data values
    """
    source_dict = {}

    series_info = pf.get_series_info(series_id=series_id)
    source_dict["series_title"] = series_info.title
    source_dict["series_notes"] = series_info.notes
    source_dict["frequency"] = series_info.frequency
    source_dict["units"] = series_info.units

    release_info = pf.get_series_releases(series_id=series_id)
    release_id = release_info["releases"][0]["id"]

    source_dict["dataset_url"] = release_info["releases"][0]["link"]

    release_sources = pf.get_release_sources(release_id=release_id)
    source_dict["source_name"] = release_sources["sources"][0]["name"]
    source_dict["source_url"] = release_sources["sources"][0]["link"]

    return source_dict


def get_fred_data(
    series_dict: dict[str, dict[str, str]] | None = None,
    indicators: list[str] | None = None,
    countries: list[str] | None = None,
    start_date: str | None = "1970-01-01",
    end_date: str | None = "2020-12-31",
) -> pd.DataFrame:
    """Fetches FRED data for multiple series and compiles it into a DataFrame.

    Args:
        series_dict: Nested dictionary mapping indicator types to country-series mappings.
            If None, builds from templates using indicators/countries filters.
        indicators: List of indicators to include (used if series_dict is None).
        countries: List of countries to include (used if series_dict is None).
        start_date: The earliest date to include in the data (default: '1970-01-01').
        end_date: The latest date to include in the data (default: '2020-12-31').

    Returns:
        A DataFrame with (date, country) multi-index and indicator columns.
    """
    if series_dict is None:
        series_dict = build_series_dict(indicators, countries)

    data_frames = []
    fetched_series: dict[str, pd.DataFrame] = {}  # Cache for fetched series

    for indicator in series_dict:
        country_series = series_dict[indicator]
        config = SERIES_CONFIG[indicator]

        for country, series_id in country_series.items():
            # Use cached data if already fetched (for global indicators)
            if series_id not in fetched_series:
                try:
                    # Build optional kwargs for frequency aggregation
                    # If agg_method is defined and is_global, use monthly frequency with specified aggregation
                    kwargs = {}
                    if config.is_global and config.agg_method is not None:
                        kwargs["frequency"] = "m"
                        kwargs["aggregation_method"] = config.agg_method

                    series = pf.get_series(series_id=series_id, **kwargs)
                except Exception as e:
                    print(f"Error fetching data for {series_id}: {e}")
                    raise

                series_df = series[["date", "value"]].copy()
                series_df["date"] = pd.to_datetime(series_df["date"])
                series_df = series_df.set_index("date")

                fetched_series[series_id] = series_df

            df = fetched_series[series_id].copy()
            df = df.rename(columns={"value": (indicator, country)})

            data_frames.append(df)

    result = pd.concat(data_frames, axis=1)
    result.columns = pd.MultiIndex.from_tuples(
        result.columns, names=["indicator", "country"]
    )

    # Upsample quarterly data to monthly with forward fill
    result = result.resample("MS").first().ffill()

    # Filter to start_date onwards
    if start_date is not None:
        result = result.loc[start_date:]

    # Filter to end_date
    if end_date is not None:
        result = result.loc[:end_date]

    result = result.stack(level="country")

    return result

### Feature Engineering

Derives additional features from the raw FRED data.

##### Original Features
- `gdp_qoq_growth`: Quarter-over-quarter GDP growth rate (%)
- `technical_rec`: Boolean flag for technical recession (two consecutive quarters of negative GDP growth)
- `yield_curve`: Spread between 10-year and 3-month yields (negative values often signal recession)
- `sahm_value`: SAHM Rule indicator

##### Log Differences (40 features)
For level-based indicators: `real_gdp`, `cpi`, `ind_out`, `oil`, `copper`, `gps`, `vix`, `epu`, `retail_vol`, `national_share_price`
- `{col}_log_1mo`, `{col}_log_3mo`, `{col}_log_6mo`, `{col}_log_12mo`

##### First Differences (12 features)
For rate-based indicators: `unemployment_rate`, `10_yr_yld`, `3_mo_yld`
- `{col}_diff_1mo`, `{col}_diff_3mo`, `{col}_diff_6mo`, `{col}_diff_12mo`

##### Amplitude Deviation Features (10 features)
For amplitude-adjusted indices centered at 100: `comp_consumer_conf`, `cli`
- `{col}_dev`: Deviation from 100 baseline (100 = long-term trend)
- `{col}_diff_1mo`, `{col}_diff_3mo`, `{col}_diff_6mo`, `{col}_diff_12mo`

##### Yield Curve Features (4 features)
- `yield_curve_inverted`: Binary flag when yield curve is negative
- `yield_curve_diff_1mo`, `yield_curve_diff_3mo`, `yield_curve_diff_6mo`

##### Rolling Statistics (18 features)
For volatility/market indicators: `vix`, `epu`, `national_share_price`
- `{col}_rolling_mean_3mo`, `{col}_rolling_mean_6mo`, `{col}_rolling_mean_12mo`
- `{col}_rolling_std_3mo`, `{col}_rolling_std_6mo`, `{col}_rolling_std_12mo`

##### Derived Features (3 features)
- `unemployment_accel`: Second derivative of unemployment rate
- `cpi_yoy_pct`: Year-over-year CPI change in percentage
- `real_rate_10yr`, `real_rate_3mo`: Real interest rates (nominal minus inflation)

##### Lagged Features
Lagged versions of key indicators at 1, 3, 6, and 12 month horizons for recession prediction.
Lags capture past levels of indicators, enabling the model to learn leading patterns.

Indicators with `lagged=True` in `SERIES_CONFIG` receive lagged versions:
- `{col}_lag_1mo`, `{col}_lag_3mo`, `{col}_lag_6mo`, `{col}_lag_12mo`

Lagged indicators: `unemployment_rate`, `epu`, `10_yr_yld`, `3_mo_yld`, `ind_out`, `comp_consumer_conf`, `vix`, `cli`, `retail_vol`, `national_share_price`

Additional derived features are also lagged:
- `yield_curve`, `yield_curve_inverted`, `sahm_value`, `gdp_qoq_growth`
- `unemployment_accel`, `real_rate_10yr`, `real_rate_3mo`

##### Forecast Target
- `pre_recession`: Binary target marking the 12 months before each recession start (configurable horizon)
  - Value of 1 indicates the economy will enter recession within the forecast horizon
  - Value of 0 for all other periods (including during recession itself)
  - Useful for training models to predict upcoming recessions rather than identify current ones

In [None]:
def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
    """Derives additional features from raw FRED data.

    Args:
        df: DataFrame with (date, country) multi-index and indicator columns.

    Returns:
        DataFrame with additional derived feature columns.
    """
    df = df.copy()
    periods = [(1, "1mo"), (3, "3mo"), (6, "6mo"), (12, "12mo")]

    # Convert OECD recession flag to boolean
    df["oecd_rec"] = df["oecd_rec"].astype(bool)

    # Calculate quarter over quarter GDP growth rate (GDP is quarterly data)
    df["gdp_qoq_growth"] = (
        df.groupby(level="country")["real_gdp"].pct_change(periods=3) * 100
    )

    # Technical recession: two consecutive quarters of negative GDP growth
    df["technical_rec"] = (df["gdp_qoq_growth"] < 0) & (
        df.groupby(level="country")["gdp_qoq_growth"].shift(3) < 0
    )

    # Spread between 10-year and 3-month yields
    # This being negative is a commonly used metric for recession
    df["yield_curve"] = df["10_yr_yld"] - df["3_mo_yld"]

    # SAHM Rule: signals recession when 3-month avg unemployment rises 0.5pp+ above
    # its 12-month low. Useful as a real-time recession indicator with monthly data.
    df["unemployment_3mo_avg"] = (
        df.groupby(level="country")["unemployment_rate"]
        .rolling(window=3, min_periods=3)
        .mean()
        .droplevel(0)
    )
    df["unemployment_12mo_min"] = (
        df.groupby(level="country")["unemployment_3mo_avg"]
        .rolling(window=12, min_periods=12)
        .min()
        .droplevel(0)
    )
    df["sahm_value"] = df["unemployment_3mo_avg"] - df["unemployment_12mo_min"]

    # =========================================================================
    # CONFIG-DRIVEN FEATURE ENGINEERING
    # Applies feature operations based on SERIES_CONFIG settings
    # =========================================================================
    for indicator, config in SERIES_CONFIG.items():
        if indicator not in df.columns:
            continue

        # LOG DIFFERENCES - for level-based indicators
        # Captures growth rates with symmetric treatment of gains/losses
        if config.has_op("log_diff"):
            for period, label in periods:
                df[f"{indicator}_log_{label}"] = df.groupby(level="country")[
                    indicator
                ].transform(lambda x: np.log(x).diff(period))

        # FIRST DIFFERENCES - for rate-based indicators (already percentages)
        # Captures change in percentage points
        if config.has_op("first_diff"):
            for period, label in periods:
                df[f"{indicator}_diff_{label}"] = df.groupby(level="country")[
                    indicator
                ].diff(period)

        # AMPLITUDE DEVIATION FEATURES - for amplitude-adjusted indices centered at 100
        # These indices are normalized so 100 = long-term trend; deviation is meaningful
        if config.has_op("amplitude_deviation"):
            # Deviation from baseline (100)
            df[f"{indicator}_dev"] = df[indicator] - 100
            # Momentum (change over time)
            for period, label in periods:
                df[f"{indicator}_diff_{label}"] = df.groupby(level="country")[
                    indicator
                ].diff(period)

        # ROLLING STATISTICS - for volatility indicators
        # Captures sustained stress vs temporary spikes
        if config.has_op("rolling_stats"):
            for window in [3, 6, 12]:
                df[f"{indicator}_rolling_mean_{window}mo"] = (
                    df.groupby(level="country")[indicator]
                    .rolling(window=window, min_periods=window)
                    .mean()
                    .droplevel(0)
                )
                df[f"{indicator}_rolling_std_{window}mo"] = (
                    df.groupby(level="country")[indicator]
                    .rolling(window=window, min_periods=window)
                    .std()
                    .droplevel(0)
                )

    # =========================================================================
    # YIELD CURVE FEATURES
    # =========================================================================
    # Binary flag for yield curve inversion (historically precedes recessions)
    df["yield_curve_inverted"] = (df["yield_curve"] < 0).astype(int)

    # Yield curve momentum (flattening vs steepening)
    for period, label in [(1, "1mo"), (3, "3mo"), (6, "6mo")]:
        df[f"yield_curve_diff_{label}"] = df.groupby(level="country")[
            "yield_curve"
        ].diff(period)

    # =========================================================================
    # UNEMPLOYMENT ACCELERATION
    # Second derivative captures labor market inflection points
    # =========================================================================
    df["unemployment_accel"] = (
        df.groupby(level="country")["unemployment_rate"].diff().diff()
    )

    # =========================================================================
    # REAL INTEREST RATES
    # Nominal rate minus inflation (approximated by CPI YoY log return)
    # =========================================================================
    # CPI YoY is already computed as cpi_log_12mo, convert to percentage
    df["cpi_yoy_pct"] = df["cpi_log_12mo"] * 100
    df["real_rate_10yr"] = df["10_yr_yld"] - df["cpi_yoy_pct"]
    df["real_rate_3mo"] = df["3_mo_yld"] - df["cpi_yoy_pct"]

    return df

def one_hot_encode_countries(df: pd.DataFrame) -> pd.DataFrame:
    """Creates one-hot encoded columns for each country in the multi-index.

    Args:
        df: DataFrame with (date, country) multi-index.

    Returns:
        DataFrame with additional binary columns for each country.
    """
    df = df.copy()

    # Get country from the multi-index level
    countries = df.index.get_level_values("country")

    # Create one-hot encoded columns using pd.get_dummies
    country_dummies = pd.get_dummies(countries, prefix="country")
    country_dummies.index = df.index

    # Concatenate with original DataFrame
    df = pd.concat([df, country_dummies], axis=1)

    return df


def add_lagged_features(df: pd.DataFrame) -> pd.DataFrame:
    """Adds lagged versions of key features for recession prediction.

    Lagged features capture past levels of indicators, enabling the model to
    learn leading patterns where indicators precede recessions by months.

    Args:
        df: DataFrame with (date, country) multi-index and engineered features.

    Returns:
        DataFrame with additional lagged feature columns.
    """
    df = df.copy()

    lag_periods = [(1, "1mo"), (3, "3mo"), (6, "6mo"), (12, "12mo")]

    # =========================================================================
    # COLLECT COLUMNS TO LAG FROM SERIES_CONFIG
    # =========================================================================
    cols_to_lag = []

    for indicator, config in SERIES_CONFIG.items():
        if config.lagged and indicator in df.columns:
            cols_to_lag.append(indicator)
            # Also add deviation columns for amplitude_deviation features
            if config.has_op("amplitude_deviation"):
                dev_col = f"{indicator}_dev"
                if dev_col in df.columns:
                    cols_to_lag.append(dev_col)
            # Also add rolling mean columns for rolling_stats features
            if config.has_op("rolling_stats"):
                for window in [3, 6, 12]:
                    rolling_col = f"{indicator}_rolling_mean_{window}mo"
                    if rolling_col in df.columns:
                        cols_to_lag.append(rolling_col)

    # =========================================================================
    # DERIVED FEATURES TO LAG (not in SERIES_CONFIG)
    # These are computed features that should also be lagged
    # =========================================================================
    derived_lag_cols = [
        "yield_curve",           # Inversion precedes recession by 6-18 months
        "yield_curve_inverted",  # Binary signal of inversion state
        "sahm_value",            # Designed as early recession signal
        "gdp_qoq_growth",        # GDP contraction leads recession dating
        "unemployment_accel",    # Acceleration signals inflection points
        "real_rate_10yr",        # Real 10-year rate
        "real_rate_3mo",         # Real 3-month rate
    ]
    cols_to_lag.extend([c for c in derived_lag_cols if c in df.columns])

    # =========================================================================
    # APPLY LAGS
    # =========================================================================
    for col in cols_to_lag:
        for period, label in lag_periods:
            df[f"{col}_lag_{label}"] = df.groupby(level="country")[col].shift(period)

    return df


def create_forecast_target(df: pd.DataFrame, horizon: int = 12) -> pd.DataFrame:
    """Creates a binary target for recession forecasting.

    Marks the N months before each recession start as the positive class (1).
    This allows the model to learn patterns that precede recessions, making it
    suitable for early warning systems.

    Args:
        df: DataFrame with (date, country) multi-index and 'oecd_rec' column.
        horizon: Number of months before recession to mark as target (default: 12).

    Returns:
        DataFrame with 'pre_recession' target column added.
    """
    df = df.copy()
    df["pre_recession"] = 0

    for country in df.index.get_level_values("country").unique():
        # Get country data with dates as index
        country_mask = df.index.get_level_values("country") == country
        country_data = df.loc[country_mask, "oecd_rec"].droplevel("country")

        # Identify recession starts (False -> True transition)
        rec = country_data.astype(int)
        recession_starts = rec.index[(rec == 1) & (rec.shift(1) == 0)]

        # Mark the N months before each recession start
        for start_date in recession_starts:
            pre_start = start_date - pd.DateOffset(months=horizon)
            pre_mask = (
                (df.index.get_level_values("date") >= pre_start)
                & (df.index.get_level_values("date") < start_date)
                & (df.index.get_level_values("country") == country)
            )
            df.loc[pre_mask, "pre_recession"] = 1

    return df

### Fetch and Display Data

Execute the data pipeline: fetch from FRED, engineer features, and display sample outputs.

In [None]:
# Fetch data from FRED and apply feature engineering
df = get_fred_data()
df = engineer_features(df)
df = add_lagged_features(df)
df = create_forecast_target(df, horizon=12)
df = one_hot_encode_countries(df)

# Display feature count
print(f"Total features: {len(df.columns)}")

# Display forecast target distribution
print("\n\nForecast Target Distribution (pre_recession):")
print(df["pre_recession"].value_counts())
print(f"\nPositive class ratio: {df['pre_recession'].mean():.2%}")

# Display pre-recession periods by country
print("\n\nPre-recession months by country:")
print(df.groupby(level="country")["pre_recession"].sum().astype(int))

# Scale GDP to billions for readability
df_display = df.copy()
df_display["real_gdp"] = df_display["real_gdp"] / 1e9

# Display recent data (last 14 rows = 2 months for all 7 countries)
print("\n\nDataFrame Tail (GDP in billions):")
print(df_display.round(2).tail(14))

# Can also unstack for super wide date, country multi-index (not recommended)
# print("\n\nDataFrame Tail (Unstacked, GDP in billions):")
# print(df_display.unstack(level="country").round(2).tail(10))

# Can also only get specific indicators/countries
# df_usa_gdp = get_fred_data(indicators=["real_gdp"], countries=["USA", "UK"])

# Display data from specific historical dates
print("\n\nData from specific dates:")
specific_dates = ["1980-01-01", "2000-01-01", "2020-01-01"]
print(df_display.loc[specific_dates].round(2))

# Display data from 2008 financial crisis period
print("\n\nData from 2008 financial crisis period:")
print(df_display.loc["2008-06-01":"2009-06-01"].round(2))