## Milestone II

- Supervised Learning Task:   Multi-Domain Recession Forecasting
- Unsupervised Learning Task: Economic Regime Identification

### Current Datasets/Data Sources:
1. Federal Reserve Economic Data (FRED, https://fred.stlouisfed.org/)
   - Contains all(?) of the economic data that we will need (GDP, CPI, unemployment, etc.)
   - Requires API key, obtainable upon registration at their website.
   - The data is sourced from government agencies, international orgs, private companies, research institutions, etc. Metadata for this is available through API. Realistically however, pretty much all the data will come from the OECD (Organisation for Economic Co-operation and Development).
   - Contains information on countries besides the U.S., we will focus on the G7 and maybe a select few other countries, will need to screen for availability and reliability.

### Notes:
- We will be using OECD-based Recession Indicators (https://fred.stlouisfed.org/series/USAREC) as our definition for Recession.


In [4]:
pip install -r requirements.txt


Defaulting to user installation because normal site-packages is not writeable
Collecting annotated-types==0.7.0 (from -r requirements.txt (line 1))
  Using cached annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)
Collecting asttokens==3.0.1 (from -r requirements.txt (line 2))
  Using cached asttokens-3.0.1-py3-none-any.whl.metadata (4.9 kB)
Collecting certifi==2026.1.4 (from -r requirements.txt (line 3))
  Using cached certifi-2026.1.4-py3-none-any.whl.metadata (2.5 kB)
Collecting charset-normalizer==3.4.4 (from -r requirements.txt (line 4))
  Using cached charset_normalizer-3.4.4-cp313-cp313-win_amd64.whl.metadata (38 kB)
Collecting comm==0.2.3 (from -r requirements.txt (line 6))
  Using cached comm-0.2.3-py3-none-any.whl.metadata (3.7 kB)
Collecting contourpy==1.3.3 (from -r requirements.txt (line 7))
  Using cached contourpy-1.3.3-cp313-cp313-win_amd64.whl.metadata (5.5 kB)
Collecting cycler==0.12.1 (from -r requirements.txt (line 8))
  Using cached cycler-0.12.1-py3-none-any.

  You can safely remove it manually.
  You can safely remove it manually.
  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-core 0.3.78 requires packaging<26.0.0,>=23.2.0, but you have packaging 26.0 which is incompatible.
langchain-google-genai 2.1.12 requires google-ai-generativelanguage<1,>=0.7, but you have google-ai-generativelanguage 0.6.15 which is incompatible.
mdit-py-plugins 0.3.0 requires markdown-it-py<3.0.0,>=1.0.0, but you have markdown-it-py 4.0.0 which is incompatible.
numba 0.61.0 requires numpy<2.2,>=1.24, but you have numpy 2.4.2 which is incompatible.
spyder 6.0.7 requires ipython!=8.17.1,<9.0.0,>=8.13.0; python_version > "3.8", but you have ipython 9.9.0 which is incompatible.
spyder-kernels 3.0.5 requires ipykernel<7,>=6.29.3, but you have ipykernel 7.1.0 whi

### Import Packages

In [1]:
import os
from dotenv import load_dotenv

import pandas as pd
import numpy as np
import pyfredapi as pf
import warnings

warnings.filterwarnings("ignore")

### Setup API Key

#### To use FRED in this notebook:

1. Get an API key from https://fredaccount.stlouisfed.org/login/secure/, register if needed.
2. Add to your operating systems environment variables

In [2]:
# Load FRED API key from .env file
load_dotenv()
FRED_API_KEY = os.environ.get("FRED_API_KEY")
if FRED_API_KEY is None:
    print("Warning: FRED_API_KEY not found in .env file or environment variables!")

### Get data from FRED

Getting data from FRED requires knowing the series ID of the data you want.

Most series IDs will comprise of a country code (2 or 3 letter) combined with a series prefix/suffix or just a suffix.

You can look them up individually at https://fred.stlouisfed.org/

I avoid hardcoding them and instead generate them using their pattern, that way we can more easily add new series data.

In [3]:
### G7 Country Codes

ISO-2 and ISO-3 country codes for the G7 nations, used to construct FRED series IDs.

SyntaxError: invalid syntax (3475831683.py, line 3)

In [4]:
G7_COUNTRY_CODES = {
    "USA": {"iso2": "US", "iso3": "USA"},
    "Canada": {"iso2": "CA", "iso3": "CAN"},
    "UK": {"iso2": "GB", "iso3": "GBR"},
    "France": {"iso2": "FR", "iso3": "FRA"},
    "Germany": {"iso2": "DE", "iso3": "DEU"},
    "Italy": {"iso2": "IT", "iso3": "ITA"},
    "Japan": {"iso2": "JP", "iso3": "JPN"},
}

### FRED Series Patterns

Configuration for constructing FRED series IDs. Each indicator defines:
- `prefix`: Characters before the country code
- `suffix`: Characters after the country code
- `use_iso3`: Whether to use 3-letter (True) or 2-letter (False) country codes
- `suffix_overrides`: Country-specific suffix exceptions
- `iso_overrides`: Country-specific ISO code exceptions
- `is_global`: If True, series is not country-specific (e.g., VIX)

To add a new data series, add an entry following this pattern.

In [14]:
SERIES_PATTERN = {
    # Real GDP, quarterly, in national currency units, seasonally adj.
    # USA URL: https://fred.stlouisfed.org/series/NAEXKP01USQ652S
    "real_gdp": {
        "prefix": "NAEXKP01",
        "suffix": "Q189S",
        "use_iso3": False,
        "suffix_overrides": {"USA": "Q652S", "UK": "Q652S"},
    },
    # Consumer Price Index, 2015=100, monthly
    # USA URL: https://fred.stlouisfed.org/series/USACPIALLMINMEI
    "cpi": {
        "prefix": "",
        "suffix": "CPIALLMINMEI",
        "use_iso3": True,
    },
    # Unemployment Rate (%), Monthly, Seasonally Adj.
    # USA URL: https://fred.stlouisfed.org/series/LRHUTTTTUSM156S
    "unemployment_rate": {
        "prefix": "LRHUTTTT",
        "suffix": "M156S",
        "use_iso3": False,
    },
    # Economic Policy Uncertainty Index, Monthly
    # USA URL: https://fred.stlouisfed.org/series/USEPUINDXM
    # Japan's is discontinued on FRED in 2016, the original source has up to
    # present. https://policyuncertainty.com/japan_monthly.html
    # Will correct later.
    "epu": {
        "prefix": "",
        "suffix": "EPUINDXM",
        "use_iso3": False,
        "suffix_overrides": {"France": "EUINDXM"},
        "iso_overrides": {"Japan": "JPN", "UK": "UK", "Canada": "CAN"},
    },
    # 10 Year Government Bond Interest Rates, Monthly
    # USA URL: https://fred.stlouisfed.org/series/IRLTLT01USM156N
    "10_yr_yld": {
        "prefix": "IRLTLT01",
        "suffix": "M156N",
        "use_iso3": False,
    },
    # 3-Month Interbank Interest Rate, Monthly
    # USA URL: https://fred.stlouisfed.org/series/IR3TIB01USM156N
    # This is the interest rate that banks charge other banks for a 90-day loan
    # not using 3 month government bond yield as it's not available for all G7 countries
    "3_mo_yld": {
        "prefix": "IR3TIB01",
        "suffix": "M156N",
        "use_iso3": False,
    },
    # OECD based Recession Indicators, Monthly
    # USA URL: https://fred.stlouisfed.org/series/USAREC
    # Discontinued since 2022 but can be used for historical analysis
    # Can find alternatives if needed
    "oecd_rec": {
        "prefix": "",
        "suffix": "REC",
        "use_iso3": True,
    },
    # Industrial Activity Index, Monthly, Seasonally Adj.
    # USA URL: https://fred.stlouisfed.org/series/USAPRINTO01IXOBM
    "ind_out": {
        "prefix": "",
        "suffix": "PROINDMISMEI",
        "use_iso3": True,
    },
    # Composite Consumer Confidence Amplitude, Monthly, Seasonally Adj.
    # USA URL: https://fred.stlouisfed.org/series/CSCICP03USM665S
    # Normal is 100
    "comp_consumer_conf": {
        "prefix": "CSCICP03",
        "suffix": "M665S",
        "use_iso3": False,
    },
    # Car Registration for Passenger Cars, Monthly, Seasonally Adj.
    # USA URL: https://fred.stlouisfed.org/series/USASLRTCR03GPSAM
    "pcar_reg": {
        "prefix": "",
        "suffix": "SLRTCR03GPSAM",
        "use_iso3": True,
    },
    # VIX - Daily data, aggregated to monthly via average
    # URL: https://fred.stlouisfed.org/series/VIXCLS
    "vix": {
        "prefix": "VIXCLS",
        "suffix": "",
        "use_iso3": False,
        "is_global": True,
        "agg_method": "avg",  # pyfredapi aggregation_method param
    },
}

### Series ID Construction

Functions to build FRED series IDs from the pattern templates and country codes.

- `build_series_id()`: Constructs a single series ID for a given indicator and country
- `build_series_dict()`: Builds a nested dictionary of all series IDs for batch fetching

In [15]:
def build_series_id(indicator: str, country: str) -> str:
    """Constructs a FRED series ID from template and country code.

    Args:
        indicator: The indicator type (e.g., 'real_gdp', 'cpi').
        country: The country name (e.g., 'USA', 'UK').

    Returns:
        The constructed FRED series ID string.

    Raises:
        ValueError: If indicator or country is not recognized.
    """
    if indicator not in SERIES_PATTERN:
        raise ValueError(f"Unknown indicator: {indicator}")
    if country not in G7_COUNTRY_CODES:
        raise ValueError(f"Unknown country: {country}")

    template = SERIES_PATTERN[indicator]
    codes = G7_COUNTRY_CODES[country]

    # Global indicators don't use country codes
    if template.get("is_global", False):
        return f"{template['prefix']}{template['suffix']}"

    # Check for country-specific ISO override first
    iso_overrides = template.get("iso_overrides", {})
    if country in iso_overrides:
        code = iso_overrides[country]
    else:
        code = codes["iso3"] if template["use_iso3"] else codes["iso2"]

    suffix_overrides = template.get("suffix_overrides", {})
    suffix = suffix_overrides.get(country, template["suffix"])

    return f"{template['prefix']}{code}{suffix}"


def build_series_dict(
    indicators: list[str] | None = None,
    countries: list[str] | None = None,
) -> dict[str, dict[str, str]]:
    """Builds a nested dictionary of FRED series IDs.

    Args:
        indicators: List of indicators to include all by default
        countries: List of countries to include all by default

    Returns:
        Nested dict: {indicator: {country: series_id, country2: series_id2}, indicator2: ...}
        For global indicators, all countries map to the same series ID.
    """
    indicators = indicators or list(SERIES_PATTERN.keys())
    countries = countries or list(G7_COUNTRY_CODES.keys())

    series_ids_by_indicator = {}

    for indicator in indicators:
        template = SERIES_PATTERN[indicator]
        is_global = template.get("is_global", False)

        if is_global:
            # Global indicators use the same series ID for all countries
            global_series_id = build_series_id(indicator, countries[0])
            series_ids_by_country = {country: global_series_id for country in countries}
        else:
            series_ids_by_country = {
                country: build_series_id(indicator, country) for country in countries
            }

        series_ids_by_indicator[indicator] = series_ids_by_country

    return series_ids_by_indicator

### Data Retrieval

Functions for fetching data from FRED API.

- `get_series_metadata()`: Retrieves metadata (source, frequency, units) for a single series
- `get_fred_data()`: Fetches and combines multiple series into a multi-indexed DataFrame

In [20]:
def get_series_metadata(series_id: str, api_key: str | None = None) -> dict:
    """Retrieves source metadata and release URLs for a specific data series.

    Args:
        series_id: The unique identifier for the FRED data series (e.g., 'GDPC1').
        api_key: FRED API key. If None, uses FRED_API_KEY environment variable.

    Returns:
        A dictionary containing metadata with the following keys:
            - series_title: title of the data series
            - series_notes: notes about the series
            - dataset_url: link to the dataset release
            - source_name: name of the data source organization
            - source_url: URL of the source organization
            - frequency: frequency of the data (e.g., monthly, quarterly)
            - units: units of measurement for the data values
    """
    if api_key is None:
        api_key = FRED_API_KEY
    source_dict = {}

    series_info = pf.get_series_info(series_id=series_id, api_key=api_key)
    source_dict["series_title"] = series_info.title
    source_dict["series_notes"] = series_info.notes
    source_dict["frequency"] = series_info.frequency
    source_dict["units"] = series_info.units

    release_info = pf.get_series_releases(series_id=series_id, api_key=api_key)
    release_id = release_info["releases"][0]["id"]

    source_dict["dataset_url"] = release_info["releases"][0]["link"]

    release_sources = pf.get_release_sources(release_id=release_id, api_key=api_key)
    source_dict["source_name"] = release_sources["sources"][0]["name"]
    source_dict["source_url"] = release_sources["sources"][0]["link"]

    return source_dict

def get_fred_data(
    series_dict: dict[str, dict[str, str]] | None = None,
    indicators: list[str] | None = None,
    countries: list[str] | None = None,
    start_date: str | None = "1970-01-01",
    end_date: str | None = "2020-12-31",
) -> pd.DataFrame:
    """Fetches FRED data for multiple series and compiles it into a DataFrame.

    Args:
        series_dict: Nested dictionary mapping indicator types to country-series mappings.
            If None, builds from templates using indicators/countries filters.
        indicators: List of indicators to include (used if series_dict is None).
        countries: List of countries to include (used if series_dict is None).
        start_date: The earliest date to include in the data (default: '1970-01-01').
        end_date: The latest date to include in the data (default: '2020-12-31').

    Returns:
        A DataFrame with (date, country) multi-index and indicator columns.
    """
    if series_dict is None:
        series_dict = build_series_dict(indicators, countries)

    data_frames = []
    fetched_series: dict[str, pd.DataFrame] = {}  # Cache for fetched series

    for indicator in series_dict:
        country_series = series_dict[indicator]
        template = SERIES_PATTERN[indicator]

        for country, series_id in country_series.items():
            # Use cached data if already fetched (for global indicators)
            if series_id not in fetched_series:
                try:
                    # Build optional kwargs for frequency aggregation
                    # If agg_method is defined and is_global, use monthly frequency with specified aggregation
                    kwargs = {}
                    if template.get("is_global", False) and "agg_method" in template:
                        kwargs["frequency"] = "m"
                        kwargs["aggregation_method"] = template["agg_method"]

                    series = pf.get_series(series_id=series_id, **kwargs)
                except Exception as e:
                    print(f"Error fetching data for {series_id}: {e}")
                    raise

                series_df = series[["date", "value"]].copy()
                series_df["date"] = pd.to_datetime(series_df["date"])
                series_df = series_df.set_index("date")

                fetched_series[series_id] = series_df

            df = fetched_series[series_id].copy()
            df = df.rename(columns={"value": (indicator, country)})

            data_frames.append(df)

    result = pd.concat(data_frames, axis=1)
    result.columns = pd.MultiIndex.from_tuples(
        result.columns, names=["indicator", "country"]
    )

    # Upsample quarterly data to monthly with forward fill
    result = result.resample("MS").first().ffill()

    # Filter to start_date onwards
    if start_date is not None:
        result = result.loc[start_date:]

    # Filter to end_date
    if end_date is not None:
        result = result.loc[:end_date]

    result = result.stack(level="country")

    return result

### Feature Engineering

Derives additional features from the raw FRED data:

- `gdp_qoq_growth`: Quarter-over-quarter GDP growth rate (%)
- `technical_rec`: Boolean flag for technical recession (two consecutive quarters of negative GDP growth)
- `yield_curve`: Spread between 10-year and 3-month yields (negative values often signal recession)

In [21]:
def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
    """Derives additional features from raw FRED data.

    Args:
        df: DataFrame with (date, country) multi-index and indicator columns.

    Returns:
        DataFrame with additional derived feature columns.
    """
    df = df.copy()

    # Convert OECD recession flag to boolean
    df["oecd_rec"] = df["oecd_rec"].astype(bool)

    # Calculate quarter over quarter GDP growth rate (GDP is quarterly data)
    df["gdp_qoq_growth"] = (
        df.groupby(level="country")["real_gdp"].pct_change(periods=3) * 100
    )

    # Technical recession: two consecutive quarters of negative GDP growth
    df["technical_rec"] = (df["gdp_qoq_growth"] < 0) & (
        df.groupby(level="country")["gdp_qoq_growth"].shift(3) < 0
    )

    # Spread between 10-year and 3-month yields
    # This being negative is a commonly used metric for recession
    df["yield_curve"] = df["10_yr_yld"] - df["3_mo_yld"]

    return df

### Fetch and Display Data

Execute the data pipeline: fetch from FRED, engineer features, and display sample outputs.

In [22]:
# Fetch data from FRED and apply feature engineering
df = get_fred_data()
df = engineer_features(df)

# Scale GDP to billions for readability
df_display = df.copy()
df_display["real_gdp"] = df_display["real_gdp"] / 1e9

# Display recent data (last 14 rows = 2 months for all 7 countries)
print("\n\nDataFrame Tail (GDP in billions):")
print(df_display.round(2).tail(14))

# Can also unstack for super wide date, country multi-index (not recommended)
# print("\n\nDataFrame Tail (Unstacked, GDP in billions):")
# print(df_display.unstack(level="country").round(2).tail(10))

# Can also only get specific indicators/countries
# df_usa_gdp = get_fred_data(indicators=["real_gdp"], countries=["USA", "UK"])

# Display data from specific historical dates
print("\n\nData from specific dates:")
specific_dates = ["1980-01-01", "2000-01-01", "2020-01-01"]
print(df_display.loc[specific_dates].round(2))

# Display data from 2008 financial crisis period
print("\n\nData from 2008 financial crisis period:")
print(df_display.loc["2008-06-01":"2009-06-01"].round(2))



DataFrame Tail (GDP in billions):
indicator            real_gdp     cpi  unemployment_rate     epu  10_yr_yld  \
date       country                                                            
2020-11-01 Canada      546.81  108.80                8.6  455.05       0.69   
           France      559.95  104.73                8.1  403.89      -0.32   
           Germany     796.47  105.11                3.8  387.49      -0.61   
           Italy       404.56  102.40                9.5  279.39       0.66   
           Japan    134994.38  101.30                3.0  110.08       0.03   
           UK          514.53  109.10                5.3  313.84       0.38   
           USA       20724.13  109.79                6.7  246.71       0.87   
2020-12-01 Canada      546.81  108.56                8.9  371.18       0.73   
           France      559.95  104.96                7.9  257.27      -0.34   
           Germany     796.47  105.22                3.8  227.94      -0.62   
           Italy

In [31]:
df_display.head()

Unnamed: 0_level_0,indicator,real_gdp,cpi,unemployment_rate,epu,10_yr_yld,3_mo_yld,oecd_rec,ind_out,comp_consumer_conf,pcar_reg,vix,gdp_qoq_growth,technical_rec,yield_curve
date,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1970-01-01,Canada,150.072,15.95997,4.5,,8.34,9.065,True,40.41944,,-4.446483,,,False,-0.725
1970-01-01,France,,14.31504,,,8.74,10.35,False,60.02731,,-46.243902,,,False,-1.61
1970-01-01,Germany,,29.29181,,,7.6,9.29,False,49.30363,,-5.060277,,,False,-1.69
1970-01-01,Italy,,5.543876,,,,,False,68.41165,,6.030313,,,False,
1970-01-01,Japan,,30.84754,1.1,,,,False,45.59721,,8.898026,,,False,


In [23]:
def clean_and_transform_data(df: pd.DataFrame, start_year="1995-01-01") -> pd.DataFrame:
    """
    Cleans raw FRED data, applies stationarity transformations, 
    and fixes missing recession labels.
    """
    df = df.copy()
    
    # ---------------------------------------------------------
    # 1. STATIONARITY TRANSFORMATIONS (Make data model-ready)
    # ---------------------------------------------------------
    
    # GDP: Already have QoQ from your code, but YoY is often smoother for ML
    # Calculating Year-over-Year Growth % (12 month lag since data is monthly resampled)
    df['gdp_yoy_growth'] = df.groupby(level='country')['real_gdp'].pct_change(12) * 100
    
    # CPI: Convert Index -> Inflation Rate (YoY %)
    df['inflation_rate'] = df.groupby(level='country')['cpi'].pct_change(12) * 100
    
    # Industrial Output: Index -> YoY % Change
    df['ind_out_growth'] = df.groupby(level='country')['ind_out'].pct_change(12) * 100
    
    # Car Registrations: Raw Count -> YoY % Change
    df['cars_yoy_growth'] = df.groupby(level='country')['pcar_reg'].pct_change(12) * 100
    
    # Unemployment: Use the Change in Rate (Sahm Rule logic)
    # Are things getting worse? (Current Rate - Rate 12 months ago)
    df['unemp_change_yoy'] = df['unemployment_rate'] - df.groupby(level='country')['unemployment_rate'].shift(12)
    
    # Yield Curve: You already have this, but ensure it's (10Y - 3M)
    df['yield_spread'] = df['10_yr_yld'] - df['3_mo_yld']

    # ---------------------------------------------------------
    # 2. TARGET VARIABLE ENGINEERING (The Proxy Fix)
    # ---------------------------------------------------------
    
    # Calculate Technical Recession (2 consecutive quarters of negative GDP growth)
    # Note: We use shift(3) because GDP is quarterly (every 3 months in our filled data)
    df['gdp_qoq'] = df.groupby(level='country')['real_gdp'].pct_change(3)
    df['is_tech_rec'] = (df['gdp_qoq'] < 0) & (df.groupby(level='country')['gdp_qoq'].shift(3) < 0)
    
    # Patch 'oecd_rec': If it is NaN (missing recent data), use the Technical Recession flag
    df['oecd_rec'] = df['oecd_rec'].fillna(df['is_tech_rec'])
    
    # Convert boolean/object to integer (1/0) for XGBoost
    df['oecd_rec'] = df['oecd_rec'].astype(int)

    # ---------------------------------------------------------
    # 3. CLEANING & TRUNCATION
    # ---------------------------------------------------------
    
    # Select only the features we want for the model (Drop raw indices)
    features_to_keep = [
        'oecd_rec',          # TARGET
        'yield_spread',      # Feature
        'gdp_yoy_growth',    # Feature
        'inflation_rate',    # Feature
        'unemp_change_yoy',  # Feature
        'ind_out_growth',    # Feature
        'cars_yoy_growth',   # Feature
        'epu',               # Feature (Policy Uncertainty)
        'comp_consumer_conf',# Feature (Confidence)
        'vix'                # Feature (Volatility)
    ]
    
    # Filter columns
    df_clean = df[features_to_keep]
    
    # Drop data before the start_year (to remove early eras with many NaNs)
    df_clean = df_clean.loc[start_year:]
    
    # Forward Fill any remaining small gaps (e.g., if EPU is missing for 1 month)
    df_clean = df_clean.groupby(level='country').ffill()
    
    # Drop any remaining rows that still have NaNs (Model cannot handle them)
    df_final = df_clean.dropna()
    
    return df_final

# --- APPLY IT ---
df_clean = clean_and_transform_data(df, start_year="1995-01-01")
print(df_clean.describe())

indicator     oecd_rec  yield_spread  gdp_yoy_growth  inflation_rate  \
count      2049.000000   2049.000000     2049.000000     2049.000000   
mean          0.382138      1.162832        1.377941        1.555788   
std           0.486029      1.138796        2.736143        1.074660   
min           0.000000     -2.316680      -21.944151       -2.558867   
25%           0.000000      0.369000        0.749774        0.907686   
50%           0.000000      1.075500        1.824159        1.616619   
75%           1.000000      1.911143        2.791475        2.222239   
max           1.000000      5.572318        5.944910        5.600125   

indicator  unemp_change_yoy  ind_out_growth  cars_yoy_growth          epu  \
count           2049.000000     2049.000000      2049.000000  2049.000000   
mean              -0.089946        0.656203              NaN   141.979127   
std                0.947131        5.574486              NaN    96.488360   
min               -2.900000      -43.951622

In [27]:
df_clean.head()

Unnamed: 0_level_0,indicator,oecd_rec,yield_spread,gdp_yoy_growth,inflation_rate,unemp_change_yoy,ind_out_growth,cars_yoy_growth,epu,comp_consumer_conf,vix
date,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1995-01-01,Canada,1,1.511524,4.458118,0.580722,-1.8,10.258572,-138.471726,194.27143,98.954867,12.27
1995-01-01,France,0,2.2977,3.152749,1.719902,-0.7,6.680106,-8851.831831,150.71796,100.912,12.27
1995-01-01,Germany,1,2.435909,1.790475,2.304741,-0.2,4.023834,-132.072985,107.05698,99.79076,12.27
1995-01-01,USA,1,1.54,3.481406,2.804371,-1.0,6.745325,-343.950529,107.95494,101.1366,12.27
1995-02-01,Canada,1,0.7835,4.458118,1.873542,-1.5,9.969476,295.685744,126.45159,98.723338,11.47


In [25]:
def fix_data_integrity(df):
    df = df.copy()
    
    # --- FIX 1: Infinite Values in Cars ---
    # Replace inf/-inf with NaNs, then fill or clip
    df['cars_yoy_growth'] = df['cars_yoy_growth'].replace([np.inf, -np.inf], np.nan)
    
    # Clip extreme outliers (e.g., growth > 100% or < -100% is likely noise/base effects)
    # We clip to +/- 100% to keep the signal but remove the explosion
    df['cars_yoy_growth'] = df['cars_yoy_growth'].clip(lower=-100.0, upper=100.0)
    
    # Fill remaining NaNs (if any) with the median of that country
    df['cars_yoy_growth'] = df['cars_yoy_growth'].fillna(df.groupby('country')['cars_yoy_growth'].transform('median'))

    mask_obvious_growth = df['gdp_yoy_growth'] > 0.5
    df.loc[mask_obvious_growth, 'oecd_rec'] = 0
    
    return df

# Apply the fix
df_ready = fix_data_integrity(df_clean)

# Check the new "Recession Rate" - it should be closer to 0.10 - 0.15
print(f"New Recession Rate: {df_ready['oecd_rec'].mean():.4f}")
print(df_ready.describe())

New Recession Rate: 0.1279
indicator     oecd_rec  yield_spread  gdp_yoy_growth  inflation_rate  \
count      2049.000000   2049.000000     2049.000000     2049.000000   
mean          0.127867      1.162832        1.377941        1.555788   
std           0.334023      1.138796        2.736143        1.074660   
min           0.000000     -2.316680      -21.944151       -2.558867   
25%           0.000000      0.369000        0.749774        0.907686   
50%           0.000000      1.075500        1.824159        1.616619   
75%           0.000000      1.911143        2.791475        2.222239   
max           1.000000      5.572318        5.944910        5.600125   

indicator  unemp_change_yoy  ind_out_growth  cars_yoy_growth          epu  \
count           2049.000000     2049.000000      2049.000000  2049.000000   
mean              -0.089946        0.656203       -45.744243   141.979127   
std                0.947131        5.574486        76.481580    96.488360   
min             

In [28]:
def prepare_supervised_data(df, target_col='oecd_rec', lag_months=6):
    """
    Lags all feature columns by `lag_months` to create a predictive dataset.
    The Target (Y) stays at time T.
    The Features (X) come from time T - lag_months.
    """
    df_lagged = df.copy()
    
    feature_cols = [c for c in df.columns if c != target_col]
    
    # Shift features forward by lag_months (so X_{t-6} aligns with Y_t)
    # GroupBy country ensures we don't shift Canada data into France rows
    df_lagged[feature_cols] = df_lagged.groupby(level='country')[feature_cols].shift(lag_months)
    
    # Drop the first 'lag_months' rows (which are now NaNs)
    df_lagged = df_lagged.dropna()
    
    return df_lagged

# Create the dataset for "6-Month Early Warning"
df_model = prepare_supervised_data(df_ready, lag_months=6)

# Final Check
print("Model Data Shape:", df_model.shape)


Model Data Shape: (2007, 10)


In [29]:
df_model.head()

Unnamed: 0_level_0,indicator,oecd_rec,yield_spread,gdp_yoy_growth,inflation_rate,unemp_change_yoy,ind_out_growth,cars_yoy_growth,epu,comp_consumer_conf,vix
date,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1995-07-01,Canada,0,1.511524,4.458118,0.580722,-1.8,10.258572,-100.0,194.27143,98.954867,12.27
1995-07-01,France,0,2.2977,3.152749,1.719902,-0.7,6.680106,-100.0,150.71796,100.912,12.27
1995-07-01,Germany,0,2.435909,1.790475,2.304741,-0.2,4.023834,-100.0,107.05698,99.79076,12.27
1995-07-01,USA,0,1.54,3.481406,2.804371,-1.0,6.745325,-100.0,107.95494,101.1366,12.27
1995-08-01,Canada,0,0.7835,4.458118,1.873542,-1.5,9.969476,100.0,126.45159,98.723338,11.47
