---
title: "Downstream Risk Implications of Derivative Pricing Models: An End-to-End Market Risk Modelling Study"
format:
  pdf:
    toc: true
    toc-depth: 3
    geometry:
      - margin=15mm
---

## Section 1: Introduction and Summary

**1.1. Project Overview and Objective**

Market risk modelling for equity derivatives relies fundamentally on the interaction between upstream pricing models and downstream risk analytics, where valuation engines serve as inputs into P&L simulation, exposure profiling, and regulatory risk measurement. While derivative pricing models are often evaluated on valuation accuracy in isolation, their distributional assumptions and dynamic structures play a critical role in shaping downstream risk estimates used for trading book risk management and capital assessment.

The objective of this study is to systematically examine the downstream risk implications of three widely used equity derivative pricing frameworks—Monte-Carlo simulation, Black-Scholes-Merton (BSM), and the Heston stochastic volatility model—by benchmarking how their respective modelling assumptions propagate into market risk measures and factor-based risk diagnostics. The analysis is conducted within an end-to-end market risk modelling pipeline, spanning price generation, P&L simulation, and tail-risk evaluation. Two parallel datasets are employed: a proxy-based dataset designed to support methodological transparency and regulatory defensibility, and an ETF-based dataset used for external price anchoring and empirical realism.

Rather than proposing a novel pricing methodology, this research adopts a risk-centric perspective, focusing on comparative robustness, distributional fidelity, and stability of downstream risk metrics under realistic data, calibration, and governance constraints. The study is explicitly framed to reflect the considerations faced by market risk, model validation, and regulatory risk teams, where the primary concern is not pricing optimality in isolation, but the coherence, interpretability, and reliability of risk estimates derived from model-implied price dynamics.


**1.2. Overview of the Models Considered**

The three models examined in this study span a spectrum of theoretical assumptions and practical complexity:

- Black-Scholes-Merton (BSM) represents the classical closed-form solution for European option pricing, assuming log-normally distributed asset prices, constant volatility, frictionless markets, and continuous hedging. Despite well-documented empirical violations of these assumptions, BSM remains an industry benchmark due to its interpretability, analytical tractability, and historical adoption.

- The Heston stochastic volatility model extends the BSM framework by introducing a mean-reverting stochastic process for volatility, thereby capturing volatility clustering and leverage effects observed in real markets. While more flexible than BSM, the Heston model introduces calibration complexity, numerical stability considerations, and additional parameter uncertainty.

- Monte-Carlo simulation, implemented in this study using a bootstrap-based, non-parametric approach, avoids strong parametric assumptions about return distributions. Instead, it relies on empirically observed return dynamics to generate future price paths, allowing for skewness, excess kurtosis, and regime-specific behavior to be preserved in simulated outcomes.

Together, these models enable a structured comparison between analytical tractability, stochastic realism, and empirically driven distributional modeling. Given the dual data architecture adopted in this study—European-style proxy options for model development and American-style ETF options for external benchmarking—the **Longstaff–Schwartz (LSMC) methodology for approximating optimal early-exercise decisions in American options** is employed solely as a conditional early-exercise valuation operator where required. Importantly, LSMC is not treated as an independent pricing model, nor is it used in downstream risk generation; its role is limited to ensuring comparability and pricing realism when benchmarking against observed ETF option markets.


**1.3. Data Architecture and Dual-ETL Design**

A key design feature of this project is the construction of two distinct ETL pipelines, motivated by data availability constraints and regulatory considerations.

- The proxy-based dataset is constructed using liquid, publicly available equity and index data to create synthetic underlying assets suitable for model development, stress testing, and out-of-sample evaluation. This pipeline emphasizes transparency, reproducibility, long historical coverage, and suitability for statistical testing and regulatory defense. The proxy dataset serves as the primary foundation for model calibration, simulation, and risk evaluation. For the proxy dataset architecture, European options have been leveraged.

- The ETF-based dataset sources option-relevant price information derived from exchange-traded funds tracking the same or closely related underlying assets. This pipeline is used exclusively for external price comparison, visual benchmarking, and plausibility checks of model-implied option prices. Due to limitations associated with free and publicly accessible option datasets for European options, the ETF architecture leverages American options.

While European and American options will share stochasticity, American options come with the added complexity of exercise time. I.e., while European options exercise only at maturity, American options can be exercised at any point in time (as long as exercise date <= maturity date of the contract). Hence, the proxy-based architecture is used to assess the realism and stability of alternative stochastic volatility specifications under controlled European-style contracts. These findings inform — but do not determine — the interpretation of results in the ETF-based American option framework, where all candidate dynamics are evaluated under a common early-exercise valuation methodology against observed market prices.

Essentially, while $European Options = Stochastic Modelling$, $American Options = Stochastic Modelling + Exercise Time Handling$. For approximating optimal early-exercise decisions, the LSMC methodology has been considered.


**1.4. Statistical Diagnostics and Distributional Testing**

Prior to model implementation, extensive statistical diagnostics are applied to the proxy-based dataset to assess the validity of common modeling assumptions. These diagnostics include:

- Tests for normality and log-normality of returns
- Evaluation of skewness and excess kurtosis
- Analysis of volatility clustering and regime dependence

Empirical evidence of non-Gaussian return behavior, fat tails, and asymmetry motivates the inclusion of a bootstrap-based Monte-Carlo framework and informs the interpretation of results obtained from parametric models such as BSM and Heston.


**1.5. Out-of-Sample Evaluation Framework**

Model evaluation is conducted using a clearly defined out-of-sample (OOS) framework, designed to assess robustness rather than in-sample fit. OOS parameters are selected to capture multiple dimensions of model performance, including:

- Pricing stability across market regimes
- Tail risk behavior and downside sensitivity
- Risk decomposition and factor exposure consistency

The evaluation framework emphasizes comparative behavior under identical inputs, ensuring that differences in outcomes can be attributed to model structure rather than data artifacts.


**1.6. Implementation Methodology (Summary)**

The overall implementation methodology follows a structured and reproducible sequence:

 1. Construction of proxy-based and ETF-based ETL pipelines 
 2. Statistical diagnostics and assumption testing on proxy data 
 3. Independent implementation of Monte-Carlo, BSM, and Heston models 
 4. Model calibration and numerical validation, where applicable
 5. ETF-based benchmarking for external pricing plausibility and realism 
 6. Proxy-based out-of-sample risk and stability evaluation (including Barra-style risk identification and other OOS risk metrics like CVaR, VaR, TLF, ESD, etc.) 
 7. Comparative analysis and final research findings across all models and evaluation metrics

While steps 1 to 4 and 6 will leverage proxy data, step 5 will leverage the ETF data for external price plausibility and realism. Additionally, all downstream risk and stability analyses are conducted using the proxy-based architecture.

**1.7. Summary of Findings**

**TO BE FILLED IN THE END**


**1.8. Compliance with SR 11-7 Model Risk and Governance Framework**

**TO BE FILLED IN THE END**


**Disclaimer**

This document is intended solely for academic, educational, and research purposes. The models, methodologies, data sources, and results presented herein are illustrative in nature and are not intended to constitute investment advice, trading recommendations, or financial forecasts. Any references to market instruments, prices, or returns are for analytical demonstration only. The analyses rely on publicly available data and simplifying assumptions, and may not fully capture real-world market frictions, liquidity constraints, transaction costs, or institutional trading considerations. This document does not intend to provide representations regarding the suitability of any model or result for live trading, risk management, or regulatory capital determination.

In [1]:
# Import necessary packages
import os, time, requests, json
from dotenv import load_dotenv
from contextlib import contextmanager
from IPython.display import display
import pandas as pd, numpy as np, matplotlib.pyplot as plt, yfinance as yf

# Initializing environment
load_dotenv()
zero_coupon_ecb = os.getenv('zero_coupon_ecb')
zero_coupon_fred = os.getenv('zero_coupon_fred')
proxy_master = os.getenv('proxy_master')
etf_master = os.getenv('etf_master')
fred_api_key = os.getenv('fred_api_key')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', '{:.4f}'.format)

# Defining time context block
@contextmanager
def time_block(label: str='Block'):
    start_time = time.perf_counter()
    try:
        yield
    finally:
        end_time = time.perf_counter()
        runtime = end_time - start_time
        print(f'{label} execution time: {runtime/60:.2f} minutes.')


## Section 2: Implementation of ETL Pipelines

**1.1. Overview**

This project adopts a dual data-architecture strategy to balance **economic correctness**, **empirical validation**, and **data accessibility constraints**:

- **Proxy-based architecture**:
  This architecture constructs a synthetic yet economically grounded state-space of contingent claims, designed to support model development, assumption testing, and controlled out-of-sample (OOS) evaluation. The approach deliberately decouples contract specification from market-quoted option chains by defining options across fixed grids of moneyness, maturity, and payoff type, while anchoring all valuations to observed underlying prices, realized volatility proxies, risk-free term structures, and dividend yield assumptions. By holding the contract space and input state variables constant, the framework enables clean comparative benchmarking of pricing models, ensuring that observed differences in valuation and risk metrics are attributable to model dynamics rather than data availability, liquidity effects, or market microstructure noise. This proxy-based design emphasizes transparency, economic interpretability, numerical stability, and robustness under model governance and regulatory scrutiny, consistent with principles outlined in formal model-risk management frameworks.

- **ETF-based architecture**: 
  This architecture leverages exchange-traded fund (ETF) option chains as market-anchored representations of contingent claims, designed to support empirical validation, calibration realism, and external consistency checks against observed option prices. The approach preserves the contract specifications embedded in traded ETF options, including listed strikes, standardized maturities, and observed call/put structures, while anchoring valuations to market-implied prices, bid–ask dynamics, and prevailing liquidity conditions. As a result, the ETF-based framework reflects the joint influence of investor demand, hedging activity, volatility risk premia, and market microstructure effects inherent in real-world options markets. By operating directly on quoted option chains, the architecture enables assessment of absolute pricing accuracy, calibration stability, and model fit to observed market prices, complementing the controlled benchmarking enabled by the proxy-based framework. Differences between model-implied and market-observed prices can therefore be interpreted in the context of model assumptions, volatility dynamics, and unmodeled risk premia, rather than purely synthetic contract design. This ETF-based design emphasizes market realism, empirical validity, and external benchmark alignment, providing a necessary counterbalance to synthetic architectures and supporting robust model validation, sensitivity analysis, and governance-oriented performance assessment when used in conjunction with controlled proxy-based evaluations.

- **Why a dual-architecture setup**:
  The dual-architecture design addresses the absence of licensed market data feeds (e.g., Bloomberg, Refinitiv, Eurex) while maintaining methodological rigor consistent with institutional model development practices. Early stages of model development—including assumption testing, sensitivity analysis, and numerical stability checks—require controlled experimentation, isolation of effects, and a repeatable state-space. These requirements are satisfied by the proxy-based architecture, but cannot be reliably met using a purely ETF-based framework due to changing contract availability, liquidity effects, and embedded market risk premia. In contrast, later stages of model calibration and out-of-sample (OOS) validation benefit from direct exposure to market-quoted option prices, where an ETF-based architecture provides a realistic representation of observed pricing dynamics that a proxy-based framework cannot replicate. This separation is consistent with formal model-risk management guidance. In particular, SR 11-7 states that *“model validation should employ a combination of theoretical evaluation, controlled testing, and benchmarking against observed outcomes.”* The dual-architecture setup operationalizes this principle by explicitly separating controlled model diagnostics from market-anchored validation, ensuring both interpretability and empirical relevance.

>**Notes**: 
>While European and American options will share stochasticity, American options come with the added complexity of exercise time. I.e., while European options exercise ONLY at maturity, American options can be exercised at any point in time (as long as exercise date <= maturity date of the contract). Hence, the proxy-based architecture is used to assess the realism and stability of alternative stochastic volatility specifications under controlled European-style contracts. These findings inform—but do not determine—model preference in the ETF-based American option framework, where all candidate dynamics are evaluated under a common early-exercise methodology against observed market prices. Hence, for Proxy architecture "EXSA.DE (iShares STOXX Europe 600 UCITS ETF (DE) EUR (Dist))" closing prices have been considered and for ETF architecture "SPY (State Street SPDR S&P 500 ETF Trust)" closing prices and option chain have been considered - with dividend yield held as constant.
>Fundamentally, the proxy architecture answers the question **"Given a clean, European, long-history equity proxy, how do different stochastic dynamics compare relative to each other in terms of realism, stability, and numerical behavior?"** where as the ETF architecture answers the question **"Given observable option markets on a different but comparable equity benchmark, do those relative model characteristics persist when confronted with real prices?"**.


**1.2. ETL Pipeline**

In alignment with the SR 11-7 Model Risk Management framework, the project implements the following standardized ETL pipeline for both data architectures:

- **Step 1: Data Acquisition:** Sourcing raw market observables from documented, reproducible open-source channels.

- **Step 2: Data Ingestion:** Structured loading into analysis-ready data containers with version control and timestamping.

- **Step 3: Data Cleansing:** Handling missing values, calendar alignment, corporate action adjustments, and data sanity checks.

- **Step 4: Feature Engineering (as required):** Construction of derived quantities such as log returns, realized volatility measures, and term-structure interpolations.

- **Step 5: Rendering Model-Ready Datasets:** Final transformation into inputs compatible with Black–Scholes–Merton, Heston, and Monte Carlo pricing frameworks.

This pipeline separation ensures traceability, auditability, and reproducibility—key SR 11-7 requirements.


**1.3. Data Acquisition Sources and Rationale — Proxy-Based Architecture**

Under the proxy-based architecture, the following data components are sourced to construct a **theoretically consistent index-level pricing environment**. These inputs are used for model calibration, assumption testing, and controlled comparative analysis rather than direct replication of observed option prices.

Main Area 1: Underlying Index Prices - *Used to compute log returns, realized volatility, and to define the underlying state variable.*

- Daily closing prices - *Source: Yahoo Finance – EXSA.DE*  
- Trading calendar and date alignment - *Derived from ETF price time series*

Main Area 2: Realized Volatility Proxy - *Used to characterize historical volatility dynamics and provide data-driven inputs and initial conditions for stochastic volatility models.*

- Rolling realized volatility (21 days) - *Engineered from log returns derived in Main Area 1*  
- Rolling realized volatility (63 days) - *Engineered from log returns derived in Main Area 1*
- EWMA realized volatility - *Computed from historical log returns using exponentially decaying weights*

> **Note:** Full implied volatility surfaces are not available via open-source channels for European indices. Volatility indices are therefore engineered for this project using realized returns data.

Main Area 3: Risk-Free Rate Term Structure - *Used for discounting option payoffs and defining the risk-neutral drift.*

- Zero-coupon yields by maturity - *Source: European Central Bank – Data Portal*

> **Note:** Annualized continuously compounded zero-coupon yields - published on an ACT/365 basis - across multiple maturities observed daily. Retrieval mode: JSON dumps from public URLs published by ECB. 

Main Area 4: Dividend Yield - *Used to correct forward price dynamics and maintain put–call parity consistency.*

- Index-level dividend yield (spot or trailing) - *Source: Yahoo Finance - EXSA.DE – index `dividendYield` field*

> **Note:** The dividend yield is treated as constant over time, consistent with standard equity option pricing practice.

Main Area 5: Option Chain Construction and Pricing - *Used for inter-model benchmarking, internal consistency checks, and analysis of model-implied smile and skew behavior.*

- Option type (Call / Put) - *Constructed using simple classification*
- Strike price grid (K) - *Constructed using volatility-scaled spot log-moneyness. $K = S_t * e^k$ where $k = L * \sigma_a * \sqrt(T)$*
- Time to maturity (T) - *Constructed - as per zero-coupon maturity dates*

>**Note:**
>To preserve economic relevance and numerical stability, the log-moneyness domain is capped between -0.7 and 0.7 while allowing volatility regimes to retain regime shocks and affect pricing within the admissible contract space. Without capping our k-values amplify mathematically (not economically) since $k \propto \sigma * \sqrt{T} * L$ hence, during high time to maturity and volatility grids (e.g., 2008 + 10Y) the k-values get mathematically amplified - which lead to Far OTM and Deep ITM strike prices - rendering price ~0, greek ~0 and all 3 models being numerically indistinguishable from one another. Hence, to avoid this mathematical noise, capping has been enforced.

> **Important Clarification:**  
> Under the proxy-based architecture, **no observed market option prices are sourced**. Option prices are fully **model-implied**, enabling controlled comparison of pricing dynamics across Black–Scholes–Merton, Heston, and Monte Carlo frameworks without contamination from microstructure noise or liquidity effects. Additionally, under the proxy architecture, for each observed market state, we fix the underlying economic variables and enumerate a controlled grid of contingent claims. Pricing models are then applied to this fixed state-space, allowing downstream P&L, Greeks, and tail-risk measures to be compared on a like-for-like basis without contamination from liquidity or market microstructure effects.


**1.4. Data Acquisition Sources and Rationale — ETF-Based Architecture**

The ETF-based architecture is introduced to provide **empirical pricing benchmarks** using observable option markets on highly liquid exchange-traded funds. While ETFs introduce tracking error and structural noise, their option chains offer the only feasible open-source alternative for observed option prices. Additionally, under the ETF-based architecture, the ETF itself is treated as the underlying tradable asset; index replication is not assumed.

The data points mirror those used in the proxy-based architecture to ensure methodological consistency; however, **data sources differ materially**.

Main Area 1: Underlying ETF Prices - *Used as the tradable underlying for observed option contracts.*

- Daily closing and adjusted prices - *Source: Yahoo Finance – SPY*  
- Trading calendar - *Derived from ETF price series*

Main Area 2: Implied Volatility (Observed) - *Computed via model inversion from observable option mid-market prices for liquid ETFs and used for empirical benchmarking and validation.*

- Option-implied volatility (by strike and maturity) - *Extracted from Yahoo Finance (SPY) ETF option chains and cleaned for liquidity and data quality*

Main Area 3: Risk-Free Rate Term Structure - *Used for discounting option payoffs and defining the risk-neutral drift.*

- Zero-coupon yields by maturity - *Source: FRED  - API*

Main Area 4: Dividend Yield - *Used to correct forward price dynamics and maintain put–call parity consistency.*

- ETF dividend yield - *Source: Yahoo Finance – SPY - index `dividendYield` field*

Main Area 5: Observed Option Chain Data - *Used for direct model-to-market pricing comparison and OOS evaluation.*

- Option type (Call / Put) - *Source: Yahoo Finance SPY option chains*
- Strike price (K) - *Source: Yahoo Finance SPY option chains*
- Expiration date and time to maturity (T) - *Source: Yahoo Finance SPY option chains*
- Market option prices (mid, bid–ask) - *Source: Yahoo Finance SPY option chains*
- Volume and open interest (where available) - *Source: Yahoo Finance SPY option chains*

>**Important Clarification:**
>Due to the unavailability of reliable open-source historical option chain data for ETFs, the ETF-based architecture is implemented as a point-in-time benchmark. Its purpose is not time-series evaluation, but to assess whether the relative pricing behavior and structural characteristics observed under the proxy architecture remain economically plausible when confronted with observable market prices.

**Governance Note (SR 11-7 Alignment)**

The separation of proxy-based and ETF-based architectures ensures:

- Explicit documentation of data limitations and assumptions
- Clear distinction between **model development** and **empirical validation**
- Avoidance of overfitting or misrepresentation of model accuracy
- Traceability of all inputs to reproducible sources

This design aligns with SR 11-7 expectations regarding model transparency, validation independence, and controlled use of approximations.


**Section Output:**
At the end of the ETL process, the project produces two consolidated model-ready datasets one based on proxy and second based on ETF architectures. The proxy datasets support controlled model development and assumption testing, while the ETF datasets enable empirical validation against observed option prices. Model prices generated under the ETF architecture are compared against market option prices—not against proxy-generated prices—to ensure economically meaningful validation and SR 11-7-compliant separation of development and benchmarking.

In [2]:
# Pull zero-coupon data - for proxy architecture
def pull_ecb_zero_coupon_data():
    url = [
        'https://data-api.ecb.europa.eu/service/data/YC/B.U2.EUR.4F.G_N_A.SV_C_YM.SR_3M?startPeriod=2008-04-04&endPeriod=2026-01-30&format=jsondata',
        'https://data-api.ecb.europa.eu/service/data/YC/B.U2.EUR.4F.G_N_A.SV_C_YM.SR_6M?startPeriod=2008-04-04&endPeriod=2026-01-30&format=jsondata',
        'https://data-api.ecb.europa.eu/service/data/YC/B.U2.EUR.4F.G_N_A.SV_C_YM.SR_1Y?startPeriod=2008-04-04&endPeriod=2026-01-30&format=jsondata',
        'https://data-api.ecb.europa.eu/service/data/YC/B.U2.EUR.4F.G_N_A.SV_C_YM.SR_2Y?startPeriod=2008-04-04&endPeriod=2026-01-30&format=jsondata',
        'https://data-api.ecb.europa.eu/service/data/YC/B.U2.EUR.4F.G_N_A.SV_C_YM.SR_5Y?startPeriod=2008-04-04&endPeriod=2026-01-30&format=jsondata',
        'https://data-api.ecb.europa.eu/service/data/YC/B.U2.EUR.4F.G_N_A.SV_C_YM.SR_10Y?startPeriod=2008-04-04&endPeriod=2026-01-30&format=jsondata'
    ]
    data_dump_yield = []
    data_dump_dates = []
    data_dump_ids = []
    df_zero_coupon_ecb = pd.DataFrame()
    for i in url:
        req = requests.get(i, timeout=120)
        req.raise_for_status()
        json_dump = req.json()
        yield_data = json_dump['dataSets'][0]['series']['0:0:0:0:0:0:0']['observations']
        for i in range(len(yield_data)):
            data_dump_yield.append(yield_data[str(i)][0])
            id_data = json_dump['structure']['dimensions']['series'][6]['values'][0]['id']
            data_dump_ids.append(id_data)

        time_data = json_dump['structure']['dimensions']['observation'][0]['values']
        for i in range(len(time_data)):
            data_dump_dates.append(time_data[i]['name'])

    
    df_zero_coupon_ecb['yields'] = data_dump_yield
    df_zero_coupon_ecb['yields'] = df_zero_coupon_ecb['yields'] / 100
    df_zero_coupon_ecb['Date'] = data_dump_dates
    df_zero_coupon_ecb['ID'] = data_dump_ids
    df_zero_coupon_ecb['Date'] = pd.to_datetime(df_zero_coupon_ecb['Date'])
    df_zero_coupon_ecb = (
        df_zero_coupon_ecb
        .pivot(index='Date', columns='ID', values='yields')
        .reset_index()
    )
    df_zero_coupon_ecb = df_zero_coupon_ecb.set_index('Date')
    df_zero_coupon_ecb.to_csv(zero_coupon_ecb)
    return df_zero_coupon_ecb

In [3]:
# Proxy-based architecture implementation
def proxy_architecture_data_generation():
    rt = {
        'date': [],
        'r_t': []
    }
    
    # 1. Index closing prices
    df_proxy_index = yf.download('EXSA.DE', start='2008-01-01', end='2026-02-01', progress=False)
    df_proxy_index = (
        df_proxy_index
        .xs('Close', level='Price', axis=1)
        .rename(columns={'EXSA.DE': 'index_closing_price'})
        .sort_index()
    )

    # 3. Zero-coupon yield
    with time_block('Zero Coupon Data Load - ECB'):
        df_zero_coupon = pull_ecb_zero_coupon_data()

    df_proxy_index = df_proxy_index.join(
        df_zero_coupon,
        how='inner'
    )

    # 4. Dividend-yield
    exsa = yf.Ticker('EXSA.DE')
    div = exsa.info.get('dividendYield') / 100


    # Performing data consolidation, engineering and sanity checks
    # Data engineering
    df_proxy_index['index_closing_simple_ret'] = df_proxy_index['index_closing_price'].pct_change()
    df_proxy_index['index_closing_log_ret'] = np.log(1 + df_proxy_index['index_closing_simple_ret'])
    # 2. Volatility proxies (calculated in this section to avoid data conflicts)
    lam = 0.94
    df_proxy_index['daily_rol_vol_21D'] = df_proxy_index['index_closing_log_ret'].rolling(21).std(ddof=1)
    df_proxy_index['daily_rol_vol_63D'] = df_proxy_index['index_closing_log_ret'].rolling(63).std(ddof=1)
    df_proxy_index['daily_ewma_vol'] = np.sqrt(df_proxy_index['index_closing_log_ret'].pow(2).ewm(alpha=1-lam, adjust=False).mean()) # set adjust = False to compute volatility using the classical EWMA vol model. True will use the statistical definition.
    df_proxy_index['dividend_yield'] = div
    df_proxy_index = df_proxy_index.dropna(how='any')
    # 5. Option chain construction (implemented here to preserve data structure)
    # Maturity dates construction
    T_in_years = [0.25, 0.50, 1.00, 2.00, 5.00, 10.00]
    df_proxy_index = (
        df_proxy_index
        .loc[df_proxy_index.index.repeat(len(T_in_years))]
        .assign(T_in_years=T_in_years * len(df_proxy_index))
        .set_index('T_in_years', append=True)
        .sort_index()
    )
    # Strike price construction
    L_grid = np.array([-2.5,-2.0,-1.5,-1.0,-0.5,0,0.5,1.0,1.5,2.0,2.5])
    n = len(df_proxy_index)
    df_proxy_index = (
        df_proxy_index
        .loc[df_proxy_index.index.repeat(len(L_grid))]
    )
    df_proxy_index['L'] = np.tile(L_grid, n)
    df_proxy_index['annualized_ewma_vol'] = df_proxy_index['daily_ewma_vol'].to_numpy() * np.sqrt(252)
    k_raw = df_proxy_index['L'] * df_proxy_index['annualized_ewma_vol'] * np.sqrt(df_proxy_index.index.get_level_values('T_in_years'))
    df_proxy_index['k'] = np.clip(k_raw, -0.7, 0.7)
    df_proxy_index['K'] = df_proxy_index['index_closing_price'] * np.exp(df_proxy_index['k'])
    # Call/ Put classification construction
    c_p_classification = ['C', 'P']
    df_proxy_index = (
        df_proxy_index
        .loc[df_proxy_index.index.repeat(2)]
    )
    df_proxy_index['call_put_classification'] = np.tile(c_p_classification, len(df_proxy_index)//2)
    df_class_check = (
        df_proxy_index.groupby([
            pd.Grouper(level='Date'),
            pd.Grouper(level='T_in_years'),
            'K'
        ])['call_put_classification']
        .nunique()
    )
    mismatch = df_class_check[df_class_check < 2]
    if len(mismatch) > 0:
        print(f'{len(mismatch)} records with incorrect classification structure found.')
    else:
        print('Classification structure complete.')

    # Data sanity checks
    assert df_proxy_index.index.is_monotonic_increasing
    assert (df_proxy_index['index_closing_simple_ret'] > -1).all()

    # Dropping duplicates and creating final master
    df_proxy_index_master = (
        df_proxy_index
        .reset_index()
        .drop_duplicates(subset=['Date', 'T_in_years', 'L', 'call_put_classification'], keep='first')
        .sort_values(['Date', 'T_in_years', 'L', 'call_put_classification'])
        .set_index(['Date', 'T_in_years'])
    )

    # Mapping zero-coupon rates based on T_in_years
    T = df_proxy_index_master.index.get_level_values("T_in_years").to_numpy()
    T_round = np.round(T, 2)  # grid is {0.25,0.50,1,2,5,10}
    df_proxy_index_master["r_T"] = np.select(
        [
            T_round == 0.25,
            T_round == 0.50,
            T_round == 1.00,
            T_round == 2.00,
            T_round == 5.00,
            T_round == 10.0,
        ],
        [
            df_proxy_index_master["SR_3M"].to_numpy(),
            df_proxy_index_master["SR_6M"].to_numpy(),
            df_proxy_index_master["SR_1Y"].to_numpy(),
            df_proxy_index_master["SR_2Y"].to_numpy(),
            df_proxy_index_master["SR_5Y"].to_numpy(),
            df_proxy_index_master["SR_10Y"].to_numpy(),
        ],
        default=np.nan
    )
    assert df_proxy_index_master["r_T"].isna().sum() == 0
    
    # Configuring data display template
    df_proxy_index_master.style.set_table_styles(
        [
            {
                'selector': 'th',
                'props': [('text-align','center')]
            },
            {
                'selector': 'td',
                'props': [('text-align', 'right')]
            }
        ]
    )

    # Diagnostic print statements
    print('Data generated for proxy architecture:')
    display(df_proxy_index_master.head(44))
    print(f'Total number of records in master proxy data: {len(df_proxy_index_master)}')
    
    # Uploading data to pickle file
    df_proxy_index_master.to_pickle(proxy_master)

In [4]:
# Pull zero-coupon data - for ETF architecture - max 2Y since SPY options are mostly <= 2 years
def pull_fred_zero_coupon_data():
    id_url = 'https://api.stlouisfed.org/fred/release/series'
    obs_url = 'https://api.stlouisfed.org/fred/series/observations'
    id_params = {
        'release_id': 18,
        'api_key': fred_api_key,
        'file_type': 'json'
    }
    id_data = {
        'id': [],
        'title': []
    }
    obs_data = {
        'date': [],
        'value': [],
        'id': []
    }
    df_zero_coupon_fred = pd.DataFrame()

    response_id = requests.get(id_url, params=id_params, timeout=60)
    response_id_dump = response_id.json()
    for i in range(len(response_id_dump['seriess'])):
        title = response_id_dump['seriess'][i].get('title', '')
        identifier = response_id_dump['seriess'][i].get('id', '')
        if (
            identifier.startswith('D') # To get daily data
            and title.startswith('Market') # To get only yeild specific data
            and (
                '1-Month' in title
                or '3-Month' in title
                or '6-Month' in title
                or '1-Year' in title
                or '2-Year' in title
            )
        ):
            id_data['id'].append(response_id_dump['seriess'][i]['id'])
            id_data['title'].append(response_id_dump['seriess'][i]['title'])
        else:
            continue

    identifier_list = id_data['id']
    for i, iden in enumerate(identifier_list):
        obs_params = {
            'series_id': iden,
            'api_key': fred_api_key,
            'file_type': 'json',
            'observation_start': '2026-01-01',
            'observation_end': '2026-02-01'
        }
        response_obs = requests.get(obs_url, params=obs_params, timeout=60)
        response_obs_dump = response_obs.json()
        for data in response_obs_dump['observations']:
            obs_data['date'].append(data['date'])
            obs_data['value'].append(data['value'])
            n = len(data['value'])
            obs_data['id'].append(identifier_list[i])

    json_dump_id = json.dumps(id_data, indent=2)
    df_zero_coupon_fred['Date'] = pd.to_datetime(obs_data['date'])
    df_zero_coupon_fred['yields'] = pd.to_numeric(
        obs_data['value'],
        errors='coerce'
    )
    df_zero_coupon_fred['yields'] = df_zero_coupon_fred['yields'] / 100
    df_zero_coupon_fred['identifier'] = obs_data['id']
    df_zero_coupon_fred = df_zero_coupon_fred.set_index('Date')
    df_zero_coupon_fred = df_zero_coupon_fred.pivot(columns='identifier', values='yields')

    # Loading data to csv
    df_zero_coupon_fred.to_csv(zero_coupon_fred)

    return df_zero_coupon_fred

In [5]:
# ETF-based architecture implementation
def etf_architecture_data_generation():
    df_etf_option_chain = pd.DataFrame()

    # 1. Index closing prices
    df_etf_index = yf.download('SPY', start='2026-01-01', end='2026-02-01', progress=False)
    df_etf_index = (
        df_etf_index
        .xs('Close', level='Price', axis=1)
        .rename(columns={'SPY': 'index_closing_price'})
        .sort_index()
    )

    # 3. Zero-coupon yeild
    with time_block('Zero Coupon Data Load - FRED'):
        df_zero_coupon = pull_fred_zero_coupon_data()

    # 4. Dividend yield
    spy = yf.Ticker('SPY')
    div = spy.info.get('dividendYield') / 100
    df_etf_index['dividend_yeild'] = div

    # 2 and 5. Option chain incl. implied volatility
    option_chain = []
    exp_dates = spy.options
    for date in exp_dates:
        chain = spy.option_chain(date)
        for opt_type, df in [('call', chain.calls), ('put', chain.puts)]:
            tmp_df = df.copy()
            tmp_df['option_type'] = opt_type
            tmp_df['expiration_dates'] = pd.to_datetime(date)
            tmp_df['valuation_as_of_date'] = df_etf_index.index[-1]
            tmp_df['spot'] = float(
                df_etf_index.loc[df_etf_index.index[-1], 'index_closing_price']
            )
            option_chain.append(tmp_df)

    df_etf_option_chain = pd.concat(option_chain)
    df_etf_option_chain = df_etf_option_chain.set_index('valuation_as_of_date')

    # Performing data consolidation, engineering and sanity checks
    df_etf_option_chain = df_etf_option_chain.join(
        df_zero_coupon,
        how='inner'
    )
    df_etf_option_chain = df_etf_option_chain.dropna(how='any')
    df_etf_option_chain['mid'] = (
        (df_etf_option_chain['bid'] + df_etf_option_chain['ask']) / 2
    )
    df_etf_option_chain['spread'] = (
        df_etf_option_chain['ask'] - df_etf_option_chain['bid']
    )
    df_etf_option_chain['rel_spread'] = (
        df_etf_option_chain['spread'] / df_etf_option_chain['mid']
    )
    df_etf_option_chain = df_etf_option_chain[
        (df_etf_option_chain['bid'] >= 0) &
        (df_etf_option_chain['ask'] > df_etf_option_chain['bid']) &
        (df_etf_option_chain['mid'] >= 0.05) &
        (df_etf_option_chain['rel_spread'] <= 0.20)
    ]
    df_etf_option_chain['lastTradeDate'] = pd.to_datetime(df_etf_option_chain['lastTradeDate'])
    df_etf_option_chain['T_in_years'] = (
        (df_etf_option_chain['expiration_dates'] - df_etf_option_chain.index).dt.days / 365
    )
    df_etf_option_chain = df_etf_option_chain[
        df_etf_option_chain['impliedVolatility'] > 1e-5
    ]
    df_etf_option_chain = df_etf_option_chain[
        (df_etf_option_chain['T_in_years'] > 0) &
        (
            (df_etf_option_chain['volume'] > 0) | (df_etf_option_chain['openInterest'] > 0)
        )
    ]
    df_etf_option_chain = df_etf_option_chain[
        df_etf_option_chain['lastTradeDate'] <= '2026-01-30'
    ]
    # Interpolating zero-coupon rates based on T_in_years
    t_years = np.array([1/12, 0.25, 0.5, 1.0, 2.0])
    t_cols = ['DGS1MO','DGS3MO','DGS6MO','DGS1','DGS2']
    r_data = df_etf_option_chain[t_cols].iloc[0].to_numpy()
    T = df_etf_option_chain['T_in_years'].to_numpy()
    T_clip = np.clip(T, t_years.min(), t_years.max())
    df_etf_option_chain['r_T'] = np.interp(T_clip, t_years, r_data)
    df_etf_option_chain.style.set_table_styles(
        [
            {
                'selector': 'th',
                'props': [('text-align', 'center')]
            },
            {
                'selector': 'td',
                'props': [('text-align', 'right')]
            }
        ]
    )

    # Diagnostic print statements
    print('Data generated for ETF architecture:')
    display(df_etf_option_chain.head(44))
    print(f'Total records in ETF index data {len(df_etf_option_chain)}')

    # Uploading data to pickle file
    df_etf_option_chain.to_pickle(etf_master)

In [6]:
# Calling data generation programs

# 1. Calling proxy data generation program
with time_block('Proxy data generation block'):
    proxy_architecture_data_generation()

# 2. Calling etf data generation program
with time_block('ETF data generation block'):
    etf_architecture_data_generation()

Zero Coupon Data Load - ECB execution time: 0.13 minutes.
Classification structure complete.
Data generated for proxy architecture:


Unnamed: 0_level_0,Unnamed: 1_level_0,index_closing_price,SR_10Y,SR_1Y,SR_2Y,SR_3M,SR_5Y,SR_6M,index_closing_simple_ret,index_closing_log_ret,daily_rol_vol_21D,daily_rol_vol_63D,daily_ewma_vol,dividend_yield,L,annualized_ewma_vol,k,K,call_put_classification,r_T
Date,T_in_years,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2008-07-03,0.25,19.4655,0.0471,0.0444,0.0451,0.0422,0.0455,0.0432,0.0088,0.0087,0.0112,0.0107,0.0119,0.0254,-2.5,0.1886,-0.2358,15.3771,C,0.0422
2008-07-03,0.25,19.4655,0.0471,0.0444,0.0451,0.0422,0.0455,0.0432,0.0088,0.0087,0.0112,0.0107,0.0119,0.0254,-2.5,0.1886,-0.2358,15.3771,P,0.0422
2008-07-03,0.25,19.4655,0.0471,0.0444,0.0451,0.0422,0.0455,0.0432,0.0088,0.0087,0.0112,0.0107,0.0119,0.0254,-2.0,0.1886,-0.1886,16.1196,C,0.0422
2008-07-03,0.25,19.4655,0.0471,0.0444,0.0451,0.0422,0.0455,0.0432,0.0088,0.0087,0.0112,0.0107,0.0119,0.0254,-2.0,0.1886,-0.1886,16.1196,P,0.0422
2008-07-03,0.25,19.4655,0.0471,0.0444,0.0451,0.0422,0.0455,0.0432,0.0088,0.0087,0.0112,0.0107,0.0119,0.0254,-1.5,0.1886,-0.1415,16.8979,C,0.0422
2008-07-03,0.25,19.4655,0.0471,0.0444,0.0451,0.0422,0.0455,0.0432,0.0088,0.0087,0.0112,0.0107,0.0119,0.0254,-1.5,0.1886,-0.1415,16.8979,P,0.0422
2008-07-03,0.25,19.4655,0.0471,0.0444,0.0451,0.0422,0.0455,0.0432,0.0088,0.0087,0.0112,0.0107,0.0119,0.0254,-1.0,0.1886,-0.0943,17.7137,C,0.0422
2008-07-03,0.25,19.4655,0.0471,0.0444,0.0451,0.0422,0.0455,0.0432,0.0088,0.0087,0.0112,0.0107,0.0119,0.0254,-1.0,0.1886,-0.0943,17.7137,P,0.0422
2008-07-03,0.25,19.4655,0.0471,0.0444,0.0451,0.0422,0.0455,0.0432,0.0088,0.0087,0.0112,0.0107,0.0119,0.0254,-0.5,0.1886,-0.0472,18.5689,C,0.0422
2008-07-03,0.25,19.4655,0.0471,0.0444,0.0451,0.0422,0.0455,0.0432,0.0088,0.0087,0.0112,0.0107,0.0119,0.0254,-0.5,0.1886,-0.0472,18.5689,P,0.0422


Total number of records in master proxy data: 588720
Proxy data generation block execution time: 0.32 minutes.
Zero Coupon Data Load - FRED execution time: 0.08 minutes.
Data generated for ETF architecture:


Unnamed: 0,contractSymbol,lastTradeDate,strike,lastPrice,bid,ask,change,percentChange,volume,openInterest,impliedVolatility,inTheMoney,contractSize,currency,option_type,expiration_dates,spot,DGS1,DGS1MO,DGS2,DGS3MO,DGS6MO,mid,spread,rel_spread,T_in_years,r_T
2026-01-30,SPY260220C00365000,2025-11-26 15:41:15+00:00,365.0,316.48,327.35,330.81,0.0,0.0,2.0,2.0,1.3731,True,REGULAR,USD,call,2026-02-20,691.97,0.0348,0.0372,0.0352,0.0367,0.0361,329.08,3.46,0.0105,0.0575,0.0372
2026-01-30,SPY260220P00745000,2025-12-19 15:52:02+00:00,745.0,65.93,52.12,55.51,0.0,0.0,2.0,0.0,0.4158,True,REGULAR,USD,put,2026-02-20,691.97,0.0348,0.0372,0.0352,0.0367,0.0361,53.815,3.39,0.063,0.0575,0.0372
2026-01-30,SPY260220P00780000,2025-12-15 15:03:53+00:00,780.0,99.27,87.12,90.41,0.0,0.0,2.0,0.0,0.5043,True,REGULAR,USD,put,2026-02-20,691.97,0.0348,0.0372,0.0352,0.0367,0.0361,88.765,3.29,0.0371,0.0575,0.0372
2026-01-30,SPY260220P00795000,2025-12-29 20:50:07+00:00,795.0,106.93,102.49,105.3,0.0,0.0,1.0,0.0,0.5698,True,REGULAR,USD,put,2026-02-20,691.97,0.0348,0.0372,0.0352,0.0367,0.0361,103.895,2.81,0.027,0.0575,0.0372
2026-01-30,SPY260227C00470000,2025-11-18 14:54:26+00:00,470.0,197.4,223.42,226.8,0.0,0.0,5.0,5.0,0.907,True,REGULAR,USD,call,2026-02-27,691.97,0.0348,0.0372,0.0352,0.0367,0.0361,225.11,3.38,0.015,0.0767,0.0372
2026-01-30,SPY260227C00485000,2025-11-14 16:56:55+00:00,485.0,196.05,208.62,211.99,0.0,0.0,1.0,2.0,0.866,True,REGULAR,USD,call,2026-02-27,691.97,0.0348,0.0372,0.0352,0.0367,0.0361,210.305,3.37,0.016,0.0767,0.0372
2026-01-30,SPY260227C00495000,2025-11-05 15:18:54+00:00,495.0,189.6,198.71,201.99,0.0,0.0,3.0,0.0,0.8284,True,REGULAR,USD,call,2026-02-27,691.97,0.0348,0.0372,0.0352,0.0367,0.0361,200.35,3.28,0.0164,0.0767,0.0372
2026-01-30,SPY260227C00584000,2025-11-11 15:32:18+00:00,584.0,106.29,110.81,114.24,0.0,0.0,19.0,20.0,0.5403,True,REGULAR,USD,call,2026-02-27,691.97,0.0348,0.0372,0.0352,0.0367,0.0361,112.525,3.43,0.0305,0.0767,0.0372
2026-01-30,SPY260227C00587000,2025-11-10 20:17:51+00:00,587.0,105.68,107.89,111.44,0.0,0.0,19.0,73.0,0.5344,True,REGULAR,USD,call,2026-02-27,691.97,0.0348,0.0372,0.0352,0.0367,0.0361,109.665,3.55,0.0324,0.0767,0.0372
2026-01-30,SPY260227C00597000,2025-12-01 14:44:53+00:00,597.0,91.31,98.19,101.49,0.0,0.0,1.0,13.0,0.5,True,REGULAR,USD,call,2026-02-27,691.97,0.0348,0.0372,0.0352,0.0367,0.0361,99.84,3.3,0.0331,0.0767,0.0372


Total records in ETF index data 366
ETF data generation block execution time: 0.25 minutes.
