__Exploratory Data Analysis (EDA)__

- Inspect distributions, missing values, outliers.
- Plot time series of IV, skew, curvature.
- Compare SPY vs QQQ.
- Correlation checks.
- Document findings.

In [1]:
import wrds
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller

__Load the filtered parquet__

In [None]:
df = pd.read_parquet("options_filtered/", engine="fastparquet")
df.head()

__Inspect distributions, missing values, outliers__

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
# Unique symbols
print(df['ticker'].unique())

# Or with counts
print(df['ticker'].value_counts())

In [None]:
fig, axes = plt.subplots(2,1, figsize=(12,10))

# Distribution of IV
sns.histplot(data=df, x='impl_volatility', bins=50, kde=True, ax = axes[0])
axes[0].set_title("Distribution of Implied Volatility")

# Boxplot by symbol
sns.boxplot(x='ticker', y='impl_volatility', data=df, ax = axes[1])
axes[1].set_title("IV Distribution by Ticker")

plt.tight_layout()
plt.show()

Implied volatility is concentrated between roughly 15% and 30%, with very few outliers above 30%. This suggests the dataset is clean after filtering. SPY and QQQ show broadly similar IV distributions, with QQQ’s median only slightly higher, reflecting modest sector risk differences. Outliers above 0.3 may represent short‑term stress events or occasional bad quotes, but they are rare and do not dominate the distribution.

__Aggregate daily features (ATM IV, skew, curvature)__

In [None]:
features = []

# Group by date and symbol (ticker)
for (date, ticker), group in df.groupby(['date','ticker']):
    group = group.copy()
    
    # --- ATM IV by delta ---
    atm_row = group.iloc[(group['delta'].abs() - 0.5).abs().argsort()[:1]]
    atm_iv = atm_row['impl_volatility'].values[0]

    # --- 25-delta put and call ---
    put25 = group.loc[group['cp_flag'] == 'P']
    call25 = group.loc[group['cp_flag'] == 'C']

    if not put25.empty and not call25.empty:
        put25_idx = (put25['delta'] + 0.25).abs().idxmin()
        call25_idx = (call25['delta'] - 0.25).abs().idxmin()

        iv_put25 = group.loc[put25_idx, 'impl_volatility']
        iv_call25 = group.loc[call25_idx, 'impl_volatility']

        skew = iv_put25 - iv_call25
        curvature = (iv_put25 + iv_call25) / 2 - atm_iv
    else:
        skew = curvature = np.nan

    features.append({
        'date': date,
        'ticker': ticker,
        'ATM_IV': atm_iv,
        'Skew': skew,
        'Curvature': curvature
    })

features_df = pd.DataFrame(features)
features_df.head()

Pull in the TBills Secondary Market 3-month (dtb3) and Treasury Constant Maturity 2-year (dgs2) from WRDS FRB and join by date

In [None]:
db = wrds.Connection(wrds_username='ayansola')
# setup pg_pass needed for access to the wrds dataset (first time only)
# db.create_pgpass_file()

In [None]:
params = {
    "from_date": "2022-01-01",
    "to_date": "2023-12-31"
}

t_bill = db.raw_sql(
    """
    SELECT date, dtb3 as tbills_3m, dgs2 as treasury_2y
    FROM frb.rates_daily
    WHERE date BETWEEN %(from_date)s AND %(to_date)s
    """,
    params=params,
)

In [None]:
t_bill.dtypes

In [None]:
t_bill['date'] = pd.to_datetime(t_bill['date'])

In [None]:
t_bill.info()

In [None]:
t_bill.head()

In [None]:
features_df = features_df.merge(t_bill[['date','tbills_3m', 'treasury_2y']], on='date', how='left')

In [None]:
features_df.head()

__Plot time series of IV, skew, curvature & compare SPY vs QQQ__

In [None]:
fig, axes = plt.subplots(5,1, figsize=(12,16))

sns.lineplot(data=features_df, x='date', y='ATM_IV', hue='ticker', ax = axes[0])
axes[0].set_title("ATM IV Over Time")

sns.lineplot(data=features_df, x='date', y='Skew', hue='ticker', ax = axes[1])
axes[1].set_title("Skew Over Time")

sns.lineplot(data=features_df, x='date', y='Curvature', hue='ticker', ax = axes[2])
axes[2].set_title("Curvature Over Time")

sns.lineplot(data=features_df, x='date', y='tbills_3m', ax = axes[3])
axes[3].set_title("3 month TBills Secondary Market Over Time")

sns.lineplot(data=features_df, x='date', y='treasury_2y', ax = axes[4])
axes[4].set_title("Treasury Constant Maturity 2-year Over Time")


# sns.lineplot(data=features_df, x='date', y='dtb1yr', ax = axes[3])
# axes[6].set_title("1 year TBills Secondary Market Over Time")

plt.tight_layout()
plt.show()

- Both symbols experience occasional, brief volatility spikes, with QQQ’s being more pronounced.

- QQQ’s skew is more volatile and often higher, pointing to increased demand for put options

- Curvature is persistently positive for both

- Implied volatility doesn’t track with short‑term rates like the 3‑month bill, but it does respond more visibly to shifts in the 2‑year yield as seen above

In [None]:
# drop dtb3 since IV doesn't track with it
features_df = features_df.drop(columns=['tbills_3m'])

__Correlation checks__

In [None]:
metrics = ['ATM_IV', 'Skew', 'Curvature']

for metric in metrics:
    pivot = features_df.pivot(index='date', columns='ticker', values=metric)
    corr = pivot.corr()
    sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
    plt.title(f"Correlation of {metric} between SPY and QQQ")
    plt.show()

Volatility levels between SPY and QQQ are highly synchronized, skew is moderately aligned but allows for divergence, and curvature behaves almost independently.

In [None]:
series = features_df[features_df['ticker']=='SPY']['ATM_IV']
result = adfuller(series.dropna())
print("ADF Statistic:", result[0])
print("p-value:", result[1])

### EDA Summary
- **Data quality:** No major missing values.
- **Distributions:** SPY && QQQ IV show broadly similar distributions with SPY IV centered lower than QQQ
- **Time series:** Both symbols show volatility spikes around market stress dates.
- **Correlations:** ATM IV highly correlated (SPY vs QQQ ~0.8).

__Save Aggregate daily features to Parquet: partition by symbol/year for downstream use.__

In [None]:
features_df['year'] = pd.to_datetime(features_df['date']).dt.year

# Write to Parquet with partitioning
features_df.to_parquet(
    "features_parquet/",
    engine="fastparquet",        # or "pyarrow"
    partition_cols=["ticker", "year"],
    index=False
)