# Preprocessing & Graph Construction

**Goal.** Build the day-by-day graph dataset used throughout the thesis:
- one **graph per date**,
- **fixed nodes**: `MOM, HML, OI, PC, BAS, LIQ, VOL`,
- **directed edges** encode the assumed causal diagram,
- features are **scalar node values for that date**,
- target is next-day (or same-day, if you chose) `VOL`.



In [None]:
import os
import numpy as np
import pandas as pd
import yfinance as yf
from pandas_datareader import data as pdr
import networkx as nx
import matplotlib.pyplot as plt
from pathlib import Path

# config
MARKET_PROXY_TICKER = "SPY"
ALL_NODES = ["Mom", "HML", "OI", "PC", "BAS", "LIQ", "VOL"]
START_DATE = "2023-01-01"
END_DATE = "2023-12-31"
VOL_WINDOW = 21
ROLL_SPREAD_WINDOW = 21

PROJECT_ROOT = Path().resolve().parent
OUTPUT_DIR = PROJECT_ROOT / "data" / "processed"
os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"Project Root detected at: {PROJECT_ROOT}")
print(f"Libraries loaded. Output will be saved to '{OUTPUT_DIR}'")


## 2. Feature Engineering (Definitions)

We follow standard operational definitions:

- **MOM (Momentum).** Winners minus losers over prior 12 months, skip last month.
- **HML (Value).** High minus low book-to-market proxy.
- **OI (Order Imbalance).** Net buyer-initiated minus seller-initiated flow.
- **PC (Price Change).** Same-day percentage return.
- **BAS (Bid–Ask Spread).** Quoted or effective spread proxy.
- **LIQ (Liquidity).** Inverse illiquidity proxy (choose your preferred measure).
- **VOL (Realized Volatility).** Rolling window realized vol or daily range-based proxy.

To get data for these abstract nodes, we will use a market proxy—the SPDR S&P 500 ETF (SPY)—to calculate all price-derived and microstructure variables. The factor data comes directly from the Kenneth French Data Library. All required metrics can be calculated from simple daily data. Then we calculate the values using the helper functions we created and scale the variables with too big or too small values




In [None]:
#helper functions 
def compute_returns(df, price_col):
    return df[price_col].pct_change()

def compute_rolling_vol(returns, window=21):
    return returns.rolling(window=window, min_periods=window).std()

def compute_amihud(return_series, dollar_volume_series):
    return (return_series.abs() / (dollar_volume_series + 1e-9)).fillna(0.0)

def compute_signed_volume(return_series, dollar_volume_series):
    return np.sign(return_series).fillna(0) * dollar_volume_series

def compute_roll_spread(price_series, window=21):
    dp = price_series.diff()
    dp_lag = dp.shift(1)
    cov = dp.rolling(window=window).cov(dp_lag)
    spread = 2.0 * np.sqrt((-cov).clip(lower=0.0))
    return spread.fillna(0.0)


In [None]:
print("Downloading Fama-French 3 Factors and Momentum Factor (Daily)...")
ff_3_factor = pdr.DataReader('F-F_Research_Data_Factors_daily', 'famafrench', start=START_DATE, end=END_DATE)[0]
ff_mom_factor = pdr.DataReader('F-F_Momentum_Factor_daily', 'famafrench', start=START_DATE, end=END_DATE)[0]


# Combine and clean the factor data
ff_factors = ff_3_factor.join(ff_mom_factor, how='inner')
ff_factors /= 100.0

ff_factors = ff_factors[['HML', 'Mom']].copy()
print("Fama-French data downloaded successfully.")

print(f"\nDownloading market proxy data for {MARKET_PROXY_TICKER}...")
market_data_raw = yf.download(MARKET_PROXY_TICKER, start=START_DATE, end=END_DATE)
print("Market proxy data downloaded successfully.")

if isinstance(market_data_raw.columns, pd.MultiIndex):
    market_data = market_data_raw.xs(MARKET_PROXY_TICKER, level='Ticker', axis=1).copy()
else:
    market_data = market_data_raw.copy()

market_data.head()


In [None]:
price_col = 'Price' if 'Price' in market_data.columns else 'Close'
print(f"\nUsing '{price_col}' as the primary price column for calculations.")

market_data['pc'] = compute_returns(market_data, price_col=price_col)
market_data['vol'] = compute_rolling_vol(market_data['pc'], window=VOL_WINDOW)
market_data['dollarvolume'] = market_data[price_col] * market_data['Volume']
market_data['liq'] = compute_amihud(market_data['pc'], market_data['dollarvolume'])
market_data['bas'] = compute_roll_spread(market_data[price_col], window=ROLL_SPREAD_WINDOW)
market_data['oi'] = compute_signed_volume(market_data['pc'], market_data['dollarvolume'])

# Rename columns to match our final node names for clarity
market_data.rename(columns={
    'pc': 'PC', 'vol': 'VOL', 'liq': 'LIQ', 'bas': 'BAS', 'oi': 'OI'
}, inplace=True)

# Join the factor data with the calculated microstructure data
final_df_wide = ff_factors.join(market_data, how='inner').dropna()
final_df_wide = final_df_wide[ALL_NODES]
print("\nAll node values calculated and aligned.")
final_df_wide.head()


Our preprocessing employs a targeted scaling strategy, transforming only variables with extreme ranges to ensure the numerical stability of our models while preserving interpretability. Specifically, we apply a signed logarithmic transformation to compress the high-magnitude `OI` variable and rescale the low-magnitude `LIQ` and `VOL` variables by a constant factor. All other variables remain on their original scale, ensuring their estimated causal effects are directly interpretable in their real-world units.


In [None]:
df_t = final_df_wide.copy()

df_t['OI'] = np.sign(df_t['OI']) * np.log1p(np.abs(df_t['OI']))

LIQ_SCALE = 1e12   # 3e-13 -> ~0.3
VOL_SCALE = 100.0  # 0.01   -> ~1.0
df_t['LIQ'] = df_t['LIQ'] * LIQ_SCALE
df_t['VOL'] = df_t['VOL'] * VOL_SCALE

df_t.head()


## 3. Causal Graph (Fixed Topology)

We use a **fixed DAG** across all dates:

`MOM → HML`  
`MOM → PC`  
`HML → OI`  
`OI → BAS`  
`OI → PC`  
`PC → BAS`  
`BAS → LIQ`  
`BAS → VOL`

This topology encodes microstructure priors (order flow → prices/spreads; spreads → liquidity/volatility). Real markets have feedback; we treat this as a **working assumption** for tractable graph surgery.



In [None]:
# --- 1. features.csv ---
features_long = df_t.reset_index().melt(
    id_vars='Date', value_vars=ALL_NODES, var_name='node_id', value_name='value')
features_long.rename(columns={'Date': 'date'}, inplace=True)
features_long['date'] = pd.to_datetime(features_long['date']).dt.strftime("%Y-%m-%d")
features_path = OUTPUT_DIR / "features.csv"
features_long.to_csv(features_path, index=False)
print(f"Saved features.csv to {features_path}")

# --- 2. nodes.csv ---
nodes_df = pd.DataFrame({'node_id': ALL_NODES})
nodes_path = OUTPUT_DIR / "nodes.csv"
nodes_df.to_csv(nodes_path, index=False)
print(f"Saved nodes.csv to {nodes_path}")

# --- 3. edges.csv ---
edges_list = [
    {'source': 'Mom', 'target': 'HML'}, {'source': 'Mom', 'target': 'PC'},
    {'source': 'HML', 'target': 'OI'},  {'source': 'OI', 'target': 'PC'},
    {'source': 'OI', 'target': 'BAS'},  {'source': 'PC', 'target': 'BAS'},
    {'source': 'BAS', 'target': 'LIQ'}, {'source': 'BAS', 'target': 'VOL'},
]
edges_df = pd.DataFrame(edges_list)
edges_path = OUTPUT_DIR / "edges.csv"
edges_df.to_csv(edges_path, index=False)
print(f"Saved edges.csv to {edges_path}")

In [None]:
print("\n--- Visualizing the Constructed Causal Graph ---")

G = nx.DiGraph()
G.add_nodes_from(nodes_df['node_id'])
G.add_edges_from(edges_df.values)

# We explicitly set the (x, y) coordinates for each node to ensure a clean,
pos = {
    'Mom': (0, 4),      # Top layer (Root Cause)
    'HML': (-1.5, 3),   # Mid-layer 1
    'PC':  (1.5, 3),
    'OI':  (0, 2),      # Mid-layer 2
    'BAS': (0, 1),      # Mid-layer 3
    'LIQ': (-1, 0),     # Bottom layer (Outcomes)
    'VOL': (1, 0)
}


plt.figure(figsize=(12, 10))
node_options = {
    "node_size": 4500,
    "node_color": "#e0f7fa",
    "edgecolors": "#0077b6",
    "linewidths": 2.0,
}
label_options = {
    "font_size": 12,
    "font_weight": "bold",
}

edge_options = {
    "width": 2.0,
    "edge_color": "#495057",
    "arrowstyle": "-|>",  
    "arrowsize": 25,
}

nx.draw(
    G,
    pos,
    with_labels=True,
    **node_options,
    font_size=label_options["font_size"],
    font_weight=label_options["font_weight"],
    **edge_options
)

plt.title("Hypothesized Causal Graph", size=20)
plt.axis('off') 
plt.show()


In [None]:
print("\n--- Preprocessing Complete ---")
print(f"Total nodes: {len(nodes_df)}")
print(f"Total causal edges: {len(edges_df)}")
print(f"Date range: {final_df_wide.index.min().date()} to {final_df_wide.index.max().date()}")
print(f"Total time steps (days): {len(final_df_wide)}")