# Week 1 — Financial Markets, Data Structures & Microstructure

> **"Every ML-for-finance failure you'll ever read about started here — with someone who didn't understand their data."**

Here's something that might surprise you: when you click "buy" on Robinhood, your order doesn't go to "the stock market." It goes to Citadel Securities — a single company that handles about 25% of all US equity trades. They look at your order, decide whether to fill it themselves, and pocket a fraction of a penny for their trouble. They made $7.5 billion in revenue in 2022. On pennies per share. To understand why that system exists, why those fractions of pennies add up to billions, and why it matters for every ML model you'll ever build in finance — we need to start from the beginning.

But first, a horror story. On August 1, 2012, Knight Capital Group deployed new trading software to production. Somewhere in the release, old test code was reactivated — code that bought at the ask and sold at the bid, exactly backwards, across 154 stocks simultaneously. In 45 minutes, Knight Capital lost $440 million. That's roughly $10 million per minute. The company's entire market cap was $365 million. They were bankrupt before lunch. The bug wasn't in a machine learning model. It wasn't in a neural network. It was in the data pipeline — the system that decides what prices to look at and what orders to send. Knight Capital didn't have a model failure. They had a data and systems failure. The model did exactly what it was told. It was told the wrong thing.

This should unsettle you, because the single most common reason ML models fail in finance has nothing to do with architecture, hyperparameters, or loss functions. It's the data. Specifically, it's that the person who built the model didn't understand the data they were feeding it — didn't know that stock prices can't be treated like pixel values, didn't know that their "clean" dataset had quietly removed every company that went bankrupt, didn't know that the number their model was predicting had already been reported three days before the timestamp said it would be. These aren't edge cases. These are the default state of financial data.

This week, we're going to take you — an ML engineer who knows how to build a transformer from scratch but has never thought about what a bid-ask spread is — and give you the mental model you need to work with financial data without making the mistakes that have cost real firms real money. We'll start from what happens when you click "buy," build up to the data structures that become inputs to your models, and encounter four horsemen of financial data pathology along the way: survivorship bias, look-ahead bias, non-stationarity, and fat tails. Each one can silently destroy a model that would otherwise look brilliant in a backtest.

By the end of this lecture, you'll know how to download financial data, clean it properly, compute returns correctly, understand why your return distributions look nothing like a Gaussian, and build alternative sampling methods that produce better-behaved inputs for ML models. You'll also have a healthy paranoia about data quality that will serve you for the rest of this course — and the rest of your career.

In [None]:
import warnings; warnings.filterwarnings('ignore', category=FutureWarning)
import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
from scipy import stats

plt.rcParams.update({'figure.figsize': (10, 5), 'font.size': 12,
                      'axes.grid': True, 'grid.alpha': 0.3})

def get_close(data):
    """Extract close prices, handling yfinance MultiIndex."""
    if isinstance(data.columns, pd.MultiIndex):
        return data['Close'].squeeze()
    return data['Close']

## 1. How Financial Markets Actually Work

Imagine you're buying a used car. The sticker says $20,000. But you can't just pay the sticker price — that's the seller's *asking* price. You counter at $19,500. That $500 gap between what the buyer will pay and what the seller demands? On Wall Street, they call it the **bid-ask spread**. It's the price of immediacy — the tax you pay for wanting something *right now* instead of waiting for a better deal.

Now scale this up. An exchange — NYSE, NASDAQ, CME — is nothing more than a matching engine. It maintains a giant list of everyone who wants to buy (the **bids**) and everyone who wants to sell (the **asks**), sorted by price. This list is called the **order book**. When you place a **limit order**, you're adding yourself to the queue: "I'll buy 100 shares at $189.99, and I'll wait." When you place a **market order**, you're saying: "I'll buy 100 shares at whatever the best available price is, and I'll pay it right now." Market orders execute instantly but at whatever price the book offers you. Limit orders might never execute at all.

Standing in the middle of all this are **market makers** — firms that continuously post both bids and asks, earning the spread on every round-trip. Citadel Securities is the largest. They don't bet on whether Apple will go up or down; they bet that they can buy at $189.99 and sell at $190.01 enough times per second to make the pennies add up. And they do — to the tune of billions per year. Every time your Robinhood order fills instantly, there's a good chance Citadel took the other side of it.

Why does any of this matter for ML? Because every time your model says "buy," someone is quietly taking money from you. The bid-ask spread is a tax that comes out of your returns on every single trade, both in and out. If your model can't predict returns larger than this tax, it's losing money by design — no matter how good the architecture is.

Let's make this concrete with a formula. The round-trip cost of a trade — buying and then selling — is simply the spread divided by the mid-price:

$$\text{Round-trip cost} = \frac{\text{ask} - \text{bid}}{\text{mid-price}} = \frac{\text{ask} - \text{bid}}{(\text{ask} + \text{bid}) \,/\, 2}$$

For Apple at bid = $189.99, ask = $190.01, that's about 0.01% — you'd barely notice. But for a micro-cap stock trading at bid = $4.75, ask = $5.25, the round-trip cost is 10%. Your model needs to be right by more than 10% just to break even. Most models aren't right by 1%. This is why nearly every profitable quant strategy focuses on liquid, large-cap stocks — not because they're easier to predict, but because the spread doesn't eat your edge alive.

Let's see what an order book looks like. We'll simulate one for a stock like Apple — 10 price levels on each side of the spread. If the book were perfectly balanced, buyers and sellers would be equally eager. They rarely are, and that imbalance is itself a signal that high-frequency trading firms monitor thousands of times per second.

In [None]:
np.random.seed(42)
mid = 190.00
bid_prices = np.round(np.arange(mid - 0.01, mid - 0.11, -0.01), 2)
ask_prices = np.round(np.arange(mid + 0.01, mid + 0.11, 0.01), 2)
bid_volumes = np.random.randint(500, 900, size=10)
ask_volumes = np.random.randint(300, 650, size=10)

fig, ax = plt.subplots(figsize=(10, 5))
ax.barh(range(10), -bid_volumes, color='#2ecc71', alpha=0.8, label='Bids (buyers)')
ax.barh(range(10), ask_volumes, color='#e74c3c', alpha=0.8, label='Asks (sellers)')
labels = [f'${b}  |  ${a}' for b, a in zip(bid_prices, ask_prices)]
ax.set_yticks(range(10)); ax.set_yticklabels(labels)
ax.set_xlabel('Volume (shares)'); ax.set_title('Simulated Order Book — AAPL-like Stock')
ax.legend(loc='lower right'); ax.axvline(0, color='black', linewidth=1.5)
plt.tight_layout(); plt.show()

Look at those volumes: 500–800 shares stacked on the bid side, 300–600 on the ask. That asymmetry isn't random — it means more people want to buy than sell at these prices. Firms like Jump Trading and Virtu Financial monitor exactly this ratio, updating their models every 50 microseconds. A persistent bid-side imbalance often predicts short-term upward price movement — because demand is outstripping supply. We'll work with a much slower version of this signal in later weeks, but the principle is the same: the order book isn't just a list of orders, it's a live readout of collective opinion.

Notice the gap in the middle — that's the spread. For a stock like Apple, it's a penny. For illiquid names, it can be dollars. Every time your model crosses that gap by submitting a market order, it pays the spread as an implicit tax.

Here's the bottom line for your future models: a strategy that turns over daily at 10 basis points round-trip (0.10%) burns 25% per year in transaction costs alone. The average hedge fund's gross return is about 10–15%. You'd be spending twice your expected revenue on shipping costs. That's not a strategy — that's a donation to market makers.

## 2. Financial Data Types and Their Quirks

On June 9, 2022, Amazon did a 20:1 stock split. The stock went from roughly $2,447 to roughly $122 overnight. Same company, same market cap, same everything — management just decided to cut each share into 20 smaller pieces. If your model is looking at raw prices, it sees a 95% crash. If it's looking at adjusted prices, it sees nothing — a quiet Tuesday. Same company, same value, but one version of the data says the world ended and the other says it was a normal day. This is the adjusted price problem, and it's the first of many ways financial data will try to trick your model.

Let's talk about what financial data actually looks like. The workhorse format is **OHLCV** — Open, High, Low, Close, Volume. Each row compresses an entire trading day (or hour, or minute) into five numbers. The Open is the first trade of the period, the Close is the last, and the High and Low capture the extremes. Volume counts how many shares changed hands. This is a brutal compression — thousands of individual trades, millions of dollars of flow, the entire drama of buyers and sellers fighting over price — all collapsed into five numbers. It's like summarizing a movie with its first frame, last frame, the brightest frame, the darkest frame, and a ticket count.

Then there's the distinction between **adjusted** and **unadjusted** prices. The raw (unadjusted) close is what actually traded on that day — the number you would have seen on your screen. The adjusted close has been retroactively modified to account for stock splits and dividends, so that the historical prices are consistent with today's share count. Which one should you use? Almost always adjusted. But you need to understand what the adjustment does, because it rewrites history — and history that's been rewritten can mislead in subtle ways.

There's another trap waiting here, and it's even more insidious than splits: **look-ahead bias**. Fundamental data — earnings, revenue, book value — arrives with a delay. A company's Q4 earnings might cover October through December, but the actual report isn't filed until February. If your dataset timestamps Q4 earnings as "December 31," your model is using information that wasn't available until two months later. Your backtest looks amazing. Your live strategy loses money. This is look-ahead bias, and it's the most common way quants accidentally cheat. We won't fix it today — that requires a proper point-in-time database — but you need to know it exists, because it will haunt you in Week 4 when we start working with fundamental features.

Let's see the adjusted vs. unadjusted problem with real data. We'll download 10 years of Apple prices — both the raw and the adjusted versions — and spot the splits. Apple did a 7:1 split in June 2014 and a 4:1 split in August 2020. Both should be visible as sudden cliffs in the raw data, but completely invisible in the adjusted series.

In [None]:
aapl = yf.download("AAPL", start="2014-01-01", end="2024-01-01", auto_adjust=False)
print(f"Columns: {list(aapl.columns.get_level_values(0).unique())}")
print(f"Date range: {aapl.index[0].date()} to {aapl.index[-1].date()}")
print(f"Total trading days: {len(aapl)}")
aapl.head()

Notice we have both "Close" and "Adj Close" in the columns. The raw Close is what actually traded that day — the number that appeared on Bloomberg terminals and brokerage screens. The Adj Close has been retroactively modified to account for splits and dividends, so the numbers go back in time, making older prices *lower* than they really were. For example, before the 4:1 split in 2020, Apple traded around $500. After adjustment, that same day shows roughly $125. Neither number is "wrong" — they're just answers to different questions. The raw price answers "what could I have bought it for?" The adjusted price answers "what is this equivalent to in today's shares?" For ML, we almost always want the adjusted version. Let's see why visually.

Below, we plot both series on the same axis. The adjusted series (orange) will be a smooth curve from bottom-left to top-right — Apple's steady appreciation over a decade. The raw series (blue) will have dramatic stair-steps — sudden drops that look like crashes but are actually just splits. Your ML model can't tell the difference between a split and a crash. That's the problem.

In [None]:
fig, ax = plt.subplots(figsize=(12, 5))
close = aapl['Close'].squeeze() if isinstance(aapl['Close'], pd.DataFrame) else aapl['Close']
adj = aapl['Adj Close'].squeeze() if isinstance(aapl['Adj Close'], pd.DataFrame) else aapl['Adj Close']
ax.plot(close.index, close.values, label='Raw Close', alpha=0.8, color='#3498db')
ax.plot(adj.index, adj.values, label='Adjusted Close', alpha=0.8, color='#e67e22')
ax.set_title('AAPL: Raw vs. Adjusted Close (2014–2024)')
ax.set_ylabel('Price ($)'); ax.legend()
ax.axvline(pd.Timestamp('2014-06-09'), color='gray', ls='--', alpha=0.5)
ax.axvline(pd.Timestamp('2020-08-31'), color='gray', ls='--', alpha=0.5)
ax.annotate('7:1 split', xy=(pd.Timestamp('2014-06-09'), 100), fontsize=9, color='gray')
ax.annotate('4:1 split', xy=(pd.Timestamp('2020-08-31'), 200), fontsize=9, color='gray')
plt.tight_layout(); plt.show()

See those two stair-steps in the blue line? The first is the 7:1 split in June 2014 — the raw price drops from around $650 to around $93 overnight. The second is the 4:1 split in August 2020 — the raw price drops from roughly $500 to $125. To a human, these are obviously not crashes. To your model, they're the most dramatic price events in the entire dataset — and they mean *nothing*.

The orange adjusted line is smooth through both events. That's what your model should see. But be warned: the adjustment changes historical prices retroactively. The number in your dataset for January 3, 2014 is NOT the number that Apple actually traded at on that day. It's a fiction — a useful fiction that makes the math work, but a fiction nonetheless. If your model needs to know what price was actually available for execution on a given day (for example, when computing whether a limit order would have filled), you need unadjusted prices. For returns and features, use adjusted. For execution simulation, use raw. Mixing them up is a classic quant mistake.

## 3. Data Pathology #1 — Survivorship Bias

Let's play a game. Name the 10 biggest US companies from the year 2000. You'll probably remember Microsoft, GE, Walmart, maybe Cisco or Intel. You probably won't remember WorldCom — the 20th largest company by market cap — which filed the largest bankruptcy in US history just two years later. Or Enron — the 7th largest by revenue — whose stock went from $90 to $0.26 before being delisted in 2001. Or Lehman Brothers, or Bear Stearns, or Washington Mutual. These weren't small companies. They were titans. And they're gone.

If your training data starts in 2005, these companies simply don't exist. They've been erased from the dataset as thoroughly as if they never existed at all. Your model trains on today's S&P 500 — a list that, by definition, contains only companies that survived and prospered enough to still be in the index. The dead have been removed and replaced with companies like Tesla (added 2020), Meta (added 2013), and Netflix (added 2010). Your model learns from this curated list of winners and concludes that large-cap stocks always survive. They don't.

This is **survivorship bias**, and it's the most insidious data quality problem in finance. It doesn't show up as a missing value or an error code. It shows up as an absence — the absence of every company that failed, went bankrupt, or was acquired at a loss. Your dataset doesn't tell you what's missing. It just quietly pretends that the surviving companies are the whole story.

Here's the scale of the problem: of the 500 companies in the S&P 500 in the year 2000, approximately 150 are no longer in the index by 2024. That's 30% turnover. A model trained on today's index members has never seen 30% of the actual market that existed during its training period. It's as if you trained an image classifier on the 70% of images that were "easy" and threw away the ones the model got wrong.

We can express the bias formally. The survivorship bias is the difference in average return between a survivor-only dataset and the full universe that actually existed:

$$\text{Survivorship Bias} = \bar{R}_{\text{survivors only}} - \bar{R}_{\text{full universe}}$$

This number is always positive — because the survivors, by definition, did better than the companies that died. Elton, Gruber & Blake (1996) measured it at roughly 0.9% per year for mutual funds. For individual stocks, Shumway & Warther (1999) estimated the bias from delisted stocks at 0.5–1.0% *per month*. That compounds to enormous errors over a backtest horizon. A 14-year backtest inflated by even 1% per year overstates cumulative returns by about 15%. If your model shows a 20% annualized return, a meaningful chunk of that might be ghosts.

Let's try to download data for some of these ghosts. If survivorship bias is real — if the dead truly have been erased from history — then yfinance should have nothing for companies that no longer exist. We'll try Enron, WorldCom, Lehman Brothers, Bear Stearns, and Washington Mutual — five companies that were household names and are now footnotes.

In [None]:
ghosts = {
    'Enron': 'ENRNQ',
    'WorldCom': 'WCOEQ',
    'Lehman Brothers': 'LEH',
    'Bear Stearns': 'BSC',
    'Washington Mutual': 'WAMUQ'
}

for name, ticker in ghosts.items():
    data = yf.download(ticker, start="2000-01-01", end="2010-01-01",
                       progress=False)
    rows = len(data)
    status = f"{rows} rows" if rows > 0 else "EMPTY — erased from history"
    print(f"{name:25s} ({ticker:6s}): {status}")

Most of these return empty datasets. Lehman Brothers (LEH) might return partial data up to September 2008 — the month it filed the largest bankruptcy in US history with $639 billion in assets. But Enron? Gone. WorldCom? Gone. The dead have been erased from the free data that most of us use for research.

And here's a fact that should keep you up at night: there are roughly 150 companies that were in the S&P 500 in 2000 but aren't there anymore. That's 30% turnover in the index. General Electric — one of the original Dow Jones companies, in the index since 1907 — was removed in 2018 after 111 years. Your model trained on today's S&P 500 has never seen 30% of the actual market that existed during its training period. It's never seen a stock go from $90 to $0.26 (Enron), or from $64 to zero in a weekend (Lehman). It has never watched a company die. It will be very confused when one does.

The ML analogy is exact: survivorship bias is training on a dataset where all the failed examples have been removed. You'd get great training accuracy and terrible real-world performance. That's exactly what happens when you backtest on Yahoo Finance data. In the seminar, we'll try to quantify exactly how many percentage points this bias adds to a backtest — and the number will make you uncomfortable.

## 4. Non-Stationarity and Fat Tails

On October 19, 1987, the S&P 500 dropped 22.6% in a single day. Under a Gaussian model with the historical mean and standard deviation, this was a 25-sigma event — a move 25 standard deviations from the mean. The probability of that, under Gaussian assumptions, is approximately $10^{-160}$. To put that number in perspective: the universe is about $4.3 \times 10^{17}$ seconds old. If you watched the market every second since the Big Bang, across $10^{50}$ parallel universes, you still wouldn't expect to see this event once. It was supposed to be impossible on timescales that make the age of the universe look like a coffee break. It happened on a Monday.

That crash tells us two things about financial data, and both of them are bad news for your ML model. The first is **non-stationarity**: the data-generating process that produced stock prices in 1987 is not the same process producing them today. Regulations changed, algorithmic trading arrived, new instruments were invented, market participants evolved. The distribution shifts constantly. The second is **fat tails**: even within a single regime, extreme events happen far more often than any Gaussian model predicts. These aren't separate problems — they're two faces of the same fundamental challenge.

Let's start with non-stationarity, because it's the more immediate problem for ML. A **stationary** time series is one whose statistical properties — mean, variance, autocorrelation — don't change over time. Temperature in a city is roughly stationary (it fluctuates around a fixed seasonal pattern). Stock prices are emphatically not. They trend upward over decades, crash periodically, and exhibit volatility that clusters in bursts. If your model assumes the training data and test data come from the same distribution, and they don't, you're in trouble before you write a single line of code.

Here's what non-stationarity means concretely for your model. Apple's price in 2015 was around $130. In 2024, it's around $190. If you train an LSTM on 2015 data and test on 2024 data, every single number in the test set is outside the training range. The model is extrapolating every prediction. This is the equivalent of training an image classifier on cats and testing on dogs — except it's harder to notice, because the numbers *look* similar. $130 and $190 are both reasonable-looking numbers. But to a model that learned its patterns from the $110–$135 range, $190 is as foreign as a picture of a dog to a cat classifier.

Let's see non-stationarity with our own eyes. We'll plot 24 years of SPY prices — the S&P 500 ETF, the most widely traded security in the world. If the series were stationary — if it came from a stable distribution — you'd see it oscillate around a fixed level, like a temperature reading. It won't.

In [None]:
spy_raw = yf.download("SPY", start="2000-01-01", end="2024-01-01")
spy_close = get_close(spy_raw)

fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(spy_close.index, spy_close.values, color='#2c3e50', linewidth=0.8)
ax.set_title('SPY Close Price (2000–2024): A Very Non-Stationary Process')
ax.set_ylabel('Price ($)'); ax.set_xlabel('Date')
plt.tight_layout(); plt.show()

That's not a stationary process — it's a series with a clear upward trend, changing variance (look at the 2008–2009 crash and the COVID dip in March 2020), and regime shifts (the post-2009 bull run looks nothing like the 2000–2002 dot-com bust). Feeding raw prices directly into an ML model is like feeding it a ruler and asking it to predict the next inch mark. It'll memorize the training range and fail spectacularly outside it.

The partial fix is **returns** — the percentage change from one period to the next. Returns are *roughly* stationary: they fluctuate around a mean that's close to zero, and their variance is somewhat stable over time. We'll cover returns properly in Section 5. But first, let's address the other half of the problem: even returns have a distribution that's very far from Gaussian, and that matters enormously for risk management.

Fat tails are measured by **excess kurtosis** — a statistic that captures how much probability mass lives in the extreme tails of a distribution compared to a Gaussian:

$$\text{Excess Kurtosis} = \frac{E\left[(X - \mu)^4\right]}{\sigma^4} - 3$$

A Gaussian distribution has excess kurtosis of exactly 0. The S&P 500's daily returns have excess kurtosis around 20. That means the tails are dramatically thicker than a Gaussian would predict — extreme events happen 10 to 100 times more often than your Gaussian risk model expects. Every Value-at-Risk (VaR) estimate, every portfolio optimizer, every option pricing model that assumes normality is systematically understating the probability of disaster. It's not just wrong — it's wrong in the direction that loses you money.

We're about to plot the distribution of daily S&P 500 returns against a Gaussian with the same mean and standard deviation. If markets were the well-behaved system that most textbooks assume, these two curves would overlap perfectly. They won't. Watch the tails — the left tail (crashes) and the center peak will be the most dramatic departures.

In [None]:
spy_log_ret = np.log(spy_close / spy_close.shift(1)).dropna()
mu, sigma = spy_log_ret.mean(), spy_log_ret.std()
kurt = stats.kurtosis(spy_log_ret)  # excess kurtosis

fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(spy_log_ret, bins=200, density=True, alpha=0.6, color='#2c3e50',
        label='SPY daily log returns')
x = np.linspace(spy_log_ret.min(), spy_log_ret.max(), 500)
ax.plot(x, stats.norm.pdf(x, mu, sigma), 'r-', lw=2,
        label=f'Gaussian (same μ, σ)')
ax.set_title(f'SPY Daily Returns vs. Gaussian — Excess Kurtosis: {kurt:.1f}')
ax.set_xlabel('Log Return'); ax.set_ylabel('Density')
ax.legend(); plt.tight_layout(); plt.show()

See those heavy tails? The actual distribution has far more extreme events than the Gaussian predicts. The center is also sharper — more days with near-zero returns than a Gaussian expects. This is the classic "peaked center, fat tails" shape that finance people call **leptokurtic**. On March 16, 2020, the S&P dropped 12% in a single day. Under Gaussian assumptions, that's roughly a 1-in-$10^{25}$ event — it shouldn't happen once in trillions of universe lifetimes. Under the actual distribution, it's maybe a 1-in-500-year event. Unlikely, but not impossible. If your risk model uses Gaussian VaR, it told you that day was impossible. Your portfolio disagreed.

This isn't just a theoretical concern. In 1998, Long-Term Capital Management (LTCM) — a hedge fund with two Nobel laureates on staff, $125 billion in assets, and models that explicitly assumed Gaussian tails — nearly collapsed the global financial system when a series of "impossible" events happened in quick succession. Their models said a loss that large wouldn't happen in the lifetime of the universe. It happened in four months.

A QQ-plot makes the fat tail problem even more dramatic. It plots the quantiles of your data against what a Gaussian would predict. If returns were Gaussian, the points would fall neatly on the diagonal line. Any departure — especially at the edges — is evidence that the tails are thicker than a Gaussian allows.

In [None]:
fig, ax = plt.subplots(figsize=(7, 7))
stats.probplot(spy_log_ret, dist="norm", plot=ax)
ax.set_title('QQ-Plot: SPY Daily Log Returns vs. Gaussian')
ax.get_lines()[0].set(markersize=2, alpha=0.5, color='#2c3e50')
ax.get_lines()[1].set(color='#e74c3c', linewidth=2)
plt.tight_layout(); plt.show()

Look at both ends of the plot. The left tail (crashes) curves sharply downward away from the line — real crashes are much worse than a Gaussian predicts. The right tail (rallies) curves upward — real rallies are also more extreme. The departures at both extremes are dramatic: the most extreme observed returns are 3 to 5 times larger than what a Gaussian with the same standard deviation would ever produce.

This means every risk metric that assumes normality — VaR, Sharpe ratio (in its standard form), portfolio optimizers using mean-variance — is lying to you. It's not lying by a little. The probability of a 5-sigma event under a Gaussian is about 1 in 3.5 million. Under the actual fat-tailed distribution of S&P returns, events of that magnitude happen roughly once every few years. Your risk model says "once in a million days" and reality says "see you next Tuesday." We'll come back to this problem in Week 2, where we'll discuss how to model these tails properly instead of pretending they don't exist.

## 5. Returns Math — Simple vs. Log Returns

Quick puzzle. You invest $100. It goes up 10% in January — great, you have $110. Then it drops 10% in February. Are you back to $100?

Nope. You're at $99. You *lost* money on two moves that should cancel out. This is the **compounding trap**, and it's bitten more junior quants than any bug in production code. The reason is multiplicative compounding: the 10% drop in February applies to $110, not $100. So you lose $11, not $10. And $110 − $11 = $99.

This is not just a math curiosity — it has real consequences for how you compute multi-period returns. If you naively add simple returns across time ($+10\% + (-10\%) = 0\%$), you get the wrong answer. The correct multi-period return is the *product* of $(1 + r_t)$ terms, not the sum. That multiplicative structure is annoying to work with — you can't just add things up. Unless you switch to log returns.

Here are the two types of returns you'll use throughout this course:

**Simple (arithmetic) returns:**

$$r_t = \frac{P_t - P_{t-1}}{P_{t-1}}$$

**Log (continuously compounded) returns:**

$$R_t = \ln\!\left(\frac{P_t}{P_{t-1}}\right) = \ln(1 + r_t)$$

The magic of log returns is that they're **additive across time**. The multi-period log return is just the sum of single-period log returns:

$$R_{t_1 \to t_n} = \sum_{i=1}^{n} R_{t_i} = \ln\!\left(\frac{P_{t_n}}{P_{t_0}}\right)$$

That's because $\ln(a/b) = \ln(a) - \ln(b)$, and the intermediate terms telescope. Simple returns, on the other hand, are **additive across assets** — a portfolio return is the weighted sum of its constituent simple returns. This means: log returns for time-series work, simple returns for portfolio work. Mixing them up won't crash your code — it'll quietly bias every number downstream.

For annualization, we scale by the number of trading days (approximately 252): $\mu_{\text{annual}} = \mu_{\text{daily}} \times 252$ and $\sigma_{\text{annual}} = \sigma_{\text{daily}} \times \sqrt{252}$. The square root comes from the assumption that daily returns are roughly independent — variance adds linearly, so standard deviation scales with $\sqrt{T}$. This assumption is approximate at best, and we'll challenge it in Week 2.

Let's prove the compounding trap with code, and then show that log returns handle it honestly. We'll walk through the $100 → +10% → −10% example and verify that simple returns lie while log returns tell the truth.

In [None]:
# The compounding trap
start = 100
after_up = start * 1.10       # +10%
after_down = after_up * 0.90  # -10%

simple_sum = 0.10 + (-0.10)   # naive: just add them
log_sum = np.log(1.10) + np.log(0.90)  # log returns

print(f"Start: ${start:.2f}")
print(f"After +10%: ${after_up:.2f}")
print(f"After -10%: ${after_down:.2f}  (NOT $100!)")
print(f"\nSimple returns sum:  {simple_sum:+.4f}  (says breakeven — WRONG)")
print(f"Log returns sum:     {log_sum:+.4f}  (says small loss — CORRECT)")
print(f"Actual return:       {(after_down/start - 1):+.4f}")

The simple returns said +10% and −10% — feels like it should net to zero. But $1.10 \times 0.90 = 0.99$, not $1.00$. You lost a dollar. Log returns got it right: $\ln(1.10) + \ln(0.90) = -0.005$, a small negative number that's honest about the loss. The actual return is −1%, and the log return sum correctly reflects this (approximately — log returns are exact for the multi-period calculation, while the small difference comes from the relationship $\ln(1+r) \approx r$ for small $r$).

This matters every time you compute cumulative returns, backtest a strategy, or calculate a Sharpe ratio. If you sum simple returns across time, you're lying to yourself. If you sum log returns, you're telling the truth. But how different are these two measures in practice? For most days, barely at all — but for the days that matter most, the gap is significant.

For small daily returns — typically under 1% in absolute value — simple and log returns are nearly identical, because $\ln(1 + r) \approx r$ when $r$ is small. But for crash days, the days that determine whether your risk model keeps you solvent, they diverge meaningfully. Let's see the difference across the full SPY dataset and highlight the days where it matters.

In [None]:
simple_ret = spy_close.pct_change().dropna()
log_ret = np.log(spy_close / spy_close.shift(1)).dropna()
diff = simple_ret - log_ret

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(diff.index, diff.values * 100, color='#8e44ad', linewidth=0.5, alpha=0.7)
ax.set_title('Difference: Simple − Log Returns (percentage points)')
ax.set_ylabel('Difference (pp)'); ax.set_xlabel('Date')
ax.axhline(0, color='black', linewidth=0.5)

# Annotate the biggest divergences
worst_day = diff.abs().idxmax()
ax.annotate(f'{worst_day.strftime("%Y-%m-%d")}', xy=(worst_day, diff[worst_day]*100),
            fontsize=8, color='red', ha='center')
plt.tight_layout(); plt.show()

For 99% of days, the difference is negligible — under 0.01 percentage points. But look at the spikes. On the worst crash days, the gap between simple and log returns reaches 0.5–1.0 percentage points. On March 16, 2020 (the COVID crash), simple returns said about −11.98% and log returns said about −12.77%. That 0.8 percentage point gap matters when you're computing cumulative returns over years or calculating a Value-at-Risk threshold.

For the rest of this course: **log returns for time-series analysis** (because they add across time) and **simple returns for portfolio analysis** (because they add across assets). When in doubt, use log returns — they're more mathematically convenient, and for daily data, the difference is usually negligible. But on the days that matter most — the crash days, the days that determine whether your fund survives — the difference is real.

## 6. Alternative Bars — Volume Bars and Dollar Bars

Think about what a daily bar actually represents. On a quiet summer Tuesday, Apple might trade 30 million shares. On the morning after an earnings announcement, it might trade 150 million shares. Same bar. Same weight in your dataset. But one contains 5x more information than the other. You'd never train an image classifier by giving some images 1 pixel and others 500 pixels and calling them the same resolution. That's exactly what time bars do with financial data.

This insight comes from Marcos Lopez de Prado, who argues in *Advances in Financial Machine Learning* that time bars are "the worst possible way" to sample financial data. His alternative: sample by *activity*, not by *clock time*. **Volume bars** create a new bar every time a fixed number of shares trade. **Dollar bars** create a new bar every time a fixed dollar amount trades. Dollar bars are Lopez de Prado's recommendation, because they normalize for both volume changes *and* price changes — if a stock doubles in price, you don't suddenly get half as many bars.

The practical consequence is elegant: during high-activity periods (earnings, crashes, news events), you get more bars — more data points precisely when more information is flowing. During quiet periods, you get fewer bars, because there's less to learn. Your model sees a more uniform information density, and the resulting return distribution is closer to Gaussian. That's a free lunch for any ML model that assumes (or benefits from) approximately normal inputs.

In the time it takes your Python script to import NumPy — about 150 milliseconds — a Xilinx FPGA at the NYSE has already processed roughly 150,000 market data messages and placed orders on half of them. That's the speed at which modern markets operate. Dollar bars are a much humbler intervention — we're just being smarter about when we sample, not trying to compete on speed.

The idea is straightforward. We compute the dollar volume at each time step — price times shares traded — and accumulate it:

$$\text{Dollar Volume}_t = P_t \times V_t$$

$$\text{Cumulative Dollar Volume} = \sum_{i=1}^{t} P_i \times V_i$$

Every time the cumulative dollar volume crosses a threshold (say, $50 billion for SPY), we close one bar and start a new one. The result: more bars during high-activity periods, fewer during quiet periods. The threshold is a hyperparameter — too small and you get noisy bars, too large and you lose resolution. A good starting point for daily data is the median daily dollar volume.

Lopez de Prado argues dollar bars are strictly better than time bars for ML inputs. Let's test that claim. We'll build dollar bars in about 5 lines — the napkin version — and see how many bars we get compared to the standard daily bars. In the seminar, you'll build these from scratch with proper OHLC handling; for now, we just want the punchline.

In [None]:
spy = spy_raw.copy()
c = get_close(spy)
v = spy['Volume'].squeeze() if isinstance(spy['Volume'], pd.DataFrame) else spy['Volume']
dollar_vol = c * v
cum_dollars = dollar_vol.cumsum()
threshold = dollar_vol.median() * 5  # ~5 median days per bar
bar_ids = (cum_dollars // threshold).astype(int)

dollar_bars = spy.groupby(bar_ids).agg(
    {spy.columns[0]: 'first', spy.columns[1]: 'max',
     spy.columns[2]: 'min', spy.columns[3]: 'last', spy.columns[4]: 'sum'})

print(f"Time bars (daily):  {len(spy):,} bars")
print(f"Dollar bars:        {len(dollar_bars):,} bars (threshold: ${threshold:,.0f})")

We went from thousands of daily bars down to a smaller set of dollar bars. The dollar bars are unevenly spaced in time — some cover a single wild trading day during the COVID crash, others span a quiet week in August. That's the entire point: we're sampling *information*, not *clock ticks*. During the March 2020 crash, when SPY traded hundreds of billions of dollars in a single session, we get multiple bars per day. During a sleepy holiday week, a single bar might cover three or four days. The model sees a more uniform information density.

Now the punchline. If Lopez de Prado is right, dollar bar returns should be closer to Gaussian than time bar returns — lower excess kurtosis, thinner tails, a distribution that better matches the assumptions baked into most ML models. Let's compute returns for both bar types and plot them side by side with Gaussian overlays.

In [None]:
time_ret = np.log(spy_close / spy_close.shift(1)).dropna()
db_close = dollar_bars.iloc[:, 3]  # 'last' = close
dollar_ret = np.log(db_close / db_close.shift(1)).dropna()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for ax, ret, title in zip(axes, [time_ret, dollar_ret],
    [f'Time Bars (kurtosis={stats.kurtosis(time_ret):.1f})',
     f'Dollar Bars (kurtosis={stats.kurtosis(dollar_ret):.1f})']):
    ax.hist(ret, bins=100, density=True, alpha=0.6, color='#2c3e50')
    x = np.linspace(ret.min(), ret.max(), 300)
    ax.plot(x, stats.norm.pdf(x, ret.mean(), ret.std()), 'r-', lw=2)
    ax.set_title(title); ax.set_xlabel('Log Return'); ax.set_ylabel('Density')
fig.suptitle('Time Bars vs. Dollar Bars: Return Distributions', y=1.02)
plt.tight_layout(); plt.show()

Compare the kurtosis numbers in the titles. Time bars produce returns with substantially higher excess kurtosis — fatter tails, more extreme events, a distribution that's further from Gaussian. Dollar bars bring the kurtosis down meaningfully. The histogram on the right should look more "Gaussian-like" — the tails are thinner, the peak is less extreme, the overall shape is closer to the red curve. And we achieved this with zero feature engineering, zero model changes, zero additional data. Just by being smarter about *when* we sample.

The **Jarque-Bera test** makes this rigorous. It's a statistical test for whether a sample comes from a Gaussian distribution, based on skewness and kurtosis. A lower test statistic means closer to Gaussian; a large statistic (with a tiny p-value) means "definitely not Gaussian." Neither of our return series will *pass* the test — financial returns are never truly Gaussian — but we can see which one is *closer*.

In [None]:
jb_time = stats.jarque_bera(time_ret.dropna())
jb_dollar = stats.jarque_bera(dollar_ret.dropna())

print(f"{'Bar Type':<15} {'JB Statistic':>15} {'p-value':>12}")
print(f"{'-'*42}")
print(f"{'Time bars':<15} {jb_time.statistic:>15,.1f} {jb_time.pvalue:>12.2e}")
print(f"{'Dollar bars':<15} {jb_dollar.statistic:>15,.1f} {jb_dollar.pvalue:>12.2e}")

Dollar bars win. The Jarque-Bera statistic drops substantially — often by 50% or more — confirming what the histograms showed. Both series still reject the Gaussian hypothesis (the p-values are essentially zero), because financial returns are *never* Gaussian. But dollar bars are meaningfully *closer* to Gaussian, and for ML models that assume or benefit from approximately normal inputs, that's a real advantage.

This replicates the qualitative result from Lopez de Prado's Figure 2.4 in *Advances in Financial Machine Learning*. The takeaway: ML models generally perform better when their inputs are approximately Gaussian. Dollar bars move you closer to Gaussian for free — just by understanding your data better than the next person. No new features, no new model, no new math. Just a smarter sampling strategy. In the seminar, you'll build dollar bars from scratch for different stocks and discover that the improvement isn't universal — it depends on how much volume variation the stock has.

## 7. Toward a Clean Data Pipeline

Every quant firm has a data pipeline they've spent years building and debugging. At Two Sigma, the data engineering team is larger than the research team. At Renaissance Technologies, Jim Simons reportedly said that 80% of the work is cleaning data. You won't build anything that sophisticated this week, but you'll build something *correct* — and in this business, correct beats sophisticated every time.

Think about everything we've covered today. A proper data pipeline needs to handle: downloading from a source (and caching, because yfinance will throttle you), adjusting for splits and dividends, filling or flagging missing data (and documenting which policy you chose and why), computing returns both ways (simple for portfolios, log for time-series), constructing alternative bar types (volume, dollar), and flagging anomalies — extreme returns that might be data errors, zero-volume days that might be holidays or halts, ticker changes that might be misinterpreted as new companies.

That's a lot of plumbing. And every piece of it matters, because a single error — one unadjusted split, one delisted ticker treated as a loss, one future-leaked fundamental number — can silently corrupt every model downstream. The data pipeline isn't glamorous work. It's the most important work.

In the homework, you'll build a `DataLoader` class that handles all of this for 200 stocks. Here's a sketch of what it needs to look like — the blueprint, not the implementation. Think of this as the table of contents for a book you're about to write.

Below is the skeleton — the list of responsibilities your DataLoader will need to fulfill. Each comment represents a real decision you'll have to make in the homework: What's your missing data policy? What's your dollar bar threshold? How do you flag a suspicious return?

In [None]:
class DataLoader:
    """Week 1 DataLoader — skeleton for homework."""
    def __init__(self, tickers, start, end):
        self.tickers = tickers
        self.start = start
        self.end = end
        # TODO: download and cache adjusted OHLCV
        # TODO: document your missing data policy
        # TODO: compute simple and log returns
        # TODO: flag anomalies (>15% daily moves, zero volume)
        # TODO: implement get_dollar_bars(threshold)

That's the blueprint. In the homework, you'll flesh this out into a real, working class — one that handles 200 stocks, computes returns both ways, builds dollar bars, flags anomalies, and documents its choices. The class you build will be the foundation for every pipeline in this course. If it's right, you can trust your results for the next 17 weeks. If it's wrong, every model you build on top of it will be wrong in ways that are very hard to debug. No pressure.

## Summary

| Concept | Key Takeaway | Watch Out For |
|---|---|---|
| **Order books & microstructure** | Every trade has an implicit cost (the spread). Market makers profit from it; your model pays it. | Transaction costs can exceed your model's edge, especially for illiquid stocks. |
| **OHLCV & adjusted prices** | Always use adjusted prices for returns and features. Raw prices contain split artifacts that look like crashes. | Adjusted prices rewrite history — don't use them for execution simulation. |
| **Survivorship bias** | Free datasets contain only survivors. ~30% of the S&P 500 turns over per decade. | Your backtest has never seen a company die. It will overstate returns by 1–3% per year. |
| **Non-stationarity** | Raw prices are non-stationary — the distribution shifts constantly. Returns are the partial fix. | Even returns change character over time (volatility clustering, regime shifts). |
| **Fat tails** | Extreme events are 10–100x more likely than Gaussian models predict. Kurtosis ≈ 20 for SPY. | Every risk metric assuming normality systematically understates tail risk. |
| **Returns math** | Log returns add across time; simple returns add across assets. Don't mix them. | The compounding trap: +10% then −10% ≠ 0%. The difference matters on crash days. |
| **Alternative bars** | Dollar bars sample by information flow, not clock time. Returns are closer to Gaussian. | The improvement varies by stock. More dramatic for high-volume-variation names. |
| **Data pipeline** | 80% of quant work is data cleaning. Correct beats sophisticated. | A single unadjusted split can corrupt every model downstream. |

Let's take stock of what you now know that you didn't know 90 minutes ago. You know that "the stock market" is actually a complex network of exchanges, dark pools, and market makers — and that every trade has a cost that most backtests ignore. You know that your data has been quietly lying to you: removing dead companies, leaking future information, and pretending that extreme events can't happen. You know that returns, not prices, are the right input for ML — and that log returns and simple returns each have their place. And you know that sampling by dollar volume instead of by clock time gives you better-behaved data for free.

Next week, we confront the deeper problem that returns only partially solve: stationarity. We'll discover that taking first differences (returns) throws away too much information — the long-memory structure that tells you whether a stock is trending or mean-reverting. Raw prices keep too much information — they're non-stationary and can't be fed to a model. There's a mathematically elegant sweet spot between the two called **fractional differentiation**. It sounds exotic. It's actually just calculus being clever. We'll implement it, test it on real data, and see whether it actually improves predictions or whether Lopez de Prado was having us on.

### Suggested Reading

- **Lopez de Prado, *Advances in Financial Machine Learning*, Chapter 2 (Data Structures)** — The authoritative treatment of alternative bar types. Dense and academic in style, but the empirical evidence for dollar bars is compelling. This is the chapter we'll reference most in the first half of the course.

- **Larry Harris, *Trading and Exchanges* (2003)** — The definitive market microstructure textbook. Dated in its technology (pre-HFT era) but timeless in its economics. If you want to understand *why* markets are structured the way they are — why spreads exist, why dark pools emerged, why market makers behave as they do — this is the book.

- **Stefan Jansen, *Machine Learning for Algorithmic Trading*, Chapter 2 (Market & Fundamental Data)** — The most practical treatment of financial data for ML engineers. Covers everything from tick data to fundamental data to alternative data, with working Python code. If this lecture was the "why," Jansen's chapter is the "how" at industrial scale.

- **Michael Lewis, *Flash Boys* (2014)** — Not a textbook, but the most readable account of how modern market microstructure actually works. The story of IEX, dark pools, and the arms race between high-frequency traders. You'll understand the plumbing — and the politics — much better after reading this.