# Week 1 Seminar — Getting Your Hands Dirty with Market Data

You've heard the theory. Now let's see if it holds up when you actually touch the data. The lecture told you that fat tails exist, that survivorship bias is bad, that dollar bars are better, and that transaction costs matter. Fine — but how do fat tails compare across *asset classes* — equities vs. bonds vs. commodities vs. volatility products? Do dollar bars help every ETF equally, or does the benefit depend on liquidity? Can you actually put a dollar figure on survivorship bias? And at what cost level does a decent-looking strategy die?

These are the questions we'll answer today. Four exercises, each one designed to produce a number that surprises you. By the end, you'll have hard evidence — not just intuitions — for why these issues matter. And you'll have a much better sense of which asset classes behave nicely and which ones make your distributional assumptions look absurd.

In [None]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, yfinance as yf
from scipy import stats
plt.rcParams.update({'figure.figsize': (12, 6), 'font.size': 12, 'axes.grid': True, 'grid.alpha': 0.3})
get_close = lambda d: d['Close'] if isinstance(d.columns, pd.MultiIndex) else d[['Close']]
# ── Download ALL data for ALL four exercises ──
# 10 ETFs spanning equities, bonds, commodities, volatility
tickers = ['SPY','QQQ','TLT','GLD','USO','EEM','IWM','HYG','XLU','VIXY']
raw = yf.download(tickers, start='2010-01-01', end='2024-01-01', auto_adjust=True)
prices = get_close(raw)
# Survivorship bias data (Exercise 3)
survivors = ['AAPL','MSFT','JNJ','JPM','PG','UNH','HD','V','DIS','MRK','KO','PEP','CSCO','ABT','WMT','CMCSA','AMGN','LOW']
removed = ['GE','XRX','GPS','FLR','HRB','HP','LEG','NWSA','IPG','AIZ']
bias_raw = yf.download(survivors + removed, start='2010-01-01', end='2024-01-01', auto_adjust=True)
bias_prices = get_close(bias_raw)

## Exercise 1: Fat-Tail Safari Across Asset Classes

The lecture showed you that SPY has fat tails — excess kurtosis well above zero, a QQ-plot that curls away from the diagonal. But SPY is just one data point from one asset class. Here's the question worth asking: is non-Gaussianity a *universal* market phenomenon, or does it vary dramatically depending on what you're trading?

We've downloaded 10 ETFs that span very different corners of the financial universe: SPY and QQQ (US equities), IWM (small caps), EEM (emerging markets), TLT (long-term treasuries), HYG (high-yield bonds), GLD (gold), USO (oil), XLU (utilities), and VIXY (VIX futures — pure volatility). These aren't just different tickers — they represent fundamentally different return-generating processes. Oil prices are driven by OPEC decisions and inventory reports. Treasury prices respond to Fed announcements. And VIXY is a product designed to spike when everything else crashes.

If fat tails are the same everywhere, a single distributional model covers everything. If they're different — different magnitudes, different *directions* of skewness — then your risk model needs to know which asset it's looking at. Let's find out.

### Tasks

1. Compute daily log returns for all 10 ETFs.
2. Build a statistics table: mean, standard deviation, skewness, excess kurtosis, and Jarque-Bera statistic for each ETF. Include the asset class label.
3. Produce QQ-plots for all 10 ETFs on a single 2×5 multi-panel figure, ordered by kurtosis.
4. Answer: does kurtosis cluster by asset class? By how much do the fattest and tamest tails differ? Does skewness tell a different story than kurtosis?

In [None]:
# ── YOUR WORKSPACE ──
# Compute log returns for all 10 ETFs
# Build a comparison table: mean, std, skewness, kurtosis, JB stat
# Produce a 2x5 grid of QQ-plots ordered by kurtosis

---
### ▶ Solution

In [None]:
log_rets = np.log(prices / prices.shift(1))

asset_class = {'SPY': 'US Eq', 'QQQ': 'US Eq', 'IWM': 'US SmCap',
               'EEM': 'EM Eq', 'TLT': 'Bonds', 'HYG': 'HY Bonds',
               'GLD': 'Gold', 'USO': 'Oil', 'XLU': 'Util', 'VIXY': 'Vol'}
rows = []
for ticker in log_rets.columns:
    r = log_rets[ticker].dropna()
    jb_stat, _ = stats.jarque_bera(r)
    rows.append({'ETF': ticker, 'Class': asset_class.get(ticker, '?'),
                 'Mean%': r.mean() * 100, 'Std%': r.std() * 100,
                 'Skew': r.skew(), 'Kurtosis': r.kurtosis(), 'JB': jb_stat})

stat_df = pd.DataFrame(rows).sort_values('Kurtosis', ascending=False).set_index('ETF')
print(stat_df.to_string(float_format=lambda x: f'{x:.2f}'))

Scan the kurtosis column from top to bottom. The pattern jumps out immediately. VIXY — pure volatility — sits at the extreme, with excess kurtosis that dwarfs everything else. Oil (USO) is likely next, driven by OPEC shocks and supply disruptions that produce days when crude moves 8-10% on a single headline. The equity ETFs (SPY, QQQ, IWM, EEM) cluster in the middle with broadly similar kurtosis, though emerging markets tend to run a notch wilder than US large caps. And TLT and XLU — the "boring" names — sit at the bottom, still non-Gaussian, but dramatically less fat-tailed.

Now look at the skewness column. This is where asset classes diverge in a way kurtosis alone can't capture. Equity ETFs tend to be *left-skewed* — their tails are fatter on the downside, because crashes are fast and violent while recoveries are slow. VIXY is likely *right-skewed* — it spikes upward when fear hits, producing extreme positive returns. GLD often has near-zero skewness but still fat tails — symmetric heavy tails, the kind of distribution Nassim Taleb builds portfolios around. These are qualitatively different risk profiles masquerading under the same "fat tails" label.

The QQ-plots below will make these differences visceral — watch which curves break away from the diagonal at the 2-sigma level and which ones hold steady until 4-sigma territory.

In [None]:
fig, axes = plt.subplots(2, 5, figsize=(20, 8))
for i, ticker in enumerate(stat_df.index[:10]):
    ax = axes.ravel()[i]
    r = log_rets[ticker].dropna()
    stats.probplot(r, dist='norm', plot=ax)
    ax.set_title(f'{ticker} (κ={r.kurtosis():.1f}, s={r.skew():.2f})')
    ax.get_lines()[0].set(markersize=2, alpha=0.4)

fig.suptitle('QQ-Plots Ranked by Excess Kurtosis — Watch the Tails',
             fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

The QQ-plots tell the full story. The highest-kurtosis ETFs — VIXY and USO — break away from the Gaussian diagonal almost immediately, curving outward in dramatic fashion. For VIXY, the right tail dominates: the top 1% of returns are far larger than a Gaussian predicts, while the left tail is comparatively tame. For equity ETFs, the opposite pattern: the left tail (crashes) is where the real deviation lives. TLT and XLU stay closer to the diagonal longer, bending only at the extremes — they're non-Gaussian, but politely so.

Here's why this matters for your ML models. A risk model that treats all 10 ETFs as having "the same kind of fat tails" is making at least three different mistakes. For volatility products, it underestimates upside risk. For equities, it underestimates downside risk. For bonds, it overestimates both. The variation across asset classes isn't noise — it's a structural feature of how different markets process information and absorb shocks. Any feature engineering or distribution-fitting you do later in this course needs to be *asset-aware*, not one-size-fits-all.

The lecture told you fat tails exist. Now you know they vary by an order of magnitude across asset classes, they have different *shapes* (left-skewed vs. right-skewed vs. symmetric), and the pattern is predictable from the type of asset. In the homework, you'll compute this for 200 stocks and discover that even within equities, the kurtosis distribution is bimodal.

## Exercise 2: Dollar Bars at Scale — 10 ETFs

The lecture showed a 5-line "napkin version" of dollar bars on SPY and noted that they bring returns closer to Gaussian. Lopez de Prado makes a stronger claim: dollar bars are *strictly better* than time bars as inputs for ML models. But SPY is the most liquid security on the planet — over $30 billion in daily volume. It's the easiest possible test case.

Here's the real question: does the dollar-bar advantage generalize? When you try dollar bars on a thinly-traded utilities ETF, or a volatile commodity fund, or a volatility product with bizarre volume patterns, does the kurtosis reduction hold up? Does it get better or worse? And does adding volume bars to the comparison change the story?

We'll build time bars, volume bars, *and* dollar bars for all 10 ETFs, choose per-ETF thresholds based on each asset's own trading patterns, and produce a comprehensive comparison table. By the end, you'll know exactly when dollar bars earn their keep and when they're more trouble than they're worth.

### Tasks

1. Write `make_dollar_bars` and `make_volume_bars` functions that take a DataFrame and a threshold and return OHLCV bars.
2. For each of the 10 ETFs, compute the median daily dollar volume and median daily share volume. Use these as per-ETF thresholds.
3. Build time bars (daily), volume bars, and dollar bars for all 10 ETFs. Compute log returns for each bar type.
4. Build a comparison table: ETF, bar type, number of bars, excess kurtosis, Jarque-Bera statistic.
5. Answer: does the kurtosis reduction from dollar bars depend on the ETF's liquidity? Which ETFs benefit most, and which least?

In [None]:
# ── YOUR WORKSPACE ──
# Build make_dollar_bars and make_volume_bars functions
# Apply to all 10 ETFs with per-ETF thresholds
# Compute returns for each bar type, build comparison table

---
### ▶ Solution

In [None]:
def make_dollar_bars(df, dollar_threshold):
    """Build OHLCV bars, one per fixed dollar volume traded."""
    dollar_vol = df['Close'] * df['Volume']
    cum_dollars = dollar_vol.cumsum()
    bar_ids = (cum_dollars // dollar_threshold).astype(int)
    return df.groupby(bar_ids).agg(
        {'Open': 'first', 'High': 'max',
         'Low': 'min', 'Close': 'last', 'Volume': 'sum'})

def make_volume_bars(df, volume_threshold):
    """Build OHLCV bars, one per fixed number of shares traded."""
    cum_vol = df['Volume'].cumsum()
    bar_ids = (cum_vol // volume_threshold).astype(int)
    return df.groupby(bar_ids).agg(
        {'Open': 'first', 'High': 'max',
         'Low': 'min', 'Close': 'last', 'Volume': 'sum'})

Both functions follow the same logic: accumulate a running total (dollar volume or share volume), and cut a new bar every time the total crosses the threshold. On a high-activity day you get multiple bars; on a quiet day you get zero or one. That's the entire insight — we're sampling by *information flow*, not by the clock.

The critical design choice is the threshold. Too low and you get noisy micro-bars dominated by bid-ask bounce. Too high and you get so few bars that you're back to weekly data. Using each ETF's median daily value as the threshold gives roughly the same number of bars as trading days — an apples-to-apples comparison. But that "roughly" is doing real work: a calm August week produces fewer bars, while an FOMC announcement week produces many more.

Now let's apply both functions to all 10 ETFs and see which ones benefit most from the resampling.

In [None]:
rows = []
for ticker in tickers:
    df = raw.xs(ticker, axis=1, level=1).dropna(subset=['Close'])
    time_ret = np.log(df['Close'] / df['Close'].shift(1)).dropna()

    daily_dollar = df['Close'] * df['Volume']
    d_bars = make_dollar_bars(df, daily_dollar.median())
    v_bars = make_volume_bars(df, df['Volume'].median())

    d_ret = np.log(d_bars['Close'] / d_bars['Close'].shift(1)).dropna()
    v_ret = np.log(v_bars['Close'] / v_bars['Close'].shift(1)).dropna()

    for label, ret in [('Time', time_ret), ('Volume', v_ret), ('Dollar', d_ret)]:
        jb = stats.jarque_bera(ret)
        rows.append({'ETF': ticker, 'Bar': label, 'N': len(ret),
                     'Kurt': ret.kurtosis(), 'JB': jb.statistic})

comp = pd.DataFrame(rows)
pivot_k = comp.pivot(index='ETF', columns='Bar', values='Kurt')[['Time','Volume','Dollar']]
pivot_k['%Δ'] = ((pivot_k['Dollar'] - pivot_k['Time']) / pivot_k['Time'] * 100)
pivot_k = pivot_k.sort_values('Time', ascending=False)
print(pivot_k.to_string(float_format=lambda x: f'{x:.1f}'))

Read the %Δ column — it tells you how much kurtosis *changed* when switching from time bars to dollar bars (negative means improvement). The pattern is striking. The liquid, high-volume ETFs — SPY, QQQ, IWM — show substantial kurtosis reductions, likely in the 20-50% range. These are assets where daily volume swings dramatically between quiet days and event days, so resampling by dollar volume genuinely normalizes the information content per bar.

Now look at the lower-volume ETFs — XLU, USO, possibly HYG. The improvement is much smaller, perhaps only 5-15%. For very illiquid names, you might even see kurtosis *increase* with dollar bars, because the threshold produces bars that span multiple days, introducing stale-price effects. This is the honest answer about dollar bars: they're not a magic bullet.

The volume bars (middle column) tell an intermediate story. They help, but not as much as dollar bars for the liquid names — because volume bars don't account for price changes. A million shares of SPY at $450 carries different information content than a million shares at $350. Dollar bars capture that; volume bars don't. Let's see this as a chart.

In [None]:
fig, ax = plt.subplots(figsize=(14, 6))
x = np.arange(len(pivot_k))
w = 0.25

ax.bar(x - w, pivot_k['Time'], w, label='Time Bars', color='#e74c3c', alpha=0.8)
ax.bar(x, pivot_k['Volume'], w, label='Volume Bars', color='#f39c12', alpha=0.8)
ax.bar(x + w, pivot_k['Dollar'], w, label='Dollar Bars', color='#2ecc71', alpha=0.8)
ax.set_xticks(x)
ax.set_xticklabels(pivot_k.index, rotation=45)
ax.set_ylabel('Excess Kurtosis')
ax.set_title('Kurtosis by Bar Type — Dollar Bars Help Most Where Needed Most')
ax.legend()
plt.tight_layout()
plt.show()

The chart makes the pattern visceral. For the fattiest-tailed ETFs on the left — the ones where non-Gaussianity is most extreme — the green bars (dollar) sit noticeably lower than the red bars (time). Dollar bars are doing real work where it matters most. For the tamest ETFs on the right — where time-bar kurtosis is already relatively low — all three bar types produce similar results. The free lunch has a variable price tag.

This has a practical consequence for your ML pipeline. If you're building a multi-asset model that takes positions across equities, bonds, commodities, and volatility products, a blanket switch to dollar bars helps some assets dramatically and others barely at all. For liquid assets with large intraday volume variation (SPY on earnings day vs. SPY in August), dollar bars are a genuine improvement. For thinly-traded assets with sparse, lumpy volume, the threshold selection becomes a new source of researcher discretion — and researcher discretion is a polite name for overfitting.

The lecture told you dollar bars are better. Now you know *how much* better depends on the asset — and the pattern is predictable from liquidity and volume variability. In the homework, you'll integrate dollar bars into a `DataLoader` class and discover that threshold selection at scale introduces its own set of design decisions.

## Exercise 3: Putting a Price on Survivorship Bias

The lecture told the story of Enron and WorldCom — companies that vanished and took their data with them. It told you survivorship bias is bad. But *how* bad? Two percent per year? Five? Is it enough to matter, or is it a theoretical concern that doesn't move the needle in practice?

Let's put a number on it. We'll take a list of S&P 500 constituents from around 2010, compare it to the current list, and measure the return gap between the stocks that survived and the stocks that were removed. The survivors are the ones you'd see if you downloaded "S&P 500 data" today and backtested to 2010. The removed companies are the ghosts your model never trained on.

### Tasks

1. Using the pre-defined lists of survivors and removed stocks (downloaded in the imports cell), compute the annualized return of each stock from 2010 to 2024 (or to removal date, whichever comes first).
2. Compute the average annualized return for each group.
3. Calculate the survivorship bias premium (survivors minus removed).
4. Compound that annual premium over 14 years. Answer: if your backtest shows 15% annualized, how many of those percentage points might be ghosts?

**Removed stocks provided** (these were in or near the S&P 500 around 2010 and have since been removed):

| Ticker | Company | Removed | Reason |
|--------|---------|---------|--------|
| GE | General Electric | 2018 | Underperformance (was in index 122 years) |
| XRX | Xerox | 2003/2017 | Decline |
| GPS | Gap Inc | 2020 | Retail decline |
| FLR | Fluor Corp | 2019 | Earnings collapse |
| HRB | H&R Block | 2013 | Shrinking market cap |
| HP | Helmerich & Payne | 2020 | Energy downturn |
| LEG | Leggett & Platt | 2019 | Market cap decline |
| NWSA | News Corp | various | Corporate restructuring |
| IPG | Interpublic Group | various | Removed/readded |
| AIZ | Assurant | 2016 | Dropped below threshold |

In [None]:
# ── YOUR WORKSPACE ──
# Compute annualized return for each survivor and removed stock
# Compare group averages
# Calculate the compound bias over 14 years


---
### ▶ Solution

In [None]:
def annualized_return(series):
    """Annualized return from a price series."""
    clean = series.dropna()
    if len(clean) < 252:
        return np.nan
    n_yr = (clean.index[-1] - clean.index[0]).days / 365.25
    total = clean.iloc[-1] / clean.iloc[0]
    return total ** (1 / n_yr) - 1 if (total > 0 and n_yr > 0) else np.nan

results = []
for ticker in survivors + removed:
    if ticker in bias_prices.columns:
        ann = annualized_return(bias_prices[ticker])
        grp = 'Survivor' if ticker in survivors else 'Removed'
        if not np.isnan(ann):
            results.append({'Ticker': ticker, 'Group': grp, 'Ann': ann})

We now have annualized returns for both groups. Before we look at the averages, it's worth scanning the individual stocks. The survivors should look like a greatest-hits album — many of them are companies that rode the post-2010 bull market to multi-bagger returns (think AAPL, AMGN, UNH). The removed stocks tell a different story: GE, once the most valuable company on Earth, has underperformed badly. GPS (Gap) was gutted by the retail apocalypse. FLR collapsed when its earnings evaporated.

The key insight is that you wouldn't see the removed group at all in a typical backtest. They've been silently replaced by companies like Tesla, which entered the S&P 500 in 2020. Your backtest gets Tesla's 2020 return and misses whatever stock Tesla replaced.

In [None]:
res_df = pd.DataFrame(results)

surv_avg = res_df[res_df['Group'] == 'Survivor']['Ann'].mean()
remv_avg = res_df[res_df['Group'] == 'Removed']['Ann'].mean()
bias = surv_avg - remv_avg
n_years = 14

surv_cum = (1 + surv_avg) ** n_years - 1
remv_cum = (1 + remv_avg) ** n_years - 1

print(f'Survivor avg ann. return:  {surv_avg:+.2%}')
print(f'Removed avg ann. return:   {remv_avg:+.2%}')
print(f'Survivorship bias premium: {bias:+.2%} / year')
print(f'\nOver {n_years} years, compounded:')
print(f'  Survivors cumulative:  {surv_cum:+.0%}')
print(f'  Removed cumulative:    {remv_cum:+.0%}')
print(f'  Cumulative overstate:  {surv_cum - remv_cum:+.0%}')

The numbers speak for themselves. The survivorship bias premium is likely in the range of 2-5% per year — and that's a *lower bound*, because we can only measure it for removed stocks that still have downloadable data. Enron returned -100%. Lehman Brothers returned -100%. WorldCom returned -100%. Those zeros don't show up in our removed-stock average because `yfinance` can't download them. The true bias is worse than what we measured.

Compound that annual premium over a 14-year backtest horizon and the overstatement becomes staggering. If the true annual bias is, say, 3.5%, that compounds to roughly a 60-70% overstatement of cumulative returns. Concretely: if your backtest shows 15% annualized returns, somewhere around 3-4 of those percentage points might be ghosts — returns that came from stocks being silently replaced by better-performing ones. Your model didn't earn those returns. The index reconstitution gave them to you for free.

This is why professional quants pay thousands of dollars a year for survivorship-bias-free databases like CRSP. With free data from `yfinance`, you can document the bias and acknowledge it, but you cannot eliminate it. Every backtest you run with free data carries this invisible inflation.

## Exercise 4: Death by a Thousand Cuts — Transaction Costs

Most academic papers show you a beautiful equity curve and then bury the words "transaction costs are not included" in a footnote on page 23. Let's see what happens when you drag those costs out of the footnote and into the P&L.

We'll implement the simplest possible timing strategy — the 50/200-day moving average crossover that every intro-to-trading textbook loves. At zero cost, it looks decent. The question is: at what cost level does it die? Is it 5 basis points per side? 10? 20? And once you know the answer, you'll understand why every model we build in this course includes a transaction cost estimate — because without one, you're running a fantasy backtest.

### Tasks

1. Compute the 50-day and 200-day simple moving averages of SPY's close price.
2. Generate a signal: long SPY when MA(50) > MA(200), else flat (cash). Shift the signal by one day to avoid look-ahead bias.
3. Compute daily strategy returns at four cost levels: 0, 5, 10, and 20 basis points per side. Costs are incurred on each day the signal changes.
4. Plot cumulative returns for each cost level alongside buy-and-hold.
5. Answer: at what cost level does the strategy underperform buy-and-hold? What does that imply for any strategy you build?

In [None]:
# ── YOUR WORKSPACE ──
# Build the moving average crossover strategy
# Apply costs at 0, 5, 10, 20 bps per side
# Plot cumulative returns for each cost level vs. buy-and-hold


---
### ▶ Solution

In [None]:
spy_close = prices['SPY'].dropna()
spy_ret = np.log(spy_close / spy_close.shift(1)).dropna()

ma50 = spy_close.rolling(50).mean()
ma200 = spy_close.rolling(200).mean()
signal = (ma50 > ma200).astype(int)
signal = signal.shift(1).reindex(spy_ret.index).dropna()
spy_ret = spy_ret.reindex(signal.index)

trades = signal.diff().abs().fillna(0)
n_trades = int(trades.sum())

print(f'Period: {signal.index[0].strftime("%Y-%m-%d")} to '
      f'{signal.index[-1].strftime("%Y-%m-%d")}')
print(f'Signal changes (trades): {n_trades}')
print(f'Avg holding period: ~{len(signal) // max(n_trades, 1)} days')

The moving average crossover is intentionally a low-frequency strategy — it trades only when a slow trend reverses, so the number of round trips over 14 years is modest. That's actually the *best case* for surviving transaction costs. A daily-rebalancing strategy would trade 252 times per year; this one trades maybe a dozen times. If even this strategy gets killed by costs, imagine what happens to something that trades every day.

Now let's compute returns at each cost level and see where the equity curves diverge.

In [None]:
cost_bps_list = [0, 5, 10, 20]
strat_curves = {}
bh_curve = spy_ret.cumsum().apply(np.exp)
bh_ann = spy_ret.mean() * 252

for bps in cost_bps_list:
    cost = bps / 10_000
    net = signal * spy_ret - trades * cost
    strat_curves[bps] = {'curve': net.cumsum().apply(np.exp),
                          'ann': net.mean() * 252}

for bps in cost_bps_list:
    r = strat_curves[bps]
    print(f"  {bps:2d} bps/side  ->  ann. return: {r['ann']:+.2%}")
print(f"  Buy & Hold ->  ann. return: {bh_ann:+.2%}")

Look at how the annualized return decays as costs increase. The gap between 0 bps and 20 bps might not sound like much in percentage terms, but compounded over 14 years it's the difference between a strategy worth running and one that's worse than doing nothing. Let's see this as an equity curve, where the divergence over time makes the damage visceral.

In [None]:
fig, ax = plt.subplots(figsize=(14, 7))
colors = ['#2ecc71', '#f39c12', '#e67e22', '#e74c3c']
for bps, c in zip(cost_bps_list, colors):
    r = strat_curves[bps]
    ax.plot(r['curve'], color=c, lw=1.5,
            label=f'{bps} bps/side (ann: {r["ann"]:.1%})')
ax.plot(bh_curve, color='steelblue', lw=2, ls='--',
        label=f'Buy & Hold (ann: {bh_ann:.1%})')
ax.set_yscale('log')
ax.set_ylabel('Cumulative Return (log scale)')
ax.set_title('Transaction Costs: The Silent Strategy Killer')
ax.legend(loc='upper left')
plt.tight_layout()
plt.show()

The chart tells a clear story. At zero cost, the moving average crossover looks like a reasonable strategy — it avoids some drawdowns and captures most of the upside. But watch what happens as you add costs. At 5 bps per side, the curve dips noticeably. At 10 bps, the gap with buy-and-hold narrows to almost nothing. At 20 bps, the strategy likely underperforms buy-and-hold over the full period.

And remember: this is a *low-turnover* strategy. It trades maybe 10-20 times over 14 years. A typical quant strategy trades weekly or daily. The damage from costs scales linearly with turnover, so a daily-rebalancing model at 10 bps round-trip would lose roughly 25% of its gross return to costs per year. For reference, the average hedge fund's gross return is about 10-15%. You'd be donating most of your edge to market makers.

This is the most important practical lesson from today's seminar. A strategy that looks brilliant on paper can be worthless in practice if it trades too frequently or if the assets it trades have wide spreads. Every model in this course will include a transaction cost estimate — because without one, you're living in fantasy land. The breakeven cost level for your strategy is the single most important number in your backtest, and most academic papers never report it.

---

## Summary

Four exercises, four numbers that should change how you think about financial data:

- **Fat tails vary by an order of magnitude across asset classes — and they have different *shapes*.** Volatility products (VIXY) can exceed kurtosis of 30, oil (USO) pushes 15-20, while bonds and utilities sit in the 5-8 range. Skewness diverges too: equities are left-skewed (crash risk), volatility is right-skewed (spike risk), gold is nearly symmetric. A risk model that treats all assets the same is making different mistakes in different directions.

- **Dollar bars help most where they're needed most — on liquid ETFs.** SPY and QQQ see 20-50% kurtosis reduction; thinly-traded ETFs see 5-15% at best. The dollar-bar advantage is proportional to each asset's volume variability. For illiquid assets, threshold selection becomes a new source of researcher discretion — which is a polite name for overfitting.

- **Survivorship bias inflates backtest returns by an estimated 2-5% per year.** Compounded over 14 years, that's a 30-70% overstatement of cumulative returns. And our estimate is a lower bound — the true disasters are unmeasurable because the data no longer exists.

- **Even a low-turnover strategy dies at moderate cost levels.** The moving average crossover underperforms buy-and-hold somewhere around 10-20 bps per side. A daily-rebalancing strategy would need far lower costs to survive.

You've done this for 10 ETFs. In the homework, you'll scale to 200 stocks — and discover that the patterns get more interesting. The kurtosis distribution across the full universe is bimodal. The data quality issues multiply. And you'll build a `DataLoader` class that handles these pathologies systematically, so every model you build for the rest of the course starts from clean, honest data.