# Week 1 Homework — The Data Quality Audit

## Your Mission

Your mission is straightforward but important: take a universe of 200 US equities and put the data through the most skeptical scrutiny you can manage. Think of yourself as the last line of defense before bad data meets a model with real capital behind it. Every quant firm has someone in this role — the person who signs off on the dataset before it touches a production model. Today, that person is you.

You will download the data yourself, compute returns both ways, build alternative bar types, hunt for data quality issues that most people never bother to check for, and ultimately produce a clean `DataLoader` class that handles the common pitfalls automatically. The class you build this week is not a homework exercise — it is the foundation of every pipeline you will construct for the rest of this course. If it is wrong, everything downstream is wrong. If it is right, you can trust your results, and that trust is worth more than any architectural improvement you will ever make to a model.

Here is the thing that makes this homework more than busywork: every issue you find in this dataset is an issue that exists in production systems at hedge funds. The difference is that their datasets have thousands of stocks, decades of history, and millions of dollars riding on the results. At Two Sigma, the data engineering team is larger than the research team. You will not build anything that sophisticated this week, but you will build something *correct* — and in this business, correct beats sophisticated every time.

One more thing. You are going to discover patterns that nobody mentioned in the lecture or the seminar. The lecture showed you that fat tails exist for SPY. The seminar showed you that kurtosis varies across 10 stocks. But when you run 200 stocks through the same analysis, something new emerges — the *distribution* of kurtosis across stocks is itself interesting, and it tells you something about how your model needs to handle different stocks differently. That is the kind of insight that only comes from working at scale.

## Deliverables

1. **Download and inspect data for 200 US equities (2010-2024).** Compute simple and log returns. Produce QQ-plots for at least 10 diverse stocks. Report kurtosis and skewness for the full universe. You will notice something unexpected about how kurtosis is distributed across stocks — document it.

2. **Implement dollar bars and compare with time bars.** Build a clean, reusable dollar bar function. Apply it to SPY and run the Jarque-Bera test to compare normality. Replicate the qualitative result from Lopez de Prado Figure 2.4.

3. **Find and document at least 3 specific data quality issues.** This is where it gets interesting — you will discover that "clean" data is a myth. Scan systematically for: missing dates, zero-volume days, extreme returns (>15%), possible unadjusted splits, NaN values. For each issue: what is it, which stocks are affected, how would it bias an ML model, and what did you do about it.

4. **Build the `DataLoader` class.** This is the main event. A reusable class that handles downloading, caching, missing data, returns computation (simple + log), bar construction (time/volume/dollar), and anomaly flagging. This class follows you through the rest of the course.

5. **Write a 1-page data quality report.** A structured markdown summary of biases found, what you fixed, and what remains unfixable. Be honest — this is practice for the kind of documentation that separates a trustworthy backtest from a misleading one.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import yfinance as yf
from scipy import stats
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline

plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3


def get_close(data):
    """Extract close prices, handling yfinance MultiIndex."""
    if isinstance(data.columns, pd.MultiIndex):
        return data['Close']
    return data[['Close']]

## Deliverable 1: Download and Inspect 200 US Equities

Download daily OHLCV data for 200 stocks from 2010 to 2024 using `yfinance`. Use a cross-section of the current S&P 500 as your starting universe — mega-cap tech, financials, healthcare, energy, utilities, the works. Compute both simple and log returns. Produce QQ-plots for at least 10 stocks spanning different sectors and volatility levels. Then compute kurtosis and skewness for the *entire* universe and plot their distributions.

A note before you start: we are using the *current* S&P 500 list to select stocks, then downloading their history back to 2010. This means survivorship bias is baked in from the very first line of code — every stock in your universe is, by definition, a stock that survived to today. Document this explicitly. It matters more than you think, and Deliverable 5 will ask you to reckon with it honestly.

In [None]:
# YOUR CODE HERE
# 
# Suggested approach:
# 1. Define your 200-ticker universe (diverse cross-section of S&P 500)
# 2. Download with yf.download() -- handle failures gracefully
# 3. Extract close prices, compute simple and log returns
# 4. Compute distributional stats (mean, std, skewness, kurtosis) for all
# 5. QQ-plots for 10 representative stocks
# 6. Histogram of kurtosis across the full universe
#
# Hint: yf.download() accepts a list of tickers and returns a MultiIndex
# DataFrame. The get_close() helper above handles the extraction.

In [None]:
# MORE WORKSPACE (continue your solution here)

---
## ━━━ SOLUTION: Deliverable 1 ━━━

Let us pull the full universe — 200 stocks, 14 years, roughly 700,000 data points. This is where your laptop starts to earn its keep. We pick 20 tickers from each of 10 sector groups to ensure broad coverage. Notice that every ticker on this list exists today; the ghosts — the Enrons, the Lehman Brothers, the companies that were in these index slots in 2010 but did not make it to 2024 — are invisible. That is survivorship bias, and we will quantify its cost in Deliverable 5.

In [None]:
SP500_SAMPLE = [
    'AAPL', 'MSFT', 'AMZN', 'GOOGL', 'META', 'NVDA', 'TSLA', 'AVGO', 'ORCL', 'CRM',  # Tech
    'ADBE', 'AMD', 'INTC', 'CSCO', 'QCOM', 'TXN', 'AMAT', 'MU', 'LRCX', 'KLAC',
    'JPM', 'BAC', 'WFC', 'GS', 'MS', 'BLK', 'SCHW', 'C', 'AXP', 'USB',  # Financials
    'PNC', 'TFC', 'BK', 'STT', 'FITB', 'KEY', 'CFG', 'HBAN', 'RF', 'ZION',
    'UNH', 'JNJ', 'LLY', 'PFE', 'ABBV', 'MRK', 'TMO', 'ABT', 'DHR', 'BMY',  # Healthcare
    'AMGN', 'GILD', 'ISRG', 'VRTX', 'REGN', 'MDT', 'SYK', 'BSX', 'ZBH', 'EW',
    'HD', 'LOW', 'NKE', 'SBUX', 'TJX', 'MCD', 'YUM', 'CMG', 'ROST', 'DG',  # Consumer
    'DLTR', 'ORLY', 'AZO', 'BBY', 'POOL', 'GRMN', 'EBAY', 'ETSY', 'APTV', 'MGM',
    'PG', 'KO', 'PEP', 'WMT', 'COST', 'PM', 'MO', 'CL', 'KMB', 'GIS',  # Staples
    'K', 'HSY', 'SJM', 'CPB', 'CAG', 'TSN', 'HRL', 'MKC', 'CHD', 'CLX',
    'CAT', 'DE', 'UNP', 'HON', 'UPS', 'RTX', 'BA', 'LMT', 'GD', 'NOC',  # Industrials
    'GE', 'MMM', 'EMR', 'ITW', 'ETN', 'PH', 'ROK', 'SWK', 'CMI', 'PCAR',
    'XOM', 'CVX', 'COP', 'EOG', 'SLB', 'MPC', 'PSX', 'VLO', 'OXY', 'PXD',  # Energy
    'DVN', 'HES', 'HAL', 'BKR', 'FANG', 'APA', 'MRO', 'CTRA', 'OVV', 'AR',
    'NEE', 'DUK', 'SO', 'D', 'AEP', 'SRE', 'EXC', 'XEL', 'ED', 'WEC',  # Utilities
    'ES', 'PPL', 'FE', 'EIX', 'DTE', 'AEE', 'CMS', 'CNP', 'NI', 'PNW',
    'AMT', 'PLD', 'CCI', 'EQIX', 'PSA', 'SPG', 'O', 'WELL', 'DLR', 'AVB',  # RE + Materials
    'LIN', 'APD', 'SHW', 'FCX', 'NEM', 'DOW', 'DD', 'PPG', 'ECL', 'NUE',
]

The download will take a minute or two depending on your connection. We use `auto_adjust=True` so that all prices are split- and dividend-adjusted — meaning Amazon's 20:1 split in June 2022 shows up as a smooth price history rather than a 95% overnight "crash." We also use `threads=True` to parallelize the download, because 200 sequential API calls would take an unreasonable amount of time.

In [None]:
raw_data = yf.download(
    SP500_SAMPLE,
    start='2010-01-01',
    end='2024-01-01',
    auto_adjust=True,
    threads=True,
)

prices = get_close(raw_data)
n_tickers = prices.shape[1]
n_complete = prices.dropna(axis=1).shape[1]

print(f"Downloaded: {prices.shape[0]} trading days x {n_tickers} tickers")
print(f"Date range: {prices.index[0]:%Y-%m-%d} to {prices.index[-1]:%Y-%m-%d}")
print(f"Complete data (no NaN): {n_complete} tickers")
print(f"Partial data: {n_tickers - n_complete} tickers")

Some tickers will come back with partial histories because they listed after 2010 (META IPO'd in May 2012, for instance, and several energy names were spun off or restructured mid-period). That is not a bug — it is our first data quality observation. A naive model that forward-fills those NaN values with zeros is silently inventing data. A careful model keeps the NaN and handles it explicitly. We will address this in the DataLoader's missing data policy in Deliverable 4.

Now let us compute both types of returns and pull the cross-sectional distributional statistics. We use log returns for the statistics because they are additive over time, which makes annualization cleaner.

In [None]:
simple_returns = prices.pct_change().iloc[1:]
log_returns = np.log(prices / prices.shift(1)).iloc[1:]

dist_stats = pd.DataFrame({
    'mean_annual': log_returns.mean() * 252,
    'vol_annual': log_returns.std() * np.sqrt(252),
    'skewness': log_returns.skew(),
    'excess_kurtosis': log_returns.kurtosis(),
    'min_return': log_returns.min(),
    'max_return': log_returns.max(),
    'n_obs': log_returns.count(),
}).dropna()

print(f"Computed returns for {len(dist_stats)} tickers")
print(f"\nExcess Kurtosis — Mean: {dist_stats['excess_kurtosis'].mean():.1f}, "
      f"Median: {dist_stats['excess_kurtosis'].median():.1f}, "
      f"Range: [{dist_stats['excess_kurtosis'].min():.1f}, {dist_stats['excess_kurtosis'].max():.1f}]")
print(f"Skewness — Mean: {dist_stats['skewness'].mean():.3f}, "
      f"Median: {dist_stats['skewness'].median():.3f}")
print(f"\nLowest kurtosis: {dist_stats['excess_kurtosis'].idxmin()} "
      f"({dist_stats['excess_kurtosis'].min():.1f})")
print(f"Highest kurtosis: {dist_stats['excess_kurtosis'].idxmax()} "
      f"({dist_stats['excess_kurtosis'].max():.1f})")

Every single stock in the universe has positive excess kurtosis — fatter tails than a Gaussian. But look at the *range*. The lowest-kurtosis stocks (likely utilities and consumer staples) sit around 3-6, while the highest-kurtosis stocks (volatile tech, energy, or meme-adjacent names) can exceed 20 or 30. That is not a minor variation; it is a 5x to 10x difference in tail thickness between stocks that are all in the same index.

Now let us see how this looks across sectors. The QQ-plots below compare 10 stocks chosen for maximum diversity: a boring utility, a mega-cap tech name, a bank, a healthcare stock, an energy company, and several high-volatility names.

In [None]:
qq_tickers = ['DUK', 'JNJ', 'KO', 'JPM', 'XOM',
              'AAPL', 'NVDA', 'TSLA', 'FANG', 'MGM']

fig, axes = plt.subplots(2, 5, figsize=(22, 8))
for ax, ticker in zip(axes.flatten(), qq_tickers):
    ret = log_returns[ticker].dropna()
    stats.probplot(ret.values, dist='norm', plot=ax)
    kurt = ret.kurtosis()
    ax.set_title(f'{ticker} (kurt={kurt:.1f})', fontsize=10)
    ax.get_lines()[0].set_markersize(1.5)
    ax.get_lines()[0].set_color('steelblue')

plt.suptitle('QQ-Plots: Every Stock Has Fat Tails, But Severity Varies by Sector',
             fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

Look at the difference between Duke Energy (DUK) and Tesla (TSLA). DUK's QQ-plot hugs the reference line fairly closely — its tails deviate, but modestly. TSLA's QQ-plot bends away from the line dramatically in both tails, showing that extreme moves happen far more often than a Gaussian would predict. Both are in the S&P 500. Both would be inputs to the same model. But treating them as coming from the same distribution is a modeling choice that costs you information before training even begins.

Now for the cross-sectional view — this is the picture you could not see with just 10 stocks. Let us plot the distribution of kurtosis *across* all 200 stocks.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].hist(dist_stats['excess_kurtosis'], bins=40,
             color='steelblue', alpha=0.7, edgecolor='white')
axes[0].axvline(0, color='red', ls='--', label='Gaussian (kurt=0)')
axes[0].set_xlabel('Excess Kurtosis')
axes[0].set_ylabel('Number of Stocks')
axes[0].set_title('Kurtosis Distribution Across 200 Stocks')
axes[0].legend()

axes[1].hist(dist_stats['skewness'], bins=40,
             color='#e74c3c', alpha=0.7, edgecolor='white')
axes[1].axvline(0, color='red', ls='--', label='Symmetric')
axes[1].set_xlabel('Skewness')
axes[1].set_ylabel('Number of Stocks')
axes[1].set_title('Skewness Distribution Across 200 Stocks')
axes[1].legend()

plt.tight_layout()
plt.show()

Here is the aha moment for Deliverable 1 — something you could not see with SPY alone or even with 10 stocks. The kurtosis histogram is not a single bell curve. There is a cluster of stocks with moderate kurtosis (roughly 3-8, mostly utilities, staples, and large-cap industrials) and a long right tail of stocks with extreme kurtosis (15-40+, volatile tech, energy, and smaller names). Whether this is truly bimodal or simply right-skewed depends on the exact sample, but the key insight is the same: the "how fat are the tails" question does not have *one* answer for your universe. It has a distribution of answers, and that distribution is wide.

This matters for modeling. If you fit a single Student-t distribution to your entire universe and use the same degrees-of-freedom parameter for all stocks, you are underfitting the tails for volatile names and overfitting them for utilities. A model that adapts its distributional assumptions per-stock — or at least per-sector — will do better. That is a design decision you could not have made without seeing this histogram.

## Deliverable 2: Implement Dollar Bars

Build a clean, reusable dollar bar function and apply it to SPY. Compare the return distributions of time bars versus dollar bars using the Jarque-Bera test. The lecture showed you the theory; the seminar had you build bars for SPY and TSLA. Here you are going to build a *production-quality* function and examine what the improvement actually looks like with daily data.

A sensible starting threshold for daily data: set the dollar bar threshold to approximately the median daily dollar volume. This gives you roughly one bar per day on average, but more bars on high-activity days and fewer on quiet days. Experiment with other thresholds too — half the median, double the median — and see how the results change.

In [None]:
# YOUR CODE HERE
#
# Suggested approach:
# 1. Download SPY with full OHLCV (or reuse from above)
# 2. Write a make_dollar_bars(df, threshold) function
# 3. Compute time-bar and dollar-bar returns
# 4. Run Jarque-Bera on both, compare kurtosis
# 5. Visualize: histograms + QQ-plots side by side

---
## ━━━ SOLUTION: Deliverable 2 ━━━

The dollar bar function below is intentionally compact. The core logic is four lines: compute dollar volume per row, take the cumulative sum, integer-divide by the threshold to assign bar IDs, and aggregate. The docstring and input validation make it production-ready — the kind of function you can drop into a pipeline and trust.

In [None]:
def make_dollar_bars(df, threshold):
    """Construct dollar bars from OHLCV data.

    Samples one bar every time cumulative dollar volume crosses
    `threshold`. Produces more bars during high-activity periods
    and fewer during quiet periods.

    Args:
        df: DataFrame with 'Close', 'Volume', 'Open', 'High', 'Low' columns.
        threshold: Dollar volume per bar (e.g., median daily dollar volume).

    Returns:
        DataFrame of OHLCV dollar bars.
    """
    dollar_vol = df['Close'] * df['Volume']
    cum_dollars = dollar_vol.cumsum()
    bar_ids = (cum_dollars // threshold).astype(int)

    bars = df.groupby(bar_ids).agg({
        'Open': 'first', 'High': 'max',
        'Low': 'min', 'Close': 'last', 'Volume': 'sum',
    })
    return bars

Now let us apply this to SPY and compare. We download SPY separately with full OHLCV because we need the Open/High/Low columns for bar construction, not just Close. The threshold is set to the median daily dollar volume — this gives us approximately one bar per typical trading day, but more bars on heavy days (earnings, FOMC announcements, crisis days) and fewer on quiet August Tuesdays.

In [None]:
spy_raw = yf.download('SPY', start='2010-01-01', end='2024-01-01',
                       auto_adjust=True)
if isinstance(spy_raw.columns, pd.MultiIndex):
    spy_raw = spy_raw.droplevel(1, axis=1)

spy_raw['DollarVol'] = spy_raw['Close'] * spy_raw['Volume']
dol_threshold = spy_raw['DollarVol'].median()

time_bars = spy_raw[['Open', 'High', 'Low', 'Close', 'Volume']]
dol_bars = make_dollar_bars(spy_raw, dol_threshold)

print(f"SPY time bars: {len(time_bars):,}")
print(f"SPY dollar bars: {len(dol_bars):,}")
print(f"Dollar threshold: ${dol_threshold:,.0f}")

The number of dollar bars should be in the same ballpark as the number of time bars — we set the threshold to the median, so roughly half of all days produce one bar and the other half produce more. The interesting part is *which* days produce extra bars. Let us compute returns for both bar types and run the Jarque-Bera test to see whether the dollar bars actually produce a more Gaussian distribution.

In [None]:
time_ret = np.log(time_bars['Close'] / time_bars['Close'].shift(1)).dropna()
dol_ret = np.log(dol_bars['Close'] / dol_bars['Close'].shift(1)).dropna()

jb_time = stats.jarque_bera(time_ret)
jb_dol = stats.jarque_bera(dol_ret)

print(f"{'Bar Type':<12} {'N':>6} {'Kurt':>8} {'Skew':>8} {'JB Stat':>10}")
print('-' * 48)
print(f"{'Time':<12} {len(time_ret):>6,} {time_ret.kurtosis():>8.2f} "
      f"{time_ret.skew():>8.3f} {jb_time.statistic:>10.1f}")
print(f"{'Dollar':<12} {len(dol_ret):>6,} {dol_ret.kurtosis():>8.2f} "
      f"{dol_ret.skew():>8.3f} {jb_dol.statistic:>10.1f}")

Both bar types reject the null hypothesis of normality (the Jarque-Bera p-values are effectively zero). But look at the *magnitude* of the JB statistic and the kurtosis — dollar bars should show a lower value for both, meaning the distribution is closer to Gaussian even though it is still far from it. The reduction is real but modest.

This is the honest result that Lopez de Prado's intraday examples do not always make obvious: with daily data, the dollar bar improvement is measurable but not transformative. The reason is that daily bars already average out a lot of the intraday variation. The real win comes with tick-level data, where a quiet Monday and an earnings-day Friday can differ by 20x in volume. With daily data, the maximum variation is maybe 3-5x. The *principle* is right; the *magnitude* depends on your data granularity. For this course, we will use dollar bars as a default because the improvement is free and the habit is good practice for when you graduate to intraday data.

## Deliverable 3: Find and Document Data Quality Issues

Now comes the detective work. Scan the full 200-stock universe systematically for data quality problems: missing dates, zero-volume days, extreme returns greater than 15% in a single day, possible unadjusted stock splits, and NaN values. For each issue you find, document what it is, which tickers are affected, how it would bias an ML model if left unchecked, and what you did about it.

Here is a hint about what to expect: you will not find a handful of edge cases buried in obscure stocks. You will find that data quality issues are *pervasive* — affecting 10-20% or more of your universe. If you are coming from a world of curated ML benchmark datasets, this will recalibrate your expectations about what "clean data" means in finance.

In [None]:
# YOUR CODE HERE
#
# Suggested approach:
# 1. Check for NaN: which tickers, how many, where?
# 2. Check for extreme returns: > 15% single-day moves
# 3. Check for zero-volume or suspiciously low-volume days
# 4. Check for unusual date gaps (> 4 calendar days)
# 5. For each issue, note: ticker, date, severity, ML impact

---
## ━━━ SOLUTION: Deliverable 3 ━━━

We will build a systematic anomaly scanner that checks for four categories of issues. The function below takes in prices, returns, and volumes and produces a structured report. This is the kind of function that, once written, you run on every new dataset for the rest of the course — and the rest of your career.

In [None]:
def scan_data_quality(prices, simple_returns, raw_data):
    """Scan a stock universe for common data quality issues.

    Returns a dict of DataFrames/Series summarizing each issue type.
    """
    report = {}

    # --- Issue 1: Missing data ---
    missing = prices.isna().sum().sort_values(ascending=False)
    report['missing'] = missing[missing > 0]

    # --- Issue 2: Extreme returns ---
    extremes = []
    for col in simple_returns.columns:
        s = simple_returns[col].dropna()
        big = s[s.abs() > 0.15]
        for dt, val in big.items():
            extremes.append({'ticker': col, 'date': dt, 'return': val})
    report['extreme_returns'] = pd.DataFrame(extremes)

    # --- Issue 3: Zero / low volume ---
    if isinstance(raw_data.columns, pd.MultiIndex):
        vols = raw_data['Volume']
    else:
        vols = raw_data[['Volume']]
    zero_vol = (vols == 0).sum().sort_values(ascending=False)
    report['zero_volume'] = zero_vol[zero_vol > 0]

    return report

The scanner is deliberately simple — no fancy logic, just honest counting. The goal is not to build a machine learning model for anomaly detection; the goal is to *look at your data* before you build a machine learning model on it. Let us run it and see what falls out.

In [None]:
quality = scan_data_quality(prices, simple_returns, raw_data)

print("ISSUE 1: Missing Data (NaN)")
print(f"Tickers with NaN: {len(quality['missing'])} of {prices.shape[1]}")
if len(quality['missing']) > 0:
    for tkr, cnt in quality['missing'].head(10).items():
        pct = cnt / len(prices) * 100
        first = prices[tkr].first_valid_index()
        fv = first.strftime('%Y-%m-%d') if first else 'N/A'
        print(f"  {tkr:6s} {cnt:>5} NaN ({pct:>5.1f}%)  first valid: {fv}")

Missing data is the most common issue, and the most underestimated. A stock with NaN values in its early history is not "slightly incomplete" — it is a stock that did not exist during that period. Forward-filling those NaN values would fabricate zero-return days that never happened, which biases your volatility estimates downward and your Sharpe ratios upward. The correct policy is to forward-fill only short gaps (trading halts of a few days) and leave genuine "not yet listed" periods as NaN. We will implement exactly this policy in the DataLoader.

Now let us check for extreme returns — single-day moves larger than 15%. Some of these are real events (the COVID crash, earnings-driven jumps), and some may be data artifacts (unadjusted splits or dividend dates that yfinance handled incorrectly).

In [None]:
ext = quality['extreme_returns'].sort_values('return')
n_ext_tickers = ext['ticker'].nunique() if len(ext) > 0 else 0

print(f"ISSUE 2: Extreme Returns (> 15% single-day)")
print(f"Total events: {len(ext)}, Tickers affected: {n_ext_tickers}")

if len(ext) > 0:
    print(f"\nLargest drops:")
    for _, row in ext.head(8).iterrows():
        print(f"  {row['ticker']:6s} {row['date']:%Y-%m-%d}  {row['return']:+.1%}")
    print(f"\nLargest gains:")
    for _, row in ext.tail(8).iterrows():
        print(f"  {row['ticker']:6s} {row['date']:%Y-%m-%d}  {row['return']:+.1%}")

Look at those extreme returns and notice the dates. You will likely see a cluster around March 2020 (COVID crash), several around individual earnings announcements, and possibly one or two that look suspicious — a 30%+ move on an otherwise quiet date, which might be a split that yfinance mishandled. The ML impact is severe: a single -40% day can dominate your loss function during training, pulling model parameters toward accommodating that one outlier at the expense of the other 3,500 days. Whether you winsorize, downweight, or keep these extremes is a modeling decision, but it must be a *conscious* decision, not something that happens silently.

Finally, the zero-volume check. Zero volume means either the stock was halted (real, but the closing price is stale) or the data is missing/incorrect.

In [None]:
zv = quality['zero_volume']
print(f"ISSUE 3: Zero-Volume Days")
print(f"Tickers with zero-volume days: {len(zv)} of {prices.shape[1]}")

if len(zv) > 0:
    for tkr, cnt in zv.head(10).items():
        print(f"  {tkr:6s} {cnt:>3} zero-volume days")
else:
    print("  No zero-volume days found in this sample.")
    print("  (Check for low-volume days — < 1% of median — instead.)")

Here is what this scan reveals about the 200-stock universe, and it is a finding that recalibrates your expectations about financial data quality. You will typically find that at least 10-20 stocks (5-10% of the universe) have at least one category of obvious issue: missing data, extreme returns that warrant investigation, or volume anomalies. These are not exotic edge cases lurking in micro-cap penny stocks — these are S&P 500 constituents, the most liquid and well-covered stocks in the US market. If the "best" data looks like this, imagine what emerging market small-cap data looks like.

The lesson for your ML pipeline is simple: never assume your data is clean. Always scan first. The 10 minutes you spend running a quality check like this will save you days of debugging mysterious model behavior downstream.

## Deliverable 4: Build the DataLoader Class

This is the main event. Build a reusable `DataLoader` class that accepts a list of tickers and a date range, downloads and caches the data, computes returns (simple and log), handles missing data with a documented policy, constructs dollar bars, and flags anomalies. This class will be your data foundation for the rest of the course — every model you build in Weeks 2-18 starts here.

Design it for reuse, not just for this homework. A good DataLoader has sensible defaults that work out of the box, clear method names that say what they do, and docstrings that document the policy choices (why we forward-fill up to 5 days, why we flag but do not remove extreme returns). Production code is not clever code — it is code that your future self can read in three months and understand immediately.

In [None]:
# YOUR CODE HERE
#
# Suggested structure:
#   __init__: accept tickers, date range, config parameters
#   download(): fetch data from yfinance, store internally
#   _handle_missing(): apply documented missing data policy
#   get_returns(method): compute simple or log returns
#   get_dollar_bars(ticker, threshold): construct dollar bars
#   flag_anomalies(): scan for quality issues
#   quality_report(): print human-readable summary

---
## ━━━ SOLUTION: Deliverable 4 ━━━

We will build the DataLoader in six pieces, each with a clear purpose. The `__init__` method sets up configuration and internal state. The key design decision here is that downloading is a separate step from initialization — you create the object, configure it, and *then* call `.download()`. This separation lets you change parameters without re-downloading, and it makes the class testable without hitting the network.

In [None]:
class DataLoader:
    """Clean financial data pipeline for ML applications.

    Data Policy:
      - Prices: split/dividend adjusted (auto_adjust=True)
      - Missing: forward-fill up to max_ffill_days, then NaN
      - Survivorship bias: KNOWN, NOT corrected (needs CRSP)
      - Anomalies: flagged, not removed (user decides)

    Usage:
        loader = DataLoader(['AAPL', 'MSFT'], '2010-01-01', '2024-01-01')
        loader.download()
        returns = loader.get_returns('log')
        bars = loader.get_dollar_bars('AAPL')
    """

    def __init__(self, tickers, start='2010-01-01',
                 end='2024-01-01', max_ffill_days=5):
        self.tickers = list(tickers)
        self.start = start
        self.end = end
        self.max_ffill_days = max_ffill_days

        self._raw = None
        self._prices = None
        self._volumes = None
        self._ohlcv = {}
        self._flags = {}
        self._downloaded = False

The download method handles the yfinance API call and the MultiIndex extraction that trips up almost everyone the first time they use `yf.download()` with multiple tickers. We also store per-ticker OHLCV DataFrames in a dictionary for bar construction later — you cannot build dollar bars from just the Close column; you need the full OHLCV.

After downloading, we immediately apply the missing data policy. This is a deliberate choice: the DataLoader should never return un-cleaned data. If you want the raw version, download it yourself with `yf.download()` directly.

In [None]:
    def download(self):
        """Download adjusted OHLCV data from yfinance."""
        self._raw = yf.download(
            self.tickers, start=self.start,
            end=self.end, auto_adjust=True, threads=True,
        )
        if isinstance(self._raw.columns, pd.MultiIndex):
            self._prices = self._raw['Close'].copy()
            self._volumes = self._raw['Volume'].copy()
        else:
            self._prices = self._raw[['Close']].copy()
            self._prices.columns = self.tickers
            self._volumes = self._raw[['Volume']].copy()
            self._volumes.columns = self.tickers

        self._store_ohlcv()
        self._handle_missing()
        self._downloaded = True
        return self

DataLoader.download = download

The missing data handler implements a conservative policy: forward-fill short gaps (up to 5 trading days, which covers typical halts and holiday-adjacent quirks) and leave longer gaps as NaN. The rationale is that a 1-2 day gap in an otherwise active stock is likely a halt or data glitch, and the last known price is a reasonable proxy. A 200-day gap means the stock was not listed yet, and fabricating prices would be dishonest.

We also track what we did — how many NaN values existed before cleaning, how many we filled, and how many remain. This metadata feeds into the quality report.

In [None]:
    def _handle_missing(self):
        """Forward-fill short gaps; leave long gaps as NaN."""
        before = int(self._prices.isna().sum().sum())
        self._prices = self._prices.ffill(limit=self.max_ffill_days)
        self._volumes = self._volumes.ffill(limit=self.max_ffill_days)
        self._volumes = self._volumes.fillna(0)
        after = int(self._prices.isna().sum().sum())
        self._flags['missing_data'] = {
            'nan_before': before, 'filled': before - after,
            'nan_after': after,
        }

DataLoader._handle_missing = _handle_missing

We also need a helper that extracts per-ticker OHLCV DataFrames for bar construction. This runs during download and pulls each ticker's full OHLCV from the raw MultiIndex DataFrame into a clean, single-level DataFrame that the dollar bar method can consume directly. It silently skips any tickers that failed to download.

In [None]:
    def _store_ohlcv(self):
        """Store per-ticker OHLCV for bar construction."""
        cols = ['Open', 'High', 'Low', 'Close', 'Volume']
        for tkr in self.tickers:
            try:
                if isinstance(self._raw.columns, pd.MultiIndex):
                    df = pd.DataFrame(
                        {c: self._raw[c][tkr] for c in cols}
                    ).dropna()
                else:
                    df = self._raw[cols].dropna()
                self._ohlcv[tkr] = df
            except (KeyError, TypeError):
                pass

DataLoader._store_ohlcv = _store_ohlcv

The returns method offers both log and simple returns through a single interface. We default to log returns because they are additive over time, which makes multi-period calculations correct by construction. Simple returns are available for portfolio-level analysis where you need cross-sectional additivity. The docstring makes this trade-off explicit so that future users (including your future self) know which to use when.

In [None]:
    def get_returns(self, method='log'):
        """Compute returns for the universe.

        Args:
            method: 'log' (additive over time) or
                    'simple' (additive over assets).
        Returns:
            DataFrame of returns (first row dropped).
        """
        self._check()
        if method == 'log':
            return np.log(self._prices / self._prices.shift(1)).iloc[1:]
        elif method == 'simple':
            return self._prices.pct_change().iloc[1:]
        raise ValueError(f"Unknown method '{method}'")

    @property
    def prices(self):
        """Cleaned closing prices."""
        self._check()
        return self._prices.copy()

DataLoader.get_returns = get_returns
DataLoader.prices = prices

The dollar bar method reuses the same logic we built in Deliverable 2, now wrapped inside the class. If no threshold is provided, it defaults to the median daily dollar volume for that ticker — a sensible starting point that produces roughly one bar per average trading day. Having this as a method on the DataLoader means you can generate bars for any ticker in your universe with a single call, without re-downloading or re-processing anything.

In [None]:
    def get_dollar_bars(self, ticker, threshold=None):
        """Construct dollar bars for a single ticker.

        Args:
            ticker: Stock ticker symbol.
            threshold: Dollar volume per bar. Defaults to
                       median daily dollar volume.
        Returns:
            DataFrame of OHLCV dollar bars.
        """
        self._check()
        df = self._ohlcv.get(ticker)
        if df is None:
            raise ValueError(f"No OHLCV for '{ticker}'")

        dvol = df['Close'] * df['Volume']
        if threshold is None:
            threshold = dvol.median()

        bar_ids = (dvol.cumsum() // threshold).astype(int)
        return df.groupby(bar_ids).agg({
            'Open': 'first', 'High': 'max',
            'Low': 'min', 'Close': 'last', 'Volume': 'sum',
        })

DataLoader.get_dollar_bars = get_dollar_bars

The anomaly flagging method runs the same checks we built in Deliverable 3 — extreme returns, zero volume, remaining NaN — but packages the results into a structured dictionary that the quality report can consume. Notice that we *flag* anomalies but do not *remove* them. The decision about how to handle an extreme return (winsorize? downweight? keep?) depends on the downstream model, and that decision should be made by the modeler, not hard-coded into the data pipeline.

In [None]:
    def flag_anomalies(self, ret_threshold=0.15):
        """Flag data quality issues across the universe."""
        self._check()
        rets = self.get_returns('simple')

        extremes = []
        for col in rets.columns:
            s = rets[col].dropna()
            big = s[s.abs() > ret_threshold]
            for dt, v in big.items():
                extremes.append({'ticker': col, 'date': dt, 'ret': v})
        self._flags['extreme_returns'] = extremes

        zv = (self._volumes == 0).sum()
        self._flags['zero_volume'] = zv[zv > 0].to_dict()

        nan_left = self._prices.isna().sum()
        self._flags['remaining_nan'] = nan_left[nan_left > 0].to_dict()
        return self._flags

    def _check(self):
        if not self._downloaded:
            raise RuntimeError("Call .download() first.")

    def __repr__(self):
        st = 'loaded' if self._downloaded else 'not loaded'
        return f"DataLoader({len(self.tickers)} tickers, {st})"

DataLoader.flag_anomalies = flag_anomalies
DataLoader._check = _check
DataLoader.__repr__ = __repr__

Now let us test the complete DataLoader. We will run it on the full 200-stock universe and verify that every component works as expected — download, cleaning, returns, bars, and anomaly flagging. This is the "integration test" that proves the pieces fit together.

In [None]:
loader = DataLoader(SP500_SAMPLE, '2010-01-01', '2024-01-01')
loader.download()

log_ret = loader.get_returns('log')
flags = loader.flag_anomalies()

print(f"{loader}")
print(f"Prices shape: {loader.prices.shape}")
print(f"Returns shape: {log_ret.shape}")
print(f"Extreme return events: {len(flags['extreme_returns'])}")
print(f"Tickers with zero-volume: {len(flags['zero_volume'])}")

The DataLoader works end-to-end. But here is the insight that justifies building a class rather than a collection of loose functions: at this scale, you start to see patterns that are invisible at smaller scales. When you flag anomalies across 200 stocks, you discover that the issues are *correlated*. The extreme return events cluster around the same dates — March 2020, February 2018, August 2015 — because market-wide shocks hit everything at once. The missing data clusters around IPO dates and corporate restructurings. These correlations matter for your model: if your training set happens to include one crisis period and your test set includes another, the non-stationarity of crisis dynamics will dominate your out-of-sample performance. The DataLoader gives you the tools to see this; the modeling decisions remain yours.

## Deliverable 5: Data Quality Report

Write a structured 1-page markdown report summarizing: what biases exist in this dataset, what you fixed, what remains unfixable with free data, and what recommendations you would make for downstream ML models. This is practice for the kind of documentation that separates a trustworthy backtest from a misleading one. Be honest — the point is not to pretend your data is clean; it is to know *exactly* how it is dirty.

Use the template below. Fill in the blanks based on what *your* analysis actually found — the specific numbers, the specific tickers, the specific dates. Do not copy generic statements about survivorship bias; reference *your* data.

### Data Quality Report Template

---

**Dataset Overview**
- Universe: ____ stocks from the current S&P 500
- Time period: __________ to __________
- Source: Yahoo Finance via `yfinance` (free tier)
- Price adjustment: split- and dividend-adjusted (`auto_adjust=True`)

---

**Known Biases (Unfixable with Free Data)**

1. *Survivorship Bias*: We used today's S&P 500 list applied to ____. Companies removed between ____ and ____ are absent. Estimated return overstatement: ____% per year (cite source or your own comparison). Specific examples of missing companies: ____________.

2. *Look-Ahead in Universe Selection*: We selected stocks *because* they are in the index today, then measured their historical returns. This is circular. A proper methodology would use ______________.

3. *No Point-in-Time Fundamental Data*: Yahoo Finance reports fundamentals as of ____________, not as of the original release date. Impact: ____________.

---

**Issues Found and Addressed**

1. *Missing Data*: ____ tickers had NaN values. Top offenders: ____________. Fix applied: ____________. Residual risk: ____________.

2. *Extreme Returns*: ____ events across ____ tickers exceeded 15% in a single day. Most common dates: ____________. Investigation: ____ were genuine market events, ____ are suspected data artifacts because ____________. Fix applied: ____________.

3. *Volume Anomalies*: ____ tickers had zero-volume or suspiciously low-volume days. Examples: ____________. ML impact: ____________. Fix applied: ____________.

4. *(Additional issue you found)*: ____________.

---

**Distributional Properties**
- Excess kurtosis: median = ____, range = [____, ____]. Notable pattern: ____________.
- Skewness: ____% of stocks have negative skewness. Interpretation: ____________.
- Simple vs. log return divergence: largest on ____ (date) for ____ (ticker), where the difference was ____ percentage points.

---

**Recommendations for Downstream ML**
1. ____________
2. ____________
3. ____________
4. ____________
5. ____________

---
## ━━━ SOLUTION: Deliverable 5 ━━━

Below is a sample completed report. Your version should reference the specific numbers from *your* run — the exact ticker counts, the exact kurtosis values, the exact dates. The template above is what you fill in; this sample shows the level of specificity and honesty expected.

### Sample Completed Report

---

**Dataset Overview**
- Universe: 200 stocks from the current S&P 500 (20 per sector group)
- Time period: 2010-01-01 to 2024-01-01 (~3,500 trading days)
- Source: Yahoo Finance via `yfinance` (free tier, known limitations)
- Price adjustment: split- and dividend-adjusted

---

**Known Biases (Unfixable)**

1. *Survivorship Bias*: We used today's S&P 500 list applied to 2010. Roughly 150 of the 500 companies in the index in 2010 are no longer in it today — removed due to decline, bankruptcy, or acquisition. These companies are absent from our dataset. The survivorship bias premium is estimated at 2-4% annualized return overstatement (Elton, Gruber & Blake 1996 found ~0.9%/year for mutual funds; individual stocks are worse). GE, the most iconic example, was removed in 2018 after 122 years in the index. It is in our sample but only because it was re-added; the dozens of companies that were removed permanently are ghosts.

2. *Look-Ahead in Universe Selection*: We picked winners and then measured their returns. A proper study would reconstruct the S&P 500 constituent list at each quarterly rebalancing date and use the point-in-time membership, which requires a paid data source.

3. *No Point-in-Time Fundamentals*: Not used in this homework, but worth flagging for future weeks. Any model using earnings or revenue data from yfinance for historical predictions would leak future information.

---

**Issues Found and Addressed**

1. *Missing Data*: Approximately 15-25 tickers had NaN values in their early history (exact count varies by run). Top offenders are typically post-2010 IPOs and corporate restructurings. Fix: forward-fill gaps up to 5 trading days, leave longer gaps as NaN. Residual risk: forward-filled days produce zero returns, which slightly understates volatility for those periods.

2. *Extreme Returns*: Typically 30-80 events across 15-30 tickers. The largest cluster is around March 12-23, 2020 (COVID crash), with individual stocks dropping 15-40% in single days. Most are genuine market events. A small number (1-3) may be data artifacts — investigate any >30% move on an otherwise unremarkable date. Fix: flagged but not removed. Recommendation: winsorize at 0.5th/99.5th percentile for initial training.

3. *Volume Anomalies*: A handful of tickers may show zero-volume days (trading halts) or suspiciously low volume. Fix: flagged; low-volume returns should be down-weighted in model training.

---

**Distributional Properties**
- Excess kurtosis: median roughly 5-10, range from ~2 (utilities) to 30+ (volatile tech/energy). The distribution of kurtosis across stocks is wide and right-skewed, with a possible second mode above 15 driven by high-volatility names.
- Skewness: approximately 60-70% of stocks have negative skewness, consistent with the leverage effect (crashes are sharper than rallies).
- Simple vs. log returns diverge most on extreme days. On March 16, 2020, for SPY, the simple return was approximately -12.0% and the log return was -12.8% — a 0.8 percentage point difference that matters for risk calculations.

---

**Recommendations for Downstream ML**
1. Use log returns for all time-series features. Use simple returns only for cross-sectional / portfolio analysis.
2. Prefer dollar bars over time bars when building features from OHLCV data.
3. Never assume Gaussian distributions. Use robust loss functions (Huber) and robust estimators (MAD instead of std).
4. Document the survivorship bias in every result. Report a confidence interval that includes the estimated 2-4% bias.
5. Validate on held-out *stocks* (not just held-out time periods) to partially mitigate universe selection bias.

Here is the final aha moment for this homework, and it is the one worth sitting with. The survivorship bias in this dataset is *structurally unfixable* with free data. You can document it — and you should. You can estimate it — roughly 2-4% per year of overstated returns. You can acknowledge it in every result you present. But you cannot download data for companies that no longer exist. Enron, Lehman Brothers, WorldCom, Bear Stearns, Washington Mutual — they are invisible in yfinance. Your model has been trained in a world where large companies always survive to the end of the sample.

This is why professional quants pay for survivorship-bias-free databases like CRSP (thousands of dollars per year for academic access, more for commercial). It is also why the first question any serious reviewer asks about a backtest is: "Is your universe free of survivorship bias?" If the answer is no, everything that follows is suspect. You now know enough to ask that question, and to answer it honestly.

---

## Summary of Discoveries

Here is what this data quality audit revealed — not concepts restated from the lecture, but findings that only emerged from running 200 stocks through a systematic pipeline:

1. **The kurtosis distribution across 200 stocks is wide and right-skewed**, with values ranging from roughly 2-3 (utilities) to 30+ (volatile tech and energy). A one-size-fits-all distributional assumption is not just wrong — it is wrong by different amounts for different stocks, which means per-stock or per-sector distributional parameters are necessary.

2. **Dollar bars reduce kurtosis and JB statistics relative to time bars**, but the improvement with daily data is modest compared to what Lopez de Prado shows for intraday data. The principle is correct; the magnitude scales with how much volume variation exists within your sampling frequency.

3. **At least 10-20% of S&P 500 stocks have at least one data quality issue** — missing data, extreme returns, or volume anomalies. These are not edge cases in obscure stocks; they are the norm in supposedly "clean" large-cap data.

4. **Extreme return events cluster in time**, not uniformly across the calendar. The same dates (March 2020, February 2018, August 2015) produce extreme returns across dozens of stocks simultaneously, which means your model's handling of crisis periods will dominate its overall performance.

5. **Simple and log returns diverge most exactly when it matters most** — during crisis days with large moves. On a typical 0.5% day, the difference is negligible. On a -12% day, the difference is nearly a full percentage point, which matters for risk calculations and position sizing.

6. **The survivorship bias premium is real and unfixable with free data.** Every stock in this dataset survived to 2024 by construction. The 150+ companies that were in the S&P 500 in 2010 but did not make it are invisible, and their absence overstates returns by an estimated 2-4% per year.

7. **Skewness is persistently negative for the majority of stocks**, confirming the leverage effect at scale: crashes are sharper and faster than rallies across nearly all sectors and market caps.