---
title: "Lab 2: Data Acquisition & APIs"
subtitle: "Build a minimal, reliable pipeline"
format:
  html:
    toc: false
    number-sections: true
execute:
  echo: true
  warning: false
  message: false
---

::: callout-note
### Expected Time
- FIN510: Seminar hands‑on ≈ 60 min; 
- Directed learning extensions ≈ 90–120 min
- FIN720: Computer lab ≈ 120 min
:::

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/quinfer/fin510-colab-notebooks/blob/main/labs/lab02_apis.ipynb)

## Setup (Colab‑only installs)

In [None]:
try:
    import yfinance, pandas, pandas_datareader
except Exception:
    !pip -q install yfinance pandas pandas-datareader

## Objectives

- Pull assets with yfinance; validate and log
- Handle missing values and out‑of‑range returns

## Task 1 — Download and Validate

In [None]:
import os, time, random
import yfinance as yf
import pandas as pd
import numpy as np

symbols = ['AAPL','MSFT','SPY']

def get_close_from_yf(symbols, period='2y', tries=3):
    last_err = None
    for i in range(tries):
        try:
            df = yf.download(symbols, period=period, auto_adjust=True, progress=False, group_by='ticker', threads=True)
            # yfinance returns MultiIndex cols when multiple symbols
            if isinstance(df.columns, pd.MultiIndex):
                closes = pd.concat({sym: df[sym]['Close'] for sym in symbols if sym in df.columns.levels[0]}, axis=1)
                closes.columns = [c if isinstance(c, str) else c[0] for c in closes.columns]
            else:
                # single symbol
                closes = df['Close'].to_frame(symbols[0])
            if closes.dropna(how='all').shape[0] > 0:
                return closes
        except Exception as e:
            last_err = e
        # Backoff on rate limit
        time.sleep(2*(i+1) + random.random())
    raise RuntimeError(f"yfinance download failed after {tries} tries: {last_err}")

def get_close_from_stooq(symbols, years=2):
    from datetime import datetime, timedelta
    from pandas_datareader import data as web
    end = datetime.today()
    start = end - timedelta(days=365*years + 14)
    series = []
    for sym in symbols:
        try:
            s = web.DataReader(sym, 'stooq', start, end)['Close'].sort_index()
            s.name = sym
            series.append(s)
            time.sleep(0.4)
        except Exception:
            pass
    if not series:
        raise RuntimeError("stooq fallback returned no data")
    return pd.concat(series, axis=1)

def synthetic_prices(symbols, periods=252, mu=0.0004, sigma=0.012):
    rng = np.random.default_rng(42)
    dates = pd.bdate_range(end=pd.Timestamp.today().normalize(), periods=periods)
    shocks = rng.normal(mu, sigma, size=(len(dates), len(symbols)))
    levels = 100*np.exp(np.cumsum(shocks, axis=0))
    return pd.DataFrame(levels, index=dates, columns=symbols)

# Try yfinance → stooq → synthetic
try:
    prices = get_close_from_yf(symbols)
    source = 'yfinance'
except Exception as e1:
    try:
        prices = get_close_from_stooq(symbols)
        source = 'stooq (pandas-datareader)'
    except Exception as e2:
        prices = synthetic_prices(symbols)
        source = f'synthetic (fallback due to: {e1!r}; {e2!r})'

log = {}
log['source'] = source
log['missing_prices'] = int(prices.isna().sum().sum())
rets = prices.pct_change()
log['missing_returns'] = int(rets.isna().sum().sum())
log['out_of_range'] = int((rets.abs()>0.2).sum().sum())
log
if prices.dropna(how='all').shape[0] > 0:
    print(f"Data source: {source}. Download and validation checks passed ✔")
else:
    print(f"Warning: no data returned from any source. Proceeding with empty frame.")

## Task 2 — Clean and Save

In [None]:
clean = rets.dropna().clip(lower=-0.2, upper=0.2)
clean.tail()
clean.to_csv('returns_clean.csv')
"Saved returns_clean.csv"
assert 'returns_clean.csv'

Deliverable: Short note describing issues found (missing, out‑of‑range) and how you handled them.

::: callout-tip
### Troubleshooting
- API download empty: try fewer symbols or shorter period.
- Many outliers: inspect corporate actions/adjustments; consider `auto_adjust=True`.
- CSV not found: ensure current working directory permissions in Colab.
:::

::: callout-note
### Further Reading (Hilpisch 2019)
- See: [Hilpisch Code Resources](../resources/hilpisch-code.qmd) — Week 2
- Chapter 13 (ML pipelines) shows end‑to‑end workflows (features → pipeline → evaluation) you can mirror with time‑aware splits.
:::

## Mini‑Task — JKP Sample (Primer for Coursework 2)

This short exercise previews the JKP factor dataset used in Coursework 2. Load a small sample CSV, compute quick stats, and (optionally) run a one‑line CAPM alpha.

In [None]:
# JKP sample (course mirror) — small monthly slice with MKT, SMB, HML, MOM
import pandas as pd, os
import statsmodels.api as sm

# Prefer local file during site build; fall back to raw GitHub if needed
local_path = os.path.join('..','resources','jkp-sample.csv')
if os.path.exists(local_path):
    jkp = pd.read_csv(local_path, parse_dates=['date']).set_index('date').sort_index()
else:
    url = "https://raw.githubusercontent.com/quinfer/financial-data-science/main/resources/jkp-sample.csv"
    jkp = pd.read_csv(url, parse_dates=['date']).set_index('date').sort_index()

# Summary stats and quick cumulative return for MOM
summary = jkp[['MKT','SMB','HML','MOM']].describe().round(3)
cum = (1 + jkp['MOM']).cumprod() - 1
summary.tail(3), cum.tail()

# Optional: CAPM alpha (no HAC here — use HAC in the assessment)
ls = jkp['MOM'].dropna()
mkt = jkp['MKT'].reindex(ls.index)
capm = sm.OLS(ls, sm.add_constant(mkt)).fit()
float(capm.params['const']), float(capm.tvalues['const'])

Notes
- In the assessment you will use a larger CSV downloaded from the JKP portal and apply HAC standard errors.
- Keep scope tight (few factors, limited window) and focus on quality of evidence.

## Quick Leakage Check (Practice)

In [None]:
# Ensure prediction tasks shift the target correctly
import pandas as pd

# Intentionally wrong design (no shift) for demonstration
X_wrong = jkp[['MKT','SMB','HML','MOM']].dropna()
y_next   = jkp['MOM'].shift(-1)               # next-month target

# Overlap of indices indicates potential leakage if you don't drop/shift properly
overlap = X_wrong.index.intersection(y_next.dropna().index)
print("Potential leakage rows with wrong design:", len(overlap))

# Correct design: predictors at t, target at t+1 → align and drop NA
X = jkp[['MKT','SMB','HML']].shift(0)
y = jkp['MOM'].shift(-1)
df = pd.concat([X, y.rename('y')], axis=1).dropna()
print("Rows after proper shift/drop:", len(df))