# Homework Starter — Stage 04: Data Acquisition and Ingestion
Name: 
Date: 

## Objectives
- API ingestion with secrets in `.env`
- Scrape a permitted public table
- Validate and save raw data to `data/raw/`

In [24]:
import os, pathlib, datetime as dt
import requests
import pandas as pd
from bs4 import BeautifulSoup
from dotenv import load_dotenv

RAW = pathlib.Path('data/raw'); RAW.mkdir(parents=True, exist_ok=True)
load_dotenv(); print('ALPHAVANTAGE_API_KEY loaded?', bool(os.getenv('ALPHAVANTAGE_API_KEY')))

ALPHAVANTAGE_API_KEY loaded? False


## Helpers (use or modify)

In [25]:
def ts():
    return dt.datetime.now().strftime('%Y%m%d-%H%M%S')

def save_csv(df: pd.DataFrame, prefix: str, **meta):
    mid = '_'.join([f"{k}-{v}" for k,v in meta.items()])
    path = RAW / f"{prefix}_{mid}_{ts()}.csv"
    df.to_csv(path, index=False)
    print('Saved', path)
    return path

def validate(df: pd.DataFrame, required):
    missing = [c for c in required if c not in df.columns]
    return {'missing': missing, 'shape': df.shape, 'na_total': int(df.isna().sum().sum())}

## Part 1 — API Pull (Required)
Choose an endpoint (e.g., Alpha Vantage or use `yfinance` fallback).

In [26]:
SYMBOL = 'AAPL'
USE_ALPHA = bool(os.getenv('ALPHAVANTAGE_API_KEY'))
import yfinance as yf
df_api = yf.download(SYMBOL, period='3mo', interval='1d').reset_index()[['Date','Close']]
df_api.columns = ['date','adj_close']

v_api = validate(df_api, ['date','adj_close']); v_api

  df_api = yf.download(SYMBOL, period='3mo', interval='1d').reset_index()[['Date','Close']]
[*********************100%***********************]  1 of 1 completed


{'missing': [], 'shape': (63, 2), 'na_total': 0}

In [27]:
_ = save_csv(df_api.sort_values('date'), prefix='api', source='alpha' if USE_ALPHA else 'yfinance', symbol=SYMBOL)

Saved data/raw/api_source-yfinance_symbol-AAPL_20250820-174747.csv


## Part 2 — Scrape a Public Table (Required)
Replace `SCRAPE_URL` with a permitted page containing a simple table.

In [28]:
SCRAPE_URL = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'  # TODO: replace with permitted page
headers = {'User-Agent':'AFE-Homework/1.0'}
try:
    resp = requests.get(SCRAPE_URL, headers=headers, timeout=30); resp.raise_for_status()
    soup = BeautifulSoup(resp.text, 'html.parser')
    rows = [[c.get_text(strip=True) for c in tr.find_all(['th','td'])] for tr in soup.find_all('tr')]
    header, *data = [r for r in rows if r]
    df_scrape = pd.DataFrame(data, columns=header)
except Exception as e:
    print('Scrape failed, using inline demo table:', e)
    html = '<table><tr><th>Ticker</th><th>Price</th></tr><tr><td>AAA</td><td>101.2</td></tr></table>'
    soup = BeautifulSoup(html, 'html.parser')
    rows = [[c.get_text(strip=True) for c in tr.find_all(['th','td'])] for tr in soup.find_all('tr')]
    header, *data = [r for r in rows if r]
    df_scrape = pd.DataFrame(data, columns=header)

if 'Price' in df_scrape.columns:
    df_scrape['Price'] = pd.to_numeric(df_scrape['Price'], errors='coerce')
v_scrape = validate(df_scrape, list(df_scrape.columns)); v_scrape

{'missing': [], 'shape': (880, 8), 'na_total': 758}

In [29]:
_ = save_csv(df_scrape, prefix='scrape', site='wikipedia', table='sp500')

Saved data/raw/scrape_site-wikipedia_table-sp500_20250820-174748.csv


## Documentation — Stage 04 (Data Acquisition & Ingestion)

**API Source & Params**  
- Source: `yfinance.download()` (fallback used because `ALPHAVANTAGE_API_KEY` was not set).  
- Params used: `symbol="AAPL"`, `period="3mo"`, `interval="1d"`.  
- Note: `yfinance` downloaded daily bars successfully in this run.

**API Output (Saved)**  
- Pattern: `data/raw/api_source=yfinance_symbol=AAPL_<YYYYMMDD-HHMMSS>.csv`  
- Example from this run: see the printed path in the cell output.

**Scrape Source**  
- URL: `https://en.wikipedia.org/wiki/List_of_S%26P_500_companies`  
- Parse method: BeautifulSoup (`html.parser`) with generic row extraction over `<tr>/<th>/<td>`.  
- Fallback: if request/parse fails, a tiny inline demo table (`AAA`, `101.2`) is used so the pipeline still demonstrates saving/validation.

**Scrape Output (Saved)**  
- Pattern (per current `save_csv()` call): `data/raw/scrape_site-wikipedia_table-sp500_<YYYYMMDD-HHMMSS>.csv`  

**Validation Logic**  
- Function: `validate(df, required_cols)` reports:
  - `missing` → any required columns absent,
  - `shape` → `(rows, cols)`,
  - `na_total` → total missing values in the DataFrame.
- API check: validated presence of the date/price columns used in this run (e.g., `date`, `adj_close` / `close`) and confirmed `na_total` reported by the helper.  
- Scrape check: validated against the columns returned by the parsed table (or by the inline demo on fallback).

**Assumptions & Risks**  
- API: possible throttling/changes on provider side; holiday/weekend gaps in daily bars.  
- Scrape: Wikipedia schema/HTML can change (selector fragility); generic `<tr>/<td>` parsing may include header or footnote rows; network failures handled by inline fallback.

**.env & Repro Notes**  
- `.env` is present locally and not committed (verified with `git check-ignore -v .env`).  
- Reproduce: create env, install requirements, run notebook top-to-bottom; outputs appear under `data/raw/` with a fresh timestamp in the filenames.
