# Homework Starter — Stage 04: Data Acquisition and Ingestion
Name: 
Date: 

## Objectives
- API ingestion with secrets in `.env`
- Scrape a permitted public table
- Validate and save raw data to `data/raw/`

In [2]:
import os, pathlib, datetime as dt
import requests
import pandas as pd
from bs4 import BeautifulSoup
from dotenv import load_dotenv

RAW = pathlib.Path('data/raw'); RAW.mkdir(parents=True, exist_ok=True)
load_dotenv(); print('ALPHAVANTAGE_API_KEY loaded?', bool(os.getenv('ALPHAVANTAGE_API_KEY')))

ModuleNotFoundError: No module named 'dotenv'

## Helpers (use or modify)

In [None]:
def ts():
    return dt.datetime.now().strftime('%Y%m%d-%H%M%S')

def save_csv(df: pd.DataFrame, prefix: str, **meta):
    mid = '_'.join([f"{k}-{v}" for k,v in meta.items()])
    path = RAW / f"{prefix}_{mid}_{ts()}.csv"
    df.to_csv(path, index=False)
    print('Saved', path)
    return path

def validate(df: pd.DataFrame, required):
    missing = [c for c in required if c not in df.columns]
    return {'missing': missing, 'shape': df.shape, 'na_total': int(df.isna().sum().sum())}

## Part 1 — API Pull (Required)
Choose an endpoint (e.g., Alpha Vantage or use `yfinance` fallback).

In [1]:
SYMBOL = 'AAPL'
USE_ALPHA = bool(os.getenv('ALPHAVANTAGE_API_KEY'))

if USE_ALPHA:
    url = 'https://www.alphavantage.co/query'
    params = {
        'function':'TIME_SERIES_DAILY_ADJUSTED',
        'symbol': SYMBOL,
        'outputsize':'compact',
        'apikey': os.getenv('ALPHAVANTAGE_API_KEY')
    }
    r = requests.get(url, params=params, timeout=30)
    r.raise_for_status()
    js = r.json()
    key = [k for k in js if 'Time Series' in k][0]
    df_api = pd.DataFrame(js[key]).T.reset_index().rename(
        columns={'index':'date', '5. adjusted close':'adj_close'}
    )[['date', 'adj_close']]
    df_api['date'] = pd.to_datetime(df_api['date'])
    df_api['adj_close'] = pd.to_numeric(df_api['adj_close'])
else:
    import yfinance as yf
    df_api = yf.download(SYMBOL, period='3mo', interval='1d').reset_index()[['Date', 'Adj Close']]
    df_api.columns = ['date', 'adj_close']

v_api = validate(df_api, ['date', 'adj_close'])
print(v_api)

NameError: name 'os' is not defined

In [None]:
_ = save_csv(df_api.sort_values('date'), prefix='api', source='alpha' if USE_ALPHA else 'yfinance', symbol=SYMBOL)

## Part 2 — Scrape a Public Table (Required)
Replace `SCRAPE_URL` with a permitted page containing a simple table.

In [None]:
SCRAPE_URL = 'https://www.wsj.com/market-data'  # TODO: replace with permitted page
headers = {'User-Agent':'AFE-Homework/1.0'}
try:
    resp = requests.get(SCRAPE_URL, headers=headers, timeout=30); resp.raise_for_status()
    soup = BeautifulSoup(resp.text, 'html.parser')
    rows = [[c.get_text(strip=True) for c in tr.find_all(['th','td'])] for tr in soup.find_all('tr')]
    header, *data = [r for r in rows if r]
    df_scrape = pd.DataFrame(data, columns=header)
except Exception as e:
    print('Scrape failed, using inline demo table:', e)
    html = '<table><tr><th>Ticker</th><th>Price</th></tr><tr><td>AAA</td><td>101.2</td></tr></table>'
    soup = BeautifulSoup(html, 'html.parser')
    rows = [[c.get_text(strip=True) for c in tr.find_all(['th','td'])] for tr in soup.find_all('tr')]
    header, *data = [r for r in rows if r]
    df_scrape = pd.DataFrame(data, columns=header)

if 'Price' in df_scrape.columns:
    df_scrape['Price'] = pd.to_numeric(df_scrape['Price'], errors='coerce')
v_scrape = validate(df_scrape, list(df_scrape.columns)); v_scrape

In [None]:
_ = save_csv(df_scrape, prefix='scrape', site='example', table='markets')

## Documentation
- API Source: (URL/endpoint/params)
- Scrape Source: (URL/table description)
- Assumptions & risks: (rate limits, selector fragility, schema changes)
- Confirm `.env` is not committed.

In [None]:
Data Sources & URLs
Alpha Vantage API (if API key provided)
URL: https://www.alphavantage.co/query
Params: function=TIME_SERIES_DAILY_ADJUSTED, symbol=AAPL, outputsize=compact, apikey from .env

Yahoo Finance (yfinance fallback)
Uses yfinance Python package to download historical data for symbol AAPL for 3 months.

Web Scraping Market Data Table
Example URL: https://www.wsj.com/market-data
HTML table parsed with BeautifulSoup (selectors for <tr>, <th>, <td>)

Validation Logic
Required columns for API & yfinance data: date, adj_close

For scraped data: Check presence of columns (e.g., 'Ticker', 'Price'), convert 'Price' to numeric if applicable

Validation output includes:

Missing required columns list

DataFrame shape (rows, columns)

Total number of NA (missing) values

Environment & Security
.env file contains sensitive keys, e.g. ALPHAVANTAGE_API_KEY

Ensure .env is listed in .gitignore to avoid committing secrets to version control

Assumptions & Risks
Assumptions:

Alpha Vantage API key valid and rate limits not exceeded

yfinance fallback availability and accuracy for historical prices

Scraped websites’ structure remains stable and HTML format unchanged

Risks:

API limits or downtime could interrupt data retrieval

Website layout changes can break scraping logic requiring updates

Possible missing or corrupted data requiring manual review

Sensitive API keys must be kept secure and not exposed publicly