# Homework Starter — Stage 05: Data Storage
Name: 
Date: 

Objectives:
- Env-driven paths to `data/raw/` and `data/processed/`
- Save CSV and Parquet; reload and validate
- Abstract IO with utility functions; document choices

In [1]:
import os, pathlib, datetime as dt
import pandas as pd
from dotenv import load_dotenv

load_dotenv()
RAW = pathlib.Path(os.getenv('DATA_DIR_RAW', 'data/raw'))
PROC = pathlib.Path(os.getenv('DATA_DIR_PROCESSED', 'data/processed'))
RAW.mkdir(parents=True, exist_ok=True)
PROC.mkdir(parents=True, exist_ok=True)
print('RAW ->', RAW.resolve())
print('PROC ->', PROC.resolve())

RAW -> C:\Users\Tracy\bootcamp_yuning_wang\homework\stage05_data-storage\notebooks\data\raw
PROC -> C:\Users\Tracy\bootcamp_yuning_wang\homework\stage05_data-storage\notebooks\data\processed


## 1) Create or Load a Sample DataFrame
You may reuse data from prior stages or create a small synthetic dataset.

In [2]:
import numpy as np
# Reproducible sample data
rng = np.random.default_rng(42)
dates = pd.date_range('2024-01-01', periods=20, freq='D')
df = pd.DataFrame({
    'date': dates,
    'ticker': ['AAPL'] * 20,
    'price': 150 + rng.normal(0, 1, size=20).cumsum()
})
df.head()

Unnamed: 0,date,ticker,price
0,2024-01-01,AAPL,150.304717
1,2024-01-02,AAPL,149.264733
2,2024-01-03,AAPL,150.015184
3,2024-01-04,AAPL,150.955749
4,2024-01-05,AAPL,149.004714


## 2) Save CSV to data/raw/ and Parquet to data/processed/ (TODO)
- Use timestamped filenames.
- Handle missing Parquet engine gracefully.

In [3]:
# Timestamp helper
def ts():
    return dt.datetime.now().strftime('%Y%m%d-%H%M%S')

# --- Save CSV (raw) ---
csv_path = RAW / f"sample_{ts()}.csv"
df.to_csv(csv_path, index=False)
csv_path

# --- Save Parquet (processed) ---
pq_path = PROC / f"sample_{ts()}.parquet"
try:
    # If pyarrow/fastparquet not installed, this will raise
    df.to_parquet(pq_path)
except Exception as e:
    print('Parquet engine not available. Install pyarrow or fastparquet to complete this step.')
    pq_path = None

pq_path

WindowsPath('data/processed/sample_20250820-155307.parquet')

## 3) Reload and Validate (TODO)
- Compare shapes and key dtypes.

In [4]:
# Validation helpers
def validate_loaded(original: pd.DataFrame, reloaded: pd.DataFrame) -> dict:
    checks = {
        'shape_equal': original.shape == reloaded.shape,
        'date_is_datetime': (
            'date' in reloaded.columns and pd.api.types.is_datetime64_any_dtype(reloaded['date'])
        ),
        'price_is_numeric': (
            'price' in reloaded.columns and pd.api.types.is_numeric_dtype(reloaded['price'])
        ),
    }
    return checks

# Reload CSV (parse date)
df_csv = pd.read_csv(csv_path, parse_dates=['date'])
csv_checks = validate_loaded(df, df_csv)
print('CSV checks:', csv_checks)

CSV checks: {'shape_equal': True, 'date_is_datetime': True, 'price_is_numeric': True}


In [5]:
if pq_path:
    try:
        df_pq = pd.read_parquet(pq_path)
        pq_checks = validate_loaded(df, df_pq)
        print('Parquet checks:', pq_checks)
    except Exception as e:
        print('Parquet read failed:', e)
else:
    print('Parquet file not created (engine missing).')

Parquet checks: {'shape_equal': True, 'date_is_datetime': True, 'price_is_numeric': True}


## 4) Utilities (TODO)
- Implement `detect_format`, `write_df`, `read_df`.
- Use suffix to route; create parent dirs if needed; friendly errors for Parquet.

In [None]:
import typing as t
from pathlib import Path

def detect_format(path: t.Union[str, Path]) -> str:
    s = str(path).lower()
    if s.endswith('.csv'):
        return 'csv'
    if s.endswith('.parquet') or s.endswith('.pq') or s.endswith('.parq'):
        return 'parquet'
    raise ValueError(f'Unsupported format: {s}')

def write_df(df: pd.DataFrame, path: t.Union[str, Path]) -> Path:
    p = Path(path)
    p.parent.mkdir(parents=True, exist_ok=True)
    fmt = detect_format(p)
    if fmt == 'csv':
        df.to_csv(p, index=False)
    else:
        try:
            df.to_parquet(p)
        except Exception as e:
            raise RuntimeError('Parquet engine not available. Install pyarrow or fastparquet.') from e
    return p

def _csv_has_date_col(p: Path) -> bool:
    # Read header only to decide whether to parse dates
    cols = pd.read_csv(p, nrows=0).columns
    return 'date' in cols

def read_df(path: t.Union[str, Path]) -> pd.DataFrame:
    p = Path(path)
    fmt = detect_format(p)
    if fmt == 'csv':
        if _csv_has_date_col(p):
            return pd.read_csv(p, parse_dates=['date'])
        return pd.read_csv(p)
    else:
        try:
            return pd.read_parquet(p)
        except Exception as e:
            raise RuntimeError('Parquet engine not available. Install pyarrow or fastparquet.') from e


In [10]:
from IPython.display import display

p_csv = RAW / f"util_{ts()}.csv"
p_pq  = PROC / f"util_{ts()}.parquet"

print("→ Writing CSV:", p_csv)
write_df(df, p_csv)
csv_head = read_df(p_csv).head()
print("CSV preview:")
display(csv_head)

try:
    print("\n→ Writing Parquet:", p_pq)
    write_df(df, p_pq)
    pq_head = read_df(p_pq).head()
    print("Parquet preview:")
    display(pq_head)
except RuntimeError as e:
    print("Skipping Parquet util demo:", e)

→ Writing CSV: data\raw\util_20250820-160301.csv
CSV preview:


Unnamed: 0,date,ticker,price
0,2024-01-01,AAPL,150.304717
1,2024-01-02,AAPL,149.264733
2,2024-01-03,AAPL,150.015184
3,2024-01-04,AAPL,150.955749
4,2024-01-05,AAPL,149.004714



→ Writing Parquet: data\processed\util_20250820-160301.parquet
Parquet preview:


Unnamed: 0,date,ticker,price
0,2024-01-01,AAPL,150.304717
1,2024-01-02,AAPL,149.264733
2,2024-01-03,AAPL,150.015184
3,2024-01-04,AAPL,150.955749
4,2024-01-05,AAPL,149.004714


## 5) Documentation (TODO)
- Update README with a **Data Storage** section (folders, formats, env usage).
- Summarize validation checks and any assumptions.