# Scrape Americas Stocks and Fundamentals -- Financial Modeling Prep (FMP) API

End‑to‑end pipeline to build an Americas universe (stock_tickers + stock_profiles), fetch stock_quotes and key stock_metrics/ratios, and persist everything into DuckDB.

- Data sources (FMP stable APIs):
  - Exchanges: https://financialmodelingprep.com/stable/available-exchanges
  - Stock list/stock_profiles (paged bulk): https://financialmodelingprep.com/stable/profile-bulk?part=0
  - EOD bulk (daily): https://financialmodelingprep.com/stable/eod-bulk?date=YYYY-MM-DD
  - Key stock_metrics: https://financialmodelingprep.com/stable/key-stock_metrics?symbol=AAPL&period=quarter&limit=100
  - Ratios: https://financialmodelingprep.com/stable/ratios?symbol=AAPL&period=quarter&limit=100
  - Index list: https://financialmodelingprep.com/stable/index-list
  - Index quotes (light): https://financialmodelingprep.com/stable/historical-price-eod/light

- Inputs
  - Optional environment variable FMP_API_KEY to raise rate limits: export FMP_API_KEY=your_key

- Outputs (DuckDB: americas.db)
  - exchanges, stock_profiles, stock_tickers, stock_quotes, key_metrics, index_list, index_quotes

- Run order
  1) Setup and exchanges filter  2) Profiles (paged) → stock_tickers  3) Load stock_profiles
  4) EOD bulk stock_quotes  5) Load stock_quotes  6) Key stock_metrics + ratios → load
  7) Exchanges table  8) index_list + stock_quotes → load

Notes
- “Americas” classification uses FMP exchange metadata (region/country) and symbol suffix mapping; U.S. stock_tickers with no suffix are included.
- Network calls use simple retry/backoff and thread pools to balance speed and rate limits.

## Setup

- Optionally set FMP_API_KEY in your shell to improve quotas: export FMP_API_KEY=your_key.
- All outputs are persisted into DuckDB (americas.db). No static files are read or written by this notebook.

In [1]:
import os, requests
API_KEY=os.getenv("FMP_API_KEY")
params={"apikey":API_KEY} if API_KEY else {}

# Fetch exchange data with better error handling
try:
    raw=requests.get("https://financialmodelingprep.com/stable/available-exchanges",params=params,timeout=30).json()
    print(f"API response type: {type(raw)}")
    
    # Handle both potential response formats
    if isinstance(raw, dict):
        # Extract list from dictionary if possible
        list_found = False
        for k, v in raw.items():
            if isinstance(v, list) and len(v) > 0:
                raw = v
                list_found = True
                print(f"Found list in key '{k}' with {len(v)} items")
                break
        if not list_found:
            print("No list found in response dictionary, using empty list")
            raw = []
    elif not isinstance(raw, list):
        print(f"Unexpected response type: {type(raw)}, using empty list")
        raw = []
        
    # Log response size for debugging
    if isinstance(raw, list):
        print(f"Processing list with {len(raw)} items")
except Exception as e:
    print(f"API request failed: {e}")
    raw = []

ex_data = raw
SUFFIX_TO_EXCHANGE = {}
if isinstance(ex_data, list) and len(ex_data) > 0:
    SUFFIX_TO_EXCHANGE = { 
        (i.get("symbolSuffix") or "").strip().upper(): (i.get("exchange") or "").strip().upper() 
        for i in ex_data 
        if (i.get("symbolSuffix") or "").strip() and (i.get("exchange") or "").strip() 
    }
    print(f"Created SUFFIX_TO_EXCHANGE mapping with {len(SUFFIX_TO_EXCHANGE)} entries")

# Infer Americas exchanges from API (no file reads)
CTRY={"UNITED STATES","CANADA","BRAZIL","MEXICO","ARGENTINA","CHILE","PERU","COLOMBIA","VENEZUELA","URUGUAY","PARAGUAY","BOLIVIA","ECUADOR","GUYANA","SURINAME","FRENCH GUIANA","JAMAICA","TRINIDAD AND TOBAGO","TRINIDAD & TOBAGO","BARBADOS","BAHAMAS","BERMUDA","CAYMAN ISLANDS","PANAMA","COSTA RICA","GUATEMALA","HONDURAS","EL SALVADOR","NICARAGUA","DOMINICAN REPUBLIC","HAITI","PUERTO RICO","BELIZE","CURACAO","ARUBA","SAINT LUCIA","GRENADA","ST. VINCENT AND THE GRENADINES"}
ISO2={"US","CA","BR","MX","AR","CL","PE","CO","VE","UY","PY","BO","EC","GY","SR","GF","JM","TT","BB","BS","BM","KY","PA","CR","GT","HN","SV","NI","DO","H","PR","BZ","LC","GD","VC","CW","AW"}

def _amer_exchange(rec: dict)->bool:
    for k in ("region","continent"):
        v=rec.get(k)
        if isinstance(v,str) and "AMERICA" in v.upper(): return True
    for k in ("country","countryName","countryCode","country_code","country_iso2"):
        v=rec.get(k); vu=v.strip().upper() if isinstance(v,str) else ""
        if vu and ((len(vu)<=3 and vu in ISO2) or vu in CTRY or "UNITED STATES" in vu or "LATIN AMERICA" in vu): return True
    return False

EXCHANGES_AMERICAS=set()
for it in ex_data:
    if _amer_exchange(it):
        exch=(it.get("exchange") or "").strip(); acr=(it.get("acronym") or "").strip(); mic=(it.get("mic") or "").strip()
        if exch: EXCHANGES_AMERICAS.add(exch.upper())
        elif acr or mic: EXCHANGES_AMERICAS.add((acr or mic).upper())

# Fallback: infer from known U.S./Americas exchange acronyms in the exchange name map
if not EXCHANGES_AMERICAS:
    for _, exch in SUFFIX_TO_EXCHANGE.items():
        eu=exch.upper()
        if any(x in eu for x in ("NAS","NYS","ARC","BATS","OTC","TSX","TSXV","CSE","BOV","MEX","XBOV","XMEX")):
            EXCHANGES_AMERICAS.add(eu)

print(f"Identified {len(EXCHANGES_AMERICAS)} Americas exchanges: {', '.join(sorted(EXCHANGES_AMERICAS))}")
if not EXCHANGES_AMERICAS:
    # Use hardcoded fallback instead of raising error to allow continuing execution
    print("WARNING: Could not infer any Americas exchanges from API payload. Using hardcoded fallbacks.")
    EXCHANGES_AMERICAS = {"NYSE","NASDAQ","AMEX","ARCX","NYS","NAS","ARC","BATS","OTC","TSX","TSXV","CSE"}
    print(f"Using fallback exchanges: {', '.join(sorted(EXCHANGES_AMERICAS))}")

API response type: <class 'list'>
Processing list with 71 items
Created SUFFIX_TO_EXCHANGE mapping with 65 entries
Identified 14 Americas exchanges: AMEX, BUE, BVC, CBOE, CNQ, MEX, NASDAQ, NEO, NYSE, OTC, SAO, SGO, TSX, TSXV


## Profiles (paged) → Universe

Pull paged company profiles and keep only Americas listings (via symbol suffix→exchange map). Limit the investable universe to marketCap ≥ 1B and persist to DuckDB.

In [2]:
import os, json, time, requests, polars as pl
from io import StringIO
from pathlib import Path

API_KEY=os.getenv("FMP_API_KEY")

def fetch_profiles_paged(api_key: str|None=None, start_part: int=0, max_parts: int|None=None, sleep_s: float=0.0, max_retries: int=3, verbose: bool=False)->pl.DataFrame:
    key=api_key or API_KEY; params={"apikey":key} if key else {}
    # ensure exchange context
    global SUFFIX_TO_EXCHANGE, EXCHANGES_AMERICAS
    if 'SUFFIX_TO_EXCHANGE' not in globals() or 'EXCHANGES_AMERICAS' not in globals():
        try:
            ex_data=requests.get("https://financialmodelingprep.com/stable/available-exchanges",params=params,timeout=60).json()
            ex_data = ex_data if isinstance(ex_data,list) else []
        except Exception: ex_data=[]
        SUFFIX_TO_EXCHANGE = 'SUFFIX_TO_EXCHANGE' in globals() and SUFFIX_TO_EXCHANGE or { (i.get("symbolSuffix") or "").strip().upper(): (i.get("exchange") or "").strip().upper() for i in ex_data if (i.get("symbolSuffix") or "").strip() and (i.get("exchange") or "").strip() }
        # Infer amer exchanges from payload directly (no file reads)
        CTRY={"UNITED STATES","CANADA","BRAZIL","MEXICO","ARGENTINA","CHILE","PERU","COLOMBIA","VENEZUELA","URUGUAY","PARAGUAY","BOLIVIA","ECUADOR","GUYANA","SURINAME","FRENCH GUIANA","JAMAICA","TRINIDAD AND TOBAGO","TRINIDAD & TOBAGO","BARBADOS","BAHAMAS","BERMUDA","CAYMAN ISLANDS","PANAMA","COSTA RICA","GUATEMALA","HONDURAS","EL SALVADOR","NICARAGUA","DOMINICAN REPUBLIC","HAITI","PUERTO RICO","BELIZE","CURACAO","ARUBA","SAINT LUCIA","GRENADA","ST. VINCENT AND THE GRENADINES"}
        ISO2={"US","CA","BR","MX","AR","CL","PE","CO","VE","UY","PY","BO","EC","GY","SR","GF","JM","TT","BB","BS","BM","KY","PA","CR","GT","HN","SV","NI","DO","H","PR","BZ","LC","GD","VC","CW","AW"}
        def amer(r):
            if any(isinstance(r.get(k), str) and "AMERICA" in r[k].upper() for k in ("region","continent")): return True
            for k in ("country","countryCode","country_code","country_iso2","countryName"):
                v=r.get(k); vu=v.strip().upper() if isinstance(v,str) else ""
                if vu and ((len(vu)<=3 and vu in ISO2) or vu in CTRY or "UNITED STATES" in vu or "LATIN AMERICA" in vu): return True
            return False
        EXCHANGES_AMERICAS=set()
        for it in ex_data:
            if amer(it):
                exch=(it.get("exchange") or "").strip(); acr=(it.get("acronym") or "").strip(); mic=(it.get("mic") or "").strip()
                if exch: EXCHANGES_AMERICAS.add(exch.upper())
                elif acr or mic: EXCHANGES_AMERICAS.add((acr or mic).upper())
        if not EXCHANGES_AMERICAS:
            for _, exch in SUFFIX_TO_EXCHANGE.items():
                eu=str(exch).upper()
                if any(x in eu for x in ("NAS","NYS","ARC","BATS","OTC","TSX","TSXV","CSE","BOV","MEX","XBOV","XMEX")):
                    EXCHANGES_AMERICAS.add(eu)
    _is_amer=lambda s: isinstance(s,str) and (SUFFIX_TO_EXCHANGE.get(("."+s.rsplit(".",1)[-1]).upper()) in EXCHANGES_AMERICAS if "." in s else True)

    def _parse(txt: str)->list[dict]:
        s=txt.lstrip()
        if s.startswith('['):
            try: return json.loads(s)
            except Exception: pass
        norm=txt.replace('}{','}\n{'); rec=[]
        for ln in (l.strip() for l in norm.splitlines() if l.strip()):
            i=ln.find('{'); ln=ln[i:] if i>=0 else ln
            if ln.startswith('{'):
                try: o=json.loads(ln); rec.append(o) if isinstance(o,dict) else None
                except Exception: pass
        if rec: return rec
        if ',' in txt and '\n' in txt:
            try: return pl.read_csv(StringIO(txt), ignore_errors=True).to_dicts()
            except Exception: pass
            try:
                import pandas as pd; return pl.from_pandas(pd.read_csv(StringIO(txt), engine="python", dtype=str, on_bad_lines="skip").fillna("")).to_dicts()
            except Exception: pass
            try:
                import csv; return [dict({k:(v or "") for k,v in r.items()}) for r in csv.DictReader(StringIO(txt))]
            except Exception: return []
        return []

    def _to_pl(records: list[dict])->pl.DataFrame:
        if not records: return pl.DataFrame()
        keys=list({k for r in records for k in r.keys()}); norm=[{k:r.get(k,None) for k in keys} for r in records]
        try: return pl.DataFrame(norm, schema_overrides={k:pl.Utf8 for k in keys}, strict=False, infer_schema_length=len(norm))
        except Exception:
            try:
                import pandas as pd; return pl.from_pandas(pd.DataFrame(norm).astype("string").fillna(""))
            except Exception: return pl.DataFrame()

    sess=requests.Session(); url="https://financialmodelingprep.com/stable/profile-bulk"; frames=[]; part=start_part
    while True:
        if max_parts is not None and part>=start_part+max_parts: break
        q={"part":part,"datatype":"json",**params}; attempt=0
        while True:
            try:
                r=sess.get(url,params=q,timeout=120); status=r.status_code
                if verbose: print(f"GET {r.url[:80]}... -> {status}")
                if status in (429,) or status>=500:
                    if attempt<max_retries: time.sleep((sleep_s or 0.5)*(2**attempt)); attempt+=1; continue
                if status==403: return pl.DataFrame()
                r.raise_for_status(); data=_parse(r.text)
            except Exception:
                return pl.DataFrame() if not frames else pl.concat(frames, how="vertical_relaxed").unique(subset=["symbol"], keep="last")
            break
        if not isinstance(data,list): break
        if not data:
            if r.text.strip(): part+=1; time.sleep(sleep_s) if sleep_s>0 else None; continue
            break
        df=_to_pl(data)
        if df.is_empty(): part+=1; continue
        if "Symbol" in df.columns and "symbol" not in df.columns: df=df.rename({"Symbol":"symbol"})
        df=df.filter(pl.col("symbol").map_elements(_is_amer, return_dtype=pl.Boolean)) if "symbol" in df.columns else pl.DataFrame()
        if not df.is_empty():
            casts=[]
            for c in ("price","beta","lastDiv","lastDividend","change","changes","changePercentage","changesPercentage"): casts.append(pl.col(c).cast(pl.Float64, strict=False)) if c in df.columns else None
            for c in ("marketCap","mktCap","volume","volAvg","avgVolume","averageVolume"): casts.append(pl.col(c).cast(pl.Int64, strict=False)) if c in df.columns else None
            df=df.with_columns(casts) if casts else df; frames.append(df)
        part+=1; time.sleep(sleep_s) if sleep_s>0 else None
    return pl.DataFrame() if not frames else pl.concat(frames, how="vertical_relaxed").unique(subset=["symbol"], keep="last")

profiles_df=fetch_profiles_paged(api_key=os.getenv("FMP_API_KEY"))
profiles_df=profiles_df.filter(pl.col("marketCap").cast(pl.Int64, strict=False) >= 1_000_000_000)
stock_tickers = profiles_df.select("symbol").unique().sort("symbol")
print(profiles_df.shape)
print(profiles_df.head())

(11801, 36)
shape: (5, 36)
┌─────────────┬──────────┬────────┬─────────┬───┬─────────────┬─────────────┬────────┬─────────────┐
│ isActivelyT ┆ currency ┆ isFund ┆ price   ┆ … ┆ image       ┆ isin        ┆ symbol ┆ website     │
│ rading      ┆ ---      ┆ ---    ┆ ---     ┆   ┆ ---         ┆ ---         ┆ ---    ┆ ---         │
│ ---         ┆ str      ┆ str    ┆ f64     ┆   ┆ str         ┆ str         ┆ str    ┆ str         │
│ str         ┆          ┆        ┆         ┆   ┆             ┆             ┆        ┆             │
╞═════════════╪══════════╪════════╪═════════╪═══╪═════════════╪═════════════╪════════╪═════════════╡
│ true        ┆ USD      ┆ false  ┆ 0.89    ┆ … ┆ https://ima ┆ SG1V1293623 ┆ SRHBF  ┆ https://www │
│             ┆          ┆        ┆         ┆   ┆ ges.financi ┆ 2           ┆        ┆ .starhub.co │
│             ┆          ┆        ┆         ┆   ┆ almodeli…   ┆             ┆        ┆ m           │
│ true        ┆ USD      ┆ false  ┆ 14.5125 ┆ … ┆ https://ima ┆ 

### Save tickers → DuckDB
Persist unique investable symbols into americas.db.tickers for downstream joins.

In [3]:
import duckdb, polars as pl
# Incremental load for stock_tickers (unique by symbol)
if 'stock_tickers' in globals() and isinstance(stock_tickers, pl.DataFrame) and not stock_tickers.is_empty():
    con = duckdb.connect('../americas.db')
    # Register incoming dataframe
    con.register('tickers_view', stock_tickers.to_pandas())
    # Create table if it does not exist (empty schema clone)
    con.sql("""
        CREATE TABLE IF NOT EXISTS stock_tickers AS
        SELECT * FROM tickers_view LIMIT 0
    """)
    # Insert only new symbols
    con.sql("""
        INSERT INTO stock_tickers
        SELECT t.*
        FROM tickers_view t
        WHERE NOT EXISTS (
            SELECT 1 FROM stock_tickers x WHERE x.symbol = t.symbol
        )
    """)
    # (Optional) collect count of new rows inserted in this run
    new_count = con.sql("SELECT COUNT(*) AS c FROM tickers_view WHERE symbol NOT IN (SELECT symbol FROM stock_tickers)").fetchone()[0] if False else None
    con.close()
else:
    print("No stock_tickers to save; skipping DuckDB load.")

In [4]:
import duckdb, polars as pl
# Incremental load for stock_profiles (unique by symbol) with robust dynamic casting
if 'profiles_df' in globals() and isinstance(profiles_df, pl.DataFrame) and not profiles_df.is_empty():
    cap_col = 'marketCap' if 'marketCap' in profiles_df.columns else ('mktCap' if 'mktCap' in profiles_df.columns else None)
    if cap_col:
        # Filter investable universe
        filtered = profiles_df.filter(pl.col(cap_col).cast(pl.Int64, strict=False) >= 1_000_000_000)
        if not filtered.is_empty():
            con = duckdb.connect('../americas.db')
            table_exists = False
            try:
                table_exists = bool(con.execute("SELECT 1 FROM information_schema.tables WHERE table_name='stock_profiles'").fetchone())
            except Exception:
                table_exists = False

            if not table_exists:
                # Detect boolean-like columns (string reps of true/false only)
                bool_like = []
                for c in filtered.columns:
                    try:
                        vals = filtered.select(pl.col(c).cast(pl.Utf8, strict=False).str.to_lowercase().drop_nulls().unique()).to_series().to_list()
                        if vals and all(v in ('true','false') for v in vals):
                            bool_like.append(c)
                    except Exception:
                        pass
                if bool_like:
                    casts = [
                        pl.when(pl.col(c).cast(pl.Utf8, strict=False).str.to_lowercase()=="true").then(pl.lit(1))
                          .when(pl.col(c).cast(pl.Utf8, strict=False).str.to_lowercase()=="false").then(pl.lit(0))
                          .otherwise(pl.lit(None)).alias(c)
                        for c in bool_like
                    ]
                    filtered = filtered.with_columns(casts)
                # Basic numeric casts for common fields
                float_cols = [c for c in ("price","beta","lastDiv","lastDividend","change","changes","changePercentage","changesPercentage") if c in filtered.columns]
                int_cols = [c for c in ("marketCap","mktCap","volume","volAvg","avgVolume","averageVolume") if c in filtered.columns]
                casts = [pl.col(c).cast(pl.Float64, strict=False) for c in float_cols] + [pl.col(c).cast(pl.Int64, strict=False) for c in int_cols]
                if casts:
                    filtered = filtered.with_columns(casts)
                # Create table schema clone + initial load
                con.register('profiles_incoming_initial', filtered.to_pandas())
                con.sql("""
                    CREATE TABLE IF NOT EXISTS stock_profiles AS
                    SELECT * FROM profiles_incoming_initial LIMIT 0
                """)
                con.sql("INSERT INTO stock_profiles SELECT * FROM profiles_incoming_initial")
                con.close()
            else:
                # Existing table: build insert aligned to destination schema while avoiding lower() on numeric sources
                con.register('profiles_incoming_raw', filtered.to_pandas())
                schema_rows = con.execute("PRAGMA table_info('stock_profiles')").fetchall()
                # Map source column polars dtypes to simple strings
                src_type_map = {c: str(t) for c, t in zip(filtered.columns, filtered.dtypes)}
                numeric_prefixes = ("Int","UInt","Float","Decimal")
                select_exprs = []
                dest_cols = []
                for cid, name, dtype, *_ in schema_rows:
                    dest_cols.append(name)
                    upper_type = (dtype or '').upper()
                    if name == 'symbol':
                        expr = f"p.{name} AS {name}"
                    elif name in filtered.columns:
                        src_t = src_type_map.get(name, "")
                        is_src_numeric = any(src_t.startswith(pref) for pref in numeric_prefixes)
                        if 'INT' in upper_type:
                            if is_src_numeric:
                                # Source already numeric -> direct cast
                                expr = f"TRY_CAST(p.{name} AS {upper_type}) AS {name}"
                            else:
                                # Source textual -> handle boolean-like strings then try numeric cast
                                expr = (
                                    f"CASE WHEN lower(CAST(p.{name} AS VARCHAR))='true' THEN 1 "
                                    f"WHEN lower(CAST(p.{name} AS VARCHAR))='false' THEN 0 "
                                    f"ELSE TRY_CAST(p.{name} AS {upper_type}) END AS {name}"
                                )
                        elif any(t in upper_type for t in ['DOUBLE','FLOAT','REAL','DECIMAL']):
                            expr = f"TRY_CAST(p.{name} AS {upper_type}) AS {name}"
                        else:
                            # Leave as-is
                            expr = f"p.{name} AS {name}"
                    else:
                        null_cast_type = upper_type if upper_type else 'VARCHAR'
                        expr = f"CAST(NULL AS {null_cast_type}) AS {name}"
                    select_exprs.append(expr)
                insert_sql = f"""
                    INSERT INTO stock_profiles ({','.join(dest_cols)})
                    SELECT {','.join(select_exprs)}
                    FROM profiles_incoming_raw p
                    WHERE NOT EXISTS (
                        SELECT 1 FROM stock_profiles x WHERE x.symbol = p.symbol
                    )
                """
                try:
                    con.sql(insert_sql)
                except Exception as e:
                    print(f"Incremental insert failed: {e}")
                con.close()
        else:
            print("No stock_profiles with marketCap >= 1,000,000,000; skipping DuckDB load")
    else:
        print("No market cap column found; skipping DuckDB load for stock_profiles")
else:
    print("profiles_df is empty or undefined; skipping DuckDB load")

## EOD Bulk Quotes (2010‑01‑01 → today)

Parallel fetch daily bulk EOD, filter to Americas + investable stock_tickers, and stage for DuckDB.

In [5]:
# Parallel EOD Bulk (full or incremental range), filtered by `stock_tickers` + logging
import os, requests, polars as pl, concurrent.futures as cf, logging, threading
from io import StringIO
from datetime import date as _date, timedelta
from requests.adapters import HTTPAdapter

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger("eod")

API_KEY=os.getenv("FMP_API_KEY"); params={"apikey":API_KEY} if API_KEY else {}
SESSION = requests.Session()
# Enlarge connection pool to avoid 'Connection pool is full' warnings under concurrency
try:
    ADAPTER = HTTPAdapter(pool_connections=128, pool_maxsize=128)
    SESSION.mount('https://', ADAPTER); SESSION.mount('http://', ADAPTER)
except Exception:
    pass

# Ensure filter context exists even if setup cells weren't run
if 'SUFFIX_TO_EXCHANGE' not in globals(): SUFFIX_TO_EXCHANGE = {}
if 'EXCHANGES_AMERICAS' not in globals(): EXCHANGES_AMERICAS = set()

# business-day date range
def date_range(start: str, end: str, weekdays_only: bool=True):
    y,m,d=map(int,start.split("-")); ye,me,de=map(int,end.split("-")); dt=_date(y,m,d); end_dt=_date(ye,me,de)
    while dt<=end_dt:
        if not weekdays_only or dt.weekday()<5: yield dt.isoformat()
        dt+=timedelta(days=1)

_is_amer=lambda s: isinstance(s,str) and (SUFFIX_TO_EXCHANGE.get(("."+s.rsplit(".",1)[-1]).upper()) in EXCHANGES_AMERICAS if "." in s else True)

# daily fetch with retry/backoff
def fetch_one_day(ds:str, allowed_symbols: set[str]|None=None, max_retries:int=3)->pl.DataFrame:
    attempt=0
    while True:
        try:
            r=SESSION.get(f"https://financialmodelingprep.com/stable/eod-bulk?date={ds}", params=params, timeout=120); status=r.status_code
            if status in (429,) or status>=500:
                if attempt<max_retries:
                    import time; log.warning(f"{ds} -> {status}, retry {attempt+1}/{max_retries}")
                    time.sleep(0.5*(2**attempt)); attempt+=1; continue
            r.raise_for_status()
            # Read potentially mixed-type numeric columns as strings to avoid parse errors, then cast below
            df=pl.read_csv(
                StringIO(r.text),
                try_parse_dates=False,
                schema_overrides={"open": pl.Utf8, "high": pl.Utf8, "low": pl.Utf8, "close": pl.Utf8, "adjClose": pl.Utf8, "volume": pl.Utf8},
                infer_schema_length=1000,
            )
            if df.is_empty(): log.debug(f"{ds} -> 0 rows"); return df
            n0=len(df); df=df.filter(pl.col("symbol").map_elements(_is_amer, return_dtype=pl.Boolean)); n1=len(df)
            if df.is_empty(): log.debug(f"{ds} -> amer 0/{n0}"); return df
            if allowed_symbols: df=df.filter(pl.col("symbol").is_in(allowed_symbols)); n2=len(df)
            else: n2=n1
            log.debug(f"{ds} -> raw={n0} amer={n1} allowed={n2}")
            return df.with_columns([
                pl.col("date").cast(pl.Utf8).str.to_date("%Y-%m-%d"),
                pl.col("open").cast(pl.Float64, strict=False), pl.col("high").cast(pl.Float64, strict=False),
                pl.col("low").cast(pl.Float64, strict=False), pl.col("close").cast(pl.Float64, strict=False),
                pl.col("adjClose").cast(pl.Float64, strict=False),
                # Cast volume via Float -> rounded Int to handle occasional fractional values
                pl.col("volume").cast(pl.Float64, strict=False).round(0).cast(pl.Int64, strict=False)
            ])
        except Exception as e:
            if attempt<max_retries:
                import time; log.warning(f"{ds} -> error {e!r}, retry {attempt+1}/{max_retries}")
                time.sleep(0.5*(2**attempt)); attempt+=1; continue
            log.error(f"{ds} -> failed after {max_retries} retries: {e!r}")
            return pl.DataFrame()

# Parallel over full or supplied range, with optional streaming insert
def parallel_fetch_eod_bulk_range(start_date:str, end_date:str,
                                  max_workers:int|None=None,
                                  weekdays_only:bool=True,
                                  allowed_symbols: set[str] | None = None,
                                  days: list[str] | None = None,
                                  skip_existing: bool = True,
                                  existing_dates: set[str] | None = None,
                                  insert_into_duckdb: bool = False,
                                  duckdb_path: str = '../americas.db') -> pl.DataFrame:
    """Fetch EOD bulk data.
    Optimizations:
      - pass precomputed days list
      - skip days already in stock_quotes table (existing_dates)
      - optionally stream each fetched day into DuckDB (idempotent insert on (date,symbol))
    """
    all_days = days if days is not None else list(date_range(start_date, end_date, weekdays_only=weekdays_only))
    if skip_existing and existing_dates:
        fetch_days=[d for d in all_days if d not in existing_dates]
    else:
        fetch_days=all_days
    if not fetch_days:
        log.info("No new days to fetch (incremental). Returning empty DataFrame.")
        return pl.DataFrame()
    if max_workers is None:
        import os as _os; max_workers=min(32, max(8, (_os.cpu_count() or 8)*2))
    log.info(f"EOD bulk {fetch_days[0]}→{fetch_days[-1]}: {len(fetch_days)} new days (of {len(all_days)} total), workers={max_workers}, symbols={'all' if not allowed_symbols else len(allowed_symbols)}")

    con = None
    lock = threading.Lock()
    if insert_into_duckdb:
        import duckdb
        con = duckdb.connect(duckdb_path)
        # Ensure table exists
        con.execute("CREATE TABLE IF NOT EXISTS stock_quotes (date DATE, symbol VARCHAR, open DOUBLE, high DOUBLE, low DOUBLE, close DOUBLE, adjClose DOUBLE, volume BIGINT)")
        # Create composite index (DuckDB 1.0 lacks indexes; rely on NOT EXISTS checks later)

    frames=[]; done=0; step=max(1, len(fetch_days)//20)
    with cf.ThreadPoolExecutor(max_workers=max_workers) as ex:
        futs={ex.submit(fetch_one_day, ds, allowed_symbols): ds for ds in fetch_days}
        for fut in cf.as_completed(futs):
            ds=futs[fut]
            f=fut.result(); done+=1
            if isinstance(f, pl.DataFrame) and not f.is_empty():
                if insert_into_duckdb and con is not None:
                    # Insert only new (date,symbol)
                    try:
                        import duckdb
                        with lock:
                            con.register('__new_day', f.to_pandas())
                            con.execute("""
                                INSERT INTO stock_quotes
                                SELECT n.* FROM __new_day n
                                WHERE NOT EXISTS (
                                    SELECT 1 FROM stock_quotes q WHERE q.date = n.date AND q.symbol = n.symbol
                                )
                            """)
                            con.unregister('__new_day')
                    except Exception as e:
                        log.warning(f"DuckDB insert failed for {ds}: {e}")
                else:
                    frames.append(f)
            if done%step==0 or done==len(fetch_days): log.info(f"Progress {done}/{len(fetch_days)} days (fetched {len(frames)} non-empty frames)")
    if con is not None:
        con.close()
    out = pl.DataFrame() if not frames else pl.concat(frames, how="vertical_relaxed").unique(subset=["date","symbol"], keep="last").sort(["date","symbol"])
    log.info(f"Combined rows (not counting already inserted days): {0 if out.is_empty() else out.height}")
    return out

### Extract Dates

In [6]:
import duckdb as db
from datetime import date  # or: from datetime import date as _date

con = db.connect('../americas.db')

# list tables
tables = con.sql("SHOW TABLES").fetchall()
print("Tables in the database:", tables)

# get last quote date if table exists
max_date = None
if any(t[0] == 'stock_quotes' for t in tables):
    max_date = con.sql("SELECT MAX(date) AS max_date FROM stock_quotes").fetchone()[0]

# Ensure string ISO format (DuckDB may return date/datetime object)
if max_date:
    # If you want to resume AFTER last stored day uncomment next line and import timedelta:
    # from datetime import timedelta; max_date = (max_date + timedelta(days=1))
    start_date = (max_date.date().isoformat() if hasattr(max_date, "date") else max_date.isoformat()) if hasattr(max_date, "isoformat") else str(max_date)
else:
    start_date = '2010-01-01'

end_date = date.today().isoformat()
print(f"Fetching EOD bulk from {start_date} to {end_date}")

Tables in the database: [('exchanges',), ('index_list',), ('index_quotes',), ('risk_premium',), ('stock_metrics',), ('stock_profiles',), ('stock_quotes',), ('stock_tickers',)]
Fetching EOD bulk from 2025-09-16 to 2025-09-16


In [7]:
# Gather existing quote dates to enable incremental skipping
import duckdb as _db
_existing_dates=set()
try:
    _con=_db.connect('../americas.db')
    if _con.execute("SELECT 1 FROM information_schema.tables WHERE table_name='stock_quotes'").fetchone():
        _existing_dates={r[0].isoformat() if hasattr(r[0],'isoformat') else str(r[0]) for r in _con.execute("SELECT DISTINCT date FROM stock_quotes").fetchall()}
    _con.close()
except Exception as e:
    print(f"Could not load existing quote dates: {e}")
print(f"Existing quote days: {len(_existing_dates)}")

Existing quote days: 4098


In [8]:
# Incremental EOD fetch using optimized function with skipping + streaming inserts
import polars as pl
allowed_symbols = set(stock_tickers.get_column("symbol").to_list()) if 'stock_tickers' in globals() else None

# Only fetch new days and stream insert directly to DuckDB (reduces memory and time for long histories)
parallel_month_df = parallel_fetch_eod_bulk_range(
    start_date,
    end_date,
    allowed_symbols=allowed_symbols,
    skip_existing=True,
    existing_dates=_existing_dates,
    insert_into_duckdb=True,  # Stream directly
)

# For quick inspection show just last few rows newly fetched (if any collected in-memory)
print(parallel_month_df.shape)
if not parallel_month_df.is_empty():
    print("Unique Symbols (new batch):", parallel_month_df.select(pl.col("symbol").n_unique()).item())
    print(parallel_month_df.tail())
else:
    print("No new in-memory rows (all inserted directly or nothing new).")

2025-09-16 17:38:42,485 INFO EOD bulk 2025-09-16→2025-09-16: 1 new days (of 1 total), workers=32, symbols=11801
2025-09-16 17:38:49,512 INFO Progress 1/1 days (fetched 0 non-empty frames)
2025-09-16 17:38:49,516 INFO Combined rows (not counting already inserted days): 0


(0, 0)
No new in-memory rows (all inserted directly or nothing new).


### Load stock_quotes → DuckDB
Create or replace americas.db.stock_quotes from the staged EOD bulk dataframe.

In [9]:
import duckdb, polars as pl
if 'parallel_month_df' in globals() and isinstance(parallel_month_df, pl.DataFrame) and not parallel_month_df.is_empty():
    con = duckdb.connect('../americas.db')
    con.register('new_quotes', parallel_month_df.to_pandas())

    # Create table if missing
    con.sql("""
        CREATE TABLE IF NOT EXISTS stock_quotes AS
        SELECT * FROM new_quotes LIMIT 0
    """)

    # Insert only rows whose (date,symbol) key not present
    con.sql("""
        INSERT INTO stock_quotes
        SELECT nq.*
        FROM new_quotes nq
        WHERE NOT EXISTS (
            SELECT 1 FROM stock_quotes q
            WHERE q.date = nq.date
              AND q.symbol = nq.symbol
        )
    """)

    con.close()
else:
    print("No new stock_quotes data to load.")

con = db.connect('../americas.db')
con.sql("SELECT MAX(date) AS max_date FROM stock_quotes").fetchone()[0]

No new stock_quotes data to load.


datetime.datetime(2025, 9, 16, 0, 0)

## Metrics and Ratios (quarterly)

Parallel fetch per‑symbol key metrics and ratios (last 100 periods), normalize/cast, join on (symbol, date, fiscalYear, period), and load to DuckDB.

In [None]:
# ULTRA-OPTIMIZED Incremental fetch of Key Metrics + Ratios with AsyncIO + Advanced Optimizations
# Performance improvements: Async I/O, Arrow integration, streaming, advanced caching, pipelining

import os, time, asyncio, aiohttp, polars as pl, duckdb
import concurrent.futures as cf
from typing import Any, Dict, List, Optional, Set
from datetime import date as _date, datetime
import json
from dataclasses import dataclass
from collections import defaultdict
import weakref

API_KEY = os.getenv("FMP_API_KEY")
_common_params = {"apikey": API_KEY} if API_KEY else {}

# ------------------------------------------------------------------
# Advanced Configuration & Caching
# ------------------------------------------------------------------
@dataclass
class OptimizationConfig:
    max_concurrent_requests: int = 100  # Higher concurrency for async
    max_batch_size: int = 2000
    cache_size: int = 10000
    use_arrow_integration: bool = True
    enable_streaming: bool = True
    connection_timeout: int = 30
    request_timeout: int = 45
    retry_attempts: int = 2
    backoff_factor: float = 1.2

CONFIG = OptimizationConfig()

# Simple LRU cache for API responses
class SimpleCache:
    def __init__(self, maxsize: int = 10000):
        self.cache = {}
        self.access_order = []
        self.maxsize = maxsize
    
    def get(self, key: str) -> Optional[List[Dict]]:
        if key in self.cache:
            # Move to end (most recently used)
            self.access_order.remove(key)
            self.access_order.append(key)
            return self.cache[key]
        return None
    
    def put(self, key: str, value: List[Dict]):
        if key in self.cache:
            self.access_order.remove(key)
        elif len(self.cache) >= self.maxsize:
            # Remove least recently used
            oldest = self.access_order.pop(0)
            del self.cache[oldest]
        
        self.cache[key] = value
        self.access_order.append(key)

# Global cache instance
API_CACHE = SimpleCache(CONFIG.cache_size)

# ------------------------------------------------------------------
# Async HTTP Client with Advanced Features
# ------------------------------------------------------------------
class AsyncAPIClient:
    def __init__(self):
        self.session: Optional[aiohttp.ClientSession] = None
        self.semaphore = asyncio.Semaphore(CONFIG.max_concurrent_requests)
        
    async def __aenter__(self):
        connector = aiohttp.TCPConnector(
            limit=CONFIG.max_concurrent_requests,
            limit_per_host=50,
            ttl_dns_cache=300,
            use_dns_cache=True,
            keepalive_timeout=60,
            enable_cleanup_closed=True
        )
        
        timeout = aiohttp.ClientTimeout(
            total=CONFIG.connection_timeout + CONFIG.request_timeout,
            connect=CONFIG.connection_timeout,
            sock_read=CONFIG.request_timeout
        )
        
        self.session = aiohttp.ClientSession(
            connector=connector,
            timeout=timeout,
            headers={'User-Agent': 'FMP-Optimizer/2.0'}
        )
        return self
    
    async def __aexit__(self, exc_type, exc_val, exc_tb):
        if self.session:
            await self.session.close()
    
    async def fetch_json(self, url: str, params: Dict[str, Any]) -> List[Dict]:
        """Optimized async JSON fetcher with caching and smart retries"""
        cache_key = f"{url}?{sorted(params.items())}"
        
        # Check cache first
        cached_result = API_CACHE.get(cache_key)
        if cached_result is not None:
            return cached_result
        
        async with self.semaphore:  # Limit concurrent requests
            for attempt in range(CONFIG.retry_attempts + 1):
                try:
                    async with self.session.get(url, params=params) as response:
                        if response.status == 403:
                            return []
                        
                        if response.status == 429 or response.status >= 500:
                            if attempt < CONFIG.retry_attempts:
                                await asyncio.sleep(CONFIG.backoff_factor ** attempt)
                                continue
                            return []
                        
                        response.raise_for_status()
                        
                        # Stream large JSON responses
                        if CONFIG.enable_streaming:
                            data = await response.json(loads=json.loads)
                        else:
                            text = await response.text()
                            data = json.loads(text)
                        
                        # Handle various response formats
                        if isinstance(data, dict):
                            for v in data.values():
                                if isinstance(v, list):
                                    result = v
                                    break
                            else:
                                result = []
                        else:
                            result = data if isinstance(data, list) else []
                        
                        # Cache successful results
                        if result:
                            API_CACHE.put(cache_key, result)
                        
                        return result
                        
                except asyncio.TimeoutError:
                    if attempt < CONFIG.retry_attempts:
                        await asyncio.sleep(CONFIG.backoff_factor ** attempt)
                        continue
                except Exception as e:
                    if attempt < CONFIG.retry_attempts:
                        await asyncio.sleep(CONFIG.backoff_factor ** attempt)
                        continue
                    
        return []

# ------------------------------------------------------------------
# Optimized Symbol Universe with Lazy Loading
# ------------------------------------------------------------------
def load_symbol_universe_optimized() -> List[str]:
    """Ultra-fast symbol loading with caching"""
    cache_key = "symbol_universe"
    cached = API_CACHE.get(cache_key)
    if cached:
        return [item['symbol'] for item in cached if 'symbol' in item]
    
    if 'stock_tickers' in globals() and isinstance(stock_tickers, pl.DataFrame):
        # Use lazy scan with projection pushdown
        symbols = (stock_tickers
                  .lazy()
                  .select("symbol")
                  .filter(pl.col("symbol").is_not_null() & (pl.col("symbol") != ""))
                  .unique()
                  .collect()
                  .to_series()
                  .to_list())
        
        # Cache the result
        API_CACHE.put(cache_key, [{"symbol": s} for s in symbols])
        return symbols
    
    try:
        con = duckdb.connect('../americas.db')
        # Use Arrow format for faster data transfer
        result = con.execute("""
            SELECT DISTINCT symbol 
            FROM stock_tickers 
            WHERE symbol IS NOT NULL AND symbol != ''
            ORDER BY symbol
        """).arrow()
        con.close()
        
        # Convert Arrow to list efficiently
        symbols = result.column('symbol').to_pylist()
        API_CACHE.put(cache_key, [{"symbol": s} for s in symbols])
        return symbols
    except Exception:
        return []

# ------------------------------------------------------------------
# Optimized Database Operations with Arrow Integration
# ------------------------------------------------------------------
def get_existing_metrics_info() -> tuple[bool, Optional[str]]:
    """Fast database introspection using Arrow format"""
    try:
        con = duckdb.connect('../americas.db')
        
        # Single optimized query using Arrow
        result = con.execute("""
            WITH table_check AS (
                SELECT COUNT(*) as table_exists
                FROM information_schema.tables 
                WHERE table_name = 'stock_metrics'
            ),
            max_date AS (
                SELECT CASE 
                    WHEN (SELECT table_exists FROM table_check) > 0 
                    THEN (SELECT MAX(date) FROM stock_metrics)
                    ELSE NULL 
                END as max_date
            )
            SELECT 
                (SELECT table_exists FROM table_check) > 0 as exists,
                (SELECT max_date FROM max_date) as max_date
        """).fetchone()
        
        con.close()
        
        table_exists = bool(result[0])
        max_date = result[1].isoformat() if result[1] else None
        
        return table_exists, max_date
        
    except Exception:
        return False, None

# ------------------------------------------------------------------
# Advanced Async Data Processing Pipeline
# ------------------------------------------------------------------
async def fetch_symbol_data_async(client: AsyncAPIClient, symbol: str, threshold_date: Optional[str]) -> Optional[pl.LazyFrame]:
    """Ultra-optimized async symbol data fetching with pipeline processing"""
    
    # Fetch both endpoints concurrently
    metrics_task = client.fetch_json(
        "https://financialmodelingprep.com/stable/key-metrics",
        {"symbol": symbol, "period": "quarter", "limit": 100, **_common_params}
    )
    
    ratios_task = client.fetch_json(
        "https://financialmodelingprep.com/stable/ratios", 
        {"symbol": symbol, "period": "quarter", "limit": 100, **_common_params}
    )
    
    try:
        km_data, rt_data = await asyncio.gather(metrics_task, ratios_task, return_exceptions=True)
        
        # Handle exceptions
        if isinstance(km_data, Exception):
            km_data = []
        if isinstance(rt_data, Exception):
            rt_data = []
            
        if not km_data and not rt_data:
            return None
        
        # Process data efficiently
        def process_records(records: List[Dict], symbol: str) -> Optional[pl.LazyFrame]:
            if not records:
                return None
                
            # Optimize record processing
            for r in records:
                r['symbol'] = symbol  # Avoid .get() overhead
                date_val = r.get('date') or r.get('reportedDate')
                if date_val:
                    r['date'] = date_val[:10]  # Fast string slice instead of split
            
            try:
                df = pl.DataFrame(records)
                lazy_df = df.lazy()
                
                # Apply threshold filter early
                if threshold_date and 'date' in df.columns:
                    lazy_df = lazy_df.filter(pl.col('date') >= threshold_date)
                    
                return lazy_df
            except Exception:
                return None
        
        km_lazy = process_records(km_data, symbol)
        rt_lazy = process_records(rt_data, symbol)
        
        # Smart joining with schema optimization
        if km_lazy is None and rt_lazy is None:
            return None
        elif km_lazy is None:
            result = rt_lazy
        elif rt_lazy is None:
            result = km_lazy
        else:
            # Efficient join with column optimization
            try:
                km_schema = km_lazy.collect_schema()
                rt_schema = rt_lazy.collect_schema()
                
                # Determine optimal join strategy
                join_keys = [k for k in ('symbol', 'date', 'fiscalYear', 'period')
                           if k in km_schema and k in rt_schema]
                
                if not join_keys:
                    join_keys = ['symbol', 'date']
                    join_keys = [k for k in join_keys if k in km_schema and k in rt_schema]
                
                if join_keys and len(join_keys) >= 2:
                    # Use efficient join
                    rt_only_cols = [c for c in rt_schema.names() if c not in km_schema.names()]
                    if rt_only_cols:
                        select_cols = join_keys + rt_only_cols
                        result = km_lazy.join(rt_lazy.select(select_cols), on=join_keys, how='full')
                    else:
                        result = km_lazy
                else:
                    # Fallback to concat
                    result = pl.concat([km_lazy, rt_lazy], how='vertical_relaxed')
                    
            except Exception:
                # Ultimate fallback
                result = pl.concat([km_lazy, rt_lazy], how='vertical_relaxed')
        
        # Optimize date parsing
        if result and 'date' in result.collect_schema():
            result = result.with_columns(
                pl.col('date').str.strptime(pl.Date, "%Y-%m-%d", strict=False)
            )
        
        return result
        
    except Exception:
        return None

# ------------------------------------------------------------------
# Main Optimized Processing Function
# ------------------------------------------------------------------
async def process_metrics_ultra_optimized():
    """Main ultra-optimized processing function with full async pipeline"""
    
    print("🚀 Starting ULTRA-OPTIMIZED metrics processing...")
    
    # Load symbols with caching
    symbols = load_symbol_universe_optimized()
    print(f"📊 Processing {len(symbols):,} symbols with advanced optimizations")
    
    if not symbols:
        print("❌ No symbols found")
        return pl.DataFrame()
    
    # Get existing data info efficiently
    metrics_exists, existing_max_date = get_existing_metrics_info()
    print(f"📅 Existing max date: {existing_max_date}")
    
    # Calculate threshold date
    threshold_date = None
    if existing_max_date:
        try:
            def _quarter_start(d: _date) -> _date:
                return _date(d.year, ((d.month-1)//3)*3 + 1, 1)
            
            def _prev_quarter_start(d: _date) -> _date:
                qs = _quarter_start(d)
                m, y = qs.month - 3, qs.year
                if m <= 0:
                    m, y = m + 12, y - 1
                return _date(y, m, 1)

            emd = datetime.strptime(existing_max_date[:10], "%Y-%m-%d").date()
            threshold_date = _prev_quarter_start(emd).isoformat()
        except Exception:
            threshold_date = existing_max_date
    
    print(f"🎯 Threshold date: {threshold_date}")
    
    # Process in optimized batches with async pipeline
    batch_size = CONFIG.max_batch_size
    all_lazy_frames = []
    
    async with AsyncAPIClient() as client:
        for batch_start in range(0, len(symbols), batch_size):
            batch_end = min(batch_start + batch_size, len(symbols))
            batch_symbols = symbols[batch_start:batch_end]
            
            print(f"🔄 Processing batch {batch_start//batch_size + 1}/{(len(symbols) + batch_size - 1)//batch_size} ({len(batch_symbols)} symbols)")
            
            # Create tasks for this batch
            tasks = [
                fetch_symbol_data_async(client, symbol, threshold_date)
                for symbol in batch_symbols
            ]
            
            # Process batch with progress tracking
            start_time = time.time()
            results = await asyncio.gather(*tasks, return_exceptions=True)
            elapsed = time.time() - start_time
            
            # Filter successful results
            batch_frames = []
            for result in results:
                if isinstance(result, pl.LazyFrame):
                    batch_frames.append(result)
                elif isinstance(result, Exception):
                    continue  # Skip failed requests
            
            all_lazy_frames.extend(batch_frames)
            
            print(f"✅ Batch completed: {len(batch_frames)} successful frames in {elapsed:.1f}s ({len(batch_frames)/elapsed:.1f} frames/sec)")
            
            # Small delay to be nice to the API
            if batch_end < len(symbols):
                await asyncio.sleep(0.1)
    
    if not all_lazy_frames:
        print("❌ No data retrieved")
        return pl.DataFrame()
    
    print(f"🔗 Combining {len(all_lazy_frames)} lazy frames...")
    
    try:
        # Ultra-optimized combining with streaming
        combined = pl.concat(all_lazy_frames, how='vertical_relaxed')
        
        # Apply final optimizations
        combined = combined.unique(
            subset=['symbol', 'date', 'fiscalYear', 'period'],
            keep='last'
        )
        
        # Apply threshold filter if needed
        if threshold_date:
            threshold_obj = datetime.strptime(threshold_date[:10], "%Y-%m-%d").date()
            combined = combined.filter(pl.col('date') >= pl.lit(threshold_obj))
        
        # Collect final result
        result = combined.collect()
        
        print(f"🎉 Processing complete: {result.height:,} rows retrieved")
        return result
        
    except Exception as e:
        print(f"❌ Error combining results: {e}")
        return pl.DataFrame()

# ------------------------------------------------------------------
# Ultra-Optimized DuckDB Integration
# ------------------------------------------------------------------
def save_to_duckdb_optimized(df: pl.DataFrame) -> bool:
    """Ultra-fast DuckDB insertion with Arrow integration"""
    if df.is_empty():
        return False
    
    try:
        con = duckdb.connect('../americas.db')
        
        # Use Arrow format for maximum speed
        if CONFIG.use_arrow_integration:
            arrow_table = df.to_arrow()
            con.register('metrics_arrow', arrow_table)
            source_table = 'metrics_arrow'
        else:
            con.register('metrics_new', df.to_pandas())
            source_table = 'metrics_new'
        
        # Create table if needed
        con.execute(f"CREATE TABLE IF NOT EXISTS stock_metrics AS SELECT * FROM {source_table} LIMIT 0")
        
        # Ultra-fast bulk upsert
        key_cols = [c for c in ('symbol', 'date', 'fiscalYear', 'period') if c in df.columns]
        if not key_cols:
            key_cols = ['symbol', 'date']
        
        if key_cols:
            key_conditions = ' AND '.join([f"existing.{c} = new.{c}" for c in key_cols])
            
            # Use efficient MERGE-like operation
            insert_sql = f"""
                INSERT INTO stock_metrics
                SELECT new.* FROM {source_table} new
                WHERE NOT EXISTS (
                    SELECT 1 FROM stock_metrics existing 
                    WHERE {key_conditions}
                )
            """
            
            rows_before = con.execute("SELECT COUNT(*) FROM stock_metrics").fetchone()[0]
            con.execute(insert_sql)
            rows_after = con.execute("SELECT COUNT(*) FROM stock_metrics").fetchone()[0]
            
            inserted = rows_after - rows_before
            print(f"💾 Inserted {inserted:,} new rows into stock_metrics")
        
        con.close()
        return True
        
    except Exception as e:
        print(f"❌ DuckDB save failed: {e}")
        return False

# ------------------------------------------------------------------
# Main Execution
# ------------------------------------------------------------------
async def main():
    """Ultra-optimized main execution function"""
    start_time = time.time()
    
    try:
        # Process data with full optimizations
        merged_df = await process_metrics_ultra_optimized()
        
        # Save to database if we have data
        if not merged_df.is_empty():
            success = save_to_duckdb_optimized(merged_df)
            if success:
                elapsed = time.time() - start_time
                print(f"🏁 ULTRA-OPTIMIZED processing completed in {elapsed:.1f}s")
                print(f"📈 Performance: {len(merged_df)/elapsed:.1f} rows/second")
            else:
                print("❌ Failed to save to database")
        else:
            print("ℹ️ No new data to process")
            
    except Exception as e:
        print(f"❌ Ultra-optimization failed: {e}")
        import traceback
        traceback.print_exc()

# Execute the ultra-optimized version
if __name__ == "__main__":
    # For notebook execution
    try:
        # Check if we're in a notebook with existing event loop
        loop = asyncio.get_running_loop()
        # Create task in existing loop
        task = loop.create_task(main())
        # For notebook - you might need to await this differently
        print("🔄 Ultra-optimized processing started in existing event loop...")
        merged_df = None  # Will be set by the async process
    except RuntimeError:
        # No existing event loop, create new one
        merged_df = asyncio.run(main())

🔄 Ultra-optimized processing started in existing event loop...


🚀 Starting ULTRA-OPTIMIZED metrics processing...
📊 Processing 11,801 symbols with advanced optimizations
📅 Existing max date: 2025-07-05T00:00:00
🎯 Threshold date: 2025-04-01
🔄 Processing batch 1/6 (2000 symbols)
✅ Batch completed: 0 successful frames in -12.3s (-0.0 frames/sec)
🔄 Processing batch 2/6 (2000 symbols)
✅ Batch completed: 0 successful frames in -13.8s (-0.0 frames/sec)
🔄 Processing batch 3/6 (2000 symbols)
✅ Batch completed: 0 successful frames in 47.1s (0.0 frames/sec)
🔄 Processing batch 4/6 (2000 symbols)
✅ Batch completed: 0 successful frames in 63.5s (0.0 frames/sec)
🔄 Processing batch 5/6 (2000 symbols)
✅ Batch completed: 0 successful frames in 64.5s (0.0 frames/sec)
🔄 Processing batch 6/6 (1801 symbols)
✅ Batch completed: 0 successful frames in 55.0s (0.0 frames/sec)
❌ No data retrieved
ℹ️ No new data to process


In [21]:
# Load to DuckDB (incremental append on (symbol,date,fiscalYear,period))
import polars as pl
if 'merged_df' in globals() and isinstance(merged_df, pl.DataFrame) and not merged_df.is_empty():
    import duckdb
    con = duckdb.connect('../americas.db')
    # Ensure base types for key columns
    casts = []
    if 'symbol' in merged_df.columns: casts.append(pl.col('symbol').cast(pl.Utf8, strict=False))
    if 'date' in merged_df.columns: casts.append(pl.col('date').cast(pl.Date, strict=False))
    if 'fiscalYear' in merged_df.columns: casts.append(pl.col('fiscalYear').cast(pl.Int64, strict=False))
    if 'period' in merged_df.columns: casts.append(pl.col('period').cast(pl.Utf8, strict=False))
    merged_df = merged_df.with_columns(casts) if casts else merged_df

    con.register('metrics_new', merged_df.to_pandas())
    # Create table if not exists
    con.execute("CREATE TABLE IF NOT EXISTS stock_metrics AS SELECT * FROM metrics_new LIMIT 0")

    # Determine natural key columns present
    key_cols = [c for c in ('symbol','date','fiscalYear','period') if c in merged_df.columns]
    if not key_cols:  # fallback
        key_cols = [c for c in ('symbol','date') if c in merged_df.columns]

    # Build NOT EXISTS predicate
    predicate = ' AND '.join([f"m.{c} = n.{c}" for c in key_cols]) if key_cols else '1=0'

    insert_sql = f"""
        INSERT INTO stock_metrics
        SELECT n.* FROM metrics_new n
        WHERE NOT EXISTS (
            SELECT 1 FROM stock_metrics m WHERE {predicate}
        )
    """
    try:
        con.execute(insert_sql)
        new_rows = con.execute("SELECT COUNT(*) FROM metrics_new n WHERE NOT EXISTS (SELECT 1 FROM stock_metrics m WHERE " + predicate + ")").fetchone()[0] if False else None
    except Exception as e:
        print(f"Incremental stock_metrics insert failed: {e}")
    finally:
        con.close()
else:
    print("No new key stock_metrics/ratios data to load.")

No new key stock_metrics/ratios data to load.


## Exchanges → DuckDB

Fetch all exchanges, keep only Americas via heuristic, and persist to americas.db.exchanges.

In [22]:
# Exchanges → DuckDB (incremental)
import os, requests, polars as pl
API_KEY=os.getenv("FMP_API_KEY"); _params={"apikey":API_KEY} if API_KEY else {}

try:
    r=requests.get("https://financialmodelingprep.com/stable/available-exchanges", params=_params, timeout=60); r.raise_for_status(); ex_data=r.json(); ex_data=ex_data if isinstance(ex_data,list) else []
except Exception: ex_data=[]

CTRY={"UNITED STATES","CANADA","BRAZIL","MEXICO","ARGENTINA","CHILE","PERU","COLOMBIA","VENEZUELA","URUGUAY","PARAGUAY","BOLIVIA","ECUADOR","GUYANA","SURINAME","FRENCH GUIANA","JAMAICA","TRINIDAD AND TOBAGO","TRINIDAD & TOBAGO","BARBADOS","BAHAMAS","BERMUDA","CAYMAN ISLANDS","PANAMA","COSTA RICA","GUATEMALA","HONDURAS","EL SALVADOR","NICARAGUA","DOMINICAN REPUBLIC","HAITI","PUERTO RICO","BELIZE","CURACAO","ARUBA","SAINT LUCIA","GRENADA","ST. VINCENT AND THE GRENADINES"}
ISO2={"US","CA","BR","MX","AR","CL","PE","CO","VE","UY","PY","BO","EC","GY","SR","GF","JM","TT","BB","BS","BM","KY","PA","CR","GT","HN","SV","NI","DO","H","PR","BZ","LC","GD","VC","CW","AW"}

def is_amer_exchange(rec: dict)->bool:
    for k in ("region","continent"):
        v=rec.get(k)
        if isinstance(v,str) and "AMERICA" in v.upper(): return True
    for k in ("countryName","country","countryCode","country_code","country_iso2"):
        v=rec.get(k); vu=v.strip().upper() if isinstance(v,str) else ""
        if vu and ((len(vu)<=3 and vu in ISO2) or vu in CTRY or "UNITED STATES" in vu or "LATIN AMERICA" in vu): return True
    return False

ex_df = pl.DataFrame([rec for rec in ex_data if isinstance(rec,dict) and is_amer_exchange(rec)])
if not ex_df.is_empty():
    ren={}; ren["countryName"]="country" if "countryName" in ex_df.columns and "country" not in ex_df.columns else None; ren={k:v for k,v in ren.items() if v}
    ex_df = ex_df.rename(ren) if ren else ex_df
    ex_df = ex_df.with_columns([pl.all().cast(pl.Utf8, strict=False)])

print(ex_df.shape); print(ex_df.head())

if not ex_df.is_empty():
    import duckdb
    con=duckdb.connect('../americas.db')
    con.register('exchanges_new', ex_df.to_pandas())
    con.sql("""
        CREATE TABLE IF NOT EXISTS exchanges AS
        SELECT * FROM exchanges_new LIMIT 0
    """)
    
    # Check which columns exist in the dataframe
    columns = ex_df.columns
    
    # Use exchange as the primary key since it's always present
    if 'exchange' in columns:
        con.sql("""
            INSERT INTO exchanges
            SELECT e.* FROM exchanges_new e
            WHERE NOT EXISTS (
                SELECT 1 FROM exchanges x
                WHERE x.exchange = e.exchange
            )
        """)
    con.close()
else:
    print("No Americas exchanges found from API; skipping DuckDB load.")

(14, 6)
shape: (5, 6)
┌──────────┬───────────────────────────┬──────────────────┬─────────────┬──────────────┬───────────┐
│ exchange ┆ name                      ┆ country          ┆ countryCode ┆ symbolSuffix ┆ delay     │
│ ---      ┆ ---                       ┆ ---              ┆ ---         ┆ ---          ┆ ---       │
│ str      ┆ str                       ┆ str              ┆ str         ┆ str          ┆ str       │
╞══════════╪═══════════════════════════╪══════════════════╪═════════════╪══════════════╪═══════════╡
│ AMEX     ┆ New York Stock Exchange   ┆ United States of ┆ US          ┆ N/A          ┆ Real-time │
│          ┆ Arca                      ┆ America          ┆             ┆              ┆           │
│ BUE      ┆ Buenos Aires Stock        ┆ Argentina        ┆ AR          ┆ .BA          ┆ 20 min    │
│          ┆ Exchange                  ┆                  ┆             ┆              ┆           │
│ BVC      ┆ Colombia Stock Exchange   ┆ Colombia         ┆ CO       

## index_list → DuckDB

Fetch index list, keep those on Americas exchanges, then persist as americas.db.index_list.

In [13]:
# index_list → DuckDB (incremental)
import os, requests, polars as pl
API_KEY=os.getenv("FMP_API_KEY"); _params={"apikey":API_KEY} if API_KEY else {}

try:
    r=requests.get("https://financialmodelingprep.com/stable/index-list", params=_params, timeout=60); r.raise_for_status(); idx_data=r.json(); idx_data=idx_data if isinstance(idx_data,list) else []
except Exception: idx_data=[]

amer_exchanges = {e.upper() for e in EXCHANGES_AMERICAS} if 'EXCHANGES_AMERICAS' in globals() and EXCHANGES_AMERICAS else {"NYSE","NASDAQ","AMEX","ARCX","NYS","NAS","ARC","BATS","OTC","TSX","TSXV","CSE","XTSE","XTSX","CHIC","B3","BMFBOVESPA","BOV","XBOV","XMEX","BMV"}
idx_df = pl.DataFrame([rec for rec in idx_data if isinstance(rec,dict) and str(rec.get("exchange","" )).upper() in amer_exchanges])
if not idx_df.is_empty(): idx_df=idx_df.with_columns([pl.all().cast(pl.Utf8, strict=False)])

print(idx_df.shape); print(idx_df.head())

if not idx_df.is_empty():
    import duckdb
    con=duckdb.connect('../americas.db')
    con.register('index_list_new', idx_df.to_pandas())
    con.sql("""
        CREATE TABLE IF NOT EXISTS index_list AS
        SELECT * FROM index_list_new LIMIT 0
    """)
    # Natural key = symbol (assumed unique for index_list list)
    con.sql("""
        INSERT INTO index_list
        SELECT i.* FROM index_list_new i
        WHERE NOT EXISTS (SELECT 1 FROM index_list x WHERE x.symbol = i.symbol)
    """)
    con.close()
else:
    print("No Americas index_list found from API; skipping DuckDB load.")

(36, 4)
shape: (5, 4)
┌────────┬─────────────────────────────────┬──────────┬──────────┐
│ symbol ┆ name                            ┆ exchange ┆ currency │
│ ---    ┆ ---                             ┆ ---      ┆ ---      │
│ str    ┆ str                             ┆ str      ┆ str      │
╞════════╪═════════════════════════════════╪══════════╪══════════╡
│ ^TTIN  ┆ S&P/TSX Capped Industrials Ind… ┆ TSX      ┆ CAD      │
│ ^NYA   ┆ NYSE Composite                  ┆ NYSE     ┆ USD      │
│ ^XAX   ┆ NYSE American Composite Index   ┆ NYSE     ┆ USD      │
│ ^NYITR ┆ NYSE International 100 Index    ┆ NYSE     ┆ USD      │
│ ^DJU   ┆ Dow Jones Utility Average       ┆ NASDAQ   ┆ USD      │
└────────┴─────────────────────────────────┴──────────┴──────────┘


## Index Quotes (historical light) → DuckDB

Parallel fetch daily light stock_quotes for Americas index_list since 2010‑01‑01 and write to americas.db.index_quotes (unique by symbol,date).

In [14]:
# Index stock_quotes (light) → DuckDB
import os, time, requests, polars as pl, concurrent.futures as cf
from io import StringIO

API_KEY=os.getenv("FMP_API_KEY"); _params={"apikey": API_KEY} if API_KEY else {}
BASE_URL="https://financialmodelingprep.com/stable/historical-price-eod/light"

# 1) Gather index symbols
index_symbols = sorted({s for s in idx_df.get_column('symbol').to_list() if isinstance(s,str) and s.strip()}) if 'idx_df' in globals() and isinstance(idx_df, pl.DataFrame) and not idx_df.is_empty() else []
if not index_symbols:
    try:
        import duckdb; con=duckdb.connect('../americas.db'); res=con.sql("SELECT symbol FROM index_list").fetchall(); con.close(); index_symbols=sorted({r[0] for r in res if isinstance(r[0],str) and r[0].strip()})
    except Exception: index_symbols=[]
print(f"Index symbols to fetch: {len(index_symbols)}")

# 2) Fetch per symbol
def fetch_index_quotes(symbol: str, start: str="2010-01-01", max_retries: int=3, sleep_base: float=0.4)->pl.DataFrame:
    q={"symbol":symbol,"from":start, **_params}; a=0
    while True:
        try:
            r=requests.get(BASE_URL, params=q, timeout=90); s=r.status_code
            if s in (429,) or s>=500:
                if a<max_retries: time.sleep(sleep_base*(2**a)); a+=1; continue
            if s==403: return pl.DataFrame()
            r.raise_for_status(); data=r.json(); data=[data] if isinstance(data,dict) else data
            if not isinstance(data,list) or not data: return pl.DataFrame()
            df=pl.DataFrame(data)
            casts=[]
            if "date" in df.columns: casts.append(pl.col("date").cast(pl.Utf8).str.to_date("%Y-%m-%d"))
            if "price" in df.columns: casts.append(pl.col("price").cast(pl.Float64, strict=False))
            if "volume" in df.columns: casts.append(pl.col("volume").cast(pl.Int64, strict=False))
            df=df.with_columns(casts) if casts else df
            if "symbol" not in df.columns: df=df.with_columns([pl.lit(symbol).alias("symbol")])
            return df
        except Exception:
            if a<max_retries: time.sleep(sleep_base*(2**a)); a+=1; continue
            return pl.DataFrame()

# 3) Parallel fetch
frames=[]
if index_symbols:
    max_workers=min(16, max(4, (os.cpu_count() or 8)))
    with cf.ThreadPoolExecutor(max_workers=max_workers) as ex:
        for fut in cf.as_completed({ex.submit(fetch_index_quotes, s): s for s in index_symbols}):
            df=fut.result(); frames.append(df) if isinstance(df,pl.DataFrame) and not df.is_empty() else None

index_quotes_df=pl.DataFrame() if not frames else pl.concat(frames, how="vertical_relaxed")

# Unique by (symbol,date)
if not index_quotes_df.is_empty():
    keep=[c for c in index_quotes_df.columns if c in {"symbol","date","price","volume"}] or index_quotes_df.columns
    index_quotes_df=index_quotes_df.select(keep).unique(subset=["symbol","date"], keep="last").sort(["symbol","date"])

print(index_quotes_df.shape); print(index_quotes_df.head())

# 4) Load to DuckDB
if not index_quotes_df.is_empty():
    import duckdb
    con=duckdb.connect('../americas.db'); con.sql("CREATE OR REPLACE TABLE index_quotes AS SELECT * FROM index_quotes_df"); con.close()
else:
    print("No index stock_quotes fetched; skipping DuckDB load.")

Index symbols to fetch: 36
(111524, 4)
shape: (5, 4)
┌─────────┬────────────┬────────────┬───────────┐
│ symbol  ┆ date       ┆ price      ┆ volume    │
│ ---     ┆ ---        ┆ ---        ┆ ---       │
│ str     ┆ date       ┆ f64        ┆ i64       │
╞═════════╪════════════╪════════════╪═══════════╡
│ TX60.TS ┆ 2020-11-03 ┆ 948.200012 ┆ 94504821  │
│ TX60.TS ┆ 2020-11-04 ┆ 952.299988 ┆ 179295563 │
│ TX60.TS ┆ 2020-11-05 ┆ 968.52002  ┆ 127428692 │
│ TX60.TS ┆ 2020-11-06 ┆ 966.869995 ┆ 118405024 │
│ TX60.TS ┆ 2020-11-09 ┆ 980.429993 ┆ 246182781 │
└─────────┴────────────┴────────────┴───────────┘


## Market Risk Premium → DuckDB

Fetch market risk premium data for all countries, filter to Americas region, and persist as americas.db.risk_premium.
- API endpoint: https://financialmodelingprep.com/stable/market-risk-premium
- Returns countryRiskPremium and totalEquityRiskPremium by country
- Americas countries filtered using same logic as exchanges

In [15]:
# Market Risk Premium → DuckDB (incremental)
import os, requests, polars as pl
API_KEY=os.getenv("FMP_API_KEY"); _params={"apikey":API_KEY} if API_KEY else {}

try:
    r=requests.get("https://financialmodelingprep.com/stable/market-risk-premium", params=_params, timeout=60)
    r.raise_for_status()
    risk_data=r.json()
    risk_data=risk_data if isinstance(risk_data,list) else []
    print(f"Fetched {len(risk_data)} countries from Market Risk Premium API")
except Exception as e:
    print(f"Failed to fetch market risk premium data: {e}")
    risk_data=[]

# Define Americas countries (same logic as used in exchanges filter)
CTRY_AMERICAS={"UNITED STATES","CANADA","BRAZIL","MEXICO","ARGENTINA","CHILE","PERU","COLOMBIA","VENEZUELA","URUGUAY","PARAGUAY","BOLIVIA","ECUADOR","GUYANA","SURINAME","FRENCH GUIANA","JAMAICA","TRINIDAD AND TOBAGO","TRINIDAD & TOBAGO","BARBADOS","BAHAMAS","BERMUDA","CAYMAN ISLANDS","PANAMA","COSTA RICA","GUATEMALA","HONDURAS","EL SALVADOR","NICARAGUA","DOMINICAN REPUBLIC","HAITI","PUERTO RICO","BELIZE","CURACAO","ARUBA","SAINT LUCIA","GRENADA","ST. VINCENT AND THE GRENADINES"}
ISO2_AMERICAS={"US","CA","BR","MX","AR","CL","PE","CO","VE","UY","PY","BO","EC","GY","SR","GF","JM","TT","BB","BS","BM","KY","PA","CR","GT","HN","SV","NI","DO","H","PR","BZ","LC","GD","VC","CW","AW"}

def is_americas_country(rec: dict)->bool:
    """Check if country record is in Americas region"""
    country = rec.get("country", "").strip().upper()
    continent = rec.get("continent", "").strip().upper()
    
    # Check continent first
    if continent and "AMERICA" in continent:
        return True
    
    # Check country name
    if country and (
        country in CTRY_AMERICAS or
        any(americas_country in country for americas_country in ["UNITED STATES", "BRAZIL", "CANADA", "MEXICO"]) or
        (len(country) <= 3 and country in ISO2_AMERICAS)
    ):
        return True
    
    return False

# Filter to Americas countries only
americas_risk_data = [rec for rec in risk_data if isinstance(rec, dict) and is_americas_country(rec)]
print(f"Filtered to {len(americas_risk_data)} Americas countries")

if americas_risk_data:
    # Show sample of what we found
    print("Sample Americas risk premium data:")
    for item in americas_risk_data[:5]:  # Show first 5
        print(f"  {item.get('country', 'N/A')}: Country Risk={item.get('countryRiskPremium', 'N/A')}%, Total Equity Risk={item.get('totalEquityRiskPremium', 'N/A')}%")

# Create DataFrame
risk_premium_df = pl.DataFrame(americas_risk_data) if americas_risk_data else pl.DataFrame()

if not risk_premium_df.is_empty():
    # Ensure proper data types
    casts = []
    if "countryRiskPremium" in risk_premium_df.columns:
        casts.append(pl.col("countryRiskPremium").cast(pl.Float64, strict=False))
    if "totalEquityRiskPremium" in risk_premium_df.columns:
        casts.append(pl.col("totalEquityRiskPremium").cast(pl.Float64, strict=False))
    if "country" in risk_premium_df.columns:
        casts.append(pl.col("country").cast(pl.Utf8, strict=False))
    if "continent" in risk_premium_df.columns:
        casts.append(pl.col("continent").cast(pl.Utf8, strict=False))
    
    risk_premium_df = risk_premium_df.with_columns(casts) if casts else risk_premium_df

print(f"Risk premium DataFrame shape: {risk_premium_df.shape}")
if not risk_premium_df.is_empty():
    print(risk_premium_df.head())

Fetched 186 countries from Market Risk Premium API
Filtered to 44 Americas countries
Sample Americas risk premium data:
  Venezuela: Country Risk=23.59%, Total Equity Risk=27.92%
  Uruguay: Country Risk=2.13%, Total Equity Risk=6.46%
  United States: Country Risk=0%, Total Equity Risk=4.33%
  Turks & Caicos Islands: Country Risk=2.13%, Total Equity Risk=6.46%
  Trinidad and Tobago: Country Risk=4.02%, Total Equity Risk=8.35%
Risk premium DataFrame shape: (44, 4)
shape: (5, 4)
┌────────────────────────┬───────────────┬────────────────────┬────────────────────────┐
│ country                ┆ continent     ┆ countryRiskPremium ┆ totalEquityRiskPremium │
│ ---                    ┆ ---           ┆ ---                ┆ ---                    │
│ str                    ┆ str           ┆ f64                ┆ f64                    │
╞════════════════════════╪═══════════════╪════════════════════╪════════════════════════╡
│ Venezuela              ┆ South America ┆ 23.59              ┆ 27.92     

In [16]:
# Load risk premium data to DuckDB (incremental by country)
if 'risk_premium_df' in globals() and isinstance(risk_premium_df, pl.DataFrame) and not risk_premium_df.is_empty():
    import duckdb
    con = duckdb.connect('../americas.db')
    
    # Register the new data
    con.register('risk_premium_new', risk_premium_df.to_pandas())
    
    # Create table if not exists
    con.sql("""
        CREATE TABLE IF NOT EXISTS risk_premium AS
        SELECT * FROM risk_premium_new LIMIT 0
    """)
    
    # Insert only new countries (unique by country name)
    con.sql("""
        INSERT INTO risk_premium
        SELECT rp.* FROM risk_premium_new rp
        WHERE NOT EXISTS (
            SELECT 1 FROM risk_premium x 
            WHERE UPPER(TRIM(x.country)) = UPPER(TRIM(rp.country))
        )
    """)
    
    # Check how many countries we have in total
    total_countries = con.sql("SELECT COUNT(*) FROM risk_premium").fetchone()[0]
    new_countries = con.sql("""
        SELECT COUNT(*) FROM risk_premium_new rp
        WHERE NOT EXISTS (
            SELECT 1 FROM risk_premium x 
            WHERE UPPER(TRIM(x.country)) = UPPER(TRIM(rp.country))
        )
    """).fetchone()[0] if False else None  # Skip this count for performance
    
    print(f"Risk premium table now contains {total_countries} Americas countries")
    con.close()
else:
    print("No risk premium data to load.")

# Show what's in the table
try:
    con = duckdb.connect('../americas.db')
    if con.execute("SELECT 1 FROM information_schema.tables WHERE table_name='risk_premium'").fetchone():
        sample_data = con.sql("SELECT * FROM risk_premium ORDER BY country LIMIT 10").fetchall()
        print(f"\nSample risk premium data ({len(sample_data)} of total):")
        for row in sample_data:
            print(f"  {row[0]} ({row[1]}): Country Risk={row[2]}%, Total Equity Risk={row[3]}%")
    con.close()
except Exception as e:
    print(f"Could not show sample data: {e}")

Risk premium table now contains 44 Americas countries

Sample risk premium data (10 of total):
  Anguilla (North America): Country Risk=8.11%, Total Equity Risk=12.44%
  Antigua and Barbuda (North America): Country Risk=8.1%, Total Equity Risk=12.43%
  Argentina (South America): Country Risk=16.02%, Total Equity Risk=20.35%
  Aruba (North America): Country Risk=2.93%, Total Equity Risk=7.26%
  Bahamas (North America): Country Risk=6.01%, Total Equity Risk=10.34%
  Barbados (North America): Country Risk=8.68%, Total Equity Risk=13.01%
  Belize (North America): Country Risk=10.01%, Total Equity Risk=14.34%
  Bermuda (North America): Country Risk=1.13%, Total Equity Risk=5.46%
  Bolivia (South America): Country Risk=13.35%, Total Equity Risk=17.68%
  Brazil (South America): Country Risk=3.34%, Total Equity Risk=7.67%


## Verify database tables and record counts

In [17]:
# Verify database tables and record counts
import duckdb
con = duckdb.connect('../americas.db')
tables = con.sql("SHOW TABLES").fetchall()
print("Tables in the database:")
for table in tables:
    count = con.sql(f"SELECT COUNT(*) AS count FROM {table[0]}").fetchone()[0]
    print(f"  - {table[0]}: {count:,} rows")

# Check the latest dates in the stock_quotes and index_quotes tables
if any(t[0] == 'stock_quotes' for t in tables):
    last_quote_date = con.sql("SELECT MAX(date) FROM stock_quotes").fetchone()[0]
    print(f"Latest quote date: {last_quote_date}")

if any(t[0] == 'index_quotes' for t in tables):
    last_index_date = con.sql("SELECT MAX(date) FROM index_quotes").fetchone()[0] 
    print(f"Latest index quote date: {last_index_date}")

# Show sample of risk premium data if table exists
if any(t[0] == 'risk_premium' for t in tables):
    risk_premium_count = con.sql("SELECT COUNT(*) FROM risk_premium").fetchone()[0]
    print(f"Risk premium countries: {risk_premium_count}")
    if risk_premium_count > 0:
        sample_risk = con.sql("SELECT country, countryRiskPremium, totalEquityRiskPremium FROM risk_premium ORDER BY country LIMIT 3").fetchall()
        print("Sample risk premium data:")
        for row in sample_risk:
            print(f"  {row[0]}: Country Risk={row[1]}%, Total Equity Risk={row[2]}%")

con.close()

Tables in the database:
  - exchanges: 14 rows
  - index_list: 36 rows
  - index_quotes: 111,524 rows
  - risk_premium: 44 rows
  - stock_metrics: 525,189 rows
  - stock_profiles: 11,859 rows
  - stock_quotes: 34,094,058 rows
  - stock_tickers: 11,859 rows
Latest quote date: 2025-09-16 00:00:00
Latest index quote date: 2025-09-16
Risk premium countries: 44
Sample risk premium data:
  Anguilla: Country Risk=8.11%, Total Equity Risk=12.44%
  Antigua and Barbuda: Country Risk=8.1%, Total Equity Risk=12.43%
  Argentina: Country Risk=16.02%, Total Equity Risk=20.35%
