# DataUSA Population Ingest (Raw Landing) with Resilient Fetch + Cached Fallback

This notebook ingests **U.S. population** data from the DataUSA Tesseract API and lands the response as a **raw JSON artifact** in the lakehouse.

- **Source API:** `https://honolulu-api.datausa.io/tesseract/data.jsonrecords`
- **Dataset:** `acs_yg_total_population_1` (drilldowns: `Year,Nation`; measure: `Population`)
- **Raw target:** `/Volumes/rearc_quest/lakehouse/raw_datausa/population.json`

## Why this pattern
It is a production-style ingestion design optimized for **reliability and idempotency**:

### 1) API-first, cache-on-failure
- The job **always attempts the API first**.
- If the API fails after retries, it **falls back to the last successfully saved file** (if present) so downstream stages can still proceed.

### 2) Retry/backoff for transient failures
- Uses an HTTP session configured with retries/backoff for common transient conditions: **429** and **5xx**.

### 3) Observability
- Writes a metadata file per run to `_meta/` capturing:
  - run timestamp
  - API URL + params
  - mode (`api_success` vs `fallback_cached`)
  - row count (when available)
  - error message (if any)

## Notes
- This is a **raw ingestion step** (bronze-aligned). It preserves the upstream JSON structure.


In [0]:
import requests
import json
import datetime as dt
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# ------------------------------------------------------------------------------
# PURPOSE
# Fetch DataUSA population data and land the raw API response as JSON in a
# lakehouse raw zone. Uses API-first ingestion with a cached fallback strategy
# to reduce pipeline fragility during transient API outages.
# ------------------------------------------------------------------------------

# Raw landing path (single artifact representing the latest successful fetch)
TARGET_PATH = "/Volumes/rearc_quest/lakehouse/raw_datausa/population.json"

# Run metadata for observability (audit trail)
META_DIR  = "/Volumes/rearc_quest/lakehouse/raw_datausa/_meta"
META_PATH = f"{META_DIR}/population_ingest_run.json"

# Source API endpoint
URL = "https://honolulu-api.datausa.io/tesseract/data.jsonrecords"

# Query parameters to request population by Year and Nation
PARAMS = {
    "cube": "acs_yg_total_population_1",
    "drilldowns": "Year,Nation",
    "locale": "en",
    "measures": "Population"
}

# Data citizenship: identify the pipeline via User-Agent
HEADERS = {
    "User-Agent": "rearc-quest-contact: rohit.pradhan2995@gmail.com",
    "Accept": "application/json"
}

# Ensure target directories exist (Databricks Volumes paths)
dbutils.fs.mkdirs("/Volumes/rearc_quest/lakehouse/raw_datausa")
dbutils.fs.mkdirs(META_DIR)

def path_exists(p: str) -> bool:
    """
    Returns True if the path exists and is accessible. Used to confirm whether a
    cached file is available for fallback when the API is down.
    """
    try:
        dbutils.fs.ls(p)
        return True
    except Exception:
        return False

# ------------------------------------------------------------------------------
# Robust HTTP session with retries/backoff
# - handles transient throttling (429) and server errors (5xx)
# - keeps the notebook resilient without over-retrying
# ------------------------------------------------------------------------------
retry_strategy = Retry(
    total=2,                      # total retries (keep modest to avoid long waits)
    backoff_factor=2,             # exponential backoff between retries
    status_forcelist=[429, 500, 502, 503, 504],
    allowed_methods=["GET"]
)

adapter = HTTPAdapter(max_retries=retry_strategy)

session = requests.Session()
session.headers.update(HEADERS)
session.mount("https://", adapter)
session.mount("http://", adapter)

# ------------------------------------------------------------------------------
# Run metadata (written in finally for observability)
# ------------------------------------------------------------------------------
run_info = {
    "run_utc": dt.datetime.utcnow().replace(microsecond=0).isoformat() + "Z",
    "url": URL,
    "params": PARAMS,
    "mode": None,      # "api_success" or "fallback_cached"
    "rows": None,      # number of records in payload["data"] when known
    "error": None      # error message if API fails
}

try:
    # --------------------------------------------------------------------------
    # API-first: attempt to fetch fresh data from DataUSA
    # timeout=(connect, read) to avoid hanging
    # --------------------------------------------------------------------------
    resp = session.get(URL, params=PARAMS, timeout=(30, 60))
    resp.raise_for_status()

    payload = resp.json()

    # Persist raw response as the current "latest" snapshot
    # overwrite=True gives deterministic, idempotent runs
    dbutils.fs.put(TARGET_PATH, json.dumps(payload), overwrite=True)

    run_info["mode"] = "api_success"
    run_info["rows"] = len(payload.get("data", []))

    print(" Population API fetch succeeded")
    print("Final URL:", resp.url)
    print("Saved to:", TARGET_PATH)
    print("Rows:", run_info["rows"])

except requests.exceptions.RequestException as e:
    # --------------------------------------------------------------------------
    # If the API fails after retries, fall back to last cached payload if it exists.
    # This keeps downstream pipelines unblocked during transient outages.
    # --------------------------------------------------------------------------
    run_info["error"] = str(e)

    if path_exists(TARGET_PATH):
        run_info["mode"] = "fallback_cached"

        # Optional: estimate row count from cached payload for reporting.
        # Note: dbutils.fs.head reads only a prefix of the file; for large files,
        # this may fail to parse, in which case rows remains None.
        try:
            cached_head = dbutils.fs.head(TARGET_PATH, 2_000_000)
            cached_payload = json.loads(cached_head)
            run_info["rows"] = len(cached_payload.get("data", []))
        except Exception:
            run_info["rows"] = None

        print(" Population API fetch failed after retries.")
        print(" Falling back to last cached file:", TARGET_PATH)
        print("Error:", str(e))
        print("Cached rows (if parsed):", run_info["rows"])

    else:
        # No cached file exists -> hard fail (correct behavior for a first run)
        raise RuntimeError(
            "Population API fetch failed after retries AND no cached population.json exists. "
            f"Error={str(e)}"
        )

finally:
    # Always write run metadata for observability (success or failure)
    dbutils.fs.put(META_PATH, json.dumps(run_info, indent=2), overwrite=True)
    print(" Wrote ingest run metadata to:", META_PATH)


  "run_utc": dt.datetime.utcnow().replace(microsecond=0).isoformat() + "Z",


 Population API fetch failed after retries.
 Falling back to last cached file: /Volumes/rearc_quest/lakehouse/raw_datausa/population.json
Error: HTTPSConnectionPool(host='honolulu-api.datausa.io', port=443): Max retries exceeded with url: /tesseract/data.jsonrecords?cube=acs_yg_total_population_1&drilldowns=Year%2CNation&locale=en&measures=Population (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fc68c573020>, 'Connection to honolulu-api.datausa.io timed out. (connect timeout=30)'))
Cached rows (if parsed): 10
Wrote 685 bytes.
 Wrote ingest run metadata to: /Volumes/rearc_quest/lakehouse/raw_datausa/_meta/population_ingest_run.json


In [0]:
from pyspark.sql.functions import explode

# Read the nested JSON with multiLine option
df_raw = spark.read.option("multiLine", "true").json("/Volumes/rearc_quest/lakehouse/raw_datausa/population.json")

# Extract the nested 'data' array and explode it
df = df_raw.select(explode("data").alias("record")).select("record.*")

display(df)

Nation,Nation ID,Population,Year
United States,01000US,316128839,2013
United States,01000US,318857056,2014
United States,01000US,321418821,2015
United States,01000US,323127515,2016
United States,01000US,325719178,2017
United States,01000US,327167439,2018
United States,01000US,328239523,2019
United States,01000US,331893745,2021
United States,01000US,333287562,2022
United States,01000US,334914896,2023
