Skip to content

valetudo/data_storage

Repository files navigation

global_data_storage

Centralized financial data layer for trading systems.

Subsystem Source Status
prices — daily OHLCV yfinance (primary) + Stooq CSV (IT fallback) implemented
sentiment — superinvestor 13F dataroma.com (HTML scraping) implemented
viz — interactive Plotly charts reads from prices via the public API implemented
fundamentals TBD (yfinance / FMP / EODHD) scaffold only — see DESIGN.md
macro TBD (FRED / ECB / ISTAT / Eurostat) scaffold only — see DESIGN.md

Architecture & design decisions: ARCHITECTURE.md.


Quickstart

uv sync --extra dev                                # install
uv run pytest                                      # run the test suite
uv run python examples/external_system_usage.py    # live end-to-end demo

The demo is incremental: the first run downloads ~5y of OHLCV for the seeded universe and scrapes one investor; subsequent runs are no-ops unless new bars or filings exist.

Public API

External trading systems must depend only on these symbols.

# Prices — subsystem 1
from global_data_storage.prices import (
    get_ohlcv,         # (ticker, start, end, freq="1d") -> pd.DataFrame
    ensure_available,  # (ticker) -> bool   # idempotent gate
    list_universe,     # () -> list[str]
    last_update,       # (ticker) -> date | None
)

# Sentiment — subsystem 4
from global_data_storage.sentiment import (
    get_holdings,      # (investor) -> pd.DataFrame
    get_recent_moves,  # (since: date, action: str | None = None) -> pd.DataFrame
    get_consensus,     # (ticker) -> dict
    list_investors,    # () -> list[str]
)

Tickers are canonical (no .MI suffix). The yfinance source rewrites ENIENI.MI internally; the DB and API never expose the suffix.

The ensure_available pattern is the recommended gate — call it before any read, and the underlying data will be fetched if missing or stale:

from global_data_storage.prices import ensure_available, get_ohlcv

if ensure_available("ENI"):
    df = get_ohlcv("ENI", start, end)

Configuration

Editable hand-curated artefacts:

  • config/universe.yaml — tracked tickers (5 IT seed + 11 SPDR; populate FTSE MIB via scripts/seed_universe_it.py).

Environment variables (all optional; see .env.example):

Var Default Purpose
GDS_DB_PATH ~/.global_data_storage/store.duckdb DuckDB store path
GDS_HTTP_CACHE_PATH ~/.global_data_storage/http_cache scraping cache
GDS_LOG_FORMAT console console (dev) or json (prod)
GDS_LOG_LEVEL INFO DEBUG / INFO / WARNING / ERROR
GDS_CONTACT_EMAIL "" contact appended to scraping User-Agent

Storage

Single DuckDB file, four schemas (prices, sentiment, fundamentals, macro). Cross-domain joins are first-class:

-- "OHLCV of stocks Buffett bought in 2025Q4"
SELECT p.*
FROM prices.equity_ohlcv p
JOIN sentiment.moves m ON m.ticker = p.ticker
WHERE m.investor_slug = 'BRK'
  AND m.action IN ('BUY', 'ADD')
  AND m.period_quarter = '2025Q4'
  AND p.date >= today() - INTERVAL 90 DAY;

External trading systems should open the DB read-only:

from global_data_storage.storage import read_only

with read_only() as con:
    df = con.execute("SELECT ...").fetchdf()

Layout

config/                              # universe.yaml (hand-edited)
src/global_data_storage/
  prices/         # subsystem 1 — full
  sentiment/      # subsystem 4 — full
  fundamentals/   # subsystem 2 — DESIGN.md only
  macro/          # subsystem 3 — DESIGN.md only
  storage/        # DuckDB connection + schema
  common/         # logging, config, retry, http (cache + politeness)
scripts/
  seed_universe_it.py    # one-shot Borsa Italiana → YAML
  backup.py              # cold-copy snapshot of the DuckDB file
tests/
  fixtures/sentiment/    # captured Dataroma HTML for parser tests
examples/
  external_system_usage.py  # canonical consumer pattern

Operations

Backup

uv run python scripts/backup.py             # default: ~/.global_data_storage/backups
uv run python scripts/backup.py --keep 30   # keep last 30 snapshots

Refreshing data

Prices are refreshed lazily by ensure_available(ticker). Sentiment is manual on-demand (13F filings are quarterly):

from global_data_storage.sentiment.ingest import refresh_all
refresh_all()    # ~80 investors x 2 pages each, politeness-bound

Checking what's new

SELECT ticker, last_date, status, rows_added
FROM prices.update_log
ORDER BY last_run_at DESC
LIMIT 20;

Constraints

  • Single-user, single-machine, batch ingestion. DuckDB is single-writer.
  • External trading systems are read-only consumers — open the DB with read_only=True.
  • No intraday support today; the freq column on equity_ohlcv is intraday-ready, value is always '1d'.
  • Scraping politeness: 2.5s ± 0.5s per request, 24h on-disk cache, robots.txt enforced. User-Agent identifies the project.

Visualization

Local-only interactive charts (Plotly + kaleido). No web server. Each call returns a fresh go.Figure; HTML and PNG land in output/viz/.

from global_data_storage.viz import quick

# Single ticker, candles + indicators (range selector, range slider, hover OHLC)
fig = quick.candles("ENI", period="1Y", indicators=["sma_50", "sma_200", "bbands"])
fig.show()

# Multi-ticker comparison (default: cumulative log returns — honest on long horizons)
fig = quick.compare(["ENI", "ENEL", "ISP", "UCG"], mode="log_returns", period="2Y")

# Cross-ticker analytics
fig = quick.correlation_heatmap(["XLK","XLF","XLE","XLV","XLI","XLY","XLP","XLU","XLB","XLRE","XLC"], period="1Y")
fig = quick.pair_scatter("XLK", "XLY", period="1Y")             # OLS line + R² annotation
fig = quick.rolling_correlation("XLK", "XLY", window=60)        # rolling Pearson on log returns

# Sub-panel indicators (RSI, MACD, ATR)
fig = quick.overlay("ENI", indicators=["rsi_14", "macd"], period="1Y")

# Programmatic annotations (survive HTML/PNG round-trip; live drawn shapes don't)
quick.add_horizontal_level(fig, price=22.5, label="resistance")
quick.add_trendline(fig, start=("2025-01-15", 13.0), end=("2026-04-30", 24.0), label="uptrend")
quick.add_vertical_event(fig, when="2025-04-07", label="vol spike")
quick.add_fibonacci(fig, high_date="2026-04-15", high_price=25.0,
                          low_date="2026-05-08",  low_price=22.5)

# Slide-ready
quick.apply_presentation_theme(fig, title="ENI", subtitle="1Y daily")
quick.save_html(fig, "eni_1y")          # ~80 KB, opens offline (Plotly via CDN)
quick.save_image(fig, "eni_1y_slide")   # PNG @ 1920x1080, scale=2

Design choices (palette Okabe-Ito, 16:9 default, footer + theme constants) live in VIZ_DESIGN.md. Five end-to-end examples in examples/viz_examples.py.

Dashboard (control panel)

Local Flask app: refresh prices and generate any chart in the viz module without writing a Python script. Loopback only (127.0.0.1); the browser opens automatically.

uv run python -m global_data_storage.dashboard   # or: gds-dashboard

On Windows: double-click dashboard.bat instead — it sets up PATH, launches the server, and your browser opens automatically.

What it gives you:

  • Sidebar: every active ticker from config/universe.yaml with its last_update date. Tick the ones you want, click Refresh selected — the dashboard runs prices.ensure_available(...) for each (incremental download, no duplicate work).
  • Tabs for every chart type: Candles, Compare, Correlation heatmap, Pair scatter, Rolling correlation, Overlay. Each form picks tickers from the universe (single dropdown or multi-select); you can't request a ticker that isn't in universe.yaml — to add one, edit the YAML and reload the page.
  • Compare tab: pick ≥ 2 tickers, choose mode (log returns / normalized / rolling / drawdown), period — the chart re-renders inline. This is the part the static HTML exports don't give you.
  • Result is rendered as a Plotly chart inside an iframe; pop it out for a full window or use quick.save_html / save_image from a script for slide exports.

Tooling

uv (deps), ruff (lint+format), mypy --strict (types), pytest (tests). All wired in pyproject.toml.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors