# GDELT News Stream

This notebook polls the **GDELT 2.0 Doc API** every 15 minutes, scrapes headlines that
mention any of a set of tickers / company names, and stores the result in a local
DuckDB file.  It mirrors the architecture of our Alpaca price‑stream notebook so
that data from both sources can be joined later.

**Why do this?**  The goal is to measure *market attention*—how often companies are
talked about in the news—side‑by‑side with live price data.

**Key steps**
1. Define the watch‑list of companies/tickers.
2. Build an async fetcher that pages through the GDELT "ArtList" endpoint.
3. Normalise & deduplicate articles.
4. Store both articles and their ticker‑mentions in DuckDB tables.
5. Explore the data (quick sanity checks & visualisations).

---


## Imports & configuration
Most of these are standard data‑science libraries.  *`nest_asyncio`* lets us run
an asyncio loop inside a Jupyter kernel.



In [1]:
import asyncio
import aiohttp
import hashlib
import duckdb
import nest_asyncio
import pandas as pd
from datetime import datetime, timezone, timedelta
from pathlib import Path
from urllib.parse import quote_plus

nest_asyncio.apply()


### Watch‑list
Map **company names → stock tickers** so we can detect both in headlines.

In [2]:
COMPANY_NAMES = {
    "Google": "GOOGL",
    "Apple": "AAPL",
    "Amazon": "AMZN",
    "Nvidia": "NVDA",
    "Microsoft": "MSFT",
    "Bitcoin": "BTC",
    "Ripple": "XRP",
}

keywords = list(COMPANY_NAMES.keys())  # for the query string
print("Keywords:", keywords)


Keywords: ['Google', 'Apple', 'Amazon', 'Nvidia', 'Microsoft', 'Bitcoin', 'Ripple']


### GDELT API endpoint
We hit the *Doc 2* “ArtList” mode which returns JSON metadata for up to 250
articles per page.


In [3]:
GDELT_URL = (
    "https://api.gdeltproject.org/api/v2/doc/doc?"
    "query={query}&mode=ArtList&maxrecords=250&STARTDATETIME={start}" 
    "&ENDDATETIME={end}&page={page}&format=json"
)

# Build the OR‑joined search expression, quoted as required by the API.
QUERY = "(" + " OR ".join(f'"{kw}"' for kw in keywords) + ")"
print("Endpoint example:")
print(GDELT_URL.format(
    query=quote_plus(QUERY),
    page=1,
    start=datetime.utcnow().strftime("%Y%m%d%H%M%S"),
    end=datetime.utcnow().strftime("%Y%m%d%H%M%S"),
))


Endpoint example:
https://api.gdeltproject.org/api/v2/doc/doc?query=%28%22Google%22+OR+%22Apple%22+OR+%22Amazon%22+OR+%22Nvidia%22+OR+%22Microsoft%22+OR+%22Bitcoin%22+OR+%22Ripple%22%29&mode=ArtList&maxrecords=250&STARTDATETIME=20250714214805&ENDDATETIME=20250714214805&page=1&format=json


### Local DuckDB database
The same file will later also hold our Alpaca price stream so that we can join
on timestamps.


In [4]:
DB_PATH = (Path.cwd() / "data" / "market_attention.duckdb").resolve()
DB_PATH.parent.mkdir(parents=True, exist_ok=True)
print("DB path:", DB_PATH)


DB path: C:\Users\zaina\PycharmProjects\Market-Attention-Graph\notebooks\collectors_notebooks\data\market_attention.duckdb


## Helper functions
* `hash_id` — stable SHA‑256 hash of the article URL (primary key).
* `fetch_news` — async crawler that pages until < 250 results returned.
* `save_to_db` — stores articles + per‑ticker mention rows.


In [37]:

def hash_id(text: str) -> str:
    """Return a deterministic 64‑char hex digest for *text*."""
    return hashlib.sha256(text.encode()).hexdigest()


async def fetch_news(session: aiohttp.ClientSession, *, hours: int = 1):
    """Fetch headlines from the past *hours* hours.

    Returns a list of article dicts as the GDELT API sends them.
    """
    all_articles, page = [], 1
    now = datetime.utcnow().replace(tzinfo=timezone.utc)
    start = now - timedelta(hours=hours)

    while True:
        url = GDELT_URL.format(
            query=quote_plus(QUERY),
            page=page,
            start=start.strftime("%Y%m%d%H%M%S"),
            end=now.strftime("%Y%m%d%H%M%S"),
        )
        async with session.get(url) as resp:
            if resp.status != 200:
                print(f"[{datetime.utcnow()}] GDELT fetch failed: {resp.status}")
                break
            data = await resp.json()
            rows = data.get("articles", [])
            all_articles.extend(rows)

        if len(rows) <= 250:  # last page reached
            break
        page += 1

    return all_articles


def save_to_db(con: duckdb.DuckDBPyConnection, articles: list[dict]):
    """Persist *articles* and their ticker mentions."""
    # --- ensure tables exist ---
    con.execute(
        """
        CREATE TABLE IF NOT EXISTS news_articles (
            article_id TEXT PRIMARY KEY,
            title TEXT,
            timestamp TIMESTAMP
        );
        """
    )

    con.execute(
        """
        CREATE TABLE IF NOT EXISTS ticker_mentions (
            article_id TEXT,
            ticker TEXT
        );
        """
    )

    # --- upsert loop ---
    for art in articles:
        article_id = hash_id(art["url"])
        title = art.get("title", "")
        published = art.get("seendate")  # format: YYYYMMDDhhmmssZ
        if not published:
            continue
        clean = published.replace("T", "").rstrip("Z")
        timestamp = datetime.strptime(clean, "%Y%m%d%H%M%S").replace(tzinfo=timezone.utc)

        # Which tickers does this headline mention?
        mentions = [
            ticker
            for name, ticker in COMPANY_NAMES.items()
            if (name.lower() in title.lower()) or (ticker.lower() in title.lower())
        ]
        if not mentions:
            continue  # skip if no company of interest found

        # --- write article row ---
        try:
            con.execute(
                "INSERT INTO news_articles VALUES (?, ?, ?)",
                (article_id, title, timestamp),
            )
        except duckdb.ConstraintException:
            pass  # duplicate headline, ignore

        # --- write mention rows ---
        for ticker in mentions:
            con.execute(
                "INSERT INTO ticker_mentions VALUES (?, ?)",
                (article_id, ticker),
            )


## Poll loop (15‑minute cadence)
This mirrors the *writer* loop in the price‑stream notebook.  Feel free to stop
it with <kbd>Interrupt</kbd> once you see data accumulating.


In [6]:
async def poll_gdelt(hours_back: int = 1, sleep_seconds: int = 900):
    """Endless loop: fetch → save → sleep."""
    con = duckdb.connect(DB_PATH)
    async with aiohttp.ClientSession() as session:
        while True:
            print(f"[{datetime.utcnow()}] Polling GDELT…")
            articles = await fetch_news(session, hours=hours_back)
            if articles:
                print(f"[+] {len(articles):,} articles fetched → saving to DB…")
                save_to_db(con, articles)
            else:
                print("[-] No articles returned.")
            await asyncio.sleep(sleep_seconds)


**Run the poller**
The call below starts the background loop.  Stop it with *Kernel → Interrupt*.


In [14]:
await poll_gdelt()

## Quick exploration
Once you have some data, you can run the cells below to check what came in.


In [34]:
EXAMPLE_DB = (Path.cwd().parent.parent / "src" / "data" / "market_attention.duckdb").resolve()

con = duckdb.connect(EXAMPLE_DB)
df = con.execute("SELECT title, timestamp FROM news_articles ORDER BY timestamp DESC LIMIT 10").fetchdf()
display(df)

Unnamed: 0,title,timestamp
0,"XRP price : Bitcoin , Ether , XRP to witness 2...",2025-07-15 17:45:00
1,Grayscale Confidentially Files for Potential I...,2025-07-15 17:45:00
2,RICH Miner Launches XRP - Powered Cloud Mining...,2025-07-15 17:45:00
3,Apple Reportedly Taps Samsung Display for Firs...,2025-07-15 17:45:00
4,» Prime Day «: Amazon wegen vermeintlicher So...,2025-07-15 17:45:00
5,Function Ushers in the Era of Bitcoin Yield Wi...,2025-07-15 17:45:00
6,Kunden mit falschen Rabatten getäuscht : Niede...,2025-07-15 17:45:00
7,Google signs agreement with Brookfield for 3GW...,2025-07-15 17:45:00
8,US stocks drift as Nvidia leads gains for tech,2025-07-15 17:45:00
9,"Bitcoin gains tempered by profit - taking , US...",2025-07-15 17:45:00


### Mention counts per ticker


In [35]:
mention_counts = con.execute(
    """
    SELECT ticker, COUNT(*) AS mentions
    FROM ticker_mentions
    GROUP BY ticker
    ORDER BY mentions DESC
    """
).fetchdf()
con.close()
display(mention_counts)

Unnamed: 0,ticker,mentions
0,NVDA,15
1,BTC,9
2,GOOGL,6
3,AAPL,5
4,MSFT,5
5,AMZN,3
6,XRP,2
