# RSS News Stream

This notebook polls a collection of **finance/economy RSS feeds** every 15 minutes,
extracts headlines that mention a watch-list of companies/tickers, and stores the
data in the same **`market_attention.duckdb`** file we used for the GDELT stream.

That means you can join *RSS* articles, *GDELT* articles, and *Alpaca* price data
in one place for unified attention analytics.

**Feeds included** (edit `RSS_FEEDS` below to customise):
- Yahoo Finance Top Stories
- Investing.com Market News
- CNBC Markets
- MarketWatch Top Stories
- CoinDesk crypto news
- Bloomberg ETF Report podcast (metadata only)
- Wall Street Journal Markets
- The Economist Finance & Economics
- NY Times Business & Economy

---


## Imports & configuration

In [4]:
import feedparser
import duckdb
import hashlib
import time
from datetime import datetime, timezone
from pathlib import Path
import nest_asyncio
import pandas as pd

# When running inside Jupyter we need this so `time.sleep` doesn’t block interrupts
nest_asyncio.apply()


### Feed list & watch-list

In [5]:
RSS_FEEDS = [
    "https://finance.yahoo.com/rss/",
    "https://www.investing.com/rss/news_301.rss",
    "https://www.cnbc.com/id/100003114/device/rss/rss.html",
    "https://www.marketwatch.com/rss/topstories",
    "https://www.coindesk.com/arc/outboundfeeds/rss/",
    "https://www.bloomberg.com/feed/podcast/etf-report.xml",
    "https://feeds.a.dj.com/rss/RSSMarketsMain.xml",
    "https://www.economist.com/finance-and-economics/rss.xml",
    "https://rss.nytimes.com/services/xml/rss/nyt/Business.xml",
    "https://rss.nytimes.com/services/xml/rss/nyt/Economy.xml",
]

COMPANY_NAMES = {
    "Google": "GOOGL",
    "Apple": "AAPL",
    "Amazon": "AMZN",
    "Nvidia": "NVDA",
    "Microsoft": "MSFT",
    "Bitcoin": "BTC",
    "XRP": "XRP",
}

print("Tracking:", list(COMPANY_NAMES.keys()))


Tracking: ['Google', 'Apple', 'Amazon', 'Nvidia', 'Microsoft', 'Bitcoin', 'XRP']


### Database path
We store everything in `data/market_attention.duckdb` under the current working
directory so this notebook behaves the same no matter where it lives.


In [17]:
DB_PATH = (Path.cwd().parent.parent / "src" / "data" / "market_attention.duckdb").resolve()
DB_PATH.parent.mkdir(parents=True, exist_ok=True)
print("DB path:", DB_PATH)

# Ensure the schema exists up-front so the first poll doesn’t need to create it.
with duckdb.connect(DB_PATH) as con:
    con.execute(
        """
        CREATE TABLE IF NOT EXISTS news_articles (
            article_id TEXT PRIMARY KEY,
            title TEXT,
            timestamp TIMESTAMP
        );
        """
    )
    con.execute(
        """
        CREATE TABLE IF NOT EXISTS ticker_mentions (
            article_id TEXT,
            ticker TEXT
        );
        """
    )


DB path: C:\Users\zaina\PycharmProjects\Market-Attention-Graph\src\data\market_attention.duckdb


## Helper functions


In [7]:

def hash_id(text: str) -> str:
    """Return a deterministic 64-char SHA-256 hex digest."""
    return hashlib.sha256(text.encode()).hexdigest()


def store_article(con: duckdb.DuckDBPyConnection, title: str, url: str, ts, mentions):
    """Insert the article row + mention rows if not already present."""
    article_id = hash_id(url)

    try:
        con.execute(
            "INSERT INTO news_articles VALUES (?, ?, ?)",
            (article_id, title, ts),
        )
    except duckdb.ConstraintException:
        return  # duplicate URL -> skip

    for ticker in mentions:
        con.execute(
            "INSERT INTO ticker_mentions VALUES (?, ?)",
            (article_id, ticker),
        )


def poll_once(con: duckdb.DuckDBPyConnection) -> int:
    """Scan all feeds once; return number of matching headlines stored."""
    count = 0
    for url in RSS_FEEDS:
        feed = feedparser.parse(url)
        for entry in feed.entries:
            title = getattr(entry, "title", "")
            link = getattr(entry, "link", "")
            if not title or not link:
                continue
            if hasattr(entry, "published_parsed") and entry.published_parsed:
                ts = datetime.fromtimestamp(time.mktime(entry.published_parsed), timezone.utc)
            else:
                ts = datetime.utcnow().replace(tzinfo=timezone.utc)

            mentions = [
                ticker
                for name, ticker in COMPANY_NAMES.items()
                if name.lower() in title.lower() or ticker.lower() in title.lower()
            ]
            if mentions:
                store_article(con, title, link, ts, mentions)
                count += 1
    return count


## Poll loop (15-minute interval)
Run this cell to start continuous polling. Stop with *Kernel → Interrupt*.


In [8]:

def poll_rss(interval_seconds: int = 900):
    con = duckdb.connect(DB_PATH)
    try:
        while True:
            print(f"[{datetime.utcnow()}] 🔎 Polling RSS feeds…")
            added = poll_once(con)
            print(f"[+] Stored {added} matched articles.")
            time.sleep(interval_seconds)
    except KeyboardInterrupt:
        print("RSS polling stopped by user.")


In [None]:
poll_rss()

## Quick look at what we have so far


In [15]:
con = duckdb.connect(DB_PATH)
df = con.execute("SELECT title, timestamp FROM news_articles ORDER BY timestamp DESC LIMIT 10").fetchdf()
display(df)

Unnamed: 0,title,timestamp
0,BTC Volatility Hit Historic Lows as ETF Inflow...,2025-07-15 16:00:37
1,Wall Street Cheers Nvidia's Return To China AI...,2025-07-15 15:49:53
2,MP Materials stock rips 23% higher after $500 ...,2025-07-15 15:28:14
3,Google to invest $25 billion in data centers a...,2025-07-15 15:27:11
4,Nvidia’s stock pops as China win may pave the ...,2025-07-15 15:27:00
5,Stock Market Today: Dow Wavers After Inflation...,2025-07-15 15:23:19
6,"Nvidia Stock Reaches for Another Record, Lifts...",2025-07-15 15:17:48
7,Nvidia Says U.S. Has Lifted Restrictions on A....,2025-07-15 15:17:02
8,Nvidia Jumps As Trump Administration Will Let ...,2025-07-15 15:04:24
9,Apple Partners With MP Materials On Rare-Earth...,2025-07-15 15:04:16


### Mention counts (all sources combined)


In [16]:
mention_counts = con.execute(
    """
    SELECT ticker, COUNT(*) AS mentions
    FROM ticker_mentions
    GROUP BY ticker
    ORDER BY mentions DESC
    """
).fetchdf()
con.close()
mention_counts

Unnamed: 0,ticker,mentions
0,NVDA,9
1,BTC,6
2,GOOGL,4
3,AAPL,2
4,XRP,1
5,MSFT,1
6,AMZN,1
