# Twitter Mentions Stream

This notebook listens to **X / Twitter’s public search endpoint** (via the
[`twikit`](https://github.com/x0rz/twikit) async client) for brand‑related tweets
and stores matches in the same `market_attention.duckdb` database used by our
GDELT and RSS pipelines.

## Why scrape Twitter?
* Headlines tell us what *news desks* publish; tweets tell us what *people with
  large followings* amplify in near‑real‑time.
* By filtering for accounts with **≥ 10 k followers** we keep the focus on
  tweets that can plausibly move sentiment.
* Merging the resulting `ticker_mentions` with price data lets us explore
  *attention shocks* vs. short‑term volatility.

**Caveats**
* Twitter rate‑limits unauthenticated search very aggressively. `twikit` works
  by using the same guest‑token flow the web client uses, so if you hammer it
  you will get 429 errors. The loop below handles that by sleeping until the
  reset time.
* For higher throughput you’ll need elevated API access or a paid tier.

---

## Imports & configuration


In [15]:
import asyncio
import hashlib
import time
from datetime import datetime, timezone
from pathlib import Path
from random import randint
import pandas as pd
import duckdb
import nest_asyncio  # so await/async works smoothly in Jupyter
from twikit import Client, TooManyRequests, errors

nest_asyncio.apply()

# Watch‑list (same as other notebooks)
COMPANY_NAMES = {
    "Google": "GOOGL",
    "Apple": "AAPL",
    "Amazon": "AMZN",
    "Nvidia": "NVDA",
    "Microsoft": "MSFT",
    "Bitcoin": "BTC",
    "Ripple": "XRP",
}
MIN_FOLLOWERS = 10_000
QUERY = " OR ".join(f'"{kw}"' for kw in COMPANY_NAMES)  # quoted OR‑joined terms
print("Search query:", QUERY)

# DuckDB file (shared with other streams)
DB_PATH = (Path.cwd().parent.parent / "src" / "data" / "market_attention.duckdb").resolve()
DB_PATH.parent.mkdir(parents=True, exist_ok=True)
print("DB path:", DB_PATH)

# Ensure schema exists once at notebook load
with duckdb.connect(DB_PATH) as con:
    con.execute(
        """
        CREATE TABLE IF NOT EXISTS news_articles (
            article_id TEXT PRIMARY KEY,
            title TEXT,
            timestamp TIMESTAMP
        );
        """
    )
    con.execute(
        """
        CREATE TABLE IF NOT EXISTS ticker_mentions (
            article_id TEXT,
            ticker TEXT
        );
        """
    )


Search query: "Google" OR "Apple" OR "Amazon" OR "Nvidia" OR "Microsoft" OR "Bitcoin" OR "Ripple"
DB path: C:\Users\zaina\PycharmProjects\Market-Attention-Graph\src\data\market_attention.duckdb


## Helper functions


In [3]:

def hash_id(text: str) -> str:
    """64‑char SHA‑256 hex digest (stable primary key)."""
    return hashlib.sha256(text.encode()).hexdigest()


async def authenticate() -> Client:
    """Return an authenticated **twikit** Client.

    The call expects a `cookies.json` exported from a logged‑in browser session.
    Remove or modify this step if you prefer guest‑token only.
    """
    client = Client(language="en-US")
    client.load_cookies("cookies.json")  # create once via browser dev‑tools
    await client._get_guest_token()
    return client


async def latest_tweets(client: Client):
    """Fetch up to 100 newest tweets matching *QUERY* (web public search API)."""
    try:
        return await client.search_tweet(QUERY, "Latest", count=100)
    except errors.NotFound:  # guest token expired → refresh & retry once
        await client._get_guest_token()
        return await client.search_tweet(QUERY, "Latest", count=100)


def save_to_db(con: duckdb.DuckDBPyConnection, tweets):
    """Insert tweet rows + ticker mentions, skipping dups & small accounts."""
    added = 0
    for tw in tweets:
        if getattr(tw.user, "followers_count", 0) < MIN_FOLLOWERS:
            continue

        tweet_id = hash_id(tw.id)
        title = tw.text
        ts = datetime.strptime(tw.created_at, "%a %b %d %H:%M:%S %z %Y")

        mentions = [
            tk
            for name, tk in COMPANY_NAMES.items()
            if name.lower() in title.lower() or tk.lower() in title.lower()
        ]
        if not mentions:
            continue

        try:
            con.execute("INSERT INTO news_articles VALUES (?, ?, ?)", (tweet_id, title, ts))
        except duckdb.ConstraintException:
            continue  # duplicate

        for tk in mentions:
            con.execute("INSERT INTO ticker_mentions VALUES (?, ?)", (tweet_id, tk))
        added += 1
    return added


## Continuous poll loop (every 20 s)
*Handles 429 by sleeping until reset.* Stop with *Kernel → Interrupt*.


In [4]:
async def poll_twitter(interval_seconds: int = 20):
    con = duckdb.connect(DB_PATH)
    client = await authenticate()
    stored = 0

    while True:
        try:
            tweets = await latest_tweets(client)
        except TooManyRequests as e:
            reset = datetime.fromtimestamp(e.rate_limit_reset)
            wait = max(1, int((reset - datetime.utcnow()).total_seconds()))
            print(f"[{datetime.utcnow()}] ⚠️ 429 hit → sleeping {wait}s (until {reset})")
            await asyncio.sleep(wait)
            continue

        if tweets:
            n = save_to_db(con, tweets)
            stored += n
            print(f"[{datetime.utcnow()}] +{n} tweets stored (total {stored})")
        else:
            print(f"[{datetime.utcnow()}] No tweets returned")

        await asyncio.sleep(interval_seconds)


In [None]:
await poll_twitter()

## Quick sanity‑check query

In [12]:
con = duckdb.connect(DB_PATH)
pd.set_option("display.max_rows", None)  # show all when printing
df = con.execute("SELECT title, timestamp, ticker FROM news_articles JOIN ticker_mentions ON article_id = article_id ORDER BY timestamp LIMIT 10").fetchdf()
pd.set_option('display.max_rows',    None)
display(df)

Unnamed: 0,title,timestamp,ticker
0,The latest MLX Swift supports tvOS 📺 !\n\nYou ...,2025-07-15 01:55:08,AAPL
1,That’s awful… if you’re the thief you should b...,2025-07-15 01:55:31,GOOGL
2,そうか両社共通の投資家Founders Fundが引き合わせた説はありうるな。ナプキンの裏計...,2025-07-15 01:55:51,GOOGL
3,@GrapeApe9k Just because drones require a smal...,2025-07-15 01:56:30,AAPL
4,@pudgypenguins @Apple @Google @PlayPudgyParty ...,2025-07-15 01:56:31,AAPL
5,@pudgypenguins @Apple @Google @PlayPudgyParty ...,2025-07-15 01:56:31,GOOGL
6,@namiru319 失礼いたします。Amazonの自動メッセージです。 ご注文は、お届け予...,2025-07-15 01:56:32,AMZN
7,ε/ The cycle structure hasn’t changed\n\n• Bit...,2025-07-15 01:56:32,BTC
8,"If you want to get early opportunities, you ca...",2025-07-15 01:56:34,BTC
9,This is not altseason.\n\nThis is still Bitcoi...,2025-07-15 01:56:34,BTC


### Mention count


In [13]:
df = con.execute("SELECT ticker, COUNT(ticker) AS mentions FROM ticker_mentions GROUP BY ticker ORDER BY mentions").fetchdf()
con.close()
display(df)

Unnamed: 0,ticker,mentions
0,AMZN,1
1,GOOGL,3
2,BTC,3
3,AAPL,5
