Leveraging Diffbot's powerful Knowledge Graph to collect news from a pre-defined set of sources (e.g. wsj.com) and a pre-defined set of companies (Russell 1000).
build_stock_universe.py
-- Build a Mapped Universe- Get R1000 Tickers from iShares holdings
- Get diffbot-entity-id by submitting
Ticker
to Diffbot Knowledge Graph - Output goes to
./data/id_map.csv
sync_news.py
-- Get Articles by querying Diffbot Knowledge Graph- query by diffbot-entity-id + news source + year
- Output goes to S3 (e.g.
diffbot-stock-news/type=kg_raw/version=202110.0/entity=E-6s5hEvCNFCnAQ2hpmLT8g/source=cnn.com/year=2019.gz
)
build_corpus.py
- Collect all the data into a single object- remove duplicates
- clean up text
- Output goes to
./data/news_extracts/R2000_201901_to_2021109_allnews.json
(everything in one place) and./data/news_extracts/text_chunks/
(data is chunked for more efficient downstream multiprocessing/streaming)
news_sources = ['bloomberg.com', 'wsj.com', 'reuters.com', 'barrons.com', 'nytimes.com', 'cnbc.com',
'marketwatch.com', 'ft.com', 'finance.yahoo.com', 'apnews.com', 'cnn.com',
'foxnews.com', 'foxbusiness.com']