Skip to content

Collecting news articles for all the companies in the R1000, for a pre-defined set of news outlets, using Diffbot's Knowledge Graph

Notifications You must be signed in to change notification settings

talsan/stock_news

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Collecting Stock News History using Diffbot

Leveraging Diffbot's powerful Knowledge Graph to collect news from a pre-defined set of sources (e.g. wsj.com) and a pre-defined set of companies (Russell 1000).

Diffbot Example

Process Details

  1. build_stock_universe.py -- Build a Mapped Universe
    1. Get R1000 Tickers from iShares holdings
    2. Get diffbot-entity-id by submitting Ticker to Diffbot Knowledge Graph
    3. Output goes to ./data/id_map.csv
  2. sync_news.py -- Get Articles by querying Diffbot Knowledge Graph
    1. query by diffbot-entity-id + news source + year
    2. Output goes to S3 (e.g. diffbot-stock-news/type=kg_raw/version=202110.0/entity=E-6s5hEvCNFCnAQ2hpmLT8g/source=cnn.com/year=2019.gz)
  3. build_corpus.py - Collect all the data into a single object
    1. remove duplicates
    2. clean up text
    3. Output goes to ./data/news_extracts/R2000_201901_to_2021109_allnews.json (everything in one place) and ./data/news_extracts/text_chunks/ (data is chunked for more efficient downstream multiprocessing/streaming)

News Sources

news_sources = ['bloomberg.com', 'wsj.com', 'reuters.com', 'barrons.com', 'nytimes.com', 'cnbc.com',
'marketwatch.com', 'ft.com', 'finance.yahoo.com', 'apnews.com', 'cnn.com',
'foxnews.com', 'foxbusiness.com']

About

Collecting news articles for all the companies in the R1000, for a pre-defined set of news outlets, using Diffbot's Knowledge Graph

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages