Standalone internal scripts used on HFNews, a news aggregator and search engine built in Django.
Scrapes RSS feeds and runs them through Readability to get news articles. Requires bs4, readability-lxml, lxml, requests, and feedparser.
fetch_articles(rss.xml) to print to stdout.
Takes body of text and runs it through the OpenCalais API to get entities (tags). Requires requests.
tag = TagScraper(text) tag.get_calais_json() tag.get_entities() # all entities under 30% relevance are filtered out: print tag.entities print tag.crunchbase_entities
Used in combination with entities retrieved using
calais.py to get relevant company/person information. Requires requests.
fetch_info(tag_name, tag_type) to retrieve relevant information from Crunchbase API.
can be equal to