GitHub - slifty/rdiscraper: Scraping news since 2011

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README		README
chicago_tribune_scraper.py		chicago_tribune_scraper.py
la_times_scraper.py		la_times_scraper.py
ny_times_post_1981_scraper.py		ny_times_post_1981_scraper.py
ny_times_pre_1981_scraper.py		ny_times_pre_1981_scraper.py
washington_post_post_1987.py		washington_post_post_1987.py
washington_post_pre_1987.py		washington_post_pre_1987.py

Repository files navigation

This is a suite of scrapers which return article URLs which will be used to feed into the MediaCloud system.

We are focusing on the following organizations:
 - New York Times
 - Chicago Tribune
 - Washington Post
 - LA Times


The scrapers use a consistent API which takes in:
 - A start date
 - An end date


And returns (in XML format):
 - Article URL for each article in that date range


==== RETURN XML STRUCTURE ====
<articles>
	<article>
		<url></url>
	</article>
</articles>