# Web Scraping for the News

## What exactly *is* web scraping?

"Web scraping" is an overloaded phrase.

Depending on who you ask, web scraping could involve one or more of the following activities:

- Downloading structured data used by a website to generate content (e.g. a JSON API that supplies data for a table).
- Plucking data from the HTML of web pages
- Gathering PDFs, audio, video, screenshots or other files hosted on websites
- Extracting data from scraped PDFs
- Transcribing scraped audio and video
- Standardizing and adding new columns to scraped data
- Storing data and files gathered from websites in databases, ["buckets" in the cloud](https://en.wikipedia.org/wiki/Object_storage), and other long-term storage locales for analysis, search or archiving.

All of these activities are common in data and document processing pipelines that power news stories, interactive graphics and apps.

But for our purposes, we're going to use a more narrow definition.

<dl>
    <dt><strong>web scraping</strong></dt>
    <dd><em>The act of automating the acquisition of data or other files
(images, videos, documents, etc.) from the web. The data may live on one or more pages of a website, or perhaps many different websites. At root, web scraping involves writing code to mimic the actions a human might take to visit a site in a web browser and manually gather files or data.</em></dd>

## Value of scraping

Journalistically valuable information is often locked up on a website that lacks easier methods for data acquisition. Not all government agencies, for example, offer downloadable [CSV files](https://en.wikipedia.org/wiki/Comma-separated_values) or [APIs](../apis/README.ipynb). Nor do they always respond to public records requests in a timely or helpful manner.

**Web scraping allows journalists to acquire information in the face of technical or bureaucratic hurdles.**

Scraping is also useful in scenarios where a website offers the most up-to-date or widest scope of information. In such cases, web scraping can help journalists tell a more accurate and timely story.

Here are a few examples where web scraping helped produce news:

* [Accidential shootings involving kids often go unpunished](https://apnews.com/article/32e2ce4e701f4448b3d9ba355edfa31d), by The Associated Press, relied on data scraped from the [Gun Violence Archive](https://www.gunviolencearchive.org/).
* [Amazon Says It Puts Customers First. But Its Pricing Algorithm Doesn't](https://www.propublica.org/article/amazon-says-it-puts-customers-first-but-its-pricing-algorithm-doesnt), by ProPublica. Here's the [behind-the-scenes look](https://www.propublica.org/article/how-we-analyzed-amazons-shopping-algorithm) at how they scraped and analyzed Amazon data.
* [Dollars for Docs](https://projects.propublica.org/docdollars), a searchable news app by ProPublica. Here's a write-up on the [scraping aspect](https://www.propublica.org/nerds/scraping-websites) of the work.


## Scraping - The Technical Bits
    
Every web site has its own [personality](website_personalities.ipynb).
    
Head to [Web Scraping 101](scraping_101.ipynb) to learn how to scrape different types of sites, from simple to challenging.

## The option of last resort

Web scraping is a brittle activity. Sites move, URL
and page structures evolve, interactivity gets added or removed.

Shiny new web scrapers inevitably break in the days, months and years after they were written.

And often, websites do *not* reflect the most recent or most accurate information.

**For these reasons, scraping should be treated as an option of last resort.**
    
When a government website does not offer easy methods for obtaining data, journalists typically reach out to the agency and possibly file public records requests to obtain structured data or digital files. They seek to exhaust easier options before turning to their scraping toolkit. 

## Ethical scraping

Scraping ethically implies a number of best practices. To mention a few:

* Respecting a site's terms of use
* Identifying yourself clearly
* Taking care not to overwhelm a site with large volumes of requests

Here are a few articles that lay out ethical concerns in more detail:

* [On the Ethics of Web Scraping and Data Journalism](https://gijn.org/stories/on-the-ethics-of-web-scraping-and-data-journalism/)
* [Ethics in Web Scraping](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01)

Keep in mind that opinions vary about what is or is not "ethical" -- or legal -- when it comes to scraping. It's an issue that [has been tested in the courts](https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data) and will continue to be fought over.

*Be mindful of your legal responsibilities and potential liability when scraping the web.* If in doubt, contact a legal organization that provides advice to journalists, such as our friends at the [First Amendment Coalition](https://firstamendmentcoalition.org/).

## Let's scrape something already

Sheesh. That was a lot of preamble.
    
Let's write some code.
    
Below is an extremely simple example of web scraping. It's designed to whet your appetite. We'll wrestle with more challenging sites down the road.
    
In the below example, we use a pair of Python libraries to:
    
- Fetch the HTML of the home page of <http://example.com> (the [requests][] bit)
- Extract the page's title from the HTML (the [BeautifulSoup][] bit)
    
Enjoy this amuse bouche. It's about as easy as scraping will ever get.

[requests]: https://docs.python-requests.org/en/latest/index.html
[BeautifulSoup]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

> If you're working in GitHub Codespaces, we've installed these libraries for you. If you're working locally, `pip install requests bs4`


In [None]:
import bs4, requests
url = "http://www.example.com"
html = requests.get(url).text
soup = bs4.BeautifulSoup(html, 'html.parser')
h1 = soup.find('h1')
print(h1.text)

> If you're curious why we include `html.parser` as an option when we call [BeautifulSoup][],
> check out bs4's docs on [installing a parser][] and [differences between parsers][].

[BeautifulSoup]: https://beautiful-soup-4.readthedocs.io/en/latest/#beautifulsoup
[installing a parser]: https://beautiful-soup-4.readthedocs.io/en/latest/#installing-a-parser
[differences between parsers]: https://beautiful-soup-4.readthedocs.io/en/latest/#differences-between-parsers


## What's Next

- [Web Scraping 101](scraping_101.ipynb) 
- [Dissecting Websites](dissecting_websites.ipynb)
- [Website Personalities](website_personalities.ipynb) - Common website traits (aka scraping challenges) and how to address them
  - [Skip the Scraping. Cheat](skip_scraping_cheat.ipynb) - You'll thank us
  - [WYSIWYG Scraping](wysiwyg_scraping.ipynb) - Old school, easy scrapes. We miss you, Internet 2005
  - [Drive the Browser, Robot](drive_the_browser_robot.ipynb) - Sometimes it's easier to have the machines do the work
- [Scraping exercises](exercises.ipynb) - A few sites to challenge your scraping skills.
- [Scraping resources](resources.ipynb) - Tutorials, key concepts, code libraries for scraping, etc.