# Skip the Scraping

## Cheat with JSON

The absolute first question to ask before embarking on a web scraping journey: 

> *Do I need to scrape at all?*

These days, a seemingly complex website might secretly rely on easily accessible structured data, often in the form of JSON supplied through an [API](../apis/README.ipynb).

## A Secretly Simple "Scrape"

For example, the [Committee to Protect Journalists](https://cpj.org) provides a [searchable online database](https://cpj.org/data/) of reporters that have died in armed conflicts since 1992. 

The site provides a basic search form to fetch a subset of the data.

![CPJ search form](../files/cpj_gaza_db_search.png)

## Looks are deceiving

You might see this site and, eager to bring it to heel with a shiny new web scraper, spend minutes, hours or days (depending on your scraping chops) crafting code to capture the data.

If you act on that plan, you'll be kicking yourself for wasting more than a minute of time on it. 

Yes, we said **1 minute**.


## Just take a peak...

Had you peaked under the website's hood using the `Network` tab of the browser's Developer Tools, you would have quickly noticed that the data for the search is powered by an API.

![CPJ data API in network tab of browser tools](../files/cpj_api_network_tab.png)

## ...and poke around a bit

Let's see what that data looks like.

- On the `Network` tab, click on the web request for the API call. *If you're trying this at home, it won't appear exactly the same. Locate the row where the `File` column value starts with `entries?distinct...`*
- When the side panel for the request appears, click on the `Response` tab.

![CPJ API response expanded to show data](../files/cpj_api_response_expanded.png)


## Solve a puzzle

This sure resembles the data displayed on the results page for the search. It's quite likely the search results page is constructed dynamically by Javascript using this JSON data (*see [Website Personalities](website_personalities.ipynb) for more on dynamic pages.*) 

We could spelunk the code to prove this is case, but that can be a LOT of work. Instead, let's just compare a sample of the JSON to the displayed search results to verify that we're working with the right data.

Below is a side-by-side view of the search results and the JSON data. 

*See those red lines connecting data points from the search results to the JSON data?*

Behold the glory. The data is readily accessible in a structured form. So in theory, we can just grab the JSON data and head home for an early dinner. Or can we...?

![CPJ search results compared to JSON data](../files/cpj_search_results_by_json.png)

## But wait..that's only 20 records

So we've solved one problem. Or rather, we have a promising lead on avoiding a problem called "web scraping."

But there's another issue to contend with before declaring victory: There are only 20 records in this JSON file and we know that the database contains more than 1,500 records (at the time of writing). 

*How do we get all of the records?*

## Hack the API

Let's see how the API call is constructed by:

- Clicking the `Network tab`
- Clicking on the web request for the API call
- Heading over to the `Headers` tab for the web request

In the information panel, you should see a downright awful URL. It contains a boatload of URL parameters after the `?` in the form of `key=value` pairs, separated by ampersands (`&`). These are variables of sorts that instruct the API on what data to return. Normally, these parameters are configured by a web form filled out by a human visiting the website.

If you look close, you may notice that the URL parameters include one particularly interesting morsel: `pageSize=20`

![CPJ api call](../files/cpj_api_call_page_size.png)


It's a good bet that `pageSize` tells the API how many records to return. You could "page" through the data by changing the `pageNum` parameter from `1` to `2` to `3` and so on. 

But we know you have the mind of a true hacker. 

What if we tweaked `pageSize` instead and set it to a value exceeding the total number of records in the database (~1,500 at the time of writing)?

Below is a modified version of the URL that applies this strategy. We changed `pageSize=20` to `pageSize=2000` (yes, that's two thousand).

Go ahead. Click the link. We dare you.

<a href="https://datamanager.cpj.org/api/datamanager/reports/entries?distinct(personId)=&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=2000&in(status,'Killed')=&or(eq(type,'media worker'),in(motiveConfirmed,'Confirmed'))=&in(type,'Journalist')=&ge(year,1992)=&le(year,2024)=">https://datamanager.cpj.org/api/datamanager/reports/entries?distinct(personId)=&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=2000&in(status,'Killed')=&or(eq(type,'media worker'),in(motiveConfirmed,'Confirmed'))=&in(type,'Journalist')=&ge(year,1992)=&le(year,2024)="</a>

## Skip the Scraping

Congratulations!! You nabbed the data without having to write a web scraper. At least in the traditional sense. 

There was no need to scrape the search page, fill out a form, get the results back, and then page through the search results, extracting data points from HTML along the way. If that sounds painful and error-prone, you have good instincts. It's a workable solution, but in this case it's total overkill.

Instead, we gave the site a [phsyical exam](dissecting_websites.ipynb) and realized that we could skip the scraping entirely and just grab the data.

If you've never dissected a website like this before, all of the above likely seems like magic. It might even feel like this process would take just as long as writing a web scraper. But you'd be wrong. As you gain comfort with dissecting websites, the techniques described here will take you minutes -- perhaps even seconds -- on many sites.

Invest a bit of time learning how to use browser Developer Tools, and this workflow will become second nature. And then you can avoid hours of writing and maintaining fickle scrapers. Sometimes.

## Alas, you can't always skip the scraping

Sadly, not all websites provide a grab-and-go JSON API. Often, you really will need to grab the HTML source code for a web browser to extract the data or files that you're after.

If you find yourself in that situation, check out the following:

- [WYSIWYG Scraping](wysiwyg_scraping.ipynb) - Basic scrapes. Unicorns in the mist...
- [Just Use a Browser, Robot](just_use_browser_robot.ipynb) - After a certain threshold, it's easier to mimic a human.