# Dissecting a web site

> **WARNING**: This page strains the limits of medical analogies and metaphors.

OK, we get it. You're just *itching* to scrape that government website. A juicy story is waiting to be told, if you could only get your hands on the data...

PLEASE. Resist the urge to start cranking away on code. 

Close that code editor and fire up a web browser.

Open your browser's [developer tools](https://en.wikipedia.org/wiki/Web_development_tools) (*see [Resources](resources.ipynb) for options*).

Now spend some time dissecting the anatomy of the website. 

Your job is to understand how the site works at a code level. By code, we mean how the HTML, CSS, Javascript and various other file types (audio, video, PNGs, JPEGs, etc.) come together into this thing -- this *experience* -- we call a web page.

Why do we need to perform this kind of surgery?

Because **understanding the inner workings of a website is the only way to craft a sensible, minimally painful scraping strategy.** 

This process is not an indulgence. It will routinely spare hours or days of wasted effort writing brittle scrapers, only to realize that a little poking around would have revealed a [JSON API](../apis/README.ipynb) that provides data in much more accessible fashion.

Don't believe us? Check [this](skip_scraping_cheat.ipynb) out. 

Or read on for a checklist of questions that will help you devise a solid scraping...wait for it... prescription.

## Anamnesis

Fancy word, [anamnesis](https://en.ghsg.org/anamnesis), right? 

From medicine, it refers to the doctor-patient interview process in which a physician questions a patient to gather information about her health, symptoms, etc. It's a critical step in the process of diagnosis and prescribing remedies.

Before we begin scraping, we have to play the role of doctor. The website is our patient.

### Do you *really* need to scrape?

Here are some questions to ask when assessing a site (see *The Web Surgeon's Tools* below for an overview of tools and underlying tech that can help you dissect a website).

1. Is there an easier, more reliable way to get the data than scraping? 
1. You're sure there's no easier way to get the data? A CSV download perhaps? Maybe a simple phone call or records request?

### Ok, you do need to scrape

If you're *absolutely certain* there's no easier way to get the data, then here are some 
questions to help plan a scraping strategy.

> **See [Website Personalities](website_personalities.ipynb) for details on terminoloy such as "pagination, predictable URLs, etc.**

1. Does the page have a JSON or other structured data source underlying it (see the Developer Tools Network tab) that you can readily use, rather than scraping.
1. Is the target information located on a single web page?
1. Is there a landing page with a list of items that link to child pages with additional details?
1. Does the site use [pagination][] to present a long list of data, files, downstream pages, etc?
1. Do target pages/files have "predictable" URLs (*see [Website Personalities](website_personalities.ipynb)*)?
1. Do you have to fill out out a search form before seeing target results?
1. Does the site require a user to log in?
1. Is the site using sessions/cookies to manage client connections?
1. Is the target data in the source HTML or is it dynamically generated by Javascript after the page has loaded in the browser? (*see [Website Personalities](website_personalities.ipynb) and [Driver the Browser, Robot](drive_the_browser_robot.ipynb)*)
1. Are there CAPTCHAs or does the site block IP addresses that issue too many requests? *Note: Often you'll only discover these roadblocks while testing or running a scraper. see [Website Personalities](website_personalities.ipynb) for more background.*

[pagination]: https://en.wikipedia.org/wiki/Pagination#Pagination_on_UI

## The Web Surgeon's Tools

To answer these questions, you must go beyond simply clicking around a
website. You must use your web browser to view its source code ***and***
use [Developer Tools][] to look under the hood.

[Developer Tools]: https://developers.google.com/web/tools/chrome-devtools/

Here are some additional resources to level up on core skills for
dissecting websites:

- [HTML][] - A markup language that tells your browser how to structure a page.
- [HTTP][] - Or [Hypertext Transfer Protocol](https://en.wikipedia.org/wiki/HTTP) if you're into web jargon. These are the verbs of the Web. They enable client computers (aka your browser or scraper code) and servers to communicate. If nothing else, you should understand how to use [GET][] and [POST][] requests (the latter are commonly used with web forms).
-  [CSS][] - A style language that tells a browser how to display page elements. CSS selectors are invaluable for extracting data from elements in a page.
-  Javascript and the [Document Object Model (DOM)][] - You don't need to be a master of Javascript to scrape, but it's critical to understand how Javascript can manipulate the DOM to dynamically generate/alter content **after** a page has been loaded.

[HTML]: https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML/Getting_started
[HTTP]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods
[GET]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/GET
[POST]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/POST
[CSS]: https://developer.mozilla.org/en-US/docs/Web/CSS
[Document Object Model (DOM)]: https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction