# Website Personalities

Every website has a personality.


On a technical level, web scraping typically involves gathering web pages or other files from a website. This process can be automated by understanding the anatomy of a site -- how pages are structured, URL patterns, and other ["personality traits"](website_personalities.ipynb) of a site.

Scraping can be more or less difficult depending on the nature of the site. 

A simple site with no dynamic content and predictable URL patterns could be a quick job, compared to one that uses web forms, randomized URLs, cookies or sessions, dynamically generated content, password-based logins, CAPTCHAs, etc. 

Sites often use a combination of these strategies, so it's important to spend time learning how a site works so you can devise an appropriate scraping strategy.

Below are some high-level strategies for a few common scraping
scenarios. Keep in mind that you may run into sites that require you to
combine these strategies -- e.g. basic scraping techniques with more
advanced stateful web scraping.

## Basic scrapes

Let's define a basic scrape as a site where target information is located in the source code itself, and any child pages are easily accessible (perhaps because they use predictable URLs that can be harvested from a landing or index page). The site doesn't use forms or require logins, it does not dynamically generate content, and does not use sessions/cookies. In such cases, you can likely get away with simply using the [requests](https://2.python-requests.org/en/master/) library to grab HTML pages and the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to parse and extract data from each page's HTML.

## Predictable URLs and Query Strings

If a site contains a list of records that in turn lead to child pages with more detail, you can often scrape the so-called "index" page to harvest links to the child pages. Or perhaps there's a piece of metadata in the records on the index page (e.g. a company ID) that will let you
dynamically generate the links to child pages.

The FDIC Failed Banks site and Data.gov are good examples:

> <https://www.fdic.gov/bank/individual/failed/wafedbank.html>

> <https://catalog.data.gov/dataset/national-student-loan-data-system>

Some sites use so-called [query strings](https://en.wikipedia.org/wiki/Query_string), which are
extra search parameters added to a URL as one or more `key=value` pairs (following a question mark). Here are two examples:

> <https://www.whitehouse.gov/?s=coronavirus>

> <https://www.governmentjobs.com/careers/santaclara?department%5B0%5D=County%20Counsel&department%5B1%5D=County%20Executive&sort=Salary%7CDescending>

The requests library can you help you [construct such
URLs](http://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls).

### Web Forms

Often, you will have to fill out a search form to locate target data.  Such forms can be handled in a few ways, depending on the nature of the site. If the form generates a predictable URL (perhaps using URL parameters), you can dig into the form options in the HTML and figure out how to dynamically construct the URL. You can test this by manually filling out and submitting the form and examining the URL of the resulting page.

The website where officials in East Brandywine, PA is a good example.

<img alt="Agenda website with form" src="../files/scraping_agendas_pa.png" style="vertical-align:bottom; border:2px solid #555; margin: 10px;" width="350">

Many web forms use POST requests, where the form information is sent as part of the body of the web request (as opposed to embedded in the URL).
In such cases, you can use a tool such as [requests.post](https://2.python-requests.org/en/master/user/quickstart/#more-complicated-post-requests) or Selenium to [fill out and submit](https://selenium-python.readthedocs.io/locating-elements.html?highlight=login#locating-by-id)
the form.

### Logging in

Sites that require logins can often be handled by simply passing in your login credentials as part of a web form (see Web Forms above). The requests library provides several ways to [authenticate](http://docs.python-requests.org/en/master/user/authentication/),
or you can use a stateful web browser such as Selenium to [fill out a login form](https://selenium-python.readthedocs.io/locating-elements.html?highlight=login#locating-by-id).
Login-based sites will also often use sessions/cookies to manage your interactions. See [Stateful Web Scraping](#stateful-web-scraping) below for more details on how to handle this.

### Dynamic content

Many sites use Javascript to dynamically add or transform page content ***after*** you've loaded the page. This means that what you see in the source HTML using View Source will not match what you see in the browser
(or the Elements tab of Chrome Developer Tools).

Scraping such a page requires using a library such as [Selenium](https://selenium-python.readthedocs.io/index.html), which uses the "web driver" technology behind browsers such as Firefox to automate browser interactions.

Selenium gives you access to the [Document Object Model](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction)
(DOM) -- the content as seen by a real web browser. The DOM is the internal representation of a page that reflects both the static HTML *and* elements/styles dynamically added or manipulated by Javascript.

Selenium allows you to automate interactions with the DOM -- the same as a human using a browser -- to generate the content that you're targeting, such as scrolling down a page or stepping through a paginated list of results.

Further, the [Python Selenium](https://selenium-python.readthedocs.io/) library
provides convenient helper methods to help you access DOM elements, for instance using [CSS selectors](https://selenium-python.readthedocs.io/locating-elements.html#locating-elements-by-css-selectors). This is similar in concept (and often in syntax) to how BeautifulSoup helps you parse and extract data from HTML.

### Stateful Web Scraping

Some websites will use [sessions](https://en.wikipedia.org/wiki/Session_(computer_science)#HTTP_session_token) to uniquely identify a visitor and maintain a record of the visitor's interaction -- or state --with the site. Web servers often manage sessions by sending a unique
key in a [cookie](https://en.wikipedia.org/wiki/HTTP_cookie) to your browser. This key is passed back to the server for each new request a visitor makes (e.g. when submitting a search form). Scraping a session based-site requires you to manage the session in your code. The
requests library has support for [managing
sessions](https://2.python-requests.org/en/master/user/advanced/#session-objects).

Alternatively, you can use a library such as [Selenium](https://selenium-python.readthedocs.io/getting-started.html) to mimic a browser and get session management for "free" (although note that session management in Selenium can also get tricky depending on the
nature of the site).