# Website Personalities

Every website has a personality.


On a technical level, web scraping typically involves gathering web pages or other files from a website. This process can be automated by understanding the anatomy of a site -- how pages are structured, URL patterns, and other "personality traits" of a site.

Scraping can be more or less difficult depending on the nature of the site. 

A friendly site with no dynamic content and predictable URL patterns could be a quick job.

A not-so-friendly site might "feature" web forms, randomized URLs, cookies or sessions, dynamically generated content, password-based logins, CAPTCHAs, etc.

Sites often use a combination of these strategies, so it's important to spend time learning how a site works so you can devise an appropriate scraping strategy.

Below are some high-level challenges and related technical strategies for common scraping scenarios. Keep in mind that you may run into sites that require you to combine approaches -- e.g. basic scraping techniques with more advanced stateful web scraping.

## Avoiding a scrape

Some seemingly complex might be quite easy to "scrape". 

> Check out [Skip the Scraping: Cheat with JSON](skip_scraping_cheat.ipynb).

## Basic scrapes

Let's define a basic scrape as a site where target information is located in the HTML source code, and any child pages are easily accessible (perhaps because they use predictable URLs that can be harvested from a landing or index page). 

The site doesn't use forms or require logins.

It does not dynamically generate content, and does not use sessions/cookies. 

In such cases, you can likely get away with simply using the [requests](https://requests.readthedocs.io/en/latest/) library to grab HTML pages and the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to parse and extract data from each page's HTML.

> Here's a [basic scraping tutorial](wysiwyg_scraping.ipynb) that uses `requests` and `BeautifulSoup`.

## Predictable URLs and Query Strings

If a site contains a list of records that in turn lead to child pages with more detail, you can often scrape the so-called "index" page to harvest links to the child pages.

Or perhaps there's a piece of metadata in the records on the index page (e.g. a company ID) that will let you dynamically generate the links to child pages.

The FDIC Failed Banks site and Data.gov are good examples:

> <https://www.fdic.gov/bank/individual/failed/wafedbank.html>

> <https://catalog.data.gov/dataset/national-student-loan-data-system>

Some sites use so-called [query strings](https://en.wikipedia.org/wiki/Query_string), which are
extra search parameters added to a URL as one or more `key=value` pairs. The pairs follow a question mark and are separated by ampersands (`&`). Here are two examples:

> <https://www.whitehouse.gov/?s=coronavirus>
> <https://www.governmentjobs.com/careers/santaclara?department%5B0%5D=County%20Counsel&department%5B1%5D=County%20Executive&sort=Salary%7CDescending>

The `requests` library can you help you [construct such
URLs](http://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls).

### Web Forms

Often, you will have to fill out a search form to locate target data.  Such forms can be handled in a few ways, depending on the nature of the site. If the form generates a predictable URL (perhaps using URL parameters), you can dig into the form options in the HTML and figure out how to dynamically construct the URL. You can test this by manually filling out and submitting the form and examining the URL of the resulting page.

The website where officials in East Brandywine, PA is a good example.

<img alt="Agenda website with form" src="../files/scraping_agendas_pa.png" style="vertical-align:bottom; border:2px solid #555; margin: 10px;" width="350">

Many web forms use POST requests, where the form information is sent as part of the body of the web request (as opposed to embedded in the URL). 

In such cases, you can use a tool such as [requests.post](https://docs.python-requests.org/en/latest/user/quickstart/#more-complicated-post-requests) or Selenium to [fill out and submit](https://selenium-python.readthedocs.io/locating-elements.html#locating-by-id) the form.

### Logging in

Sites that require logins can often be handled by simply passing in your login credentials as part of a web form (see `Web Forms` above). 

The requests library provides several ways to [authenticate](https://docs.python-requests.org/en/latest/user/authentication/), or you can use a browser automation library such as [Playwright](https://playwright.dev/python/).

> Check out [Driver the Browser, Robot](drive_the_browser_robot.ipynb).

Login-based sites will also often use sessions/cookies to manage your interactions (*see `Stateful Web Scraping` below for more details*).

### Dynamic content

Many sites use Javascript to dynamically add or transform page content ***after*** the page has loaded. This means that what you see in the source HTML using `View Page Source` will not match what you see in the browser (or the Elements tab of Chrome Developer Tools).

Scraping such a page requires using a library such as [Playwright](drive_the_browser_robot.ipynb) or [Selenium](https://selenium-python.readthedocs.io/index.html), which use the "web driver" technology behind browsers such as Firefox to automate browser interactions.

These tools give you access to the [Document Object Model](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction)
(DOM) -- the content as seen by a real web browser. The DOM is the internal representation of a page that reflects both the static HTML delivered to your browser *and* elements/styles dynamically added or manipulated by Javascript.

Playwright/Selenium allow you to automate interactions with the browser -- the same as a human. These tools can be programmed to scroll down a page, step through a paginated list of results, take screenshots and download PDFs.

> Check out [Drive the Browser, Robot](drive_the_browser_robot.ipynb) for a tutorial using `Playwright`.

### Stateful Web Scraping

Some websites will use [sessions](https://en.wikipedia.org/wiki/Session_(computer_science)#HTTP_session_token) to uniquely identify a visitor and maintain a record of the visitor's interaction -- or state -- with the site. 

Web servers often manage sessions by sending a unique key in a [cookie](https://en.wikipedia.org/wiki/HTTP_cookie) to your browser. This key is passed back to the server for each new request a visitor makes (e.g. when submitting a search form). 

Scraping a session based-site requires you to manage the session in your code. The
requests library has support for [managing sessions](https://requests.readthedocs.io/en/latest/user/advanced/#session-objects).

Alternatively, you can use a browswer-automation library such as [Playwright](https://playwright.dev/python/) or [Selenium](https://selenium-python.readthedocs.io/getting-started.html) to mimic a browser and get session management for "free".

> Check out [Drive the Browser, Robot](drive_the_browser_robot.ipynb) for a tutorial using `Playwright`.

## CAPTCHAs

CAPTCHAs are the pesky image or audio based challenges that confront users when they visit some websites. The goal is to block (or at least slow down) [bots](https://en.wikipedia.org/wiki/Software_agent), including web scrapers.

<img src="../files/captcha_sf_court_search.png" alt="sf courts captcha" style="float:right; border:2px solid #555; margin: 10px;" width="200"/>

On some sites they're always present as a first obstacle. Other sites use algorithms (often provided by third-party services) to determine if the "client" requesting a page is a human or a bot. This assessment can be based on everything from the metadata that your browsing agent shares with the website, to how frequently requests for pages are emanating from your [IP address][] (see below). 

If you're scraping a lot of pages on a website or domain, it's possible that initial requests will not trigger a CAPTCHA, but as the volume of requests increase over a certain period of time, you may end up seeing CAPTCHAS.

A common solution (not without its [ethical challenges](https://github.com/iv-org/invidious/issues/1256)) is to use a 3rd party CAPTCHA-solving service such as [AntiCAPTCHA](https://anti-captcha.com/). These services enlist data entry workers to solve CAPTCHAs and typically cost fractions of a penny per usage.

## IP Blocking

Websites that receive a large number of requests from a single [IP address][] may begin rejecting web requests from that address, typically for some period of time. When scraping websites, a common tactic is to "throttle" the webscraper to slow it down (e.g. pausing some amount of time before requesting additional pages). 

Alternatively, you can use free or paid [web proxy servers](https://en.wikipedia.org/wiki/Proxy_server) to generate web requests from randomized IP addresses. Here are a few resources to check out:

- [Zyte](https://www.zyte.com/ban-management-lp)
- [Oxylabs](https://oxylabs.io/pages/rotating-residential-proxies)
- [How to use a Proxy with Scrapy in 2024](https://www.zenrows.com/blog/scrapy-proxy)


[IP address]: https://en.wikipedia.org/wiki/IP_address