# Drive the Browser, Robot

## That's an order. Not a request.

The days of simple web sites built from plain-old HTML and CSS seem like a distant memory.

Today, a growing number of sites use Javascript to dynamically add functionality *and* content to a page.

Perhaps you're staring at a simple table of data on a government website and thinking: *This looks like an easy scrape!*

Think again. 

Often, the data you see in a web browser is not actually present in the HTML of the page, but written into the page *after it has loaded* in your browser.

> *Confused? That's natural. Check [this](website_personalities.ipynb) out. We'll wait...*

In other words, [what you see is NOT what you get](wysiwyg_scraping.ipynb).

That fact of modern life on the web can dramatically complicate the task of web scraping, especially if there's no [JSON API](../apis/README.ipynb) lurking under the hood that can allow you to [skip the scraping](skip_scraping_cheat.ipynb).

Modern web pages are often just a collection of "hooks" with little of the most interesting content embedded in the source code. Instead, Javascript dynamically generates the content using logic and/or secondary sources of data such as a [JSON API](../apis/README.ipynb).

In some cases, this lets us [avoid web scraping](skip_scraping_cheat.ipynb). We can simply grab the JSON, making the job much easier.

In other cases, it's not so simple. Add into this mix a gaggle of other [website idiosyncracies](website_personalities.ipynb), and the traditional way of scraping (using simple tools such as `requests` and `BeautifulSoup`) becomes far more difficult or downright impossible.

In such cases, it's often much easier to switch tactics and use a browser-automation tool to drive Chrome or Firefox to harvest files, data, etc.

**You're essentially automating a human workflow: Open browser. Do stuff.**

This approach is a bit mind-bendy and more complex than plain-old `requests`, but it's quite handy when tackling a difficult web scrape.

Many programming languages have browser automation tools. In Python, two of the most popular are [Selenium](https://selenium-python.readthedocs.io/index.html) and the newer kid on the block: [Playwright](https://playwright.dev/python/).

We'll use Playwright in this tutorial since it simplifies some of the initial setup.

> **IMPORTANT**: This tutorial must be run locally on your own machine. It won't work on GitHub Codespaces or in JupyterLite.

## Setting up the Environment

If you haven't already done so, clone the [data-journalism-notebooks GitHub repository](https://github.com/stanfordjournalism/data-journalism-notebooks), either using VS Code or the command line:

```bash
# Here's how to use plain old git to clone on the command line
git clone git@github.com:stanfordjournalism/data-journalism-notebooks.git
```

You can install `playwright` using `pip` or `pipenv`. We've included `playwright` in the `Pipfile` for this repo, so plain-old `pipenv install` should have you covered.

```bash
# On the command line, navigate to the code repo,
# replacing "path/to" with a real path on your machine.
cd path/to/data-journalism-notebooks
pipenv install # will include playwright and some other goodies
```

Make sure you have the Google Chromium browser installed on your machine, though note you can also use Firefox and other browsers with `playwright`.

`playwright` requires browser drivers to interact with web browsers installed on your machine.

Run the following command to install the browser drivers:
        
```bash
# On the command line, navigate to our repo
cd path/to/data-journalism-notebooks
# Activate the virtual environment (where playwright was installed)
pipenv shell
# Install the drivers
playwright install
```

This command will download and install the necessary drivers for Chromium, Firefox, and WebKit browsers.

> Note: If you encounter any permission issues while installing the library or browser drivers, try running the command as an administrator: `sudo playwright install` on Mac/Linux systems.

Verify the installation by executing the below cell.

In [None]:
from playwright.sync_api import sync_playwright

If the import statement executes without any errors, the `playwright` library is successfully installed and ready to use.

## Robot, go to example.com

Launch a browser instance using `playwright` and navigate to <https://example.com>.

### **Important**

This code is more complex than if we were working in a simple Python script (with a `.py` extension).

Because we're working in Jupyter Lab, which has its own ["event loop"](https://medium.com/@dpzhcmy/running-asynchronous-code-in-jupyter-notebooks-managing-event-loops-b9696a596ce4), we have to use Playwright in so-called `async` mode.

In practical terms, that means you have to prepend the `await` keyword on most invocations of `playwright` classes, methods, etc.

> *See [Hidden Life of Objects](../classes_and_oop/hidden_life_of_objects.ipynb) if you're unfamiliar with terms such as classes and methods. Check out [Async IO in Python: A Complete Walkthrough](https://realpython.com/async-io-python/) for background on asynchronous Python.*

In [None]:
from playwright.async_api import async_playwright

pw = await async_playwright().start()
# make the browser visible instead of running in "headless" mode
browser = await pw.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto("https://example.com")

In [None]:
h1 = await page.query_selector('h1')
text = await h1.inner_text()
text

Congratulations!! You just made the robot do your bidding.

It's important to dismiss the robot when you're done by the way, if for no other reason than avoiding a boatload of open browser tabs.

In [None]:
await browser.close()

## Do harder things Robot

Ok, so that was a nice warm-up. Let's put `playwright` through its paces on a more challenging website.

This time we'll work with the Oklahoma Court Search index.

Technically, it's possible to scrape this site with plain-old `requests` and `BeautifulSoup`, and in fact we do just that in the Big Local News [court-scraper](https://github.com/biglocalnews/court-scraper) project.

But the site is sufficiently complex that `playwright` is a justifiable alternative.

Here's the Oklahoma [Case Docket Search page](https://oscn.net/dockets/Search.aspx). 

![Oklahoma court search](../files/ok_courts_search.png)

The page contains a number of forms, allowing you to search for docket information by:
    
- County / court
- Case number
- Name of parties to the case
- Case date range (not pictured above; scroll down the page)

## Test the search

Before writing any code, you should always spend some time exploring a site from the perspective of a normal user. 

As an example, let's try hunting down the docket info for a case between an Oklahoma resident named Scott Sapulpa and the Gannett news company. Sapulpa won a $25 million award from Gannett in February 2024. Here's the [AP story](https://apnews.com/article/oklahoma-newspaper-defamation-racist-comments-7a97e443a35097fa25617106ea20bafe) on the judgement.

![ok man wins 5 million from gannett](../files/ok_man_wins_5m_from_gannett.png)

We don't have the case number, but the story contains enough details for us to hunt down the docket information based on the plaintiff's name and the county where the case was filed.

Navigate to the [Case Docket Search page](https://oscn.net/dockets/Search.aspx) and fill in the form fields.

First select `Muskogee County District Court` from the court selection menu.

![ok search muskogee count](../files/ok_court_select_muskogee.png)

Next, fill in the plaintiff's last name (`Sapulpa`) and first name (`Scott`).

![ok search sapulpa case](../files/ok_search_sapulpa.png)

Now click the `Go` button, making sure to use the one in the `Search by Party` section (there are `Go` buttons in each section).

This should display a page of cases, including *Sapulpa v. Gannett*.

![ok search results](../files/ok_court_search_results.png)

Lastly, if you click through the record, you'll find oodles of information related to the case, from parties involved in the case to important events and even downloadable case filings.

![ok court case detail page](../files/ok_case_detail_page.png)

## Automate the Search

So we have a feel for the search system. 

Better yet, we've created a roadmap of the steps we'll need to follow when using `playwright` to automate the browser:

- Go to case search page
- Fill in court / county field
- Fill in last and first name fields
- Click `Go` under `Search by Party`
- Locate the case on search results page
- Click on case to view detailed docket information

### Go to search page

Let's grab some starter code from the earlier `example.com` scrape, with a few tweaks:

- Skip the import line since we already did that in an earlier notebook cell
- Update the URL in the `page.goto(...)` call.

In [None]:
pw = await async_playwright().start()
browser = await pw.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto("https://oscn.net/dockets/Search.aspx")

### Select the county

Did the page open for you? Great.

Now we need to pinpoint the `County or Court` search menu and select Muskogee County.

If we examine the HTML, we can see this is a `<select>` tag  and it has an associated `<label>`.

![ok court field with html](../files/ok_court_field_with_html.png)

`playwright` provides tons of different methods to interact with elements on a web page. 

For example, you can [locate an element by its label](https://playwright.dev/python/docs/locators#locate-by-label).

In our case, that label is `County or Court:` (note the trailing colon `:`).

Below, we create a `locator` using the field's label, and then select the court option by name.

In [None]:
locator = page.get_by_label("County or Court:")

In [None]:
await locator.select_option("Muskogee County District Court")

### Fill the name fields

Next up, we need to fill in those name fields. 

Once again, we should inspect the source code to get our bearings.

![OK court search name fields html](../files/ok_search_name_field_html.png)

We can see that these are standard `input` fields. They also have labels, so let's try using the `get_by_label` method again, this time with the [fill](https://playwright.dev/python/docs/api/class-locator#locator-fill) method.

In [None]:
await page.get_by_label("Last name:").fill('Sapulpa')
await page.get_by_label("First Name:").fill('Scott')

### Click Go

We're now ready to click. You know the routine. Inspect the HTML to locate the button. 

Then use `playwright` to select and click the button.

**The one hitch is that our page has more than one `Go` button** (one per search section).

We want to click the second button at the end of the `Search by Party` section.

Let's grab all the buttons and pluck out the second.

In [None]:
buttons = await page.query_selector_all('input.submit[type="submit"]')
# Grab the second button and click
await buttons[1].click()

### Locate the detail page and click through

Hopefully you now see the search results page we described earlier.

The last step in this process involves locating the `Scott Sapulpa V. Gannett CO. INC.` row and clicking through to the case information page.

Once again, let's pry open the hood on the web page.

![ok sapulpa search results html](../files/ok_sapulpa_case_results_html.png)

We can see there's a table data (`td`) tag that contains the link (`a` tag) to our case.

Let's select the link using the case name text and trigger a click.

In [None]:
link = page.get_by_text("SCOTT  SAPULPA V. GANNETT CO. INC.")
await link.click()

## Time to celebrate

If the case information page opened, congratulations!! It's time to bust out the champagne and party!!

You've now subjected the machine to your will. The case detail page is ready and waiting for you to harvest data and documents. We'll leave that as an exercise (see `Code Challenges` below).

Hopefully you're feeling the rush of excitement that comes with this form of web scraping. 

We won't rain on the party excessively by reminding you that scraping -- using browser automation or otherwise -- should always be an option of last resort.

But when all else fails, it sure can help sidestep bureaucratic and technical hurdles.

One last technical reminder: Once you're done admiring the case information page, make sure to close the browser. The robot deserves a break once in a while.

In [None]:
await browser.close()

## Code Challenges

If you're hooked on scraping, try your hand at the below code challenges to get some more reps.

- Update the notebook to extract metadata from the case information page (e.g. party names, lawyers, etc.)
- Update the notebook to download documents from the case information page
- Update the notebook to extract case information and documents from *all* cases listed on the search results page