# Scraping dynamic web sites with Selenium

[Last week's lesson](../pt1/scraping_lecture.ipynb) involved scraping a static site, or a site that is rendered up front in HTML. Today, we'll look at how to scrape sites that change when you load or interact with the page, sometimes without the URL changing.

[Selenium](https://www.selenium.dev/documentation/) was created to "automate browsers." The major use case for software like Selenium is to automate testing browser-based apps. But journalists can use software like Selenium to scrape dynamic websites.

For today's lesson, we're going to scrape all the public hearings in Alameda County courts on a given day.

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm

from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.by import By

import time
import math

## Open your automated browser

Earlier we installed `chromedriver` using `brew`. Below, we tell Selenium to use Chrome as our automated browser.

In [None]:
# initiate webdriver
driver = webdriver.Chrome()

# some people like to call this variable `browser` — call it whatever you like!

## Open the website

In [None]:
# Navigate to a URL
driver.get('https://publicrecords.alameda.courts.ca.gov/CalendarSearch/')

## Find the inputs you want to interact with

In last week's lecture, we used Beautiful Soup to find elements on a page. Because we want to interact with elements within Selenium's automated browser, we need to use Selenium to find elements. 

Tips:
- If you want to interact with the page, use selenium
- If you want to read or parse complex HTML, use bs4

You'll use `By` to indicate how the browser will pinpoint your element. These are the [different options for `By`](https://www.selenium.dev/selenium/docs/api/py/webdriver/selenium.webdriver.common.by.html):

- `CLASS_NAME`
- `CSS_SELECTOR` (e.g. a pseudo-element)
- `ID`
- `LINK_TEXT` (the text inside <a> tags)
- `NAME`
- `PARTIAL_LINK_TEXT` (the text inside <a> tags)
- `TAG_NAME`
- `XPATH` (when the element doesn't have a unique identifer, you can still pinpoint with this method; Chrome has a cool way to grab the xpath of an item in Developer Tools)

Luckily, the date fields have IDs, so we can select them this way:

In [None]:
hearing_date_from = driver.find_element(By.ID, 'FeaturedContent_txtFromdt')
hearing_date_to   = driver.find_element(By.ID, 'FeaturedContent_txtTodt')

You can use `type()` to find out whether a variable is a selenium object or a bs4 object.

In [None]:
type(hearing_date_from)

## Input dates into the dropdowns

Use selenium's `send_keys()` method to input text into the date dropdowns.

In [None]:
hearing_date_from.send_keys('12/06/2021')
hearing_date_to.send_keys('12/06/2021')

## "Click" on the submit button

First, you'll have to find the element by its `id` value, then `click()` on it.

In [None]:
submit_button = driver.find_element(By.ID, 'FeaturedContent_btFind')

In [None]:
submit_button.click()

Below, I'm telling the computer to wait 5 seconds before executing the next line of code. That way the browser can finish loading the page before continuing with the code. That's crucial if I end up restarting this notebook kernel and running all cells at once. We want the browser to finish loading the page because some elements might not exist until the element exists. 

In [None]:
time.sleep(5)

There are better ways to wait for elements on a page. Check out the documentation to read more about [WebDriverWait()](https://selenium-python.readthedocs.io/waits.html).

## "Select" more rows to view

When you  get your search results, the courts show only 10 rows at a time. It'll be faster to scrape all the results if you can show the max amount of rows at a time (which is 50).

In [None]:
displayed_rows_dropdown = Select(driver.find_element(By.NAME, 'ctl00$MainContent$gvResult$ctl13$ctl13'))

In [None]:
displayed_rows_dropdown.select_by_visible_text('50')

## Get the count of results so you know how many pages you have to scrape

Even though I'm parsing HTML below, I'm using Selenium instead of Beautiful Soup. I'm doing this because I haven't called Beautiful Soup yet and Selenium is capable of parsing.

In [None]:
records_count_container = driver.find_element(By.ID, 'MainContent_lbCnt')
records_count = records_count_container.text.split()
records_count = records_count[len(records_count) - 1]
records_count = int(records_count)
records_count

In [None]:
pages_to_check = math.ceil(records_count/50)
pages_to_check

## Figure out how to loop through the pages

In [None]:
# find the "Next" link — it looks like ">"
next_button = driver.find_element(By.LINK_TEXT, '>')
next_button.click()

The below code is commented out because I don't want you to run it yet. But, you can see how one could flip through all the pages of this site.

In [None]:
# for n in range(pages_to_check):
#     next_button = driver.find_element(By.LINK_TEXT, '>')
#     next_button.click()
    
#     # wait 2 seconds
#     time.sleep(2)

You can manually get back to the first page by going to the "automated" browser and clicking "1".

## Parse the first page of results with Beautiful Soup

Now I'm going to switch to using Beautiful Soup because it's the best program to parse through a lot of HTML.

### Get the table by its `id`

In [None]:
soup = BeautifulSoup(driver.page_source, 'html.parser')
table = soup.find(id='MainContent_gvResult')

In [None]:
# table

Each row of this table is a unique something. I'm not sure what that something is. It might not be a unique case. It might be something else. I'm not going to assume. Anyway, I'd like to transfer this table into a pandas dataframe.

### Create your blank dataframe

In [None]:
hearings = pd.DataFrame(
    columns=[
        'Serial No.',
        'Name',
        'Case #',
        'PFN',
        'CEN',
        'Dept#',
        'Hearing Date',
        'Hearing Time',
        'Hearing Type',
        'Case Type',
        'Defense Atty',
        'DA'
    ])

### Parse the table and put the data into a dataframe

Let's go over each section below manually before running.

In [None]:
# create a simple `page_data` list to store the page data before we make a pandas dataframe
page_data = []
rows = table.find_all('tr')

# we haven't used enumerate() yet but basically that just allows you to index an iterable
for i, row in enumerate(rows):

    # we can skip the first row because that's the header row
    # we can also skip any row greater than index 50 because that has the page numbers
    if (i > 0) and (i <= 50):
    
        # `cells` will get and index all the cells within a row
        cells = row.find_all('td')
        page_data.append({
            'Serial No.' : cells[0].text.strip(),
            'Name' : cells[1].text.strip(), 
            'Case #' : cells[2].text.strip(), 
            'PFN' : cells[3].text.strip(), 
            'CEN' : cells[4].text.strip(), 
            'Dept#' : cells[5].text.strip(), 
            'Hearing Date' : cells[6].text.strip(), 
            'Hearing Time' : cells[7].text.strip(), 
            'Hearing Type' : cells[8].text.strip(), 
            'Case Type' : cells[9].text.strip(), 
            'Defense Atty' : cells[10].text.strip(), 
            'DA' : cells[11].text.strip()
        })
        
# create a dataframe with `page_data`
page_hearing = pd.DataFrame(page_data)

## Append `page_hearing` dataframe to main `hearings` dataframe

In [None]:
hearings = hearings.append(page_hearing).reset_index(drop=True)

## View dataframe

In [None]:
hearings

In [None]:
# for n in range(pages_to_check):
#     next_button = driver.find_element(By.LINK_TEXT, '>')
#     next_button.click()
#     time.sleep(2)

## Addenda 

If I want to search for another date, I can stay on the same page and "clear" the date fields. Then I send send new dates.

In [None]:
hearing_date_from.clear()
hearing_date_to.clear()

hearing_date_from.send_keys('12/07/2021')
hearing_date_to.send_keys('12/07/2021')

Once you're done using the automated browser, you can close it manually or run the following:

In [None]:
driver.close()

## Classwork

I'd like you to figure out how to loop through all the pages and collect all the information.