## Scraping from a live site

In our [starter exercise](scraper-0.ipynb), we stepped through the basics of writing a scraper, picking up some principles we'll use again and again.

* Inspect the page to find the HTML tags wrapping the data you care about
* Use Python to open the file
* Parse that HTML and loop through it, storing the data you want
* Save it into a CSV for analysis later

We also started by scraping a really simple file, but scraping a more complex page pretty much follows these same steps. This time we'll need to get a little more specific as we target our data.

And the file in our first example was a copy we had saved locally—which is a great approach when you can use it! But sometimes you might want to scrape directly from a live page on the internet, like if a page updates regularly and you want to automate your scraper to fetch new data each day.

That's the approach we'll take here, to scrape [a list of court cases](https://cp.spokanecounty.org/courtdocumentviewer/PublicViewer/SCHearingsByDate.aspx?d=01/23/2019). We'll start our Python script the same as last time, importing the `BeautifulSoup` and `csv` libraries. And we'll add a new library called `requests` to help us grab our page from the internet.

In [1]:
# import the Python libraries we need
import requests
from bs4 import BeautifulSoup
import csv

Just in case our wifi fails, we have a local copy of the page we can work with. But we'll try to just leave the `open()` method we used last time commented out here.

In [2]:
# if we're working from a local copy of our page ...
# use Python's open() to open the HTML page
# html = open('pages/scraper-1-page.html', 'r')

The `requests` library is amazingly handy, and we'll barely be scratching the surface of what it can do. To get our page from the internet, we'll define a couple variables first: `URL` to hold our page's web address, and `HEADERS` to share a bit of information in their server logs. Then we can use `requests` to `get()` our page, and pull its contents into a variable we can work with.

(We'll be hard-coding the date in our `URL` variable ... for now. But let's get our scraper working first before we think about how to make it more flexible.)

In [3]:
# if we're requesting a live page over the internet ...
# define the URL we want to scrape
URL = 'https://cp.spokanecounty.org/courtdocumentviewer/PublicViewer/SCHearingsByDate.aspx?d=01/23/2019'

# define the headers our scraper will pass, so we look like a browser
# https://developers.whatismybrowser.com/useragents/explore/
HEADERS = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0',}

# use requests to fetch that URL
page = requests.get(URL, headers=HEADERS)

# and store the page content in a Python variable
html = page.content

And now we're back into familiar territory, using `BeautifulSoup` to parse our HTML and making an empty list to hold our row data. We're going to be a little more specific this time when we tell `BeautifulSoup` where our data is, though! Just in case there's more than one `<table>` on the page, we'll identify the one we want by its `id`.

In [4]:
# however we opened it, we have our page so use BeautifulSoup to parse it
soup = BeautifulSoup(html, 'html.parser')

# make ourselves an empty list to hold data for a CSV
list_of_rows = []

# use BeautifulSoup to find the table in our parsed HTML
table = soup.find(id='tblHearingsSCByDate')

We'll also be a little more specific here when we loop through our table rows. The `<tr>` elements with the data we want all have a CSS class of `detailrow`, so let's specify those and ignore any other rows we run into.

Another problem we're going to run into: One of our `<td>` cells has some duplicated information in it. But luckily it's wrapped in a `<span>` tag, so we can use `BeautifulSoup` to `clear()` it right out.

In [5]:
# loop through the rows in our table using BeautifulSoup
for row in table.find_all('tr', class_='detailrow'):
    # create an empty list each time through, to hold cell data
    list_of_cells = []
    # loop through each cell in this table row
    for cell in row.find_all('td'):
        # we noticed some cruft on one cell, so get rid of it
        if cell.span:
            cell.span.clear()
        # grab the text from that cell
        text = cell.text.strip()
        # and append it to our list
        list_of_cells.append(text)
    # when we're done with this table row, append its data to our list of rows
    list_of_rows.append(list_of_cells)

Nice! We've looped through our table, ignored rows we don't care about and some duplicated data, and now we're ready to write everything into a CSV. The only thing to change about this from last time is our filename for output.

In [6]:
# use Python's CSV library to create our output file
outfile = open('docket.csv', 'w', newline='', encoding='utf-8')
writer = csv.writer(outfile)
writer.writerows(list_of_rows)
outfile.close()

And we have scraped yet again, hopefully from all the way across the internet.

That's awesome, but let's take a second to think about how we could make this scraper even better. That URL we're scraping has a date parameter ... and that sounds like an invitation to make our script flexible enough to grab data day after day without having to edit things by hand.

To do that, we'd revisit the top of our scraper, importing Python's built-in `date` library and creating a variable that holds today's date.

In [7]:
from datetime import date
today = date.today()

Then we could use our `today` variable to change the URL we want to scrape ...

In [8]:
url_date = today.strftime('%m/%d/%Y')
URL = 'https://cp.spokanecounty.org/courtdocumentviewer/PublicViewer/SCHearingsByDate.aspx?d={}'.format(url_date)
print(URL)

https://cp.spokanecounty.org/courtdocumentviewer/PublicViewer/SCHearingsByDate.aspx?d=08/21/2019


And we could use it to name our output file in a way that keeps everything organized.

In [9]:
filename_date = today.strftime('%Y-%m-%d')
outfile_filename = 'docket-{}.csv'.format(filename_date)
print(outfile_filename)

docket-2019-08-21.csv


Nice. Just a couple small tweaks and this scraper is ready for automation. There are plenty of other things we could add, too.

* We could email a reporter each time the scraper runs, attaching the daily data file.
* Once we've populated `list_of_rows`, we can use it as much as we want. We could open a second CSV in append mode, and add the daily data to it for a longer-term analysis.
* We could make ourselves a list of `ALERT_NAMES` and check for matches each time the scraper runs.

We're just writing Python here, so we're only limited by our imagination and our ability to search for answers we can copy/paste from StackOverflow.

Our [final exercise](scraper-2.ipynb) adds one more approach to our scraping toolkit. Sometimes the dataset we want is broken up across multiple web pages, but we won't let that throw us for a `loop`. (Yeahhhhh, I feel pretty guilty about typing that.)