## Scraping basics

There are a few steps you'll follow pretty much every time you write a new web scraper:

1. Inspect the underlying HTML on the page you want to scrape. Use "View Source" in your browser, or right-click and "Inspect Element" to take a look at how the HTML tags are structured around the data you want to capture.
2. In the programming language of your choice (we're using Python here), write a script that:
    * Opens that page from the internet
    * Parses its HTML, using the tags you spotted earlier as a guide
    * Saves that data for later
3. Test your scraper, fix what's broken, and run it again till it works.

Feels like a pretty straightforward process, but the code underneath the pages where your data lives are often _anything_ but straightforward. So let's start by stepping through the process with a super simple example.

[This page just has one table on it](pages/scraper-0-page-example-table.html), holding some recent data about NICAR conference locations. Try "View Source" and take a look at the code underneath. A `<table>` tag, some `<tr>` table rows inside of it, and some `<td>` table cells inside those rows. Just what we like to see.

So let's get that data out of there! In the script we'll build below, our `#` code comments will guide us through each piece of code to write. The first thing we need to do in our Python script is import the libraries we'll be using here: `BeautifulSoup` for parsing HTML and `csv` to write our data to a file.

In [2]:
# import the Python libraries we need
from bs4 import BeautifulSoup
import csv

Now we need to load our page and parse it. Our local copy of the HTML makes that easy, and we'll use `BeautifulSoup` to turn it into a Python object we can work with.

In [3]:
# use Python's open() to open the HTML page we've stored locally
page = open('pages/scraper-0-page-example-table.html', 'r')

# use BeautifulSoup to parse that page into Python
soup = BeautifulSoup(page, 'html.parser')

# and close the HTML page
page.close()

We know we want to make ourselves a CSV later on, so we'll need an empty Python list to start stuffing each row of data into. And we can use `BeautifulSoup` again to find just the part of our page we care about: the `<table>`.

In [4]:
# make ourselves an empty list to hold data for a CSV
list_of_rows = []

# use BeautifulSoup to find the table in our parsed HTML
table = soup.find('table')

And now we're ready to start extracting data! We need a Python loop that:

* goes through each `<tr>` in our `<table>`
* creates a new, temporary list to hold the cell data it contains
* loops through each `<td>` in that row, adding its text to our temporary list
* and once we've processed the full row, append it all to our master list of row data.

Then our loop will move on to the next row, and repeat.

In [5]:
# loop through the rows in our table using BeautifulSoup
for row in table.find_all('tr'):
    # create an empty list each time through, to hold cell data
    list_of_cells = []
    # loop through each cell in this table row
    for cell in row.find_all('td'):
        # grab the text from that cell
        text = cell.text.strip()
        # and append it to our list
        list_of_cells.append(text)
    # when we're done with this table row, append its data to our list of rows
    list_of_rows.append(list_of_cells)

With our data successfully extracted and stored in a nice, big list of lists, we can use Pyton's built-in `csv` library to write ourselves a CSV to analyze later.

In [6]:
# use Python's CSV library to create our output file
outfile = open('nicar_cities.csv', 'w')
writer = csv.writer(outfile)
writer.writerows(list_of_rows)
outfile.close()

And there we have it, we scraped!

Our [next exercise](scraper-1.ipynb) adds just a bit of complexity. We'll scrape a page with more than just a `<table>` on it, and we'll pull it straight off the internet (wifi willing).