# From zero to scraping

Depending on how much time we have, this may end up being more of a demo than a full hands-on experience.

### What am I even looking at right now?

This is an IPython Notebook, which is basically just a fancy Python script that can be executed in chunks and annotated with helpful text. The aim is to make Python stuff more approachable and easily digestible. We'll see!

(Barely important side note: Wakari is actually using an older version of this setup; they've rebranded as [Jupyter](http://jupyter.org/).)

### So, what do we need to know to make this happen?

1. Python will execute your instructions one by one.
2. Variables store things like strings, numbers and lists of items (like strings and numbers)
3. We can extend Python's innate abilities with outside libraries designed for specific tasks.
    - **What the hell does that mean?** It means smart people have written things that allow you to emulate a web browser or dissect complex HTML without much work.
4. Files are typically read and/or written one line (row) at a time.
5. Loops help you do the same thing to every item in a list.
    - Like a list made up of rows in an online table.

### How will we scrape [this website](http://www.nrc.gov/reactors/operating/list-power-reactor-units.html)?

1. We will import some libraries that:
    - Act like an internet browser
    - Parse HTML code
    - Read and write CSV files
2. Grab the contents of the web page.
3. Parse the contents of the web page and target only the data table.
4. Open a blank CSV file to store the information in the data table.
5. Loop through each row in the online data table:
    - Extract each element (cell) and store it in a variable
    - Write those variables as a row into the CSV file
6. Close the CSV file.
7. Rejoice.

### Why will we scrape this way?

While code-free tools are handy in a pinch, scripts written in Python or another language are more flexible and adaptable. They can also run automatically in the background on a schedule. Also, you don't have to worry about a service or a tool ever disappearing, making all your hard work for naught.

### 1. Import libraries to do the heavy lifting

We're going to bring in three outside modules to help us scrape this page.

- **requests** will act like an internet browser and collect HTML
- **BeautifulSoup** will parse the HTML code and allow us to isolate a data table
- **csv** will allow us to write what we find to a nicely formatted file

In [1]:
import requests
from bs4 import BeautifulSoup # There's a bunch of stuff in bs4; we only want BeautifulSoup
import csv

### 2. Grab the contents of a web page.

The page we want is located here: http://www.nrc.gov/reactors/operating/list-power-reactor-units.html

**requests** has a method called *get*, which is analagous to a browser like Firefox or Chrome fetching the HTML code for display.

In [2]:
url = 'http://www.nrc.gov/reactors/operating/list-power-reactor-units.html'
web_page = requests.get(url)

We can check this quickly to see if we've gotten the expected raw HTML code by using another **requests** method that returns the HTML code as plain text.

In [3]:
print(web_page.text)

<!-- #BeginTemplate "/Templates/generic-terminal-no-box.dwt" --><!-- DW6 -->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<meta http-equiv="X-UA-Compatible" content="IE=11" />

<head>
<!-- #BeginEditable "doctitle" -->
<title>NRC: List of Power Reactor Units</title>
<!-- #EndEditable -->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="content-language" content="en" />
<meta name="description" content="The Nuclear Regulatory Commission, protecting people and the environment." />
<meta name="keywords" lang="en" content="Nuclear Regulatory Commission, NRC, protecting, people, environment" />
<link rel="stylesheet" href="/admin/css/styles.css" type="text/css" media="screen" />
<link rel="stylesheet" href="/admin/css/jcalendar.css" type="text/css" />
<!-- <link rel="stylesheet" type="text/css" 

### 3. Parse the HTML and target the table

Now we can send our HTML code to **BeautifulSoup**, which is specifically designed to navigate the structural elements of the document, breaking off the pieces we choose. In this case, we are after the web page's only table -- it has all the data we need.

**BeautifulSoup** has methods called *find* and *find_all* designed to target HTML tags. While *find* picks up the first matching instance, *find_all* locates all matching instances and returns them as a kind of list. We will use this to our advantage in a moment.

In [4]:
soup = BeautifulSoup(web_page.content, 'html.parser')
reactor_table = soup.find('table')

Again, we can check to see if we've isolated the table.

In [5]:
print(reactor_table)

<table border="1" cellpadding="5" cellspacing="0" summary="List of Power Reactor Units" width="100%">
<tr align="middle" bgcolor="#eaf2eb">
<th scope="col">Plant Name<br/>
            Docket Number</th>
<th scope="col">Reactor<br/>
            Type</th>
<th scope="col">Location</th>
<th scope="col">Owner/Operator</th>
<th scope="col">NRC Region</th>
</tr>
<tr>
<td scope="row"><a href="/info-finder/reactors/ano1.html">Arkansas Nuclear 1</a><br/>
            05000313</td>
<td>PWR</td>
<td>6 miles WNW of Russellville,  AR</td>
<td>Entergy Nuclear Operations, Inc. </td>
<td align="middle">4</td>
</tr>
<tr>
<td scope="row"><a href="/info-finder/reactors/ano2.html">Arkansas Nuclear 2</a><br/>
            05000368</td>
<td>PWR</td>
<td>6 miles WNW of Russellville,  AR</td>
<td>Entergy Nuclear Operations, Inc. </td>
<td align="middle">4</td>
</tr>
<tr>
<td scope="row"><a href="/info-finder/reactors/bv1.html">Beaver Valley 1</a><br/>
            05000334</td>
<td>PWR</td>
<td>17 miles W of

### 4. Open a blank CSV file for data storage

We need a place for all this data to go once we start scraping it; we can open a new blank file and then use the **csv** method *writer* to create an object (stay with me now) that we can order around with some basic commands, making it write data to the new blank file.

In [6]:
data_file = open('reactors.csv', 'wb')
output = csv.writer(data_file)

Let's write our inaugural row to the file: the header that specifies what all the different columns are. We'll use **csv**'s *writerow* to send a list of what we would like written to the file: `"NAME", "LINK", "DOCKET", "TYPE", "LOCATION", "OWNER", "REGION"`

In [7]:
output.writerow(["NAME", "LINK", "DOCKET", "TYPE", "LOCATION", "OWNER", "REGION"])

### 5. Loop through each row in the table, extract data and write it to the file

Here comes the tricky part: we have to actually scrape the data out of the table we isolated.

To do that, we need to not only loop through every row in the table, but also each cell in every row.

The basic nitty-gritty of Python can be self-explanatory to a certain extent, but loops tend to hang people up who haven't been exposed to the concept before.

A loop just does the same thing to every item in a list. It's a very helpful structure for scraping, because you can essentially treat a table like a list of rows.

Let's experiment for a minute on an example list:

```
my_list = ['Toronto', 'Ontario', 2016, 'May']
```

In [8]:
my_list = ['Toronto', 'Ontario', 2016, 'May']

I want to do something to each item in this list without having to retype it repeatedly. This basic syntax, in pseudocode:

```
for [a list item] in [some list]:
    do a thing with [a list item]
```

That thing will then happen with the first list item, the second, the third, etc., until the end of the list is reached.

So if I wanted to print each thing in the list we made above, one by one, I could do it like this:

```
for thing in my_list:
    print(thing)
```

In [9]:
for thing in my_list:
    print(thing)

Toronto
Ontario
2016
May


There is nothing special about the variable name `thing`; all it does is hold the list item until the loop moves on to the next. We could call it `banana` or `zorro` and the result would be exactly the same.

In [10]:
for zorro in my_list:
    print(zorro)

Toronto
Ontario
2016
May


In [11]:
print(zorro)

May


Above, you're seeing the last list item that was passed into the loop variable.

So now we have to dive into the table with this long-ish list of steps. We'll make a list of HTML snippets wrapped in `<tr>` tags (the table rows), and then a list within that of the actual data cells inside each `<td>`.

```
for row in reactor_table.find_all('tr')[1:]:

    # Each <tr> tag also has some <td> tags holding cell contents; these are
    # what we'll move into variables and then write to the CSV file.
    cell = row.find_all('td')
    
    # Reactor name, detail page link and docket number are all part of the first cell.
    # Docket has a bunch of whitespace, so we'll .strip() it.
    name = cell[0].contents[0].text
    link = cell[0].contents[0].get('href')
    docket = cell[0].contents[2].strip()
    reactype = cell[1].text
    
    # Two fields in this table (location and owner) have characters outside
    # of our fair ASCII realm; we need to make sure these are encoded into a
    # character system (and one that can handle them) on the way into our CSV.
    # We'll put them in UTF-8, the original encoding of our page.
    location = cell[2].text.encode('utf-8')
    owner = cell[3].text.strip().encode('utf-8')
    region = cell[4].text

    # Once everything's collected, write it as a row in the csv.
    output.writerow([name, link, docket, reactype, location, owner, region])
```

In [12]:
for row in reactor_table.find_all('tr')[1:]:

    # Each <tr> tag also has some <td> tags holding cell contents; these are
    # what we'll move into variables and then write to the CSV file.
    cell = row.find_all('td')
    
    # Reactor name, detail page link and docket number are all part of the first cell.
    # Docket has a bunch of whitespace, so we'll .strip() it.
    name = cell[0].contents[0].text
    link = cell[0].contents[0].get('href')
    docket = cell[0].contents[2].strip()
    reactype = cell[1].text
    
    # Two fields in this table (location and owner) have characters outside
    # of our fair ASCII realm; we need to make sure these are encoded into a
    # character system (and one that can handle them) on the way into our CSV.
    # We'll put them in UTF-8, the original encoding of our page.
    location = cell[2].text.encode('utf-8')
    owner = cell[3].text.strip().encode('utf-8')
    region = cell[4].text

    # Once everything's collected, write it as a row in the csv.
    output.writerow([name, link, docket, reactype, location, owner, region])

This loop has done all the work! Just one thing left to do:

### 6. Close the file

Some of it just hangs out in the computer's memory until you close the file and commit it all to disk. 

In [13]:
data_file.close()