## Gathering some reactor data
Now for the fun part.

In [1]:
# import the same modules as before but we need a new one, too
import requests
from bs4 import BeautifulSoup
import csv

### Fetch HTML with `requests`

`requests` is great at playing web browser. For more information, check out the [full documentation](http://docs.python-requests.org/en/master/).

```python
requests.get('some URL')
# navigates to a site and sends you the response

response.content
# a method to retrieve what comes back from the server (the HTML we crave) 
```

We are going to be getting data on nuclear reactors operating in the U.S.: http://www.nrc.gov/reactors/operating/list-power-reactor-units.html

In [2]:
# fetch the contents of webpage with requests
url = "http://www.nrc.gov/reactors/operating/list-power-reactor-units.html"
main_page = requests.get(url)

### Parse HTML with `BeautifulSoup`

In [3]:
# let BeautifulSoup parse the content of that page
soup = BeautifulSoup(main_page.content, 'html.parser')

### Target the data

In [4]:
# snip out the table and pass it to a new variable
reactors_table = soup.find('table')

In [5]:
# print reactor_table to verify we have the right thing
print(reactors_table)

<table border="1" cellpadding="5" cellspacing="0" summary="List of Power Reactor Units" width="100%">
<tr valign="top">
<th scope="col">Plant Name<br/>
Docket Number</th>
<th scope="col">License Number</th>
<th scope="col">Reactor<br/>
Type</th>
<th scope="col">Location</th>
<th scope="col">Owner/Operator</th>
<th scope="col">NRC Region</th>
</tr>
<tr valign="top">
<td scope="row"><a href="/info-finder/reactors/ano1.html">Arkansas Nuclear 1</a><br/>05000313</td>
<td align="center">DPR-51</td>
<td>PWR</td>
<td>6 miles WNW of Russellville,  AR</td>
<td>Entergy Nuclear Operations, Inc. </td>
<td align="middle">4</td>
</tr>
<tr valign="top">
<td scope="row"><a href="/info-finder/reactors/ano2.html">Arkansas Nuclear 2</a><br/>05000368</td>
<td align="center">NPF-6</td>
<td>PWR</td>
<td>6 miles WNW of Russellville,  AR</td>
<td>Entergy Nuclear Operations, Inc. </td>
<td align="middle">4</td>
</tr>
<tr valign="top">
<td scope="row"><a href="/info-finder/reactors/bv1.html">Beaver Valley 1</a

In [6]:
# use .find_all to create a list of rows in the table
reactor_rows = reactors_table.find_all('tr')

In [7]:
# isolate the fourth row and print it
example_row = reactor_rows[3]
print(example_row)

<tr valign="top">
<td scope="row"><a href="/info-finder/reactors/bv1.html">Beaver Valley 1</a><br/>05000334</td>
<td align="center">DPR-66</td>
<td>PWR</td>
<td>17 miles W of McCandless,  PA</td>
<td>FirstEnergy Nuclear Operating Co. </td>
<td align="middle">1</td>
</tr>


One of our table's rows, with a little shading and indentation:

```html
<tr valign="top">
    <td scope="row"><a href="/info-finder/reactors/ano1.html">Arkansas Nuclear 1</a><br/>05000313</td>
    <td align="center">DPR-51</td>
    <td>PWR</td>
    <td>6 miles WNW of Russellville,  AR</td>
    <td>Entergy Nuclear Operations, Inc.</td>
    <td align="middle">4</td>
</tr>```

In [8]:
# use .find_all again to generate a list of the row's cells and return it
cells = example_row.find_all('td')
cells

[<td scope="row"><a href="/info-finder/reactors/bv1.html">Beaver Valley 1</a><br/>05000334</td>,
 <td align="center">DPR-66</td>,
 <td>PWR</td>,
 <td>17 miles W of McCandless,  PA</td>,
 <td>FirstEnergy Nuclear Operating Co. </td>,
 <td align="middle">1</td>]

### Getting data from trickier tags

`BeautifulSoup` has a couple of other methods we haven't discussed yet that are helpful for extracting the information _inside_ of tags:
```python
soup.contents
# breaks the components withina tag into a fresh list (which can be useful when you have more than text in a cell)

soup.get('some attribute')
# returns the attribute (useful for getting URLs from <a> tags, for example)
```

In [9]:
# examine the "contents" of the first item in cells
cells[0].contents

[<a href="/info-finder/reactors/bv1.html">Beaver Valley 1</a>,
 <br/>,
 '05000334']

In [10]:
# isolate and print the name, the link and the docket number
print(cells[0].contents[0].text)
print(cells[0].contents[0].get('href'))
print(cells[0].contents[2])

Beaver Valley 1
/info-finder/reactors/bv1.html
05000334


### Extract the data

In [11]:
# make an empty list to hold the data
scraped_data = []

# a for loop is going to take us through every row in the table EXCEPT the header
# combining two steps: the list it pulls from will be greated by a .find_all for 'tr' tags
for row in reactors_table.find_all('tr')[1:]:
    
    # .find_all 'td' tags in the row and put them into a variable
    cells = row.find_all('td')
    
    # extract the cell contents
    reactor_name = cells[0].contents[0].text
    link = 'http://www.nrc.gov' + cells[0].contents[0].get('href')
    docket = cells[0].contents[2]
    license = cells[1].text
    reactor_type = cells[2].text
    location = cells[3].text
    owner = cells[4].text
    region = cells[5].text
    
    # append the collected data to the empty list
    scraped_data.append([reactor_name, link, docket, license, reactor_type, location, owner, region])
    

### Does it all look OK?

In [12]:
# print scraped_data and give it a look
print(scraped_data)

[['Arkansas Nuclear 1', 'http://www.nrc.gov/info-finder/reactors/ano1.html', '05000313', 'DPR-51', 'PWR', '6 miles WNW of Russellville,\xa0\xa0AR', 'Entergy Nuclear Operations, Inc. ', '4'], ['Arkansas Nuclear 2', 'http://www.nrc.gov/info-finder/reactors/ano2.html', '05000368', 'NPF-6', 'PWR', '6 miles WNW of Russellville,\xa0\xa0AR', 'Entergy Nuclear Operations, Inc. ', '4'], ['Beaver Valley 1', 'http://www.nrc.gov/info-finder/reactors/bv1.html', '05000334', 'DPR-66', 'PWR', '17 miles W of McCandless,\xa0\xa0PA', 'FirstEnergy Nuclear Operating Co. ', '1'], ['Beaver Valley 2', 'http://www.nrc.gov/info-finder/reactors/bv2.html', '05000412', 'NPF-73', 'PWR', '17 miles W of McCandless,\xa0\xa0PA', 'FirstEnergy Nuclear Operating Co. ', '1'], ['Braidwood 1', 'http://www.nrc.gov/info-finder/reactors/brai1.html', '05000456', 'NPF-72', 'PWR', '20 miles SSW of Joliet,\xa0\xa0IL', 'Exelon Generation Co., LLC ', '3'], ['Braidwood 2', 'http://www.nrc.gov/info-finder/reactors/brai2.html', '05000457

### Write the data to CSV

In [13]:
# open a file and write our data to it
with open('reactor_data.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(['reactor_name', 'link', 'docket', 'license', 'reactor_type', 'location', 'owner', 'region'])
    writer.writerows(scraped_data)