# Scrape data with Python Requests and Beautiful Soup

Welcome to this Jupyter Notebook! 
  
This notebook was made for the Learno course Python for Journalists. In this module you'll learn how to automatically download data from the internet, a technique also known as scraping data. We'll be using the libraries Requests and Beautiful Soup to scrape data. Don't forget to install these libraries to your Anaconda environment. (Otherwise importing these libraries will result in an error message.) Installating these libraries needs to be done in the terminal/cmd prompt using the commands `conda install requests` and `conda install bs4`.


## About Jupyter Notebooks and Pandas

Right now you're looking at a Jupyter Notebook: an interactive, browser based programming environment. You can use these notebooks to program in R, Julia or Python - as you'll be doing later on. Read more about Jupyter Notebook in the [Jupyter Notebook Quick Start Guide](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). 
  
To clean up our data, we'll be using Python and Pandas. Pandas is an open-source Python library - basically an extra toolkit to go with Python - that is designed for data analysis. Pandas is flexible, easy to use and has lots of useful functions built right in. Read more about Pandas and its features in [the Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/).

**Notebook shortcuts**  

Within Jupyter Notebooks, there are some shortcuts you can use. If you'll be using more notebooks for your data analysis in the future, you'll remember these shortcuts soon enough. :) 

* `esc` will take you into command mode
* `a` will insert cell above
* `b` will insert cell below
* `shift then tab` will show you the documentation for your code
* `shift and enter` will run your cell
* ` d d` will delete a cell

**Pandas dictionary**

* **dataframe**: dataframe is Pandas speak for a table with a labeled y-axis, also known as an index. (The index usually starts at 0.)
* **series**: a series is a list, a series can be made of a single column within a dataframe.

Before we dive in, a little more about Jupyter Notebooks. Every notebooks is made out of cells. A cell can either contain Markdown text - like this one - or code. In the latter you can execute your code. To see what that means, type the following command in the next cell `print("hello world")`.

In [1]:
print('Hello world!')

Hello world!


## Getting started

Now, let's import the libraries we need to get started with scraping. Type `import requests`, `from bs4 import BeautifulSoup`, `import pandas as pd` and `import csv`.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv

**What's in a name**  
Scraping is the act of automatically downloading selected data from a website. Scraping is also known as web scraping, web harvesting, web data extraction and data scraping. It can be very valueable tool for your newsroom: instead of by hand saving data from the web, you can automate and speed up the process by writing a custom Python program that downloads the information for you. 
  
    


**What we'll actually will be doing, when I say 'we're scraping a website':**  

- tell your computer which site to visit: where do you want to download data from? 
    - we'll be using the `requests` library to requests webpages
- save the webpage (the html-page) to the computer
    - this too will be done with library `requests`
- from the webpage, select the data you want to have
    - we'll be using `BeautifulSoup` to do this
- write the selection to a csv-file
    - this is done with the `csv` library

If there is more than 1 page where you want to get data from, you can tell your computer to move on the next page to repeat the process. But that's for another course... :) 


# Scraping a website

## Request webpage
We'll be scraping a list of [Power Reactors](https://www.nrc.gov/reactors/operating/list-power-reactor-units.html) from the site of the US government. First we need to let our computer know what site we want to visit; than we can request the site using `requests.get('http://website.com')`.

In [3]:
page = requests.get('https://www.nrc.gov/reactors/operating/list-power-reactor-units.html')

If you want your code to become more easily reusable, you can rewrite to:

In [4]:
url = 'https://www.nrc.gov/reactors/operating/list-power-reactor-units.html'
page = requests.get(url)

Note that `requests.get(url)` doesn't have the url in quotes; it's clear the url is a string by the quotation marks in `url = 'https://www.nrc.gov/reactors/operating/list-power-reactor-units.html'`.

To check if everything went right, we can use simpy type `page`; this will return a response code. Status codes are issued by a server in response to a client's request made to the server. Read more about these code on the [wikipedia page on status codes](). Basically, if you have a 200 response code, the website loaded in just fine.

In [5]:
page

<Response [200]>

## Parse HTML, select data
Now that we've got the page, let's parse the htmlpage. To parse is just nerd speak for splitting up the original data in smaller bits. Use `BeautifulSoup(page.content, 'html.parser')`. It's pretty common when scraping, to name the first with BeautifulSoup created file 'soup'. This 'soup' variable will contain all html of the page once we're done. 

Off course, if you want to see what is in 'soup', you could type `print(soup)`. (Notice how there are no quotemarks, since the soup we're refering to is a variable that has data stored inside of it and it is not a string. But, when you add `soup` on a new line, the computer will also print your soup. Again: programmers like things short and sweet.

Btw, the library is named after the Beautiful Soup from Alice in Wonderland... Not kidding.

Now, let's make ourselves some soup...

In [7]:
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!-- #BeginTemplate "/Templates/generic-terminal-no-box.dwt" --><!-- DW6 --><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<meta content="IE=11" http-equiv="X-UA-Compatible"/>
<head>
<!-- #BeginEditable "doctitle" -->
<title>NRC: List of Power Reactor Units</title>
<!-- #EndEditable -->
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="en" http-equiv="content-language"/>
<meta content="The Nuclear Regulatory Commission, protecting people and the environment." name="description"/>
<meta content="Nuclear Regulatory Commission, NRC, protecting, people, environment" lang="en" name="keywords"/>
<link href="/admin/css/styles.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="/admin/css/jcalendar.css" rel="stylesheet" type="text/css"/>
<!-- <link rel="stylesheet" type="text/css" href="/admin/style/base

Next you want to select the table from this soup. Thanks to the BeautifulSoup library, you can do this writing `soup.find('table')`, this command will look for the first `<table>` in the source code of the webpage, also known as our soup.

In [8]:
table = soup.find('table')

In [9]:
table

<table border="1" cellpadding="5" cellspacing="0" summary="List of Power Reactor Units" width="100%">
<tr valign="top">
<th scope="col">Plant Name<br/>
Docket Number</th>
<th scope="col">License Number</th>
<th scope="col">Reactor<br/>
Type</th>
<th scope="col">Location</th>
<th scope="col">Owner/Operator</th>
<th scope="col">NRC Region</th>
</tr>
<tr valign="top">
<td scope="row"><a href="/info-finder/reactors/ano1.html">Arkansas Nuclear 1</a><br/>05000313</td>
<td align="center">DPR-51</td>
<td>PWR</td>
<td>6 miles WNW of Russellville,  AR</td>
<td>Entergy Nuclear Operations, Inc. </td>
<td align="middle">4</td>
</tr>
<tr valign="top">
<td scope="row"><a href="/info-finder/reactors/ano2.html">Arkansas Nuclear 2</a><br/>05000368</td>
<td align="center">NPF-6</td>
<td>PWR</td>
<td>6 miles WNW of Russellville,  AR</td>
<td>Entergy Nuclear Operations, Inc. </td>
<td align="middle">4</td>
</tr>
<tr valign="top">
<td scope="row"><a href="/info-finder/reactors/bv1.html">Beaver Valley 1</a><

Next, let's get all rows in the table. The HTML code for rows in a table is `<tr>`. We can use the BeautifulSoup command `.find_all('tr')` to get all of these rows.

In [25]:
rows = table.find_all('tr')

In [26]:
rows

[<tr valign="top">
 <th scope="col">Plant Name<br/>
 Docket Number</th>
 <th scope="col">License Number</th>
 <th scope="col">Reactor<br/>
 Type</th>
 <th scope="col">Location</th>
 <th scope="col">Owner/Operator</th>
 <th scope="col">NRC Region</th>
 </tr>, <tr valign="top">
 <td scope="row"><a href="/info-finder/reactors/ano1.html">Arkansas Nuclear 1</a><br/>05000313</td>
 <td align="center">DPR-51</td>
 <td>PWR</td>
 <td>6 miles WNW of Russellville,  AR</td>
 <td>Entergy Nuclear Operations, Inc. </td>
 <td align="middle">4</td>
 </tr>, <tr valign="top">
 <td scope="row"><a href="/info-finder/reactors/ano2.html">Arkansas Nuclear 2</a><br/>05000368</td>
 <td align="center">NPF-6</td>
 <td>PWR</td>
 <td>6 miles WNW of Russellville,  AR</td>
 <td>Entergy Nuclear Operations, Inc. </td>
 <td align="middle">4</td>
 </tr>, <tr valign="top">
 <td scope="row"><a href="/info-finder/reactors/bv1.html">Beaver Valley 1</a><br/>05000334</td>
 <td align="center">DPR-66</td>
 <td>PWR</td>
 <td>17 mi

See how with `.find_all('')` you can find all rows at once, while `.find('')` will just get you the first one of whatever it is your looking for.

Since there is only 1 table on this webpage, you can either use `soup.find_all('tr')` or `table.find_all('tr')`. But if there are two or more tables on one page, the `soup.find_all('tr')` command will get you all rows, from all tables. `table.find_all('tr')` builds upon `soup.find('table')`, which will give you the **first** table; meaning that `table.find_all('tr')` will get all rows from the first table only.

Don't believe me? Let's try and use `soup.find_all('tr')`...

In [15]:
url1 = 'https://en.wikipedia.org/wiki/Elfstedentocht'
page1 = requests.get(url1)
soup1 = BeautifulSoup(page1.content, 'html.parser')
table1 = soup1.find_all('table')[4]

In [16]:
table1

<table class="wikitable hlist" style="border:0; margin:0;">
<tr>
<th>Year</th>
<th>Date</th>
<th>Temperature</th>
<th colspan="2">Winner (*)</th>
<th>Time</th>
<th>Distance</th>
<th>Average speed</th>
</tr>
<tr>
<td>1909</td>
<td>2 January</td>
<td style="text-align:center;">n/a</td>
<td class="hlist" colspan="2">
<ul>
<li><a class="new" href="/w/index.php?title=Minne_Hoekstra&amp;action=edit&amp;redlink=1" title="Minne Hoekstra (page does not exist)">Minne Hoekstra</a><sup>(<a class="extiw" href="https://nl.wikipedia.org/wiki/Minne_Hoekstra" title="nl:Minne Hoekstra">NL</a>)</sup></li>
</ul>
</td>
<td>13:50</td>
<td>189 km</td>
<td>13.7 km/h</td>
</tr>
<tr>
<td>1912</td>
<td>7 February</td>
<td style="text-align:center;">3.8°C</td>
<td class="hlist" colspan="2">
<ul>
<li><a href="/wiki/Coen_de_Koning" title="Coen de Koning">Coen de Koning</a></li>
</ul>
</td>
<td>11:40</td>
<td>189 km</td>
<td>16.2 km/h</td>
</tr>
<tr>
<td>1917</td>
<td>27 January</td>
<td style="text-align:center;">-

You see? Exactly the same result. Just remember; whatever assignment you give to your computer, it always refers to the data that is before the `.assignment`. Meaning `soup.find_all('tr')` looks for '`tr`'s' in `soup`, and `table.find_all('tr')` looks for `tr`s in `table`.


Now let's say that you are especially interested in the 21st row. What do you do? Since computers start counting at zero, you should ask it for row 20 to get to see the 21st row. And since you saved all rows in the `rows` variable, you can actually say 'dear computer, give me row 20' by typing `rows[20]`.

In [27]:
rows[20]

<tr valign="top">
<td nowrap="nowrap" scope="row"><a href="/info-finder/reactors/wash2.html">Columbia Generating Station</a><br/>05000397</td>
<td align="center">NPF-21</td>
<td>BWR</td>
<td>20 miles NNE of Pasco, WA</td>
<td>Energy Northwest </td>
<td align="middle">4</td>
</tr>

Looking at this row, do you recognize the different cells? Every cell starts with `<td>`, the HTML abbrevation for table data. You can use BeautifulSoup to look for all `td`'s in this 21st row by typing: `table.find_all('td')`.

In [28]:
cells = table.find_all('td')
cells

[<td scope="row"><a href="/info-finder/reactors/ano1.html">Arkansas Nuclear 1</a><br/>05000313</td>,
 <td align="center">DPR-51</td>,
 <td>PWR</td>,
 <td>6 miles WNW of Russellville,  AR</td>,
 <td>Entergy Nuclear Operations, Inc. </td>,
 <td align="middle">4</td>,
 <td scope="row"><a href="/info-finder/reactors/ano2.html">Arkansas Nuclear 2</a><br/>05000368</td>,
 <td align="center">NPF-6</td>,
 <td>PWR</td>,
 <td>6 miles WNW of Russellville,  AR</td>,
 <td>Entergy Nuclear Operations, Inc. </td>,
 <td align="middle">4</td>,
 <td scope="row"><a href="/info-finder/reactors/bv1.html">Beaver Valley 1</a><br/>05000334</td>,
 <td align="center">DPR-66</td>,
 <td>PWR</td>,
 <td>17 miles W of McCandless,  PA</td>,
 <td>FirstEnergy Nuclear Operating Co. </td>,
 <td align="middle">1</td>,
 <td scope="row"><a href="/info-finder/reactors/bv2.html">Beaver Valley 2</a><br/>05000412</td>,
 <td align="center">NPF-73</td>,
 <td>PWR</td>,
 <td>17 miles W of McCandless,  PA</td>,
 <td>FirstEnergy Nuclea

Just for your information: you can even save the data from the `td`'s to a variable called cells, simply type ` cells = rows[21].find_all('td')`

In [30]:
cells = rows[20].find_all('td')
cells

[<td nowrap="nowrap" scope="row"><a href="/info-finder/reactors/wash2.html">Columbia Generating Station</a><br/>05000397</td>,
 <td align="center">NPF-21</td>,
 <td>BWR</td>,
 <td>20 miles NNE of Pasco, WA</td>,
 <td>Energy Northwest </td>,
 <td align="middle">4</td>]

Now that you know how to only select 1 certain row, you can probably guess how to select a data cell. Exactly, use `cells[0]` to get the first cell of `cells`.

In [31]:
cells[0]

<td nowrap="nowrap" scope="row"><a href="/info-finder/reactors/wash2.html">Columbia Generating Station</a><br/>05000397</td>

It works, but it doesn't look too good, does it? Let's get rid of the HTML bits and pieces around our data. Add `.text` to get the job done.

In [32]:
cells[0].text

'Columbia Generating Station05000397'

Looks much better, doesn't it? 

Unfortunately, there are too many rows in this table to get each cell like we got `Comanche Peak 105000445`. We'll going to have to automate it. Luckily this is one of the big benefits of programming. 

Here's what we're going to do: 
1. create an empty list to be used later
2. extract the table from our soup, save it to the `table` variable
3. 'loop over' our table....
4. ...to save the data we need for each row in the table
5. add the selected data to the list
6. print the list

At step 3 we'll 'loop over' the table. What does it mean? Well, using a for loop as its called means that we'll give our computer an assignment and have it done **for** every something. It's like your mum when she told you to treat your friends with candy: **for every one of your friend, give them a piece of candy** It's shorter than naming all your friends one by one and repeating the assignment time and time again, right? We're doing exactly the same by telling our computer: **for every row in the table, get the data inside the cells**.

In [38]:
# for every row in the table...
for row in table.find_all('tr'):
    # ...save the data in each cell inside that row to cells
    cells = row.find_all(['th', 'td'])
    # ...get the data from each of the cells and save it to a variable
    plantNameDocketNumber = cells[0].text
    licenseNumber = cells[1].text
    reactorType = cells[2].text
    location = cells[3].text
    ownerOperator = cells[4].text
    NRCRegion = cells[5].text
    # ...create a list called rowData containing all variables
    rowData = [plantNameDocketNumber, licenseNumber, reactorType, location, ownerOperator, NRCRegion]
    print(rowData)

['Plant Name\r\nDocket Number', 'License Number', 'Reactor\r\nType', 'Location', 'Owner/Operator', 'NRC Region']
['Arkansas Nuclear 105000313', 'DPR-51', 'PWR', '6 miles WNW of Russellville,\xa0\xa0AR', 'Entergy Nuclear Operations, Inc. ', '4']
['Arkansas Nuclear 205000368', 'NPF-6', 'PWR', '6 miles WNW of Russellville,\xa0\xa0AR', 'Entergy Nuclear Operations, Inc. ', '4']
['Beaver Valley 105000334', 'DPR-66', 'PWR', '17 miles W of McCandless,\xa0\xa0PA', 'FirstEnergy Nuclear Operating Co. ', '1']
['Beaver Valley 205000412', 'NPF-73', 'PWR', '17 miles W of McCandless,\xa0\xa0PA', 'FirstEnergy Nuclear Operating Co. ', '1']
['Braidwood 105000456', 'NPF-72', 'PWR', '20 miles SSW of Joliet,\xa0\xa0IL', 'Exelon Generation Co., LLC ', '3']
['Braidwood 205000457', 'NPF-77', 'PWR', '20 miles SSW of Joliet,\xa0\xa0IL', 'Exelon Generation Co., LLC ', '3']
['Browns Ferry 105000259', 'DPR-33', 'BWR', '32 miles W of Huntsville,  AL', 'Tennessee Valley Authority ', '2']
['Browns Ferry 205000260', 'D

Congrats! You just wrote your very first scraper - well done!

## Saving the scraped data

Now, off course having your data printed inside the notebook is nice. But it would be even beter to store the data in a CSV file. Remember that I explained what we'd actually be doing? Off course things are a bit more complicated; let me explain. Here's what I told you before:

- tell your computer which site to visit: where do you want to download data from? 
    - we'll be using the `requests` library to requests webpages
- save the webpage (the html-page) to the computer
    - this too will be done with library `requests`
- from the webpage, select the data you want to have
    - we'll be using `BeautifulSoup` to do this
- write the selection to a csv-file
    - this is done with the `csv` library

Here's what the code will actually do: 
1. Create a CSV file to save data in
2. Create a CSV writer to write data with to the CSV file
3. Tell your computer which site(s) to visit
4. Get the webpage
5. Select data from the webpage
6. Write data with the CSV writer to the CSV file 
7. Save file

## Save data to CSV

Here's how to save data to a CSV file using the CSV library - the process involves a couple steps:
1. create a file, open it, make sure it's 'writeable', use `open('filename.csv', 'w', encoding='utf8', newline='')`
2. create a writer, you'll need a writer if you want to write data to the file, use `csv.writer(filename, delimiter=',')`
3. write data to the file using the writer, use `writer.writerow([data])`

Off course you can repeat step 3 as often as necessary.

In [56]:
# create file, make sure it's writeable, set some defaults
f = open('powerReactorUnits.csv', 'w', encoding='utf8', newline='')

# create a writer to write data to the CSV file
writer = csv.writer(f, delimiter=',')

# write the row of data to the file
writer.writerow(['plantNameDocketNumber', 'licenseNumber', 'reactorType', 'location', 'ownerOperator', 'NRCRegion'])

82

Using the `ls` command you can see that a new file was created. 

In [57]:
ls

Analyse data - Complete.ipynb  Test Notebook.ipynb
[31mAnalyse data.ipynb[m[m*            Untitled.ipynb
Clean data - Complete.ipynb    powerReactorUnits.csv
[31mClean data.ipynb[m[m*              [31mresults.csv[m[m*
[31mScrape data.ipynb[m[m*             results_clean.csv


## The scraper
Before we broke our essay scraper into sentences before. Now I'll be putting all these sentences together. This way, you can get a good overview of what a scraper could look like. Here's a list of what we need to do, in the exact order: 
1. Create a CSV file, open it, make it writeable
2. Create a CSV writer to write data
3. Write the column headers to the file
4. Tell your computer which site(s) to visit
5. Get the webpage
6. Select data from the webpage
7. Write data with the CSV writer to the CSV file 
8. Save file

In [53]:
# create a file, make it writeable, set encoding, no specials for a new line
f = open('powerReactorUnits.csv', 'w', encoding='utf8', newline='')
# create a writer that seperates values with a , and writes to the f
writer = csv.writer(f, delimiter=',')
# write the first row of data to the file - column headers
writer.writerow(['plantNamedocketNumber', 'licenseNumber', 'reactorType', 'location', 'ownerOperator', 'NRCRegion'])

# set the url of the page that needs to be scraped
url = 'https://www.nrc.gov/reactors/operating/list-power-reactor-units.html'
# get the source code of the webpage, save to page
page = requests.get(url)

# let's make some soup
soup = BeautifulSoup(page.content, 'html.parser')
# extract the table from our soup
table = soup.find('table')

# for every row in the table...
for row in table.find_all('tr'):
    # ...save the data in each cell inside that row to cells
    cells = row.find_all(['th', 'td'])
    # ...get the data from each of the cells and save it to a variable
    plantNameDocketNumber = cells[0].text
    licenseNumber = cells[1].text
    reactorType = cells[2].text
    location = cells[3].text
    ownerOperator = cells[4].text
    NRCRegion = cells[5].text
    # ...create a list called rowData containing all variables
    rowData = [plantNameDocketNumber, licenseNumber, reactorType, location, ownerOperator, NRCRegion]
    # ...write the information in rowData to the CSV file
    writer.writerow(rowData)

If you want to check if everything worked as it's supposed to, you can import the ScrapedData.csv file as a dataframe using `pd.read_csv('filename.csv')`. Look at the dataframe to see if there's data in the file. Using `df.shape` you can even quickly check if there is as much data in the file as you'd expect. 

In [58]:
df = pd.read_csv('powerReactorUnits.csv')
df

Unnamed: 0,plantNamedocketNumber,licenseNumber,reactorType,location,ownerOperator,NRCRegion
0,Plant Name\r\nDocket Number,License Number,Reactor\r\nType,Location,Owner/Operator,NRC Region
1,Arkansas Nuclear 105000313,DPR-51,PWR,"6 miles WNW of Russellville, AR","Entergy Nuclear Operations, Inc.",4
2,Arkansas Nuclear 205000368,NPF-6,PWR,"6 miles WNW of Russellville, AR","Entergy Nuclear Operations, Inc.",4
3,Beaver Valley 105000334,DPR-66,PWR,"17 miles W of McCandless, PA",FirstEnergy Nuclear Operating Co.,1
4,Beaver Valley 205000412,NPF-73,PWR,"17 miles W of McCandless, PA",FirstEnergy Nuclear Operating Co.,1
5,Braidwood 105000456,NPF-72,PWR,"20 miles SSW of Joliet, IL","Exelon Generation Co., LLC",3
6,Braidwood 205000457,NPF-77,PWR,"20 miles SSW of Joliet, IL","Exelon Generation Co., LLC",3
7,Browns Ferry 105000259,DPR-33,BWR,"32 miles W of Huntsville, AL",Tennessee Valley Authority,2
8,Browns Ferry 205000260,DPR-52,BWR,"32 miles W of Huntsville, AL",Tennessee Valley Authority,2
9,Browns Ferry 305000296,DPR-68,BWR,"32 miles W of Huntsville, AL",Tennessee Valley Authority,2


`df.shape` will give you the number of rows and columns of the dataframe. A quick way to check if really everything that should be in the CSV file is there.

In [59]:
df.shape

(101, 6)

In [60]:
df.tail(5)

Unnamed: 0,plantNamedocketNumber,licenseNumber,reactorType,location,ownerOperator,NRCRegion
96,Vogtle 205000425,NPF-81,PWR,"26 miles SE of Augusta, GA",Southern Nuclear Operating Co.,2
97,Waterford 305000382,NPF-38,PWR,"25 miles W of New Orleans, LA","Entergy Nuclear Operations, Inc.",4
98,Watts Bar 105000390,NPF-90,PWR,"60 miles SW of Knoxville, TN",Tennessee Valley Authority,2
99,Watts Bar 205000391,NPF-96,PWR,"60 miles SW of Knoxville, TN",Tennessee Valley Authority,2
100,Wolf Creek 105000482,NPF-42,PWR,"3.5 miles NE of Burlington, KS",Wolf Creek Nuclear Operating Corp.,4


Well done, happy web scraping!