# Scrape data with Python Requests and Beautiful Soup

Welcome to this Jupyter Notebook! 
  
This notebook was made for the Datajournalism.com [course Python for Journalists](https://datajournalism.com/watch/python-for-journalists). In this module you'll learn how to instruct your computer to download structured, not password protected data from the internet; a technique also known as webscraping. We'll be using the libraries Requests and Beautiful Soup to scrape data. Don't forget to install these libraries to your Anaconda environment. (Otherwise importing these libraries will result in an error message.) Installating these libraries needs to be done in the terminal/cmd prompt using the commands `conda install requests` and `conda install bs4`.


## About Jupyter Notebooks and Pandas

Right now you're looking at a Jupyter Notebook: an interactive, browser based programming environment. You can use these notebooks to program in R, Julia or Python - as you'll be doing later on. Read more about Jupyter Notebook in the [Jupyter Notebook Quick Start Guide](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). 
  
To clean up our data, we'll be using Python and Pandas. Pandas is an open-source Python library - basically an extra toolkit to go with Python - that is designed for data analysis. Pandas is flexible, easy to use and has lots of useful functions built right in. Read more about Pandas and its features in [the Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/).

**Notebook shortcuts**  

Within Jupyter Notebooks, there are some shortcuts you can use. If you'll be using more notebooks for your data analysis in the future, you'll remember these shortcuts soon enough. :) 

* `esc` will take you into command mode
* `a` will insert cell above
* `b` will insert cell below
* `shift then tab` will show you the documentation for your code
* `shift and enter` will run your cell
* ` d d` will delete a cell

**Pandas dictionary**

* **dataframe**: dataframe is Pandas speak for a table with a labeled y-axis, also known as an index. (The index usually starts at 0.)
* **series**: a series is a list, a series can be made of a single column within a dataframe.

Before we dive in, a little more about Jupyter Notebooks. Every notebooks is made out of cells. A cell can either contain Markdown text - like this one - or code. In the latter you can execute your code. To see what that means, type the following command in the next cell `print("hello world")`.

## Getting started

Now, let's import the libraries we need to get started with scraping. Type `import requests`, `from bs4 import BeautifulSoup`, `import pandas as pd` and `import csv`.

**What's in a name**  
Scraping is the act of automatically downloading selected data from a website. Scraping is also known as web scraping, web harvesting, web data extraction and data scraping. It can be very valueable tool for your newsroom: instead of by hand saving data from the web, you can automate and speed up the process by writing a custom Python program that downloads the information for you. 
  
    


**What we'll actually will be doing, when I say 'we're scraping a website':**  

- tell your computer which site to visit: where do you want to download data from? 
    - we'll be using the `requests` library to requests webpages
- save the webpage (the html-page) to the computer
    - this too will be done with library `requests`
- from the webpage, select the data you want to have
    - we'll be using `BeautifulSoup` to do this
- write the selection to a csv-file
    - this is done with the `csv` library

If there is more than 1 page where you want to get data from, you can tell your computer to move on the next page to repeat the process. But that's for another course... :) 


# Scraping a website

## Request webpage
We'll be scraping a list of [Power Reactors](https://www.nrc.gov/reactors/operating/list-power-reactor-units.html) from the site of the US government. First we need to let our computer know what site we want to visit; than we can request the site using `requests.get('http://website.com')`.

If you want your code to become more easily reusable, you can rewrite to:

Note that `requests.get(url)` doesn't have the url in quotes; it's clear the url is a string by the quotation marks in `url = 'https://www.nrc.gov/reactors/operating/list-power-reactor-units.html'`.

To check if everything went right, we can use simpy type `page`; this will return a response code. Status codes are issued by a server in response to a client's request made to the server. Read more about these code on the [wikipedia page on status codes](). Basically, if you have a 200 response code, the website loaded in just fine.

## Parse HTML, select data
Now that we've got the page, let's parse the htmlpage. To parse is just nerd speak for splitting up the original data in smaller bits. Use `BeautifulSoup(page.content, 'html.parser')`. It's pretty common when scraping, to name the first with BeautifulSoup created file 'soup'. This 'soup' variable will contain all html of the page once we're done. 

Off course, if you want to see what is in 'soup', you could type `print(soup)`. (Notice how there are no quotemarks, since the soup we're refering to is a variable that has data stored inside of it and it is not a string. But, when you add `soup` on a new line, the computer will also print your soup. Again: programmers like things short and sweet.

Btw, the library is named after the Beautiful Soup from Alice in Wonderland... Not kidding.

Now, let's make ourselves some soup...

Next you want to select the table from this soup. Thanks to the BeautifulSoup library, you can do this writing `soup.find('table')`, this command will look for the first `<table>` in the source code of the webpage, also known as our soup.

Next, let's get all rows in the table. The HTML code for rows in a table is `<tr>`. We can use the BeautifulSoup command `.find_all('tr')` to get all of these rows.

See how with `.find_all('')` you can find all rows at once, while `.find('')` will just get you the first one of whatever it is your looking for.

Since there is only 1 table on this webpage, you can either use `soup.find_all('tr')` or `table.find_all('tr')`. But if there are two or more tables on one page, the `soup.find_all('tr')` command will get you all rows, from all tables. `table.find_all('tr')` builds upon `soup.find('table')`, which will give you the **first** table; meaning that `table.find_all('tr')` will get all rows from the first table only.

Don't believe me? Let's try and use `soup.find_all('tr')`...

You see? Exactly the same result. Just remember; whatever assignment you give to your computer, it always refers to the data that is before the `.assignment`. Meaning `soup.find_all('tr')` looks for '`tr`'s' in `soup`, and `table.find_all('tr')` looks for `tr`s in `table`.


Now let's say that you are especially interested in the 21st row. What do you do? Since computers start counting at zero, you should ask it for row 20 to get to see the 21st row. And since you saved all rows in the `rows` variable, you can actually say 'dear computer, give me row 20' by typing `rows[20]`.

Looking at this row, do you recognize the different cells? Every cell starts with `<td>`, the HTML abbrevation for table data. You can use BeautifulSoup to look for all `td`'s in this 21st row by typing: `rows.find_all('td')`.

Just for your information: you can even save the data from the `td`'s to a variable called cells, simply type ` cells = rows[21].find_all('td')`

Now that you know how to only select 1 certain row, you can probably guess how to select a data cell. Exactly, use `cells[0]` to get the first cell of `cells`.

It works, but it doesn't look too good, does it? Let's get rid of the HTML bits and pieces around our data. Add `.text` to get the job done.

Looks much better, doesn't it? 

Unfortunately, there are too many rows in this table to get each cell like we got `Comanche Peak 105000445`. We'll going to have to automate it. Luckily this is one of the big benefits of programming. 

Here's what we're going to do: 
1. create an empty list to be used later
2. extract the table from our soup, save it to the `table` variable
3. 'loop over' our table....
4. ...to save the data we need for each row in the table
5. add the selected data to the list
6. print the list

At step 3 we'll 'loop over' the table. What does it mean? Well, using a for loop as its called means that we'll give our computer an assignment and have it done **for** every something. It's like your mum when she told you to treat your friends with candy: **for every one of your friend, give them a piece of candy** It's shorter than naming all your friends one by one and repeating the assignment time and time again, right? We're doing exactly the same by telling our computer: **for every row in the table, get the data inside the cells**.

Congrats! You just wrote your very first scraper - well done!

## Saving the scraped data

Now, off course having your data printed inside the notebook is nice. But it would be even beter to store the data in a CSV file. Remember that I explained what we'd actually be doing? Off course things are a bit more complicated; let me explain. Here's what I told you before:

- tell your computer which site to visit: where do you want to download data from? 
    - we'll be using the `requests` library to requests webpages
- save the webpage (the html-page) to the computer
    - this too will be done with library `requests`
- from the webpage, select the data you want to have
    - we'll be using `BeautifulSoup` to do this
- write the selection to a csv-file
    - this is done with the `csv` library

Here's what the code will actually do: 
1. Create a CSV file to save data in
2. Create a CSV writer to write data with to the CSV file
3. Tell your computer which site(s) to visit
4. Get the webpage
5. Select data from the webpage
6. Write data with the CSV writer to the CSV file 
7. Save file

## Save data to CSV

Here's how to save data to a CSV file using the CSV library - the process involves a couple steps:
1. create a file, open it, make sure it's 'writeable', use `open('filename.csv', 'w', encoding='utf8', newline='')`
2. create a writer, you'll need a writer if you want to write data to the file, use `csv.writer(filename, delimiter=',')`
3. write data to the file using the writer, use `writer.writerow([data])`

Off course you can repeat step 3 as often as necessary.

Using the `ls` command you can see that a new file was created. 

## The scraper
Before we broke our essay scraper into sentences before. Now I'll be putting all these sentences together. This way, you can get a good overview of what a scraper could look like. Here's a list of what we need to do, in the exact order: 
1. Create a CSV file, open it, make it writeable
2. Create a CSV writer to write data
3. Write the column headers to the file
4. Tell your computer which site(s) to visit
5. Get the webpage
6. Select data from the webpage
7. Write data with the CSV writer to the CSV file 
8. Save file

If you want to check if everything worked as it's supposed to, you can import the ScrapedData.csv file as a dataframe using `pd.read_csv('filename.csv')`. Look at the dataframe to see if there's data in the file. Using `df.shape` you can even quickly check if there is as much data in the file as you'd expect. 

`df.shape` will give you the number of rows and columns of the dataframe. A quick way to check if really everything that should be in the CSV file is there.

Note that the headers are in the dataset twice:
while scraping we added header; but we also scraped the headers since the headers are in the first row of the table and we scraped all table rows...

Now what? 

You can easily delete a row by using ``df.drop(df.index[N])``, to drop the Nth row by index number.

To make sure you get the index number right, why not print the first rows once more? We're in a notebook after all... You can use ``df.head()``

Looking at these first 5 rows, you'll find that you want to delete the row with indexnumber 0. As stated before, you can use ``df.drop``. By default Pandas will create and return a copy of your dataset, and delete the row of your choosing in that copy. This means that the original will still include dropped row.

Consider this a safety belt when deleting data using Pandas. ;)

To delete the first row in the original dataset - and not in a copy that Pandas will return to you; you'll need to use ``inplace=True``. The full command becomes: ``df.drop(df.index[0], inplace=True)``. 

``inplace=True`` will delete the row in the original dataset, and won't return anything. Try it:

To see that it worked, request the head of the dataframe...

If you want to you can save this cleaned version, by using ``df.to_csv()``...

Well done, happy web scraping!