## Scraping basics

Code-free tools for data harvesting are handy in a pinch, but scripts written in Python or another language are more flexible and adaptable. They can also run automatically in the background on a schedule. Also, you don't have to worry about a service or a tool ever disappearing, making all your hard work for naught.

This exercise uses a simple HTML table as an example before trying a live site.

### Import the necessary modules

In [1]:
# we need BeautifulSoup from bs4 and csv
from bs4 import BeautifulSoup
import csv

### Load the HTML file and parse it with `BeautifulSoup`
We can handle this in a combined step. 
```python
with open('/path/to/some/file.html', 'r') as our_file:
    soup = BeautifulSoup(our_file, 'html.parser')```

In [2]:
# open the sample HTML file and parse it with BeautifulSoup
with open('simple_table.html', 'r') as our_file:
    soup = BeautifulSoup(our_file, 'html.parser')

### Let's see what we're working with
Chances are, you've worked with data types like strings, integers, etc., in Python. `BeautifulSoup` makes HTML navigable because it does something a little bit different.

In [3]:
# try printing soup and checking its type
print(soup)
print(type(soup))

<!DOCTYPE html>

<html lang="en">
<body>
<table border="1px solid black">
<tr>
<th>Name</th>
<th>Age</th>
<th>Location</th>
</tr>
<tr>
<td>John Smith</td>
<td>42</td>
<td>Miami</td>
</tr>
<tr>
<td>Jane Lindey</td>
<td>70</td>
<td>Fresno</td>
</tr>
<tr>
<td>Beth Green</td>
<td>22</td>
<td>Des Moines</td>
</tr>
<tr>
<td>Paul Johnson</td>
<td>35</td>
<td>Chicago</td>
</tr>
<tr>
<td>Lisa Perez</td>
<td>65</td>
<td>Las Vegas</td>
</tr>
</table>
</body>
</html>
<class 'bs4.BeautifulSoup'>


### Target the data

Two key ways (among many) to zero in on specific sections of the web page in question with the `BeautifulSoup` package:
```python
soup.find('some HTML tag')
# returns only the FIRST tag (like a "div," "span" or "table") that matches

soup.find_all('some HTML tag')
# returns a list of ALL tags that match
```

(`BeautifulSoup` also has [detailed documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for the varied ways in which it can parse HTML and XML in a number of effective ways.)

In [4]:
# snip out the table and pass it to a new variable
tbl = soup.find('table')
print(tbl)

<table border="1px solid black">
<tr>
<th>Name</th>
<th>Age</th>
<th>Location</th>
</tr>
<tr>
<td>John Smith</td>
<td>42</td>
<td>Miami</td>
</tr>
<tr>
<td>Jane Lindey</td>
<td>70</td>
<td>Fresno</td>
</tr>
<tr>
<td>Beth Green</td>
<td>22</td>
<td>Des Moines</td>
</tr>
<tr>
<td>Paul Johnson</td>
<td>35</td>
<td>Chicago</td>
</tr>
<tr>
<td>Lisa Perez</td>
<td>65</td>
<td>Las Vegas</td>
</tr>
</table>


In [5]:
# use the find_all method to create a list of rows in the table
tbl_rows = tbl.find_all('tr')
print(tbl_rows)

[<tr>
<th>Name</th>
<th>Age</th>
<th>Location</th>
</tr>, <tr>
<td>John Smith</td>
<td>42</td>
<td>Miami</td>
</tr>, <tr>
<td>Jane Lindey</td>
<td>70</td>
<td>Fresno</td>
</tr>, <tr>
<td>Beth Green</td>
<td>22</td>
<td>Des Moines</td>
</tr>, <tr>
<td>Paul Johnson</td>
<td>35</td>
<td>Chicago</td>
</tr>, <tr>
<td>Lisa Perez</td>
<td>65</td>
<td>Las Vegas</td>
</tr>]


In [6]:
# isolate the second row and print it
print(tbl_rows[1])

<tr>
<td>John Smith</td>
<td>42</td>
<td>Miami</td>
</tr>


In [7]:
# use .find_all again to generate a list of the row's cells and return it
example_row = tbl_rows[1]
cells = example_row.find_all('td')
print(cells)

[<td>John Smith</td>, <td>42</td>, <td>Miami</td>]


`BeautifulSoup` lets you fetch text from within HTML tags:
```python
soup.text
# returns the text in a tag as a string
```

In [8]:
# print the first item in cells, then try printing what .text can extract from the same thing
print(cells[0])
print(cells[0].text)

<td>John Smith</td>
John Smith


### Extract the data

OK, now for the tricky part. We need to create the list of rows in the table and then extract the text contents of each cell. We'll set up an empty list beforehand and append each row of extracted data to it as its own list.

In [9]:
# make an empty list to hold extracted data
table_data = []

# loop through rows, and then each cell in each row, returning a list of extracted text
for row in tbl.find_all('tr'):
    cells = row.find_all(['th', 'td'])
    table_data.append([cells[0].text, cells[1].text, cells[2].text])

### Let's check our work
Did we get a list of lists like we'd expect?

In [10]:
# print table_data
print(table_data)

[['Name', 'Age', 'Location'], ['John Smith', '42', 'Miami'], ['Jane Lindey', '70', 'Fresno'], ['Beth Green', '22', 'Des Moines'], ['Paul Johnson', '35', 'Chicago'], ['Lisa Perez', '65', 'Las Vegas']]


### Now write the data to CSV
We're going to use a process similar to the one we used to read the HTML file. The basic syntax to know for Python's `csv` package:

```python
writer_obj = csv.writer('some file we opened')
# make a writer object that can move information from your script to a file in CSV form

writer_obj.writerow('some list of strings')
# write a single row

writer_obj.writerows('some list of lists of strings')
# write a bunch of rows
```

Check out [the documentation](https://docs.python.org/3.6/library/csv.html) for more examples of how it all works.

In [11]:
# open a file and write our data to it
with open('simple.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    writer.writerows(table_data)