## Scraping basics

Code-free tools for data harvesting are handy in a pinch, but scripts written in Python or another language are more flexible and adaptable. They can also run automatically in the background on a schedule. Also, you don't have to worry about a service or a tool ever disappearing, making all your hard work for naught.

This exercise uses a simple HTML table as an example before trying a live site.

### Import the necessary modules

In [None]:
# we need BeautifulSoup from bs4 and csv


### Load the HTML file and parse it with `BeautifulSoup`
We can handle this in a combined step. 
```python
with open('/path/to/some/file.html', 'r') as our_file:
    soup = BeautifulSoup(our_file, 'html.parser')```

In [None]:
# open the sample HTML file and parse it with BeautifulSoup


### Let's see what we're working with
Chances are, you've worked with data types like strings, integers, etc., in Python. `BeautifulSoup` makes HTML navigable because it does something a little bit different.

In [None]:
# try printing soup and checking its type


### Target the data

Two key ways (among many) to zero in on specific sections of the web page in question with the `BeautifulSoup` package:
```python
soup.find('some HTML tag')
# returns only the FIRST tag (like a "div," "span" or "table") that matches

soup.find_all('some HTML tag')
# returns a list of ALL tags that match
```

(`BeautifulSoup` also has [detailed documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for the varied ways in which it can parse HTML and XML in a number of effective ways.)

In [None]:
# snip out the table and pass it to a new variable


In [None]:
# use the find_all method to create a list of rows in the table


In [None]:
# isolate the second row and print it


In [None]:
# use .find_all again to generate a list of the row's cells and return it


`BeautifulSoup` lets you fetch text from within HTML tags:
```python
soup.text
# returns the text in a tag as a string
```

In [None]:
# print the first item in cells, then try printing what .text can extract from the same thing


### Extract the data

OK, now for the tricky part. We need to create the list of rows in the table and then extract the text contents of each cell. We'll set up an empty list beforehand and append each row of extracted data to it as its own list.

In [None]:
# make an empty list to hold extracted data

# loop through rows, and then each cell in each row, returning a list of extracted text


### Let's check our work
Did we get a list of lists like we'd expect?

In [None]:
# print table_data


### Now write the data to CSV
We're going to use a process similar to the one we used to read the HTML file. The basic syntax to know for Python's `csv` package:

```python
writer_obj = csv.writer('some file we opened')
# make a writer object that can move information from your script to a file in CSV form

writer_obj.writerow('some list of strings')
# write a single row

writer_obj.writerows('some list of lists of strings')
# write a bunch of rows
```

Check out [the documentation](https://docs.python.org/3.6/library/csv.html) for more examples of how it all works.

In [None]:
# open a file and write our data to it
