## Scraping multiple pages from a live site

By now we're feeling pretty good about how to structure a Python script that scrapes data out of HTML, and in our [previous exercise](scraper-1.ipynb) we even pulled that HTML straight off the internet. We're going to go even one step further here, and get our script ready to handle multiple pages in one scrape.

It's pretty common to see websites paginate long lists of data or search results, so this technique will really come in handy. We'll need the same libraries we used before (`requests`, `BeautifulSoup`, and `csv`), and then we'll add Python's built-in `time` to help us be good citizens later on.

In [1]:
# import the Python libraries we need
import requests
from bs4 import BeautifulSoup
import csv
import time

Our target data this time [lives here](https://www.fda.gov/ICECI/EnforcementActions/WarningLetters/2018/default.htm), in a big list of FDA Warning Letters. Let's take a second to inspect the HTML, and notice a couple things we'll care about in just a minute.

* The first row of every table is a list of headers, not data. And we probably won't want those same headers sprinkled throughout our CSV at the end.
* Below each table is a set of links to each page in the dataset. There's also a link called "Next" that doesn't work on the very last page. Interesting.

Now that we have a feel for the HTML, we'll set up just like last time, with `URL` and `HEADERS` variables that we pass into a `get()` request. 

In [2]:
# define the URL we want to scrape
URL = 'https://www.fda.gov/ICECI/EnforcementActions/WarningLetters/2018/default.htm'

# define the headers our scraper will pass, so we look like a browser
# https://developers.whatismybrowser.com/useragents/explore/
HEADERS = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0',}

# use requests to fetch that URL
page = requests.get(URL, headers=HEADERS)

# and store the page content in a Python variable
html = page.content

If some of this is starting to feel familiar our third time through, then good! Just wait till you write your 30th scraper, or your 300th. Your scripts like this will probably always have a lot in common, with tweaks here and there to target different kinds of HTML tags, or to handle different edge cases. That's not a _bad_ thing either—it means you don't have to think about every single piece each time you write a new scraper. Just copy, paste, and make changes.

In [3]:
# use BeautifulSoup to parse our page
soup = BeautifulSoup(html, 'html.parser')

# make ourselves an empty list to hold data for a CSV
list_of_rows = []

# use BeautifulSoup to find the table in our parsed HTML
table = soup.find('table')

And here's our first big change: We need to set up for handling multiple pages. We still need an empty list to hold our row data, but let's give ourselves a couple new variables too. As we loop through the pages that hold our dataset, we'll need to keep track of what `page_num` we're on, and whether there are `more_pages` left to scrape.

In [4]:
# make ourselves an empty list to hold data for a CSV
list_of_rows = []

# we'll be scraping multiple pages, so start tracking what page we're on
page_num = 1

# and set up our loop to run until we tell it to stop
more_pages = True

Our loop is going to look a little different too. In fact, we'll have a loop inside of a loop, handling the pages we're scraping and then the rows of data on each one of them.

Python's `while` statement will help us out here. It starts a loop that runs until its condition fails to be met, so our first loop will just keep on going until we flip a switch and tell it not to. As long as there are `more_pages`, we'll keep scraping. Inside that loop, we'll have three main pieces of logic:

* The internal loop that we've written before, looking at each `<tr>`, grabbing data, and sticking it in our list for later
* An `if` statement that tests for a link called "Next" on our current page. If it exists, we still have scraping to do! So we should use our `page_num` to concoct a new URL to request, and `time` to pause for a second and give the server a break. If there's no link called "Next," we're on the last page, and we should use ...
* An `else` statement that gets us out of our `while` loop

This part of our script might look more unwieldy than our previous scrapers, but breaking down what we need to do like this maps things out nicely.

In [5]:
# start our loop to scrape multiple pages
while more_pages is True:

    # loop through the rows in our current table using BeautifulSoup
    # (we noticed that the first row is empty, so we can use Python's slice to skip it)
    for row in table.find_all('tr')[1:]:
        # create an empty list each time through, to hold cell data
        list_of_cells = []

        # loop through each cell in this table row
        for cell in row.find_all('td'):
        
            # grab the text from that cell
            text = cell.text.strip()
            
            # and append it to our list
            list_of_cells.append(text)
        
        # when we're done with this table row, append its data to our list of rows
        list_of_rows.append(list_of_cells)
    
    # look to see if there's a "next page" link on our current page
    if len(soup.find_all('a', href=True, text='Next')) > 0:

        # we have another page! fetch a new table and send it back through the loop
        # adjust the URL we're scraping (in this case, by incrementing the page number)
        page_num += 1
        NEXT_URL = URL + "?Page=" + str(page_num)
        
        # use requests to fetch our new URL
        page = requests.get(NEXT_URL, headers=HEADERS)
        
        # as above, get the page content into Python
        # then use BeautifulSoup to parse it and find our table
        html = page.content
        soup = BeautifulSoup(html, 'html.parser')
        table = soup.find('table')
        
        # pause for a second to be kind to the server
        time.sleep(1)

    # if there's no "next page" link, there are no more pages, so drop out of our loop
    else:
        # our `while` loop only runs until `more_pages` is no longer True, so...
        more_pages = False

And because we carried forward some code from last time, we have _all_ that data, from multiple pages, tucked nicely into one Python list. So we don't have to change anything about our CSV output other than the filename.

In [6]:
# use Python's CSV library to create our output file
outfile = open('fda_warning_letters.csv', 'w', newline='', encoding='utf-8')
writer = csv.writer(outfile)
writer.writerows(list_of_rows)
outfile.close()