# Multipage Tables Scrape Demo

You're often going to encounter data and tables that is spread across hundreds if not thousands of pages. 

We're going to scrape as a demo a table that runs across several pages on this mock website.

```https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html```

To capture your target information into a single CSV file will require the use of many of the foundational skills we've covered, including:

- ```delays```
- ```conditional logic```
- ```while loops```
- ```BeautifulSoup```


And we'll explore a few new functional Python methods today.

## Scraping Strategies

- How do we approach this scrape?
- What pattern do we see?
- How do we capture a table on a single page?
- How do we capture a sequence of tables?
- How we navigate from page 1 to the subsequent pages?

# Let's code!

In [1]:
# import libraries

from bs4 import BeautifulSoup  ## web scraping
import requests ## request html for a page(s)
import csv ## read or write to csv
import pandas as pd ## pandas to work with data
import re ## regular express in one of our functions

We want to scrape:

```https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html```

But look at what I have assigned to the url variable:

In [2]:
## How is it different?
url = "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page{}.html"

## Placeholders

<img src="images/placeholder1.png" style="width:500px;">

## Placeholders

<img src="images/placeholder2.png" style="width:500px;">

## Placeholders

<img src="images/placeholder3.png" style="width:500px;">

## Filling the Placeholder

### We use ```.format()``` to fill in values into the ```{}```placeholder

In [3]:
## here's our base url
base_link = "http://www.example{}.html"
base_link

'http://www.example{}.html'

In [4]:
## Using a ```for loop```
all_urls_fl = []
for url_number in range(0,7):
    print(url_number)
    print(base_link.format(url_number))
    all_urls_fl.append(base_link.format(url_number))
    
# all_urls_fl


0
http://www.example0.html
1
http://www.example1.html
2
http://www.example2.html
3
http://www.example3.html
4
http://www.example4.html
5
http://www.example5.html
6
http://www.example6.html


In [5]:
## using list comprehension
all_urls_lc = [base_link.format(url_number) for url_number in range(1,7)]
all_urls_lc

['http://www.example1.html',
 'http://www.example2.html',
 'http://www.example3.html',
 'http://www.example4.html',
 'http://www.example5.html',
 'http://www.example6.html']

## Using f-strings

In [6]:
burl = "http://www.example"


In [7]:
## Using a ```for loop```
all_urls_fs = []
for url_number in range(1,7):
    temp_url = f"{burl}{url_number}.html"
    all_urls_fs.append(temp_url)
    
all_urls_fs

['http://www.example1.html',
 'http://www.example2.html',
 'http://www.example3.html',
 'http://www.example4.html',
 'http://www.example5.html',
 'http://www.example6.html']

## Back to our scrape

In [8]:
## let's remind ourselves of url variable's value

url

'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page{}.html'

## We know we need a placeholder value of upto ```4```
## Let's create a variable called  ```total_pages``` to match number of pages on site.

In [9]:
## total pages to scrape
total_pages = 5

In [10]:
## Let's write the for loop
## but instead of storing into a list, we just feed it directly to our placeholder
## we want to just scape each page
for url_number in range(0,total_pages):
    link = url.format(url_number)
    print(link)
    site = requests.get(link)
    print(site.status_code)

https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page0.html
404
https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html
200
https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html
200
https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page3.html
200
https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page4.html
200


# HUGE PROBLEM

### We're hitting the server way too fast. We have to add a delay before we proceed.

In [11]:
## Let's import the required libaries to create a delay
from random import randrange ##  allows us to randomize numbers library
import time ## time tracker

In [12]:
## Let's run our code again but with appropriate delay

for url_number in range(0,total_pages):
    link = url.format(url_number)
    print(link)
    snooze = randrange(10,25)
    print(f"snoozing for {snooze} seconds before scraping next link.")
    time.sleep(snooze)

https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page0.html
snoozing for 15 seconds before scraping next link.
https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html
snoozing for 22 seconds before scraping next link.
https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html
snoozing for 16 seconds before scraping next link.
https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page3.html
snoozing for 11 seconds before scraping next link.
https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page4.html
snoozing for 18 seconds before scraping next link.


## Working Around Errors

When you scrape hundreds of pages, there's chance that one of the URLs might be a dud.

We can set up a error control to see what kind of responses we get:

```<Response [200]>``` means website is accessible.

```<Response [404]>``` means broken link or no page on content.

In that case, your whole code might break and you'll have to figure out where it broke.

We can make that easier with conditional logic.

In [13]:
total_pages = 6
for url_number in range(0,total_pages):
    link = url.format(url_number)
#     print(url_number)
    site = requests.get(link)
    try: 
        if site.status_code == 200:
            print(f"got it...scraping page...{link}")
            soup = BeautifulSoup(site.content, "html.parser")
            snooze = randrange(10,25)
            print(f"snoozing for {snooze} seconds before scraping next link.")
            time.sleep(snooze)

        else:
            print(f"oh no! {link} returned:", site.status_code)
    except: 
        print(f"I can't seem to find these urls")

oh no! https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page0.html returned: 404
got it...scraping page...https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html
snoozing for 18 seconds before scraping next link.
got it...scraping page...https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html
snoozing for 12 seconds before scraping next link.
got it...scraping page...https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page3.html
snoozing for 19 seconds before scraping next link.
got it...scraping page...https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page4.html
snoozing for 14 seconds before scraping next link.
oh no! https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page5.html returned: 404


# Cleaning and Organizing Functions

Before proceeding to the entire scrape, let's activate our functions that will help us clean and organize our scraped data

In [14]:
# function to lowercase, strip and underscore header labels
def santize_label(label):
    value = label.lower().replace(":", "").strip()
    value = re.sub(r'[^A-Za-z0-9]+', '_', value)
    return value

# function to create a dict of row data
def make_dict_list(animal, weight, animal_type):
    creature = {'animal': animal, 'weight': weight, 'animal_type': animal_type}
    return creature

# All in One Step

Because we are using a  ```for loop``` that cycles through each link to do multiple steps on our target data, we need to have it done as one step.

In [35]:
total_pages = 5 ## number of pages we want to scrape

data_rows = [] # list of dicts that hold row info

for url_number in range(1,total_pages):
    link = url.format(url_number)
    print(url_number)
#     print(f"I'm on page {page}")
    site = requests.get(link)
    try: 
        if site.status_code == 200:
            print(f"got it...scraping page...{link}")
            soup = BeautifulSoup(site.content, "html.parser")
            
                 # find table in soup
        table = soup.find("table", class_ ="full_table")
    #     print(table.prettify())

        # find rows
        rows = table.find("tbody").find_all("tr")
    #     print(rows)


        ## grab each row into proper variable
        for row in rows:
            my_row = row.find_all("td")
            print(my_row)

    #         lastname = my_row[0].getText().replace('\n', "")
            animal = my_row[0].get_text()
            weight = int(my_row[1].get_text().replace(",",""))
            animal_type = my_row[2].get_text()

    #         print(animal, weight, animal_type)
    

            ## CREATE DICT
    #         creatures_dict = {'animal': animal, 'weight': weight, 'animal_type': animal_type}
            
            ## USE DICT CONSTRUCTOR
            creatures_dict = dict(
                animal = animal,
                weight= weight,
                animal_type= animal_type
            )
           
            print(creatures_dict)

            data_rows.append(creatures_dict)

            if url_number != (total_pages-1):
                snooze = randrange(3,6)
                print(f"snoozing for {snooze} seconds before scraping next link.")
                time.sleep(snooze)

            else:
                pass
                print(data_rows)
            
        else:
            print(f"oh no! {link} returned:", site.status_code)
    except: 
        print(f"I can't seem to find these urls")
        
        


1
got it...scraping page...https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html
[<td>Blue whale</td>, <td>136,000</td>, <td>Marine</td>]
{'animal': 'Blue whale', 'weight': 136000, 'animal_type': 'Marine'}
snoozing for 5 seconds before scraping next link.
[<td>Bowhead whale</td>, <td>100,000</td>, <td>Marine</td>]
{'animal': 'Bowhead whale', 'weight': 100000, 'animal_type': 'Marine'}
snoozing for 4 seconds before scraping next link.
[<td>Fin whale</td>, <td>70,000</td>, <td>Marine</td>]
{'animal': 'Fin whale', 'weight': 70000, 'animal_type': 'Marine'}
snoozing for 5 seconds before scraping next link.
[<td>Southern right whale</td>, <td>45,000</td>, <td>Marine</td>]
{'animal': 'Southern right whale', 'weight': 45000, 'animal_type': 'Marine'}
snoozing for 3 seconds before scraping next link.
[<td>Humpback whale</td>, <td>30,000</td>, <td>Marine</td>]
{'animal': 'Humpback whale', 'weight': 30000, 'animal_type': 'Marine'}
snoozing for 5 seconds before scraping next li

In [25]:
data_rows

[{'animal': 'Blue whale', 'weight': 136000, 'animal_type': 'Marine'},
 {'animal': 'Bowhead whale', 'weight': 100000, 'animal_type': 'Marine'},
 {'animal': 'Fin whale', 'weight': 70000, 'animal_type': 'Marine'},
 {'animal': 'Southern right whale', 'weight': 45000, 'animal_type': 'Marine'},
 {'animal': 'Humpback whale', 'weight': 30000, 'animal_type': 'Marine'},
 {'animal': 'Gray whale', 'weight': 28500, 'animal_type': 'Marine'},
 {'animal': 'Northern right whale', 'weight': 23000, 'animal_type': 'Marine'},
 {'animal': 'Sei whale', 'weight': 20000, 'animal_type': 'Marine'},
 {'animal': "Bryde's whale", 'weight': 16000, 'animal_type': 'Marine'},
 {'animal': "Baird's beaked whale", 'weight': 11380, 'animal_type': 'Marine'},
 {'animal': 'Minke whale ', 'weight': 7500, 'animal_type': 'Marine'},
 {'animal': 'Northern bottlenose whale',
  'weight': 6500,
  'animal_type': 'Marine'},
 {'animal': "Gervais's beaked whale", 'weight': 5600, 'animal_type': 'Marine'},
 {'animal': 'African elephant', '

# Export to a CSV file

In [None]:
## use pandas to write to csv file
filename = "heaviest-animals.csv"
# df = pd.DataFrame({key: pd.Series(value) for key, value in gas_dict.items()})
df = pd.DataFrame(data_rows) 
df.to_csv(filename, encoding='utf-8', index=False)

print(f"{filename} is in your project folder!")

## Not needed

In [None]:
## not needed~!
# find column headers
header = table.find("thead").find("tr")
print(header.prettify())
labels = []
for column_headers in header.find_all("th", class_ ="table-head"):
    column_header = santize_label(column_headers.get_text())
    labels.append(column_header)
labels.append('link')
print(labels)

In [None]:
all_urls_lc = [base_link.format(url_number) for url_number in range(1,total_pages)]
all_urls_lc