# Multipage Tables Scrape Demo

You're often going to encounter data and tables spread across hundreds if not thousands of pages. 

We're going to scrape as a demo a table that runs across several pages on this mock website.

```https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html```

To capture your target information into a single CSV file will require the use of many of the foundational skills we've covered, including:

- ```delays```
- ```conditional logic```
- ```while loops```
- ```BeautifulSoup```


And we'll explore a few new functional Python methods today.

## Scraping Strategies

- How do we approach this scrape?
- What pattern do we see?
- How do we capture a table on a single page?
- How do we capture a sequence of tables?
- How we navigate from page 1 to the subsequent pages?

# Let's code!

## pip install icecream for debugging

In [1]:
pip install icecream

Note: you may need to restart the kernel to use updated packages.


In [2]:
# import libraries

from bs4 import BeautifulSoup  ## web scraping
import requests ## request html for a page(s)
import pandas as pd ## pandas to work with data
from icecream import ic

## Single Table Scrape

In [4]:
##scrape url website
url = "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html"
response = requests.get(url)
ic(response.status_code)

ic| response.status_code: 200


200

In [5]:
type(response)

requests.models.Response

In [8]:
## page content type
ic(type(response.text))

ic| type(response.text): <class 'str'>


str

## Since ```page.text``` returns a ```str```, we don't need to use ```BeautifulSoup```.


In [17]:
## use Pandas to read tables on page
df = pd.read_html(response.text)
df[0]

Unnamed: 0,Animal,Weight(kg),Type
0,Blue whale,136000,Marine
1,Bowhead whale,100000,Marine
2,Fin whale,70000,Marine
3,Southern right whale,45000,Marine
4,Humpback whale,30000,Marine


In [18]:
type(df)

list

In [19]:
type(df[0])

pandas.core.frame.DataFrame

In [None]:
## Do we want the first table?
# type(df)


In [20]:
## store it into a copy called animals_df
df = df[0].copy()
df

Unnamed: 0,Animal,Weight(kg),Type
0,Blue whale,136000,Marine
1,Bowhead whale,100000,Marine
2,Fin whale,70000,Marine
3,Southern right whale,45000,Marine
4,Humpback whale,30000,Marine


## But we want to scrape multiple pages
2 ways to build a list of urls that we have to navigate to:

1. Placeholders
2. f-strings

In [None]:
## Never do this manually
list = ["https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html",
       "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html",
       "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page3.html",
       "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page4.html"]

### 1. Placeholders

In [None]:
## How is it different?
url = "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page{}.html"

## Placeholders

<img src="https://raw.githubusercontent.com/sandeepmj/scrape-example-page/master/images/placeholder1.png" style="width:500px;">

## Placeholders

<img src="https://raw.githubusercontent.com/sandeepmj/scrape-example-page/master/images/placeholder2.png" style="width:500px;">

## Placeholders

<img src="https://raw.githubusercontent.com/sandeepmj/scrape-example-page/master/images/placeholder3.png" style="width:500px;">

## Filling the Placeholder

### We use ```.format()``` to fill in values into the ```{}```placeholder

In [21]:
## here's our base url
base_url = "https://www.example{}.html"
ic(base_url)

ic| base_url: 'https://www.example{}.html'


'https://www.example{}.html'

In [23]:
## Using a ```for loop```
all_urls_fl = []

for url_number in range(1,7):
#     ic(base_url.format(url_number))
    all_urls_fl.append(base_url.format(url_number))

ic(all_urls_fl)

ic| all_urls_fl: ['https://www.example1.html',
                  'https://www.example2.html',
                  'https://www.example3.html',
                  'https://www.example4.html',
                  'https://www.example5.html',
                  'https://www.example6.html']


['https://www.example1.html',
 'https://www.example2.html',
 'https://www.example3.html',
 'https://www.example4.html',
 'https://www.example5.html',
 'https://www.example6.html']

In [24]:
## using list comprehension
all_urls_lc = [base_url.format(url_number) for url_number in range(1,7)]
all_urls_lc

['https://www.example1.html',
 'https://www.example2.html',
 'https://www.example3.html',
 'https://www.example4.html',
 'https://www.example5.html',
 'https://www.example6.html']

### 2. Using f-strings

In [26]:
## base url of site to scrape
base_url = "https://www.example"

In [27]:
## Using a ```for loop```
fs_fl = []
for url_number in range(1,7):
    fs_fl.append(f"{base_url}{url_number}.html")
    
ic(fs_fl)

ic| fs_fl: ['https://www.example1.html',
            'https://www.example2.html',
            'https://www.example3.html',
            'https://www.example4.html',
            'https://www.example5.html',
            'https://www.example6.html']


['https://www.example1.html',
 'https://www.example2.html',
 'https://www.example3.html',
 'https://www.example4.html',
 'https://www.example5.html',
 'https://www.example6.html']

In [None]:
## using list comprehension


In [None]:
## f string base url


In [28]:
## using list comprehension

url_lc = [f"{base_url}{url_number}.html" for url_number in range(1,7)]
url_lc

['https://www.example1.html',
 'https://www.example2.html',
 'https://www.example3.html',
 'https://www.example4.html',
 'https://www.example5.html',
 'https://www.example6.html']

## Back to our scrape

In [29]:
## let's remind ourselves of url variable's value

url = "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page{}.html"

## We know we need a placeholder value of upto ```4```
## Let's create a variable called  ```total_pages``` to match number of pages on site.

In [31]:
## total pages to scrape
total_pages = 3

In [32]:
## generates urls and loop through to get response from surver (are you getting 200?)

for url_number in range(1, total_pages):
    link = url.format(url_number)
    ic(link)


ic| link: 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html'
ic| link: 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html'


# We have a problem...

### We're hitting the server way too fast. We have to add a delay before we proceed.

In [33]:
## Let's import the required libaries to create a delay
from random import randrange ##  allows us to randomize numbers library
import time ## time tracker

In [36]:
## Let's run our code again but with appropriate delay

for url_number in range(1, total_pages):
    link = url.format(url_number)
    ic(link)
    print(f"scraping page {link}")
    response = requests.get(link)
    mysnoozer = randrange(4,10)
    print(f"Snoozing for {mysnoozer} seconds before next scrape")
    ic(response.status_code)
    time.sleep(mysnoozer)
    
print(f"done scraping {total_pages - 1} pages")



ic| link: 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html'
ic| response.status_code: 200


scraping page https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html
Snoozing for 7 seconds before next scrape


ic| link: 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html'
ic| response.status_code: 200


scraping page https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html
Snoozing for 9 seconds before next scrape
done scraping 2 pages


In [None]:
## let's remind ourselves of url variable's value

base_url = "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page"

In [None]:
## ## for loop with timer

## Working Around Errors

When you scrape hundreds of pages, there's chance that one of the URLs might be a dud.

We can set up a error control to see what kind of responses we get:

```<Response [200]>``` means website is accessible.

```<Response [404]>``` means broken link or no page on content.

In that case, your whole code might break and you'll have to figure out where it broke.

We can make that easier with conditional logic.

In [38]:
## CHECK FOR ERROR
total_pages = 8
busted_links = []
df_all = []

for url_number in range(1, total_pages):
    link = url.format(url_number)
    ic(link)
    print(f"scraping page {link}")
    response = requests.get(link)
    if response.status_code == 200:
        df = pd.read_html(response.text)
        ic(df)
        mysnoozer = randrange(4,10)
        print(f"Snoozing for {mysnoozer} seconds before next scrape")
        ic(response.status_code)
        time.sleep(mysnoozer)
    
    else:
        print(f"Oh no... {link} returned {response.status_code}")
        busted_links.append(link)
    
print(f"done scraping {total_pages - 1} pages")

ic| link: 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html'
ic| df: [                 Animal  Weight(kg)    Type
        0            Blue whale      136000  Marine
        1         Bowhead whale      100000  Marine
        2             Fin whale       70000  Marine
        3  Southern right whale       45000  Marine
        4        Humpback whale       30000  Marine]


scraping page https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html
Snoozing for 8 seconds before next scrape


ic| response.status_code: 200
ic| link: 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html'
ic| df: [                 Animal  Weight(kg)    Type
        0            Gray whale       28500  Marine
        1  Northern right whale       23000  Marine
        2             Sei whale       20000  Marine
        3         Bryde's whale       16000  Marine
        4  Baird's beaked whale       11380  Marine]
ic| response.status_code: 200


scraping page https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html
Snoozing for 6 seconds before next scrape


ic| link: 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page3.html'
ic| df: [                      Animal  Weight(kg)         Type
        0                Minke whale        7500       Marine
        1  Northern bottlenose whale        6500       Marine
        2     Gervais's beaked whale        5600       Marine
        3           African elephant        4800  Terrestrial
        4               Killer whale        3988       Marine]
ic| response.status_code: 200


scraping page https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page3.html
Snoozing for 4 seconds before next scrape


ic| link: 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page4.html'
ic| df: [                     Animal  Weight(kg)         Type
        0              Hippopotamus        3750  Terrestrial
        1            Asian elephant        3178  Terrestrial
        2     Cuvier's beaked whale        2701       Marine
        3  Short-finned pilot whale        2200       Marine
        4          White rhinoceros        2175  Terrestrial]
ic| response.status_code: 200


scraping page https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page4.html
Snoozing for 9 seconds before next scrape


ic| link: 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page5.html'
ic| link: 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page6.html'


scraping page https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page5.html
Oh no... https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page5.html returned 404
scraping page https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page6.html


ic| link: 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page7.html'


Oh no... https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page6.html returned 404
scraping page https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page7.html
Oh no... https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page7.html returned 404
done scraping 7 pages


In [39]:
## show broken links
busted_links

['https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page5.html',
 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page6.html',
 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page7.html']

# All in One Step

Because we are using a  ```for loop``` that cycles through each link to do multiple steps on our target data, we need to have it done as one step.



In [40]:
## Combined url timed nav with table scrape

total_pages = 5
busted_links = []
df_all = []

for url_number in range(1, total_pages):
    link = url.format(url_number)
#     ic(link)
    print(f"scraping page {link}")
    response = requests.get(link)
    if response.status_code == 200:
        df_list = pd.read_html(response.text)
        df_all.append(df_list[0])
#         ic(df)
        mysnoozer = randrange(4,10)
        print(f"Snoozing for {mysnoozer} seconds before next scrape")
#         ic(response.status_code)
        time.sleep(mysnoozer)
    
    else:
        print(f"Oh no... {link} returned {response.status_code}")
        busted_links.append(link)
    
print(f"done scraping {total_pages - 1} pages")

scraping page https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html
Snoozing for 5 seconds before next scrape
scraping page https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html
Snoozing for 5 seconds before next scrape
scraping page https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page3.html
Snoozing for 6 seconds before next scrape
scraping page https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page4.html
Snoozing for 8 seconds before next scrape
done scraping 4 pages


In [41]:
df_all

[                 Animal  Weight(kg)    Type
 0            Blue whale      136000  Marine
 1         Bowhead whale      100000  Marine
 2             Fin whale       70000  Marine
 3  Southern right whale       45000  Marine
 4        Humpback whale       30000  Marine,
                  Animal  Weight(kg)    Type
 0            Gray whale       28500  Marine
 1  Northern right whale       23000  Marine
 2             Sei whale       20000  Marine
 3         Bryde's whale       16000  Marine
 4  Baird's beaked whale       11380  Marine,
                       Animal  Weight(kg)         Type
 0                Minke whale        7500       Marine
 1  Northern bottlenose whale        6500       Marine
 2     Gervais's beaked whale        5600       Marine
 3           African elephant        4800  Terrestrial
 4               Killer whale        3988       Marine,
                      Animal  Weight(kg)         Type
 0              Hippopotamus        3750  Terrestrial
 1            Asian

In [45]:
df_all[3]

Unnamed: 0,Animal,Weight(kg),Type
0,Hippopotamus,3750,Terrestrial
1,Asian elephant,3178,Terrestrial
2,Cuvier's beaked whale,2701,Marine
3,Short-finned pilot whale,2200,Marine
4,White rhinoceros,2175,Terrestrial


In [51]:
## FUNCTION to download individual dataframes in a list as a single csv
def combine_tables(list_name,filename):
  '''
  Takes dataframes in a list and combines into a single CSV.
  Tables must have identical column headers and order
  Arguments: name of list and the CSV name you want (in quotes as a string)
  '''
  df = pd.concat(list_name) ## join/concat all the dataframes into one dataframe
  df.to_csv(filename, encoding='utf-8', index=False) ## convert that single dataframe into a csv
  print(f"{filename} is in your current folder!")
  return df
    

In [52]:
## CALL THE FUNCTION
df = combine_tables(df_all, "animals1.csv")

animals1.csv is in your current folder!


In [54]:
df

Unnamed: 0,Animal,Weight(kg),Type
0,Blue whale,136000,Marine
1,Bowhead whale,100000,Marine
2,Fin whale,70000,Marine
3,Southern right whale,45000,Marine
4,Humpback whale,30000,Marine
0,Gray whale,28500,Marine
1,Northern right whale,23000,Marine
2,Sei whale,20000,Marine
3,Bryde's whale,16000,Marine
4,Baird's beaked whale,11380,Marine
