# Multi-page Tables Scrape Demo

You're often going to encounter data and tables spread across hundreds if not thousands of pages.

We might want to, for example, compile details about all the doctors  <a href="https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AllRecordsAction.action">on this site</a> and export to a ```dataframe``` and a ```.csv``` file.

#### Today in class

We're going to scrape as a demo a table that runs across several pages on [this mock website](https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html).

```https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html```

To capture your target information into a single CSV file will require the use of many of the foundational skills we've covered, including:

- ```delays```
- ```conditional logic```
- ```for loops```


And we'll explore a few new functional Python methods today.

## Scraping Strategies

- How do we approach this scrape?
- What pattern do we see?
- How do we capture a table on a single page?
- How do we capture a sequence of tables?
- How we navigate from page 1 to the subsequent pages?

# Let's code!

In [1]:
# import libraries
import pandas as pd
import requests

## Single Table Scrape

In [3]:
##scrape url website

url = "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html"
response = requests.get(url)
response.status_code

200

In [4]:
## page content type
type(response.content)

bytes

In [5]:
## page text type
type(response.text)

str

In [6]:
response.text

'<html lang="en">\n\n\t<head>\n\n\t\t<!-- Makes the page responsive and scaled to be read easily -->\n\t\t<meta name="viewport" content="width=device-width, initial-scale=1">\n\n\t\t<!-- Links to stylesheet -->\n\t\t<link rel="stylesheet" type="text/css" href="style.css">\n\t\t<!-- Remember to update page title -->\n\t\t<title>Heaviest Animals</title>\n\n\t</head>\n\n\t<body>\n\t\t<!-- All content goes here -->\n\n\t\t<div class="container">\n\t\t\t<section id="multi_table">\n\t\t\t\t<table class="full_table">\n\t\t\t\t\t<h3>20 Heaviest Animals in the World</h3>\n\t\t\t\t\t<h4>Showing 1-5 of 20</h4>\n\t\t\t\t\t<thead>\n\t\t\t\t\t\t<tr>\n\t\t\t\t\t\t\t<th class="table-head">Animal</th>\n\t\t\t\t\t\t\t<th class="table-head">Weight(kg)</th>\n\t\t\t\t\t\t\t<th class="table-head">Type</th>\n\t\t\t\t\t\t</tr>\n\t\t\t\t\t</thead>\n\t\t\t\t\t<tbody>\n\t\t\t\t\t\t<tr>\n\t\t\t\t\t\t\t<td>Blue whale</td>\n\t\t\t\t\t\t\t<td>136,000</td>\n\t\t\t\t\t\t\t<td>Marine</td>\n\n\t\t\t\t\t\t</tr>\n\t\t\t\t

## ```Pandas``` captures tables.


In [7]:
## use Pandas to read tables on page
df1 = pd.read_html(response.text)
df1

[                 Animal  Weight(kg)    Type
 0            Blue whale      136000  Marine
 1         Bowhead whale      100000  Marine
 2             Fin whale       70000  Marine
 3  Southern right whale       45000  Marine
 4        Humpback whale       30000  Marine]

In [8]:
# type
type(df1)

list

In [9]:
## Do we want the first table?
df = df1[0]
df

Unnamed: 0,Animal,Weight(kg),Type
0,Blue whale,136000,Marine
1,Bowhead whale,100000,Marine
2,Fin whale,70000,Marine
3,Southern right whale,45000,Marine
4,Humpback whale,30000,Marine


In [None]:
## store it into a copy called animals_df


## But we want to scrape multiple pages


In [None]:
## Never do this manually
url_list = [
    "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html",
    "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html",
    "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page3.html",
    "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1000000.html"
]

## ```f-strings``` to create our links

In [10]:
## base url of site to scrape
base_url = "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page"



In [13]:
## Using a ```for loop```

urls_fl = []
for url_number in range(1,6):
#     print(url_number)
#     print(f"{base_url}{url_number}.html")
    urls_fl.append(f"{base_url}{url_number}.html")

In [14]:
urls_fl

['https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html',
 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html',
 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page3.html',
 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page4.html',
 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page5.html']

In [15]:
## using list comprehension
urls_lc = [(f"{base_url}{url_number}.html") for url_number in range(1,6)]

urls_lc

['https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html',
 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html',
 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page3.html',
 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page4.html',
 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page5.html']

## Back to our scrape

Remember, we'll hit this server too fast. We have to add a delay.

In [16]:
## Let's import the required libaries to create a delay

from random import randrange
import time

In [21]:
## first time scrape

total_links = len(urls_lc)
counter = 1

for url in urls_lc:
    print(f"scraping {counter} of {total_links}")
    counter += 1
    response = requests.get(url)
    df = pd.read_html(response.text)[0]
    print(df)
    snoozer = randrange(5,12)
    print(f"snoozing for {snoozer} seconds before next scrape")
    time.sleep(snoozer)

scraping 1 of 5
                 Animal  Weight(kg)    Type
0            Blue whale      136000  Marine
1         Bowhead whale      100000  Marine
2             Fin whale       70000  Marine
3  Southern right whale       45000  Marine
4        Humpback whale       30000  Marine
snoozing for 8 seconds before next scrape
scraping 2 of 5
                 Animal  Weight(kg)    Type
0            Gray whale       28500  Marine
1  Northern right whale       23000  Marine
2             Sei whale       20000  Marine
3         Bryde's whale       16000  Marine
4  Baird's beaked whale       11380  Marine
snoozing for 7 seconds before next scrape
scraping 3 of 5
                      Animal  Weight(kg)         Type
0                Minke whale        7500       Marine
1  Northern bottlenose whale        6500       Marine
2     Gervais's beaked whale        5600       Marine
3           African elephant        4800  Terrestrial
4               Killer whale        3988       Marine
snoozing for 7 s

ValueError: No tables found

In [22]:
## what was our status code when it broke?
response.status_code

404

## Working Around Errors

When you scrape hundreds of pages, there's chance that one of the URLs might be a dud.

We can set up a error control to see what kind of responses we get:

```<Response [200]>``` means website is accessible.

```<Response [404]>``` means broken link or no page on content.

In that case, your whole code might break and you'll have to figure out where it broke.

We can make that easier with ```Conditional Logic``` or ```Error Exceptions```

### Bypassing exceptions

In [26]:
## deal with exceptions
## hold on to broken links

busted_links = []
df_all = []
total_links = len(urls_lc)
counter = 1

for url in urls_lc:
    print(f"Scraping {counter} of {total_links}")
    counter += 1
    response = requests.get(url)
    try:
        df = pd.read_html(response.text)[0]
    except:
        print(f"oh no! {url} returned {response.status_code}")
        busted_links.append(url)
    else:
        df_all.append(df)
        print(df)
    snoozer = randrange(5,7)
    print(f"snoozing for {snoozer} seconds before next scrape")
    time.sleep(snoozer)
    
print("done scraping all provided links")


Scraping 1 of 5
                 Animal  Weight(kg)    Type
0            Blue whale      136000  Marine
1         Bowhead whale      100000  Marine
2             Fin whale       70000  Marine
3  Southern right whale       45000  Marine
4        Humpback whale       30000  Marine
snoozing for 5 seconds before next scrape
Scraping 2 of 5
                 Animal  Weight(kg)    Type
0            Gray whale       28500  Marine
1  Northern right whale       23000  Marine
2             Sei whale       20000  Marine
3         Bryde's whale       16000  Marine
4  Baird's beaked whale       11380  Marine
snoozing for 6 seconds before next scrape
Scraping 3 of 5
                      Animal  Weight(kg)         Type
0                Minke whale        7500       Marine
1  Northern bottlenose whale        6500       Marine
2     Gervais's beaked whale        5600       Marine
3           African elephant        4800  Terrestrial
4               Killer whale        3988       Marine
snoozing for 5 s

In [27]:
## which link broke?
busted_links

['https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page4.html']

In [28]:
## what does df_all hold?
df_all

[                 Animal  Weight(kg)    Type
 0            Blue whale      136000  Marine
 1         Bowhead whale      100000  Marine
 2             Fin whale       70000  Marine
 3  Southern right whale       45000  Marine
 4        Humpback whale       30000  Marine,
                  Animal  Weight(kg)    Type
 0            Gray whale       28500  Marine
 1  Northern right whale       23000  Marine
 2             Sei whale       20000  Marine
 3         Bryde's whale       16000  Marine
 4  Baird's beaked whale       11380  Marine,
                       Animal  Weight(kg)         Type
 0                Minke whale        7500       Marine
 1  Northern bottlenose whale        6500       Marine
 2     Gervais's beaked whale        5600       Marine
 3           African elephant        4800  Terrestrial
 4               Killer whale        3988       Marine,
                      Animal  Weight(kg)         Type
 0              Hippopotamus        3750  Terrestrial
 1            Asian

In [32]:
df_all[3]

Unnamed: 0,Animal,Weight(kg),Type
0,Hippopotamus,3750,Terrestrial
1,Asian elephant,3178,Terrestrial
2,Cuvier's beaked whale,2701,Marine
3,Short-finned pilot whale,2200,Marine
4,White rhinoceros,2175,Terrestrial


In [35]:
## convert to a single df rather than a list of df
df = pd.concat(df_all, ignore_index = True)
df

Unnamed: 0,Animal,Weight(kg),Type
0,Blue whale,136000,Marine
1,Bowhead whale,100000,Marine
2,Fin whale,70000,Marine
3,Southern right whale,45000,Marine
4,Humpback whale,30000,Marine
5,Gray whale,28500,Marine
6,Northern right whale,23000,Marine
7,Sei whale,20000,Marine
8,Bryde's whale,16000,Marine
9,Baird's beaked whale,11380,Marine


In [36]:
## export to csv
df.to_csv("heaviest_animals.csv", index = False, encoding = "UTF-8")