# Challenge – Multi-page Tables Scrape 

You're often going to encounter data and tables spread across hundreds if not thousands of pages. We might want, for example, to compile details about all the doctors  <a href="https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AllRecordsAction.action">on this site</a> and export to a ```dataframe``` and a ```.csv``` file.

But we'll practice first on <a href="https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html">a mock site</a>.

While you could try to jump straight to your AI Assistants, let's brainstorm how you might, in pseudocoding, scrape this site. 

## <a id='toc1_'></a>[Controlling the flow](#toc0_)

We always need to ensure that our scripts are able to handle various scenarios to return error-free and meaningful outputs. We do this by controlling the flow of our code (and not just how fast it runs).

Here are two main reasons we will control the flow:


### 1. <a id='toc1_1_2_'></a>[Exception Handling: managing unforeseen errors](#toc0_)

> Our scripts might need to interact with external web pages (or documents or other data). We can't account for every single variation we might run into and our code will break down when it doesn't know how to proceed.

> Flow control ensures that when errors happen, the program can handle them gracefully instead of crashing.

### 2. <a id='toc1_1_3_'></a>[Time Delays - pacing the speed of our code.](#toc0_)

> Python can be super fast – often faster than a website can populate its content. We need to slow down our code.










In [1]:
## the zen of python
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


### 2. Exception Handling: managing unforeseen errors

> Our scripts might need to interact with external web pages (or documents or other data). We can't account for every single variation we might run into and our code will break down when it doesn't know how to proceed.

> Flow control ensures that when errors happen, the program can handle them gracefully instead of crashing.


In Python, an ```exception``` is an error that occurs during the execution of a script.
If not handled, exceptions will stop the program and raise an error message.

We will run into errors when we scrape webpages, read documents and run natural language analysis. 

We might have to iterate through 10,000 links to scrape the content of each page. If some of those pages aren't structured as our scraper expects, our script will break. We'll get to those more complex cases in the near future. Today, we'll deal with a simplier ```exception```.

In [3]:
## divide 100 by 10
100 /10

10.0

In [5]:
## divide 100 by 0
100 / 0


ZeroDivisionError: division by zero

### ```try-except``` blocks:

<img src="https://sandeepmj.github.io/image-host/try-except.png" width="550">


In [9]:
## code try-except block here
try:
    num = int(input("enter number: "))
    result = 100 / num
    print(f"100 divided by {num} is {result}")
except ZeroDivisionError:
    print("you can't divide by zero!")



enter number:  0


you can't divide by zero!


### You can have multiple exceptions

In [11]:
## divide a number by a string
try:
    num = int(input("enter number: "))
    result = 100 / num
    print(f"100 divided by {num} is {result}")
except ZeroDivisionError:
    print("you can't divide by zero!")

enter number:  dog


ValueError: invalid literal for int() with base 10: 'dog'

In [13]:
## code try-multiple exceptions here
try:
    num = int(input("enter number: "))
    result = 100 / num
    print(f"100 divided by {num} is {result}")
except ZeroDivisionError:
    print("you can't divide by zero!")
except ValueError:
    print("you can't divide a number by a string")

enter number:  dog


you can't divide a number by a string


### ```try-except-finally```

We tack on a ```finally``` block that executes code regardless of whether an exception was raised or not, usually just to show us how far our code has progressed.

<img src="https://sandeepmj.github.io/image-host/try-except-finally.png" width="550">


In [27]:
## demo here

rent = 1_000
# food = 300
# income = 1_500
income = "$1,500"
food = "300"

try: 
    remaining_balance = income - food - rent
    print(remaining_balance)
except TypeError:
    print("You can't divide by a string")
finally:
    print("prices are going up no matter what")


You can't divide by a string
prices are going up no matter what


## Totally Unexpected Errors?

You actually don't even need to know what type of error you might encounter.

You can log an error, using the ```logging``` package.

In [None]:
## code try-except variables here
rent = 1_000
# food = 300
# income = 1_500
income = "$1,500"
food = "300"

try: 
    remaining_balance = income - food - rent
    print(remaining_balance)
except TypeError:
    print("You can't divide by a string")
finally:
    print("prices are going up no matter what")

In [29]:
## import logging
import logging

In [33]:
## try with logging
try: 
    remaining_balance = income - food - rent
    print(remaining_balance)
    if remaining_balance < 0:
        print("You are spending more than you earn")
    else: 
        print("you are making do")
except Exception as e:
    logging.exception("something went wrong")
finally:
    print("prices are going up no matter what")

ERROR:root:something went wrong
Traceback (most recent call last):
  File "/var/folders/jg/xjfmqdcj0m1bqs1d7sgv9v480000gp/T/ipykernel_19032/1426120569.py", line 3, in <module>
    remaining_balance = income - food - rent
TypeError: unsupported operand type(s) for -: 'str' and 'str'


prices are going up no matter what


### 2. Time Delays

**Delay timers** are critical when scraping data from websites for several reasons. The **two** most important reasons are:

1. Sometimes your scraper clicks on links and must wait for the content to actually populated on the new page. Your script is likely to run faster than a page can load.


2. You don't want your scraper to be mistaken for a hostile attack on a server. You have to slow down the scrapes.

In [35]:
## run this cell to activate the list
URLS = [
    'great-unique-data-1.html',
    'great-unique-data-2.html',
    'great-unique-data-3.html',
    'great-unique-data-4.html',
    'great-unique-data-5.html',
    'great-unique-data-6.html',
    'great-unique-data-7.html',
    'great-unique-data-8.html',
    'great-unique-data-9.html',
    'great-unique-data-10.html',
    'great-unique-data-11.html',
    'great-unique-data-12.html',
    'great-unique-data-13.html',
    'great-unique-data-14.html',
    'great-unique-data-15.html'
]

In [39]:
## import a package includes a method that's like a stopwatch
import datetime as dt

In [47]:
# time how fast code runs
quantity = len(URLS)
for i, URL in enumerate(URLS, start = 1):
    current_time = dt.datetime.now()
    print(f"scraping {i} of {quantity} at exactly {current_time}: {URL}")

scraping 1 of 15 at exactly 2025-09-15 14:42:42.073271: great-unique-data-1.html
scraping 2 of 15 at exactly 2025-09-15 14:42:42.073884: great-unique-data-2.html
scraping 3 of 15 at exactly 2025-09-15 14:42:42.073904: great-unique-data-3.html
scraping 4 of 15 at exactly 2025-09-15 14:42:42.073919: great-unique-data-4.html
scraping 5 of 15 at exactly 2025-09-15 14:42:42.073931: great-unique-data-5.html
scraping 6 of 15 at exactly 2025-09-15 14:42:42.073944: great-unique-data-6.html
scraping 7 of 15 at exactly 2025-09-15 14:42:42.073956: great-unique-data-7.html
scraping 8 of 15 at exactly 2025-09-15 14:42:42.073968: great-unique-data-8.html
scraping 9 of 15 at exactly 2025-09-15 14:42:42.073981: great-unique-data-9.html
scraping 10 of 15 at exactly 2025-09-15 14:42:42.073994: great-unique-data-10.html
scraping 11 of 15 at exactly 2025-09-15 14:42:42.074007: great-unique-data-11.html
scraping 12 of 15 at exactly 2025-09-15 14:42:42.074020: great-unique-data-12.html
scraping 13 of 15 at e

In [49]:
# time is required. we will use its sleep function
import time

In [51]:
## snooze for 5 seconds between scrapes

quantity = len(URLS)
for i, URL in enumerate(URLS, start = 1):
    current_time = dt.datetime.now()
    print(f"scraping {i} of {quantity} at exactly {current_time}: {URL}")
    print("Snoozing for 5 seconds")
    time.sleep(5)
    

    


scraping 1 of 15 at exactly 2025-09-15 14:45:51.063257: great-unique-data-1.html
Snoozing for 5 seconds
scraping 2 of 15 at exactly 2025-09-15 14:45:56.068689: great-unique-data-2.html
Snoozing for 5 seconds
scraping 3 of 15 at exactly 2025-09-15 14:46:01.070043: great-unique-data-3.html
Snoozing for 5 seconds
scraping 4 of 15 at exactly 2025-09-15 14:46:06.075862: great-unique-data-4.html
Snoozing for 5 seconds
scraping 5 of 15 at exactly 2025-09-15 14:46:11.081624: great-unique-data-5.html
Snoozing for 5 seconds
scraping 6 of 15 at exactly 2025-09-15 14:46:16.087696: great-unique-data-6.html
Snoozing for 5 seconds
scraping 7 of 15 at exactly 2025-09-15 14:46:21.093522: great-unique-data-7.html
Snoozing for 5 seconds
scraping 8 of 15 at exactly 2025-09-15 14:46:26.098740: great-unique-data-8.html
Snoozing for 5 seconds
scraping 9 of 15 at exactly 2025-09-15 14:46:31.103581: great-unique-data-9.html
Snoozing for 5 seconds
scraping 10 of 15 at exactly 2025-09-15 14:46:36.107496: great-u

### Randomize

Software that tracks traffic to a server might grow suspicious about a hit every nth seconds.

Let's **randomize** the time between hits by using ```randint``` or ```uniform``` from the ```random``` library.


#### What's the difference?

**Difference 1**

```uniform``` returns a ```float```.

```randint``` returns an ```integer```. 

**Syntax**

```uniform(1, 10)```  will return a float between 1 and 10

```randint(1, 10)``` will return an integer between 1 and 10


In [53]:
# import randint necessary library
from random import randint, uniform

In [65]:
## test it randint
randint(5,15)

9

In [67]:
## test it uniform
uniform(5,15)

9.502284622390423

In [69]:
## snooze for a random number of seconds between 5 and 12 seconds between scrapes
quantity = len(URLS)
for i, URL in enumerate(URLS, start = 1):
    current_time = dt.datetime.now()
    print(f"scraping {i} of {quantity} at exactly {current_time}: {URL}")
    seconds = uniform(8,16)
    print(f"Snoozing for {seconds} seconds")
    time.sleep(seconds)

print("All done")


scraping 1 of 15 at exactly 2025-09-15 14:52:54.060847: great-unique-data-1.html
Snoozing for 15.488563043770775 seconds
scraping 2 of 15 at exactly 2025-09-15 14:53:09.554546: great-unique-data-2.html
Snoozing for 13.393059173619518 seconds
scraping 3 of 15 at exactly 2025-09-15 14:53:22.953197: great-unique-data-3.html
Snoozing for 14.495520721653467 seconds
scraping 4 of 15 at exactly 2025-09-15 14:53:37.454095: great-unique-data-4.html
Snoozing for 8.603953255806232 seconds
scraping 5 of 15 at exactly 2025-09-15 14:53:46.062863: great-unique-data-5.html
Snoozing for 11.651167481377932 seconds
scraping 6 of 15 at exactly 2025-09-15 14:53:57.719684: great-unique-data-6.html
Snoozing for 15.943309295423449 seconds
scraping 7 of 15 at exactly 2025-09-15 14:54:13.668883: great-unique-data-7.html
Snoozing for 14.482913498406546 seconds
scraping 8 of 15 at exactly 2025-09-15 14:54:28.153401: great-unique-data-8.html
Snoozing for 12.981483022499598 seconds
scraping 9 of 15 at exactly 2025-

### ```range()``` and ```for loop```

```range()``` is a built-in function that generates a sequence of numbers.

It takes 2 import parameters:

```range(start_number, end_number)```

In [71]:
## can't just call range()
range(1,10)

range(1, 10)

### range() must be paired with ```for loops```

In [73]:
## range 1-10 and print out each generated number
for number in range(1, 10):
    print(number)

1
2
3
4
5
6
7
8
9


In [75]:
# import libraries
import pandas as pd
# import randint necessary library
from random import randint, uniform
## import logging
import logging
# time is required. we will use its sleep function
import time


In [81]:
## headers
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "           
                         "AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/124.0.0.0 Safari/537.36"}

In [1]:
# simple scrape example of single page
url = "https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AllRecordsAction.action?d-49653-p=1"

## read the url:
response  = pd.read_html(url)



NameError: name 'pd' is not defined

In [101]:
len(response)

1

In [103]:
type(response)

list

In [107]:
response[0]

Unnamed: 0,Animal,Weight(kg),Type
0,Blue whale,136000,Marine
1,Bowhead whale,100000,Marine
2,Fin whale,70000,Marine
3,Southern right whale,45000,Marine
4,Humpback whale,30000,Marine


In [113]:
## full scrape example of multiple pages with error handling and snoozing
base_url = "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page"
df_list = []
broken_links = []

for i, number in enumerate(range(1,6), start = 1):
    url = f"{base_url}{number}.html"
    print(f"Scraping page {i}, url: {url}")
    try:
        df = pd.read_html(url)[0]
        df["source_url"] = url
        df_list.append(df)
    except Exception as e:
        print(f"Oh no...encountered an issue: {e} at {url}")
        broken_links.append(url)
    finally:
        snoozer = uniform(5,20)
        print(f"Snoozing for {snoozer} seconds before next scrape")
        time.sleep(snoozer)

print("done scraping all urls")

Scraping page 1, url: https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html
Snoozing for 12.300955446535562 seconds before next scrape
Scraping page 2, url: https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html
Snoozing for 17.10772549449547 seconds before next scrape
Scraping page 3, url: https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page3.html
Snoozing for 15.2335010535359 seconds before next scrape
Scraping page 4, url: https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page4.html
Oh no...encountered an issue: No tables found at https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page4.html
Snoozing for 7.596019168329496 seconds before next scrape
Scraping page 5, url: https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page5.html
Snoozing for 16.937485581701893 seconds before next scrape
done scraping all urls


In [None]:
pip install html5lib

In [117]:
# call the list
len(df_list)

4

In [125]:
# concat the list into a single dataframe
df = pd.concat(df_list, ignore_index= True )
df

Unnamed: 0,Animal,Weight(kg),Type,source_url
0,Blue whale,136000,Marine,https://sandeepmj.github.io/scrape-example-pag...
1,Bowhead whale,100000,Marine,https://sandeepmj.github.io/scrape-example-pag...
2,Fin whale,70000,Marine,https://sandeepmj.github.io/scrape-example-pag...
3,Southern right whale,45000,Marine,https://sandeepmj.github.io/scrape-example-pag...
4,Humpback whale,30000,Marine,https://sandeepmj.github.io/scrape-example-pag...
5,Gray whale,28500,Marine,https://sandeepmj.github.io/scrape-example-pag...
6,Northern right whale,23000,Marine,https://sandeepmj.github.io/scrape-example-pag...
7,Sei whale,20000,Marine,https://sandeepmj.github.io/scrape-example-pag...
8,Bryde's whale,16000,Marine,https://sandeepmj.github.io/scrape-example-pag...
9,Baird's beaked whale,11380,Marine,https://sandeepmj.github.io/scrape-example-pag...


## To recap

When we scrape, analyze data or documents, we need to control the flow of our code.

- ```error handling``` will allos you to write script that doesn't break down when it encounters its first blip, and will even let you collect info on what exactly went wrong.
- ```timers``` help so down your code so you are not flagged as a hacker!
- ```headers``` so the site thinks you are a humand, and not ```code```.
- ```random``` to generate random numbers.


In [127]:
import requests

