# Challenge – Multi-page Tables Scrape 

You're often going to encounter data and tables spread across hundreds if not thousands of pages. We might want, for example, to compile details about all the doctors  <a href="https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AllRecordsAction.action">on this site</a> and export to a ```dataframe``` and a ```.csv``` file.

But we'll practice first on <a href="https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html">a mock site</a>.

While you could try to jump straight to your AI Assistants, let's brainstorm how you might, in pseudocoding, scrape this site. 

## <a id='toc1_'></a>[Controlling the flow](#toc0_)

We always need to ensure that our scripts are able to handle various scenarios to return error-free and meaningful outputs. We do this by controlling the flow of our code (and not just how fast it runs).

Here are two main reasons we will control the flow:


### 1. <a id='toc1_1_2_'></a>[Exception Handling: managing unforeseen errors](#toc0_)

> Our scripts might need to interact with external web pages (or documents or other data). We can't account for every single variation we might run into and our code will break down when it doesn't know how to proceed.

> Flow control ensures that when errors happen, the program can handle them gracefully instead of crashing.

### 2. <a id='toc1_1_3_'></a>[Time Delays - pacing the speed of our code.](#toc0_)

> Python can be super fast – often faster than a website can populate its content. We need to slow down our code.










In [1]:
## the zen of python
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


### 2. Exception Handling: managing unforeseen errors

> Our scripts might need to interact with external web pages (or documents or other data). We can't account for every single variation we might run into and our code will break down when it doesn't know how to proceed.

> Flow control ensures that when errors happen, the program can handle them gracefully instead of crashing.


In Python, an ```exception``` is an error that occurs during the execution of a script.
If not handled, exceptions will stop the program and raise an error message.

We will run into errors when we scrape webpages, read documents and run natural language analysis. 

We might have to iterate through 10,000 links to scrape the content of each page. If some of those pages aren't structured as our scraper expects, our script will break. We'll get to those more complex cases in the near future. Today, we'll deal with a simplier ```exception```.

In [5]:
## divide 100 by 10
100 / 10

10.0

In [7]:
## divide 100 by 0

100/0

ZeroDivisionError: division by zero

### ```try-except``` blocks:

<img src="https://sandeepmj.github.io/image-host/try-except.png" width="550">


In [None]:
## code try-except block here

try:
    num = int(input("enter number: "))
    result = 100 / num
    print(f"100 divided by {num} is {result}")

except ZeroDivisionError:
    print("you can't divide by zero!")

100/0

### You can have multiple exceptions

In [None]:
## divide a number by a string


In [None]:
## code try-multiple exceptions here


### ```try-except-finally```

We tack on a ```finally``` block that executes code regardless of whether an exception was raised or not, usually just to show us how far our code has progressed.

<img src="https://sandeepmj.github.io/image-host/try-except-finally.png" width="550">


In [None]:
## demo here

rent = 1_000
# food = 300
# income = 1_500
income = "$1,500"
food = "300"



## Totally Unexpected Errors?

You actually don't even need to know what type of error you might encounter.

You can log an error, using the ```logging``` package.

In [None]:
## code try-except variables here
rent = 1_000
# food = 300
# income = 1_500
income = "$1,500"
food = "300"



In [None]:
## import logging


In [None]:
## try with logging


### 2. Time Delays

**Delay timers** are critical when scraping data from websites for several reasons. The **two** most important reasons are:

1. Sometimes your scraper clicks on links and must wait for the content to actually populated on the new page. Your script is likely to run faster than a page can load.


2. You don't want your scraper to be mistaken for a hostile attack on a server. You have to slow down the scrapes.

In [None]:
## run this cell to activate the list
URLS = [
    'great-unique-data-1.html',
    'great-unique-data-2.html',
    'great-unique-data-3.html',
    'great-unique-data-4.html',
    'great-unique-data-5.html',
    'great-unique-data-6.html',
    'great-unique-data-7.html',
    'great-unique-data-8.html',
    'great-unique-data-9.html',
    'great-unique-data-10.html',
    'great-unique-data-11.html',
    'great-unique-data-12.html',
    'great-unique-data-13.html',
    'great-unique-data-14.html',
    'great-unique-data-15.html'
]

In [None]:
## import a package includes a method that's like a stopwatch


In [None]:
# time how fast code runs


In [None]:
# time is required. we will use its sleep function


In [None]:
## snooze for 5 seconds between scrapes


    

    


### Randomize

Software that tracks traffic to a server might grow suspicious about a hit every nth seconds.

Let's **randomize** the time between hits by using ```randint``` or ```uniform``` from the ```random``` library.


#### What's the difference?

**Difference 1**

```uniform``` returns a ```float```.

```randint``` returns an ```integer```. 

**Syntax**

```uniform(1, 10)```  will return a float between 1 and 10

```randint(1, 10)``` will return an integer between 1 and 10


In [None]:
# import randint necessary library


In [None]:
## test it randint


In [None]:
## test it uniform


In [None]:
## snooze for a random number of seconds between 5 and 12 seconds between scrapes



### ```range()``` and ```for loop```

```range()``` is a built-in function that generates a sequence of numbers.

It takes 2 import parameters:

```range(start_number, end_number)```

In [None]:
## can't just call range()


### range() must be paired with ```for loops```

In [None]:
## range 1-10 and print out each generated number


In [None]:
# import libraries
import pandas as pd
# import randint necessary library
from random import randint, uniform
## import logging
import logging
# time is required. we will use its sleep function
import time


In [None]:
## headers
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "           
                         "AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/124.0.0.0 Safari/537.36"}

In [None]:
# simple scrape example of single page


In [None]:
## full scrape example of multiple pages with error handling and snoozing


In [None]:
# call the list


In [None]:
# concat the list into a single dataframe


## To recap

When we scrape, analyze data or documents, we need to control the flow of our code.

- ```error handling``` will allos you to write script that doesn't break down when it encounters its first blip, and will even let you collect info on what exactly went wrong.
- ```timers``` help so down your code so you are not flagged as a hacker!
- ```headers``` so the site thinks you are a humand, and not ```code```.
- ```random``` to generate random numbers.
