# Today's theme is Failiure


# How To Debug Scapers: Browser Automation

## 1. Spot check the results

Manually inspect the data you just collected. Does it look like what you expect?

Let's look at the first and last page of Zillow that we collected.

## 2. Can't find an element

Maybe something hasn't loaded yet. If that is the case, you can wait for it to show up.

See the example in the [Inspect Element tutorial](https://inspectelement.org/browser_automation.html#step-3-finding-elements-on-page-and-interacting-with-them).

In [None]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Wait up to 20 seconds before we proceed to `find_element`.
X_seconds = 20
wait = WebDriverWait(driver, timeout = X_seconds)
wait.until(EC.visibility_of_element_located(
    (By.CSS_SELECTOR, '[data-e2e="modal-close-inner-button"]'))
)

# this line will only execute whenever the element was found (or after 20 seconds it it wasn't)
close_button = driver.find_element(By.CSS_SELECTOR, '[data-e2e="modal-close-inner-button"]')
close_button

## 3. Look to known issues

For example, a capcha, or an empty result. 
- Wait to see if these signs show up.
- Intervene as necessary.

## Debugging APIs

## 1. Listen to status codes
The status code will tell you if your API calls are successful, and whether you crashed a server.

Intervene as necessary. Also place periodic sleeps.

## 2. Spotcheck

Open the JSON and make sure it looks like what you expect.

## 3. Check for known keys

Programmatically check if the `key` you're expecting is present.

# General notes

## Summarize the data
Check the number of rows per day. This is similar to a dashboard

## Catch and handle expections

Monitor the scraper to known issues. Determine automated answers to those issues.

Have you used `try` and `except` phrases in Python? Read more about that [here](https://pythonbasics.org/try-except/).

In [10]:
try:
    assert(2 == 3)
except Exception as e:
    print(f"Wrong {e}")

Wrong 


## Keep a log

Get familiar with a [log file](https://realpython.com/python-logging/). This is basically a place to store `print` statements.
Read more here.

For a quick version: check the last time a directory was modified.

# Productionalizing Scrapers

See more from [this presentation](https://docs.google.com/presentation/d/1K5ttTgP1f6ghL06kj6QqyqsGccU_Ttxh1otdx5wWYGo/edit#slide=id.p).

Simple tips:
1. Don't repeat work
     - Structured naming system for outputs, check if it exists first.
2. Break up the work. Make the scraper as simple as possible.
For example, a scraper handles one city in Zillow.
    - Paginate, save results. That's all
    - Another scraper takes the saved HTML, and parses it and inserts it into a database.
3. Keep a schedule.
    - Use `cron` to schedule jobs locally. For example, cron allows an hourly job or one that runs every day at 4:30pm. Read [more](https://ostechnix.com/a-beginners-guide-to-cron-jobs/).
    - Other tools exist to do this on the cloud.
4. Keep tabs on inputs with a TODO list.
    - Use a CSV if you know what you want.
    - Use AWS SQS (similar to a commercial kitchen's ticket system). My fave!
5. Can you scale up?
    - If scrapers are simple, it's easy to parallelize them.
    - If local: use async computing or `Multiprocessing`.


# Tools
- cron: schedule scripts and scrapers on a local machine.
- htop: view your computer's resources. For example, how many CPUs are being used and how much memory used.
- multiprocessing

### Multiprocessing
Check this [gist](https://gist.github.com/yinleon/8b7555afbbeed47e439dbd2364b8d404).

In [12]:
import time
from multiprocessing import Pool

In [20]:
def example_function(n):
    """Sleeps for 5 seconds with an arbitrary input"""
    time.sleep(5)
    print(n)
    return 1

In [14]:
ex_inputs = list(range(30))

In [21]:
data = []
with Pool(processes=8) as pool:
    for record in pool.imap_unordered(example_function, ex_inputs):
        data.append(record)

2
1
6
3
4
5
7
0
8
11
10
9
12
13
14
15
17
16
18
20
21
19
23
22
24
25
27
29
28
26


Notice that order doesn't matter here