# Today's theme is Failiure


# How To Debug Scapers: Browser Automation

## 1. Spot check the results

Manually inspect the data you just collected. Does it look like what you expect?

Let's look at the first and last page of Zillow that we collected.

## 2. Can't find an element

Maybe something hasn't loaded yet. If that is the case, you can wait for it to show up.

See the example in the [Inspect Element tutorial](https://inspectelement.org/browser_automation.html#step-3-finding-elements-on-page-and-interacting-with-them).

```
from playwright.async_api import async_playwright, expect

# assume `page` was set up earlier

xpath_1st_opt = '//li[@role="option"]'
await expect(page.locator(xpath_1st_opt).first).to_be_visible()

close_button = page.locator(xpath_1st_opt)
close_button
```

## 3. Look to known issues

For example, a capcha, or an empty result. 
- Wait to see if these signs show up.
- Intervene as necessary.

## Debugging APIs

## 1. Listen to status codes
The status code will tell you if your API calls are successful, and whether you crashed a server.

Intervene as necessary. Also place periodic sleeps.

## 2. Spotcheck

Open the JSON and make sure it looks like what you expect.

## 3. Check for known keys

Programmatically check if the `key` you're expecting is present.

# General notes

## Summarize the data
Check the number of rows per day. This is similar to a dashboard

## Catch and handle expections

Monitor the scraper to known issues. Determine automated answers to those issues.

Have you used `try` and `except` phrases in Python? Read more about that [here](https://pythonbasics.org/try-except/).

In [1]:
2 == 3

False

In [2]:
try:
    # checks if a statement is True
    assert(2 == 3)
except Exception as e:
    print(f"Wrong {e}")

Wrong 


## Keep a log

Get familiar with a [log file](https://realpython.com/python-logging/). This is basically a place to store `print` statements.
Read more here.

For a quick version: check the last time a directory was modified.

In [3]:
import logging

logger = logging.getLogger(__name__)
logging.basicConfig(filename='example.log', encoding='utf-8', level=logging.DEBUG)
logger.debug('This message should go to the log file')
logger.info('So should this')
logger.warning('And this, too')

In [4]:
!cat example.log

DEBUG:__main__:This message should go to the log file
INFO:__main__:So should this
DEBUG:parso.python.diff:diff parser start
DEBUG:parso.python.diff:line_lengths old: 1; new: 1
DEBUG:parso.python.diff:-> code[replace] old[1:1] new[1:1]
DEBUG:parso.python.diff:parse_part from 1 to 1 (to 0 in part parser)
DEBUG:parso.python.diff:diff parser end
DEBUG:__main__:This message should go to the log file
INFO:__main__:So should this
DEBUG:__main__:This message should go to the log file
INFO:__main__:So should this
DEBUG:__main__:This message should go to the log file
INFO:__main__:So should this


# Productionalizing Scrapers

See more from [this presentation](https://docs.google.com/presentation/d/1K5ttTgP1f6ghL06kj6QqyqsGccU_Ttxh1otdx5wWYGo/edit#slide=id.p).

Simple tips:
1. Don't repeat work
     - Structured naming system for outputs, check if it exists first.
2. Keep receipts
    - Save the timestamp (when data was collected) and the raw data.
3. Break up the work. Make the scraper as simple as possible.
For example, a scraper handles one city in Zillow.
    - Paginate, save results. That's all
    - Another scraper takes the saved HTML, and parses it and inserts it into a database.
4. Keep a schedule.
    - Use `cron` to schedule jobs locally. For example, cron allows an hourly job or one that runs every day at 4:30pm. Read [more](https://ostechnix.com/a-beginners-guide-to-cron-jobs/).
    - Other tools exist to do this on the cloud.
5. Keep tabs on inputs with a TODO list.
    - Use a CSV if you know what you want.
    - Use AWS SQS (similar to a commercial kitchen's ticket system). My fave!
6. Can you scale up?
    - If scrapers are simple, it's easy to parallelize them.
    - If local: use async computing or `Multiprocessing`.


In [5]:
import os

# make a folder for data
os.makedirs('data/', exist_ok=True)

fn_out = 'data/output.csv'

# check if the file exists
if not os.path.exists(fn_out):
    print("Write the file")
    # do something

else:
    print("File exists")
    # do nothing

Write the file


# Tools
- cron: schedule scripts and scrapers on a local machine.
- htop: view your computer's resources. For example, how many CPUs are being used and how much memory used.
- multiprocessing

## TQDM
A useful status bar

In [6]:
!pip install tqdm



In [7]:
import time
from tqdm import tqdm

In [8]:
def example_function(n):
    """
    Sleeps for 1 seconds with an arbitrary input
    """
    time.sleep(1)
    # print(n)
    return n

In [9]:
ex_inputs = list(range(10))

for i in tqdm(ex_inputs):
    example_function(i)
    pass

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:10<00:00,  1.01s/it]


### Multiprocessing
Check this [gist](https://gist.github.com/yinleon/8b7555afbbeed47e439dbd2364b8d404).

In [10]:
!pip install multiprocess



In [13]:
from multiprocess import Pool

ex_inputs = list(range(20))

data = []
with Pool(processes=4) as pool:
    for record in tqdm(pool.imap_unordered(example_function, 
                                           ex_inputs)):
        data.append(record)

20it [00:05,  3.99it/s]


Notice that order doesn't matter here