## Configuration
_Initial steps to get the notebook ready to play nice with our repository. Do not delete this section._

Code formatting with [black](https://pypi.org/project/nb-black/).

In [166]:
%load_ext lab_black

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black


Add our `utils` directory to the system's `$PATH` so we can import Python files from sibling directories.

In [167]:
import os
import sys

In [168]:
this_dir = os.path.abspath("")

In [169]:
parent_df = os.path.dirname(this_dir)

In [170]:
sys.path.insert(0, parent_df)

Import utilities from `../utils` module.

In [171]:
from utils import env, reader, writer, cleaners

In [172]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re
import unicodedata
from datetime import datetime, date
from slugify import slugify

## Download

Retrieve the page

In [173]:
url = "http://publichealth.lacounty.gov/media/Coronavirus/locations.htm"

In [174]:
page = requests.get(url)

## Parse

In [175]:
soup = BeautifulSoup(page.content, "html.parser")

Get content well

In [176]:
content = soup.find("div", {"id": "content"})

Get table

In [177]:
for tag in content.find_all(text=re.compile("CITY/COMMUNITY")):
    table = tag.findParent("table")

In [178]:
tbody = soup.tbody

In [179]:
row_list = tbody.find_all("tr")

In [180]:
dict_list = []

In [181]:
def safetxt(element):
    v = element.text.strip()
    v = v.replace("\u200b", "")
    return v

In [182]:
def safenumber(element):
    v = safetxt(element)
    v = v.replace(",", "")
    v = v.replace(" ", "")
    return v

In [183]:
for row in row_list:
    cell_content = row.find_all("td")
    d = dict(
        county="Los Angeles",
        area=safetxt(cell_content[0]),
        confirmed_cases=safenumber(cell_content[1]),
        confirmed_deaths=safenumber(cell_content[3]),
    )
    dict_list.append(d)

In [184]:
df = pd.DataFrame(dict_list)

Get timestamp

In [185]:
date_url = "http://publichealth.lacounty.gov/media/Coronavirus/js/casecounter.js"

In [186]:
response = requests.get(date_url)
date_page = response.text

In [187]:
date_text = re.search(r"([0-9][0-9]/[0-9][0-9])", date_page).group(1)
date_text = date_text + "/" + str(date.today().year)

In [188]:
latest_date = pd.to_datetime(date_text).date()

In [189]:
df["county_date"] = latest_date

In [190]:
df.tail(1)

Unnamed: 0,county,area,confirmed_cases,confirmed_deaths,county_date
341,Los Angeles,- Under Investigation,6823,12,2020-12-11


In [191]:
df.loc[df.area == "-  Under Investigation", "area"] = "Under Investigation"

In [192]:
df.loc[df.area == "- Under Investigation", "area"] = "Under Investigation"

## Vet

In [193]:
len(df)

342

In [194]:
try:
    assert not len(df) > 342
except AssertionError:
    raise AssertionError("L.A. County's scraper has extra rows")

In [195]:
try:
    assert not len(df) < 342
except AssertionError:
    raise AssertionError("L.A. County's scraper is missing rows")

## Export

Set the date

In [196]:
now = env.get_today()

In [197]:
writer.to_csv(df, f"city-scrapers/{slugify('Los Angeles')}/{now}.csv")