This notebook pulls EPA Facilities Registry Service (FRS) data from a ZIP download, extracts the zip in memory, reads the four constituent CSV files with Pandas, and transforms to Parquet files stashed in the Microsoft Planetary Computer Hub account the kernel is connected to via an API token. This makes it easier to simply read the data when needed in other work being done on that machine.

In [1]:
import requests
from io import BytesIO
from zipfile import ZipFile
import pandas as pd
from glob import glob

In [3]:
glob('data/*')

['data/FRS_PROGRAM_LINKS.parquet',
 'data/FRS_SIC_CODES.parquet',
 'data/FRS_NAICS_CODES.parquet',
 'data/FRS_FACILITIES.parquet']

Files are there already, but if we need to refresh we can. There is a page on the EPA ECHO site I found in the past that indicates when one of the ECHO datasets has been updated. This could be used to drive a periodic refresh process.

In [4]:
%%time
epa_frs_url = "https://echo.epa.gov/files/echodownloads/frs_downloads.zip"

r_epa_frs = requests.get(epa_frs_url)
z = ZipFile(BytesIO(r_epa_frs.content))

for fn in z.namelist():
    pd.read_csv(z.open(fn), dtype=str).to_parquet(
        f"data/{fn.split('.')[0]}.parquet"
    )

CPU times: user 38 s, sys: 3.69 s, total: 41.7 s
Wall time: 52.6 s
