**Aim**: The aim herein is to acquire the geographic - and industry - details of all entitities that are considered hazard release spots by the [toxics release inventory](https://enviro.epa.gov/triexplorer/tri_release.facility).  A few references of import:
* [TRI Model](https://www.epa.gov/enviro/tri-model)
* [TRI Reported Chemical Information Subject Area Model](https://www.epa.gov/enviro/tri-reported-chemical-information-subject-area-model)
* [Table: TRI_CHEM_INFO](https://enviro.epa.gov/enviro/ef_metadata_html.ef_metadata_table?p_table_name=tri_chem_info&p_topic=tri)

<br>

**Important**: It seems that the [United States Environmental Protection Agency](https://www.epa.gov/) is in the midst of upgrading its data services.  This seems to include the decommissioning of 

* https://data.epa.gov/efservice/ 

and replacing it with

* http://iaspub.epa.gov/enviro/efservice

Peculiarly, https://data.epa.gov/efservice/ is **out of service**, but http://iaspub.epa.gov/enviro/efservice is **not in service**.  **Therefore, this notebook cannot function at present.**  This state of affairs was anticipated.  Thus, a set of results this notebook produced, whilst it could function, can be retrieved from [GitHub](https://github.com/vetiveria/spots).

<br>
<br>

## **Preliminaries**

```bash
# machine characteristics
cat /etc/issue &> logs/ubuntu.log
cat /proc/cpuinfo &> logs/cpu.log
cat /proc/meminfo &> logs/memory.log
```

<br>


Cleaning-up

In [1]:
!rm -rf *.sh

<br>
<br>

**Packages**

In [2]:
import subprocess

In [3]:
if 'google.colab' in str(get_ipython()):
    subprocess.run('wget -q https://raw.githubusercontent.com/vetiveria/spots/develop/scripts.sh', shell=True)
    subprocess.run('chmod u+x scripts.sh', shell=True)
    subprocess.run('./scripts.sh', shell=True)

<br>
<br>

**Libraries**

In [4]:
import pandas as pd
import dask.dataframe as dd
import dask
import numpy as np
import requests
import os

In [5]:
import logging

<br>

**Logging**

In [6]:
logging.basicConfig(level=logging.ERROR, format='%(asctime)s \n\r %(levelname)s %(message)s', datefmt='%H:%M:%S')
logger = logging.getLogger(__name__)

<br>
<br>

**Classes**

In [7]:
import src.boundaries.boundaries
import src.settings

import src.references.chemicals
import src.references.naics
import src.references.industries

import src.releases.request
import src.releases.helpers

<br>

Instantiate

In [8]:
settings = src.settings.Settings()
boundaries = src.boundaries.boundaries.Boundaries(crs=settings.crs)

<br>
<br>

## **References**

In [9]:
referencespath = os.path.join(os.getcwd(), 'warehouse', 'references')

if not os.path.exists(referencespath):

    os.makedirs(referencespath)

<br>

### NAICS

In [10]:
naics = src.references.naics.NAICS().exc()
naics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2003 entries, 0 to 2002
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   naics        2003 non-null   int64 
 1   description  2003 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.4+ KB


In [11]:
naics.to_csv(path_or_buf=os.path.join(referencespath, 'naics.csv'), header=True, index=False, encoding='UTF-8')

<br>

### Industries

In [12]:
industries = src.references.industries.Industries().exc()
industries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   industry_code  30 non-null     int64 
 1   name           30 non-null     object
dtypes: int64(1), object(1)
memory usage: 608.0+ bytes


In [13]:
industries.to_csv(path_or_buf=os.path.join(referencespath, 'industries.csv'), header=True, index=False, encoding='UTF-8')

<br>

### Chemicals

In [None]:
chemicals = src.references.chemicals.Chemicals().exc()
chemicals.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 649 entries, 0 to 650
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   TRI_CHEM_ID                649 non-null    object 
 1   CHEM_NAME                  649 non-null    object 
 2   ACTIVE_DATE                649 non-null    int64  
 3   INACTIVE_DATE              649 non-null    int64  
 4   CAAC_IND                   649 non-null    int64  
 5   CARC_IND                   649 non-null    int64  
 6   R3350_IND                  649 non-null    int64  
 7   METAL_IND                  649 non-null    int64  
 8   FEDS_IND                   649 non-null    int64  
 9   CLASSIFICATION             649 non-null    int64  
 10  PBT_START_YEAR             21 non-null     float64
 11  PBT_END_YEAR               21 non-null     float64
 12  NO_DECIMALS                20 non-null     float64
 13  UNIT_OF_MEASURE            649 non-null    object 

<br>
<br>

## **Front Matter**

The objectives of the modules herein

**TRI**
* acquire the lists of hazardous sites from the old TRI repository, and ensure that each site has its coordinate details: longitude, latitude, state FP/GEOID, county FP/GEOID, tract CE/GEOID
* acquire the lists of hazardous sites from the latest TRI Services repository
* map the details of these repositories

**NAICS**
* acquire their industry classifications
* create an efficient storage set-up for the outcomes

<br>

### TRI

```bash
%%bash
python src/tri/main.py
ls spots/mapable/*csv | wc -l
ls spots/mapable/*csv | wc -l
```

<br>

### NAICS

```bash
%%bash
python src/naics/main.py
ls naics/*csv | wc -l
```

<br>
<br>

## **Releases**

States

In [None]:
states = boundaries.states(settings.latest)
states.info()

<br>

Counties

In [None]:
counties = boundaries.counties(settings.latest)
counties = counties.merge(states[['STATEFP', 'STUSPS']], on='STATEFP', how='left')
counties.rename(columns={'GEOID': 'COUNTYGEOID'}, inplace=True)
counties.info()

<br>

Directories

In [None]:
releasespath = os.path.join(os.getcwd(), 'warehouse', 'designs', 'designs')
if not os.path.exists(releasespath):
    os.makedirs(releasespath)

<br>
<br>

### Setting-up

Initially focus on LA.  Remember, these are **on-site releases & disposals**

In brief:

* Drop: FACILITY_NAME, EPA_REGISTRY_ID, CAS_CHEM_NAME, RELEASE_BASIS_EST_CODE
* Only select cases whereby TRADE_SECRET_IND == 0
* Ascertain data consistency w.r.t. ENVIRONMENTAL_MEDIUM
* Drop duplicates

Beware of

* The `DOC_CTRL_NUM`, `WATER_SEQUENCE_NUM`, `ENVIRONMENTAL_MEDIUM` [fields](https://enviro.epa.gov/enviro/ef_metadata_html.ef_metadata_table?p_table_name=tri_release_qty&p_topic=tri).  These probably aid distinct record identification during duplicate-records-drop; it is expected that release quantities per unique combination of these fields is almost-always different.
* Units of measure differences; a mix of pounds & grams


<br>
<br>

### Steps

In [None]:
request = src.releases.request.Request()
helpers = src.releases.helpers.Helpers(counties=counties, chemicals=chemicals)

In [None]:
for index in states[5:6].index:

    logger.info('\n...{}'.format(states.STUSPS[index]))

    # A state's data streams
    streams = request.exc(state=states.STUSPS[index])

    # The streams: county, year, and chemical level
    distributions = streams.compute(scheduler='processes')

    # The chemical amount released; the by year optionmight be removed in future
    base = distributions.groupby(by=['COUNTYGEOID', 'TRI_CHEM_ID', 'REPORTING_YEAR'])['TOTAL_RELEASE'].sum()
    base = base.reset_index(drop=False)

    # Get the unit of measure per chemical
    details = helpers.units(data=base.copy())

    # Ensure consistent release measures
    transformed = helpers.weights(data=details.copy())

    # Preliminary Focus: Analysis w.r.t. total release over time, per county & chemical
    # points = transformed.groupby(by=['COUNTYGEOID', 'TRI_CHEM_ID'])['RELEASE_KG'].sum()
    # points = points.reset_index(drop=False)
    points = transformed

    # Design matrix
    matrix = helpers.regressors(data=points, state=states.STUSPS[index])
    logger.info(matrix.info())
    matrix.to_csv(path_or_buf=os.path.join(releasespath, states.STUSPS[index] + '.csv'), index=True, encoding='UTF-8', header=True)


<br>

A Computation Graph

In [None]:
streams.visualize(filename='streams', format='pdf')

<br>
<br>

### Later

* Selections w.r.t. the states of `CAAC_IND`, `CARC_IND`, & `R3350_IND` in chemicals, per `TRI_CHEM_ID`.