# ATTAINS Data Cleaning
<i> Ryan Treves
### Goals:
- Download and clean data from EPA's ATTAINS database in order to explore TMDLs and water quality

### Can we download summary data on Assessment Units (AUs) for 52 states and territories?

In [1]:
import pandas as pd

# display all rows & columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json
from urllib.request import urlopen

In [112]:
# Load acceptable list of state codes (this came from https://attains.epa.gov/attains-public/api/domains?domainName=OrgStateCode)
orgID_list = pd.read_csv('OrgID_list.csv')

In [None]:
# Download .json files for each state
for stateCode in orgID_list['code'].unique():
    response = urlopen('https://attains.epa.gov/attains-public/api/assessmentUnits?stateCode=' + stateCode)
    data = json.loads(response.read())
    with open(stateCode + '.json', 'w') as f:
        json.dump(data, f)

After this step, I manually uploaded each .json file into this jsontocsv converter and downloaded the output: https://csvjson.com/json2csv . Output files are labeled state code + '_AUs.csv' and can be found in Raw_AU_data/. NOTE: because the Pennsylvania .json file was so massive (127MB), I couldn't find a way to run it through the converter without crashing. Thus, Pennsylvania is excluded for now.

Now, we want to clean up columns so all the files can be concatenated. A few states (e.g., AZ, ME, NC will throw parsing errors at this step due to data entry differences within the rows- I went back and fixed these manually)

In [4]:
stateCode = 'WV' # Example

In [5]:
AUs = pd.read_csv('Raw_AU_data/' + stateCode + '_AUs.csv')

For some states, this pipeline will work:

In [32]:
# Pipeline 1
AU_keys = AUs[['items.organizationTypeText', 'items.organizationIdentifier', 'items.organizationName', 'items.assessmentUnits.assessmentUnitIdentifier', 'items.assessmentUnits.assessmentUnitName', 'items.assessmentUnits.agencyCode', 'items.assessmentUnits.statusIndicator', 'items.assessmentUnits.useClass']].rename(columns={'items.organizationTypeText': 'AUID'}).drop_duplicates()

HUC_codes = AUs[AUs['items.assessmentUnits.locations.locationTypeCode']=='HUC-8'][['items.organizationTypeText', 'items.assessmentUnits.locations.locationText']].rename(columns={'items.organizationTypeText': 'AUID', 'items.assessmentUnits.locations.locationText': 'HUC-8'})

waterTypes = AUs[['items.organizationTypeText', 'items.assessmentUnits.waterTypes.waterTypeCode', 'items.assessmentUnits.waterTypes.waterSizeNumber', 'items.assessmentUnits.waterTypes.unitsCode']].dropna().rename(columns={'items.organizationTypeText': 'AUID'})
AUs = AU_keys.merge(waterTypes, on='AUID', how='left').merge(HUC_codes, on='AUID', how='left')

For others, this pipeline will work:

In [9]:
# Pipeline 2
AU_keys = AUs[['items.organizationTypeText', 'items.organizationIdentifier', 'items.organizationName', 'items.assessmentUnits.assessmentUnitIdentifier', 'items.assessmentUnits.assessmentUnitName', 'items.assessmentUnits.agencyCode', 'items.assessmentUnits.statusIndicator', 'items.assessmentUnits.useClass']].rename(columns={'items.assessmentUnits.assessmentUnitIdentifier': 'AUID'}).drop_duplicates()

HUC_codes_1 = AUs[AUs['items.assessmentUnits.locations.locationTypeCode']=='HUC-8'][['items.assessmentUnits.assessmentUnitIdentifier', 'items.assessmentUnits.locations.locationText']].rename(columns={'items.assessmentUnits.assessmentUnitIdentifier': 'AUID', 'items.assessmentUnits.locations.locationText': 'HUC-8'})

HUC_codes_12 = AUs[AUs['items.assessmentUnits.locations.locationTypeCode']=='HUC-12'][['items.assessmentUnits.assessmentUnitIdentifier', 'items.assessmentUnits.locations.locationText']].rename(columns={'items.assessmentUnits.assessmentUnitIdentifier': 'AUID', 'items.assessmentUnits.locations.locationText': 'HUC-12'})


waterTypes = AUs[['items.assessmentUnits.assessmentUnitIdentifier', 'items.assessmentUnits.waterTypes.waterTypeCode', 'items.assessmentUnits.waterTypes.waterSizeNumber', 'items.assessmentUnits.waterTypes.unitsCode']].dropna().rename(columns={'items.assessmentUnits.assessmentUnitIdentifier': 'AUID'})

AUs = AU_keys.merge(waterTypes, on='AUID', how='left').merge(HUC_codes_1, on='AUID', how='left').merge(HUC_codes_12, on='AUID', how='left')

For others still, no HUC data is available and we just use this:

In [63]:
AUs.rename(columns={'items.assessmentUnits.assessmentUnitIdentifier': 'AUID'}).to_csv('NY_AUs_cleaned.csv')

Finally, save the result to a cleaned .csv file:

In [12]:
AUs.to_csv('Clean_AU_data/' + stateCode + '_AUs_cleaned.csv')

Once the above code has been done for all states, we can concatenate the data:

In [None]:
# Note: Pennsylvania (PA) is excluded, see above for an explanation. Note that 'VI'= Virgin Islands, 'PR'=Puerto Rico, 'GU'=Guam
states = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'GU', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PR', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'VI', 'WA', 'WV', 'WI', 'WY']

# Load and concatenate all state AUs data
all_AUs = pd.DataFrame()
for state in states:
    state_AUs = pd.read_csv('Clean_AU_data/' + state + '_AUs_cleaned.csv')
    all_AUs = pd.concat([state_AUs, all_AUs], axis=0)
    del state_AUs

In [None]:
all_AUs.to_csv('Clean_AU_data/all_AUs_cleaned.csv')

### Can we download summary data on assessments yielding IR category 5 determinations for 52 states and territories?
It turns out, because the ATTAINS API provides the data in deeply nested .json format, I wasn't able to find a workable way to download the .json files in the way that I did for the assessment unit data above. Instead, find the R script `pull_IR5_assessments.R`, which can be run to download assessment summary data containing the features we're interested in.