# ATTAINS DATA EXPLORATION
<i> Ryan Treves

### Questions:
- How many Assessment Units (AUs) exist nationwide?
- For many AUs do we have a HUC code match?
- How many use assessments nationwide, ever, have contributed to an IR5 category determination?
- How many unique assessment units have been assigned category IR5?
- Which states have had the most use assessments leading to IR5 determinations?
- Which states have had the highest rate of IR5 determinations per assessment unit?
- For what fraction of use assessments do we have an assessment date?
- What parameters have caused the most use non-attainment declarations?
- What parameters have caused the most assessment units to be categorized as IR5, irrespective of number of use non-attainment declarations?
- What uses have the highest rate of non-attainment?

Note: the dataset of AUs doesn't include Pennsylvania (see `ATTAINS_data_cleaning.ipynb` for an explanation)


In [1]:
import pandas as pd
import json
from urllib.request import urlopen

# display all rows & columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [47]:
# Load in national Assessment Unit (AU) data
AUs = pd.read_csv('Clean_AU_data/all_AUs_cleaned.csv')

  AUs = pd.read_csv('Clean_AU_data/all_AUs_cleaned.csv')


In [3]:
AUs.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,AUID,items.organizationIdentifier,items.organizationName,items.assessmentUnits.assessmentUnitIdentifier,items.assessmentUnits.assessmentUnitName,items.assessmentUnits.agencyCode,items.assessmentUnits.statusIndicator,items.assessmentUnits.useClass,items.assessmentUnits.waterTypes.waterTypeCode,items.assessmentUnits.waterTypes.waterSizeNumber,items.assessmentUnits.waterTypes.unitsCode,HUC-8,items.organizationTypeText,HUC-12,count,items.assessmentUnits.locationDescriptionText,items.assessmentUnits.stateCode,items.assessmentUnits.waterTypes.sizeEstimationMethodCode,items.assessmentUnits.waterTypes.sizeSourceText,items.assessmentUnits.waterTypes.sizeSourceScaleText,items.assessmentUnits.monitoringStations.monitoringOrganizationIdentifier,items.assessmentUnits.monitoringStations.monitoringLocationIdentifier,items.assessmentUnits.monitoringStations.monitoringDataLinkText,items.assessmentUnits.locations.locationTypeCode,items.assessmentUnits.locations.locationText,HUC-10
0,0,0,WYGR140401040103_01,Wyoming,State,Lower Big Sandy River,From the confluence with Squaw Creek downstrea...,WY,"{\useClassCode\"":\""85\""","\""useClassName\"":\""CLASS 2AB\""}""",STREAM,2.1,Miles,,,,,,,,,,,,,,,
1,1,1,WYNP101800020105_02,Wyoming,State,Muddy Creek,Entire watershed upstream of the confluence wi...,WY,"{\useClassCode\"":\""85\""","\""useClassName\"":\""CLASS 2AB\""}""",STREAM,44.5,Miles,,,,,,,,,,,,,,,
2,2,2,WYBH100800140107_01,Wyoming,State,Dry Gulch,From the confluence with the Shoshone River to...,WY,"{\useClassCode\"":\""117\""","\""useClassName\"":\""CLASS 3B\""}""",STREAM,0.5,Miles,,,,,,,,,,,,,,,
3,3,3,WYBH100800030108_02,Wyoming,State,Little Popo Agie River,From the confluence with the Popo Agie River u...,WY,"{\useClassCode\"":\""85\""","\""useClassName\"":\""CLASS 2AB\""}""",STREAM,11.1,Miles,,,,,,,,,,,,,,,
4,4,4,WYGR140401040303 _01,Wyoming,State,Pacific Creek,Confluence with Jack Morrow Creek upstream to ...,WY,"{\useClassCode\"":\""85\""","\""useClassName\"":\""CLASS 2AB\""}""",STREAM,13.8,Miles,,,,,,,,,,,,,,,


### How many Assessment Units (AUs) exist nationwide?
Note: this estimate doesn't include Pennsylvania, which according to https://attains.epa.gov/attains-public/api/assessmentUnits?stateCode=PA&returnCountOnly=Y contains on the order of 200,000 AUs on its own.

In [4]:
len(AUs['AUID'].unique())

331553

### For many AUs do we have a HUC code match?

In [5]:
AUs[(~pd.isna(AUs['HUC-12'])) | (~pd.isna(AUs['HUC-10'])) | (~pd.isna(AUs['HUC-8']))].shape[0]

99534

### How many use assessments nationwide, ever, have contributed to an IR5 category determination?
Here, a use assessment is uniquely identified by assessmentUnitIdentifier + useName + reportingCycleText + assessment_date

In [36]:
assessments = pd.read_csv('all_IR5_assessments.csv')
assessments.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,state_code,organizationIdentifier,organizationTypeText,reportingCycleText,reportStatusCode,assessmentUnitIdentifier,trophicStatusCode,useName,useAttainmentCode,threatenedIndicator,parameterStatusName,parameterName,cycle_first_listed,cycleLastAssessedText,cycle_scheduled_for_TMDL,assessment_date
0,0,1,AL,21AWIC,State,2008,Historical,AL-Gulf-of-Mexico,,Contact Recreation,F,N,Cause,MERCURY,2006.0,1998,2013.0,1998-04-01
1,1,2,AL,21AWIC,State,2008,Historical,AL-Gulf-of-Mexico,,Fishing,N,N,Cause,MERCURY,2006.0,1998,2013.0,1998-04-01
2,2,3,AL,21AWIC,State,2008,Historical,AL-Gulf-of-Mexico,,Propagation of Fish and Wildlife,F,N,Cause,MERCURY,2006.0,1998,2013.0,1998-04-01
3,3,4,AL,21AWIC,State,2008,Historical,AL-Gulf-of-Mexico,,Shellfishing,N,N,Cause,MERCURY,2006.0,1998,2013.0,1998-04-01
4,4,5,AL,21AWIC,State,2008,Historical,AL-Gulf-of-Mexico,,Industrial and Agriculture Uses,F,N,Cause,MERCURY,2006.0,1998,2013.0,1998-04-01


In [44]:

assessments_nonattainment = assessments[assessments['useAttainmentCode']=='N']
assessments_nonattainment.drop_duplicates(subset=['assessmentUnitIdentifier', 'useName', 'reportingCycleText', 'assessment_date']).shape[0]

564221

### How many unique assessment units have been assigned category IR5?

In [45]:
len(assessments['assessmentUnitIdentifier'].unique())

126751

### Which states have had the most use assessments leading to IR5 determinations?

In [46]:
assessments_nonattainment.drop_duplicates(subset=['assessmentUnitIdentifier', 'useName', 'reportingCycleText', 'assessment_date'])['state_code'].value_counts()

PA    226200
VA     27282
NH     22287
IN     19875
MI     15928
WV     14355
MN     14161
FL     12923
NC     12718
TN     12564
CA     11523
KY     11422
NJ     11141
OR     10842
OK      9592
MA      9164
ID      8943
WA      8330
KS      8263
OH      7425
MT      7320
SC      6850
TX      6672
WI      5559
IL      5378
CT      4145
CO      3816
PR      3722
AL      3661
LA      3538
IA      3255
NM      2908
HI      2620
UT      2518
GA      2511
RI      2347
NV      2293
SD      2190
NE      2003
MS      1787
VT      1761
ME      1665
AR      1619
MD      1419
DE      1395
MO      1237
VI      1183
AK       845
AZ       782
WY       750
NY       664
ND       629
GU       241
Name: state_code, dtype: int64

### Which states have had the highest rate of IR5 determinations per assessment unit?

In [55]:
# Get counts of assessment units in each state
AU_counts = {}
for state in assessments['state_code'].unique():
    response = urlopen('https://attains.epa.gov/attains-public/api/assessmentUnits?stateCode=' + state + '&returnCountOnly=Y')
    data = json.loads(response.read())['count']
    AU_counts[state] = data

In [70]:
rates = pd.DataFrame(assessments_nonattainment.drop_duplicates(subset=['assessmentUnitIdentifier', 'useName', 'reportingCycleText', 'assessment_date'])['state_code'].value_counts())
rates = rates.reset_index().rename(columns = {'index':'state', 'state_code':'# IR5 use assessments'})
rates['AUs'] = rates['state'].apply(lambda x: AU_counts[x])
rates['IR5 use assessment rate'] = rates['# IR5 use assessments']/rates['AUs']

In [75]:
rates.sort_values(by='IR5 use assessment rate', ascending=False).iloc[0:10]

Unnamed: 0,state,# IR5 use assessments,AUs,IR5 use assessment rate
12,NJ,11141,958,11.629436
27,PR,3722,358,10.396648
46,VI,1183,177,6.683616
29,LA,3538,563,6.284192
20,MT,7320,1203,6.084788
37,SD,2190,400,5.475
19,OH,7425,1723,4.309344
18,KS,8263,2421,3.413052
15,MA,9164,2764,3.315485
36,NV,2293,711,3.225035


### For what fraction of use assessments do we have an assessment date?

In [77]:
use_assessments_unique = assessments.drop_duplicates(subset=['assessmentUnitIdentifier', 'useName', 'reportingCycleText', 'assessment_date'])
use_assessments_unique[~pd.isna(use_assessments_unique['assessment_date'])].shape[0]/use_assessments_unique.shape[0]

0.24028999680744875

### What parameters have caused the most use non-attainment declarations?

In [78]:
assessments[assessments['parameterStatusName']=='Cause']['parameterName'].value_counts(normalize=True).iloc[0:10]

ESCHERICHIA COLI (E. COLI)          0.077221
PATHOGENS                           0.055360
DISSOLVED OXYGEN                    0.050509
POLYCHLORINATED BIPHENYLS (PCBS)    0.044071
PCBS IN FISH TISSUE                 0.044023
FECAL COLIFORM                      0.037691
SEDIMENTATION/SILTATION             0.035051
MERCURY IN FISH TISSUE              0.034144
SILTATION                           0.033814
MERCURY                             0.030667
Name: parameterName, dtype: float64

### What parameters have caused the most assessment units to be categorized as IR5, irrespective of number of use non-attainment declarations?

In [86]:
IR5_culprits_unique = assessments.drop_duplicates(subset=['assessmentUnitIdentifier', 'reportingCycleText', 'assessment_date', 'parameterName'])

IR5_culprits_unique[IR5_culprits_unique['parameterStatusName']=='Cause']['parameterName'].value_counts(normalize=True).iloc[0:10]

PATHOGENS                           0.089291
ESCHERICHIA COLI (E. COLI)          0.075443
SILTATION                           0.072346
DISSOLVED OXYGEN                    0.045512
PCBS IN FISH TISSUE                 0.037096
MERCURY                             0.036874
FECAL COLIFORM                      0.035271
POLYCHLORINATED BIPHENYLS (PCBS)    0.033996
PH                                  0.032374
CAUSE UNKNOWN                       0.031365
Name: parameterName, dtype: float64

### What uses have the highest rate of non-attainment?

In [87]:
use_assessments_unique['useName'].value_counts(normalize=True).iloc[0:10]

Water Contact Sports            0.082313
Fish Consumption                0.059401
Fishing                         0.052269
Warm Water Fishes               0.049517
Cold Water Fishes               0.043985
Aquatic Life                    0.032166
Primary Contact Recreation      0.029205
Recreation                      0.027474
Trout Stocking                  0.024389
Secondary Contact Recreation    0.022256
Name: useName, dtype: float64