
# Mapping Groundwater Contaminants of California

Potential goals of this notebook:
1. Clean the dataset into a workable dataframe
2. Spatially plot the data using geopandas, or cartopy
3. ...

## 1. Reorganizing the data

This first section will deal with cleaning and reorganizing the data. 

(The `%matplotlib inline` syntax prints out the figures that are created after each specific call.)

In [3]:
%matplotlib inline
from matplotlib import pyplot as plt

## Imports
import pandas as pd
import numpy as np
import os
pd.set_option('display.max_columns', 500)

## Set the data's directory path
script_dir = os.path.abspath('')
data_dir   = os.path.join( os.path.split(os.path.split(script_dir)[0])[0], # shared path
                           'whw2019_India GW Code & Data\\whw2019_NWQData' )

The data we are using for this analysis are from a collaboration between the United States Geological Survey ([USGS](https://www.usgs.gov/)), the Environmental Protection Agency([EPA](https://www.epa.gov/)), United States Department of Agriculture Agricultural Reaseach Service ([USDA ARS](https://www.ars.usda.gov/)), and the National Water Quality Monitoring Council ([NWQMC](https://acwi.gov/monitoring/)). The groundwater quality data was aggragated and downloaded from the [Water Quality Portal](https://www.waterqualitydata.us/coverage/). 

The reported data sources are:
* National Water Information System ([NWIS](https://waterdata.usgs.gov/nwis)) - USGS
* STOrage and RETrieval ([STORET](https://www.epa.gov/waterdata/water-quality-data-wqx)) Data Warehouse - EPA
* Sustaining The Earth's Watersheds - Agricultural Research Database System ([STEWARDS]())

For now the state/region of interest is California (CA). However, we hope to be able to apply similar analyses to other states around the US, or to other countries (e.g., India) should adequate spatial (X,Y,Z) and temporal data resolution be available.

To read in the datafiles, we must make the proper call toward their storage location (on Hydroshare). The following `pd.read_csv` commands may present with some warnings after running. In this instance, the warnings are fine to ignore (however, always be mindful of the coding issues). 


In [4]:
# Enter state code in 'state' variable to read in that states data results.
state = 'CA'
results  = pd.read_csv(r'{}\\{}_result.csv'.format(data_dir, state))
stations = pd.read_csv(r'{}\\{}_station.csv'.format(data_dir, state))

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
results.drop(columns=['OrganizationIdentifier', 
                      'OrganizationFormalName',
                      'ActivityIdentifier', 
                      'ActivityTypeCode', 
                      'ActivityMediaName',
                      'ActivityMediaSubdivisionName',
                      'ResultStatusIdentifier', 
                      'StatisticalBaseCode', 
                      'ResultValueTypeName',
                      'ResultWeightBasisText', 
                      'ResultTimeBasisText',
                      'ResultTemperatureBasisText',
                      'ResultParticleSizeBasisText',
                      'PrecisionValue', 
                      'ResultCommentText',
                      'USGSPCode',
                      'ResultDepthHeightMeasure/MeasureValue',
                      'ResultDepthHeightMeasure/MeasureUnitCode',
                      'ResultDepthAltitudeReferencePointText', 
                      'SubjectTaxonomicName',
                      'SampleTissueAnatomyName',
                      'ResultAnalyticalMethod/MethodIdentifier',
                      'ResultAnalyticalMethod/MethodIdentifierContext',
                      'ResultAnalyticalMethod/MethodName', 
                      'MethodDescriptionText',
                      'LaboratoryName',
                      'AnalysisStartDate', 
                      'ResultLaboratoryCommentText',
                      'DetectionQuantitationLimitTypeName',
                      'DetectionQuantitationLimitMeasure/MeasureValue',
                      'DetectionQuantitationLimitMeasure/MeasureUnitCode',
                      'PreparationStartDate', 
                      'ProviderName',
                      'ProjectIdentifier',
                      'ActivityConductingOrganizationText',
                      'ActivityCommentText',
                      'MeasureQualifierCode', 
                      'SampleCollectionMethod/MethodIdentifier',
                      'SampleCollectionMethod/MethodIdentifierContext',
                      'SampleCollectionMethod/MethodName',
                      'SampleCollectionEquipmentName',
                      'ResultDetectionConditionText'
                     ], inplace=True)

# # preview the data
# results.head()
# stations.head()

After uploading/reading the .csv file as a `Pands` dataframe, we dropped unnecessary column values (above). The next step is to merge the two data frames `results` and `stations` by the station identifier to obtain a unified dataframe.

We have reset the "headers" of the rows to be the 'MonitoringLocationIdentifier' for each station. The new variable `mwd` stands for "merged well dataframe".

In [6]:
mwd = stations.merge( results, on='MonitoringLocationIdentifier' )
mwd = mwd.set_index( 'MonitoringLocationIdentifier' )

Now that we have a `Pandas` dataframe, I want to convert the dataframe to a geopandas dataframe.

If this is your first time using `GeoPandas`, you may need to install `GeoPandas`. To do so, run the following script above the first lin in the box below:

`! pip install geopandas`

In [10]:
# import geopandas
import geopandas as gpd
from shapely.geometry import Point

# 1. We need to convert the DataFrame's Lat/Long coordinates into the appropriate shapely geometries
gpdgeom = [Point(xy) for xy in zip(mwd.LatitudeMeasure, mwd.LongitudeMeasure)]
mwd.drop( ['LatitudeMaesure', 'LongitudeMeasure'], axis=1 )

# convert the dataframe to a geopandas dataframe
crs = {'init': 'epsg:4326'}
mwgd = gpd.GeoDataFrame(mwd, crs=crs, geometry=gpdgeom)

ModuleNotFoundError: No module named 'geopandas'

In [13]:
! pip install shapely

Collecting shapely
  Using cached https://files.pythonhosted.org/packages/a2/fb/7a7af9ef7a35d16fa23b127abee272cfc483ca89029b73e92e93cdf36e6b/Shapely-1.6.4.post2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\allan\AppData\Local\Temp\pip-install-etyci4w_\shapely\setup.py", line 80, in <module>
        from shapely._buildcfg import geos_version_string, geos_version, \
      File "C:\Users\allan\AppData\Local\Temp\pip-install-etyci4w_\shapely\shapely\_buildcfg.py", line 200, in <module>
        lgeos = CDLL("geos_c.dll")
      File "C:\Users\allan\Anaconda3\lib\ctypes\__init__.py", line 356, in __init__
        self._handle = _dlopen(self._name, mode)
    OSError: [WinError 126] The specified module could not be found
    
    ----------------------------------------


Command "python setup.py egg_info" failed with error code 1 in C:\Users\allan\AppData\Local\Temp\pip-install-etyci4w_\shapely\


In [8]:
mwd.LatitudeMeasure[0]


32.6946938