# NRW Groundwater Data - OpenHygrisC Data Engineering

Data from <br>
**[LANUV](https://www.lanuv.nrw.de/): Landesamt für Natur, Umwelt und Verbraucherschutz Nordrhein-Westfalen** <br>
(State Office for Nature, Environment and Consumer Protection NRW)

* LANUV groundwater web pages: https://www.lanuv.nrw.de/umwelt/wasser/grundwasser

Groundwater data: https://www.lanuv.nrw.de/umwelt/wasser/grundwasser/grundwasserstand/grundwasserdaten-online

ELWAS-WEB NRW - Infos zu den Grundwasserkörpern (YouTube): https://www.youtube.com/watch?v=4wFKIu622rk

In the database HygrisC the LANUV provides groundwater quality and quantity data for most groundwater wells in NRW. The groundwater wells are partly owned and operated by NRW, partly by other parties. 
The measurement intervals are usually annual. Some groundwater well are sampled more frequently. 

WRRL: EU Wasserrahmenrichtlinie, EU Water Framework Directive

The quality data is based on chemical analyses of groundwater samples. The quantity data is based on groundwater level measurement.


OpenHygrisC Data: https://www.opengeodata.nrw.de/produkte/umwelt_klima/wasser/grundwasser/hygrisc/

**Download the NRW groundwater data zip file**:
<br>
https://www.opengeodata.nrw.de/produkte/umwelt_klima/wasser/grundwasser/hygrisc/OpenHygrisC_gw-messstellen-messwerte_EPSG25832_CSV.zip

The zip archive contains gw station info, a catalog of possible physico-chemical analysis parameters, and the measured data. 

## Coordinate Obfuscation 

Some coordinate data in the gw station info reveal difficulties. The coordinate reference system (CRS) used is the projected metric based 
EPSG:25832 ( ETRS89 / UTM zone 32N). 
The dataframe coordinate columns `e32` (easting) and `n32` (northing) are of data type object (not numeric). 

The resolution is 1m but many coordinates are obscurred because of privacy issues to a precision of 100m. A few coordinates are missing, i.e. either empty (nan) or filled with `xx`.


The coordinate columns e32 and n32 are of data type object/string. Four cases must be distinguished:

* Most strings are in a regular number format and can be converted to float right away (case (1) and (2) in the table)
* Other coordinate strings are obfuscated by replacing the two least significant decimal places with the characters "xx". This usually happens when a groundwater well is on private property. The coordinates are made less precise to respect privacy. The remaining coordinate information is still usable. The precision is limited to 100 meters. The uncertainty is +- 50m. (case (3) in the table)
* In some cases no coordinate infomation is given at all. In these cases the coordinate strings are just "xx". (case (4) in the table)
* In a very few cases the coordinate columns are empty, i.e. NaN (Null). (case (5) in the table)

The following table shows representative cases.


| case |   messstelle_id | e32    | n32     | grundstueck   |
|-----:|----------------:|-------:|--------:|:--------------|
|  (1) |        10000094 | 292868 | 5632572 | oeffentlich   |
|  (2) |        10000045 | 299399 | 5650595 | privat        |
|  (3) |        10000033 | 3070xx | 56583xx | privat        |
|  (4) |        47247101 | xx     | xx      |               |
|  (5) |        79921802 | nan    | nan     |               |

Case (1) and (2) have coordinate strings which can be immediately converted to integer or float with 1m precision. Case (3) shows coordinate obfuscation to a precision to 100m. The digits representing tens and ones are anonymized. Case (4) and (5) show useless coordinate information.  

How to deal with non-anonymized data:

"299399" (string, prec. 1) => 299399.0 (float) 

How to deal with anonymization:

307000 <= 3070xx <= 307099

"3070xx" (string, prec. 100) => 307050 (float, +- 50m) 



In [1]:
#!conda env list

## Correct wrong `PROJ_LIB` environment variable value 

This problem seems to occur on Windows when using the OSGeo4W installer. The environment variable must point to a user specific directory and according to the activated conda environment, e.g. `PROJ_LIB=C:\Users\<username>\Anaconda3\envs\geo\Library\share\proj` 

In [2]:
import os
os.environ['proj_lib']

'C:\\Program Files\\PostgreSQL\\13\\share\\contrib\\postgis-3.4\\proj'

In [3]:
# Correct wrong environment variable value occurring when using OSGeo4W installer
conda_prefix = os.environ['conda_prefix']
print(f"CONDA_PREFIX: {conda_prefix:s}")
os.environ['proj_lib'] = conda_prefix + r"\Library\share\proj"
proj_lib = os.environ['proj_lib']
print(f"New env var value: \nPROJ_LIB={proj_lib:s}")

CONDA_PREFIX: C:\Users\shrey\anaconda3\envs\geo
New env var value: 
PROJ_LIB=C:\Users\shrey\anaconda3\envs\geo\Library\share\proj


## Imports

In [4]:
# CORRECT THE WRONG PYPROJ PATH FIRST! OTHERWISE GEOPANDAS DOES NOT LOAD!
import pandas as pd
import geopandas as gpd

## Data Directories and Files

In [68]:
station_data_in_dir = r"../data/OpenGeodata.NRW/OpenHygrisC/OpenHygrisC_gw-messstelle_EPSG25832_CSV/"
for elt in os.listdir(data_in_dir): print(elt)

.ipynb_checkpoints
OpenHygrisC_gw-chemischer-messwert_2020-2029_EPSG25832_CSV
OpenHygrisC_gw-messstelle.csv
OpenHygrisC_gw-messstelle1-utf8.csv
OpenHygrisC_gw-messstelle1.csv


## GW Station Data


In [69]:
gw_station_fname = r"OpenHygrisC_gw-messstelle.csv"
gw_station_pfname = station_data_in_dir + "/" + gw_station_fname
print(f"Stationsdaten:  {gw_station_pfname:s}")

Stationsdaten:  ../data/OpenGeodata.NRW/OpenHygrisC/OpenHygrisC_gw-messstelle_EPSG25832_CSV//OpenHygrisC_gw-messstelle.csv


In [30]:
df = pd.read_csv(gw_station_pfname, sep = ";", encoding="cp1252", index_col=["messstelle_id"])

In [31]:
df.sort_index(ascending=True, inplace=True)

In [32]:
num_total = df.shape[0]
df.shape

(72528, 38)

In [33]:
print(f"{pd.get_option('display.max_columns') = }")
pd.set_option("display.max_columns", None)
print(f"{pd.get_option('display.max_columns') = }")

pd.get_option('display.max_columns') = None
pd.get_option('display.max_columns') = None


In [34]:
df.head()

Unnamed: 0_level_0,sl_nr,name,e32,n32,gw_stockwerk,grundstueck,gemeinde_id,gwhorizont_id,gwhorizont,gwleiter_id,gwleiter,einrichtungsgrund,gwk_lage_auf_id,gwk_lage_id,gwk_monitoring_auf_id,gwk_monitoring_id,messprogramm,turnus_wasserstand,freigabe_wstd,freigabe_chemie,freigabe_lage,wasserstandsmessstelle,guetemessstelle,im_wrrl_messnetz_chemie,im_wrrl_messnetz_wasserstand,messstellenart,wasserart,labor,beobachtung_wasserstand,eigentuemer,betreiber,filterlaenge_cm,sumpfrohrlaenge_cm,ausbaudurchmesser_mm,historischer_ruhe_wsp,einbaulaenge_cm,oberkante_filter_cm,unterkante_filter_cm
messstelle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1
10000008,70796,P3-BSR/Mariaschacht P 5 neu,3068XX,56254XX,1.0,,5334032,,,,,GW-Beschaffenheit,01.07.2016,282_11,01.07.2016,282_11,RWÜ (Messstelle f. GWÜ geeignet),monatlich,nein,nein,nein,ja,ja,nein,nein,GW-Messstelle,reines Grundwasser,-,-,BSR Schotterwerk ...,Enwor GmbH ...,700.0,0.0,80.0,,1650.0,20973.0,20273.0
10000010,1,SCHERPENSEEL NR 1,2935XX,56452XX,1.0,privat,5370028,6D,Neurather Sand,,,LGD,01.07.2016,28_04,01.07.2016,28_04,Grundwassergüteüberwachung,Messstelle besteht nicht mehr,ja,ja,nein,ja,ja,nein,nein,GW-Messstelle,keine Angabe,-,-,Land NRW ...,keine Angabe ...,400.0,,100.0,,4936.0,7731.0,7331.0
10000021,2,Bellinghoven Nr. 2,312776,5660432,1.0,privat,5370004,14,Ältere Hauptterrassen,,,LGD,01.07.2016,286_07,01.07.2016,286_07,,Messstelle inaktiv,ja,nein,ja,ja,nein,nein,nein,Schachtbrunnen,,,-,keine Angabe ...,keine Angabe ...,,,1000.0,,1411.0,7885.0,7885.0
10000033,3,Doveren Nr. 3,3070XX,56583XX,1.0,privat,5370020,16,Jüngere Hauptterrassen mit Lößauflagerung,,,LGD,01.07.2016,282_01,01.07.2016,282_01,Emittentenmst./Anlagenüberw.,Messstelle inaktiv,ja,nein,nein,ja,ja,nein,nein,Schachtbrunnen,keine Angabe,-,-,Privatperson ...,keine Angabe ...,,,1000.0,,755.0,4847.0,4847.0
10000045,4,Geilenkirchen Nr. 5,299399,5650595,1.0,privat,5370012,10,Sande und Kiese,,,LGD,01.07.2016,282_03,01.07.2016,282_03,,Messstelle besteht nicht mehr,ja,nein,ja,ja,nein,nein,nein,Vertikalfilterbrunnen,,,-,Privatperson ...,keine Angabe ...,200.0,,1000.0,,1079.0,6140.0,5940.0


In [12]:
#df[df["grundstueck"]=="oeffentlich"].head()

## Challenge: Coordinates obfuscation

The coordinate columns e32 and n32 are of data type string. Four cases must be distinguished:

(1) Most strings are in a regular number format and can be converted to float right away.

(2) Other coordinate strings are obfuscated by replacing the two least significant digits with the characters "xx". This usually happens when a groundwater well is on private property. The coordinates are made less precise to respect privacy. The remaining coordinate information is still usable. The precision is limited to 100 meters. The uncertainty is +- 50m. 

(3) In some cases no coordinate infomation is given at all. In these cases the coordinate strings are just "xx".

(4) In a very few cases the coordinate columns are empty, i.e. NaN (Null).

In [35]:
# These four groundwater wells summarize the coordinate problems.
#df_coord_problem=df.loc[[10000094, 10000045, 10000033, 47247101, 79921802],["e32","n32", "grundstueck"]]
#df_coord_problem

In [36]:
# forma table as markdown
#from tabulate import tabulate
#print(tabulate(df_coord_problem, tablefmt="pipe", headers="keys"))

|   messstelle_id | e32    | n32     | grundstueck   |
|----------------:|:-------|:--------|:--------------|
|        10000094 | 292868 | 5632572 | oeffentlich   |
|        10000045 | 299399 | 5650595 | privat        |
|        10000033 | 3070xx | 56583xx | privat        |
|        47247101 | xx     | xx      |               |
|        79921802 | nan    | nan     |               |

**Boolean indexes are used to filter the data according to the cases (1) to (4).**

In [37]:
df

Unnamed: 0_level_0,sl_nr,name,e32,n32,gw_stockwerk,grundstueck,gemeinde_id,gwhorizont_id,gwhorizont,gwleiter_id,gwleiter,einrichtungsgrund,gwk_lage_auf_id,gwk_lage_id,gwk_monitoring_auf_id,gwk_monitoring_id,messprogramm,turnus_wasserstand,freigabe_wstd,freigabe_chemie,freigabe_lage,wasserstandsmessstelle,guetemessstelle,im_wrrl_messnetz_chemie,im_wrrl_messnetz_wasserstand,messstellenart,wasserart,labor,beobachtung_wasserstand,eigentuemer,betreiber,filterlaenge_cm,sumpfrohrlaenge_cm,ausbaudurchmesser_mm,historischer_ruhe_wsp,einbaulaenge_cm,oberkante_filter_cm,unterkante_filter_cm
messstelle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1
10000008,70796,P3-BSR/Mariaschacht P 5 neu,3068XX,56254XX,1.0,,05334032,,,,,GW-Beschaffenheit,01.07.2016,282_11,01.07.2016,282_11,RWÜ (Messstelle f. GWÜ geeignet),monatlich,nein,nein,nein,ja,ja,nein,nein,GW-Messstelle,reines Grundwasser,-,-,BSR Schotterwerk ...,Enwor GmbH ...,700.0,0.0,80.0,,1650.0,20973.0,20273.0
10000010,1,SCHERPENSEEL NR 1,2935XX,56452XX,1.0,privat,05370028,6D,Neurather Sand,,,LGD,01.07.2016,28_04,01.07.2016,28_04,Grundwassergüteüberwachung,Messstelle besteht nicht mehr,ja,ja,nein,ja,ja,nein,nein,GW-Messstelle,keine Angabe,-,-,Land NRW ...,keine Angabe ...,400.0,,100.0,,4936.0,7731.0,7331.0
10000021,2,Bellinghoven Nr. 2,312776,5660432,1.0,privat,05370004,14,Ältere Hauptterrassen,,,LGD,01.07.2016,286_07,01.07.2016,286_07,,Messstelle inaktiv,ja,nein,ja,ja,nein,nein,nein,Schachtbrunnen,,,-,keine Angabe ...,keine Angabe ...,,,1000.0,,1411.0,7885.0,7885.0
10000033,3,Doveren Nr. 3,3070XX,56583XX,1.0,privat,05370020,16,Jüngere Hauptterrassen mit Lößauflagerung,,,LGD,01.07.2016,282_01,01.07.2016,282_01,Emittentenmst./Anlagenüberw.,Messstelle inaktiv,ja,nein,nein,ja,ja,nein,nein,Schachtbrunnen,keine Angabe,-,-,Privatperson ...,keine Angabe ...,,,1000.0,,755.0,4847.0,4847.0
10000045,4,Geilenkirchen Nr. 5,299399,5650595,1.0,privat,05370012,10,Sande und Kiese,,,LGD,01.07.2016,282_03,01.07.2016,282_03,,Messstelle besteht nicht mehr,ja,nein,ja,ja,nein,nein,nein,Vertikalfilterbrunnen,,,-,Privatperson ...,keine Angabe ...,200.0,,1000.0,,1079.0,6140.0,5940.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289382518,72729,Hackenbroich H20H,345602,5659991,1.0,,05162004,19,Niederterrassen mit Lößauflagerung,,,keine Angabe,01.07.2016,27_20,01.07.2016,27_20,,halbjährlich,nein,nein,ja,ja,nein,nein,nein,GW-Messstelle,,,-,,,2000.0,100.0,65.0,,3070.0,3442.0,1442.0
289382520,72730,Hackenbroich H20H,345602,5659991,1.0,,05162004,19,Niederterrassen mit Lößauflagerung,,,keine Angabe,01.07.2016,27_20,01.07.2016,27_20,,halbjährlich,nein,nein,ja,ja,nein,nein,nein,GW-Messstelle,,,-,,,2000.0,100.0,150.0,,3070.0,3431.0,1431.0
289382610,72927,Hackenbroich H25T,345313,5660272,1.0,,05162004,19,Niederterrassen mit Lößauflagerung,,,keine Angabe,01.07.2016,274_01,01.07.2016,274_01,,Messstelle inaktiv,nein,nein,ja,ja,nein,nein,nein,GW-Messstelle,,,-,,,1000.0,,65.0,,1700.0,3659.0,2659.0
289382622,72928,Hackenbroich H25T,345313,5660272,1.0,,05162004,19,Niederterrassen mit Lößauflagerung,,,keine Angabe,01.07.2016,274_01,01.07.2016,274_01,,Messstelle inaktiv,nein,nein,ja,ja,nein,nein,nein,GW-Messstelle,,,-,,,900.0,,65.0,,2650.0,2608.0,1708.0


In [38]:
# Add column for precision
df["genau"] = 0

# (1) If the coord data is numeric then the precision is 1m
idx_coords_1m_prec = (df["e32"].str.isnumeric() == True)

# (3,4) Some stations don't have coordinates
# e32 and n32 strings are either NaN (Null) or "xx"
idx_coords_missing = (df["e32"].str.len() < 6) | (df["e32"].isnull() == True)

# (2) If coord data is avaliable but not numeric, then the numbers have been obscured with "XX" for the two least significant decimals.
idx_coords_100m_prec = ~idx_coords_missing &  ~(df["e32"].str.isnumeric() == True)


In [39]:
df[idx_coords_missing]

Unnamed: 0_level_0,sl_nr,name,e32,n32,gw_stockwerk,grundstueck,gemeinde_id,gwhorizont_id,gwhorizont,gwleiter_id,gwleiter,einrichtungsgrund,gwk_lage_auf_id,gwk_lage_id,gwk_monitoring_auf_id,gwk_monitoring_id,messprogramm,turnus_wasserstand,freigabe_wstd,freigabe_chemie,freigabe_lage,wasserstandsmessstelle,guetemessstelle,im_wrrl_messnetz_chemie,im_wrrl_messnetz_wasserstand,messstellenart,wasserart,labor,beobachtung_wasserstand,eigentuemer,betreiber,filterlaenge_cm,sumpfrohrlaenge_cm,ausbaudurchmesser_mm,historischer_ruhe_wsp,einbaulaenge_cm,oberkante_filter_cm,unterkante_filter_cm,genau
messstelle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1
36446518,47111,WA-Lörick LR - RMM,XX,XX,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),Messstelle inaktiv,nein,nein,nein,ja,ja,nein,nein,Sammelmessstelle,keine Angabe,-,-,,,,,,,,,,0
36487600,46659,Wittlaer,XX,XX,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),Messstelle inaktiv,nein,nein,nein,ja,ja,nein,nein,Sammelmessstelle,keine Angabe,-,-,,,,,,,,,,0
47199003,47636,Sammelleitung 1-7,XX,XX,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),Messstelle inaktiv,nein,nein,nein,ja,ja,nein,nein,Sammelmessstelle,keine Angabe,-,-,,,,,,,,,,0
47200005,47638,Sammelleitung 10-20,XX,XX,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),Messstelle inaktiv,nein,nein,nein,ja,ja,nein,nein,Sammelmessstelle,keine Angabe,-,-,,,,,,,,,,0
47202002,47647,Sammelleitung 21-30,XX,XX,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),Messstelle inaktiv,nein,nein,nein,ja,ja,nein,nein,Sammelmessstelle,keine Angabe,-,-,,,,,,,,,,0
47247101,46769,RM-Moers Gerdt,XX,XX,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),Messstelle inaktiv,nein,nein,nein,ja,ja,nein,nein,Rohmischwassermessstelle,keine Angabe,-,-,,,,,,,,,,0
47299009,47658,RM-Bucholtwelmen,XX,XX,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),Messstelle inaktiv,nein,nein,nein,ja,ja,nein,nein,Rohmischwassermessstelle,keine Angabe,-,-,,,,,,,,,,0
59621035,46215,RM-Brunnen Stortel,XX,XX,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),,nein,nein,nein,nein,ja,nein,nein,Rohmischwassermessstelle,keine Angabe,-,,,,,,,,,,,0
68011003,47697,WW.HALTERNHAARD-MI,XX,XX,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),,nein,nein,nein,nein,ja,nein,nein,Sammelmessstelle,keine Angabe,-,,,,,,,,,,,0
68012007,14582,WW.HALTERNWEST-MI,XX,XX,1.0,,,,,,,keine Angabe,,,,,RWÜ (keine GWÜ-Messstellen),,nein,nein,nein,nein,ja,nein,nein,Sammelmessstelle,keine Angabe,-,,,,,,,,,,,0


**Convert the strings to floats where possible. No data values are represented as negative numbers.**

In [40]:
df.loc[idx_coords_1m_prec,"e32num"] = df.loc[idx_coords_1m_prec,"e32"].astype(float)
df.loc[idx_coords_1m_prec,"n32num"] = df.loc[idx_coords_1m_prec,"n32"].astype(float)
df.loc[idx_coords_1m_prec, "genau"] = 1

In [41]:
df.loc[idx_coords_100m_prec,"e32num"] = (df.loc[idx_coords_100m_prec,"e32"].str[:-2]+"50").astype(float)
df.loc[idx_coords_100m_prec,"n32num"] = (df.loc[idx_coords_100m_prec,"n32"].str[:-2]+"50").astype(float)
df.loc[idx_coords_100m_prec, "genau"] = 100

In [42]:
df.loc[idx_coords_missing,"e32num"] = -999.9
df.loc[idx_coords_missing,"n32num"] = -999.9
df.loc[idx_coords_missing, "genau"] = -999

In [43]:
# check if all records have been matched
num_of_1m_prec = df[df["genau"] == 1].shape[0]
num_of_100m_prec = df[df["genau"] == 100].shape[0]
num_of_no_prec = df[df["genau"] == -999].shape[0]

num_check = num_of_1m_prec + num_of_100m_prec + num_of_no_prec

print(f"total num of recs:                        {num_total:6d}")
print(f"number of recs with 1m coord precision:   {num_of_1m_prec:6d}")
print(f"number of recs with 100m coord precision: {num_of_100m_prec:6d}")
print(f"number of recs with no coords:            {num_of_no_prec:6d}")
print(f"check sum:                                {num_check:6d}")

assert num_check == num_total, "ERROR. Mismatch in numbers of stations"


total num of recs:                         72528
number of recs with 1m coord precision:    60885
number of recs with 100m coord precision:  11625
number of recs with no coords:                18
check sum:                                 72528


**Save the original string as well as the derived numeric columns to a CSV file for checking externally.**

In [44]:
df[["e32","e32num","n32","n32num","genau"]].to_csv("check.csv")
df[["e32","e32num","n32","n32num","genau"]]

Unnamed: 0_level_0,e32,e32num,n32,n32num,genau
messstelle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10000008,3068XX,306850.0,56254XX,5625450.0,100
10000010,2935XX,293550.0,56452XX,5645250.0,100
10000021,312776,312776.0,5660432,5660432.0,1
10000033,3070XX,307050.0,56583XX,5658350.0,100
10000045,299399,299399.0,5650595,5650595.0,1
...,...,...,...,...,...
289382518,345602,345602.0,5659991,5659991.0,1
289382520,345602,345602.0,5659991,5659991.0,1
289382610,345313,345313.0,5660272,5660272.0,1
289382622,345313,345313.0,5660272,5660272.0,1


## Geopandas

In [45]:
import geopandas as gpd
from shapely.geometry import Point

In [46]:
# remove records without coords
df2 = df[df["genau"] > 0]

In [47]:
df2.shape

(72510, 41)

In [48]:
%%time
gdf = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2.e32num, df2.n32num), crs="EPSG:25832")

CPU times: total: 62.5 ms
Wall time: 87.8 ms


In [49]:
gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 72510 entries, 10000008 to 289382713
Data columns (total 42 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   sl_nr                         72510 non-null  int64   
 1   name                          72510 non-null  object  
 2   e32                           72510 non-null  object  
 3   n32                           72510 non-null  object  
 4   gw_stockwerk                  54887 non-null  float64 
 5   grundstueck                   72510 non-null  object  
 6   gemeinde_id                   72510 non-null  object  
 7   gwhorizont_id                 28561 non-null  object  
 8   gwhorizont                    28561 non-null  object  
 9   gwleiter_id                   2308 non-null   object  
 10  gwleiter                      2308 non-null   object  
 11  einrichtungsgrund             72510 non-null  object  
 12  gwk_lage_auf_id               72

In [50]:
gdf.head(3)

Unnamed: 0_level_0,sl_nr,name,e32,n32,gw_stockwerk,grundstueck,gemeinde_id,gwhorizont_id,gwhorizont,gwleiter_id,gwleiter,einrichtungsgrund,gwk_lage_auf_id,gwk_lage_id,gwk_monitoring_auf_id,gwk_monitoring_id,messprogramm,turnus_wasserstand,freigabe_wstd,freigabe_chemie,freigabe_lage,wasserstandsmessstelle,guetemessstelle,im_wrrl_messnetz_chemie,im_wrrl_messnetz_wasserstand,messstellenart,wasserart,labor,beobachtung_wasserstand,eigentuemer,betreiber,filterlaenge_cm,sumpfrohrlaenge_cm,ausbaudurchmesser_mm,historischer_ruhe_wsp,einbaulaenge_cm,oberkante_filter_cm,unterkante_filter_cm,genau,e32num,n32num,geometry
messstelle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1
10000008,70796,P3-BSR/Mariaschacht P 5 neu,3068XX,56254XX,1.0,,5334032,,,,,GW-Beschaffenheit,01.07.2016,282_11,01.07.2016,282_11,RWÜ (Messstelle f. GWÜ geeignet),monatlich,nein,nein,nein,ja,ja,nein,nein,GW-Messstelle,reines Grundwasser,-,-,BSR Schotterwerk ...,Enwor GmbH ...,700.0,0.0,80.0,,1650.0,20973.0,20273.0,100,306850.0,5625450.0,POINT (306850.000 5625450.000)
10000010,1,SCHERPENSEEL NR 1,2935XX,56452XX,1.0,privat,5370028,6D,Neurather Sand,,,LGD,01.07.2016,28_04,01.07.2016,28_04,Grundwassergüteüberwachung,Messstelle besteht nicht mehr,ja,ja,nein,ja,ja,nein,nein,GW-Messstelle,keine Angabe,-,-,Land NRW ...,keine Angabe ...,400.0,,100.0,,4936.0,7731.0,7331.0,100,293550.0,5645250.0,POINT (293550.000 5645250.000)
10000021,2,Bellinghoven Nr. 2,312776,5660432,1.0,privat,5370004,14,Ältere Hauptterrassen,,,LGD,01.07.2016,286_07,01.07.2016,286_07,,Messstelle inaktiv,ja,nein,ja,ja,nein,nein,nein,Schachtbrunnen,,,-,keine Angabe ...,keine Angabe ...,,,1000.0,,1411.0,7885.0,7885.0,1,312776.0,5660432.0,POINT (312776.000 5660432.000)


In [None]:
%%time

# This takes 90 secs on my computer!

#gdf.to_file("GW_Stations.gpkg", layer='GW Stations', driver="GPKG")

## PostGIS, Inline SQL Magic: `create schema gw`

To store the data in PostGIS/PostgreSQL it is recommended to create a dedicated database "schema" (a kind of name space) to separate relations (tables, views), stored procedures, etc. from the rest of the database. Schemata help to organize the tables and access privileges clearly. 


In [56]:
#!conda install -c conda-forge ipython-sql
!pip install ipython-sql

Collecting ipython-sql
  Downloading ipython_sql-0.5.0-py3-none-any.whl (20 kB)
Collecting prettytable (from ipython-sql)
  Obtaining dependency information for prettytable from https://files.pythonhosted.org/packages/4d/81/316b6a55a0d1f327d04cc7b0ba9d04058cb62de6c3a4d4b0df280cbe3b0b/prettytable-3.9.0-py3-none-any.whl.metadata
  Downloading prettytable-3.9.0-py3-none-any.whl.metadata (26 kB)
Collecting sqlparse (from ipython-sql)
  Downloading sqlparse-0.4.4-py3-none-any.whl (41 kB)
     ---------------------------------------- 0.0/41.2 kB ? eta -:--:--
     ---------------------------------------- 41.2/41.2 kB ? eta 0:00:00
Downloading prettytable-3.9.0-py3-none-any.whl (27 kB)
Installing collected packages: sqlparse, prettytable, ipython-sql
Successfully installed ipython-sql-0.5.0 prettytable-3.9.0 sqlparse-0.4.4


In [57]:
%load_ext sql

In [58]:
print("Connect")
%sql postgresql://env_master:M123xyz@localhost/env_db

Connect


In [59]:
%%sql
SELECT * FROM information_schema.schemata

 * postgresql://env_master:***@localhost/env_db
5 rows affected.


catalog_name,schema_name,schema_owner,default_character_set_catalog,default_character_set_schema,default_character_set_name,sql_path
env_db,public,pg_database_owner,,,,
env_db,information_schema,postgres,,,,
env_db,pg_catalog,postgres,,,,
env_db,pg_toast,postgres,,,,
env_db,gw_test,env_master,,,,


In [60]:
%%sql
CREATE SCHEMA IF NOT EXISTS gw AUTHORIZATION env_master

 * postgresql://env_master:***@localhost/env_db
Done.


[]

In [61]:
%%sql
SELECT * FROM information_schema.schemata;

 * postgresql://env_master:***@localhost/env_db
6 rows affected.


catalog_name,schema_name,schema_owner,default_character_set_catalog,default_character_set_schema,default_character_set_name,sql_path
env_db,public,pg_database_owner,,,,
env_db,information_schema,postgres,,,,
env_db,pg_catalog,postgres,,,,
env_db,pg_toast,postgres,,,,
env_db,gw,env_master,,,,
env_db,gw_test,env_master,,,,


## PostGIS: Upload GeoDataFrame with `gdf.to_postgis()`

Dependencies:
* psycopg2
* geoalchemy2

In [None]:
#!conda install -c conda-forge geoalchemy2 psycopg2

In [62]:
import sqlalchemy
engine = sqlalchemy.create_engine("postgresql://env_master:M123xyz@localhost/env_db")
# fast_executemany=True
# use_batch_mode=True

In [65]:
%%time
gdf.to_postgis(con=engine, name="gw_stations", schema="gw", index=True, chunksize=100, if_exists="replace")

CPU times: total: 1.75 s
Wall time: 37.6 s


Create primary key!

In [66]:
%%sql
alter table gw.gw_stations add constraint pk_gw_stations primary key (messstelle_id)

 * postgresql://env_master:***@localhost/env_db
Done.


[]

# Groundwater "Quality Data": Chemistry!

## Data Directories and Files

In [70]:
quality_data_in_dir = r"..\data\OpenGeodata.NRW\OpenHygrisC\OpenHygrisC_gw-chemischer-messwert_EPSG25832_CSV\OpenHygrisC_gw-chemischer-messwert_1990-1999_EPSG25832_CSV"
gw_quality_fname = r"\gw-chemischer-messwert-1990-1999.csv"
gw_quality_pfname = quality_data_in_dir + gw_quality_fname
print(f"Qualitätsdaten: {gw_quality_pfname:s}")

Qualitätsdaten: ..\data\OpenGeodata.NRW\OpenHygrisC\OpenHygrisC_gw-chemischer-messwert_EPSG25832_CSV\OpenHygrisC_gw-chemischer-messwert_1990-1999_EPSG25832_CSV\gw-chemischer-messwert-1990-1999.csv


In [71]:
fh = open(gw_quality_pfname,"r", encoding = "cp1252", newline = '')
s = fh.readline()
s = s.replace('"', '').strip()
header_de = s[1:].split(';')
header_de

['l_nr',
 'messstelle_id',
 'messstelle_sl_nr',
 'datum_pn',
 'stoff_nr',
 'probengut',
 'messergebnis_c',
 'messergebnis_hinweis',
 'bestimmungsgrenze',
 'masseinheit',
 'trennverfahren',
 'verfahren',
 'vor_ort']

In [72]:
#df['messstell'] = df['Value'].str.replace(',', '.', regex=True)

In [73]:
%time df_qual = pd.read_csv(gw_quality_pfname, sep = ";", encoding="cp1252", dtype = {"messergebnis_c":str ,"messergebnis_hinweis":str }, nrows = 5)

CPU times: total: 0 ns
Wall time: 8.06 ms


In [74]:
df_qual

Unnamed: 0,sl_nr,messstelle_id,messstelle_sl_nr,datum_pn,stoff_nr,probengut,messergebnis_c,messergebnis_hinweis,bestimmungsgrenze,masseinheit,trennverfahren,verfahren,vor_ort
0,10472780,289002916,18994,1991-06-24,1247,Grundwasser,"<0,03049",Konzentration zu gering zur Bestimmung ...,,mg/l,Gesamtgehalt,,
1,758273,10203230,252,1990-09-18,2001,Grundwasser,"<0,10000",Konzentration zu gering zur Bestimmung ...,,µg/l,Gesamtgehalt,DIN 38407-F4 MAI 1988,
2,758278,10203230,252,1990-09-18,2010,Grundwasser,"<0,10000",Konzentration zu gering zur Bestimmung ...,,µg/l,Gesamtgehalt,DIN 38407-F4 MAI 1988,
3,758272,10203230,252,1990-09-18,2000,Grundwasser,"<10,00000",Konzentration zu gering zur Bestimmung ...,,µg/l,Gesamtgehalt,DIN 38407-F4 MAI 1988,
4,758267,10203230,252,1990-09-18,1343,Grundwasser,"<10,00000",Konzentration zu gering zur Bestimmung ...,,µg/l,Gesamtgehalt,"DIN 38409-H14 MAERZ 1985, ABSCHN. 8.2.2",


In [75]:
df_qual["messergebnis_c"] = df_qual["messergebnis_c"].str.replace(',', '.', regex=True)
df_qual

Unnamed: 0,sl_nr,messstelle_id,messstelle_sl_nr,datum_pn,stoff_nr,probengut,messergebnis_c,messergebnis_hinweis,bestimmungsgrenze,masseinheit,trennverfahren,verfahren,vor_ort
0,10472780,289002916,18994,1991-06-24,1247,Grundwasser,<0.03049,Konzentration zu gering zur Bestimmung ...,,mg/l,Gesamtgehalt,,
1,758273,10203230,252,1990-09-18,2001,Grundwasser,<0.10000,Konzentration zu gering zur Bestimmung ...,,µg/l,Gesamtgehalt,DIN 38407-F4 MAI 1988,
2,758278,10203230,252,1990-09-18,2010,Grundwasser,<0.10000,Konzentration zu gering zur Bestimmung ...,,µg/l,Gesamtgehalt,DIN 38407-F4 MAI 1988,
3,758272,10203230,252,1990-09-18,2000,Grundwasser,<10.00000,Konzentration zu gering zur Bestimmung ...,,µg/l,Gesamtgehalt,DIN 38407-F4 MAI 1988,
4,758267,10203230,252,1990-09-18,1343,Grundwasser,<10.00000,Konzentration zu gering zur Bestimmung ...,,µg/l,Gesamtgehalt,"DIN 38409-H14 MAERZ 1985, ABSCHN. 8.2.2",


In [76]:
df_qual.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   sl_nr                 5 non-null      int64  
 1   messstelle_id         5 non-null      int64  
 2   messstelle_sl_nr      5 non-null      int64  
 3   datum_pn              5 non-null      object 
 4   stoff_nr              5 non-null      int64  
 5   probengut             5 non-null      object 
 6   messergebnis_c        5 non-null      object 
 7   messergebnis_hinweis  5 non-null      object 
 8   bestimmungsgrenze     0 non-null      float64
 9   masseinheit           5 non-null      object 
 10  trennverfahren        5 non-null      object 
 11  verfahren             4 non-null      object 
 12  vor_ort               5 non-null      object 
dtypes: float64(1), int64(4), object(8)
memory usage: 652.0+ bytes


**The complete CSV file with the measured values of the chemical analyses comprises more than 3.6 million measured values!**

In [77]:
# Wall time: 13 s
%time df_qual = pd.read_csv(gw_quality_pfname, sep = ";", encoding="cp1252", index_col=["sl_nr"], \
                            dtype = {"messergebnis_c":str ,"messergebnis_hinweis":str }, \
                            parse_dates = ["datum_pn"])

CPU times: total: 2.42 s
Wall time: 2.56 s


In [78]:
df_qual.shape

(714299, 12)

In [79]:
df_qual.info()

<class 'pandas.core.frame.DataFrame'>
Index: 714299 entries, 10472780 to 12763519
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   messstelle_id         714299 non-null  int64         
 1   messstelle_sl_nr      714299 non-null  int64         
 2   datum_pn              714299 non-null  datetime64[ns]
 3   stoff_nr              714299 non-null  int64         
 4   probengut             714299 non-null  object        
 5   messergebnis_c        714299 non-null  object        
 6   messergebnis_hinweis  714299 non-null  object        
 7   bestimmungsgrenze     24065 non-null   object        
 8   masseinheit           714299 non-null  object        
 9   trennverfahren        714299 non-null  object        
 10  verfahren             521715 non-null  object        
 11  vor_ort               714299 non-null  object        
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage

In [80]:
# duplicate sl_nr values? Can it be a unique index?
# Result should be empty
print(df_qual[df_qual.index.duplicated()])

Empty DataFrame
Columns: [messstelle_id, messstelle_sl_nr, datum_pn, stoff_nr, probengut, messergebnis_c, messergebnis_hinweis, bestimmungsgrenze, masseinheit, trennverfahren, verfahren, vor_ort]
Index: []


## Time Series Example

In [82]:
df_qual

Unnamed: 0_level_0,messstelle_id,messstelle_sl_nr,datum_pn,stoff_nr,probengut,messergebnis_c,messergebnis_hinweis,bestimmungsgrenze,masseinheit,trennverfahren,verfahren,vor_ort
sl_nr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
10472780,289002916,18994,1991-06-24,1247,Grundwasser,"<0,03049",Konzentration zu gering zur Bestimmung ...,,mg/l,Gesamtgehalt,,
758273,10203230,252,1990-09-18,2001,Grundwasser,"<0,10000",Konzentration zu gering zur Bestimmung ...,,µg/l,Gesamtgehalt,DIN 38407-F4 MAI 1988,
758278,10203230,252,1990-09-18,2010,Grundwasser,"<0,10000",Konzentration zu gering zur Bestimmung ...,,µg/l,Gesamtgehalt,DIN 38407-F4 MAI 1988,
758272,10203230,252,1990-09-18,2000,Grundwasser,"<10,00000",Konzentration zu gering zur Bestimmung ...,,µg/l,Gesamtgehalt,DIN 38407-F4 MAI 1988,
758267,10203230,252,1990-09-18,1343,Grundwasser,"<10,00000",Konzentration zu gering zur Bestimmung ...,,µg/l,Gesamtgehalt,"DIN 38409-H14 MAERZ 1985, ABSCHN. 8.2.2",
...,...,...,...,...,...,...,...,...,...,...,...,...
2148723,30302626,8174,1998-07-21,1266,Grundwasser,001000,...,,mg/l,Membranfilter,DIN 38405-D11-3 OKTOBER 1983,
2148724,30302626,8174,1998-07-21,1061,Grundwasser,670000,...,,-,Gesamtgehalt,DIN 38404-C5 JANUAR 1984,ja
12244907,30302626,8174,1998-07-21,1244,Grundwasser,336452,...,,mg/l,Membranfilter,DIN 38405-D20 SEPTEMBER 1991,
12520549,30302626,8174,1998-07-21,1246,Grundwasser,002300,...,,mg/l,Gesamtgehalt,"DIN 38405-D10 02.81,DIN EN 26777 04.1993",


In [83]:
# time series example
# stoff_nr=1244 ->"Nitrat"
idx = (df_qual["messstelle_id"] == 30302626) & (df_qual["stoff_nr"] == 1244)
df_qual.loc[idx,["datum_pn", "messergebnis_c"]].sort_values("datum_pn")

Unnamed: 0_level_0,datum_pn,messergebnis_c
sl_nr,Unnamed: 1_level_1,Unnamed: 2_level_1
12242606,1996-03-18,"<1,32810"
12153491,1996-09-03,"<1,32810"
12153509,1997-04-04,"<1,32810"
12153524,1997-09-09,"<1,32810"
12244853,1998-03-12,442700
12244907,1998-07-21,336452
12244901,1999-06-16,278901
12245004,1999-12-17,283328


### Tests for different measurement value string cases

```
(1)   "1.00" (is_float)
(2)  "<1.00" (is_less)
(3)  ">1.00" (is_greater)
```


In [84]:
# check if string can be converted to float
def is_float(element: str) -> bool:
    try:
        float(element)
        return True
    except ValueError:
        return False

In [85]:
# check if string starts with '<'
def is_less(element: str) -> bool:
    return element[0] == "<" 

In [86]:
# check if string starts with '>'
def is_greater(element: str) -> bool:
    return element[0] == ">" 

In [87]:
print("is_float()")
print(is_float("<1.234"))
print(is_float(">1.234"))
print(is_float("-1.234"))

is_float()
False
False
True


In [88]:
# Some test applications
print("is_less()")
print(is_less("<1.234"))
print(is_less(">1.234"))
print(is_less("1.234"))
print("is_greater()")
print(is_greater("<1.234"))
print(is_greater(">1.234"))
print(is_greater("1.234"))
print("is_float()")
print(is_float("<1.234"))
print(is_float(">1.234"))
print(is_float("1.234"))

is_less()
True
False
False
is_greater()
False
True
False
is_float()
False
False
True


In [89]:
# Apply the tests and create Boolean indexes
%time idx_mess_is_float   = df_qual["messergebnis_c"].apply(is_float)
%time idx_mess_is_less    = df_qual["messergebnis_c"].apply(is_less)
%time idx_mess_is_greater = df_qual["messergebnis_c"].apply(is_greater)

CPU times: total: 1.34 s
Wall time: 1.38 s
CPU times: total: 188 ms
Wall time: 185 ms
CPU times: total: 78.1 ms
Wall time: 179 ms


In [90]:
print(idx_mess_is_greater)

sl_nr
10472780    False
758273      False
758278      False
758272      False
758267      False
            ...  
2148723     False
2148724     False
12244907    False
12520549    False
12763519    False
Name: messergebnis_c, Length: 714299, dtype: bool


In [93]:
# Print records which are neither less nor greater nor float -> should be empty data frame
assert df_qual[~idx_mess_is_less & ~idx_mess_is_greater & ~idx_mess_is_float].shape[0] == 0

AssertionError: 

In [92]:

# Dataframe should be empty
print(df_qual[~idx_mess_is_less & ~idx_mess_is_greater & ~idx_mess_is_float])

          messstelle_id  messstelle_sl_nr   datum_pn  stoff_nr    probengut  \
sl_nr                                                                         
758239         10203230               252 1990-09-18      1011  Grundwasser   
758240         10203230               252 1990-09-18      1061  Grundwasser   
758241         10203230               252 1990-09-18      1061  Grundwasser   
758242         10203230               252 1990-09-18      1082  Grundwasser   
758243         10203230               252 1990-09-18      1082  Grundwasser   
...                 ...               ...        ...       ...          ...   
2148723        30302626              8174 1998-07-21      1266  Grundwasser   
2148724        30302626              8174 1998-07-21      1061  Grundwasser   
12244907       30302626              8174 1998-07-21      1244  Grundwasser   
12520549       30302626              8174 1998-07-21      1246  Grundwasser   
12763519       30302626              8174 1998-07-21

In [44]:
# res = (~idx_mess_is_less & ~idx_mess_is_greater & ~idx_mess_is_float).value_counts()
res = (idx_mess_is_less | idx_mess_is_greater | idx_mess_is_float).value_counts()
res

messergebnis_c
False    402875
True     311424
Name: count, dtype: int64

## Convert measurement results to float. Fill the limit column.

In [45]:
%time df_qual.loc[idx_mess_is_float,"messergebnis_num"] = df_qual.loc[idx_mess_is_float,"messergebnis_c"].astype(float)
%time df_qual.loc[idx_mess_is_float,"grenze"] = "="

%time df_qual.loc[idx_mess_is_less,"messergebnis_num"] = df_qual.loc[idx_mess_is_less,"messergebnis_c"].str[1:].astype(float)
%time df_qual.loc[idx_mess_is_less,"grenze"] = "<"

%time df_qual.loc[idx_mess_is_greater,"messergebnis_num"] = df_qual.loc[idx_mess_is_greater,"messergebnis_c"].str[1:].astype(float)
%time df_qual.loc[idx_mess_is_greater,"grenze"] = ">"



CPU times: total: 0 ns
Wall time: 10.2 ms
CPU times: total: 15.6 ms
Wall time: 34.3 ms




ValueError: could not convert string to float: '0,03049'

CPU times: total: 0 ns
Wall time: 15.5 ms


ValueError: could not convert string to float: '0,00000'

CPU times: total: 0 ns
Wall time: 2 ms


In [46]:
print("Different values for column 'grenze'")
print(df_qual["grenze"].value_counts())

Different values for column 'grenze'
grenze
<    311420
>         4
Name: count, dtype: int64


In [47]:
df_qual[idx_mess_is_greater][["messergebnis_c", "messergebnis_num", "grenze"]].head()

Unnamed: 0_level_0,messergebnis_c,messergebnis_num,grenze
sl_nr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6177232,">0,00000",,>
6177402,">0,00000",,>
6177268,">0,00000",,>
6177316,">0,00000",,>


In [48]:
df_qual[idx_mess_is_less][["messergebnis_c", "messergebnis_num", "grenze"]].head()

Unnamed: 0_level_0,messergebnis_c,messergebnis_num,grenze
sl_nr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10472780,"<0,03049",,<
758273,"<0,10000",,<
758278,"<0,10000",,<
758272,"<10,00000",,<
758267,"<10,00000",,<


In [49]:
df_qual[idx_mess_is_float][["messergebnis_c", "messergebnis_num", "grenze"]].head()

Unnamed: 0_level_0,messergebnis_c,messergebnis_num,grenze
sl_nr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [None]:
# Reason for not being float? XOR: A ^ B
#idx = (~idx_mess_is_float ^ idx_mess_is_less) # These are non-floats which are be less at the same time => greater
#df_qual[idx]

In [None]:
# Reason for not being float? XOR
#idx = (~idx_mess_is_float ^ idx_mess_is_greater)
#df_qual[idx]

In [50]:
df_qual[df_qual["messergebnis_num"]<0]

Unnamed: 0_level_0,messstelle_id,messstelle_sl_nr,datum_pn,stoff_nr,probengut,messergebnis_c,messergebnis_hinweis,bestimmungsgrenze,masseinheit,trennverfahren,verfahren,vor_ort,messergebnis_num,grenze
sl_nr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


## Upload the data to the database with `df.to_sql()`

In [None]:
import sqlalchemy
engine = sqlalchemy.create_engine("postgresql+psycopg://env_master:M123xyz@localhost/env_db")

In [None]:
# the default to_sql() / sqlalchemy method using psycopg2 (default PG driver) ...
# on my laptop:
# Approx. Wall time: 4min 32s 

%time df_qual.to_sql(con=engine, name="gw_meas", schema="gw", if_exists="fail")
#%time df_qual.to_sql(con=engine, name="gw_meas", schema="gw", if_exists="replace")

## Search for duplicates! Primary key is not straight forward!

In [None]:
%load_ext sql

In [None]:
print("Connect")
%sql postgresql://env_master:M123xyz@localhost/env_db

In [None]:
%%sql
alter table gw.gw_meas add constraint pk_gw_meas primary key (messstelle_id, datum_pn, stoff_nr)

In [None]:
%%sql
select * from gw.gw_meas where (messstelle_id, datum_pn, stoff_nr) = (73537317, '1990-08-17 00:00:00', 1061)

Is `sl_nr` unique?

In [None]:
%%sql
select sl_nr,count(sl_nr) as count from gw.gw_meas group by sl_nr having count(sl_nr) > 1; 

**`sl_nr` is a non-smart primary key ...**

In [None]:
%%sql
alter table gw.gw_meas add constraint pk_gw_meas primary key (sl_nr)

## Create some indexes to improve database performance

In [None]:
%%sql
create index idx_gw_meas_messstelle_id_datum_pn on gw.gw_meas (messstelle_id, datum_pn)

In [None]:
%%sql
create index idx_gw_meas_datum_pn_meas_messstelle on gw.gw_meas (datum_pn, messstelle_id)

In [None]:
%%sql
create index idx_gw_meas_stoff_nr on gw.gw_meas (stoff_nr)

In [None]:
%%sql
create index idx_gw_meas_datum_pn_stoff_nr on gw.gw_meas (datum_pn, stoff_nr);

In [None]:
%%time
%sql select count(*) from gw.gw_meas

In [None]:
%%time
%sql select messstelle_id, datum_pn, count(*) as count from gw.gw_meas group by (messstelle_id, datum_pn) limit 20

**ATTENTION! 140515 anlyses were performed with more than one method!**

In [None]:
#%%sql
#SELECT messstelle_id, datum_pn, stoff_nr, COUNT(*) AS Count
#FROM gw.gw_meas
#GROUP BY messstelle_id, datum_pn, stoff_nr
#HAVING COUNT(*) > 1;

In [None]:
# %%sql
# SELECT t1.* from gw.gw_meas t1, gw.gw_meas t2 
# where 
# t1.messstelle_id = t2.messstelle_id
# and
# t1.datum_pn = t2.datum_pn
# and
# t1.stoff_nr = t2.stoff_nr
# and
# t1.verfahren <> t2.verfahren
# and

# t1.sl_nr = (select max(sl_nr) from gw.gw_meas t3 
# where
# t1.messstelle_id = t3.messstelle_id
# and
# t1.datum_pn = t3.datum_pn
# and
# t1.stoff_nr = t3.stoff_nr
# )

# limit 1000

# Import `katalog_stoff`

It is in another notebook!

In [None]:
%%sql
select * from gw.katalog_stoff where name like 'N%'

In [None]:
%%sql
alter table gw.katalog_stoff add constraint pk_kat_stoff primary key (stoff_nr)

In [None]:
%%sql
create index idx_kat_stoff_name on gw.katalog_stoff (name) 

In [None]:
%%sql
select * from gw.gw_stations limit 3

# Create Views!

In [None]:
%%sql
drop view gw.v_gw_stations_wrrl_chemie

In [None]:
%%sql
create view gw.v_gw_stations_wrrl_chemie as
select * from gw.gw_stations 
where im_wrrl_messnetz_chemie = 'ja'
and freigabe_chemie = 'ja'

In [None]:
%%sql
select count(*) from gw.v_gw_stations_wrrl_chemie

In [None]:
%%sql
select geometry, messstelle_id, name, genau, im_wrrl_messnetz_chemie, im_wrrl_messnetz_wasserstand from gw.gw_stations
limit 10

In [None]:
%%sql
select *
from gw.gw_meas
limit 1

In [None]:
%%sql
select sl_nr, messstelle_id, stoff_nr, datum_pn, grenze, messergebnis_num, masseinheit 
from gw.gw_meas
limit 3

In [None]:
%%sql
select 
st.geometry, st.messstelle_id, st.name, st.genau, st.im_wrrl_messnetz_chemie, st.im_wrrl_messnetz_wasserstand,
m.sl_nr, m.stoff_nr,  p.name as stoffname, m.datum_pn, m.grenze, m.messergebnis_num, m.masseinheit
from gw.gw_stations st, gw.gw_meas m, gw.katalog_stoff p
where st.messstelle_id = m.messstelle_id
and m.stoff_nr = p.stoff_nr
limit 2

In [None]:
%%sql
drop view gw.gw_station_series

In [None]:
%%sql
create or replace view gw.gw_station_series as
select 
m.sl_nr as fid, st.geometry, st.messstelle_id, st.name, st.genau, st.im_wrrl_messnetz_chemie, st.im_wrrl_messnetz_wasserstand,
m.sl_nr, m.stoff_nr,  p.name as stoffname, m.datum_pn, m.grenze, m.messergebnis_num, m.masseinheit
from gw.gw_stations st, gw.gw_meas m, gw.katalog_stoff p
where st.messstelle_id = m.messstelle_id
and m.stoff_nr = p.stoff_nr
order by messstelle_id, stoff_nr, datum_pn

In [None]:
%%sql
select * from gw.katalog_stoff where name = 'Nitrat'

In [None]:
%%sql
drop view gw.gw_station_nitrat_series

In [None]:
%%sql
select distinct (im_wrrl_messnetz_chemie) from gw.gw_stations

In [None]:
%%sql
drop view gw.v_gw_station_nitrat

In [None]:
%%sql
create or replace view gw.v_gw_station_nitrat as
select 
m.sl_nr as fid, st.geometry, st.messstelle_id, st.name, st.genau, st.im_wrrl_messnetz_chemie, st.im_wrrl_messnetz_wasserstand,
m.sl_nr, m.stoff_nr,  p.name as stoffname, m.datum_pn, m.grenze, m.messergebnis_num, m.masseinheit
from gw.gw_stations st, gw.gw_meas m, gw.katalog_stoff p
where st.im_wrrl_messnetz_chemie = 'ja'
and p.name = 'Nitrat'
and m.stoff_nr = p.stoff_nr
and st.messstelle_id = m.messstelle_id
order by messstelle_id, stoff_nr, datum_pn

In [None]:
%%sql
create or replace view gw.v_gw_station_sulfat as
select 
m.sl_nr as fid, st.geometry, st.messstelle_id, st.name, st.genau, st.im_wrrl_messnetz_chemie, st.im_wrrl_messnetz_wasserstand,
m.sl_nr, m.stoff_nr,  p.name as stoffname, m.datum_pn, m.grenze, m.messergebnis_num, m.masseinheit
from gw.gw_stations st, gw.gw_meas m, gw.katalog_stoff p
where st.im_wrrl_messnetz_chemie = 'ja'
and p.name = 'Sulfat'
and m.stoff_nr = p.stoff_nr
and st.messstelle_id = m.messstelle_id
order by messstelle_id, stoff_nr, datum_pn

In [None]:
%sql select count(*) from gw.v_gw_station_sulfat

## Exercises

1) Add the PostGIS table `gw.gw_stations` as vector layer to QGIS.

2) Use df.to_sql() to upload the table with the catalog (file `katalog_stoff.csv` in the data directory) of the analyzed quantities (substances, physico-chemical parameters, e.g. NO3- concentation (nitrate), pH, air temperature (can be neg.), etc.)

3) Add the catalog with municipalities (file `katalog_gemeinde.csv`)

4) SQL: Create a view joining the gw station table with gw meas table and gw parameter table. (A bit difficult. We have not discussed it yet.)

5) Create a reduced view for nitrate only joining the gw station table with gw meas table and gw parameter table.

6) Try to get the station-nitrate table into QGIS using the PostGIS interface.

SQL: Before you create the views create primary keys for the tables. i.e. `(messstelle_id)` for `gw_stations`, 
`(messstelle_id, stoff_nr, pna_datum)` for `gw_meas`.