# NRW Groundwater Data - OpenHygrisC Data Engineering

Data from <br>
**[LANUV](https://www.lanuv.nrw.de/): Landesamt für Natur, Umwelt und Verbraucherschutz Nordrhein-Westfalen** <br>
(State Office for Nature, Environment and Consumer Protection NRW)

* LANUV groundwater web pages: https://www.lanuv.nrw.de/umwelt/wasser/grundwasser

Groundwater data: https://www.lanuv.nrw.de/umwelt/wasser/grundwasser/grundwasserstand/grundwasserdaten-online

ELWAS-WEB NRW - Infos zu den Grundwasserkörpern (YouTube): https://www.youtube.com/watch?v=4wFKIu622rk

In the database HygrisC the LANUV provides groundwater quality and quantity data for most groundwater wells in NRW. The groundwater wells are partly owned and operated by NRW, partly by other parties. 
The measurement intervals are usually annual. Some groundwater well are sampled more frequently. 

WRRL: EU Wasserrahmenrichtlinie, EU Water Framework Directive

The quality data is based on chemical analyses of groundwater samples. The quantity data is based on groundwater level measurement.


OpenHygrisC Data: https://www.opengeodata.nrw.de/produkte/umwelt_klima/wasser/grundwasser/hygrisc/

**Download the NRW groundwater data zip file**:
<br>
https://www.opengeodata.nrw.de/produkte/umwelt_klima/wasser/grundwasser/hygrisc/OpenHygrisC_gw-messstellen-messwerte_EPSG25832_CSV.zip

The zip archive contains gw station info, a catalog of possible physico-chemical analysis parameters, and the measured data. 

## Coordinate Obfuscation 

Some coordinate data in the gw station info reveal difficulties. The coordinate reference system (CRS) used is the projected metric based 
EPSG:25832 ( ETRS89 / UTM zone 32N). 
The dataframe coordinate columns `e32` (easting) and `n32` (northing) are of data type object (not numeric). 

The resolution is 1m but many coordinates are obscurred because of privacy issues to a precision of 100m. A few coordinates are missing, i.e. either empty (nan) or filled with `xx`.


The coordinate columns e32 and n32 are of data type object/string. Four cases must be distinguished:

* Most strings are in a regular number format and can be converted to float right away (case (1) and (2) in the table)
* Other coordinate strings are obfuscated by replacing the two least significant decimal places with the characters "xx". This usually happens when a groundwater well is on private property. The coordinates are made less precise to respect privacy. The remaining coordinate information is still usable. The precision is limited to 100 meters. The uncertainty is +- 50m. (case (3) in the table)
* In some cases no coordinate infomation is given at all. In these cases the coordinate strings are just "xx". (case (4) in the table)
* In a very few cases the coordinate columns are empty, i.e. NaN (Null). (case (5) in the table)

The following table shows representative cases.


| case |   messstelle_id | e32    | n32     | grundstueck   |
|-----:|----------------:|-------:|--------:|:--------------|
|  (1) |        10000094 | 292868 | 5632572 | oeffentlich   |
|  (2) |        10000045 | 299399 | 5650595 | privat        |
|  (3) |        10000033 | 3070xx | 56583xx | privat        |
|  (4) |        47247101 | xx     | xx      |               |
|  (5) |        79921802 | nan    | nan     |               |

Case (1) and (2) have coordinate strings which can be immediately converted to integer or float with 1m precision. Case (3) shows coordinate obfuscation to a precision to 100m. The digits representing tens and ones are anonymized. Case (4) and (5) show useless coordinate information.  

How to deal with non-anonymized data:

"299399" (string, prec. 1) => 299399.0 (float) 

How to deal with anonymization:

307000 <= 3070xx <= 307099

"3070xx" (string, prec. 100) => 307050 (float, +- 50m) 



In [1]:
#!conda env list

## Correct wrong `PROJ_LIB` environment variable value 

This problem seems to occur on Windows when using the OSGeo4W installer. The environment variable must point to a user specific directory and according to the activated conda environment, e.g. `PROJ_LIB=C:\Users\<username>\Anaconda3\envs\geo\Library\share\proj` 

In [11]:
import os
os.environ['proj_lib']

'C:\\Users\\shrey\\anaconda3\\envs\\geo\\Library\\share\\proj'

In [12]:
# Correct wrong environment variable value occurring when using OSGeo4W installer
conda_prefix = os.environ['conda_prefix']
print(f"CONDA_PREFIX: {conda_prefix:s}")
os.environ['proj_lib'] = conda_prefix + r"\Library\share\proj"
proj_lib = os.environ['proj_lib']
print(f"New env var value: \nPROJ_LIB={proj_lib:s}")

CONDA_PREFIX: C:\Users\shrey\anaconda3\envs\geo
New env var value: 
PROJ_LIB=C:\Users\shrey\anaconda3\envs\geo\Library\share\proj


## Imports

In [13]:
# CORRECT THE WRONG PYPROJ PATH FIRST! OTHERWISE GEOPANDAS DOES NOT LOAD!
import pandas as pd
import geopandas as gpd

## Data Directories and Files

In [14]:
pwd

'C:\\Users\\shrey\\Desktop\\Geodata_Management\\EE_8136_Geodata_WS2023_1_EXAM-Group-C\\gdms0000_Final_Assignment\\Task_3\\OpenHyPE-main\\OpenHyPE-main\\python'

In [28]:
data_in_dir = r"C:\Users\shrey\Desktop\Geodata_Management\EE_8136_Geodata_WS2023_1_EXAM-Group-C\gdms0000_Final_Assignment\Task_3\OpenHyPE-main\OpenHyPE-main\data\OpenGeoData.NRW\OpenHygrisC\OpenHygrisC_gw-messstelle_EPSG25832_CSV"
for elt in os.listdir(data_in_dir): print(elt)
 #C:\Users\shrey\Desktop\Geodata_Management\EE_8136_Geodata_WS2023_1_EXAM-Group-C\gdms0000_Final_Assignment\Task_3\OpenHyPE-main\OpenHyPE-main\data\OpenGeodata.NRW\OpenHygrisC\OpenHygrisC_gw-messstelle_EPSG25832_CSV   

OpenHygrisC_gw-messstelle.csv


## GW Station Data


In [29]:
gw_station_fname = r"\OpenHygrisC_gw-messstelle.csv"
gw_station_pfname = data_in_dir + gw_station_fname
print(f"Stationsdaten:  {gw_station_pfname:s}")

Stationsdaten:  C:\Users\shrey\Desktop\Geodata_Management\EE_8136_Geodata_WS2023_1_EXAM-Group-C\gdms0000_Final_Assignment\Task_3\OpenHyPE-main\OpenHyPE-main\data\OpenGeoData.NRW\OpenHygrisC\OpenHygrisC_gw-messstelle_EPSG25832_CSV\OpenHygrisC_gw-messstelle.csv


In [31]:
df = pd.read_csv(gw_station_pfname, sep = ";", index_col=["messstelle_id"], encoding="ISO-8859-1")

In [32]:
df.sort_index(ascending=True, inplace=True)

In [33]:
num_total = df.shape[0]
df.shape

(72528, 38)

In [34]:
print(f"{pd.get_option('display.max_columns') = }")
pd.set_option("display.max_columns", None)
print(f"{pd.get_option('display.max_columns') = }")

pd.get_option('display.max_columns') = 20
pd.get_option('display.max_columns') = None


In [35]:
df.head()

Unnamed: 0_level_0,sl_nr,name,e32,n32,gw_stockwerk,grundstueck,gemeinde_id,gwhorizont_id,gwhorizont,gwleiter_id,gwleiter,einrichtungsgrund,gwk_lage_auf_id,gwk_lage_id,gwk_monitoring_auf_id,gwk_monitoring_id,messprogramm,turnus_wasserstand,freigabe_wstd,freigabe_chemie,freigabe_lage,wasserstandsmessstelle,guetemessstelle,im_wrrl_messnetz_chemie,im_wrrl_messnetz_wasserstand,messstellenart,wasserart,labor,beobachtung_wasserstand,eigentuemer,betreiber,filterlaenge_cm,sumpfrohrlaenge_cm,ausbaudurchmesser_mm,historischer_ruhe_wsp,einbaulaenge_cm,oberkante_filter_cm,unterkante_filter_cm
messstelle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1
10000008,70796,P3-BSR/Mariaschacht P 5 neu,3068XX,56254XX,1.0,,5334032,,,,,GW-Beschaffenheit,01.07.2016,282_11,01.07.2016,282_11,RWÜ (Messstelle f. GWÜ geeignet),monatlich,nein,nein,nein,ja,ja,nein,nein,GW-Messstelle,reines Grundwasser,-,-,BSR Schotterwerk ...,Enwor GmbH ...,700.0,0.0,80.0,,1650.0,20973.0,20273.0
10000010,1,SCHERPENSEEL NR 1,2935XX,56452XX,1.0,privat,5370028,6D,Neurather Sand,,,LGD,01.07.2016,28_04,01.07.2016,28_04,Grundwassergüteüberwachung,Messstelle besteht nicht mehr,ja,ja,nein,ja,ja,nein,nein,GW-Messstelle,keine Angabe,-,-,Land NRW ...,keine Angabe ...,400.0,,100.0,,4936.0,7731.0,7331.0
10000021,2,Bellinghoven Nr. 2,312776,5660432,1.0,privat,5370004,14,Ältere Hauptterrassen,,,LGD,01.07.2016,286_07,01.07.2016,286_07,,Messstelle inaktiv,ja,nein,ja,ja,nein,nein,nein,Schachtbrunnen,,,-,keine Angabe ...,keine Angabe ...,,,1000.0,,1411.0,7885.0,7885.0
10000033,3,Doveren Nr. 3,3070XX,56583XX,1.0,privat,5370020,16,Jüngere Hauptterrassen mit Lößauflagerung,,,LGD,01.07.2016,282_01,01.07.2016,282_01,Emittentenmst./Anlagenüberw.,Messstelle inaktiv,ja,nein,nein,ja,ja,nein,nein,Schachtbrunnen,keine Angabe,-,-,Privatperson ...,keine Angabe ...,,,1000.0,,755.0,4847.0,4847.0
10000045,4,Geilenkirchen Nr. 5,299399,5650595,1.0,privat,5370012,10,Sande und Kiese,,,LGD,01.07.2016,282_03,01.07.2016,282_03,,Messstelle besteht nicht mehr,ja,nein,ja,ja,nein,nein,nein,Vertikalfilterbrunnen,,,-,Privatperson ...,keine Angabe ...,200.0,,1000.0,,1079.0,6140.0,5940.0


In [36]:
df[df["grundstueck"]=="oeffentlich"].head()

Unnamed: 0_level_0,sl_nr,name,e32,n32,gw_stockwerk,grundstueck,gemeinde_id,gwhorizont_id,gwhorizont,gwleiter_id,gwleiter,einrichtungsgrund,gwk_lage_auf_id,gwk_lage_id,gwk_monitoring_auf_id,gwk_monitoring_id,messprogramm,turnus_wasserstand,freigabe_wstd,freigabe_chemie,freigabe_lage,wasserstandsmessstelle,guetemessstelle,im_wrrl_messnetz_chemie,im_wrrl_messnetz_wasserstand,messstellenart,wasserart,labor,beobachtung_wasserstand,eigentuemer,betreiber,filterlaenge_cm,sumpfrohrlaenge_cm,ausbaudurchmesser_mm,historischer_ruhe_wsp,einbaulaenge_cm,oberkante_filter_cm,unterkante_filter_cm
messstelle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1
10000094,9,Richterich Nr. 11,292868,5632572,1.0,oeffentlich,5334002,,,kro,Oberkreide,LGD,01.07.2016,282_09,01.07.2016,282_09,,Messstelle inaktiv,ja,nein,ja,ja,nein,nein,nein,Schachtbrunnen,,,-,Bahnbrunnen ...,keine Angabe ...,,,3000.0,,954.0,17351.0,17351.0
10000173,19,WALLENTHAL NR 20,328303,5604342,1.0,oeffentlich,5366024,SM,Mittlerer Buntsandstein,,,keine Angabe,01.07.2016,274_13,01.07.2016,274_13,,monatlich,ja,nein,ja,ja,nein,nein,ja,Schachtbrunnen,,,durch LANUV,keine Angabe ...,keine Angabe ...,,,1000.0,,400.0,37030.0,37030.0
10099839,73160,LGD Nettersheim 01,330892,5595776,1.0,oeffentlich,5366032,,,dk,"Devon, Kalk",GW-Beschaffenheit,01.07.2016,282_15,01.07.2016,282_15,Grundwassergüteüberwachung,monatlich,ja,ja,ja,ja,ja,nein,nein,GW-Messstelle,reines Grundwasser,LANUV,durch LANUV,Land NRW ...,Land NRW ...,500.0,0.0,100.0,3490.0,4144.0,47588.0,47088.0
10099967,72240,LGD Breitenbenden 01,335348,5605306,1.0,oeffentlich,5366028,,,,,GW-Beschaffenheit,01.07.2016,274_10,01.07.2016,274_10,Grundwassergüteüberwachung,monatlich,ja,ja,ja,ja,ja,nein,nein,GW-Messstelle,reines Grundwasser,LANUV,durch LANUV,Land NRW ...,Land NRW ...,500.0,0.0,100.0,150.0,1470.0,27072.0,26572.0
10099979,72238,LGD Vussem-Bergh. 01,334646,5604224,1.0,oeffentlich,5366028,,,,,GW-Beschaffenheit,01.07.2016,274_10,01.07.2016,274_10,Grundwassergüteüberwachung,monatlich,ja,ja,ja,ja,ja,nein,nein,GW-Messstelle,reines Grundwasser,LANUV,durch LANUV,Land NRW ...,Land NRW ...,200.0,0.0,100.0,380.0,860.0,29231.0,29031.0


## Challenge: Coordinates obfuscation

The coordinate columns e32 and n32 are of data type string. Four cases must be distinguished:

(1) Most strings are in a regular number format and can be converted to float right away.

(2) Other coordinate strings are obfuscated by replacing the two least significant digits with the characters "xx". This usually happens when a groundwater well is on private property. The coordinates are made less precise to respect privacy. The remaining coordinate information is still usable. The precision is limited to 100 meters. The uncertainty is +- 50m. 

(3) In some cases no coordinate infomation is given at all. In these cases the coordinate strings are just "xx".

(4) In a very few cases the coordinate columns are empty, i.e. NaN (Null).

In [13]:
# These four groundwater wells summarize the coordinate problems.
df_coord_problem=df.loc[[10000094, 10000045, 10000033, 47247101, 79921802],["e32","n32", "grundstueck"]]
df_coord_problem

Unnamed: 0_level_0,e32,n32,grundstueck
messstelle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10000094,292868,5632572,oeffentlich
10000045,299399,5650595,privat
10000033,3070xx,56583xx,privat
47247101,xx,xx,
79921802,,,


In [14]:
# forma table as markdown
#from tabulate import tabulate
#print(tabulate(df_coord_problem, tablefmt="pipe", headers="keys"))

|   messstelle_id | e32    | n32     | grundstueck   |
|----------------:|:-------|:--------|:--------------|
|        10000094 | 292868 | 5632572 | oeffentlich   |
|        10000045 | 299399 | 5650595 | privat        |
|        10000033 | 3070xx | 56583xx | privat        |
|        47247101 | xx     | xx      |               |
|        79921802 | nan    | nan     |               |

**Boolean indexes are used to filter the data according to the cases (1) to (4).**

In [15]:
# Add column for precision
df["genau"] = 0

# (1) If the coord data is numeric then the precision is 1m
idx_coords_1m_prec = (df["e32"].str.isnumeric() == True)

# (3,4) Some stations don't have coordinates
# e32 and n32 strings are either NaN (Null) or "xx"
idx_coords_missing = (df["e32"].str.len() < 6) | (df["e32"].isnull() == True)

# (2) If coord data is avaliable but not numeric, then the numbers have been obscured with "XX" for the two least significant decimals.
idx_coords_100m_prec = ~idx_coords_missing &  ~(df["e32"].str.isnumeric() == True)


In [16]:
df[idx_coords_missing]

Unnamed: 0_level_0,sl_nr,name,e32,n32,gw_stockwerk,grundstueck,gemeinde_id,gwhorizont_id,gwhorizont,gwleiter_id,gwleiter,einrichtungsgrund,gwk_lage_auf_id,gwk_lage_id,gwk_monitoring_auf_id,gwk_monitoring_id,messprogramm,turnus_wasserstand,freigabe_wstd,freigabe_chemie,freigabe_lage,wasserstandsmessstelle,guetemessstelle,im_wrrl_messnetz_chemie,im_wrrl_messnetz_wasserstand,messstellenart,wasserart,labor,beobachtung_wasserstand,eigentuemer,betreiber,filterlaenge_cm,sumpfrohrlaenge_cm,ausbaudurchmesser_mm,historischer_ruhe_wsp,einbaulaenge_cm,oberkante_filter_cm,unterkante_filter_cm,genau
messstelle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1
36446518,47111,WA-Lörick LR - RMM,xx,xx,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),Messstelle inaktiv,nein,nein,nein,ja,ja,nein,nein,Sammelmessstelle,keine Angabe,-,-,,,,,,,,,,0
36487600,46659,Wittlaer,xx,xx,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),Messstelle inaktiv,nein,nein,nein,ja,ja,nein,nein,Sammelmessstelle,keine Angabe,-,-,,,,,,,,,,0
47039000,46753,Mündelheim Rhein,xx,xx,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),Messstelle inaktiv,nein,nein,nein,ja,ja,nein,nein,"Talsperre, Flusswassermesstelle",keine Angabe,-,-,,,,,,,,,,0
47199003,47636,Sammelleitung 1-7,xx,xx,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),Messstelle inaktiv,nein,nein,nein,ja,ja,nein,nein,Sammelmessstelle,keine Angabe,-,-,,,,,,,,,,0
47200005,47638,Sammelleitung 10-20,xx,xx,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),Messstelle inaktiv,nein,nein,nein,ja,ja,nein,nein,Sammelmessstelle,keine Angabe,-,-,,,,,,,,,,0
47202002,47647,Sammelleitung 21-30,xx,xx,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),Messstelle inaktiv,nein,nein,nein,ja,ja,nein,nein,Sammelmessstelle,keine Angabe,-,-,,,,,,,,,,0
47247101,46769,RM-Moers Gerdt,xx,xx,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),Messstelle inaktiv,nein,nein,nein,ja,ja,nein,nein,Rohmischwassermessstelle,keine Angabe,-,-,,,,,,,,,,0
47299009,47658,RM-Bucholtwelmen,xx,xx,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),Messstelle inaktiv,nein,nein,nein,ja,ja,nein,nein,Rohmischwassermessstelle,keine Angabe,-,-,,,,,,,,,,0
59621035,46215,RM-Brunnen Stortel,xx,xx,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),,nein,nein,nein,nein,ja,nein,nein,Rohmischwassermessstelle,keine Angabe,-,,,,,,,,,,,0
68011003,47697,WW.HALTERNHAARD-MI,xx,xx,,,,,,,,Eigenüberwachung Wasserwerk,,,,,RWÜ (keine GWÜ-Messstellen),,nein,nein,nein,nein,ja,nein,nein,Sammelmessstelle,keine Angabe,-,,,,,,,,,,,0


**Convert the strings to floats where possible. No data values are represented as negative numbers.**

In [17]:
df.loc[idx_coords_1m_prec,"e32num"] = df.loc[idx_coords_1m_prec,"e32"].astype(float)
df.loc[idx_coords_1m_prec,"n32num"] = df.loc[idx_coords_1m_prec,"n32"].astype(float)
df.loc[idx_coords_1m_prec, "genau"] = 1

In [18]:
df.loc[idx_coords_100m_prec,"e32num"] = (df.loc[idx_coords_100m_prec,"e32"].str[:-2]+"50").astype(float)
df.loc[idx_coords_100m_prec,"n32num"] = (df.loc[idx_coords_100m_prec,"n32"].str[:-2]+"50").astype(float)
df.loc[idx_coords_100m_prec, "genau"] = 100

In [19]:
df.loc[idx_coords_missing,"e32num"] = -999.9
df.loc[idx_coords_missing,"n32num"] = -999.9
df.loc[idx_coords_missing, "genau"] = -999

In [20]:
# check if all records have been matched
num_of_1m_prec = df[df["genau"] == 1].shape[0]
num_of_100m_prec = df[df["genau"] == 100].shape[0]
num_of_no_prec = df[df["genau"] == -999].shape[0]

num_check = num_of_1m_prec + num_of_100m_prec + num_of_no_prec

print(f"total num of recs:                        {num_total:6d}")
print(f"number of recs with 1m coord precision:   {num_of_1m_prec:6d}")
print(f"number of recs with 100m coord precision: {num_of_100m_prec:6d}")
print(f"number of recs with no coords:            {num_of_no_prec:6d}")
print(f"check sum:                                {num_check:6d}")

assert num_check == num_total, "ERROR. Mismatch in numbers of stations"


total num of recs:                         71120
number of recs with 1m coord precision:    59280
number of recs with 100m coord precision:  11810
number of recs with no coords:                30
check sum:                                 71120


**Save the original string as well as the derived numeric columns to a CSV file for checking externally.**

In [21]:
df[["e32","e32num","n32","n32num","genau"]].to_csv("check.csv")
df[["e32","e32num","n32","n32num","genau"]]

Unnamed: 0_level_0,e32,e32num,n32,n32num,genau
messstelle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10000008,3068xx,306850.0,56254xx,5625450.0,100
10000010,2935xx,293550.0,56452xx,5645250.0,100
10000021,312776,312776.0,5660432,5660432.0,1
10000033,3070xx,307050.0,56583xx,5658350.0,100
10000045,299399,299399.0,5650595,5650595.0,1
...,...,...,...,...,...
289382210,345323,345323.0,5659935,5659935.0,1
289382221,345323,345323.0,5659935,5659935.0,1
289382518,345603,345603.0,5659991,5659991.0,1
289382520,345603,345603.0,5659991,5659991.0,1


## Geopandas

In [22]:
import geopandas as gpd
from shapely.geometry import Point

In [23]:
# remove records without coords
df2 = df[df["genau"] > 0]

In [24]:
df2.shape

(71090, 41)

In [25]:
%%time
gdf = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2.e32num, df2.n32num), crs="EPSG:25832")

CPU times: total: 78.1 ms
Wall time: 84 ms


In [26]:
gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 71090 entries, 10000008 to 289382713
Data columns (total 42 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   sl_nr                         71090 non-null  int64   
 1   name                          71090 non-null  object  
 2   e32                           71090 non-null  object  
 3   n32                           71090 non-null  object  
 4   gw_stockwerk                  54173 non-null  float64 
 5   grundstueck                   71090 non-null  object  
 6   gemeinde_id                   71090 non-null  object  
 7   gwhorizont_id                 28424 non-null  object  
 8   gwhorizont                    28424 non-null  object  
 9   gwleiter_id                   2690 non-null   object  
 10  gwleiter                      2690 non-null   object  
 11  einrichtungsgrund             71090 non-null  object  
 12  gwk_lage_auf_id            

In [27]:
gdf.head(3)

Unnamed: 0_level_0,sl_nr,name,e32,n32,gw_stockwerk,grundstueck,gemeinde_id,gwhorizont_id,gwhorizont,gwleiter_id,gwleiter,einrichtungsgrund,gwk_lage_auf_id,gwk_lage_id,gwk_monitoring_auf_id,gwk_monitoring_id,messprogramm,turnus_wasserstand,freigabe_wstd,freigabe_chemie,freigabe_lage,wasserstandsmessstelle,guetemessstelle,im_wrrl_messnetz_chemie,im_wrrl_messnetz_wasserstand,messstellenart,wasserart,labor,beobachtung_wasserstand,eigentuemer,betreiber,filterlaenge_cm,sumpfrohrlaenge_cm,ausbaudurchmesser_mm,historischer_ruhe_wsp,einbaulaenge_cm,oberkante_filter_cm,unterkante_filter_cm,genau,e32num,n32num,geometry
messstelle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1
10000008,70796,P3-BSR/Mariaschacht P 5 neu,3068xx,56254xx,1.0,,5334032,,,,,GW-Beschaffenheit,01.07.2016,282_11,01.07.2016,282_11,RWÜ (Messstelle f. GWÜ geeignet),monatlich,nein,nein,nein,ja,ja,nein,nein,GW-Messstelle,keine Angabe,-,-,BSR Schotterwerk ...,Enwor GmbH ...,700.0,,80.0,,1570.0,21053.0,20353.0,100,306850.0,5625450.0,POINT (306850.000 5625450.000)
10000010,1,SCHERPENSEEL NR 1,2935xx,56452xx,1.0,privat,5370028,6D,Neurather Sand,,,LGD,01.07.2016,28_04,01.07.2016,28_04,Grundwassergüteüberwachung,Messstelle besteht nicht mehr,ja,ja,nein,ja,ja,nein,nein,GW-Messstelle,keine Angabe,-,-,Land NRW ...,keine Angabe ...,400.0,,100.0,,4936.0,7731.0,7331.0,100,293550.0,5645250.0,POINT (293550.000 5645250.000)
10000021,2,Bellinghoven Nr. 2,312776,5660432,1.0,privat,5370004,14,Ältere Hauptterrassen,,,LGD,01.07.2016,286_07,01.07.2016,286_07,,Messstelle inaktiv,ja,nein,ja,ja,nein,nein,nein,Schachtbrunnen,,,-,keine Angabe ...,keine Angabe ...,,,1000.0,,1411.0,7885.0,7885.0,1,312776.0,5660432.0,POINT (312776.000 5660432.000)


In [28]:
%%time

# This takes 90 secs on my computer!

#gdf.to_file("GW_Stations.gpkg", layer='GW Stations', driver="GPKG")

CPU times: total: 0 ns
Wall time: 0 ns


## PostGIS, Inline SQL Magic: `create schema gw`

To store the data in PostGIS/PostgreSQL it is recommended to create a dedicated database "schema" (a kind of name space) to separate relations (tables, views), stored procedures, etc. from the rest of the database. Schemata help to organize the tables and access privileges clearly. 


In [29]:
#!conda install -c conda-forge ipython-sql

In [30]:
%load_ext sql

In [31]:
print("Connect")
%sql postgresql://env_master:M123xyz@localhost/env_db

Connect


'Connected: env_master@env_db'

In [32]:
%%sql
SELECT * FROM information_schema.schemata

 * postgresql://env_master:***@localhost/env_db
5 rows affected.


catalog_name,schema_name,schema_owner,default_character_set_catalog,default_character_set_schema,default_character_set_name,sql_path
env_db,public,pg_database_owner,,,,
env_db,information_schema,postgres,,,,
env_db,pg_catalog,postgres,,,,
env_db,pg_toast,postgres,,,,
env_db,gw,env_master,,,,


In [33]:
%%sql
CREATE SCHEMA IF NOT EXISTS gw AUTHORIZATION env_master

 * postgresql://env_master:***@localhost/env_db
Done.


[]

In [34]:
%%sql
SELECT * FROM information_schema.schemata;

 * postgresql://env_master:***@localhost/env_db
5 rows affected.


catalog_name,schema_name,schema_owner,default_character_set_catalog,default_character_set_schema,default_character_set_name,sql_path
env_db,public,pg_database_owner,,,,
env_db,information_schema,postgres,,,,
env_db,pg_catalog,postgres,,,,
env_db,pg_toast,postgres,,,,
env_db,gw,env_master,,,,


## PostGIS: Upload GeoDataFrame with `gdf.to_postgis()`

Dependencies:
* psycopg2
* geoalchemy2

In [35]:
#!conda install -c conda-forge geoalchemy2 psycopg2

In [36]:
import sqlalchemy
engine = sqlalchemy.create_engine("postgresql://env_master:M123xyz@localhost/env_db")
# fast_executemany=True
# use_batch_mode=True

In [37]:
%%time
gdf.to_postgis(con=engine, name="gw_stations", schema="gw", index=True, chunksize=100, if_exists="replace")

CPU times: total: 1.83 s
Wall time: 4 s


Create primary key!

In [38]:
%%sql
alter table gw.gw_stations add constraint pk_gw_stations primary key (messstelle_id)

 * postgresql://env_master:***@localhost/env_db
Done.


[]

# Groundwater "Quality Data": Chemistry!

## Data Directories and Files

In [39]:
gw_quality_fname = r"opendata.gw_chemischer_messwert.csv"
gw_quality_pfname = data_in_dir + gw_quality_fname
print(f"Qualitätsdaten: {gw_quality_pfname:s}")

Qualitätsdaten: ../data/OpenGeoData.NRW/OpenHygrisC/OpenHygrisC_gw-messstellen-messwerte_EPSG25832_CSV/opendata.gw_chemischer_messwert.csv


In [40]:
fh = open(gw_quality_pfname,"r", encoding = "utf-8", newline = '')
s = fh.readline()
s = s.replace('"', '').strip()
header_de = s[1:].split(';')
header_de

['sl_nr',
 'messstelle_id',
 'pna_id',
 'datum_pn',
 'stoff_nr',
 'probengut',
 'messergebnis_c',
 'messergebnis_hinweis',
 'bestimmungsgrenze',
 'masseinheit',
 'trennverfahren',
 'verfahren',
 'vor_ort',
 'herkunft',
 'aktual_dat',
 'erstell_dat']

In [41]:
%time df_qual = pd.read_csv(gw_quality_pfname, sep = ";", dtype = {"messergebnis_c":str ,"messergebnis_hinweis":str }, nrows = 5)

CPU times: total: 0 ns
Wall time: 10 ms


In [42]:
df_qual.head(5)

Unnamed: 0,sl_nr,messstelle_id,pna_id,datum_pn,stoff_nr,probengut,messergebnis_c,messergebnis_hinweis,bestimmungsgrenze,masseinheit,trennverfahren,verfahren,vor_ort,herkunft,aktual_dat,erstell_dat
0,2903561,59620687,5/2005/4599,20051018,1164,Grundwasser,22.0,,,µg/l,Gesamtgehalt,DIN 38406-E22 MAERZ 1988,,HYGC_BR-AR,20051205,20051205
1,2903564,59620687,5/2005/4599,20051018,1061,Grundwasser,6.8,,,-,Gesamtgehalt,DIN 38404-C5 JANUAR 1984,ja,HYGC_BR-AR,20051205,20051205
2,2903565,59620687,5/2005/4599,20051018,1011,Grundwasser,12.8,,,°C,Gesamtgehalt,DIN 38404-C4 DEZEMBER 1976,ja,HYGC_BR-AR,20051205,20051205
3,2903584,59620389,5/2005/5002,20051114,1011,Grundwasser,12.3,,,°C,Gesamtgehalt,DIN 38404-C4 DEZEMBER 1976,ja,HYGC_BR-AR,20051205,20051205
4,2903585,59620080,5/2005/5001,20051111,1061,Grundwasser,7.4,,,-,Gesamtgehalt,DIN 38404-C5 JANUAR 1984,ja,HYGC_BR-AR,20051205,20051205


In [43]:
df_qual.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   sl_nr                 5 non-null      int64  
 1   messstelle_id         5 non-null      int64  
 2   pna_id                5 non-null      object 
 3   datum_pn              5 non-null      int64  
 4   stoff_nr              5 non-null      int64  
 5   probengut             5 non-null      object 
 6   messergebnis_c        5 non-null      object 
 7   messergebnis_hinweis  0 non-null      object 
 8   bestimmungsgrenze     0 non-null      float64
 9   masseinheit           5 non-null      object 
 10  trennverfahren        5 non-null      object 
 11  verfahren             5 non-null      object 
 12  vor_ort               5 non-null      object 
 13  herkunft              5 non-null      object 
 14  aktual_dat            5 non-null      int64  
 15  erstell_dat           5 non

**The complete CSV file with the measured values of the chemical analyses comprises more than 3.6 million measured values!**

In [44]:
# Wall time: 13 s
%time df_qual = pd.read_csv(gw_quality_pfname, sep = ";", index_col=["sl_nr"], \
                            dtype = {"messergebnis_c":str ,"messergebnis_hinweis":str }, \
                            parse_dates = ["datum_pn", "aktual_dat", "erstell_dat"])

CPU times: total: 8.52 s
Wall time: 8.53 s


In [45]:
df_qual.shape

(3671913, 15)

In [46]:
df_qual.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3671913 entries, 2903561 to 2882795
Data columns (total 15 columns):
 #   Column                Dtype         
---  ------                -----         
 0   messstelle_id         int64         
 1   pna_id                object        
 2   datum_pn              datetime64[ns]
 3   stoff_nr              int64         
 4   probengut             object        
 5   messergebnis_c        object        
 6   messergebnis_hinweis  object        
 7   bestimmungsgrenze     float64       
 8   masseinheit           object        
 9   trennverfahren        object        
 10  verfahren             object        
 11  vor_ort               object        
 12  herkunft              object        
 13  aktual_dat            datetime64[ns]
 14  erstell_dat           datetime64[ns]
dtypes: datetime64[ns](3), float64(1), int64(2), object(9)
memory usage: 448.2+ MB


In [47]:
# duplicate sl_nr values? Can it be a unique index?
# Result should be empty
print(df_qual[df_qual.index.duplicated()])

Empty DataFrame
Columns: [messstelle_id, pna_id, datum_pn, stoff_nr, probengut, messergebnis_c, messergebnis_hinweis, bestimmungsgrenze, masseinheit, trennverfahren, verfahren, vor_ort, herkunft, aktual_dat, erstell_dat]
Index: []


## Time Series Example

In [48]:
# time series example
# stoff_nr=1244 ->"Nitrat"
idx = (df_qual["messstelle_id"] == 20002129) & (df_qual["stoff_nr"] == 1244)
df_qual.loc[idx,["datum_pn", "messergebnis_c"]].sort_values("datum_pn")

Unnamed: 0_level_0,datum_pn,messergebnis_c
sl_nr,Unnamed: 1_level_1,Unnamed: 2_level_1
12222093,1985-09-04,20.3642
12149275,1986-05-05,22.5777
12222368,1986-12-11,42.9419
12222587,1987-05-22,23.0204
12222658,1987-11-17,37.1868
12149846,1988-07-08,44.27
12149887,1988-10-27,66.405
12222967,1989-11-28,75.259
12223110,1990-07-06,101.821
12223236,1991-09-18,97.394


### Tests for different measurement value string cases

```
(1)   "1.00" (is_float)
(2)  "<1.00" (is_less)
(3)  ">1.00" (is_greater)
```


In [49]:
# check if string can be converted to float
def is_float(element: str) -> bool:
    try:
        float(element)
        return True
    except ValueError:
        return False

In [50]:
# check if string starts with '<'
def is_less(element: str) -> bool:
    return element[0] == "<" 

In [51]:
# check if string starts with '>'
def is_greater(element: str) -> bool:
    return element[0] == ">" 

In [52]:
print("is_float()")
print(is_float("<1.234"))
print(is_float(">1.234"))
print(is_float("-1.234"))

is_float()
False
False
True


In [53]:
# Some test applications
print("is_less()")
print(is_less("<1.234"))
print(is_less(">1.234"))
print(is_less("1.234"))
print("is_greater()")
print(is_greater("<1.234"))
print(is_greater(">1.234"))
print(is_greater("1.234"))
print("is_float()")
print(is_float("<1.234"))
print(is_float(">1.234"))
print(is_float("1.234"))

is_less()
True
False
False
is_greater()
False
True
False
is_float()
False
False
True


In [54]:
# Apply the tests and create Boolean indexes
%time idx_mess_is_float   = df_qual["messergebnis_c"].apply(is_float)
%time idx_mess_is_less    = df_qual["messergebnis_c"].apply(is_less)
%time idx_mess_is_greater = df_qual["messergebnis_c"].apply(is_greater)

CPU times: total: 2.67 s
Wall time: 2.66 s
CPU times: total: 625 ms
Wall time: 621 ms
CPU times: total: 625 ms
Wall time: 627 ms


In [55]:
# Print records which are neither less nor greater nor float -> should be empty data frame
assert df_qual[~idx_mess_is_less & ~idx_mess_is_greater & ~idx_mess_is_float].shape[0] == 0

# Dataframe should be empty
print(df_qual[~idx_mess_is_less & ~idx_mess_is_greater & ~idx_mess_is_float])

Empty DataFrame
Columns: [messstelle_id, pna_id, datum_pn, stoff_nr, probengut, messergebnis_c, messergebnis_hinweis, bestimmungsgrenze, masseinheit, trennverfahren, verfahren, vor_ort, herkunft, aktual_dat, erstell_dat]
Index: []


In [56]:
# res = (~idx_mess_is_less & ~idx_mess_is_greater & ~idx_mess_is_float).value_counts()
res = (idx_mess_is_less | idx_mess_is_greater | idx_mess_is_float).value_counts()
res

True    3671913
Name: messergebnis_c, dtype: int64

## Convert measurement results to float. Fill the limit column.

In [57]:
%time df_qual.loc[idx_mess_is_float,"messergebnis_num"] = df_qual.loc[idx_mess_is_float,"messergebnis_c"].astype(float)
%time df_qual.loc[idx_mess_is_float,"limit"] = "="

%time df_qual.loc[idx_mess_is_less,"messergebnis_num"] = df_qual.loc[idx_mess_is_less,"messergebnis_c"].str[1:].astype(float)
%time df_qual.loc[idx_mess_is_less,"limit"] = "<"

%time df_qual.loc[idx_mess_is_greater,"messergebnis_num"] = df_qual.loc[idx_mess_is_greater,"messergebnis_c"].str[1:].astype(float)
%time df_qual.loc[idx_mess_is_greater,"limit"] = ">"



CPU times: total: 344 ms
Wall time: 348 ms
CPU times: total: 109 ms
Wall time: 103 ms
CPU times: total: 766 ms
Wall time: 760 ms
CPU times: total: 31.2 ms
Wall time: 43 ms
CPU times: total: 15.6 ms
Wall time: 6 ms
CPU times: total: 0 ns
Wall time: 5 ms


In [58]:
print("Different values for column 'limit'")
print(df_qual["limit"].value_counts())

Different values for column 'limit'
<    1974713
=    1697167
>         33
Name: limit, dtype: int64


In [59]:
df_qual[idx_mess_is_greater][["messergebnis_c", "messergebnis_num", "limit"]].head()

Unnamed: 0_level_0,messergebnis_c,messergebnis_num,limit
sl_nr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
17552890,>1.00000,1.0,>
1263499,>0.03875,0.03875,>
2016179,>1.00000,1.0,>
2923020,>1.00000,1.0,>
2923130,>1.00000,1.0,>


In [60]:
df_qual[idx_mess_is_less][["messergebnis_c", "messergebnis_num", "limit"]].head()

Unnamed: 0_level_0,messergebnis_c,messergebnis_num,limit
sl_nr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
17716627,<1.00000,1.0,<
17716638,<0.01000,0.01,<
17716639,<0.03000,0.03,<
17716670,<0.00200,0.002,<
17716672,<1.00000,1.0,<


In [61]:
df_qual[idx_mess_is_float][["messergebnis_c", "messergebnis_num", "limit"]].head()

Unnamed: 0_level_0,messergebnis_c,messergebnis_num,limit
sl_nr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2903561,22.0,22.0,=
2903564,6.8,6.8,=
2903565,12.8,12.8,=
2903584,12.3,12.3,=
2903585,7.4,7.4,=


In [62]:
# Reason for not being float? XOR: A ^ B
#idx = (~idx_mess_is_float ^ idx_mess_is_less) # These are non-floats which are be less at the same time => greater
#df_qual[idx]

In [63]:
# Reason for not being float? XOR
#idx = (~idx_mess_is_float ^ idx_mess_is_greater)
#df_qual[idx]

In [64]:
df_qual[df_qual["messergebnis_num"]<0]

Unnamed: 0_level_0,messstelle_id,pna_id,datum_pn,stoff_nr,probengut,messergebnis_c,messergebnis_hinweis,bestimmungsgrenze,masseinheit,trennverfahren,verfahren,vor_ort,herkunft,aktual_dat,erstell_dat,messergebnis_num,limit
sl_nr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2937027,219278519,1/2005/90634,2005-11-16,1072,Grundwasser,-60.00000,,,mV,Nach Laborjournal,,,HYGC_BR-K,2006-01-30,2006-01-30,-60.0,=
2943925,219278519,1/2007/90357,2007-02-07,1015,Grundwasser,-0.90000,,,°C,Nach Laborjournal,DIN 38404-C4 DEZEMBER 1976,ja,HYGC_BR-K,2007-08-20,2007-08-20,-0.9,=
2976241,219278519,1/2007/90832,2007-07-23,1072,Grundwasser,-11.00000,,,mV,Nach Laborjournal,,,HYGC_BR-K,2007-10-29,2007-10-29,-11.0,=
2980420,59620419,5/2007/4322,2007-10-11,1072,Grundwasser,-27.00000,,,mV,Gesamtgehalt,DIN 38404-C6 MAI 1984,,HYGC_BR-AR,2007-11-07,2007-11-07,-27.0,=
3006262,59160135,5/2007/4395,2007-10-16,1072,Grundwasser,-5.00000,,,mV,Gesamtgehalt,DIN 38404-C6 MAI 1984,,HYGC_BR-AR,2008-02-26,2008-02-26,-5.0,=
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2868501,59540643,5/2005/1064,2005-02-23,1015,Grundwasser,-5.00000,,,°C,Gesamtgehalt,DIN 38404-C4 DEZEMBER 1976,,HYGC_BR-AR,2005-04-07,2005-04-07,-5.0,=
2868904,59540485,5/2005/1135,2005-03-02,1015,Grundwasser,-1.00000,,,°C,Gesamtgehalt,DIN 38404-C4 DEZEMBER 1976,,HYGC_BR-AR,2005-04-07,2005-04-07,-1.0,=
2869927,59160100,5/2005/1767,2005-04-12,1072,Grundwasser,-4.00000,,,mV,Gesamtgehalt,DIN 38404-C6 MAI 1984,,HYGC_BR-AR,2005-05-06,2005-05-06,-4.0,=
2873380,59160056,5/2005/1636,2005-04-08,1072,Grundwasser,-3.00000,,,mV,Gesamtgehalt,DIN 38404-C6 MAI 1984,,HYGC_BR-AR,2005-07-21,2005-07-21,-3.0,=


## Upload the data to the database with `df.to_sql()`

In [65]:
import sqlalchemy
engine = sqlalchemy.create_engine("postgresql+psycopg2://env_master:M123xyz@localhost/env_db")

In [66]:
# the default to_sql() / sqlalchemy method using psycopg2 (default PG driver) ...
# on my laptop:
# Approx. Wall time: 4min 32s 

%time df_qual.to_sql(con=engine, name="gw_meas", schema="gw", if_exists="fail")
#%time df_qual.to_sql(con=engine, name="gw_meas", schema="gw", if_exists="replace")

ValueError: Table 'gw_meas' already exists.

## Search for duplicates! Primary key is not straight forward!

In [67]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [68]:
print("Connect")
%sql postgresql://env_master:M123xyz@localhost/env_db

Connect


'Connected: env_master@env_db'

In [69]:
%%sql
alter table gw.gw_meas add constraint pk_gw_meas primary key (messstelle_id, datum_pn, stoff_nr)

 * postgresql://env_master:***@localhost/env_db
(psycopg2.errors.InvalidTableDefinition) multiple primary keys for table "gw_meas" are not allowed

[SQL: alter table gw.gw_meas add constraint pk_gw_meas primary key (messstelle_id, datum_pn, stoff_nr)]
(Background on this error at: https://sqlalche.me/e/14/f405)


In [70]:
%%sql
select * from gw.gw_meas where (messstelle_id, datum_pn, stoff_nr) = (73537317, '1990-08-17 00:00:00', 1061)

 * postgresql://env_master:***@localhost/env_db
2 rows affected.


sl_nr,messstelle_id,pna_id,datum_pn,stoff_nr,probengut,messergebnis_c,messergebnis_hinweis,bestimmungsgrenze,masseinheit,trennverfahren,verfahren,vor_ort,herkunft,aktual_dat,erstell_dat,messergebnis_num,limit
5165938,73537317,7/1990/2850,1990-08-17 00:00:00,1061,Grundwasser,6.9,,,-,Gesamtgehalt,DIN 38404-C5 JANUAR 1984,ja,HYGC_BR-K,1994-12-01 00:00:00,1994-12-01 00:00:00,6.9,=
5165937,73537317,7/1990/2850,1990-08-17 00:00:00,1061,Grundwasser,7.2,,,-,Gesamtgehalt,DEV C5-2 4.LIEFERUNG 1966,,HYGC_BR-K,1994-12-01 00:00:00,1994-12-01 00:00:00,7.2,=


Is `sl_nr` unique?

In [71]:
%%sql
select sl_nr,count(sl_nr) as count from gw.gw_meas group by sl_nr having count(sl_nr) > 1; 

 * postgresql://env_master:***@localhost/env_db
0 rows affected.


sl_nr,count


**Ugly Primary Key!**

In [72]:
%%sql
alter table gw.gw_meas add constraint pk_gw_meas primary key (sl_nr)

 * postgresql://env_master:***@localhost/env_db
(psycopg2.errors.InvalidTableDefinition) multiple primary keys for table "gw_meas" are not allowed

[SQL: alter table gw.gw_meas add constraint pk_gw_meas primary key (sl_nr)]
(Background on this error at: https://sqlalche.me/e/14/f405)


**Create some indexes to improve database performance.**

In [73]:
%%sql
create index idx_gw_meas_messstelle_id_datum_pn on gw.gw_meas (messstelle_id, datum_pn)

 * postgresql://env_master:***@localhost/env_db
(psycopg2.errors.DuplicateTable) relation "idx_gw_meas_messstelle_id_datum_pn" already exists

[SQL: create index idx_gw_meas_messstelle_id_datum_pn on gw.gw_meas (messstelle_id, datum_pn)]
(Background on this error at: https://sqlalche.me/e/14/f405)


In [74]:
%%sql
create index idx_gw_id_datum_pn_meas_messstelle on gw.gw_meas (datum_pn, messstelle_id)

 * postgresql://env_master:***@localhost/env_db
(psycopg2.errors.DuplicateTable) relation "idx_gw_id_datum_pn_meas_messstelle" already exists

[SQL: create index idx_gw_id_datum_pn_meas_messstelle on gw.gw_meas (datum_pn, messstelle_id)]
(Background on this error at: https://sqlalche.me/e/14/f405)


In [75]:
%%time
%sql select count(*) from gw.gw_meas

 * postgresql://env_master:***@localhost/env_db
1 rows affected.
CPU times: total: 0 ns
Wall time: 236 ms


count
3671913


In [79]:
%%time
%sql select messstelle_id, datum_pn, count(*) as count from gw.gw_meas group by (messstelle_id, datum_pn) limit 20

 * postgresql://env_master:***@localhost/env_db
20 rows affected.
CPU times: total: 0 ns
Wall time: 2 ms


messstelle_id,datum_pn,count
10131310,1984-05-17 00:00:00,17
10131310,1984-11-23 00:00:00,20
10131310,1985-04-26 00:00:00,22
10131310,1988-09-21 00:00:00,25
10131310,1989-09-20 00:00:00,25
10131310,1991-11-22 00:00:00,36
10131310,1992-10-14 00:00:00,37
10131310,1993-11-12 00:00:00,44
10131310,1994-11-29 00:00:00,46
10131310,1995-09-12 00:00:00,36


**ATTENTION! 140515 anlyses were performed with more than one method!**

In [77]:
#%%sql
#SELECT messstelle_id, datum_pn, stoff_nr, COUNT(*) AS Count
#FROM gw.gw_meas
#GROUP BY messstelle_id, datum_pn, stoff_nr
#HAVING COUNT(*) > 1;

In [78]:
# %%sql
# SELECT t1.* from gw.gw_meas t1, gw.gw_meas t2 
# where 
# t1.messstelle_id = t2.messstelle_id
# and
# t1.datum_pn = t2.datum_pn
# and
# t1.stoff_nr = t2.stoff_nr
# and
# t1.verfahren <> t2.verfahren
# and

# t1.sl_nr = (select max(sl_nr) from gw.gw_meas t3 
# where
# t1.messstelle_id = t3.messstelle_id
# and
# t1.datum_pn = t3.datum_pn
# and
# t1.stoff_nr = t3.stoff_nr
# )

# limit 1000

# Import `katalog_stoff`

# Create Views!

## Exercises

1) Add the PostGIS table `gw.gw_stations` as vector layer to QGIS.

2) Use df.to_sql() to upload the table with the catalog (file `katalog_stoff.csv` in the data directory) of the analyzed quantities (substances, physico-chemical parameters, e.g. NO3- concentation (nitrate), pH, air temperature (can be neg.), etc.)

3) Add the catalog with municipalities (file `katalog_gemeinde.csv`)

4) SQL: Create a view joining the gw station table with gw meas table and gw parameter table. (A bit difficult. We have not discussed it yet.)

5) Create a reduced view for nitrate only joining the gw station table with gw meas table and gw parameter table.

6) Try to get the station-nitrate table into QGIS using the PostGIS interface.

SQL: Before you create the views create primary keys for the tables. i.e. `(messstelle_id)` for `gw_stations`, 
`(messstelle_id, stoff_nr, pna_datum)` for `gw_meas`.