# GeoDataFrames

This session provides an introduction to GeoDataFrames that link spatial attributes to the data in a Pandas DataFrame. It is a very powerful package but unfortunately it is plagued by installation issues. Along the way we'll also learn how to download data from online databases and store them in a DataFrame.

We'll start by importing some of the general Python packages. During this session we'll make use of the pathlib library. It provides an object-oriented approach to represent filesystem paths and works across operating systems.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path

## Downloading data from a REST API

In this next exercise we will pull online data into a Pandas DataFrame. We'll make use of the documentation of a library was developed by Kent Inverarity for obtaining groundwater data from the South Australian WaterConnect database, see
<A href="https://github.com/kinverarity1/python-sa-gwdata">https://github.com/kinverarity1/python-sa-gwdata</A>. The key command behind both libraries is the `get` method of the `requests` package, which will attempt to retrieve data from a specified source. The request is made by passing an url with a specifc stucture, which is defined by the application programming interface (API) of the service that is being queried. Several protocols exists, with the most common one today being the representational state transfer architectural style (REST), and an API that conforms to this standard is called a RESTful API.

Without providing any more technical details, let's just try to see how this works. Click the following link and observe the information that appears in your web browser.

<A href="https://www.waterconnect.sa.gov.au/_layouts/15/dfw.sharepoint.wdd/WDDDMS.ashx/GetObswellNumberSearchData?OBSNUMBER=WLG051">https://www.waterconnect.sa.gov.au/_layouts/15/dfw.sharepoint.wdd/WDDDMS.ashx/GetObswellNumberSearchData?OBSNUMBER=WLG051</A>

The data obtained appear in the form of a table and it can be seen that there are several fields. This is not very useful yet, but it gives you an idea of the information that is sent when you use the `get` method. If you look at the above url, it is possible to recognise three parts:
 * a base url: https://www.waterconnect.sa.gov.au/_layouts/15/dfw.sharepoint.wdd/WDDDMS.ashx/
 * a command: 'GetObswellNumberSearchData'
 * a section with parameters for the command: 'OBSNUMBER=WLG051'

From this we can infer that this url requests to search data based on the Obswel number, which in this case is specified to be WLG051 (not entirely coincidentally, this is the monitoring bore right next to the farm dam).

Now let's do this in Python using the `get` method. We create a variable `url`, which combines the base url and the command and we pass the command parameters as a dictionary.

In [None]:
import requests

base_url = "https://www.waterconnect.sa.gov.au/_layouts/15/dfw.sharepoint.wdd/WDDDMS.ashx/"
url = base_url + "GetObswellNumberSearchData"
rest_params = {"OBSNUMBER": "WLG051"}

response = requests.get(url, params=rest_params)

The returned data are stored in `response`. This is an object that contains information about the request and, if the request was successful, the data in JSON format (note that this differs depending on the API, other services may use a different format, e.g. csv). JSON is shorthand for JavaScript Object Notation and is a common data-interchange format. Although it is intended to be readable for humans, it is not as convenient as a DataFrame, so the next code cell converts the data to a DataFrame called `df`. 

In [None]:
data = response.json()
df = pd.json_normalize(data)

Inspecting the DataFrame in the variable explorer shows that it has a column 'DHNO', which stands for drillhole number. We can use this number in combination with the API command 'GetWaterLevelDetails' to get the water level time series for this well.

In [None]:

dhno = df["DHNO"]
url = base_url + "GetWaterLevelDetails"
rest_params = {"DHNO": dhno}

response = requests.get(url, params=rest_params)

Once again, we can convert the JSON data to a DataFrame. The data in the column 'OBS_DATE' can be converted to a datetime format, and be used as the index of the DataFrame.

In [None]:
data = response.json()
df = pd.json_normalize(data)
df.index = pd.to_datetime(df["OBS_DATE"])

Plotting the data is then a breeze of course...

In [None]:
fig, ax = plt.subplots()
ax.plot(df.index, df["RSWL"], '.-')
ax.set_title("WLG051")
ax.set_ylabel("Year")
ax.set_ylabel("RSWL")

The examples above demonstrate the use of the `get` method. Libraries like <A href="https://github.com/kinverarity1/python-sa-gwdata">python-sa-gwdata</A> and <A href="https://github.com/ArtesiaWater/hydropandas">Hydropandas</A> wrap Python code around this method to provide a user-friendly way to obtain data from a database. If you work with chemicals then <A href="https://pubchempy.readthedocs.io/en/latest/">PubChemPy</A> is another interesting package to look at, as it allows you to access the data in the <A href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</A> database. More RESTful APIs exist, and their number is growing.

***Exercise***: The following command requests the molecular formula and weight of PFOA (Perfluorooctanoic acid) from the PubChem database in json format:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/pfoa/property/MolecularFormula,MolecularWeight/json

Use `requests.get` to retrieve this record and store the molecular weight in a separate variable. Note that you can pass the above url directly, there is no need to define parameters (because of the way the PUG REST works, see <A href="https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest#section=Compound-Property-Tables">the documentation page</A>).

*Hint: The json part of the response is a nested dictionary which contains a list with yet another dictionary (rather convoluted!). The structure of the dictionary can be seen by looking at the variable in the variables explorer. In order to pull out the molecular weigth you have to do some slicing/indexing.*

In [None]:
# Pass this url to the get function (no need to pass parameters)
url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/pfoa/property/MolecularFormula,MolecularWeight/json"

response = requests.get(url)
data = response.json()
mw = data['PropertyTable']['Properties'][0]['MolecularWeight']

## PyQGIS 
QGIS (and ArcGIS) offers Python support via the QGIS Python API. Extensive documentation is provided <A href="https://docs.qgis.org/3.28/en/docs/pyqgis_developer_cookbook/index.html#">here</A>. The example below will use GeoPandas to create a shapefile of some data downloaded from  WaterConnect. The shapefile will be imported into QGIS and PyQGIS will be used to create three separate layers showing the boreholes with chemistry, water (level) and salinity data, respectively. The PyQGIS interface will also be used to set a different marker symbol colour for each of the three layers.

Let's start by using the WaterConnect API to download the availabe boreholes in a rectangular area near McLarenVale, SA.

In [None]:
url = base_url + "GetGridData"
rest_params = {"Box": "-35.25,138.55,-35.20,138.6"}
response = requests.get(url, params=rest_params)

# Convert to DataFrame
data = response.json()
df = pd.json_normalize(data)

With Geopandas, the DataFrame can be converted to a shapefile with just a few lines of code. First the data in the columns 'LON' and 'LAT' will be used to create the coordinate data. Together with the DataFrame `df` the coordinate data in `lat_long_coordinates` is used to create a GeoDataFrame (note that the EPSG code 4326 is for lat/long coordinates based on the World Geodetic System 1984 ensemble (WSG84)). The method `to_file` saves the shapefile to disk, and the `mkdir` call ensures that the subdirectory 'shp' exists.

In [None]:
import geopandas as gpd

lat_long_coordinates = gpd.points_from_xy(df["LON"], df["LAT"])
gdf = gpd.GeoDataFrame(
    df, 
    geometry=lat_long_coordinates, 
    crs="EPSG:4326",
)

Path('shp').mkdir(exist_ok=True)
gdf.to_file("shp/borehole_data.shp")

If running the code in the above cell results in an error message, you can solve it by installing the `pyogrio` by entering the following command lines in the Anaconda Prompt (note that you have to open it from the Windows Start menu, the first line below activates the geopandas_env environment):

`conda activate geopandas_env`

`conda install -c conda-forge pyogrio`

The code below will only work inside the QGIS Python editor (as will be demonstrated during the session). Note that the information on the object model for QGIS is extensive. For example, the documentation for a map layer object can be found <A href="https://api.qgis.org/api/classQgsMapLayer.html">here</A> and there are many, many more (see <A href="https://api.qgis.org/api/modules.html">https://api.qgis.org/api/modules.html</A>).

In [None]:
current_project = QgsProject.instance()

layer = current_project.mapLayersByName('borehole_data')[0]

field_names = ["CHEM", "WATER", "SAL"]

for field_name in field_names:
    new_layer = layer.clone()
    new_layer.setName(f'{field_name}_data')
    new_layer.setSubsetString(f'"{field_name}" = \'Y\'')
    current_project.addMapLayer(new_layer)

colors = ["red", "green", "blue"] 
for field_name, color in zip(field_names, colors):
    layer = current_project.mapLayersByName(f'{field_name}_data')[0]
    layerRenderer= layer.renderer()
    mSingleRenderer = QgsSingleSymbolRenderer.convertFromRenderer(layerRenderer)

    new_sym = QgsMarkerSymbol.createSimple({"color": color})
    mSingleRenderer.setSymbol(new_sym)
    layer.setRenderer(mSingleRenderer)