# Intermediate Lesson on Geospatial Data 

## Spatial Databases

<strong>Lesson Developers:</strong> Jayakrishnan Ajayakumar, Shana Crosson, Mohsen Ahmadkhani

#### Part 4 of 5

In [None]:
# This code cell starts the necessary setup for Hour of CI lesson notebooks.
# First, it enables users to hide and unhide code by producing a 'Toggle raw code' button below.
# Second, it imports the hourofci package, which is necessary for lessons and interactive Jupyter Widgets.
# Third, it helps hide/control other aspects of Jupyter Notebooks to improve the user experience
# This is an initialization cell
# It is not displayed because the Slide Type is 'Skip'

from IPython.display import HTML, IFrame, Javascript, display, clear_output
from ipywidgets import interactive, Textarea, HBox, Button, Layout
import ipywidgets as widgets
import sqlite3
import spatialite
import pandas as pd
import geopandas as gpd

import getpass # This library allows us to get the username (User agent string)

# import package for hourofci project
import sys
sys.path.append('../../supplementary') # relative path (may change depending on the location of the lesson notebook)
# sys.path.append('supplementary')
import hourofci
try:
    import os
    os.chdir('supplementary')
except:
    pass

# load javascript to initialize/hide cells, get user agent string, and hide output indicator
# hide code by introducing a toggle button "Toggle raw code"
HTML(''' 
    <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
    <style>
        .output_prompt{opacity:0;}
    </style>
    
    <input id="toggle_code" type="button" value="Toggle raw code">
''')

 ## Making the database spatially-enabled

To perform spatial queries we should **spatially enable our database**. This is exactly what a spatial database is built for. 

> A spatial database is a database that has been extended to include spatial data that represents objects defined in a geographic space, along with tools for querying and analyzing such data.

Remember: 
<ul>
    <li>A spatial database is a database, so we can still leverage all the functionalities of a traditional non-spatial database. 
    <li>A spatial database includes a new data type called <b>Geometry</b> that enables spatial operations on and between objects (i.e., points, lines, and/or polygons). 
</ul>    

Let's look at three example relations (tables) that has spatial data in the form of geometry. 

<table style="background: #fff; font-size:25px; text-align:left">
    <tr>
        <td style="background: #fff; font-size:25px; text-align:left">As shown below, the three tables have a special column with <b>geometry</b> data type. 
    <br/>They maintain spatial information that is non-readable for human as they look like series of numbers, symbols, and letters like the column on the right. <br/>
        <i style="font-size:20px">*Please note that the graphics in geometry column of the three tables above are for illustration. </i></td>
        <td style="background: #fff; font-size:25px; text-align:left">  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </td>
<td><img src = "supplementary/images/geom_col.jpg" width = "500px"></td>
    </tr>
</table>
<center><img src = "supplementary/images/geometry_types.png" width = "600px" height = "100px"></center>


### The three basic geometry types are 

1. **Points** (schools, shooting, earthquake,your location)
2. **Lines** (rivers, streets, roads, railway lines)
3. **Polygon** (countries, states, census tracts, zip codes)


Apart from supporting geometry types, spatial databases also support operations on geometries (e.g., intersection of two polygons).

Queries that involve geometry types are called spatial queries which we are going to be covered in this lesson. 

## Spatial Queries

> **Spatial queries are queries in a spatial database** that can be answered on the **basis of geometric information only,** i.e., the spatial position and extent of the objects involved.

Spatial functions start with <b> `ST_` </b> prefix and perform specific spatial operations. 

There are many ST_ functions in spatial queries that we will introduce a few of them here and in the next segment. 

But before we start, let's take a look at our `us_states` table that will be used in this lesson. 

### Fetching and Visualizing our data 
In our spatial database there is a table holding information of the contiguous US states named `us_states`. 

In the module below: <br/>
<ul>
    <li>
        To fetch all rows from it as a <b>Dataframe</b> click the <i>Execute!</i> button.
    </li>
    <li>
        To <b>plot</b> it click the <i>Plot!</i> button. 
    </li>
</ul>


**Please note that in this query we use `ST_AsBinary(geom)` instead of `geom`. This is a function to translate the geometry column to something readable for python. You won't use it so often, don't get intimidated!!!*


In [None]:
q1 = """SELECT pk_uid, statefp, geoid, name, aland , awater, ST_AsBinary(geom) as geom 
FROM us_states
"""
inp1 = Textarea(description='<b>Query:</b>', value= q1, layout=Layout(width='40%', height='120px'))
button1 = Button(description="Execute!")
plot1 = Button(description="Plot!")
Box1 = HBox([inp1, button1, plot1])

db = spatialite.connect('databases/spatialDB.sqlite')

def execute_query1(b): 
    clear_output()
    button1.on_click(execute_query1)
    plot1.on_click(plot_query1)
    display(Box1)
    print('Please wait...')
    gdff1 = gpd.GeoDataFrame.from_postgis(inp1.value, db,crs = 'EPSG:3857')
    clear_output()
    button1.on_click(execute_query1)
    plot1.on_click(plot_query1)
    display(Box1)
    return display(gdff1)

def plot_query1(b): 
    clear_output()
    button1.on_click(execute_query1)
    plot1.on_click(plot_query1)
    display(Box1)
    print('Please wait...')
    gdff1 = gpd.GeoDataFrame.from_postgis(inp1.value, db,crs = 'EPSG:3857')
    clear_output()
    button1.on_click(execute_query1)
    plot1.on_click(plot_query1)
    display(Box1)
    return display(gdff1.plot())

button1.on_click(execute_query1)
plot1.on_click(plot_query1)
display(Box1)


## Generating Centroids
Calculating the centroid of a set of polygons is indeed a **spatial** operation. 

Performing this operation is as easy as using `st_centroid(geom)` function! This function gets the geometry column as the only parameter and returns the centroids (points). 

In the example below you can generate the centroids of the US states yourself! 

In [None]:

q2 = "select geoid, name, ST_AsBinary(ST_CENTROID(geom)) as geom from us_states"

inp2 = Textarea(description='<b>Query:</b>', value= q2, layout=Layout(width='40%', height='120px'))
button2 = Button(description="Execute!")
plot2 = Button(description="Plot!")
Box2 = HBox([inp2, button2, plot2])

db = spatialite.connect('databases/spatialDB.sqlite')

def execute_query2(b): 
    clear_output()
    button2.on_click(execute_query2)
    plot2.on_click(plot_query2)
    display(Box2)
    print('Please wait...')
    gdff2 = gpd.GeoDataFrame.from_postgis(inp2.value, db,crs = 'EPSG:3857')
    clear_output()
    button2.on_click(execute_query2)
    plot2.on_click(plot_query2)
    display(Box2)
    return display(gdff2)

def plot_query2(b): 
    clear_output()
    button2.on_click(execute_query2)
    plot2.on_click(plot_query2)
    display(Box2)
    print('Please wait...')
    gdff2 = gpd.GeoDataFrame.from_postgis(inp2.value, db,crs = 'EPSG:3857')
    clear_output()
    button2.on_click(execute_query2)
    plot2.on_click(plot_query2)
    display(Box2)
    return display(gdff2.plot())

button2.on_click(execute_query2)
plot2.on_click(plot_query2)
display(Box2)






## Generating Buffers
Creating buffers of any radius is another common **spatial** operation that has a built-in function called `st_buffer(geom, radius)`. 
This function gets the geometry column along with the radius of the buffer. The unit of the radius depends on the coordinate system used. 

The following query will make a buffer of 30 KMs around the state of Minnesota:

```sql
select st_buffer(st_transform(geom, 3857), 30000) as geom 
from us_states 
where name='Minnesota'
```
Notice that in this query we use `st_transform(geom, 3857)` function to transform the geometry to be represented in the Web Mercator spatial reference system that uses **meters** as the length unit.

The number 3857 is the spatial reference system identifier of <a href="https://en.wikipedia.org/wiki/Web_Mercator_projection">Web Mercator projection system</a>. 

Run this query in the next slide and change the buffer radius to see how it looks! 

In [None]:

q3 = "select ST_AsBinary(st_buffer(ST_Transform(geom, 3857), 30000)) as geom from us_states where name='Minnesota'"


inp3 = Textarea(description='<b>Query:</b>', value= q3, layout=Layout(width='40%', height='120px'))
button3 = Button(description="Execute!")
plot3 = Button(description="Plot!")
Box3 = HBox([inp3, button3, plot3])

db = spatialite.connect('databases/spatialDB.sqlite')

def execute_query3(b): 
    clear_output()
    button3.on_click(execute_query3)
    plot3.on_click(plot_query3)
    display(Box3)
    print('Please wait...')
    gdff3 = gpd.GeoDataFrame.from_postgis(inp3.value, db,crs = 'EPSG:3857')
    clear_output()
    button3.on_click(execute_query3)
    plot3.on_click(plot_query3)
    display(Box3)
    return display(gdff3)

def plot_query3(b): 
    clear_output()
    button3.on_click(execute_query3)
    plot3.on_click(plot_query3)
    display(Box3)
    print('Please wait...')
    gdff3 = gpd.GeoDataFrame.from_postgis(inp3.value, db,crs = 'EPSG:3857')
    clear_output()
    button3.on_click(execute_query3)
    plot3.on_click(plot_query3)
    display(Box3)
    return display(gdff3.plot())

button3.on_click(execute_query3)
plot3.on_click(plot_query3)
display(Box3)



## Challenge!
#### Can you make a buffer of 2 KMs around the centroid of Minnesota?

You can check your solution by clicking on *Reveal the SQL code!* button!

In [None]:

q4 = "SELECT \nFROM \nWHERE"


inp4 = Textarea(description='<b>Query:</b>', value= q4, layout=Layout(width='40%', height='120px'))
button4 = Button(description="Execute!")
plot4 = Button(description="Plot!")
Box4 = HBox([inp4, button4, plot4])

db = spatialite.connect('databases/spatialDB.sqlite')

def execute_query4(b): 
    clear_output()
    button4.on_click(execute_query4)
    plot4.on_click(plot_query4)
    display(Box4)
    print('Please wait...')
    gdff4 = gpd.GeoDataFrame.from_postgis(inp4.value, db,crs = 'EPSG:3857')
    clear_output()
    button4.on_click(execute_query4)
    plot4.on_click(plot_query4)
    display(Box4)
    return display(gdff4)

def plot_query4(b): 
    clear_output()
    button4.on_click(execute_query4)
    plot4.on_click(plot_query4)
    display(Box4)
    print('Please wait...')
    gdff4 = gpd.GeoDataFrame.from_postgis(inp4.value, db,crs = 'EPSG:3857')
    clear_output()
    button4.on_click(execute_query4)
    plot4.on_click(plot_query4)
    display(Box4)
    return display(gdff4.plot())

button4.on_click(execute_query4)
plot4.on_click(plot_query4)
display(Box4)


In [None]:
button5 = Button(description="Reveal the SQL code!")
Box5 = HBox([button5])
query5 = "SELECT ST_AsBinary(st_buffer(st_centroid(ST_Transform(geom, 3857)), 2000)) as geom \nFROM us_states \nWHERE name='Minnesota'"

def on_click5(b):
    clear_output()
    display(Box5)
    return print(query5)

button5.on_click(on_click5)
display(Box5)

### Containment Query
The function **st_contains(geometry A,geometry B)** returns true if geometry A completely contains geometry B


<img src = "supplementary/images/containment_detail.png" width = "600px">

Some real world examples include

 <table>
    <tr>
        <td style="background: #fff; font-size:25px; text-align:left">
            <img src = "supplementary/images/PointInPolygon.png" width = "600px">
        </td>
        <td style="background: #fff; font-size:25px; text-align:left">
            <img src = "supplementary/images/Covid_Cases.png" width = "600px">
        </td>
    </tr>
  </table>

### Query: How many starbucks are there in Minnesota?

<center><img src = "supplementary/images/137-44505.png" width = "400px"></center>
In the next slide, we will look at the two <b>tables</b> involved in this query:

To perform such query, we need two spatial tables, a table that keeps the US states' information (`us_states` table), and another one holding Starbuck stores' inforamtion (`starbucks` table) as shown below:

In [None]:
from ipywidgets import Button, HBox, VBox,widgets,Layout,GridspecLayout,IntSlider,HTML
from IPython.display import display
import spatialite
import pandas as pd
db = spatialite.connect('databases/spatialDB.sqlite')
table1 = pd.read_sql_query('select statefp,name,geom as geometry from us_states limit 5',db)
table2 = pd.read_sql_query('select pk_uid,fid,geom as geometry from starbucks limit 5',db)
table1_disp = widgets.Output()
table2_disp = widgets.Output()
table1_header = widgets.HTML(value = f"<b><font color='red'><center>US_STATES</center></b>")
table2_header = widgets.HTML(value = f"<b><font color='red'><center>STARBUCKS</center></b>")
with table1_disp:
    display(table1)
with table2_disp:
    display(table2)
out=HBox([VBox([table1_header,table1_disp],layout = Layout(margin='0 100px 0 0')),VBox([table2_header,table2_disp])])
out

To count all the STARBUCKS that are **within** the state of 'Minnesota' we can use the following query:

```sql
select count(*) as total_starbucks 
from us_states u,starbucks s 
where u.name = 'Minnesota' and st_contains(u.geom,s.geom)
```

From the last chapter you might recall that this is a join operation involving multiple tables. But unlike the examples we saw, there is no explicit key-based relationship between the two tables. 

So instead of using a key-based relationship for the join we are using the relationship between the geometries of the two tables for the join. Such type of joins are called **Spatial Joins**.

>**Spatial Join** - **Joins attributes from one table to another based on their spatial relationship**.

#### Let's dismantle the query!
The clause
```sql
where u.name = 'Minnesota' 
```
retrieves the row from the `us_states` table with a name 'Minnesota' and the  function
```sql
st_contains(u.geom,s.geom)
```

retrieves those rows from `starbucks` and `us_states` tables where the **geometry in `starbucks` table (which is `point` in this case) is contained by the geometry in `us_states` table (which is `polygon`)** (which in this case is Minnesota) and then
```sql
count(*) as total_starbucks
```
counts the number of rows returned as a result of the where clause and assign it a name total_starbucks


Run this query in the next slide!


In [None]:

inp6 = Textarea(description='<b>Query:</b>', value="select count(*) as total_starbucks \nfrom us_states u,starbucks s \nwhere u.name = 'Minnesota' and st_contains(u.geom,s.geom)" , layout=Layout(width='40%', height='120px'))
button6 = Button(description="Execute!")          
Box6 = HBox([inp6, button6])

db = spatialite.connect('databases/spatialDB.sqlite')


def execute_query6(b):
    clear_output()
    button6.on_click(execute_query6)
    display(Box6)
    print('Please wait...')
    table16 = pd.read_sql_query(inp6.value,db)
    clear_output()
    button6.on_click(execute_query6)
    display(Box6)
    return display(table16)

button6.on_click(execute_query6)
display(Box6)


As shown in the example below, the `st_contains(u.geom,s.geom)` counts the number of objects (stars) within another object (circles).

<img src = "supplementary/images/PointInMultiplePolygons.png" width = "600px">

Now, let's modify the question as below:
### How many starbucks are there in *each* state?

The key difference here is that we are not selecting any particular state and we want our results to be **grouped by** the state names.

We can write this query as 

```sql
select u.name,count(*) as total_starbucks 
from us_states u,starbucks s 
where st_contains(u.geom,s.geom) 
group by u.name
```

If you compare this query to the previous one you can notice that the clause that checks the state name is removed. 


Run this query yourself in the next slide!

In [None]:

inp7 = Textarea(description='<b>Query:</b>', value="select u.name,count(*) as total_starbucks \nfrom us_states u,starbucks s \nwhere st_contains(u.geom,s.geom) \ngroup by u.name", layout=Layout(width='40%', height='120px'))
button7 = Button(description="Execute!")
Box7 = HBox([inp7, button7])

def execute_query7(b):
    clear_output()
    button7.on_click(execute_query7)
    display(Box7)
    print('Please wait...')
    table17 = pd.read_sql_query(inp7.value,db)
    clear_output()
    button7.on_click(execute_query7)
    display(Box7)
    return display(table17)

button7.on_click(execute_query7)
display(Box7)


### Let's visualize the result!

In the previous slide you were able to return the number of Starbucks branches in each state. Now, let's see those branches on a map! In the map below you can change the state interactively, get the counts, and see them on the map.


In [None]:
from ipyleaflet import Map, DrawControl,GeoData,LayerGroup,WidgetControl,Rectangle,basemap_to_tiles,basemaps,Polygon,GeoJSON,Choropleth
from ipywidgets import Button, HBox, VBox,widgets,Layout,GridspecLayout,IntSlider,HTML
from IPython.display import display
import spatialite
import pandas as pd
import geopandas as gpd
import json
import time
db = spatialite.connect('databases/spatialDB.sqlite')
def stateChanged(slid):
    layer_group.clear_layers()
    stateGeomSql = f"SELECT ST_AsBinary(geom) as geom FROM us_states where name='{states.value}';"
    starbucksSql = f"""SELECT ST_AsBinary(s.geom) as geom FROM us_states u,starbucks s where u.name='{states.value}'
     and st_contains(u.geom,s.geom) and s.rowid in(SELECT ROWID 
    FROM SpatialIndex
    WHERE f_table_name = 'starbucks' 
        AND search_frame = u.geom)"""
    gdf = gpd.GeoDataFrame.from_postgis(stateGeomSql, db,crs = 'EPSG:4269').to_crs('EPSG:4326')
    starbucksgdf = gpd.GeoDataFrame.from_postgis(starbucksSql, db,crs = 'EPSG:4326')
    center = [gdf.centroid.y.values[0],gdf.centroid.x.values[0]]
    sMap.center = center
    sMap.zoom = 6
    geo_data = GeoJSON(data = json.loads(gdf.to_json()),style={'opacity': 1, 'dashArray': '9', 'fillOpacity': 0, 'weight': 1})
    layer_group.add_layer(geo_data)
    geo_data_starbucks = GeoJSON(data = json.loads(starbucksgdf.to_json()))
    layer_group.add_layer(geo_data_starbucks)
    counts.value = str(len(starbucksgdf))
    
sql = "SELECT name FROM us_states order by name;"
statedf = pd.read_sql_query(sql,db)
sMap= Map(center=(41.482222, -81.669722), zoom=15,prefer_canvas =True)
layer_group = LayerGroup()
sMap.add_layer(layer_group)
states = widgets.Dropdown(
    options=statedf.name.values,
    value=statedf.name.values[0],
    description='State:',
    disabled=False,
)
counts=widgets.Text(
    value='',
    placeholder='',
    description='Total:',
    disabled=True,
)
states.observe(stateChanged, 'value')
filterParams=HBox([sMap,VBox([states,counts])])
stateChanged(None)
filterParams

Click the link below to move on


<br>
<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" href="gd-5.ipynb">Click here to go to the next notebook.</a></font>