# Intermediate Lesson on Geospatial Data 

## Spatial Databases

<strong>Lesson Developers:</strong> Jayakrishnan Ajayakumar, Shana Crosson, Mohsen Ahmadkhani

#### Part 4 of 5

In [None]:
# This code cell starts the necessary setup for Hour of CI lesson notebooks.
# First, it enables users to hide and unhide code by producing a 'Toggle raw code' button below.
# Second, it imports the hourofci package, which is necessary for lessons and interactive Jupyter Widgets.
# Third, it helps hide/control other aspects of Jupyter Notebooks to improve the user experience
# This is an initialization cell
# It is not displayed because the Slide Type is 'Skip'

from IPython.display import HTML, IFrame, Javascript, display
from ipywidgets import interactive
import ipywidgets as widgets
from ipywidgets import Layout

import getpass # This library allows us to get the username (User agent string)

# import package for hourofci project
import sys
sys.path.append('../../supplementary') # relative path (may change depending on the location of the lesson notebook)
# sys.path.append('supplementary')
import hourofci
try:
    import os
    os.chdir('supplementary')
except:
    pass

# load javascript to initialize/hide cells, get user agent string, and hide output indicator
# hide code by introducing a toggle button "Toggle raw code"
HTML(''' 
    <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
    <style>
        .output_prompt{opacity:0;}
    </style>
    
    <input id="toggle_code" type="button" value="Toggle raw code">
''')

In the last chapter, from the Kindergarten and Schools example we saw that, with traditional database and SQL queries we can't ask any **"spatially relevant"** questions nor we can identify any spatial patterns (or simply we can't map). 

To facilitate such questions we should some how **spatially enable our database**. This is exactly what a spatial database is build for. Let's look at the formal definition for a spatial database

>A **spatial database is a database** that has been **enhanced to include spatial data that represents objects defined in a geometric space, along with tools for querying and analyzing such data**.

The first key point to note here is that **spatial database is a database**, so we can still leverage all the functionalities of a traditional non-spatial database. The second point is the capability of spatial database to include a **new type called Geometry** and perform operations on them and between them. Let's look at three example relations (tables) that has spatial data in the form of geometry. 


<img src = "supplementary/images/geometry_types.png" width = "900px">

As you can see the three tables have a special column called **geometry** (the images shown in geometry column are for representational purpose). The three basic geometry types are 

1. Points (schools, shooting, earthquake,your location)
2. Lines (rivers, streets, roads, railway lines)
3. Polygon (countries, states, census tracts, zip codes)

And then there other geometries that are build on top of the basic geometry types such as 
Multipoint, Multiline, and MultiPolygons.

Apart from supporting geometry types, spatial database also supports operations on geometry as well as between geometries.

Queries that involve geometry types are called spatial queries which we are going to cover in the next section. 

## Spatial Queries

> **Spatial queries are queries in a spatial database** that can be answered on the **basis of geometric information only, i.e., the spatial position and extent of the objects involved**.

Let's look at various types of spatial functions and queries

### Containment Query
The function **st_contains(geometry A,geometry B)** returns true if geometry A completely contains geometry B


<img src = "supplementary/images/containment_detail.png" width = "600px">

Some real world examples include

 <table>
    <tr>
        <td>
            <img src = "supplementary/images/PointInPolygon.png" width = "600px">
        </td>
        <td>
            <img src = "supplementary/images/Covid_Cases.png" width = "600px">
        </td>
    </tr>
  </table>

**How many starbucks are there in my state**

<img src = "supplementary/images/137-44505.png" width = "400px">
Lets look at the two **tables** involved in this query

In [None]:
from ipywidgets import Button, HBox, VBox,widgets,Layout,GridspecLayout,IntSlider,HTML
from IPython.display import display
import spatialite
import pandas as pd
db = spatialite.connect('databases/spatialDB.sqlite')
table1 = pd.read_sql_query('select statefp,name,geom as geometry from us_states limit 5',db)
table2 = pd.read_sql_query('select pk_uid,fid,geom as geometry from starbucks limit 5',db)
table1_disp = widgets.Output()
table2_disp = widgets.Output()
table1_header = widgets.HTML(value = f"<b><font color='red'><center>US_STATES</center></b>")
table2_header = widgets.HTML(value = f"<b><font color='red'><center>STARBUCKS</center></b>")
with table1_disp:
    display(table1)
with table2_disp:
    display(table2)
out=HBox([VBox([table1_header,table1_disp],layout = Layout(margin='0 100px 0 0')),VBox([table2_header,table2_disp])])
out

As you can see for both the tables (US_STATES and STARBUCKS) there is a geometry column which can be used for spatial querying. Suppose your state is Califronia, then to count all the STARBUCKS that are **within** the state of 'Califorina' we can use the query,

```sql
select count(*) as total_starbucks from us_states u,starbucks s where u.name = 'California' and st_contains(u.geom,s.geom)
```

From the last chapter you might recall that this is a join operation involving multiple tables. But unlike the examples the we have seen there is no explicit key-based relationship between the two tables. 

So instead of using a key-based relationship for the join we are using the relationship between the geometries of the two tables for the join. Such type of joins are called **Spatial Joins**.

>**Spatial Joins** - **Joins attributes from one table to another based on the spatial relationship**.

The clause
```sql
where u.name = 'California' 
```
retrieve all rows from the us_states table with a name 'California' (which is only one)
and the clause
```sql
st_contains(u.geom,s.geom)
```

retrieves those rows from starbucks and us_states table where the **geometry in starbucks table (which is point in case of starbucks table) is contained by the geometry (which is Polygon) in us_states table** (which in this case is California) and then
```sql
count(*) as total_starbucks
```
counts the number of rows returned as a result of the where clause and assign it a name total_starbucks

Let's look at an interactive example of this query. Here we can change the state interactively and get the counts. Along with the counts the locations of the starbucks are also displayed. 

In [None]:
from ipyleaflet import Map, DrawControl,GeoData,LayerGroup,WidgetControl,Rectangle,basemap_to_tiles,basemaps,Polygon,GeoJSON,Choropleth
from ipywidgets import Button, HBox, VBox,widgets,Layout,GridspecLayout,IntSlider,HTML
from IPython.display import display
import spatialite
import pandas as pd
import geopandas as gpd
import json
import time
db = spatialite.connect('databases/spatialDB.sqlite')
def stateChanged(slid):
    layer_group.clear_layers()
    stateGeomSql = f"SELECT ST_AsBinary(geom) as geom FROM us_states where name='{states.value}';"
    starbucksSql = f"""SELECT ST_AsBinary(s.geom) as geom FROM us_states u,starbucks s where u.name='{states.value}'
     and st_contains(u.geom,s.geom) and s.rowid in(SELECT ROWID 
    FROM SpatialIndex
    WHERE f_table_name = 'starbucks' 
        AND search_frame = u.geom)"""
    gdf = gpd.GeoDataFrame.from_postgis(stateGeomSql, db,crs = 'EPSG:4269').to_crs('EPSG:4326')
    starbucksgdf = gpd.GeoDataFrame.from_postgis(starbucksSql, db,crs = 'EPSG:4326')
    center = [gdf.centroid.y.values[0],gdf.centroid.x.values[0]]
    sMap.center = center
    sMap.zoom = 6
    geo_data = GeoJSON(data = json.loads(gdf.to_json()),style={'opacity': 1, 'dashArray': '9', 'fillOpacity': 0, 'weight': 1})
    layer_group.add_layer(geo_data)
    geo_data_starbucks = GeoJSON(data = json.loads(starbucksgdf.to_json()))
    layer_group.add_layer(geo_data_starbucks)
    counts.value = str(len(starbucksgdf))
    
sql = "SELECT name FROM us_states order by name;"
statedf = pd.read_sql_query(sql,db)
sMap= Map(center=(41.482222, -81.669722), zoom=15,prefer_canvas =True)
layer_group = LayerGroup()
sMap.add_layer(layer_group)
states = widgets.Dropdown(
    options=statedf.name.values,
    value=statedf.name.values[0],
    description='State:',
    disabled=False,
)
counts=widgets.Text(
    value='',
    placeholder='',
    description='Total:',
    disabled=True,
)
states.observe(stateChanged, 'value')
filterParams=HBox([sMap,VBox([states,counts])])
stateChanged(None)
filterParams

We can also modify the question as, **"How many starbucks in each state?"**


<img src = "supplementary/images/PointInMultiplePolygons.png" width = "600px">

**Total starbucks in each state**

The key difference here is that we are not selecting any particular state and we want our results to be **grouped** by each state name

We can write this **query** as 

```sql
select u.name,count(*) as total_starbucks from us_states u,starbucks s where st_contains(u.geom,s.geom) group by u.name
```

If you compare this query to the previous one you can notice that the clause that checks the state name is removed. 

In [None]:
from ipywidgets import Button, HBox, VBox,widgets,Layout,GridspecLayout,IntSlider,HTML
from IPython.display import display
db = spatialite.connect('databases/spatialDB.sqlite')
table1 = pd.read_sql_query("""SELECT statefp,name,u.geom as geometry,s.pk_uid,fid,s.geom as geometry from us_states u,starbucks s
 where st_contains(u.geom,s.geom) and s.rowid in(SELECT ROWID 
    FROM SpatialIndex
    WHERE f_table_name = 'starbucks' 
        AND search_frame = u.geom) limit 10""",db)
table1_disp = widgets.Output()
table1_header = widgets.HTML(value = f"<b><font color='red'><center>Matching Rows</center></b>")
with table1_disp:
    display(table1)
out=HBox([VBox([table1_header,table1_disp])])
out

Now if you want to group this table based on name and show the total count for each name use count() and group by name

In [None]:
disp = widgets.Output()
db = spatialite.connect('databases/spatialDB.sqlite')
stateGeomSql = f"""SELECT u.name,count(*) as total_sbucks from us_states u,starbucks s
 where st_contains(u.geom,s.geom) and s.rowid in(SELECT ROWID 
    FROM SpatialIndex
    WHERE f_table_name = 'starbucks' 
        AND search_frame = u.geom) group by u.name"""
data = pd.read_sql_query(stateGeomSql,con=db)
with disp:
    display(data)
disp

We can ask similar questions using the containment query. 
Try out some examples:

**How many dominos are there in my state**

Let's look at the tables:

In [None]:
from ipywidgets import Button, HBox, VBox,widgets,Layout,GridspecLayout,IntSlider,HTML
from IPython.display import display
db = spatialite.connect('databases/spatialDB.sqlite')
table1 = pd.read_sql_query('select statefp,name,geom as geometry from us_states limit 5',db)
table2 = pd.read_sql_query('select pk_uid,fid,geom as geometry from dominos limit 5',db)
table1_disp = widgets.Output()
table2_disp = widgets.Output()
table1_header = widgets.HTML(value = f"<b><font color='red'><center>US_STATES</center></b>")
table2_header = widgets.HTML(value = f"<b><font color='red'><center>DOMINOS</center></b>")
with table1_disp:
    display(table1)
with table2_disp:
    display(table2)
out=HBox([VBox([table1_header,table1_disp],layout = Layout(margin='0 100px 0 0')),VBox([table2_header,table2_disp])])
out

And this is the query

```sql
SELECT ST_AsBinary(s.geom) as geom FROM us_states u,dominos s where u.name='California'
     and st_contains(u.geom,s.geom)
```

Let's look at an interactive example:

In [None]:
from ipyleaflet import Map, DrawControl,GeoData,LayerGroup,WidgetControl,Rectangle,basemap_to_tiles,basemaps,Polygon,GeoJSON
from ipywidgets import Button, HBox, VBox,widgets,Layout,GridspecLayout,IntSlider,HTML
from IPython.display import display
import spatialite
import pandas as pd
import geopandas as gpd
import json
import time
db = spatialite.connect('databases/spatialDB.sqlite')

def stateChanged(slid):
    layer_group.clear_layers()
    stateGeomSql = f"SELECT ST_AsBinary(geom) as geom FROM us_states where name='{states.value}';"
    dominosSql = f"""SELECT ST_AsBinary(s.geom) as geom FROM us_states u,dominos s where u.name='{states.value}'
     and st_contains(u.geom,s.geom) and s.rowid in(SELECT ROWID 
    FROM SpatialIndex
    WHERE f_table_name = 'dominos' 
        AND search_frame = u.geom)"""
    gdf = gpd.GeoDataFrame.from_postgis(stateGeomSql, db,crs = 'EPSG:4269').to_crs('EPSG:4326')
    dominosgdf = gpd.GeoDataFrame.from_postgis(dominosSql, db,crs = 'EPSG:4326')
    center = [gdf.centroid.y.values[0],gdf.centroid.x.values[0]]
    sMap.center = center
    sMap.zoom = 6
    geo_data = GeoJSON(data = json.loads(gdf.to_json()),style={'opacity': 1, 'dashArray': '9', 'fillOpacity': 0, 'weight': 1})
    layer_group.add_layer(geo_data)
    geo_data_dominos = GeoJSON(data = json.loads(dominosgdf.to_json()))
    layer_group.add_layer(geo_data_dominos)
    counts.value = str(len(dominosgdf))
    
sql = "SELECT name FROM us_states order by name;"
statedf = pd.read_sql_query(sql,db)
sMap= Map(center=(41.482222, -81.669722), zoom=15,prefer_canvas =True)
layer_group = LayerGroup()
sMap.add_layer(layer_group)
states = widgets.Dropdown(
    options=statedf.name.values,
    value=statedf.name.values[0],
    description='State:',
    disabled=False,
)
counts=widgets.Text(
    value='',
    placeholder='',
    description='Total:',
    disabled=True,
)
states.observe(stateChanged, 'value')
filterParams=HBox([sMap,VBox([states,counts])])
stateChanged(None)
filterParams

Another example 

**Total dominos in each state**

And this is the query

```sql
   SELECT u.name,count(*) as total_dominos from us_states u,dominos s
 where st_contains(u.geom,s.geom) group by u.name
```

In [None]:
disp = widgets.Output()
db = spatialite.connect('databases/spatialDB.sqlite')
stateGeomSql = f"""SELECT u.name,count(*) as total_dominos from us_states u,dominos s
 where st_contains(u.geom,s.geom) and s.rowid in(SELECT ROWID 
    FROM SpatialIndex
    WHERE f_table_name = 'dominos' 
        AND search_frame = u.geom) group by u.name"""
data = pd.read_sql_query(stateGeomSql,con=db)
with disp:
    display(data)
disp

One more example

**Total homicides in each neighborhood**

Here are the tables

In [None]:
from ipywidgets import HBox, VBox,widgets,Layout,HTML
from IPython.display import display
db = spatialite.connect('databases/spatialDB.sqlite')
table1 = pd.read_sql_query('select boroname,name,geom as geometry from nyc_neighborhoods limit 5',db)
table2 = pd.read_sql_query('select id,weapon, light_dark, year, geom as geometry from nyc_homicides limit 5',db)
table1_disp = widgets.Output()
table2_disp = widgets.Output()
table1_header = widgets.HTML(value = f"<b><font color='red'><center>NYC_NEIGHBORHOODS</center></b>")
table2_header = widgets.HTML(value = f"<b><font color='red'><center>NYC_HOMICIDES</center></b>")
with table1_disp:
    display(table1)
with table2_disp:
    display(table2)
out=HBox([VBox([table1_header,table1_disp],layout = Layout(margin='0 100px 0 0')),VBox([table2_header,table2_disp])])
out

And here is the query

```sql
   SELECT u.boroname,count(*) as total_homicides from nyc_neighborhoods u,nyc_homicides s
 where st_contains(u.geom,s.geom) group by u.boroname
```

In [None]:
disp = widgets.Output()
db = spatialite.connect('databases/spatialDB.sqlite')
stateGeomSql = f"""SELECT u.boroname,count(*) as total_homicides from nyc_neighborhoods u,nyc_homicides s
 where st_contains(u.geom,s.geom) and s.rowid in(SELECT ROWID 
    FROM SpatialIndex
    WHERE f_table_name = 'nyc_homicides' 
        AND search_frame = u.geom) group by u.boroname"""
data = pd.read_sql_query(stateGeomSql,con=db)
with disp:
    display(data)
disp

One last example

**Total earthquakes in each state**

Let's look at the tables

In [None]:
from ipywidgets import HBox, VBox,widgets,Layout,HTML
from IPython.display import display
db = spatialite.connect('databases/spatialDB.sqlite')
table1 = pd.read_sql_query('select statefp,name,geom as geometry from us_states limit 5',db)
table2 = pd.read_sql_query('select * from earthquakes limit 5',db)
table1_disp = widgets.Output()
table2_disp = widgets.Output()
table1_header = widgets.HTML(value = f"<b><font color='red'><center>US_STATES</center></b>")
table2_header = widgets.HTML(value = f"<b><font color='red'><center>EARTHQUAKES</center></b>")
with table1_disp:
    display(table1)
with table2_disp:
    display(table2)
out=HBox([VBox([table1_header,table1_disp],layout = Layout(margin='0 100px 0 0')),VBox([table2_header,table2_disp])])
out

And here is the query

```sql
SELECT u.name,count(*) as total_earthquakes from us_states u,earthquakes s
 where st_contains(u.geom,s.geometry) group by u.name
```

In [None]:
disp = widgets.Output()
db = spatialite.connect('databases/spatialDB.sqlite')
stateGeomSql = f"""SELECT u.name,count(*) as total_earthquakes from us_states u,earthquakes s
 where st_contains(u.geom,s.geometry) and s.rowid in(SELECT ROWID 
    FROM SpatialIndex
    WHERE f_table_name = 'earthquakes' 
        AND search_frame = u.geom) group by u.name"""
data = pd.read_sql_query(stateGeomSql,con=db)
with disp:
    display(data)
disp

Click the link below to move on


<br>
<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" href="gd-5.ipynb">Click here to go to the next notebook.</a></font>