## Introduction to Big Data
### Segment 3 of 5

# Spatial Big Data (Volume)

*Lesson Developer: Jayakrishnan Ajayakumar, jxa421@case.edu*



In [None]:
# This code cell starts the necessary setup for Hour of CI lesson notebooks.
# First, it enables users to hide and unhide code by producing a 'Toggle raw code' button below.
# Second, it imports the hourofci package, which is necessary for lessons and interactive Jupyter Widgets.
# Third, it helps hide/control other aspects of Jupyter Notebooks to improve the user experience
# This is an initialization cell
# It is not displayed because the Slide Type is 'Skip'

from IPython.display import HTML, IFrame, Javascript, display
from ipywidgets import interactive
import ipywidgets as widgets
from ipywidgets import Layout

import getpass # This library allows us to get the username (User agent string)

# import package for hourofci project
import sys
sys.path.append('../../supplementary') # relative path (may change depending on the location of the lesson notebook)
import hourofci

import warnings
warnings.filterwarnings('ignore') # Hide warnings

# load javascript to initialize/hide cells, get user agent string, and hide output indicator
# hide code by introducing a toggle button "Toggle raw code"
# HTML(''' 
#     <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
#     <input id="toggle_code" type="button" value="Toggle raw code">
# ''')

HTML(''' 
    <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
    <style>
        .output_prompt{opacity:0;}
    </style>
    
    <input id="toggle_code" type="button" value="Toggle raw code">
''')




## Datasets
For this module we will use four datasets

1. US counties (https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html).

2. Major Cities in US (https://hub.arcgis.com/datasets/esri::usa-major-cities)

3. NYC Taxi Zones (https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc)

4. NYC Taxi Data for One Day (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

Our first hands-on experiment will be to find out which county you belong to. 

## Experiment I (Which county I belong to???)

Let's first explore our county dataset. We will use a popular Python package called geopandas to explore our dataset.

In [None]:
#let's import geopandas
import geopandas as gpd
#our shapefile is in the us_counties folder which is in the data folder. We will use the bbox parameter as we want to limit the dataset with in Contiguous United States
counties = gpd.read_file(r'supplementary/data/us_counties/us_counties_geog.shp',bbox=[-125.6,24.89,-64.36,49.47])
#just type the variable to see the contents. It has 9 attribute column and one geometry column
counties

In [None]:
#you can even plot this on the fly
counties.plot();

In [None]:
#and use attributes for selecting a subset. For example the STATEFP for Ohio is 39. We have to use '39' as its a string
ohio = counties[counties.STATEFP == '39']
# now just plot it
ohio.plot();

### First small Exercise.
Try to plot Minnesota. You can search it here https://www.mercercountypa.gov/dps/state_fips_code_listing.htm

In [None]:
# Give it a shot!!!

Now let's continue.....

Your current location can be represented as a Point (with a single coordinate pair)

If you want to find out your location an easy thing to do is to use Google Maps. You can click anywhere on the map (possibly your home location) and Google Maps will display the coordinates of the location in (latitude,longitude) format. Give it a try.

Now what we want to achieve is to identify which County the location is in. For example we can ask, <span STYLE="font-size:15.0pt;color:black">"Which County is Statue of Liberty in?", or Which County is (40.68927479475465, -74.04450466714349) in? </span> (you can just copy paste the coordinates to Google map search to explore the location)

So how do we achieve it. 

### Point in Polygon Overlay

Point-in-Polygon Overlay is a GIS operation where points on one dataset are overlaid onto polygons of another to determine the location of points on the polygon. Let's see an illustrative diagram.

![point_in_polygon](supplementary/images/point_in_polygon.png)

We can implement this in an intuitive way (may be not essentially the fastest way) by checking the point against each Polygon. 

![pp_sequence](supplementary/images/pp_sequence.png)


As you can see the number of steps taken depends up on the order in which the data is stored. Unfortunately, in the illustrative example, the matching polygon was stored at the end and hence it took four steps or four comparison to find out the matching polygon. But if the data was stored in the order D,A,C,B or any other combination with D as the first geometry then there will be only one comparison (assuming there are no overlapping polygons!!). We could always argue that we could plot and check which polygon contains the point, but it will become a daunting task once there are many points and many polygons. 

So let's see this in action (don't feel overwhelmed by the code, we will dissect the code part by part).

You can click on the circle in the tool bar and add a location anywhere on the map and then click on "Start". You can see the Name of the matching county and total number of comparisons.


In [None]:
import geopandas as gpd
import ipywidgets as widgets
from ipyleaflet import Map, DrawControl,GeoData,LayerGroup,WidgetControl,Rectangle,basemap_to_tiles,basemaps,Polygon,GeoJSON
import json
from shapely.geometry import Point
#load the county data using geopandas
counties = gpd.read_file(r'supplementary/data/us_counties/us_counties_geog.shp',bbox=[-125.6,24.89,-64.36,49.47])
#lets make a map
m = Map(center=(44.967243, -103.771556), zoom=3)
#now lets load the county data 
#first we will create a layer group for the county
countyLayerGroup = LayerGroup()
#create another layer to hold the new points being added to the map
newPointLayer = LayerGroup()
#add the layer group to map
m.add_layer(countyLayerGroup)
m.add_layer(newPointLayer)
#now create a geojson layer from the county dataset
geo_data = GeoJSON(data = json.loads(counties.to_json()),
                   name = 'counties',style={'opacity': 1,'color':'gray', 'fillOpacity': 0, 'weight': 1})
#add the layer to the countyLayerGroup
countyLayerGroup.add_layer(geo_data)
#add draw control to the map
draw_control = DrawControl(polyline={},polygon={})
#add to map
m.add_control(draw_control)
# now we will create few more elements to display our results
#we will create an html element which displays data along with results
results = widgets.HTML(
    value="",
    placeholder='',
    description='',
)
#we will also add a button to start the process
button = widgets.Button(
    description='Start',
    disabled=False,
    button_style='info'
)
button.layout.width = '35%'
button.layout.margin = '0% 0% 0% 50%'
#add a text box to display the coordinates of the selected point
coords = widgets.Text(
    value='',
    placeholder='',
    description='Location',
    disabled=True
)
coords.layout.margin = '0% 2% 5% 0%'
coords.layout.width = '100%'
sm_container = widgets.VBox([coords,button,results])

#now create a horizontal box that can hold the map and the other container.
m.layout.width='80%'
sm_container.layout.width='20%'
container = widgets.HBox([m,sm_container])
container.layout.width='90%'
#now write a call back function when a marker is added. Add the new location coordinates to the text box
def addPoint(target, action, geo_json):
    newPointLayer.clear_layers()
    draw_control.clear()
    newPoint = GeoJSON(data = geo_json)
    newPointLayer.add_layer(newPoint)
    coords.value = ",".join(map(str,geo_json['geometry']['coordinates']))
#add the call back function
draw_control.on_draw(addPoint)
#now write a function for the start button
def start(b):
    if coords.value!='':
        #create a point geometry from the coordinates
        point = Point(list(map(float,coords.value.split(','))))
        #now go through the county dataset one by one and check whether the selected point is inside the county
        details="<b>Name</b>: Not Found"
        for idx,county in counties.iterrows():
            if county.geometry.contains(point):
                details = details.replace('Not Found',county.NAME)
                break
        details+=f'<br><b>Number of comparisons</b>: {idx+1}'
        results.value=details
button.on_click(start)
container

Did you notice anything??

When you add a point on the Western side there are relatively less number of comparisons than when you add a point on the Eastern side. 

The illustration shown below explains this phenomenon

![point_in_polygon_order](supplementary/images/pp_order.png)

From our county dataset, if the queried location happens to be in Clallam County, WA, then we get the result instantaneously as there is only one comparison, but at the same time if the queried location happens to be in Washington County, ME, then we get the result after 3108 comparisons.

And how many comparisons will be there if we search for a location outside USA?????

So what do you think more comparisons mean.....We will see that in our next Experiment....

## Experiment II (Which County gets Which City)

For this experiment we will use the County data that we already have and the City data. Let's quickly explore the city data.  

In [None]:
import geopandas as gpd
from ipyleaflet import Map, DrawControl,GeoData,LayerGroup,WidgetControl,Rectangle,basemap_to_tiles,basemaps,Polygon,GeoJSON
import json
import ipywidgets as widgets
from shapely.geometry import Point
import pandas as pd
import time
pd.set_option('display.max_rows', None)

In [None]:
#load the county data using geopandas
counties = gpd.read_file(r'supplementary/data/us_counties/us_counties_geog.shp',bbox=[-125.6,24.89,-64.36,49.47])
#load the cities data using geopandas
cities = gpd.read_file(r'supplementary/data/USA_Major_Cities/USA_Major_Cities.shp')
#just check the cities, since there are are 3,886 cities we will just show first five cities
cities.head()

Now just plot the cities

In [None]:
cities.plot();

Now we will create an interactive page where user can select a set of cities and then check which city belong to which county. Click on the select widget to select the number of cities and then click on Start

In [None]:
#lets make a map
m = Map(center=(44.967243, -103.771556), zoom=3)
#now lets load the county data 
#first we will create a layer group for the county and city
countyLayerGroup = LayerGroup()
cityLayerGroup = LayerGroup()
#add the layer group to map
m.add_layer(countyLayerGroup)
m.add_layer(cityLayerGroup)
#now create a geojson layer from the county dataset
geo_data = GeoJSON(data = json.loads(counties.to_json()),
                   name = 'counties',style={'opacity': 1,'color':'gray', 'fillOpacity': 0, 'weight': 1})
#add the layer to the countyLayerGroup
countyLayerGroup.add_layer(geo_data)
#now lets load the county data 
#first we will create a layer group for the county and city
countyLayerGroup = LayerGroup()
cityLayerGroup = LayerGroup()
#add the layer group to map
m.add_layer(countyLayerGroup)
m.add_layer(cityLayerGroup)
#now create a geojson layer from the county dataset
geo_data = GeoJSON(data = json.loads(counties.to_json()),
                   name = 'counties',style={'opacity': 1,'color':'gray', 'fillOpacity': 0, 'weight': 1})
#add the layer to the countyLayerGroup
countyLayerGroup.add_layer(geo_data)
#now write a function for the start button
# now we will create few more elements to display our results
#we will create an output element which will dispaly the cities table with corresponding counties
results = widgets.Output()
#we will also add a button to start the process
button = widgets.Button(
    description='Start',
    disabled=False,
    button_style='info'
)
results.layout.width = '100%'
results.layout.overflow = 'scroll'
results.layout.height = '250px'
button.layout.width = '35%'
button.layout.margin = '3% 0% 0% 50%'
#add a select widget to select the number of cities
numcities = [str(i) for i in range(100,3300,100)]
citySelect = widgets.Dropdown(
    options=numcities,
    value='100',
    description='Total Cities:',
    disabled=False,
)
citySelect.layout.width = '95%'
totalComparisons = widgets.Text(
    value='',
    placeholder='',
    description='Comparisons:',
    disabled=True
)
totalComparisons.layout.margin = '2% 2% 5% 0%'
totalComparisons.layout.width = '100%'
totalTime = widgets.Text(
    value='',
    placeholder='',
    description='Time Taken:',
    disabled=True
)
totalTime.layout.margin = '2% 2% 5% 0%'
totalTime.layout.width = '100%'
sm_container = widgets.VBox([citySelect,button,totalComparisons,totalTime,results])

#now create a horizontal box that can hold the map and the other container.
m.layout.width='80%'
sm_container.layout.width='20%'
container = widgets.HBox([m,sm_container])
container.layout.width='90%'

def start(b):
    totalComparisons.value = 'Waiting.............'
    totalTime.value = 'Waiting.............'
    cityLayerGroup.clear_layers()
    #randomly select a sample of cities based on the selection
    sampleCount = int(citySelect.value)
    sampleCities = cities.sample(sampleCount)
    #show the cities on the map
    city_data = GeoJSON(data  = json.loads(sampleCities.to_json()),
                    point_style={'radius': 2, 'color': 'red', 'fillOpacity': 1, 'fillColor': 'red', 'weight': 0})
    cityLayerGroup.add_layer(city_data)
    #now iterate through each of the cities and assign corresponding counties
    cityCounty = []  # a list to store city name and corresponding county name
    totalComp = 0
    start = time.time()
    for cidx,city in sampleCities.iterrows():
        for coidx,county in counties.iterrows():
            totalComp+=1
            if county.geometry.contains(city.geometry):
                cityCounty.append([city.NAME,county.NAME])
                break
    totalTime.value = str(int(time.time()-start))+' seconds'
    #update the total comparison value
    totalComparisons.value = str(totalComp)
    #create a dataframe for the output
    cityCounty = pd.DataFrame(cityCounty,columns=['City','County'])
    with results:
        display(cityCounty, clear=True)
button.on_click(start)
container

What do you notice??????. 

Even for a small number of cities the total time taken is around 12 seconds. And when you increase the number of cities the total number of comparison and total time taken increases.

Now try even 200 cities. You will start getting frustrated to see the results.

Now lets see the section of codes that does the heavy lifting


```python
cityCounty = []  # a list to store city name and corresponding county name
    totalComp = 0
    start = time.time()
    for cidx,city in sampleCities.iterrows():
        for coidx,county in counties.iterrows():
            totalComp+=1
            if county.geometry.contains(city.geometry):
                cityCounty.append([city.NAME,county.NAME])
                break
```
We are going through cities one by one <code>for cidx,city in sampleCities.iterrows():</code> and for each city we go through all the counties <code>for coidx,county in counties.iterrows():</code>. And we use the contains method of the geometry object to check whether one geometry contains other geometry <code>if county.geometry.contains(city.geometry):</code>


Why do you think even for small number of cities, this approach takes relatively long duration??

It is just because of the sheer number of comparisons that have to be performed. For example for 100 cities, the maximum number of comparisons that's possible is

100 (cities) X 3108 (counties) = 310,800

And for 3200 cities the total maximum number of comparisons is 

3200 (cities) X 3108 (counties) = 9,945,600!!!

No wonder it takes forever!!!

In this experiment the result is a table with city name and the corresponding county name.

In the next experiment we will count the number of cities in every county


## Experiment III (Total Number of Cities in every County)

We will straightaway get into the code

In [None]:
import geopandas as gpd
from ipyleaflet import Map, DrawControl,GeoData,LayerGroup,WidgetControl,Rectangle,basemap_to_tiles,basemaps,Polygon,GeoJSON
import json
import ipywidgets as widgets
from shapely.geometry import Point
import pandas as pd
import time
import collections
pd.set_option('display.max_rows', None)
#load the county data using geopandas
counties = gpd.read_file(r'supplementary/data/us_counties/us_counties_geog.shp',bbox=[-125.6,24.89,-64.36,49.47])
#load the cities data using geopandas
cities = gpd.read_file(r'supplementary/data/USA_Major_Cities/USA_Major_Cities.shp')
#lets make a map
m = Map(center=(44.967243, -103.771556), zoom=3)
#now lets load the county data 
#first we will create a layer group for the county and city
countyLayerGroup = LayerGroup()
cityLayerGroup = LayerGroup()
#add the layer group to map
m.add_layer(countyLayerGroup)
m.add_layer(cityLayerGroup)
#now create a geojson layer from the county dataset
geo_data = GeoJSON(data = json.loads(counties.to_json()),
                   name = 'counties',style={'opacity': 1,'color':'gray', 'fillOpacity': 0, 'weight': 1})
#add the layer to the countyLayerGroup
countyLayerGroup.add_layer(geo_data)
# now we will create few more elements to display our results
#we will create an output element which will dispaly the cities table with corresponding counties
results = widgets.Output()
#we will also add a button to start the process
button = widgets.Button(
    description='Start',
    disabled=False,
    button_style='info'
)
results.layout.width = '100%'
results.layout.overflow = 'scroll'
results.layout.height = '300px'
button.layout.width = '35%'
button.layout.margin = '3% 0% 0% 50%'
#add a select widget to select the number of cities
numcities = [str(i) for i in range(100,3300,100)]
citySelect = widgets.Dropdown(
    options=numcities,
    value='100',
    description='Total Cities:',
    disabled=False,
)
citySelect.layout.width = '95%'
totalComparisons = widgets.Text(
    value='',
    placeholder='',
    description='Comparisons:',
    disabled=True
)
totalComparisons.layout.margin = '2% 2% 5% 0%'
totalComparisons.layout.width = '100%'
totalTime = widgets.Text(
    value='',
    placeholder='',
    description='Time Taken:',
    disabled=True
)
totalTime.layout.margin = '2% 2% 5% 0%'
totalTime.layout.width = '100%'
sm_container = widgets.VBox([citySelect,button,totalComparisons,totalTime,results])

#now create a horizontal box that can hold the map and the other container.
m.layout.width='80%'
sm_container.layout.width='20%'
container = widgets.HBox([m,sm_container])
container.layout.width='90%'
#now write a function for the start button
def start(b):
    totalComparisons.value = 'Waiting.............'
    totalTime.value = 'Waiting.............'
    cityLayerGroup.clear_layers()
    #randomly select a sample of cities based on the selection
    
    sampleCount = int(citySelect.value)
    sampleCities = cities.sample(sampleCount)
    #show the cities on the map
    city_data = GeoJSON(data  = json.loads(sampleCities.to_json()),
                    point_style={'radius': 2, 'color': 'red', 'fillOpacity': 1, 'fillColor': 'red', 'weight': 0})
    cityLayerGroup.add_layer(city_data)
    countyDat = []  # a list to store county ids
    totalComp = 0
    start = time.time()
    for cidx,city in sampleCities.iterrows():
        for coidx,county in counties.iterrows():
            totalComp+=1
            if county.geometry.contains(city.geometry):
                # we need to use GEOID as two counties can have the same name
                countyDat.append(county.GEOID)
                break
    totalTime.value = str(int(time.time()-start))+' seconds'
    #update the total comparison value
    totalComparisons.value = str(totalComp)
    #count the occurence of counties
    countyFrequency = collections.Counter(countyDat)
    #create a dataframe for the output
    countyDat = pd.DataFrame({'GEOID':countyFrequency.keys(),'Total':countyFrequency.values()}).sort_values(by='Total',ascending=False)
    #now merge with original dataset to get the names
    countyDat=countyDat.merge(counties[['GEOID','NAME']],on='GEOID')[['GEOID','NAME','Total']]
    with results:
        display(countyDat, clear=True)
button.on_click(start)
container

The important code sections are almost the same, but rather than saving the city name with the corresponding county name we are just adding the County Id to our list and later calculating the frequency of the counties.

While 12 seconds for 100 cities seems to be reasonable, if this were a professional website, then the user would have just moved to another website. 

![slow_website](supplementary/images/slow_website.jpg)

So how can we improve it. 

## Experiment IV (Where am I in a Grid)

Have you heard about indexes before?? I bet you have seen them before

![book_index](supplementary/images/book_index.jpg)

From the index shown above we could easily lookup for the content about "ticket pricing" using the index (it's in page 44). Now without index we would have to go through the pages one-by-one and unfortunately if "ticket-pricing" occurs at the last page, then we have to go through the entire book.

We will try to create such an index, but for spatial data. For this experiment we will cover the entire US with a uniform grid. Let's look at the grid

In [None]:
import geopandas as gpd
from ipyleaflet import Map, DrawControl,GeoData,LayerGroup,WidgetControl,Rectangle,basemap_to_tiles,basemaps,Polygon,GeoJSON
import json
import ipywidgets as widgets
import numpy as np
from shapely.geometry import Point
#load the county data using geopandas
counties = gpd.read_file(r'supplementary/data/us_counties/us_counties_geog.shp',bbox=[-125.6,24.89,-64.36,49.47])
# for easy access based on GEOID we will set it as the index
counties.index = counties.GEOID
#load the grid file
grid = gpd.read_file(r'supplementary/data/us_grid_22_40/us_grid_22_40_with_counties.shp')
#for fast access based on id we will use id as index
grid.index = grid.id
grid

In [None]:
ax = grid.plot(facecolor="none")
counties.plot(facecolor="none",ax=ax);

The grid shapefile contains the id for the grid as well as a GEOIDS column. The GEOIDS column refers to the GEOID of the counties that <span STYLE="font-size:18.0pt;color:black">intersects with the grid.</span>

![grid_poly](supplementary/images/grid_poly.jpg)

The grid acts as a first pass to eliminate many non-matching candidates. Having a regular structure is another advantage of grids. We can easily query for points with in the grid. Let's look at the demonstration below. 

![grid_search](supplementary/images/grid_search.jpg)

Once the particular grid cell is identified we need to check only those Polygons (in our case counties) that intersects with the grid cell (we have their GEIOD's precomputed). For example in the setting shown below, if we want to check which polygon contains the point (0.5,1.5), we can do it in two steps. 

First find out the matching grid cell for (0.5,1.5), which in this case is the grid cell with ID 8.

Then check whether the point is inside any of the polygons that intersect with the grid cell 8, which in this case is Polygon E only. 

So rather than going through each polygon one-by-one, we can directly check whether the Polygon E contains the required point. 

![grid_search](supplementary/images/grid_search.jpg)

Let's see this in action

In [None]:
#now we need to create bins along the longitude and along the latitude. In this case we know the grid has 40 columns and 22 rows and the id for the grid is in column order
cols = [i for i in range(1,len(grid),22)]  #this should give all the grid cells in first row.......1,23,45....859
rows = [i for i in range(1,23)] #this should give all the grid cells in the first column.......1,2,3.....22

longBins = np.asarray([[bnds['minx'],bnds['maxx']] for i,bnds in grid.loc[cols].geometry.bounds.iterrows()]).flatten().tolist()
longBins = [longBins[0]]+longBins[1:-1:2]+[longBins[-1]]
latBins = np.asarray([[bnds['maxy'],bnds['miny']] for i,bnds in grid.loc[rows].geometry.bounds.iterrows()]).flatten().tolist()
#as latitude reduces as we move from top to bottom (90,-90) we need to reverse the bins so that its in ascending order
latBins.reverse()
latBins = [latBins[0]]+latBins[1:-1:2]+[latBins[-1]]
#lets make a map
m = Map(center=(44.967243, -103.771556), zoom=3)
#now lets load the county data 
#first we will create a layer group for the county
countyLayerGroup = LayerGroup()
#create another layer to hold the new points being added to the map
newPointLayer = LayerGroup()
#another layer for the grid
newGridLayer = LayerGroup()
#add the layer group to map
m.add_layer(countyLayerGroup)
m.add_layer(newPointLayer)
m.add_layer(newGridLayer)
#now create a geojson layer from the county dataset
geo_data = GeoJSON(data = json.loads(counties.to_json()),
                   name = 'counties',style={'opacity': 1,'color':'gray', 'fillOpacity': 0, 'weight': 1})
#add the layer to the countyLayerGroup
countyLayerGroup.add_layer(geo_data)
#now create a geojson layer from the grid dataset
grid_geo_data = GeoJSON(data = json.loads(grid.to_json()),
                   name = 'grid',style={'opacity': 1,'color':'black', 'fillOpacity': 0, 'weight': 1})
#add the layer to the newGridLayer
newGridLayer.add_layer(grid_geo_data)
#add draw control to the map
draw_control = DrawControl(polyline={},polygon={})
#add to map
m.add_control(draw_control)
# now we will create few more elements to display our results
#we will create an html element which displays data along with results
results = widgets.HTML(
    value="",
    placeholder='',
    description='',
)
#we will also add a button to start the process
button = widgets.Button(
    description='Start',
    disabled=False,
    button_style='info'
)
button.layout.width = '35%'
button.layout.margin = '0% 0% 0% 50%'
#add a text box to display the coordinates of the selected point
coords = widgets.Text(
    value='',
    placeholder='',
    description='Location',
    disabled=True
)
coords.layout.margin = '0% 2% 5% 0%'
coords.layout.width = '100%'
sm_container = widgets.VBox([coords,button,results])

#now create a horizontal box that can hold the map and the other container.
m.layout.width='80%'
sm_container.layout.width='20%'
container = widgets.HBox([m,sm_container])
container.layout.width='90%'
#now write a call back function when a marker is added. Add the new location coordinates to the text box
def addPoint(target, action, geo_json):
    newPointLayer.clear_layers()
    draw_control.clear()
    newPoint = GeoJSON(data = geo_json)
    newPointLayer.add_layer(newPoint)
    coords.value = ",".join(map(str,geo_json['geometry']['coordinates']))
#add the call back function
draw_control.on_draw(addPoint)
#now write a function for the start button
def start(b):
    if coords.value!='':
        #create a point geometry from the coordinates
        point = Point(list(map(float,coords.value.split(','))))
        #check to which grid cell the coordinate belong to
        box_x = np.searchsorted(longBins,point.x,side = "right")
        box_y = 23-np.searchsorted(latBins,point.y,side = "right")
        details="<b>Name</b>: Not Found"
        totalComparison = 1 #this is set to 1 as we will atleast need one comparison for the box checks
        # if the coordinates fall outside the bound we don't need to do anything
        if box_x>0 and box_x<41 and box_y>0 and box_y<22:
            #calculate the index for the box
            id_c = box_y+((box_x-1)*22)
            #get the grid cell
            gridCell = grid.loc[id_c]
            #if the grid cell has counties then check
            if gridCell.GEOIDS is not None:
                #retrieve the Geoids as a list
                geoids = gridCell.GEOIDS.split(',')
                #the corresponding counties
                matchCounties = counties.loc[geoids]
                for idx,county in matchCounties.iterrows():
                    totalComparison+=1
                    if county.geometry.intersects(point):
                        details = details.replace('Not Found',county.NAME)
                        break
        details+=f'<br><b>Number of comparisons</b>: {totalComparison}'
        results.value=details
button.on_click(start)
container

The first thing you notice is that the number of comparisons have reduced drastically irrespective of the direction (N,S,E,W). To see whether there is any reduction in time, let's move to the next experiment.

## Experiment V (Which County gets Which City, a gridded approach)

Again we will use the County and City data. But this time we will use the grid data as a first pass to remove non-matched counties. 

In [None]:
import geopandas as gpd
from ipyleaflet import Map, DrawControl,GeoData,LayerGroup,WidgetControl,Rectangle,basemap_to_tiles,basemaps,Polygon,GeoJSON
import json
import ipywidgets as widgets
from shapely.geometry import Point
import pandas as pd
import time
import numpy as np
pd.set_option('display.max_rows', None)
#load the county data using geopandas
counties = gpd.read_file(r'supplementary/data/us_counties/us_counties_geog.shp',bbox=[-125.6,24.89,-64.36,49.47])
# for easy access based on GEOID we will set it as the index
counties.index = counties.GEOID
#load the cities data using geopandas
cities = gpd.read_file(r'supplementary/data/USA_Major_Cities/USA_Major_Cities.shp')
#load the grid file
grid = gpd.read_file(r'supplementary/data/us_grid_22_40/us_grid_22_40_with_counties.shp')
#for fast access based on id we will use id as index
grid.index = grid.id
#now we need to create bins along the longitude and along the latitude. In this case we know the grid has 40 columns and 22 rows and the id for the grid is in column order
cols = [i for i in range(1,len(grid),22)]  #this should give all the grid cells in first row.......1,23,45....859
rows = [i for i in range(1,23)] #this should give all the grid cells in the first column.......1,2,3.....22

longBins = np.asarray([[bnds['minx'],bnds['maxx']] for i,bnds in grid.loc[cols].geometry.bounds.iterrows()]).flatten().tolist()
longBins = [longBins[0]]+longBins[1:-1:2]+[longBins[-1]]
latBins = np.asarray([[bnds['maxy'],bnds['miny']] for i,bnds in grid.loc[rows].geometry.bounds.iterrows()]).flatten().tolist()
#as latitude reduces as we move from top to bottom (90,-90) we need to reverse the bins so that its in ascending order
latBins.reverse()
latBins = [latBins[0]]+latBins[1:-1:2]+[latBins[-1]]
#lets make a map
m = Map(center=(44.967243, -103.771556), zoom=3)
#now lets load the county data 
#first we will create a layer group for the county and city
countyLayerGroup = LayerGroup()
cityLayerGroup = LayerGroup()
#another layer for the grid
newGridLayer = LayerGroup()
#add the layer group to map
m.add_layer(countyLayerGroup)
m.add_layer(cityLayerGroup)
m.add_layer(newGridLayer)
#now create a geojson layer from the county dataset
geo_data = GeoJSON(data = json.loads(counties.to_json()),
                   name = 'counties',style={'opacity': 1,'color':'gray', 'fillOpacity': 0, 'weight': 1})
#add the layer to the countyLayerGroup
countyLayerGroup.add_layer(geo_data)
#now create a geojson layer from the grid dataset
grid_geo_data = GeoJSON(data = json.loads(grid.to_json()),
                   name = 'grid',style={'opacity': 1,'color':'black', 'fillOpacity': 0, 'weight': 1})
#add the layer to the newGridLayer
newGridLayer.add_layer(grid_geo_data)
# now we will create few more elements to display our results
#we will create an output element which will dispaly the cities table with corresponding counties
results = widgets.Output()
#we will also add a button to start the process
button = widgets.Button(
    description='Start',
    disabled=False,
    button_style='info'
)
results.layout.width = '100%'
results.layout.overflow = 'scroll'
results.layout.height = '250px'
button.layout.width = '35%'
button.layout.margin = '3% 0% 0% 50%'
#add a select widget to select the number of cities
numcities = [str(i) for i in range(100,3300,100)]
citySelect = widgets.Dropdown(
    options=numcities,
    value='100',
    description='Total Cities:',
    disabled=False,
)
citySelect.layout.width = '95%'
totalComparisons = widgets.Text(
    value='',
    placeholder='',
    description='Comparisons:',
    disabled=True
)
totalComparisons.layout.margin = '2% 2% 5% 0%'
totalComparisons.layout.width = '100%'
totalTime = widgets.Text(
    value='',
    placeholder='',
    description='Time Taken:',
    disabled=True
)
totalTime.layout.margin = '2% 2% 5% 0%'
totalTime.layout.width = '100%'
sm_container = widgets.VBox([citySelect,button,totalComparisons,totalTime,results])

#now create a horizontal box that can hold the map and the other container.
m.layout.width='80%'
sm_container.layout.width='20%'
container = widgets.HBox([m,sm_container])
container.layout.width='90%'
#now write a function for the start button
def start(b):
    totalComparisons.value = 'Waiting.............'
    totalTime.value = 'Waiting.............'
    cityLayerGroup.clear_layers()
    #randomly select a sample of cities based on the selection
    sampleCount = int(citySelect.value)
    sampleCities = cities.sample(sampleCount)
    #show the cities on the map
    city_data = GeoJSON(data  = json.loads(sampleCities.to_json()),
                    point_style={'radius': 2, 'color': 'red', 'fillOpacity': 1, 'fillColor': 'red', 'weight': 0})
    cityLayerGroup.add_layer(city_data)
    #now iterate through each of the cities and assign corresponding counties
    cityCounty = []  # a list to store city name and corresponding county name
    totalComp = 1
    start = time.time()
    for cidx,city in sampleCities.iterrows():
        #now rather than iterating through the entire county set find the grid cell and corresponding counties
        #check to which grid cell the coordinate belong to
        box_x = np.searchsorted(longBins,city.geometry.x,side = "right")
        box_y = 23-np.searchsorted(latBins,city.geometry.y,side = "right")
        if box_x>0 and box_x<41 and box_y>0 and box_y<22:
            #calculate the index for the box
            id_c = box_y+((box_x-1)*22)
            #get the grid cell
            gridCell = grid.loc[id_c]
            #if the grid cell has counties then check
            if gridCell.GEOIDS is not None:
                #retrieve the Geoids as a list
                geoids = gridCell.GEOIDS.split(',')
                #the corresponding counties
                matchCounties = counties.loc[geoids]
                for coidx,county in matchCounties.iterrows():
                    totalComp+=1
                    if county.geometry.contains(city.geometry):
                        cityCounty.append([city.NAME,county.NAME])
                        break
    totalTime.value = str(int(time.time()-start))+' seconds'
    #update the total comparison value
    totalComparisons.value = str(totalComp)
    #create a dataframe for the output
    cityCounty = pd.DataFrame(cityCounty,columns=['City','County'])
    with results:
        display(cityCounty, clear=True)
button.on_click(start)
container

You will straightaway notice the reduction in two things

1) Total comparisons

2) Total time taken

For example processing 100 cities doesn't even take 1 second (compared to the 12 seconds previously). The number of comparisons have reduced from around 160,000 to 961. Even for 3200 cities, the total time taken is under six seconds and total number of comparisons is under 30,000. Let's look at the critical sections of the code. 

```python
for cidx,city in sampleCities.iterrows():
        #now rather than iterating through the entire county set find the grid cell and corresponding counties
        #check to which grid cell the coordinate belong to
        box_x = np.searchsorted(longBins,city.geometry.x,side = "right")
        box_y = 23-np.searchsorted(latBins,city.geometry.y,side = "right")
        if box_x>0 and box_x<41 and box_y>0 and box_y<22:
            #calculate the index for the box
            id_c = box_y+((box_x-1)*22)
            #get the grid cell
            gridCell = grid.loc[id_c]
            #if the grid cell has counties then check
            if gridCell.GEOIDS is not None:
                #retrieve the Geoids as a list
                geoids = gridCell.GEOIDS.split(',')
                #the corresponding counties
                matchCounties = counties.loc[geoids]
                for coidx,county in matchCounties.iterrows():
                    totalComp+=1
                    if county.geometry.contains(city.geometry):
                        cityCounty.append([city.NAME,county.NAME])
                        break
```

Similar to the previous version of the same problem, we will iterate through the cities one by one <code>for cidx,city in sampleCities.iterrows():</code>

Then we will find out to which grid cell the city belongs to

```python
box_x = np.searchsorted(longBins,city.geometry.x,side = "right")
        box_y = 23-np.searchsorted(latBins,city.geometry.y,side = "right")
        if box_x>0 and box_x<41 and box_y>0 and box_y<22:
            #calculate the index for the box
            id_c = box_y+((box_x-1)*22)
```

Once you know the id of the cell, you can retreive the cell details from the grid dataset.

<code>gridCell = grid.loc[id_c]</code>

Then we just need to retrieve the GEOIDS for the particular cell and retrieve the counties for the corresponding GEOIDS

```python
if gridCell.GEOIDS is not None:
    #retrieve the Geoids as a list
    geoids = gridCell.GEOIDS.split(',')
    #the corresponding counties
    matchCounties = counties.loc[geoids]
```   
And then, as done previously, we just need to loop through the counties and check whether the city is within the county.

```python
for coidx,county in matchCounties.iterrows():
    totalComp+=1
    if county.geometry.contains(city.geometry):
        cityCounty.append([city.NAME,county.NAME])
        break
```

Now we will look at one more example with the grid

## Experiment VI (Total Number of Cities in every County, a gridded approach)

This is similar to experiment III where we calculate the number of cities in each county. But for this experiment we are using the grid as a first pass. So let's get into the code straightaway. 

In [None]:
import geopandas as gpd
from ipyleaflet import Map, DrawControl,GeoData,LayerGroup,WidgetControl,Rectangle,basemap_to_tiles,basemaps,Polygon,GeoJSON
import json
import ipywidgets as widgets
from shapely.geometry import Point
import pandas as pd
import time
import numpy as np
import collections
pd.set_option('display.max_rows', None)
#load the county data using geopandas
counties = gpd.read_file(r'supplementary/data/us_counties/us_counties_geog.shp',bbox=[-125.6,24.89,-64.36,49.47])
# for easy access based on GEOID we will set it as the index
counties.index = counties.GEOID
#load the cities data using geopandas
cities = gpd.read_file(r'supplementary/data/USA_Major_Cities/USA_Major_Cities.shp')
#load the grid file
grid = gpd.read_file(r'supplementary/data/us_grid_22_40/us_grid_22_40_with_counties.shp')
#for fast access based on id we will use id as index
grid.index = grid.id
#now we need to create bins along the longitude and along the latitude. In this case we know the grid has 40 columns and 22 rows and the id for the grid is in column order
cols = [i for i in range(1,len(grid),22)]  #this should give all the grid cells in first row.......1,23,45....859
rows = [i for i in range(1,23)] #this should give all the grid cells in the first column.......1,2,3.....22

longBins = np.asarray([[bnds['minx'],bnds['maxx']] for i,bnds in grid.loc[cols].geometry.bounds.iterrows()]).flatten().tolist()
longBins = [longBins[0]]+longBins[1:-1:2]+[longBins[-1]]
latBins = np.asarray([[bnds['maxy'],bnds['miny']] for i,bnds in grid.loc[rows].geometry.bounds.iterrows()]).flatten().tolist()
#as latitude reduces as we move from top to bottom (90,-90) we need to reverse the bins so that its in ascending order
latBins.reverse()
latBins = [latBins[0]]+latBins[1:-1:2]+[latBins[-1]]
#lets make a map
m = Map(center=(44.967243, -103.771556), zoom=3)
#now lets load the county data 
#first we will create a layer group for the county and city
countyLayerGroup = LayerGroup()
cityLayerGroup = LayerGroup()
#another layer for the grid
newGridLayer = LayerGroup()
#add the layer group to map
m.add_layer(countyLayerGroup)
m.add_layer(cityLayerGroup)
m.add_layer(newGridLayer)
#now create a geojson layer from the county dataset
geo_data = GeoJSON(data = json.loads(counties.to_json()),
                   name = 'counties',style={'opacity': 1,'color':'gray', 'fillOpacity': 0, 'weight': 1})
#add the layer to the countyLayerGroup
countyLayerGroup.add_layer(geo_data)
#now create a geojson layer from the grid dataset
grid_geo_data = GeoJSON(data = json.loads(grid.to_json()),
                   name = 'grid',style={'opacity': 1,'color':'black', 'fillOpacity': 0, 'weight': 1})
#add the layer to the newGridLayer
newGridLayer.add_layer(grid_geo_data)
# now we will create few more elements to display our results
#we will create an output element which will dispaly the cities table with corresponding counties
results = widgets.Output()
#we will also add a button to start the process
button = widgets.Button(
    description='Start',
    disabled=False,
    button_style='info'
)
results.layout.width = '100%'
results.layout.overflow = 'scroll'
results.layout.height = '250px'
button.layout.width = '35%'
button.layout.margin = '3% 0% 0% 50%'
#add a select widget to select the number of cities
numcities = [str(i) for i in range(100,3300,100)]
citySelect = widgets.Dropdown(
    options=numcities,
    value='100',
    description='Total Cities:',
    disabled=False,
)
citySelect.layout.width = '95%'
totalComparisons = widgets.Text(
    value='',
    placeholder='',
    description='Comparisons:',
    disabled=True
)
totalComparisons.layout.margin = '2% 2% 5% 0%'
totalComparisons.layout.width = '100%'
totalTime = widgets.Text(
    value='',
    placeholder='',
    description='Time Taken:',
    disabled=True
)
totalTime.layout.margin = '2% 2% 5% 0%'
totalTime.layout.width = '100%'
sm_container = widgets.VBox([citySelect,button,totalComparisons,totalTime,results])

#now create a horizontal box that can hold the map and the other container.
m.layout.width='80%'
sm_container.layout.width='20%'
container = widgets.HBox([m,sm_container])
container.layout.width='90%'
#now write a function for the start button
def start(b):
    totalComparisons.value = 'Waiting.............'
    totalTime.value = 'Waiting.............'
    cityLayerGroup.clear_layers()
    #randomly select a sample of cities based on the selection
    sampleCount = int(citySelect.value)
    sampleCities = cities.sample(sampleCount)
    #show the cities on the map
    city_data = GeoJSON(data  = json.loads(sampleCities.to_json()),
                    point_style={'radius': 2, 'color': 'red', 'fillOpacity': 1, 'fillColor': 'red', 'weight': 0})
    cityLayerGroup.add_layer(city_data)
    countyDat = []  # a list to store county ids
    totalComp = 1
    start = time.time()
    for cidx,city in sampleCities.iterrows():
        #now rather than iterating through the entire county set find the grid cell and corresponding counties
        #check to which grid cell the coordinate belong to
        box_x = np.searchsorted(longBins,city.geometry.x,side = "right")
        box_y = 23-np.searchsorted(latBins,city.geometry.y,side = "right")
        if box_x>0 and box_x<41 and box_y>0 and box_y<22:
            #calculate the index for the box
            id_c = box_y+((box_x-1)*22)
            #get the grid cell
            gridCell = grid.loc[id_c]
            #if the grid cell has counties then check
            if gridCell.GEOIDS is not None:
                #retrieve the Geoids as a list
                geoids = gridCell.GEOIDS.split(',')
                #the corresponding counties
                matchCounties = counties.loc[geoids]
                for coidx,county in matchCounties.iterrows():
                    totalComp+=1
                    if county.geometry.contains(city.geometry):
                        # we need to use GEOID as two counties can have the same name
                        countyDat.append(county.GEOID)
                        break
    totalTime.value = str(int(time.time()-start))+' seconds'
    #update the total comparison value
    totalComparisons.value = str(totalComp)
    #count the occurence of counties
    countyFrequency = collections.Counter(countyDat)
    #create a dataframe for the output
    countyDat = pd.DataFrame({'GEOID':countyFrequency.keys(),'Total':countyFrequency.values()}).sort_values(by='Total',ascending=False)
    #now merge with original dataset to get the names
    countyDat=countyDat.merge(counties['NAME'],on='GEOID')[['GEOID','NAME','Total']]
    with results:
        display(countyDat, clear=True)
button.on_click(start)
container

Similar to the previous results with Grid, we can see the considerable reduction in total comparisons and total time taken. 

So our grid experiments have been successful. We were able to reduce the total number of comparisons and total run time. 

But are there any potential issues. Let's look at the illustration shown below,

![uneven](supplementary/images/uneven.jpg)

As you can see our polygon data is unevenly distributed (which is very common in real-world datasets). Now if a point is in Grid Cell 9, then we would have to go through almost all the Polygon's in the dataset (barring the one polygon that happens to be in Grid cell 1). This actually defeats the purpose of the index as we have to go through the entire dataset. 

So our take home with Grid type index is that, it is good when the spatial data is evenly distributed, but falls apart when there is apparent skewness in the data

So what's the solution??

## R-Tree

Note: We are only going to give a birds-eye-view of R-Tree as its an advanced data structure and you need to have strong background in Tree based data structures to understand how R-Tree is implemented.

An R-tree represents individual spatial objects (could be Polygons, Lines or Points) and their minimum bounding rectangle as the lowest level of the spatial index. It then aggregates nearby objects and represents them with their aggregate minimum bounding rectangle in the next higher level of the index. This is done iteratively, until everything is nested into one top-level minimum bounding rectangle.

Let's see an example of minimum bounding rectangle (MBR).

![mbr](supplementary/images/mbr.jpg)

The minimum bounding rectangles are very powerful as it acts as a first pass before the **costly geometric operation such as Contains, Intersection, and Within**. Let's see a quick example

![mbr_example](supplementary/images/mbr_example.jpg)

Let's look at an illustration of R-tree

![sample_rtree](supplementary/images/sample_rtree.png)

As you can see the R-tree is basically a hierarchy of minimum bounding rectangles. 

We won't go into depth of how to query an r-tree or how to add/delete data. Luckily we have an r-tree implementation in Python. We will look into how we can use r-tree for fast lookups.

## Experiment VII (Total Number of Cities in every County, an R-Tree based approach)

In [None]:
import geopandas as gpd
from rtree import index
from ipyleaflet import Map, DrawControl,GeoData,LayerGroup,WidgetControl,Rectangle,basemap_to_tiles,basemaps,Polygon,GeoJSON
import json
import ipywidgets as widgets
from shapely.geometry import Point
import pandas as pd
import time
import numpy as np
import collections
from shapely.geometry import box
pd.set_option('display.max_rows', None)

In [None]:
#create an index (r-tree)
idx = index.Index()
#load the county data using geopandas
counties = gpd.read_file(r'supplementary/data/us_counties/us_counties_geog.shp',bbox=[-125.6,24.89,-64.36,49.47])
#load the cities data using geopandas
cities = gpd.read_file(r'supplementary/data/USA_Major_Cities/USA_Major_Cities.shp')
#fill the index using the bounding box of counties
counties.bounds.apply(lambda x:idx.insert(x.name,x.values),axis=1)
#now lets make a geodataframe from the tree rectangles
tree_rectangles = gpd.GeoDataFrame(pd.DataFrame([[leaf[0],box(*leaf[2])] for leaf in idx.leaves()],columns=['id','geometry']),crs='EPSG:4326')

In [None]:
ax = tree_rectangles.plot(facecolor="none")
counties.plot(facecolor="none",ax=ax);

In [None]:
#lets make a map
m = Map(center=(44.967243, -103.771556), zoom=3)
#now lets load the county data 
#first we will create a layer group for the county and city
countyLayerGroup = LayerGroup()
cityLayerGroup = LayerGroup()
#another layer for the tree
treeLayer = LayerGroup()
#add the layer group to map
m.add_layer(countyLayerGroup)
m.add_layer(cityLayerGroup)
m.add_layer(treeLayer)
#now create a geojson layer from the county dataset
geo_data = GeoJSON(data = json.loads(counties.to_json()),
                   name = 'counties',style={'opacity': 1,'color':'gray', 'fillOpacity': 0, 'weight': 1})
#add the layer to the countyLayerGroup
countyLayerGroup.add_layer(geo_data)
#now create a geojson layer from the tree_rectangles dataset
tree_geo_data = GeoJSON(data = json.loads(tree_rectangles.to_json()),
                   name = 'grid',style={'opacity': 1,'color':'green', 'fillOpacity': 0, 'weight': 1})
#add the layer to the treeLayer
treeLayer.add_layer(tree_geo_data)
# now we will create few more elements to display our results
#we will create an output element which will dispaly the cities table with corresponding counties
results = widgets.Output()
#we will also add a button to start the process
button = widgets.Button(
    description='Start',
    disabled=False,
    button_style='info'
)
results.layout.width = '100%'
results.layout.overflow = 'scroll'
results.layout.height = '250px'
button.layout.width = '35%'
button.layout.margin = '3% 0% 0% 50%'
#add a select widget to select the number of cities
numcities = [str(i) for i in range(100,3300,100)]
citySelect = widgets.Dropdown(
    options=numcities,
    value='100',
    description='Total Cities:',
    disabled=False,
)
citySelect.layout.width = '95%'
totalTime = widgets.Text(
    value='',
    placeholder='',
    description='Time Taken:',
    disabled=True
)
totalTime.layout.margin = '2% 2% 5% 0%'
totalTime.layout.width = '100%'
sm_container = widgets.VBox([citySelect,button,totalTime,results])

#now create a horizontal box that can hold the map and the other container.
m.layout.width='80%'
sm_container.layout.width='20%'
container = widgets.HBox([m,sm_container])
container.layout.width='90%'
#now write a function for the start button
def start(b):
    totalTime.value = 'Waiting.............'
    cityLayerGroup.clear_layers()
    #randomly select a sample of cities based on the selection
    sampleCount = int(citySelect.value)
    sampleCities = cities.sample(sampleCount)
    #show the cities on the map
    city_data = GeoJSON(data  = json.loads(sampleCities.to_json()),
                    point_style={'radius': 2, 'color': 'red', 'fillOpacity': 1, 'fillColor': 'red', 'weight': 0})
    cityLayerGroup.add_layer(city_data)
    countyDat = []  # a list to store county ids
    start = time.time()
    for cidx,city in sampleCities.iterrows():
        #now using the bounding box of the city, query the tree to get the index for counties
        countyIndexes = list(idx.intersection(city.geometry.bounds))
        #get the counties from the index
        matchCounties = counties.loc[countyIndexes]
        for coidx,county in matchCounties.iterrows():
            if county.geometry.contains(city.geometry):
                # we need to use GEOID as two counties can have the same name
                countyDat.append(county.GEOID)
                break
    totalTime.value = str(int(time.time()-start))+' seconds'
    #count the occurence of counties
    countyFrequency = collections.Counter(countyDat)
    #create a dataframe for the output
    countyDat = pd.DataFrame({'GEOID':countyFrequency.keys(),'Total':countyFrequency.values()}).sort_values(by='Total',ascending=False)
    #now merge with original dataset to get the names
    countyDat=countyDat.merge(counties[['GEOID','NAME']],on='GEOID')[['GEOID','NAME','Total']]
    with results:
        display(countyDat, clear=True)
button.on_click(start)
container

Let's see the critical sections of the code

This is how we create an index. 
<code>idx = index.Index()</code>

And this is how we insert minimium bounding rectangles to the index
<code>counties.bounds.apply(lambda x:idx.insert(x.name,x.values),axis=1)</code>

And this is the section of the code that does the heavy weight lifting

```python
for cidx,city in sampleCities.iterrows():
    #now using the bounding box of the city, query the tree to get the index for counties
    countyIndexes = list(idx.intersection(city.geometry.bounds))
    #get the counties from the index
    matchCounties = counties.loc[countyIndexes]
    for coidx,county in matchCounties.iterrows():
        if county.geometry.contains(city.geometry):
            # we need to use GEOID as two counties can have the same name
            countyDat.append(county.GEOID)
            break
```

Similar to the earlier approaches we iterate through the cities
<code>for cidx,city in sampleCities.iterrows():</code>

This line of code extracts out all county maximum bounding rectangles that intersects with the maximum bounding rectangle of the city
<code> countyIndexes = list(idx.intersection(city.geometry.bounds))</code>
Now we can retrieve the matching county geometries using the matching indexes
<code> matchCounties = counties.loc[countyIndexes]</code>

Similar to the previous experiments we will iterate through this small subset of counties to check whether the city lies within any of the counties in the subset.
```python
for coidx,county in matchCounties.iterrows():
    if county.geometry.contains(city.geometry):
        # we need to use GEOID as two counties can have the same name
        countyDat.append(county.GEOID)
        break
```

What do you notice? While for 100 cities the total time taken is same as the grid approach, for 3000 cities the total time take is halved (from 6 seconds to 3 seconds). But can we improve. 
Thanks to the highly optimized libraries such as geopandas we can perform standard geometrical operations such as point-in-polygon with few lines of code. So lets straightaway dive into the code.

## Experiment VIII (Total Number of Cities in every County, using Geopandas)

In [None]:
import geopandas as gpd
from ipyleaflet import Map, DrawControl,GeoData,LayerGroup,WidgetControl,Rectangle,basemap_to_tiles,basemaps,Polygon,GeoJSON
import json
import ipywidgets as widgets
import pandas as pd
import time
pd.set_option('display.max_rows', None)
#load the county data using geopandas
counties = gpd.read_file(r'supplementary/data/us_counties/us_counties_geog.shp',bbox=[-125.6,24.89,-64.36,49.47])
#load the cities data using geopandas
cities = gpd.read_file(r'supplementary/data/USA_Major_Cities/USA_Major_Cities.shp')
#lets make a map
m = Map(center=(44.967243, -103.771556), zoom=3)
#now lets load the county data 
#first we will create a layer group for the county and city
countyLayerGroup = LayerGroup()
cityLayerGroup = LayerGroup()
#add the layer group to map
m.add_layer(countyLayerGroup)
m.add_layer(cityLayerGroup)
#now create a geojson layer from the county dataset
geo_data = GeoJSON(data = json.loads(counties.to_json()),
                   name = 'counties',style={'opacity': 1,'color':'gray', 'fillOpacity': 0, 'weight': 1})
#add the layer to the countyLayerGroup
countyLayerGroup.add_layer(geo_data)
# now we will create few more elements to display our results
#we will create an output element which will dispaly the cities table with corresponding counties
results = widgets.Output()
#we will also add a button to start the process
button = widgets.Button(
    description='Start',
    disabled=False,
    button_style='info'
)
results.layout.width = '100%'
results.layout.overflow = 'scroll'
results.layout.height = '250px'
button.layout.width = '35%'
button.layout.margin = '3% 0% 0% 50%'
#add a select widget to select the number of cities
numcities = [str(i) for i in range(100,3300,100)]
citySelect = widgets.Dropdown(
    options=numcities,
    value='100',
    description='Total Cities:',
    disabled=False,
)
citySelect.layout.width = '95%'
totalTime = widgets.Text(
    value='',
    placeholder='',
    description='Time Taken:',
    disabled=True
)
totalTime.layout.margin = '2% 2% 5% 0%'
totalTime.layout.width = '100%'
sm_container = widgets.VBox([citySelect,button,totalTime,results])

#now create a horizontal box that can hold the map and the other container.
m.layout.width='80%'
sm_container.layout.width='20%'
container = widgets.HBox([m,sm_container])
container.layout.width='90%'
#now write a function for the start button
def start(b):
    totalTime.value = 'Waiting.............'
    cityLayerGroup.clear_layers()
    #randomly select a sample of cities based on the selection
    sampleCount = int(citySelect.value)
    sampleCities = cities.sample(sampleCount)
    #show the cities on the map
    city_data = GeoJSON(data  = json.loads(sampleCities.to_json()),
                    point_style={'radius': 2, 'color': 'red', 'fillOpacity': 1, 'fillColor': 'red', 'weight': 0})
    cityLayerGroup.add_layer(city_data)
    #now we will directly use geopandas to get the counts using spatial join
    start = time.time()
    cityCountyJoin = counties.sjoin(sampleCities)
    totalTime.value = str(int(time.time()-start))+' seconds'
    countyDat=cityCountyJoin[['GEOID','NAME_left']].value_counts().reset_index().rename(columns={'NAME_left':'NAME',0:'Total'})
    with results:
        display(countyDat, clear=True)
button.on_click(start)
container

You will notice the instant speedups. It doesn't take even a second to finish the aggregation run for 3,200 cities. The secret is that apart from using advanced algorithms such as R-Tree, many of the spatial operations in Geopandas are written in "C" language which trounces Python when it comes to speed as it is a compiled language (vs Python which is an interpreted language). And all it takes is a one liner to achieve this drastic performance improvement. 
<code>counties.sjoin(sampleCities)</code> 
And this one liner in-fact is the main section of the code. You don't need to even iterate through the cities or counties one by one as done previously.

We were able to reduce the response time from many seconds to few milliseconds. This is kind of the response time we would expect for real-world websites. No body would want to wait for even for few seconds to find out the nearby restaurants from their current location. So the key take away from these experiments is that 

<span STYLE="font-size:24.0pt;color:black">"Solving Spatial Big Data challenges require better Algorithms."</span>

Finally, let's remember an aspect of the Volume component in Spatial Big Data that we covered in the Beginner Big Data lesson.

## Limited Main Memory

Recall that **main memory** is where programs and data are kept when the processor is actively using them. When programs and data become active, they are copied from secondary memory into main memory where the processor can interact with them. 

Main memory is sometimes called **RAM**, which stands for Random Access Memory.

And *RAM is limited.*

So what if you have a 64GB RAM machine and the spatial data file that you want to process is 500GB (which is not uncommon in the world we live in, for example the entire yellow cab taxi data for NYC from 2009 to 2022 is around 500GB and contained 1.5 billion ride details). You will quickly run into 

<span STYLE="font-size:24.0pt;color:red">MemoryError</span>



## Reading data in chunks
Returning to the exploration in Beginner Big Data lesson. We can chunk data. We can read it in, process it, and then remove it from memory -  clearing space for the next chunk. Let's look at an illustration:

![chunking](supplementary/images/chunking.jpg)

As you can see chunking can be used to generate partitions of data. These partitions can further be processed to generate partial results. And at the end, all the partial results can be combined together to generate the final results. This is a very powerful paradigm and can be used to solve very large problems. You can, in theory send these chunks to different threads/cores/or even machines to generate partial results which can all be stitched back to get the final results. Ok, enough of theory, let's see this in action. 


So that finishes our lesson on Spatial Big Data Volume. The key takeaways are

1) Efficient Algorithms are important to handle large volumes of data

2) Chunking can be an effective strategy to handle memory bottlenecks.

In the next lesson we will look into the next important characteristic of SBD, <span STYLE="font-size:24.0pt;color:black">Velocity</span>

<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" 
href="bigdata-5.ipynb">Click here to go to the next notebook.</a></font>