# Geo analytics AAA - Intentionally Blank

The census tract borders and the points of interest data were sourced from the following sites:
- Census tract borders: https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2016&layergroup=Census+Tracts
- Points of interest data: https://overpass-turbo.eu/s/1yIY
There is no need to manually download the files. Both files should be already in the data folder provided be the previously mentioned sciebo link.
The notebook is split into two parts. This notebook visualizes static non time dependend census tract and hexagon bin data while the second notebook visualizes different temporal bin sizes.

**Dependencies needed for this notebook:**
- Pandas
- Numpy
- Folium
- H3-py (*conda install -c conda-forge h3-py*)
- Matplotlib (*conda install -c conda-forge matplotlib*)
- Geopandas (*conda install -c conda-forge geopandas*)
  - Shapely (*conda install -c conda-forge shapely*)
- Branca (most likely installed with Folium or Geopandas else: *conda install -c conda-forge branca*)
- Pyarrow (*conda install pyarrow*)
  - Needed for reading the taxi dataset from the parquet file


## Imports and preparation

In [None]:
import json

import pandas as pd
import numpy as np

import folium
import h3
import matplotlib.pyplot as plt
import geopandas as gpd
import branca.colormap as cm
import shapely
from shapely.geometry import shape

First we import the needed files for the analysis.

In [None]:
# Read taxi, census tract borders and points of interest data
taxi_df = pd.read_parquet('data/prepared/taxi_data_prepared.gzip')
census_tract_borders = gpd.read_file('data/chicago_census_tract_borders.zip')
poi_df = gpd.read_file('data/POI.geojson')

Afterwards we do some small preparation steps needed for later use.

In [None]:
# Change pickup and dropoff location to geopandas geometry objects
taxi_df['pickup_centroid_location'] = gpd.GeoSeries.from_wkt(taxi_df['pickup_centroid_location'])
taxi_df['dropoff_centroid_location'] = gpd.GeoSeries.from_wkt(taxi_df['dropoff_centroid_location'])

# Drop columns that are not needed for the analysis
poi_df = poi_df.drop(columns= poi_df.columns.difference(["amenity", "public_transport","geometry"]))
census_tract_borders = census_tract_borders.drop(census_tract_borders.columns.difference(['GEOID', 'geometry']), axis=1)

# Get all unique census tract ids from the taxi data and filter the census tract borders to only include those
unique_census_tract_id = np.append(taxi_df['pickup_census_tract'].unique(), taxi_df['dropoff_census_tract'].unique()).astype('str')
census_tract_borders= census_tract_borders[census_tract_borders['GEOID'].isin(unique_census_tract_id)].reset_index(drop=True)

Here is a first look at the used data for an overview.

In [None]:
# First look at the data
poi_df

The amenity column describes the type of point of interest while the public_transport column describes which type of station the point of interest is if it is a station to begin with.

In [None]:
taxi_df

In [None]:
census_tract_borders

## Census Tract Analysis

Below is a helper function used to filter the features and aggregate the taxi dataframe by location, feature column and aggregation type. Furthermore it can be specified if all census tracts, which are present in the taxi dataframe, should be added to the output dataframe or not.

In [None]:
def filterByFeatureStatic(dataframe, location='pickup', feature='all', aggregation='sum', missingCensusTract=False):
    """ Filter a feature or all features of a dataframe. Furthermore aggregate the data by location and feature.

    Parameters
    ----------

    dataframe :  (pandas.DataFrame) 
        The dataframe to plot.
    location : (str) 
        The location column of the dataframe. Can be either 'pickup' or 'dropoff'. Default is 'pickup'.
    feature : (str) 
        The feature to aggregate. If 'all', all features are aggregated. Default is 'all'
    aggregation : (str)  
        The aggregation function to use. Can be either 'mean', 'median', 'sum', 'count', 'min', 'max'. Default is 'sum'.
    missingCensusTract : (bool)
        If True, census tracts with no data are included in the plot. Default is False.

    Returns
    ----------

    dataframe_grouped : (geopandas.GeoDataFrame) 
        The geodataframe grouped by the location column and the feature column. Contains always a geometry column and trip_count column.
    """
    # Copy dataframe to not change the original dataframe
    dataframe_grouped = dataframe.copy()

    # Get all features or only the specified feature and save them in a list
    if feature == 'all':
        features = dataframe_grouped.columns.difference(['pickup_census_tract', 'dropoff_census_tract', 'pickup_centroid_location', 'dropoff_centroid_location', 'trip_start_timestamp', 'trip_end_timestamp', 'taxi_id']).tolist()
    else:
        features = [feature]
    
    # Append the location column to the features list and drop all other columns not in the features list
    if location == 'pickup':
        features.append('pickup_census_tract')
        features.append('pickup_centroid_location')
        dataframe_grouped = dataframe_grouped.drop(columns=dataframe_grouped.columns.difference(features))
    elif location == 'dropoff':
        features.append('dropoff_census_tract')
        features.append('dropoff_centroid_location')
        dataframe_grouped = dataframe_grouped.drop(columns=dataframe_grouped.columns.difference(features))
    else:
        raise ValueError("Location must be either 'pickup' or 'dropoff'.")
    
    # Make a trip count column for later aggregation
    dataframe_grouped['trip_count'] = dataframe_grouped[features[0]]
    
    # Group by the census tract id and the location centroid. Moreover aggregate the data by the specified aggregation function
    dataframe_grouped = dataframe_grouped.groupby([location + '_census_tract', location + '_centroid_location']).agg(lambda column: column.agg('count') if column.name == 'trip_count' else column.agg(aggregation)).reset_index()
    dataframe_grouped = dataframe_grouped.rename(columns={location + '_census_tract': 'GEOID'})
    dataframe_grouped['GEOID'] = dataframe_grouped['GEOID'].astype('str')

    # Add missing census tracts if specified
    if missingCensusTract == True:
        dataframe_grouped = dataframe_grouped.merge(census_tract_borders, on='GEOID', how='right')
        dataframe_grouped = gpd.GeoDataFrame(dataframe_grouped)
        dataframe_grouped[location + '_centroid_location'] = dataframe_grouped['geometry'].to_crs('+proj=cea').centroid.to_crs('EPSG:4326')
        dataframe_grouped['geometry'] = dataframe_grouped['geometry'].to_crs('EPSG:4326')
    else:
        dataframe_grouped = dataframe_grouped.merge(census_tract_borders, on='GEOID', how='left')
        dataframe_grouped = gpd.GeoDataFrame(dataframe_grouped)
        dataframe_grouped[location + '_centroid_location'] = dataframe_grouped['geometry'].to_crs('+proj=cea').centroid.to_crs('EPSG:4326')
        dataframe_grouped['geometry'] = dataframe_grouped['geometry'].to_crs('EPSG:4326')

    # Add poi count to each census tract
    dataframe_grouped['poi_count'] = dataframe_grouped.apply(lambda row: row['geometry'].contains(poi_df['geometry']).sum(), axis=1)
    return dataframe_grouped


In [None]:
geo_df_taxi_pickup_sum = filterByFeatureStatic(taxi_df, missingCensusTract = True, feature='all', aggregation= "sum", location='pickup')

In [None]:
# Filter all POIs that are within the census tract borders

poi_in_census_tract = gpd.sjoin(poi_df, geo_df_taxi_pickup_sum, how='inner', op='within')

First we plot all points of interest to see how the points are distributed.

In [None]:
# Add the title to the folium map
loc = 'Points of interest within census tract borders'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)   

m = folium.Map(location=(41.8,-87.723177), zoom_start=11)

m.get_root().html.add_child(folium.Element(title_html))
# Interactive map
poi_in_census_tract.explore(tooltip=True, cmap="viridis", m = m, style_kwds = {"opacity": 0.3, "red": "red", "fillOpacity": 0.3, "fillColor": 'red'}, marker_kwds = {"radius": 3})

As we can see the center of chicago has the most points of interest. The distribution gets thinner going further away from the center. Furthermore if you look closer on the distribution we can see that the points of interest go mostly along the bigger streets.

Now we plot the points of interest in addition to a choropleth plot of the census tracts. We decided to plot most plots onwards on a logarithmic scale because the difference in demand is so great that only the center would be colored yellow and all other census tracts would appear purple.

In [None]:
# Add the title to the folium map
loc = 'POI count by census tract pickup location logaritmic scale'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)   

m = folium.Map(location=(41.8,-87.723177), zoom_start=11)

m.get_root().html.add_child(folium.Element(title_html))
# Interactive map
geo_df_taxi_pickup_sum[geo_df_taxi_pickup_sum['poi_count'] > 0].explore(column=np.log10(geo_df_taxi_pickup_sum[geo_df_taxi_pickup_sum['poi_count'] > 0]['poi_count']), tooltip=True, cmap="viridis", m = m)

The choropleth showing the density of the POIs makes it even more clear. The spider web similar distribution most likely comes from the more populated streets. The choropleth further shows that the center and the airports have a higher density of POIs.

In [None]:
geo_df_taxi_dropoff_sum = filterByFeatureStatic(taxi_df, missingCensusTract = True, feature='all', aggregation= "sum", location='dropoff')

We check if the demand depending on the trip count is distributed differently depending on the dropoff or pickup location.

In [None]:
# Add the title to the folium map
loc = 'Trip count by census tract pickup location logaritmic scale'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)   

m = folium.Map(location=(41.8,-87.723177), zoom_start=11)

m.get_root().html.add_child(folium.Element(title_html))
geo_df_taxi_pickup_sum.explore(column=np.log10(geo_df_taxi_pickup_sum['trip_count']), tooltip=True, cmap="viridis", m = m)

In [None]:
# Add the title to the folium map
loc = 'Trip count by census tract dropoff location logaritmic scale'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)   

m = folium.Map(location=(41.8,-87.723177), zoom_start=11)

m.get_root().html.add_child(folium.Element(title_html))
geo_df_taxi_dropoff_sum.explore(column=np.log10(geo_df_taxi_dropoff_sum['trip_count']), tooltip=True, cmap="viridis", m = m)

As we can see the overall distribution is nearly the same between the dropoff and pickup location. The only difference is that some census tracts have no entries for the dropoff or pickup locations which are visualized in grey.

Furthermore like previously mentioned we can see that the center of chicago, especially near the Navy Pier, has the most demand decreasing going further away from the center. There are two census tracts which are an exception to this trend: In the upper left and middle left we can see two census tracts that have high demand. If we look closer on the map we can see that those two census tracts have airports which could explain the spike in demand in comparison to the other census tracts near those two. One reason why the airports and the center/Navy Pier have the most trip counts could be the higher traffic in general around those areas. People landing with a plane generally do not have a car directly at the airport making and most likey have luggage with them. It is therefore more convenient to drive with a taxi instead of taking public transit. It goes the same for the Navy Pier when people come with the ship and another argument for the high demand near the pier could be the central location making it easy to reach POIs and to move to other places.

Now we plot the choropleth map dependent on the sum of the trip total per census tract. The trip total column correlates with the trip miles and trip seconds column because more revenue usually means that the trips were longer in time and distance. Furthermore the trip count is a good demand factor but the pattern could be different depending on the types of trips. The trip count in a census tract could be high but the trips themselves could be relatively short compared to other parts of chicago leading to less revenue overall. This could skew the perception of the demand.

In [None]:
# Add the title to the folium map
loc = 'Trip total by census tract pickup location logarithmic scale'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)   

m = folium.Map(location=(41.8,-87.723177), zoom_start=11)

m.get_root().html.add_child(folium.Element(title_html))
geo_df_taxi_pickup_sum.explore(column=np.log10(geo_df_taxi_pickup_sum['trip_total']), tooltip=True, cmap="viridis", m = m)

As we can see the demand depending on the trip total and trip count correlates with eachother. The pattern looks similar if not identical to the trip count pattern.

In the below figure all features a plotted side by side to further observe the differences in the patterns between different features.

In [None]:
fig, axd = plt.subplot_mosaic([['trip_total', 'trip_seconds'],
                               ['trip_miles', 'idle_seconds'],
                               ['trip_count', 'poi_count']],
                              figsize=(18,30), layout="constrained")
fig.suptitle('Features plotted by census tract pickup location on a logarithmic scale', fontsize=16)
geo_df_taxi_pickup_sum.plot(column=np.log10(geo_df_taxi_pickup_sum['trip_total']), cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'}, ax=axd['trip_total'], legend_kwds={"label": "Sum of Revenue (in dollar)", "orientation": "horizontal"})
geo_df_taxi_pickup_sum.plot(column=np.log10(geo_df_taxi_pickup_sum['trip_seconds']), cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'}, ax=axd['trip_seconds'], legend_kwds={"label": "Sum of trip time in seconds", "orientation": "horizontal"})
geo_df_taxi_pickup_sum.plot(column=np.log10(geo_df_taxi_pickup_sum['trip_miles']), cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'}, ax=axd['trip_miles'], legend_kwds={"label": "Sum of trip miles in seconds", "orientation": "horizontal"})
geo_df_taxi_pickup_sum[geo_df_taxi_pickup_sum['idle_seconds'] > 0 | geo_df_taxi_pickup_sum['idle_seconds'].isna()].plot(column=np.log10(geo_df_taxi_pickup_sum[geo_df_taxi_pickup_sum['idle_seconds'] > 0  | geo_df_taxi_pickup_sum['idle_seconds'].isna()]['idle_seconds']),  cmap="viridis", legend=True, missing_kwds={'color': 'lightgrey'}, ax=axd['idle_seconds'], legend_kwds={"label": "Sum of idle time in seconds", "orientation": "horizontal"})
geo_df_taxi_pickup_sum.plot(column=np.log10(geo_df_taxi_pickup_sum['trip_count']), cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'}, ax=axd['trip_count'], legend_kwds={"label": "Sum of Trips", "orientation": "horizontal"})
geo_df_taxi_pickup_sum[geo_df_taxi_pickup_sum['poi_count'] > 0  | geo_df_taxi_pickup_sum['poi_count'].isna()].plot(column=np.log10(geo_df_taxi_pickup_sum[geo_df_taxi_pickup_sum['poi_count'] > 0  | geo_df_taxi_pickup_sum['poi_count'].isna()]['poi_count']), cmap="viridis", legend=True, missing_kwds={'color': 'lightgrey'}, ax=axd['poi_count'], legend_kwds={"label": "Sum of points of interest", "orientation": "horizontal"})

From the plot we can see that each feature follows a similar trend which makes sense because most features correlate with eachother. It is worth noting that the idle time plot is the least meaningful plot of the six because for the idle time we should look more at the mean as a metric instead of the sum which is heavily influenced by the trip count.

In [None]:
geo_df_taxi_pickup_mean = filterByFeatureStatic(taxi_df, missingCensusTract = True, feature='all', aggregation= "mean", location='pickup')

Next we look at the means of all features grouped by the pickup location. From the previous plots we can infer that the center of the city and the airports are the areas with the most demand but it can be worth looking at the averages for additional patterns. For this we plot the averages for the revenue, trip seconds, trip miles and idle seconds but we also plot the trip count and total points of interest for comparison.

In [None]:
fig, axd = plt.subplot_mosaic([['trip_total', 'trip_seconds'],
                               ['trip_miles', 'idle_seconds'],
                               ['trip_count', 'poi_count']],
                              figsize=(18,30), layout="constrained")
fig.suptitle('Features plotted by census tract pickup location on a linear scale', fontsize=16)
geo_df_taxi_pickup_mean.plot(column=geo_df_taxi_pickup_mean['trip_total'], cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'},vmax= geo_df_taxi_pickup_mean['trip_total'].quantile(0.75), ax=axd['trip_total'], legend_kwds={"label": "Mean of Revenue (in dollar)", "orientation": "horizontal"})
geo_df_taxi_pickup_mean.plot(column=geo_df_taxi_pickup_mean['trip_seconds'], cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'},vmax= geo_df_taxi_pickup_mean['trip_seconds'].quantile(0.75), ax=axd['trip_seconds'], legend_kwds={"label": "Mean of trip time in seconds", "orientation": "horizontal"})
geo_df_taxi_pickup_mean.plot(column=geo_df_taxi_pickup_mean['trip_miles'], cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'},vmax= geo_df_taxi_pickup_mean['trip_miles'].quantile(0.75), ax=axd['trip_miles'], legend_kwds={"label": "Mean of trip miles in seconds", "orientation": "horizontal"})
geo_df_taxi_pickup_mean[(geo_df_taxi_pickup_mean['idle_seconds'] > 0) | geo_df_taxi_pickup_mean['idle_seconds'].isna()].plot(column=geo_df_taxi_pickup_mean[(geo_df_taxi_pickup_mean['idle_seconds'] > 0) | geo_df_taxi_pickup_mean['idle_seconds'].isna()]['idle_seconds'],  cmap="viridis", legend=True, missing_kwds={'color': 'lightgrey'},vmax= geo_df_taxi_pickup_mean['idle_seconds'].quantile(0.75), ax=axd['idle_seconds'], legend_kwds={"label": "Mean of idle time in seconds", "orientation": "horizontal"})
geo_df_taxi_pickup_mean.plot(column=np.log10(geo_df_taxi_pickup_mean['trip_count']), cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'}, ax=axd['trip_count'], legend_kwds={"label": "Sum of Trips", "orientation": "horizontal"})
geo_df_taxi_pickup_mean[geo_df_taxi_pickup_mean['poi_count'] > 0 | geo_df_taxi_pickup_mean['poi_count'].isna()].plot(column=np.log10(geo_df_taxi_pickup_mean[geo_df_taxi_pickup_mean['poi_count'] > 0 | geo_df_taxi_pickup_mean['poi_count'].isna()]['poi_count']), cmap="viridis", legend=True, missing_kwds={'color': 'lightgrey'}, ax=axd['poi_count'], legend_kwds={"label": "Sum of points of interest", "orientation": "horizontal"})

Again, Revenue, trip seconds and trip miles have similar patterns because of their correlation. All three features have less revenue and shorter trips in time and distance at the center which gets more going further from the center. The idle seconds also follow a similar trend but with more outliers. It is worth noting that a factor for the patterns could be the less available data in the less demanded areas with low trip counts which could lead to statistical errors and false patterns.

## Hexagon bin analysis

As in the previous chapter, below is a helper function used to filter the features and aggregate the taxi dataframe by location, feature column and aggregation type. Furthermore it can be specified if all census tracts, which are present in the taxi dataframe, should be added to the output dataframe or not. Another difference is the 'hexRes' parameter which specifies the resolution of the hexagon bins.

In [None]:
def plotH3_HexagonMap(dataframe, location='pickup', feature='all', aggregation='sum', missingCensusTract=False, hexRes = 8):
    """ Plot a feature of a dataframe on a map.

    Parameters
    ----------

    dataframe :  (pandas.DataFrame) 
        The dataframe to plot.
    location : (str) 
        The location column of the dataframe. Can be either 'pickup' or 'dropoff'. Default is 'pickup'.
    feature : (str) 
        The feature to aggregate. If 'all', all features are aggregated. Default is 'all'
    aggregation : (str)  
        The aggregation function to use. Can be either 'mean', 'median', 'sum', 'count', 'min', 'max'. Default is 'sum'.
    missingCensusTract : (bool)
        If True, census tracts with no data are included in the plot. Default is False.
    hexRes : (int)
        H3 hexagon resolution size. Default is 8.
    Returns
    ----------

    taxi_df_geo_grouped : (geopandas.GeoDataFrame) 
        The geodataframe grouped by the location column and the feature column. Contains always a geometry column and trip_count column.
    """
    taxi_df_geo = filterByFeatureStatic(dataframe, location= location,feature = feature, aggregation = aggregation, missingCensusTract = missingCensusTract)
    # geometry to h3 index
    taxi_df_geo['h3_index'] = taxi_df_geo.apply(lambda row: h3.geo_to_h3(row[location + '_centroid_location'].y, row[location + '_centroid_location'].x, hexRes), axis=1)

    geojson = []
    geometries = []
    indexes = []

    for geometry in taxi_df_geo['geometry']:
        geojson.append(shapely.to_geojson(geometry))

    for geometry in geojson:
        obj = json.loads(geometry)
        h3_indexes = h3.polyfill(obj, hexRes ,True)
        for index in h3_indexes:
            geometries.append(shape({"type": "Polygon",
                    "coordinates": [h3.h3_to_geo_boundary(index, geo_json=True)],
                    "properties": ""
                    }))
            indexes.append(index)   
    taxi_df_geo.drop(columns= ['geometry', 'pickup_centroid_location', 'GEOID'], inplace = True)
    df_h3_polyfilled = pd.DataFrame({'h3_index': indexes})
    taxi_df_geo_grouped = taxi_df_geo.groupby('h3_index').agg(aggregation).reset_index()
    taxi_df_geo_grouped = taxi_df_geo_grouped.merge(df_h3_polyfilled, on='h3_index', how='outer')
    taxi_df_geo_grouped['geometry'] = taxi_df_geo_grouped.apply(lambda row: shape({"type": "Polygon",
                                           "coordinates": [h3.h3_to_geo_boundary(row["h3_index"], geo_json=True)],
                                           "properties": ""
                                           }), axis=1)
    taxi_df_geo_grouped.loc[taxi_df_geo_grouped['trip_count'] == 0,'trip_count'] = np.nan
    taxi_df_geo_grouped.loc[taxi_df_geo_grouped['trip_miles'] == 0,'trip_miles'] = np.nan
    taxi_df_geo_grouped.loc[taxi_df_geo_grouped['trip_seconds'] == 0,'trip_seconds'] = np.nan
    taxi_df_geo_grouped.loc[taxi_df_geo_grouped['trip_total'] == 0,'trip_total'] = np.nan
    taxi_df_geo_grouped = gpd.GeoDataFrame(taxi_df_geo_grouped, crs='EPSG:4326', geometry='geometry')

    taxi_df_geo_grouped['poi_count'] = taxi_df_geo_grouped.apply(lambda row: row['geometry'].contains(poi_df['geometry']).sum(), axis=1)    
    return taxi_df_geo_grouped


First we need the right hexagon resolutions. For this we test 4, 7, 8 and 9 as resolution showcasing too big hexagons and too small ones.

In [None]:
geo_df_taxi_h3_pickup_sum_res4 = plotH3_HexagonMap(taxi_df, location='pickup', feature='all', aggregation='sum', missingCensusTract=True, hexRes= 4)

First we plot the hexagon bins with a resolution of 4.

In [None]:
# Add the title to the folium map
loc = 'Trip count by h3 hexagon pickup location logaritmic scale - Resolution 4'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)   

m = folium.Map(location=(41.8,-87.723177), zoom_start=11)

m.get_root().html.add_child(folium.Element(title_html))
geo_df_taxi_h3_pickup_sum_res4.explore(column=geo_df_taxi_h3_pickup_sum_res4['trip_count'], vmax = geo_df_taxi_h3_pickup_sum_res4['trip_count'].quantile(0.75), tooltip=True, cmap="viridis", m = m)

From the map above we can see that the areas are too broad to really observe and analyze patterns. We can still see that the center has the most demand but not the finer gradient. Furthermore the hexagons span over more than the census tract borders.

In [None]:
geo_df_taxi_h3_pickup_sum_res7 = plotH3_HexagonMap(taxi_df, location='pickup', feature='all', aggregation='sum', missingCensusTract=True, hexRes= 7)

In [None]:
# Add the title to the folium map
loc = 'Trip total by h3 hexagon pickup location logarithmic scale - Resolution 7'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)   

m = folium.Map(location=(41.8,-87.723177), zoom_start=11)

m.get_root().html.add_child(folium.Element(title_html))
geo_df_taxi_h3_pickup_sum_res7.explore(column=np.log10(geo_df_taxi_h3_pickup_sum_res7['trip_total']), tooltip=True, cmap="viridis", m = m)

The map looks much clearer on a higher resolution. At a resolution of 7 we can already recognize the pattern from the census tract analysis.

In [None]:
geo_df_taxi_h3_pickup_sum_res8 = plotH3_HexagonMap(taxi_df, location='pickup', feature='all', aggregation='sum', missingCensusTract=True, hexRes= 8)

In [None]:
# Add the title to the folium map
loc = 'Trip total by h3 hexagon pickup location logarithmic scale - Resolution 8'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)   

m = folium.Map(location=(41.8,-87.723177), zoom_start=11)

m.get_root().html.add_child(folium.Element(title_html))
geo_df_taxi_h3_pickup_sum_res8.explore(column=np.log10(geo_df_taxi_h3_pickup_sum_res8['trip_total']), tooltip=True, cmap="viridis", m = m)

At resolution 8 we can observe finer transitions in demand between neighboring hexagon bins but also the problem of smaller hexagon bins gets apparent. At a higher resolution more gaps emerge with no data available.

In [None]:
geo_df_taxi_h3_pickup_sum_res9 = plotH3_HexagonMap(taxi_df, location='pickup', feature='all', aggregation='sum', missingCensusTract=True, hexRes= 9)

In [None]:
# Add the title to the folium map
loc = 'Trip total by h3 hexagon pickup location logarithmic scale - Resolution 9'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)   

m = folium.Map(location=(41.8,-87.723177), zoom_start=11)

m.get_root().html.add_child(folium.Element(title_html))
geo_df_taxi_h3_pickup_sum_res9.explore(column=np.log10(geo_df_taxi_h3_pickup_sum_res9['trip_total']), tooltip=True, cmap="viridis", m = m)

Taking it one resolution further the problem with higher resolution gets more clear. We can still observe the demand to some degree but the increase in gaps of data gets more apparent. Such high resolution is not optimal for the type of data we use but could be useful for data with a higher variance in location points. Even from a resolution of 9 we can derive that the trip location points are not saved on the exact location because of privacy reasons which should be kept in mind for the conclusions taken from the plots.

After showcasing different resolutions, we decide to use a resolution of 8 for the following maps. A resolution of 8 does have more gaps in comparison to a resolution of 7 but the patterns seem more clear because of the finer hexagons. Both resolutions still seem like a good choice for this specific use case.

In [None]:
# Add the title to the folium map
loc = 'POI count by h3 hexagon pickup location logarithmic scale - Resolution 8'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)   

m = folium.Map(location=(41.8,-87.723177), zoom_start=11)

m.get_root().html.add_child(folium.Element(title_html))
geo_df_taxi_h3_pickup_sum_res8[geo_df_taxi_h3_pickup_sum_res8['poi_count'] > 0].explore(column=np.log10(geo_df_taxi_h3_pickup_sum_res8[geo_df_taxi_h3_pickup_sum_res8['poi_count'] > 0]['poi_count']), tooltip=True, cmap="viridis", m = m)

In [None]:
# Add the title to the folium map
loc = 'Trip total by census tract pickup location logarithmic scale - Resolution 8'
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(loc)   

m = folium.Map(location=(41.8,-87.723177), zoom_start=11)

m.get_root().html.add_child(folium.Element(title_html))
geo_df_taxi_h3_pickup_sum_res8.explore(column=np.log10(geo_df_taxi_h3_pickup_sum_res8['trip_total']), tooltip=True, cmap="viridis", m = m)

From the previous figures we can see the same patterns as in the census tract chapter. The advantage of hexagon bins are that their size is uniform which makes it easier to see the patterns and transitions.

Below we also plotted the sums and averages for all features depending on the hexagon bins for comparison with the last chapter plotted census tract plots.

In [None]:
fig, axd = plt.subplot_mosaic([['trip_total', 'trip_seconds'],
                               ['trip_miles', 'idle_seconds'],
                               ['trip_count', 'poi_count']],
                              figsize=(18,30), layout="constrained")
fig.suptitle('Features plotted by hexagon bins on a logarithmic scale - Resolution 8', fontsize=16)
geo_df_taxi_h3_pickup_sum_res8.plot(column=np.log10(geo_df_taxi_h3_pickup_sum_res8['trip_total']), cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'}, ax=axd['trip_total'], legend_kwds={"label": "Sum of Revenue (in dollar)", "orientation": "horizontal"})
geo_df_taxi_h3_pickup_sum_res8.plot(column=np.log10(geo_df_taxi_h3_pickup_sum_res8['trip_seconds']), cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'}, ax=axd['trip_seconds'], legend_kwds={"label": "Sum of trip time in seconds", "orientation": "horizontal"})
geo_df_taxi_h3_pickup_sum_res8.plot(column=np.log10(geo_df_taxi_h3_pickup_sum_res8['trip_miles']), cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'}, ax=axd['trip_miles'], legend_kwds={"label": "Sum of trip miles in seconds", "orientation": "horizontal"})
geo_df_taxi_h3_pickup_sum_res8[(geo_df_taxi_h3_pickup_sum_res8['idle_seconds'] > 0) | geo_df_taxi_h3_pickup_sum_res8['idle_seconds'].isna()].plot(column=np.log10(geo_df_taxi_h3_pickup_sum_res8[(geo_df_taxi_h3_pickup_sum_res8['idle_seconds'] > 0) | geo_df_taxi_h3_pickup_sum_res8['idle_seconds'].isna()]['idle_seconds']),  cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'}, ax=axd['idle_seconds'], legend_kwds={"label": "Sum of idle time in seconds", "orientation": "horizontal"})
geo_df_taxi_h3_pickup_sum_res8.plot(column=np.log10(geo_df_taxi_h3_pickup_sum_res8['trip_count']), cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'}, ax=axd['trip_count'], legend_kwds={"label": "Sum of Trips", "orientation": "horizontal"})
geo_df_taxi_h3_pickup_sum_res8[geo_df_taxi_h3_pickup_sum_res8['poi_count'] > 0  | geo_df_taxi_h3_pickup_sum_res8['poi_count'].isna()].plot(column=np.log10(geo_df_taxi_h3_pickup_sum_res8[geo_df_taxi_h3_pickup_sum_res8['poi_count'] > 0  | geo_df_taxi_h3_pickup_sum_res8['poi_count'].isna()]['poi_count']), cmap="viridis", legend=True, missing_kwds={'color': 'lightgrey'}, ax=axd['poi_count'], legend_kwds={"label": "Sum of points of interest", "orientation": "horizontal"})
geo_df_taxi_pickup_mean[geo_df_taxi_pickup_mean['poi_count'] > 0 | geo_df_taxi_pickup_mean['poi_count'].isna()].plot(column=np.log10(geo_df_taxi_pickup_mean[geo_df_taxi_pickup_mean['poi_count'] > 0 | geo_df_taxi_pickup_mean['poi_count'].isna()]['poi_count']), cmap="viridis", legend=True, missing_kwds={'color': 'lightgrey'}, ax=axd['poi_count'], legend_kwds={"label": "Sum of points of interest", "orientation": "horizontal"})

In [None]:
geo_df_taxi_h3_pickup_mean_res8 = plotH3_HexagonMap(taxi_df, location='pickup', feature='all', aggregation='mean', missingCensusTract=True, hexRes= 8)

In [None]:
fig, axd = plt.subplot_mosaic([['trip_total', 'trip_seconds'],
                               ['trip_miles', 'idle_seconds'],
                               ['trip_count', 'poi_count']],
                              figsize=(18,30), layout="constrained")
fig.suptitle('Features plotted by hexagon bins on a linear scale - Resolution 8', fontsize=16)
geo_df_taxi_h3_pickup_mean_res8.plot(column=geo_df_taxi_h3_pickup_mean_res8['trip_total'], cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'},vmax= geo_df_taxi_h3_pickup_mean_res8['trip_total'].quantile(0.75), ax=axd['trip_total'], legend_kwds={"label": "Sum of Revenue (in dollar)", "orientation": "horizontal"})
geo_df_taxi_h3_pickup_mean_res8.plot(column=geo_df_taxi_h3_pickup_mean_res8['trip_seconds'], cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'},vmax= geo_df_taxi_h3_pickup_mean_res8['trip_seconds'].quantile(0.75), ax=axd['trip_seconds'], legend_kwds={"label": "Sum of trip time in seconds", "orientation": "horizontal"})
geo_df_taxi_h3_pickup_mean_res8.plot(column=geo_df_taxi_h3_pickup_mean_res8['trip_miles'], cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'},vmax= geo_df_taxi_h3_pickup_mean_res8['trip_miles'].quantile(0.75), ax=axd['trip_miles'], legend_kwds={"label": "Sum of trip miles in seconds", "orientation": "horizontal"})
geo_df_taxi_h3_pickup_mean_res8[(geo_df_taxi_h3_pickup_mean_res8['idle_seconds'] > 0) | geo_df_taxi_h3_pickup_mean_res8['idle_seconds'].isna()].plot(column=geo_df_taxi_h3_pickup_mean_res8[(geo_df_taxi_h3_pickup_mean_res8['idle_seconds'] > 0) | geo_df_taxi_h3_pickup_mean_res8['idle_seconds'].isna()]['idle_seconds'],  cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'},vmax= geo_df_taxi_pickup_mean['idle_seconds'].quantile(0.75), ax=axd['idle_seconds'], legend_kwds={"label": "Sum of idle time in seconds", "orientation": "horizontal"})
geo_df_taxi_h3_pickup_mean_res8.plot(column=np.log10(geo_df_taxi_h3_pickup_mean_res8['trip_count']), cmap="viridis", legend=True,missing_kwds={'color': 'lightgrey'}, ax=axd['trip_count'], legend_kwds={"label": "Sum of Trips", "orientation": "horizontal"})
geo_df_taxi_h3_pickup_mean_res8[geo_df_taxi_h3_pickup_mean_res8['poi_count'] > 0 | geo_df_taxi_h3_pickup_mean_res8['poi_count'].isna()].plot(column=np.log10(geo_df_taxi_h3_pickup_mean_res8[geo_df_taxi_h3_pickup_mean_res8['poi_count'] > 0 | geo_df_taxi_h3_pickup_mean_res8['poi_count'].isna()]['poi_count']), cmap="viridis", legend=True, missing_kwds={'color': 'lightgrey'}, ax=axd['poi_count'], legend_kwds={"label": "Sum of points of interest", "orientation": "horizontal"})

In conclusion to this part we can say that the observed patterns can be derived from both the census tract plots and the hexagon bin plots. From an overall view we recommend using hexagon bins over census tract borders because the patterns and transitions look much clearer. Furthermore the resolution highly depends on the type of data and the overall use case. On a city scale a resolution of 7 or 8 is optimal but if the geospatial obervation is on a state scale or a neighborhood scale instead of a city scale we recommend a lower or higher resolution.

The geo analysis for the temporal bins can be found in the [02b_geo_analytics_time_series.ipynb](./02b_geo_analytics_time_series.ipynb).