# Stop Sign Counting By Zipcodes in California

This notebook demonstrates the use of geodataframe joins with point in polygon in cuSpatial

## Prerequisite: Datasets

Dataset used (TODO: discuss if need to self host):
1. Stops (Signs and Stop lines) Dataset from OpenStreetMap: [Tag Page](https://wiki.openstreetmap.org/wiki/Tag:highway%3Dstop)
2. USA ZipCode: [Download](https://catalog.data.gov/dataset/tiger-line-shapefile-2019-2010-nation-u-s-2010-census-5-digit-zip-code-tabulation-area-zcta5-na)
3. USA States boundary: [link](https://osm-boundaries.com/)

Download the datasets and save as:
1. USA_Stops_Vertices.csv
2. USA_Zipcodes_2019_Tiger.csv
3. USA_States.csv

## Overview

Stop signs is a important symbol of city and road development in geographical information systems. This notebook processes all stop sign locations from the a dataset, using spatial joins to categorify them within the zipcodes boundaries that are located within California. This notebook does the following steps:
1. Filters the zipcode boundaries that locates in California with spatial join.
2. Filters the stop signs that locates in all the zipcodes with spatial join.
3. Count the stop signs by zipcodes.
4. Visualize the result on map.

In [1]:
# Import Necessary Packages
import cuspatial, cudf
import pandas as pd
import geopandas as gpd
import numpy as np
from shapely import wkt
from tqdm.notebook import trange
import pydeck as pdk

## Load Dataset and Cleanup

We load the dataset and store them as device dataframe in cuspatial. Note that the second cell below loads the dataset with cuDF, then adopts geopandas to parse the WKT strings into shapely objects. This is a slow step performed on CPU and requires data transfer between device and host.

In [2]:
# Import Stop Sign CSV
d_stops = cudf.read_csv("USA_Stops_Vertices.csv", usecols=["x", "y"])

In [3]:
#Import CSV of Zipcodes
d_zip = cudf.read_csv(
    "USA_Zipcodes_2019_Tiger.csv",
    usecols=["WKT", "ZCTA5CE10", "INTPTLAT10", "INTPTLON10"])
d_zip.INTPTLAT10 = d_zip.INTPTLAT10.astype("float")
d_zip.INTPTLON10 = d_zip.INTPTLON10.astype("float")

In [4]:
# Load WKT as shapely objects
h_zip = d_zip.to_pandas()
h_zip["WKT"] = h_zip["WKT"].apply(wkt.loads)
h_zip = gpd.GeoDataFrame(h_zip, geometry="WKT", crs='epsg:4326')

# Roundtrip back to geopandas
d_zip = cuspatial.from_geopandas(h_zip)

In [5]:
# Load State Boundaries
states = gpd.read_file("USA_States.csv", geometry='WKT', crs='epsg:4326')
d_states = cuspatial.from_geopandas(states)

In [6]:
class QuadTree:
    """Helper class to use cuspatial quadtree interface
    """
    def __init__(self,
                 df,
                 x_column,
                 y_column,
                 x_min=None,
                 x_max=None,
                 y_min=None,
                 y_max=None,
                 scale = 100,
                 max_depth = 3,
                 min_size = 12):

        self.x_min = df[x_column].min() if not x_min else x_min
        self.x_max = df[x_column].max() if not x_max else x_max
        self.y_min = df[y_column].min() if not y_min else y_min
        self.y_max = df[y_column].max() if not y_max else y_max
        
        self.scale = scale
        self.max_depth = max_depth
        self.min_size = min_size

        self.point_df = df
        self.x_column = x_column
        self.y_column = y_column
        
        self.polygon_point_mapping = None
        
        self.point_indices, self.quadtree = (
            cuspatial.quadtree_on_points(df[x_column],
                                         df[y_column],
                                         self.x_min,
                                         self.x_max,
                                         self.y_min,
                                         self.y_max,
                                         self.scale,
                                         self.max_depth,
                                         self.min_size))

    def set_polygon(self, df, poly_column, verbose=True):
        rng = trange if verbose else range

        polys = df[poly_column]

        parts = polys.polygons.part_offset
        rings = polys.polygons.ring_offset
        x = polys.polygons.x
        y = polys.polygons.y
        partials = []

        stride = 100

        geometries = cudf.Series(polys.polygons.geometry_offset)
        for start in trange(0, len(parts), stride):
            end = min(start + stride, len(parts))
            
            poly_bboxes = cuspatial.polygon_bounding_boxes(
                parts[start:end],
                rings,
                x,
                y
            )
            intersections = cuspatial.join_quadtree_and_bounding_boxes(
                self.quadtree, poly_bboxes, self.x_min, self.x_max, self.y_min, self.y_max, self.scale, self.max_depth
            )
            partial_polygon_point_mapping = cuspatial.quadtree_point_in_polygon(
                intersections,
                self.quadtree,
                self.point_indices,
                self.point_df[self.x_column],
                self.point_df[self.y_column],
                parts[start:end],
                rings,
                x,
                y
            )
            
            part_index = partial_polygon_point_mapping.polygon_index + start
            polygon_index = geometries.searchsorted(part_index, side="right")-1
            partial_polygon_point_mapping.polygon_index = polygon_index
            partials.append(partial_polygon_point_mapping)

        self.polygon_point_mapping = cudf.concat(partials, axis=0)
        self.polygon_point_mapping = self.polygon_point_mapping.reset_index(drop=True)
        
        idx_of_idx = self.point_indices.take(
            self.polygon_point_mapping.point_index
        ).reset_index(drop=True)
        
        self.polygon_point_mapping.point_index = idx_of_idx
        self.polygon_df = df

    def _subset_geodf(self, geodf, columns):
        res = cudf.DataFrame()
        for col in columns:
            res[col] = geodf[col]
        return res

    def points(self, columns = None):
        if self.polygon_point_mapping is None:
            raise ValueError("First set polygon dataframe.")
        
        if not columns:
            df = self.point_df
        else:
            df = self._subset_geodf(self.point_df, columns)

        if any(dtype == "geometry" for dtype in df.dtypes):
            df = cuspatial.GeoDataFrame(df)
        
        mapping = self.polygon_point_mapping
        res = df.iloc[mapping.point_index]
        res = res.reset_index(drop=True)
        res["polygon_index"] = mapping.polygon_index
        res["point_index"] = mapping.point_index
        return res

    def polygons(self, columns = None):
        if self.polygon_point_mapping is None:
            raise ValueError("First set polygon dataframe.")
        
        if not columns:
            df = self.polygon_df
        else:
            df = self._subset_geodf(self.polygon_df, columns)
        
        if any(dtype == "geometry" for dtype in df.dtypes):
            df = cuspatial.GeoDataFrame(df)
        
        mapping = self.polygon_point_mapping
        res = df.iloc[mapping.polygon_index]
        res = res.reset_index(drop=True)
        res["polygon_index"] = mapping.polygon_index
        res["point_index"] = mapping.point_index
        return res
    
    def point_left_join_polygon(self, point_columns=None, polygon_columns=None):
        points = self.points(point_columns)
        polygons = self.polygons(polygon_columns)
        joined = points.merge(polygons, on=["polygon_index", "point_index"], how="left")
        joined = joined.drop(["polygon_index", "point_index"], axis=1)
        return cuspatial.GeoDataFrame(joined)

## Filtering Zipcode by its Geometric Center

The Zipcode Dataset contains boundaries for all zipcodes in the US. The below uses the geometric center for each zipcode and uses cuspatial's quadtree interface to filter zipcodes that locates only in California.

In [7]:
# Use quadtree to filter zip codes

# Build a point quadtree using the geometric center of the zip code region
zipcode_quadtree = QuadTree(d_zip, x_column="INTPTLON10", y_column="INTPTLAT10")

# Pass boundary
zipcode_quadtree.set_polygon(d_states, poly_column='geometry')

# Join state and zip code boundaries
zipcode_by_state = zipcode_quadtree.point_left_join_polygon(["WKT", "ZCTA5CE10"], ["STUSPS"])

# Get Californian zipcodes
CA_zipcode = zipcode_by_state[zipcode_by_state.STUSPS == 'CA']

  0%|          | 0/15 [00:00<?, ?it/s]

In [8]:
len(CA_zipcode), len(d_stops)

(1762, 467666)

## Join stop signs dataset with California Zipcode boundaries

The below joins the stop sign dataset (460K data points) with all zip code boundaries in California (1700 data points).

In [9]:
# Build a 2nd quadtree with all stop signs in US
stop_quadtree = QuadTree(d_stops, x_column='x', y_column='y')

# Pass zip code polygons
stop_quadtree.set_polygon(CA_zipcode, poly_column="WKT")

# Join the stop signs and the zip code dataframe
stop_by_zipcode = stop_quadtree.point_left_join_polygon(["x", "y"], ["ZCTA5CE10"])

  0%|          | 0/21 [00:00<?, ?it/s]

In [10]:
stop_by_zipcode.head()

Unnamed: 0,x,y,ZCTA5CE10
0,-118.438872,34.175839,91401
1,-118.438872,34.17568,91401
2,-118.442066,34.176668,91401
3,-118.442298,34.176667,91401
4,-118.442066,34.177578,91401


## Zipcode counting with cuDF

The below resorts to [cuDF](https://docs.rapids.ai/api/cudf/stable/index.html) and counts the number of stop signs per zip code. Then merge the geometry information from the zipcode dataset.

In [11]:
# Count the Stop Signs by California Zip Codes
stop_counts = stop_by_zipcode.groupby("ZCTA5CE10").x.count().rename("stop_count")
stop_counts.head(), len(stop_counts)

(ZCTA5CE10
 90034    399
 94501    128
 92114     86
 93543     17
 90502      2
 Name: stop_count, dtype: int32,
 1338)

In [12]:
# Fetch the polygon boundaries
stop_counts_and_bounds = cuspatial.GeoDataFrame(CA_zipcode.merge(stop_counts, on="ZCTA5CE10", how="left"))
stop_counts_and_bounds["stop_count"] = stop_counts_and_bounds["stop_count"].astype("int").fillna(0)
print("DataFrame Size: ", len(stop_counts_and_bounds))

DataFrame Size:  1762


In [13]:
# Move dataframe to host for visualization
host_df = stop_counts_and_bounds.to_geopandas()
host_df.head()

Unnamed: 0,WKT,ZCTA5CE10,STUSPS,stop_count
25407,"POLYGON ((-117.80192 34.05094, -117.80168 34.0...",91766,CA,27
25538,"POLYGON ((-123.28057 41.59464, -123.27830 41.6...",96032,CA,2
25545,"POLYGON ((-122.71085 41.99296, -122.71084 41.9...",96044,CA,13
25547,"POLYGON ((-122.90051 40.35854, -122.90048 40.3...",96047,CA,3
25406,"MULTIPOLYGON (((-114.55600 32.78118, -114.5560...",92222,CA,0


In [14]:
host_df = host_df.rename({"WKT": "geometry"}, axis=1)

In [15]:
host_df

Unnamed: 0,geometry,ZCTA5CE10,STUSPS,stop_count
25407,"POLYGON ((-117.80192 34.05094, -117.80168 34.0...",91766,CA,27
25538,"POLYGON ((-123.28057 41.59464, -123.27830 41.6...",96032,CA,2
25545,"POLYGON ((-122.71085 41.99296, -122.71084 41.9...",96044,CA,13
25547,"POLYGON ((-122.90051 40.35854, -122.90048 40.3...",96047,CA,3
25406,"MULTIPOLYGON (((-114.55600 32.78118, -114.5560...",92222,CA,0
...,...,...,...,...
19532,"POLYGON ((-122.26230 37.83786, -122.26218 37.8...",94618,CA,115
19533,"POLYGON ((-122.21027 37.78804, -122.20993 37.7...",94619,CA,217
19214,"POLYGON ((-120.46465 40.21273, -120.46454 40.2...",96121,CA,0
19524,"POLYGON ((-122.19735 37.77162, -122.19728 37.7...",94605,CA,894


## Visualization

Now, we visualize the stop sign count results using [PyDeck](https://deckgl.readthedocs.io/en/latest/index.html).

In [16]:
# Visualize the Dataset

# Geo Center of CA: 120°4.9'W 36°57.9'N
view_state = pdk.ViewState(
    **{"latitude": 33.96500, "longitude": -118.08167, "zoom": 6, "maxZoom": 16, "pitch": 95, "bearing": 0}
)

gpd_layer = pdk.Layer(
        "GeoJsonLayer",
        data=host_df[["geometry", "stop_count", "ZCTA5CE10"]],
        get_polygon="geometry",
        get_elevation="stop_count",
        extruded=True,
        elevation_scale=50,
        get_fill_color=[227,74,51],
        get_line_color=[255, 255, 255],
        auto_highlight=True,
        filled=True,
        wireframe=True,
        pickable=True
    )

tooltip = {"html": "<b>Stop Sign Count:</b> {stop_count} <br> <b>ZipCode: {ZCTA5CE10}"}

r = pdk.Deck(
    gpd_layer,
    initial_view_state=view_state,
    map_style=pdk.map_styles.LIGHT,
    tooltip=tooltip,
)

# Time consuming since it contains ~1700 polygons
r.to_html("geopandas_layer.html", notebook_display=False)

### Open geopandas_layer.html to see visulization result

![stop_per_state_map](stop_states.png)