# Stop Sign Counting By Zipcode in California

This notebook demonstrates the use of geodataframe joins with point-in-polygon in cuSpatial

## Prerequisite: Datasets

Datasets used:
1. Stops (Signs and Stop lines) dataset from OpenStreetMap
2. USA ZipCode boundaries from US Census Bureau
3. USA States boundaries from OpenStreetMap

- OpenStreetMap is open data, licensed under [Open Data Commons Open Database License (ODbL)](https://opendatacommons.org/licenses/odbl/) by the [OpenStreetMap Foundation (OSMF)](https://wiki.osmfoundation.org/wiki/Main_Page).
- US Census Bureau data is open and free to use: https://www.census.gov/about/policies/open-gov/open-data.html
- TIGER/Line Shapefile, 2019, 2010 nation, U.S., 2010 Census 5-Digit ZIP Code Tabulation Area (ZCTA5) National, Metadata Updated: November 1, 2022." Accessed March xx, 2023. https://catalog.data.gov/dataset/tiger-line-shapefile-2019-2010-nation-u-s-2010-census-5-digit-zip-code-tabulation-area-zcta5-na

Disclaimer: Each user is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.

In [1]:
# Download the datasets and save as:
# 1. USA_Stops_Vertices.csv
# 2. USA_Zipcodes_2019_Tiger.csv
# 3. USA_States.csv

!if [ ! -f "USA_States.csv" ]; then curl "https://data.rapids.ai/cuspatial/benchmark/USA_States.csv" -o USA_States.csv; else echo "USA_States.csv found"; fi
!if [ ! -f "USA_Stops_Vertices.csv" ]; then curl "https://data.rapids.ai/cuspatial/benchmark/USA_Stops_Vertices.csv" -o USA_Stops_Vertices.csv; else echo "USA_Stops_Vertices.csv found"; fi
!if [ ! -f "USA_Zipcodes_2019_Tiger.csv" ]; then curl "https://data.rapids.ai/cuspatial/benchmark/USA_Zipcodes_2019_Tiger.csv" -o USA_Zipcodes_2019_Tiger.csv; else echo "USA_Zipcodes_2019_Tiger.csv found"; fi

USA_States.csv found
USA_Stops_Vertices.csv found
USA_Zipcodes_2019_Tiger.csv found


## Overview

Stop signs are an important symbol of city and road development in geographical information systems. This notebook processes all stop sign locations from the dataset, using spatial joins to locate them within the zipcode boundaries located within California. This notebook performs the following steps:

1. Filters the zipcode boundaries located in California with spatial join.
2. Filters the stop signs located in all the zipcodes with spatial join.
3. Counts the stop signs by zipcode.
4. Visualize the result on map.

In [2]:
# Import Necessary Packages
import os
import cuspatial, cudf
import pandas as pd
import geopandas as gpd
import numpy as np
import cupy as cp
from shapely import wkt
import pydeck as pdk

In [3]:
# Root folder for datasets
DATASET_ROOT = "./"

def path_of(dataset):
    return os.path.join(DATASET_ROOT, dataset)

## Load Dataset and Cleanup

We load the datasets and store them as cuSpatial device dataframes. Note that the second cell below loads the dataset with cuDF, then adopts geopandas to parse the WKT (Well-Known Text) strings into shapely objects. This is a slow step performed on CPU and requires data transfer between device and host.

In [4]:
# Import Stop Sign CSV
d_stops = cudf.read_csv(path_of("USA_Stops_Vertices.csv"), usecols=["x", "y"])

In [5]:
#Import CSV of Zipcodes
d_zip = cudf.read_csv(
    path_of("USA_Zipcodes_2019_Tiger.csv"),
    usecols=["WKT", "ZCTA5CE10", "INTPTLAT10", "INTPTLON10"])
d_zip.INTPTLAT10 = d_zip.INTPTLAT10.astype("float")
d_zip.INTPTLON10 = d_zip.INTPTLON10.astype("float")

In [6]:
# Load WKT as shapely objects
h_zip = d_zip.to_pandas()
h_zip["WKT"] = h_zip["WKT"].apply(wkt.loads)
h_zip = gpd.GeoDataFrame(h_zip, geometry="WKT", crs='epsg:4326')

# Transfer back to GPU with cuSpatial
d_zip = cuspatial.from_geopandas(h_zip)

In [7]:
# Load State Boundaries
states = gpd.read_file("USA_States.csv", geometry='WKT', crs='epsg:4326')
d_states = cuspatial.from_geopandas(states)

In [8]:
class QuadTree:
    """Helper class to use cuspatial quadtree interface
    """
    def __init__(self,
                 df,
                 x_column,
                 y_column,
                 x_min=None,
                 x_max=None,
                 y_min=None,
                 y_max=None,
                 scale = -1,
                 max_depth = 15,
                 min_size = 12):

        self.x_min = df[x_column].min() if not x_min else x_min
        self.x_max = df[x_column].max() if not x_max else x_max
        self.y_min = df[y_column].min() if not y_min else y_min
        self.y_max = df[y_column].max() if not y_max else y_max
        
        self.scale = scale
        self.max_depth = max_depth
        self.min_size = min_size

        self.point_df = df
        self.x_column = x_column
        self.y_column = y_column
        
        self.polygon_point_mapping = None
        
        self.d_points = cuspatial.GeoSeries.from_points_xy(
            cudf.DataFrame({"x": df[x_column], "y": df[y_column]}
        ).interleave_columns())
        
        self.point_indices, self.quadtree = (
            cuspatial.quadtree_on_points(self.d_points,
                                         self.x_min,
                                         self.x_max,
                                         self.y_min,
                                         self.y_max,
                                         self.scale,
                                         self.max_depth,
                                         self.min_size))

    def set_polygon(self, df, poly_column):
        polys = df[poly_column]

        parts = polys.polygons.part_offset
        rings = polys.polygons.ring_offset
        x = polys.polygons.x
        y = polys.polygons.y
        
        single_polys = cuspatial.GeoSeries.from_polygons_xy(
            polys.polygons.xy, rings, parts, cp.arange(len(parts))
        )
        
        geometries = cudf.Series(polys.polygons.geometry_offset)
            
        poly_bboxes = cuspatial.polygon_bounding_boxes(single_polys)
        intersections = cuspatial.join_quadtree_and_bounding_boxes(
            self.quadtree, poly_bboxes, self.x_min, self.x_max, self.y_min, self.y_max, self.scale, self.max_depth
        )
        polygon_point_mapping = cuspatial.quadtree_point_in_polygon(
            intersections,
            self.quadtree,
            self.point_indices,
            self.d_points,
            single_polys
        )

        # Update Polygon Index to MultiPolygon Index
        polygon_index = geometries.searchsorted(polygon_point_mapping.polygon_index, side="right")-1
        polygon_point_mapping.polygon_index = polygon_index

        self.polygon_point_mapping = polygon_point_mapping.reset_index(drop=True)
        
        # Remap point indices
        idx_of_idx = self.point_indices.take(
            self.polygon_point_mapping.point_index
        ).reset_index(drop=True)
        self.polygon_point_mapping.point_index = idx_of_idx

        self.polygon_df = df

    def _subset_geodf(self, geodf, columns):
        res = cudf.DataFrame()
        for col in columns:
            res[col] = geodf[col]
        return res

    def points(self, columns = None):
        if self.polygon_point_mapping is None:
            raise ValueError("First set polygon dataframe.")
        
        if not columns:
            df = self.point_df
        else:
            df = self._subset_geodf(self.point_df, columns)

        if any(dtype == "geometry" for dtype in df.dtypes):
            df = cuspatial.GeoDataFrame(df)
        
        mapping = self.polygon_point_mapping
        res = df.iloc[mapping.point_index]
        res = res.reset_index(drop=True)
        res["polygon_index"] = mapping.polygon_index
        res["point_index"] = mapping.point_index
        return res

    def polygons(self, columns = None):
        if self.polygon_point_mapping is None:
            raise ValueError("First set polygon dataframe.")
        
        if not columns:
            df = self.polygon_df
        else:
            df = self._subset_geodf(self.polygon_df, columns)
        
        if any(dtype == "geometry" for dtype in df.dtypes):
            df = cuspatial.GeoDataFrame(df)
        
        mapping = self.polygon_point_mapping
        res = df.iloc[mapping.polygon_index]
        res = res.reset_index(drop=True)
        res["polygon_index"] = mapping.polygon_index
        res["point_index"] = mapping.point_index
        return res
    
    def point_left_join_polygon(self, point_columns=None, polygon_columns=None):
        points = self.points(point_columns)
        polygons = self.polygons(polygon_columns)
        joined = points.merge(polygons, on=["polygon_index", "point_index"], how="left")
        joined = joined.drop(["polygon_index", "point_index"], axis=1)
        return cuspatial.GeoDataFrame(joined)

## Filtering Zipcode by its Geometric Center

The Zipcode Dataset contains boundaries for all zipcodes in the US. The below uses the geometric center (encoded in `INTPTLON10` and `INTPTLAT10` column) for each zipcode and uses cuspatial's quadtree interface to filter zipcodes located only in California.

In [9]:
# Use quadtree to filter zip codes

# Build a point quadtree using the geometric center of the zip code region
zipcode_quadtree = QuadTree(d_zip, x_column="INTPTLON10", y_column="INTPTLAT10")

# Pass boundary
zipcode_quadtree.set_polygon(d_states, poly_column='geometry')

# Join state and zip code boundaries
zipcode_by_state = zipcode_quadtree.point_left_join_polygon(["WKT", "ZCTA5CE10"], ["STUSPS"])

# Get Californian zipcodes
CA_zipcode = zipcode_by_state[zipcode_by_state.STUSPS == 'CA']



In [10]:
len(CA_zipcode), len(d_zip)

(1762, 33144)

From the 33K zipcode dataset, 1.7K of them belong to California.

## Join stop signs dataset with California Zipcode boundaries

The below joins the stop sign dataset (460K data points) with all zip code boundaries in California (1700 data points).

In [11]:
# Build a second quadtree with all stop signs in the US
stop_quadtree = QuadTree(d_stops, x_column='x', y_column='y')

# Pass zip code polygons
stop_quadtree.set_polygon(CA_zipcode, poly_column="WKT")

# Join the stop signs and the zip code dataframe
stop_by_zipcode = stop_quadtree.point_left_join_polygon(["x", "y"], ["ZCTA5CE10"])



In [12]:
stop_by_zipcode.head()

Unnamed: 0,x,y,ZCTA5CE10
0,-121.858094,37.280787,95136
1,-121.856648,37.278295,95136
2,-121.855441,37.280375,95136
3,-121.856343,37.283195,95136
4,-121.856604,37.281005,95136


## Zipcode counting with cuDF

The below uses [cuDF](https://docs.rapids.ai/api/cudf/stable/index.html) to count the number of stop signs per zip code. Then merge the geometry information from the zipcode dataset.

In [13]:
# Count the Stop Signs by California Zip Codes
stop_counts = stop_by_zipcode.groupby("ZCTA5CE10").x.count().rename("stop_count")
stop_counts.head()

ZCTA5CE10
94901    131
94535    205
95112    103
95407    126
93933    205
Name: stop_count, dtype: int32

In [14]:
# Fetch the polygon boundaries
stop_counts_and_bounds = cuspatial.GeoDataFrame(CA_zipcode.merge(stop_counts, on="ZCTA5CE10", how="left"))
stop_counts_and_bounds["stop_count"] = stop_counts_and_bounds["stop_count"].astype("int").fillna(0)
print("DataFrame Size: ", len(stop_counts_and_bounds))

DataFrame Size:  1762


## Visualization

Now, we visualize the stop sign count results using [PyDeck](https://deckgl.readthedocs.io/en/latest/index.html).
Uncomment to run below:

In [15]:
# # Visualize the Dataset

# # Move dataframe to host for visualization
# host_df = stop_counts_and_bounds.to_geopandas()
# host_df = host_df.rename({"WKT": "geometry"}, axis=1)
# host_df.head()

# # Geo Center of CA: 120°4.9'W 36°57.9'N
# view_state = pdk.ViewState(
#     **{"latitude": 33.96500, "longitude": -118.08167, "zoom": 6, "maxZoom": 16, "pitch": 95, "bearing": 0}
# )

# gpd_layer = pdk.Layer(
#         "GeoJsonLayer",
#         data=host_df[["geometry", "stop_count", "ZCTA5CE10"]],
#         get_polygon="geometry",
#         get_elevation="stop_count",
#         extruded=True,
#         elevation_scale=50,
#         get_fill_color=[227,74,51],
#         get_line_color=[255, 255, 255],
#         auto_highlight=True,
#         filled=True,
#         wireframe=True,
#         pickable=True
#     )

# tooltip = {"html": "<b>Stop Sign Count:</b> {stop_count} <br> <b>ZipCode: {ZCTA5CE10}"}

# r = pdk.Deck(
#     gpd_layer,
#     initial_view_state=view_state,
#     map_style=pdk.map_styles.LIGHT,
#     tooltip=tooltip,
# )

# r.to_html("geopandas_layer.html", notebook_display=False)

### Open geopandas_layer.html to see visualization result

![stop_per_state_map](https://github.com/isVoid/cuspatial/raw/notebook/zipcode_counting/notebooks/stop_states.png)