# cuSpatial API demo
GTC April 2023 Michael Wang and Thomson Comer


## Data

[National Address Database](https://nationaladdressdata.s3.amazonaws.com/NAD_r12_TXT.zip)

[NYC Taxi Boroughs](https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip)

[taxi2015.csv](https://rapidsai-data.s3.us-east-2.amazonaws.com/viz-data/nyc_taxi.tar.gz)

The following notebook demonstrates the use of cuSpatial to perform analytics using large datasets.

The structure of the notebook is as follows:
1. Imports
1. Read datasets: National Address Database (NAD), NYC Taxi Boroughs Polygons, 2015 NYC Taxi pickup/dropoff information with lon/lat. Also convert epsg:2263 (NYC Long Island) to WGS.
1. Convert separate lon/lat columns in DataFrames into cuspatial.GeoSeries
1. Compute number of addresses and pickups in each borough
1. Compute addresses for each pickup in one borough

A drawing of an addresses table and a pickups table, with a line connecting two rows together and
adding the address where it belongs in the pickups table.

In [None]:
import cudf
import cuspatial
import geopandas
import cupy as cp
import pandas as pd
cudf.set_option("spill", True) 

<center><img src="https://www.dropbox.com/s/b7zmjlxnrtgqdwn/zones.png?dl=1" width=400></center>
zero
image of national address database points

I/O
 
- National Address Database (NAD): 
- NYC Taxi Zones Shapefile (zones)
- NYC 2015 Taxi Pickups and Dropoffs with Lon/Lat Coords (taxi2015)

In [None]:
# I/O (18GB NAD, 265 borough polygons, 13m taxi pickups and dropoffs.
NAD = cudf.read_csv('NAD_r11.txt', usecols=[
    'State',
    'Longitude',
    'Latitude',
])
NAD = NAD[NAD['State'] == 'NY']
NAD_Street = cudf.read_csv('NAD_r11.txt', usecols=[
    'State',
    'StN_PreDir',
    'StreetName',
    'StN_PosTyp',
    'Add_Number',
])
NAD_Street = NAD_Street[NAD_Street['State'] == 'NY']
# Read taxi_zones.zip shapefile with GeoPandas, then convert to epsg:4326 for lon/lat
host_zones = geopandas.read_file('taxi_zones.zip')
host_lonlat = host_zones.to_crs(epsg=4326)
zones = cuspatial.from_geopandas(host_lonlat)
zones.set_index(zones['OBJECTID'], inplace=True)
taxi2015 = cudf.read_csv('taxi2015.csv')

`make_geoseries_from_lon_lat`
<center><img src="https://www.dropbox.com/s/pp75u59z5uxwrlz/table-to-geoseries.png?dl=1" width=500></center>

In [None]:
# Utility function to convert dataframes into GeoSeries

def make_geoseries_from_lon_lat(lon, lat):
    # Scatter the two columns into one column
    assert len(lon) == len(lat)
    xy = cudf.Series(cp.zeros(len(lon) * 2))
    xy[::2] = lon
    xy[1::2] = lat

    return cuspatial.GeoSeries(cuspatial.core._column.geocolumn.GeoColumn._from_points_xy(xy._column))

In [None]:
# Convert DataFrames to GeoSeries

pickups = make_geoseries_from_lon_lat(
    taxi2015['pickup_longitude'],
    taxi2015['pickup_latitude']
)
addresses = make_geoseries_from_lon_lat(
    NAD['Longitude'],
    NAD['Latitude']
)

In [None]:
borough_addresses = zones['geometry'].contains_properly(addresses, allpairs=True)
display(borough_addresses)

Unnamed: 0,polygon_index,point_index
20736,1,5648100
20737,1,5648101
20738,2,5202801
20739,2,5202802
20740,2,5202803
...,...,...
966784,262,5368821
966785,262,5368822
966786,262,5368823
966787,262,5368824


In [None]:
borough_pickups = zones['geometry'].iloc[0:50].contains_properly(pickups, allpairs=True)
display(borough_pickups)

# You can do it one of two ways: .contains_properly, or write the pip yourself.

Unnamed: 0,polygon_index,point_index
15008,0,44084
15009,0,76169
15010,0,129737
15011,0,177939
15012,0,219859
...,...,...
1055577,49,12748299
1055578,49,12748338
1055579,49,12748431
1055580,49,12748781


In [180]:
# Add pickup and address counts to zones dataframe

zones["pickup_count"] = borough_pickups.groupby('polygon_index').count()
zones["address_count"] = borough_addresses.groupby('polygon_index').count()
# Add to taxi2015 dataframe for demo display

TypeError: data type 'geometry' not understood

TypeError: data type 'geometry' not understood

          OBJECTID  Shape_Leng  Shape_Area                     zone  \
OBJECTID                                                              
1                1    0.116357    0.000782           Newark Airport   
2                2    0.433470    0.004866              Jamaica Bay   
3                3    0.084341    0.000314  Allerton/Pelham Gardens   
4                4    0.043567    0.000112            Alphabet City   
5                5    0.092146    0.000498            Arden Heights   
...            ...         ...         ...                      ...   
21              21    0.115974    0.000380         Bensonhurst East   
22              22    0.126170    0.000472         Bensonhurst West   
23              23    0.290556    0.002196  Bloomfield/Emerson Hill   
24              24    0.047000    0.000061             Bloomingdale   
25              25    0.047146    0.000124              Boerum Hill   

          LocationID        borough  \
OBJECTID                             

# Cartesian product via tiling

<center><img src="https://www.dropbox.com/s/wlcr9fugq79nyut/tiled-cartesian-product.png?dl=1" width=650></center>

In [None]:
BOROUGH_ID = 12

# Let's make two GeoSeries: For each borough, create a GeoSeries with all address Points
# repeated the number of times there are pickups in that borough, and another GeoSeries with
# the opposite: all pickups Points repeated the number of times there are addresses in that
# borough.

# addresses tiled
borough_address_point_ids = borough_addresses['point_index'][borough_addresses['polygon_index'] == BOROUGH_ID]
pickups_count = len(borough_pickups[borough_pickups['polygon_index'] == BOROUGH_ID])
addresses_tiled = NAD.iloc[
    borough_address_point_ids
].tile(pickups_count)

# pickups tiled
borough_pickup_point_ids = borough_pickups['point_index'][borough_pickups['polygon_index'] == BOROUGH_ID]
addresses_count = len(borough_addresses[borough_addresses['polygon_index'] == BOROUGH_ID])
pickups_tiled = taxi2015[[
    'pickup_longitude',
    'pickup_latitude'
]].iloc[
    borough_pickup_point_ids
].tile(addresses_count)

pickup_points = make_geoseries_from_lon_lat(
    pickups_tiled['pickup_longitude'],
    pickups_tiled['pickup_latitude']
)
address_points = make_geoseries_from_lon_lat(
    addresses_tiled['Longitude'],
    addresses_tiled['Latitude']
)
len(address_points)

11081124

<center><img src="https://www.dropbox.com/s/30rntm6p67mw96c/pairwise_point_distance.png?dl=1" width=550></center>

In [None]:
# get the list of addresses and their indices that are closest to a pickup point

haversines = cuspatial.haversine_distance(
    pickup_points.points.x,
    pickup_points.points.y,
    address_points.points.x,
    address_points.points.y
)

gb_df = cudf.DataFrame({
    'address': addresses_tiled.index,
    'pickup': pickups_tiled.index,
    'distance': haversines
})

address_indices_of_nearest = gb_df[['address', 'distance']].groupby('address').idxmin()
pickup_indices_of_nearest = gb_df[['pickup', 'distance']].groupby('pickup').idxmin()
address_pickup_minimum_correspondence = gb_df.loc[address_indices_of_nearest['distance']]

# We're almost there

### We have the index of the addresses and their pickups

In [131]:
nearest_pickups = taxi2015.iloc[address_pickup_minimum_correspondence['pickup']]
nearest_addresses_lonlat = NAD.loc[address_pickup_minimum_correspondence['address']]
# display(nearest_pickups)
# display(nearest_addresses_lonlat)


In [None]:
# concatenate address fields

def build_address_string(NAD_Street):
    blanks = cudf.Series([' '] * len(NAD_Street))
    blanks.index = NAD_Street.index
    NAD_Street['StN_PreDir'] = NAD_Street['StN_PreDir'].fillna('')
    NAD_Street['StN_PosTyp'] = NAD_Street['StN_PosTyp'].fillna('')
    street_names = NAD_Street['Add_Number'].astype('str').str.cat(
        blanks
    ).str.cat(
        NAD_Street['StN_PreDir']
    ).str.cat(
        blanks
    ).str.cat(
        NAD_Street['StreetName']
    ).str.cat(
        blanks
    ).str.cat(
        NAD_Street['StN_PosTyp']
    )
    return street_names.str.replace('  ', ' ')

nearest_addresses_street_name = NAD_Street.loc[address_pickup_minimum_correspondence['address']]
street_names = build_address_string(nearest_addresses_street_name)

# Last Step

In [None]:
# Attach the street names to the original pickups dataframe

no_index = nearest_pickups.reset_index()
no_index['address'] = street_names.reset_index(drop=True)
no_index['distance'] = gb_df['distance']
taxi_pickups_with_address = no_index.set_index(no_index['index'])
taxi_pickups_with_address.drop('index', inplace=True, axis=1)
display(taxi_pickups_with_address[[
    'VendorID',
    'tpep_pickup_datetime',
    'passenger_count',
    'trip_distance',
    'distance',
    'pickup_longitude',
    'pickup_latitude',
    'fare_amount',
    'tip_amount',
    'address'
]])
display(taxi_pickups_with_address[[
    'pickup_latitude',
    'pickup_longitude',
    'address'
]])

Unnamed: 0_level_0,VendorID,tpep_pickup_datetime,passenger_count,trip_distance,distance,pickup_longitude,pickup_latitude,fare_amount,tip_amount,address
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1567610,2,2015-01-15 20:09:13,1,2.68,0.405943,-74.014481,40.712929,11.5,2.40,100 West Street
926297,1,2015-01-02 16:13:01,2,3.80,0.457179,-74.017296,40.710136,14.7,3.30,328 Albany Street
4705561,1,2015-01-18 13:57:29,1,14.00,0.815652,-74.016747,40.714275,39.0,4.00,1 North End Avenue
1826002,1,2015-01-05 17:51:49,1,3.70,0.423738,-74.017555,40.710224,12.0,5.70,332 Albany Street
7804656,1,2015-01-11 19:38:00,1,6.50,0.335372,-74.016289,40.709743,21.5,2.00,250 South End Avenue
...,...,...,...,...,...,...,...,...,...,...
3580781,1,2015-01-29 22:50:43,1,1.20,1.013655,-74.015305,40.714058,8.0,1.86,230 Vesey Street
11620349,1,2015-01-15 09:46:42,2,4.70,1.024640,-74.015724,40.714394,24.0,4.95,4 World Financial Center
6985477,2,2015-01-30 09:05:58,1,0.95,0.182268,-74.016220,40.711037,6.0,1.00,345 South End Avenue
6747208,2,2015-01-29 14:56:25,1,2.20,0.532995,-74.017464,40.707241,9.5,2.06,70 Battery Place


3446.0116925239563


Unnamed: 0_level_0,pickup_latitude,pickup_longitude,address
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1567610,40.712929,-74.014481,100 West Street
926297,40.710136,-74.017296,328 Albany Street
4705561,40.714275,-74.016747,1 North End Avenue
1826002,40.710224,-74.017555,332 Albany Street
7804656,40.709743,-74.016289,250 South End Avenue
...,...,...,...
3580781,40.714058,-74.015305,230 Vesey Street
11620349,40.714394,-74.015724,4 World Financial Center
6985477,40.711037,-74.016220,345 South End Avenue
6747208,40.707241,-74.017464,70 Battery Place


# Use cuXfilter to display these coordinates

In [181]:
import cuxfilter
from bokeh import palettes
from cuxfilter.layouts import feature_and_double_base

from pyproj import Proj, Transformer

combined_pickups_and_addresses = cudf.concat([
    nearest_pickups[['pickup_longitude', 'pickup_latitude']].rename(
        columns={
            'pickup_longitude': 'Longitude',
            'pickup_latitude': 'Latitude'
        }
    ),
    nearest_addresses_lonlat[['Longitude', 'Latitude']]], axis=0
)
combined_pickups_and_addresses['color'] = cp.repeat(cp.array([1, 2]), len(
    combined_pickups_and_addresses
)//2)
"""
combined_pickups_and_addresses = taxi2015
combined_pickups_and_addresses['color'] = cp.repeat(cp.array([1, 2]), len(
    combined_pickups_and_addresses) //2)
combined_pickups_and_addresses = combined_pickups_and_addresses.rename(
    columns={
        'pickup_longitude': 'Longitude',
        'pickup_latitude': 'Latitude'
    }
)
"""
# Back to NYC CRS for display
transform_4326_to_3857 = Transformer.from_crs('epsg:4326', 'epsg:3857')
combined_pickups_and_addresses['location_x'], combined_pickups_and_addresses['location_y'] = transform_4326_to_3857.transform(
    combined_pickups_and_addresses['Latitude'].values_host, combined_pickups_and_addresses['Longitude'].values_host
)

In [182]:
cux_df = cuxfilter.DataFrame.from_dataframe(combined_pickups_and_addresses)
chart1 = cuxfilter.charts.scatter(
    x='location_x',
    y='location_y',
    point_size=20,
    color_palette=["#3182dd", "#c2314d"],
    aggregate_col="color", aggregate_fn="mean",
    pixel_shade_type='log', legend_position='top_right',
    unselected_alpha=0.2,
    tile_provider="CartoLight", x_range=(-8239910.23,-8229529.24), y_range=(4968481.34,4983152.92),
)
d = cux_df.dashboard([chart1],  theme=cuxfilter.themes.dark, title= 'NYC TAXI DATASSET')

In [183]:
chart1.view()