# Using cuSpatial to Correlate Taxi Data after a Format Change
In 2017, the NYC Taxi data switched from giving their pickup and drop off locations in `lat/lon` to one of 262 `LocationID`s.  While these `LocationID`s made it easier to determine some regions and borough information that was lacking in the previous datasets, it made it difficult to compare datasets before and after this transition.  



By using cuSpatial `Points in Polygon` (PIP), we can quickly and easily map the latitude and longitude of the pre-2017 taxi dataset to the `LocationID`s of the 2017+ dataset.  In this notebook, we will show you how to do so.  cuSpatial 0.14 PIP only works on 31 polygons per call, so we will show how to process this larger 263 polygon shapefile with minimal memory impact.  cuSpatial 0.15 will eliminate the 31 polygon limitation and provide substantial additional speedup.

You may need a 16GB card or larger.

## Imports

In [1]:
import cuspatial
import geopandas as gpd
import cudf
from numba import cuda
import numpy as np

## Download the Data
We're going to download the January NYC Taxi datasets for 2016 and 2017.  We also need the NYC Taxi Zones 

In [2]:
!if [ ! -f "tzones_lonlat.json" ]; then curl "https://data.cityofnewyork.us/api/geospatial/d3c5-ddgc?method=export&format=GeoJSON" -o tzones_lonlat.json; else echo "tzones_lonlat.json found"; fi
!if [ ! -f "taxi2016.csv" ]; then curl https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv -o taxi2016.csv; else echo "taxi2016.csv found"; fi   
!if [ ! -f "taxi2017.csv" ]; then curl https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-01.csv -o taxi2017.csv; else echo "taxi2017.csv found"; fi

tzones_lonlat.json found
taxi2016.csv found
taxi2017.csv found


## Read in the Taxi Data with cuDF
Let's read in the pickups and dropoffs for 2016 and 2017.

In [3]:
taxi2016 = cudf.read_csv("taxi2016.csv")
taxi2017 = cudf.read_csv("taxi2017.csv")

Let's have a look at the columns in `taxi2016` and `taxi2017` to verify the difference.

In [4]:
set(taxi2017.columns).difference(set(taxi2016.columns))

{'DOLocationID', 'PULocationID'}

## Read in the Spatial Data with cuSpatial

cuSpatial loads polygons into a `cudf.Series` of _Feature offsets_, a `cudf.Series` of _ring offsets_, and a `cudf.DataFrame` of `x` and `y` coordinates (which can be used for lon/lat as in this case) with the `read_polygon_shapefile` function. We're working on more advanced I/O integrations and nested `Columns` this year.

In [5]:
tzones = gpd.GeoDataFrame.from_file('tzones_lonlat.json')
tzones.to_file('cu_taxi_zones.shp')

In [6]:
taxi_zones = cuspatial.read_polygon_shapefile('cu_taxi_zones.shp')

## Converting lon/lat coordinates to LocationIDs with cuSpatial
Looking at the taxi zones and the taxi2016 data, you can see that
- 10.09 million pickup locations
- 10.09 million dropoff locations
- 263 LocationID features
- 354 LocationID rings
- 98,192 LocationID coordinates

Now that we've collected the set of pickup locations and dropoff locations, we can use `cuspatial.point_in_polygon` to quickly determine which pickups and dropoffs occur in each borough.

To do this in a memory efficient way, instead of creating two massive 10.09 million x 263 arrays, we're going to use the 31 polygon limit to our advantage and map the resulting true values in the array a new `PULocationID` and `DOLocationID`, matching the 2017 schema.  Locations outside of the `LocationID` areas are `264` and `265`.  We'll be using 264 to indicate our out-of-bounds zones.

In [7]:
pip_iterations = list(np.arange(0, 263, 31))
pip_iterations.append(263)
print(pip_iterations)

[0, 31, 62, 93, 124, 155, 186, 217, 248, 263]


In [8]:
%%time
taxi2016['PULocationID'] = 264
taxi2016['DOLocationID'] = 264
for i in range(len(pip_iterations)-1):
    start = pip_iterations[i]
    end = pip_iterations[i+1]
    pickups = cuspatial.point_in_polygon(taxi2016['pickup_longitude'] , taxi2016['pickup_latitude'], taxi_zones[0][start:end], taxi_zones[1], taxi_zones[2]['x'], taxi_zones[2]['y'])
    dropoffs = cuspatial.point_in_polygon(taxi2016['dropoff_longitude'] , taxi2016['dropoff_latitude'], taxi_zones[0][start:end], taxi_zones[1], taxi_zones[2]['x'], taxi_zones[2]['y'])
    for j in pickups.columns:
        taxi2016['PULocationID'].loc[pickups[j]] = j
    for j in dropoffs.columns:
        taxi2016['DOLocationID'].loc[dropoffs[j]] = j

CPU times: user 11.2 s, sys: 6.27 s, total: 17.5 s
Wall time: 17.7 s


In [9]:
del pickups
del dropoffs

I wonder how many taxi rides in 2016 started and ended in the same location.

In [10]:
print(taxi2016['DOLocationID'].corr(taxi2016['PULocationID']))

0.10632302994054163


Not nearly as many as I thought. How many exactly?

In [11]:
print(format((taxi2016['DOLocationID'] == taxi2016['PULocationID']).sum()/taxi2016.shape[0], '.2f'), '%')

0.07 %


Something with perhaps a higher correlation: It seems likely that pickups and dropoffs by zone are not likely to change much from year to year, especially within a given month. Let's see how similar the pickup and dropoff patterns are in January 2016 and 2017:

In [12]:
print(taxi2016['DOLocationID'].value_counts().corr(taxi2017['DOLocationID'].value_counts()))
print(taxi2016['PULocationID'].value_counts().corr(taxi2017['PULocationID'].value_counts()))

0.6013696841885884
0.5822576814239405


## Bringing Them All Together
If you wanted to include this as part of a larger clean up of Taxi data, you'd then concatenate this dataframe into a `dask_cudf` dataframe and delete its `cuDF` version, or convert it into arrow memory format and process it similar to how we did in the mortgage notebook.  For now, as we are only working on a couple of GBs, we'll concatenate in cuDF.

In [13]:
df = cudf.concat([taxi2017, taxi2016])

## Final Check
Now to test to see if both years are present as expected.

In [14]:
print(df.query('PULocationID == 204'))

         VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
542479          1  2017-01-15 17:33:37   2017-01-15 17:35:32                1   
159122          1  2016-01-01 04:19:02   2016-01-01 04:25:09                3   
180769          2  2016-01-01 06:40:21   2016-01-01 06:41:09                2   
4871707         1  2016-01-15 01:07:34   2016-01-15 01:11:50                1   
7766271         1  2016-01-31 04:38:06   2016-01-31 04:39:05                1   
9039663         1  2016-01-22 17:49:06   2016-01-22 17:49:34                2   
9259375         1  2016-01-23 10:58:21   2016-01-23 11:20:08                1   

         trip_distance  RatecodeID store_and_fwd_flag  PULocationID  \
542479             0.0           5                  N           204   
159122             1.5           1                  N           204   
180769             0.0           5                  N           204   
4871707            1.2           1                  N           204

As you can see, 2017 values lack longitude and latitude. It is trivial and fast using our existing `DataFrame` to compute the mean of each `PULocationID` and `DOLocationID`s. We could inject those into the missing values to easily see how well the new LocationIDs map to pickup and dropoff locations.

## Back To Your Workflow
So now you've seen how to use cuSpatial to clean and correlate your spatial data using the NYC taxi data. You can now perform multi year analytics across the entire range of taxi datasets using your favorite RAPIDS libraries,