# NYC Taxi Spatial Notebook

The New York Yellow Cab datasets have long been interesting large datasets to both observe the behavior of NYCers and test performance.  However, the dataset has long been known to have noise and bad data within it.  Also, we haven't really been able to analyze how NYCers travel between their 5 boroughs and surrounding areas.  

`cuSpatial`, the GPU-accelerated RAPIDS spatial library, can now allow everyone to quickly both clean the data to analyze travel between only the 5 boroughs as well as let us dive deeper into the interborough behavior of NYCers.

In this notebook, we will primarily demonstrate 
- `cuSpatial`'s `Point in Polygon` (PIP) and `Window Points` (WP) capabilities 
- how to use them effectively in a workflow alongside other RAPIDS libraries, like `cudf` and `cuXFilter`.  

We'll also show:
- how polygons data work in cuSpatial with features, rings, and coordinates
- how to analyze specific features in polygons with PIP
- a small CPU comparison
- how to visualize your data spatial data with cuXFilter
- methods of cleaning spatial data with cuSpatial and cuXFilter

In [1]:
import cuspatial
import geopandas as gpd
import cudf
from numba import cuda
import numpy as np

## Add Notebook Ports
Before you go on your click-fest through this notebook, **please add your notebook server's URL and port**
- local server, use: `127.0.0.1:<port>`
- remote server, use `<notebook ip address>:<notebook server port>`

If you don't do this correctly, the **notebook's visualiztion will FAIL!!**

In [None]:
#for cuXfilter Bokeh server
url = '127.0.0.1' # if not on local machine, make this your notebook server's url, like '192.168.1.100' # 
port = '8888'     # set this to your notebook server's port

# Get your data

In [4]:
!if [ ! -f "zones.zip" ]; then curl https://s3.amazonaws.com/nyc-tlc/misc/taxi_zones.zip -o zones.zip; else echo "zones.zip found"; fi

zones.zip found


In [5]:
!if [ !  -f "taxi2015.csv" ]; then curl https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-01.csv -o taxi2015.csv; else echo "taxi2015.csv found"; fi

taxi2015.csv found


In [6]:
! if [ ! -f "NYC_boroughs.json" ]; then curl "https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?method=export&format=GeoJSON" > NYC_boroughs.json; else echo "NYC_boroughs.json found"; fi

NYC_boroughs.json found


In [7]:
%%time
NYC_boroughs = gpd.read_file('NYC_boroughs.json')
NYC_boroughs.to_file('NYC_boroughs.shp')
NYC_gpu = cuspatial.read_polygon_shapefile('NYC_boroughs.shp')

CPU times: user 762 ms, sys: 259 ms, total: 1.02 s
Wall time: 1.02 s


# Let's check out the shape file

cuSpatial geometry uses a packed format. In this object, `NYC_gpu`, the first GPU-array in the tuple contains the feature positions `f_pos`. `Feature 0` is the first 24 rings, `Feature 1` is the set of rings from index 24-28, and so forth.

`r_pos` are the ring positions of each feature. `Ring 0` is the first 12 coordinate pairs, `Ring 1` is the coordinate pairs from index 12-511, and so forth.

Finally the `DataFrame` at position `2` in `NYC_gpu` are the 75519 coordinates used in all 5 features, and the 106 rings that compose them.


In [8]:
print(NYC_boroughs.boro_name) # take note of Borough names for later
print(NYC_gpu)

0            Bronx
1    Staten Island
2         Brooklyn
3           Queens
4        Manhattan
Name: boro_name, dtype: object
(0     24
1     28
2     55
3     73
4    107
Name: f_pos, dtype: int32, 0         12
1        511
2        523
3        528
4        536
       ...  
102    70054
103    70300
104    75278
105    75291
106    75519
Name: r_pos, Length: 107, dtype: int32,                x          y
0     -73.896809  40.795808
1     -73.896785  40.796329
2     -73.897131  40.796798
3     -73.897883  40.797117
4     -73.898521  40.796936
...          ...        ...
75514 -73.907743  40.872846
75515 -73.907465  40.873547
75516 -73.907088  40.874326
75517 -73.906917  40.875056
75518 -73.906651  40.875753

[75519 rows x 2 columns])


In [9]:
t = NYC_gpu[2].iloc[511:522]
print(t)

             x          y
511 -73.898330  40.802413
512 -73.899387  40.801936
513 -73.899489  40.800901
514 -73.900037  40.800909
515 -73.899716  40.800799
516 -73.899787  40.799510
517 -73.900210  40.799264
518 -73.899025  40.799172
519 -73.898640  40.799101
520 -73.897985  40.799604
521 -73.896467  40.800790


You can see above the 11 vertices that make up `Feature 0`, `Ring 1`.

## Working with Polygons
Let's look at and dive into the elements of the read shapefile using cuSpatial

In [10]:
print("Polygon Bounds:" , NYC_gpu[0]) # upper bound of the rings that make up the polygon (feature)
print("Last Vertex:" , NYC_gpu[1]) # this is the position of the last vertex in each ring

# You can get the lon/lat by 
print("Longitude: ", NYC_gpu[2]['x']) # prints lon
print("Latitude: " , NYC_gpu[2]['y']) # prints lat
      
NYC_gpu[2]['x'] = NYC_gpu[2]['x']
NYC_gpu[2]['y'] = NYC_gpu[2]['y']

Polygon Bounds: 0     24
1     28
2     55
3     73
4    107
Name: f_pos, dtype: int32
Last Vertex: 0         12
1        511
2        523
3        528
4        536
       ...  
102    70054
103    70300
104    75278
105    75291
106    75519
Name: r_pos, Length: 107, dtype: int32
Longitude:  0       -73.896809
1       -73.896785
2       -73.897131
3       -73.897883
4       -73.898521
           ...    
75514   -73.907743
75515   -73.907465
75516   -73.907088
75517   -73.906917
75518   -73.906651
Name: x, Length: 75519, dtype: float64
Latitude:  0        40.795808
1        40.796329
2        40.796798
3        40.797117
4        40.796936
           ...    
75514    40.872846
75515    40.873547
75516    40.874326
75517    40.875056
75518    40.875753
Name: y, Length: 75519, dtype: float64


In [11]:
NYC_gpu[1].head(25) #this set of rings are NYC_gpu[1][0] (the first polygon). This data is packed.  

0       12
1      511
2      523
3      528
4      536
5      561
6      581
7     1968
8     2065
9     2255
10    2260
11    2389
12    2496
13    2633
14    2638
15    2643
16    2649
17    2657
18    2664
19    2680
20    2685
21    2690
22    2714
23    8509
24    8556
Name: r_pos, dtype: int32

## Taxi Data
Let's import the taxi data.  Newer years have pickup/drop off location ids that you would have to cross correlate.  Older years, like 2015 give you a lon/lat values.  We'll be using those ideas

In [12]:
taxi2015 = cudf.read_csv("taxi2015.csv")
print(taxi2015.dtypes)

VendorID                   int64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
pickup_longitude         float64
pickup_latitude          float64
RateCodeID                 int64
store_and_fwd_flag        object
dropoff_longitude        float64
dropoff_latitude         float64
payment_type               int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
dtype: object


In [13]:
taxi2015.count()

VendorID                 12748986
tpep_pickup_datetime     12748986
tpep_dropoff_datetime    12748986
passenger_count          12748986
trip_distance            12748986
pickup_longitude         12748986
pickup_latitude          12748986
RateCodeID               12748986
store_and_fwd_flag       12748986
dropoff_longitude        12748986
dropoff_latitude         12748986
payment_type             12748986
fare_amount              12748986
extra                    12748986
mta_tax                  12748986
tip_amount               12748986
tolls_amount             12748986
improvement_surcharge    12748983
total_amount             12748986
dtype: int64

As you can see, the columns that we're interested in for our spatial analysis are `pickup_longitude`, `pickup_latitude`, `dropoff_longitude`, and `dropoff_latitude`.  However, they are `float64`. If we convert to `float32` we'll have GPU storage for twice as many features. Also, `float32` has plenty of precision for lon/lat computations.

In [14]:
taxi2015['pickup_longitude'] = taxi2015['pickup_longitude']
taxi2015['pickup_latitude'] = taxi2015['pickup_latitude']

taxi2015['dropoff_longitude'] = taxi2015['dropoff_longitude']
taxi2015['dropoff_latitude'] = taxi2015['dropoff_latitude']

## GPU Point-In-Polygon

- 12m pickup locations
- 12m dropoff locations
- 5 borough features
- 107 borough polygons
- 75518 borough coordinates

Now that we've collected the set of pickup locations and dropoff locations, we can use `cuSpatial.point_in_polygon_bitmap` to quickly determine which pickups and drop-offs occur in each borough. That is, 5 boroughs composed of a total of 107 polygons.

In [15]:
%%time

NYC_gpu[0].index = ["Bronx", "Staten Island", "Brooklyn", "Queens", "Manhattan"]

pickups = cuspatial.point_in_polygon_bitmap(taxi2015['pickup_longitude'] , taxi2015['pickup_latitude'], NYC_gpu[0], NYC_gpu[1], NYC_gpu[2]['x'], NYC_gpu[2]['y'])
dropoffs = cuspatial.point_in_polygon_bitmap(taxi2015['dropoff_longitude'] , taxi2015['dropoff_latitude'], NYC_gpu[0], NYC_gpu[1], NYC_gpu[2]['x'], NYC_gpu[2]['y'])

CPU times: user 27.5 s, sys: 15.4 ms, total: 27.6 s
Wall time: 27.6 s


In [16]:
pickups.head()

Unnamed: 0,Bronx,Staten Island,Brooklyn,Queens,Manhattan
0,False,False,False,False,True
1,False,False,False,False,True
2,False,False,False,False,True
3,False,False,False,False,True
4,False,False,False,False,True


In [17]:
dropoffs.head()

Unnamed: 0,Bronx,Staten Island,Brooklyn,Queens,Manhattan
0,False,False,False,False,True
1,False,False,False,False,True
2,False,False,False,False,True
3,False,False,False,False,True
4,False,False,False,False,True


We just computed point in polygon for 25 million points against 75.5k other points in `3.07s`. That's really fast. Simply computing the pairwise distance between all 25m points and the 75.5k polygon coordinates is `25000000*75500=2 trillion` comparisons. We're **NOT going to compare full CPU times**.  We will take a subset, to give you a taste though.

## CPU Point-In-Polygon

Comparing GPU speed with CPU speed is always a concern with RAPIDS. We want to be much, much faster on GPU. The next few cells pull out a single ring from our dataset, then runs point-in-polygon using `Shapely`, one of the most popular CPU python GIS libraries.

### Getting the Data

Previously, we detected which points fall inside of which features using `cuspatial.point_in_polygon_bitmap`. Lets make sure that the CPU comparison is interesting by selecting a ring that we know contains points for comparison.

Instead of computing `point_in_polygon` for our five original features, lets modify the call to use the rings instead.

In [18]:
ring_pickups = cuspatial.point_in_polygon_bitmap(taxi2015['pickup_longitude'] , taxi2015['pickup_latitude'], cudf.Series(np.arange(2,34)), NYC_gpu[1], NYC_gpu[2]['x'], NYC_gpu[2]['y'])
print(ring_pickups.head())
print(NYC_gpu[1][20:26])
print(ring_pickups[24][3799])

      0      1      2      3      4      5      6      7      8      9   ...  \
0  False  False  False  False  False  False  False  False  False  False  ...   
1  False  False  False  False  False  False  False  False  False  False  ...   
2  False  False  False  False  False  False  False  False  False  False  ...   
3  False  False  False  False  False  False  False  False  False  False  ...   
4  False  False  False  False  False  False  False  False  False  False  ...   

      22     23     24     25     26     27     28     29     30     31  
0  False  False  False  False  False  False  False  False  False  False  
1  False  False  False  False  False  False  False  False  False  False  
2  False  False  False  False  False  False  False  False  False  False  
3  False  False  False  False  False  False  False  False  False  False  
4  False  False  False  False  False  False  False  False  False  False  

[5 rows x 32 columns]
20    2685
21    2690
22    2714
23    8509
24    85

In [19]:
ring_pickups.sum()
print(ring_pickups[24].index[ring_pickups[24]])
print(ring_pickups[20][3799])
print(ring_pickups[21][3799])
print(ring_pickups[22][3799])
print(ring_pickups[23][3799])
print(ring_pickups[24][3799])
print(ring_pickups[25][3799])
first_true = taxi2015[['pickup_longitude', 'pickup_latitude']].iloc[3799,:]
print(first_true)

Int64Index([], dtype='int64')
False
False
True
False
False
False
pickup_longitude   -73.904068
pickup_latitude     40.852001
Name: 3799, dtype: float64


By substituting `np.arange(32)` for the `fpos` argument, we're telling cuspatial to compute which points fall into which rings. Rings 24 and 28 have points in them.

In [20]:
NYC_gpu[1][20:26]

20    2685
21    2690
22    2714
23    8509
24    8556
25    8565
Name: r_pos, dtype: int32

the 24th feature starts and ends on 2714 to 8509.  We'll use that for our temporary dataframe for our CPU run.

### CPU Benchmark

In the following section we'll pull the ring from 8509:8556 from the source data, create a `Shapely` polygon from it, and use Shapely to run point_in_polygon on a small subset of the 12m taxi pickup locations.

In [21]:
from shapely.geometry.polygon import Polygon

pol1 = NYC_gpu[2].iloc[2714:8509]
pol2 = NYC_gpu[2].iloc[8509:8556]
pol3 = NYC_gpu[2].iloc[8556:8565]
shape1 = list(zip(pol1['x'].tolist(), pol1['y'].tolist()))
shape2 = list(zip(pol2['x'].tolist(), pol2['y'].tolist()))
shape3 = list(zip(pol3['x'].tolist(), pol3['y'].tolist()))

In [22]:
from shapely.geometry import Point
first_point = Point(first_true.tolist())
print(Polygon(shape1).contains(first_point))
print(Polygon(shape2).contains(first_point))
print(Polygon(shape3).contains(first_point))

polygon = Polygon(shape1)

True
False
False


We take a small sample of points from the original set of `pickup_latitudes` - 1m points. This will run 12x as fast as one 12m points, so that you can hopefully see the results quickly regardless of your individual machine performance.  Because they are randomly sampled, and 1/12th the size of the original data, we should see `in_polygon_24/12 = 9810/12 817` points in the test ring.

In [23]:
%%time
from shapely.geometry import Point

outcome = []
outcome_yes = 0
outcome_no = 0
random_set = taxi2015[['pickup_longitude', 'pickup_latitude']].iloc[
    np.random.choice(taxi2015.shape[0], 1000000),:]
#random_set = taxi2015[['pickup_longitude', 'pickup_latitude']].iloc[0:5000,:]
points = [Point(p) for p in list(zip(
    random_set['pickup_longitude'].tolist(), random_set['pickup_latitude'].tolist()))]
print(points[3799])

POINT (-73.7820587158203 40.64459228515625)
CPU times: user 7.05 s, sys: 83.9 ms, total: 7.13 s
Wall time: 7.13 s


Just converting the points from GPU back to host for the comparison takes a noticeable period of time.

Now for the benchmark:

In [24]:
%%time
for i in range(0, len(points)):
    point = points[i]
    if(polygon.contains(point)):
        outcome_yes += 1
    else:
        outcome_no += 1
print("yes: ", outcome_yes, "no: " , outcome_no)

yes:  758 no:  999242
CPU times: user 3.82 s, sys: 16.2 ms, total: 3.83 s
Wall time: 3.83 s


A simple measure: 1m points inside of 1 polygon with 8509-2714=5795 vertices. This is `/12` as many points and `1/15` as many polygon points. Without running the full dataset, we can estimate that the final runtime would be `7.4s * 12 * 15 = 1332s`. That's a speed up of `1332 / 2.9 = 579`x using GPU instead of CPU!

# Making the Spatial data useful for the other RAPIDS libraries

Now that we've identify the polygon membership status of every one of our 12m points, lets add that information back into our original dataframe. From that point we could easily save the data permanently for future use, or, run further analytics using cuML, as we'll continue this current example, load it into cuXFilter for visualization.

In [25]:
taxi2015['pickup_borough'] = 'external'
taxi2015['pickup_borough'][pickups['Bronx']] = 'Bronx'
taxi2015['pickup_borough'][pickups['Staten Island']] = 'Staten Island'
taxi2015['pickup_borough'][pickups['Brooklyn']] = 'Brooklyn'
taxi2015['pickup_borough'][pickups['Queens']] = 'Queens'
taxi2015['pickup_borough'][pickups['Manhattan']] = 'Manhattan'
taxi2015['dropoff_borough'] = 'external'
taxi2015['dropoff_borough'][dropoffs['Bronx']] = 'Bronx'
taxi2015['dropoff_borough'][dropoffs['Staten Island']] = 'Staten Island'
taxi2015['dropoff_borough'][dropoffs['Brooklyn']] = 'Brooklyn'
taxi2015['dropoff_borough'][dropoffs['Queens']] = 'Queens'
taxi2015['dropoff_borough'][dropoffs['Manhattan']] = 'Manhattan'
name_map = {"external": 0, "Bronx": 1 , "Staten Island": 2, "Brooklyn": 3, "Queens": 4, "Manhattan": 5}
taxi2015['pickup_borough_integer'] = taxi2015['pickup_borough'].replace(name_map).astype('int32')
taxi2015['dropoff_borough_integer'] = taxi2015['dropoff_borough'].replace(name_map).astype('int32')

The following code block does what the above code block does. The above is much simpler. Am I missing a detail, or can we use the above?

## Add the outputs to your taxi dataframe

In [26]:
taxi2015.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,...,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,pickup_borough,dropoff_borough,pickup_borough_integer,dropoff_borough_integer
0,2,2015-01-15 19:05:39,2015-01-15 19:23:42,1,1.59,-73.993896,40.750111,1,N,-73.974785,...,1.0,0.5,3.25,0.0,0.3,17.05,Manhattan,Manhattan,5,5
1,1,2015-01-10 20:33:38,2015-01-10 20:53:28,1,3.3,-74.001648,40.724243,1,N,-73.994415,...,0.5,0.5,2.0,0.0,0.3,17.8,Manhattan,Manhattan,5,5
2,1,2015-01-10 20:33:38,2015-01-10 20:43:41,1,1.8,-73.963341,40.802788,1,N,-73.95182,...,0.5,0.5,0.0,0.0,0.3,10.8,Manhattan,Manhattan,5,5
3,1,2015-01-10 20:33:39,2015-01-10 20:35:31,1,0.5,-74.009087,40.713818,1,N,-74.004326,...,0.5,0.5,0.0,0.0,0.3,4.8,Manhattan,Manhattan,5,5
4,1,2015-01-10 20:33:39,2015-01-10 20:52:58,1,3.0,-73.971176,40.762428,1,N,-74.004181,...,0.5,0.5,0.0,0.0,0.3,16.3,Manhattan,Manhattan,5,5


In [27]:
taxi2015['all_within_boroughs'] = (
   (taxi2015['pickup_borough'] != 'external')
 & (taxi2015['dropoff_borough'] != 'external')
)

In [28]:
taxi2015.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,pickup_borough,dropoff_borough,pickup_borough_integer,dropoff_borough_integer,all_within_boroughs
0,2,2015-01-15 19:05:39,2015-01-15 19:23:42,1,1.59,-73.993896,40.750111,1,N,-73.974785,...,0.5,3.25,0.0,0.3,17.05,Manhattan,Manhattan,5,5,True
1,1,2015-01-10 20:33:38,2015-01-10 20:53:28,1,3.3,-74.001648,40.724243,1,N,-73.994415,...,0.5,2.0,0.0,0.3,17.8,Manhattan,Manhattan,5,5,True
2,1,2015-01-10 20:33:38,2015-01-10 20:43:41,1,1.8,-73.963341,40.802788,1,N,-73.95182,...,0.5,0.0,0.0,0.3,10.8,Manhattan,Manhattan,5,5,True
3,1,2015-01-10 20:33:39,2015-01-10 20:35:31,1,0.5,-74.009087,40.713818,1,N,-74.004326,...,0.5,0.0,0.0,0.3,4.8,Manhattan,Manhattan,5,5,True
4,1,2015-01-10 20:33:39,2015-01-10 20:52:58,1,3.0,-73.971176,40.762428,1,N,-74.004181,...,0.5,0.0,0.0,0.3,16.3,Manhattan,Manhattan,5,5,True


In [29]:
taxi2015['dropoff_borough'].value_counts()

Manhattan        11150550
Brooklyn           642234
Queens             602715
external           282554
Bronx               68489
Staten Island        2444
Name: dropoff_borough, dtype: int32

In [30]:
taxi2015['pickup_borough'].value_counts()

Manhattan        11612333
Queens             640693
external           256123
Brooklyn           229809
Bronx                9840
Staten Island         188
Name: pickup_borough, dtype: int32

In [31]:
taxi2015['all_within_boroughs'].value_counts()

True     12434190
False      314796
Name: all_within_boroughs, dtype: int32

In [32]:
taxi2015['interborough']= (taxi2015[[]].assign(interborough=taxi2015.pickup_borough != taxi2015.dropoff_borough))

In [33]:
taxi2015['interborough'].value_counts()

False    11275954
True      1473032
Name: interborough, dtype: int32

# Cleaning the Data
Now that we have some great analytics, let's do what we came here to do: get a clean, 5 borough only dataset with no drop offs in the middle of the water. There are two ways to clean the data, made possible by using cuspatial: 
    1. through cuspatial (no visualization required)
    1. through cuxfilter (has awesome visualizations, but it is required to create an output)
    
## Cleaning with cuspatial 
With cuSpatial, we can easily query the data on `all_within_boroughs` or where `pickup_borough` and `dropoff_borough` are `!=0`.  Since we created `all_within_boroughs` mostly for an easy option two, and you'd probably not do that in your spatial workflow, we'll show how to do it the latter way.  

In [34]:
taxi_cuS_cleaned = taxi2015[
    (
        (taxi2015['pickup_borough'] != 'external')
      & (taxi2015['dropoff_borough'] != 'external')
    )
]

The `True` and `False` values don't equal up to our full dataset numbers. The remainder, "None" type, will contain the extra-borough transportation as well as the dirty data.  Good to look at with interest, but outside of our scope.  If you want to clean this data as well, you'll been boundary maps of th rest of the TriState area.  

## Cleaning using cuXFilter
(You may remember me from other notebooks like...)  

First, we're going to have a little fun with this.  All those new parameters we created will be put to use.  We're going to an interactive visualization on the whole `taxi2015` dataset.  In the end you'll be able to see:
1. original data pickups
1. original data dropoffs
1. cuSpatial cleaned pickups
1. cuSpatial cleaned dropoffs
1. original data and cleaned intraborough pickups
1. original data and cleaned intraborough dropoffs
1. original data and cleaned interborough pickups
1. original data and cleaned interborough dropoffs
1. and more!

To do this, we're going to 
1. convert the pickup and drop off coordinates from lat/lon to EPSG:4326 format
1. prep the bokeh server to show a background map to overlay our data
1. create some charts to plot pickup and drop off points
1. create some more charts to help filter the data by parameters of interest, like 
 1. pickup borough by borough
 1. dropoff borough by borough
 1. Whether the trip was interborough
 1. and a quick filter on whether or not the whole trip was within NYC's 5 boroughs (yes, the "easy button" cleaning filter)

In [35]:
import cuxfilter
from bokeh import palettes
from cuxfilter.layouts import double_feature

In [36]:
from pyproj import Proj, Transformer
def makeXfilter(x, y):
    temp= cudf.DataFrame()
    transform_4326_to_3857 = Transformer.from_crs('epsg:4326', 'epsg:3857')
    temp['x'], temp['y'] = transform_4326_to_3857.transform(
                                                x.to_array(), y.to_array()
                                            )
    print(temp.head())
    return temp.x, temp.y

taxi2015['pu_x'], taxi2015['pu_y'] = makeXfilter(taxi2015['pickup_latitude'], taxi2015['pickup_longitude'])
taxi2015['do_x'], taxi2015['do_y'] = makeXfilter(taxi2015['dropoff_latitude'],taxi2015['dropoff_longitude'])
cux = cuxfilter.DataFrame.from_dataframe(taxi2015)

              x             y
0 -8.236963e+06  4.975553e+06
1 -8.237826e+06  4.971752e+06
2 -8.233561e+06  4.983296e+06
3 -8.238654e+06  4.970221e+06
4 -8.234434e+06  4.977363e+06
              x             y
0 -8.234835e+06  4.975627e+06
1 -8.237021e+06  4.976875e+06
2 -8.232279e+06  4.986477e+06
3 -8.238124e+06  4.971127e+06
4 -8.238108e+06  4.974457e+06


In [37]:
from bokeh.tile_providers import get_provider as gp
tile_provider = gp('CARTODBPOSITRON')

### Create your charts and launch the dashboard

In [38]:
label_map = {
    0:"external", 
    1:"Bronx",
    2:"Staten Island",
    3:"Brooklyn", 
    4:"Queens", 
    5:"Manhattan"
}
chart0 = cuxfilter.charts.scatter_geo(x='pu_x',
                                      y='pu_y',
                                      title='NYC Taxi Pickups',
                                      aggregate_fn='count',
                                      tile_provider=tile_provider, x_range=(-8267428.97,-8207328.23), y_range=(4935861.67,5000548.55))
chart1 = cuxfilter.charts.scatter_geo(x='do_x',
                                      y='do_y',
                                      title='NYC Taxi Dropoffs',
                                      aggregate_fn='count',
                                      tile_provider=tile_provider, x_range=(-8267428.97,-8207328.23), y_range=(4935861.67,5000548.55))
chart2 = cuxfilter.charts.multi_select('pickup_borough_integer', label_map=label_map)
chart3 = cuxfilter.charts.multi_select('dropoff_borough_integer', label_map=label_map)

In [None]:
d = cux.dashboard([chart0, chart1, chart2, chart3], layout=cuxfilter.layouts.feature_and_base, theme=cuxfilter.themes.dark, title= 'NYC TAXI DATASET')

In [None]:
d

In [None]:
d.show(url+':'+port)

# WARNING!
Before you stop the data viz, please 
1. select all boroughs that aren't `external` in `pickup_borough_integer` and `dropoff_borough_integer`.  This will give you a cleaned dataset
1. Then, run the next cell to export the data in the visualizations as a dataframe.  

This will give you a similar result to the `taxi_cuS_cleaned` dataset.  Then, we'll stop the visualization and compare the two datasets for similarity, just to prove that they are the same.

In [42]:
taxi_cuX_cleaned = d.export()

final query pickup_borough_integer in (1,2,3,4,5) and dropoff_borough_integer in (1,2,3,4,5)


In [43]:
taxi_cuX_cleaned.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,...,pickup_borough,dropoff_borough,pickup_borough_integer,dropoff_borough_integer,all_within_boroughs,interborough,pu_x,pu_y,do_x,do_y
0,2,2015-01-15 19:05:39,2015-01-15 19:23:42,1,1.59,-73.993896,40.750111,1,N,-73.974785,...,Manhattan,Manhattan,5,5,True,False,-8236963.0,4975553.0,-8234835.0,4975627.0
1,1,2015-01-10 20:33:38,2015-01-10 20:53:28,1,3.3,-74.001648,40.724243,1,N,-73.994415,...,Manhattan,Manhattan,5,5,True,False,-8237826.0,4971752.0,-8237021.0,4976875.0
2,1,2015-01-10 20:33:38,2015-01-10 20:43:41,1,1.8,-73.963341,40.802788,1,N,-73.95182,...,Manhattan,Manhattan,5,5,True,False,-8233561.0,4983296.0,-8232279.0,4986477.0
3,1,2015-01-10 20:33:39,2015-01-10 20:35:31,1,0.5,-74.009087,40.713818,1,N,-74.004326,...,Manhattan,Manhattan,5,5,True,False,-8238654.0,4970221.0,-8238124.0,4971127.0
4,1,2015-01-10 20:33:39,2015-01-10 20:52:58,1,3.0,-73.971176,40.762428,1,N,-74.004181,...,Manhattan,Manhattan,5,5,True,False,-8234434.0,4977363.0,-8238108.0,4974457.0


In [44]:
d.stop()

# Comparing the data
Let's checking that the data between the cleaned datasets, `taxi_cuS_cleaned` and `taxi_cuX_cleaned` are the same.

In [45]:
print(taxi_cuX_cleaned.head())
print(taxi_cuS_cleaned.head())
print(taxi_cuX_cleaned.count())
print(taxi_cuS_cleaned.count())

   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         2  2015-01-15 19:05:39   2015-01-15 19:23:42                1   
1         1  2015-01-10 20:33:38   2015-01-10 20:53:28                1   
2         1  2015-01-10 20:33:38   2015-01-10 20:43:41                1   
3         1  2015-01-10 20:33:39   2015-01-10 20:35:31                1   
4         1  2015-01-10 20:33:39   2015-01-10 20:52:58                1   

   trip_distance  pickup_longitude  pickup_latitude  RateCodeID  \
0           1.59        -73.993896        40.750111           1   
1           3.30        -74.001648        40.724243           1   
2           1.80        -73.963341        40.802788           1   
3           0.50        -74.009087        40.713818           1   
4           3.00        -73.971176        40.762428           1   

  store_and_fwd_flag  dropoff_longitude  ...  pickup_borough dropoff_borough  \
0                  N         -73.974785  ...       Manhattan      

So far, so good.  Now let's compare each element in each column and row.

In [46]:
print(taxi_cuX_cleaned.columns)
print(taxi_cuS_cleaned.columns)
for i in range(0, len(taxi_cuS_cleaned.columns)):
               diff = (taxi_cuX_cleaned.iloc[:,i] != taxi_cuS_cleaned.iloc[:,i]).sum()
               print("# of differences in column " + taxi_cuX_cleaned.columns[i]+ ': ', diff)

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'pickup_longitude',
       'pickup_latitude', 'RateCodeID', 'store_and_fwd_flag',
       'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount',
       'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount', 'pickup_borough',
       'dropoff_borough', 'pickup_borough_integer', 'dropoff_borough_integer',
       'all_within_boroughs', 'interborough', 'pu_x', 'pu_y', 'do_x', 'do_y'],
      dtype='object')
Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'pickup_longitude',
       'pickup_latitude', 'RateCodeID', 'store_and_fwd_flag',
       'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount',
       'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount', 'pickup_borough',
       'dropoff_borough', 'pic

These should all be 0.  Hooray! (if not, you didn't follow directions :) )

# Analyzing interesting interborough behaviours of NYCers
So now that we have cleaned data of yellow cab trips, and verified it vusially using cuXFilter, we can use cuSpatial to see who is coming from where and why. One bit of curiousness is why, unlike the rest of the region, does Rockaway/Seaside have so many dropoffs, not almost no pickups.  This happens both with interborough and intraborough travel.  It's also Janurary, so nature tourism seems...unlikely.  Let's find out where these people are coming from.

To begin, we have to make a box with the latitudes and longitudes of our area of interest
40.633869, -73.748353
40.559571, -73.877628

In [47]:
pu_points_inside = cuspatial.window_points(
    40.559571, -73.877628,  40.633869, -73.748353, taxi2015['pickup_latitude'], taxi2015['pickup_longitude']
)
print(pu_points_inside.shape[0])

431


In [48]:
do_points_inside = cuspatial.window_points(
    40.559571, -73.877628,  40.633869, -73.748353, taxi2015['dropoff_latitude'], taxi2015['dropoff_longitude']
)
print(do_points_inside.shape[0])
print(do_points_inside)

1453
              x          y
0     40.625492 -73.780495
1     40.618229 -73.779617
2     40.596161 -73.766403
3     40.600056 -73.763321
4     40.597908 -73.776009
...         ...        ...
1448  40.593910 -73.755951
1449  40.596855 -73.772598
1450  40.599525 -73.782089
1451  40.574512 -73.858978
1452  40.590324 -73.797195

[1453 rows x 2 columns]


So, that's over 3x the pickup traffic.  Where is this influx coming from?  Let's merge back the other data on these points.

In [49]:
pickups = pickups*1
dropoffs = dropoffs*1
taxi2015 = taxi2015.join(pickups)

In [50]:
taxi2015.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,...,interborough,pu_x,pu_y,do_x,do_y,Bronx,Staten Island,Brooklyn,Queens,Manhattan
3488,1,2015-01-11 00:38:14,2015-01-11 00:55:13,1,2.0,-73.980522,40.744232,1,N,-73.988129,...,False,-8235474.0,4974689.0,-8236321.0,4971210.0,0,0,0,0,1
3489,1,2015-01-11 00:38:15,2015-01-11 00:48:17,1,2.8,0.0,0.0,1,N,0.0,...,False,0.0,0.0,0.0,0.0,0,0,0,0,0
3490,1,2015-01-11 00:38:15,2015-01-11 01:11:20,1,12.7,-73.997078,40.736382,1,N,-73.893898,...,True,-8237317.0,4973535.0,-8225831.0,4991296.0,0,0,0,0,1
3491,1,2015-01-11 00:38:15,2015-01-11 00:46:33,4,1.3,-73.986359,40.739956,1,N,-74.005478,...,False,-8236124.0,4974061.0,-8238252.0,4974359.0,0,0,0,0,1
3492,1,2015-01-11 00:38:16,2015-01-11 00:58:56,1,4.5,-74.003876,40.732872,1,N,-73.962509,...,False,-8238074.0,4973020.0,-8233469.0,4979692.0,0,0,0,0,1


In [51]:
taxi2015['x'] = taxi2015['dropoff_latitude']
taxi2015['y'] = taxi2015['dropoff_longitude']
rwys = do_points_inside.merge(taxi2015)

Let's also do some time analysis and see the frequency of rides per day.  This requires us to modify the date a bit.

In [52]:
rwys['tpep_pickup_datetime']=rwys['tpep_pickup_datetime'].str.timestamp2int(format="%Y-%m-%d %H:%M:%S")

In [53]:
rwys['tpep_pickup_datetime']=rwys['tpep_pickup_datetime'].astype('datetime64[s]')
rwys['day_of_the_month']=rwys['tpep_pickup_datetime'].dt.day
rwys['time_of_the_day']= rwys['tpep_pickup_datetime'].dt.hour

In [54]:
day_freq = cudf.DataFrame()
day_freq = rwys['day_of_the_month'].value_counts().reset_index()
day_freq['# of Rides'] = day_freq['day_of_the_month']
day_freq['Day of the Month'] = day_freq['index']
print(day_freq)

    index  day_of_the_month  # of Rides  Day of the Month
0      20               135         135                20
1       8                97          97                 8
2      12                97          97                12
3      13                94          94                13
4      14                84          84                14
5       9                78          78                 9
6      10                63          63                10
7      11                58          58                11
8       7                51          51                 7
9      22                48          48                22
10     21                45          45                21
11      4                44          44                 4
12      6                44          44                 6
13     17                44          44                17
14     30                43          43                30
15      1                41          41                 1
16      5     

In [55]:
cux = cuxfilter.DataFrame.from_dataframe(day_freq)
chart1 = cuxfilter.charts.bar("Day of the Month", "# of Rides")
d = cux.dashboard([chart1], layout=cuxfilter.layouts.single_feature, theme=cuxfilter.themes.dark, title= 'NYC TAXI DATASET')
chart1.view()

As you can see, there was a spike of rides on the 19th of Janurary.  But where were they all from?

Let's look at a heatmap of `Day of Month` and `Boroughs` filtered on `Time of Day` 

Recall: 
label_map = {
    0:"external", 
    1:"Bronx",
    2:"Staten Island",
    3:"Brooklyn", 
    4:"Queens", 
    5:"Manhattan"
}

In [56]:
rwys = rwys.sort_values(['day_of_the_month'])

In [None]:
from cuxfilter import layouts, themes, DataFrame
from cuxfilter.charts import heatmap
from cuxfilter.charts import stacked_lines

rwys = rwys.sort_values(['day_of_the_month'])
cux = cuxfilter.DataFrame.from_dataframe(rwys)
chart0 = heatmap(y='pickup_borough_integer', x='day_of_the_month', aggregate_fn='count',
                            color_palette=palettes.inferno(256), point_size=20)
chart1 = cuxfilter.charts.multi_select('time_of_the_day')
chart2 = cuxfilter.charts.multi_select('pickup_borough_integer', label_map=label_map)
chart3 = cuxfilter.charts.multi_select('dropoff_borough_integer', label_map=label_map)
d = cux.dashboard([chart0, chart1, chart2, chart3], layout=layouts.single_feature, theme=themes.dark)
d.show(url+':'+port)


You should see that the spike of rides started to occur after 4PM on the 19th.  What's more, they were from pick ups that happened from outside of the 5 boroughs.  That's interesting, but we'll need to map this out.  For now, let's move on to the other days of high ridership.

In [58]:
d.stop()

Let's examine the other days looking at `Time of Day` and `Boroughs` filtered on `Day of Month`

Recall:
label_map = {
    0:"external", 
    1:"Bronx",
    2:"Staten Island",
    3:"Brooklyn", 
    4:"Queens", 
    5:"Manhattan"
}

In [None]:
chart0 = heatmap(y='pickup_borough_integer', x='time_of_the_day', aggregate_fn='count',
                            color_palette=palettes.inferno(256), point_size=20)
chart1 = cuxfilter.charts.multi_select('day_of_the_month')
chart2 = cuxfilter.charts.multi_select('pickup_borough_integer', label_map=label_map)
chart3 = cuxfilter.charts.multi_select('dropoff_borough_integer', label_map=label_map)
d = cux.dashboard([chart0, chart1, chart2, chart3], layout=layouts.single_feature, theme=themes.dark)
d.show(url+':'+port)

In [60]:
d.stop()

On the 24 hour clock, for 7th, 11th, and 12th, it seems like most of the rides to the rockaways are at mid day from Queens, at the 9PM hour.  Let's verify this and see where those rides come from.  We'll do this by once again using the `scatter_geo` plot and having some `multi_select` filters

### Now to create the charts

In [61]:
rwys['rwys_x'], rwys['rwys_y'] = makeXfilter(rwys['pickup_latitude'], rwys['pickup_longitude'])
cux = cuxfilter.DataFrame.from_dataframe(rwys)
chart1 = cuxfilter.charts.scatter_geo(x='rwys_x',
                                      y='rwys_y',
                                      title='Where are the drop offs coming from?',
                                         add_interaction=True , 
                                         color_palette=palettes.inferno(256) , # use bokeh color palettes to get more pop out colors
                                         aggregate_col='time_of_the_day' , aggregate_fn='max',
                                         tile_provider=tile_provider, x_range=(-8267428.97,-8207328.23), y_range=(4935861.67,5000548.55))
chart2 = cuxfilter.charts.multi_select('pickup_borough_integer', label_map=label_map)
chart3 = cuxfilter.charts.multi_select('dropoff_borough_integer', label_map=label_map)
chart4 = cuxfilter.charts.multi_select('day_of_the_month')
chart5 = cuxfilter.charts.multi_select('time_of_the_day')

              x             y
0 -8.232332e+06  4.979535e+06
1 -8.235610e+06  4.974270e+06
2 -8.231901e+06  4.979102e+06
3 -8.234700e+06  4.975193e+06
4 -8.232094e+06  4.969006e+06


In [None]:
d = cux.dashboard([chart1, chart2, chart3, chart4, chart5], layout=cuxfilter.layouts.single_feature, theme=cuxfilter.themes.dark, title= 'NYC TAXI DATASET')
d.show(url+':'+port)

Many of the pick up points seem to be at LaGuardia and JFK airports.

In [63]:
d.stop()

And only because SOMEONE will ask "why didn't we show 'day of the month' and 'time of the day' on the large visualization?"...

In [64]:
taxi2015['tpep_pickup_datetime']=taxi2015['tpep_pickup_datetime'].str.timestamp2int(format="%Y-%m-%d %H:%M:%S")

In [65]:
taxi2015['tpep_pickup_datetime']=taxi2015['tpep_pickup_datetime'].astype('datetime64[s]')
taxi2015['day_of_the_month']=taxi2015['tpep_pickup_datetime'].dt.day
taxi2015['time_of_the_day']= taxi2015['tpep_pickup_datetime'].dt.hour

In [66]:
chart0 = cuxfilter.charts.scatter_geo(x='pu_x',
                                      y='pu_y',
                                      title='NYC Taxi Pickups',
                                      aggregate_fn='count',
                                      tile_provider=tile_provider, x_range=(-8267428.97,-8207328.23), y_range=(4935861.67,5000548.55))
chart1 = cuxfilter.charts.scatter_geo(x='do_x',
                                      y='do_y',
                                      title='NYC Taxi Dropoffs',
                                      aggregate_fn='count',
                                      tile_provider=tile_provider, x_range=(-8267428.97,-8207328.23), y_range=(4935861.67,5000548.55))
chart2 = cuxfilter.charts.multi_select('pickup_borough_integer', label_map=label_map)
chart3 = cuxfilter.charts.multi_select('dropoff_borough_integer', label_map=label_map)
chart4 = cuxfilter.charts.multi_select('day_of_the_month')
chart5 = cuxfilter.charts.multi_select('time_of_the_day')

In [None]:
d = cux.dashboard([chart0, chart1, chart2, chart3, chart4, chart5], layout=cuxfilter.layouts.feature_and_base, theme=cuxfilter.themes.dark, title= 'NYC TAXI DATASET')
d.show(url+':'+port)

In [68]:
d.stop()