Refer here for instructions to [download and run this sample locally](https://developers.arcgis.com/python/sample-notebooks/#Download-and-run-the-sample-notebooks) on your computer

# Analyzing New York city taxi data using big data tools

At 10.5, ArcGIS Enterprise introduces ArcGIS GeoAnalytics Server which provides you the ability to perform big data analysis on your infrastructure. This sample demonstrates the steps involved in performing an aggregation analysis on New York city taxi point data using ArcGIS Python API.

The data used in this sample can be downloaded from [NYC Taxi & Limousine Commission website](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml). For this sample, data for the months January & Febuary of 2015 were used, each averaging 12 million records.

**Note**: The ability to perform big data analysis is only available on ArcGIS Enterprise 10.5 licensed with a GeoAnalytics server and not yet available on ArcGIS Online.

## The NYC taxi data

To give you an overview, let us take a look at a subset with 2000 points published as a feature service.

In [1]:
from arcgis import GIS
# Connect to ArcGIS Online anonymously
ago_gis = GIS()
search_subset = ago_gis.content.search("NYC_taxi_subset", item_type = "Feature Service")
subset_item = search_subset[0]
subset_item

Let us bring up a map to display the data.

In [2]:
subset_map = ago_gis.map("New York, NY", zoomlevel = 11)
subset_map

In [3]:
subset_map.add_layer(subset_item)

Let us access the feature layers and their attribute table to understand the structure of our data.

In [4]:
import pandas as pd
from pandas.io.json import json_normalize
from arcgis.lyr import FeatureLayer

subset_feature_layer = subset_item.layers[0]

# query the attribute information. Limit to first 5 rows.
query_result = subset_feature_layer.query(where ='OBJECTID < 5',
                                          out_fields = "*", 
                                          returnGeometry = False)

att_data_frame = json_normalize(query_result)
att_data_frame.columns = att_data_frame.columns.str.replace("attribute..","")
att_data_frame

Unnamed: 0,Field1,OBJECTID,RateCodeID,VendorID,dropoff_latitude,dropoff_longitude,extra,fare_amount,improvement_surcharge,mta_tax,...,payment_type,pickup_latitude,pickup_longitude,store_and_fwd_flag,tip_amount,tolls_amount,total_amount,tpep_dropoff_datetime,tpep_pickup_datetime,trip_distance
0,3479320,1,1,2,40.782318,-73.980492,0.0,9.5,0.3,0.5,...,1,40.778149,-73.956291,N,2.1,0,12.4,1422268943000,1422268218000,1.76
1,8473342,2,1,2,40.769756,-73.9506,0.5,13.5,0.3,0.5,...,2,40.729458,-73.983864,N,0.0,0,14.8,1422137577000,1422136892000,3.73
2,10864374,3,1,2,40.75304,-73.98568,0.0,14.5,0.3,0.5,...,2,40.74374,-73.987617,N,0.0,0,15.3,1422719906000,1422718711000,2.84
3,7350094,4,1,2,40.765743,-73.954994,0.0,11.5,0.3,0.5,...,2,40.757507,-73.981682,N,0.0,0,12.3,1420907558000,1420906601000,2.18


The table above represents the attribute information available from the NYC dataset. Columns like pickup, dropoff locations, fare, tips, toll, trip distance provide a wealth of infomation allowing many interesting patterns to be observed. Our full data dataset contains over 24 million points. To discern patterns out of it, let us aggregate the points into square blocks of 1 Kilometer length.

## Creating a data store

For the GeoAnalytics server to process your big data, it needs the data to be registered as a data store. In our case, the data is in multiple csv files and we will register the folder containing the files as a data store of type `bigDataFileShare`.

Let us connect to an ArcGIS Enterprise

In [5]:
gis = GIS("http://yourportal.domain.com/webcontext", "username","password")

The `datastore` property of `GIS` provides you with a `DatastoreManager` object. This object allows you to query, inspect and manipulate the datastores available to your ArcGIS Server.

In [6]:
# Query the data stores available
data_mgr = gis.datastore
data_mgr.search()

[<Datastore title:"/bigDataFileShares/Fortune_500" type:"bigDataFileShare">,
 <Datastore title:"/enterpriseDatabases/AGSDataStore_ds_t6qywzm8" type:"egdb">,
 <Datastore title:"/fileShares/_raster_store" type:"folder">,
 <Datastore title:"/nosqlDatabases/AGSDataStore_bigdata_bds_jn7cdee2" type:"nosql">,
 <Datastore title:"/nosqlDatabases/AGSDataStore_nosqldb_tcs_5p0kacid" type:"nosql">]

There is no `big data file share` for NYC taxi data registered on the server. So let us register one that points to the shared folder containing NYC taxi data.

In [7]:
data_item = data_mgr.add_bigdata("NYCdata",r"\\path_to_your_data")

Big Data file share exists for NYCdata


Once a big data file share is created, the GeoAnalytics server processes all the valid file types to discern the schema of the data. This process can take a few minutes depending on the size of your data. Once processed, querying the `manifest` property returns the schema. As you can see from below, the schema is similar to the subset we observed earlier in this sample.

In [8]:
from IPython.display import display
display(data_item.manifest)

{'datasets': [{'format': {'extension': 'csv',
    'fieldDelimiter': ',',
    'hasHeaderRow': True,
    'quoteChar': '"',
    'recordTerminator': '\n',
    'type': 'delimited'},
   'geometry': {'fields': [{'formats': ['x'], 'name': 'pickup_longitude'},
     {'formats': ['y'], 'name': 'pickup_latitude'}],
    'geometryType': 'esriGeometryPoint',
    'spatialReference': {'wkid': 4326}},
   'name': 'sampled',
   'schema': {'fields': [{'name': 'VendorID',
      'type': 'esriFieldTypeBigInteger'},
     {'name': 'tpep_pickup_datetime', 'type': 'esriFieldTypeString'},
     {'name': 'tpep_dropoff_datetime', 'type': 'esriFieldTypeString'},
     {'name': 'passenger_count', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'trip_distance', 'type': 'esriFieldTypeDouble'},
     {'name': 'pickup_longitude', 'type': 'esriFieldTypeDouble'},
     {'name': 'pickup_latitude', 'type': 'esriFieldTypeDouble'},
     {'name': 'RateCodeID', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'store_and_fwd_flag',

## Perform data aggregation

When you add a big data file share datastore, a corresponding item gets created on your portal. You can search for it like a regular item and query its layers.

In [9]:
search_result = gis.content.search("",item_type = "big data file share")
search_result

[<Item title:"bigDataFileShares_NYCdata" type:Big Data File Share owner:Admin>,
 <Item title:"bigDataFileShares_Fortune_500" type:Big Data File Share owner:Admin>]

In [10]:
data_item = search_result[0]
data_item.url

'https://portalurl/server/rest/services/DataStoreCatalogs/bigDataFileShares_NYCdata/BigDataCatalogServer'

In [11]:
year_2015 = data_item.layers[0]
year_2015

<Layer url:"https://dev002146.esri.com/server/rest/services/DataStoreCatalogs/bigDataFileShares_NYCdata/BigDataCatalogServer/sampled">

### Aggregate points tool
The `aggregate_points()` tool can be accessed through the `tools.bigdata` property of your GIS. In this example, we are using this tool to aggregate the numerous points into 1 Kilometer square blocks. The tool creates a feature service as an output which can be accessed once the processing is complete.

In [12]:
agg_result = gis.tools.bigdata.aggregate_points(year_2015,
                                               "NYC_aggregation_result",
                                               distance_interval = 1,
                                               distance_interval_unit = 'Kilometers',
                                               process_sr = 3857,
                                               out_sr = 3857)

Submitted.
Executing...
Executing (AggregatePoints): AggregatePoints "Feature Set" 1 Kilometers SQUARE # # # # # # # {"itemProperties":{"itemId":"08235fc287a846ccac2ff0c331d3937c"},"serviceProperties":{"serviceUrl":"http://dev002146.esri.com/server/rest/services/Hosted/NYC_aggregation_result/FeatureServer","name":"NYC_aggregation_result"}} # # GDB {"outSR":{"wkid":3857},"processSR":{"wkid":3857}}
Start Time: Thu Sep 15 12:48:49 2016
Using URL based GPRecordSet param: https://dev002146.esri.com/server/rest/services/DataStoreCatalogs/bigDataFileShares_NYCdata/BigDataCatalogServer/sampled
'Input Points' will be projected into the output spatial reference.
Starting new distributed job with 4 tasks.
1/4 distributed tasks completed.
  extent = Some(Envelope: [-74.27270769100203, 0.0, 0.008983152841195214, 40.8687394594905])
  interval = None
  count = 152
Feature service layer created: http://dev002146.esri.com/server/rest/services/Hosted/NYC_aggregation_result/FeatureServer/0
Succeeded at T

### Inspect the results
Let us create a map and load the processed result which is a feature service.

In [13]:
processed_map = gis.map('New York, NY', 11)
processed_map

In [14]:
sr = gis.content.search("NYC_aggregation_result")
sr

[<Item title:"NYC_aggregation_result" type:Feature Service owner:Admin>]

In [15]:
processed_map.add_layer(sr[0])

Let us create a few more maps and inspect the analysis result using smart mapping. To learn more about this visualization capability, refer to the sample titled 'Smart Mapping' under the section '02 Power Users & Developers'

In [16]:
processed_layer = sr[0].layers[0]
processed_layer

<FeatureLayer url:"http://dev002146.esri.com/server/rest/services/Hosted/NYC_aggregation_result/FeatureServer/0">

In [17]:
map2 = gis.map("New York, NY", 11)
map2

In [18]:
map2.add_layer({"type":"FeatureLayer", 
               "url":processed_layer.url,
               "renderer":"ClassedSizeRenderer",
               "field_name":"MAX_tip_amount",
                "normalizationField":'MAX_trip_distance',
                "clasificationMethod":'natural-breaks',
               "opacity":0.8
              })