# Data Exploration
## Intro
Alright, now we know the ins and outs of Spark!

Spark be like:
<img src="https://68.media.tumblr.com/ca90adfa596dbf8652705a138d065cb8/tumblr_opqf06DGKQ1w2tlv5o1_500.gif" width="400">

And, yes Spark, you would be absolutely correct. I won't pretend to be a Spark expert after scouring the internet for random tutorial videos, but you know what? I feel like I have enough knowledge to get started. So, let's do just that.

Now, having the data on S3 really helps us here, and I'm starting to quickly appreciate the integrated-ness (definitely not a word) of AWS' suite of services. In the case of EMR, S3 actually integrates directly with it as a distributed storage source (replacing HDFS). All this is abstracted to us, and I'll just keep my mind naive for now until I need to deal with that.

First, I have to set up a few packages that will help me integrate Spark with Jupyter. By default, EMR comes with the Zeppelin notebook which integrates directly into Spark out of the box, but I'm going with Jupyter because it's what I'm more familiar with (Zeppelin actually looks pretty cool though, built in interactive visualizations from SQL results!) and it translates better to a blog.

The **findspark** package will allow us to connect up SparkSession with our Jupyter notebook, and the **sql\_magic** package will allow us to add the SQL "Magic" command capabilities, that is, to actually write in-line SQL statements to query our Spark dataframes using Spark SQL (back in the [all-nba predict](https://strikingmoose.com/2017/07/28/exploration-of-historical-nba-players-part-ii-scatterplot-matrices/) days, we were using the R magic commands to pass dataframes back and forth between R and Python).

In [2]:
import os
os.system("sudo pip install findspark sql_magic")

0

## Initiate Spark
Let's use the **findspark** package to connect Jupyter up to the Spark shell. The first thing we do apparently in any new Spark application is initiate a SparkSession. When you initiate a Spark shell from command line, or open up Zeppelin, a SparkSession will already be instantiated as the _**spark**_ object. In Jupyter, we can either actually set up Jupyter on the server-side to recognize the Spark shell, or we can use _**findspark**_ to be able to work within the Spark shell within our Python shell, which we have chosen Jupyter to open up with. Therefore, we have to initiate this SparkSession ourselves. The SparkSession is an object that will provide an entry point into all the different APIs (e.g. Spark SQL that we will be using later) and also tracks the application being spun up so we can view it and tag it in the Spark History Server console.

In [4]:
# Use findspark package to connect Jupyter to Spark shell
import findspark
findspark.init('/usr/lib/spark')

In [19]:
# Load SparkSession object
import pyspark
from pyspark.sql import SparkSession

# Load other libraries
from datetime import datetime
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DateType
import pandas as pd

In [7]:
# Initiate SparkSession as "spark"
spark = SparkSession\
    .builder\
    .getOrCreate()

## SQL Magic
Okay, now we have a SparkSession initiated as _**spark**_ like we would have if we just opened up a Spark shell. Next up, we have to load the **sql\_magic** package so we have Spark SQL capabilities later on when we play around with the data.

We load the "Magic" package and configure it to be hooked up to Spark. [Reading up on **sql_magic** a little bit](https://github.com/pivotal/sql_magic), it's actually pretty cool beacuse it provides an entry point into multiple technologies, one of which is Spark, but others as well that include sqlalchemy connections... We can pull froma database inline with SQL! Amazing!

<img src="https://s-media-cache-ak0.pinimg.com/originals/6b/84/1b/6b841b927f7202ede137263640bb79db.jpg" width="500">

I immediately thought of that, but NO, I WILL STICK TO MY WORD. IT'S GODDAMN AMAZING.

In [29]:
%load_ext sql_magic
%config SQL.conn_name = 'spark'

## Read CSV
Spark 2.0 and on makes it relatively easy to read a CSV, let's try it:

In [12]:
%%time
# Read NYPD Complaint Data
df = spark.read.csv(
    "s3n://2017edmfasatb/nypd_complaints/data/NYPD_Complaint_Data_Historic.csv", 
    header = True, 
    inferSchema = True
)

CPU times: user 4 ms, sys: 4 ms, total: 8 ms
Wall time: 35 s


35 seconds to load a ~1.5G dataset... a bit on the longer side I would imagine? The dataset is not _**that big**_, but remember I'm kind of coming into this knowing that the data set size is on the smaller side to be working with Spark or EMR.

In [9]:
# Describe df
df.printSchema()

root
 |-- CMPLNT_NUM: integer (nullable = true)
 |-- CMPLNT_FR_DT: string (nullable = true)
 |-- CMPLNT_FR_TM: string (nullable = true)
 |-- CMPLNT_TO_DT: string (nullable = true)
 |-- CMPLNT_TO_TM: string (nullable = true)
 |-- RPT_DT: string (nullable = true)
 |-- KY_CD: integer (nullable = true)
 |-- OFNS_DESC: string (nullable = true)
 |-- PD_CD: integer (nullable = true)
 |-- PD_DESC: string (nullable = true)
 |-- CRM_ATPT_CPTD_CD: string (nullable = true)
 |-- LAW_CAT_CD: string (nullable = true)
 |-- JURIS_DESC: string (nullable = true)
 |-- BORO_NM: string (nullable = true)
 |-- ADDR_PCT_CD: integer (nullable = true)
 |-- LOC_OF_OCCUR_DESC: string (nullable = true)
 |-- PREM_TYP_DESC: string (nullable = true)
 |-- PARKS_NM: string (nullable = true)
 |-- HADEVELOPT: string (nullable = true)
 |-- X_COORD_CD: integer (nullable = true)
 |-- Y_COORD_CD: integer (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- Lat_Lon: string (nu

We see here that the dataframe has loaded successfully, and that the headers and column types have been inferred somewhat. Let's try to do a head() on the dataframe.

In [36]:
%%time
# See df
df.head(10)

CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 9.15 s


[Row(CMPLNT_NUM=101109527, CMPLNT_FR_DT=u'12/31/2015', CMPLNT_FR_TM=u'23:45:00', CMPLNT_TO_DT=None, CMPLNT_TO_TM=None, RPT_DT=u'12/31/2015', KY_CD=113, OFNS_DESC=u'FORGERY', PD_CD=729, PD_DESC=u'FORGERY,ETC.,UNCLASSIFIED-FELO', CRM_ATPT_CPTD_CD=u'COMPLETED', LAW_CAT_CD=u'FELONY', JURIS_DESC=u'N.Y. POLICE DEPT', BORO_NM=u'BRONX', ADDR_PCT_CD=44, LOC_OF_OCCUR_DESC=u'INSIDE', PREM_TYP_DESC=u'BAR/NIGHT CLUB', PARKS_NM=None, HADEVELOPT=None, X_COORD_CD=1007314, Y_COORD_CD=241257, Latitude=40.828848333, Longitude=-73.916661142, Lat_Lon=u'(40.828848333, -73.916661142)'),
 Row(CMPLNT_NUM=153401121, CMPLNT_FR_DT=u'12/31/2015', CMPLNT_FR_TM=u'23:36:00', CMPLNT_TO_DT=None, CMPLNT_TO_TM=None, RPT_DT=u'12/31/2015', KY_CD=101, OFNS_DESC=u'MURDER & NON-NEGL. MANSLAUGHTER', PD_CD=None, PD_DESC=None, CRM_ATPT_CPTD_CD=u'COMPLETED', LAW_CAT_CD=u'FELONY', JURIS_DESC=u'N.Y. POLICE DEPT', BORO_NM=u'QUEENS', ADDR_PCT_CD=103, LOC_OF_OCCUR_DESC=u'OUTSIDE', PREM_TYP_DESC=None, PARKS_NM=None, HADEVELOPT=None, X_

Okay! Not so friendly of an output, but we actually see some data. We've only issued like 2-3 commands at this point, but that was quite the whirlwind in the back-end... what the hell exactly happened there? Let's try to break it down.

## Review
### Initial state
First of all, when we spun up this EMR cluster, we get an architecture like this:

<img src="https://s3.ca-central-1.amazonaws.com/2017edmfasatb/nypd_complaints/images/23_spark_1.png" width="500">

It's not the greatest diagram, but simple is better right? As painful to the eyes as this diagram is, please bear with me haha. The master and 2 workers are the EC2 nodes that our EMR cluster spun up. The S3 stack is our storage, which acts as our distributed file system for our EMR cluster (the S3 being split up into multiple partitions is supposed to act as a distributed storage engine, if my drawing was not clear enough).

The master is where the EMR applications are installed, and whether you're reaching Zeppelin, Spark History Server, or Hue, you would be connecting to the master node. Before I opened this Jupyter notebook, I SSH'ed into the master node and did a simple

> _**sudo pip install jupyter**_

and Jupyter was available on the master node at port 8889 (8888 was used for Zeppelin by default on EMR). Hopefuly everything makes sense so far! As I'm working in this notebook right now, I'm working on the master node. The master node is not only providing an interface for me to interact with Spark via Jupyter, but its also acting as my Spark application's master who is generating the DAG, pushing individual tasks as defined by the DAG down to the appropriate workers, and collecting the results back from the workers and presenting it back to me when I run an action.

### spark.read.csv
When I read a csv into a dataframe within Spark, this is the first time I'm now interacting with any data on S3 (remember, this is where my raw CSV file is hosted). I'm not 100% sure about this, but I don't actually think any data is loaded when you perform spark.read (Spark's dataframe reader). Because I've specified for spark.read to
1. Look for headers
2. Infer a schema (try to guess which fields are text or numbers etc)

It actually has to scan through the entire data set (or at least sample a sizeable portion of the data set) to confidently say that this column is a string and that column is an integer. I think that's why the spark.read function took 35 seconds. At this point, however, no data has actually be read into the memory of our workers at all, Spark has simply _**mapped**_ our data structure, enough that we can run a

> _**df.printSchema()**_

command! Note that Spark doesn't actually have to show any data on this command, just the skeleton of the dataframe! This actually makes me wonder because I'm pretty sure if I loaded a Pandas dataframe of this 1.5G data set, it wouldn't take more than 35 seconds. I could just be talking out of my arse though, maybe a comparison to try for another day! At this point, we'll say that 35 seconds is fair as it does have to scan the dataframe to infer the schema (and it probably does this in a distributed way)

### head()
When we try a _**head()**_ command, now we are starting to load _**some**_ data into memory. I've specified head(10), so we're only looking at the first 10 rows of the dataframe, so it only has to load a subset of the data into memory. We see that it took around 10 seconds to load the first 10 rows of the dataframe. Is this efficient? At the number of data we pulled? Probably not. I could imagine a head(10) on a normal Pandas dataframe doesn't take more than a few milliseconds! However, we see some of the distributed overhead at work here. It took 10 seconds to roughly do something like this:

<img src="https://s3.ca-central-1.amazonaws.com/2017edmfasatb/nypd_complaints/images/25_spark_3.png" width="500">
<img src="https://s3.ca-central-1.amazonaws.com/2017edmfasatb/nypd_complaints/images/24_spark_2.png" width="500">

In the first image, our client (my laptop) types the _**head()**_ command into the Jupyter notebook (which lies on the Spark Master node). The Spark Master node then creates a DAG, in this case, simply consists of pulling data and presenting it without any processing (head simply retrieves and presents data). The master node figures out:
1. Where the data is stored in the distributed and partitioned S3 storage via a metadata store (HDFS will often have a metadata store in a separate MySQL or HIVE table that the Master node can reference)
2. Which nodes should be executing its own DAG processes on which partitions of data

The second image basically just shows the two nodes at work - They are carrying out the DAG assignments given to them by the Master. In this case, each worker grabs a portion of the data. _**I have to make it extremely clear that I don't actually know that worker \#2 gets rows 1-6 and worker \#1 gets rows 7-10**_. This is purely me trying to visualize _**one possibility**_ in which the DAG could be carried out. At this point, I'm actually not even sure if both workers are doing the work, Spark may have some built in intelligence to say that only a single worker is required here, but for demonstration purposes of distributed computing, I wanted to just visualize a super simple distributed task for myself.

## Spark SQL
Back to our Spark dataframe for a sec... Usually, with vanilla Python and Pandas, Jupyter will output your dataframe nicely:

In [2]:
pd.DataFrame({
    'a': [1, 2],
    'b': [3, 4]
})

Unnamed: 0,a,b
0,1,3
1,2,4


Unfortunately, our Spark _**df.head()**_ command didn't yield that clean of an output. One thing we can do is actually introduce Spark SQL here. Spark SQL, by default within the SQL Magic library, executes the distributed Spark DAG and collects the result into a Pandas dataframe, we can then view the Pandas dataframe in a well formatted table within Jupyter.

We first have to register the Spark dataframe as a "temporary table", adds the dataframe to the SQLContext entry point for the Spark SQL API so we can perform SQL queries on it.

In [31]:
# Register temporary table
df.createOrReplaceTempView("df")

In [34]:
# Perform the SQL equivalent of "head"
result = %read_sql SELECT * FROM df LIMIT 10;
result

Query started at 10:44:47 PM UTC; Query executed in 0.01 m

Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,CMPLNT_TO_DT,CMPLNT_TO_TM,RPT_DT,KY_CD,OFNS_DESC,PD_CD,PD_DESC,...,ADDR_PCT_CD,LOC_OF_OCCUR_DESC,PREM_TYP_DESC,PARKS_NM,HADEVELOPT,X_COORD_CD,Y_COORD_CD,Latitude,Longitude,Lat_Lon
0,101109527,12/31/2015,23:45:00,,,12/31/2015,113,FORGERY,729.0,"FORGERY,ETC.,UNCLASSIFIED-FELO",...,44,INSIDE,BAR/NIGHT CLUB,,,1007314,241257,40.828848,-73.916661,"(40.828848333, -73.916661142)"
1,153401121,12/31/2015,23:36:00,,,12/31/2015,101,MURDER & NON-NEGL. MANSLAUGHTER,,,...,103,OUTSIDE,,,,1043991,193406,40.697338,-73.784557,"(40.697338138, -73.784556739)"
2,569369778,12/31/2015,23:30:00,,,12/31/2015,117,DANGEROUS DRUGS,503.0,"CONTROLLED SUBSTANCE,INTENT TO",...,28,,OTHER,,,999463,231690,40.802607,-73.945052,"(40.802606608, -73.945051911)"
3,968417082,12/31/2015,23:30:00,,,12/31/2015,344,ASSAULT 3 & RELATED OFFENSES,101.0,ASSAULT 3,...,105,INSIDE,RESIDENCE-HOUSE,,,1060183,177862,40.654549,-73.726339,"(40.654549444, -73.726338791)"
4,641637920,12/31/2015,23:25:00,12/31/2015,23:30:00,12/31/2015,344,ASSAULT 3 & RELATED OFFENSES,101.0,ASSAULT 3,...,13,FRONT OF,OTHER,,,987606,208148,40.738002,-73.987891,"(40.7380024, -73.98789129)"
5,365661343,12/31/2015,23:18:00,12/31/2015,23:25:00,12/31/2015,106,FELONY ASSAULT,109.0,"ASSAULT 2,1,UNCLASSIFIED",...,71,FRONT OF,DRUG STORE,,,996149,181562,40.665023,-73.957111,"(40.665022689, -73.957110763)"
6,608231454,12/31/2015,23:15:00,,,12/31/2015,235,DANGEROUS DRUGS,511.0,"CONTROLLED SUBSTANCE, POSSESSI",...,7,OPPOSITE OF,STREET,,,987373,201662,40.7202,-73.988735,"(40.720199996, -73.988735082)"
7,265023856,12/31/2015,23:15:00,12/31/2015,23:15:00,12/31/2015,118,DANGEROUS WEAPONS,792.0,WEAPONS POSSESSION 1 & 2,...,46,FRONT OF,STREET,,,1009041,247401,40.845707,-73.910398,"(40.845707148, -73.910398033)"
8,989238731,12/31/2015,23:15:00,12/31/2015,23:30:00,12/31/2015,344,ASSAULT 3 & RELATED OFFENSES,101.0,ASSAULT 3,...,48,INSIDE,RESIDENCE - APT. HOUSE,,,1014154,251416,40.856711,-73.8919,"(40.856711291, -73.891899956)"
9,415095955,12/31/2015,23:10:00,12/31/2015,23:10:00,12/31/2015,341,PETIT LARCENY,338.0,"LARCENY,PETIT FROM BUILDING,UN",...,19,INSIDE,DRUG STORE,,,994327,218211,40.765618,-73.963623,"(40.765617688, -73.96362342)"


One thing to note is that, whether we performed a _**df.head(10)**_ or a SQL _**SELECT * ... LIMIT 10**_, we are essentially _**trying to execute the SAME query**_!

Spark works in a way such that, once a DAG is created, it runs _**the same way every time**_. There are simply multiple ways to tell Spark to build this DAG. We can use the Spark Core API (Scala), Pyspark API (Python), or Spark SQL API... No matter what type of input we give Spark, its power comes from contextualizing these entry points as _**simply an API**_. Our pyspark command and Spark SQL command should have resulted in the _**exact same DAG**_ being built, and the two should have _**no**_ difference in performance other than the minor overhead of, for example, converting the SQL input into a DAG!

Even with this single command, let's try to illustrate what just happened:

<img src="https://s3.ca-central-1.amazonaws.com/2017edmfasatb/nypd_complaints/images/26_spark_4.png" width="500">
<img src="https://s3.ca-central-1.amazonaws.com/2017edmfasatb/nypd_complaints/images/27_spark_5.png" width="500">

We see how important of a role / how many responsibilities the Master has at all times.

In the first image, the Master is
1. Hosting the Jupyter Notebook
2. Receiving the user command
3. Understanding and translating the input Spark SQL into a DAG
4. Distributing tasks among the workers

In the second image, the Master is
1. Receiving the partitioned pieces from workers
2. Aggregating the results into a single dataframe
3. Converting the Spark dataframe into a Pandas dataframe (which has now switched context to the Python interpreter running on the Master itself)
4. Feeds the result back to the user via Jupyter

And all we did there was... pull 10 lines of data! Now that we have a sense of what the Spark workflow looks like, let's get serious and start exploring the data for realsies.

## Data Cleaning
The first thing I'm noticing here is that most of the fields were inferred correctly - good job Spark! One thing it didn't get (whether through cleanliness of the data or capabilities of Spark, I'm not sure) is the date and time fields.

Pyspark is supposed to be fairly close to Python when scripting and Pandas when working with dataframes. It's not 100%, but should be close enough. Let's try converting these columns.

In [7]:
# Define lambda functions as UDFs to convert data columns
date_func =  udf(lambda x: datetime.strptime(x, '%m/%d/%Y'), DateType())
time_func =  udf(lambda x: datetime.strptime(x, '%H:%M:%S'), DateType())

# After a bit of tinkering, the UDF above doesn't work well with NAs, so we fill NAs with bogus values
df = df.fillna('01/01/1900', subset = ['CMPLNT_FR_DT', 'CMPLNT_TO_DT', 'RPT_DT'])
df = df.fillna('00:00:00', subset = ['CMPLNT_FR_TM', 'CMPLNT_TO_TM'])

# Perform column conversion
df = df.withColumn('CMPLNT_FR_DT_FORMATTED', date_func(col('CMPLNT_FR_DT')))
df = df.withColumn('CMPLNT_TO_DT_FORMATTED', date_func(col('CMPLNT_TO_DT')))
df = df.withColumn('RPT_DT_FORMATTED', date_func(col('RPT_DT')))
df = df.withColumn('CMPLNT_FR_TM_FORMATTED', time_func(col('CMPLNT_FR_TM')))
df = df.withColumn('CMPLNT_TO_TM_FORMATTED', time_func(col('CMPLNT_TO_TM')))

# View schema after conversion
df.printSchema()

root
 |-- CMPLNT_NUM: integer (nullable = true)
 |-- CMPLNT_FR_DT: string (nullable = false)
 |-- CMPLNT_FR_TM: string (nullable = false)
 |-- CMPLNT_TO_DT: string (nullable = false)
 |-- CMPLNT_TO_TM: string (nullable = false)
 |-- RPT_DT: string (nullable = false)
 |-- KY_CD: integer (nullable = true)
 |-- OFNS_DESC: string (nullable = true)
 |-- PD_CD: integer (nullable = true)
 |-- PD_DESC: string (nullable = true)
 |-- CRM_ATPT_CPTD_CD: string (nullable = true)
 |-- LAW_CAT_CD: string (nullable = true)
 |-- JURIS_DESC: string (nullable = true)
 |-- BORO_NM: string (nullable = true)
 |-- ADDR_PCT_CD: integer (nullable = true)
 |-- LOC_OF_OCCUR_DESC: string (nullable = true)
 |-- PREM_TYP_DESC: string (nullable = true)
 |-- PARKS_NM: string (nullable = true)
 |-- HADEVELOPT: string (nullable = true)
 |-- X_COORD_CD: integer (nullable = true)
 |-- Y_COORD_CD: integer (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- Lat_Lon: strin

Looks like its worked. The 5 last columns are now a "date" field. I also want to extract a few

In [8]:
extract_year =  udf(lambda x: x.year)
extract_month =  udf(lambda x: x.month)

df = df.withColumn('CMPLNT_FR_DT_YEAR', extract_year(col('CMPLNT_FR_DT_FORMATTED')))
df = df.withColumn('CMPLNT_FR_DT_MONTH', extract_month(col('CMPLNT_FR_DT_FORMATTED')))

In [9]:
df.createOrReplaceTempView("complaint")

In [18]:
result = %read_sql SELECT * FROM complaint LIMIT 10;
result

Query started at 07:43:37 PM UTC; Query executed in 0.73 m

Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,CMPLNT_TO_DT,CMPLNT_TO_TM,RPT_DT,KY_CD,OFNS_DESC,PD_CD,PD_DESC,...,Latitude,Longitude,Lat_Lon,CMPLNT_FR_DT_FORMATTED,CMPLNT_TO_DT_FORMATTED,RPT_DT_FORMATTED,CMPLNT_FR_TM_FORMATTED,CMPLNT_TO_TM_FORMATTED,CMPLNT_FR_DT_YEAR,CMPLNT_FR_DT_MONTH
0,101109527,12/31/2015,23:45:00,01/01/1900,00:00:00,12/31/2015,113,FORGERY,729.0,"FORGERY,ETC.,UNCLASSIFIED-FELO",...,40.828848,-73.916661,"(40.828848333, -73.916661142)",2015-12-31,1900-01-01,2015-12-31,1900-01-01,1900-01-01,2015,12
1,153401121,12/31/2015,23:36:00,01/01/1900,00:00:00,12/31/2015,101,MURDER & NON-NEGL. MANSLAUGHTER,,,...,40.697338,-73.784557,"(40.697338138, -73.784556739)",2015-12-31,1900-01-01,2015-12-31,1900-01-01,1900-01-01,2015,12
2,569369778,12/31/2015,23:30:00,01/01/1900,00:00:00,12/31/2015,117,DANGEROUS DRUGS,503.0,"CONTROLLED SUBSTANCE,INTENT TO",...,40.802607,-73.945052,"(40.802606608, -73.945051911)",2015-12-31,1900-01-01,2015-12-31,1900-01-01,1900-01-01,2015,12
3,968417082,12/31/2015,23:30:00,01/01/1900,00:00:00,12/31/2015,344,ASSAULT 3 & RELATED OFFENSES,101.0,ASSAULT 3,...,40.654549,-73.726339,"(40.654549444, -73.726338791)",2015-12-31,1900-01-01,2015-12-31,1900-01-01,1900-01-01,2015,12
4,641637920,12/31/2015,23:25:00,12/31/2015,23:30:00,12/31/2015,344,ASSAULT 3 & RELATED OFFENSES,101.0,ASSAULT 3,...,40.738002,-73.987891,"(40.7380024, -73.98789129)",2015-12-31,2015-12-31,2015-12-31,1900-01-01,1900-01-01,2015,12
5,365661343,12/31/2015,23:18:00,12/31/2015,23:25:00,12/31/2015,106,FELONY ASSAULT,109.0,"ASSAULT 2,1,UNCLASSIFIED",...,40.665023,-73.957111,"(40.665022689, -73.957110763)",2015-12-31,2015-12-31,2015-12-31,1900-01-01,1900-01-01,2015,12
6,608231454,12/31/2015,23:15:00,01/01/1900,00:00:00,12/31/2015,235,DANGEROUS DRUGS,511.0,"CONTROLLED SUBSTANCE, POSSESSI",...,40.7202,-73.988735,"(40.720199996, -73.988735082)",2015-12-31,1900-01-01,2015-12-31,1900-01-01,1900-01-01,2015,12
7,265023856,12/31/2015,23:15:00,12/31/2015,23:15:00,12/31/2015,118,DANGEROUS WEAPONS,792.0,WEAPONS POSSESSION 1 & 2,...,40.845707,-73.910398,"(40.845707148, -73.910398033)",2015-12-31,2015-12-31,2015-12-31,1900-01-01,1900-01-01,2015,12
8,989238731,12/31/2015,23:15:00,12/31/2015,23:30:00,12/31/2015,344,ASSAULT 3 & RELATED OFFENSES,101.0,ASSAULT 3,...,40.856711,-73.8919,"(40.856711291, -73.891899956)",2015-12-31,2015-12-31,2015-12-31,1900-01-01,1900-01-01,2015,12
9,415095955,12/31/2015,23:10:00,12/31/2015,23:10:00,12/31/2015,341,PETIT LARCENY,338.0,"LARCENY,PETIT FROM BUILDING,UN",...,40.765618,-73.963623,"(40.765617688, -73.96362342)",2015-12-31,2015-12-31,2015-12-31,1900-01-01,1900-01-01,2015,12
