# New York City 311 Data

## Overview

In the city of New York, citizens with non-emergency complaints (e.g. trash non-collection, rodent infestations) can call 311 to make a Service Request.  These are recorded and shared on New York's open data site at  https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9.

## High-Level Description

The data dates from 2010 to the current day, with data being updated on a daily basis.  At the time of this writing, there are over 20 million rows, each row representing a single service request, and over 40 columns which represent aspects of each service request, such as the street address being referenced, the type of complaint, the agency responsible, the date of the service request, etc.

## Bring in Data Dictionary via pandas

We'll use `pandas` to bring in the (somewhat incomplete) data dictionary supplied by the City.

In [1]:
import pandas as pd
data_dict = pd.read_excel("https://nycopendata.socrata.com/api/views/erm2-nwe9/files/68b25fbb-9d30-486a-a571-7115f54911cd?download=true&filename=311_SR_Data_Dictionary_2018.xlsx",
                         sheet_name='Data Dictionary')




Let's take a peek at the data dictionary supplied by NYC.  You'll notice it's far from perfect!

In [2]:
data_dict

Unnamed: 0,Column Name,Description,Expected Values,Notes:
0,Unique Key,Unique identifier of a Service Request (SR) in...,,This is NOT the Service Request (SR) # provide...
1,Created Date,Date SR was created,Date in format MM/DD/YY HH:MM:SS AM/PM,
2,Closed Date,Date SR was closed by responding agency,Date in format MM/DD/YY HH:MM:SS AM/PM,
3,Agency,Acronym of responding City Government Agency,,
4,Agency Name,Full Agency name of responding City Government...,,
5,Complaint Type,This is the fist level of a hierarchy identify...,,
6,Descriptor,"This is associated to the Complaint Type, and...",,
7,Status,Status of SR submitted,"Assigned, Cancelled, Closed, Pending, +",Prior column indicates most frequent
8,Due Date,Date when responding agency is expected to upd...,Date in format MM/DD/YY HH:MM:SS AM/PM,
9,Resolution Action Updated Date,Date when responding agency last updated the SR.,Date in format MM/DD/YY HH:MM:SS AM/PM,


And now, let's bring in the data itself!  I've already downloaded a file that has a million rows.

In [3]:
data311 = pd.read_csv("data/fhrw-4uyv.csv")

  interactivity=interactivity, compiler=compiler, result=result)


Let's take a quick peek at what the data looks like.  Then we'll use pandas to work with it!

In [4]:
data311.head(50)

Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,location_city,location,location_address,location_zip,location_state,:@computed_region_efsh_h5xi,:@computed_region_f5dn_yrer,:@computed_region_yeji_bk3q,:@computed_region_92fq_4b7q,:@computed_region_sbqj_enih
0,28276154,2014-06-17T15:29:57.000,2014-06-18T00:00:00.000,DOB,Department of Buildings,General Construction/Plumbing,Building Shaking/Vibrating/Structural Stability,,10453.0,17 WEST 182 STREET,...,,POINT (-73.905280267776 40.857277469306),,,,10931.0,6.0,5.0,29.0,29.0
1,28276155,2014-06-17T14:14:37.000,2014-06-27T00:00:00.000,DOB,Department of Buildings,General Construction/Plumbing,Failure To Maintain,,11209.0,7316 5 AVENUE,...,,POINT (-74.022106248876 40.630814858486),,,,17216.0,10.0,2.0,44.0,41.0
2,28276157,2014-06-17T21:30:00.000,2014-06-18T11:18:00.000,DOT,Department of Transportation,Street Light Condition,Street Light Out,,11209.0,95 STREET,...,,POINT (-74.032338267192 40.616588929222),,,,17216.0,10.0,2.0,44.0,41.0
3,28276158,2014-06-17T00:00:00.000,2014-07-17T00:00:00.000,DOHMH,Department of Health and Mental Hygiene,Standing Water,Tires,1-2 Family Dwelling,11356.0,5-23 128 STREET,...,,POINT (-73.83983236841 40.792152555656),,,,14191.0,22.0,3.0,20.0,67.0
4,28276161,2014-06-17T08:56:00.000,2014-06-18T05:05:00.000,DOT,Department of Transportation,Traffic Signal Condition,Veh Signal Lamp,,11432.0,,...,,POINT (-73.795622199878 40.71175740947),,,,24340.0,25.0,3.0,24.0,65.0
5,28276165,2014-06-18T00:14:54.000,2014-06-18T15:45:19.000,NYPD,New York City Police Department,Blocked Driveway,No Access,Street/Sidewalk,11226.0,3009 CLARENDON ROAD,...,,POINT (-73.948561667644 40.643266811829),,,,13510.0,61.0,2.0,26.0,40.0
6,28276167,2014-06-17T20:01:00.000,2014-06-19T12:00:00.000,DSNY,BCC - Brooklyn North,Sanitation Condition,15 Street Cond/Dump-Out/Drop-Off,Street,11211.0,,...,,POINT (-73.944613290729 40.715938351298),,,,17613.0,36.0,2.0,30.0,57.0
7,28276168,2014-06-17T13:59:00.000,2014-06-24T12:00:00.000,DSNY,BCC - Brooklyn North,Sanitation Condition,15 Street Cond/Dump-Out/Drop-Off,Street,11212.0,406 REMSEN AVENUE,...,,POINT (-73.923251548408 40.655738728336),,,,16866.0,61.0,2.0,17.0,40.0
8,28276173,2014-06-17T10:40:00.000,2014-07-20T11:55:00.000,DOT,Department of Transportation,Traffic Signal Condition,Controller,,11234.0,,...,,POINT (-73.917993178671 40.626456458664),,,,13825.0,5.0,2.0,8.0,38.0
9,28276174,2014-06-17T18:50:07.000,2014-09-08T00:00:00.000,DSNY,Department of Sanitation,Graffiti,Graffiti,Mixed Use,11204.0,124 AVENUE O,...,,POINT (-73.980453997038 40.610594441885),,,,16867.0,1.0,2.0,18.0,37.0


We're interested in looking at pothole repair time (from the opening of the ticket to the close) over time:

* Transform the string timestamp for `created_date` to a true datetime data type
* Do the same for `closed_date`
* Create a `time_to_close` variable and convert it to an integer
* Isolate just the pothole data
* Trim the columns so that the data is easier to work with / look at / understand
* Do some basic data visualization in matplotlib

In [5]:
data311['created_date'] = pd.to_datetime(data311['created_date'])
data311['closed_date'] = pd.to_datetime(data311['closed_date'])
data311['time_to_close'] = pd.to_timedelta(data311['closed_date'] - data311['created_date'], unit='d')
data311.head()

Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,location,location_address,location_zip,location_state,:@computed_region_efsh_h5xi,:@computed_region_f5dn_yrer,:@computed_region_yeji_bk3q,:@computed_region_92fq_4b7q,:@computed_region_sbqj_enih,time_to_close
0,28276154,2014-06-17 15:29:57,2014-06-18 00:00:00,DOB,Department of Buildings,General Construction/Plumbing,Building Shaking/Vibrating/Structural Stability,,10453.0,17 WEST 182 STREET,...,POINT (-73.905280267776 40.857277469306),,,,10931.0,6.0,5.0,29.0,29.0,0 days 08:30:03
1,28276155,2014-06-17 14:14:37,2014-06-27 00:00:00,DOB,Department of Buildings,General Construction/Plumbing,Failure To Maintain,,11209.0,7316 5 AVENUE,...,POINT (-74.022106248876 40.630814858486),,,,17216.0,10.0,2.0,44.0,41.0,9 days 09:45:23
2,28276157,2014-06-17 21:30:00,2014-06-18 11:18:00,DOT,Department of Transportation,Street Light Condition,Street Light Out,,11209.0,95 STREET,...,POINT (-74.032338267192 40.616588929222),,,,17216.0,10.0,2.0,44.0,41.0,0 days 13:48:00
3,28276158,2014-06-17 00:00:00,2014-07-17 00:00:00,DOHMH,Department of Health and Mental Hygiene,Standing Water,Tires,1-2 Family Dwelling,11356.0,5-23 128 STREET,...,POINT (-73.83983236841 40.792152555656),,,,14191.0,22.0,3.0,20.0,67.0,30 days 00:00:00
4,28276161,2014-06-17 08:56:00,2014-06-18 05:05:00,DOT,Department of Transportation,Traffic Signal Condition,Veh Signal Lamp,,11432.0,,...,POINT (-73.795622199878 40.71175740947),,,,24340.0,25.0,3.0,24.0,65.0,0 days 20:09:00


In [6]:
good_311_data = data311[~(data311['time_to_close']  < pd.to_timedelta(0))]
good_311_data.head()

Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,location,location_address,location_zip,location_state,:@computed_region_efsh_h5xi,:@computed_region_f5dn_yrer,:@computed_region_yeji_bk3q,:@computed_region_92fq_4b7q,:@computed_region_sbqj_enih,time_to_close
0,28276154,2014-06-17 15:29:57,2014-06-18 00:00:00,DOB,Department of Buildings,General Construction/Plumbing,Building Shaking/Vibrating/Structural Stability,,10453.0,17 WEST 182 STREET,...,POINT (-73.905280267776 40.857277469306),,,,10931.0,6.0,5.0,29.0,29.0,0 days 08:30:03
1,28276155,2014-06-17 14:14:37,2014-06-27 00:00:00,DOB,Department of Buildings,General Construction/Plumbing,Failure To Maintain,,11209.0,7316 5 AVENUE,...,POINT (-74.022106248876 40.630814858486),,,,17216.0,10.0,2.0,44.0,41.0,9 days 09:45:23
2,28276157,2014-06-17 21:30:00,2014-06-18 11:18:00,DOT,Department of Transportation,Street Light Condition,Street Light Out,,11209.0,95 STREET,...,POINT (-74.032338267192 40.616588929222),,,,17216.0,10.0,2.0,44.0,41.0,0 days 13:48:00
3,28276158,2014-06-17 00:00:00,2014-07-17 00:00:00,DOHMH,Department of Health and Mental Hygiene,Standing Water,Tires,1-2 Family Dwelling,11356.0,5-23 128 STREET,...,POINT (-73.83983236841 40.792152555656),,,,14191.0,22.0,3.0,20.0,67.0,30 days 00:00:00
4,28276161,2014-06-17 08:56:00,2014-06-18 05:05:00,DOT,Department of Transportation,Traffic Signal Condition,Veh Signal Lamp,,11432.0,,...,POINT (-73.795622199878 40.71175740947),,,,24340.0,25.0,3.0,24.0,65.0,0 days 20:09:00


Before we start filtering, we want to understand what kinds of street condition reports there are -- maybe potholes aren't the only thing we want to track!  We are going to make a **copy**, not just a **slice** of data -- a whole new data frame we call `street_conditions`.

In [7]:
street_conditions = good_311_data[good_311_data['complaint_type'] == "Street Condition"].copy()
street_conditions.head(50)

Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,location,location_address,location_zip,location_state,:@computed_region_efsh_h5xi,:@computed_region_f5dn_yrer,:@computed_region_yeji_bk3q,:@computed_region_92fq_4b7q,:@computed_region_sbqj_enih,time_to_close
33,28276220,2014-06-17 11:10:52,2014-06-17 13:57:06,DOT,Department of Transportation,Street Condition,Cave-in,Street,10310.0,203 ELIZABETH STREET,...,POINT (-74.119232293179 40.629056135134),,,,10697.0,4.0,1.0,13.0,74.0,0 days 02:46:14
117,28276325,2014-06-16 10:04:26,2014-06-16 10:04:26,DOT,Department of Transportation,Street Condition,Pothole,,10463.0,3616 HENRY HUDSON PARKWAY,...,POINT (-73.911646178492 40.88726556442),,,,11272.0,48.0,5.0,40.0,33.0,0 days 00:00:00
129,28276343,2014-06-17 12:11:07,2014-06-17 12:11:07,DOT,Department of Transportation,Street Condition,Pothole,,11204.0,2066 53 STREET,...,POINT (-73.977410202739 40.622489816417),,,,16867.0,2.0,2.0,18.0,39.0,0 days 00:00:00
138,28276357,2014-06-16 12:16:26,2014-06-16 12:16:26,DOT,Department of Transportation,Street Condition,Pothole,,10306.0,WATERSIDE STREET,...,POINT (-74.099310710279 40.564300536276),,,,10693.0,30.0,1.0,14.0,76.0,0 days 00:00:00
139,28276358,2014-06-17 09:54:21,2014-06-17 09:54:21,DOT,Department of Transportation,Street Condition,Pothole,,11234.0,6813 AVENUE T,...,POINT (-73.911200260156 40.620020364259),,,,13825.0,5.0,2.0,8.0,38.0,0 days 00:00:00
140,28276359,2014-06-17 06:19:21,2014-06-17 06:19:21,DOT,Department of Transportation,Street Condition,Pothole,,11222.0,136 RUSSELL STREET,...,POINT (-73.944389264845 40.725031631493),,,,18182.0,36.0,2.0,38.0,57.0,0 days 00:00:00
141,28276360,2014-06-16 16:05:50,2014-06-16 16:05:50,DOT,Department of Transportation,Street Condition,Pothole,,11232.0,,...,POINT (-74.001189425552 40.660562203785),,,,13515.0,9.0,2.0,7.0,45.0,0 days 00:00:00
142,28276362,2014-06-16 14:21:05,2014-06-16 14:21:05,DOT,Department of Transportation,Street Condition,Pothole,,11210.0,1736 ALBANY AVENUE,...,POINT (-73.936873176083 40.630895721731),,,,17217.0,5.0,2.0,26.0,38.0,0 days 00:00:00
152,28276372,2014-06-17 09:51:17,2014-06-17 09:51:17,DOT,Department of Transportation,Street Condition,Pothole,,11378.0,60-64 71 STREET,...,POINT (-73.887759145655 40.721910733153),,,,14788.0,54.0,3.0,34.0,62.0,0 days 00:00:00
154,28276375,2014-06-16 20:01:29,2014-06-16 20:01:29,DOT,Department of Transportation,Street Condition,Pothole,,11372.0,,...,POINT (-73.894900136103 40.750718186716),,,,14783.0,65.0,3.0,5.0,73.0,0 days 00:00:00


Let's look at unique values for `descriptor`:

In [8]:
street_conditions['descriptor'].unique()

array(['Cave-in', 'Pothole', 'Wear & Tear',
       'Rough, Pitted or Cracked Roads', 'Failed Street Repair',
       'Guard Rail - Street', 'Defective Hardware',
       'Line/Marking - Faded', 'Blocked - Construction',
       'General Bad Condition', 'Plate Condition - Noisy',
       'Plate Condition - Shifted', 'Line/Marking - After Repaving',
       'Defacement', 'Plate Condition - Open', 'Unsafe Worksite',
       'Crash Cushion Defect', 'Plate Condition - Anti-Skid', 'Hummock',
       'Depression Maintenance', 'Maintenance Cover',
       'Dumpster - Construction Waste', 'Dumpster - Causing Damage',
       'Suspected Street Cut', 'Strip Paving'], dtype=object)

Lots to choose from!  What about counts of each type?  This might help us decide on whether there are enough potholes alone to merit study, or if we should include all street conditions complaints.


In [9]:
street_conditions['descriptor'].value_counts()

Pothole                           21827
Cave-in                            5535
Rough, Pitted or Cracked Roads     2876
Defective Hardware                 2693
Failed Street Repair               2534
Blocked - Construction             1318
Line/Marking - Faded               1222
Plate Condition - Noisy             835
Wear & Tear                         598
Plate Condition - Shifted           385
Line/Marking - After Repaving       216
Plate Condition - Open              158
Dumpster - Construction Waste        91
Hummock                              60
Unsafe Worksite                      45
Guard Rail - Street                  44
Defacement                           41
Crash Cushion Defect                 29
Plate Condition - Anti-Skid           9
Maintenance Cover                     9
General Bad Condition                 6
Dumpster - Causing Damage             6
Depression Maintenance                4
Suspected Street Cut                  2
Strip Paving                          1


With ~22k pothole complaints among 1M rows, there are plenty of potholes for us to do data analysis on. Let's use the Socrata API to bring in the pothole data (up to 2 million rows, but we won't get that many) from the many millions of rows of NYC 311 data!  We'll also free up some memory by removing the data frames we won't use again, and running garbage collection (gc):

In [10]:
import gc
del [[data311, street_conditions]]
gc.collect()

170

In [11]:
potholes = pd.read_csv("https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?descriptor=Pothole&$limit=2000000")
potholes['created_date'] = pd.to_datetime(potholes['created_date'])
potholes['closed_date'] = pd.to_datetime(potholes['closed_date'])
potholes['time_to_close'] = pd.to_timedelta(potholes['closed_date'] - potholes['created_date'], unit='d')

How many pothole complaints do we have?

In [12]:
potholes.count()

unique_key                        563770
created_date                      563770
closed_date                       560867
agency                            563770
agency_name                       563770
complaint_type                    563770
descriptor                        563770
location_type                       1875
incident_zip                      519887
incident_address                  353209
street_name                       353209
cross_street_1                    458762
cross_street_2                    458697
intersection_street_1             201410
intersection_street_2             201409
address_type                      546194
city                              523306
landmark                               0
facility_type                          0
status                            563770
due_date                            1872
resolution_description            563234
resolution_action_updated_date    563578
community_board                   563770
bbl             

I'm very interested in the number of days it takes for potholes to be fixed, so I'll create a new variable that says (in a numberic data type, not a timedelta) how many days are in the timedelta `time_to_close`.

In [13]:
from datetime import datetime, timedelta
potholes['days_to_close'] = potholes['time_to_close'] / timedelta (days=1)

In [14]:
potholes.head()

Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,bridge_highway_segment,latitude,longitude,location_city,location,location_address,location_zip,location_state,time_to_close,days_to_close
0,34690422,2016-11-01 15:01:46,2016-11-02 09:45:00,DOT,Department of Transportation,Street Condition,Pothole,,,CONEY ISLAND AVENUE,...,,,,,,,,,0 days 18:43:14,0.780023
1,42107874,2019-04-01 22:22:27,2019-04-02 10:51:00,DOT,Department of Transportation,Street Condition,Pothole,,10306.0,355 EDISON STREET,...,,40.572961,-74.113157,,POINT (-74.113156832531 40.572961322519),,,,0 days 12:28:33,0.519826
2,24766901,2013-01-09 11:20:10,2013-01-10 14:04:00,DOT,Department of Transportation,Street Condition,Pothole,,,BARUCH DRIVE,...,,,,,,,,,1 days 02:43:50,1.113773
3,24767098,2013-01-10 14:45:07,2013-01-11 10:20:00,DOT,Department of Transportation,Street Condition,Pothole,,11101.0,10 STREET,...,,,,,,,,,0 days 19:34:53,0.815891
4,42082809,2019-03-29 07:05:49,2019-03-30 20:00:00,DOT,Department of Transportation,Street Condition,Pothole,,10025.0,122 MANHATTAN AVENUE,...,,40.7982,-73.961809,,POINT (-73.961809120646 40.798199855119),,,,1 days 12:54:11,1.537627


## Visualization With Matplotlib

Let's try to get some understanding of pothole repair times using matplotlib.  We'll begin by just plotting the number of days it took to fix potholes as a function of the initial complaint date.  We're going to use `matplotlib` as a tool of exploratory data analysis (EDA), not as the engine of perfectly beautiful data visualizations.

In [16]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10,15))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
#plt.xlim(datetime(2014,6,1), datetime(2019,8,1))
ax.plot(potholes['created_date'], 
        potholes['days_to_close'], 
        marker='o', 
        linestyle='',  
        alpha = 0.5)

plt.show()

<Figure size 1000x1500 with 1 Axes>

The existence of some outliers makes the overall trend (maybe a decline, over time, of the time it took to fix potholes?) hard to see.  If we had bins, say, by month or year, we could do box plots to help us understand statistical trends, but we're not advanced enough yet to do that kind of feature engineering!  But we do have some bins already -- the borough bins!  Let's do a boxplot using `pandas` -- it's far easier to plot categorical data in `pandas` than in `matplotlib`.

## Visualization With Pandas


In [None]:
potholes.boxplot("days_to_close", by='borough')

It's hard to know if there is a statistical difference between boroughs, because, again, the outliers make it hard to compare the actual box, which is down around 0.  Let's change some parameters!

In [None]:
potholes.boxplot("days_to_close", by='borough', showfliers=False, figsize = (20,6))

What's interesting here is that the IQR (interquartile range, or middle 50%) for Manhattan and Queens is much smaller than other boroughs, for example, Staten Island.  Maybe Manhattanites are very demanding, so there's rapid response all the time.  This gives me some ideas about research questions, such as:

* Are Manhattan complaints in general dealt with more quickly than complaints from other boroughs?
* Are Queens complaints in general dealt with more quickly than complaints from other boroughs?
* Are potholes different than other kinds of complaints?  E.g. is the rapid response for Queens potholes not due to overall pro-Queens bias but perhaps due to some other reason?
* How have different boroughs been treated over time?  It seems like the response time for potholes might have improved over time -- is that true for all boroughs?

Let's see what else we can learn through graphical EDA using Seaborn!

## Visualization With Seaborn

In [None]:
import seaborn as sns
sns.set(style="whitegrid")

# Draw a scatter plot while assigning point colors and sizes to different
# variables in the dataset
f, ax = plt.subplots(figsize=(20, 10))
sns.scatterplot(x="created_date", y="days_to_close",
                hue="borough", linewidth=0,
                data=potholes, ax=ax)

This is weird!  Why does my plot start at 2000, when my data does not? 

* https://stackoverflow.com/questions/54050472/seaborn-scatterplot-datetime-xaxis-too-wide
* https://github.com/mwaskom/seaborn/issues/1641#issuecomment-452078518
 
Long story short, this is a `matplotlib` problem, and there's not a simple workaround.  We can manually set the x axis, however, setting the start and end date we want to see:

In [None]:
sns.set(style="whitegrid")

# Draw a scatter plot while assigning point colors and sizes to different
# variables in the dataset
f, ax = plt.subplots(figsize=(20, 20))
plt.xlim(datetime(2009,6,1), datetime(2019,8,1))
sns.scatterplot(x="created_date", y="days_to_close",
                hue="borough", linewidth=0,
                data=potholes, ax=ax)

Again, those darn outliers prevent us from seeing overall trends.  Let's remove outliers (maybe `days_to_close` greater than 30):

In [None]:
f, ax = plt.subplots(figsize=(20, 20))
plt.xlim(datetime(2009,6,1), datetime(2019,8,1))
sns.scatterplot(x="created_date", y="days_to_close",
                hue="borough", linewidth=0,
                data=potholes[potholes['days_to_close'] <= 30], ax=ax)

Wow, this is interesting!  There are some line artifacts -- diagonal in some views or so steep they look vertical, as in the above graph -- what the heck is that?  I suspect this has to do with batches of complaints being formally closed for a given borough when work crews concentrate in a specific area.  So, in a couple of days, a lot of open tickets get closed out all at once in one borough (giving us a stripe of that borough's color), and then the complaints build up there for awhile before all getting resolved later in another few day work sprint.  Meanwhile, complaints come in fairly evenly, across all days. 

How could we test this theory?

Also, we have some negative time-to-close.  We should probably remove them as we advance in our data analysis.

What have we learned?

* There are differences in pothole resolution time between boroughs
* There are spikes in pothole resolution time (due to weather? Crew locations?)
* There are more things we'd like to do once we can bin time into months or years