# New York City 311 Data

## Overview

In the city of New York, citizens with non-emergency complaints (e.g. trash non-collection, rodent infestations) can call 311 to make a Service Request.  These are recorded and shared on New York's open data site at  https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9.

## High-Level Description

The data dates from 2010 to the current day, with data being updated on a daily basis.  At the time of this writing, there are over 20 million rows, each row representing a single service request, and over 40 columns which represent aspects of each service request, such as the street address being referenced, the type of complaint, the agency responsible, the date of the service request, etc.

## Bring in Data via pandas

I'm only going to bring in only the rows that have 'Pothole' in the `descriptor` field.  I'll set an upper limit of 5 million rows.


In [16]:
import pandas as pd
import numpy as np
import datetime as dt
potholes = pd.read_csv("https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?descriptor=Pothole&$limit=5000000")

  interactivity=interactivity, compiler=compiler, result=result)


Let's take a quick peek at what the data looks like.  Then we'll use pandas to work with it!

In [17]:
potholes.head()

Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,cross_street_1,cross_street_2,intersection_street_1,intersection_street_2,address_type,city,landmark,facility_type,status,due_date,resolution_description,resolution_action_updated_date,community_board,bbl,borough,x_coordinate_state_plane,y_coordinate_state_plane,open_data_channel_type,park_facility_name,park_borough,vehicle_type,taxi_company_borough,taxi_pick_up_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location_city,location,location_address,location_zip,location_state
0,34690422,2016-11-01T15:01:46.000,2016-11-02T09:45:00.000,DOT,Department of Transportation,Street Condition,Pothole,,,CONEY ISLAND AVENUE,CONEY ISLAND AVENUE,AVENUE M,AVENUE N,,,BLOCKFACE,,,,Closed,,The Department of Transportation inspected thi...,2016-11-02T09:45:00.000,Unspecified BROOKLYN,,BROOKLYN,,,UNKNOWN,Unspecified,BROOKLYN,,,,,,,,,,,,,,
1,42107874,2019-04-01T22:22:27.000,2019-04-02T10:51:00.000,DOT,Department of Transportation,Street Condition,Pothole,,10306.0,355 EDISON STREET,EDISON STREET,JACQUES AVENUE,NEW DORP LANE,,,ADDRESS,STATEN ISLAND,,,Closed,,The Department of Transportation inspected thi...,2019-04-02T10:51:00.000,02 STATEN ISLAND,,STATEN ISLAND,952814.0,148042.0,UNKNOWN,Unspecified,STATEN ISLAND,,,,,,,,40.572961,-74.113157,,POINT (-74.113156832531 40.572961322519),,,
2,24766901,2013-01-09T11:20:10.000,2013-01-10T14:04:00.000,DOT,Department of Transportation,Street Condition,Pothole,,,BARUCH DRIVE,BARUCH DRIVE,DELANCEY STREET,WILLIAMSBURG BRIDGE,,,BLOCKFACE,NEW YORK,,,Closed,,The Department of Transportation inspected thi...,2013-01-10T14:04:00.000,Unspecified MANHATTAN,,MANHATTAN,,,UNKNOWN,Unspecified,MANHATTAN,,,,,,,,,,,,,,
3,24767098,2013-01-10T14:45:07.000,2013-01-11T10:20:00.000,DOT,Department of Transportation,Street Condition,Pothole,,11101.0,10 STREET,10 STREET,40 AVENUE,41 AVENUE,,,BLOCKFACE,Long Island City,,,Closed,,The Department of Transportation inspected thi...,2013-01-11T10:20:00.000,Unspecified QUEENS,,QUEENS,,,UNKNOWN,Unspecified,QUEENS,,,,,,,,,,,,,,
4,42082809,2019-03-29T07:05:49.000,2019-03-30T20:00:00.000,DOT,Department of Transportation,Street Condition,Pothole,,10025.0,122 MANHATTAN AVENUE,MANHATTAN AVENUE,WEST 105 STREET,WEST 106 STREET,,,ADDRESS,NEW YORK,,,Closed,,The Department of Transportation inspected thi...,2019-03-30T20:00:00.000,07 MANHATTAN,1018410000.0,MANHATTAN,994824.0,230085.0,UNKNOWN,Unspecified,MANHATTAN,,,,,,,,40.7982,-73.961809,,POINT (-73.961809120646 40.798199855119),,,


In [18]:
potholes.shape

(568977, 45)

OK, we have around 570 k rows, much less than our 5 million upper limit, but plenty to work with!  Let's do a bit of cleanup.  First, we'll do some date work.

In [0]:
for col in ['created_date', 'closed_date', 'due_date', 'resolution_action_updated_date']:
    potholes[col] = pd.to_datetime(potholes[col])
    potholes.loc[potholes[col] < '2007-01-01', col] = pd.NaT
    potholes.loc[potholes[col] > pd.Timestamp(dt.date.today())] = pd.NaT

potholes['resolved_date'] = np.where(potholes['closed_date'].notnull(), potholes['closed_date'], potholes['resolution_action_updated_date'])
potholes['days_to_close'] = (potholes['resolved_date'].dt.date - potholes['created_date'].dt.date).dt.days


# Get names of indexes for which days_to_close < 0
indexNames = potholes[potholes['days_to_close'] <0 ].index
# Drop them
potholes.drop(indexNames , inplace=True)

Let's change some of the geographic stuff.

In [0]:
new_lat_long = (potholes['location'].str.extract('.+(\-\d{2}\.*\d*) (\d{2}\.*\d*).+')).astype(float)
potholes.loc[:, 'longitude'] = new_lat_long[0]
potholes.loc[:, 'latitude'] = new_lat_long[1]

And let's remove "unspecified" boroughs and tickets that weren't closed.

In [0]:
indexNames = potholes[potholes['borough'] == 'Unspecified' ].index
potholes.drop(indexNames , inplace=True)

potholes.drop(potholes[potholes['status'] != "Closed"].index, axis=0, inplace=True)

In [22]:
potholes['resolution_description'].unique()

array(['The Department of Transportation inspected this complaint and repaired the problem.',
       'The Department of Transportation inspected this complaint and did not find the reported problem.',
       'The Department of Transportation determined that this complaint is a duplicate of a previously filed complaint. The original complaint is being addressed.',
       'The Department of Transportation inspected this complaint and found that the problem was fixed.',
       'The Department of Transportation referred this complaint to the appropriate Maintenance Unit for repair.',
       'General maintenance and cleaning is on a regular schedule. The next scheduled maintenance and cleaning will correct the condition.',
       'The Department of Transportation has determined that this issue is not within its jurisdiction.',
       'The Department of Transportation requires 6 months to respond to this type of complaint.  Please note your Service Request number for future reference.',
    

And create better, briefer resolution descriptions.

In [0]:
resolution_map = zip(potholes['resolution_description'].unique(), ["Repaired",
                                                                  "Did Not Find",
                                                                  "Repaired Already",
                                                                  "Duplicate",
                                                                  "Referred: Maintenance Unit",
                                                                  "Repaired: Capital Project",
                                                                  "No Description",
                                                                  "Rescheduled: Inaccessible",
                                                                  "Assigned: Field Crew",
                                                                  "Referred: Inspections Unit",
                                                                  "Future Maintenance Will Repair (Incomplete Decription)",
                                                                  "Status Not Available",
                                                                  "Future Maintenance Will Repair (Complete Decription)",
                                                                  "Not in DOT Jurisdiction (Not Specified)",
                                                                  "Completed or Corrected",
                                                                  "See Customer Notes",
                                                                  "Requires 6 Months for Response",
                                                                  "Not Repaired, was in Compliance",
                                                                  "Repair to be Scheduled",
                                                                  "Insufficient Information to Respond",
                                                                  "Not in DOT Jurisdiction (MTA)",
                                                                  "Not in DOT Jurisdiction (Parks and Rec)",
                                                                  "Referred: Barricaded",
                                                                  "Temporarily Repaired",
                                                                  "Not in DOT Jurisdiction (Other)",
                                                                  "Referred: Other DOT",
                                                                  "In Progress",
                                                                  "Referred: Dept. Environmental Protection",
                                                                  "Not in DOT Jurisdiction (State DOT)"
                                                                  ])

In [0]:
simple_map = zip(potholes['resolution_description'].unique(), ["Repaired",
                                                              "Not Repaired",
                                                              "Repaired",
                                                              "Duplicate",
                                                              "Not Repaired",
                                                              "Repaired",
                                                              "Unknown",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Unknown",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Repaired",
                                                              "Unknown",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired"
                                                                  ])

In [0]:
potholes['shorter_resolution_desc'] = potholes['resolution_description'].map(dict(resolution_map))
potholes['shortest_resolution_desc'] = potholes['resolution_description'].map(dict(simple_map))

I'd like to use scikit-learn (`sklearn`) to do a tree algorithm on my data, figuring out how to predict "Repaired" or "Not Repaired".  

In [0]:
import datetime as dt
simpler_potholes = potholes.loc[(potholes['shortest_resolution_desc'] == "Repaired") | (potholes['shortest_resolution_desc'] == "Not Repaired"), 
                            ["latitude", "longitude", "created_date", "shortest_resolution_desc"]].dropna()
simpler_potholes['year'], simpler_potholes['month'] = simpler_potholes['created_date'].dt.year, simpler_potholes['created_date'].dt.month
simpler_potholes.drop(columns = ['created_date'], inplace=True)

In [33]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
predictors, outcome = simpler_potholes.drop('shortest_resolution_desc',axis=1), simpler_potholes['shortest_resolution_desc']
X_train, X_test, y_train, y_test = train_test_split(predictors, outcome, random_state=1)


model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [0]:
predictions = model.predict(X_train)

In [37]:
model_outcome_training = pd.DataFrame({"prediction": predictions, "actual": y_train})

model_outcome_training.head()

Unnamed: 0,prediction,actual
24018,Repaired,Repaired
329817,Repaired,Repaired
440011,Repaired,Repaired
84658,Repaired,Repaired
536682,Repaired,Repaired


In [39]:
(model_outcome_training['prediction'] == model_outcome_training['actual']).value_counts()

True     371856
False      2844
dtype: int64

In [42]:
pd.crosstab(model_outcome_training['prediction'], model_outcome_training['actual'])


actual,Not Repaired,Repaired
prediction,Unnamed: 1_level_1,Unnamed: 2_level_1
Not Repaired,25466,2118
Repaired,726,346390


Well, that's pretty good, and better than last week's logistic regression performance.  Let's see how it does on testing data!

In [44]:
model_outcome_testing = pd.DataFrame({"prediction": model.predict(X_test), "actual": y_test})

pd.crosstab(model_outcome_testing['prediction'], model_outcome_testing['actual'])


actual,Not Repaired,Repaired
prediction,Unnamed: 1_level_1,Unnamed: 2_level_1
Not Repaired,1429,7753
Repaired,7373,108345
