# New York City 311 Data

## Overview

In the city of New York, citizens with non-emergency complaints (e.g. trash non-collection, rodent infestations) can call 311 to make a Service Request.  These are recorded and shared on New York's open data site at  https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9.

## High-Level Description

The data dates from 2010 to the current day, with data being updated on a daily basis.  At the time of this writing, there are over 20 million rows, each row representing a single service request, and over 40 columns which represent aspects of each service request, such as the street address being referenced, the type of complaint, the agency responsible, the date of the service request, etc.

## Bring in Data via pandas

I'm only going to bring in only the rows that have 'Pothole' in the `descriptor` field.  I'll set an upper limit of 5 million rows.


In [1]:
import pandas as pd
import numpy as np
import datetime as dt
potholes = pd.read_csv("https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?descriptor=Pothole&$limit=5000000")

  interactivity=interactivity, compiler=compiler, result=result)


Let's take a quick peek at what the data looks like.  Then we'll use pandas to work with it!

In [2]:
potholes.head()

Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,cross_street_1,cross_street_2,intersection_street_1,intersection_street_2,address_type,city,landmark,facility_type,status,due_date,resolution_description,resolution_action_updated_date,community_board,bbl,borough,x_coordinate_state_plane,y_coordinate_state_plane,open_data_channel_type,park_facility_name,park_borough,vehicle_type,taxi_company_borough,taxi_pick_up_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location_city,location,location_address,location_zip,location_state
0,42035216,2019-03-23T15:37:36.000,2019-03-27T07:34:00.000,DOT,Department of Transportation,Street Condition,Pothole,,11355.0,BOOTH MEMORIAL AVENUE,BOOTH MEMORIAL AVENUE,146 STREET,148 STREET,,,BLOCKFACE,Flushing,,,Closed,,The Department of Transportation inspected thi...,2019-03-27T07:34:00.000,07 QUEENS,,QUEENS,1033984.0,210706.0,UNKNOWN,Unspecified,QUEENS,,,,,,,,40.744876,-73.820516,,POINT (-73.820515773594 40.744876037242),,,
1,42035218,2019-03-23T17:09:03.000,2019-03-27T10:13:00.000,DOT,Department of Transportation,Street Condition,Pothole,,10314.0,,,,,MANOR ROAD,VICTORY BOULEVARD,INTERSECTION,STATEN ISLAND,,,Closed,,The Department of Transportation inspected thi...,2019-03-27T10:13:00.000,01 STATEN ISLAND,,STATEN ISLAND,950223.0,162661.0,UNKNOWN,Unspecified,STATEN ISLAND,,,,,,,,40.613078,-74.122557,,POINT (-74.122556992381 40.613077936948),,,
2,42036611,2019-03-23T19:04:12.000,,DOT,Department of Transportation,Street Condition,Pothole,,11377.0,,,,,BQE EB ENTRANCE QUEENS BLVD WB,BROOKLYN QUEENS EXPRESSWAY EN NB,INTERSECTION,Woodside,,,Pending,,The Department of Transportation inspected thi...,2019-03-27T07:02:00.000,02 QUEENS,,QUEENS,1012316.0,209017.0,UNKNOWN,Unspecified,QUEENS,,,,,,,,40.740335,-73.89872,,POINT (-73.898720018464 40.740335470005),,,
3,42036612,2019-03-23T09:25:59.000,2019-03-27T13:20:00.000,DOT,Department of Transportation,Street Condition,Pothole,,10314.0,,,,,DELMORE STREET,LIVERMORE AVENUE,INTERSECTION,STATEN ISLAND,,,Closed,,The Department of Transportation inspected thi...,2019-03-27T13:20:00.000,01 STATEN ISLAND,,STATEN ISLAND,945579.0,163320.0,UNKNOWN,Unspecified,STATEN ISLAND,,,,,,,,40.614868,-74.139287,,POINT (-74.139287300499 40.614867702584),,,
4,42036613,2019-03-23T13:40:15.000,2019-03-28T09:44:00.000,DOT,Department of Transportation,Street Condition,Pothole,,10475.0,3285 ROMBOUTS AVENUE,ROMBOUTS AVENUE,CARVER LOOP,GIVAN AVENUE,,,ADDRESS,BRONX,,,Closed,,The Department of Transportation inspected thi...,2019-03-28T09:44:00.000,10 BRONX,2051410000.0,BRONX,1031923.0,259965.0,UNKNOWN,Unspecified,BRONX,,,,,,,,40.880089,-73.827604,,POINT (-73.827603802742 40.880089328815),,,


In [3]:
potholes.shape

(568911, 45)

OK, we have around 570 k rows, much less than our 5 million upper limit, but plenty to work with!  Let's do a bit of cleanup.  First, we'll do some date work.

In [0]:
for col in ['created_date', 'closed_date', 'due_date', 'resolution_action_updated_date']:
    potholes[col] = pd.to_datetime(potholes[col])
    potholes.loc[potholes[col] < '2007-01-01', col] = pd.NaT
    potholes.loc[potholes[col] > pd.Timestamp(dt.date.today())] = pd.NaT

potholes['resolved_date'] = np.where(potholes['closed_date'].notnull(), potholes['closed_date'], potholes['resolution_action_updated_date'])
potholes['days_to_close'] = (potholes['resolved_date'].dt.date - potholes['created_date'].dt.date).dt.days


# Get names of indexes for which days_to_close < 0
indexNames = potholes[potholes['days_to_close'] <0 ].index
# Drop them
potholes.drop(indexNames , inplace=True)

Let's change some of the geographic stuff.

In [0]:
new_lat_long = (potholes['location'].str.extract('.+(\-\d{2}\.*\d*) (\d{2}\.*\d*).+')).astype(float)
potholes.loc[:, 'longitude'] = new_lat_long[0]
potholes.loc[:, 'latitude'] = new_lat_long[1]

And let's remove "unspecified" boroughs and tickets that weren't closed.

In [0]:
indexNames = potholes[potholes['borough'] == 'Unspecified' ].index
potholes.drop(indexNames , inplace=True)

potholes.drop(potholes[potholes['status'] != "Closed"].index, axis=0, inplace=True)

In [7]:
potholes['resolution_description'].unique()

array(['The Department of Transportation inspected this complaint and repaired the problem.',
       'The Department of Transportation inspected this complaint and did not find the reported problem.',
       'The Department of Transportation inspected this complaint and found that the problem was fixed.',
       'The Department of Transportation determined that this complaint is a duplicate of a previously filed complaint. The original complaint is being addressed.',
       'The Department of Transportation referred this complaint to the appropriate Maintenance Unit for repair.',
       'The Department of Transportation fixed all street defects at this location as part of a Capital Project.',
       nan,
       'The Department of Transportation inspected this complaint and found that the defect was not accessible. The repair will be rescheduled.',
       'The Department of Transportation assigned this complaint to a field crew for inspection and, if warranted, repair.',
       'The Dep

And create better, briefer resolution descriptions.

In [0]:
resolution_map = zip(potholes['resolution_description'].unique(), ["Repaired",
                                                                  "Did Not Find",
                                                                  "Repaired Already",
                                                                  "Duplicate",
                                                                  "Referred: Maintenance Unit",
                                                                  "Repaired: Capital Project",
                                                                  "No Description",
                                                                  "Rescheduled: Inaccessible",
                                                                  "Assigned: Field Crew",
                                                                  "Referred: Inspections Unit",
                                                                  "Future Maintenance Will Repair (Incomplete Decription)",
                                                                  "Status Not Available",
                                                                  "Future Maintenance Will Repair (Complete Decription)",
                                                                  "Not in DOT Jurisdiction (Not Specified)",
                                                                  "Completed or Corrected",
                                                                  "See Customer Notes",
                                                                  "Requires 6 Months for Response",
                                                                  "Not Repaired, was in Compliance",
                                                                  "Repair to be Scheduled",
                                                                  "Insufficient Information to Respond",
                                                                  "Not in DOT Jurisdiction (MTA)",
                                                                  "Not in DOT Jurisdiction (Parks and Rec)",
                                                                  "Referred: Barricaded",
                                                                  "Temporarily Repaired",
                                                                  "Not in DOT Jurisdiction (Other)",
                                                                  "Referred: Other DOT",
                                                                  "In Progress",
                                                                  "Referred: Dept. Environmental Protection",
                                                                  "Not in DOT Jurisdiction (State DOT)"
                                                                  ])

In [0]:
simple_map = zip(potholes['resolution_description'].unique(), ["Repaired",
                                                              "Not Repaired",
                                                              "Repaired",
                                                              "Duplicate",
                                                              "Not Repaired",
                                                              "Repaired",
                                                              "Unknown",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Unknown",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Repaired",
                                                              "Unknown",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired",
                                                              "Repaired",
                                                              "Not Repaired",
                                                              "Not Repaired"
                                                                  ])

In [0]:
potholes['shorter_resolution_desc'] = potholes['resolution_description'].map(dict(resolution_map))
potholes['shortest_resolution_desc'] = potholes['resolution_description'].map(dict(simple_map))

I'd like to use scikit-learn (`sklearn`) to do a logistic regression to predict "Repaired" versus "Not Repaired"  potholes.  We'll use latitude, longitude, month, and year as our predictors.

In [0]:
import datetime as dt
simpler_potholes = potholes.loc[(potholes['shortest_resolution_desc'] == "Repaired") | (potholes['shortest_resolution_desc'] == "Not Repaired"), 
                            ["latitude", "longitude", "created_date", "shortest_resolution_desc"]].dropna()
simpler_potholes['year'], simpler_potholes['month'] = simpler_potholes['created_date'].dt.year, simpler_potholes['created_date'].dt.month
simpler_potholes.drop(columns = ['created_date'], inplace = True)

In [12]:
from sklearn.linear_model import LogisticRegression

predictors, outcome = simpler_potholes.drop('shortest_resolution_desc',axis=1), simpler_potholes['shortest_resolution_desc']
logisticRegr = LogisticRegression()

logisticRegr.fit(predictors, outcome)
predictions = logisticRegr.predict(predictors)



Let's look at how we did.

In [13]:
model_outcome = pd.DataFrame({"prediction": predictions, "actual": outcome})

model_outcome.head()

Unnamed: 0,prediction,actual
0,Repaired,Repaired
1,Repaired,Repaired
3,Repaired,Repaired
4,Repaired,Repaired
5,Repaired,Not Repaired


That doesn't look bad!  Let's look at our stats.

In [14]:
(model_outcome['prediction'] == model_outcome['actual']).value_counts()

True     426157
False     35357
dtype: int64

Wow, that looks really great.  BUT.  There's a problem!

In [15]:
model_outcome['prediction'].value_counts()

Repaired    461514
Name: prediction, dtype: int64

My model simply guessed "Repaired" every single time.  A good guess, since my sample is so biased, with many more 'Repaired' than 'Not Repaired'.  But this shows a complexity in machine learning... the need to handle unbalanced samples.  This is really important when dealing with things like rare (but catastrophic) diseases or material failures in bridges or planes.