# Obtaining NYC 311 Data

New York's Open Data Portal (https://opendata.cityofnewyork.us/) uses the Socrata Open Data API to give API access to data hosted on the site.

This is significant because the datasets on NYC Open Data are often many millions of rows -- prohibitively large for some.  It's helpful to be able to download only the first, say, 50 thousand rows to get a taste of what the entire dataset is.  We can also specify only certain data, using column names and conditions.

## 311 Overview

In the city of New York, citizens with non-emergency complaints (e.g. trash non-collection, rodent infestations) can call 311 to make a Service Request. These are recorded and shared on New York's open data site at https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9.

## High-Level Data Description

The data dates from 2010 to the current day, with data being updated on a daily basis. At the time of this writing, there are over 20 million rows, each row representing a single service request, and over 40 columns which represent aspects of each service request, such as the street address being referenced, the type of complaint, the agency responsible, the date of the service request, etc.

## Bring in Data via pandas

We're only going to bring in only the rows that have 'Pothole' in the descriptor field.  We'll set an upper limit of 1 million rows.

In [1]:
import pandas as pd
potholes = pd.read_csv("https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?descriptor=Pothole&$limit=1000000")

  interactivity=interactivity, compiler=compiler, result=result)


How large is this data?

In [2]:
potholes.shape

(569331, 45)

Let's take a peek at the data in several ways.  We'll start by looking at the first few rows.  We'll scroll to the right to see all the columns.

In [3]:
potholes.head()

Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location_city,location,location_address,location_zip,location_state
0,34690422,2016-11-01T15:01:46.000,2016-11-02T09:45:00.000,DOT,Department of Transportation,Street Condition,Pothole,,,CONEY ISLAND AVENUE,...,,,,,,,,,,
1,42107874,2019-04-01T22:22:27.000,2019-04-02T10:51:00.000,DOT,Department of Transportation,Street Condition,Pothole,,10306.0,355 EDISON STREET,...,,,,40.572961,-74.113157,,POINT (-74.113156832531 40.572961322519),,,
2,24766901,2013-01-09T11:20:10.000,2013-01-10T14:04:00.000,DOT,Department of Transportation,Street Condition,Pothole,,,BARUCH DRIVE,...,,,,,,,,,,
3,24767098,2013-01-10T14:45:07.000,2013-01-11T10:20:00.000,DOT,Department of Transportation,Street Condition,Pothole,,11101.0,10 STREET,...,,,,,,,,,,
4,42082809,2019-03-29T07:05:49.000,2019-03-30T20:00:00.000,DOT,Department of Transportation,Street Condition,Pothole,,10025.0,122 MANHATTAN AVENUE,...,,,,40.7982,-73.961809,,POINT (-73.961809120646 40.798199855119),,,


Let's look at the overall number of present vs absent values in each column, as well as the column type:

In [4]:
potholes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569331 entries, 0 to 569330
Data columns (total 45 columns):
unique_key                        569331 non-null int64
created_date                      569331 non-null object
closed_date                       566424 non-null object
agency                            569331 non-null object
agency_name                       569331 non-null object
complaint_type                    569331 non-null object
descriptor                        569331 non-null object
location_type                     1900 non-null object
incident_zip                      525231 non-null float64
incident_address                  357148 non-null object
street_name                       357148 non-null object
cross_street_1                    462767 non-null object
cross_street_2                    462702 non-null object
intersection_street_1             202993 non-null object
intersection_street_2             202991 non-null object
address_type                      55

## Cleaning and Preparing Data

We see multiple columns with few to no values, and we also see columns that have data types that aren't quite right (date stamps as string objects).  We'll take that on in this section.

### Dates

Let's begin by converting dates:

In [5]:
for col in ['created_date', 'closed_date', 'due_date', 'resolution_action_updated_date']:
    potholes[col] = pd.to_datetime(potholes[col])

And now let's peek at the dates:

In [6]:
potholes[['created_date', 'closed_date', 'due_date', 'resolution_action_updated_date']].describe()

Unnamed: 0,created_date,closed_date,due_date,resolution_action_updated_date
count,569331,566424,1872,569138
unique,545184,407240,1575,410364
top,2010-03-26 07:00:00,2011-02-10 00:00:00,2014-06-25 18:44:38,2014-02-23 00:00:00
freq,94,70,17,48
first,2010-01-01 01:57:23,2010-01-01 06:08:33,2010-06-22 16:28:07,2010-01-01 06:08:33
last,2019-08-01 01:01:28,2019-07-31 21:46:00,2019-11-24 09:43:36,2019-08-01 01:01:28


Wonderful, we don't have any outlier dates.  All the dates fall within an expected range of 2010-2019.  Let's now narrow our data by eliminating columns with greater than 70% missing values.

In [7]:
potholes.dropna(thresh=(.7*potholes.shape[0]), axis=1, inplace=True)
potholes.head()

Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,incident_zip,cross_street_1,cross_street_2,...,community_board,borough,x_coordinate_state_plane,y_coordinate_state_plane,open_data_channel_type,park_facility_name,park_borough,latitude,longitude,location
0,34690422,2016-11-01 15:01:46,2016-11-02 09:45:00,DOT,Department of Transportation,Street Condition,Pothole,,AVENUE M,AVENUE N,...,Unspecified BROOKLYN,BROOKLYN,,,UNKNOWN,Unspecified,BROOKLYN,,,
1,42107874,2019-04-01 22:22:27,2019-04-02 10:51:00,DOT,Department of Transportation,Street Condition,Pothole,10306.0,JACQUES AVENUE,NEW DORP LANE,...,02 STATEN ISLAND,STATEN ISLAND,952814.0,148042.0,UNKNOWN,Unspecified,STATEN ISLAND,40.572961,-74.113157,POINT (-74.113156832531 40.572961322519)
2,24766901,2013-01-09 11:20:10,2013-01-10 14:04:00,DOT,Department of Transportation,Street Condition,Pothole,,DELANCEY STREET,WILLIAMSBURG BRIDGE,...,Unspecified MANHATTAN,MANHATTAN,,,UNKNOWN,Unspecified,MANHATTAN,,,
3,24767098,2013-01-10 14:45:07,2013-01-11 10:20:00,DOT,Department of Transportation,Street Condition,Pothole,11101.0,40 AVENUE,41 AVENUE,...,Unspecified QUEENS,QUEENS,,,UNKNOWN,Unspecified,QUEENS,,,
4,42082809,2019-03-29 07:05:49,2019-03-30 20:00:00,DOT,Department of Transportation,Street Condition,Pothole,10025.0,WEST 105 STREET,WEST 106 STREET,...,07 MANHATTAN,MANHATTAN,994824.0,230085.0,UNKNOWN,Unspecified,MANHATTAN,40.7982,-73.961809,POINT (-73.961809120646 40.798199855119)


Let's add a new column that gives the time between complaint creation date and completion date (either closed or resolution updated date), and remove the columns we don't need any more:

In [8]:
import numpy as np
potholes['resolved_date'] = np.where(potholes['closed_date'].notnull(), potholes['closed_date'], 
                                     potholes['resolution_action_updated_date'])
potholes['days_to_close'] = (potholes['resolved_date'].dt.date - potholes['created_date'].dt.date).dt.days
potholes.drop(columns=['closed_date','resolution_action_updated_date'], inplace = True)

We can also remove columns that don't provide meaningful data for prediction (like `unique_key`) or have the same data throughout (like `agency`).  Let's take a quick peek at the number of unique values in each column to see if there are obvious candidates for removal:

Let's take another peek at our column information:



In [9]:
potholes.nunique()

unique_key                  569331
created_date                545184
agency                           1
agency_name                      1
complaint_type                   2
descriptor                       1
incident_zip                   229
cross_street_1                7836
cross_street_2                7616
address_type                     3
city                            89
status                           5
resolution_description          32
community_board                 77
borough                          6
x_coordinate_state_plane     93761
y_coordinate_state_plane    100169
open_data_channel_type           4
park_facility_name               1
park_borough                     6
latitude                    216552
longitude                   216550
location                    216552
resolved_date               409230
days_to_close                  286
dtype: int64

OK, so we can get rid of `agency`, `agency_name`, `descriptor`, and `park_facility_name` for sure!  As stated earlier, `unique_key` doesn't add any useful info, so we can get of it as well.  Location is essentially a duplicate of lat/long, so we can get rid of that column, too.  Unfortunately, we don't have any information about `x_ccordinate_state_plane` and `y_coordinate_state_plane`, so we'll remove them.  There's not much we can do with that data!

In [10]:
potholes.drop(columns = ['unique_key', 'agency', 'agency_name', 'descriptor', 
                         'park_facility_name', 'location', 'x_coordinate_state_plane', 
                         'y_coordinate_state_plane'], inplace = True)

Let's look at `complaint_type`, which we expected to just have one value, but seems to have two:

In [11]:
potholes['complaint_type'].value_counts()

Street Condition    567427
Bridge Condition      1904
Name: complaint_type, dtype: int64

OK, that's fair.  Sometimes potholes are on streets, sometimes on bridges.

Let's peek at the column data now:

In [12]:
potholes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569331 entries, 0 to 569330
Data columns (total 17 columns):
created_date              569331 non-null datetime64[ns]
complaint_type            569331 non-null object
incident_zip              525231 non-null float64
cross_street_1            462767 non-null object
cross_street_2            462702 non-null object
address_type              551684 non-null object
city                      528650 non-null object
status                    569331 non-null object
resolution_description    568795 non-null object
community_board           569331 non-null object
borough                   569331 non-null object
open_data_channel_type    569331 non-null object
park_borough              569331 non-null object
latitude                  521373 non-null float64
longitude                 521373 non-null float64
resolved_date             569138 non-null datetime64[ns]
days_to_close             569138 non-null float64
dtypes: datetime64[ns](2), float64(4

There are many ways a ticket can be resolved.  Let's take a look at them and see if we can make some categories that are easier to read.

In [13]:
print(potholes['resolution_description'].unique())

['The Department of Transportation inspected this complaint and repaired the problem.'
 'The Department of Transportation inspected this complaint and did not find the reported problem.'
 'The Department of Transportation referred this complaint to the Inspections Unit for further action.'
 'The Department of Transportation determined that this complaint is a duplicate of a previously filed complaint. The original complaint is being addressed.'
 'The Department of Transportation inspected this complaint and found that the problem was fixed.'
 'The Department of Transportation referred this complaint to the appropriate Maintenance Unit for repair.'
 'General maintenance and cleaning is on a regular schedule. The next scheduled maintenance and cleaning will correct the condition.'
 'The Department of Transportation has determined that this issue is not within its jurisdiction.'
 'The Department of Transportation requires 6 months to respond to this type of complaint.  Please note your Se

I'm going to make a map of possible long descriptions of outcome and shorter descriptions of a few words each:

In [14]:
long_descriptions = ['The Department of Transportation inspected this complaint and repaired the problem.',
       'The Department of Transportation inspected this complaint and referred it to the Arterial Division for further action.',
       'The Department of Transportation inspected this complaint and did not find the reported problem.',
       'The Department of Transportation inspected this complaint and found that the problem was fixed.',
       'The Department of Transportation referred this complaint to the Inspections Unit for further action.',
       'The Department of Transportation determined that this complaint is a duplicate of a previously filed complaint. The original complaint is being addressed.',
       'General maintenance and cleaning is on a regular schedule. The next scheduled maintenance and cleaning will correct the condition.',
       'The Department of Transportation referred this complaint to the appropriate Maintenance Unit for repair.',
       'The Department of Transportation has completed the request or corrected the condition.',
       'The Department of Transportation fixed all street defects at this location as part of a Capital Project.',
       np.nan,
       'The Department of Transportation inspected this complaint and found that the defect was not accessible. The repair will be rescheduled.',
       'The Department of Transportation assigned this complaint to a field crew for inspection and, if warranted, repair.',
       'The Department of Transportation has determined that this issue is not within its jurisdiction.',
       'The Department of Transportation inspected and has requested the Department of Environmental Protection address the issue. The condition will be re-inspected in 60 days.',
       'The condition has been inspected/investigated, see customer notes for more information.',
       'The status of this Service Request is currently not available online. Please call 311 for further assistance. If you are outside of New York City, please call (212) NEW-YORK (212-639-9675).',
       'General maintenance and cleaning is on a regular schedule. The next scheduled maintenance and cleani',
       'The Department of Transportation requires 6 months to respond to this type of complaint.  Please note your Service Request number for future reference.',
       'The condition was inspected and it was in compliance with Department of Transportation standards, not hazardous, or a valid permit exists.',
       'The Department of Transportation inspected this complaint and referred it to the Bridge Division for further action.',
       'The Department of Transportation has determined that this issue is not within its jurisdiction. It has been referred to the Department of Parks and Recreation.',
       'The Department of Transportation inspected this complaint and will schedule the repair.',
       'The request submitted did not have sufficient information for the Department of Transportation to respond.',
       'The Department of Transportation has determined that this issue is not within its jurisdiction. It has been referred to the Metropolitan Transportation Authority.',
       'Your Service Request has been submitted to the Department of Transportation.  Please check back later for status.',
       'The Department of Transportation inspected this complaint and barricaded the area. The issue was referred to another unit for further action.',
       'The Department of Transportation inspected the condition and temporary repairs were made to make the area safe. Permanent repairs/restoration will be scheduled as part of a project, seasonal work (April - November), or when work is assigned to a contractor.',
       "The condition was inspected and determined not to be under Department of Transportation's jurisdiction. The Department of Transportation notified the appropriate responsible party of the complaint.",
       'The condition was inspected and determined to be under the jurisdiction of another Department of Transportation unit.  The unit has been notified.',
       "The Department of Transportation's work on this complaint is still in progress.",
       "The condition was inspected and determined not to be under New York City Department of Transportation's jurisdiction. The Department of Transportation notified the New York State Department of Transportation of the complaint.",
       'The Department of Transportation has inspected the complaint and referred it to the Department of Environmental Protection.']
short_descriptions = ["Repaired: Inspected and Repaired",
                       "Referred: Arterial",
                       "Not Repaired: Did Not Find",
                       "Repaired: Already Complete",
                       "Referred: Inspections",
                       "Not Repaired: Duplicate",
                       "Postponed: Future Maintenance Will Fix",
                       "Referred: Maintenance Unit",
                       "Repaired: Completed or Corrected",
                       "Repaired: Capital Project",
                       "Other: No Description",
                       "Postponed: Inaccessible",
                       "Scheduled: Field Crew",
                       "Not Repaired: Not in DOT Jurisdiction (Not Specified)",
                       "Referred: Dept. Environmental Protection, Will Reinspect",
                       "Referred: See Customer Notes",
                       "Other: Status Not Available",
                       "Postponed: Future Maintenance Will Fix (Incomplete Decription)",
                       "Postponed: Requires 6 Months for Response",
                       "Not Repaired: Was in Compliance",
                       "Referred: Bridges",
                       "Referred: Parks and Rec",
                       "Scheduled: DOT",
                       "Referred: MTA",
                       "Not Repaired: Unsufficient Information",
                       "Other: Check Status",
                       "Referred: Barricaded",
                       "Repaired: Temporary Repair",
                       "Referred: Unspecified",
                       "Referred: Other DOT",
                       "Repaired: In Progress",
                       "Referred: State DOT",
                       "Referred: Dept. Environmental Protection"]
desc_map = dict(zip(long_descriptions, short_descriptions))
desc_map

{'The Department of Transportation inspected this complaint and repaired the problem.': 'Repaired: Inspected and Repaired',
 'The Department of Transportation inspected this complaint and referred it to the Arterial Division for further action.': 'Referred: Arterial',
 'The Department of Transportation inspected this complaint and did not find the reported problem.': 'Not Repaired: Did Not Find',
 'The Department of Transportation inspected this complaint and found that the problem was fixed.': 'Repaired: Already Complete',
 'The Department of Transportation referred this complaint to the Inspections Unit for further action.': 'Referred: Inspections',
 'The Department of Transportation determined that this complaint is a duplicate of a previously filed complaint. The original complaint is being addressed.': 'Not Repaired: Duplicate',
 'General maintenance and cleaning is on a regular schedule. The next scheduled maintenance and cleaning will correct the condition.': 'Postponed: Future 

In [15]:
potholes['shorter_description'] = potholes['resolution_description'].map(desc_map)

Did we miss any codes?

In [16]:
potholes[potholes['shorter_description'].isna()]['resolution_description']

Series([], Name: resolution_description, dtype: object)

No, we didn't.  Let's delete the longer resolution description and peek at what we have now.

In [17]:
potholes.drop(columns = "resolution_description", inplace=True)

In [18]:
potholes.head()

Unnamed: 0,created_date,complaint_type,incident_zip,cross_street_1,cross_street_2,address_type,city,status,community_board,borough,open_data_channel_type,park_borough,latitude,longitude,resolved_date,days_to_close,shorter_description
0,2016-11-01 15:01:46,Street Condition,,AVENUE M,AVENUE N,BLOCKFACE,,Closed,Unspecified BROOKLYN,BROOKLYN,UNKNOWN,BROOKLYN,,,2016-11-02 09:45:00,1.0,Repaired: Inspected and Repaired
1,2019-04-01 22:22:27,Street Condition,10306.0,JACQUES AVENUE,NEW DORP LANE,ADDRESS,STATEN ISLAND,Closed,02 STATEN ISLAND,STATEN ISLAND,UNKNOWN,STATEN ISLAND,40.572961,-74.113157,2019-04-02 10:51:00,1.0,Not Repaired: Did Not Find
2,2013-01-09 11:20:10,Street Condition,,DELANCEY STREET,WILLIAMSBURG BRIDGE,BLOCKFACE,NEW YORK,Closed,Unspecified MANHATTAN,MANHATTAN,UNKNOWN,MANHATTAN,,,2013-01-10 14:04:00,1.0,Repaired: Inspected and Repaired
3,2013-01-10 14:45:07,Street Condition,11101.0,40 AVENUE,41 AVENUE,BLOCKFACE,Long Island City,Closed,Unspecified QUEENS,QUEENS,UNKNOWN,QUEENS,,,2013-01-11 10:20:00,1.0,Repaired: Inspected and Repaired
4,2019-03-29 07:05:49,Street Condition,10025.0,WEST 105 STREET,WEST 106 STREET,ADDRESS,NEW YORK,Closed,07 MANHATTAN,MANHATTAN,UNKNOWN,MANHATTAN,40.7982,-73.961809,2019-03-30 20:00:00,1.0,Repaired: Inspected and Repaired


Great, we now have a compact DataFrame with not very many missing values and not a lot of duplication of data across columns.  Let's save that data!  Note that we have to use git LFS (Large File Storage) to handle the storage of this large CSV on GitHub.  See https://git-lfs.github.com for more information on how to do this!

In [19]:
potholes.to_csv("../data/cleaned_311_pothole_data.csv", index=False)