# Data Wrangling - Potholes Dataset

Thibautl Dody
08/11/2017

The purpose of this notebook is to describe the process used to verify the quality of the potholes data set. The output of this notebook is a new csv file **Closed_Pothole_Cases_Cleaned.csv** located in *./Cleaned Data/* folder.

Data source: https://data.cityofboston.gov/City-Services/Requests-for-Pothole-Repair/n65p-xaz7/data
The data is filtered down to the requests made after January 1st 2014.

## 1. Process

The first step consists of a visual inspection of the csv file (loaded using excel). This phase is critical as it is used to define how the data will be imported and how the parameters of the import will be defined. Once the data has been imported and investigated, it needs to be cleaned and re-organized. To do so, missing values are either dropped or estimated and unrelevant features are removed from the set.

The file **Closed_Pothole_Cases.csv** is located in *./Original Data/* folder.

In [208]:
# Import all the python libraries needed for the wrangling
import numpy as np
import pandas as pd

## 2. Visual inspection

The dataset contains has the following properties that will impact the import:
- The file contains missing data
- The file contains date and times (OPEN_DT, CLOSED_DT)
- The file contains columns that are empty

In order to facilitate the filtering of the dataset, the choice is made to import the entire content of the file and to modify the dataset using Pandas' tools.

In [209]:
# Import the data as Pandas dataframe
fileFullPath = "./Original Data/Closed_Pothole_Cases.csv"
potholes_raw_df = pd.read_csv(fileFullPath, parse_dates=[1,2,3])

Before diving into the content of the dataframe, the size and data type of the dataset are obtained.

In [210]:
# Dataframe Shape
potholes_raw_df.shape

(29451, 33)

In [211]:
# Dataframe Info
potholes_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29451 entries, 0 to 29450
Data columns (total 33 columns):
CASE_ENQUIRY_ID                   29451 non-null int64
OPEN_DT                           29451 non-null datetime64[ns]
TARGET_DT                         29451 non-null datetime64[ns]
CLOSED_DT                         29118 non-null datetime64[ns]
OnTime_Status                     29450 non-null object
CASE_STATUS                       29451 non-null object
CLOSURE_REASON                    29125 non-null object
CASE_TITLE                        29451 non-null object
SUBJECT                           29451 non-null object
REASON                            29451 non-null object
TYPE                              29451 non-null object
QUEUE                             29451 non-null object
Department                        29451 non-null object
SubmittedPhoto                    16533 non-null object
ClosedPhoto                       11074 non-null object
Location                    

The features presented in the file are defined as:
 - *CASE_ENQUIRY_ID*: Case number assigned to the request to repair the pothole.
 - *OPEN_DT*: Date and time of the repair request.
 - *TARGET_DT*: Scheduled time for repair.
 - *CLOSED_DT*: Date and time the case was closed.
 - *OnTime_Status*: ONTIME if *CLOSED_DT*>TARGET_DT
 - *CASE_STATUS*: Case status.
 - *CLOSURE_REASON*: Reason for the case closure.
 - *CASE_TITLE*: Request type. In this case, the type is "Request for Pothole Repair".
 - *SUBJECT*: The city department in charge of the request.
 - *REASON*: Reason for the case opening.
 - *TYPE*: Specific reason for the case opening. In this case, the type is "Request for Pothole Repair".
 - *QUEUE*: Code corresponding to the department per neighborhood in charge of the repair.
 - *Department*: Code corresponding to the department in charge of the repair.
 - *SubmittedPhoto*: URL of the photo taken to support the claim.
 - *ClosedPhoto*:  URL of the photo taken to support the repair.
 - *Location*: Address of the pothole.
 - *fire_district*: Fire district corresponding to the pothole location.
 - *pwd_district*: Public Work district corresponding to the pothole location.
 - *city_council_district*: City Council district corresponding to the pothole location.
 - *police_district*: Police district corresponding to the pothole location.
 - *neighborhood*: Neighborhood corresponding to the pothole location.
 - *neighborhood_services_district*: Neighborhood Services district corresponding to the pothole location.
 - *ward*: Ward corresponding to the pothole location.
 - *precinct*: Precinct corresponding to the pothole location.
 - *land_usage*: Blank column.
 - *LOCATION_STREET_NAME*: Street number and street name corresponding to the pothole location.
 - *LOCATION_ZIPCODE*: Zip code corresponding to the pothole location.
 - *Property_Type*: Blank column.
 - *Property_ID*: Blank column.
 - *LATITUDE*: Latitude of the pothole location.
 - *LONGITUDE*: Longitude of the pothole location.
 - *Source*: Source of the request.
 - *Geocoded_Location*: Blank column

# 3. Feature cleaning

The following choices are made:
- Delete *land_usage*, *Property_Type*, *Property_ID*, and *Geocoded_Location* since they do not contain any data.
- Import *OPEN_DT* and *CLOSED_DT* as DateTime objects

In [212]:
# Defined the feature names to be deleted.
columnNameToDelete = ['land_usage','Property_Type','Property_ID','Geocoded_Location']

# Clone the dataset and delete features of the clone.
potholes_df = potholes_raw_df.copy()
potholes_df.drop(columnNameToDelete,inplace=True,axis=1)

potholes_df.shape

(29451, 29)

After checking the top records of the dataset, it seems that several columns are filled with the same values. This is due to the fact that the database is a subset of a larger one containing all the 311 calls to the city of Boston. We check that our assumption is correct by looking at the number of unique values in each column.

In [213]:
# Obtain the count of all and unique values for the entiere dataframe.
uniqueValuesCount_dict = {function.__name__:potholes_df.apply(function)
                          for function in (pd.Series.nunique, pd.Series.count)}
pd.concat(uniqueValuesCount_dict, axis=1)

Unnamed: 0,count,nunique
CASE_ENQUIRY_ID,29451,29451
OPEN_DT,29451,29381
TARGET_DT,29451,22068
CLOSED_DT,29118,29073
OnTime_Status,29450,2
CASE_STATUS,29451,2
CLOSURE_REASON,29125,18033
CASE_TITLE,29451,46
SUBJECT,29451,1
REASON,29451,1


After inspection of the results, the following decisions are made:
- *SUBJECT*, *REASON*, *TYPE*, and *Department* can be deleted

In [214]:
# The case status is dropped
potholes_df.drop(["SUBJECT","REASON","TYPE","Department"],inplace=True,axis=1)
potholes_df.shape

(29451, 25)

## 4. Feature conversion

### 4.1 Photos

In order to simplify the analysis and keep only relevant infomation, the *Submitted_Photo* and *Closed_Photo* features are converted into booleans. If the record contains the url of a picture, the value is converted to True.

In [215]:
# Create the new faeture and delete the ones containing the urls.
potholes_df["SubmittedPhoto_Bool"] = potholes_df["SubmittedPhoto"].notnull()
potholes_df["ClosedPhoto_Bool"] = potholes_df["ClosedPhoto"].notnull()
potholes_df.drop(["SubmittedPhoto","ClosedPhoto"],inplace=True,axis=1)
potholes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29451 entries, 0 to 29450
Data columns (total 25 columns):
CASE_ENQUIRY_ID                   29451 non-null int64
OPEN_DT                           29451 non-null datetime64[ns]
TARGET_DT                         29451 non-null datetime64[ns]
CLOSED_DT                         29118 non-null datetime64[ns]
OnTime_Status                     29450 non-null object
CASE_STATUS                       29451 non-null object
CLOSURE_REASON                    29125 non-null object
CASE_TITLE                        29451 non-null object
QUEUE                             29451 non-null object
Location                          29328 non-null object
fire_district                     29155 non-null float64
pwd_district                      29213 non-null object
city_council_district             29321 non-null float64
police_district                   29232 non-null object
neighborhood                      29228 non-null object
neighborhood_services_dist

### 4.2. Closure reason

Upon inspection of the *CLOSURE_REASON* column, a number of records contains information about the case. For instance,
- "Case Closed Case Invalid"
- "Case Closed Case Invalid  duplicate"
- "Case Closed Case Noted  No address given  please resubmit"
- "Case Closed Case Noted  no eform"  
  
In conlusion, it seems that the data needs to be cleaned to make sure only relevant records are considered.

The first step consists of identifying elements that are shared amongst all the relevant pothole case.
- *CLOSURE_REASON* contains the string "Case Resolved" => Case is acceptable
- *CLOSURE_REASON* contains the word "duplicate" => Case to be removed. (typos include duplcate, duplicte)
- *CLOSURE_REASON* contains the word "invalid" => Case to be removed.
- *CLOSURE_REASON* contains the words "better location" => Case to be removed.
- *CLOSURE_REASON* contains the words "please contact" => Case to be removed.
- *CLOSURE_REASON* contains the words "please call" => Case to be removed.
- *CLOSURE_REASON* contains the word "test" => Case to be removed.
- *CLOSURE_REASON* contains the words "could not find" => Case to be removed.
- *CLOSURE_REASON* contains the word "cannot" => Case to be removed.
- *CLOSURE_REASON* contains the word "private" => Case to be removed. (versions include prvt)
- *CLOSURE_REASON* contains the word "wrong" => Case to be removed.
- *CLOSURE_REASON* contains the word "nothing" => Case to be removed.
- *CLOSURE_REASON* contains the word "re-subnmit" => Case to be removed. (versions include resubmit)
- *CLOSURE_REASON* contains the words "no pot hole" => Case to be removed. (versions include no potholes, no sink hole)

Having one field set as a text box proved to be a challenge to deal with when cleaning the data.
Please note that afte the above filters are applied, roughly 400 claims are left unfiltered. Upon visual inspection, it seems that a large (if not all) requests contains relevant claim. The choice is made to keep the leftover in the set.

In [216]:
# Convert the CLOSURE_REASON content to lower case
potholes_df.CLOSURE_REASON = potholes_df.CLOSURE_REASON.str.lower()

# Create a list of all the cases considered as invalid
invalid_expressions = ['duplicate','duplcate','duplicte','invalid','better location','please contact',
                      'please call','test','could not find','cannot','private','prvt','wrong','nothing',
                      're-subnmit','resubmit','no pot hole','no potholes','no sink hole']

# In order to use the str.contains method, we need to replace the NaN values in the column by ''
potholes_df.CLOSURE_REASON.fillna(" ",inplace=True)
print(potholes_df.shape)

for invalid_key in invalid_expressions:
    dim_before = potholes_df.shape[0]
    potholes_df = potholes_df[~potholes_df.CLOSURE_REASON.str.contains(invalid_key)]
    dim_after = potholes_df.shape[0]
    print('Key: "'+invalid_key+'"',dim_before-dim_after,'matchs found.')

(29451, 25)
Key: "duplicate" 928 matchs found.
Key: "duplcate" 1 matchs found.
Key: "duplicte" 1 matchs found.
Key: "invalid" 392 matchs found.
Key: "better location" 167 matchs found.
Key: "please contact" 376 matchs found.
Key: "please call" 32 matchs found.
Key: "test" 18 matchs found.
Key: "could not find" 9 matchs found.
Key: "cannot" 16 matchs found.
Key: "private" 267 matchs found.
Key: "prvt" 18 matchs found.
Key: "wrong" 12 matchs found.
Key: "nothing" 43 matchs found.
Key: "re-subnmit" 0 matchs found.
Key: "resubmit" 42 matchs found.
Key: "no pot hole" 22 matchs found.
Key: "no potholes" 23 matchs found.
Key: "no sink hole" 3 matchs found.


### 4.2 Ward name

The first step consists of extracting the record without ward number and decide if they can be removed from the dataset.

In [217]:
print(potholes_df[potholes_df.ward.isnull()].shape)
potholes_df[potholes_df.ward.isnull()].head(50)

(105, 25)


Unnamed: 0,CASE_ENQUIRY_ID,OPEN_DT,TARGET_DT,CLOSED_DT,OnTime_Status,CASE_STATUS,CLOSURE_REASON,CASE_TITLE,QUEUE,Location,...,neighborhood_services_district,ward,precinct,LOCATION_STREET_NAME,LOCATION_ZIPCODE,LATITUDE,LONGITUDE,Source,SubmittedPhoto_Bool,ClosedPhoto_Bool
268,101001141766,2014-08-01 11:32:55,2014-08-05 11:32:55,NaT,OVERDUE,Open,,Request for Pothole Repair,PWDx_Roadway Repair_ARP_Resurfacing,,...,,,,,,42.3594,-71.0587,Self Service,False,False
584,101001130012,2014-07-16 10:19:52,2014-07-18 10:19:52,NaT,OVERDUE,Open,,Request for Pothole Repair,PWDx_Roadway Repair_CRP_Resurfacing,,...,,,,,,42.3594,-71.0587,Self Service,False,False
934,101001162784,2014-09-02 21:37:16,2015-09-05 08:30:00,NaT,OVERDUE,Open,,Request for Pothole Repair,PWDx_Roadway Repair_Ponding,,...,,,,,,42.3594,-71.0587,Self Service,False,False
1344,101001151755,2014-08-17 16:15:59,2014-08-20 08:30:00,2014-08-21 10:27:18,OVERDUE,Closed,case closed case resolved it belonngs to b w s...,Request for Pothole Repair,PWDx_District 05: South Boston,,...,,,,,,42.3594,-71.0587,Self Service,False,False
1398,101001031585,2014-02-20 11:15:12,2014-02-28 11:13:00,2014-02-27 12:08:36,ONTIME,Closed,case closed case resolved done,Request for Pothole Repair,PWDx_District 1A: Charlestown,,...,,,,,,42.3594,-71.0587,Employee Generated,False,False
1733,101001316714,2015-02-25 19:11:59,2015-02-27 08:30:00,2015-04-14 07:29:33,OVERDUE,Closed,case closed case resolved,Request for Pothole Repair,PWDx_District 10A: Roxbury,,...,,,,,,42.3594,-71.0587,Self Service,False,False
1989,101001353573,2015-04-10 11:50:15,2015-04-13 11:50:15,2015-04-13 12:49:32,OVERDUE,Closed,case closed case resolved completed from e.bro...,Request for Pothole Repair,PWDx_District 05: South Boston,,...,,,,,,42.3594,-71.0587,Self Service,False,True
2265,101001058326,2014-04-01 09:28:11,2014-04-03 09:28:11,2014-04-04 06:22:59,OVERDUE,Closed,case closed case resolved,Request for Pothole Repair,PWDx_District 05: South Boston,,...,,,,,,42.3594,-71.0587,Self Service,False,False
2456,101002047469,2017-03-23 18:36:11,2017-03-27 08:30:00,2017-03-24 06:50:36,ONTIME,Closed,case closed case noted no location specified ...,Request for Pothole Repair,PWDx_Requests for Pothole Repair,,...,,,,,,42.3594,-71.0587,Citizens Connect App,False,False
2766,101001338019,2015-03-20 09:18:21,2015-03-25 09:18:00,2015-03-24 10:51:36,ONTIME,Closed,case closed case resolved,Request for Pothole Repair,PWDx_District 10B: Roxbury,,...,,,,,,42.3594,-71.0587,Self Service,False,True


The records having both their location as Nan and their ward as Nan are removed. We choose to keep the other since there is a change to retrieve the locations later.

In [218]:
potholes_df = potholes_df[potholes_df.ward.notnull() & potholes_df.Location.notnull()]
potholes_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26976 entries, 0 to 29450
Data columns (total 25 columns):
CASE_ENQUIRY_ID                   26976 non-null int64
OPEN_DT                           26976 non-null datetime64[ns]
TARGET_DT                         26976 non-null datetime64[ns]
CLOSED_DT                         26648 non-null datetime64[ns]
OnTime_Status                     26976 non-null object
CASE_STATUS                       26976 non-null object
CLOSURE_REASON                    26976 non-null object
CASE_TITLE                        26976 non-null object
QUEUE                             26976 non-null object
Location                          26976 non-null object
fire_district                     26823 non-null float64
pwd_district                      26873 non-null object
city_council_district             26970 non-null float64
police_district                   26897 non-null object
neighborhood                      26890 non-null object
neighborhood_services_dist

The ward column contains records as "Ward n" and other as "n". In order to have a consistent data type, the "Ward n" record are converted into "n". Finally, the column data is converted into integer.

In [219]:
potholes_df['ward'] = potholes_df['ward'].str.extract('(\d+)').astype(int)
potholes_df.info()

  """Entry point for launching an IPython kernel.


<class 'pandas.core.frame.DataFrame'>
Int64Index: 26976 entries, 0 to 29450
Data columns (total 25 columns):
CASE_ENQUIRY_ID                   26976 non-null int64
OPEN_DT                           26976 non-null datetime64[ns]
TARGET_DT                         26976 non-null datetime64[ns]
CLOSED_DT                         26648 non-null datetime64[ns]
OnTime_Status                     26976 non-null object
CASE_STATUS                       26976 non-null object
CLOSURE_REASON                    26976 non-null object
CASE_TITLE                        26976 non-null object
QUEUE                             26976 non-null object
Location                          26976 non-null object
fire_district                     26823 non-null float64
pwd_district                      26873 non-null object
city_council_district             26970 non-null float64
police_district                   26897 non-null object
neighborhood                      26890 non-null object
neighborhood_services_dist

### 4.3 The misterious ward "0"

Several records contains a ward value of 0. In order to decide is this value is meaningfull, let's plot the data on a map to see if the ward 0 records are grouped.

In [220]:
# Extracted records of interest
wardZero_df = potholes_df[potholes_df.ward==0]
defaultLocation = wardZero_df[["LATITUDE","LONGITUDE"]].drop_duplicates()
defaultLocation

Unnamed: 0,LATITUDE,LONGITUDE
93,42.3594,-71.0587


In [221]:
# Extract default location
defaultLocation_lst = [defaultLocation.LATITUDE[93],defaultLocation.LONGITUDE[93]]
defaultLocation_lst

[42.359400000000001, -71.058700000000002]

<font color='red'>**NOTE: As part of the goal of this capstone, the choice is made to experiment with a map plotting package instead of just retrieving the corresponding the location from the Web.**</font>

In [222]:
# The folium packages is used
%matplotlib inline
import folium

In [223]:
map_ward_0 = folium.Map(location=defaultLocation_lst,zoom_start=14)

folium.Marker(location=defaultLocation_lst, 
              popup='Ward 0',
              icon=folium.Icon(color='red',icon='info-sign',)).add_to(map_ward_0)

map_ward_0

In [224]:
# Extract the Location feature of the potholes assigned to the Ward 0.
print(wardZero_df.shape)
for loc in wardZero_df.Location.drop_duplicates():
    print(loc)

(37, 25)
INTERSECTION of N Washington St & Rutherford Ave  Boston  MA
INTERSECTION of Summit St & Metropolitan Ave  Boston  MA
INTERSECTION of Saint Pauls Ave & Pond St  Boston  MA
INTERSECTION of Saint Mary's St & Beacon St  Boston  MA
INTERSECTION of Mountfort St & Saint Mary's St  Boston  MA
INTERSECTION of Medfield St & Saint Mary's St  Boston  MA
INTERSECTION of Cambridge St & Crescent St  Charlestown  MA
INTERSECTION of Commonwealth Ave & University Rd  Boston  MA


Later on the Google geo-API will be used to correct some of the locations. In order to avoid wastin an unecessary number of query (limit per day). The locations above are searched using Google Maps and the Longitude and Latitude are retrieved.

In [225]:
# Prepare dictionaries mapping the locations to their longitudes, latitues, and zip.
Ward_0_Loc_Lat = {"INTERSECTION of N Washington St & Rutherford Ave  Boston  MA":42.371427,
                 "INTERSECTION of Summit St & Metropolitan Ave  Boston  MA":42.252180,
                 "INTERSECTION of Saint Pauls Ave & Pond St  Boston  MA":42.309236,
                 "INTERSECTION of Saint Mary's St & Beacon St  Boston  MA":42.346021,
                 "INTERSECTION of Mountfort St & Saint Mary's St  Boston  MA":42.348591,
                 "INTERSECTION of Medfield St & Saint Mary's St  Boston  MA":42.345236,
                 "INTERSECTION of Cambridge St & Crescent St  Charlestown  MA":42.382121,
                 "INTERSECTION of Commonwealth Ave & University Rd  Boston  MA":42.350490}
Ward_0_Loc_Lon = {"INTERSECTION of N Washington St & Rutherford Ave  Boston  MA":-71.06282,
                 "INTERSECTION of Summit St & Metropolitan Ave  Boston  MA":-71.109310,
                 "INTERSECTION of Saint Pauls Ave & Pond St  Boston  MA":-71.134461,
                 "INTERSECTION of Saint Mary's St & Beacon St  Boston  MA":-71.106668,
                 "INTERSECTION of Mountfort St & Saint Mary's St  Boston  MA":-71.106880,
                 "INTERSECTION of Medfield St & Saint Mary's St  Boston  MA":-71.106298,
                 "INTERSECTION of Cambridge St & Crescent St  Charlestown  MA":-71.080858,
                 "INTERSECTION of Commonwealth Ave & University Rd  Boston  MA":-71.109661}
Ward_0_Loc_Zip = {"INTERSECTION of N Washington St & Rutherford Ave  Boston  MA":2129,
                 "INTERSECTION of Summit St & Metropolitan Ave  Boston  MA":2136,
                 "INTERSECTION of Saint Pauls Ave & Pond St  Boston  MA":2130,
                 "INTERSECTION of Saint Mary's St & Beacon St  Boston  MA":2215,
                 "INTERSECTION of Mountfort St & Saint Mary's St  Boston  MA":2446,
                 "INTERSECTION of Medfield St & Saint Mary's St  Boston  MA":2215,
                 "INTERSECTION of Cambridge St & Crescent St  Charlestown  MA":2129,
                 "INTERSECTION of Commonwealth Ave & University Rd  Boston  MA":2215}

# Create dataframe containing the corrected values related to Ward 0.
Ward_0_corrected_df = pd.DataFrame({"LATITUDE":Ward_0_Loc_Lat,
                                    "LONGITUDE":Ward_0_Loc_Lon,
                                    "LOCATION_ZIPCODE":Ward_0_Loc_Zip})

Ward_0_corrected_df['Location'] = Ward_0_Loc_Zip.keys()
Ward_0_corrected_df.reset_index(inplace=True,drop=True)
Ward_0_corrected_df.head()

Unnamed: 0,LATITUDE,LOCATION_ZIPCODE,LONGITUDE,Location
0,42.382121,2129,-71.080858,INTERSECTION of N Washington St & Rutherford A...
1,42.35049,2215,-71.109661,INTERSECTION of Summit St & Metropolitan Ave ...
2,42.345236,2215,-71.106298,INTERSECTION of Saint Pauls Ave & Pond St Bos...
3,42.348591,2446,-71.10688,INTERSECTION of Saint Mary's St & Beacon St B...
4,42.371427,2129,-71.06282,INTERSECTION of Mountfort St & Saint Mary's St...


In [226]:
# Apply the merge:
potholes_df_clone = potholes_df[potholes_df.ward==0].copy()
potholes_df_clone=potholes_df_clone.merge(right=Ward_0_corrected_df,on='Location')

# Re-organize and rename the column to keep only the data from the mapping.
potholes_df_clone.drop(["LOCATION_ZIPCODE_x","LATITUDE_x","LONGITUDE_x"],inplace=True,axis=1)
potholes_df_clone.rename(index=str,columns={"LATITUDE_y":"LATITUDE",
                                           "LOCATION_ZIPCODE_y":"LOCATION_ZIPCODE",
                                           "LONGITUDE_y":"LONGITUDE"},inplace=True)

potholes_df_No_Ward0 = potholes_df[potholes_df.ward!=0]

potholes_df = pd.concat([potholes_df_No_Ward0,potholes_df_clone])


In [227]:
potholes_df[potholes_df.ward==0].head()

Unnamed: 0,CASE_ENQUIRY_ID,CASE_STATUS,CASE_TITLE,CLOSED_DT,CLOSURE_REASON,ClosedPhoto_Bool,LATITUDE,LOCATION_STREET_NAME,LOCATION_ZIPCODE,LONGITUDE,...,SubmittedPhoto_Bool,TARGET_DT,city_council_district,fire_district,neighborhood,neighborhood_services_district,police_district,precinct,pwd_district,ward
0,101002062057,Closed,Request for Pothole Repair,2017-04-13 12:30:30,case closed. closed date : 2017-04-13 12:30:30...,True,42.382121,INTERSECTION N Washington St & Rutherford Ave,2129.0,-71.080858,...,False,2017-04-13 08:30:00,0.0,3.0,,0.0,,,,0
1,101002064025,Closed,Request for Pothole Repair,2017-04-18 10:27:17,case closed. closed date : 2017-04-18 10:27:17...,True,42.382121,INTERSECTION N Washington St & Rutherford Ave,2129.0,-71.080858,...,False,2017-04-14 16:39:47,0.0,3.0,,0.0,,,,0
2,101002066010,Closed,Request for Pothole Repair,2017-04-18 10:26:24,case closed. closed date : 2017-04-18 10:26:24...,True,42.382121,INTERSECTION N Washington St & Rutherford Ave,2129.0,-71.080858,...,False,2017-04-19 08:30:00,0.0,3.0,,0.0,,,,0
3,101002066592,Closed,Request for Pothole Repair,2017-04-18 10:26:08,case closed. closed date : 2017-04-18 10:26:08...,True,42.382121,INTERSECTION N Washington St & Rutherford Ave,2129.0,-71.080858,...,False,2017-04-19 08:31:05,0.0,3.0,,0.0,,,,0
4,101002012963,Closed,Request for Pothole Repair,2017-02-14 11:44:44,case closed. closed date : 2017-02-14 11:44:44...,False,42.382121,INTERSECTION N Washington St & Rutherford Ave,2129.0,-71.080858,...,True,2017-02-15 11:44:42,0.0,3.0,,0.0,,,,0


In [228]:
potholes_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26976 entries, 0 to 36
Data columns (total 25 columns):
CASE_ENQUIRY_ID                   26976 non-null int64
CASE_STATUS                       26976 non-null object
CASE_TITLE                        26976 non-null object
CLOSED_DT                         26648 non-null datetime64[ns]
CLOSURE_REASON                    26976 non-null object
ClosedPhoto_Bool                  26976 non-null bool
LATITUDE                          26976 non-null float64
LOCATION_STREET_NAME              26976 non-null object
LOCATION_ZIPCODE                  16944 non-null float64
LONGITUDE                         26976 non-null float64
Location                          26976 non-null object
OPEN_DT                           26976 non-null datetime64[ns]
OnTime_Status                     26976 non-null object
QUEUE                             26976 non-null object
Source                            26976 non-null object
SubmittedPhoto_Bool               26976 non

### 4.3 Missing CLOSED_DT

As shown above, a number of potholes have not been assigned a closed date. Since the purpose of this project is to analyze the response time of the city to repair potholes, these cases are not relevant. Therefore, they are removed from the data frame.

In [229]:
potholes_df=potholes_df[potholes_df.CLOSED_DT.notnull()]
potholes_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26648 entries, 0 to 36
Data columns (total 25 columns):
CASE_ENQUIRY_ID                   26648 non-null int64
CASE_STATUS                       26648 non-null object
CASE_TITLE                        26648 non-null object
CLOSED_DT                         26648 non-null datetime64[ns]
CLOSURE_REASON                    26648 non-null object
ClosedPhoto_Bool                  26648 non-null bool
LATITUDE                          26648 non-null float64
LOCATION_STREET_NAME              26648 non-null object
LOCATION_ZIPCODE                  16754 non-null float64
LONGITUDE                         26648 non-null float64
Location                          26648 non-null object
OPEN_DT                           26648 non-null datetime64[ns]
OnTime_Status                     26648 non-null object
QUEUE                             26648 non-null object
Source                            26648 non-null object
SubmittedPhoto_Bool               26648 non

## 5. The default location

After inspecting the resulting data, it seems that a large number of potholes are located at the "Default" location corresponding to LATITUDE=42.3594, LONGITUDE=-71.0587.

In [230]:
defaultLocation_df = potholes_df[(potholes_df.LATITUDE==42.3594) &
                                 (potholes_df.LONGITUDE==-71.0587)]
defaultLocation_df.shape

(5353, 25)

In [231]:
# The default location Latitude and Longitude are replaced with NaN in order to facilitate the cleaning.
potholes_df.loc[(potholes_df.LATITUDE==42.3594) &
            (potholes_df.LONGITUDE==-71.0587),["LATITUDE","LONGITUDE"]] = np.NaN

In order to estimate the location of these potholes, a step by step process is chosen. The records corresponding to this default location are inspected. If sufficient information can be retrieve from other feature to estimate the location then the record is kept.

** Search by zip code**

In [232]:
# Identify the mapping between neighborhoods and zip codes
potholes_df[['neighborhood',"LOCATION_ZIPCODE"]].drop_duplicates().sort_values(by="LOCATION_ZIPCODE")

Unnamed: 0,neighborhood,LOCATION_ZIPCODE
456,Beacon Hill,2108.0
13,Downtown / Financial District,2108.0
6446,Back Bay,2108.0
1185,Boston,2108.0
22,Boston,2109.0
49,Downtown / Financial District,2109.0
5454,South Boston / South Boston Waterfront,2110.0
1420,Boston,2110.0
217,Downtown / Financial District,2110.0
451,Downtown / Financial District,2111.0


The mapping is not unique for most cases. However, the following zip/neighborhood are:
 - Charlestown: zip 2129
 - East Boston: zip 2128
 
 However, this would only help to fix three records.


The precinct feature are dropped is not relevant as the zones can be redefined every couple years.

In [233]:
potholes_df.drop(["precinct"],inplace=True,axis=1)
potholes_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26648 entries, 0 to 36
Data columns (total 24 columns):
CASE_ENQUIRY_ID                   26648 non-null int64
CASE_STATUS                       26648 non-null object
CASE_TITLE                        26648 non-null object
CLOSED_DT                         26648 non-null datetime64[ns]
CLOSURE_REASON                    26648 non-null object
ClosedPhoto_Bool                  26648 non-null bool
LATITUDE                          21295 non-null float64
LOCATION_STREET_NAME              26648 non-null object
LOCATION_ZIPCODE                  16754 non-null float64
LONGITUDE                         21295 non-null float64
Location                          26648 non-null object
OPEN_DT                           26648 non-null datetime64[ns]
OnTime_Status                     26648 non-null object
QUEUE                             26648 non-null object
Source                            26648 non-null object
SubmittedPhoto_Bool               26648 non

First let's take care of the single missing "OnTime_Status".

Create a new feature *OnTime_Status_Bool* which is True is OnTime_Status is "ONTIME".

In [234]:
potholes_df["OnTime_Status_Bool"] = potholes_df["OnTime_Status"]=="ONTIME"
potholes_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26648 entries, 0 to 36
Data columns (total 25 columns):
CASE_ENQUIRY_ID                   26648 non-null int64
CASE_STATUS                       26648 non-null object
CASE_TITLE                        26648 non-null object
CLOSED_DT                         26648 non-null datetime64[ns]
CLOSURE_REASON                    26648 non-null object
ClosedPhoto_Bool                  26648 non-null bool
LATITUDE                          21295 non-null float64
LOCATION_STREET_NAME              26648 non-null object
LOCATION_ZIPCODE                  16754 non-null float64
LONGITUDE                         21295 non-null float64
Location                          26648 non-null object
OPEN_DT                           26648 non-null datetime64[ns]
OnTime_Status                     26648 non-null object
QUEUE                             26648 non-null object
Source                            26648 non-null object
SubmittedPhoto_Bool               26648 non

### 5. Default location and incorrect address

The next big challenge of the cleaning process is to correct the records with incorrect addresses. By default, their longitude and latitude were assigned to the default location (see previous). The idea is to create a function that takes a row as an input, creates an approximate addres, and retrieves the longitude and latitude from Google's geo-API.

The main challenge with this method is the limit of the API. In order to deal with it in an efficient way, we will extract all the locations from the data frame and save the unique values in a csv file. Using a separate Notebook, an algorithm is implemented to retrieve as much information as possible.  
**PLEASE SEE Google_Geo_API_Fetcher.ipynb**

In [251]:
Missing_Zip_Lat_Long = potholes_df[(potholes_df.LOCATION_ZIPCODE.isnull())|
                                   (potholes_df.LATITUDE.isnull())|
                                   (potholes_df.LONGITUDE.isnull())]
Missing_Zip_Lat_Long =Missing_Zip_Lat_Long[["Location"]].drop_duplicates()
Missing_Zip_Lat_Long.sort_values(by="Location",ascending=True,inplace=True)
Missing_Zip_Lat_Long.reset_index(inplace=True,drop=True)
Missing_Zip_Lat_Long.shape

(4530, 1)

In [252]:
Missing_Zip_Lat_Long.to_csv("./Intermediate Data/Missing_Zip_Lat_Long.csv")

# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

# THE SECTION BELOW IS NOT UP TO DATE. PLEASE DISREGARD

# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

In [91]:
# The Geocoder module is used.
from pygeocoder import Geocoder
import time

In [93]:
def CorrectLongitudeLatitude(df):
    '''
    Goal: retrieve the longitude and latitude of a given pothole.
    Input: Complete dataframe row.
    Output: a tuple with the longitude and latitude.
    '''
    Results = []
    Addresses = []
    Failures = 0
    Size = df.shape[0]
    
    print("Fetch starts on " + str(Size) + " items...")
    i=0
    for index, row in potholes_default.iterrows():
        Address_str = row['Location']
        time.sleep(0.2)
        # First brute pass
        try:
            Results.append(Address_str)
            Addresses.append(Geocoder.geocode(Address_str).coordinates)
        except:
            Results.append("ERROR")
            Addresses.append("ERROR")
            Failures+=1
            i+=1
        if i%50==0:
            print("Done with "+str(i)+" records. "+ str(Failures)+" failures encountered...")
    print("Fetch completed.")
    print(str(Failures) + " failures encountered.")
    return Results, Addresses

In [94]:
#test
potholes_default = potholes_df[(potholes_df.LATITUDE==42.3594) & (potholes_df.LONGITUDE==-71.0587)]
Results,Addresses = CorrectLongitudeLatitude(potholes_default)

Fetch starts on 1119 items...
Fetch completed.
1119 failures encountered.


Upon review of several records with this set of latitude and logitude, the following reason are found to explain the use of this generic geolocation.
- The pothole is located at an intersection
- The address does not have an exact number or is incomplete


Unnamed: 0,LATITUDE,LONGITUDE
93,42.3594,-71.0587


It seems that all the yard 0 corresponds to a default location. A

<font color='red'>**NOTE: As part of the goal of this capstone, the choice is made to experiment with a map plotting package instead of just retrieving the corresponding the location from the Web.**</font>

In [86]:
# The folium packages is used
%matplotlib inline
import folium

In [90]:
map_ward_0 = folium.Map(location=wardZeor_longLat[0],zoom_start=12)

for loc in wardZeor_longLat:
    folium.Marker(location=loc, popup='Ward 0',
              icon=folium.Icon(color='red',icon='info-sign',)).add_to(map_ward_0)

map_ward_0

In [17]:
%matplotlib inline
import folium

<font color='red'>**NOTE: As part of the goal of this capstone, the choice is made to experiment with a map plotting package instead of just retrieving the corresponding the location from the Web.**</font>

In [74]:
unknownLocation = [singleLatLong.LATITUDE.iloc[0],singleLatLong.LONGITUDE.iloc[0]]
unknownLocation

map_1 = folium.Map(location=unknownLocation,zoom_start=14.5)
folium.Marker(location=unknownLocation, popup='Unknow Location',
              icon=folium.Icon(color='red',icon='info-sign')).add_to(map_1)
map_1

NameError: name 'singleLatLong' is not defined

The address seems to be located at the center of the city. While the location is not off, there is no reason for more than one hundred potholes to be located at this address. The choice is made to drop the corresponding records.

In [19]:
# Drop the date where the location is null
potholes_df = potholes_df[potholes_df.Location.notnull()]
potholes_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29328 entries, 0 to 29450
Data columns (total 25 columns):
CASE_ENQUIRY_ID                   29328 non-null int64
OPEN_DT                           29328 non-null object
TARGET_DT                         29328 non-null object
CLOSED_DT                         28998 non-null object
OnTime_Status                     29327 non-null object
CLOSURE_REASON                    29005 non-null object
CASE_TITLE                        29328 non-null object
QUEUE                             29328 non-null object
Department                        29328 non-null object
SubmittedPhoto                    16526 non-null object
ClosedPhoto                       11061 non-null object
Location                          29328 non-null object
fire_district                     29155 non-null float64
pwd_district                      29213 non-null object
city_council_district             29321 non-null float64
police_district                   29232 non-null o

After this step, the missing data is contained in the following features:
- <font color='red'>*fire_district*</font>
- <font color='red'>*pwd_district*</font>
- <font color='red'>*city_council_district*</font>
- <font color='red'>*police_district*</font>
- <font color='green'>*neighborhood*</font>
- <font color='green'>*neighborhood_services_district*</font>
- <font color='green'>*ward*</font>
- <font color='green'>*precinct*</font>
- <font color='blue'>*LOCATION_ZIPCODE*</font>
- <font color='purple'>*CLOSURE_REASON*</font>

In [20]:
# Missing CLOSURE_REASON
potholes_df[potholes_df.CLOSURE_REASON.isnull()]

Unnamed: 0,CASE_ENQUIRY_ID,OPEN_DT,TARGET_DT,CLOSED_DT,OnTime_Status,CLOSURE_REASON,CASE_TITLE,QUEUE,Department,SubmittedPhoto,...,police_district,neighborhood,neighborhood_services_district,ward,precinct,LOCATION_STREET_NAME,LOCATION_ZIPCODE,LATITUDE,LONGITUDE,Source
3,101001100259,05/30/2014 09:53:06 AM,06/03/2014 09:53:06 AM,,OVERDUE,,Request for Pothole Repair,PWDx_Roadway Repair_CRP_Resurfacing,PWDx,,...,D4,Back Bay,6.0,Ward 5,0501,INTERSECTION Berkeley St & Saint James Ave,,42.3504,-71.0726,Constituent Call
5,101001191509,10/08/2014 01:38:03 AM,12/10/2014 08:30:00 AM,,OVERDUE,,PRINTED,PWDx_Roadway Repair_Ponding,PWDx,https://mayors24.cityofboston.gov/media/boston...,...,D4,South End,6.0,Ward 8,0801,INTERSECTION Albany St & Plympton St,,42.3381,-71.0671,Citizens Connect App
7,101001637210,11/17/2015 06:10:00 PM,11/19/2015 08:30:00 AM,,OVERDUE,,Request for Pothole Repair,PWDx_Roadway Repair_ARP_Resurfacing,PWDx,https://mayors24.cityofboston.gov/media/boston...,...,B2,Roxbury,13.0,08,0804,30 Warren St,2119.0,42.3298,-71.0831,Citizens Connect App
9,101001952215,11/15/2016 12:14:00 PM,11/16/2016 12:14:35 PM,,OVERDUE,,Request for Pothole Repair,PWDx_Roadway Repair,PWDx,,...,D4,South End,4.0,Ward 3,0307,22 Milford St,2118.0,42.3441,-71.0700,Citizens Connect App
11,101002062120,04/11/2017 07:00:00 PM,04/13/2017 08:30:00 AM,,OVERDUE,,Request for Pothole Repair,PWDx_Roadway Repair_CRP_Resurfacing,PWDx,https://mayors24.cityofboston.gov/media/boston...,...,A1,Downtown / Financial District,3.0,Ward 3,0304,18 Cooper St,2113.0,42.3643,-71.0563,Citizens Connect App
20,101001770109,04/16/2016 04:49:00 PM,04/20/2016 08:30:00 AM,,OVERDUE,,Request for Pothole Repair,PWDx_Roadway Repair,PWDx,,...,E18,Hyde Park,10.0,Ward 18,1820,92 W Milton St,2136.0,42.2387,-71.1366,Citizens Connect App
21,101001692985,01/13/2016 02:18:00 PM,01/14/2016 02:18:27 PM,,OVERDUE,,Request for Pothole Repair,PWDx_Roadway Repair_Patch Paving,PWDx,,...,E18,Hyde Park,10.0,Ward 18,1820,10-2 Milton St,2136.0,42.2378,-71.1336,Constituent Call
23,101001398694,05/30/2015 09:25:00 AM,06/02/2015 08:30:00 AM,,OVERDUE,,Request for Pothole Repair,PWDx_Roadway Repair_Reconstruction,PWDx,,...,C6,South Boston / South Boston Waterfront,5.0,Ward 7,0701,1849 Columbia Rd,2127.0,42.3321,-71.0270,Self Service
24,101001927200,10/09/2016 02:47:00 PM,10/12/2016 08:30:00 AM,,OVERDUE,,Request for Pothole Repair,PWDx_Roadway Repair,PWDx,https://mayors24.cityofboston.gov/media/boston...,...,C6,South Boston / South Boston Waterfront,5.0,06,0605,685 E Second St,2127.0,42.3372,-71.0357,Citizens Connect App
28,101001214546,11/10/2014 12:11:20 PM,11/13/2014 12:11:20 PM,,OVERDUE,,Request for Pothole Repair,PWDx_Roadway Repair,PWDx,,...,A1,Downtown / Financial District,4.0,Ward 3,0308,185 Harrison Ave,2111.0,42.3490,-71.0624,Constituent Call


Since only one record is incomplete for the CLOSURE_REASON, the choice is made to remove is from the data set.

In [21]:
# Remove missing data
potholes_df = potholes_df[potholes_df.CLOSURE_REASON.notnull()]
potholes_df.shape

(29005, 25)

## 5. Filling missing data

### 5.1 Fire, police

The first step consists of locating the different locations corresponding to the 

In [22]:
missing_fire = potholes_df[potholes_df.fire_district.isnull()]
missing_pwd = potholes_df[potholes_df.pwd_district.isnull()]
missing_police = potholes_df[potholes_df.police_district.isnull()]
missing_council = potholes_df[potholes_df.city_council_district.isnull()]

In [23]:
missing_fire_lat_long = [[x,y] for x,y in 
                         zip(missing_fire.LATITUDE.values,missing_fire.LONGITUDE.values)]
missing_pwd_lat_long = [[x,y] for x,y in 
                         zip(missing_fire.LATITUDE.values,missing_pwd.LONGITUDE.values)]
missing_police_lat_long = [[x,y] for x,y in 
                         zip(missing_fire.LATITUDE.values,missing_police.LONGITUDE.values)]
missing_council_lat_long = [[x,y] for x,y in 
                         zip(missing_fire.LATITUDE.values,missing_council.LONGITUDE.values)]

all_missing_dept = [missing_fire_lat_long,missing_pwd_lat_long,missing_police_lat_long,missing_council_lat_long]

In [27]:
map_2 = folium.Map(location=unknownLocation,zoom_start=12)

for loc in missing_fire_lat_long:
    folium.Marker(location=loc, popup='Unknow Location',
              icon=folium.Icon(color='red',icon='info-sign',)).add_to(map_2)
for loc in missing_pwd_lat_long:
    folium.Marker(location=loc, popup='Unknow Location',
              icon=folium.Icon(color='green',icon='info-sign',)).add_to(map_2)
for loc in missing_police_lat_long:
    folium.Marker(location=loc, popup='Unknow Location',
              icon=folium.Icon(color='blue',icon='info-sign',)).add_to(map_2)
for loc in missing_council_lat_long:
    folium.Marker(location=loc, popup='Unknow Location',
              icon=folium.Icon(color='purple',icon='info-sign',)).add_to(map_2)
map_2;