# Capstone Data Wrangling
#### Springboard Data Science Career Track
##### Tamara Monge

### (A) Data Cleaning Performed

In [1]:
import pandas as pd
from datetime import datetime
df = pd.read_csv ('Documents/Data Science Course/Capstone1/Parking_Citations.csv')
df.info()

  interactivity=interactivity, compiler=compiler, result=result)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1345053 entries, 0 to 1345052
Data columns (total 21 columns):
Citation           1345053 non-null int64
Tag                1344837 non-null object
ExpMM              1300960 non-null object
ExpYY              1345047 non-null float64
State              1345053 non-null object
Make               1343712 non-null object
Address            1345051 non-null object
ViolCode           1344668 non-null float64
Description        1345053 non-null object
ViolFine           1344668 non-null object
ViolDate           1340590 non-null object
Balance            1345053 non-null object
PenaltyDate        0 non-null float64
OpenFine           1344668 non-null object
OpenPenalty        1344668 non-null object
NoticeDate         602001 non-null object
ImportDate         1345053 non-null object
Neighborhood       206670 non-null object
PoliceDistrict     206670 non-null object
CouncilDistrict    206691 non-null float64
Location           1323450 non-nul

As shown above, the initial DataFrame was composed of 21 columns: 16 object Series (Tag, ExpMM, State, Make, Address, Description, ViolFine, ViolDate, Balance, OpenFine, OpenPenalty, NoticeDate, ImportDate, Neighborhood, PoliceDistrict, Location) and 4 float64 Series (ExpYY, ViolCode, PenaltyDate, CouncilDistrict) and 1 int64 Series (Citation). 


Four of the columns (ViolFine, Balance, OpenFine, OpenPenalty) contained financial data in string format and needed to be converted to floats. This process required 3 cleaning steps. First, null values (non-strings) were removed. Second, a lambda function was applied that trimmed the '$'. Third, the values were converted to floats:

In [None]:
df.ViolFine = df.ViolFine[df.ViolFine.apply(type) == str].apply(lambda x: x[1:]).astype(float)   

One of the columns (ExpMM) contained string information in mixed formats (e.g., '001', '01', '1.0', and 'JAN' all corresponded to the month of January). These data needed to be cleaned to a single format. This process required the creation of dictionaries which I then used to map the undesired formats to the desired format:  

In [None]:
dict1 = {'01':'JAN', '02':'FEB', '03':'MAR', '04':'APR', '05':'MAY', '06':'JUN', '07':'JUL', '08':'AUG', '09':'SEP', '10':'OCT', '11':'NOV', '12':'DEC', '00':'', 'PE':''}
df.ExpMM.replace(dict1, inplace=True)

One of the columns (ExpYY) contained 2-digit year information and required two cleaning steps. First, it needed to be converted to a string and second, it needed to have the century information pre-pended:

In [None]:
df.ExpYY = df.ExpYY[df.ExpYY.notnull()].apply(lambda x: int(x)).astype(str)
df.ExpYY = '20' + df.ExpYY

One of the columns (Make) contained data in string format, with varying cases and varying number of characters representing the same category (e.g., 'Hon' and 'HONDA'). This string required two cleaning steps. First, the cases needed to be standardized. Second the character length needed to be standardized:

In [None]:
df.Make = df.Make.str.upper()
df.Make = df.Make[df.Make.apply(type) == str].apply(lambda x: x[:3]) 

Two of the columns (State, PoliceDistrict) contained data in string format with varying cases. These strings only required one cleaning step: case standardization:

In [None]:
df.State = df.State.str.upper()

One of the columns (CouncilDistrict) contained float64 information that needed to be treated as strings for the purposes of calculating descriptive statistics and thus were converted to strings:

In [None]:
df.CouncilDistrict = df.CouncilDistrict[df.CouncilDistrict.notnull()].astype(str)     

One of the columns (NoticeDate) contained data in string format and needed to be converted to timestamp/datetime objects:

In [None]:
df.NoticeDate = df.NoticeDate[pd.notnull(df.NoticeDate)].apply(lambda x: datetime.strptime(x, '%m/%d/%Y'))

One of the columns (ViolDate) contained data in string format that needed to be converted to a datetimeindex. This required 4 cleaning steps. First, the Series was converted to timestamp objects. Second the index of the DataFrame was set to the Series. Third, the year, month, and hour of the timestamp objects were extracted and saved as 3 new Series within the DataFrame. Fourth, the redundant Series was dropped:

In [None]:
df.ViolDate = df.ViolDate[pd.notnull(df.ViolDate)].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %I:%M:%S %p'))
df.ViolYear = df.ViolDate[pd.notnull(df.ViolDate)].dt.year.astype(int)
df.index = df.ViolDate
df.drop('ViolDate', axis=1)

One of the columns (Location) housed multiple pieces of information in a single string (e.g., '6000 CHINQUAPIN PKWY\nBaltimore, MD\n(39.365093, -76.59764).' This required two cleaning steps: splitting the string and extracting the latitude and longitude components into new columns.

In [None]:
df.Lat = float(df.Location.split('\n')[2].split('(')[1].split(',')[0])
df.Lon = float(df.Location.split('\n')[2].split('(')[1].split(',')[1].split(')')[0])

Two of the columns (PenaltyDate, ImportDate) contained irrelevant and/or all null data and were dropped:

In [None]:
df.drop('PenaltyDate', axis=1, inplace=True)

Five of the columns (Citation, Tag, Address, Description, Neighborhood) required no cleaning.

### (B) Handling Missing Data

In [2]:
df.count()/df.Citation.count()*100

Citation           100.000000
Tag                 99.983941
ExpMM               96.721839
ExpYY               99.999554
State              100.000000
Make                99.900301
Address             99.999851
ViolCode            99.971377
Description        100.000000
ViolFine            99.971377
ViolDate            99.668192
Balance            100.000000
PenaltyDate          0.000000
OpenFine            99.971377
OpenPenalty         99.971377
NoticeDate          44.756675
ImportDate         100.000000
Neighborhood        15.365194
PoliceDistrict      15.365194
CouncilDistrict     15.366755
Location            98.393892
dtype: float64

Two columns (PenaltyDate, NoticeDate) were dropped. 


15 of the remaining 19 columns had fewer than 5% of data missing. For these columns, I chose to simply neglect the missing values. 

The remaining columns (NoticeDate, Neighborhood, PoliceDistrict, and CouncilDistrict) are secondary to the main questions of this study, and very well may go unused. If I do decide to use the, I will need to determine how to handle their missing values. 