In [2]:
# The usual preamble
%matplotlib inline
import polars as pl
import matplotlib.pyplot as plt

# Make the graphs a bit prettier, and bigger
plt.style.use('ggplot')

pl.Config.set_tbl_cols(60)
pl.Config.set_fmt_str_lengths(5000)

plt.rcParams['figure.figsize'] = (15, 5)

One of the main problems with messy data is: how do you know if it's messy or not?

We're going to use the NYC 311 service request dataset again here, since it's big and a bit unwieldy.

In [3]:
requests = pl.read_csv('./data/311-service-requests.csv', infer_schema_length=0, schema_overrides=[pl.Utf8])

# 7.1 How do we know if it's messy? 

We're going to look at a few columns here. I know already that there are some problems with the zip code, so let's look at that first.
 
To get a sense for whether a column has problems, I usually use `.unique()` to look at all its values. If it's a numeric column, I'll instead plot a histogram to get a sense of the distribution.

When we look at the unique values in "Incident Zip", it quickly becomes clear that this is a mess.

Some of the problems:

* Some have been parsed as strings, and some as floats
* There are `nan`s 
* Some of the zip codes are `29616-0759` or `83`
* There are some N/A values that pandas didn't recognize, like 'N/A' and 'NO CLUE'

What we can do:

* Normalize 'N/A' and 'NO CLUE' into regular nan values
* Look at what's up with the 83, and decide what to do
* Make everything strings

In [4]:
requests['Incident Zip'].unique()

Incident Zip
str
"""11220"""
"""11419"""
"""10000"""
"""11375"""
"""11104"""
…
"""41042"""
"""08807"""
"""92123"""
"""11716"""


# 7.2 Fixing the nan values and string/float confusion

We can pass a `na_values` option to `pd.read_csv` to clean this up a little bit. We can also specify that the type of Incident Zip is a string, not a float.

In [7]:
na_values = ['NO CLUE', 'N/A', '0']
requests = pl.read_csv(
    './data/311-service-requests.csv',
    null_values=na_values,
    schema_overrides={'Incident Zip': pl.Utf8}  
)

In [8]:
requests['Incident Zip'].unique()

Incident Zip
str
"""55164-0737"""
"""11549-3650"""
"""11697"""
"""11575"""
"""11413"""
…
"""11207"""
"""10034"""
"""11520"""
"""10282"""


# 7.3 What's up with the dashes?

In [9]:
rows_with_dashes = (requests
    .select(pl.col('Incident Zip').str.contains('-').fill_null(False))
    .to_series()
)

result = len(requests.filter(rows_with_dashes))
result

5

In [10]:
requests.filter(rows_with_dashes)

Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,Street Name,Cross Street 1,Cross Street 2,Intersection Street 1,Intersection Street 2,Address Type,City,Landmark,Facility Type,Status,Due Date,Resolution Action Updated Date,Community Board,Borough,X Coordinate (State Plane),Y Coordinate (State Plane),Park Facility Name,Park Borough,School Name,School Number,School Region,School Code,School Phone Number,School Address,School City,School State,School Zip,School Not Found,School or Citywide Complaint,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Garage Lot Name,Ferry Direction,Ferry Terminal Name,Latitude,Longitude,Location
i64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,i64,i64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,f64,f64,str
26550551,"""10/24/2013 06:16:34 PM""",,"""DCA""","""Department of Consumer Affairs""","""Consumer Complaint""","""False Advertising""",,"""77092-2016""","""2700 EAST SELTICE WAY""","""EAST SELTICE WAY""",,,,,,"""HOUSTON""",,,"""Assigned""","""11/13/2013 11:15:20 AM""","""10/29/2013 11:16:16 AM""","""0 Unspecified""","""Unspecified""",,,"""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,,,
26548831,"""10/24/2013 09:35:10 AM""",,"""DCA""","""Department of Consumer Affairs""","""Consumer Complaint""","""Harassment""",,"""55164-0737""","""P.O. BOX 64437""","""64437""",,,,,,"""ST. PAUL""",,,"""Assigned""","""11/13/2013 02:30:21 PM""","""10/29/2013 02:31:06 PM""","""0 Unspecified""","""Unspecified""",,,"""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,,,
26488417,"""10/15/2013 03:40:33 PM""",,"""TLC""","""Taxi and Limousine Commission""","""Taxi Complaint""","""Driver Complaint""","""Street""","""11549-3650""","""365 HOFSTRA UNIVERSITY""","""HOFSTRA UNIVERSITY""",,,,,,"""HEMSTEAD""",,,"""Assigned""","""11/30/2013 01:20:33 PM""","""10/16/2013 01:21:39 PM""","""0 Unspecified""","""Unspecified""",,,"""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,"""La Guardia Airport""",,,,,,,,,,
26468296,"""10/10/2013 12:36:43 PM""","""10/26/2013 01:07:07 AM""","""DCA""","""Department of Consumer Affairs""","""Consumer Complaint""","""Debt Not Owed""",,"""29616-0759""","""PO BOX 25759""","""BOX 25759""",,,,,,"""GREENVILLE""",,,"""Closed""","""10/26/2013 09:20:28 AM""","""10/26/2013 01:07:07 AM""","""0 Unspecified""","""Unspecified""",,,"""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,,,
26461137,"""10/09/2013 05:23:46 PM""","""10/25/2013 01:06:41 AM""","""DCA""","""Department of Consumer Affairs""","""Consumer Complaint""","""Harassment""",,"""35209-3114""","""600 BEACON PKWY""","""BEACON PKWY""",,,,,,"""BIRMINGHAM""",,,"""Closed""","""10/25/2013 02:43:42 PM""","""10/25/2013 01:06:41 AM""","""0 Unspecified""","""Unspecified""",,,"""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,,,


I thought these were missing data and originally deleted them like this:

`requests['Incident Zip'][rows_with_dashes] = np.nan`

But then my friend Dave pointed out that 9-digit zip codes are normal. Let's look at all the zip codes with more than 5 digits, make sure they're okay, and then truncate them.

In [None]:
result = (requests
    .select(pl.col('Incident Zip')
    .filter(pl.col('Incident Zip').str.len_chars() > 5)
    .unique())
    ['Incident Zip']
)
result

Incident Zip
str
"""35209-3114"""
"""29616-0759"""
"""77092-2016"""
"""11549-3650"""
"""55164-0737"""
"""000000"""


Those all look okay to truncate to me.

In [17]:
requests = requests.with_columns([pl.col('Incident Zip').str.slice(0,5)])
requests

Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,Street Name,Cross Street 1,Cross Street 2,Intersection Street 1,Intersection Street 2,Address Type,City,Landmark,Facility Type,Status,Due Date,Resolution Action Updated Date,Community Board,Borough,X Coordinate (State Plane),Y Coordinate (State Plane),Park Facility Name,Park Borough,School Name,School Number,School Region,School Code,School Phone Number,School Address,School City,School State,School Zip,School Not Found,School or Citywide Complaint,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Garage Lot Name,Ferry Direction,Ferry Terminal Name,Latitude,Longitude,Location
i64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,i64,i64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,f64,f64,str
26589651,"""10/31/2013 02:08:41 AM""",,"""NYPD""","""New York City Police Department""","""Noise - Street/Sidewalk""","""Loud Talking""","""Street/Sidewalk""","""11432""","""90-03 169 STREET""","""169 STREET""","""90 AVENUE""","""91 AVENUE""",,,"""ADDRESS""","""JAMAICA""",,"""Precinct""","""Assigned""","""10/31/2013 10:08:41 AM""","""10/31/2013 02:35:17 AM""","""12 QUEENS""","""QUEENS""",1042027,197389,"""Unspecified""","""QUEENS""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,40.708275,-73.791604,"""(40.70827532593202, -73.79160395779721)"""
26593698,"""10/31/2013 02:01:04 AM""",,"""NYPD""","""New York City Police Department""","""Illegal Parking""","""Commercial Overnight Parking""","""Street/Sidewalk""","""11378""","""58 AVENUE""","""58 AVENUE""","""58 PLACE""","""59 STREET""",,,"""BLOCKFACE""","""MASPETH""",,"""Precinct""","""Open""","""10/31/2013 10:01:04 AM""",,"""05 QUEENS""","""QUEENS""",1009349,201984,"""Unspecified""","""QUEENS""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,40.721041,-73.909453,"""(40.721040535628305, -73.90945306791765)"""
26594139,"""10/31/2013 02:00:24 AM""","""10/31/2013 02:40:32 AM""","""NYPD""","""New York City Police Department""","""Noise - Commercial""","""Loud Music/Party""","""Club/Bar/Restaurant""","""10032""","""4060 BROADWAY""","""BROADWAY""","""WEST 171 STREET""","""WEST 172 STREET""",,,"""ADDRESS""","""NEW YORK""",,"""Precinct""","""Closed""","""10/31/2013 10:00:24 AM""","""10/31/2013 02:39:42 AM""","""12 MANHATTAN""","""MANHATTAN""",1001088,246531,"""Unspecified""","""MANHATTAN""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,40.84333,-73.939144,"""(40.84332975466513, -73.93914371913482)"""
26595721,"""10/31/2013 01:56:23 AM""","""10/31/2013 02:21:48 AM""","""NYPD""","""New York City Police Department""","""Noise - Vehicle""","""Car/Truck Horn""","""Street/Sidewalk""","""10023""","""WEST 72 STREET""","""WEST 72 STREET""","""COLUMBUS AVENUE""","""AMSTERDAM AVENUE""",,,"""BLOCKFACE""","""NEW YORK""",,"""Precinct""","""Closed""","""10/31/2013 09:56:23 AM""","""10/31/2013 02:21:10 AM""","""07 MANHATTAN""","""MANHATTAN""",989730,222727,"""Unspecified""","""MANHATTAN""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,40.778009,-73.980213,"""(40.7780087446372, -73.98021349023975)"""
26590930,"""10/31/2013 01:53:44 AM""",,"""DOHMH""","""Department of Health and Mental Hygiene""","""Rodent""","""Condition Attracting Rodents""","""Vacant Lot""","""10027""","""WEST 124 STREET""","""WEST 124 STREET""","""LENOX AVENUE""","""ADAM CLAYTON POWELL JR BOULEVARD""",,,"""BLOCKFACE""","""NEW YORK""",,,"""Pending""","""11/30/2013 01:53:44 AM""","""10/31/2013 01:59:54 AM""","""10 MANHATTAN""","""MANHATTAN""",998815,233545,"""Unspecified""","""MANHATTAN""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,40.807691,-73.947387,"""(40.80769092704951, -73.94738703491433)"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
26426013,"""10/04/2013 12:01:13 AM""","""10/07/2013 04:07:16 PM""","""DPR""","""Department of Parks and Recreation""","""Maintenance or Facility""","""Structure - Outdoors""","""Park""","""11213""",,,,,,,,"""BROOKLYN""",,,"""Closed""","""10/18/2013 12:01:13 AM""","""10/07/2013 04:07:16 PM""","""08 BROOKLYN""","""BROOKLYN""",,,"""Brower Park""","""BROOKLYN""","""Brower Park""","""B012""",,,"""7189658900""","""Brooklyn, St. Marks, Kingston Avenues, Park Place""","""BROOKLYN""","""NY""","""11213""","""N""",,,,,,,,,,,,,,
26428083,"""10/04/2013 12:01:05 AM""","""10/04/2013 02:13:50 AM""","""NYPD""","""New York City Police Department""","""Illegal Parking""","""Posted Parking Sign Violation""","""Street/Sidewalk""","""11434""",,,,,"""GUY R BREWER BOULEVARD""","""ROCKAWAY BOULEVARD""","""INTERSECTION""","""JAMAICA""",,"""Precinct""","""Closed""","""10/04/2013 08:01:05 AM""","""10/04/2013 02:13:50 AM""","""13 QUEENS""","""QUEENS""",1048801,178419,"""Unspecified""","""QUEENS""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,40.65616,-73.767353,"""(40.656160351546845, -73.76735262738222)"""
26428987,"""10/04/2013 12:00:45 AM""","""10/04/2013 01:25:01 AM""","""NYPD""","""New York City Police Department""","""Noise - Street/Sidewalk""","""Loud Talking""","""Street/Sidewalk""","""10016""","""344 EAST 28 STREET""","""EAST 28 STREET""","""MOUNT CARMEL PLACE""","""1 AVENUE""",,,"""ADDRESS""","""NEW YORK""",,"""Precinct""","""Closed""","""10/04/2013 08:00:45 AM""","""10/04/2013 01:25:01 AM""","""06 MANHATTAN""","""MANHATTAN""",990637,208987,"""Unspecified""","""MANHATTAN""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,40.740295,-73.976952,"""(40.740295354643706, -73.97695165980414)"""
26426115,"""10/04/2013 12:00:28 AM""","""10/04/2013 04:17:32 AM""","""NYPD""","""New York City Police Department""","""Noise - Commercial""","""Loud Talking""","""Club/Bar/Restaurant""","""11226""","""1233 FLATBUSH AVENUE""","""FLATBUSH AVENUE""","""AVENUE D""","""NEWKIRK AVENUE""",,,"""ADDRESS""","""BROOKLYN""",,"""Precinct""","""Closed""","""10/04/2013 08:00:28 AM""","""10/04/2013 04:17:32 AM""","""14 BROOKLYN""","""BROOKLYN""",996654,172515,"""Unspecified""","""BROOKLYN""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,40.640182,-73.955306,"""(40.64018174662485, -73.95530566958138)"""


Done.

Earlier I thought 00083 was a broken zip code, but turns out Central Park's zip code 00083! Shows what I know. I'm still concerned about the 00000 zip codes, though: let's look at that. 

In [19]:
is_00000 = requests['Incident Zip'] == '00000'
requests.filter(is_00000)

Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,Street Name,Cross Street 1,Cross Street 2,Intersection Street 1,Intersection Street 2,Address Type,City,Landmark,Facility Type,Status,Due Date,Resolution Action Updated Date,Community Board,Borough,X Coordinate (State Plane),Y Coordinate (State Plane),Park Facility Name,Park Borough,School Name,School Number,School Region,School Code,School Phone Number,School Address,School City,School State,School Zip,School Not Found,School or Citywide Complaint,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Garage Lot Name,Ferry Direction,Ferry Terminal Name,Latitude,Longitude,Location
i64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,i64,i64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,f64,f64,str
26529313,"""10/22/2013 02:51:06 PM""",,"""TLC""","""Taxi and Limousine Commission""","""Taxi Complaint""","""Driver Complaint""",,"""00000""","""EWR EWR""","""EWR""",,,,,,"""NEWARK""",,,"""Assigned""","""12/07/2013 09:53:51 AM""","""10/23/2013 09:54:43 AM""","""0 Unspecified""","""Unspecified""",,,"""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,"""Other""",,,,,,,,,,
26507389,"""10/17/2013 05:48:44 PM""",,"""TLC""","""Taxi and Limousine Commission""","""Taxi Complaint""","""Driver Complaint""","""Street""","""00000""","""1 NEWARK AIRPORT""","""NEWARK AIRPORT""",,,,,,"""NEWARK""",,,"""Assigned""","""12/02/2013 11:59:46 AM""","""10/18/2013 12:01:08 PM""","""0 Unspecified""","""Unspecified""",,,"""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,"""Other""",,,,,,,,,,


This looks bad to me. Let's set these to nan.

In [20]:
is_00000 = requests['Incident Zip'] == '00000'
requests = requests.with_columns(
    pl.when(is_00000)
      .then(None)
      .otherwise(pl.col('Incident Zip'))
      .alias('Incident Zip')
)

Great. Let's see where we are now:

In [21]:
unique_zips = requests['Incident Zip'].unique()
unique_zips.sort()
unique_zips

Incident Zip
str
"""10026"""
"""11412"""
"""07114"""
"""10024"""
"""11235"""
…
"""10310"""
"""11354"""
"""02061"""
"""11042"""


Amazing! This is much cleaner. There's something a bit weird here, though -- I looked up 77056 on Google maps, and that's in Texas.

Let's take a closer look:

In [22]:
zips = pl.col('Incident Zip')
is_close = (zips.str.starts_with('0') | zips.str.starts_with('1'))
is_far = (~is_close) & (zips.is_not_null())

In [23]:
requests.filter(is_far)

Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,Street Name,Cross Street 1,Cross Street 2,Intersection Street 1,Intersection Street 2,Address Type,City,Landmark,Facility Type,Status,Due Date,Resolution Action Updated Date,Community Board,Borough,X Coordinate (State Plane),Y Coordinate (State Plane),Park Facility Name,Park Borough,School Name,School Number,School Region,School Code,School Phone Number,School Address,School City,School State,School Zip,School Not Found,School or Citywide Complaint,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Garage Lot Name,Ferry Direction,Ferry Terminal Name,Latitude,Longitude,Location
i64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,i64,i64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,f64,f64,str
26575658,"""10/28/2013 04:11:05 PM""",,"""DCA""","""Department of Consumer Affairs""","""Consumer Complaint""","""Debt Not Owed""",,"""77056""","""5251 WESTHEIMER""","""WESTHEIMER""",,,,,,"""HOUSTON""",,,"""Open""","""11/01/2013 04:11:05 PM""",,"""0 Unspecified""","""Unspecified""",,,"""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,,,
26578251,"""10/28/2013 11:02:26 AM""",,"""DCA""","""Department of Consumer Affairs""","""Consumer Complaint""","""Contract Dispute""",,"""70711""","""139 ACKERMAN AVENUE""","""ACKERMAN AVENUE""",,,,,,"""CLIFTON""",,,"""Open""","""11/01/2013 11:02:26 AM""",,"""0 Unspecified""","""Unspecified""",,,"""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,,,
26550551,"""10/24/2013 06:16:34 PM""",,"""DCA""","""Department of Consumer Affairs""","""Consumer Complaint""","""False Advertising""",,"""77092""","""2700 EAST SELTICE WAY""","""EAST SELTICE WAY""",,,,,,"""HOUSTON""",,,"""Assigned""","""11/13/2013 11:15:20 AM""","""10/29/2013 11:16:16 AM""","""0 Unspecified""","""Unspecified""",,,"""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,,,
26548831,"""10/24/2013 09:35:10 AM""",,"""DCA""","""Department of Consumer Affairs""","""Consumer Complaint""","""Harassment""",,"""55164""","""P.O. BOX 64437""","""64437""",,,,,,"""ST. PAUL""",,,"""Assigned""","""11/13/2013 02:30:21 PM""","""10/29/2013 02:31:06 PM""","""0 Unspecified""","""Unspecified""",,,"""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,,,
26528948,"""10/22/2013 09:29:57 AM""",,"""DCA""","""Department of Consumer Affairs""","""Consumer Complaint""","""Billing Dispute""",,"""90010""","""4751 WILSHIER BLVD""","""WILSHIER BLVD""",,,,,,"""LOS ANGELES""",,,"""Assigned""","""11/09/2013 06:10:20 PM""","""10/25/2013 06:11:06 PM""","""0 Unspecified""","""Unspecified""",,,"""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
26479594,"""10/13/2013 04:19:13 PM""",,"""TLC""","""Taxi and Limousine Commission""","""Taxi Complaint""","""Driver Complaint""","""Street""","""NA""","""NA NA""","""NA""",,,,,,"""NA""",,,"""Assigned""","""11/29/2013 08:41:20 AM""","""10/15/2013 08:42:38 AM""","""Unspecified QUEENS""","""QUEENS""",,,"""Unspecified""","""QUEENS""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,"""JFK Airport""",,,,,,,,,,
26489268,"""10/11/2013 06:51:31 PM""","""10/27/2013 01:07:21 AM""","""DCA""","""Department of Consumer Affairs""","""Consumer Complaint""","""Billing Dispute""",,"""61702""","""P.O. BOX""","""BOX""",,,,,,"""BLOOMIGTON""",,,"""Closed""","""10/30/2013 08:08:59 AM""","""10/27/2013 01:07:21 AM""","""0 Unspecified""","""Unspecified""",,,"""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,,,
26468296,"""10/10/2013 12:36:43 PM""","""10/26/2013 01:07:07 AM""","""DCA""","""Department of Consumer Affairs""","""Consumer Complaint""","""Debt Not Owed""",,"""29616""","""PO BOX 25759""","""BOX 25759""",,,,,,"""GREENVILLE""",,,"""Closed""","""10/26/2013 09:20:28 AM""","""10/26/2013 01:07:07 AM""","""0 Unspecified""","""Unspecified""",,,"""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,,,
26461137,"""10/09/2013 05:23:46 PM""","""10/25/2013 01:06:41 AM""","""DCA""","""Department of Consumer Affairs""","""Consumer Complaint""","""Harassment""",,"""35209""","""600 BEACON PKWY""","""BEACON PKWY""",,,,,,"""BIRMINGHAM""",,,"""Closed""","""10/25/2013 02:43:42 PM""","""10/25/2013 01:06:41 AM""","""0 Unspecified""","""Unspecified""",,,"""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""Unspecified""","""N""",,,,,,,,,,,,,,


In [24]:
requests.filter(is_far) \
    .select(['Incident Zip', 'Descriptor', 'City']) \
    .sort('Incident Zip')

Incident Zip,Descriptor,City
str,str,str
"""23502""","""Harassment""","""NORFOLK"""
"""23541""","""Harassment""","""NORFOLK"""
"""29616""","""Debt Not Owed""","""GREENVILLE"""
"""35209""","""Harassment""","""BIRMINGHAM"""
"""41042""","""Harassment""","""FLORENCE"""
…,…,…
"""77092""","""False Advertising""","""HOUSTON"""
"""90010""","""Billing Dispute""","""LOS ANGELES"""
"""92123""","""Harassment""","""SAN DIEGO"""
"""92123""","""Billing Dispute""","""SAN DIEGO"""


Okay, there really are requests coming from LA and Houston! Good to know. Filtering by zip code is probably a bad way to handle this -- we should really be looking at the city instead.

In [25]:
requests['City'].str.to_uppercase().value_counts()

City,count
str,u32
"""WOODSIDE""",609
"""BIRMINGHAM""",1
"""RICHMOND HILL""",404
"""EDGEWATER""",1
"""FLORENCE""",1
…,…
"""BLOOMIGTON""",1
"""LAWRENCE""",1
"""NANUET""",1
"""BRIARWOOD""",1


It looks like these are legitimate complaints, so we'll just leave them alone.

# 7.4 Putting it together

Here's what we ended up:

In [26]:
requests['Incident Zip'].unique()

Incident Zip
str
"""11423"""
"""10459"""
"""07306"""
"""11363"""
"""11693"""
…
"""10468"""
"""07093"""
"""11004"""
"""11416"""


<style>
    @font-face {
        font-family: "Computer Modern";
        src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');
    }
    div.cell{
        width:800px;
        margin-left:16% !important;
        margin-right:auto;
    }
    h1 {
        font-family: Helvetica, serif;
    }
    h4{
        margin-top:12px;
        margin-bottom: 3px;
       }
    div.text_cell_render{
        font-family: Computer Modern, "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;
        line-height: 145%;
        font-size: 130%;
        width:800px;
        margin-left:auto;
        margin-right:auto;
    }
    .CodeMirror{
            font-family: "Source Code Pro", source-code-pro,Consolas, monospace;
    }
    .text_cell_render h5 {
        font-weight: 300;
        font-size: 22pt;
        color: #4057A1;
        font-style: italic;
        margin-bottom: .5em;
        margin-top: 0.5em;
        display: block;
    }
    
    .warning{
        color: rgb( 240, 20, 20 )
        }  