# Data Aggregation

First, we'll ingest some data from NYC 311 data into a pandas DataFrame:

In [1]:
import pandas as pd
nyc_311_100k = pd.read_csv("https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?$limit=100000")

  interactivity=interactivity, compiler=compiler, result=result)


Let's take a look at the number of unique values for each variable in our dataset:

In [2]:
nyc_311_100k.nunique()

unique_key                        100000
created_date                       44279
closed_date                        28932
agency                                14
agency_name                           95
complaint_type                       130
descriptor                           634
location_type                         67
incident_zip                         368
incident_address                   38449
street_name                         5056
cross_street_1                      6671
cross_street_2                      6744
intersection_street_1               3264
intersection_street_2               3394
address_type                           4
city                                  88
landmark                              24
facility_type                          4
status                                 6
due_date                           23660
resolution_description               389
resolution_action_updated_date      8674
community_board                       75
bbl             

Interestingly, we have six kinds of statuses and six boroughs (five boroughs + "Unspecified").  It might be nice to know what the frequency of statuses are for each borough.

In [6]:
status_counts = nyc_311_100k.groupby(['borough', 'status']).size()
status_counts

borough        status     
BRONX          Assigned         118
               Closed          6478
               Open              55
               Pending          710
               Started            1
BROOKLYN       Assigned         101
               Closed         14261
               Open              42
               Pending          738
               Started            3
MANHATTAN      Assigned         131
               Closed         10491
               Open              37
               Pending          332
QUEENS         Assigned         242
               Closed         14268
               Open              79
               Pending          624
               Unspecified        2
STATEN ISLAND  Assigned          27
               Closed          3541
               Open              75
               Pending          202
Unspecified    Closed         47405
               Open              37
dtype: int64

That's great, but it doesn't give us percentages.  How can we solve this?

In [7]:
status_pcts = status_counts.groupby(level=0).apply(lambda x: 100 * x / float(x.sum()))
status_pcts

borough        status     
BRONX          Assigned        1.602825
               Closed         87.992393
               Open            0.747080
               Pending         9.644118
               Started         0.013583
BROOKLYN       Assigned        0.666887
               Closed         94.163090
               Open            0.277319
               Pending         4.872895
               Started         0.019809
MANHATTAN      Assigned        1.191884
               Closed         95.450823
               Open            0.336639
               Pending         3.020653
QUEENS         Assigned        1.590536
               Closed         93.775879
               Open            0.519224
               Pending         4.101216
               Unspecified     0.013145
STATEN ISLAND  Assigned        0.702211
               Closed         92.093628
               Open            1.950585
               Pending         5.253576
Unspecified    Closed         99.922010
             