# Working with Data - Computer Lab for Guest Lecture Julia Lane

In this computer lab we will learn more more details and practice data work to enhance the content of the lecture presented by Julia Lane on responsible data use. We will address a research question, think about data and measurement errors, and manipulate data. 

OUTLINE: 
1. Define a research question 
2. Think about what data are available 
3. Think about possible measurement errors 
4. Think about the interpretation of your results 
5. Inform your results by linking datasets 

# 1. Define a reserach question
Which Community Districts in NYC show the highest number of complaints?

# 2. Think about what data are available
Find suitable data by searching the CUSP Data Catalog https://datahub.cusp.nyu.edu/catalog. You can use Urban Profiler to investigate the Metadata associated with each dataset. Using this tool will help you to decide which attributes of the data you need to answer your question so you don't have to load the entire dataset. 

In [1]:
import os
import pandas as pd
import numpy as np
import re
import string
PUIdata = os.getenv('PUIDATA')

In [53]:
# Load dataset
datfile = "/projects/open/NYCOpenData/nycopendata/data/erm2-nwe9/1446832678/erm2-nwe9"
c311reqs = pd.read_csv(datfile, header=0)
c311reqs.shape

(10187766, 53)

In [54]:
c311reqs.head(5)

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Garage Lot Name,Ferry Direction,Ferry Terminal Name,Latitude,Longitude,Location
0,31911011,11/05/2015 02:59:15 AM,,DOT,Department of Transportation,Street Condition,Pothole,,11224.0,,...,,,,,,,,40.573431,-73.991742,"(40.57343122248129, -73.99174247588253)"
1,31908754,11/05/2015 02:09:49 AM,,CHALL,CHALL,Opinion for the Mayor,HOUSING,,,,...,,,,1-1-1173130914,,,,,,
2,31910423,11/05/2015 02:06:51 AM,,DPR,Department of Parks and Recreation,Root/Sewer/Sidewalk Condition,Trees and Sidewalks Program,Street,11234.0,1157 EAST 57 STREET,...,,,,,,,,40.625004,-73.920726,"(40.62500363580505, -73.92072558378698)"
3,31909924,11/05/2015 02:02:20 AM,,NYPD,New York City Police Department,Illegal Parking,Blocked Hydrant,Street/Sidewalk,11218.0,722 EAST 4 STREET,...,,,,,,,,40.634522,-73.97479,"(40.634522428879706, -73.97479041437481)"
4,31913310,11/05/2015 01:57:20 AM,11/05/2015 01:57:31 AM,HRA,HRA Benefit Card Replacement,Benefit Card Replacement,Medicaid,NYC Street Address,,,...,,,,,,,,,,


# 3. Think about possible measurement errors
Do you see any problems regarding possible measurement error? Think about who is represented in the data, ommissions, duplications, content error, missing data, etc. 

In [55]:
# Check if all Boroughs and Community Districts are represented in the Data 
boroughs = c311reqs.groupby(["Borough"]).agg('count')
boroughs["Park Borough"]

Borough
BRONX            1665625
BROOKLYN         2831932
MANHATTAN        1900005
QUEENS           2189760
STATEN ISLAND     490998
Unspecified      1109446
Name: Park Borough, dtype: int64

In [56]:
# How many unique values do we have?

"""
There should be 59, 12 each in Bronx and Manhattan, 14 in Queens, 18 in Brooklyn and
3 in Staten Island

However, there is an unspecified borough (1.1 million records), an unspecified community
board for each borough (rough 200,000 per borough except SI which is 50,000), and a set
of invalid boards for each borough (anything over 18, about 8,000 for Queens and less than
2,000 for all other boroughs).
"""

com_grp = c311reqs[['Community Board','Borough','Agency']].groupby(["Community Board",'Borough'])
com_dists = com_grp.agg('count')
print(len(com_dists))
dists_list = com_dists.sort_index(level=1)
for dist in com_dists.itertuples():
    print(dist)

77
Pandas(Index=('0 Unspecified', 'Unspecified'), Agency=1109446)
Pandas(Index=('01 BRONX', 'BRONX'), Agency=74631)
Pandas(Index=('01 BROOKLYN', 'BROOKLYN'), Agency=185057)
Pandas(Index=('01 MANHATTAN', 'MANHATTAN'), Agency=77974)
Pandas(Index=('01 QUEENS', 'QUEENS'), Agency=171484)
Pandas(Index=('01 STATEN ISLAND', 'STATEN ISLAND'), Agency=182713)
Pandas(Index=('02 BRONX', 'BRONX'), Agency=60257)
Pandas(Index=('02 BROOKLYN', 'BROOKLYN'), Agency=121022)
Pandas(Index=('02 MANHATTAN', 'MANHATTAN'), Agency=133860)
Pandas(Index=('02 QUEENS', 'QUEENS'), Agency=114333)
Pandas(Index=('02 STATEN ISLAND', 'STATEN ISLAND'), Agency=121132)
Pandas(Index=('03 BRONX', 'BRONX'), Agency=75134)
Pandas(Index=('03 BROOKLYN', 'BROOKLYN'), Agency=197306)
Pandas(Index=('03 MANHATTAN', 'MANHATTAN'), Agency=150296)
Pandas(Index=('03 QUEENS', 'QUEENS'), Agency=122009)
Pandas(Index=('03 STATEN ISLAND', 'STATEN ISLAND'), Agency=136487)
Pandas(Index=('04 BRONX', 'BRONX'), Agency=181953)
Pandas(Index=('04 BROOKLYN

In [None]:
# Why do we have so many? Some of them are unspecified, missing. Some might be invalid entries. 
# We should have 59 Community Districts.

In [57]:
# Check for duplicates? Are these plausible?
# There are 22 unique key entries that are duplicates of other unique keys in set
print(len(c311reqs["Unique Key"]))
print(len(c311reqs["Unique Key"].unique()))

10187766
10187744


In [58]:
dups = c311reqs['Unique Key'].duplicated()
dups.shape

(10187766,)

In [59]:
# Nearly all of the duplicates are rodents. The two exceptions also relate to unsanitary animals
isdup = dups[dups == True]
isdup
dupreqs = c311reqs.loc[isdup.index]
print(len(dupreqs['Unique Key'].unique()))
dupreqs

22


Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Garage Lot Name,Ferry Direction,Ferry Terminal Name,Latitude,Longitude,Location
4248298,26003579,07/26/2013 12:00:00 AM,08/06/2013 12:00:00 AM,DOHMH,Department of Health and Mental Hygiene,Rodent,Condition Attracting Rodents,1-2 Family Dwelling,10302,131 HARRISON AVENUE,...,,,,,,,,40.636835,-74.138796,"(40.63683487948972, -74.13879629882382)"
4253082,26020434,07/25/2013 12:00:00 AM,08/05/2013 12:00:00 AM,DOHMH,Department of Health and Mental Hygiene,Rodent,Condition Attracting Rodents,1-2 Family Dwelling,10462,1909 BARNES AVENUE,...,,,,,,,,40.848845,-73.863631,"(40.8488451919449, -73.86363125763393)"
4253086,26012011,07/25/2013 12:00:00 AM,08/06/2013 12:00:00 AM,DOHMH,Department of Health and Mental Hygiene,Rodent,Rat Sighting,3+ Family Apt. Building,11225,440 BROOKLYN AVENUE,...,,,,,,,,40.664148,-73.945482,"(40.66414769632634, -73.94548172836168)"
4253091,26002968,07/25/2013 12:00:00 AM,08/06/2013 12:00:00 AM,DOHMH,Department of Health and Mental Hygiene,Rodent,Condition Attracting Rodents,1-2 Family Dwelling,11237,406 SUYDAM STREET,...,,,,,,,,40.705703,-73.920175,"(40.705702500630075, -73.92017516512333)"
4253092,26020332,07/25/2013 12:00:00 AM,08/06/2013 12:00:00 AM,DOHMH,Department of Health and Mental Hygiene,Rodent,Rat Sighting,Other (Explain Below),11213,780 ST MARKS AVENUE,...,,,,,,,,40.675023,-73.946814,"(40.67502312706964, -73.94681393618372)"
4253095,26033513,07/25/2013 12:00:00 AM,08/06/2013 12:00:00 AM,DOHMH,Department of Health and Mental Hygiene,Rodent,Condition Attracting Rodents,1-2 Family Dwelling,10302,60 AVENUE B,...,,,,,,,,40.636351,-74.129806,"(40.63635131789832, -74.12980576711388)"
4253097,26011524,07/25/2013 12:00:00 AM,08/02/2013 12:00:00 AM,DOHMH,Department of Health and Mental Hygiene,Rodent,Rat Sighting,3+ Family Apt. Building,10031,1484 AMSTERDAM AVENUE,...,,,,,,,,40.81786,-73.953008,"(40.817859804380014, -73.95300779968848)"
4253098,26011777,07/25/2013 12:00:00 AM,08/06/2013 12:00:00 AM,DOHMH,Department of Health and Mental Hygiene,Rodent,Rat Sighting,Other (Explain Below),11221,,...,,,,,,,,40.691219,-73.939679,"(40.6912192567118, -73.93967921077531)"
4253100,26011611,07/25/2013 12:00:00 AM,08/02/2013 12:00:00 AM,DOHMH,Department of Health and Mental Hygiene,Rodent,Rat Sighting,3+ Family Apt. Building,10467,2309 HOLLAND AVENUE,...,,,,,,,,40.859953,-73.865586,"(40.85995256353461, -73.86558595844559)"
4253101,26028900,07/25/2013 12:00:00 AM,08/06/2013 10:54:57 AM,DOHMH,Department of Health and Mental Hygiene,Rodent,Condition Attracting Rodents,1-2 Family Dwelling,10302,22 JEWETT AVENUE,...,,,,,,,,40.636904,-74.12882,"(40.63690412970023, -74.12881960785445)"


In [61]:
# Remove duplicates
notdup = dups[dups == False]
c311reqs = c311reqs.loc[notdup.index]
c311reqs.shape

(10187744, 53)

In [62]:
# What about missing values? Can you detect any patterns? 
# Unique Key, Agency, Complaint Type, Status, Borough and a few others all have no missing values
# Landmark, Facility Type, Vehicle Type, Bridge Highway, Ferry, are mostly or all missing
c311reqs.isnull().sum()

Unique Key                               0
Created Date                             0
Closed Date                         450879
Agency                                   0
Agency Name                              0
Complaint Type                           0
Descriptor                           31057
Location Type                      3067596
Incident Zip                        794159
Incident Address                   2197705
Street Name                        2198558
Cross Street 1                     2375942
Cross Street 2                     2425792
Intersection Street 1              8345267
Intersection Street 2              8346281
Address Type                        490299
City                                788712
Landmark                          10180434
Facility Type                      8579463
Status                                   0
Due Date                           7120286
Resolution Description             3786739
Resolution Action Updated Date      250932
Community B

In [79]:
# Generate marker for unplausible Community Districts
# How do these districts look like? 

c311reqs["valid_board"] = np.ones(len(c311reqs.Borough), np.float)
for bname in com_dists.itertuples():
    if re.match(r"[01]\d", bname[0][0]) == None:
        c311reqs.valid_board.loc[c311reqs['Community Board'] == bname[0][0]] = np.nan
c311reqs.valid_board.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


2076821

In [85]:
# Drop the marked districts
c311reqs = c311reqs[~c311reqs['valid_board'].isnull()]

In [87]:
# Produce your result: Generate an indicator which ranks the Community District by
# complaint numbers on the Community district level
c311reqs['Complaints'] = np.zeros(len(c311reqs.valid_board))
com_grp = c311reqs[['Community Board', 'Borough',
                    'Complaints']].groupby(['Community Board', 'Borough'])
comp_by_board = com_grp.agg({'Complaints' : 'count'})
comp_by_board

Unnamed: 0_level_0,Unnamed: 1_level_0,Complaints
Community Board,Borough,Unnamed: 2_level_1
01 BRONX,BRONX,74631
01 BROOKLYN,BROOKLYN,185057
01 MANHATTAN,MANHATTAN,77974
01 QUEENS,QUEENS,171484
01 STATEN ISLAND,STATEN ISLAND,182708
02 BRONX,BRONX,60257
02 BROOKLYN,BROOKLYN,121021
02 MANHATTAN,MANHATTAN,133860
02 QUEENS,QUEENS,114333
02 STATEN ISLAND,STATEN ISLAND,121132


In [None]:
# Safe reduced data frame (Community District level)

# 4. Think about the interpretation of your results?
What do you have to keep in mind when interpreting your results? Are they generable? Does the way the data is collected influence your results? To better inform city agancies it might be good to explore in more detail the underlying dempgraphics/infrastructure of a Community District becasue this might influence 311 calls. You can do this by merging external data on the Community District level to your analysis data. 

In [88]:
# Population by Community District
df_pop = pd.read_csv("http://cosmo.nyu.edu/~fb55/PUI2016/data/Final_Demographics.csv")

In [91]:
# Check variables in file
for cname in df_pop.columns:
    print(cname)

FIPS
cd_id
Total Population
Population Density (per sq. mile)
% Total Population: Male
% Total Population: 18 to 24 Years
% Total Population: 25 to 34 Years
% Total Population: 35 to 44 Years
% Population 5 Years And Over: Speak Only English
% Population 5 Years And Over: Spanish or Spanish Creole
% Population 5 Years And Over: Spanish or Spanish Creole: Speak English "very Well"
% Population 5 Years And Over: Spanish or Spanish Creole: Speak English Less Than "very Well"
Population 25 Years and over:
Population 25 Years and over: Less Than High School
Population 25 Years and over: High School Graduate (includes equivalency)
Population 25 Years and over: Some college
Population 25 Years and over: Bachelor's degree
Population 25 Years and over: Master's degree
Population 25 Years and over: Professional school degree
Population 25 Years and over: Doctorate degree
% Population 25 Years and over: Less Than High School
% Population 25 Years and over: High School Graduate (includes equivalen

In [90]:
# How many community districts are in file? 
df_pop.shape

(59, 158)

In [100]:
# Set some variables to refer to column names in df_pop
Popn = "Total Population"
Dense = "Population Density (per sq. mile)"
PctEnglish = "% Population 5 Years And Over: Speak Only English"
PctBach = "% Population 25 Years and over: Bachelor's degree or more"
PctMaster = "% Population 25 Years and over: Master's degree or more"
PctProf = "% Population 25 Years and over: Professional school degree or more"
PctDoctorate = "% Population 25 Years and over: Doctorate degree.1"
Bach = "Population 25 Years and over: Bachelor's degree or more"
Master = "Population 25 Years and over: Master's degree or more"
Prof = "Population 25 Years and over: Professional school degree or more"
Doctorate = "Population 25 Years and over: Doctorate degree.1"

In [194]:
# Manipulate data to get some information on demographics by Community District. 
# Think about who might be more likely to call 311
df_pop['Low Income'] = (df_pop["% Households: Less than $10,000"] +
                        df_pop["% Households: $10,000 to $14,999"] +
                        df_pop["% Households: $15,000 to $19,999"] +
                        df_pop["% Households: $20,000 to $24,999"] +
                        df_pop["% Households: $25,000 to $29,999"])
df_pop_slim = df_pop[["FIPS", "cd_id", Popn, Dense, PctEnglish, Bach, Master, Prof, Doctorate,
                    "Low Income"]]
df_pop_slim["English Only"] = df_pop_slim[Popn] * df_pop_slim[PctEnglish]
bdict = {'BX' : "BRONX", 'BK' : 'BROOKLYN', 'MN' : 'MANHATTAN', 'QN' : 'QUEENS',
        'SI' : 'STATEN ISLAND'}
str1 = "BX01"
print(str1)
str1 = re.sub("(..)(\d\d)", r'\2 \1', str1)
print(str1)
df_pop_slim.cd_id = df_pop_slim.cd_id.map(lambda x: x[2:4] + " " + x[:2])
for br in bdict.keys():
    df_pop_slim.cd_id = df_pop_slim.cd_id.map(lambda x: x.replace(br, bdict[br]))
df_pop_slim

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


BX01
01 BX


Unnamed: 0,FIPS,cd_id,Total Population,Population Density (per sq. mile),% Population 5 Years And Over: Speak Only English,Population 25 Years and over: Bachelor's degree or more,Population 25 Years and over: Master's degree or more,Population 25 Years and over: Professional school degree or more,Population 25 Years and over: Doctorate degree.1,Low Income,English Only
0,3603701,08 BRONX,106737,31229.95006,46.8,28677,13421,5205,1885,31.31,4995291.6
1,3603702,12 BRONX,134644,19966.67839,73.09,20682,5466,213,151,33.92,9841129.96
2,3603703,10 BRONX,121209,12913.81703,61.79,23341,8959,1361,725,27.34,7489504.11
3,3603704,11 BRONX,135839,35677.95453,43.22,22500,10174,3168,1246,33.76,5870961.58
4,3603705,03 BRONX,172247,39405.79222,36.82,11694,3781,1237,743,60.09,6342134.54
5,3603705,06 BRONX,172247,39405.79222,28.21,11694,3781,1237,743,60.09,4859087.87
6,3603706,07 BRONX,135893,86487.07792,29.1,15350,3872,905,307,43.74,3954486.3
7,3603707,05 BRONX,132850,87974.3486,29.84,9931,3463,658,295,61.31,3964244.0
8,3603708,04 BRONX,141467,71270.88219,42.97,13564,3501,502,385,54.34,6078836.99
9,3603709,09 BRONX,190126,42752.5069,33.62,20089,5968,579,48,43.08,6392036.12


In [None]:
# Save data frame

In [118]:
# Infrastructure by Community District
infr_file = ("http://cosmo.nyu.edu/~fb55/PUI2016/data/" +
             "ACS_Computer_Use_and_Internet_2014_1Year_Estimate.csv")
df_infr = pd.read_csv(infr_file)
df_infr.shape

(59, 31)

In [120]:
# Check variables in file
df_infr.columns

Index([u'FIPS', u'Geographic Identifier', u'Qualifying Name', u'Households',
       u'Households: With An Internet Subscription',
       u'Households: Dial-Up Alone', u'Households: Dsl',
       u'Households: With Mobile Broadband',
       u'Households: Without Mobile Broadband', u'Households: Cable Modem',
       u'Households: With Mobile Broadband.1',
       u'Households: Without Mobile Broadband.1', u'Households: Fiber-Optic',
       u'Households: With Mobile Broadband.2',
       u'Households: Without Mobile Broadband.2',
       u'Households: Satellite Internet Service',
       u'Households: With Mobile Broadband.3',
       u'Households: Without Mobile Broadband.3',
       u'Households: Two or More Fixed Broadband Types, or Other',
       u'Households: With Mobile Broadband.4',
       u'Households: Without Mobile Broadband.4',
       u'Households: Mobile Broadband Alone or With Dialup',
       u'Households: Internet Access Without A Subscription',
       u'Households: No Internet Acc

In [123]:
# How many community districts are in file? 
# There are 59 community districts. However, 2 pairs of districts are combined in
# Bronx and 1 in Manhattan
df_infr["Qualifying Name"]

0     NYC-Bronx Community District 8--Riverdale, New...
1     NYC-Bronx Community District 12--Wakefield, Ne...
2     NYC-Bronx Community District 10--Co-op City, N...
3     NYC-Bronx Community District 11--Pelham Parkwa...
4     NYC-Bronx Community District 3 & 6--Belmont, N...
5     NYC-Bronx Community District 3 & 6--Belmont, N...
6     NYC-Bronx Community District 7--Bedford Park, ...
7     NYC-Bronx Community District 5--Morris Heights...
8     NYC-Bronx Community District 4--Concourse, New...
9     NYC-Bronx Community District 9--Castle Hill, N...
10    NYC-Bronx Community District 1 & 2--Hunts Poin...
11    NYC-Bronx Community District 1 & 2--Hunts Poin...
12    NYC-Manhattan Community District 12--Washingto...
13    NYC-Manhattan Community District 9--Hamilton H...
14    NYC-Manhattan Community District 10--Central H...
15    NYC-Manhattan Community District 11--East Harl...
16    NYC-Manhattan Community District 8--Upper East...
17    NYC-Manhattan Community District 7--Upper 

In [138]:
s1 = "comm board dist 2"
w = s1.split(' ')
"{:s} {:02d}".format(w[0], int(w[3]))

'comm 02'

In [130]:
df_infr[["Geographic Identifier", "Households"]].iloc[[10, 11]]

Unnamed: 0,Geographic Identifier,Households
10,79500US3603710,52191
11,79500US3603710,52191


In [154]:
cdists = {}
def set_cd_id(qname):
    qname = qname[4:]
    Qname = qname.upper()
    Qname = Qname.replace("-", " ")
    wds = Qname.split(' ')
    bname = wds[0]
    wds = wds[1:]
    if bname == "STATEN":
        bname += " " + wds[0]
        wds = wds[1:]
    wds = wds[2:]
    qnum = "{:02d}".format(int(wds[0]))
    #dname = qnum + " " + bname
    if qnum + " " + bname in cdists.keys():
        qnum = "{:02d}".format(int(wds[2]))
    dname = qnum + " " + bname
    cdists[dname] = 1
    return dname

df_infr["cd_id"] = df_infr["Geographic Identifier"]
for rw in range(59):
    df_infr.cd_id.iloc[rw] = set_cd_id(df_infr['Qualifying Name'].iloc[rw])
df_infr[["Qualifying Name", "Households", "Households: With An Internet Subscription", "cd_id"]]

Unnamed: 0,Qualifying Name,Households,Households: With An Internet Subscription,cd_id
0,"NYC-Bronx Community District 8--Riverdale, New...",42035,31795,08 BRONX
1,"NYC-Bronx Community District 12--Wakefield, Ne...",44830,32243,12 BRONX
2,"NYC-Bronx Community District 10--Co-op City, N...",47050,32729,10 BRONX
3,NYC-Bronx Community District 11--Pelham Parkwa...,44922,32003,11 BRONX
4,"NYC-Bronx Community District 3 & 6--Belmont, N...",57556,35503,03 BRONX
5,"NYC-Bronx Community District 3 & 6--Belmont, N...",57556,35503,06 BRONX
6,"NYC-Bronx Community District 7--Bedford Park, ...",47252,31468,07 BRONX
7,NYC-Bronx Community District 5--Morris Heights...,44699,26332,05 BRONX
8,"NYC-Bronx Community District 4--Concourse, New...",47935,29376,04 BRONX
9,"NYC-Bronx Community District 9--Castle Hill, N...",64011,45976,09 BRONX


In [162]:
# Manipulate data to get some information on internet/broadband useage by Community District
# Aggregate the mobile subscription data
mobile_flds = []
for fld in df_infr.columns:
    if re.search("With Mobile", fld) or re.search("Households: Mobile", fld):
        mobile_flds.append(fld)
df_infr['Mobile'] = df_infr[mobile_flds[0]]
for c in mobile_flds[1:]:
    df_infr['Mobile'] += df_infr[c]
mobile_flds.append('Mobile')
df_infr[mobile_flds].head(5)

Unnamed: 0,Households: With Mobile Broadband,Households: With Mobile Broadband.1,Households: With Mobile Broadband.2,Households: With Mobile Broadband.3,Households: With Mobile Broadband.4,Households: Mobile Broadband Alone or With Dialup,Mobile
0,946,10433,433,37,3510,2168,17527
1,405,5577,2358,0,2146,928,11414
2,398,6377,1200,0,3450,639,12064
3,474,5624,2272,241,2137,1001,11749
4,651,6690,695,111,6760,1385,16292


In [None]:
# Aggregate internet type by high and low connections

In [None]:
# Save data frame 

# 5. Inform your results by linking datasets
Now you want to link the three data frames to produce summary statistics for Community Districts which show a high number of complaints vs. Community Districts which show a lower number of complaints. The Community District identifiers for each DataFrame were already harmonized as the DataFrames were loaded.

In [None]:
# Harmonize identifier of dataframe 1

In [None]:
# Harmonize identifier of dataframe 2

In [None]:
# Harmonize identifier of dataframe 3

In [195]:
# Link the 3 dataframes
#comp_by_board.reset_index(inplace=True)
comp_by_board['cd_id'] = comp_by_board['Community Board']
df_merge = pd.merge(comp_by_board, df_pop_slim, on='cd_id')
df_merge = pd.merge(df_merge, df_infr, on='cd_id')
df_merge.head(5)

Unnamed: 0,level_0,index,Community Board,Borough,Complaints,cd_id,FIPS_x,Total Population,Population Density (per sq. mile),% Population 5 Years And Over: Speak Only English,...,Households: Internet Access Without A Subscription,Households: No Internet Access,% Households: With An Internet Subscription,Households.1,Households: Has A Computer,Households: With Dial-Up Internet Subscription Alone,Households: With A Broadband Internet Subscription,Households: Without An Internet Subscription,Households: No Computer,Mobile
0,0,0,01 BRONX,BRONX,74631,01 BRONX,3603710,167147,34412.07524,27.49,...,2412,17066,62.68,52191,39141,0,30958,8183,13050,15981
1,1,1,01 BROOKLYN,BROOKLYN,185057,01 BROOKLYN,3604001,154713,37671.51058,72.48,...,2451,13526,74.64,62990,52660,449,46148,6063,10330,21459
2,2,2,01 MANHATTAN,MANHATTAN,77974,01 MANHATTAN,3603810,159903,53928.0536,65.31,...,3089,6193,88.95,83976,79890,132,74339,5419,4086,41450
3,3,3,01 QUEENS,QUEENS,171484,01 QUEENS,3604101,182860,35800.7596,66.19,...,1815,12082,81.66,75758,66023,277,60733,5013,9735,31144
4,4,4,01 STATEN ISLAND,STATEN ISLAND,182708,01 STATEN ISLAND,3603903,176338,12537.60496,71.43,...,1340,13521,76.05,62047,50159,134,46362,3663,11888,11151


In [188]:
df_merge.columns

Index([u'index', u'Community Board', u'Borough', u'Complaints', u'cd_id',
       u'FIPS_x', u'Total Population', u'Population Density (per sq. mile)',
       u'% Population 5 Years And Over: Speak Only English',
       u'Population 25 Years and over: Bachelor's degree or more',
       u'Population 25 Years and over: Master's degree or more',
       u'Population 25 Years and over: Professional school degree or more',
       u'Population 25 Years and over: Doctorate degree.1', u'English Only',
       u'FIPS_y', u'Geographic Identifier', u'Qualifying Name', u'Households',
       u'Households: With An Internet Subscription',
       u'Households: Dial-Up Alone', u'Households: Dsl',
       u'Households: With Mobile Broadband',
       u'Households: Without Mobile Broadband', u'Households: Cable Modem',
       u'Households: With Mobile Broadband.1',
       u'Households: Without Mobile Broadband.1', u'Households: Fiber-Optic',
       u'Households: With Mobile Broadband.2',
       u'Households: 

In [197]:
# Are the demographics and infrastructure different in Community Districts that
# show more complaints than others?
# Try running a regression against several of the demographic variables and see how
# strong the correlation is. Looking for p-value < 0.05
import statsmodels.api as sm
df_merge["Complaints Per1000"] = df_merge.Complaints * 1000 / df_merge['Total Population']
df_merge['Pct College'] = (df_merge["Population 25 Years and over: Bachelor's degree or more"] /
  df_merge["Total Population"] * 100)
df_merge['Pct Mobile'] = df_merge['Mobile'] * 100 / df_merge['Households']
ind_vars = ['% Population 5 Years And Over: Speak Only English', "Low Income",
            'Population Density (per sq. mile)', "Pct College", 'Pct Mobile']
model = sm.OLS(df_merge["Complaints Per1000"], sm.add_constant(df_merge[ind_vars]),
               missing='drop').fit()
model.summary()

0,1,2,3
Dep. Variable:,Complaints Per1000,R-squared:,0.222
Model:,OLS,Adj. R-squared:,0.143
Method:,Least Squares,F-statistic:,2.797
Date:,"Sun, 13 Nov 2016",Prob (F-statistic):,0.0267
Time:,19:48:10,Log-Likelihood:,-373.39
No. Observations:,55,AIC:,758.8
Df Residuals:,49,BIC:,770.8
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,1243.2125,218.189,5.698,0.000,804.745 1681.680
% Population 5 Years And Over: Speak Only English,2.0582,1.887,1.091,0.281,-1.733 5.850
Low Income,-11.7640,4.329,-2.717,0.009,-20.464 -3.065
Population Density (per sq. mile),0.0053,0.002,3.089,0.003,0.002 0.009
Pct College,-11.3356,4.413,-2.569,0.013,-20.204 -2.467
Pct Mobile,0.0322,5.135,0.006,0.995,-10.286 10.351

0,1,2,3
Omnibus:,1.172,Durbin-Watson:,2.528
Prob(Omnibus):,0.557,Jarque-Bera (JB):,1.215
Skew:,0.283,Prob(JB):,0.545
Kurtosis:,2.543,Cond. No.,361000.0


In the absence of the Low Income variable, none of the variables in this set had a p-value below 0.05. However, adding the Low Income variable resulted in the coefficients for Population Density (positive) and Pct College (negative) becoming significant. In addition, a higher percentage low income results in fewer complaints per capita with a p-value of .009, which is less than .05.