### Diagnostic HW 1 - Part 2

To limit the number of API requets, I will focus on Vacant and Abandoned Buildings on the last 3 months of 2016. Specifically, I will explore the income distribution of where the requests are coming from. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
import json
from pprint import pprint
import requests
from urllib.request import urlopen
from datetime import datetime
import matplotlib
%matplotlib inline

In [2]:
buildings = pd.read_csv('vacant_buildings_3_months.csv')

In [3]:
buildings.shape

(702, 23)

In [4]:
sum(pd.isnull(buildings['LATITUDE'])), sum(pd.isnull(buildings['LONGITUDE']))

(0, 0)

There are no null values for latitude and longitude, so I will assume that there are valid values for the dataframe. I will now query the API and store each row's corresponding FIPS code to obtain more census information.

In [5]:
def get_census_block(lat,long):
    FIPS_url = 'http://data.fcc.gov/api/block/find?format=json&latitude={}&longitude={}&showall=true'.format(lat,long)
    try:
        response = urlopen(FIPS_url)
        FIPS = response.read().decode('utf-8')
        FIPS = json.loads(FIPS)
        return FIPS['Block']['FIPS']
    except:
        print(FIPS_url)

In [6]:
def scrape_fips_blocks(df):
    blocks = []
    for index, row in df.iterrows():
        lat = row['LATITUDE']
        long = row['LONGITUDE']
        blocks.append(get_census_block(lat, long))
    fips_df = pd.DataFrame(blocks)
    building_fips_df = pd.concat([df,fips_df], axis=1)
    building_fips_df = building_fips_df.rename(columns={0 : 'FIPS_BLOCK_NUMBER'})
    return building_fips_df

In [7]:
get_census_block(41.9688732498, -87.6698381016)

'170310318001003'

In [8]:
building_fips_df = scrape_fips_blocks(buildings)

In [9]:
building_fips_df.to_pickle('building_fips.pkl')

In [10]:
building_fips_df = pd.read_pickle('building_fips.pkl')

In [11]:
building_fips_df.head(1)

Unnamed: 0,SERVICE REQUEST TYPE,SERVICE REQUEST NUMBER,DATE SERVICE REQUEST WAS RECEIVED,"LOCATION OF BUILDING ON THE LOT (IF GARAGE, CHANGE TYPE CODE TO BGD).",IS THE BUILDING DANGEROUS OR HAZARDOUS?,IS BUILDING OPEN OR BOARDED?,"IF THE BUILDING IS OPEN, WHERE IS THE ENTRY POINT?",IS THE BUILDING CURRENTLY VACANT OR OCCUPIED?,IS THE BUILDING VACANT DUE TO FIRE?,"ANY PEOPLE USING PROPERTY? (HOMELESS, CHILDEN, GANGS)",...,ZIP CODE,X COORDINATE,Y COORDINATE,Ward,Police District,Community Area,LATITUDE,LONGITUDE,Location,FIPS_BLOCK_NUMBER
0,Vacant/Abandoned Building,16-06902848,10/01/2016,Rear,,Open,BACK DOOR,Vacant,False,True,...,60628.0,1177385.0,1837852.0,9,5,49,41.710392,-87.62599,"(41.710391684078424, -87.62598966616069)",170314907002004


Using the FIPS information, I can now scrape to obtain income and benefits information

In [None]:
def scrape_income(df):
    '''
    Function that retrieves the INCOME AND BENEFITS (IN 2015 INFLATION-ADJUSTED DOLLARS)
    '''
    avg_income_list = []    
    for index, row in df.iterrows():
        state = row['FIPS_BLOCK_NUMBER'][0:2]
        county = row['FIPS_BLOCK_NUMBER'][2:5]
        tract = row['FIPS_BLOCK_NUMBER'][5:11]
        url ='http://api.census.gov/data/2015/acs5/profile?get=DP03_0051E,NAME&for=tract:{tract}&in=state:{state}+county:{county}&key=5114f013c5c3a46e13d51564a7d6411436e2b063'.format(state=state, county=county, tract=tract)
        r = requests.get(url)
        if r.status_code != 204: # 204 corresponds to no content.
            json = r.json()
            income_amount= json[1][0]
            avg_income_list.append(income_amount)    
    income = pd.DataFrame(avg_income_list)
    buildings_income_df = pd.concat([df,income], axis=1)
    buildings_income_df = buildings_income_df.rename(columns={0 : 'AVG_INCOME'})
    return buildings_income_df

In [None]:
buildings_income_df = scrape_income(building_fips_df)

In [None]:
buildings_income_df.to_pickle('buildings_income.pkl')

In [None]:
buildings_income_df = pd.read_pickle('buildings_income.pkl')

In [None]:
buildings_income_df.sample(10)

In [None]:
int_list = [int(x) for x in list(buildings_income_df['AVG_INCOME'])]

In [None]:
buildings_income_df.drop(['AVG_INCOME'], axis=1, inplace=True)

In [None]:
pd_int_list = pd.DataFrame({'AVG_INCOME': int_list})

In [None]:
buildings_income_df = pd.concat([buildings_income_df, pd_int_list], axis=1)

In [None]:
income_dist = pd.DataFrame(buildings_income_df['AVG_INCOME'].groupby(buildings_income_df['ZIP CODE']).describe().unstack()['mean'].sort_values(ascending=True,inplace=False))

In [None]:
income_dist.plot(kind='barh',figsize=(16, 12))
plt.title('Income Distribution per Zip Code')
plt.xlabel('Average Income (Thousands)')
plt.ylabel('Zip Codes')

This graph shows the stark income divide in Chicago:
60605 refers to the downtown core while 60636 is the Englewood neighbourhood.
The downtown core is extrmemly affluent, while Englewood suffered from high levels of poverty and violence.
The graph below is consistent with our finidings in that the most neglected urban infrastructure and reports occur in the most impoverished neighbourhoods.
We can see that the divide between the richest and poorest zipcodes is approximately 500 percent.

In [None]:
hist_zip = buildings['ZIP CODE'].value_counts()
plt.figure(figsize=(16,12))
graph=sns.countplot(y='ZIP CODE', saturation=1, data=buildings, order=hist_zip.index)
plt.title('ZIP Code Histogram')
plt.xlabel('Requests')
plt.ylabel('Zip Codes')

60636 had a higher number of requests for sanitation (~500), potholes (~750) and graffiti (~1000) than it did for vacant buildings (~90). It is reasonable to assume that graffiti and sanitation issues have a higher chance of occurence in unoccupied buildings. Thus, the fact that we have a large amount of requests for graffiti and sanitation may lead one to question whether the number of requests put through for vacant buildings is accurate. The lack off affordable housing in the south side of chicago may be a factor in the low number of yearly reports about vacant buildings (squatters living in these buildings). Then again, perhaps my assumption that graffiti is more likely to occur on unoccupied buildings is weak.
Another issue to highlight is how the data was collected - it would not surprise me if statistics relating to the South Side of Chicago were hard to collect/gather. As such, there be implicit bias/large amounts of missing data.