# Chicago Bike Infrastructure Project
<h3>Capstone Project for Data Analytics Certificate<br>
University of Texas<br><br>
Samantha Goodman
<br>December 2021</h3><br>
This notebook is part 3 of a 5 part series.<br>
       1 - Bike Shops from FourSquare API<br>
    2 - Bike Infrastructure<br>
        <b>3 - Background information about neighborhoods<br></b>
        4 - Analysis<br>
        5 - Model Building and Predictions<br><br>
    Questions this project aims to answer:<br>
    <ul><li>Which community areas (neighborhoods) have the most bike infrastructure, and which have the least?</li>
<li>Are there areas that show an unmet demand for bike infrastructure (higher rates of Divvy trips, but lower rates of bike lanes and repair shops)?</li>
<li>Can I predict bike infrastructure levels based on demographic or community health data?


In [65]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

First import list of Chicago neighborhoods - referred to as Community Areas

In [66]:
# Import CSV to dataframe
# Source = City of Chicago data portal
neighborhoods = pd.read_csv('Chicago_community_areas.csv')
# Make all column names lowercase
neighborhoods.columns= neighborhoods.columns.str.lower()

In [67]:
neighborhoods.shape

(77, 3)

There are 77 official neighborhoods in Chicago. We will keep this number in mind when chekcing our datasets going forward.

# Find total street lengths for each neighborhood

This is a good way to metric to use when looking at length of bike lanes. Although we have the area of each neighborhood, some neighborhoods have large industrial sections without much city-owned infrastructure.

In [68]:
# Import CSV to dataframe
# Source = FourSquare, see 1-BikeShopsFromFourSquare notebook
streets = pd.read_csv('street_lengths_comm.csv')
# Make all column names lowercase
streets.columns= streets.columns.str.lower()

In [69]:
streets.head()

Unnamed: 0,class,date_creat,time_creat,create_use,dir_travel,edit_date,edit_type,ewns,ewns_coord,ewns_dir,...,l_f_add,l_fips,l_parity,l_t_add,l_zip,length,update_use,area_numbe,community,length_2
0,4,1999/01/01,00:00:00.000,EXISTING,F,0,,-232,232,W,...,0,14000,O,0,60621.0,220.566014,EXISTING,68,ENGLEWOOD,67.227523
1,2,1999/01/01,00:00:00.000,EXISTING,B,0,,0,0,,...,7301,14000,O,7359,60619.0,664.774635,EXISTING,69,GREATER GRAND CROSSING,202.61964
2,4,1999/01/01,00:00:00.000,EXISTING,B,0,,-2500,2500,W,...,10801,14000,O,10859,60655.0,665.378484,EXISTING,75,MORGAN PARK,202.805535
3,4,1999/01/01,00:00:00.000,EXISTING,B,0,,-932,932,W,...,0,14000,O,0,60643.0,152.564889,EXISTING,75,MORGAN PARK,46.501185
4,4,1999/01/01,00:00:00.000,EXISTING,B,0,,-11800,11800,S,...,1933,14000,O,1959,60643.0,332.691371,EXISTING,75,MORGAN PARK,101.403278


In [70]:
# Create a dataframe of grouped data by with length of all streets in each neighborhood
streets_grouped = pd.DataFrame(streets.groupby('community')['length_2'].sum().reset_index()).copy()
streets_grouped.head()

Unnamed: 0,community,length_2
0,ALBANY PARK,73723.820898
1,ARCHER HEIGHTS,52339.197584
2,ARMOUR SQUARE,59148.654592
3,ASHBURN,152829.703586
4,AUBURN GRESHAM,136022.586733


In [71]:
streets_grouped.sort_values('length_2')

Unnamed: 0,community,length_2
12,BURNSIDE,13591.225291
55,OAKLAND,19827.721142
36,KENWOOD,31479.439883
44,MONTCLARE,33779.597458
25,FULLER PARK,40480.268680
...,...,...
61,ROSELAND,168029.716438
75,WEST TOWN,188426.007579
56,OHARE,205033.967690
5,AUSTIN,232288.776880


In [72]:
# Rename column to reflect that it's the number of shops
streets_grouped.columns = streets_grouped.columns.str.replace('length_2', 'length_streets_m')

# Sort descending by sum of street lengths, just to see
streets_grouped.sort_values(by=['length_streets_m'], inplace=True, ascending=False)

streets_grouped.head()

Unnamed: 0,community,length_streets_m
49,NEAR WEST SIDE,232421.104766
5,AUSTIN,232288.77688
56,OHARE,205033.96769
75,WEST TOWN,188426.007579
61,ROSELAND,168029.716438


Add street lengths to neighborhood database and rename as neighborhood_data

In [73]:
# Join neighborhoods and street length dataframes on the 'community' column
neighborhood_data = neighborhoods.merge(streets_grouped, how='left', on='community')

In [74]:
neighborhood_data.head()

Unnamed: 0,comm_num,community,area_kmsq,length_streets_m
0,1,ROGERS PARK,51.259902,57712.379599
1,2,WEST RIDGE,98.429095,121335.202892
2,3,UPTOWN,65.095643,63769.090382
3,4,LINCOLN SQUARE,71.352328,76974.128441
4,5,NORTH CENTER,57.054168,73743.375859


In [75]:
neighborhood_data.shape

(77, 4)

# Next up, add economic hardship data by neighborhood

In [76]:
# Import CSV to dataframe
# Source = City of Chicago Data Portal
# https://data.cityofchicago.org/Health-Human-Services/hardship-index/792q-4jtu
# Time Period: 2006-2010
hardship = pd.read_csv('hardship_data_community.csv')
# Make all column names lowercase
hardship.columns= hardship.columns.str.lower()

In [77]:
hardship.head()

Unnamed: 0,community area number,community area name,percent of housing crowded,percent households below poverty,percent aged 16+ unemployed,percent aged 25+ without high school diploma,percent aged under 18 or over 64,per capita income,hardship index
0,1.0,Rogers Park,7.7,23.6,8.7,18.2,27.5,23939,39.0
1,2.0,West Ridge,7.8,17.2,8.8,20.8,38.5,23040,46.0
2,3.0,Uptown,3.8,24.0,8.9,11.8,22.2,35787,20.0
3,4.0,Lincoln Square,3.4,10.9,8.2,13.4,25.5,37524,17.0
4,5.0,North Center,0.3,7.5,5.2,4.5,26.2,57123,6.0


In [78]:
# Sort descending by hardship index, just to see
hardship.sort_values(by=['hardship index'], inplace=True, ascending=True)
hardship.head()

Unnamed: 0,community area number,community area name,percent of housing crowded,percent households below poverty,percent aged 16+ unemployed,percent aged 25+ without high school diploma,percent aged under 18 or over 64,per capita income,hardship index
7,8.0,Near North Side,1.9,12.9,7.0,2.5,22.6,88669,1.0
6,7.0,Lincoln Park,0.8,12.3,5.1,3.6,21.5,71551,2.0
31,32.0,Loop,1.5,14.7,5.7,3.1,13.5,65526,3.0
5,6.0,Lake View,1.1,11.4,4.7,2.6,17.0,60058,5.0
4,5.0,North Center,0.3,7.5,5.2,4.5,26.2,57123,6.0


In [79]:
hardship.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 78 entries, 7 to 77
Data columns (total 9 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   community area number                         77 non-null     float64
 1   community area name                           78 non-null     object 
 2   percent of housing crowded                    78 non-null     float64
 3   percent households below poverty              78 non-null     float64
 4   percent aged 16+ unemployed                   78 non-null     float64
 5   percent aged 25+ without high school diploma  78 non-null     float64
 6   percent aged under 18 or over 64              78 non-null     float64
 7   per capita income                             78 non-null     int64  
 8   hardship index                                77 non-null     float64
dtypes: float64(7), int64(1), object(1)
memory usage: 6.1+ KB


In [80]:
# Drop the final row, which is Chicago averages
hardship = hardship.drop(77)

In [81]:
# Recast floats as ints
hardship['community area number'] = hardship['community area number'].astype(int)

In [82]:
# Join neighborhoods and street length dataframes on the 'community' column
neighborhood_data = neighborhood_data.merge(hardship, how='left', left_on='comm_num', right_on='community area number')

In [83]:
neighborhood_data.shape

(77, 13)

In [84]:
neighborhood_data.head()

Unnamed: 0,comm_num,community,area_kmsq,length_streets_m,community area number,community area name,percent of housing crowded,percent households below poverty,percent aged 16+ unemployed,percent aged 25+ without high school diploma,percent aged under 18 or over 64,per capita income,hardship index
0,1,ROGERS PARK,51.259902,57712.379599,1,Rogers Park,7.7,23.6,8.7,18.2,27.5,23939,39.0
1,2,WEST RIDGE,98.429095,121335.202892,2,West Ridge,7.8,17.2,8.8,20.8,38.5,23040,46.0
2,3,UPTOWN,65.095643,63769.090382,3,Uptown,3.8,24.0,8.9,11.8,22.2,35787,20.0
3,4,LINCOLN SQUARE,71.352328,76974.128441,4,Lincoln Square,3.4,10.9,8.2,13.4,25.5,37524,17.0
4,5,NORTH CENTER,57.054168,73743.375859,5,North Center,0.3,7.5,5.2,4.5,26.2,57123,6.0


In [85]:
# Drop extraneous community area columns
neighborhood_data.drop('community area number', axis=1, inplace=True)
neighborhood_data.drop('community area name', axis=1, inplace=True)
neighborhood_data.head()

Unnamed: 0,comm_num,community,area_kmsq,length_streets_m,percent of housing crowded,percent households below poverty,percent aged 16+ unemployed,percent aged 25+ without high school diploma,percent aged under 18 or over 64,per capita income,hardship index
0,1,ROGERS PARK,51.259902,57712.379599,7.7,23.6,8.7,18.2,27.5,23939,39.0
1,2,WEST RIDGE,98.429095,121335.202892,7.8,17.2,8.8,20.8,38.5,23040,46.0
2,3,UPTOWN,65.095643,63769.090382,3.8,24.0,8.9,11.8,22.2,35787,20.0
3,4,LINCOLN SQUARE,71.352328,76974.128441,3.4,10.9,8.2,13.4,25.5,37524,17.0
4,5,NORTH CENTER,57.054168,73743.375859,0.3,7.5,5.2,4.5,26.2,57123,6.0


In [86]:
# Incorporate census data from CMAP
# Source: https://datahub.cmap.illinois.gov/dataset/community-data-snapshots-raw-data
demo_data = pd.read_csv('CMAPdata.csv')
# Make all column names lowercase
demo_data.columns= demo_data.columns.str.lower()

demo_data.head()


Unnamed: 0,geoid,geog,2000_pop,2010_pop,tot_pop,und5,a5_19,a20_34,a35_49,a50_64,...,2000_assoc,2000_bach,2000_grad_prof,2000_pop_25ov,ht_cost_typical,ht_cost_mod,h_cost_typical,h_cost_mod,t_cost_typical,t_cost_mod
0,14.0,Albany Park,57655.0,51542.0,49805.99998,3110.0,9413.0,12785.0,11765.0,7691.0,...,,,,,,,,,,
1,57.0,Archer Heights,12644.0,13393.0,13700.97018,1164.0,3306.0,2970.0,2805.0,1713.0,...,,,,,,,,,,
2,34.0,Armour Square,12032.0,13391.0,13598.48056,645.0,1876.0,2657.0,2525.0,2520.0,...,,,,,,,,,,
3,70.0,Ashburn,39584.0,41081.0,43355.99999,2741.0,9956.0,8224.0,8847.0,8497.0,...,,,,,,,,,,
4,71.0,Auburn Gresham,55928.0,48743.0,45909.00001,2415.0,9256.0,8520.0,8040.0,9452.0,...,,,,,,,,,,


In [87]:
demo_data.shape

(104, 245)

In [88]:
demo_data.tail(30)

Unnamed: 0,geoid,geog,2000_pop,2010_pop,tot_pop,und5,a5_19,a20_34,a35_49,a50_64,...,2000_assoc,2000_bach,2000_grad_prof,2000_pop_25ov,ht_cost_typical,ht_cost_mod,h_cost_typical,h_cost_mod,t_cost_typical,t_cost_mod
74,2.0,West Ridge,73199.0,71942.0,78466.00002,6723.0,15755.0,16184.0,15248.0,14104.0,...,,,,,,,,,,
75,24.0,West Town,87435.0,82236.0,84698.0,5315.0,6960.0,38669.0,19729.0,9014.0,...,,,,,,,,,,
76,42.0,Woodlawn,27086.0,23740.0,22655.0,1499.0,5137.0,5554.0,3865.0,3896.0,...,,,,,,,,,,
77,,,,,,,,,,,...,,,,,,,,,,
78,,,,,,,,,,,...,,,,,,,,,,
79,,,,,,,,,,,...,,,,,,,,,,
80,,,,,,,,,,,...,,,,,,,,,,
81,,,,,,,,,,,...,,,,,,,,,,
82,,,,,,,,,,,...,,,,,,,,,,
83,,,,,,,,,,,...,,,,,,,,,,


In [89]:
# Drop the NaN rows at the end of the dataset
demo_data.drop(demo_data.index[77:104], inplace=True)

In [90]:
demo_data.head()

Unnamed: 0,geoid,geog,2000_pop,2010_pop,tot_pop,und5,a5_19,a20_34,a35_49,a50_64,...,2000_assoc,2000_bach,2000_grad_prof,2000_pop_25ov,ht_cost_typical,ht_cost_mod,h_cost_typical,h_cost_mod,t_cost_typical,t_cost_mod
0,14.0,Albany Park,57655.0,51542.0,49805.99998,3110.0,9413.0,12785.0,11765.0,7691.0,...,,,,,,,,,,
1,57.0,Archer Heights,12644.0,13393.0,13700.97018,1164.0,3306.0,2970.0,2805.0,1713.0,...,,,,,,,,,,
2,34.0,Armour Square,12032.0,13391.0,13598.48056,645.0,1876.0,2657.0,2525.0,2520.0,...,,,,,,,,,,
3,70.0,Ashburn,39584.0,41081.0,43355.99999,2741.0,9956.0,8224.0,8847.0,8497.0,...,,,,,,,,,,
4,71.0,Auburn Gresham,55928.0,48743.0,45909.00001,2415.0,9256.0,8520.0,8040.0,9452.0,...,,,,,,,,,,


In [91]:
demo_data.shape

(77, 245)

In [92]:
neighborhood_data.head()

Unnamed: 0,comm_num,community,area_kmsq,length_streets_m,percent of housing crowded,percent households below poverty,percent aged 16+ unemployed,percent aged 25+ without high school diploma,percent aged under 18 or over 64,per capita income,hardship index
0,1,ROGERS PARK,51.259902,57712.379599,7.7,23.6,8.7,18.2,27.5,23939,39.0
1,2,WEST RIDGE,98.429095,121335.202892,7.8,17.2,8.8,20.8,38.5,23040,46.0
2,3,UPTOWN,65.095643,63769.090382,3.8,24.0,8.9,11.8,22.2,35787,20.0
3,4,LINCOLN SQUARE,71.352328,76974.128441,3.4,10.9,8.2,13.4,25.5,37524,17.0
4,5,NORTH CENTER,57.054168,73743.375859,0.3,7.5,5.2,4.5,26.2,57123,6.0


In [93]:
# Sort by community number
demo_data.sort_values(by=['geoid'], inplace=True, ascending=True)
demo_data.reset_index(drop=True, inplace=True)

In [94]:
# Merge demographic data to neighborhood dataset
neighborhood_data = neighborhood_data.merge(demo_data.loc[:,['geoid', 'med_age', 'white', 'hisp', 'black', 'asian', 'other']], how='left', left_on='comm_num', right_on='geoid')

In [95]:
# Add a column for population total - since other columns in original dataset didn't line up
neighborhood_data['pop'] = (neighborhood_data['white'] + neighborhood_data['white'] + neighborhood_data['hisp'] + neighborhood_data['black'] + neighborhood_data['asian'] + neighborhood_data['other'])

In [96]:
# Calculate percentages for each demographic group
neighborhood_data['percent_white'] = (neighborhood_data['white'] / neighborhood_data['pop'])*100
neighborhood_data['percent_hisp'] = (neighborhood_data['hisp'] / neighborhood_data['pop'])*100
neighborhood_data['percent_black'] = (neighborhood_data['black'] / neighborhood_data['pop'])*100
neighborhood_data['percent_asian'] = (neighborhood_data['asian'] / neighborhood_data['pop'])*100
neighborhood_data['percent_other'] = (neighborhood_data['other'] / neighborhood_data['pop'])*100

In [97]:
neighborhood_data.head()

Unnamed: 0,comm_num,community,area_kmsq,length_streets_m,percent of housing crowded,percent households below poverty,percent aged 16+ unemployed,percent aged 25+ without high school diploma,percent aged under 18 or over 64,per capita income,...,hisp,black,asian,other,pop,percent_white,percent_hisp,percent_black,percent_asian,percent_other
0,1,ROGERS PARK,51.259902,57712.379599,7.7,23.6,8.7,18.2,27.5,23939,...,10887.0,15187.0,2695.0,2349.0,79832.0,30.510322,13.637389,19.0237,3.375839,2.942429
1,2,WEST RIDGE,98.429095,121335.202892,7.8,17.2,8.8,20.8,38.5,23040,...,14835.0,9086.0,18650.0,4059.0,110302.0,28.862577,13.449439,8.237385,16.908125,3.679897
2,3,UPTOWN,65.095643,63769.090382,3.8,24.0,8.9,11.8,22.2,35787,...,8609.0,10476.0,6207.0,1713.0,90953.0,35.15442,9.465328,11.518037,6.824404,1.88339
3,4,LINCOLN SQUARE,71.352328,76974.128441,3.4,10.9,8.2,13.4,25.5,37524,...,7611.0,1470.0,3820.0,2037.0,69268.0,39.217243,10.987758,2.122192,5.514812,2.940752
4,5,NORTH CENTER,57.054168,73743.375859,0.3,7.5,5.2,4.5,26.2,57123,...,4070.0,750.0,1737.0,1357.0,63172.0,43.736149,6.442728,1.187235,2.749636,2.148104


In [99]:
# Save dataframe to CSV for use in analysis notebook
neighborhood_data.to_csv('neighborhood_data.csv')