### For Capstone Project

### Data Collection

We will look for a suitable place for our business based on neihborhoods in City of Chicago. For this we need relevant data to go ahead with our analysis. Data will be collected from the following, we will need names of neighborhoods, zip codes, lat,lng for map marking

- **For NeigbourHood names**: https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Chicago

- **For Zip Codes** https://data.cityofchicago.org/api/views/unjd-c2ca/rows.csv?accessType=DOWNLOAD

- **For Lat Lng** https://simplemaps.com/data/us-zips

- **FourSquare API for Venues** https://developer.foursquare.com/docs/resources/categories

After that we will form clusters and analyze which cluster have space/land for commercial activity for this data will be obatined from the following.
- **Chicago City Owned Lands**  https://data.cityofchicago.org/Community-Economic-Development/City-Owned-Land-Inventory/aksk-kvfp/data

### 1. Setting up the environment

In [296]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json

from geopy.geocoders import Nominatim

from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
from pandas.io.json import json_normalize
import geocoder
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
import seaborn as sns
from sklearn.cluster import KMeans
import folium

## Geting Neighborhoods for City of Chicago

### 1. Parsing the html

In [297]:
url = 'https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Chicago'
page = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(page, 'html.parser')

wiki_table = soup.body.table.tbody

### 2. Extracting data from the table to the data frame

In [298]:
def get_cell(element):
    cells = element.find_all('td')
    row = []
    
    for cell in cells:
        if cell.a:            
            if (cell.a.text):
                row.append(cell.a.text)
                continue
        row.append(cell.string.strip())
        
    return row

In [299]:
def get_row():    
    data = []  
    
    for tr in wiki_table.find_all('tr'):
        row = get_cell(tr)
        if len(row) != 2:
            continue
        data.append(row)        
    
    return data

In [300]:
data = get_row()
columns = ['Neighborhood', 'Community Area']
df = pd.DataFrame(data, columns=columns)
df.head()

Unnamed: 0,Neighborhood,Community Area
0,Albany Park,Albany Park
1,Altgeld Gardens,Riverdale
2,Andersonville,Edgewater
3,Archer Heights,Archer Heights
4,Armour Square,Armour Square


In [301]:
df.shape

(246, 2)

### 3. Cleaning the data

In [10]:
df = df[df.Neighborhood != 'Not assigned']
df = df.sort_values(by=['Neighborhood','Community Area'])

df.reset_index(inplace=True)
df.drop('index',axis=1,inplace=True)

df.head()

Unnamed: 0,Neighborhood,Community Area
0,Albany Park,Albany Park
1,Altgeld Gardens,Riverdale
2,Andersonville,Edgewater
3,Archer Heights,Archer Heights
4,Armour Square,Armour Square


In [11]:
df.shape

(246, 2)

We have our Neighborhoods but we need more info to get geographical locations. One way is to use ZipCodes and [this city of chicago website](https://data.cityofchicago.org/api/views/unjd-c2ca/rows.csv?accessType=DOWNLOAD) provides relevant data.

In [12]:
df_zip=pd.read_csv('Zip_Codes.csv')
df_zip.head()

Unnamed: 0,the_geom,OBJECTID,ZIP,SHAPE_AREA,SHAPE_LEN
0,MULTIPOLYGON (((-87.67762151065281 41.91775780...,33,60647,106052300.0,42720.044406
1,MULTIPOLYGON (((-87.72683253163021 41.92264626...,34,60639,127476100.0,48103.782721
2,MULTIPOLYGON (((-87.78500237831095 41.90914785...,35,60707,45069040.0,27288.609612
3,MULTIPOLYGON (((-87.6670686895295 41.888851884...,36,60622,70853830.0,42527.989679
4,MULTIPOLYGON (((-87.70655631674127 41.89555340...,37,60651,99039620.0,47970.140153


We only need ZIP column

In [13]:
df_zip=df_zip['ZIP']

In [14]:
df_zip=pd.DataFrame(df_zip)
df_zip.head()

Unnamed: 0,ZIP
0,60647
1,60639
2,60707
3,60622
4,60651


We have Zip codes but we don't know, which Zip codes fall in which neigbourhood or what there lat, lng is so, we need to look for some data that can either provide some info that could help us in mapping these zip codes to lat lng and then to neigborhoods. The data set at [SimpleMaps](https://simplemaps.com/data/us-zips) provide us with this info. So, we will use it.

In [15]:
df3=pd.read_csv('uszips.csv')
df3.head()

Unnamed: 0,zip,lat,lng,city,state_id,state_name,zcta,parent_zcta,population,density,county_fips,county_name,all_county_weights,imprecise,military,timezone
0,601,18.18004,-66.75218,Adjuntas,PR,Puerto Rico,True,,18570,111.4,72001,Adjuntas,"{'72001':99.43,'72141':0.57}",False,False,America/Puerto_Rico
1,602,18.36073,-67.17517,Aguada,PR,Puerto Rico,True,,41520,523.5,72003,Aguada,{'72003':100},False,False,America/Puerto_Rico
2,603,18.45439,-67.12202,Aguadilla,PR,Puerto Rico,True,,54689,667.9,72005,Aguadilla,{'72005':100},False,False,America/Puerto_Rico
3,606,18.16724,-66.93828,Maricao,PR,Puerto Rico,True,,6615,60.4,72093,Maricao,"{'72093':94.88,'72121':1.35,'72153':3.78}",False,False,America/Puerto_Rico
4,610,18.29032,-67.12243,Anasco,PR,Puerto Rico,True,,29016,312.0,72011,Añasco,"{'72003':0.55,'72011':99.45}",False,False,America/Puerto_Rico


In [16]:
df3['city'].unique()

array(['Adjuntas', 'Aguada', 'Aguadilla', ..., 'Metlakatla',
       'Point Baker', 'Wrangell'], dtype=object)

We only need data related to Chicago

In [17]:
df3=df3[df3['city']=='Chicago']

In [18]:
df3.rename(columns={'zip':'ZIP'}, inplace=True) # renaming column for merging
chicago_df=pd.merge(df_zip, df3,  how='left')

In [19]:
chicago_df

Unnamed: 0,ZIP,lat,lng,city,state_id,state_name,zcta,parent_zcta,population,density,county_fips,county_name,all_county_weights,imprecise,military,timezone
0,60647,41.92068,-87.70167,Chicago,IL,Illinois,True,,87291.0,8385.0,17031.0,Cook,{'17031':100},False,False,America/Chicago
1,60639,41.92056,-87.75603,Chicago,IL,Illinois,True,,90407.0,7156.5,17031.0,Cook,{'17031':100},False,False,America/Chicago
2,60707,,,,,,,,,,,,,,,
3,60622,41.90274,-87.68331,Chicago,IL,Illinois,True,,52548.0,8213.0,17031.0,Cook,{'17031':100},False,False,America/Chicago
4,60651,41.90206,-87.74095,Chicago,IL,Illinois,True,,64267.0,7099.1,17031.0,Cook,{'17031':100},False,False,America/Chicago
5,60611,41.89472,-87.61938,Chicago,IL,Illinois,True,,28718.0,13562.3,17031.0,Cook,{'17031':100},False,False,America/Chicago
6,60638,41.78145,-87.77056,Chicago,IL,Illinois,True,,55026.0,1913.0,17031.0,Cook,{'17031':100},False,False,America/Chicago
7,60652,41.74795,-87.71479,Chicago,IL,Illinois,True,,40959.0,3153.6,17031.0,Cook,{'17031':100},False,False,America/Chicago
8,60626,42.00903,-87.66963,Chicago,IL,Illinois,True,,50139.0,11355.1,17031.0,Cook,{'17031':100},False,False,America/Chicago
9,60615,41.80223,-87.60272,Chicago,IL,Illinois,True,,40603.0,7086.4,17031.0,Cook,{'17031':100},False,False,America/Chicago


Drop Null Entries

In [20]:
chicago_df = chicago_df[np.isfinite(chicago_df['lat'])]

In [21]:
chicago_df.drop(['zcta','parent_zcta','county_fips','all_county_weights','imprecise','military','timezone'],axis=1,inplace=True )

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [22]:
chicago_df

Unnamed: 0,ZIP,lat,lng,city,state_id,state_name,population,density,county_name
0,60647,41.92068,-87.70167,Chicago,IL,Illinois,87291.0,8385.0,Cook
1,60639,41.92056,-87.75603,Chicago,IL,Illinois,90407.0,7156.5,Cook
3,60622,41.90274,-87.68331,Chicago,IL,Illinois,52548.0,8213.0,Cook
4,60651,41.90206,-87.74095,Chicago,IL,Illinois,64267.0,7099.1,Cook
5,60611,41.89472,-87.61938,Chicago,IL,Illinois,28718.0,13562.3,Cook
6,60638,41.78145,-87.77056,Chicago,IL,Illinois,55026.0,1913.0,Cook
7,60652,41.74795,-87.71479,Chicago,IL,Illinois,40959.0,3153.6,Cook
8,60626,42.00903,-87.66963,Chicago,IL,Illinois,50139.0,11355.1,Cook
9,60615,41.80223,-87.60272,Chicago,IL,Illinois,40603.0,7086.4,Cook
10,60621,41.77638,-87.63944,Chicago,IL,Illinois,35912.0,3718.9,Cook


In [23]:
chicago_df['coord_pairs']=chicago_df[['lat', 'lng']].values.round(4).tolist()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [92]:
chicago_df.head()

Unnamed: 0,ZIP,lat,lng,city,state_id,state_name,population,density,county_name,coord_pairs,Neighborhood
0,60647,41.92068,-87.70167,Chicago,IL,Illinois,87291.0,8385.0,Cook,"[41.9207, -87.7017]",Palmer Square
1,60639,41.92056,-87.75603,Chicago,IL,Illinois,90407.0,7156.5,Cook,"[41.9206, -87.756]",Hanson Park
3,60622,41.90274,-87.68331,Chicago,IL,Illinois,52548.0,8213.0,Cook,"[41.9027, -87.6833]",Ukrainian Village
4,60651,41.90206,-87.74095,Chicago,IL,Illinois,64267.0,7099.1,Cook,"[41.9021, -87.741]",West Humboldt Park
5,60611,41.89472,-87.61938,Chicago,IL,Illinois,28718.0,13562.3,Cook,"[41.8947, -87.6194]",Streeterville


### Getting NeihborHood Names for Each ZIP code

We will now use geocoder to extract neighborhood names names for the lat lng pairs which are already mapped to zip codes.

In [27]:
def get_neighbor(latlng):
    g=geocoder.mapbox(latlng, method='reverse',key='pk.eyJ1IjoiaGNkNzQ5ODYiLCJhIjoiY2sxejh6OGNuMG82YzNjbnNjNjAxbXd4ayJ9.UyAu6s5crbE_QpzNpGg4fw')
    a=g.json['raw']['neighborhood']
    return a

In [28]:
# df['Neighbour']=df['new'].apply(lambda x : get_neighbour(x))
chicago_df['Neighborhood'] = chicago_df['coord_pairs'].apply(get_neighbor)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [29]:
chicago_df.head()

Unnamed: 0,ZIP,lat,lng,city,state_id,state_name,population,density,county_name,coord_pairs,Neighbor
0,60647,41.92068,-87.70167,Chicago,IL,Illinois,87291.0,8385.0,Cook,"[41.9207, -87.7017]",Palmer Square
1,60639,41.92056,-87.75603,Chicago,IL,Illinois,90407.0,7156.5,Cook,"[41.9206, -87.756]",Hanson Park
3,60622,41.90274,-87.68331,Chicago,IL,Illinois,52548.0,8213.0,Cook,"[41.9027, -87.6833]",Ukrainian Village
4,60651,41.90206,-87.74095,Chicago,IL,Illinois,64267.0,7099.1,Cook,"[41.9021, -87.741]",West Humboldt Park
5,60611,41.89472,-87.61938,Chicago,IL,Illinois,28718.0,13562.3,Cook,"[41.8947, -87.6194]",Streeterville


In [289]:
chicago_df.describe()

Unnamed: 0,ZIP,lat,lng,population,density
count,57.0,57.0,57.0,57.0,57.0
mean,60630.105263,41.86477,-87.674223,47902.385965,5874.859649
std,17.72678,0.09327,0.060866,26884.982808,3196.44269
min,60601.0,41.66435,-87.82692,493.0,485.3
25%,60615.0,41.78145,-87.71176,28641.0,3447.3
50%,60630.0,41.88056,-87.66277,48281.0,4950.5
75%,60644.0,41.93998,-87.62912,65996.0,7743.1
max,60661.0,42.00903,-87.55431,113916.0,13562.3


In [295]:
chicago_df['Neighborhood'].value_counts()

Old Irving Park                      1
Bronzeville                          1
Hyde Park                            1
Mount Greenwood                      1
Marquette Park                       1
South Deering                        1
Edgewater Glen                       1
Ukrainian Village                    1
South Shore                          1
East Garfield Park                   1
South Austin                         1
Edgebrook                            1
Ashburn                              1
Palmer Square                        1
Rogers Park                          1
Jefferson Park                       1
Norwood Park West                    1
West Humboldt Park                   1
Sheffield Neighbors                  1
West Englewood                       1
Brighton Park                        1
Graceland West                       1
Streeterville                        1
Little Village                       1
Uptown                               1
Englewood                

Some neighbourhoods appear to have more than 1 Zip codes, this will be surplus for us and may effect our clusters and their analysis so, we need to drop them.

In [294]:
chicago_df.drop_duplicates(subset ="Neighborhood", keep = 'first', inplace = True) 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Saving the cleaned data for further analysis.

In [None]:
chicago_df.to_csv('Chicago.csv')

**Getting Lat,Lng for the city of Chicago**

In [94]:
address = 'Chicago, IL'
geolocator = Nominatim(user_agent="ch_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Chicago City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Chicago City are 41.8755616, -87.6244212.


## Create Map for Chicago and Place markers over to idenity neighborhoods

**Folium** is a great visualization library. We can zoom into the below map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

In [32]:
# create map of Chicago using latitude and longitude values
map_Chicago = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(chicago_df['lat'], chicago_df['lng'], chicago_df['Neighbor']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Chicago)  
    
map_Chicago