Introduction/Business Problem 

People look at a variety of factors when deciding to rent/buy a home in a new town. Location is a major driver of this decision. A lot of manual effort goes into this process along with realtor fees. In this problem, I will be analyzing Chicago neighborhoods to provide recommendations on neighborhoods to rent/buy homes based on proxomity to a combination of certain venues per the user's preference

Data

Three sets of data will be required for this problem:

1. Chicago neighborhoods and associated zip codes: will be scraped from https://www.dreamtown.com/maps/chicago-zipcode-map

2. The above data will be converted to a dataframe and a longitude/latitude will be appended to the table using Geocoder Python package: https://geocoder.readthedocs.io/index.html

3. For each neighborhood, I will get the nearby venues and associated latitudes/longitudes and categories using the following 

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
results = requests.get(url).json()


Following this I will clean the json and structure it into a pandas dataframe

venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
                  

In [3]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_html('https://www.dreamtown.com/maps/chicago-zipcode-map', header = 0)

In [5]:
result=pd.concat(df)
res = result.reset_index(drop=True)

In [6]:
res.drop(['# of Listings'], axis = 1, inplace = True)

In [7]:
pd.set_option("max_rows", None)

In [8]:
res

Unnamed: 0,Neighborhoods,Zip Codes
0,Albany Park,60625
1,Altgeld Gardens,60827
2,Andersonville,60640
3,Arcadia Terrace,60659
4,Archer Heights,60632
5,Ashburn,"60652, 60629"
6,Austin,"60644, 60639, 60651, 60707"
7,Avalon Park,60619
8,Avondale,60618
9,Albany Park,60625


In [9]:
import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim 

import requests # library to handle requests
from pandas.io.json import json_normalize

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    certifi-2020.6.20          |   py36h9f0ad1d_0         151 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ca-certificates-2020.6.20  |       hecda079_0         145 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                       

In [11]:
#!conda install -c conda-forge geocoder --yes 

In [10]:
!pip install uszipcode

Collecting uszipcode
[?25l  Downloading https://files.pythonhosted.org/packages/bc/94/1b908c6fe2008f0e913b0b2d97951aa76e00ec1044883c012afb2e477b4a/uszipcode-0.2.4-py2.py3-none-any.whl (378kB)
[K     |████████████████████████████████| 378kB 11.4MB/s eta 0:00:01
Collecting pathlib-mate (from uszipcode)
[?25l  Downloading https://files.pythonhosted.org/packages/ee/90/b414af97dea2b4f98b0cebaa69ec02eacca82e6b1ba18632c5927f01591a/pathlib_mate-1.0.0-py2.py3-none-any.whl (77kB)
[K     |████████████████████████████████| 81kB 26.1MB/s eta 0:00:01
Collecting autopep8 (from pathlib-mate->uszipcode)
[?25l  Downloading https://files.pythonhosted.org/packages/33/9e/69587808c3f77088c96a99a2a4bd8e4a17e8ddbbc2ab1495b5df4c2cd37e/autopep8-1.5.3.tar.gz (116kB)
[K     |████████████████████████████████| 122kB 37.0MB/s eta 0:00:01
[?25hCollecting pycodestyle>=2.6.0 (from autopep8->pathlib-mate->uszipcode)
[?25l  Downloading https://files.pythonhosted.org/packages/10/5b/88879fb861ab79aef45c7e199cae3ef7

In [11]:
import uszipcode

In [12]:
!pip install mpu

Collecting mpu
[?25l  Downloading https://files.pythonhosted.org/packages/a6/3a/c4c04201c9cd8c5845f85915d644cb14b16200680e5fa424af01c411e140/mpu-0.23.1-py3-none-any.whl (69kB)
[K     |████████████████████████████████| 71kB 9.1MB/s  eta 0:00:01
[?25hInstalling collected packages: mpu
Successfully installed mpu-0.23.1


In [13]:
import mpu

In [14]:
df_Lat_Lng = res['Zip Codes']

In [15]:
df_Lat_Lng

0                                                  60625
1                                                  60827
2                                                  60640
3                                                  60659
4                                                  60632
5                                           60652, 60629
6                             60644, 60639, 60651, 60707
7                                                  60619
8                                                  60618
9                                                  60625
10                                                 60827
11                                                 60640
12                                                 60659
13                                                 60632
14                                          60652, 60629
15                            60644, 60639, 60651, 60707
16                                                 60619
17                             

In [17]:
f = df_Lat_Lng.shape
g = f[0]

Use uszipcode package to extract latitudes and longitudes of Chicago's zipcodes

In [18]:
from uszipcode import SearchEngine


#Dist = pd.DataFrame().astype('float')
#Dist =[]

Lt = pd.DataFrame().astype('float')
Lt = []
Lg = pd.DataFrame().astype('float')
Lg = []

search = SearchEngine(simple_zipcode=True)

for i in range(0,g):
   # print(i)
    ind = df_Lat_Lng[i]
   
    zipc = search.by_zipcode(ind)
    lat =zipc.lat
    long =zipc.lng

    Lt.append(lat)
    Lg.append(long)


In [19]:
res['Latitude'] = Lt
res['Longitude'] = Lg

In [20]:
res

Unnamed: 0,Neighborhoods,Zip Codes,Latitude,Longitude
0,Albany Park,60625,41.97,-87.7
1,Altgeld Gardens,60827,41.65,-87.63
2,Andersonville,60640,41.97,-87.66
3,Arcadia Terrace,60659,41.99,-87.7
4,Archer Heights,60632,41.82,-87.69
5,Ashburn,"60652, 60629",,
6,Austin,"60644, 60639, 60651, 60707",,
7,Avalon Park,60619,41.74,-87.61
8,Avondale,60618,41.95,-87.7
9,Albany Park,60625,41.97,-87.7


Remove neighborhoods that have multiple zip codes

In [21]:
res.dropna(inplace = True)

In [22]:
res.drop(['Zip Codes'],axis=1,inplace=True)
res.rename(columns={"Neighborhoods": "Neighborhood"}, inplace = True)

In [23]:
address = 'Chicago, IL'

geolocator = Nominatim(user_agent="ch_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Chicago are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Chicago are 41.8755616, -87.6244212.


In [24]:
res


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Albany Park,41.97,-87.7
1,Altgeld Gardens,41.65,-87.63
2,Andersonville,41.97,-87.66
3,Arcadia Terrace,41.99,-87.7
4,Archer Heights,41.82,-87.69
7,Avalon Park,41.74,-87.61
8,Avondale,41.95,-87.7
9,Albany Park,41.97,-87.7
10,Altgeld Gardens,41.65,-87.63
11,Andersonville,41.97,-87.66


In [25]:
# create map of Chicago using latitude and longitude values
map_chicago = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(res['Latitude'], res['Longitude'], res['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_chicago)  
    
map_chicago

In [26]:
CLIENT_ID = '2B5LEN4VCIG21SX3N0WNV4Q0QCMPDEKS5KJAWYEQ054PEFSX' # your Foursquare ID
CLIENT_SECRET = '2VNXKIY0UYMZJDJ1MQD14WT5GUOZWCPWXTAXKIIK1GK2UBYG' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 2B5LEN4VCIG21SX3N0WNV4Q0QCMPDEKS5KJAWYEQ054PEFSX
CLIENT_SECRET:2VNXKIY0UYMZJDJ1MQD14WT5GUOZWCPWXTAXKIIK1GK2UBYG


In [27]:
neighborhood_latitude = res.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = res.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = res.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and Longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and Longitude values of Albany Park are 41.97, -87.7.


In [28]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=2B5LEN4VCIG21SX3N0WNV4Q0QCMPDEKS5KJAWYEQ054PEFSX&client_secret=2VNXKIY0UYMZJDJ1MQD14WT5GUOZWCPWXTAXKIIK1GK2UBYG&v=20180605&ll=41.97,-87.7&radius=500&limit=100'

In [29]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f03c00420d13b532c479c68'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Ravenswood',
  'headerFullLocation': 'Ravenswood, Chicago',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 18,
  'suggestedBounds': {'ne': {'lat': 41.9745000045, 'lng': -87.69395879999725},
   'sw': {'lat': 41.9654999955, 'lng': -87.70604120000276}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4ee5382b6c25be96312ba9da',
       'name': 'Goosefoot',
       'location': {'address': '2656 W Lawrence Ave',
        'crossStreet': 'at Washtenaw',
        'lat': 41.96860996532721,
        'lng': -87.69598804838718,
        'labeledLatLngs': [{'label': 'displ

In [30]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [31]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Goosefoot,New American Restaurant,41.96861,-87.695988
1,HarvesTime Foods,Grocery Store,41.96883,-87.69525
2,Ronan Park,Park,41.969627,-87.702085
3,Monti's,Sandwich Place,41.968243,-87.694978
4,Nhu Lan Bakery,Sandwich Place,41.968598,-87.694471


In [32]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

18 venues were returned by Foursquare.


Get venues in neighborhood

In [33]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
        LIMIT)
        
        # function that extracts the category of the venue
#def get_category_type(row):
#    try:
#        categories_list = row['categories']
#    except:
#        categories_list = row['venue.categories']
        
#    if len(categories_list) == 0:
#        return None
#   else:
#        return categories_list[0]['name']
    
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [34]:
chicago_venues=getNearbyVenues(names=res['Neighborhood'],
                                   latitudes=res['Latitude'],
                                   longitudes=res['Longitude']
                                  )

Albany Park
Altgeld Gardens
Andersonville
Arcadia Terrace
Archer Heights
Avalon Park
Avondale
Albany Park
Altgeld Gardens
Andersonville
Arcadia Terrace
Archer Heights
Avalon Park
Avondale
Back of the Yards
Belmont Heights
Belmont Terrace
Beverly View
Beverly Woods
Big Oaks
Bohemian National Cemetery
Brainerd
Brighton Park
Budlong Woods
Burnside
Cabrini Green
Chatham
Chicago Lawn
Chinatown
Clearing
Cottage Grove Heights
Dearborn Park
DePaul
Douglas Park
Dunning
East Chicago
East Rogers Park
East Village
Edgebrook
Edison Park
Fifth City
Forest Glen
Fuller Park
Garfield Ridge
Graceland Cemetery
Graceland West
Gresham
Hegewisch
Hollywood Park
Homan Square
Irving Park
Irving Woods
Jackson Park Highlands
Jeffery Manor
Kennedy Park
Kilbourn Park
Lakewood Balmoral
LeClaire Courts
Lincoln Square
Margate Park
Marquette Park
Marycrest
Marynook
Mayfair
Midway
Morgan Park
Mount Greenwood
Near South Side
New East Side
Noble Square
North Kenwood
North Mayfair
North Park
Norwood Park
O'Hare
Oakland
Ol

In [35]:
print(chicago_venues.shape)
chicago_venues.head()

(2487, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Albany Park,41.97,-87.7,Goosefoot,41.96861,-87.695988,New American Restaurant
1,Albany Park,41.97,-87.7,HarvesTime Foods,41.96883,-87.69525,Grocery Store
2,Albany Park,41.97,-87.7,Ronan Park,41.969627,-87.702085,Park
3,Albany Park,41.97,-87.7,Monti's,41.968243,-87.694978,Sandwich Place
4,Albany Park,41.97,-87.7,Nhu Lan Bakery,41.968598,-87.694471,Sandwich Place


In [37]:
chicago_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Albany Park,36,36,36,36,36,36
Andersonville,108,108,108,108,108,108
Arcadia Terrace,42,42,42,42,42,42
Archer Heights,36,36,36,36,36,36
Avalon Park,48,48,48,48,48,48
Avondale,70,70,70,70,70,70
Back of the Yards,15,15,15,15,15,15
Belmont Heights,21,21,21,21,21,21
Belmont Terrace,21,21,21,21,21,21
Beverly View,12,12,12,12,12,12


In [38]:
print('There are {} uniques categories.'.format(len(chicago_venues['Venue Category'].unique())))

There are 183 uniques categories.


In [50]:
# one hot encoding
chicago_onehot = pd.get_dummies(chicago_venues[['Venue Category']], prefix="", prefix_sep="")
#chicago_onehot.drop(['Neighborhood'],axis=1,inplace=True)
#toronto_venues.head()
# add neighborhood column back to dataframe
chicago_onehot['Neighborhood'] = chicago_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [chicago_onehot.columns[-1]] + list(chicago_onehot.columns[:-1])
chicago_onehot = chicago_onehot[fixed_columns]

chicago_onehot.head(25)

Unnamed: 0,Neighborhood,African Restaurant,American Restaurant,Amphitheater,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,...,Trail,Train Station,Ukrainian Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Vineyard,Wings Joint,Women's Store,Yoga Studio
0,Albany Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Albany Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Albany Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Albany Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Albany Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Albany Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Albany Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Albany Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,Albany Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,Albany Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
chicago_onehot.shape


(2487, 184)

In [73]:
chicago_grouped_test = chicago_onehot.groupby('Neighborhood').mean().reset_index()
#chicago_grouped = chicago_grouped_test
chicago_grouped = chicago_grouped_test[['Neighborhood','Coffee Shop', 'Grocery Store', 'Train Station']]


chicago_grouped

Unnamed: 0,Neighborhood,Coffee Shop,Grocery Store,Train Station
0,Albany Park,0.0,0.055556,0.055556
1,Andersonville,0.037037,0.037037,0.0
2,Arcadia Terrace,0.047619,0.0,0.0
3,Archer Heights,0.0,0.055556,0.0
4,Avalon Park,0.0,0.0,0.0
5,Avondale,0.0,0.0,0.0
6,Back of the Yards,0.0,0.133333,0.0
7,Belmont Heights,0.0,0.0,0.0
8,Belmont Terrace,0.0,0.0,0.0
9,Beverly View,0.0,0.083333,0.0


In [74]:
chicago_grouped.shape

(114, 4)

In [75]:
num_top_venues = 5

for hood in chicago_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = chicago_grouped[chicago_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Albany Park----
           venue  freq
0  Grocery Store  0.06
1  Train Station  0.06
2    Coffee Shop  0.00


----Andersonville----
           venue  freq
0    Coffee Shop  0.04
1  Grocery Store  0.04
2  Train Station  0.00


----Arcadia Terrace----
           venue  freq
0    Coffee Shop  0.05
1  Grocery Store  0.00
2  Train Station  0.00


----Archer Heights----
           venue  freq
0  Grocery Store  0.06
1    Coffee Shop  0.00
2  Train Station  0.00


----Avalon Park----
           venue  freq
0    Coffee Shop   0.0
1  Grocery Store   0.0
2  Train Station   0.0


----Avondale----
           venue  freq
0    Coffee Shop   0.0
1  Grocery Store   0.0
2  Train Station   0.0


----Back of the Yards----
           venue  freq
0  Grocery Store  0.13
1    Coffee Shop  0.00
2  Train Station  0.00


----Belmont Heights----
           venue  freq
0    Coffee Shop   0.0
1  Grocery Store   0.0
2  Train Station   0.0


----Belmont Terrace----
           venue  freq
0    Coffee Shop   0.0
1 

In [76]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [77]:
num_top_venues = 3

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = chicago_grouped['Neighborhood']

for ind in np.arange(chicago_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(chicago_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Albany Park,Train Station,Grocery Store,Coffee Shop
1,Andersonville,Grocery Store,Coffee Shop,Train Station
2,Arcadia Terrace,Coffee Shop,Train Station,Grocery Store
3,Archer Heights,Grocery Store,Train Station,Coffee Shop
4,Avalon Park,Train Station,Grocery Store,Coffee Shop


In [78]:
# set number of clusters
kclusters = 5

chicago_grouped_clustering = chicago_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(chicago_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 4, 4, 2, 0, 0, 2, 0, 0, 2], dtype=int32)

In [None]:
#neighborhoods_venues_sorted.head()


#neighborhoods_venues_sorted.drop(['Cluster Labels'],axis=1,inplace=True)

#neighborhoods_venues_sorted.head()

In [98]:
# add clustering labels
#neighborhoods_venues_sorted.drop(columns=['Cluster Labels'])

#neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_.astype(int))
chicago_merged = res

# merge chicago_grouped with chicago_data to add latitude/longitude for each neighborhood
chicago_merged = chicago_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
chicago_merged.dropna(axis=0,inplace=True)


chicago_merged['Cluster Labels']= chicago_merged[['Cluster Labels']].astype(int)


chicago_merged.head(40)

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Albany Park,41.97,-87.7,1,Train Station,Grocery Store,Coffee Shop
2,Andersonville,41.97,-87.66,4,Grocery Store,Coffee Shop,Train Station
3,Arcadia Terrace,41.99,-87.7,4,Coffee Shop,Train Station,Grocery Store
4,Archer Heights,41.82,-87.69,2,Grocery Store,Train Station,Coffee Shop
7,Avalon Park,41.74,-87.61,0,Train Station,Grocery Store,Coffee Shop
8,Avondale,41.95,-87.7,0,Train Station,Grocery Store,Coffee Shop
9,Albany Park,41.97,-87.7,1,Train Station,Grocery Store,Coffee Shop
11,Andersonville,41.97,-87.66,4,Grocery Store,Coffee Shop,Train Station
12,Arcadia Terrace,41.99,-87.7,4,Coffee Shop,Train Station,Grocery Store
13,Archer Heights,41.82,-87.69,2,Grocery Store,Train Station,Coffee Shop


In [110]:
neighborhoods_venues_sorted_base = neighborhoods_venues_sorted[['Cluster Labels', 'Neighborhood']]
print(chicago_grouped)
chicago_merged_sum = pd.merge(neighborhoods_venues_sorted_base, chicago_grouped, on='Neighborhood')
chicago_merged_sum

                   Neighborhood  Coffee Shop  Grocery Store  Train Station
0                   Albany Park     0.000000       0.055556       0.055556
1                 Andersonville     0.037037       0.037037       0.000000
2               Arcadia Terrace     0.047619       0.000000       0.000000
3                Archer Heights     0.000000       0.055556       0.000000
4                   Avalon Park     0.000000       0.000000       0.000000
5                      Avondale     0.000000       0.000000       0.000000
6             Back of the Yards     0.000000       0.133333       0.000000
7               Belmont Heights     0.000000       0.000000       0.000000
8               Belmont Terrace     0.000000       0.000000       0.000000
9                  Beverly View     0.000000       0.083333       0.000000
10                Beverly Woods     0.000000       0.000000       0.000000
11                     Big Oaks     0.000000       0.000000       0.000000
12   Bohemian National Ce

Unnamed: 0,Cluster Labels,Neighborhood,Coffee Shop,Grocery Store,Train Station
0,1,Albany Park,0.0,0.055556,0.055556
1,4,Andersonville,0.037037,0.037037,0.0
2,4,Arcadia Terrace,0.047619,0.0,0.0
3,2,Archer Heights,0.0,0.055556,0.0
4,0,Avalon Park,0.0,0.0,0.0
5,0,Avondale,0.0,0.0,0.0
6,2,Back of the Yards,0.0,0.133333,0.0
7,0,Belmont Heights,0.0,0.0,0.0
8,0,Belmont Terrace,0.0,0.0,0.0
9,2,Beverly View,0.0,0.083333,0.0


In [111]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(chicago_merged['Latitude'], chicago_merged['Longitude'], chicago_merged['Neighborhood'], chicago_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [102]:
chicago_merged_sum.groupby('Cluster Labels').sum()

Unnamed: 0_level_0,Coffee Shop,Grocery Store,Train Station
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0,0.037037,0.0
1,0.152842,0.728488,1.140797
2,0.0,1.194118,0.0
3,0.333333,0.333333,0.0
4,1.205959,0.389133,0.022727


In [103]:
chicago_merged.loc[chicago_merged['Cluster Labels'] == 0, chicago_merged.columns[[0] + list(range(4, chicago_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
7,Avalon Park,Train Station,Grocery Store,Coffee Shop
8,Avondale,Train Station,Grocery Store,Coffee Shop
16,Avalon Park,Train Station,Grocery Store,Coffee Shop
17,Avondale,Train Station,Grocery Store,Coffee Shop
21,Belmont Heights,Train Station,Grocery Store,Coffee Shop
22,Belmont Terrace,Train Station,Grocery Store,Coffee Shop
25,Beverly Woods,Train Station,Grocery Store,Coffee Shop
26,Big Oaks,Train Station,Grocery Store,Coffee Shop
36,Burnside,Train Station,Grocery Store,Coffee Shop
39,Chatham,Train Station,Grocery Store,Coffee Shop


In [105]:
chicago_merged.loc[chicago_merged['Cluster Labels'] == 1, chicago_merged.columns[[0] + list(range(4, chicago_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Albany Park,Train Station,Grocery Store,Coffee Shop
9,Albany Park,Train Station,Grocery Store,Coffee Shop
27,Bohemian National Cemetery,Train Station,Coffee Shop,Grocery Store
34,Budlong Woods,Train Station,Grocery Store,Coffee Shop
50,East Chicago,Train Station,Grocery Store,Coffee Shop
52,East Rogers Park,Train Station,Grocery Store,Coffee Shop
56,Edison Park,Train Station,Grocery Store,Coffee Shop
60,Forest Glen,Train Station,Coffee Shop,Grocery Store
84,Jeffery Manor,Train Station,Grocery Store,Coffee Shop
95,Lincoln Square,Train Station,Grocery Store,Coffee Shop


In [106]:
chicago_merged.loc[chicago_merged['Cluster Labels'] == 2, chicago_merged.columns[[0] + list(range(4, chicago_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
4,Archer Heights,Grocery Store,Train Station,Coffee Shop
13,Archer Heights,Grocery Store,Train Station,Coffee Shop
18,Back of the Yards,Grocery Store,Train Station,Coffee Shop
24,Beverly View,Grocery Store,Train Station,Coffee Shop
29,Brainerd,Grocery Store,Train Station,Coffee Shop
31,Brighton Park,Grocery Store,Train Station,Coffee Shop
42,Clearing,Grocery Store,Train Station,Coffee Shop
61,Fuller Park,Grocery Store,Train Station,Coffee Shop
65,Garfield Ridge,Grocery Store,Train Station,Coffee Shop
71,Gresham,Grocery Store,Train Station,Coffee Shop


In [107]:
chicago_merged.loc[chicago_merged['Cluster Labels'] == 3, chicago_merged.columns[[0] + list(range(4, chicago_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
54,Edgebrook,Grocery Store,Coffee Shop,Train Station
130,Peterson Park Grounds,Grocery Store,Coffee Shop,Train Station
157,South Edgebrook,Grocery Store,Coffee Shop,Train Station
189,Wildwood,Grocery Store,Coffee Shop,Train Station


In [108]:
chicago_merged.loc[chicago_merged['Cluster Labels'] == 4, chicago_merged.columns[[0] + list(range(4, chicago_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
2,Andersonville,Grocery Store,Coffee Shop,Train Station
3,Arcadia Terrace,Coffee Shop,Train Station,Grocery Store
11,Andersonville,Grocery Store,Coffee Shop,Train Station
12,Arcadia Terrace,Coffee Shop,Train Station,Grocery Store
37,Cabrini Green,Coffee Shop,Train Station,Grocery Store
46,DePaul,Coffee Shop,Train Station,Grocery Store
53,East Village,Grocery Store,Coffee Shop,Train Station
68,Graceland Cemetery,Coffee Shop,Grocery Store,Train Station
69,Graceland West,Coffee Shop,Grocery Store,Train Station
75,Hollywood Park,Coffee Shop,Train Station,Grocery Store
