<h1 align=center><font size = 5>Best Neighborhood for a new Shopping Mall</font></h1>

## Introduction

In this lab, you will learn how to convert addresses into their equivalent latitude and longitude 
values. Also, you will use the Foursquare API to explore neighborhoods in Chicago City. You will 
use the **explore** function to get the number of shopping malls in each neighborhood, and then use
this feature to group the neighborhoods into clusters and we have also taken into consideration 
per capita income of neighborhoods. You will use the *k*-means clustering algorithm to complete
this task. Finally, you will use the Folium library to visualize the neighborhoods in Chicago City
and examine, analyse the cluster and select the best set of neighborhoods to open a new Shopping Mall.


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Explore Neighborhoods in Chicago City</a>

3. <a href="#item3">Analyze Each Neighborhood</a>

4. <a href="#item4">Cluster Neighborhoods</a>

5. <a href="#item5">Examine Clusters</a>    
</font>
</div>

Before we start exploring the data, let's download all the libraries we need to build the model

In [15]:
import pandas as pd  # library for data analsysis
import numpy as np  # library to handle data in a vectorized manner

# map rendering library
import folium

# convert an address into latitude and longitude values
!pip install geopy
from geopy.geocoders import Nominatim 

import requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
import urllib.request

# for web scrapping data from web pages
!pip install bs4
from bs4 import BeautifulSoup
!conda install -c anaconda lxml --yes

#to get the latitude and longitude values for neighborhoods 
!pip install geocoder  
import geocoder  

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/07/e1/9c72de674d5c2b8fcb0738a5ceeb5424941fefa080bfe4e240d0bacb5a38/geopy-2.0.0-py3-none-any.whl (111kB)
[K     |████████████████████████████████| 112kB 6.1MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.0.0
Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/66/25/ff030e2437265616a1e9b25ccc864e0371a0bc3adb7c5a404fd661c6f4f6/beautifulsoup4-4.9.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 3.3MB/s eta 0:00:01
[

## 1. Download and Explore Dataset

Apparently it's was not so easy to collect the data of Chicago City,beacause there was no particular website which had all the data. So,I have collected data from two different websites,one
with the data having per capita income with community area names and the other having the data  of neighborhoods. I used Community Area Names as common column to combine the data from both the  websites.

Dowloading Data from the web page "http://www.chicagocomputerclasses.com/average-city-chicago-income/"

In [16]:
url = "http://www.chicagocomputerclasses.com/average-city-chicago-income/"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")

FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

We use Beautiful soup library to copy the data and then convert to a DataFrame

In [3]:
right_table=soup.find('table')
A=[]
B=[]
for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==2:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
A=A[2:]
B=B[2:]
per_capita_income = pd.DataFrame({"Community_Area_Name":A,"Per_Capita_Income":B})
per_capita_income['Neighborhood'] = None
per_capita_income

Unnamed: 0,Community_Area_Name,Per_Capita_Income,Neighborhood
0,Near North Side,"$88,669.00",
1,Lincoln Park,"$71,551.00",
2,Loop,"$65,526.00",
3,Lake View,"$60,058.00",
4,Near South Side,"$59,077.00",
...,...,...,...
73,West Englewood,"$11,317.00",
74,West Garfield Park,"$10,934.00",
75,Fuller Park,"$10,432.00",
76,South Lawndale,"$10,402.00",


Copying Data from the web page "https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Chicago"

In [4]:
url = "https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Chicago"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")

We use Beautiful soup library to copy the data and then convert to a DataFrame

In [5]:
right_table=soup.find('table', class_='wikitable sortable')
A=[]
B=[]
for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==2:
        A.append(cells[0].find(text=True).rstrip('\n'))
        B.append(cells[1].find(text=True).rstrip('\n'))
neighborhood_df = pd.DataFrame({"Neighborhood":A,"Community_Area_Name":B}) 
neighborhood_df

Unnamed: 0,Neighborhood,Community_Area_Name
0,Albany Park,Albany Park
1,Altgeld Gardens,Riverdale
2,Andersonville,Edgewater
3,Archer Heights,Archer Heights
4,Armour Square,Armour Square
...,...,...
241,Wildwood,Forest Glen
242,Woodlawn,Woodlawn
243,Wrightwood,Ashburn
244,Wrightwood Neighbors,Lincoln Park


Combining two table for the neighborhoods and per capita income to get together in a single table

In [6]:
#assigning neighborhood values to the per capita income table
for row in per_capita_income.iterrows():
    for col in neighborhood_df.iterrows():
        if row[1][0] == col[1][1] :
            row[1][2] = col[1][0]
            
#switching the columns
fixed_cols = list(per_capita_income.columns[-1:]) + list(per_capita_income.columns[:-1])
per_capita_income = per_capita_income[fixed_cols]
per_capita_income

Unnamed: 0,Community_Area_Name,Per_Capita_Income,Neighborhood
0,Near North Side,"$88,669.00",Streeterville
1,Lincoln Park,"$71,551.00",Wrightwood Neighbors
2,Loop,"$65,526.00",
3,Lake View,"$60,058.00",Wrigleyville
4,Near South Side,"$59,077.00",Prairie Avenue Historic District


I used Geocode function to get the Latitude and Longitude values for the Neighborhoods 

In [8]:
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Chicago'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

# Call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(neighborhood) for neighborhood in per_capita_income["Neighborhood"].tolist()]

We put the latitude and longitude values in a data frame and now data is ready to be explored !

In [9]:
per_capita_income['Latitude'] = (0)
per_capita_income['Longitude'] = (0)
per_capita_income[['Latitude','Longitude']] = coords
per_capita_income

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using

Unnamed: 0,Neighborhood,Community_Area_Name,Per_Capita_Income,Latitude,Longitude
0,Streeterville,Near North Side,"$88,669.00",41.898430,-87.621410
1,Wrightwood Neighbors,Lincoln Park,"$71,551.00",41.928979,-87.656190
2,,Loop,"$65,526.00",41.884250,-87.632450
3,Wrigleyville,Lake View,"$60,058.00",41.947250,-87.653200
4,Prairie Avenue Historic District,Near South Side,"$59,077.00",41.856420,-87.620880
...,...,...,...,...,...
73,West Englewood,West Englewood,"$11,317.00",41.777580,-87.667260
74,West Garfield Park,West Garfield Park,"$10,934.00",41.877020,-87.730740
75,Fuller Park,Fuller Park,"$10,432.00",41.812530,-87.632620
76,South Lawndale,South Lawndale,"$10,402.00",41.848013,-87.717330


We use geolocator to get the latitude and longitude values of Chicago City

In [10]:
address = "Chicago"
geolocator = Nominatim(user_agent="ch-explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print("The Coordinates of {} are {},{}".format(address,latitude,longitude))

The Coordinates of Chicago are 41.8755616,-87.6244212


Now we use folium to plot all the neighborhoods on the map of Chicago

In [11]:
map_chicago = folium.Map(location = [latitude,longitude],zoom_start=10)

for lat,lng,label in zip(per_capita_income['Latitude'],per_capita_income['Longitude'],per_capita_income['Neighborhood']):
    label=folium.Popup(label,parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_capacity=0.7).add_to(map_chicago)
    
map_chicago

## 2. Explore Neighborhoods in Manhattan

Here are your FourSquare credentials which we'll be using to get the venues around the neighborhoods

In [12]:
CLIENT_ID =  'BLR24K5O3BY4RUZ5ZQPRHRRO2UU41JRTWWO1L2LINL2AIS3U'
CLIENT_SECRET = 'CLU4HNVW1WYE1TREVETMMY2BOILHYGCPRLL1DOQHHHX24V1N'
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: BLR24K5O3BY4RUZ5ZQPRHRRO2UU41JRTWWO1L2LINL2AIS3U
CLIENT_SECRET:CLU4HNVW1WYE1TREVETMMY2BOILHYGCPRLL1DOQHHHX24V1N


We get different venues around the neighborhood by making API call to FourSquare and convert them into a DataFrame 

In [13]:
RADIUS = 2000
LIMIT = 100

venues = []

for lat,lng,pci,neighborhood in zip(per_capita_income['Latitude'],per_capita_income['Longitude'],per_capita_income['Per_Capita_Income'],per_capita_income['Neighborhood']):
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID,CLIENT_SECRET,VERSION,lat,lng,RADIUS,LIMIT)
    results = requests.get(url).json()['response']['groups'][0]['items']
    for venue in results:
        venues.append((neighborhood,
                     pci,
                     lat,
                     lng,
                     venue['venue']['name'],
                     venue['venue']['location']['lat'],
                     venue['venue']['location']['lng'],
                     venue['venue']['categories'][0]['name']))
venues_df = pd.DataFrame(venues)
venues_df.columns = ['Neighborhood','PCI','Latitude','Longitude','VenueName','VenueLatiude','VenueLongitude','VenueCategory']
venues_df

Unnamed: 0,Neighborhood,PCI,Latitude,Longitude,VenueName,VenueLatiude,VenueLongitude,VenueCategory
0,Streeterville,"$88,669.00",41.898430,-87.621410,360 CHICAGO,41.898642,-87.622758,Scenic Lookout
1,Streeterville,"$88,669.00",41.898430,-87.621410,Marisol,41.897420,-87.621284,Restaurant
2,Streeterville,"$88,669.00",41.898430,-87.621410,The LEGO Store,41.898087,-87.622788,Toy / Game Store
3,Streeterville,"$88,669.00",41.898430,-87.621410,Broadway Playhouse,41.898475,-87.622678,Performing Arts Venue
4,Streeterville,"$88,669.00",41.898430,-87.621410,Cafecito,41.898344,-87.621274,Cuban Restaurant
...,...,...,...,...,...,...,...,...
6256,Riverdale,"$8,201.00",41.653846,-87.609655,142nd Railroad Tracks,41.637117,-87.611708,Light Rail Station
6257,Riverdale,"$8,201.00",41.653846,-87.609655,Currency Exchange,41.636696,-87.608824,Currency Exchange
6258,Riverdale,"$8,201.00",41.653846,-87.609655,Illinois International Port,41.665959,-87.593030,Pier
6259,Riverdale,"$8,201.00",41.653846,-87.609655,Rene's Pizza,41.644330,-87.629159,Pizza Place


We just check how many venues have been returned for each neighborhood

In [14]:
venues_df.groupby(['Neighborhood','PCI','Latitude','Longitude']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,VenueName,VenueLatiude,VenueLongitude,VenueCategory
Neighborhood,PCI,Latitude,Longitude,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Archer Heights,"$16,134.00",41.811540,-87.72556,100,100,100,100
Bridgeport,"$22,694.00",41.837980,-87.65090,100,100,100,100
Brighton Park,"$13,089.00",41.818610,-87.69948,76,76,76,76
Burnside,"$12,515.00",41.729440,-87.59768,88,88,88,88
Clearing West,"$25,113.00",41.778090,-87.75978,94,94,94,94
...,...,...,...,...,...,...,...
Wildwood,"$44,164.00",42.001350,-87.77537,100,100,100,100
Woodlawn,"$18,672.00",41.780460,-87.60135,89,89,89,89
Wrightwood,"$23,482.00",41.928979,-87.65619,100,100,100,100
Wrightwood Neighbors,"$71,551.00",41.928979,-87.65619,100,100,100,100


Cheking how many different categories of venues are there and printing a few of them

In [15]:
print('There are {} number of unique categories of venues'.format(len(venues_df['VenueCategory'].unique())))
venues_df['VenueCategory'].unique()[0:50]

There are 348 number of unique categories of venues


array(['Scenic Lookout', 'Restaurant', 'Toy / Game Store',
       'Performing Arts Venue', 'Cuban Restaurant', 'Breakfast Spot',
       'Art Museum', 'Hotel', 'Boutique', 'Park', 'Salon / Barbershop',
       'Sporting Goods Shop', 'Historic Site', 'American Restaurant',
       'Shopping Mall', 'Clothing Store', 'Lingerie Store',
       'New American Restaurant', 'Beach', 'Bakery', 'Donut Shop',
       'Jewelry Store', 'Spa', 'Vietnamese Restaurant', 'Cupcake Shop',
       'Café', 'Yoga Studio', "Women's Store", 'Pizza Place',
       'Mediterranean Restaurant', 'Coffee Shop', 'Snack Place',
       'Cosmetics Shop', 'Italian Restaurant', 'Steakhouse',
       'Seafood Restaurant', 'Japanese Restaurant', 'Grocery Store',
       'Cycle Studio', 'Mexican Restaurant', 'Gourmet Shop', 'Juice Bar',
       'Gym / Fitness Center', 'Cocktail Bar', 'Salad Place',
       'Frozen Yogurt Shop', 'Asian Restaurant', 'Burger Joint',
       'Bike Rental / Bike Share', 'Department Store'], dtype=object)

## 3. Analyze Each Neighborhood

In [16]:
#one hot encoding
chicago_onehot = pd.get_dummies(venues_df['VenueCategory'],prefix="",prefix_sep="")

#adding neighborhood and PCI columns
chicago_onehot['Neighborhoods'] = venues_df['Neighborhood']
chicago_onehot['PCI'] = venues_df['PCI']

#moving neighborhood and PCI columns to the front
fixed_cols = list(chicago_onehot.columns[-2:]) + list(chicago_onehot.columns[:-2])
chicago_onehot = chicago_onehot[fixed_cols]

#checking out the table
print(chicago_onehot.shape)
chicago_onehot

(6261, 350)


Unnamed: 0,Neighborhoods,PCI,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Service,Airport Terminal,...,Vietnamese Restaurant,Warehouse Store,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Streeterville,"$88,669.00",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Streeterville,"$88,669.00",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Streeterville,"$88,669.00",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Streeterville,"$88,669.00",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Streeterville,"$88,669.00",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6256,Riverdale,"$8,201.00",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6257,Riverdale,"$8,201.00",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6258,Riverdale,"$8,201.00",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6259,Riverdale,"$8,201.00",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [17]:
#calculation mean of each category for each neighborhood 
chicago_grouped = chicago_onehot.groupby(['Neighborhoods','PCI']).mean().reset_index()

#checking out the table
print(chicago_grouped.shape)
chicago_grouped

(71, 350)


Unnamed: 0,Neighborhoods,PCI,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Service,Airport Terminal,...,Vietnamese Restaurant,Warehouse Store,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Archer Heights,"$16,134.00",0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,...,0.00,0.00,0.0,0.0,0.00,0.0,0.000000,0.010000,0.000000,0.000000
1,Bridgeport,"$22,694.00",0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,...,0.00,0.00,0.0,0.0,0.00,0.0,0.000000,0.020000,0.000000,0.000000
2,Brighton Park,"$13,089.00",0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,...,0.00,0.00,0.0,0.0,0.00,0.0,0.000000,0.000000,0.000000,0.000000
3,Burnside,"$12,515.00",0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,...,0.00,0.00,0.0,0.0,0.00,0.0,0.000000,0.022727,0.011364,0.000000
4,Clearing West,"$25,113.00",0.0,0.010638,0.0,0.0,0.010638,0.010638,0.021277,0.0,...,0.00,0.00,0.0,0.0,0.00,0.0,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66,Wildwood,"$44,164.00",0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,...,0.01,0.01,0.0,0.0,0.00,0.0,0.000000,0.010000,0.000000,0.000000
67,Woodlawn,"$18,672.00",0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,...,0.00,0.00,0.0,0.0,0.00,0.0,0.011236,0.000000,0.000000,0.011236
68,Wrightwood,"$23,482.00",0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,...,0.00,0.00,0.0,0.0,0.02,0.0,0.000000,0.000000,0.000000,0.020000
69,Wrightwood Neighbors,"$71,551.00",0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,...,0.00,0.00,0.0,0.0,0.02,0.0,0.000000,0.000000,0.000000,0.020000


Checking number of shopping malls in Chigago City

In [18]:
len(chicago_grouped[chicago_grouped["Shopping Mall"]>0])

10

Converting PCI column to numeric for KMeans to work without any error

In [53]:
#We only consider shopping mall of all the categories beacause that is all we want
chicago_mall = chicago_grouped[['Neighborhoods','PCI','Shopping Mall']]

#removes '&' and ',' ,wherever found
chicago_mall = chicago_mall.replace('\$','',regex=True)
chicago_mall = chicago_mall.replace('\,','',regex=True)

#convert PCI to numeric to rum KMeans
chicago_mall[['PCI']] = chicago_mall[['PCI']].apply(pd.to_numeric)

chicago_mall.head()

Unnamed: 0,Neighborhoods,PCI,Shopping Mall
0,Archer Heights,16134.0,0.0
1,Bridgeport,22694.0,0.0
2,Brighton Park,13089.0,0.0
3,Burnside,12515.0,0.0
4,Clearing West,25113.0,0.0


## 4. Cluster Neighborhoods

Running k-means to cluster the neighborhoods into 3 clusters

In [54]:
#initialize number of clusters to 3
k_clusters = 3

#drop the neighborhood column
chicago_clustering = chicago_mall.drop(['Neighborhoods'],1)

#run k-means
kmeans = KMeans(n_clusters=k_clusters,random_state=0).fit(chicago_clustering)

#checkout the labels
kmeans.labels_[0:10]

array([2, 2, 2, 2, 2, 2, 1, 2, 2, 2], dtype=int32)

Copying Neighborhood, PCI, Shopping Malls, Cluster Labels into a new data frame

In [57]:
# Creating a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
chicago_merged = chicago_mall.copy()

# Add the clustering labels
chicago_merged["Cluster Labels"] = kmeans.labels_
chicago_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
chicago_merged

Unnamed: 0,Neighborhood,PCI,Shopping Mall,Cluster Labels
0,Archer Heights,16134.0,0.00,2
1,Bridgeport,22694.0,0.00,2
2,Brighton Park,13089.0,0.00,2
3,Burnside,12515.0,0.00,2
4,Clearing West,25113.0,0.00,2
...,...,...,...,...
66,Wildwood,44164.0,0.01,1
67,Woodlawn,18672.0,0.00,2
68,Wrightwood,23482.0,0.00,2
69,Wrightwood Neighbors,71551.0,0.00,0


Sorting the table with respect to Cluster Labels

In [61]:
#copying latitude and longitude values
chicago_merged['Latitude'] = venues_df['Latitude']
chicago_merged['Longitude'] = venues_df['Longitude']

#sort the table wrt cluster labels
chicago_merged.sort_values(['Cluster Labels'], inplace=True)

#print the table
chicago_merged

Unnamed: 0,Neighborhood,PCI,Shopping Mall,Cluster Labels,Latitude,Longitude
70,Wrigleyville,60058.0,0.000000,0,41.89843,-87.62141
30,Prairie Avenue Historic District,59077.0,0.000000,0,41.89843,-87.62141
69,Wrightwood Neighbors,71551.0,0.000000,0,41.89843,-87.62141
46,Streeterville,88669.0,0.010000,0,41.89843,-87.62141
38,Saint Ben's,57123.0,0.000000,0,41.89843,-87.62141
...,...,...,...,...,...,...
37,Rosemoor,17949.0,0.011628,2,41.89843,-87.62141
7,Fifth City,12961.0,0.000000,2,41.89843,-87.62141
21,New City,12765.0,0.000000,2,41.89843,-87.62141
63,West Pullman,16563.0,0.019608,2,41.89843,-87.62141


Visualizing the clusters of neighborhoods

In [67]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k_clusters)
ys = [i + x + (i*x)**2 for i in range(k_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
#markers_colors = []
for lat, lon, poi, cluster in zip(chicago_merged['Latitude'], chicago_merged['Longitude'], chicago_merged['Neighborhood'], chicago_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
    #print("!")
       
map_clusters

## 5. Examine Clusters

Now, you can examine each cluster and determine which set of neighborhoods is better suited for a new shopping mall.  

In [69]:
#checking the number of neighborhoods in each cluster
print(len(chicago_merged.loc[chicago_merged['Cluster Labels'] == 0]))
print(len(chicago_merged.loc[chicago_merged['Cluster Labels'] == 1]))
print(len(chicago_merged.loc[chicago_merged['Cluster Labels'] == 2]))

5
19
47


**CLUSTER 1**

In [70]:
chicago_merged.loc[chicago_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,PCI,Shopping Mall,Cluster Labels,Latitude,Longitude
70,Wrigleyville,60058.0,0.0,0,41.89843,-87.62141
30,Prairie Avenue Historic District,59077.0,0.0,0,41.89843,-87.62141
69,Wrightwood Neighbors,71551.0,0.0,0,41.89843,-87.62141
46,Streeterville,88669.0,0.01,0,41.89843,-87.62141
38,Saint Ben's,57123.0,0.0,0,41.89843,-87.62141


**CLUSTER 2**

In [71]:
chicago_merged.loc[chicago_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,PCI,Shopping Mall,Cluster Labels,Latitude,Longitude
17,Lakewood / Balmoral,33385.0,0.096774,1,41.89843,-87.62141
50,Vittum Park,26353.0,0.0,1,41.89843,-87.62141
27,Pill Hill,28887.0,0.0,1,41.89843,-87.62141
22,North Kenwood,35911.0,0.0,1,41.89843,-87.62141
48,Union Ridge,32875.0,0.0,1,41.89843,-87.62141
39,Schorsch Village,26282.0,0.02,1,41.89843,-87.62141
54,West Beverly,39523.0,0.016393,1,41.89843,-87.62141
47,The Villa,27249.0,0.0,1,41.89843,-87.62141
14,Hyde Park,39056.0,0.0,1,41.89843,-87.62141
65,Wicker Park,43198.0,0.0,1,41.89843,-87.62141


**CLUSTER 3**

In [72]:
chicago_merged.loc[chicago_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,PCI,Shopping Mall,Cluster Labels,Latitude,Longitude
58,West Garfield Park,10934.0,0.0,2,41.89843,-87.62141
57,West Englewood,11317.0,0.0,2,41.89843,-87.62141
56,West Elsdon,15754.0,0.0,2,41.89843,-87.62141
44,Stateway Gardens,23791.0,0.0,2,41.89843,-87.62141
52,Washington Park,13785.0,0.0,2,41.89843,-87.62141
53,Wentworth Gardens,16148.0,0.0,2,41.89843,-87.62141
59,West Humboldt Park,15957.0,0.0,2,41.89843,-87.62141
43,South Shore,19398.0,0.0,2,41.89843,-87.62141
55,West Chesterfield,18881.0,0.0,2,41.89843,-87.62141
60,West Lawn,16907.0,0.0,2,41.89843,-87.62141


In [14]:
temp  = chicago_onehot[['Neighborhoods','PCI','Shopping Mall']]
temp.groupby(['Neighborhoods','PCI']).sum(1)


NameError: name 'chicago_onehot' is not defined