# Problem Statement


Using Clustering to find similar neighborhoods to the current location in order to find the most suitable place for opening up another branch of your business.

# Introduction:



Imagine you have a shop or any business in the outskirts of New York city (In our case Little Neck, Queens) and the business does well in your current neighborhood due to several geospatial features in proximity such as other business, parks, offices, etc. 

Now you wish to open another branch of your business, in Manhantan for instance. But how do you decide which neighbourhood inside Manhattan would be most suitable for your business and would ensure that your new branch continues to thrive as much as your current branch? You solve this problem why finding out all neighbourhoods in Manhattan that are similar to your current neighborhood.

# Data 

So how does one decide which neighborhood is similar to your current neighbourhood? This is where data science comes in. 

We will use the New York city JSON data set which is available on the IBM Developer Skills Network (https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json). This dataset contains the different neighbourhoods in New York city along with their Latitude and Longitude.


With the help of this datasets latitude and longitude co ordinates we will leverage the Foursquare API to explore each neighbourhood and find the most prominent and commonly occurring venues in that neighbourhoods vicinity. Once we have these details we will use these as our feature vector in order to fit this data in clustering machine learning algorithm such as K means clustering or DBSCAN. 

These clustering algorithms will group neighbourhoods of similar type based on the feature set (in our case most common venues in the vicinity information) and label them in different clusters. 

After this point it becomes a simple problem of identifying the neighbourhoods in Manhattan which belong to the same cluster as our current neighbourhood. These are Neighbourhoods which are most suitable to open up our new branch which will see favourable market conditions similar to your current branch. 

# Methodology

As already stated in the above Data section we will leverage the data from the New York city JSON file which contains the names of the different neighbourhoods within New York city and also Latitude and Longitude of each of the Neighbourhood. We will use these Geospatial data along with the Foursquare API in order to explore each neighbourhood find the most common and prominent type of venues for that Neighbourhood. We will use this prominence of venue types for our feature set. 

We will then feed this feature set to a clustering machine learning algorithm which will cluster the location together based on the prominence of similar types of venues in the neighbourhood vicinity. 

## Data Exploratory Analysis

Let us load the dataset (the New York city JSON file which we have downloaded from the IBM skills network through the link in the data section) and visualize all the neighbourhoods:

In [1]:
import json # library to handle JSON files

with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

A quick look into the file:

In [2]:
newyork_data

{'type': 'FeatureCollection',
 'totalFeatures': 306,
 'features': [{'type': 'Feature',
   'id': 'nyu_2451_34572.1',
   'geometry': {'type': 'Point',
    'coordinates': [-73.84720052054902, 40.89470517661]},
   'geometry_name': 'geom',
   'properties': {'name': 'Wakefield',
    'stacked': 1,
    'annoline1': 'Wakefield',
    'annoline2': None,
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.84720052054902,
     40.89470517661,
     -73.84720052054902,
     40.89470517661]}},
  {'type': 'Feature',
   'id': 'nyu_2451_34572.2',
   'geometry': {'type': 'Point',
    'coordinates': [-73.82993910812398, 40.87429419303012]},
   'geometry_name': 'geom',
   'properties': {'name': 'Co-op City',
    'stacked': 2,
    'annoline1': 'Co-op',
    'annoline2': 'City',
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.82993910812398,
     40.87429419303012,
     -73.82993910812398,
     40.87429419303012]}},
  {'type': 'Feature',
 

We need to extract the feature attribute from the above JSON file from the above inspection as this attribute contains the information about the neighbourhoods

In [3]:
neighborhoods_data = newyork_data['features']

Now let us load this data into a Pandas data frame for further analysis. Our dataframe will contain the columns Borough, Neighbourhood, Latitude and Longitude.

In [4]:
import pandas as pd 
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Let us loop through the dataset and load the data into the dataframe

In [5]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Let us check the number of Neighbourhoods and Boroughs in the dataframe

In [6]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


Let us Visualise the neighbourhoods on the Map. 

For our business case our location is Little Neck, Queens which will be highlighted in orange and all the other location will be highlighted in blue.

In [7]:
import requests

import matplotlib.cm as cm
import matplotlib.colors as colors

import folium # map rendering library


In [8]:
# Our shop latitude and longitude
our_neighbourhood = 'Little Neck'
our_borough = "Queens"

our_latitude = neighborhoods.loc[neighborhoods['Neighborhood']==our_neighbourhood]['Latitude'].values[0]
our_longitude = neighborhoods.loc[neighborhoods['Neighborhood']==our_neighbourhood]['Longitude'].values[0]

print('The geograpical coordinate of our Neighbourhood - {}, {} are ({}, {}).'.format(our_neighbourhood,our_borough, our_latitude, our_longitude))


The geograpical coordinate of our Neighbourhood - Little Neck, Queens are (40.7708261928267, -73.7388977558074).


In [9]:
# New york city Co ordinates
latitude = 40.7127281
longitude = -74.0060152

# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    if neighborhood == our_neighbourhood:
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='red',
            fill=True,
            fill_color='#cc9331',
            fill_opacity=0.7,
            parse_html=False).add_to(map_newyork)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  


map_newyork

In the above map all the neighbourhoods in New York city are identified as blue while our current neighbourhood is identified as Red. 


Now let us look at all at the possible neighbourhoods in Manhattan where we wish to open up a new branch

In [10]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


The number of possible neighborhoods are:

In [11]:
manhattan_data.count()

Borough         40
Neighborhood    40
Latitude        40
Longitude       40
dtype: int64

There are 40 possible locations. Let us visualise all of them on a map

In [12]:
#Manhattan Co ordinates 
manhattan_lat=40.7896239
manhattan_long=-73.9598939

# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[manhattan_lat, manhattan_long], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  

our_label=our_neighbourhood
our_label = folium.Popup(our_label, parse_html=True)
folium.CircleMarker(
    [our_latitude, our_longitude],
    radius=5,
    popup=our_label,
    color='red',
    fill=True,
    fill_color='#cc9331',
    fill_opacity=0.7,
    parse_html=False).add_to(map_manhattan)    
map_manhattan

In the above map the red dot is out current shop location. And the blue dot signifies the 40 possible locations or neighbourhoods in Manhattan to open up a new Branch. Now we need to decide the best blue neighborhood which is the most similar to our current red location.

Now let us find the most prominent venues in the neighbourhoods which will be the basis of the feature set which will be used for grouping the neighbourhoods together into clusters which will help us identify the Manhattan neighbourhoods which are the most identical to our current Neighbourhood of Little Neck.

Let us prepare the dataframe with all the Manhattan Neighbourhoods and add our location so that we can build our featureset.

In [19]:
manhattan_data = manhattan_data.append(neighborhoods.loc[neighborhoods['Neighborhood'] == our_neighbourhood ])

In [21]:
manhattan_data.reset_index(drop=True, inplace=True)
manhattan_data

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688
5,Manhattan,Manhattanville,40.816934,-73.957385
6,Manhattan,Central Harlem,40.815976,-73.943211
7,Manhattan,East Harlem,40.792249,-73.944182
8,Manhattan,Upper East Side,40.775639,-73.960508
9,Manhattan,Yorkville,40.77593,-73.947118


We will use the this dataset for our machine learning featureset.

## Feature Set Construction

 Foursquare Credentials and Version


In [22]:
CLIENT_ID = 'QWN5GN1ZVLHMKCCIBHR3QG54YUICXZIM03HYE0UGYKSCX5HB' # your Foursquare ID
CLIENT_SECRET = 'XWNW5XRMNR2ERHEKUC0VIQD5KRCGNDIYSQUGI1FSCAYOYWZ4' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: QWN5GN1ZVLHMKCCIBHR3QG54YUICXZIM03HYE0UGYKSCX5HB
CLIENT_SECRET:XWNW5XRMNR2ERHEKUC0VIQD5KRCGNDIYSQUGI1FSCAYOYWZ4


Let us Explore the neighborhoods in New York

In [23]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [24]:
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards
Little Neck


Let us have a look at the venues in the neighbourhoods

In [25]:
manhattan_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.910660,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.910660,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,Marble Hill,40.876551,-73.910660,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.910660,Dunkin',40.877136,-73.906666,Donut Shop
4,Marble Hill,40.876551,-73.910660,Starbucks,40.877531,-73.905582,Coffee Shop
...,...,...,...,...,...,...,...
3298,Little Neck,40.770826,-73.738898,Emily's Skin Care & Spa,40.772374,-73.734498,Spa
3299,Little Neck,40.770826,-73.738898,Allon Vision,40.766915,-73.738592,Doctor's Office
3300,Little Neck,40.770826,-73.738898,Deli & Grocery,40.773990,-73.742127,Deli / Bodega
3301,Little Neck,40.770826,-73.738898,Little Neck Cafe & Deli,40.774093,-73.742262,Deli / Bodega


Let us break down the neighbourhoods in terms of their frequency.

In [26]:
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

#adding neighbourhood name back to the data frame 
manhattan_onehot['Neighborhood']= manhattan_venues['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

In [27]:
manhattan_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let us now group the above dataframe by neighbourhoods as it currently has multiple rows for each neighbourhood.

In [28]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Battery Park City,0.0,0.0,0.0,0.0,0.01087,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.01087,0.0,0.021739,0.0
1,Carnegie Hill,0.0,0.0,0.0,0.0,0.011628,0.0,0.0,0.011628,0.0,...,0.0,0.011628,0.0,0.0,0.0,0.011628,0.046512,0.0,0.011628,0.034884
2,Central Harlem,0.0,0.0,0.0,0.065217,0.043478,0.0,0.0,0.0,0.043478,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Chelsea,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.04,...,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.01,0.0
4,Chinatown,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,...,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01
5,Civic Center,0.0,0.0,0.0,0.0,0.04,0.01,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.01,0.01,0.02,0.01,0.0,0.03
6,Clinton,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.03,0.0,0.0,0.0
7,East Harlem,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,East Village,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.01,...,0.0,0.03,0.0,0.0,0.0,0.03,0.01,0.0,0.0,0.0
9,Financial District,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.0


Let us restrict each neighbourhood to the top 10 prominent venues for each

In [31]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [32]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,Park,Coffee Shop,Clothing Store,Hotel,Gym,Boat or Ferry,Playground,Memorial Site,Shopping Mall,Pizza Place
1,Carnegie Hill,Coffee Shop,Café,Wine Shop,Yoga Studio,Cosmetics Shop,Gym / Fitness Center,Bookstore,French Restaurant,Gym,Bar
2,Central Harlem,African Restaurant,Seafood Restaurant,American Restaurant,Gym / Fitness Center,Chinese Restaurant,French Restaurant,Art Gallery,Bar,Public Art,Music Venue
3,Chelsea,Coffee Shop,Bakery,Art Gallery,Ice Cream Shop,Hotel,American Restaurant,Wine Shop,French Restaurant,Seafood Restaurant,Market
4,Chinatown,Chinese Restaurant,Bakery,Cocktail Bar,American Restaurant,Salon / Barbershop,Dessert Shop,Mexican Restaurant,Bubble Tea Shop,Hotpot Restaurant,Ice Cream Shop


In [34]:
neighborhoods_venues_sorted.shape

(41, 11)

The above data frame will form the featureset which we wil use to run our Clustering Machine Learning algorithm.

## Machine Learning 

Now that we have formed our featureset which contains the information about the most prominent venues in their vicinity. We will use this feature set to fit a clustering Machine Learning which will group our neighbourhoods together based on the prominent venues that are nearby them. 

### K Means Clustering

We will use K means clustering to group the neighbour hoods together.

In [36]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 


array([1, 1, 1, 1, 1, 1, 1, 0, 1, 1])

In [37]:
# add clustering labels
neighborhoods_venues_sorted_km=neighborhoods_venues_sorted
neighborhoods_venues_sorted_km.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_merged_km = neighborhoods_venues_sorted_km

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
manhattan_merged_km = manhattan_merged_km.join(manhattan_data.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged_km.head() 

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Latitude,Longitude
0,1,Battery Park City,Park,Coffee Shop,Clothing Store,Hotel,Gym,Boat or Ferry,Playground,Memorial Site,Shopping Mall,Pizza Place,Manhattan,40.711932,-74.016869
1,1,Carnegie Hill,Coffee Shop,Café,Wine Shop,Yoga Studio,Cosmetics Shop,Gym / Fitness Center,Bookstore,French Restaurant,Gym,Bar,Manhattan,40.782683,-73.953256
2,1,Central Harlem,African Restaurant,Seafood Restaurant,American Restaurant,Gym / Fitness Center,Chinese Restaurant,French Restaurant,Art Gallery,Bar,Public Art,Music Venue,Manhattan,40.815976,-73.943211
3,1,Chelsea,Coffee Shop,Bakery,Art Gallery,Ice Cream Shop,Hotel,American Restaurant,Wine Shop,French Restaurant,Seafood Restaurant,Market,Manhattan,40.744035,-74.003116
4,1,Chinatown,Chinese Restaurant,Bakery,Cocktail Bar,American Restaurant,Salon / Barbershop,Dessert Shop,Mexican Restaurant,Bubble Tea Shop,Hotpot Restaurant,Ice Cream Shop,Manhattan,40.715618,-73.994279


In [38]:
manhattan_merged_km.shape

(41, 15)

Let us plot the clusters on the map

In [39]:
# create map
map_clusters_km = folium.Map(location=[manhattan_lat, manhattan_long], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged_km['Latitude'], manhattan_merged_km['Longitude'], manhattan_merged_km['Neighborhood'], manhattan_merged_km['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters_km)
       
map_clusters_km

## Results

From the above map it can be seen that K means has divided all the possible neighbourhoods in Manhattan into 5 predefined clusters or group (Red, Purple, Blue, Lime Green and Orange). We also can identify that our Neighborhood (Little Neck) belongs to the Blue cluster. Therefore out of all the 40 possible locations in Manhattan, the neighborhoods highlighted in Blue are most similar to our neighborhood and therefore these are the best locations to open up new branch which will face similar geographical conditions which our current shop or business in which resulted it to thrive.


## Discussion

K means identified that all the blue points in the same cluster has our neighbourhood. These blue neighbourhoods are the most suitable neighbourhoods where we can open up a new branch of our shop or business. Let us have a look which are those neighbourhoods. 

Now let us have a look at all the possible neighbourhoods which belong to the blue cluster which are the best possible locations to open up our new business or shop branch as it will ensure that your new branch is located in an environment which is similar to your current business or shop. 

In [122]:
# identify the cluster as our neighbourhood
our_cluster = manhattan_merged_km.loc[manhattan_merged_km["Neighborhood"]==our_neighbourhood]["Cluster Labels"].values[0]
#identify the possible neighbourhoods that lie in the same cluster as our current neighbourhood which will be the possible 
manhattan_merged_km.loc[((manhattan_merged_km["Cluster Labels"]==our_cluster) & (manhattan_merged_km["Borough"]=="Manhattan"))]

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Latitude,Longitude
12,2,Greenwich Village,Italian Restaurant,Clothing Store,Sushi Restaurant,Boutique,Indian Restaurant,Coffee Shop,Cosmetics Shop,Dessert Shop,Bubble Tea Shop,Ice Cream Shop,Manhattan,40.726933,-73.999914
16,2,Lenox Hill,Italian Restaurant,Coffee Shop,Sushi Restaurant,Café,Cocktail Bar,Pizza Place,Gym / Fitness Center,Gym,Burger Joint,Steakhouse,Manhattan,40.768113,-73.95886
17,2,Lincoln Square,Plaza,Café,Theater,Concert Hall,Performing Arts Venue,Wine Shop,Park,Food Truck,Coffee Shop,Indie Movie Theater,Manhattan,40.773529,-73.985338
22,2,Manhattanville,Coffee Shop,Deli / Bodega,Mexican Restaurant,Bar,Italian Restaurant,Seafood Restaurant,Fried Chicken Joint,Bike Trail,Spanish Restaurant,Scenic Lookout,Manhattan,40.816934,-73.957385
26,2,Morningside Heights,Bookstore,American Restaurant,Coffee Shop,Park,Café,Sandwich Place,Deli / Bodega,Burger Joint,Food Truck,Seafood Restaurant,Manhattan,40.808,-73.963896
29,2,Roosevelt Island,Coffee Shop,Deli / Bodega,Residential Building (Apartment / Condo),Gym,Supermarket,Bus Line,Grocery Store,Greek Restaurant,Outdoors & Recreation,Soccer Field,Manhattan,40.76216,-73.949168
30,2,Soho,Clothing Store,Boutique,Italian Restaurant,Shoe Store,Mediterranean Restaurant,Salon / Barbershop,Hotel,Coffee Shop,Sporting Goods Shop,Bakery,Manhattan,40.722184,-74.000657
33,2,Tribeca,American Restaurant,Italian Restaurant,Park,Wine Bar,Café,Spa,Gym / Fitness Center,Greek Restaurant,French Restaurant,Basketball Court,Manhattan,40.721522,-74.010683
34,2,Tudor City,Park,Mexican Restaurant,Café,Pizza Place,Greek Restaurant,Gym,Diner,Coffee Shop,Garden,Seafood Restaurant,Manhattan,40.746917,-73.971219
35,2,Turtle Bay,Italian Restaurant,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Ramen Restaurant,Park,Deli / Bodega,Seafood Restaurant,Steakhouse,Manhattan,40.752042,-73.967708


In [124]:
manhattan_merged_km.loc[((manhattan_merged_km["Cluster Labels"]==our_cluster) & (manhattan_merged_km["Borough"]=="Manhattan"))].shape

(13, 15)

Let us look at the venue features of our current neighbourhood

In [123]:
manhattan_merged_km.loc[manhattan_merged_km["Neighborhood"]==our_neighbourhood]

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Latitude,Longitude
19,2,Little Neck,Chinese Restaurant,Deli / Bodega,Korean Restaurant,Italian Restaurant,Coffee Shop,Spa,Bank,Bakery,Bus Station,Peruvian Restaurant,Queens,40.770826,-73.738898


Looking at the above dataframes we see that the clustered neighbourhoods, 13 possible neighbourhoods in Manhattan where we can open an another branch od f our shop or business which will have similar geographic features similar to the ones in our current neighbourhood such as Italian Restaurants, Coffee shops, etc. 

## Conclusion

In conclusion we have successfully used K means Clustering to identify 13 possible neighbourhoods (out of a total of 40) in Manhattan which will be the most suitable to open up a new Branch of our shop or business as they are most similar to our current neighbourhood therefore will ensure similar customer base in these new locations enabling good business for the new venture.