
<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in New York City</font></h1>


## Introduction

In this lab, you will learn how to convert addresses into their equivalent latitude and longitude values. Also, you will use the Foursquare API to explore neighborhoods in New York City. You will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. You will use the _k_-means clustering algorithm to complete this task. Finally, you will use the Folium library to visualize the neighborhoods in New York City and their emerging clusters.


In [2]:
%%time
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import matplotlib.cm as cm
import matplotlib.colors as colors
import plotly.express as px 

import folium # map rendering library
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from sklearn.cluster import KMeans


Wall time: 19.2 s


<a id='item1'></a>


## 1. Download and Explore Dataset


Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 


#### Load and explore the data


In [2]:
import json,urllib.request
data = urllib.request.urlopen("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json").read()
output = json.loads(data)


Let's take a quick look at the data.


In [65]:
output['features'][1]

{'type': 'Feature',
 'id': 'nyu_2451_34572.2',
 'geometry': {'type': 'Point',
  'coordinates': [-73.82993910812398, 40.87429419303012]},
 'geometry_name': 'geom',
 'properties': {'name': 'Co-op City',
  'stacked': 2,
  'annoline1': 'Co-op',
  'annoline2': 'City',
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.82993910812398,
   40.87429419303012,
   -73.82993910812398,
   40.87429419303012]}}

Notice how all the relevant data is in the _features_ key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.


In [7]:
neighborhoods_data = output['features']

Let's take a look at the first item in this list.


In [14]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### Tranform the data into a _pandas_ dataframe


The next task is essentially transforming this data of nested Python dictionaries into a _pandas_ dataframe. So let's start by creating an empty dataframe.


In [17]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Then let's loop through the data and fill the dataframe one row at a time.


In [19]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Quickly examine the resulting dataframe.


In [126]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


And make sure that the dataset has all 5 boroughs and 306 neighborhoods.


In [26]:
print(neighborhoods['Borough'].nunique())
print(len(neighborhoods['Neighborhood'].unique()))

5
302


In [28]:
neighborhoods.shape

(306, 4)

In [29]:
neighborhoods.isnull().sum()

Borough         0
Neighborhood    0
Latitude        0
Longitude       0
dtype: int64

#### Use geopy library to get the latitude and longitude values of New York City.
In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.

In [30]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="shiv_explo")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


#### Create a map of New York with neighborhoods superimposed on top.


In [39]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='yellow', #(if we want to fill the marker)
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

### Todo simplify the above map and segment and cluster only the neighborhoods in Manhattan? <br>
### Creating a new dataframe of the Manhattan data.


In [41]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [42]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="shiv_explo")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


In [43]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

## Define Foursquare Credentials and Version


In [44]:
CLIENT_ID = 'FKZSXSUZ15N15NDT0L02LQKUVQOJPLBWRLKHXX3SFOBLYDSO' # your Foursquare ID
CLIENT_SECRET = 'XPGX1VPF3NP2NCDTYMWZXSKCZJYLETAWONQYSBNSKD3D0UJO' # your Foursquare Secret
ACCESS_TOKEN = '4HLZVV31IRIKGULK52WDBHUCHFBP4P04YZNQAXWCWGHZEIBK' # your FourSquare Access Token
VERSION = '20211505'
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: FKZSXSUZ15N15NDT0L02LQKUVQOJPLBWRLKHXX3SFOBLYDSO
CLIENT_SECRET:XPGX1VPF3NP2NCDTYMWZXSKCZJYLETAWONQYSBNSKD3D0UJO


#### Let's explore the first neighborhood in our dataframe.


In [45]:
manhattan_data['Neighborhood'].loc[0]

'Marble Hill'

In [46]:
#Get the neighborhood's latitude and longitude values.

neighborhood_latitude = manhattan_data['Latitude'].loc[0] # neighborhood latitude value
neighborhood_longitude = manhattan_data['Longitude'].loc[0] # neighborhood longitude value

neighborhood_name = manhattan_data['Neighborhood'].loc[0] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Marble Hill are 40.87655077879964, -73.91065965862981.


#### Now, let's get the top 100 venues that are in Marble Hill within a radius of 500 meters.


In [47]:
# getting the requoired request url
limit = 100
radius=500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id=FKZSXSUZ15N15NDT0L02LQKUVQOJPLBWRLKHXX3SFOBLYDSO&client_secret=XPGX1VPF3NP2NCDTYMWZXSKCZJYLETAWONQYSBNSKD3D0UJO&v=20211505&ll=40.87655077879964,-73.91065965862981&radius=500&limit=100'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL


'https://api.foursquare.com/v2/venues/explore?&client_id=FKZSXSUZ15N15NDT0L02LQKUVQOJPLBWRLKHXX3SFOBLYDSO&client_secret=XPGX1VPF3NP2NCDTYMWZXSKCZJYLETAWONQYSBNSKD3D0UJO&v=20211505&ll=40.87655077879964,-73.91065965862981&radius=500&limit=100'

Send the GET request and examine the resutls


In [110]:
# for better understanding the dict and brackets of json files we can dump it or use json_normalize
results =requests.get(url).json()
#results = requests.get(url).json()
#results_ste = json.dump(results,open('Marble Hill.json',"w"),indent=4)

results_norm = pd.json_normalize(results)



From the Foursquare lab in the previous module, we know that all the information is in the _items_ key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.


In [101]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a _pandas_ dataframe.


In [121]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[1] for col in nearby_venues.columns]

nearby_venues.head()

  nearby_venues = json_normalize(venues) # flatten JSON


Unnamed: 0,name,categories,location,location.1
0,Bikram Yoga,Yoga Studio,40.876844,-73.906204
1,Arturo's,Pizza Place,40.874412,-73.910271
2,Tibbett Diner,Diner,40.880404,-73.908937
3,Rite Aid,Pharmacy,40.875467,-73.908906
4,Subway,Sandwich Place,40.874667,-73.909586


In [122]:
#And how many venues were returned by Foursquare?

print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

24 venues were returned by Foursquare.


In [127]:
neighborhoods.Borough.unique()

array(['Bronx', 'Manhattan', 'Brooklyn', 'Queens', 'Staten Island'],
      dtype=object)

## 2. Explore Neighborhoods in Manhattan


#### Let's create a function to repeat the same process to all the neighborhoods in Manhattan


In [179]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'N_Lat', 
                  'N_Lng', 
                  'Venue', 
                  'V_Lti', 
                  'V_Lng', 
                  'Venue_Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called _manhattan_venues_.


In [180]:
%%time
# Getting all the venues and along with thier Categories in CIty: Manhattan from all the Neighborhoods 
venues_manhattan = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude'])

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


#### Let's check the size of the resulting dataframe


In [181]:
venues_manhattan

Unnamed: 0,Neighborhood,N_Lat,N_Lng,Venue,V_Lti,V_Lng,Venue_Category
0,Marble Hill,40.876551,-73.910660,Bikram Yoga,40.876844,-73.906204,Yoga Studio
1,Marble Hill,40.876551,-73.910660,Arturo's,40.874412,-73.910271,Pizza Place
2,Marble Hill,40.876551,-73.910660,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.910660,Rite Aid,40.875467,-73.908906,Pharmacy
4,Marble Hill,40.876551,-73.910660,Subway,40.874667,-73.909586,Sandwich Place
...,...,...,...,...,...,...,...
2995,Hudson Yards,40.756658,-74.000111,NYPD Mounted Unit,40.759155,-74.004121,Stables
2996,Hudson Yards,40.756658,-74.000111,NY Waterway 42nd St Bus,40.760050,-74.003379,Bus Station
2997,Hudson Yards,40.756658,-74.000111,Pier Cafe,40.759625,-74.004162,Café
2998,Hudson Yards,40.756658,-74.000111,Twilight Cruise By Citysightseeing,40.759744,-74.004096,Boat or Ferry


In [185]:
# Unique Venues and Unique Category
print(venues_manhattan.Venue.nunique())
print(venues_manhattan.Venue_Category.nunique())

2497
325


In [233]:
df = venues_manhattan.groupby('Neighborhood').count().reset_index()

df.tail()


Unnamed: 0,Neighborhood,N_Lat,N_Lng,Venue,V_Lti,V_Lng,Venue_Category
35,Upper East Side,78,78,78,78,78,78
36,Upper West Side,70,70,70,70,70,70
37,Washington Heights,83,83,83,83,83,83
38,West Village,100,100,100,100,100,100
39,Yorkville,95,95,95,95,95,95


In [234]:
fig = px.histogram(df,x='Neighborhood',y='Venue_Category')
fig.update_layout(height=720,width=1080)
fig.show()

In [None]:
venues_manhattan.to_csv('./output/venues_manhattan.csv')

## 3. Analyze Each Neighborhood


In [15]:
venues_manhattan = pd.read_csv('./output/venues_manhattan.csv',index_col=0)

In [21]:
venues_man_hotenco = pd.get_dummies(venues_manhattan.Venue_Category)
venues_manhattan_onehot = pd.concat([venues_manhattan,venues_man_hotenco],axis="columns").reset_index(drop=True)
venues_manhattan_onehot.drop(columns=['N_Lat','N_Lng','Venue','V_Lti','V_Lng','Venue_Category'],inplace=True)
venues_manhattan_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,...,Vietnamese Restaurant,Volleyball Court,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
venues_manhattan_onehot.shape

(3000, 326)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category


In [34]:
frequency_of_venues = venues_manhattan_onehot.groupby(venues_manhattan_onehot['Neighborhood']).mean().round(4).reset_index()
frequency_of_venues.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,...,Vietnamese Restaurant,Volleyball Court,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Battery Park City,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0317,0.0,0.0,0.0
1,Carnegie Hill,0.0,0.0,0.0,0.0,0.0137,0.0,0.0,0.0,0.0274,...,0.0,0.0,0.0,0.0274,0.0,0.0137,0.0548,0.0,0.0137,0.0274
2,Central Harlem,0.0,0.0,0.0,0.0811,0.0541,0.0,0.0,0.027,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Chelsea,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.07,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.02,0.0
4,Chinatown,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.01,0.0,...,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0


In [61]:
frequency_of_venues_t= frequency_of_venues.T
frequency_of_venues_t.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
Neighborhood,Battery Park City,Carnegie Hill,Central Harlem,Chelsea,Chinatown,Civic Center,Clinton,East Harlem,East Village,Financial District,...,Stuyvesant Town,Sutton Place,Tribeca,Tudor City,Turtle Bay,Upper East Side,Upper West Side,Washington Heights,West Village,Yorkville
Accessories Store,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0
Adult Boutique,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0115,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0
Afghan Restaurant,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
African Restaurant,0.0,0.0,0.0811,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [78]:
frequency_of_venues_t.iloc[1:,1].sort_values(ascending=False).head()

Coffee Shop       0.0685
Pizza Place       0.0548
Wine Shop         0.0548
Cosmetics Shop    0.0548
Gym               0.0411
Name: 1, dtype: object

#### Let's confirm the new size


In [33]:
frequency_of_venues.shape

(40, 326)

#### Let's print each neighborhood along with the top 5 most common venues


In [67]:
number_of_top_venue = 5

for neighborgood in frequency_of_venues['Neighborhood']:
    print(neighborgood+"----")
    freq = frequency_of_venues[frequency_of_venues['Neighborhood'] == neighborgood].T.reset_index()
    freq.columns = ['venue','freq']
    freq = freq.iloc[1:]
    freq['freq'] = freq['freq'].astype(float)
    freq = freq.round({'freq': 2})
    print(freq.sort_values('freq', ascending=False).reset_index(drop=True).head(number_of_top_venue))
    print('\n')

Battery Park City----
           venue  freq
0           Park  0.10
1          Hotel  0.06
2  Memorial Site  0.05
3  Boat or Ferry  0.05
4            Gym  0.05


Carnegie Hill----
            venue  freq
0     Coffee Shop  0.07
1     Pizza Place  0.05
2       Wine Shop  0.05
3  Cosmetics Shop  0.05
4             Gym  0.04


Central Harlem----
                  venue  freq
0    African Restaurant  0.08
1        Sandwich Place  0.05
2  Gym / Fitness Center  0.05
3                   Bar  0.05
4    Seafood Restaurant  0.05


Chelsea----
                 venue  freq
0          Art Gallery  0.07
1          Coffee Shop  0.05
2               Bakery  0.05
3    French Restaurant  0.03
4  American Restaurant  0.03


Chinatown----
                 venue  freq
0               Bakery  0.08
1   Chinese Restaurant  0.06
2         Cocktail Bar  0.05
3         Optical Shop  0.03
4  American Restaurant  0.03


Civic Center----
                  venue  freq
0           Coffee Shop  0.07
1                 

#### Let's put that into a _pandas_ dataframe


First, let's write a function to sort the venues in descending order.


In [86]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.


In [88]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = frequency_of_venues['Neighborhood']

for ind in np.arange(frequency_of_venues.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(frequency_of_venues.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,Park,Hotel,Memorial Site,Boat or Ferry,Gym,Coffee Shop,Sandwich Place,Food Court,Plaza,Playground
1,Carnegie Hill,Coffee Shop,Pizza Place,Wine Shop,Cosmetics Shop,Gym,Café,Yoga Studio,Bakery,Pub,French Restaurant
2,Central Harlem,African Restaurant,Sandwich Place,Gym / Fitness Center,Bar,Seafood Restaurant,French Restaurant,American Restaurant,Music Venue,Ethiopian Restaurant,Juice Bar
3,Chelsea,Art Gallery,Coffee Shop,Bakery,French Restaurant,American Restaurant,Wine Shop,Italian Restaurant,Ice Cream Shop,Hotel,Bookstore
4,Chinatown,Bakery,Chinese Restaurant,Cocktail Bar,Optical Shop,American Restaurant,Spa,Dessert Shop,Salon / Barbershop,Ice Cream Shop,Coffee Shop


## 4. Cluster Neighborhoods


Run _k_-means to cluster the neighborhood into 5 clusters.


In [160]:
# set number of clusters
k = 5

manhattan_clustering = frequency_of_venues.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(manhattan_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 1, 0, 0, 2, 1, 0, 2, 2, 1])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.


In [141]:
venues_manhattan.drop(columns='Neighborhood')
neighborhoods_venues_sorted.insert(0, 'Cluster_Labels', kmeans.labels_)
merged = pd.concat([neighborhoods_venues_sorted,venues_manhattan],axis='columns',join='outer')

merged =  merged[[ 'Neighborhood','V_Lat', 'V_Lng', 'Cluster_Labels','1st Most Common Venue',
       '2nd Most Common Venue', '3rd Most Common Venue',
       '4th Most Common Venue', '5th Most Common Venue',
       '6th Most Common Venue', '7th Most Common Venue',
       '8th Most Common Venue', '9th Most Common Venue',
       '10th Most Common Venue']]
merged.head()

Unnamed: 0,Neighborhood,N_Lat,N_Lng,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,40.876551,-73.91066,4.0,Park,Hotel,Memorial Site,Boat or Ferry,Gym,Coffee Shop,Sandwich Place,Food Court,Plaza,Playground
1,Carnegie Hill,40.876551,-73.91066,1.0,Coffee Shop,Pizza Place,Wine Shop,Cosmetics Shop,Gym,Café,Yoga Studio,Bakery,Pub,French Restaurant
2,Central Harlem,40.876551,-73.91066,0.0,African Restaurant,Sandwich Place,Gym / Fitness Center,Bar,Seafood Restaurant,French Restaurant,American Restaurant,Music Venue,Ethiopian Restaurant,Juice Bar
3,Chelsea,40.876551,-73.91066,0.0,Art Gallery,Coffee Shop,Bakery,French Restaurant,American Restaurant,Wine Shop,Italian Restaurant,Ice Cream Shop,Hotel,Bookstore
4,Chinatown,40.876551,-73.91066,2.0,Bakery,Chinese Restaurant,Cocktail Bar,Optical Shop,American Restaurant,Spa,Dessert Shop,Salon / Barbershop,Ice Cream Shop,Coffee Shop


Finally, let's visualize the resulting clusters


In [144]:
# create map
map_clusters = folium.Map(location=[N_Lat, N_Lng], zoom_start=11)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(merged['N_Lat'],merged['N_Lng'],merged['Neighborhood'],merged['Cluster_Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

NameError: name 'N_Lat' is not defined

<a id='item5'></a>


## 5. Examine Clusters


Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.


#### Cluster 1


In [154]:
merged.Cluster_Labels.value_counts()

1.0    21
2.0    10
0.0     6
4.0     2
3.0     1
Name: Cluster_Labels, dtype: int64

In [155]:
merged[merged['Cluster_Labels'] == 0]

Unnamed: 0,Neighborhood,N_Lat,N_Lng,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Central Harlem,40.876551,-73.91066,0.0,African Restaurant,Sandwich Place,Gym / Fitness Center,Bar,Seafood Restaurant,French Restaurant,American Restaurant,Music Venue,Ethiopian Restaurant,Juice Bar
3,Chelsea,40.876551,-73.91066,0.0,Art Gallery,Coffee Shop,Bakery,French Restaurant,American Restaurant,Wine Shop,Italian Restaurant,Ice Cream Shop,Hotel,Bookstore
6,Clinton,40.876551,-73.91066,0.0,Coffee Shop,Theater,Sandwich Place,Italian Restaurant,American Restaurant,Gym / Fitness Center,Gym,Spa,Wine Shop,Steakhouse
14,Hudson Yards,40.876551,-73.91066,0.0,American Restaurant,Gym / Fitness Center,Italian Restaurant,Restaurant,Burger Joint,Park,Hotel,Gym,Coffee Shop,Clothing Store
17,Lincoln Square,40.876551,-73.91066,0.0,Café,Concert Hall,Plaza,Performing Arts Venue,Theater,Gym / Fitness Center,Bakery,Indie Movie Theater,French Restaurant,Gym
23,Midtown,40.876551,-73.91066,0.0,Hotel,Clothing Store,Spa,Theater,Bookstore,Food Truck,Sushi Restaurant,Coffee Shop,Bakery,Cuban Restaurant


#### Cluster 2


In [159]:
merged[merged['Cluster_Labels'] == 1]

Unnamed: 0,Neighborhood,N_Lat,N_Lng,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Carnegie Hill,40.876551,-73.91066,1.0,Coffee Shop,Pizza Place,Wine Shop,Cosmetics Shop,Gym,Café,Yoga Studio,Bakery,Pub,French Restaurant
5,Civic Center,40.876551,-73.91066,1.0,Coffee Shop,Spa,French Restaurant,Gym / Fitness Center,Cocktail Bar,Hotel,Bakery,Gym,Park,Café
9,Financial District,40.876551,-73.91066,1.0,Coffee Shop,Pizza Place,Cocktail Bar,Gym / Fitness Center,Falafel Restaurant,American Restaurant,Steakhouse,Café,Hotel,Gym
10,Flatiron,40.876551,-73.91066,1.0,Italian Restaurant,American Restaurant,Spa,Japanese Restaurant,New American Restaurant,Mediterranean Restaurant,Wine Shop,Gym / Fitness Center,Coffee Shop,Furniture / Home Store
11,Gramercy,40.876551,-73.91066,1.0,Pizza Place,Italian Restaurant,Coffee Shop,Sandwich Place,Bagel Shop,Spa,Park,Cocktail Bar,Mexican Restaurant,Diner
12,Greenwich Village,40.876551,-73.91066,1.0,Italian Restaurant,Clothing Store,Dessert Shop,Sushi Restaurant,Coffee Shop,Café,Boutique,Indian Restaurant,Pizza Place,Cosmetics Shop
16,Lenox Hill,40.876551,-73.91066,1.0,Italian Restaurant,Café,Bank,Pizza Place,Sushi Restaurant,Burger Joint,Gym / Fitness Center,Cocktail Bar,Gym,Coffee Shop
18,Little Italy,40.876551,-73.91066,1.0,Bakery,Italian Restaurant,Café,Spa,Coffee Shop,Pizza Place,Tea Room,Salon / Barbershop,Hotel,Ice Cream Shop
21,Manhattanville,40.876551,-73.91066,1.0,Deli / Bodega,Sushi Restaurant,Bank,Chinese Restaurant,Italian Restaurant,Coffee Shop,Cosmetics Shop,Seafood Restaurant,Shipping Store,Cuban Restaurant
24,Midtown South,40.715618,-73.994279,1.0,Korean Restaurant,Hotel,Hotel Bar,Bakery,Gym / Fitness Center,Salad Place,Cosmetics Shop,Coffee Shop,Japanese Restaurant,Café


#### Cluster 3


In [171]:
cluster3_= merged[merged['Cluster_Labels'] == 2]
cluster3_

Unnamed: 0,Neighborhood,N_Lat,N_Lng,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Chinatown,40.876551,-73.91066,2.0,Bakery,Chinese Restaurant,Cocktail Bar,Optical Shop,American Restaurant,Spa,Dessert Shop,Salon / Barbershop,Ice Cream Shop,Coffee Shop
7,East Harlem,40.876551,-73.91066,2.0,Pizza Place,Bakery,Mexican Restaurant,Bank,Deli / Bodega,Steakhouse,Sandwich Place,Historic Site,Seafood Restaurant,Beer Bar
8,East Village,40.876551,-73.91066,2.0,Bar,Wine Bar,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Pizza Place,Juice Bar,Korean Restaurant,Coffee Shop,Mexican Restaurant,Ice Cream Shop
13,Hamilton Heights,40.876551,-73.91066,2.0,Pizza Place,Mexican Restaurant,Café,Sandwich Place,Yoga Studio,Coffee Shop,Bakery,Caribbean Restaurant,Juice Bar,Cocktail Bar
15,Inwood,40.876551,-73.91066,2.0,Mexican Restaurant,Lounge,Café,Park,Wine Bar,Bank,Caribbean Restaurant,Bakery,Deli / Bodega,Pizza Place
19,Lower East Side,40.876551,-73.91066,2.0,Art Gallery,Sandwich Place,Café,Chinese Restaurant,Bakery,Pizza Place,Diner,Bus Stop,Butcher,Cocktail Bar
20,Manhattan Valley,40.876551,-73.91066,2.0,Yoga Studio,Pizza Place,Coffee Shop,Mexican Restaurant,Indian Restaurant,Bar,Cuban Restaurant,Deli / Bodega,Ice Cream Shop,Thai Restaurant
22,Marble Hill,40.876551,-73.91066,2.0,Sandwich Place,Bank,Supplement Shop,Storage Facility,Steakhouse,Clothing Store,Coffee Shop,Seafood Restaurant,Deli / Bodega,Department Store
25,Morningside Heights,40.715618,-73.994279,2.0,Bookstore,Café,Burger Joint,Sandwich Place,Deli / Bodega,American Restaurant,Park,Coffee Shop,Shipping Store,Pizza Place
37,Washington Heights,40.715618,-73.994279,2.0,Café,Pizza Place,Bakery,Grocery Store,Bank,Mobile Phone Shop,Chinese Restaurant,Supplement Shop,Supermarket,Spanish Restaurant


#### Cluster 4


In [157]:
merged[merged['Cluster_Labels'] == 3]

Unnamed: 0,Neighborhood,N_Lat,N_Lng,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
30,Stuyvesant Town,40.715618,-73.994279,3.0,Park,Yoga Studio,Boat or Ferry,Bar,Coffee Shop,Gas Station,Cocktail Bar,Bistro,Harbor / Marina,Heliport


#### Cluster 5


In [193]:
merged[merged['Cluster_Labels'] == 4]


Unnamed: 0,Neighborhood,N_Lat,N_Lng,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,40.876551,-73.91066,4.0,Park,Hotel,Memorial Site,Boat or Ferry,Gym,Coffee Shop,Sandwich Place,Food Court,Plaza,Playground
28,Roosevelt Island,40.715618,-73.994279,4.0,Park,Coffee Shop,Bus Line,Soccer Field,Bubble Tea Shop,Noodle House,Supermarket,Outdoors & Recreation,Food & Drink Shop,Pizza Place
