<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in New York City</font></h1>


## Introduction

In this lab, you will learn how to convert addresses into their equivalent latitude and longitude values. Also, you will use the Foursquare API to explore neighborhoods in New York City. You will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. You will use the _k_-means clustering algorithm to complete this task. Finally, you will use the Folium library to visualize the neighborhoods in New York City and their emerging clusters.


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1.  <a href="#item1">Download and Explore Dataset</a>

2.  <a href="#item2">Explore Neighborhoods in New York City</a>

3.  <a href="#item3">Analyze Each Neighborhood</a>

4.  <a href="#item4">Cluster Neighborhoods</a>

5.  <a href="#item5">Examine Clusters</a>  
    </font>
    </div>


Before we get the data and start exploring it, let's download all the dependencies that we will need.


In [2]:
!pip install geopy

Collecting geopy
  Downloading geopy-2.1.0-py3-none-any.whl (112 kB)
[K     |████████████████████████████████| 112 kB 2.9 MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49
  Downloading geographiclib-1.50-py3-none-any.whl (38 kB)
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.1.0


In [1]:
import os # library for file path

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import re # to handle regex

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>


## 1. Download and Explore Dataset


In [2]:
# path
notebook_path = os.path.abspath("UK_Area_Similarity_Capstone_Project.ipynb")
path=notebook_path.rsplit('/',1)
path=path[0]+'/'

Get postcode with latitude and longitude information

In [3]:
file = 'ukpostcodes.csv' # downloaded csv file from internet
df = pd.read_csv(path+file)
df.head()

Unnamed: 0,id,postcode,latitude,longitude
0,1,AB10 1XG,57.144165,-2.114848
1,2,AB10 6RN,57.13788,-2.121487
2,3,AB10 7JB,57.124274,-2.12719
3,4,AB11 5QN,57.142701,-2.093295
4,5,AB11 6UL,57.137547,-2.112233


In [4]:
file = 'ukpostcodes.csv' # downloaded csv file from internet
df = pd.read_csv(path+file)
df = df.drop(['id'],axis=1)
df['AreaCode'] = df['postcode'].apply(lambda x: x.split(' ')[0])
df = df.groupby(['AreaCode']).mean()
df['AreaCode'] = df.index
df.index = range(df.shape[0])
df.columns = ['Latitude','Longitude','Postcode']
print(df.shape)
df.head()

(2979, 3)


Unnamed: 0,Latitude,Longitude,Postcode
0,57.135204,-2.120402,AB10
1,57.139148,-2.092871,AB11
2,57.102113,-2.112707,AB12
3,57.107919,-2.237453,AB13
4,57.100449,-2.271324,AB14


Get area name by webscraping

In [5]:
url = 'https://en.wikipedia.org/wiki/List_of_postcode_areas_in_the_United_Kingdom' # Wikipedia page
AreaName = pd.read_html(url) # gets all tables in this page

In [6]:
AreaName_df = AreaName[0]
AreaName_df['Area name'] = AreaName_df['Postcode area name[1][3]'].apply(lambda x: x.split('[')[0])
AreaName_df = AreaName_df.drop(['Postcode area name[1][3]','Code formation'],axis=1)
MissingPC = [['IM','Isle of Man'],['GY','Guernsey'],['JE','Jersey']]
for i,v in enumerate(MissingPC):
    AreaName_df.loc[AreaName_df.shape[0]+i] = [v[0],v[1]]
print(AreaName_df.shape)
AreaName_df

(124, 2)


Unnamed: 0,Postcode area,Area name
0,AB,Aberdeen
1,AL,St Albans
2,B,Birmingham
3,BA,Bath
4,BB,Blackburn
5,BD,Bradford
6,BH,Bournemouth
7,BL,Bolton
8,BN,Brighton
9,BR,Bromley


Mapping postcode with area name

In [7]:
AreaName_dict = {row['Postcode area']:row['Area name'] for ind,row in AreaName_df.iterrows()}

In [8]:
def MapAreaName(pc):
    r = re.compile("([a-zA-Z]+)([0-9]+)")
    m = r.match(pc)
    if m==None: 
        return ""
    else: 
        m.group(1)
        return AreaName_dict[m.group(1)]

In [172]:
df['Area Name'] = df['Postcode'].apply(lambda x: MapAreaName(x))
df = df[df['Area Name']!=""]
df = df[df['Latitude'].notna()]
print(df.shape)
df.head()

(2977, 4)


Unnamed: 0,Latitude,Longitude,Postcode,Area Name
0,57.135204,-2.120402,AB10,Aberdeen
1,57.139148,-2.092871,AB11,Aberdeen
2,57.102113,-2.112707,AB12,Aberdeen
3,57.107919,-2.237453,AB13,Aberdeen
4,57.100449,-2.271324,AB14,Aberdeen


### Get latitude and longitude

In [None]:
# loading Geospatial_Coordinates.csv 

import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_e767165c58574a5a85b5a1ebbcfbad11 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='uDB3Gtg2eOd0VBCekKevy-ojsD6MzCAPc1EVwWmeyDGO',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.eu-geo.objectstorage.service.networklayer.com')

body = client_e767165c58574a5a85b5a1ebbcfbad11.get_object(Bucket='capstoneprojectweek3-donotdelete-pr-zg9kzsdfrijbrj',Key='Geospatial_Coordinates.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_1 = pd.read_csv(body)
df_data_1.head()

#### Load and explore the data


Next, let's load the data.


In [5]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

Let's take a quick look at the data.


In [6]:
newyork_data

{'type': 'FeatureCollection',
 'totalFeatures': 306,
 'features': [{'type': 'Feature',
   'id': 'nyu_2451_34572.1',
   'geometry': {'type': 'Point',
    'coordinates': [-73.84720052054902, 40.89470517661]},
   'geometry_name': 'geom',
   'properties': {'name': 'Wakefield',
    'stacked': 1,
    'annoline1': 'Wakefield',
    'annoline2': None,
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.84720052054902,
     40.89470517661,
     -73.84720052054902,
     40.89470517661]}},
  {'type': 'Feature',
   'id': 'nyu_2451_34572.2',
   'geometry': {'type': 'Point',
    'coordinates': [-73.82993910812398, 40.87429419303012]},
   'geometry_name': 'geom',
   'properties': {'name': 'Co-op City',
    'stacked': 2,
    'annoline1': 'Co-op',
    'annoline2': 'City',
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.82993910812398,
     40.87429419303012,
     -73.82993910812398,
     40.87429419303012]}},
  {'type': 'Feature',
 

Notice how all the relevant data is in the _features_ key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.


In [7]:
neighborhoods_data = newyork_data['features']

Let's take a look at the first item in this list.


In [8]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### Tranform the data into a _pandas_ dataframe


The next task is essentially transforming this data of nested Python dictionaries into a _pandas_ dataframe. So let's start by creating an empty dataframe.


In [9]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Take a look at the empty dataframe to confirm that the columns are as intended.


In [10]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Then let's loop through the data and fill the dataframe one row at a time.


In [11]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Quickly examine the resulting dataframe.


In [12]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


And make sure that the dataset has all 5 boroughs and 306 neighborhoods.


In [13]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


#### Use geopy library to get the latitude and longitude values of New York City.


In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.


In [166]:
address = 'United Kingdom'

geolocator = Nominatim(user_agent="uk_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of United Kingdom are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of United Kingdom are 54.7023545, -3.2765753.


#### Create a map of New York with neighborhoods superimposed on top.


In [175]:
# create map of New York using latitude and longitude values
map_uk = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, postcode, area in zip(df['Latitude'], df['Longitude'], df['Postcode'], df['Area Name']):
    label = '{}, {}'.format(area, postcode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_uk)  
    
map_uk

**Folium** is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.


However, for illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods in Manhattan. So let's slice the original dataframe and create a new dataframe of the Manhattan data.


In [193]:
MapAreaName(df['Postcode'])

TypeError: expected string or bytes-like object

In [196]:
Cambridge_Data = df[df['Area Name'] == 'Cambridge'].reset_index(drop=True)
print(Cambridge_Data.shape)
Cambridge_Data.head()

(16, 4)


Unnamed: 0,Latitude,Longitude,Postcode,Area Name
0,52.195442,0.142137,CB1,Cambridge
1,52.028719,0.26299,CB10,Cambridge
2,51.999125,0.212496,CB11,Cambridge
3,52.185868,0.123155,CB2,Cambridge
4,52.127848,0.281822,CB21,Cambridge


Let's get the geographical coordinates of Manhattan.


In [198]:
address = 'Cambridge, UK'

geolocator = Nominatim(user_agent="uk_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Cambridge are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Cambridge are 52.197584649999996, 0.13915373736874398.


As we did with all of New York City, let's visualizat Manhattan the neighborhoods in it.


In [200]:
# create map of Manhattan using latitude and longitude values
map_Cambridge = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(Cambridge_Data['Latitude'], Cambridge_Data['Longitude'], Cambridge_Data['Postcode']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Cambridge)  
    
map_Cambridge

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.


#### Define Foursquare Credentials and Version


In [201]:
CLIENT_ID = 'ALIKOUIHINWH1VM14HKXTIQRZSICDPU1MJ1FES3HRYM1A4QK' # your Foursquare ID
CLIENT_SECRET = 'ZEYSWFPTFD1CU4QSV3BJBAS5ACNNJHCLU3ZYU0A1WLJYNMPZ' # your Foursquare Secret
ACCESS_TOKEN = 'POSO4Z0CPAZCUVNX4RTU13KSZNR34O2FBGMXTEPTFF15TW33' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ALIKOUIHINWH1VM14HKXTIQRZSICDPU1MJ1FES3HRYM1A4QK
CLIENT_SECRET:ZEYSWFPTFD1CU4QSV3BJBAS5ACNNJHCLU3ZYU0A1WLJYNMPZ


#### Let's explore the first neighborhood in our dataframe.


Get the neighborhood's name.


In [202]:
Cambridge_Data.loc[0, 'Postcode']

'CB1'

Get the neighborhood's latitude and longitude values.


In [203]:
cb_latitude = Cambridge_Data.loc[0, 'Latitude'] # neighborhood latitude value
cb_longitude = Cambridge_Data.loc[0, 'Longitude'] # neighborhood longitude value

cb_postcode = Cambridge_Data.loc[0, 'Postcode'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(cb_postcode, 
                                                               cb_latitude, 
                                                               cb_longitude))

Latitude and longitude values of CB1 are 52.19544233215599, 0.14213684657435816.


#### Now, let's get the top 100 venues that are in CB1 within a radius of 500 meters.


First, let's create the GET request URL. Name your URL **url**.


In [204]:
# type your answer here

radius = 500 # define radius
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    cb_latitude, 
    cb_longitude, 
    radius, 
    LIMIT)
url # display URL


'https://api.foursquare.com/v2/venues/explore?&client_id=ALIKOUIHINWH1VM14HKXTIQRZSICDPU1MJ1FES3HRYM1A4QK&client_secret=ZEYSWFPTFD1CU4QSV3BJBAS5ACNNJHCLU3ZYU0A1WLJYNMPZ&v=20180604&ll=52.19544233215599,0.14213684657435816&radius=500&limit=100'

Double-click **here** for the solution.

<!-- The correct answer is:
LIMIT = 100 # limit of number of venues returned by Foursquare API
-->

<!--
radius = 500 # define radius
-->

<!--
\\\\ # create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL
--> 


Send the GET request and examine the resutls


In [205]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '6031234bc769a12f9abb9571'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Coleridge',
  'headerFullLocation': 'Coleridge, Cambridge',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 28,
  'suggestedBounds': {'ne': {'lat': 52.199942336655994,
    'lng': 0.14946445342003567},
   'sw': {'lat': 52.19094232765599, 'lng': 0.13480923972868064}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '54cd31eb498ec288176596ae',
       'name': 'Tradizioni',
       'location': {'address': 'Mill Road',
        'lat': 52.19791754114031,
        'lng': 0.14393600453004363,
        'labeledLatLngs': [{'label': 'display',
          'lat': 52.1979175

From the Foursquare lab in the previous module, we know that all the information is in the _items_ key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.


In [206]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a _pandas_ dataframe.


In [207]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  nearby_venues = json_normalize(venues) # flatten JSON


Unnamed: 0,name,categories,lat,lng
0,Tradizioni,Italian Restaurant,52.197918,0.143936
1,196 Mill Road,Bar,52.197752,0.144226
2,Chill #02,Café,52.194805,0.137223
3,Relevant Record Café,Café,52.197105,0.146521
4,Caffè Nero,Coffee Shop,52.194526,0.136673


And how many venues were returned by Foursquare?


In [208]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

28 venues were returned by Foursquare.


In [209]:
nearby_venues

Unnamed: 0,name,categories,lat,lng
0,Tradizioni,Italian Restaurant,52.197918,0.143936
1,196 Mill Road,Bar,52.197752,0.144226
2,Chill #02,Café,52.194805,0.137223
3,Relevant Record Café,Café,52.197105,0.146521
4,Caffè Nero,Coffee Shop,52.194526,0.136673
5,Urban larder,Deli / Bodega,52.197975,0.143638
6,Ibis Hotel,Hotel,52.19483,0.13726
7,The Sea Tree,Seafood Restaurant,52.197918,0.143775
8,Limoncello,Deli / Bodega,52.197654,0.144721
9,Sainsbury's Local,Grocery Store,52.19509,0.136815


<a id='item2'></a>


## 2. Explore Neighborhoods in Cambridge


#### Let's create a function to repeat the same process to all the neighborhoods in Cambridge


In [210]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called _manhattan_venues_.


In [211]:
# type your answer here
Cambridge_Venues = getNearbyVenues(names=Cambridge_Data['Postcode'],
                                   latitudes=Cambridge_Data['Latitude'],
                                   longitudes=Cambridge_Data['Longitude']
                                  )

CB1
CB10
CB11
CB2
CB21
CB22
CB23
CB24
CB25
CB3
CB4
CB5
CB6
CB7
CB8
CB9


Double-click **here** for the solution.

<!-- The correct answer is:
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )
--> 


#### Let's check the size of the resulting dataframe


In [212]:
print(Cambridge_Venues.shape)
Cambridge_Venues.head()

(80, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,CB1,52.195442,0.142137,Tradizioni,52.197918,0.143936,Italian Restaurant
1,CB1,52.195442,0.142137,196 Mill Road,52.197752,0.144226,Bar
2,CB1,52.195442,0.142137,Chill #02,52.194805,0.137223,Café
3,CB1,52.195442,0.142137,Relevant Record Café,52.197105,0.146521,Café
4,CB1,52.195442,0.142137,Caffè Nero,52.194526,0.136673,Coffee Shop


Let's check how many venues were returned for each neighborhood


In [214]:
Cambridge_Venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
CB1,28,28,28,28,28,28
CB11,1,1,1,1,1,1
CB2,3,3,3,3,3,3
CB25,1,1,1,1,1,1
CB3,5,5,5,5,5,5
CB4,5,5,5,5,5,5
CB5,15,15,15,15,15,15
CB8,1,1,1,1,1,1
CB9,21,21,21,21,21,21


#### Let's find out how many unique categories can be curated from all the returned venues


In [215]:
print('There are {} uniques categories.'.format(len(Cambridge_Venues['Venue Category'].unique())))

There are 47 uniques categories.


<a id='item3'></a>


## 3. Analyze Each Neighborhood


In [216]:
# one hot encoding
Cambridge_Onehot = pd.get_dummies(Cambridge_Venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Cambridge_Onehot['Neighborhood'] = Cambridge_Venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [Cambridge_Onehot.columns[-1]] + list(Cambridge_Onehot.columns[:-1])
Cambridge_Onehot = Cambridge_Onehot[fixed_columns]

Cambridge_Onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Bakery,Bar,Bed & Breakfast,Bookstore,Café,Chinese Restaurant,Coffee Shop,Convenience Store,Cosmetics Shop,Deli / Bodega,Electronics Store,Fast Food Restaurant,Furniture / Home Store,Golf Course,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Hardware Store,Hostel,Hotel,Indian Restaurant,Insurance Office,Italian Restaurant,Movie Theater,Music Venue,Nature Preserve,Paper / Office Supplies Store,Pharmacy,Pub,Public Art,Rental Car Location,Sandwich Place,Seafood Restaurant,Soccer Field,Soccer Stadium,Sporting Goods Shop,Sports Bar,Sports Club,Stationery Store,Supermarket,Sushi Restaurant,Tennis Court,Turkish Restaurant,Warehouse Store
0,CB1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,CB1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,CB1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,CB1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,CB1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.


In [217]:
Cambridge_Onehot.shape

(80, 48)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category


In [218]:
Cambridge_Grouped = Cambridge_Onehot.groupby('Neighborhood').mean().reset_index()
Cambridge_Grouped

Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Bakery,Bar,Bed & Breakfast,Bookstore,Café,Chinese Restaurant,Coffee Shop,Convenience Store,Cosmetics Shop,Deli / Bodega,Electronics Store,Fast Food Restaurant,Furniture / Home Store,Golf Course,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Hardware Store,Hostel,Hotel,Indian Restaurant,Insurance Office,Italian Restaurant,Movie Theater,Music Venue,Nature Preserve,Paper / Office Supplies Store,Pharmacy,Pub,Public Art,Rental Car Location,Sandwich Place,Seafood Restaurant,Soccer Field,Soccer Stadium,Sporting Goods Shop,Sports Bar,Sports Club,Stationery Store,Supermarket,Sushi Restaurant,Tennis Court,Turkish Restaurant,Warehouse Store
0,CB1,0.0,0.0,0.0,0.071429,0.0,0.035714,0.107143,0.0,0.107143,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.035714,0.071429,0.071429,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.035714,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0
1,CB11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,CB2,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,CB25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,CB3,0.0,0.0,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0
5,CB4,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.2,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,CB5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.133333,0.0,0.066667,0.0,0.0,0.0,0.066667,0.066667,0.066667,0.0,0.0,0.066667,0.0,0.0,0.0,0.066667,0.0,0.066667,0.066667,0.0,0.0,0.066667,0.0,0.0,0.0,0.066667,0.066667,0.0,0.066667,0.0,0.066667,0.0,0.0,0.0,0.0
7,CB8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
8,CB9,0.047619,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.047619,0.0,0.047619,0.0,0.0,0.047619,0.0,0.0,0.047619,0.047619,0.0,0.0,0.0,0.0,0.047619,0.047619,0.047619,0.047619,0.047619,0.0,0.0,0.0,0.047619,0.047619,0.0,0.0,0.047619,0.0,0.047619,0.0,0.0,0.047619,0.0,0.047619,0.095238,0.0,0.0,0.0,0.047619


#### Let's confirm the new size


In [219]:
Cambridge_Grouped.shape

(9, 48)

#### Let's print each neighborhood along with the top 5 most common venues


In [220]:
num_top_venues = 5

for hood in Cambridge_Grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = Cambridge_Grouped[Cambridge_Grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----CB1----
           venue  freq
0  Grocery Store  0.14
1            Pub  0.14
2           Café  0.11
3    Coffee Shop  0.11
4            Bar  0.07


----CB11----
                 venue  freq
0        Deli / Bodega   1.0
1  American Restaurant   0.0
2   Seafood Restaurant   0.0
3        Movie Theater   0.0
4          Music Venue   0.0


----CB2----
                venue  freq
0  Athletics & Sports  0.33
1         Golf Course  0.33
2      Soccer Stadium  0.33
3      Sandwich Place  0.00
4       Movie Theater  0.00


----CB25----
                 venue  freq
0      Nature Preserve   1.0
1  American Restaurant   0.0
2   Seafood Restaurant   0.0
3        Movie Theater   0.0
4          Music Venue   0.0


----CB3----
                 venue  freq
0                  Bar   0.4
1         Tennis Court   0.2
2                  Gym   0.2
3           Public Art   0.2
4  American Restaurant   0.0


----CB4----
                venue  freq
0     Bed & Breakfast   0.2
1                 Pub   0.2
2  C

#### Let's put that into a _pandas_ dataframe


First, let's write a function to sort the venues in descending order.


In [221]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.


In [272]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = Cambridge_Grouped['Neighborhood']

for ind in np.arange(Cambridge_Grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Cambridge_Grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,CB1,Grocery Store,Pub,Café,Coffee Shop,Deli / Bodega,Indian Restaurant,Bar,Hotel,Bookstore,Seafood Restaurant
1,CB11,Deli / Bodega,Warehouse Store,Hardware Store,Gym Pool,Gym / Fitness Center,Gym,Grocery Store,Golf Course,Furniture / Home Store,Fast Food Restaurant
2,CB2,Athletics & Sports,Golf Course,Soccer Stadium,Warehouse Store,Deli / Bodega,Hardware Store,Gym Pool,Gym / Fitness Center,Gym,Grocery Store
3,CB25,Nature Preserve,Warehouse Store,Hotel,Hardware Store,Gym Pool,Gym / Fitness Center,Gym,Grocery Store,Golf Course,Furniture / Home Store
4,CB3,Bar,Tennis Court,Gym,Public Art,Warehouse Store,Deli / Bodega,Gym Pool,Gym / Fitness Center,Grocery Store,Golf Course


<a id='item4'></a>


## 4. Cluster Neighborhoods


Run _k_-means to cluster the neighborhood into 5 clusters.


In [273]:
# set number of clusters
kclusters = 5

Cambridge_Grouped_Clustering = Cambridge_Grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Cambridge_Grouped_Clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 4, 3, 0, 1, 1, 1, 2, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.


In [274]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Cambridge_Merged = Cambridge_Data

# merge Cambridge_Grouped with Cambridge_Data to add latitude/longitude for each neighborhood
Cambridge_Merged = Cambridge_Merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Postcode')

Cambridge_Merged = Cambridge_Merged[Cambridge_Merged['Cluster Labels'].notna()]

Cambridge_Merged.head() # check the last columns!

Unnamed: 0,Latitude,Longitude,Postcode,Area Name,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,52.195442,0.142137,CB1,Cambridge,1.0,Grocery Store,Pub,Café,Coffee Shop,Deli / Bodega,Indian Restaurant,Bar,Hotel,Bookstore,Seafood Restaurant
2,51.999125,0.212496,CB11,Cambridge,4.0,Deli / Bodega,Warehouse Store,Hardware Store,Gym Pool,Gym / Fitness Center,Gym,Grocery Store,Golf Course,Furniture / Home Store,Fast Food Restaurant
3,52.185868,0.123155,CB2,Cambridge,3.0,Athletics & Sports,Golf Course,Soccer Stadium,Warehouse Store,Deli / Bodega,Hardware Store,Gym Pool,Gym / Fitness Center,Gym,Grocery Store
8,52.260932,0.253649,CB25,Cambridge,0.0,Nature Preserve,Warehouse Store,Hotel,Hardware Store,Gym Pool,Gym / Fitness Center,Gym,Grocery Store,Golf Course,Furniture / Home Store
9,52.212283,0.098208,CB3,Cambridge,1.0,Bar,Tennis Court,Gym,Public Art,Warehouse Store,Deli / Bodega,Gym Pool,Gym / Fitness Center,Grocery Store,Golf Course


Finally, let's visualize the resulting clusters


In [275]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Cambridge_Merged['Latitude'], Cambridge_Merged['Longitude'], Cambridge_Merged['Area Name'], Cambridge_Merged['Cluster Labels'].astype(int)):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>


## 5. Examine Clusters


Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.


#### Cluster 1


In [276]:
Cambridge_Merged.loc[Cambridge_Merged['Cluster Labels'] == 0, Cambridge_Merged.columns[[1] + list(range(5, Cambridge_Merged.shape[1]))]]

Unnamed: 0,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,0.253649,Nature Preserve,Warehouse Store,Hotel,Hardware Store,Gym Pool,Gym / Fitness Center,Gym,Grocery Store,Golf Course,Furniture / Home Store


#### Cluster 2


In [277]:
Cambridge_Merged.loc[Cambridge_Merged['Cluster Labels'] == 1, Cambridge_Merged.columns[[1] + list(range(5, Cambridge_Merged.shape[1]))]]

Unnamed: 0,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0.142137,Grocery Store,Pub,Café,Coffee Shop,Deli / Bodega,Indian Restaurant,Bar,Hotel,Bookstore,Seafood Restaurant
9,0.098208,Bar,Tennis Court,Gym,Public Art,Warehouse Store,Deli / Bodega,Gym Pool,Gym / Fitness Center,Grocery Store,Golf Course
10,0.129817,Bed & Breakfast,Grocery Store,Chinese Restaurant,Convenience Store,Pub,Warehouse Store,Fast Food Restaurant,Hardware Store,Gym Pool,Gym / Fitness Center
11,0.153588,Electronics Store,Indian Restaurant,Rental Car Location,Gym Pool,Hardware Store,Furniture / Home Store,Music Venue,Paper / Office Supplies Store,Pharmacy,Gym / Fitness Center
15,0.439755,Supermarket,Warehouse Store,Pub,Bakery,Coffee Shop,Cosmetics Shop,Fast Food Restaurant,Grocery Store,Gym,Hotel


#### Cluster 3


In [278]:
Cambridge_Merged.loc[Cambridge_Merged['Cluster Labels'] == 2, Cambridge_Merged.columns[[1] + list(range(5, Cambridge_Merged.shape[1]))]]

Unnamed: 0,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,0.427659,Turkish Restaurant,Hotel,Hostel,Hardware Store,Gym Pool,Gym / Fitness Center,Gym,Grocery Store,Golf Course,Furniture / Home Store


#### Cluster 4


In [279]:
Cambridge_Merged.loc[Cambridge_Merged['Cluster Labels'] == 3, Cambridge_Merged.columns[[1] + list(range(5, Cambridge_Merged.shape[1]))]]

Unnamed: 0,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,0.123155,Athletics & Sports,Golf Course,Soccer Stadium,Warehouse Store,Deli / Bodega,Hardware Store,Gym Pool,Gym / Fitness Center,Gym,Grocery Store


#### Cluster 5


In [280]:
Cambridge_Merged.loc[Cambridge_Merged['Cluster Labels'] == 4, Cambridge_Merged.columns[[1] + list(range(5, Cambridge_Merged.shape[1]))]]

Unnamed: 0,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,0.212496,Deli / Bodega,Warehouse Store,Hardware Store,Gym Pool,Gym / Fitness Center,Gym,Grocery Store,Golf Course,Furniture / Home Store,Fast Food Restaurant


### Thank you for completing this lab!

This notebook was created by [Alex Aklson](https://www.linkedin.com/in/aklson?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork-21253531&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork-21253531&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ) and [Polong Lin](https://www.linkedin.com/in/polonglin?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork-21253531&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork-21253531&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ). I hope you found this lab interesting and educational. Feel free to contact us if you have any questions!


This notebook is part of a course on **Coursera** called _Applied Data Science Capstone_. If you accessed this notebook outside the course, you can take this course online by clicking [here](http://cocl.us/DP0701EN_Coursera_Week3_LAB2).


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By    | Change Description         |
| ----------------- | ------- | ------------- | -------------------------- |
| 2020-11-26        | 2.0     | Lakshmi Holla | Updated the markdown cells |
|                   |         |               |                            |
|                   |         |               |                            |

## <h3 align="center"> © IBM Corporation 2020. All rights reserved. <h3/>
