# Applied Data Science Capstone - Week 3
## Segmenting and Clustering Neighborhoods in Toronto
- Data source: Wikipedia website https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
- The dataframe will consist of three columns: Postcode, Borough, and Neighborhood.
- Only process the cells that have an assigned Borough. Ignore cells with a Borough that is not assigned.
- More than one neighborhood can exist in one postal code area. Such rows will be combined into one row separated with a comma.
- If a cell has a borough but a not assigned neighborhood, then the neighborhood will be the same as the borough.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
- Submit a link to your Notebook on your Github repository. (10 marks)

### Install beautifulsoup if necessary and import libraries

In [23]:
!pip install beautifulsoup4
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans



## Web Scraping
### 1. Read Wikipedia page
### 2. Parse HTML with standard parser

In [2]:
r = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(r.text, 'html.parser')

### Find table on Wikipedia page

In [3]:
table = soup.find('table',{'class':'wikitable sortable'})

### Find all rows in the table

In [4]:
trs = table.find_all('tr')

### Append rows

In [5]:
rows = []
for r in trs:
    rows.append([t.text.strip() for t in r.find_all('td')])
     
df = pd.DataFrame(rows, columns=['Postcode', 'Borough', 'Neighborhood'])
df = df[~df['Postcode'].isnull()]

print(df.head())
print('---')
print(df.tail())

  Postcode           Borough      Neighborhood
1      M1A      Not assigned      Not assigned
2      M2A      Not assigned      Not assigned
3      M3A        North York         Parkwoods
4      M4A        North York  Victoria Village
5      M5A  Downtown Toronto      Harbourfront
---
    Postcode       Borough           Neighborhood
283      M8Z     Etobicoke              Mimico NW
284      M8Z     Etobicoke     The Queensway West
285      M8Z     Etobicoke  Royal York South West
286      M8Z     Etobicoke         South of Bloor
287      M9Z  Not assigned           Not assigned


### Remove rows with borough='Not assigned' and reindex

In [6]:
df.drop(df[df['Borough']=='Not assigned'].index,axis=0, inplace=True)
df = df.reset_index(drop=True)

print(df.head())
print('---')
print(df.tail())

  Postcode           Borough      Neighborhood
0      M3A        North York         Parkwoods
1      M4A        North York  Victoria Village
2      M5A  Downtown Toronto      Harbourfront
3      M6A        North York  Lawrence Heights
4      M6A        North York    Lawrence Manor
---
    Postcode    Borough              Neighborhood
205      M8Z  Etobicoke  Kingsway Park South West
206      M8Z  Etobicoke                 Mimico NW
207      M8Z  Etobicoke        The Queensway West
208      M8Z  Etobicoke     Royal York South West
209      M8Z  Etobicoke            South of Bloor


### If there is more than one neighborhood for the same postcode, aggregate to 1 row with neighborhoods separated by commas and re-index

In [7]:
df = df.groupby(['Postcode', 'Borough'])['Neighborhood'].agg(', '.join).reset_index()

print(df.head())
print('---')
print(df.tail())

  Postcode      Borough                            Neighborhood
0      M1B  Scarborough                          Rouge, Malvern
1      M1C  Scarborough  Highland Creek, Rouge Hill, Port Union
2      M1E  Scarborough       Guildwood, Morningside, West Hill
3      M1G  Scarborough                                  Woburn
4      M1H  Scarborough                               Cedarbrae
---
    Postcode    Borough                                       Neighborhood
98       M9N       York                                             Weston
99       M9P  Etobicoke                                          Westmount
100      M9R  Etobicoke  Kingsview Village, Martin Grove Gardens, Richv...
101      M9V  Etobicoke  Albion Gardens, Beaumond Heights, Humbergate, ...
102      M9W  Etobicoke                                          Northwest


### If neighborhood = 'Not assigned' then set neighborhood = borough

In [8]:
print('Example:')
print('Postcode M7A old:')
print(df.loc[df['Postcode'] == 'M7A'])

df.loc[df['Neighborhood']=="Not assigned",'Neighborhood']=df.loc[df['Neighborhood']=="Not assigned",'Borough']

print('---------------------------------------')
print('Postcode M7A new:')
print(df.loc[df['Postcode'] == 'M7A'])

Example:
Postcode M7A old:
   Postcode       Borough  Neighborhood
85      M7A  Queen's Park  Not assigned
---------------------------------------
Postcode M7A new:
   Postcode       Borough  Neighborhood
85      M7A  Queen's Park  Queen's Park


## Show no. of rows and colums in the dataframe

In [9]:
df.shape

(103, 3)

## Show no. of boroughs and neighborhoods in the dataframe

In [10]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df['Borough'].unique()),
        df.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


## This Jupyter Notebook is available on GitHub
('Applied Data Science Capstone Week 3.ipynb')

https://github.com/steveshep/Coursera_Capstone/blob/master/Applied%20Data%20Science%20Capstone%20Week%203.ipynb

## Get lattitude and longitude for each postcode in the dataframe 
### Use CSV file, geocoder.google() does not not work

In [11]:
# copy dataframe for further processing with geo data
geo_df = df

# add columns latitude and Longitude to new dataframe
geo_df['Latitude'] = ''
geo_df['Longitude'] = ''

In [12]:
# read csv file with geo coordinates for Postcodes into dataframe as the geocoders don't work very well
geo_coordinates = pd.read_csv('https://cocl.us/Geospatial_data')

In [13]:
# define function to get lat and long out of coordinates dataframe
def get_geo_coord(df_pc):
    lat   = geo_coordinates.loc[geo_coordinates['Postal Code'] == df_pc].iloc[0]['Latitude']
    long = geo_coordinates.loc[geo_coordinates['Postal Code'] == df_pc].iloc[0]['Longitude']
    return lat, long

# loop to add lattitude and longitude to dataframe
for i in range(0,len(geo_df)):
    geo_df['Latitude'][i], geo_df['Longitude'][i] = get_geo_coord(geo_df.iloc[i]['Postcode'])

## Use a dataframe containing only boroughs that contain the word 'Toronto'

In [14]:
geo_df_to = geo_df[geo_df['Borough'].str.contains('Toronto')].reset_index(drop=True)

geo_df_to.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.6764,-79.293
1,M4K,East Toronto,"The Danforth West, Riverdale",43.6796,-79.3522
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.669,-79.3156
3,M4M,East Toronto,Studio District,43.6595,-79.3409
4,M4N,Central Toronto,Lawrence Park,43.728,-79.3888


## Define Foursquare credentials and version

In [15]:
# @hidden cell
CLIENT_ID = 'WWNDAHJPMN04XWSXX2APHYS3NNHBCFQS2PI3KYGONDFLFEDX' # your Foursquare ID
CLIENT_SECRET = 'X5Q5HJCGTDL2FYWTREOLGR202XHBPIVYDPZQAHHWKHMRVH1N' # your Foursquare Secret
VERSION = '20180604' # Foursquare API version

# Credentials removed before uploaded to GitHub
#CLIENT_ID = '...' # Foursquare ID
#CLIENT_SECRET = '...' # Foursquare Secret
#VERSION = '20180604' # Foursquare API version

### Use 'folium' map rendering library

In [29]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

Collecting package metadata: done
Solving environment: done


  current version: 4.6.14
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.



## Explore the first neighborhood in Toronto

In [31]:
#show name of first neighborhood
geo_df_to.loc[0, 'Neighborhood']

'The Beaches'

In [32]:
latitude = geo_df_to.loc[0, 'Latitude']
longitude = geo_df_to.loc[0, 'Longitude']

### Show map

In [33]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(geo_df_to['Latitude'], geo_df_to['Longitude'], geo_df_to['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Get the top 10 venues that are in The Beaches within a radius of 500 meters

In [34]:
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=WWNDAHJPMN04XWSXX2APHYS3NNHBCFQS2PI3KYGONDFLFEDX&client_secret=X5Q5HJCGTDL2FYWTREOLGR202XHBPIVYDPZQAHHWKHMRVH1N&v=20180604&ll=43.67635739999999,-79.2930312&radius=500&limit=100'

In [37]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5de46e527828ae001b6ec96a'},
 'response': {'headerLocation': 'The Beaches',
  'headerFullLocation': 'The Beaches, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 6,
  'suggestedBounds': {'ne': {'lat': 43.680857404499996,
    'lng': -79.28682091449052},
   'sw': {'lat': 43.67185739549999, 'lng': -79.29924148550948}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bd461bc77b29c74a07d9282',
       'name': 'Glen Manor Ravine',
       'location': {'address': 'Glen Manor',
        'crossStreet': 'Queen St.',
        'lat': 43.67682094413784,
        'lng': -79.29394208780985,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.67682094413784,
          'lng': -79.29394208780985}],
        'distanc

### All the information is in the *items* key. We borrow the **get_category_type** function from the Foursquare lab.

In [38]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

### Now we are ready to clean the json and structure it into a *pandas* dataframe

In [39]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Glen Stewart Ravine,Other Great Outdoors,43.6763,-79.294784
4,Domino's Pizza,Pizza Place,43.679058,-79.297382


In [40]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

6 venues were returned by Foursquare.


## Explore neighborhoods

### Create a function to repeat the same process to all the neighborhoods in Toronto

In [41]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Now write the code to run the above function on each neighborhood and create a new dataframe called *toronto_venues*.

In [42]:
toronto_venues = getNearbyVenues(names=geo_df_to['Neighborhood'],
                                   latitudes=geo_df_to['Latitude'],
                                   longitudes=geo_df_to['Longitude']
                                  )

The Beaches
The Danforth West, Riverdale
The Beaches West, India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The Junction Sout

### Check the size of the resulting dataframe

In [43]:
print(toronto_venues.shape)
toronto_venues.head()

(1686, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Glen Stewart Ravine,43.6763,-79.294784,Other Great Outdoors
4,The Beaches,43.676357,-79.293031,Domino's Pizza,43.679058,-79.297382,Pizza Place


### Check how many venues were returned for each neighborhood

In [44]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Berczy Park,57,57,57,57,57,57
"Brockton, Exhibition Place, Parkdale Village",23,23,23,23,23,23
Business Reply Mail Processing Centre 969 Eastern,19,19,19,19,19,19
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",15,15,15,15,15,15
"Cabbagetown, St. James Town",42,42,42,42,42,42
Central Bay Street,82,82,82,82,82,82
"Chinatown, Grange Park, Kensington Market",92,92,92,92,92,92
Christie,17,17,17,17,17,17
Church and Wellesley,84,84,84,84,84,84


### Find out how many unique categories can be curated from all the returned venues

In [45]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 235 uniques categories.


## Analyze each neighborhood

In [46]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Show the new dataframe size

In [47]:
toronto_onehot.shape

(1686, 235)

### Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [48]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.066667,0.066667,0.066667,0.133333,0.133333,0.133333,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.012195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012195,...,0.0,0.0,0.0,0.0,0.012195,0.0,0.0,0.012195,0.0,0.0
7,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01087,0.0,0.0,0.0,0.021739,0.0,0.043478,0.01087,0.0,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.011905,0.011905,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,...,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,0.0,0.0,0.011905


#### Confirm the new size

In [49]:
toronto_grouped.shape

(38, 235)

#### Print each neighborhood along with the top 5 most common venues

In [50]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
             venue  freq
0      Coffee Shop  0.07
1             Café  0.05
2       Steakhouse  0.04
3              Bar  0.04
4  Thai Restaurant  0.04


----Berczy Park----
            venue  freq
0     Coffee Shop  0.07
1          Bakery  0.05
2      Steakhouse  0.04
3  Farmers Market  0.04
4            Café  0.04


----Brockton, Exhibition Place, Parkdale Village----
                   venue  freq
0         Breakfast Spot  0.09
1            Coffee Shop  0.09
2                   Café  0.09
3  Performing Arts Venue  0.09
4                 Bakery  0.09


----Business Reply Mail Processing Centre 969 Eastern----
                venue  freq
0  Light Rail Station  0.11
1         Yoga Studio  0.05
2                 Spa  0.05
3       Garden Center  0.05
4              Garden  0.05


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
              venue  freq
0    Airport Lounge  0.13
1   Airport S

### Put that into a *pandas* dataframe

#### First, write a function to sort the venues in descending order

In [51]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### Now create the new dataframe and display the top 10 venues for each neighborhood.

In [52]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Bar,Steakhouse,Thai Restaurant,Bakery,Burger Joint,Restaurant,Cosmetics Shop,Sushi Restaurant
1,Berczy Park,Coffee Shop,Bakery,Cocktail Bar,Café,Cheese Shop,Seafood Restaurant,Steakhouse,Beer Bar,Farmers Market,Creperie
2,"Brockton, Exhibition Place, Parkdale Village",Performing Arts Venue,Coffee Shop,Café,Breakfast Spot,Bakery,Gym,Intersection,Pet Store,Grocery Store,Climbing Gym
3,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Yoga Studio,Recording Studio,Smoke Shop,Skate Park,Brewery,Burrito Place,Butcher,Restaurant,Park
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Terminal,Airport Lounge,Airport Service,Plane,Harbor / Marina,Coffee Shop,Sculpture Garden,Boutique,Boat or Ferry,Airport Gate


## Cluster neighborhoods

### Run *k*-means to cluster the neighborhood into 5 clusters

In [53]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3], dtype=int32)

#### Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood

In [54]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = geo_df_to

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.6764,-79.293,0,Pizza Place,Trail,Pub,Other Great Outdoors,Health Food Store,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Department Store,Donut Shop
1,M4K,East Toronto,"The Danforth West, Riverdale",43.6796,-79.3522,3,Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Bookstore,Furniture / Home Store,Restaurant,Pizza Place,Brewery,Bubble Tea Shop
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.669,-79.3156,3,Park,Pizza Place,Pub,Liquor Store,Light Rail Station,Burger Joint,Sandwich Place,Fast Food Restaurant,Burrito Place,Fish & Chips Shop
3,M4M,East Toronto,Studio District,43.6595,-79.3409,3,Café,Coffee Shop,Italian Restaurant,American Restaurant,Bakery,Brewery,Stationery Store,Bar,Fish Market,Coworking Space
4,M4N,Central Toronto,Lawrence Park,43.728,-79.3888,2,Park,Swim School,Bus Line,Wings Joint,Diner,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


#### Finally, visualize the resulting clusters

In [55]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine clusters

### Now we can examine each cluster and determine the discriminating venue categories that distinguish each cluster

#### Cluster 1

In [56]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0,
                     toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,0,Pizza Place,Trail,Pub,Other Great Outdoors,Health Food Store,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Department Store,Donut Shop
5,Central Toronto,0,Gym,Clothing Store,Sandwich Place,Asian Restaurant,Food & Drink Shop,Hotel,Breakfast Spot,Park,Electronics Store,Eastern European Restaurant
8,Central Toronto,0,Gym,Intersection,Trail,Tennis Court,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
23,Central Toronto,0,Trail,Mexican Restaurant,Jewelry Store,Sushi Restaurant,Wings Joint,Diner,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store


#### Cluster 2

In [57]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1,
                     toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Central Toronto,1,Garden,Wings Joint,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


#### Cluster 3

In [58]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2,
                     toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Central Toronto,2,Park,Swim School,Bus Line,Wings Joint,Diner,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


#### Cluster 4

In [59]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3,
                     toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,East Toronto,3,Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Bookstore,Furniture / Home Store,Restaurant,Pizza Place,Brewery,Bubble Tea Shop
2,East Toronto,3,Park,Pizza Place,Pub,Liquor Store,Light Rail Station,Burger Joint,Sandwich Place,Fast Food Restaurant,Burrito Place,Fish & Chips Shop
3,East Toronto,3,Café,Coffee Shop,Italian Restaurant,American Restaurant,Bakery,Brewery,Stationery Store,Bar,Fish Market,Coworking Space
6,Central Toronto,3,Clothing Store,Coffee Shop,Sporting Goods Shop,Gym / Fitness Center,Metro Station,Mexican Restaurant,Diner,Dessert Shop,Park,Chinese Restaurant
7,Central Toronto,3,Sandwich Place,Pizza Place,Dessert Shop,Café,Coffee Shop,Gym,Italian Restaurant,Sushi Restaurant,Flower Shop,Japanese Restaurant
9,Central Toronto,3,Pub,Coffee Shop,Pizza Place,Light Rail Station,Sports Bar,Bagel Shop,Restaurant,Supermarket,Sushi Restaurant,Fried Chicken Joint
11,Downtown Toronto,3,Coffee Shop,Park,Pizza Place,Restaurant,Café,Pub,Italian Restaurant,Bakery,Diner,Indian Restaurant
12,Downtown Toronto,3,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Gay Bar,Restaurant,Pub,Men's Store,Gastropub,Gym,Hotel
13,Downtown Toronto,3,Coffee Shop,Pub,Park,Bakery,Café,Breakfast Spot,Mexican Restaurant,Theater,Spa,Electronics Store
14,Downtown Toronto,3,Coffee Shop,Clothing Store,Café,Fast Food Restaurant,Middle Eastern Restaurant,Cosmetics Shop,Bakery,Japanese Restaurant,Italian Restaurant,Bubble Tea Shop


#### Cluster 5

In [60]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4,
                     toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,Downtown Toronto,4,Park,Playground,Trail,Wings Joint,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


## End of notebook
### Applied Data Science Capstone - Week 3