<h1 style="color:Black; text-align:center; font-size:3em;">Segmenting and Clustering Neighbourhoods in Toronto</h1>
<h3 style="color:green; text-align:center; font-size:2em;">Scraping data, Creating dataframe, Analizing data, Clustering<h3>

## Table of contents

### 1. Build a dataframe to use in a clustering project with Foursquare location data
    1a. Scrap Wikipedia
    1b. Build a table of the Toronto´s Neighbourhoods with its Postal code
    1c. Create a dataframe with the previous table
    1d. Clean the previous dataframe

### 2. Assign the latitude and longitude coordinates for each Neighbourhood in the previous dataframe
    2a. Complete the dataframe: get the latitude and the longitude coordinates for each Neighbourhood
    
### 3. Explore and cluster the Neighbourhoods in Toronto
    3a. Create a map of Toronto with Neighbourhoods superimposed on top
    3b. Map the Neighbourhoods in Downtown Toronto
    3c. Explore and analyze the Neighbourhoods of Downtown Toronto
    3d. Cluster the Neighbourhoods of Downtown Toronto

<h4 style="color:Black; text-align:left">This the new Notebook for this project</h4>

<div class="alert alert-block alert-info">
<b>Tip:</b> Notebook Ready.
</div>

***

### 1. Build a dataframe to use in a clustering project with Foursquare location data

<h4 style="color:Blue; text-align:left">1a. Scraping Wikipedia</h4>

This the url: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [22]:
# Importing libraries

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup


Let´s build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

In [23]:
# Wikipedia url
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikipage= requests.get(url).text

# Reading Wikipedia
wiki_data= BeautifulSoup(wikipage,'xml')
# print(wiki_data.prettify())

<h4 style="color:Blue; text-align:left">1b. Building the table</h4>

In [24]:
# Extracting data / building table

nh_table = wiki_data.find('table', class_ = 'wikitable sortable')
neightborhs = nh_table.find_all('tr')

neigh_table = []
for row in neightborhs:
    dat = row.text.split('\n')[1:-1]
    neigh_table.append(dat)
    
neigh_table[:5], print('This is the table of Toronto´s Postcodes')

This is the table of Toronto´s Postcodes


([['Postcode', 'Borough', 'Neighbourhood'],
  ['M1A', 'Not assigned', 'Not assigned'],
  ['M2A', 'Not assigned', 'Not assigned'],
  ['M3A', 'North York', 'Parkwoods'],
  ['M4A', 'North York', 'Victoria Village']],
 None)

<h4 style="color:Blue; text-align:left">1c. Creating the dataframe</h4>

In [25]:
# Building DataFrame

nh_df = pd.DataFrame(neigh_table[1:], columns=neigh_table[0])
print('This is the Dataframe of Toronto´s Postcodes'),
nh_df.head()

This is the Dataframe of Toronto´s Postcodes


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [26]:
# Shape of the initial dataframe

print('The shape of the initial dataframe is: {}'.format(nh_df.shape))

The shape of the initial dataframe is: (288, 3)


<div class="alert alert-block alert-info">
<b>Tip:</b> Dataframe Ready. Let´s clean it.
</div>

***

<h4 style="color:Blue; text-align:left">1d. Cleaning the dataframe</h4>

In [27]:
# Filtering out the rows that contain "Not assigned" in the 'Borough' Column

nh_df = nh_df[nh_df['Borough'] != 'Not assigned']

print('This is the shape of the dataframe wihtout "Not assigned" in the Borough Column : {}'.format(nh_df.shape))

This is the shape of the dataframe wihtout "Not assigned" in the Borough Column : (211, 3)


In [28]:
# Combining Neighbourhoods with the same Postcode into one row

def nh_list(grouped):
    return ', '.join(sorted(grouped['Neighbourhood'].tolist()))
                    
nh_df = nh_df.groupby(['Postcode', 'Borough'])
nhood_df = nh_df.apply(nh_list).reset_index(name='Neighbourhood')
nhood_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [29]:
print('This is the shape of the dataframe after combining rows: {}'.format(nhood_df.shape))

This is the shape of the dataframe after combining rows: (103, 3)


In [30]:
# Replacing "Not assigned" with "Borough" in the Neighbourhood column

nhood_df.loc[nhood_df['Neighbourhood']=="Not assigned",'Neighbourhood'] = nhood_df.loc[nhood_df['Neighbourhood']=="Not assigned",'Borough']
nhood_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


***
<h3 style="color:Blue; text-align:center; font-size:1em">1. This is the dataframe of Toronto´s Neighbourhoods+Postcodes</h3>

***

This is the Dataframe of Toronto´s Neighbourhood+Postcode ready to assign the latitude and the longitude coordinates

In [31]:
nhood_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [32]:
print('And this is the shape of this dataframe: {}'.format(nhood_df.shape))

And this is the shape of this dataframe: (103, 3)


### 2. Assign the latitude and longitude coordinates for each Neighbourhood in the previous dataframe 

<h4 style="color:Blue; text-align:left">2a. Getting the latitude and the longitude coordinates for each Neighbourhood</h4>

Downloading the dataset and reading it into a pandas dataframe:

In [33]:
nhCoord_df = pd.read_csv('https://cocl.us/Geospatial_data')
nhCoord_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [34]:
# Updating columns names, changing "Postal Code" to "Postcode", in order to merge the two dataframes

nhCoord_df.columns=['Postcode','Latitude','Longitude']
nhCoord_df.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Let´s assign the logitude and latitude coordinates to each Neighbourhood

In [35]:
# Merging the two dataframes

TNh_df = pd.merge(nhood_df, nhCoord_df[['Postcode','Latitude', 'Longitude']], on='Postcode')

***
<h3 style="color:Blue; text-align:center; font-size:1em">2. The following is the dataframe of Toronto´s Neighbourhoods+Coordinates</h3>

***

<div class="alert alert-block alert-info">
<b>Tip:</b> The Dataframe is Ready to work with Foursquare data
</div>

***

In [36]:
TNh_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [37]:
print('This is the shape of this dataframe: {}'.format(TNh_df.shape))

This is the shape of this dataframe: (103, 5)


### 3. Explore and cluster the neighborhoods in Toronto 

In [21]:
#!conda install -c conda-forge geopy --yes
#!conda install -c conda-forge folium=0.5.0 --yes

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.



<h4 style="color:Black; text-align:left">Importing other dependencies that we will need</h4>

In [38]:
import json
from geopy.geocoders import Nominatim
import requests
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

Using geopy library to get the **latitude** and **longitude** values of **Toronto**

In order to define an instance of the geocoder, we need to define a user_agent. I will name the agent Toronto

In [39]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="Toronto")
location = geolocator.geocode(address)
lat_Tor = location.latitude
long_Tor = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(lat_Tor, long_Tor))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


<h4 style="color:Blue; text-align:left">3a. Creating a map of Toronto with Neighbourhoods superimposed on top</h4>

Using the Dataframe that was created for the point 1: **TNh_df**

In [40]:
map_toronto = folium.Map(location=[lat_Tor, long_Tor], zoom_start=10)

# adding markers to Toronto map
for lat, lng, borough, Neighbourhood in zip(TNh_df['Latitude'], TNh_df['Longitude'], TNh_df['Borough'], TNh_df['Neighbourhood']):
    label = '{}, {}'.format(Neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

***

<h4 style="color:Black; text-align:left">Slicing the original dataframe and creating a new dataframe with the Borough which its name contains teh word "Toronto"</h4>

In [41]:
# Slicing and creating a new dataframe

downtown_data = TNh_df[TNh_df['Borough'] == 'Downtown Toronto'].reset_index(drop=True)  
downtown_data.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
8,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752


***
Getting the geographical coordinates of Downtown, Toronto

In [42]:
address = 'Downtown Toronto, TO'

geolocator = Nominatim(user_agent="Downtown Toronto")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Dowtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Dowtown Toronto are 43.6541737, -79.3808116451341.


***
<h4 style="color:Blue; text-align:left">3b. Mapping the Neighbourhoods in Downtown Toronto</h4>

Let's visualize Downtown, the neighbourhoods in it.

In [43]:
# creating map of Downtown Toronto using latitude and longitude values
map_downtown = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(downtown_data['Latitude'], downtown_data['Longitude'], downtown_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_downtown)  
    
map_downtown

#### Foursquare: Credentials and Version

In [67]:
CLIENT_ID = 'Private' # your Foursquare ID
CLIENT_SECRET = 'Private' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: Private
CLIENT_SECRET:Private


<h4 style="color:Blue; text-align:left">3c. Exploring and Analyzing Neighbourhoods of Downtown Toronto</h4>

List of the Neighbourhoods of Downtown Toronto

In [45]:
downtown_data

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
8,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752


Getting the longitude and latitude values

In [46]:
neighbourhood_latitude = downtown_data.loc[0, 'Latitude'] # neighbourhood latitude value
neighbourhood_longitude = downtown_data.loc[0, 'Longitude'] # neighbourhood longitude value
neighbourhood_name = downtown_data.loc[0, 'Neighbourhood'] # neighbourhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighbourhood_name, 
                                                               neighbourhood_latitude, 
                                                               neighbourhood_longitude))

Latitude and longitude values of Rosedale are 43.6795626, -79.37752940000001.


<h4 style="color:Black; text-align:left">Getting the top 100 venues that are in Chinatown within a radius of 500 meters</h4>

In [47]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighbourhood_latitude, 
    neighbourhood_longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=JZTXGWTYWRQEKVOQOMNITWKRJS0O1OX1ZYJ2TNLQURWTESWY&client_secret=ULVKBNSOU0N1JJ2LOAXHOZLETVMBPBQFI1DYOQ2KBL22YPK5&v=20180605&ll=43.6795626,-79.37752940000001&radius=500&limit=100'

Requesting, let´s examine the resutls

In [48]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5da9c5f36bdee6002c355651'},
 'response': {'headerLocation': 'Rosedale',
  'headerFullLocation': 'Rosedale, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 5,
  'suggestedBounds': {'ne': {'lat': 43.6840626045, 'lng': -79.37131878274371},
   'sw': {'lat': 43.675062595499995, 'lng': -79.38374001725632}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bae2150f964a520df873be3',
       'name': 'Mooredale House',
       'location': {'address': '146 Crescent Rd.',
        'crossStreet': 'btwn. Lamport Ave. and Mt. Pleasant Rd.',
        'lat': 43.678630645646535,
        'lng': -79.38009142511322,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.678630645646535,
          'lng': -79.380091425113

#### Defining the **get_category_type** function, the same used in the Foursquare lab

This Function extracts the category of the venue.

In [49]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Let´s clean the json and structure it into a pandas dataframe.

In [50]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# Filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# Filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# Clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Mooredale House,Building,43.678631,-79.380091
1,Rosedale Park,Playground,43.682328,-79.378934
2,Whitney Park,Park,43.682036,-79.373788
3,Alex Murray Parkette,Park,43.6783,-79.382773
4,Milkman's Lane,Trail,43.676352,-79.373842


The number of venues that were returned by Foursquare

In [51]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

5 venues were returned by Foursquare.


<h4 style="color:Black; text-align:left">Exploring Neighbourhoods in Downtown Toronto</h4>

Let's create a function to repeat the same process to all the neighborhoods in Downtown

In [52]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Writing the code to run the above function on each neighbourhood and create a new dataframe called downtown_venues.

In [53]:
downtown_venues = getNearbyVenues(names=downtown_data['Neighbourhood'],
                                   latitudes=downtown_data['Latitude'],
                                   longitudes=downtown_data['Longitude']
                                  )

Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront, Regent Park
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
Bathurst Quay, CN Tower, Harbourfront West, Island airport, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie


***
What is the shape?

In [54]:
print('The shape is: {}' .format(downtown_venues.shape))
downtown_venues.head()

The shape is: (1299, 7)


Unnamed: 0,Neighborhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rosedale,43.679563,-79.377529,Mooredale House,43.678631,-79.380091,Building
1,Rosedale,43.679563,-79.377529,Rosedale Park,43.682328,-79.378934,Playground
2,Rosedale,43.679563,-79.377529,Whitney Park,43.682036,-79.373788,Park
3,Rosedale,43.679563,-79.377529,Alex Murray Parkette,43.6783,-79.382773,Park
4,Rosedale,43.679563,-79.377529,Milkman's Lane,43.676352,-79.373842,Trail


Checking the number of **venues** returned for each neighbourhood.

In [55]:
downtown_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
"Bathurst Quay, CN Tower, Harbourfront West, Island airport, King and Spadina, Railway Lands, South Niagara",14,14,14,14,14,14
Berczy Park,57,57,57,57,57,57
"Cabbagetown, St. James Town",45,45,45,45,45,45
Central Bay Street,89,89,89,89,89,89
"Chinatown, Grange Park, Kensington Market",100,100,100,100,100,100
Christie,15,15,15,15,15,15
Church and Wellesley,85,85,85,85,85,85
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
"Design Exchange, Toronto Dominion Centre",100,100,100,100,100,100


Finding out how many unique categories can be curated from all the returned venues

In [56]:
print('There are {} uniques categories.'.format(len(downtown_venues['Venue Category'].unique())))

There are 204 uniques categories.


<h4 style="color:Black; text-align:left">Analyzing each Neighbourhood</h4>

In [57]:
# one hot encoding
downtown_onehot = pd.get_dummies(downtown_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to dataframe
downtown_onehot['Neighborhood'] = downtown_venues['Neighborhood'] 

# move neighbourhood column to the first column
fixed_columns = [downtown_onehot.columns[-1]] + list(downtown_onehot.columns[:-1])
downtown_onehot = downtown_onehot[fixed_columns]

downtown_onehot.head()

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


What is the shape of the previous dataframe (**downtown_onehot**)?

In [58]:
print('The shape is: {}' .format(downtown_onehot.shape))

The shape is: (1299, 204)


<h4 style="color:Black; text-align:left">Grouping rows by neighborhood and by taking the mean of the frequency of occurrence of each category</h4>

In [59]:
downtown_grouped = downtown_onehot.groupby('Neighborhood').mean().reset_index()
downtown_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01
1,"Bathurst Quay, CN Tower, Harbourfront West, Is...",0.0,0.0,0.071429,0.071429,0.142857,0.071429,0.142857,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0
3,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.011236,0.0,0.0,0.0,0.0,0.0,0.0,0.011236,0.0,...,0.0,0.0,0.0,0.0,0.011236,0.0,0.0,0.011236,0.0,0.0
5,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.01,0.0,0.0,0.05,0.0,0.04,0.01,0.0,0.0
6,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Church and Wellesley,0.023529,0.011765,0.0,0.0,0.0,0.0,0.0,0.011765,0.0,...,0.011765,0.0,0.0,0.0,0.0,0.011765,0.0,0.0,0.011765,0.0
8,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0
9,"Design Exchange, Toronto Dominion Centre",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0


***
What is the size of the previous?

In [60]:
print('The shape is: {}' .format(downtown_grouped.shape))

The shape is: (18, 204)


<h4 style="color:Black; text-align:left">Printing each Neighbourhood along with the top 5 most common venues</h4>

In [61]:
num_top_venues = 5

for hood in downtown_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = downtown_grouped[downtown_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
             venue  freq
0      Coffee Shop  0.07
1             Café  0.05
2       Steakhouse  0.04
3              Bar  0.04
4  Thai Restaurant  0.04


----Bathurst Quay, CN Tower, Harbourfront West, Island airport, King and Spadina, Railway Lands, South Niagara----
              venue  freq
0    Airport Lounge  0.14
1  Airport Terminal  0.14
2             Plane  0.07
3   Harbor / Marina  0.07
4          Boutique  0.07


----Berczy Park----
          venue  freq
0   Coffee Shop  0.07
1  Cocktail Bar  0.05
2          Café  0.04
3        Bakery  0.04
4    Steakhouse  0.04


----Cabbagetown, St. James Town----
                venue  freq
0         Coffee Shop  0.07
1  Chinese Restaurant  0.04
2          Restaurant  0.04
3              Bakery  0.04
4   Convenience Store  0.04


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.13
1  Italian Restaurant  0.06
2                Café  0.04
3      Ice Cream Shop  0.04
4        Burger

#### Creating a dataframe with the previous data

This is a function to sort the venues in descending order

In [62]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Creating the new dataframe and displaying the top 10 venues for each neighbourhood

In [63]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = downtown_grouped['Neighborhood']

for ind in np.arange(downtown_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(downtown_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Bar,Steakhouse,Thai Restaurant,Asian Restaurant,Restaurant,Burger Joint,Hotel,Sushi Restaurant
1,"Bathurst Quay, CN Tower, Harbourfront West, Is...",Airport Lounge,Airport Terminal,Coffee Shop,Harbor / Marina,Sculpture Garden,Boat or Ferry,Bar,Plane,Boutique,Airport Food Court
2,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Beer Bar,Farmers Market,Bakery,Steakhouse,Cheese Shop,Café,Belgian Restaurant
3,"Cabbagetown, St. James Town",Coffee Shop,Chinese Restaurant,Restaurant,Pub,Café,Bakery,Pizza Place,Park,Convenience Store,Italian Restaurant
4,Central Bay Street,Coffee Shop,Italian Restaurant,Ice Cream Shop,Café,Sandwich Place,Burger Joint,Middle Eastern Restaurant,Bar,Chinese Restaurant,Sushi Restaurant


<h4 style="color:Blue; text-align:left">3d. Clustering the Neighbourhoods</h4>

Let´s run **k-means** to cluster the neighborhood into 5 clusters.

In [64]:
# Setting number of clusters
kclusters = 5

downtown_grouped_clustering = downtown_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(downtown_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 3, 0, 0, 0, 4, 2, 0, 0, 0], dtype=int32)

Creating a new dataframe that includes the cluster as well as the top 10 venues for each neighbourhood

In [65]:
# Add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

downtown_merged = downtown_data

# Merge toronto_grouped with toronto_data to add latitude/longitude for each Neighbourhood
downtown_merged = downtown_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

downtown_merged.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,1,Park,Playground,Trail,Building,Dessert Shop,Ethiopian Restaurant,Electronics Store,Dumpling Restaurant,Donut Shop,Doner Restaurant
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675,0,Coffee Shop,Chinese Restaurant,Restaurant,Pub,Café,Bakery,Pizza Place,Park,Convenience Store,Italian Restaurant
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,0,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Gay Bar,Burger Joint,Restaurant,Yoga Studio,Gastropub,Fast Food Restaurant,Gym
3,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636,0,Coffee Shop,Park,Pub,Bakery,Café,Mexican Restaurant,Breakfast Spot,Theater,Hotel,Greek Restaurant
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Coffee Shop,Clothing Store,Cosmetics Shop,Italian Restaurant,Café,Middle Eastern Restaurant,Restaurant,Bubble Tea Shop,Sporting Goods Shop,Ramen Restaurant


Let´s map it

In [66]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=14)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(downtown_merged['Latitude'], downtown_merged['Longitude'], downtown_merged['Neighbourhood'], downtown_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters