# Toronto Neighborhoods clustering
In this notebook I will be exploring and clustering different neighborhoods in the Toronoto area

## Table of Contents

<div class="alert alert-block alert-success" style="margin-top: 20px">

<font size = 4>

1. <a href="#item1">Defining the data</a><br />
<br />

2. <a href="#item2">Determining Geographical Coordinates</a>
<br />

3. <a href="#item3">Explore and cluster the neighborhoods in Toronto</a><br />
     a. <a href="#item3a">Analyzing Neigborhoods</a><br />
    b. <a href="#item3b">K-Means Clustering</a><br />
    c. <a href="#item3c">Visualizing Clusters on a Map</a>
    
</font>
</div>

<a id='item1'></a>

### Part 1: Defining the data
In this section I will be defining the dataframe used in this project. The data will be taken from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 

In [1]:
#importing necessary libraries
import pandas as pd
import numpy as np
import lxml
import requests
from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors
import folium # map rendering library

In [2]:
#scraping the wikipedia page
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
toronto_df = dfs[0] #the first table of the page contains the required data
toronto_df.head()


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Cleaning the data:
* Removing all Postcodes that are not assigned
* Group all neighborhoods in the same postcode
* Rename any "not assigned" neighborhoods with Borough name

In [3]:
toronto_df = toronto_df[toronto_df.Borough != 'Not assigned'].reset_index(drop=True)
toronto_df = toronto_df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join).reset_index()
toronto_df.Neighbourhood.replace('Not assigned', toronto_df.Borough, inplace=True)
toronto_df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie..."
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam..."


In [4]:
print(toronto_df.shape)

(103, 3)


<a id='item2'></a>

### Part 2: Determining Geographical Coordinates
In this section I will be retrieving and defining the lattitude and longitude coordinates of each postal code in the toronto_df dataframe using the Geocoder Python Package 

In [5]:
#definining a function to retrieve the lattitude and longitude given the postal code
'''
#this function will not be used because the geocoder function is not reliable
import geocoder
def get_coord(df):
    
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
      print(df.Postcode)
      g = geocoder.google('{}, Toronto, Ontario'.format(df.Postcode))
      lat_lng_coords = g.latlng
    print('found')
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
'''

"\n#this function will not be used because the geocoder function is not reliable\nimport geocoder\ndef get_coord(df):\n    \n    lat_lng_coords = None\n    # loop until you get the coordinates\n    while(lat_lng_coords is None):\n      print(df.Postcode)\n      g = geocoder.google('{}, Toronto, Ontario'.format(df.Postcode))\n      lat_lng_coords = g.latlng\n    print('found')\n    latitude = lat_lng_coords[0]\n    longitude = lat_lng_coords[1]\n"

In [6]:
#Because the geocoder function is unreliable I am importing the data from the provided csv
df_coord = pd.read_csv("http://cocl.us/Geospatial_data")
df_coord.rename(columns={'Postal Code':'Postcode'}, inplace=True)
df_coord.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
toronto_df = pd.merge(toronto_df, df_coord, how='inner', on = 'Postcode')
toronto_df

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",43.739416,-79.588437


<a id='item3'></a>

### Part 3: Explore and cluster the neighborhoods in Toronto
Here I will be analyzing the Toronto neighborhoods and cluster them using K-means

<a id='item3a'></a>

#### Analyzing neighborhoods
First I will define the below function to get information about a specific location's venues

In [8]:
# Defining Foursquare credentials info
CLIENT_ID = 'WUKAGWRMSELACPHK4YJNM004PZQVEZNLOHGPNRZJ0HBRTXKY' 
CLIENT_SECRET = '15GGGDMIO1Y5VDNDTBK4PDCNDR0Z50O0XPWDHAVECMQY0TUU'
VERSION = '20180605'
LIMIT = 100
radius = 500

# Defining the function
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
        except: ()
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now I will analyze the top venues for each Postcode

In [9]:
toronto_venues = getNearbyVenues(names=toronto_df['Borough'],
                                   latitudes=toronto_df['Latitude'],
                                   longitudes=toronto_df['Longitude']
                                  )
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Borough'] = toronto_venues['Borough'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Borough,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Scarborough,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Scarborough,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Scarborough,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Scarborough,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Scarborough,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
toronto_grouped = toronto_onehot.groupby('Borough').mean().reset_index()
toronto_grouped

Unnamed: 0,Borough,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Central Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018519,...,0.0,0.009259,0.0,0.0,0.009259,0.0,0.0,0.0,0.0,0.009259
1,Downtown Toronto,0.0,0.00077,0.00077,0.00077,0.00077,0.001541,0.002311,0.001541,0.013867,...,0.002311,0.013867,0.002311,0.0,0.004622,0.0,0.006934,0.00077,0.0,0.002311
2,East Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02459,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02459
3,East York,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.013514,0.0,0.013514,0.0,0.0,0.0,0.013514
4,Etobicoke,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.0
5,Mississauga,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,North York,0.003922,0.0,0.003922,0.0,0.0,0.0,0.0,0.0,0.007843,...,0.0,0.0,0.003922,0.003922,0.007843,0.0,0.0,0.003922,0.011765,0.0
7,Queen's Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.02439,0.0,0.0,0.0,0.0,0.0,0.02439,0.0,0.02439
8,Scarborough,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010989,...,0.0,0.0,0.0,0.0,0.010989,0.0,0.0,0.0,0.0,0.0
9,West Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.011364,0.0,0.0,0.011364,0.0,0.005682,0.0,0.0,0.005682


Printing the top 5 venues for each bourough

In [11]:
num_top_venues = 5

for b in toronto_grouped['Borough']:
    #print("----"+b+"----")
    temp = toronto_grouped[toronto_grouped['Borough'] == b].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    #print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    #print('\n')

Creating a function to sort the venues in decsending order

In [12]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [13]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Borough']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Borough'] = toronto_grouped['Borough']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Central Toronto,Coffee Shop,Sandwich Place,Park,Café,Pizza Place,Clothing Store,Restaurant,Sushi Restaurant,Dessert Shop,Gym
1,Downtown Toronto,Coffee Shop,Café,Restaurant,Hotel,Italian Restaurant,Bakery,Japanese Restaurant,Bar,Park,Gastropub
2,East Toronto,Greek Restaurant,Coffee Shop,Italian Restaurant,Café,Brewery,Ice Cream Shop,Park,American Restaurant,Bakery,Yoga Studio
3,East York,Coffee Shop,Burger Joint,Pizza Place,Park,Bank,Sporting Goods Shop,Pharmacy,Fast Food Restaurant,Sandwich Place,Liquor Store
4,Etobicoke,Pizza Place,Sandwich Place,Coffee Shop,Pharmacy,Fast Food Restaurant,Grocery Store,Gym,Liquor Store,Park,Beer Store


<a id='item3b'></a>

#### K-Means Clustering

In [14]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Borough', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 3, 3, 0, 1, 4, 3, 1], dtype=int32)

In [15]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Borough'), on='Borough')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353,3,Coffee Shop,Breakfast Spot,Fast Food Restaurant,Pizza Place,Chinese Restaurant,Bakery,Pharmacy,Indian Restaurant,Electronics Store,Skating Rink
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,3,Coffee Shop,Breakfast Spot,Fast Food Restaurant,Pizza Place,Chinese Restaurant,Bakery,Pharmacy,Indian Restaurant,Electronics Store,Skating Rink
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711,3,Coffee Shop,Breakfast Spot,Fast Food Restaurant,Pizza Place,Chinese Restaurant,Bakery,Pharmacy,Indian Restaurant,Electronics Store,Skating Rink
3,M1G,Scarborough,Woburn,43.770992,-79.216917,3,Coffee Shop,Breakfast Spot,Fast Food Restaurant,Pizza Place,Chinese Restaurant,Bakery,Pharmacy,Indian Restaurant,Electronics Store,Skating Rink
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,3,Coffee Shop,Breakfast Spot,Fast Food Restaurant,Pizza Place,Chinese Restaurant,Bakery,Pharmacy,Indian Restaurant,Electronics Store,Skating Rink


<a id='item3c'></a>

#### Visualizing the clusters on a map

In [16]:
# create map
t_latitude = 43.6532
t_longitude = -79.3832
map_clusters = folium.Map(location=[t_latitude, t_longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Borough'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters