# Segmenting and Clustering Neighborhoods in Toronto

Explore, segment, and cluster the neighborhoods in the city of Toronto.

This Notebook contains all the parts of the assignment. I've labeled them with the headings: Part 1, Part 2, Part 3

## Part 1 

### Data Cleaning and Preprocessing
In this section we are going to clean and prepare the data using the following guidelines

1. Rename the Postcode column to PostalCode
2. Remove all rows with unassigned Boroughs
3. Aggregate data so that Neightborhoods sharing more than one postal code are grouped in a comma separated value format
4. Upate the value of unassigned Neighbourhoods with the value of the Borough

In [82]:
import pandas as pd
import numpy as np


Read in the data from Wikipedia using the pd.read_html helper from the Pandas library



In [83]:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]

In [84]:
#1. Rename column
df.rename(columns={'Postcode': 'PostalCode'}, inplace=True)

In [85]:
#2. Filter out unassigned boroughs
filtered_data = df[df['Borough' ] != 'Not assigned'].reset_index(drop=True)
   
#3. Group by postal code to show neighborhoods sharing same postal code
groups = filtered_data.groupby(['PostalCode', 'Borough'], as_index=False).agg(lambda x: ','.join(x))

#4. Update any neighborhoods that are unassigned and give them the name of the borough
groups.loc[groups['Neighbourhood'] == 'Not assigned', ['Neighbourhood']] = groups['Borough']
# groups.loc[matching row, [columns] = values

#groups

Show the shape of the cleaned data set

In [86]:
groups.shape

(103, 3)

### Part 2

Import the geo_data containing the Latitude and Longitute for each Neighbourhood and Borough and merge it into the original dataframe


In [87]:
geo_data = pd.read_csv("https://cocl.us/Geospatial_data")

First rename the column so we can use it for the key column in our merge.

In [88]:

# need to clean up column name so I can merge on the index
geo_data.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)


Perform an inner join merge using the PostalCode column as the key

In [128]:
# perform an inner join merge using the PostalCode as the key
groups = groups.merge(geo_data, on='PostalCode', how='inner')

groups.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude_x,Longitude_x,Latitude_y,Longitude_y
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,43.773136,-79.239476


In [90]:
groups.shape

(103, 5)

### Part 3

Data Analysis - Cluster and segment the neighborhoods


Get a subset of the dataframe that includes all Boroughs with the word Toronto in it

In [129]:
# Get a dataframe with only Boroughs containing 'Toronto' 
neighborhoods = groups[groups['Borough'].str.contains('Toronto')].reset_index()

neighborhoods.head()


Unnamed: 0,index,PostalCode,Borough,Neighbourhood,Latitude_x,Longitude_x,Latitude_y,Longitude_y
0,37,M4E,East Toronto,The Beaches,43.676357,-79.293031,43.676357,-79.293031
1,41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,43.679557,-79.352188
2,42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,43.668999,-79.315572
3,43,M4M,East Toronto,Studio District,43.659526,-79.340923,43.659526,-79.340923
4,44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,43.72802,-79.38879


In [130]:
# clean up dataframe
neighborhoods.drop(['index', 'PostalCode'], inplace=True, axis=1)
neighborhoods.head()


Unnamed: 0,Borough,Neighbourhood,Latitude_x,Longitude_x,Latitude_y,Longitude_y
0,East Toronto,The Beaches,43.676357,-79.293031,43.676357,-79.293031
1,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,43.679557,-79.352188
2,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,43.668999,-79.315572
3,East Toronto,Studio District,43.659526,-79.340923,43.659526,-79.340923
4,Central Toronto,Lawrence Park,43.72802,-79.38879,43.72802,-79.38879


In [112]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        toronto_df.shape[0]
    )
)

The dataframe has 4 boroughs and 39 neighborhoods.


Install geopy and Folium, so we can visualize the cluster of Neighborhoods on a map

In [None]:
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab

In [107]:
import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [108]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="t_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.653963, -79.387207.


In [None]:
# create map of Toronto using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

![Toronto Neighborhoods](toronto_neighborhoods.png)

Select one Borough and put the clusters on a map

In [115]:
east_toronto = neighborhoods[neighborhoods['Borough'] == 'East Toronto'].reset_index(drop=True)
east_toronto.head()

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
0,East Toronto,The Beaches,43.676357,-79.293031
1,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
2,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
3,East Toronto,Studio District,43.659526,-79.340923
4,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558


In [116]:
address = 'East Toronto, Canada'

geolocator = Nominatim(user_agent="t_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of East Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of East Toronto are 43.626243, -79.396962.


In [None]:
# create map of Manhattan using latitude and longitude values
map_east_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(east_toronto['Latitude'], east_toronto['Longitude'], east_toronto['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_east_toronto)  
    
map_east_toronto

![East Toronto](east_toronto.png)