# Segmenting and Clustering Neighborhoods in Toronto

## Introduction

In this notebook, I will scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe  and explore and cluster the neighborhoods in Toronto.


## Download all the dependencies 

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>

## 1. Download and Explore Dataset


In [2]:
# website url
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Extract table data
dft = pd.read_html(url)

# Get first table                                                                                                           
df = dft[0]

# Extract columns  PostalCode, Borough, and Neighborhood                                                                                                           
df2 = df[['Postal Code','Borough', 'Neighbourhood']]

# Rename columns 
df2.rename(columns={'Postal Code': 'PostalCode', 'Neighbourhood': 'Neighborhood'}, inplace=True)

# Print data shape
print('Dataframe shape is:', df2.shape)

# Print first data
df2.head(12)


Dataframe shape is: (180, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


<a id='item2'></a>
## 2. Remove not assigned Borough in Toronto City

In [3]:
# Ignore cells with a borough that is Not assigned.

# is_notassigned is a boolean variable with True or False in it
is_notassigned = df2 ['Borough'] != 'Not assigned'

# Extract lines with assigned Borough
df_assigned = df2[is_notassigned].reset_index(drop=True)

print("Dataframe shape after dealing with not assigned borough  is:", df_assigned.shape)

#df_assigned
df_assigned.head(12)

Dataframe shape after dealing with not assigned borough  is: (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


<a id='item3'></a>
## 3. Replace not assigned Neighborhood by Borough in Toronto City

In [4]:
# Identify not assigned Neighborough
is_notassigned_ngbhd = df_assigned.loc[(df_assigned.Neighborhood == 'Not assigned')]

# Count not assigned Neighborough
print ("The number of the not assigned Neighborough is: ", len(is_notassigned_ngbhd) )


# Replace not assigned Neighborhood by Borough
df_assigned.loc[(df_assigned.Neighborhood == 'Not assigned'), ['Neighborhood']] = df_assigned.Borough



The number of the not assigned Neighborough is:  0


In [5]:
# Shape the dataframe
print("The shape of the dataframe is: ", df_assigned.shape )

The shape of the dataframe is:  (103, 3)


<a id='item4'></a>
##  4. Create a dataframe with longitude and latitude for Toronto city

In [6]:
#!conda install -c conda-forge geocoder
#import geocoder # import geocoder

# initialize your variable to None
#lat_lng_coords = None

# loop until you get the coordinates
#while(lat_lng_coords is None):
#  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#  lat_lng_coords = g.latlng

#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]


# url for longitude and latitude reference file
url = 'https://cocl.us/Geospatial_data'
 
# read csv file    
df_longlat = pd.read_csv(url)
print("longitude and latitude dataframe columns are:", df_longlat.columns)

# Rename columns 
df_longlat.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)

# join longlat dataframe and neighborhood dataframe
neighborhoods = pd.merge(df_assigned, df_longlat)

#print head data
neighborhoods.head(12)



longitude and latitude dataframe columns are: Index(['Postal Code', 'Latitude', 'Longitude'], dtype='object')


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


### Thanks!