# Segmenting and Clustering Neighbourhoods in Toronto

## Scraping Data from Web page and adding latitude and longitude to dataframe

This notebook represents Peer-graded assignment, Segmenting and clustering neighbourhoods in Toront. I will firs scrap necessary data from web page: [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)
and prepare it for further use in assignment.

In [3]:
import pandas as pd
import numpy as np

In order to obtain postal codes of Toronto neighbourhoods I will first install *lxml* library. It is feature-rich and easy to use library for processing **XML** and **HTML** in Python language.


In [4]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


pd.raed_html is a function that search for table elements on web page. It returns list of DataFrames.

In [5]:
table = pd.read_html('http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header = 0)
table[0]

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Now I will create DataFrame by passing column names for header and list of DataFrames scraped from web page for populating rows.

In [6]:
column_n = ['Postal Code', 'Borough', 'Neighbourhood']
df = pd.DataFrame(table[0], columns = column_n)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [7]:
#Extracting only data for which Borough is assigned.
df1 =df[df['Borough'] != 'Not assigned']
df1.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [8]:
df1.shape #There is 103 rows and three columns

(103, 3)

In [9]:
df1.reset_index(drop = True, inplace = True) #Reseting index after discarding data for whom Borough is not assigned
df1.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [10]:
df1['Postal Code'].value_counts() #Checking for duplicate Postal Codes. There is no one.

M4G    1
M4M    1
M1L    1
M1W    1
M1K    1
      ..
M2L    1
M6H    1
M6N    1
M3L    1
M9A    1
Name: Postal Code, Length: 103, dtype: int64

In [11]:
df1['Neighbourhood'] == 'Not assigned' #Checkin if are all Neighbourhoods assigned to specific name.

0      False
1      False
2      False
3      False
4      False
       ...  
98     False
99     False
100    False
101    False
102    False
Name: Neighbourhood, Length: 103, dtype: bool

## Adding Latitude and Longitude to Dataframe

Unfortunately I was unable to obtain latitude and longitude coordinates by using geocoder. As alternative I have used geospatal data from [GeoCoordinates](https://cocl.us/Geospatial_data). 

In [41]:
geo_data = pd.read_csv('Geospatial_Coordinates.csv', header = 0)
geo_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [21]:
df1.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [42]:
#merging data from different dataframes based on Postal Code values 
new_df = pd.merge(df1, geo_data, how = 'left', on = 'Postal Code')
new_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [46]:
print('Toronto has {} boroughs and {} neighbourhoods'.format(len(new_df['Borough'].unique()), new_df.shape[0]))

The Toronto has 10 boroughs and 103 neighbourhoods


In [43]:
new_df['Borough'].value_counts()

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
York                 5
East York            5
East Toronto         5
Mississauga          1
Name: Borough, dtype: int64