# Assignment - Segmenting and Clustering Neighborhoods in Toronto

### Introduction

In the first part of this assignment, we will to extract postal codes from the following link:<br>
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Only postal codes with an assigned borough will be consider for this Notebook.

### Import Libraries

First of all, let's import the necessary modules:

In [1]:
import requests # library to handle requests

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#handle html data
!pip install bs4
from bs4 import BeautifulSoup

print('Libraries imported.')

Libraries imported.


### 1. Extract Postal Codes from Wiki


We use _Requests_ to get HTML code from the wiki page and _BeautifulSoup_ to handle the returned html data.

In [2]:
r = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

soup = BeautifulSoup(r.text.replace('\n', ''), "html.parser") #replaces line break

postal_code_data = pd.DataFrame(columns=["Postal Code", "Borough", "Neighborhood"])

#finds the correct table based on its class
postal_code_table = soup.find("table", {"class": "wikitable"})

for row in postal_code_table.find("tbody").find_all("tr"):
    if not row.find_all("th"): #handle data only if no table head is found
        col = row.find_all("td")
        postal_code = col[0].text
        borough = col[1].text
        neighborhood = col[2].text
        
        if borough.upper() != 'NOT ASSIGNED':
            if neighborhood.upper() == 'NOT ASSIGNED':
                neighborhood = borough
            
            postal_code_data = postal_code_data.append({"Postal Code":postal_code, "Borough":borough, "Neighborhood":neighborhood}, ignore_index=True)

postal_code_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [3]:
postal_code_data.shape

(103, 3)

### 2. Add Latitude and Longitude

Now, we get latitude and longitude from CSV File (http://cocl.us/Geospatial_data) to populate <i>postal_code_data</i>.

Download the CSV file to get the geo information:

In [4]:
csv_url = 'http://cocl.us/Geospatial_data'
df_csv = pd.read_csv(csv_url)
df_csv.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge both dataframes based on the _Postal Code_ column:

In [5]:
postal_code_data = pd.merge(postal_code_data, df_csv, on=["Postal Code"])
postal_code_data.head(20)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


Although some neighborhoods are not unique in the data frame, they have different latitude and longitude coordinates, so we keep the data as it is.

In [6]:
# example of duplicated neighbourhood
postal_code_data[postal_code_data["Neighborhood"] == "Don Mills"]

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
7,M3B,North York,Don Mills,43.745906,-79.352188
13,M3C,North York,Don Mills,43.7259,-79.340923
