# Final Assignment of the Applied Data Science Capstone
Segmenting and Clustering Neighborhoods in Toronto

## Scrape the Toronto Neighbourhoods
1. Use beautiful soap to scrape wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
2. Load the scraped data into the dataframe

Install the Beautiful Soup library for scraping of the wikipedia page

In [63]:

! pip3 install bs4



Import all the necessary libraries for the first task

In [64]:
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup

Get the wikipedia page html and load it to Beautiful Soup with html.parser

In [65]:
html_data = requests.get(url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup_scraper = BeautifulSoup(html_data.text, 'html.parser')
soup_scraper.title


<title>List of postal codes of Canada: M - Wikipedia</title>

Wikipedia page successfully loaded, now we can create the pandas data frame with required columns and fill it with table details from html

In [66]:
toronto_neighbourhoods = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighbourhood']);

for row in soup_scraper.find('div', id='mw-content-text').find('table').find('tbody').find_all('tr'):
    col = row.find_all('td')
    if len(col) > 0:
        postal_code = col[0].text
        borough = col[1].text
        neighbourhood = col[2].text

        toronto_neighbourhoods = toronto_neighbourhoods.append({'PostalCode': postal_code, 'Borough': borough, 'Neighbourhood': neighbourhood}, ignore_index=True)
    


We have acquired our dataset, now we print the first 5 elements to see the data quality

In [67]:
toronto_neighbourhoods.head()


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


We have to remove '\n' character from the dataset

In [68]:
toronto_neighbourhoods['PostalCode'] = toronto_neighbourhoods['PostalCode'].str.replace(r'\n', '')
toronto_neighbourhoods['Borough'] = toronto_neighbourhoods['Borough'].str.replace(
    r'\n', '')
toronto_neighbourhoods['Neighbourhood'] = toronto_neighbourhoods['Neighbourhood'].str.replace(
    r'\n', '')
toronto_neighbourhoods.head()


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now we should have clean data, but we can see there are unassigned neighbourhoods. We should remove them

In [69]:
toronto_neighbourhoods.replace('Not assigned', np.nan, inplace=True)
toronto_neighbourhoods.dropna(subset=['Borough'], axis=0, inplace=True)
toronto_neighbourhoods['Neighbourhood'].fillna(toronto_neighbourhoods['Borough'], inplace=True)
toronto_neighbourhoods.isnull().value_counts()


PostalCode  Borough  Neighbourhood
False       False    False            103
dtype: int64

No lets see how many of the rows we have left.

In [70]:
toronto_neighbourhoods.shape


(103, 3)

As the geocoder api do not work properly, load the latitude and longitude from the csv given in the assignment.

In [80]:
toronto_postal_code_geo = pd.read_csv('https://cocl.us/Geospatial_data')
toronto_postal_code_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


With the geo data loaded, rename the column to have same column names in data frames and look at shape if we can merge them.

In [82]:
toronto_postal_code_geo.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
toronto_postal_code_geo.shape


(103, 3)

As the shape is the same as of our original data frame, merge two data frames together based on the PostalCode column.

In [88]:
toronto_neighbourhoods_geo = pd.merge(toronto_neighbourhoods, toronto_postal_code_geo, on='PostalCode')
toronto_neighbourhoods_geo.head()


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
