# Applied Data Science Capstone Course #
## Week 3 Assignment: Segmenting and Clustering Neighbourhoods in Toronto, Canada ##

### Imports ###

In [1]:
import numpy as np
import pandas as pd
import matplotlib as plt
%matplotlib inline
import geopy
import folium
import requests

### Neighbourhood data ###

First, read the postal codes and neighbourhood data from the Wikipedia page

In [2]:
postcodes_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
df_postcodes = pd.read_html(postcodes_url)[0]
df_postcodes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Data cleaning ####

Drop entries without a borough

In [3]:
df_clean_borough = df_postcodes[df_postcodes['Borough'] != "Not assigned"]
df_clean_borough.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Combine the neighbourhoods in the same postal code such that the dataframe rows are each for a unique postal code. If a postal code has a borough but not a neighbourhood, use the borough name as the neighbourhood. While we can assume that there are no duplicate entries of postcode/neighbourhood combinations, the code below will screen out duplicates by way of the `unique` function in Pandas.

In [4]:
# Find the unique postcodes
unique_postcodes = df_clean_borough['Postcode'].unique()
unique_postcodes

# Intialize a new dataframe with the unique postcodes
df_combneigh = pd.DataFrame(columns=df_clean_borough.columns)
df_combneigh['Postcode'] = unique_postcodes

# Iterate over each unique postcode, and fill in the borough and neighbourhood list string
for index, row in df_combneigh.iterrows():
    
    # For borough, just pick the first instance
    borough = df_clean_borough[df_clean_borough['Postcode'] == row['Postcode']]['Borough'].to_list()[0]
    df_combneigh.at[index, 'Borough'] = borough
    
    # Now construct the neighbourhood string for each postal code
    neighlist = df_clean_borough[df_clean_borough['Postcode'] == row['Postcode']]['Neighbourhood'].unique()
    neighstr = ', '.join(neighlist)
    neighstr = neighstr.replace('Not assigned', borough)
    df_combneigh.at[index, 'Neighbourhood'] = neighstr

df_combneigh

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
101,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park So..."


In [5]:
print("Shape of the dataframe:", df_combneigh.shape)

Shape of the dataframe: (103, 3)
