# Applied Data Science Capstone Course #
## Week 3 Assignment: Segmenting and Clustering Neighbourhoods in Toronto, Canada ##
## Part 2 ##

### Imports ###

In [1]:
import numpy as np
import pandas as pd
import matplotlib as plt
%matplotlib inline
import geopy
import folium
import requests

### Neighbourhood data ###

First, read the postal codes and neighbourhood data from the Wikipedia page

In [2]:
postcodes_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
df_postcodes = pd.read_html(postcodes_url)[0]
df_postcodes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Data cleaning ####

Drop entries without a borough

In [3]:
df_clean_borough = df_postcodes[df_postcodes['Borough'] != "Not assigned"]
df_clean_borough.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Combine the neighbourhoods in the same postal code such that the dataframe rows are each for a unique postal code. If a postal code has a borough but not a neighbourhood, use the borough name as the neighbourhood. While we can assume that there are no duplicate entries of postcode/neighbourhood combinations, the code below will screen out duplicates by way of the `unique` function in Pandas.

In [4]:
# Find the unique postcodes
unique_postcodes = df_clean_borough['Postcode'].unique()
unique_postcodes

# Intialize a new dataframe with the unique postcodes
df_combneigh = pd.DataFrame(columns=df_clean_borough.columns)
df_combneigh['Postcode'] = unique_postcodes

# Iterate over each unique postcode, and fill in the borough and neighbourhood list string
for index, row in df_combneigh.iterrows():
    
    # For borough, just pick the first instance
    borough = df_clean_borough[df_clean_borough['Postcode'] == row['Postcode']]['Borough'].to_list()[0]
    df_combneigh.at[index, 'Borough'] = borough
    
    # Now construct the neighbourhood string for each postal code
    neighlist = df_clean_borough[df_clean_borough['Postcode'] == row['Postcode']]['Neighbourhood'].unique()
    neighstr = ', '.join(neighlist)
    neighstr = neighstr.replace('Not assigned', borough)
    df_combneigh.at[index, 'Neighbourhood'] = neighstr

df_combneigh

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
101,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park So..."


In [5]:
print("Shape of the dataframe:", df_combneigh.shape)

Shape of the dataframe: (103, 3)



### Get coordinates for each neighbourhood (Part 2) ###

Use the provided .csv file

In [6]:
!wget https://cocl.us/Geospatial_data

wget: /opt/anaconda3/lib/libuuid.so.1: no version information available (required by wget)
--2020-02-25 16:37:18--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 158.85.108.83, 158.85.108.86, 169.48.113.194
Connecting to cocl.us (cocl.us)|158.85.108.83|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-02-25 16:37:19--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.29.197, 107.152.27.197, 107.152.24.197, ...
Connecting to ibm.box.com (ibm.box.com)|107.152.29.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-02-25 16:37:20--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request se

Load the .csv file

In [7]:
df_coords = pd.read_csv('Geospatial_data')
df_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Join the coordinate dataframe with the neighbourhoods dataframe

In [8]:
df_neighcoords = df_combneigh.merge(df_coords, left_on='Postcode', right_on='Postal Code')
df_neighcoords.drop(columns=['Postal Code'], inplace=True)
df_neighcoords.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
5,M9A,Queen's Park,Queen's Park,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
