# Segmenting and Clustering Neighborhoods in the city of Toronto, Canada

## Table of Contents
- [Part 1 - Data Scraping](#part-1)


<div id='part-1'/>

____
## Part 1 - Data Scraping

Input data [Wikipedia: List of postal codes of Canada: M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

In [1]:
from bs4 import BeautifulSoup
import urllib3.request
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim
import folium
import os
import requests
import json
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors


- **Input data is obtained from Wikipedia via http request.**
- **_"BeatifulSoup"_ object is created.**

In [2]:
page_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
# if you are behind a firewall set the proper url, including protocol, host and port.
#   (ex: http://internal-proxy:80)
proxy_url = ""

if proxy_url.strip() != "":
    # using proxy
    http = urllib3.ProxyManager(proxy_url)
else:
    # direct internet connection
    http = urllib3.PoolManager()

req = http.request('GET', page_url)
soup = BeautifulSoup(req.data, 'html.parser')




  
- **HTML post codes table is parsed**
- **Rows with 'Not assigned' borough are dropped.**
- **Pandas dataframe is constructed.**
  

In [3]:
# locate postcode table
toronto_table = soup.find('table',{'class':'wikitable sortable'})

# process table rows and build raw_df
raw_df = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood'])
rows = toronto_table.findAll('tr')
for row in rows:
    row_items = row.findAll('td')
    if len(row_items) > 0:
        postcode = row_items[0].text.strip()
        borough = row_items[1].text.strip()
        if borough.lower() != "not assigned":
            neighborhood = row_items[2].text.strip()
            raw_df = raw_df.append({'PostalCode':postcode, 
                                    'Borough':borough, 
                                    'Neighborhood':neighborhood}, 
                                   ignore_index = True)

raw_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


  
- **Combine neighborhoods belonging to the same borough in one row.**
- **Replace _'Not assigned'_ neighborhoods with Borougth's name.**
  

In [4]:
grouped = []
for name, group in raw_df.groupby(['PostalCode', 'Borough'])['Neighborhood']:
    nblist = ''.join(str(x) + ", " for x in group.tolist()).strip(", ")
    if nblist == "Not assigned":
        nblist = name[1]
    grouped.append((name[0], name[1], nblist))

toronto_df = pd.DataFrame(grouped, columns=['PostalCode', 'Borough', 'Neighborhood'])
print(toronto_df.shape)
toronto_df.head()

(103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [5]:
toronto_df.tail()

Unnamed: 0,PostalCode,Borough,Neighborhood
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
102,M9W,Etobicoke,Northwest


In [6]:
# just for verification. This query should return no rows.
toronto_df.query("Neighborhood == 'Not assigned'")

Unnamed: 0,PostalCode,Borough,Neighborhood


In [7]:
# verify a known 'Not assigned' Neighborhood case, it should be equal to Borough. 
toronto_df.query("PostalCode == 'M7A'") 

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Queen's Park


  
- **Final assignament requirement: dataframe shape is shown.**
  

In [8]:
toronto_df.shape

(103, 3)