# Segmenting and Clustering Neighborhoods in Toronto
We will scrape information from a Wikipedia page, read that into dataframe, do some K-means clustering and visualize with Folium!
The web page is following:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,

## 1. Scraping postal code information from a web page

In [5]:
# Install and import required libraries, as in New Your clustering notebook
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.17.0-py_0 conda-forge

geographiclib- 100% |################################| Time: 0:00:00   1.07 MB/s
geopy-1.17.0-p 100% |################################| Time: 0:00:00   1.66 MB/s
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00   2.54 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  16.35 MB/s
vincent-0.4.4- 100% |###################

In [6]:
!conda install -c conda-forge beautifulsoup4 --yes 
from bs4 import BeautifulSoup

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following packages will be UPDATED:

    beautifulsoup4: 4.6.0-py35h442a8c9_1 --> 4.6.3-py35_0 conda-forge

beautifulsoup4 100% |################################| Time: 0:00:00   1.51 MB/s


In [7]:
# Get the page and ensure we got it 
page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
page.status_code

200

In [8]:
# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighbourhood'] 
# instantiate the dataframe
postal_codes = pd.DataFrame(columns=column_names)

soup = BeautifulSoup(page.content, 'html.parser')

# First, get the table in question
tables = soup.findChildren('table')
# We are interested in the first table only
post_table = tables[0]
rows = post_table.findChildren('tr')
# Skip the first tr (it is the header)
post_rows = rows[1:]
for row in post_rows:
    # Parse table row cells, stripping whitespace
    cells = row.findChildren('td')
    postcode = cells[0].string.strip()
    borough = cells[1].string.strip()
    for anchor in cells[1].findChildren('a'):
        borough = anchor.string.strip()
    anchorsFound = False
    for anchor in cells[2].findChildren('a'):
        neighbourhood = anchor.string.strip()
        anchorsFound = True
    if anchorsFound == False:
        neighbourhood = cells[2].string.strip() 
    if borough != 'Not assigned':
        if neighbourhood == 'Not assigned':
            neighbourhood = borough
            
        # Filter dataframes based on conditions given
        sameBorough = postal_codes[(postal_codes.PostalCode == postcode) & (postal_codes.Borough == borough)]
        if sameBorough.shape[0] > 0:
            # Already existing (postalcode, borough tuple) => replace existing list of neighbourhoods in a dataframe
            postal_codes['Neighbourhood'].replace(
                    to_replace=sameBorough['Neighbourhood'].iloc[0],
                    value=sameBorough['Neighbourhood'].iloc[0] + ', ' + neighbourhood,
                    inplace=True
            )
        else:
            # A new (postalcode, borough) tuple => append new dataframe
            postal_codes = postal_codes.append({'PostalCode': postcode,'Borough': borough,'Neighbourhood': neighbourhood}, ignore_index=True)
postal_codes.head(15)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [9]:
postal_codes.shape

(103, 3)

## Getting latitude and longitude of each neighbourhood

In [14]:
!conda install -c conda-forge geocoder --yes 
import geocoder

Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
geocoder                  1.38.1                     py_0    conda-forge


In [36]:
# The code was removed by Watson Studio for sharing.

In [41]:
# A function to return lat and long given postal code
def fetchCoordinatesByAreaAndPostalCode(area, postal_code):
    coords = None
    n_times = 0
    # Created a Bing Maps account, the other ones don't seem to work reliably...
    # loop until you get the coordinates, max 5 times
    while((coords is None) & (n_times < 5)):
        print('Trying to find {}, {}'.format(postal_code, area))
        #g = geocoder.osm('{}, {}'.format(postal_code, area))
        #coords = g.osm
        g = geocoder.bing('{}, {}'.format(postal_code, area), key=BING_KEY)
        coords = g.latlng
        n_times = n_times + 1
    if coords != None:
        # return coords.get('x'), coords.get('y')
        return coords[0], coords[1]
    else:
        return 0.0, 0.0

# Loop through all areas and fetch coordinates by postal code
lats=[]
longs=[]
for index, area in postal_codes.iterrows():
    lat, lon = fetchCoordinatesByAreaAndPostalCode('Toronto, Ontario', area['PostalCode'])
    lats.append(lat)
    longs.append(lon)
postal_codes = postal_codes.assign(Latitude=lats, Longitude=longs)
postal_codes.head(15)


Trying to find M3A, Toronto, Ontario
Trying to find M4A, Toronto, Ontario
Trying to find M5A, Toronto, Ontario
Trying to find M6A, Toronto, Ontario
Trying to find M7A, Toronto, Ontario
Trying to find M9A, Toronto, Ontario
Trying to find M1B, Toronto, Ontario
Trying to find M3B, Toronto, Ontario
Trying to find M4B, Toronto, Ontario
Trying to find M5B, Toronto, Ontario
Trying to find M6B, Toronto, Ontario
Trying to find M9B, Toronto, Ontario
Trying to find M1C, Toronto, Ontario
Trying to find M3C, Toronto, Ontario
Trying to find M4C, Toronto, Ontario
Trying to find M5C, Toronto, Ontario
Trying to find M6C, Toronto, Ontario
Trying to find M9C, Toronto, Ontario
Trying to find M1E, Toronto, Ontario
Trying to find M4E, Toronto, Ontario
Trying to find M5E, Toronto, Ontario
Trying to find M6E, Toronto, Ontario
Trying to find M1G, Toronto, Ontario
Trying to find M4G, Toronto, Ontario
Trying to find M5G, Toronto, Ontario
Trying to find M6G, Toronto, Ontario
Trying to find M1H, Toronto, Ontario
T

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.751255,-79.329895
1,M4A,North York,Victoria Village,43.729958,-79.314201
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65522,-79.361969
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.722801,-79.450691
4,M7A,Queen's Park,Queen's Park,43.664486,-79.393021
5,M9A,Etobicoke,Islington Avenue,43.662743,-79.528427
6,M1B,Scarborough,"Rouge, Malvern",43.810154,-79.194603
7,M3B,North York,Don Mills North,43.749134,-79.362007
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.707577,-79.310913
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657467,-79.377708
