# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

In this assignment, we are required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. We need to build a different code to scrape the Wikipedia page

https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 

in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe. A good package for this purpose is BeautifulSoup. We start importing a few libraries we will use in the following steps.

In [1]:
import pandas as pd
import numpy as np
import os,sys
from bs4 import BeautifulSoup
import requests
import urllib
from urllib.request import urlopen
import json
!conda install -c conda-forge folium=0.5.0 --yes
import folium

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  54.17 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  36.49 MB/s
vincent-0.4.4- 100% |################################| Time: 0:00:00  40.71 MB/s
folium-0.5.0-p 100% |################################| Time: 0:00:00  49.25 MB/s


We now look at the table at the Wikipedia page, consisting of the columns Postcode, Borough and Neighborhood. We immediately see that some postcodes are not assigned, and these cells will be ignored. Some neighborhoods have the same postcode, and we will list them in the same row with a comma as separators. A few boroughs have a "Not assigned" neighborhood, in this case we will name the neighborhood with the same name of the borough.

### Downloading the table from Wikipedia and loading the data in a pandas dataframe

In [2]:
# Download the page
URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = urlopen(URL)
soup = BeautifulSoup(page, "lxml")
page.close()
 
# Open the required table
fp = open("data.csv","w")
tables = soup.findAll('table')
tab = tables[0]
for tr in tab.tbody.findAll('tr'):
    #print(tr.findAll('th'))
    for th in tr.findAll('th'):
        text = th.getText().strip()+','
        fp.write(text)
    for td in tr.findAll('td'):
        text = td.getText().strip()+','
        fp.write(text)
    fp.write('\n')
fp.close()

# create the pandas dataframe
dfToronto = pd.read_csv('data.csv')
dfToronto.drop('Unnamed: 3',axis=1,inplace = True)
dfToronto.head(10)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


### Cleaning the dataframe

In [3]:
# Remove the unassigned postcodes
dfToronto1 = dfToronto[ ~ dfToronto['Borough'].str.contains('Not assigned')]
dfToronto1.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [4]:
# Combine the neighborhoods with the same postcode in the required format
group = dfToronto1.groupby('Postcode')
grouped_neighborhoods = group['Neighbourhood'].apply(lambda x: "%s" % ', '.join(x))
grouped_boroughs = group['Borough'].apply(lambda x: set(x).pop())
dfToronto2 = pd.DataFrame(list(zip(grouped_boroughs.index, grouped_boroughs, grouped_neighborhoods)))
dfToronto2.columns = ['Postcode', 'Borough', 'Neighbourhood']

dfToronto2.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [5]:
# Give the unassigned neighbourhoods the same name of the borough
for i in range(len(dfToronto2)):
    line_data=dfToronto2.iloc[i,:]
    if line_data['Neighbourhood'] == 'Not assigned':
        line_data['Neighbourhood'] = line_data['Borough']

In [6]:
# Check the number of rows od the dataframe
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(dfToronto2['Borough'].unique()),
      dfToronto2.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.
