<h1 align=center><font size = 6><b>Segmenting and Clustering Neighborhoods in Toronto</b></font></h1>

<p><font size = 5>In this project we will explore, segment and cluster the neighborhoods in the city of Toronto</font><p>

In the first part, we need to obtain the Neighborhood data for Toronto. This data is not readily available on the Internet. So we will scrape this data from the following Wikipedia page: 
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

To scrape the Neighborhood data from the web page we will use the Beautiful Soup 4 package. We will use the 'lxml' parser to parse the HTML file.

Let us install Beautiful Soup and lxml

In [217]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [7]:
pip install lxml

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/ec/be/5ab8abdd8663c0386ec2dd595a5bc0e23330a0549b8a91e32f38c20845b6/lxml-4.4.1-cp36-cp36m-manylinux1_x86_64.whl (5.8MB)
[K     |████████████████████████████████| 5.8MB 3.1MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.4.1
Note: you may need to restart the kernel to use updated packages.


We will also need the 'requests' library to get the HTML data from the web page.

In [9]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


In [218]:
# Importing all necessary libraries

from bs4 import BeautifulSoup
import requests
import pandas as pd

Let's get the source code from the Wikipedia page using the 'requests' library. 

In [219]:
# We use the 'text' method to get the html data in a string
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

Now let's create a Beautiful Soup object

In [221]:
soup = BeautifulSoup(source, 'lxml')

# We can view the source code using print statement
# print(soup.prettify())

# I have commented this line as the source code can be amount to hundreds and thousands of lines,
# and we all know that scrolling can be a real pain in the ass sometimes

Now on inspecting the source code we can see that the table containing the neighborhood data is inside the html tag 'table with class 'wikitable sortable'.

In [223]:
table = soup.find('table', class_='wikitable sortable')

#print(table.prettify())  # again commenting this as scrolling can be a pain in the ass.

Every row in the table is enclosed in the tag 'tr'. We will use this to fetch every row and append in a list.

In [224]:
a=[]
for row in table.find_all('tr'):
    b = row.text.split('\n')
    b[:] = (value for value in b if value != '')
    a.append(b)

Now let us create a Pandas dataframe from this list and name it 'toronto'.

In [226]:
toronto = pd.DataFrame(a[1:])
a[0][0] = 'PostalCode'
a[0][2] = 'Neighborhood'
toronto.columns = a[0]
toronto

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


We will remove every row where the value of 'Borough' is 'Not assigned'.

In [229]:
toronto.drop(toronto.loc[toronto['Borough']=='Not assigned'].index, inplace=True)
toronto.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


For rows where 'Borough' has a proper value, but the 'Neighborhood' is 'Not assigned', we will set the neighborhood same as the Borough. For example, for Borough 'Queen's Park' we'll set the neighborhood also as 'Queen's Park'.

In [232]:
idx = toronto.loc[toronto['Neighborhood']=='Not assigned'].index  #fetching the index for the rows with neighborhood not assigned
toronto.loc[idx, ['Neighborhood']] = toronto['Borough'][idx]  # setting neighborhood same as borough for these rows

toronto.reset_index(drop=True, inplace=True)   #resetting index to consecutive numbers starting from zero
toronto.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Many postal codes have multiple rows. We will group these rows into a single row and list all the Neighborhoods in a single cell separated by a comma. We are doing this as later we will cluster the neighborhoods based on the postal code, meaning, we will consider one postal code as one neighborhood.

In [249]:
toronto = toronto.groupby(['PostalCode','Borough'],as_index=False, sort=False).agg(','.join)
toronto.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


Let us see the number of rows in our dataframe.

In [245]:
print('The number of distinct neighborhoods in our data is ' + str(toronto.shape[0]) + '.')

The number of distinct neighborhoods in our data is 103.
