## **Data Collection for Segmentation and Clustering of Neighbourhood Venues in Toronto**

Data required for Segmentation and Clustering of Neighbourhoods for Toronto is available on wikipedia.To collect the data from wikipedia for toronto, Web Scraping on wikipedia is applied using BeautifulSoup python library.

In [2]:
# Import Libraries

import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd

#### Create wikipedia url using requests get method by passing wikipdepia link and parser url using BeautifulSoup library to create soup object to html file

In [3]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

website_url
soup = BeautifulSoup(website_url.text,'html.parser')

In [4]:
# Using soup object find the table with class wikitable sortable in html file 

table = soup.find('table',{'class':'wikitable sortable'})

#check table content
table

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>
<tr>
<td>M6A</td>

In [5]:
#Create rows from table of  html file with tags that contain tr 

rows = table.find_all('tr')
rows

[<tr>
 <th>Postcode</th>
 <th>Borough</th>
 <th>Neighbourhood
 </th></tr>, <tr>
 <td>M1A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M2A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M3A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
 </td></tr>, <tr>
 <td>M4A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
 </td></tr>, <tr>
 <td>M5A</td>
 <td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
 <td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
 </td></tr>, <tr>
 <td>M5A</td>
 <td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
 <td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
 </td></tr>, <tr>
 <td>M6A</td>
 <td

In [6]:
#Create a Dataframe to store Postcode,Borough  and Neighbourhood

columns = ['Postcode','Borough','Neighbourhood']
df_canada = pd.DataFrame(columns=columns)

#View the Dataframe
df_canada

Unnamed: 0,Postcode,Borough,Neighbourhood


### Walkthough each row and find the tage 'td' which is treated as cells in each 'tr' tag to retrieve postcode,borough and neighbourhood


In [7]:
postcode=[]
borough=[]
neighbourhood=[]

for row in rows:
    cells = row.find_all('td')
    if len(cells)>1:
        try:
            postcode.append(cells[0].get_text().strip('\n'))
            if cells[1].find('a') == True:
                
                borough.append(cells[1].find('a').get_text('title').strip('/n'))
                #borough.append(cells[1].get_text().strip('/n'))
            else:
                borough.append(cells[1].get_text().strip('/n'))
                #borough.append(cells[1].find('a').get_text('title').strip('/n'))
            
            if  cells[2].find('a') == True:
                neighbourhood.append(cells[2].find('a').get_text('title').strip('/n'))
            else:
                neighbourhood.append(cells[2].get_text().strip('/n'))                         
            
        except:
            None


In [8]:
# assign postcode,borough and neighbourhood from pevious step to df_canada Dataframe

df_canada['Postcode']=postcode
df_canada['Borough']=borough
df_canada['Neighbourhood']=neighbourhood

# Check 5 rows of dataframe
df_canada.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


In [9]:
# Cehck how many of borough have "Not assigned" in order to eliminate from dataframe

df_canada.loc[(df_canada.Borough == 'Not assigned'),'Borough'].count()

77

In [10]:
# Filter dataframe for Borough with valid Borough's and reset the index

df_canada_filtered = df_canada.loc[~(df_canada.Borough == 'Not assigned')].reset_index()
df_canada_filtered.drop(columns='index',inplace=True)

# List 5 rows
df_canada_filtered.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods\n
1,M4A,North York,Victoria Village\n
2,M5A,Downtown Toronto,Harbourfront\n
3,M5A,Downtown Toronto,Regent Park\n
4,M6A,North York,Lawrence Heights\n


In [11]:
# Verify df_canada_filtered for any borough have 'Not assigned' after filtering

df_canada_filtered.loc[(df_canada_filtered.Borough == 'Not assigned'),'Borough'].count()

0

In [14]:
# Check for newline characters ,since tags in html sometime contains values with newline

print("Newline characters in Postcode :",df_canada_filtered.loc[df_canada_filtered.Postcode.str.contains('\n'),'Postcode'].count())

print("Newline characters in Borough :",df_canada_filtered.loc[df_canada_filtered.Borough.str.contains('\n'),'Borough'].count())

print("Newline characters in Neighbourhood :",df_canada_filtered.loc[df_canada_filtered.Neighbourhood.str.contains('\n'),'Neighbourhood'].count())

Newline characters in Postcode : 0
Newline characters in Borough : 0
Newline characters in Neighbourhood : 211


In [16]:
# Remove newline characters for Neighbourhood

df_canada_filtered.Neighbourhood.replace('\n','',regex=True,inplace=True)

# Verify after replacing '\n'
print("Newline characters after replacement in Neighbourhood :",df_canada_filtered.loc[df_canada_filtered.Neighbourhood.str.contains('\n'),'Neighbourhood'].count())

Newline characters after replacement in Neighbourhood : 0


In [17]:
# As per the assignment if any Neighbourhood is "Not assigned" then borough is assigned to Neighbourhood.
# Verify

df_canada_filtered[df_canada_filtered.Neighbourhood == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood
6,M7A,Queen's Park,Not assigned


In [18]:
# Assign Bourgh value to Neighbourhood at the index ind

ind= df_canada_filtered[(df_canada_filtered.Neighbourhood  == 'Not assigned')].index.values[0]
df_canada_filtered.loc[ind,'Neighbourhood'] = df_canada_filtered.loc[ind,'Borough']    

# Verify the Neighbourhood at index 'ind'
df_canada_filtered.loc[ind,['Borough','Neighbourhood']]

Borough          Queen's Park
Neighbourhood    Queen's Park
Name: 6, dtype: object

In [19]:
# As per assignment ,when there are more than one neighbourhood for same Postcode and Borough ,in such cases Neighbourhood have to be clubbed with comma separated

# This can be achived by performing groupby on (['Postcode','Borough']) and join Neighbourhood with comma separated

# For example postcode M5A
df_canada_filtered[df_canada_filtered.Postcode == 'M5A']

# Groupby (['Postcode','Borough']) and join on neighbourhood with comma separated

df_final = df_canada_filtered[['Postcode','Borough','Neighbourhood']].\
groupby(['Postcode','Borough'])['Neighbourhood'].\
apply(lambda x: ','.join(x)).reset_index()

In [20]:
# Verify above scenario for Postcode = 'M5A'

df_final[df_final.Postcode == 'M5A']

Unnamed: 0,Postcode,Borough,Neighbourhood
53,M5A,Downtown Toronto,"Harbourfront,Regent Park"


In [21]:
# Now check rows of df_final to verify 

df_final.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


###  Here is the final shape of dataframe df_final after applying all scenarios

In [24]:
print("Shape of Dataframe :",df_final.shape)

print("\nThere are {} rows and {} columns in df_final Dataframe".format(df_final.shape[0],df_final.shape[1]))

Shape of Dataframe : (103, 3)

There are 103 rows and 3 columns in df_final Dataframe
