# Segmenting and Clustering Neighborhoods in Toronto

### This Notebook is part of the IBM Applied Datascience Capstone course. In this practice assignment location and venue data from Toronto will be segmented and clustered.

*By Andrew Dahlstrom*

*04/22/2019*

In [116]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

### The first step is to scrape the text from the wikitable of postal codes which can be found here:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
### Then we need parse clean and load that text data into a pandas dataframe.

In [117]:
# Scrape text from wikitable online and load into a Pandas dataframe

source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')
#print(soup.prettify())

for table in soup.find_all('table', class_= 'wikitable'):
    neightable = []
    
    for row in table.find_all('tr'):
        neighrow = []
        
        for data in row.find_all('td'):
            neighrow.append(data.text.rstrip('\n'))
        
        neightable.append(neighrow)

#Clean data remove unassigned boroughs

neighdf = pd.DataFrame(neightable, columns = ['PostalCode', 'Borough', 'Neighbourhood'])
neighdf.drop(index=0, inplace=True)
todrop = neighdf[neighdf['Borough'] == "Not assigned"].index
neighdf.drop(todrop, inplace=True)
neighdf.reset_index(drop=True)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


### The next step involves organizing the table by merging neighborhoods with the same postal codes and giving unnamed neighbourhoods the name of their borough.

In [118]:
#Merge neighborhoods with same postalcode

neighdf = neighdf.groupby(['PostalCode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
neighdf

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [121]:
# Replace unassigned neighborhoods with borough names 

for index, row in neighdf.iterrows(): 
    if neighdf.at[index, 'Neighbourhood'] == "Not assigned":
        neighdf.at[index, 'Neighbourhood'] = neighdf.at[index, 'Borough']
      
neighdf

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


### The shape of our final dataframe looks like this:

In [122]:
neighdf.shape

(103, 3)