# Segmenting and Clustering Neighborhoods in Toronto

## Submission for Question 1

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import xml
import numpy as np

### Get data from Wikipedia

In [11]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(url)

In [2]:
# uncomment the next line to see the html data in soup object. We see that the required data are in td tags.
# soup

In [14]:
tbl_tag = soup.find('table')
td_tag = tbl_tag.find_all('td')

print(len(td_tag))

861


### Create dataframe

In [17]:
# store values in lists
postalcodes = []
boroughs = []
neighborhoods = []

# note that relevant info td tags are in group of 3 tags.
indexes = np.arange(len(td_tag), step=3)

for idx in indexes:
    postalcodes.append(td_tag[idx].text.strip())
    boroughs.append(td_tag[idx+1].text.strip())
    neighborhoods.append(td_tag[idx+2].text.strip())
    

In [22]:
df = pd.DataFrame(data=[postalcodes, boroughs, neighborhoods])
df = df.transpose()
df

Unnamed: 0,0,1,2
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


## Tidy dataframe

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [24]:
# Replace column names. Remove boroughs that are not assigned.
df.columns = ['PostalCode', 'Borough', 'Neighborhood']


Ignore cells with a borough that is Not assigned.

In [25]:
df.drop( df[ df['Borough'] == "Not assigned" ].index , inplace=True)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
...,...,...,...
281,M8Z,Etobicoke,Kingsway Park South West
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [31]:
df[df['Neighborhood'] == "Not assigned"]

Unnamed: 0,PostalCode,Borough,Neighborhood


It seems there is no neighborhood with "Not assigned" values. Proceed to groups neighborhoods together in the same postal code

In [35]:
df_grouped = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()
df_grouped

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [56]:
df_grouped.shape

(103, 3)