# Exploring and clustering the neighborhoods in Toronto
### Pt.1: Scraping the Web data

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

> Scrape the following Wikipedia page (url): 

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
raw = requests.get(url).text
table = BeautifulSoup(raw, 'html.parser').find('table', {'class': 'wikitable sortable'})

In [3]:
df = pd.read_html(str(table))[0] #originally the reader returns the df inside a list of one item, so we need to 'take' it from there.

In [4]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


> Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [5]:
df = df.query('Borough != "Not assigned"').reset_index(drop=True)
df.head(7)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned


> If a cell has a borough but a **Not assigned** neighborhood, then the neighborhood will be the same as the borough.

In [6]:
for i, j in zip(range(len(df)), df['Neighbourhood']):
    if j == 'Not assigned':
        df.at[i, 'Neighbourhood'] = df.at[i, 'Borough']

In [7]:
df.head(7) # the last row

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park


> The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [8]:
df.columns = ['PostalCode', 'Borough', 'Neighborhood']
df.columns

Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')

> More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma

In [9]:
df = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()

In [10]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [11]:
df[df['PostalCode'] == 'M5A']

Unnamed: 0,PostalCode,Borough,Neighborhood
53,M5A,Downtown Toronto,"Harbourfront, Regent Park"


> In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [12]:
df.shape[0]

103

In [14]:
# Saving the df into a file ('codes-borough-neighborhood' aka 'cbh') for further analysis:
df.to_csv('cbn.csv', index=False)