## Segmenting and Clustering Neighborhoods in the City of Toronto, Canada: Part 1

Created by Rhys Morgan on 27th July 2020

[My GitHub Repository](https://github.com/rmjmorgan/Coursera_Capstone 'Coursera Capstone Project')  
[Course Info](https://www.coursera.org/learn/applied-data-science-capstone 'Applied Data Science Capstone')

In this part, I will demonstrate my understanding of scraping data from a webpage, then using it to create a dataframe.

Side Note: The Wiki page has updated numerous times since the creation of this assignment, hence the contents of the dataframe I create will be slightly different.

In [1]:
# import the Pandas library.
import pandas as pd

# verify that Pandas has been imported by displaying the version number.
('Pandas version {} imported.').format(pd.__version__) 

'Pandas version 1.0.1 imported.'

I will use Pandas' `read_html()` function to scrape data from the specified URL, because it's fast and requires very little code.  
This function will scour the entire webpage for tabular data, where each table will be a new dataframe.

In [2]:
# define the URL.
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# "header=0" tells Pandas that the column names are, in fact, the first row of the dataframe.
# "[0]" tells Pandas to only scrape data from the first table it finds. 
df = pd.read_html(wiki_url, header=0)[0]

df.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


In [3]:
# retrive the number of rows and columns of this dataframe.
('There are {} rows and {} columns in this dataframe.').format(df.shape[0], df.shape[1])

'There are 180 rows and 3 columns in this dataframe.'

In this case, it is important that I know the borough names of every postal code in this dataset. Therefore, all rows where the borough is __"Not assigned"__ must go.

For the sake of consistency, I will also rename the columns to match those shown in the assignment example.

In [4]:
# remove all entries where the borough = 'Not assigned'
df = df[df.Borough != 'Not assigned'].reset_index(drop=True)
df.rename(columns={'Postal Code':'PostalCode', 'Neighbourhood':'Neighborhood'}, inplace=True)

df.head(12) 

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [5]:
# retrive the number of rows and columns of this dataframe.
('There are {} rows and {} columns in this dataframe.').format(df.shape[0], df.shape[1]) 

'There are 103 rows and 3 columns in this dataframe.'