# Segmenting and Clustering Neighborhoods in Toronto

## Part 1: Web Scraping, Transforming Data into a Pandas Dataframe and Cleaning Data

This is the first part of the 3rd week assignement. Our task is to scrape a webpage with the table of postal codes of Canada, more specifically, Toronto, clean the data from the table and transform it into a usable *pandas dataframe*. 

The first step is to import the libraries and packages we need:

In [1]:
import requests  #a library we use for web scraping
from bs4 import BeautifulSoup  #a Python package for parsing HTML and XML documents
import pandas as pd  #a Python library necessary for data analysis and manipulation 

The next step is to scrape the data we need and turn it into a pandas dataframe:

In [2]:
res = requests.get(
    'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

soup = BeautifulSoup(res.text, 'html.parser')
table = soup.find('table')

df = pd.read_html(str(table))
df = df[0]
df.head(11)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Etobicoke,Islington Avenue


As we can see, there are rows lacking the necessary information, so we need to clean the data and keep only those rows in the 'Borough' column where we have the data we need. Let's get rid of the missing data: 

In [3]:
df = df[df.Borough != 'Not assigned']
df = df.reset_index(drop=True)
df.head(11)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Etobicoke,Islington Avenue
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


(There's an alternative way to get the same result:

df = df.replace('Not assigned', np.nan)<br/>
df = df.dropna()

But, here we'll stick to the first method.)

Now, let's rename the columns, according to the suggestions from the description of the assignment, and combine the rows where the postal code is the same for more than one neighborhoods (separating their names with a comma): 

In [4]:
df.rename(columns={'Postcode': 'PostalCode',
                   'Neighbourhood': 'Neighborhood'}, inplace=True)

df = df.groupby(['PostalCode', 'Borough'])[
    'Neighborhood'].apply(', '.join).reset_index()

df.head(11)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Great. Now, let's see how many rows our dataframe consists of:

In [5]:
df.shape

(103, 3)

Good. We have scraped data, turned it into a pandas dataframe, cleaned it and got the dataframe we'll need for the next steps. 