# Segmenting and Clustering Neighborhoods in Toronto - Part1

## Summary
This notebook is used to develop the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to gather the data that is in the table of postal codes and to transform the data into a pandas dataframe.

# Getting the Data

In [1]:
#importing and installing necessary libraries
import requests
import pandas as pd
#### use pip to instal bs4
import sys
!{sys.executable} -m pip install beautifulsoup4
from bs4 import BeautifulSoup
#install parser
!{sys.executable} -m pip install lxml
import lxml



In [2]:
# Convert URL to a string
source_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(source_url).text

In [3]:
# Create Soup object with HTLM information
soup = BeautifulSoup(source, 'xml')
table=soup.find('table')

In [4]:
#Create three columns: PostalCode, Borough, and Neighborhood on a dataframe
column_names = ['Postalcode','Borough','Neighborhood']
df = pd.DataFrame(columns = column_names)

In [5]:
# Search all the postcode, borough, neighborhood and append to the dataframe
for tr_cell in table.find_all('tr'):
    row_data=[]
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data)==3:
        df.loc[len(df)] = row_data

In [6]:
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Cleaning the Data

In [7]:
# Remove "Not asssigned" from dataframe
df=df[df['Borough']!='Not assigned']
df[df['Neighborhood']=='Not assigned']=df['Borough']
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [8]:
# Creates a temporary dataframe to prepare the merging of duplicates
temp_df=df.groupby('Postalcode')['Neighborhood'].apply(lambda x: "%s" % ', '.join(x))
temp_df=temp_df.reset_index(drop=False)
temp_df.rename(columns={'Neighborhood':'Neighborhood_joined'},inplace=True)

In [9]:
# merge duplicates
df_merge = pd.merge(df, temp_df, on='Postalcode')

In [10]:
# Merge duplicates from table and renames newly created column
df_merge.drop(['Neighborhood'],axis=1,inplace=True)
df_merge.drop_duplicates(inplace=True)
df_merge.rename(columns={'Neighborhood_joined':'Neighborhood'},inplace=True)
df_merge.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
5,Queen's Park,Queen's Park,Queen's Park


In [11]:
df_merge.shape

(103, 3)