<h1 align=center><font size = 10>Segmenting and Clustering Neighborhoods in Toronto PART ONE</font></h1>

# By Katarzyna Cybulska

This is workbook one of three containing steps in segmentation and clustering of neighborhoods in Toronto. In this workbook you will find how to scrape Wikipedia table to get full list of Toronto's boroughs and neighborhoods with associated postal codes. Enjoy!

# We will start with imporing all required packages:

In [90]:
#import packages

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import requests # library to handle requests

# For scrapping table from Wikipedia page https://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import BeautifulSoup

print('Libraries imported.')


Libraries imported.


# Scrape Wikipedia table

For this exercise we will get data from Wikipedia page: "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M". Go ahead and check out table which is available on this page.

We will find contents of this table and save it in pandas DataFrame calle "wiki_table", omitting rows which have missing values ("Not assigned" borough).

In [91]:
#Initialize pandas data frame to store information about Toronto neighbourhoods
wiki_table = pd.DataFrame({"PostalCode":{},"Borough":{},'Neighborhood':{}})

# save Wikipedia page's url
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

# Get wikipedia page contencts in text format
res = requests.get(url).text

# Use BeautifulSoup package to wraggle data from Wikipedia page. Initialize BeautifulSoup object:
soup = BeautifulSoup(res,'lxml')


#search for table and copy tablesscontent to pandas data frame

for items in soup.find('table', class_='wikitable sortable').find_all('tr')[1::1]: 

    data = items.find_all(['td'])
    try:
        postal_code = data[0].text
        borough = data[1].text               
        hood = data[2].text[0:-1]             # to remove \n sign at the end of cell
    except IndexError:pass
    
    
    
    # if/else statements to help ommit rows with "Not assigned" borough, and use borough name in case there is no neighborhood name.   
    if hood == "Not assigned":
        if borough == "Not assigned":
            pass                            
        else:
            df = pd.DataFrame([[postal_code,borough,borough]], columns=wiki_table.columns.tolist())
            wiki_table = wiki_table.append(df)
    else:
            df = pd.DataFrame([[postal_code,borough,hood]], columns=wiki_table.columns.tolist())
            wiki_table = wiki_table.append(df)

        

#Set index
wiki_table.index = range(0,wiki_table.shape[0])
  
print(wiki_table.shape)
wiki_table.head(10)

(212, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


# Reshape table

We have all the data we need! Now we want to reshape this table a little: group table by Postal Codes, so that Nieghbourhoods are listed with comma separator.

In [96]:
# First drop Neighborhoods from wiki_table and create table of unique postal codes with assigned boroughs.

unique_postal_codes = wiki_table.drop('Neighborhood', axis = 1).drop_duplicates()
unique_postal_codes['Neighborhood']=''


# For each postal codes find all neighborhoods..
for postal_code in unique_postal_codes["PostalCode"]:
    list_of_neighborhoods = wiki_table.loc[wiki_table['PostalCode']==postal_code, wiki_table.columns[2]].tolist()
    
    # ... and paste them to one string
    hoods = list_of_neighborhoods[0]
    for hood in list_of_neighborhoods[1:-1]:
        hoods = hoods + ", " + hood
        
    # paste neighborhoods into the table
    unique_postal_codes.loc[unique_postal_codes['PostalCode'] == postal_code, unique_postal_codes.columns[2]]=hoods

#rename the table
toronto_neighborhoods = unique_postal_codes
print(toronto_neighborhoods.shape)
toronto_neighborhoods.tail(10)


(103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
186,M8W,Etobicoke,Alderwood
188,M9W,Etobicoke,Northwest
189,M1X,Scarborough,Upper Rouge
190,M4X,Downtown Toronto,Cabbagetown
192,M5X,Downtown Toronto,First Canadian Place
194,M8X,Etobicoke,"The Kingsway, Montgomery Road"
197,M4Y,Downtown Toronto,Church and Wellesley
198,M7Y,East Toronto,Business reply mail Processing Centre969 Eastern
199,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park So..."
207,M8Z,Etobicoke,"Kingsway Park South West, Mimico NW, The Queen..."


# Done!

Remeber to check rest of the exercise in notebooks Segmenting-and-Clustering-Neighborhoods-in-Toronto-PART_TWO and Segmenting-and-Clustering-Neighborhoods-in-Toronto-PART_THREE.

# Thank you for reading!

