# Segmenting and Clustering Neighborhoods in Toronto (Pt. 1)

This notebook will be used for the Capstone Project for the <a href="https://www.coursera.org/professional-certificates/ibm-data-science">IBM Data Science Professional Certificate.</a>  In this project we will <strong>Segment and Cluster</strong> Neighborhoods in Toronto, Canada.

The generated datafram will meet the following requirements:
<ul>
<li>The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood</li>
<li>Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.</li>
<li>More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.</li>
<li>If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.</li>
<li>Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.</li>
<li>In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.</li>
</ul>


## Make the Necessary Imports

In [1]:
import requests
import pandas as pd

from bs4 import BeautifulSoup

## Defined function to return a DataFrame, as per the requirements above.

###  Note:  There are no 'Not assigned' neighborhoods where there is a borough.  For this reason there was no need to do point 5 from above.

In [2]:
def parse_postal_codes(table):
    t_headers = []
    
    for th in table.find_all("th"):
        # Get all header labels
        t_headers.append(th.text.replace('\n', ' ').strip())
        
    t_data = []
        
    for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
        t_row = {}
        
        # find all td's(3) in tr and zip it with t_header
        for td, th in zip(tr.find_all("td"), t_headers): 
            t_row[th] = td.text.replace('\n', '').strip()
        
        t_data.append(t_row)
            
    df = pd.DataFrame(t_data)
    
    #Clean up df
    df.dropna(inplace=True)
    
    df = df[df.Borough != "Not assigned"]
    df.rename(columns={"Postal Code": "PostalCode"}, inplace=True)
    
    return df      

## Scrape the page with BeautifulSoup

In [3]:
r = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(r.content)

nb_table = soup.find("table", attrs={"class": "wikitable sortable"})

df = parse_postal_codes(nb_table)

## Print the dataframe

In [4]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
161,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
166,M4Y,Downtown Toronto,Church and Wellesley
169,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
170,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


## Print the shape of the generated dataframe

In [5]:
df.shape

(103, 3)