# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto
This notebook extracts and processes the data in the table of postcodes from the wikipedia page at: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

We ignore unused postcodes and merge all neighbourhoods with the same postcode into a single row, and assign the suburb name to the neighbourhood where one the neighbourhood has not been assigned a separate name.


### Import the packages we'll use

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML

### Download the wiki page and convert into an object tree

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

wiki_page = requests.get(url).text
    
soup = BeautifulSoup(wiki_page,'lxml')

### Find the table part of the page and extract the headers
3.1 The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [3]:
My_table = soup.find('table',{'class':'wikitable sortable'})
#print(My_table)

ths = My_table.findAll('th')
headers=[]
for th in ths:
    headers.append(th.text.strip())
headers

['Postcode', 'Borough', 'Neighbourhood']

### Now add the data
Find all of the table rows containing data and build up a dictionary of postcodes, boroughs and neighbourhoods, filtering and appending neighbourhoods to existing postcodes as we go.
* 3.2 Only process the cells that have an assigned borough. _Ignore cells with a borough that is Not assigned._
* 3.3 More than one neighborhood can exist in one postal code area. _For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table._
* 3.4 If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. _So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park._

In [4]:
# Find all of the table rows to process
rows = My_table.findAll('tr')
postcode_list={}
# lose the first row as that's the headers that we have already processed
rows=rows[1:]
for row in rows:
    tds = row.findAll('td')
    postcode = tds[0].text.strip()
    borough = tds[1].text.strip()
    neigh = tds[2].text.strip()
    
    #3.4 If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
    if (neigh == "Not assigned"):
        neigh = borough

    ## 3.2 Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
    if (borough != 'Not assigned'):
        postcode_dict = {
            "Postcode": postcode, 
            "Borough": borough, 
            "Neighbourhood": neigh
        }
        ## 3.3 If we already have an entry for this postcode, then concatenate the neighbourhoods and updating the existing entry
        if (postcode in postcode_list):
            concatenated_neigh = postcode_list[postcode]["Neighbourhood"] + ", " + neigh
            postcode_dict.update({"Neighbourhood": concatenated_neigh})
            postcode_list[postcode].update(postcode_dict)
        else:
            postcode_list[postcode] = postcode_dict

### Convert our dictionary of postcode data into a dataframe
Show the first 12 entries.

In [5]:
df = pd.DataFrame(list(postcode_list.values()), columns=headers)
df.set_index('Postcode')
df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M5N,Central Toronto,Roselawn
1,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
2,M4J,East York,East Toronto
3,M4R,Central Toronto,North Toronto West
4,M1X,Scarborough,Upper Rouge
5,M5W,Downtown Toronto,Stn A PO Boxes 25 The Esplanade
6,M2R,North York,Willowdale West
7,M4A,North York,Victoria Village
8,M6M,York,"Del Ray, Keelesdale, Mount Dennis, Silverthorn"
9,M6K,West Toronto,"Brockton, Exhibition Place, Parkdale Village"


# Now check the shape of the data frame
3.6 In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [12]:
print("The dataframe has {} rows.".format(df.shape[0]))

The dataframe has 103 rows.


### Obtain co-ordinates for the suburbs
* 4.2 Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In [7]:
#use geocoder or csv

import csv

csv_url = 'https://cocl.us/Geospatial_data'

with requests.Session() as s:
    download = s.get(csv_url)

    decoded_content = download.content.decode('utf-8')

    reader = csv.reader(decoded_content.splitlines(), delimiter=',')
    csv_data = list(reader)
            
latlong_df=pd.DataFrame(csv_data[1:], columns=['Postcode', 'Latitude', 'Longitude'])
latlong_df.set_index('Postcode')
latlong_df.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.8066863,-79.1943534
1,M1C,43.7845351,-79.1604971
2,M1E,43.7635726,-79.1887115
3,M1G,43.7709921,-79.2169174
4,M1H,43.773136,-79.2394761


### Update our dataframe with the co-ordinates of each postcode
e.g. 'M4N' should have co-ordinates: ('43.7280205', '-79.3887901')

In [8]:
df = pd.merge(df, latlong_df, how='outer', on='Postcode')
#convert the lat/long strings to numeric
df[['Latitude', 'Longitude']] = df[['Latitude', 'Longitude']].apply(pd.to_numeric)
df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M5N,Central Toronto,Roselawn,43.711695,-79.416936
1,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
2,M4J,East York,East Toronto,43.685347,-79.338106
3,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
4,M1X,Scarborough,Upper Rouge,43.836125,-79.205636
5,M5W,Downtown Toronto,Stn A PO Boxes 25 The Esplanade,43.646435,-79.374846
6,M2R,North York,Willowdale West,43.782736,-79.442259
7,M4A,North York,Victoria Village,43.725882,-79.315572
8,M6M,York,"Del Ray, Keelesdale, Mount Dennis, Silverthorn",43.691116,-79.476013
9,M6K,West Toronto,"Brockton, Exhibition Place, Parkdale Village",43.636847,-79.428191


### Explore and cluster the neighborhoods in Toronto
Explore and cluster the neighborhoods in Toronto. 
You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:
* to add enough Markdown cells to explain what you decided to do and to report any observations you make.
* to generate maps to visualize your neighborhoods and how they cluster together.

In [9]:
# Load previous new york notebook for reference