## IBM Data Science - Peer graded assignment
### Explore, segment and cluster the neighborhoods of Toronto city

In this assignment, I am going to explore Toronto city neighborhoods by using segmenting and clustering. <br>
The data is not readily available on the internet. There is a Wikipedia page that exists for Toronto neighborhood data. <br>
Here is the link below: 

https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.


### Importing all the required libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

import requests
!pip install BeautifulSoup4
from bs4 import BeautifulSoup 

print('Required Libraries imported.')

Required Libraries imported.


### Scraping the required data from Wikipedia Url and extracting data

In [2]:
URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib') 
table = soup.find('div', attrs = {'id':'container'})

print('Wikipedia Page Scrapped.')

Wikipedia Page Scrapped.


And only processing the cells that have an assigned borough. Ignore cells with a borough that is Not assigned. <br>
If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough.

In [3]:
postalCodes = [];
boroughs= [];
neighborhoods = [];
columnNum = 1;
passVal = False

for row in soup.find_all('td'):
    for cell in row:
        if cell.string and cell.string[0].isalpha() and len(cell.string) > 2:
            passVal = False
            if columnNum == 1:
                if passVal == False and cell.string[1].isdigit():
                    postalCodes.append(cell.string);   
                    columnNum = 2
                else:
                    continue
            elif columnNum == 2 :
                if cell.string == 'Not assigned':
                    passVal = True
                    del postalCodes[-1]
                    columnNum = 1
                    continue
                else:
                    boroughs.append(cell.string);      
                    columnNum = 3
            elif columnNum == 3 :
                if cell.string == 'Not assigned\n':
                    neighborhoods.append(boroughs[-1])
                else:
                    neighborhoods.append(cell.string); 
                columnNum = 1
                
print('Required Data Collected.')

Required Data Collected.


#### Defining columns for the Dataframe

In [6]:
# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood'] 

# instantiate the dataframe
df = pd.DataFrame(columns=column_names)
df

Unnamed: 0,PostalCode,Borough,Neighborhood


In [11]:
# Appending columns and 
for data in range(len(neighborhoods)):
    code = postalCodes[data]
    borough = boroughs[data]
    neighborhood_name = neighborhoods[data]

    df = df.append({ 'PostalCode': code,
                                   'Borough': borough,
                                   'Neighborhood': neighborhood_name}, ignore_index=True)

df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M8Z,Etobicoke,Mimico NW / The Queensway West / South of Bloo...
1,M1A,Not assigned,M2A
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,M9A
8,M1B,Scarborough,Malvern / Rouge
9,M2B,Not assigned,M3B


In [12]:
df.shape

(406, 3)

### Installing and importing Geo-coder library

In [14]:
import sys
!{sys.executable} -m pip install geocoder
import geocoder # import geocoder

print('GeoCoder Package installed.')

GeoCoder Package installed.


#### Defining new dataframe columns to include geoCodes

In [15]:
# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
df = pd.DataFrame(columns=column_names)

df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude


#### Merging and Appending the dataframe with geocodes - Latitude and Longitudes

In [16]:
# initialize your variable to None
lat_lng_coords = None

for data in range(0, len(postalCodes)-1):
    code = postalCodes[data]
    borough = boroughs[data]
    neighborhood_name = neighborhoods[data]
    
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(code))
    lat_lng_coords = g.latlng

    df = df.append({ 'PostalCode': code,
                                   'Borough': borough,
                                   'Neighborhood': neighborhood_name,
                                   'Latitude': lat_lng_coords[0],
                                   'Longitude': lat_lng_coords[1]}, ignore_index=True)
    
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1A,Not assigned,M2A,43.64869,-79.38544
1,M3A,North York,Parkwoods,43.752935,-79.335641
2,M4A,North York,Victoria Village,43.728102,-79.31189
3,M5A,Downtown Toronto,Regent Park / Harbourfront,43.650964,-79.353041
4,M6A,North York,Lawrence Manor / Lawrence Heights,43.723265,-79.451211
5,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.66179,-79.38939
6,M8A,Not assigned,M9A,43.64869,-79.38544
7,M1B,Scarborough,Malvern / Rouge,43.808626,-79.189913
8,M2B,Not assigned,M3B,43.64869,-79.38544
9,M4B,East York,Parkview Hill / Woodbine Gardens,43.707193,-79.311529


In [17]:
df.shape

(135, 5)