<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto - Part1</font></h1>

## Introduction

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Before we get the data and start exploring it, let's download all the dependencies that we will need.


In [7]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

#!conda install -c conda-forge beautifulsoup4 --yes  # uncomment this line if you haven't completed beautifulsoup
from bs4 import BeautifulSoup as bs

print('Libraries imported.')

Libraries imported.


## 1. Download and Explore Dataset

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas  dataframe like the one shown below:


downloaded the files and placed it on the server, so you can simply run a `wget` command and access the data. So let's go ahead and do that.

In [2]:
!wget -q -O 'canada_postal_code_data.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
print('Data downloaded!')

Data downloaded!


#### Load and explore the data

In [3]:
with open("canada_postal_code_data.html") as fp:
    soup = bs(fp, 'html.parser')

tag = soup.table
body = tag.tbody

Tranform the data into dataframe


In [4]:
Tab = (body.tr).find_all('th')
colNames = [Tab[0].string.strip('\n'), Tab[1].string.strip('\n'), Tab[2].string.strip('\n')]

# instantiate the dataframe
postcodedf = pd.DataFrame(columns=colNames)

Notice how all the relevant data is in the _features_ key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.

In [5]:
postcodedf = pd.DataFrame(columns=colNames)

# extract all 'tr' tagged fields except the first one (column names)
CodesData= body.find_all('tr')[1:]

for n, code in enumerate(CodesData): 
    tabc = ["","",""]
    for i, value in enumerate(code.stripped_strings):
        tabc[i] = value.strip()
    postcode = tabc[0]
    borough = tabc[1]
    neighbourhood = tabc[2]
    
    # Ignore borough = 'Not assigned' records 
    if borough != 'Not assigned':
        # Replace neighbourhood <= borough when neighbourhood = 'Not assigned' 
        if neighbourhood == 'Not assigned':
            neighbourhood = borough
            
        # insert the built postal code into the dataframe
        postcodedf = postcodedf.append({'PostalCode' : postcode,
                            'Borough' : borough,
                            'Neighbourhood': neighbourhood},
                           ignore_index=True)

# Combine rows with same postal code into one row with the neighborhoods separated with a comma 
df = postcodedf.groupby('PostalCode', as_index=False).agg({'Borough':'first', 'Neighbourhood':', '.join})
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [6]:
print ("Toronto postal codes dataframe dimensions = ", df.shape)

Toronto postal codes dataframe dimensions =  (103, 3)
