# Capstone for Data Science: Part2 - Toronto Neighourhoods Clustering

**Renzo Maldonado**

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

Start by creating a new Notebook for this assignment.
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

![alt text](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1605052800000&hmac=8Rp6gpIwz2quoycYAIeS6V02xqWXOxILQHR7MkYOtRo "Logo Title Text 1")

### Loading data from Internet

In [2]:
# importing libraries

import requests
import lxml.html as lh
import pandas as pd

In [3]:
link ='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# create a page to handle the contents of the website
page = requests.get(link)

# store the contents of the website under doc
doc = lh.fromstring(page.content)

# parse data that is stored between <tr>..</tr> of a HTML page
tr_data = doc.xpath('//tr')

In [4]:
# length of first 10 rows, it means the number of columns per row

[len(T) for T in tr_data[:10]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

In [5]:
# parse the first row as header
tr_data = doc.xpath('//tr')

# create empty list
col=[]
i=0

# for each row, store each first item as the headers in an empty list
for t in tr_data[0]:
    i+=1
    name=t.text_content()
    print ('%d:"%s"'%(i,name))
    col.append((name,[]))

1:"Postal Code
"
2:"Borough
"
3:"Neighbourhood
"


#### Pandas

In [6]:
# data is stored from the second row onwards
for j in range(1,len(tr_data)):
    #T is our j'th row
    T=tr_data[j]
    
    # if row is not of size 3, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    # i is the index of our columns
    i=0
    
    # loop through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        # check if row is empty
        if i>0:
        # convert any numerical values to integers
            try:
                data=int(data)
            except:
                pass
        # append the data to the empty list of the i'th column
        col[i][1].append(data)
        # increment i for the next column
        i+=1

In [7]:
# check the number of rows in each column, if they are the same
[len(C) for (title,C) in col]

[181, 181, 181]

In [8]:
# create the dataframe df
df=pd.DataFrame({title:column for (title,column) in col})

In [9]:
# check the first 5 rows of the created dataframe
df.head()

Unnamed: 0,Postal Code\n,Borough\n,Neighbourhood\n
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


In [12]:
# remove \n from the dataframe
df.columns = df.columns.str.replace('\n','', regex=True)
df = df.replace('\n','', regex=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [13]:
# drop rows with "Not assigned" values in Borough
df.drop(df.index[df['Borough'] == 'Not assigned'], inplace = True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [14]:
# reset index in the new dataframe
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [15]:
# combine rows where multiple neighbourhoods exist in the same postal code
df = df.groupby(['Postal Code', 'Borough'])['Neighbourhood'].apply(','.join).reset_index()
df.columns = ['Postal Code','Borough','Neighbourhood']
df.drop(df.index[df['Borough'] == 'Canadian postal codes'], inplace = True)
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [16]:
# check number of rows in the new dataframe after combining neighbourhoods in a row haivng a same postal code
df.count()

Postal Code      103
Borough          103
Neighbourhood    103
dtype: int64

In [17]:
# remove spaces in the dataframe
df['Neighbourhood'] = df['Neighbourhood'].str.strip()
df['Postal Code'] = df['Postal Code'].str.strip()
df['Borough'] = df['Borough'].str.strip()
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [18]:
# assign Borough values to Neighbourhood where neighbourhood="Not assigned"
df.loc[df['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = df['Borough']
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


#### Finally, Shape of the dataframe

In [19]:
df.shape

(103, 3)

In [20]:
df.to_csv(r'df_canada_postcodes.csv')