# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

# Step 1:

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1) Start by creating a new Notebook for this assignment.

2) Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

In [12]:
import requests
import lxml.html as lh
import pandas as pd
print("Imported Successfully")

Imported Successfully


In [13]:
wikipedia_link='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#Create a handle, wikipedia_page, to handle the contents of the website
wikipedia_page=requests.get(wikipedia_link)

#Store the contents of the website under wikipedia_page
wikipedia_page = lh.fromstring(wikipedia_page.content)

#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = wikipedia_page.xpath('//tr')

#Check the length of the first 12 rows
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

In [14]:
tr_elements = wikipedia_page.xpath('//tr')

#Create empty list
col=[]

i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print(name)
    col.append((name,[]))

Postcode
Borough
Neighbourhood



In [15]:
col

[('Postcode', []), ('Borough', []), ('Neighbourhood\n', [])]

In [16]:
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 3, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

In [17]:
[len(C) for (title,C) in col]

[287, 287, 287]

In [18]:
# Transforming the dictionary on a dataframe
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood\n
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


In [19]:
# Cleaning break lines
df = df.replace(r'\n','', regex=True)
df.columns = ['Postcode', 'Borough', 'Neighbourhood']
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [36]:
#Removing Not assigned zip codes and fixing df index
df.drop(df[df['Borough'] == 'Not assigned'].index, inplace=True)
df.index = range(len(df))
# df.head()
df

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,,
1,M4A,North York,Victoria Village,,
2,M5A,Downtown Toronto,Harbourfront,,
3,M6A,North York,Lawrence Heights,,
4,M6A,North York,Lawrence Manor,,
...,...,...,...,...,...
205,M8Z,Etobicoke,Kingsway Park South West,,
206,M8Z,Etobicoke,Mimico NW,,
207,M8Z,Etobicoke,The Queensway West,,
208,M8Z,Etobicoke,Royal York South West,,


In [21]:
df.shape

(210, 3)

3) To create the above dataframe:

    1. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
    2. Only process the cells that have complete information and not greyed out or not assigned.
    3. For each cell, the postal code will go under the PostalCode column, the first line under the postal code will go under Borough, and the remaining lines will go under the Neighborhood column formatted nicely and separated with commas as shown in the sample dataframe above. For example, for cell (1, 3) on the Wikipedia page, M3A will go under PostalCode, North York will go under Borough, and Parkwoods will go under Neighborhood.
    4. If a cell has only one line under the postal code, like cell (1, 7), then that line will go under the Borough and the Neighborhood columns. So for cell (1, 7), the value of the Borough and the Neighborhood column will be Queen's Park. This statement is false for current data in wikipedia
    5. In the last cell of my notebook, use the .shape method to print the number of rows of my dataframe.


# Step 2:

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

Use the Geocoder package to get the latitude and the longitude coordinates for all the neighborhoods in our dataframe. Here is a sample of the resulting dataframe for your reference.

In [25]:
#!conda install -c conda-forge geocoder --yes
import geocoder
import pandas as pd
import numpy as np
print("Imported successfully")

Imported successfully


In [35]:
# initialize your variable to None
lat_lng_coords = None

#Create extra columns
df['Latitude'] = pd.Series("", index=df.index)
df['Longitude'] = pd.Series("", index=df.index)

print(df.columns)
df

Index(['Postcode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude'], dtype='object')


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,,
1,M4A,North York,Victoria Village,,
2,M5A,Downtown Toronto,Harbourfront,,
3,M6A,North York,Lawrence Heights,,
4,M6A,North York,Lawrence Manor,,
...,...,...,...,...,...
205,M8Z,Etobicoke,Kingsway Park South West,,
206,M8Z,Etobicoke,Mimico NW,,
207,M8Z,Etobicoke,The Queensway West,,
208,M8Z,Etobicoke,Royal York South West,,


# loop until you get the coordinates
i = 0

sum_latitude = sum(df['Latitude'] == '')

while sum_latitude > 0:
    print('Missing coordinates: ', sum_latitude) 
    if df['Latitude'][i] == '':
        try:
            g = geocoder.google('{}, Toronto, Ontario'.format(df['Neighbourhood'][i]))
            lat_lng_coords = g.latlng
            if g.latlng != None:
                df['Latitude'][i] = lat_lng_coords[0]
                df['Longitude'][i] = lat_lng_coords[1]
        except:
            break
    i = i+1
    sum_latitude = sum(df['Latitude'] == '')


df
location = pd.read_csv("Geospatial_Coordinates.csv", sep=',')
print(location['Latitude'])
g = geocoder.google('Mountain View, CA')
print(g.latlng)

# Geocoder doesnot working and Provided Geospatial data is partial. 