# Toronto Neighbourhoods

## 1. Introduction

This notebook aims to accomplish [Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto](https://www.coursera.org/learn/applied-data-science-capstone#syllabus) in the course of [Applied Data Science Capstone](https://www.coursera.org/learn/applied-data-science-capstone). This notebook is going to analyse Toronto neightbourhoods in Canada. Firstly, such neighbourhoods data will be collected directly from Wikipedia. We're going to create a table as shown in Table 1. In order to do so, we need to scrape this [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). Next, we will use Geocoder library to get the latitude and longitude coordinates as show in Table 2. After we've collected the geospatial coordinates, we will explore some Boroughs using [Foursquare API](https://developer.foursquare.com/). Finally, we will be using clustering algorithm to clustering the neighbourhoods in Toronto.

<p style="text-align:center"><i>Table 1</i></p>

![](image-table.png)

<p style="text-align:center"><i>Table 2</i></p>

![](image-table2.png)

### 1.1. Aim

The goal of this notebook is to create a table of Toronto neighbourhoods whose data are obtained from Wikipedia and Geocoder library and to create segmentations of such neighbourhoods by using clustering method.

### 1.2. Objectives

1. To scrape the following Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
2. To create the above dataframe as shown in Table 1
3. To get latitude and longitude coordinates of each postal code using Geocoder library to create the dataframe as shown in Table 2
4. To explore some Boroughs in Toronto using Foursquare API
5. To cluster the explored Boroughs using K-Means Clustering algorithm
6. To visualise the generated clusters in map using Folium library

### 1.3. Structure

This notebook is arranged as follows.

1. [Introduction](#Introduction)
2. [Webpage Scraping](#Webpage-Scraping)
3. [Getting Coordinates](#Getting-Coordinates)
4. [Exploring Boroughs](#Exploring-Boroughs)
4. [Clustering](#Clustering)
5. [Visualisations](#Visualisations)
6. [Conclusions](#Conclusions)

<h2 id="Webpage-Scraping">2. Webpage Scraping</h2>

Import all necessary libraries for scraping Wikipedia webpage.

In [4]:
import pandas as pd

Let's get the original table from Wikipedia page

In [50]:
# insert url address
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# get table from Wikipedia using pandas.read_html()
df = pd.read_html(url)

# the table we're looking for is at the first table of the list of tables extracted
df = df[0]

# print its original shapes
print('The dataframe has {} rows and {} columns'.format(df.shape[0], df.shape[1]))

# print 5 first rows
df.head()

The dataframe has 180 rows and 3 columns


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


As seen in the above dataframe, there exist __Not assigned__ and `NaN` values. First, let's see all missing value exist. We can see the unique values of each column.

In [61]:
df['Postal Code'].unique()

array(['M1A', 'M2A', 'M3A', 'M4A', 'M5A', 'M6A', 'M7A', 'M8A', 'M9A',
       'M1B', 'M2B', 'M3B', 'M4B', 'M5B', 'M6B', 'M7B', 'M8B', 'M9B',
       'M1C', 'M2C', 'M3C', 'M4C', 'M5C', 'M6C', 'M7C', 'M8C', 'M9C',
       'M1E', 'M2E', 'M3E', 'M4E', 'M5E', 'M6E', 'M7E', 'M8E', 'M9E',
       'M1G', 'M2G', 'M3G', 'M4G', 'M5G', 'M6G', 'M7G', 'M8G', 'M9G',
       'M1H', 'M2H', 'M3H', 'M4H', 'M5H', 'M6H', 'M7H', 'M8H', 'M9H',
       'M1J', 'M2J', 'M3J', 'M4J', 'M5J', 'M6J', 'M7J', 'M8J', 'M9J',
       'M1K', 'M2K', 'M3K', 'M4K', 'M5K', 'M6K', 'M7K', 'M8K', 'M9K',
       'M1L', 'M2L', 'M3L', 'M4L', 'M5L', 'M6L', 'M7L', 'M8L', 'M9L',
       'M1M', 'M2M', 'M3M', 'M4M', 'M5M', 'M6M', 'M7M', 'M8M', 'M9M',
       'M1N', 'M2N', 'M3N', 'M4N', 'M5N', 'M6N', 'M7N', 'M8N', 'M9N',
       'M1P', 'M2P', 'M3P', 'M4P', 'M5P', 'M6P', 'M7P', 'M8P', 'M9P',
       'M1R', 'M2R', 'M3R', 'M4R', 'M5R', 'M6R', 'M7R', 'M8R', 'M9R',
       'M1S', 'M2S', 'M3S', 'M4S', 'M5S', 'M6S', 'M7S', 'M8S', 'M9S',
       'M1T', 'M2T',

There is no suspicious value in the first column. Let's move on the second one

In [62]:
df.Borough.unique()

array(['Not assigned', 'North York', 'Downtown Toronto', 'Etobicoke',
       'Scarborough', 'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

As seen above, there is `Not assigned` value in the second column. Now, let's check the last column.

In [63]:
df.Neighborhood.unique()

array([nan, 'Parkwoods', 'Victoria Village', 'Regent Park, Harbourfront',
       'Lawrence Manor, Lawrence Heights',
       "Queen's Park, Ontario Provincial Government", 'Islington Avenue',
       'Malvern, Rouge', 'Don Mills', 'Parkview Hill, Woodbine Gardens',
       'Garden District, Ryerson', 'Glencairn',
       'West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale',
       'Rouge Hill, Port Union, Highland Creek', 'Woodbine Heights',
       'St. James Town', 'Humewood-Cedarvale',
       'Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood',
       'Guildwood, Morningside, West Hill', 'The Beaches', 'Berczy Park',
       'Caledonia-Fairbanks', 'Woburn', 'Leaside', 'Central Bay Street',
       'Christie', 'Cedarbrae', 'Hillcrest Village',
       'Bathurst Manor, Wilson Heights, Downsview North',
       'Thorncliffe Park', 'Richmond, Adelaide, King',
       'Dufferin, Dovercourt Village', 'Scarborough Village',
       'Fairview, Henry Farm, Oriole', 'Nort

There is `NaN`(not a number) value in the third column. So, we're going to remove the `Not assigned` values in the second column and the `NaN` values in the third one.

In [51]:
# drop Not assigned value
df = df[df.Borough != 'Not assigned'].reset_index(drop=True)

# drop NaN value
df.dropna(axis='rows', inplace=True)

# let's see the first 5 rows
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Then, let's check if `Not assigned` and `NaN` still exist.

In [4]:
print('Does "Not assigned" exist?')
print(df[df.Borough == 'Not assigned'].sum())
print('\nDoes NaN exist?')
print(df.isnull().sum())

Does "Not assigned" exist?
Postal Code     0.0
Borough         0.0
Neighborhood    0.0
dtype: float64

Does NaN exist?
Postal Code     0
Borough         0
Neighborhood    0
dtype: int64


Nice. We've got our dataframe cleaned. Now, let's check its number of rows and columns.

In [5]:
print('The final dataframe has {} rows and {} columns'.format(df.shape[0], df.shape[1]))

The final dataframe has 103 rows and 3 columns


## 3. Getting Coordinates

Import necessary library to collect coordinates

In [46]:
import geocoder # import geocoder

We're supposed to use `Geocoder` library for below function to get the geospatial coordinates for each postal code and assign them to two new columns `Latitude` and `Longitude`. However, given that this library can be very unreliable, in case we are not able to get the geographical coordinates of the neighborhoods using Geocoder, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [10]:
def get_geocoder(postal_code):
     # initialize your variable to None
    lat_lng_coords = None
     # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return latitude,longitude

df['Latitude'], df['Longitude'] = zip(*df['Postal Code'].apply(get_geocoder))

In case we are not able to get the coordinates using Geocoder, we're going to download and import the csv file.

In [48]:
# uncomment below line if you'd like to download the csv file
# !wget -q -O 'Geospatial_Coordinates.csv' http://cocl.us/Geospatial_data

So, we've downloaded the csv file. Now, let's import it and see it.

In [52]:
fileCoor = 'Geospatial_Coordinates.csv'
LatLon = pd.read_csv(fileCoor)
LatLon.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Subsequently, we're going to left join our `df` dataframe with our new `LatLon` dataframe. We use `Postal Code` column as the key.

In [53]:
df = df.merge(LatLon, on='Postal Code', how='left')
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Great! We've got our merged dataframe. Let's save it as a csv file.

In [55]:
df.to_csv('toronto.csv')

## 4. Exploring Boroughs

In [58]:
df[df.Borough.str.contains('Toronto')]

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


## 5. Clustering

## 6. Visualisations

<h2 id='Conclusions'>7. Conclusions</h2>