<h2>Segmenting and Clustering Neighborhoods in Toronto Canada</h2>

<h3>Let's import required packages</h3>   

In [1]:
import requests
import numpy as np
import pandas as pd
import random

In [2]:
!pip install geopy
import geopy
from geopy.geocoders import Nominatim

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 

import folium # plotting library



DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support


<h3>Download the Wikipedia page</h3>   

The Wikipedia page has a table of postal codes that contains all the information needed to explore and cluster the neighborhoods in Toronto.

In [3]:
wikipage = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
wikipage

<Response [200]>

In [5]:
import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(wikipage.content, 'lxml')

<h4>Get the table of postal codes and transform the data into a pandas dataframe</h4>   

In [6]:
postcode_tbl = soup.find(class_="wikitable sortable")
postcode_tbl

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>
<tr>
<td>M6A</td>

In [7]:
rows = postcode_tbl.findAll(lambda tag: tag.name =='tr')
rows

[<tr>
 <th>Postcode</th>
 <th>Borough</th>
 <th>Neighbourhood
 </th></tr>, <tr>
 <td>M1A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M2A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M3A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
 </td></tr>, <tr>
 <td>M4A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
 </td></tr>, <tr>
 <td>M5A</td>
 <td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
 <td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
 </td></tr>, <tr>
 <td>M5A</td>
 <td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
 <td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
 </td></tr>, <tr>
 <td>M6A</td>
 <td

In [8]:
columns = []
for h in rows[0].findAll(lambda tag: tag.name == 'th'):
    columns += h
cols = [str.strip(e) for e in columns]
cols

['Postcode', 'Borough', 'Neighbourhood']

In [50]:
df = pd.read_html(str(postcode_tbl))[0].drop([0])

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [10]:
df.columns = cols
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Only process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned**.

In [11]:
df = df.drop(df[df['Borough']=='Not assigned'].index.values.tolist())
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 6th cell in the dataframe, the value of the Borough and the Neighborhood columns will be **Queen's Park**.

In [12]:
df.at[6, 'Neighbourhood'] = df.at[6, 'Borough']
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


<h3>Combine the neighborhoods that have the same postal code into one row</h3>   

More than one neighborhood can exist in one postal code area. The rows will be combined into one row with the neighborhoods separated with a comma.

In [41]:
def GetBorough(series):
    return series.tolist()[0]

df_1 = df.groupby('Postcode')['Borough'].apply(GetBorough).reset_index()
df_2 = df.groupby('Postcode')['Neighbourhood'].apply(lambda tags: ','.join(tags)).reset_index()


In [49]:
df_grouped = df_1.join(df_2.set_index('Postcode'), on='Postcode', sort=True)
df_grouped

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


Use the .shape method to print the number of rows of the dataframe

In [48]:
df_grouped.shape

(103, 3)

<h3>Get latitude and the longitude coordinates of each neighborhood</h3>   

we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

In [54]:
postal_codes = df_grouped['Postcode'].tolist()

In [51]:
import geocoder # import geocoder

In [None]:
# initialize your variable to None
lat_lng_coords = None

for postal_code in postal_codes:
    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
      lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    print("Postalcode: {}'s Latitude is: {} Longitude is: {}".format(postal_code, latitude, longitude))


Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [None]:
geo_co = pd.read_csv('http://cocl.us/Geospatial_data')
geo_co