# Explore and Cluster Neighborhoods in Toronto
### Scrape wiki page for a table of postal codes and pre process into a dataframe.
### Use dataframe to explore and cluster neighborhoods 

#### Explore Beautifulsoup library for web scraping. 

##### Documentation: https://beautiful-soup-4.readthedocs.io/en/latest/
##### Youtube video: https://www.youtube.com/watch?v=ng2o98k983k

In [3]:
from bs4 import BeautifulSoup
import requests

import pandas as pd

In [4]:
weblink="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

toronto_data=requests.get(weblink)
#toronto_data.text

Now we scrape data for required info (tables)

Begin with find_all 'table' and Get to know class of required table

Use BeautifulSoup to map the table to a variable

In [5]:
soup = BeautifulSoup(toronto_data.text,'lxml')
#print(soup.prettify())
#print(soup.get_text())

#table=soup.find_all('table')
#soup.table['class']
Toronto = soup.find('table',{'class':'wikitable sortable'})
Toronto

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>
<tr>
<td>M6A</td>

It can be seen that:
1) Every row is a "tr"
2) Heading in row "th"
3)row data in "td"

In [6]:
toto_dict=[]

for tr in Toronto.find_all('tr')[1:-1]:
    data=tr.find_all(['th','td'])
    pcode=data[0].string
    bor=data[1].string
    neigh=(data[2].text).split("\n")[0]
    #print(neigh)
    
    toto_dict.append({'Postcode':pcode, 'Borough':bor, 'Neighbourhood':neigh})

toto_dict

[{'Postcode': 'M1A',
  'Borough': 'Not assigned',
  'Neighbourhood': 'Not assigned'},
 {'Postcode': 'M2A',
  'Borough': 'Not assigned',
  'Neighbourhood': 'Not assigned'},
 {'Postcode': 'M3A', 'Borough': 'North York', 'Neighbourhood': 'Parkwoods'},
 {'Postcode': 'M4A',
  'Borough': 'North York',
  'Neighbourhood': 'Victoria Village'},
 {'Postcode': 'M5A',
  'Borough': 'Downtown Toronto',
  'Neighbourhood': 'Harbourfront'},
 {'Postcode': 'M5A',
  'Borough': 'Downtown Toronto',
  'Neighbourhood': 'Regent Park'},
 {'Postcode': 'M6A',
  'Borough': 'North York',
  'Neighbourhood': 'Lawrence Heights'},
 {'Postcode': 'M6A',
  'Borough': 'North York',
  'Neighbourhood': 'Lawrence Manor'},
 {'Postcode': 'M7A',
  'Borough': "Queen's Park",
  'Neighbourhood': 'Not assigned'},
 {'Postcode': 'M8A',
  'Borough': 'Not assigned',
  'Neighbourhood': 'Not assigned'},
 {'Postcode': 'M9A',
  'Borough': 'Etobicoke',
  'Neighbourhood': 'Islington Avenue'},
 {'Postcode': 'M1B', 'Borough': 'Scarborough', 'Nei

In [7]:
#Convert to pandas DataFrame
Table=pd.DataFrame.from_dict(toto_dict, orient='columns')
Table=Table[['Postcode','Borough','Neighbourhood']]
print(Table.shape)
Table.head()

(288, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Filter Table for columns having an assigned Borough

In [8]:
Table=Table[Table.Borough != 'Not assigned']
print(Table.shape)
Table.reset_index(drop=True,inplace=True)
Table.head()

(212, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Now process for:
If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 

In [9]:
neigh_none=Table['Neighbourhood'].values.tolist().index('Not assigned')
Table['Neighbourhood'][neigh_none]=Table['Borough'][neigh_none]
Table.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Merge neighborhoods in the same Borough

In [10]:
Table1=Table.groupby(['Postcode','Borough'])['Neighbourhood'].apply(list).reset_index()
print(Table1.shape)
for ind in range(0,Table1.shape[0]):
    Table1['Neighbourhood'][ind]=",".join(Table1['Neighbourhood'][ind])
Table1

(103, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


Print shape and top 5 lines of resultant table

In [11]:
print("Table dimensions are:",Table1.shape)

Table1.head()

Table dimensions are: (103, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data



In [12]:
!wget -q -O 'toronto_data.csv' https://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


In [13]:
df_geo=pd.read_csv("toronto_data.csv")
df_geo.head()
df_geo.set_axis(['Postcode', 'Latitude', 'Longitude'], axis=1, inplace=True)
df_geo.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [14]:
Table_final=Table1.merge(df_geo,on='Postcode',how='left')
print(Table_final.shape)
Table_final

(103, 5)


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


Try using the geocoder package

In [17]:
!conda install -c conda-forge geocoder --yes

Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs: 
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    orderedset-2.0             |           py36_0         231 KB  conda-forge
    geocoder-1.38.1            |             py_0          52 KB  conda-forge
    ratelim-0.1.6              |           py36_0           5 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         288 KB

The following NEW packages will be INSTALLED:

    geocoder:   1.38.1-py_0  conda-forge
    orderedset: 2.0-py36_0   conda-forge
    ratelim:    0.1.6-py36_0 conda-forge


Downloading and Extracting Packages
orderedset-2.0       | 231 KB    | ##################################### | 100% 
geocoder-1.38.1      | 52 KB     | #############################

Test run geocoder package

In [18]:
import geocoder

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(df_geo['Postcode'][0]))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

KeyboardInterrupt: 

Note: Since geocoder package failed to respond despite multiple attempts, moving forward with the Lat Longs in the Toronto data csv file.

### K Means Clustering

In [19]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(Table_final['Borough'].unique()),
        Table_final.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


#### Use geopy library to get the latitude and longitude values of Toronto City.

In [20]:
from geopy.geocoders import Nominatim

address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.653963, -79.387207.


Create a map of Toronto using folium

In [21]:
import folium
map_toto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(Table_final['Latitude'], Table_final['Longitude'], Table_final['Borough'], Table_final['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toto)  
    
map_toto

#### Extracting Boroughs which contain the name Toronto

In [22]:
Table_km=Table_final[Table_final["Borough"].str.rsplit(" ", 1).str[-1] == "Toronto"].reset_index(drop=True)
Table_km

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",43.686412,-79.400049


#### Centroid of shortlisted Boroughs

In [33]:
Boroughs=Table_km.groupby(["Borough"])["Latitude","Longitude"].mean().reset_index()
Boroughs

Unnamed: 0,Borough,Latitude,Longitude
0,Central Toronto,43.70198,-79.398954
1,Downtown Toronto,43.654169,-79.383665
2,East Toronto,43.669436,-79.324654
3,West Toronto,43.652653,-79.44929


In [35]:
import folium
map_boro = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(Table_km['Latitude'], Table_km['Longitude'], Table_km['Borough'], Table_km['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_boro)  

for lat, lng, borough in zip(Boroughs['Latitude'], Boroughs['Longitude'], Boroughs['Borough']):
    label = '{}'.format(borough)
    label = folium.Popup(label, parse_html=True)
    folium.Marker(
        [lat, lng],
        popup=label,
        ).add_to(map_boro)
    
map_boro

#### Let's cluster the zipcodes into 4 clusters and see if they cluster around the boroughs

In [25]:
from sklearn.cluster import KMeans

# set number of clusters
clusters = 4

# run k-means clustering
kmeans = KMeans(n_clusters=clusters, random_state=0).fit(Table_km[["Latitude","Longitude"]])

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       2, 2, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 0], dtype=int32)

In [68]:
import numpy as np
import matplotlib.cm as cm
from matplotlib.colors import rgb2hex

# create map
map_km = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(clusters)
ys = [i+x+(i*x)**2 for i in range(clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, boro, neigh, cluster in zip(Table_km['Latitude'], Table_km['Longitude'], Table_km['Borough'], Table_km['Neighbourhood'], kmeans.labels_):
    label = folium.Popup(' Cluster ' + str(cluster) + ': '+str(neigh) + '->' + str(boro), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_km)       

Boroughs['cluster']=[2,1,0,3]

for lat, lon, boro, cluster in zip(Boroughs['Latitude'], Boroughs['Longitude'], Boroughs['Borough'], Boroughs['cluster']):
    label = folium.Popup('Cluster'+str(cluster)+':'+ str(boro), parse_html=True)
    folium.Marker(
        [lat, lon],
        popup=label,
        #icon=folium.Icon(color=red)
        ).add_to(map_km)
    
map_km

##### We can see that K Means has done a reasonably good job at geographical clustering