## Segmenting and Clustering Neighborhoods in Toronto 1.2

This notebook scrape a Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, and obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.  
  
Then it gets the latitude and the longitude coordinates of each neighborhood.  
  
Finally, it explores and clusters the neighborhoods in Toronto. It classifies the neighborhoods in Toronto into **5 clusters** according to their geographical coordinates, and displays them by using colorful markers in the map. The result can be also accessed by opening screenshots(**map_Toronto.png**, **map_clusters.png**) in this repository. (It may take long time to render the map in this notebook, so it would be convenient for you to approach the map by reviewing the screenshots.)

In [1]:
# import libraries
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

### 1. Scrape page

In [2]:
# connect to page and get html content
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
doc=requests.get(url)

### 2. Convert to BeautifulSoup Object and get table
Use the BeautifulSoup package to transform the data in the table on the Wikipedia page into the pandas dataframe

In [3]:
# convert to BeautifulSoup Object
html_content=BeautifulSoup(doc.content,'lxml')
#print(html_content.prettify())

In [4]:
# get table and transform it into pandas dataframe
table=html_content.find_all('table')[0]
df=pd.read_html(str(table))[0]
#df

### 3. Data wrangling

In [5]:
# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
df.columns=['PostalCode','Borough','Neighborhood']
df.drop(0,inplace=True)
#df

In [6]:
# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned

# Replace "Not assigned" with "NaN" in column "Borough"
df_clean=df
df_clean['Borough'].replace("Not assigned", np.nan, inplace = True)
#df_clean.head()

In [7]:
# Drop rows with value "NaN"
df_clean.dropna(inplace=True)
#df_clean.head()

In [8]:
# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
df_clean['Neighborhood'].replace("Not assigned",df_clean['Borough'], inplace = True)
#df_clean

In [9]:
# merge the Neighbourhood with the same Postcode
df_group=df_clean.groupby(['PostalCode','Borough']).aggregate(lambda x:', '.join(x))
df_group=df_group.reset_index()
df_group

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [10]:
#  the number of rows of df_group
df_group.shape[0]

103

### 4. get the latitude and the longitude coordinates of each neighborhood

In [11]:
lat_lng=pd.read_csv('http://cocl.us/Geospatial_data')
#lat_lng

In [12]:
# merge neighbourhood with geospatial data by the key Postal Code
df_lat_lng=df_group.copy()

# inner join two tables
df_lat_lng=pd.merge(df_lat_lng, lat_lng, how='inner', on=None, left_on='PostalCode', right_on='Postal Code')
# delete duplicate column 
df_lat_lng.drop(['Postal Code'],axis=1,inplace=True)

df_lat_lng

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [13]:
# initialize the df by extending 2 columns named Latitude and Longitude

#df_lat_lng=df_group.copy()
#df_lat_lng.insert(3,'Latitude',np.nan)
#df_lat_lng.insert(4,'Longitude',np.nan)
#df_lat_lng

In [14]:
#import geocoder

# initialize variable to None
#lat_lng_coords = None

#for i in range(0,df_group.shape[0]):
    
    # for each postalcode
#    postal_code=df_lat_lng.iloc[i][0]
    
    # loop until get the coordinates
#    while(lat_lng_coords is None):
#        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#        lat_lng_coords = g.latlng
    
    #lat_lng_coords=[postal_code,postal_code+'1'] 
    
#    latitude = lat_lng_coords[0]
#    longitude = lat_lng_coords[1]

#    df_lat_lng.loc[i,'Latitude']=latitude
#    df_lat_lng.loc[i,'Longitude']=longitude
  
#df_lat_lng   

### 5. Segment and cluster neighborhoods in Toronto

In [15]:
# import libraries
# from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 --yes 
import folium

Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.0.2p             |       h470a237_1         3.1 MB  conda-forge
    certifi-2018.10.15         |        py36_1000         138 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.3.0               |             py_0          24 KB  conda-forge
    ca-certificates-2018.10.15 |       ha4d7672_0         135 KB  conda-forge
    conda-4.5.11               |        py36_1000         651 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    altair-2.2.2               |        py36_1001         494 KB  conda-forge
    ------------------------------------------------------------
                         

In [16]:
# segment and cluster only the neighborhoods in Toronto
# drop rows that are irrelevant to Toronto
Toronto_data=df_lat_lng.copy()
Toronto_data = Toronto_data[Toronto_data['Borough'].str.contains('Toronto')].reset_index(drop=True)
Toronto_data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049


In [17]:
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.17.0               |             py_0          49 KB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          82 KB

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.17.0-py_0 conda-forge


Downloading and Extracting Packages
geopy-1.17.0         | 49 KB     | ##################################### | 100% 
geographiclib-1.49   | 32 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [18]:
# get the geographical coordinates of Toronto
address = 'Toronto, CA'

geolocator = Nominatim()
location = geolocator.geocode(address)
Toronto_lat = location.latitude
Toronto_lng= location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(Toronto_lat, Toronto_lng))



The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [19]:
# visualizat Toronto the neighborhoods in it

# create map of Toronto using latitude and longitude values
map_Toronto = folium.Map(location=[Toronto_lat, Toronto_lng], zoom_start=12)

# add markers to map
for lat, lng, label in zip(Toronto_data['Latitude'], Toronto_data['Longitude'], Toronto_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

In [24]:
map_Toronto.save('map_Toronto.html')

In [20]:
# Run k-means to cluster the neighborhood into 5 clusters
# set number of clusters
kclusters = 5

Toronto_data_clustering = Toronto_data.drop(['PostalCode','Borough','Neighborhood'], 1)
Toronto_data_clustering 
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_data_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       1, 1, 0, 0, 0, 2, 2, 2, 0, 4, 0, 0, 4, 4, 4, 3], dtype=int32)

In [21]:
Toronto_merged = Toronto_data

# add clustering labels
Toronto_merged['Cluster Labels'] = kmeans.labels_

Toronto_merged # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,3
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,3
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,3
3,M4M,East Toronto,Studio District,43.659526,-79.340923,3
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,1
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197,1
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678,1
7,M4S,Central Toronto,Davisville,43.704324,-79.38879,1
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,1
9,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049,1


In [22]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [23]:
# create map
map_clusters = folium.Map(location=[Toronto_lat, Toronto_lng], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged['Latitude'], Toronto_merged['Longitude'], Toronto_merged['Neighborhood'], Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [25]:
map_clusters.save('map_clusters.html')

## Conclusion

--Five clusters overlap with according brough mostly  
--The Neighbourhoods within each cluster are geographically adjacent on the map