# Segmenting and Clustering Neighborhoods in Toronto

The project includes scraping the Wikipedia page for the Postal Codes of Canada and then process & clean the data for the Clustering. The Clustering is performed using K Means and the clusters are plotted using the Folium Library.
<br>The Boroughs containing the name 'Toronto' in it are plotted twice, once before Clustering and again after Clustering using K Means.

All the 3 tasks of Web Scraping, Data Preparation and Clustering are done in this same notebook for the ease of evaluation.

## Part 1. Peer-graded Assignment

#### 1. Load Libraries

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

#### 2. Scrape Wikipedia

In [2]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(url,'lxml')

#### 3. Locate Postal Codes, Borough and Neighbourhood

In [3]:
Table_Postal_Code = soup.find('table')
Fields = Table_Postal_Code.find_all('td')

PostalCode = []
Borough = []
Neighbourhood = []

for i in range(0, len(Fields), 3):
    PostalCode.append(Fields[i].text.strip())
    Borough.append(Fields[i + 1].text.strip())
    Neighbourhood.append(Fields[i + 2].text.strip())
        
df_PostalCode = pd.DataFrame(data = [PostalCode,Borough,Neighbourhood]).transpose()
df_PostalCode.columns = ['PostalCode','Borough','Neighbourhood']

#### 4. Remove records where Borough is "Not assigned"

In [4]:
df_PostalCode = df_PostalCode[df_PostalCode.Borough != 'Not assigned']

#### 5. Combine records where PostalCode and Borough are same

In [5]:
df_PostalCode = df_PostalCode.groupby(['PostalCode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
df_PostalCode.columns = ['PostalCode','Borough','Neighbourhood']

#### 6. For "Not assigned" Neighbourhood assign Borough as Neighbourhood

In [6]:
df_PostalCode['Neighbourhood'] = np.where(df_PostalCode['Neighbourhood'] == 'Not assigned',df_PostalCode['Borough'], df_PostalCode['Neighbourhood'])

#### 7. Number of Rows and Columns in the dataframe

In [7]:
df_PostalCode.shape

(103, 3)

## Part 2. Peer-graded Assignment

#### 1. Read Geographical Coordinates of each Postal Code from http://cocl.us/Geospatial_data

In [8]:
df_Geo = pd.read_csv('http://cocl.us/Geospatial_data')
df_Geo.columns = ['PostalCode', 'Latitude', 'Longitude']

#### 2. Merge Geographical Coordinates in the Postal Code dataframe

In [9]:
df_Canada = pd.merge(df_PostalCode, df_Geo, on = ['PostalCode'], how = 'inner')

#### 3. Print the dataframe

In [10]:
df_Canada

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


## Part 3. Peer-graded Assignment

#### 1. Install / Load Libraries

In [None]:
!pip install folium

In [11]:
from geopy.geocoders import Nominatim
import folium
import json
from pandas.io.json import json_normalize
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

#### 2. Using geopy library find the latitude and longitude values of Toronto

In [13]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent = "Toronto_Explorer")
location = geolocator.geocode(address)
latitude_tor = location.latitude
longitude_tor = location.longitude
print('The geograpical coordinate of the City of Toronto are {}, {}.'.format(latitude_tor, longitude_tor))

The geograpical coordinate of the City of Toronto are 43.6534817, -79.3839347.


#### 3. Select only Toronto Neighbourhoods

In [14]:
df_Toronto = df_Canada[df_Canada['Borough'].str.contains('Toronto')].reset_index(drop = True)

#### 4. Map of Toronto Neighbourhoods only

In [15]:
map_toronto = folium.Map(location = [latitude_tor, longitude_tor], zoom_start = 12)

for lat, lng, bor, nei in zip(df_Toronto['Latitude'], df_Toronto['Longitude'], df_Toronto['Borough'], df_Toronto['Neighbourhood']):
    label = '{}, {}'.format(nei, bor)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.5,
        parse_html = False).add_to(map_toronto)
    
map_toronto

#### 5. Explore the first neighbourhood

In [16]:
neighbourhood_latitude = df_Toronto.loc[0, 'Latitude']
neighbourhood_longitude = df_Toronto.loc[0, 'Longitude']
neighbourhood_name = df_Toronto.loc[0, 'Neighbourhood']
print('Latitude and longitude values of {} are {}, {}.'.format(neighbourhood_name, neighbourhood_latitude, neighbourhood_longitude))

Latitude and longitude values of The Beaches are 43.67635739999999, -79.2930312.


In [None]:
# Define Foursquare Credentials and Version

CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20201111' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [18]:
# URL for Foursquare

LIMIT = 100
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighbourhood_latitude, 
    neighbourhood_longitude, 
    radius, 
    LIMIT)

In [19]:
# Get resuts from Foursquare

results = requests.get(url).json()

In [20]:
# Function to extracts the category of the venue

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [21]:
# Clean json and structure into a pandas dataframe

venues = results['response']['groups'][0]['items']
nearby_venues = pd.json_normalize(venues)
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

In [24]:
print('The following {} venues were returned by Foursquare in {} neighbourhood.'.format(nearby_venues.shape[0], neighbourhood_name))
nearby_venues

The following 4 venues were returned by Foursquare in The Beaches neighbourhood.


Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Upper Beaches,Neighborhood,43.680563,-79.292869


#### 6. Using KMeans Clustering for the Toronto neighbourhoods

In [25]:
k = 5

toronto_clustering = df_Toronto.drop(['PostalCode','Borough','Neighbourhood'], 1)
kmeans = KMeans(n_clusters = k, random_state = 0).fit(toronto_clustering)
kmeans.labels_
df_Toronto.insert(0, 'Cluster Labels', kmeans.labels_)

In [27]:
map_clusters = folium.Map(location = [latitude_tor, longitude_tor], zoom_start = 12)

x = np.arange(k)
ys = [i + x + (i * x) ** 2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lng, nei, clu in zip(df_Toronto['Latitude'], df_Toronto['Longitude'], df_Toronto['Neighbourhood'], df_Toronto['Cluster Labels']):
    label = folium.Popup(' clu ' + str(clu), parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = rainbow[clu-1],
        fill = True,
        fill_color = rainbow[clu-1],
        fill_opacity = 0.5).add_to(map_clusters)
       
map_clusters