<a href="https://colab.research.google.com/github/vvedhas/Coursera_Capstone/blob/main/Capstone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Project Notebook for the Coursera IBM Data Science Professional Capstone Project

Section 1 :

In [5]:
import numpy as np 
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json
from geopy.geocoders import Nominatim
from bs4 import BeautifulSoup
import requests
from pandas.io.json import json_normalize

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium

from IPython.display import display_html

print('Libraries imported.')

Libraries imported.


Here, we'll be scraping a table with the list of postal codes in Toronto. We're using the Wayback machine since the current Wikipedia page has the table in a more human-readable, but less scraping-compatible format.

In [15]:
source = requests.get('https://web.archive.org/web/20181110230135/https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')
tab = str(soup.table)
src_df=pd.read_html(tab)
df=src_df[0]
df1 = df[df.Borough != 'Not assigned']
df2 = df1.groupby(['Postcode','Borough'], sort=False).agg(', '.join)
df2.reset_index(inplace=True)
df2['Neighbourhood'] = np.where(df2['Neighbourhood'] == 'Not assigned',df2['Borough'], df2['Neighbourhood'])
df2.shape

(103, 3)

Section 2:*Creating a Dataframe with Latitude and Longitude Data*


In [33]:
coord=pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv")
coord.head()
coord.rename(columns={'Postal Code':'Postcode'}, inplace=True)
df=pd.merge(df2,coord,on='Postcode')
df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937


Section 3 : *Separating Boroughs with the term Toronto, and mapping them*

In [35]:
df=df[df['Borough'].str.contains("Toronto",case=False)].reset_index(drop=True)
df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
1,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
5,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
6,M6G,Downtown Toronto,Christie,43.669542,-79.422564
7,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
8,M6H,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752


In [36]:
toronto_map = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

for lat,lng,borough,neighbourhood in zip(df['Latitude'],df['Longitude'],df['Borough'],df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(toronto_map)
toronto_map


In [41]:
k=4
toronto_cluster = df.drop(['Postcode','Borough','Neighbourhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_cluster)
print(kmeans.labels_)
df.insert(0, 'Cluster Labels', kmeans.labels_)
df.head(12)

[0 0 0 3 0 0 2 0 2 0 2 3 0 2 3 0 3 1 1 1 1 2 1 0 2 1 0 2 1 0 1 0 0 0 0 0 0
 3]


Unnamed: 0,Cluster Labels,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,0,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
1,0,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
2,0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,0,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
5,0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
6,1,M6G,Downtown Toronto,Christie,43.669542,-79.422564
7,0,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
8,1,M6H,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259
9,0,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752


In [42]:
map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, neighbourhood, cluster in zip(df['Latitude'], df['Longitude'], df['Neighbourhood'], df['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Here, we can see the distinction between the various boroughs of Toronto very clearly. Every borough is clearly marked with its unique color