<a href="https://colab.research.google.com/github/talentrics/coursera_capstone/blob/master/PeerReview1_Clustering_Toronto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Coursera - Capstone Project for IBM Data Science Certificate
Segmenting and Clustering Neighborhoods in Toronto  **by Daniel Macdonald @talentrics**

**Description**

This is the peer reviewed assignment for week 3 of the Capstone project
> https://www.coursera.org/learn/applied-data-science-capstone/home/info

**Objective:** explore, segment, and cluster the neighborhoods in the city of Toronto


**Data Sources:** (Wiki page and shared csv files via Google Drive)
> https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

> https://drive.google.com/file/d/1kzhizbGuRObxHRhoYjCSDqMCaxwHEaEz/view?usp=sharing

**CoLab Shared Link - this notebook & GitHub repository**

> https://colab.research.google.com/drive/1_eAC7QIm31nl8wm5JzXNshXGEerQ7Vu3

> https://github.com/talentrics/coursera_capstone

**Table of contents:**

*   System & Data Setup
*   Part 1 - Create initial table with 103 postal codes ('Postcode', 'Borough','Neighborhood')
*   Part 2 - Concatinate table from part 1 with Geospacial Coordinates ('Latititude', 'Longitude')
*   Part 3 - Generate maps to visual neighborhoods and how they cluster together

### System & Data Setup

In [68]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

#mapping tools
!pip install geopy 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!pip install folium
import folium # map rendering library

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/55/e2/7e523df8558b7f4b2ab4c62014fd378ccecce3fdc14c9928b272a88ae4cc/folium-0.7.0-py3-none-any.whl (85kB)
[K    100% |████████████████████████████████| 92kB 4.6MB/s 
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.3.1 folium-0.7.0


create pydrive for uploading of csv file 'Geospatial_Coordinates'

In [0]:
#install PyDrive to pull in csv data
!pip install -U -q PyDrive
 
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
#https://drive.google.com/file/d/1kzhizbGuRObxHRhoYjCSDqMCaxwHEaEz/view?usp=sharing

#download survey data from google drive
downloaded1 = drive.CreateFile({'id': '1kzhizbGuRObxHRhoYjCSDqMCaxwHEaEz'})
downloaded1.GetContentFile('Geospacial_Coordinates.csv')

In [47]:
# read csv file
Geospacial_Coordinates = pd.read_csv('Geospacial_Coordinates.csv', sep = ',') 
# examine the shape of original input data
print(Geospacial_Coordinates.shape)

(103, 3)


**Create table - step 1** use BeautifulSoup to scrape data from website:

In [0]:
#create an object with raw data from website
website_url = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text

In [0]:
#create an object with the data from the website
soup = BeautifulSoup(website_url)

**Create table - step 2** based on the table in the website (search for table > extract to dict & create pd DataFrame)

In [0]:
#search for table
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table; #remove ';' to view output

**Create table - step 3** extract table data and create pd DataFrame

In [0]:
#extract row data to dict
row_data = []
for row in My_table.find_all("tr"):
    cols = row.find_all("td")
    cols = [ele.text.strip() for ele in cols]
    row_data.append(cols)

row_data; #remove ';' to view output

### Part 1 - Create initial table with 103 Postcodes ('Postcode', 'Borough', 'Neighborhood')

In [34]:
#create initial pd DataFrame
df_table = pd.DataFrame(row_data)
df_table = df_table.rename(columns={0:"Postcode",1:"Borough",2:"Neighborhood"})
df_table.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [41]:
#drop the first row (index = 0), and any row where 'Bourough' = 'Not assigned'
df_table2 = df_table.copy()
df_table2 = df_table.drop([0])
df_table2 = df_table2.drop(df_table2[df_table2['Borough']=='Not assigned'].index)
df_table2 = df_table2.reset_index(drop=True)
df_table2.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


**Create table - step 4** - data transform - if 'Neighborhood' = 'Not Assigned', then use 'Borough'

In [35]:
#check a row where 'Neighborhood' = 'Not assigned'
df_table2.loc[6]

Postcode                 M7A
Borough         Queen's Park
Neighborhood    Not assigned
Name: 6, dtype: object

In [36]:
#create a new table and replace values if 'Neighborhood' = 'Not assigned' with 'Bourough'
df_table3 = df_table2.copy()

df_table3['Neighborhood'] = df_table3.apply(
    lambda row: row['Borough'] if row['Neighborhood'] == 'Not assigned' else row['Neighborhood'],
    axis=1
)

#have a look at the transformed data
df_table3.loc[6]

Postcode                 M7A
Borough         Queen's Park
Neighborhood    Queen's Park
Name: 6, dtype: object

**Create table - step 5** group the dataframe by Postcode & Borough and 'Join' values in 'Neighborhood'

In [28]:
df_table4 = df_table3.copy()

df_table4 = (df_table4.groupby(['Postcode','Borough'])['Neighborhood']
       .apply(lambda x: ','.join(set(x.dropna())))
       .reset_index())

df_table4 = pd.DataFrame(df_table4)
df_table4.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Rouge Hill,Highland Creek,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [53]:
df_table4.shape

(103, 3)

**df_table4 above is shape (103,3)** - submission for part 1 of peer review

### Part 2 - Concatinate initial table with Geospacial Coordinates

In [48]:
Geo = pd.DataFrame(Geospacial_Coordinates)
Geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [54]:
df_table_final = pd.concat([df_table4, Geo], axis=1)
df_table_final = df_table_final.drop(['Postal Code'], axis = 1)
df_table_final.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill,Highland Creek,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### Part 3 - mapping

In [0]:
import json
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

In [82]:
address = 'Toronto, Canada'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Canada are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, Canada are 43.653963, -79.387207.


In [83]:
df_toronto = df_table_final[df_table_final['Borough'].str.contains('Toronto')].reset_index(drop=True)
df_toronto.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"Riverdale,The Danforth West",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar,The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [85]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)


# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighbourhood']):
  label = '{},{}'.format(neighborhood,borough)
  label = folium.Popup(label, parse_html=True)
  folium.CircleMarker(
      [lat, lng],
      radius=5,
      popup=label,
      color='blue',
      fill=True,
      fill_color='#3186cc',
      fill_opacity=0.7,
      parse_html=False).add_to(map_toronto) 
    
map_toronto

KeyError: ignored