## Applied Data Science Coursera Capstone
# Segmenting and Clustering Neighboods in Toronto

For this project, we want to explore and cluster the neighborhoods in Toronto. This exploration will take place in three parts.

### Part 1

We will use a webscraping package to retrieve neighborhood information for Toronto available via a Wikipedia page. We will save the scraped data in a dataframe and clean its content.

We can begin by importing the libraries we will use for this analysis.

In [1]:
# Import libraries.
import pandas as pd
from bs4 import BeautifulSoup    # for website scraping
from urllib.request import urlopen    # for working with url requests

Since we are using *BeautifulSoup* for our web scraping, we will need to assign our URL for scraping, and build the constructor to work with *BeautifulSoup* objects. This object will be the webpage we want to scrape with its associated content and HTML tags.

In [2]:
# URL for scraping.
webpage = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

# Build BeautifulSoup constructor.
with urlopen(webpage) as fp:
    soup = BeautifulSoup(fp)

print(soup)

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of postal codes of Canada: M - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"XptLuwpAICwAAA81In4AAABF","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":951325562,"wgRevisionId":951325562,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"

Before we start working on extracting content, let's first initialize the dataframe we want to build, with columns for the postal code, borough, and neighborhood.

In [24]:
# Initialize our dataframe.
df = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood'])
df

Unnamed: 0,PostalCode,Borough,Neighborhood


So far, so good. Now, we can start getting our hands dirty with extraction.

As we look at the webpage in our browser, right-click, and select *inspect element* from our browswer functions, we can see that the table containing information we want on this page is enclosed within a table in the body of the HTML document.

We can also see that the header is enclosed in ```<th></th>``` tags. Since we have already built our dataframe with this columns, we can ignore these. The data we want within the table is actually enclosed in ```<td></td>``` tags. We can use the ```find_all``` function to search for and retrieve the ```td``` tags.

In [25]:
# Find all entries of table contained within body > table > tbody elements.
web_table = soup.body.table.tbody.find_all('td')
web_table

[<td>M1A
 </td>, <td>Not assigned
 </td>, <td>
 </td>, <td>M2A
 </td>, <td>Not assigned
 </td>, <td>
 </td>, <td>M3A
 </td>, <td>North York
 </td>, <td>Parkwoods
 </td>, <td>M4A
 </td>, <td>North York
 </td>, <td>Victoria Village
 </td>, <td>M5A
 </td>, <td>Downtown Toronto
 </td>, <td>Regent Park / Harbourfront
 </td>, <td>M6A
 </td>, <td>North York
 </td>, <td>Lawrence Manor / Lawrence Heights
 </td>, <td>M7A
 </td>, <td>Downtown Toronto
 </td>, <td>Queen's Park / Ontario Provincial Government
 </td>, <td>M8A
 </td>, <td>Not assigned
 </td>, <td>
 </td>, <td>M9A
 </td>, <td>Etobicoke
 </td>, <td>Islington Avenue
 </td>, <td>M1B
 </td>, <td>Scarborough
 </td>, <td>Malvern / Rouge
 </td>, <td>M2B
 </td>, <td>Not assigned
 </td>, <td>
 </td>, <td>M3B
 </td>, <td>North York
 </td>, <td>Don Mills
 </td>, <td>M4B
 </td>, <td>East York
 </td>, <td>Parkview Hill / Woodbine Gardens
 </td>, <td>M5B
 </td>, <td>Downtown Toronto
 </td>, <td>Garden District, Ryerson
 </td>, <td>M6B
 </td>, <td>No

We can see that this method works, but the information we need is contained within a pair of tags. We will need to loop through this object and pull out the text we need using *BeautifulSoup*'s ```contents``` method.

In [26]:
postal_code = []
borough = []
neighborhood = []

for i in range(len(web_table)):
    entry = web_table[i].contents     # the contents between each <td> tag
    entry = entry[0].strip()          # remove any newline characters
    if i % 3 == 0:                    # corresponds to leftmost column index
        postal_code.append(entry)
    if i % 3 == 1:                    # corresponds to middle column index
        borough.append(entry)
    if i % 3 == 2:                    # corresponds to rightmost column index
        neighborhood.append(entry)

Now that we have our lists, we can add them to the dataframe, ```df```, that we initialized before, and view the first ten rows.

In [27]:
df['PostalCode'] = postal_code
df['Borough'] = borough
df['Neighborhood'] = neighborhood
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge


Let's check the shape of our dataframe to see the number of rows and columns it has.

In [28]:
df.shape

(180, 3)

As we can see from the first ten rows, we have some boroughs unassigned and empty neighborhood values. Since we will eventually need borough values to be present, we will begin by filtering out the rows that have values as *Not assigned* for their boroughs. We will create a new dataframe with this filter in place in case we need to refer back to the original data.

In [34]:
df_filtered = df[df['Borough'] != 'Not assigned']
df_filtered.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge
11,M3B,North York,Don Mills
12,M4B,East York,Parkview Hill / Woodbine Gardens
13,M5B,Downtown Toronto,"Garden District, Ryerson"


Let's check whether we have any ```Neighborhood``` values that have not been assigned.

In [35]:
df_filtered[df_filtered['Neighborhood'] == '']

Unnamed: 0,PostalCode,Borough,Neighborhood


It looks like we do not have any neighborhoods that have not been assigned values.

Now we can work on reformatting the neighborhoods where there are multiple neighborhoods per postal code. We want to replace the slashes with commas.

In [36]:
df_filtered[['Neighborhood']] = df_filtered[['Neighborhood']].replace(" / ", ", ", regex = True)
df_filtered.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


This looks nearly perfect; there appears to be something wrong with our row indexing. Let's fix that.

In [37]:
df_filtered.reset_index(inplace = True)
df_filtered.head(10)

Unnamed: 0,index,PostalCode,Borough,Neighborhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,8,M9A,Etobicoke,Islington Avenue
6,9,M1B,Scarborough,"Malvern, Rouge"
7,11,M3B,North York,Don Mills
8,12,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,13,M5B,Downtown Toronto,"Garden District, Ryerson"


It looks like we have our row numbering in the proper order. Now, let's drop the added ```index``` column.

In [39]:
df_filtered = df_filtered.drop('index', axis = 1)
df_filtered.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


It looks like we're all set to move onto the next section.

Before we do, let's check the shape of our resulting dataframe to see how many rows we ended with after modifications.

In [40]:
df.shape

(180, 3)

---

### Part 2

Now that we have the postal code, borough name, and neighborhood name, we need to get geographical data in the form of latitude and longitude coordinates for each of them.

We can do so using two methods. One method is to use a ```geocoder``` package, and the other is to retrieve the geocodes directly from a .csv file.

Let's try the first method and see how we do.

In [41]:
#import geocoder # import geocoder

In [42]:
#postal_code = 'M3A'

# initialize your variable to None
#lat_lng_coords = None

# loop until you get the coordinates
#while(lat_lng_coords is None):
#  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#  lat_lng_coords = g.latlng

#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]

After having tried to retrieve lat-long coordinates for the postal code M3A, the process hangs, and requires us to terminate the process without it having returned any results.

Now, can try to the second method –– to retrieve the geocodes from .csv file.

In [43]:
geo_file = "https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv"

geo_coords = pd.read_csv(geo_file, header = 0)

After retrieving the file and importing it into a *pandas* dataframe, let's take a look at the first five rows.

In [44]:
geo_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We should modify the dataframe so that the ```Postal Code``` column is the index.

In [45]:
geo_coords.set_index('Postal Code', inplace = True)
geo_coords.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


Now that we have the dataframe with geocodes corresponding to Toronto postal codes, we want to add these to our ```df_filtered``` dataframe.

We begin by setting ```PostalCode``` as our index.

In [46]:
df_filtered.set_index('PostalCode', inplace = True)

In [47]:
geo_lat = []          # save our latitude values
geo_long = []         # save our longitude values

for code in df_filtered.index:          # iterate through the postal codes of our dataframe
    lat = geo_coords.loc[code][0]
    long = geo_coords.loc[code][1]
    geo_lat.append(lat)
    geo_long.append(long)

In [48]:
df_filtered['Latitude'] = geo_lat       # add our latitude values to our dataframe
df_filtered['Longitude'] = geo_long     # add our longitude values to our dataframe
df_filtered.head(10)

Unnamed: 0_level_0,Borough,Neighborhood,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3A,North York,Parkwoods,43.753259,-79.329656
M4A,North York,Victoria Village,43.725882,-79.315572
M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
M3B,North York,Don Mills,43.745906,-79.352188
M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


Looks good! Once again, we reset our index to replicate our original dataframe style.

In [49]:
df_filtered.reset_index(inplace = True)
df_filtered.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


Success! We now have a dataframe containing the postal codes, borough names, neighborhood names, and latitude and longitude coordinates of neighborhoods in Toronto.

---

### Part 3

Up until now, we have worked towards extracting and shaping the data we needed for exploration. In this section, we can begin the exploration and clustering of neighborhoods in Toronto.

We shall start by loading the libraries we will need.

In [54]:
import matplotlib.pyplot as plt    # plotting library
# backend for rendering plots within the browser
%matplotlib inline
from sklearn.cluster import KMeans 
from geopy.geocoders import Nominatim  # convert an address into latitude and longitude values
import folium

In [62]:
# @hidden_cell
CLIENT_ID = 'GE0K3UT4PQX3EUVBJXLXSWHW1FV0DLRUJVFIGR2OO1WTLS1F' # your Foursquare ID
CLIENT_SECRET = 'R4K0XHQK4TZZ3GBFGDKAGF2JVHKMCX25D43DDBQB25NGSB2I' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [57]:
address = 'North York, Ontario, Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of M3A are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of M3A are 43.7543263, -79.44911696639593.


In [58]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_filtered['Latitude'], df_filtered['Longitude'], df_filtered['Borough'], df_filtered['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto