# Final Assignment of the Applied Data Science Capstone
Segmenting and Clustering Neighborhoods in Toronto

## Scrape the Toronto Neighbourhoods
1. Use beautiful soap to scrape wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
2. Load the scraped data into the dataframe

Install the Beautiful Soup library for scraping of the wikipedia page

In [63]:

! pip3 install bs4



Import all the necessary libraries for the first task

In [64]:
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup

Get the wikipedia page html and load it to Beautiful Soup with html.parser

In [65]:
html_data = requests.get(url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup_scraper = BeautifulSoup(html_data.text, 'html.parser')
soup_scraper.title


<title>List of postal codes of Canada: M - Wikipedia</title>

Wikipedia page successfully loaded, now we can create the pandas data frame with required columns and fill it with table details from html

In [66]:
toronto_neighbourhoods = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighbourhood']);

for row in soup_scraper.find('div', id='mw-content-text').find('table').find('tbody').find_all('tr'):
    col = row.find_all('td')
    if len(col) > 0:
        postal_code = col[0].text
        borough = col[1].text
        neighbourhood = col[2].text

        toronto_neighbourhoods = toronto_neighbourhoods.append({'PostalCode': postal_code, 'Borough': borough, 'Neighbourhood': neighbourhood}, ignore_index=True)
    


We have acquired our dataset, now we print the first 5 elements to see the data quality

In [67]:
toronto_neighbourhoods.head()


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


We have to remove '\n' character from the dataset

In [68]:
toronto_neighbourhoods['PostalCode'] = toronto_neighbourhoods['PostalCode'].str.replace(r'\n', '')
toronto_neighbourhoods['Borough'] = toronto_neighbourhoods['Borough'].str.replace(
    r'\n', '')
toronto_neighbourhoods['Neighbourhood'] = toronto_neighbourhoods['Neighbourhood'].str.replace(
    r'\n', '')
toronto_neighbourhoods.head()


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now we should have clean data, but we can see there are unassigned neighbourhoods. We should remove them

In [69]:
toronto_neighbourhoods.replace('Not assigned', np.nan, inplace=True)
toronto_neighbourhoods.dropna(subset=['Borough'], axis=0, inplace=True)
toronto_neighbourhoods['Neighbourhood'].fillna(toronto_neighbourhoods['Borough'], inplace=True)
toronto_neighbourhoods.isnull().value_counts()


PostalCode  Borough  Neighbourhood
False       False    False            103
dtype: int64

No lets see how many of the rows we have left.

In [70]:
toronto_neighbourhoods.shape


(103, 3)

## Add Geolocation data
As the geocoder api do not work properly, load the latitude and longitude from the csv given in the assignment.

In [80]:
toronto_postal_code_geo = pd.read_csv('https://cocl.us/Geospatial_data')
toronto_postal_code_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


With the geo data loaded, rename the column to have same column names in data frames and look at shape if we can merge them.

In [82]:
toronto_postal_code_geo.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
toronto_postal_code_geo.shape


(103, 3)

As the shape is the same as of our original data frame, merge two data frames together based on the PostalCode column.

In [88]:
toronto_neighbourhoods_geo = pd.merge(toronto_neighbourhoods, toronto_postal_code_geo, on='PostalCode')
toronto_neighbourhoods_geo.head()


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## Data Visualisation on the Map
Explore and cluster the neighborhoods in Toronto. We'll work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data.

We do not need postal code anymore, we can remove it

In [89]:
toronto_neighbourhoods_geo.drop('PostalCode', axis = 1, inplace=True)

Use geopy library to get the latitude and longitude values of Toronto

In [91]:
! pip3 install geopy

Collecting geopy
  Downloading geopy-2.1.0-py3-none-any.whl (112 kB)
[K     |████████████████████████████████| 112 kB 4.7 MB/s 
[?25hCollecting geographiclib<2,>=1.49
  Downloading geographiclib-1.50-py3-none-any.whl (38 kB)
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.1.0


In [92]:
from geopy.geocoders import Nominatim

In [93]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(
    latitude, longitude))


The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Create a Map of Toronto with neighborhoods

In [94]:
! pip3 install folium

Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 2.4 MB/s 
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1


In [95]:
import folium

In [98]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(toronto_neighbourhoods_geo['Latitude'], toronto_neighbourhoods_geo['Longitude'],
 toronto_neighbourhoods_geo['Borough'], toronto_neighbourhoods_geo['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)

map_toronto


## Foursquare Data and Clustering 
After the visualisation, lets cluster the data. We'll use the Foursquare API to do so.

Foursquare API config

In [100]:
CLIENT_ID = 'T1GPNN0F3DDVR5HUEMG3AVOGD3GPKQ0QAJMHUYLF4520ZAUE'  # your Foursquare ID
# your Foursquare Secret
CLIENT_SECRET = '2SM12XT5EDJ5QXQAEDOJVGDYCIP40JCWXBPUSTT0LAJYBVUP'
VERSION = '20180605'  # Foursquare API version
LIMIT = 100  # A default Foursquare API limit value


Let's create a function to get the venues information from neighbourhoods

In [103]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):

    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame(
        [item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood',
                             'Neighbourhood Latitude',
                             'Neighbourhood Longitude',
                             'Venue',
                             'Venue Latitude',
                             'Venue Longitude',
                             'Venue Category']

    return(nearby_venues)


Let's use the function and get the information we need

In [104]:
toronto_venues = getNearbyVenues(names=toronto_neighbourhoods_geo['Neighbourhood'],
                                 latitudes=toronto_neighbourhoods_geo['Latitude'],
                                 longitudes=toronto_neighbourhoods_geo['Longitude']
                                 )
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Towns On The Ravine,43.754754,-79.332552,Hotel
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Parkwoods,43.753259,-79.329656,Corrosion Service Company Limited,43.752432,-79.334661,Construction & Landscaping
4,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena


How many venues we have for each Neighbourhood?