# Segmenting and Clustering Neighboroods in Toronto

## Getting and cleaning the data

### 1) Scraping the table from Wikipedia

No need to use Beautiful Soup to import the dataframe, as Pandas has a useful `read.html` function which returns a list containing all the tables in a page, already converted into DataFrames.

In [64]:
import pandas as pd

In [65]:
url  = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [107]:
#resetting to default the number of rows displayed on output
pd.reset_option("display.max_rows")
dfs = pd.read_html(url)
dfs

[    Postal Code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 ..          ...               ...   
 175         M5Z      Not assigned   
 176         M6Z      Not assigned   
 177         M7Z      Not assigned   
 178         M8Z         Etobicoke   
 179         M9Z      Not assigned   
 
                                          Neighbourhood  
 0                                         Not assigned  
 1                                         Not assigned  
 2                                            Parkwoods  
 3                                     Victoria Village  
 4                            Regent Park, Harbourfront  
 ..                                                 ...  
 175                                       Not assigned  
 176                                       Not assigned  
 177                

The table is indexed as the first data frame in the list.

In [108]:
toronto_pc = dfs[0]
toronto_pc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Postal Code    180 non-null    object
 1   Borough        180 non-null    object
 2   Neighbourhood  180 non-null    object
dtypes: object(3)
memory usage: 4.3+ KB


We have to process only the cells that have an assigned borough and ignore cells with a borough that is 'Not Assigned'.

In [109]:
toronto_pc= toronto_pc[toronto_pc.Borough != 'Not assigned']
toronto_pc

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [110]:
#Checking that every row contains a different Postal Code
toronto_pc['Postal Code'].nunique() == len(toronto_pc)

True

In [111]:
#cleaning the index 
# the parameter drop =True avoids to create a new index columns with the old values;
toronto_pc.reset_index(inplace = True, drop = True) 
toronto_pc

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


### 2) Fetching Coordinates

Since the geocoder package seems to show many issues, I will use the package pgeocode to import the coordinates. After setting the local ('ca'), this library returns a Pandas Data Frame in answer to a Postal Code query. From this data frame we will select only the data pertaining latitude and longitude.

In [112]:
import pgeocode 

nomi = pgeocode.Nominatim('ca')

Latitude = []
Longitude = []

for pc in toronto_pc['Postal Code']:
    query = nomi.query_postal_code(pc)
    Latitude.append(query.latitude)
    Longitude.append(query.longitude)

In [113]:
Latitude
Longitude

[-79.33,
 -79.3148,
 -79.3626,
 -79.4504,
 -79.3889,
 -79.5282,
 -79.193,
 -79.359,
 -79.3094,
 -79.3783,
 -79.4479,
 -79.5517,
 -79.1564,
 -79.3329,
 -79.3116,
 -79.3756,
 -79.4307,
 -79.5767,
 -79.1866,
 -79.2941,
 -79.3754,
 -79.4507,
 -79.2144,
 -79.3644,
 -79.386,
 -79.4205,
 -79.2389,
 -79.3577,
 -79.4472,
 -79.3464,
 -79.3833,
 -79.4378,
 -79.2323,
 -79.3479,
 -79.4921,
 -79.3368,
 -79.3936,
 -79.4177,
 -79.2639,
 -79.3813,
 -79.4692,
 -79.3538,
 -79.3823,
 -79.4301,
 -79.2843,
 -79.3764,
 -79.5116,
 -79.3155,
 -79.3823,
 -79.4869,
 -79.5565,
 -79.2312,
 -79.4103,
 -79.4928,
 -79.3406,
 -79.4177,
 -79.4857,
 -79.5401,
 -79.2646,
 -79.4111,
 -79.521,
 -79.3935,
 -79.4195,
 -79.4839,
 -79.517,
 -79.2707,
 -79.3978,
 -79.3887,
 -79.412,
 -79.4633,
 -79.5323,
 -79.3003,
 -79.445,
 -79.4065,
 -79.4035,
 -79.4521,
 nan,
 -79.5582,
 -79.2644,
 -79.3853,
 -79.3987,
 -79.4828,
 -79.3036,
 -79.3853,
 -79.3978,
 -79.2819,
 -79.4025,
 -79.3995,
 -79.5013,
 -79.5876,
 -79.3216,
 -79.373,
 -7

In [114]:
#assigning the coordinates to new columns in the existing dataframe
toronto_pc['Latitude'] = Latitude
toronto_pc['Longitude'] = Longitude
toronto_pc

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.3300
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.6518,-79.5076
99,M4Y,Downtown Toronto,Church and Wellesley,43.6656,-79.3830
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.7804,-79.2505
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.6325,-79.4939


We got a common warning in Pandas, but we can ignore that. The list of coordinates seems correctly placed. 

In [115]:
toronto_pc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Postal Code    103 non-null    object 
 1   Borough        103 non-null    object 
 2   Neighbourhood  103 non-null    object 
 3   Latitude       102 non-null    float64
 4   Longitude      102 non-null    float64
dtypes: float64(2), object(3)
memory usage: 4.1+ KB


We got just one Nan:

In [116]:
toronto_pc[toronto_pc.isnull().any(axis = 1)]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
76,M7R,Mississauga,Canada Post Gateway Processing Centre,,


We can retrieve the coordinates manuamlly from Google Maps :

In [117]:
ll = 43.63657950496381, -79.61576357177279

In [118]:
#set the values
toronto_pc.at[76, 'Latitude'] = ll[0]
toronto_pc.at[76,'Longitude'] = ll[1]

In [119]:
toronto_pc.isnull().any(axis = 0)

Postal Code      False
Borough          False
Neighbourhood    False
Latitude         False
Longitude        False
dtype: bool

In [120]:
#checking if the values are correctly set
toronto_pc.iloc[76]

Postal Code                                        M7R
Borough                                    Mississauga
Neighbourhood    Canada Post Gateway Processing Centre
Latitude                                      43.63658
Longitude                                   -79.615764
Name: 76, dtype: object

In [121]:
toronto_pc

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.3300
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.6518,-79.5076
99,M4Y,Downtown Toronto,Church and Wellesley,43.6656,-79.3830
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.7804,-79.2505
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.6325,-79.4939


## Clustering

Importing the libraries needed for visualization and clustering:

In [125]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library
#library needed to get Toronto coordinates:
!pip install geopy
from geopy.geocoders import Nominatim
print('Libraries imported.')

Collecting geopy
  Downloading geopy-2.1.0-py3-none-any.whl (112 kB)
Collecting geographiclib<2,>=1.49
  Downloading geographiclib-1.50-py3-none-any.whl (38 kB)
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.1.0
Libraries imported.


In order to define an instance of the geocoder, we need to define a user_agent:

In [126]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Creating a map of Toronto with neighborhoods superimposed:


In [128]:
toronto_map= folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_pc['Latitude'], toronto_pc['Longitude'], toronto_pc['Borough'], toronto_pc['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
    
toronto_map

We are now ready to set up the Foursquare API to make queries.

## Defining Foursquare API credentials and parameters

Since I believe working with the URL as suggested in the course to make an API is overtly complicated, I decided to use this line of code (found here : https://developer.foursquare.com/docs/places-api/getting-started/). In place of filling the URL directly with the parameters, we will create a dictionary of the parameters which we will place inside a GET request (much neater).

In [160]:
import json, requests
url = 'https://api.foursquare.com/v2/venues/explore'

params = dict(
client_id='TDUMTXN0WG4EZFIPJBH5FUG1YXLZCMFQXJTWNEZ2RSJ14W3S',
client_secret='1FEIJEA0NXDXAONICQ55E53CPLUUWHQ4FNDWJIP0V1TGO1XU',
v='20180323',
ll='43.6534817,-79.3839347',
limit=1
)
#testing a query
resp = requests.get(url=url, params=params)
data = json.loads(resp.text)

In [181]:
requests.get(url=url, params=params).json()["response"]['groups']

[{'type': 'Recommended Places',
  'name': 'recommended',
  'items': [{'reasons': {'count': 0,
     'items': [{'summary': 'This spot is popular',
       'type': 'general',
       'reasonName': 'globalInteractionReason'}]},
    'venue': {'id': '4bd847e709ecb713146d487c',
     'name': 'Gingerman Restaurant',
     'contact': {},
     'location': {'address': '1104 Victoria Park Ave',
      'crossStreet': 'at St Clair Ave',
      'lat': 43.707844,
      'lng': -79.29558,
      'labeledLatLngs': [{'label': 'display',
        'lat': 43.707844,
        'lng': -79.29558}],
      'distance': 1029,
      'postalCode': 'M4B 2K3',
      'cc': 'CA',
      'city': 'East York',
      'state': 'ON',
      'country': 'Canada',
      'formattedAddress': ['1104 Victoria Park Ave (at St Clair Ave)',
       'East York ON M4B 2K3',
       'Canada']},
     'categories': [{'id': '4bf58dd8d48988d1c4941735',
       'name': 'Restaurant',
       'pluralName': 'Restaurants',
       'shortName': 'Restaurant',
       

I willl borrow the function **getNearbyVenues**  used in the Lab to loop 'explore' queries through the neighboroods and get the corresponding venues.

In [179]:
#setting the standard query limit
params['limit'] = 100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
        #modify the coordinates
        params['ll'] = str(lat)+','+str(lng)
            
        # make the GET request
        results = requests.get(url, params).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [180]:
toronto_venues = getNearbyVenues(toronto_pc.Neighbourhood, toronto_pc.Latitude, toronto_pc.Longitude, radius=500)

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

KeyError: 'groups'