# Segmenting and Clustering Neighborhoods in Toronto

**Extracting out Toronto neighbourhood data from their wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M**

### Reading data from HTML using Beautiful Soup Library

Importing Beautiful Soup package and other necessary libraries for data extraction

In [1]:
# from bs4 import BeautifulSoup
# import requests
# source = requests.get(' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
# soup = BeautifulSoup(source,'lxml')
# table = soup.table  # or table = soup.find('table')
# table_rows = table.find_all('tr')
# for tr in table_rows:
#     td = tr.find_all('td')
#     row = [i.text for i in td]
#     print(row)

Let us see another method to extract out datframe from html web page
### Reading datframe from HTML through pandas 
It is much simpler to extract out dataframe from html using pandas compared to Beautiful Soup package

In [5]:
import pandas as pd
data = []
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M',header=0)
for df in dfs:
    data.append(df)

table = pd.DataFrame(data[0])



**1. Removing the cells with borough 'Not assigned'**

In [6]:
table.drop(table.index[table.Borough == 'Not assigned'], inplace=True)

**2. Combining rows with same postal code**

In [7]:
table_final = table.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join).reset_index()

**3. For cells having a borough but a 'Not assigned' neighborhood, replacing neighborhood with the borough name** 

In [8]:
table_final.loc[table_final['Neighbourhood'] == 'Not assigned']
table_final.iloc[85,2] = table_final.iloc[85,1]



**Shape of final dataframe**

In [11]:
table_final

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


# Location of Postal codes

**Importing dataframe which contains latitude and longitute of postal codes**

In [7]:
import pandas as pd
table_2 = pd.read_csv('C:/Users/saksh/Downloads/Geospatial_Coordinates.csv')

**Joining two datframes**

In [14]:
join = table_2.set_index('Postal Code').join(table_final.set_index('Postcode'))
join = join.reset_index()
join


Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighbourhood
0,M1B,43.806686,-79.194353,Scarborough,"Rouge,Malvern"
1,M1C,43.784535,-79.160497,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,43.770992,-79.216917,Scarborough,Woburn
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae
5,M1J,43.744734,-79.239476,Scarborough,Scarborough Village
6,M1K,43.727929,-79.262029,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,43.711112,-79.284577,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,43.716316,-79.239476,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,43.692657,-79.264848,Scarborough,"Birch Cliff,Cliffside West"


In [9]:
join.Borough.nunique()
join.head


<bound method DataFrame.head of     Postal Code   Latitude  Longitude           Borough  \
0           M1B  43.806686 -79.194353       Scarborough   
1           M1C  43.784535 -79.160497       Scarborough   
2           M1E  43.763573 -79.188711       Scarborough   
3           M1G  43.770992 -79.216917       Scarborough   
4           M1H  43.773136 -79.239476       Scarborough   
5           M1J  43.744734 -79.239476       Scarborough   
6           M1K  43.727929 -79.262029       Scarborough   
7           M1L  43.711112 -79.284577       Scarborough   
8           M1M  43.716316 -79.239476       Scarborough   
9           M1N  43.692657 -79.264848       Scarborough   
10          M1P  43.757410 -79.273304       Scarborough   
11          M1R  43.750072 -79.295849       Scarborough   
12          M1S  43.794200 -79.262029       Scarborough   
13          M1T  43.781638 -79.304302       Scarborough   
14          M1V  43.815252 -79.284577       Scarborough   
15          M1W  43.7995

The final dataframe contains 103 Neighbourhoods and 11 Boroughs

In [10]:
#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import folium # map rendering library

In [11]:
address = 'Toronto, ontario'

geolocator = Nominatim(user_agent="on_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [12]:
map_toronto = folium.Map(location = [latitude,longitude], zoom_start=10)

# Add markers to map
for lat, lon, neighb, bor in zip(join['Latitude'], join['Longitude'], join['Neighbourhood'], join['Borough']):
    label = '{},{}'.format(neighb,bor)
    label = folium.Popup(label, parse_html =True)
    folium.CircleMarker([lat,lon],
                        radius = 5,
                        Popup = label,
                        color = 'blue',
                        fill = True,
                        fill_color = '#3186cc',
                        fill_opacity = 0.7,
                        parse_html =False).add_to(map_toronto)
    
map_toronto

In [19]:
# Let us examine DowntownToronto 

DownTor_data = join[join['Borough']=='Downtown Toronto'].reset_index(drop=True)
DownTor_data

Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighbourhood
0,M4W,43.679563,-79.377529,Downtown Toronto,Rosedale
1,M4X,43.667967,-79.367675,Downtown Toronto,"Cabbagetown,St. James Town"
2,M4Y,43.66586,-79.38316,Downtown Toronto,Church and Wellesley
3,M5A,43.65426,-79.360636,Downtown Toronto,"Harbourfront,Regent Park"
4,M5B,43.657162,-79.378937,Downtown Toronto,"Ryerson,Garden District"
5,M5C,43.651494,-79.375418,Downtown Toronto,St. James Town
6,M5E,43.644771,-79.373306,Downtown Toronto,Berczy Park
7,M5G,43.657952,-79.387383,Downtown Toronto,Central Bay Street
8,M5H,43.650571,-79.384568,Downtown Toronto,"Adelaide,King,Richmond"
9,M5J,43.640816,-79.381752,Downtown Toronto,"Harbourfront East,Toronto Islands,Union Station"


In [20]:
address = 'Downtown Toronto, ontario'

geolocator = Nominatim(user_agent="on_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.655115, -79.380219.


As we did with all of Toronto City, let's visualize the neighborhoods of Downtown Toronto in it.

In [21]:
map_downtown = folium.Map(location = [latitude,longitude], zoom_start=10)

# Add markers to map
for lat, lon, neighb, bor in zip(DownTor_data['Latitude'], DownTor_data['Longitude'], DownTor_data['Neighbourhood'], DownTor_data['Borough']):
    label = '{},{}'.format(neighb,bor)
    label = folium.Popup(label, parse_html =True)
    folium.CircleMarker([lat,lon],
                        radius = 5,
                        Popup = label,
                        color = 'blue',
                        fill = True,
                        fill_color = '#3186cc',
                        fill_opacity = 0.7,
                        parse_html =False).add_to(map_downtown)
    
map_downtown

#### Define Foursquare Credentials and Version

In [22]:
CLIENT_ID = 'U3CPOLQKZXPSRE3LUXCBAWQONS1OFXIE42UOJBEL3345MMSR' # your Foursquare ID
CLIENT_SECRET = '4WD4I5RY0DMYPBN0QJDEAVFOF2FCA0ISOHCUAH4BR4RPCO01' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: U3CPOLQKZXPSRE3LUXCBAWQONS1OFXIE42UOJBEL3345MMSR
CLIENT_SECRET:4WD4I5RY0DMYPBN0QJDEAVFOF2FCA0ISOHCUAH4BR4RPCO01


#### Let's explore the first neighborhood in our dataframe.

Get the neighborhood's name.

In [27]:
DownTor_data.loc[0,'Neighbourhood']

'Rosedale'

Get the neighborhood's latitude and longitude values.

In [30]:
neighborhood_latitude = DownTor_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = DownTor_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = DownTor_data.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Rosedale are 43.6795626, -79.3775294.


#### Now, let's get the top 100 venues that are in Rosedale within a radius of 500 meters.

First, let's create the GET request URL.

In [32]:
# type your answer here
LIMIT = 100
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

In [34]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
results = requests.get(url).json()
results

{u'meta': {u'code': 200, u'requestId': u'5d13f70b6adbf5002c5a9bbc'},
 u'response': {u'groups': [{u'items': [{u'reasons': {u'count': 0,
       u'items': [{u'reasonName': u'globalInteractionReason',
         u'summary': u'This spot is popular',
         u'type': u'general'}]},
      u'referralId': u'e-0-4aff2d47f964a520743522e3-0',
      u'venue': {u'categories': [{u'icon': {u'prefix': u'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/playground_',
          u'suffix': u'.png'},
         u'id': u'4bf58dd8d48988d1e7941735',
         u'name': u'Playground',
         u'pluralName': u'Playgrounds',
         u'primary': True,
         u'shortName': u'Playground'}],
       u'id': u'4aff2d47f964a520743522e3',
       u'location': {u'address': u'38 Scholfield Ave.',
        u'cc': u'CA',
        u'city': u'Toronto',
        u'country': u'Canada',
        u'crossStreet': u'at Edgar Ave.',
        u'distance': 327,
        u'formattedAddress': [u'38 Scholfield Ave. (at Edgar Ave.)',
         

let's borrow the **get_category_type** function from the Foursquare lab.

In [35]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [38]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Rosedale Park,Playground,43.682328,-79.378934
1,Whitney Park,Park,43.682036,-79.373788
2,Alex Murray Parkette,Park,43.6783,-79.382773
3,Milkman's Lane,Trail,43.676352,-79.373842


In [37]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


## Let's create a function to repeat the same process to all the neighborhoods in Downtown Toronto

In [40]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [42]:
downtown_venues = getNearbyVenues(names=DownTor_data['Neighbourhood'],
                                   latitudes=DownTor_data['Latitude'],
                                   longitudes=DownTor_data['Longitude']
                                  )

Rosedale
Cabbagetown,St. James Town
Church and Wellesley
Harbourfront,Regent Park
Ryerson,Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide,King,Richmond
Harbourfront East,Toronto Islands,Union Station
Design Exchange,Toronto Dominion Centre
Commerce Court,Victoria Hotel
Harbord,University of Toronto
Chinatown,Grange Park,Kensington Market
CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place,Underground city
Christie


#### Let's check the size of the resulting dataframe

In [44]:
print(downtown_venues.shape)
downtown_venues.head()

(1287, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rosedale,43.679563,-79.377529,Rosedale Park,43.682328,-79.378934,Playground
1,Rosedale,43.679563,-79.377529,Whitney Park,43.682036,-79.373788,Park
2,Rosedale,43.679563,-79.377529,Alex Murray Parkette,43.6783,-79.382773,Park
3,Rosedale,43.679563,-79.377529,Milkman's Lane,43.676352,-79.373842,Trail
4,"Cabbagetown,St. James Town",43.667967,-79.367675,Cranberries,43.667843,-79.369407,Diner


Let's check how many venues were returned for each neighborhood

In [46]:
downtown_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,King,Richmond",100,100,100,100,100,100
Berczy Park,55,55,55,55,55,55
"CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara",16,16,16,16,16,16
"Cabbagetown,St. James Town",46,46,46,46,46,46
Central Bay Street,88,88,88,88,88,88
"Chinatown,Grange Park,Kensington Market",100,100,100,100,100,100
Christie,15,15,15,15,15,15
Church and Wellesley,87,87,87,87,87,87
"Commerce Court,Victoria Hotel",100,100,100,100,100,100
"Design Exchange,Toronto Dominion Centre",100,100,100,100,100,100
