# Battle of Neighbourhoods(week 2) - The travelling student

### Clustering the neighbourhoods of London and Toronto

# Introduction

Studying abroad is an essential part of a lot of students University experience. This is because they not only want to study at a great establishment but also want to explore the new city that they are going to be living in for the next few years. So you could say that they are tourists who are also studying. We will be focusing on students that want to study in an English speaking country as they want to develop their knowledge of the language but are having difficulties deciding which country they want to explore.

London and Toronto are quite the popular tourist destinations for people all around the world. They are diverse and multicultural and offer a wide variety of experiences that is widely sought after. We try to group the neighbourhoods of London and Toronto respectively and draw insights to what they look like now.

# Business problem

The aim is to help international students choose their destinations depending on the experiences that the neighbourhoods have to offer and what they would want to have. This also helps people make decisions if they are thinking about living in London or Toronto. Our findings will help the international students make informed decisions and address any concerns they have including the different kinds of cuisines, provision stores and what the city has to offer.



# Data Description

We require geolocation data for both London and Toronto. Postal codes in each city serve as a starting point. Using Postal codes we use can find out the neighbourhoods, boroughs, venues and their most popular venue categories.

## London

To derive our solution, We scrape our data from https://en.wikipedia.org/wiki/List_of_areas_of_London

This wikipedia page has information about all the neighbourhoods, we limit it London.

1. borough: Name of Neighbourhood
2. town: Name of borough
3. postcode: Postal codes for London.

This wikipedia page lacks information about the geographical locations. To solve this problem we use ArcGIS API

## ArcGIS API

ArcGIS Online enables you to connect people, locations, and data using interactive maps. Work with smart, data-driven styles and intuitive analysis tools that deliver location intelligence. Share your insights with the world or specific groups.

More specifically, we use ArcGIS to get the geo locations of the neighbourhoods of London. The following columns are added to our initial dataset which prepares our data.

1. latitude: Latitude for Neighbourhood
2. longitude: Longitude for Neighbourhood

## Toronto

To derive our solution, We scrape our data from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

1. Neighborhood: Name of Neighbourhood
2. Borough: Name of borough
3. PostalCode: Postal codes for Toronto.

This wikipedia page lacks information about the geographical locations, so we will be using the link provided for us in week 3 of this course: https://cocl.us/Geospatial_data

## Foursquare API

We will need data about different venues in different neighbourhoods of that specific borough. In order to gain that information we will use "Foursquare" locational information. Foursquare is a location data provider with information about all manner of venues and events within an area of interest. Such information includes venue names, locations, menus and even photos. As such, the foursquare location platform will be used as the sole data source since all the stated required information can be obtained through the API. 

After finding the list of neighbourhoods, we then connect to the Foursquare API to gather information about venues inside each and every neighbourhood. For each neighbourhood, we have chosen the radius to be 500 meters.

The data retrieved from Foursquare contained information of venues within a specified distance of the longitude and latitude of the postcodes. The information obtained per venue as follows:

1. Neighbourhood : Name of the Neighbourhood
2. Neighbourhood Latitude : Latitude of the Neighbourhood
3. Neighbourhood Longitude : Longitude of the Neighbourhood
4. Venue : Name of the Venue
5. Venue Latitude : Latitude of Venue
6. Venue Longitude : Longitude of Venue
7. Venue Category : Category of Venue

Based on all the information collected for both London and Toronto, we have sufficient data to build our model. We cluster the neighbourhoods together based on similar venue categories. We then present our observations and findings. Using this data, the international students can take the necessary decision.

# Methodology

We will be creating our model with the help of Python so we start off by importing all the required packages.

In [1]:
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup
!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize  # tranform JSON file into a pandas dataframe
!pip install folium==0.5.0
import folium # map rendering library
# import k-means from clustering stage
from sklearn.cluster import KMeans
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

Collecting folium==0.5.0
  Downloading folium-0.5.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 9.3 MB/s  eta 0:00:01
[?25hCollecting branca
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Building wheels for collected packages: folium
  Building wheel for folium (setup.py) ... [?25ldone
[?25h  Created wheel for folium: filename=folium-0.5.0-py3-none-any.whl size=76240 sha256=d419877ffd84b2d888e40797e3bd7b41647d6dfba26707a531b915108e3fa791
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/b2/2f/2c/109e446b990d663ea5ce9b078b5e7c1a9c45cca91f377080f8
Successfully built folium
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.5.0


The approach taken here is to explore each of the cities individually, plot the map to show the neighbourhoods being considered and then build our model by clustering all of the similar neighbourhoods together and finally plot the new map with the clustered neighbourhoods. We draw insights and then compare and discuss our findings.

## Exploring London

### Neighbourhoods of London

We begin to start collecting and refining the data.

### Data collection


To get the neighbourhoods in london, we start by scraping the list of areas of london wiki page.

In [2]:
url_london="https://en.wikipedia.org/wiki/List_of_areas_of_London"
wiki_london_url=requests.get(url_london)
wiki_london_url

<Response [200]>

"<Response [200]>" means that we are able to make a connection

In [3]:
wiki_london_data=pd.read_html(wiki_london_url.text)
wiki_london_data

[                                                   0
 0  Map all coordinates in "Category:Areas of Lond...
 1                 Download coordinates as: KML · GPX,
             Location                     London borough       Post town  \
 0         Abbey Wood              Bexley, Greenwich [7]          LONDON   
 1              Acton  Ealing, Hammersmith and Fulham[8]          LONDON   
 2          Addington                         Croydon[8]         CROYDON   
 3         Addiscombe                         Croydon[8]         CROYDON   
 4        Albany Park                             Bexley  BEXLEY, SIDCUP   
 ..               ...                                ...             ...   
 526         Woolwich                          Greenwich          LONDON   
 527   Worcester Park       Sutton, Kingston upon Thames  WORCESTER PARK   
 528  Wormwood Scrubs             Hammersmith and Fulham          LONDON   
 529          Yeading                         Hillingdon           HAYES   
 


Scraping the webpage gives us all the tables present on the page. We need the 2nd table, so selecting the 2nd table.

In [4]:
wiki_london_data=wiki_london_data[1]
wiki_london_data

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,020,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",020,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,020,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,020,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",020,TQ478728
...,...,...,...,...,...,...
526,Woolwich,Greenwich,LONDON,SE18,020,TQ435795
527,Worcester Park,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4,020,TQ225655
528,Wormwood Scrubs,Hammersmith and Fulham,LONDON,W12,020,TQ225815
529,Yeading,Hillingdon,HAYES,UB4,020,TQ115825


### Data preprocessing

Now we have to remove the spaces in the column titles and also add "_" between words.

In [5]:
wiki_london_data.rename(columns=lambda x: x.strip().replace(" ", "_"), inplace=True)
wiki_london_data

Unnamed: 0,Location,London borough,Post_town,Postcode district,Dial code,OS_grid_ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,020,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",020,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,020,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,020,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",020,TQ478728
...,...,...,...,...,...,...
526,Woolwich,Greenwich,LONDON,SE18,020,TQ435795
527,Worcester Park,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4,020,TQ225655
528,Wormwood Scrubs,Hammersmith and Fulham,LONDON,W12,020,TQ225815
529,Yeading,Hillingdon,HAYES,UB4,020,TQ115825


We see that in a few columns we do not have "_" between the words despite applying our function. This  means that there are special characters present

### Feature selection


We need only the boroughs, Postal codes and Post town for further steps. Therefore we can drop the locations, dial codes and OS grid.

In [6]:
df1=wiki_london_data.drop([wiki_london_data.columns[0],wiki_london_data.columns[4],wiki_london_data.columns[5]],axis=1)
df1.head()

Unnamed: 0,London borough,Post_town,Postcode district
0,"Bexley, Greenwich [7]",LONDON,SE2
1,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4"
2,Croydon[8],CROYDON,CR0
3,Croydon[8],CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


let's rename the columns something simpler for our execution

In [7]:
df1.columns=["borough","town","post_code"]
df1.head()

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich [7]",LONDON,SE2
1,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4"
2,Croydon[8],CROYDON,CR0
3,Croydon[8],CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


Now let's remove the Square brackets with the numbers in them (eg: "[7]") from the borough column

In [8]:
df1["borough"]=df1["borough"].map(lambda x: x.rstrip("]").rstrip("0123456789").rstrip("["))
df1.head()

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
2,Croydon,CROYDON,CR0
3,Croydon,CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


Now let us check the dimensions of our dataframe

In [9]:
df1.shape

(531, 3)

We currently have 531 records and 3 columns of our data. It is now time to perform Feature Engineering

### Feature Engineering

We are only focusing on the neighbourhoods of London, so let's make the changes

In [10]:
df1=df1[df1["town"].str.contains("LONDON")]
df1.head()

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
6,City,LONDON,EC3
7,Westminster,LONDON,WC2
9,Bromley,LONDON,SE20


In [11]:
df1.shape

(308, 3)

We now have only 308 rows. We can proceed with getting some descriptive statistics

In [12]:
df1.info

<bound method DataFrame.info of                             borough                    town post_code
0                Bexley, Greenwich                   LONDON       SE2
1    Ealing, Hammersmith and Fulham                  LONDON    W3, W4
6                              City                  LONDON       EC3
7                       Westminster                  LONDON       WC2
9                           Bromley                  LONDON      SE20
..                              ...                     ...       ...
521                       Redbridge                  LONDON  IG8, E18
522       Redbridge, Waltham Forest  LONDON, WOODFORD GREEN       IG8
525                          Barnet                  LONDON       N12
526                       Greenwich                  LONDON      SE18
528          Hammersmith and Fulham                  LONDON       W12

[308 rows x 3 columns]>

In [13]:
df1.describe()

Unnamed: 0,borough,town,post_code
count,308,308,308
unique,50,12,162
top,Barnet,LONDON,E14
freq,27,297,8


## Geolocations of the London neighbourhoods

### ArcGis API

We need to get the geographical co-ordinates for the neighbourhoods to plot the map. We will use the arcgis package to do so.

Arcgis doesn't have a limitation on the number of API calls made so it fits our use case perfectly.

In [14]:
pip install arcgis

Note: you may need to restart the kernel to use updated packages.


In [15]:
from arcgis.geocoding import geocode
from arcgis.gis import GIS
g=GIS()

  pd.datetime,


Defining London arcgis geocode function to return latitude and longitude

In [16]:
def get_x_y_uk(address1):
   lat_coords=0
   lng_coords=0
   g=geocode(address="{},London,England,GBR".format(address1))[0]
   lng_coords=g["location"]["x"]
   lat_coords=g["location"]["y"]
   return str(lat_coords) +","+ str(lng_coords)

Checking sample data

In [17]:
c=get_x_y_uk("SE2")

In [18]:
c

'51.492450000000076,0.12127000000003818'

Looks good, We Copy over the postal codes of london to pass it into the geolocator function that we just defined above

In [19]:
geo_coordinates_uk=df1["post_code"]    
geo_coordinates_uk

0           SE2
1        W3, W4
6           EC3
7           WC2
9          SE20
         ...   
521    IG8, E18
522         IG8
525         N12
526        SE18
528         W12
Name: post_code, Length: 308, dtype: object

Passing postal codes of london to get the geographical co-ordinates

In [20]:
coordinates_latlng_uk=geo_coordinates_uk.apply(lambda x: get_x_y_uk(x))
coordinates_latlng_uk

0       51.492450000000076,0.12127000000003818
1        51.51324000000005,-0.2674599999999714
6       51.51200000000006,-0.08057999999994081
7       51.51651000000004,-0.11967999999995982
9       51.41009000000008,-0.05682999999993399
                        ...                   
521    51.589770000000044,0.030520000000024083
522      51.50642000000005,-0.1272099999999341
525     51.615920000000074,-0.1767399999999384
526      51.48207000000008,0.07143000000002075
528      51.50645000000003,-0.2369099999999662
Name: post_code, Length: 308, dtype: object

### Latitude

Extracting the latitude from our previously collected coordinates

In [21]:
lat_uk=coordinates_latlng_uk.apply(lambda x: x.split(",")[0])
lat_uk

0      51.492450000000076
1       51.51324000000005
6       51.51200000000006
7       51.51651000000004
9       51.41009000000008
              ...        
521    51.589770000000044
522     51.50642000000005
525    51.615920000000074
526     51.48207000000008
528     51.50645000000003
Name: post_code, Length: 308, dtype: object

### Longitude

Extracting the Longitude from our previously collected coordinates

In [22]:
lng_uk=coordinates_latlng_uk.apply(lambda x: x.split(",")[1])
lng_uk

0       0.12127000000003818
1       -0.2674599999999714
6      -0.08057999999994081
7      -0.11967999999995982
9      -0.05682999999993399
               ...         
521    0.030520000000024083
522     -0.1272099999999341
525     -0.1767399999999384
526     0.07143000000002075
528     -0.2369099999999662
Name: post_code, Length: 308, dtype: object

We now have the geographical co-ordinates of the London Neighbourhoods.

We proceed with Merging our source data with the geographical co-ordinates to make our dataset ready for the next stage

In [23]:
london_merged=pd.concat([df1,lat_uk.astype(float),lng_uk.astype(float)],axis=1)
london_merged.columns=["borough","town","post_code","latitude","longitude"]
london_merged

Unnamed: 0,borough,town,post_code,latitude,longitude
0,"Bexley, Greenwich",LONDON,SE2,51.49245,0.12127
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",51.51324,-0.26746
6,City,LONDON,EC3,51.51200,-0.08058
7,Westminster,LONDON,WC2,51.51651,-0.11968
9,Bromley,LONDON,SE20,51.41009,-0.05683
...,...,...,...,...,...
521,Redbridge,LONDON,"IG8, E18",51.58977,0.03052
522,"Redbridge, Waltham Forest","LONDON, WOODFORD GREEN",IG8,51.50642,-0.12721
525,Barnet,LONDON,N12,51.61592,-0.17674
526,Greenwich,LONDON,SE18,51.48207,0.07143


In [24]:
london_merged.dtypes

borough       object
town          object
post_code     object
latitude     float64
longitude    float64
dtype: object

### Coordinates for London

Getting the geocode for London to visualise on the map

In [25]:
london=geocode(address="London, England, GBR")[0]
london_lng_coords=london["location"]["x"]
london_lat_coords=london["location"]["y"]
london_lng_coords

-0.1272099999999341

In [26]:
london_lat_coords

51.50642000000005

### Visualising the map of London

To help visualise the neighbourhoods in London, we make use of the folium package.

In [28]:
# Creating the map of London
map_London=folium.Map(location=[london_lat_coords,london_lng_coords],zoom_start=10)
map_London
# adding markers to map
for latitude,longitude,borough,town in zip(london_merged["latitude"],london_merged["longitude"],london_merged["borough"],
                                           london_merged["town"]):
    label="{}, {}".format(town,borough)
    label=folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude,longitude],
        radius=5,
        popup=label,
        color="red",
        fill=True
        ).add_to(map_London)  
map_London

### Venues in London

To proceed with the next part, we need to define Foursquare API credentials.

Using Foursquare API, we are able to get the venue and the venue categories around each neighbourhood in London.

In [29]:
CLIENT_ID="E2DBU5MN050RH2BQYCHLOH3V41GKJUDCEV0QDKALP4DEOHZY" # your Foursquare ID
CLIENT_SECRET="2RARMP3FXP1GHPOIP3R505XOTG4WALUMNPRYLYL55HJXCWS2" # your Foursquare Secret
ACCESS_TOKEN="IAGFIRWWOAYUTS5GFAMKXEITE4UL0BZLD42RG0524PXGC4QG" # your FourSquare Access Token
VERSION="20180604"
LIMIT=100
print("Your credentails:")
print("CLIENT_ID:"+CLIENT_ID)
print("CLIENT_SECRET:"+CLIENT_SECRET)

Your credentails:
CLIENT_ID: E2DBU5MN050RH2BQYCHLOH3V41GKJUDCEV0QDKALP4DEOHZY
CLIENT_SECRET:2RARMP3FXP1GHPOIP3R505XOTG4WALUMNPRYLYL55HJXCWS2


Defining a function to get the neraby venues in the neighbourhood. This will help us get venue categories which is important for our analysis

In [30]:
LIMIT=100
def getNearbyVenues(names,latitudes,longitudes,radius=500):
    venues_list=[]
    for name,lat,lng in zip(names,latitudes,longitudes):
        print(name)
             # create the API request URL
        url="https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT
            )
             # make the GET request
        results=requests.get(url).json()["response"]["groups"][0]["items"]
         # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v["venue"]["name"], 
            v["venue"]["categories"][0]["name"]) for v in results])
    nearby_venues=pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns=["Neighbourhood", 
                  "Neighbourhood Latitude", 
                  "Neighbourhood Longitude", 
                  "Venue", 
                  "Venue Category"]
    return(nearby_venues)

Getting the venues in London

In [31]:
venues_in_London=getNearbyVenues(london_merged["borough"],london_merged["latitude"],london_merged["longitude"])

Bexley, Greenwich 
Ealing, Hammersmith and Fulham
City
Westminster
Bromley
Islington
Islington
Barnet
Enfield
Wandsworth
Southwark
City
Richmond upon Thames
Barnet
Islington
Wandsworth
Westminster
Bromley
Newham
Ealing
Westminster
Lewisham
Camden
Southwark
Tower Hamlets
Bexley
City
Lewisham
Greenwich
Tower Hamlets
Camden
Haringey
Tower Hamlets
Haringey
Barnet
Brent
Lambeth
Lewisham
Tower Hamlets
Kensington and Chelsea, Hammersmith and Fulham
Brent
Barnet
Barnet
Southwark
Tower Hamlets
Camden
Tower Hamlets
Waltham Forest
Newham
Islington
Richmond upon Thames
Lewisham
Camden
Westminster
Greenwich
Kensington and Chelsea
Barnet
Westminster
Lewisham
Waltham Forest
Hounslow, Ealing, Hammersmith and Fulham
Brent
Barnet
Lambeth, Wandsworth
Islington
Barnet
Merton
Barnet
Westminster
Barnet, Brent, Camden
Lewisham
Bexley
Haringey
Bromley
Tower Hamlets
Newham
Hackney
Islington
Southwark
Lewisham
Brent
Southwark
Ealing
Kensington and Chelsea
Wandsworth
Southwark
Barnet
Newham
Richmond upon Thames


Sampling our data

In [32]:
venues_in_London.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
0,"Bexley, Greenwich",51.49245,0.12127,Lesnes Abbey,Historic Site
1,"Bexley, Greenwich",51.49245,0.12127,Sainsbury's,Supermarket
2,"Bexley, Greenwich",51.49245,0.12127,Lidl,Supermarket
3,"Bexley, Greenwich",51.49245,0.12127,Abbey Wood Railway Station (ABW),Train Station
4,"Bexley, Greenwich",51.49245,0.12127,Bean @ Work,Coffee Shop


In [33]:
venues_in_London.shape

(10415, 5)

In London there are 10415 records for venues. This will definitely make the clustering interesting.

### Grouping by venue category 

We need to now see how many Venue Categories are there for further processing

In [34]:
venues_in_London.groupby("Venue Category").max()

Unnamed: 0_level_0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Accessories Store,Westminster,51.51656,-0.11968,James Smith & Sons
Adult Boutique,Islington,51.52969,-0.08697,Sh! Women's Erotic Emporium
African Restaurant,Westminster,51.52587,-0.08808,Red Sea Restaurant
American Restaurant,Waltham Forest,51.61780,0.02795,Spielburger
Antique Shop,Westminster,51.51651,-0.11968,The London Silver Vaults
...,...,...,...,...
Wings Joint,Hammersmith and Fulham,51.54187,-0.19795,Wingmans
Women's Store,Westminster,51.55457,0.00278,Vivien of Holloway
Xinjiang Restaurant,Southwark,51.47480,-0.09313,Silk Road
Yoga Studio,Westminster,51.55457,-0.03558,yogahaven


We can see that we have 302 records, this shows us how diverse and interesting London is.

### One hot encoding

We need to Encode our venue categories to get a better result for our clustering

In [35]:
London_venue_cat=pd.get_dummies(venues_in_London[["Venue Category"]],prefix="",prefix_sep="")
London_venue_cat

Unnamed: 0,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10410,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10411,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10412,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10413,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Adding the Neighbourhoods to it.

In [36]:
London_venue_cat["Neighbourhood"]=venues_in_London["Neighbourhood"] 
# moving neighborhood column to the first column
fixed_columns=[London_venue_cat.columns[-1]]+list(London_venue_cat.columns[:-1])
London_venue_cat=London_venue_cat[fixed_columns]
London_venue_cat.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit
0,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Bexley, Greenwich",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Venue categories mean values

We will now group together Neighbourhoods and calculate the mean of the venue categories in each Neighbourhood

In [37]:
London_grouped=London_venue_cat.groupby("Neighbourhood").mean().reset_index()
London_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit
0,Barnet,0.0,0.0,0.0,0.001795,0.0,0.0,0.0,0.007181,0.0,...,0.007181,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Barnet, Brent, Camden",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bexley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Bexley, Greenwich",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bexley, Greenwich",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now let's make a function to get the top common venue categories

In [38]:
def return_most_common_venues(row,num_top_venues):
    row_categories=row.iloc[1:]
    row_categories_sorted=row_categories.sort_values(ascending=False)
     return row_categories_sorted.index.values[0:num_top_venues]

There are way too many venue categories, we can take the top 10 to cluster the neighbourhoods.

Creating a function to label the columns of the venue correctly

In [39]:
num_top_venues=10
indicators=["st","nd","rd"]
# create columns according to number of top venues
columns=["Neighbourhood"]
for ind in np.arange(num_top_venues):
    try:
        columns.append("{}{} Most Common Venue".format(ind+1,indicators[ind]))
    except:
        columns.append("{}th Most Common Venue".format(ind+1))

### Top venue categories

Getting the top venue categories in London

In [40]:
# create a new dataframe for London
neighborhoods_venues_sorted_london=pd.DataFrame(columns=columns)
neighborhoods_venues_sorted_london["Neighbourhood"]=London_grouped["Neighbourhood"]
for ind in np.arange(London_grouped.shape[0]):
    neighborhoods_venues_sorted_london.iloc[ind,1:]=return_most_common_venues(London_grouped.iloc[ind,:],num_top_venues)
neighborhoods_venues_sorted_london.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Barnet,Coffee Shop,Café,Grocery Store,Pub,Supermarket,Pharmacy,Italian Restaurant,Bus Stop,Sushi Restaurant,Turkish Restaurant
1,"Barnet, Brent, Camden",Bus Station,Clothing Store,Gym / Fitness Center,Hardware Store,Supermarket,Fish & Chips Shop,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant
2,Bexley,Supermarket,Historic Site,Convenience Store,Coffee Shop,Train Station,Platform,Bus Stop,Golf Course,Park,Fish Market
3,"Bexley, Greenwich",Daycare,Construction & Landscaping,Bus Stop,Park,Golf Course,Historic Site,Home Service,Sports Club,Discount Store,Diner
4,"Bexley, Greenwich",Supermarket,Train Station,Coffee Shop,Convenience Store,Platform,Historic Site,Film Studio,Exhibit,Falafel Restaurant,Farmers Market


## Model building

### K means clustering

Let's cluster the city of london to roughly 5 to make it easier to analyse.

We use the K Means clustering technique to do so.

In [41]:
# set number of clusters
k_num_clusters=5
London_grouped_clustering=London_grouped.drop("Neighbourhood",1)
# run k-means clustering
kmeans_london=KMeans(n_clusters=k_num_clusters,random_state=0).fit(London_grouped_clustering)
kmeans_london

KMeans(n_clusters=5, random_state=0)

### Labelling clustered data

In [42]:
kmeans_london.labels_

array([0, 4, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0], dtype=int32)

So our model has labeled the city

In [43]:
neighborhoods_venues_sorted_london.insert(0,"Cluster Labels",kmeans_london.labels_+1)

adding latitude & longitude for each of the neighborhood so that we are able to visualise it

In [44]:
london_data=london_merged
london_data=london_data.join(neighborhoods_venues_sorted_london.set_index("Neighbourhood"),on="borough")
london_data.head()

Unnamed: 0,borough,town,post_code,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bexley, Greenwich",LONDON,SE2,51.49245,0.12127,4,Supermarket,Train Station,Coffee Shop,Convenience Store,Platform,Historic Site,Film Studio,Exhibit,Falafel Restaurant,Farmers Market
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",51.51324,-0.26746,2,Grocery Store,Indian Restaurant,Park,Breakfast Spot,Train Station,Zoo Exhibit,Film Studio,Exhibit,Falafel Restaurant,Farmers Market
6,City,LONDON,EC3,51.512,-0.08058,1,Hotel,Italian Restaurant,Coffee Shop,Gym / Fitness Center,Pub,Restaurant,Wine Bar,Sandwich Place,Salad Place,Scenic Lookout
7,Westminster,LONDON,WC2,51.51651,-0.11968,1,Hotel,Coffee Shop,Pub,Café,Sandwich Place,Italian Restaurant,Theater,Restaurant,Hotel Bar,Sushi Restaurant
9,Bromley,LONDON,SE20,51.41009,-0.05683,1,Supermarket,Grocery Store,Convenience Store,Fast Food Restaurant,Hotel,Park,Café,Historic Site,Gym / Fitness Center,Italian Restaurant


Drop all the NaN values to prevent data skew

In [45]:
london_data_nonan=london_data.dropna(subset=["Cluster Labels"])

### Visualising the clustered neighbourhoods

Let us now plot the clusters

In [48]:
map_clusters_london=folium.Map(location=[london_lat_coords,london_lng_coords],zoom_start=10)
# set color scheme for the clusters
x=np.arange(k_num_clusters)
ys=[i+x+(i*x)**2 for i in range(k_num_clusters)]
colors_array=cm.rainbow(np.linspace(0,1,len(ys)))
rainbow=[colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors=[]
for lat,lon,poi,cluster in zip(london_data_nonan["latitude"],london_data_nonan["longitude"],london_data_nonan["borough"],
                               london_data_nonan["Cluster Labels"]):
    label=folium.Popup("Cluster"+str(int(cluster)+1)+"\n"+str(poi),parse_html=True)
    folium.CircleMarker(
        [lat,lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)]
        ).add_to(map_clusters_london)
map_clusters_london

### Examining our clusters

#### Cluster 1

In [50]:
london_data_nonan.loc[london_data_nonan["Cluster Labels"]==1,london_data_nonan.columns[[1]+
                                                                                       list(range(5,london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,LONDON,1,Hotel,Italian Restaurant,Coffee Shop,Gym / Fitness Center,Pub,Restaurant,Wine Bar,Sandwich Place,Salad Place,Scenic Lookout
7,LONDON,1,Hotel,Coffee Shop,Pub,Café,Sandwich Place,Italian Restaurant,Theater,Restaurant,Hotel Bar,Sushi Restaurant
9,LONDON,1,Supermarket,Grocery Store,Convenience Store,Fast Food Restaurant,Hotel,Park,Café,Historic Site,Gym / Fitness Center,Italian Restaurant
10,LONDON,1,Coffee Shop,Pub,Food Truck,Café,Italian Restaurant,Vietnamese Restaurant,Park,Cocktail Bar,Gym / Fitness Center,Hotel
12,LONDON,1,Coffee Shop,Pub,Food Truck,Café,Italian Restaurant,Vietnamese Restaurant,Park,Cocktail Bar,Gym / Fitness Center,Hotel
...,...,...,...,...,...,...,...,...,...,...,...,...
521,LONDON,1,Café,Grocery Store,Coffee Shop,Pub,Pizza Place,Seafood Restaurant,Bar,Park,Bakery,BBQ Joint
522,"LONDON, WOODFORD GREEN",1,Hotel,Café,Plaza,Pub,Theater,Garden,Pharmacy,Bakery,Sandwich Place,Monument / Landmark
525,LONDON,1,Coffee Shop,Café,Grocery Store,Pub,Supermarket,Pharmacy,Italian Restaurant,Bus Stop,Sushi Restaurant,Turkish Restaurant
526,LONDON,1,Pub,Grocery Store,Bus Stop,Indian Restaurant,Coffee Shop,Historic Site,Fish & Chips Shop,Turkish Restaurant,Pier,Park


#### Cluster 2

In [51]:
london_data_nonan.loc[london_data_nonan["Cluster Labels"]==2,london_data_nonan.columns[[1]+
                                                                                       list(range(5,london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,LONDON,2,Grocery Store,Indian Restaurant,Park,Breakfast Spot,Train Station,Zoo Exhibit,Film Studio,Exhibit,Falafel Restaurant,Farmers Market


#### Cluster 3

In [123]:
london_data_nonan.loc[london_data_nonan["Cluster Labels"]==3,london_data_nonan.columns[[1]+
                                                                                       list(range(5,london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
377,"HARROW, STANMOREEDGWARE, LONDON",3,Bakery,Chinese Restaurant,Gym,Food Stand,Food Service,Food Court,Food & Drink Shop,Flower Shop,Flea Market,Exhibit


#### Cluster 4

In [122]:
london_data_nonan.loc[london_data_nonan["Cluster Labels"]==4,london_data_nonan.columns[[1]
                                                                                       +list(range(5,london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,LONDON,4,Supermarket,Train Station,Coffee Shop,Convenience Store,Platform,Historic Site,Film Studio,Exhibit,Falafel Restaurant,Farmers Market
45,"BEXLEYHEATH, LONDON",4,Supermarket,Historic Site,Convenience Store,Coffee Shop,Train Station,Platform,Bus Stop,Golf Course,Park,Fish Market
124,LONDON,4,Supermarket,Historic Site,Convenience Store,Coffee Shop,Train Station,Platform,Bus Stop,Golf Course,Park,Fish Market
291,"LONDON, SIDCUP",4,Supermarket,Historic Site,Convenience Store,Coffee Shop,Train Station,Platform,Bus Stop,Golf Course,Park,Fish Market
505,LONDON,4,Supermarket,Historic Site,Convenience Store,Coffee Shop,Train Station,Platform,Bus Stop,Golf Course,Park,Fish Market


#### Cluster 5

In [54]:
london_data_nonan.loc[london_data_nonan["Cluster Labels"]==5,london_data_nonan.columns[[1]
                                                                                       +list(range(5,london_data_nonan.shape[1]))]]

Unnamed: 0,town,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
121,LONDON,5,Bus Station,Clothing Store,Gym / Fitness Center,Hardware Store,Supermarket,Fish & Chips Shop,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant








# Exploring Toronto

## Neighbourhoods of Toronto

### Data Collection

To get the neighbourhoods in Toronto, we start by scraping the list of areas of Toronto from the wikipedia page.

In [55]:
source=requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
soup=BeautifulSoup(source,"lxml")
table=soup.find("table")
table_rows=table.tbody.find_all("tr")
res=[]
for tr in table_rows:
    td=tr.find_all("td")
    row=[tr.text for tr in td]
    # Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
    if row != [] and row[1] != "Not assigned\n":
        # If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough.
        if "Not assigned\n" in row[2]: 
            row[2]=row[1]
        res.append(row)
# Dataframe with 3 columns
df=pd.DataFrame(res,columns=["PostalCode","Borough","Neighborhood"])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A\n,North York\n,Parkwoods\n
1,M4A\n,North York\n,Victoria Village\n
2,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"
3,M6A\n,North York\n,"Lawrence Manor, Lawrence Heights\n"
4,M7A\n,Downtown Toronto\n,"Queen's Park, Ontario Provincial Government\n"


### Data preprocessing

We have to now remove the "\n" from the table

In [56]:
df["Neighborhood"]=df["Neighborhood"].str.replace("\n","")
df["PostalCode"]=df["PostalCode"].str.replace("\n","")
df["Borough"]=df["Borough"].str.replace("\n","")
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### Feature selection

Unlike the London data, we already have the table in the format that we need with "PostalCode", "Borough" and "Neighbourhood", so there is no need to drop any of the columns.

In [57]:
df.info

<bound method DataFrame.info of     PostalCode           Borough  \
0          M3A        North York   
1          M4A        North York   
2          M5A  Downtown Toronto   
3          M6A        North York   
4          M7A  Downtown Toronto   
..         ...               ...   
98         M8X         Etobicoke   
99         M4Y  Downtown Toronto   
100        M7Y      East Toronto   
101        M8Y         Etobicoke   
102        M8Z         Etobicoke   

                                          Neighborhood  
0                                            Parkwoods  
1                                     Victoria Village  
2                            Regent Park, Harbourfront  
3                     Lawrence Manor, Lawrence Heights  
4          Queen's Park, Ontario Provincial Government  
..                                                 ...  
98       The Kingsway, Montgomery Road, Old Mill North  
99                                Church and Wellesley  
100  Business reply ma

In [61]:
df.shape

(103, 3)

We now have only 103 rows. We can proceed with getting some descriptive statistics

In [60]:
df.describe()

Unnamed: 0,PostalCode,Borough,Neighborhood
count,103,103,103
unique,103,10,99
top,M4V,North York,Downsview
freq,1,24,4


## Geolocations of the Toronto neighbourhoods

We are going to get the Geolocations of the Toronto neighbourhoods from the link that was provided for us in the assignment in week 3 of this module. This makes things easier for us compared with the London data

In [62]:
df_geo_coor=pd.read_csv("https://cocl.us/Geospatial_data")
df_geo_coor.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We now have the geographical co-ordinates of the Toronto Neighbourhoods.

We proceed with Merging our source data with the geographical co-ordinates to make our dataset ready for the next stage

In [63]:
df_toronto=pd.merge(df,df_geo_coor,how="left",left_on="PostalCode",right_on="Postal Code")
# remove the "Postal Code" column
df_toronto.drop("Postal Code",axis=1,inplace=True)
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [64]:
df_toronto.dtypes

PostalCode       object
Borough          object
Neighborhood     object
Latitude        float64
Longitude       float64
dtype: object

Getting the geocode for Toronto to help visualize it on the map

In [65]:
address="Toronto, ON"
geolocator=Nominatim(user_agent="toronto_explorer")
location=geolocator.geocode(address)
latitude=location.latitude
longitude=location.longitude
print("The geograpical coordinate of Toronto are {},{}.".format(latitude,longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


## Visualise the map of Toronto


To help visualize the Map of Toronto and the neighbourhoods in Toronro, we make use of the folium package.

In [66]:
# create map of Toronto using latitude and longitude values
map_toronto=folium.Map(location=[latitude,longitude],zoom_start=10)
# add markers to map
for lat,lng,borough,neighborhood in zip(df_toronto["Latitude"],df_toronto["Longitude"],df_toronto["Borough"],
                                        df_toronto["Neighborhood"]):
    label="{},{}".format(neighborhood,borough)
    label=folium.Popup(label,parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color="blue",
        fill=True,
        fill_color="#3186cc",
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
map_toronto

### Venues in Toronto

To proceed with the next part, we need to define Foursquare API credentials.

Using Foursquare API, we are able to get the venue and venue categories around each neighbourhood in Toronto.

In [67]:
CLIENT_ID="E2DBU5MN050RH2BQYCHLOH3V41GKJUDCEV0QDKALP4DEOHZY" # your Foursquare ID
CLIENT_SECRET="2RARMP3FXP1GHPOIP3R505XOTG4WALUMNPRYLYL55HJXCWS2" # your Foursquare Secret
ACCESS_TOKEN="IAGFIRWWOAYUTS5GFAMKXEITE4UL0BZLD42RG0524PXGC4QG" # your FourSquare Access Token
VERSION="20180604"
LIMIT=100
print("Your credentails:")
print("CLIENT_ID:"+CLIENT_ID)
print("CLIENT_SECRET:"+CLIENT_SECRET)

Your credentails:
CLIENT_ID: E2DBU5MN050RH2BQYCHLOH3V41GKJUDCEV0QDKALP4DEOHZY
CLIENT_SECRET:2RARMP3FXP1GHPOIP3R505XOTG4WALUMNPRYLYL55HJXCWS2


Defining a function to get the neraby venues in the neighbourhood. This will help us get venue categories which is important for our analysis

In [71]:
def getNearbyVenues(names,latitudes,longitudes,radius=500):
    venues_list=[]
    for name,lat,lng in zip(names,latitudes,longitudes):
        print(name)
        # create the API request URL
        url="https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        # make the GET request
        results=requests.get(url).json()["response"]["groups"][0]["items"]
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v["venue"]["name"], 
            v["venue"]["location"]["lat"], 
            v["venue"]["location"]["lng"],  
            v["venue"]["categories"][0]["name"]) for v in results])
    nearby_venues=pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns=["Neighborhood", 
                  "Neighborhood Latitude", 
                  "Neighborhood Longitude", 
                  "Venue", 
                  "Venue Latitude", 
                  "Venue Longitude", 
                  "Venue Category"]
    return(nearby_venues)

Getting venues in Toronto

In [72]:
toronto_venues=getNearbyVenues(names=df_toronto["Neighborhood"],
                                   latitudes=df_toronto["Latitude"],
                                   longitudes=df_toronto["Longitude"])
toronto_venues.head()

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,GTA Restoration,43.753396,-79.333477,Fireworks Store
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [73]:
toronto_venues.shape

(2137, 7)

In London there are 2137 records for venues. This will definitely make the clustering interesting.

### Grouping by venue category

In [74]:
toronto_venues.groupby("Venue Category").max()

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Accessories Store,"Lawrence Manor, Lawrence Heights",43.778517,-79.346556,Sunglass Hut,43.777661,-79.344692
Airport,Downsview,43.737473,-79.394420,Toronto Downsview Airport (YZD),43.738883,-79.396033
Airport Food Court,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.394420,Billy Bishop Café,43.631132,-79.396139
Airport Gate,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.394420,Gate 8,43.631536,-79.394570
Airport Lounge,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.394420,Porter Lounge,43.631360,-79.395756
...,...,...,...,...,...,...
Wine Bar,"Toronto Dominion Centre, Design Exchange",43.657952,-79.375418,The National Club,43.659128,-79.380574
Wine Shop,"Regent Park, Harbourfront",43.654260,-79.360636,Wine Rack,43.656573,-79.356928
Wings Joint,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Wingporium,43.630275,-79.518169
Women's Store,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,Want Boutique,43.778111,-79.343660


We can see that we have 275 records, this shows us how diverse and interesting Toronto is.

### One hot encoding

We need to Encode our venue categories to get a better result for our clustering and adding the neighbourhoods to it

In [75]:
# one hot encoding
toronto_onehot=pd.get_dummies(toronto_venues[["Venue Category"]],prefix="",prefix_sep="")
# add neighborhood column back to dataframe
toronto_onehot["Neighborhood"]=toronto_venues["Neighborhood"] 
# move neighborhood column to the first column
fixed_columns=[toronto_onehot.columns[-1]]+list(toronto_onehot.columns[:-1])
toronto_onehot=toronto_onehot[fixed_columns]
toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Venue categories mean values

We will group the Neighbourhoods and calculate the mean venue categories value in each Neighbourhood

In [76]:
toronto_grouped=toronto_onehot.groupby("Neighborhood").mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038462,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038462


### Top Venue categories

Now let's create the new dataframe and display the top 10 venues for each neighbourhood. Getting the top venue categories in Toronto

In [78]:
def return_most_common_venues(row,num_top_venues):
    row_categories=row.iloc[1:]
    row_categories_sorted=row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]
num_top_venues=10
indicators=["st","nd","rd"]
# create columns according to number of top venues
columns=["Neighborhood"]
for ind in np.arange(num_top_venues):
    try:
        columns.append("{}{} Most Common Venue".format(ind+1,indicators[ind]))
    except:
        columns.append("{}th Most Common Venue".format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted=pd.DataFrame(columns=columns)
neighborhoods_venues_sorted["Neighborhood"]=toronto_grouped["Neighborhood"]
for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:]=return_most_common_venues(toronto_grouped.iloc[ind,:],num_top_venues)
neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Latin American Restaurant,Lounge,Breakfast Spot,Women's Store,Dumpling Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore
1,"Alderwood, Long Branch",Pizza Place,Gym,Dance Studio,Pharmacy,Coffee Shop,Athletics & Sports,Pub,Dog Run,Dim Sum Restaurant,Diner
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Fried Chicken Joint,Pizza Place,Intersection,Supermarket,Ice Cream Shop,Sushi Restaurant,Restaurant,Shopping Mall
3,Bayview Village,Café,Bank,Chinese Restaurant,Japanese Restaurant,Women's Store,Doner Restaurant,Discount Store,Distribution Center,Dog Run,Donut Shop
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Coffee Shop,Italian Restaurant,Women's Store,Indian Restaurant,Juice Bar,Breakfast Spot,Liquor Store,Locksmith,Restaurant


## Model building

### K means clustering and labelling clustered data

Let's cluster the city of Toronto to roughly 5 to make it easier to analyse.

We use the K Means clustering technique to do so.

In [88]:
# set number of clusters
kclusters=5
toronto_grouped_clustering=toronto_grouped.drop("Neighborhood",1)
# run k-means clustering
kmeans_toronto=KMeans(n_clusters=kclusters,random_state=0).fit(toronto_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans_toronto

KMeans(n_clusters=5, random_state=0)

In [89]:
kmeans_toronto.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 3, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 1,
       0, 1, 0, 0, 0, 0, 1], dtype=int32)

In [90]:
neighborhoods_venues_sorted.insert(0,"Cluster Labels",kmeans_toronto.labels_+1)


Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [91]:
toronto_merged=df_toronto
# merge toronto_merged with df_toronto to add latitude/longitude for each neighborhood
toronto_merged=toronto_merged.join(neighborhoods_venues_sorted.set_index("Neighborhood"),on="Neighborhood")
toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,2.0,Park,Fireworks Store,Food & Drink Shop,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
1,M4A,North York,Victoria Village,43.725882,-79.315572,1.0,Hockey Arena,Pizza Place,Coffee Shop,Portuguese Restaurant,Women's Store,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1.0,Coffee Shop,Pub,Bakery,Park,Breakfast Spot,Café,Theater,Gym / Fitness Center,Electronics Store,Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1.0,Clothing Store,Furniture / Home Store,Vietnamese Restaurant,Athletics & Sports,Coffee Shop,Miscellaneous Shop,Event Space,Boutique,Accessories Store,Ethiopian Restaurant
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1.0,Coffee Shop,Sushi Restaurant,College Cafeteria,Beer Bar,Bank,Bar,Portuguese Restaurant,Café,Diner,Yoga Studio


Drop all the NaN values to prevent data skew and replacing floats with integers

In [111]:
toronto_merged_nonan=toronto_merged.dropna(subset=["Cluster Labels"])

In [113]:
toronto_merged_nonan["Cluster Labels"]=toronto_merged_nonan["Cluster Labels"].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


### Visualising the clustered neighbourhoods

let's visualise the resulting clusters

In [115]:
# create map
map_clusters=folium.Map(location=[latitude,longitude],zoom_start=11)
# set color scheme for the clusters
x=np.arange(kclusters)
ys=[i+x+(i*x)**2 for i in range(kclusters)]
colors_array=cm.rainbow(np.linspace(0,1,len(ys)))
rainbow=[colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors=[]
for lat,lon,poi,cluster in zip(toronto_merged_nonan["Latitude"],toronto_merged_nonan["Longitude"],
                               toronto_merged_nonan["Neighborhood"],toronto_merged_nonan["Cluster Labels"]):
    label=folium.Popup(str(poi)+"Cluster"+str(cluster),parse_html=True)
    folium.CircleMarker(
        [lat,lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
map_clusters

## Examining our clusters

#### Cluster 1

In [117]:
toronto_merged_nonan.loc[toronto_merged_nonan["Cluster Labels"]==1,toronto_merged_nonan.columns[[1]+
                                                                                    list(range(5,toronto_merged_nonan.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,1,Hockey Arena,Pizza Place,Coffee Shop,Portuguese Restaurant,Women's Store,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run
2,Downtown Toronto,1,Coffee Shop,Pub,Bakery,Park,Breakfast Spot,Café,Theater,Gym / Fitness Center,Electronics Store,Restaurant
3,North York,1,Clothing Store,Furniture / Home Store,Vietnamese Restaurant,Athletics & Sports,Coffee Shop,Miscellaneous Shop,Event Space,Boutique,Accessories Store,Ethiopian Restaurant
4,Downtown Toronto,1,Coffee Shop,Sushi Restaurant,College Cafeteria,Beer Bar,Bank,Bar,Portuguese Restaurant,Café,Diner,Yoga Studio
7,North York,1,Gym,Restaurant,Japanese Restaurant,Coffee Shop,Beer Store,Supermarket,Discount Store,Café,Italian Restaurant,Caribbean Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...
97,Downtown Toronto,1,Coffee Shop,Café,Hotel,Restaurant,Gym,Japanese Restaurant,Seafood Restaurant,Salad Place,Steakhouse,Asian Restaurant
98,Etobicoke,1,River,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Women's Store,Department Store
99,Downtown Toronto,1,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Gay Bar,Yoga Studio,Fast Food Restaurant,Café,Pub,Hotel
100,East Toronto,1,Light Rail Station,Gym / Fitness Center,Spa,Auto Workshop,Brewery,Burrito Place,Comic Shop,Farmers Market,Fast Food Restaurant,Garden


#### Cluster 2

In [118]:
toronto_merged_nonan.loc[toronto_merged_nonan["Cluster Labels"]==2,toronto_merged_nonan.columns[[1]+ 
                                                                                    list(range(5,toronto_merged_nonan.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,2,Park,Fireworks Store,Food & Drink Shop,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
21,York,2,Park,Pool,Women's Store,Colombian Restaurant,Comfort Food Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant
35,East York,2,Park,Metro Station,Convenience Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Dessert Shop,Donut Shop
52,North York,2,Park,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore,College Rec Center
64,York,2,Park,Jewelry Store,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Dessert Shop,Drugstore
66,North York,2,Park,Convenience Store,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Dessert Shop,Drugstore
85,Scarborough,2,Park,Intersection,Playground,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop
91,Downtown Toronto,2,Park,Trail,Playground,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant


#### Cluster 3

In [119]:
toronto_merged_nonan.loc[toronto_merged_nonan["Cluster Labels"]==3,toronto_merged_nonan.columns[[1] 
                                                                                    +list(range(5,toronto_merged_nonan.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Scarborough,3,Fast Food Restaurant,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,College Rec Center


#### Cluster 4

In [120]:
toronto_merged_nonan.loc[toronto_merged_nonan["Cluster Labels"]==4,toronto_merged_nonan.columns[[1]+ 
                                                                                    list(range(5,toronto_merged_nonan.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
57,North York,4,Baseball Field,Women's Store,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Dim Sum Restaurant
101,Etobicoke,4,Baseball Field,Women's Store,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Dim Sum Restaurant


#### Cluster 5

In [121]:
toronto_merged_nonan.loc[toronto_merged_nonan["Cluster Labels"]==5,toronto_merged_nonan.columns[[1]+ 
                                                                                    list(range(5,toronto_merged_nonan.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,Etobicoke,5,Bakery,Women's Store,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Dim Sum Restaurant


# Results and Discussion

The neighbourhoods of London are very mulitcultural. There are a lot of different cusines including Indian, Italian, Turkish and Chinese. London seems to take a step further in this direction by having a lot of Restaurants, bars, juice bars, coffee shops, Fish and Chips shop and Breakfast spots. It has a lot of shopping options too with that of the Flea markets, flower shops, fish markets, Fishing stores, clothing stores. The main modes of transport seem to be Buses and trains. For leisure, the neighbourhoods are set up to have lots of parks, golf courses, zoo, gyms and Historic sites. Overall, the city of London offers a multicultural, diverse and certainly an entertaining experience.

Toronto is relatively the same as London. It has a wide variety of cusines and eateries including Japanese, Vietnamese, Ethiopean, Colombian, Bakeries and several others. There are a lot of places to relax including cafe's, beer bars and coffee shops, including more. People also like to go to places such as farmers markets and visiting the harbor. Toronto has a lot of Diners which London does not have. Different means of public transport in Toronto which includes the metro station and light rail station. For leisure, there a lot of dog runs, gyms, spas and people like to also play hockey baseball. Overall, much like London, Toronto is multicultural, diverse and offers a great experience to anyone who visits or lives in Toronto

# Conclusion

The purpose of this project was to explore the cities of London and Toronto and see how attractive it is to potential tourists and international students who are planning on studying in one of these cities. We explored both the cities based on their postal codes and then extrapolated the common venues present in each of the neighbourhoods and finally concluding with clustering similar neighbourhoods together.

We could see that each of the neighbourhoods in both the cities have a wide variety of experiences to offer which is unique in it's own way. The cultural diversity is quite evident which also gives the feeling of a sense of inclusion.

Both London and Toronto seem to offer a vast amount of things to do while studying with a lot of places to explore, beautiful landscapes and a wide variety of culture. Overall, it's up to the international students to decide which city they would most like to study in as both of these cities offers a lot of the same thing which will make them feel included, so whichever city they decide to study in they will not be dissapointed. To decide which city to study in they would probably need to do research on the university itself, but as for the city that they are staying in, you can't go wrong with either of these two wonderful cities