# Segmentation of neighboorhood at Toronto


# Part 1

## 1. Objectif:
Scraping data from the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, which contains the table of postal codes and to transform the data into a pandas dataframe


### Notice that: **the data from this page may be differences with the example you see on the image** since data is updated.

### Notice 2: In this notebook, the terms "region", "postal code", "postale code region" are used interchangably since in Toronto, there are several neighborhood share common postal code.

## 2. Scrapping data from wiki

In [1]:
import urllib.request
from bs4 import BeautifulSoup 

import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print("All of package is loaded!")

All of package is loaded!


In this example, it would be simpler to just copy the table from wikipage and process. But I choose to do with BeautifoulSoup because it would be useful later for my next projects.

There are few assumptions that I used here which I found from the wikipage:
1. The table is stored on the table tag of html page
2. Rows of data is stored under tr tag and may contains newline character. 

In [2]:
with urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M') as f:
    scrapper = BeautifulSoup(f, 'html.parser')
    table = scrapper.find('table').tbody
    all_row = table.find_all('tr')
    all_row = all_row[1:]
    data = [[x.text.rstrip() for x in row.find_all('td')] for row in all_row]

*data* object is a list contains all the data of table  

## 3. Prepare the dataframes

In [3]:
## prepare for the dataframe
header = ['PostalCode', 'Borough', 'Neighborhood']

In [4]:
pandas_table = pd.DataFrame(data, columns = header)
pandas_table.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Filtering out all the Borough is empty

In [5]:
pandas_table = pandas_table[pandas_table['Borough'] != 'Not assigned']
pandas_table.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


### Group the Neighborhoods have same postal code. 

In [6]:
grb_tbl = pandas_table.groupby(by=['PostalCode']).agg(lambda x: ','.join(set(x))).reset_index()

In [7]:
grb_tbl.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Rouge Hill,Highland Creek,Port Union"
2,M1E,Scarborough,"Guildwood,West Hill,Morningside"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [8]:
pandas_table[pandas_table == 'Not assigned'].sum()

PostalCode      0.0
Borough         0.0
Neighborhood    0.0
dtype: float64

As you could see from above, all cell have a borough and at least a valid neighborhood, then we do not need to fill the neighborhood to be the same as the borough.

Finally, we check the shape of pandasframe.

In [9]:
grb_tbl.shape

(103, 3)

This dataframe has 103 rows!!!

# Part 2

## 4. Add coordinates of these places to dataframe

This code below is used to get the coordinations of each place in Toronto. However,since geo is not reliable (I could not finish the code below. So i decide to use the data which is provided by this csv: https://cocl.us/Geospatial_data 

In [10]:
DO_NOT_RUN = True

if not DO_NOT_RUN:

    import geocoder # import geocoder

    latitude = []
    longtitude = []

    for row in grb_tbl.iterrows():
        # initialize your variable to None
        lat_lng_coords = None

        # loop until you get the coordinates
        while(lat_lng_coords is None):
          g = geocoder.google('{}, Toronto, Ontario'.format(row[0]))
          lat_lng_coords = g.latlng

        latitude.append(lat_lng_coords[0])
        longitude.append(lat_lng_coords[1])

    print(latitude)
    print(longtitude)

Download the csv file that contains the geo data.

In [11]:
!wget https://cocl.us/Geospatial_data

--2020-03-10 08:54:27--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 119.81.168.75, 119.81.168.76, 161.202.50.39
Connecting to cocl.us (cocl.us)|119.81.168.75|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-03-10 08:54:28--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 103.116.4.197
Connecting to ibm.box.com (ibm.box.com)|103.116.4.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-03-10 08:54:29--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv

### Load geodata into a pandas dataframe

In [12]:
geo_data = pd.read_csv('./Geospatial_data')

In [76]:
geo_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


I will check the shape of this dataframe.

In [14]:
geo_data.shape

(103, 3)

### Merge geo dataframe, merge to the original data and cleaning 

In [15]:
grb_tbl = grb_tbl.merge(geo_data, how='inner', left_on='PostalCode', right_on='Postal Code')

In [16]:
grb_tbl.shape

(103, 6)

In [17]:
grb_tbl.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Malvern,Rouge",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill,Highland Creek,Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,West Hill,Morningside",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476


#### Drop the duplicated columns because of different names

In [18]:
grb_tbl.drop(columns=['Postal Code'], inplace=True)
grb_tbl.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill,Highland Creek,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,West Hill,Morningside",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# Part 3

## 5. Explore dataset

How many boroughs are there in this dataframe?

In [19]:
print('The dataframe has {} boroughs.'.format(
        len(grb_tbl['Borough'].unique())
    )
)

The dataframe has 10 boroughs.


With the help from geopy package, position of Toronto will be extracted to serve for displaying purpose.

In [20]:
address = 'Toronto, ONTARIO, Canada'

geolocator = Nominatim(user_agent="trt_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


### Display the map of Toronto

In [22]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(grb_tbl['Latitude'], grb_tbl['Longitude'], \
                                           grb_tbl['Borough'], grb_tbl['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In this notebook, only the Scarborough is selected to examined.

### Filtered data, keep only Scarborough

In [23]:
scarborough_data = grb_tbl[grb_tbl['Borough'] == 'Scarborough'].reset_index(drop=True)
scarborough_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill,Highland Creek,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,West Hill,Morningside",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### Get the coordinates of Scarborough

In [24]:
address = 'Scarborough, Toronto'

geolocator = Nominatim(user_agent="trt_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Scarborough are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Scarborough are 43.773077, -79.257774.


### Displaying Scarborough on map

In [25]:
# create map of Scarborough using latitude and longitude values
map_scarborough = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(scarborough_data['Latitude'], scarborough_data['Longitude'],\
                           scarborough_data['PostalCode']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_scarborough)  
    
map_scarborough

### Using Foursquare API to find the venues on this borough.

To do this, we need your client ID and secret. Obviously here, I hide mine.

In [26]:
CLIENT_ID = 'your ID' # your Foursquare ID
CLIENT_SECRET = 'your secret' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

Check first value of dataframe:

In [28]:
scarborough_data.loc[0, 'PostalCode']

'M1B'

In [30]:
neighborhood_latitude = scarborough_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = scarborough_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = scarborough_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Malvern,Rouge are 43.806686299999996, -79.19435340000001.


### Testing if Foursquare API works well

**Which venues are close for given coordinate inside radius of 500?**

In [31]:
radius = 500
LIMIT=100
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)


**Foursquare API returns a result in json format. We could check the content of this result.**

In [32]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e674aa977af03001b93fb80'},
  'headerLocation': 'Malvern',
  'headerFullLocation': 'Malvern, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 2,
  'suggestedBounds': {'ne': {'lat': 43.8111863045, 'lng': -79.18812958073042},
   'sw': {'lat': 43.80218629549999, 'lng': -79.2005772192696}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bb6b9446edc76b0d771311c',
       'name': "Wendy's",
       'location': {'crossStreet': 'Morningside & Sheppard',
        'lat': 43.80744841934756,
        'lng': -79.19905558052072,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.80744841934756,
          'lng': -79.19905558052072}],
        'distance': 387,
        'cc': 'CA',
        'city': 'Toronto',
    

## Source:

https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Table.cfm?Lang=Eng&T=1201&S=22&O=A

**Support function to extract the category of given venue**

In [33]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name'] 

**Combine the information of venue on a table**

In [34]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Wendy's,Fast Food Restaurant,43.807448,-79.199056
1,Interprovincial Group,Print Shop,43.80563,-79.200378


**Given this postal code, how many venues are returned?**

In [35]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

2 venues were returned by Foursquare.


#### Combine all the previous steps on single function so we could applied them for other regions.

In [36]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [41]:
scarborough_venues = getNearbyVenues(names=scarborough_data['Neighborhood'],
                                   latitudes=scarborough_data['Latitude'],
                                   longitudes=scarborough_data['Longitude']
                                  )


Malvern,Rouge
Rouge Hill,Highland Creek,Port Union
Guildwood,West Hill,Morningside
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park,Kennedy Park,Ionview
Clairlea,Oakridge,Golden Mile
Scarborough Village West,Cliffside,Cliffcrest
Cliffside West,Birch Cliff
Wexford Heights,Scarborough Town Centre,Dorset Park
Maryvale,Wexford
Agincourt
Tam O'Shanter,Clarks Corners,Sullivan
Agincourt North,L'Amoreaux East,Milliken,Steeles East
L'Amoreaux West
Upper Rouge


In [42]:
print(scarborough_venues.shape)
scarborough_venues.head()

(85, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern,Rouge",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Malvern,Rouge",43.806686,-79.194353,Interprovincial Group,43.80563,-79.200378,Print Shop
2,"Rouge Hill,Highland Creek,Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,"Guildwood,West Hill,Morningside",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,"Guildwood,West Hill,Morningside",43.763573,-79.188711,Marina Spa,43.766,-79.191,Spa


#### Count number of venues per postal code region

In [43]:
scarborough_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
"Agincourt North,L'Amoreaux East,Milliken,Steeles East",2,2,2,2,2,2
Cedarbrae,9,9,9,9,9,9
"Clairlea,Oakridge,Golden Mile",9,9,9,9,9,9
"Cliffside West,Birch Cliff",4,4,4,4,4,4
"East Birchmount Park,Kennedy Park,Ionview",5,5,5,5,5,5
"Guildwood,West Hill,Morningside",7,7,7,7,7,7
L'Amoreaux West,12,12,12,12,12,12
"Malvern,Rouge",2,2,2,2,2,2
"Maryvale,Wexford",7,7,7,7,7,7


#### Count number of unique categories

In [45]:
print('There are {} uniques categories.'.format(len(scarborough_venues['Venue Category'].unique())))

There are 53 uniques categories.


#### Convert this categorical variable to one-hot vector so we could use them to compare between regions

In [46]:
# one hot encoding
scarborough_onehot = pd.get_dummies(scarborough_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
scarborough_onehot['Neighborhood'] = scarborough_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [scarborough_onehot.columns[-1]] + list(scarborough_onehot.columns[:-1])
scarborough_onehot = scarborough_onehot[fixed_columns]

scarborough_onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Auto Garage,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Café,Caribbean Restaurant,Chinese Restaurant,Coffee Shop,College Stadium,Convenience Store,Department Store,Discount Store,Electronics Store,Fast Food Restaurant,Fried Chicken Joint,Gaming Cafe,Gas Station,General Entertainment,Grocery Store,Hakka Restaurant,Ice Cream Shop,Indian Restaurant,Intersection,Italian Restaurant,Korean Restaurant,Latin American Restaurant,Lounge,Medical Center,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Motel,Noodle House,Park,Pet Store,Pharmacy,Pizza Place,Playground,Print Shop,Rental Car Location,Sandwich Place,Skating Rink,Smoke Shop,Soccer Field,Spa,Supermarket,Thai Restaurant,Vietnamese Restaurant
0,"Malvern,Rouge",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,"Malvern,Rouge",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,"Rouge Hill,Highland Creek,Port Union",0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,"Guildwood,West Hill,Morningside",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,"Guildwood,West Hill,Morningside",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


#### Confirm the shape of dataframe

In [48]:
scarborough_onehot.shape

(85, 54)

#### Group this dataframe by postal code region and average the other columns provide us a representation of each region. Using this reprentation, we could compare between them

In [49]:
scarborough_grouped = scarborough_onehot.groupby('Neighborhood').mean().reset_index()
scarborough_grouped

Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Auto Garage,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Café,Caribbean Restaurant,Chinese Restaurant,Coffee Shop,College Stadium,Convenience Store,Department Store,Discount Store,Electronics Store,Fast Food Restaurant,Fried Chicken Joint,Gaming Cafe,Gas Station,General Entertainment,Grocery Store,Hakka Restaurant,Ice Cream Shop,Indian Restaurant,Intersection,Italian Restaurant,Korean Restaurant,Latin American Restaurant,Lounge,Medical Center,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Motel,Noodle House,Park,Pet Store,Pharmacy,Pizza Place,Playground,Print Shop,Rental Car Location,Sandwich Place,Skating Rink,Smoke Shop,Soccer Field,Spa,Supermarket,Thai Restaurant,Vietnamese Restaurant
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0
1,"Agincourt North,L'Amoreaux East,Milliken,Steel...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Cedarbrae,0.0,0.111111,0.0,0.111111,0.111111,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.111111,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0
3,"Clairlea,Oakridge,Golden Mile",0.0,0.0,0.0,0.222222,0.0,0.0,0.0,0.222222,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0
4,"Cliffside West,Birch Cliff",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0
5,"East Birchmount Park,Kennedy Park,Ionview",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.2,0.0,0.2,0.2,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"Guildwood,West Hill,Morningside",0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.142857,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0
7,L'Amoreaux West,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.083333,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.083333,0.0,0.0
8,"Malvern,Rouge",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Maryvale,Wexford",0.0,0.0,0.142857,0.142857,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.142857,0.0,0.0,0.0,0.0,0.142857


In [50]:
scarborough_grouped.shape

(16, 54)

#### What is the top 5 venues of each region?

In [51]:
num_top_venues = 5

for hood in scarborough_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = scarborough_grouped[scarborough_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0  Latin American Restaurant  0.25
1                     Lounge  0.25
2             Breakfast Spot  0.25
3               Skating Rink  0.25
4        American Restaurant  0.00


----Agincourt North,L'Amoreaux East,Milliken,Steeles East----
                 venue  freq
0           Playground   0.5
1                 Park   0.5
2  American Restaurant   0.0
3            Pet Store   0.0
4    Korean Restaurant   0.0


----Cedarbrae----
                 venue  freq
0               Lounge  0.11
1      Thai Restaurant  0.11
2               Bakery  0.11
3                 Bank  0.11
4  Fried Chicken Joint  0.11


----Clairlea,Oakridge,Golden Mile----
            venue  freq
0          Bakery  0.22
1        Bus Line  0.22
2    Intersection  0.11
3  Ice Cream Shop  0.11
4    Soccer Field  0.11


----Cliffside West,Birch Cliff----
                   venue  freq
0  General Entertainment  0.25
1           Skating Rink  0.25
2                   Café  

#### Function to return the most common venues per region

In [52]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### Find the top 10 common venues per region

In [56]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = scarborough_grouped['Neighborhood']

for ind in np.arange(scarborough_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(scarborough_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Skating Rink,Latin American Restaurant,Breakfast Spot,Lounge,Vietnamese Restaurant,College Stadium,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint
1,"Agincourt North,L'Amoreaux East,Milliken,Steel...",Park,Playground,Vietnamese Restaurant,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant,Electronics Store,Discount Store
2,Cedarbrae,Thai Restaurant,Athletics & Sports,Bakery,Bank,Gas Station,Lounge,Fried Chicken Joint,Caribbean Restaurant,Hakka Restaurant,Department Store
3,"Clairlea,Oakridge,Golden Mile",Bakery,Bus Line,Ice Cream Shop,Intersection,Soccer Field,Bus Station,Metro Station,Convenience Store,General Entertainment,Gas Station
4,"Cliffside West,Birch Cliff",College Stadium,General Entertainment,Skating Rink,Café,Vietnamese Restaurant,Grocery Store,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant


In [62]:
neighborhoods_venues_sorted

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,Agincourt,Skating Rink,Latin American Restaurant,Breakfast Spot,Lounge,Vietnamese Restaurant,College Stadium,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint
1,0,"Agincourt North,L'Amoreaux East,Milliken,Steel...",Park,Playground,Vietnamese Restaurant,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant,Electronics Store,Discount Store
2,1,Cedarbrae,Thai Restaurant,Athletics & Sports,Bakery,Bank,Gas Station,Lounge,Fried Chicken Joint,Caribbean Restaurant,Hakka Restaurant,Department Store
3,1,"Clairlea,Oakridge,Golden Mile",Bakery,Bus Line,Ice Cream Shop,Intersection,Soccer Field,Bus Station,Metro Station,Convenience Store,General Entertainment,Gas Station
4,1,"Cliffside West,Birch Cliff",College Stadium,General Entertainment,Skating Rink,Café,Vietnamese Restaurant,Grocery Store,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant
5,1,"East Birchmount Park,Kennedy Park,Ionview",Coffee Shop,Discount Store,Department Store,Convenience Store,Chinese Restaurant,College Stadium,Grocery Store,General Entertainment,Gas Station,Gaming Cafe
6,1,"Guildwood,West Hill,Morningside",Spa,Intersection,Rental Car Location,Breakfast Spot,Electronics Store,Medical Center,Mexican Restaurant,Vietnamese Restaurant,College Stadium,Gas Station
7,1,L'Amoreaux West,Coffee Shop,Chinese Restaurant,Fast Food Restaurant,Sandwich Place,Grocery Store,Pharmacy,Pizza Place,Breakfast Spot,Supermarket,Electronics Store
8,2,"Malvern,Rouge",Fast Food Restaurant,Print Shop,Vietnamese Restaurant,Coffee Shop,Grocery Store,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Electronics Store
9,1,"Maryvale,Wexford",Vietnamese Restaurant,Middle Eastern Restaurant,Auto Garage,Bakery,Smoke Shop,Sandwich Place,Breakfast Spot,General Entertainment,Gas Station,Gaming Cafe


#### Clustering this borough into 5 clusters

In [57]:
# set number of clusters
kclusters = 5

scarborough_grouped_clustering = scarborough_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(scarborough_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 0, 1, 1, 1, 1, 1, 1, 2, 1], dtype=int32)

In [66]:
# add clustering labels
# neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

scarborough_merged = scarborough_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
scarborough_merged = scarborough_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood', how='inner')
scarborough_merged.head() # check the last columns!scarborough_grouped

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353,2,Fast Food Restaurant,Print Shop,Vietnamese Restaurant,Coffee Shop,Grocery Store,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Electronics Store
1,M1C,Scarborough,"Rouge Hill,Highland Creek,Port Union",43.784535,-79.160497,3,Bar,Vietnamese Restaurant,College Stadium,Grocery Store,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
2,M1E,Scarborough,"Guildwood,West Hill,Morningside",43.763573,-79.188711,1,Spa,Intersection,Rental Car Location,Breakfast Spot,Electronics Store,Medical Center,Mexican Restaurant,Vietnamese Restaurant,College Stadium,Gas Station
3,M1G,Scarborough,Woburn,43.770992,-79.216917,4,Coffee Shop,Korean Restaurant,College Stadium,Grocery Store,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1,Thai Restaurant,Athletics & Sports,Bakery,Bank,Gas Station,Lounge,Fried Chicken Joint,Caribbean Restaurant,Hakka Restaurant,Department Store


In [67]:
scarborough_merged

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353,2,Fast Food Restaurant,Print Shop,Vietnamese Restaurant,Coffee Shop,Grocery Store,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Electronics Store
1,M1C,Scarborough,"Rouge Hill,Highland Creek,Port Union",43.784535,-79.160497,3,Bar,Vietnamese Restaurant,College Stadium,Grocery Store,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
2,M1E,Scarborough,"Guildwood,West Hill,Morningside",43.763573,-79.188711,1,Spa,Intersection,Rental Car Location,Breakfast Spot,Electronics Store,Medical Center,Mexican Restaurant,Vietnamese Restaurant,College Stadium,Gas Station
3,M1G,Scarborough,Woburn,43.770992,-79.216917,4,Coffee Shop,Korean Restaurant,College Stadium,Grocery Store,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1,Thai Restaurant,Athletics & Sports,Bakery,Bank,Gas Station,Lounge,Fried Chicken Joint,Caribbean Restaurant,Hakka Restaurant,Department Store
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,0,Playground,Vietnamese Restaurant,Coffee Shop,Grocery Store,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
6,M1K,Scarborough,"East Birchmount Park,Kennedy Park,Ionview",43.727929,-79.262029,1,Coffee Shop,Discount Store,Department Store,Convenience Store,Chinese Restaurant,College Stadium,Grocery Store,General Entertainment,Gas Station,Gaming Cafe
7,M1L,Scarborough,"Clairlea,Oakridge,Golden Mile",43.711112,-79.284577,1,Bakery,Bus Line,Ice Cream Shop,Intersection,Soccer Field,Bus Station,Metro Station,Convenience Store,General Entertainment,Gas Station
8,M1M,Scarborough,"Scarborough Village West,Cliffside,Cliffcrest",43.716316,-79.239476,1,American Restaurant,Motel,College Stadium,Grocery Store,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
9,M1N,Scarborough,"Cliffside West,Birch Cliff",43.692657,-79.264848,1,College Stadium,General Entertainment,Skating Rink,Café,Vietnamese Restaurant,Grocery Store,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant


#### Display these clusters on the map

In [68]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(scarborough_merged['Latitude'], scarborough_merged['Longitude'],\
                                  scarborough_merged['Neighborhood'], scarborough_merged['Cluster Labels']):
    cluster = int(cluster)
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examine each cluster.

### Cluster 1

In [71]:
scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 0, scarborough_merged.columns[[1] + list(range(5, scarborough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Scarborough,0,Playground,Vietnamese Restaurant,Coffee Shop,Grocery Store,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
14,Scarborough,0,Park,Playground,Vietnamese Restaurant,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant,Electronics Store,Discount Store


### Cluster 2

In [72]:
scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 1, scarborough_merged.columns[[1] + list(range(5, scarborough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Scarborough,1,Spa,Intersection,Rental Car Location,Breakfast Spot,Electronics Store,Medical Center,Mexican Restaurant,Vietnamese Restaurant,College Stadium,Gas Station
4,Scarborough,1,Thai Restaurant,Athletics & Sports,Bakery,Bank,Gas Station,Lounge,Fried Chicken Joint,Caribbean Restaurant,Hakka Restaurant,Department Store
6,Scarborough,1,Coffee Shop,Discount Store,Department Store,Convenience Store,Chinese Restaurant,College Stadium,Grocery Store,General Entertainment,Gas Station,Gaming Cafe
7,Scarborough,1,Bakery,Bus Line,Ice Cream Shop,Intersection,Soccer Field,Bus Station,Metro Station,Convenience Store,General Entertainment,Gas Station
8,Scarborough,1,American Restaurant,Motel,College Stadium,Grocery Store,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
9,Scarborough,1,College Stadium,General Entertainment,Skating Rink,Café,Vietnamese Restaurant,Grocery Store,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant
10,Scarborough,1,Indian Restaurant,Pet Store,Chinese Restaurant,Gaming Cafe,Vietnamese Restaurant,Smoke Shop,Skating Rink,Gas Station,Spa,Fried Chicken Joint
11,Scarborough,1,Vietnamese Restaurant,Middle Eastern Restaurant,Auto Garage,Bakery,Smoke Shop,Sandwich Place,Breakfast Spot,General Entertainment,Gas Station,Gaming Cafe
12,Scarborough,1,Skating Rink,Latin American Restaurant,Breakfast Spot,Lounge,Vietnamese Restaurant,College Stadium,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint
13,Scarborough,1,Pizza Place,Gas Station,Thai Restaurant,Intersection,Bank,Italian Restaurant,Fried Chicken Joint,Fast Food Restaurant,Chinese Restaurant,Noodle House


### Cluster 3

In [77]:
scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 2, scarborough_merged.columns[[1] + list(range(5, scarborough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,2,Fast Food Restaurant,Print Shop,Vietnamese Restaurant,Coffee Shop,Grocery Store,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Electronics Store


### Cluster 4

In [74]:
scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 3, scarborough_merged.columns[[1] + list(range(5, scarborough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Scarborough,3,Bar,Vietnamese Restaurant,College Stadium,Grocery Store,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant,Electronics Store


### Cluster 5 

In [75]:
scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 4, scarborough_merged.columns[[1] + list(range(5, scarborough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Scarborough,4,Coffee Shop,Korean Restaurant,College Stadium,Grocery Store,General Entertainment,Gas Station,Gaming Cafe,Fried Chicken Joint,Fast Food Restaurant,Electronics Store


## 6. Conclusion

This borough of Toronto is quite common for its Asian restaurant that you could find such as Vietnamese/ Korean restaurant. The differences between cluster 4 and 5, for example, is one has Vietnamese restaurant and bar, one has Korean restaurants and coffee shop. Other venues are the same. 