# Segmenting, Clustering And Exploring The Neighborhoods In The City Of Toronto-Ontario, Canada

## `Introduction`

**This project is part of [IBM Data Science Professional Certificate](https://www.coursera.org/professional-certificates/ibm-data-science)**

* **`About the project?`**
>1. we will explore, segment, and cluster the neighborhoods in the city of Toronto based on the [postalcode](https://en.wikipedia.org/wiki/Postal_code#:~:text=A%20postal%20code%20(also%20known,the%20purpose%20of%20sorting%20mail.) and [borough](https://www.google.com/search?q=borough+meaning&rlz=1C1CHBF_enIN917IN917&oq=borough+meaning&aqs=chrome.0.0i433j0l2j46i175i199j0j46j0j69i60.3236j0j4&sourceid=chrome&ie=UTF-8) information. 
>2. For that, we will scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.
>3. Once the data is in a structured format, explore and cluster the neighborhoods in the city of Toronto.

* **`What is Toronto ?`** 
> Toronto is the capital of the [Canadian province](https://en.wikipedia.org/wiki/Provinces_and_territories_of_Canada) of [Ontario](https://en.wikipedia.org/wiki/Ontario) and the most populous city in Canada, the fourth most populous city in North America. [India is the number one source country for immigrants coming from overseas to Canada](https://www.cicnews.com/2020/10/how-to-immigrate-to-canada-from-india-1015949.html#gs.3qjx68) and [The highest concentrations of Indian Canadians are found in the provinces of Ontario](https://en.wikipedia.org/wiki/Indo-Canadians); [followed by the China and the Philippines in 2019](https://www.immigration.ca/where-will-canadas-401000-immigrants-come-from-in-2021). **In the five years that ended in 2019, immigration from India, skyrocketed, growing by almost 117.6%** from 39,340 in 2015 to 85,590.

Explore more about immigration from different countries to Canada [here](https://jovian.ai/omprakashp014909/immgration-to-canada-from-1980-to-2013).

## `Table Of Content`

> 1. Web scrapping the wikipedia page using requests and Beatiful Soup.
> 2. Using Geocoder Python package to find the latitude and the longitudes of a location.
> 3. Using Foursqaure's REST API for finding the location data of nearby venues of a given location. 
> 4. Data wrangling and segmentation using pandas and Numpy.
> 5. Data Clustering using Machine Learning algorithm - K Means.
> 6. Data Visualization using Folium and matplotlib.

**So let kick-start by installing and importing the helper libraries**

In [1]:
%%capture
!pip install pandas      
!pip install numpy
!pip install requests
!pip install html5lib
!pip install bs4
!pip install geopy
!pip install folium
!pip install matplotlib 

In [2]:
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np # library to handle data in a vectorized manner

import json # library to handle JSON files
import requests # library to handle requests
from bs4 import BeautifulSoup # for web scrapping
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geopy.geocoders as geocoder
import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import warnings
warnings.filterwarnings('ignore')
print('Libraries imported.')

Libraries imported.


# Part1. Web scrapping  the data

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_data = requests.get(url).text 

soup = BeautifulSoup(html_data, 'html5lib')

# creating Toronto Dataframe
df= pd.DataFrame(columns = ['PostalCode','Borough','Neighborhood'])


# scrapping all rows of the table
table_rows = soup.find('table').tbody.find_all('tr')

# filtering rows and inserting data to df_toronto
for rows in table_rows :
    for column in rows.find_all('td') :
        if column.span.text != 'Not assigned' :
            span  = column.span.text.split('(')
            df = df.append({'PostalCode' : column.b.text,
                              'Borough' : span[0],
                              'Neighborhood' : span[1][:-1]}, ignore_index=True)
            
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

df = df.sort_values('PostalCode').reset_index(drop = True)
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West
9,M1N,Scarborough,Birch Cliff / Cliffside West


**Postal data has been scrapped successfully from wikipedia**

**Let's check the size of the dataframe**

In [4]:
df.shape

(103, 3)

**df_toronto has 103 rows and 3 columns**

# Part 2: Getting the Latitude and Longitude of each Postal_code

**Load the latitude and the longitude of the postalcodes from IBM**

In [5]:
!wget -O GeoSpataial_Data https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv

--2021-06-13 19:28:57--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.45.118.108
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.45.118.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2788 (2.7K) [text/csv]
Saving to: ‘GeoSpataial_Data’


2021-06-13 19:28:58 (456 MB/s) - ‘GeoSpataial_Data’ saved [2788/2788]



In [6]:
geospatial_data = pd.read_csv('GeoSpataial_Data')
geospatial_data.columns = ['PostalCode','Latitude', 'Longitude']
geospatial_data.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
geospatial_data.shape

(103, 3)

**Let's merge df_toronto and geospatial_data**

In [8]:
df = df.join(geospatial_data.set_index('PostalCode'), on = 'PostalCode')

In [9]:
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park,43.727929,-79.262029
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge,43.711112,-79.284577
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West,43.716316,-79.239476
9,M1N,Scarborough,Birch Cliff / Cliffside West,43.692657,-79.264848


# Part 3: Exploring and Clustering the neighborhoods in Toronto

In [10]:
df.Borough.value_counts()

North York                24
Scarborough               17
Downtown Toronto          17
Etobicoke                 11
Central Toronto            9
West Toronto               6
York                       5
East York                  4
East Toronto               4
East Toronto Business      1
East York/East Toronto     1
Mississauga                1
Etobicoke Northwest        1
Downtown Toronto Stn A     1
Queen's Park               1
Name: Borough, dtype: int64

**We can see North York has highest numbmer of PostalCodes, followed by Scarborough and Downtown Toronto**

**Use geopy library to get the latitude and longitude values of Toronto city**

In [11]:
address = 'Toronto, Ontario'
geolocator = Nominatim(user_agent = 'ny_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.6534817, -79.3839347.


**Ploting all neighborhoods on Map**

In [12]:
# segmenting boroughs in 5 categories
york = ['North York', 'York ','East York' ]
toronto = ['Downtown Toronto', 'Central Toronto', 'West Toronto', 'East Toronto',
           'Downtown Toronto Stn A' ,'East Toronto Business','East York/East Toronto']
scarborough = ['Scarborough']
etobicoke = ['Etobicoke','Etobicoke Northwest']
others = ["Queen's Park", 'Mississauga']

borough_array = [york, toronto, scarborough, etobicoke, others]

# now let's make changes in the dataframe accordingly
df1 = df.copy()
for boroughs in borough_array :
    for borough in boroughs :
        df1.replace(borough, str(boroughs), inplace = True)
    
colors_array = ['red', 'blue', 'green', 'purple', 'orange']


# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for borough, color in zip(borough_array, colors_array) :
    df2 = df1[df1.Borough == str(borough)]
    for lat, lng, borough, neighborhood in zip(df2['Latitude'], df2['Longitude'], df2['Borough'], df2['Neighborhood']):
        label = '{}, {}'.format(neighborhood, borough)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color=color,
            fill=True,
            fill_color=color,
            fill_opacity=0.7,
            parse_html=False).add_to(map_toronto)  
    
map_toronto

**Feel free to zoom in and out the map and do click on the points to know the Borough**

### Let's Plot North York and explore it's first  neighborhood in our dataframe

In [13]:
df_north_york = df[df.Borough == 'North York'].reset_index(drop=True)

address = 'North York, Toronto, Ontario'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of North York are {}, {}.'.format(latitude, longitude))
# create map of Manhattan using latitude and longitude values
map_north_york = folium.Map(location=[latitude, longitude], zoom_start=11)

# plot North York
folium.CircleMarker(
    [latitude, longitude],
    radius=4,
    popup='North York',
    color='Red',
    fill=True,
    fill_color='red',
    fill_opacity=0.7,
    parse_html=False).add_to(map_north_york)

# add markers to map
for lat, lng, label in zip(df_north_york['Latitude'], df_north_york['Longitude'], df_north_york['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_north_york)  
    
map_north_york

The geograpical coordinate of North York are 43.7543263, -79.44911696639593.


**Walk through the first neighborhood of North York in our dataset**

In [14]:
df_north_york.loc[0]

PostalCode                    M2H
Borough                North York
Neighborhood    Hillcrest Village
Latitude                  43.8038
Longitude                -79.3635
Name: 0, dtype: object

**Let's explore the nearby venues of Hillcrest Village using Foursquare API**

In [15]:
neighborhood_name = df_north_york.Neighborhood[0]
neighborhood_latitude = df_north_york.Latitude[0]
neighborhood_longitude = df_north_york.Longitude[0]

CLIENT_ID = '0RVTYD1VK2FA522PAZNV0O0SLWVRLZBVCHXXG3VCPXKTPM3Y' # your Foursquare ID
CLIENT_SECRET = 'LDRZ1G3S2XW5GIO45L0L0ONSCZ0WUVKPOKGHKXJH544EKABH' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '60c65c7b08dd5409db1296fe'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Toronto',
  'headerFullLocation': 'Toronto',
  'headerLocationGranularity': 'city',
  'totalResults': 22,
  'suggestedBounds': {'ne': {'lat': 43.81276220900001,
    'lng': -79.35100467075661},
   'sw': {'lat': 43.79476219099999, 'lng': -79.37589872924339}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4d8771f6651041bd5e499b30',
       'name': 'New York Fries',
       'location': {'address': '1800 Sheppard Avenue East',
        'crossStreet': 'in Fairview Mall',
        'lat': 43.80366383775661,
        'lng': -79.36390544874158,
        'labeledLatLngs': [{'label': 'd

**By going through the json file, we can see the venues are present in `results['response']['groups'][0]['items']`. Let's create  a pandas dataframe for vennues**


In [16]:
venues =  results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues)

# filteres columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]
with pd.option_context('display.max_colwidth', -1): # display non truncated dataframe
    display(nearby_venues)

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng
0,New York Fries,"[{'id': '4bf58dd8d48988d16e941735', 'name': 'Fast Food Restaurant', 'pluralName': 'Fast Food Restaurants', 'shortName': 'Fast Food', 'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/fastfood_', 'suffix': '.png'}, 'primary': True}]",43.803664,-79.363905
1,Tastee,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'Bakery', 'pluralName': 'Bakeries', 'shortName': 'Bakery', 'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/bakery_', 'suffix': '.png'}, 'primary': True}]",43.807722,-79.356798
2,TD Canada Trust,"[{'id': '4bf58dd8d48988d10a951735', 'name': 'Bank', 'pluralName': 'Banks', 'shortName': 'Bank', 'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/shops/financial_', 'suffix': '.png'}, 'primary': True}]",43.798466,-79.368832
3,Subway,"[{'id': '4bf58dd8d48988d1c5941735', 'name': 'Sandwich Place', 'pluralName': 'Sandwich Places', 'shortName': 'Sandwiches', 'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/deli_', 'suffix': '.png'}, 'primary': True}]",43.799059,-79.368946
4,Galati,"[{'id': '4bf58dd8d48988d118951735', 'name': 'Grocery Store', 'pluralName': 'Grocery Stores', 'shortName': 'Grocery Store', 'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/shops/food_grocery_', 'suffix': '.png'}, 'primary': True}]",43.797831,-79.36941
5,Pizza Pizza,"[{'id': '4bf58dd8d48988d1ca941735', 'name': 'Pizza Place', 'pluralName': 'Pizza Places', 'shortName': 'Pizza', 'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/pizza_', 'suffix': '.png'}, 'primary': True}]",43.799079,-79.369449
6,Tim Hortons,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'Coffee Shop', 'pluralName': 'Coffee Shops', 'shortName': 'Coffee Shop', 'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/coffeeshop_', 'suffix': '.png'}, 'primary': True}]",43.798945,-79.369644
7,고려삼계탕 Korean Ginseng Chicken Soup & Bibimbap,"[{'id': '4bf58dd8d48988d113941735', 'name': 'Korean Restaurant', 'pluralName': 'Korean Restaurants', 'shortName': 'Korean', 'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/korean_', 'suffix': '.png'}, 'primary': True}]",43.798391,-79.369187
8,Shoppers Drug Mart,"[{'id': '4bf58dd8d48988d10f951735', 'name': 'Pharmacy', 'pluralName': 'Pharmacies', 'shortName': 'Pharmacy', 'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/shops/pharmacy_', 'suffix': '.png'}, 'primary': True}]",43.798341,-79.369804
9,Hillcrest Tennis Club,"[{'id': '4e39a956bd410d7aed40cbc3', 'name': 'Tennis Court', 'pluralName': 'Tennis Courts', 'shortName': 'Tennis Court', 'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/stadium_tennis_', 'suffix': '.png'}, 'primary': True}]",43.798561,-79.363506


**By observing the venue.categories, we can see that th ename of the category i savailable at `nearby_venues['venue.categories'][0]['name']`, let's pull it**

In [17]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [18]:
# filter category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [column.split('.')[-1] for column in nearby_venues.columns ]
nearby_venues

Unnamed: 0,name,categories,lat,lng
0,New York Fries,Fast Food Restaurant,43.803664,-79.363905
1,Tastee,Bakery,43.807722,-79.356798
2,TD Canada Trust,Bank,43.798466,-79.368832
3,Subway,Sandwich Place,43.799059,-79.368946
4,Galati,Grocery Store,43.797831,-79.36941
5,Pizza Pizza,Pizza Place,43.799079,-79.369449
6,Tim Hortons,Coffee Shop,43.798945,-79.369644
7,고려삼계탕 Korean Ginseng Chicken Soup & Bibimbap,Korean Restaurant,43.798391,-79.369187
8,Shoppers Drug Mart,Pharmacy,43.798341,-79.369804
9,Hillcrest Tennis Club,Tennis Court,43.798561,-79.363506


**22 Venues were returned by the fourspare for Hillcrest Village in Borough North York**

**`Let's create a function to repeat the same process to all the neighborhoods in North York`**

In [19]:
def get_near_by_venues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'\
        .format(CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(name, lat, lng, 
                             v['venue']['name'], v['venue']['location']['lat'], v['venue']['location']['lng'],
                             v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue in venues_list for item in venue])
    nearby_venues.columns = ['Neighborhood','Neighborhood Latitude', 'Neighborhood Longitude', 
                             'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
    
    return nearby_venues

**Now write the code to run the above function on each neighborhood and create a new dataframe called north_york_venues.**

In [20]:
north_york_venues = get_near_by_venues(names = df_north_york['Neighborhood'],
                                   latitudes = df_north_york['Latitude'],
                                   longitudes = df_north_york['Longitude'])


print(north_york_venues.shape)
north_york_venues.head()

(602, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hillcrest Village,43.803762,-79.363452,New York Fries,43.803664,-79.363905,Fast Food Restaurant
1,Hillcrest Village,43.803762,-79.363452,Tastee,43.807722,-79.356798,Bakery
2,Hillcrest Village,43.803762,-79.363452,TD Canada Trust,43.798466,-79.368832,Bank
3,Hillcrest Village,43.803762,-79.363452,Subway,43.799059,-79.368946,Sandwich Place
4,Hillcrest Village,43.803762,-79.363452,Galati,43.797831,-79.36941,Grocery Store


**Let's check how many venues were returned for each neighborhood**

In [21]:
north_york_venues.groupby('Neighborhood')['Venue Category'].count().sort_values(ascending = False)

Neighborhood
Willowdale)Sout                                      100
Fairview / Henry Farm / Oriole                        45
Lawrence Manor / Lawrence Heights                     41
Bedford Park / Lawrence Manor East                    39
Don Mills)Sout                                        36
Downsview)Northwes                                    35
Glencairn                                             32
Parkwoods                                             29
Don Mills)Nort                                        27
Willowdale / Newtonbrook                              25
Bathurst Manor / Wilson Heights / Downsview North     24
Northwood Park / York University                      24
Downsview)East                                        22
Hillcrest Village                                     22
York Mills West                                       18
Bayview Village                                       17
Victoria Village                                      12
Willowdale)Wes    

**Let's find out how many unique categories can be curated from all the returned venues**

In [22]:
print('There are {} uniques categories.'.format(len(north_york_venues['Venue Category'].unique())))

There are 143 uniques categories.


**`Analyze Each Neighborhood`**

In [23]:
# one hot encoding
north_york_onehot = pd.get_dummies(north_york_venues[['Venue Category']])

# add neighborhood column to the back of the dataframe
north_york_onehot['Neighborhood'] = north_york_venues['Neighborhood']

fixed_columns = [north_york_onehot.columns[-1]] + list(north_york_onehot.columns[:-1].values)

north_york_onehot = north_york_onehot[fixed_columns]
north_york_onehot.head()

Unnamed: 0,Neighborhood,Venue Category_ATM,Venue Category_Accessories Store,Venue Category_Airport,Venue Category_American Restaurant,Venue Category_Art Gallery,Venue Category_Arts & Crafts Store,Venue Category_Asian Restaurant,Venue Category_Athletics & Sports,Venue Category_Automotive Shop,Venue Category_Baby Store,Venue Category_Bagel Shop,Venue Category_Bakery,Venue Category_Bank,Venue Category_Bar,Venue Category_Baseball Field,Venue Category_Beer Store,Venue Category_Boutique,Venue Category_Bowling Alley,Venue Category_Breakfast Spot,Venue Category_Bridal Shop,Venue Category_Bubble Tea Shop,Venue Category_Burger Joint,Venue Category_Burrito Place,Venue Category_Bus Line,Venue Category_Bus Station,Venue Category_Bus Stop,Venue Category_Business Service,Venue Category_Butcher,Venue Category_Cafeteria,Venue Category_Café,Venue Category_Caribbean Restaurant,Venue Category_Chinese Restaurant,Venue Category_Clothing Store,Venue Category_Coffee Shop,Venue Category_Comfort Food Restaurant,Venue Category_Community Center,Venue Category_Convenience Store,Venue Category_Cosmetics Shop,Venue Category_Creperie,Venue Category_Cupcake Shop,Venue Category_Department Store,Venue Category_Dessert Shop,Venue Category_Dim Sum Restaurant,Venue Category_Diner,Venue Category_Discount Store,Venue Category_Dog Run,Venue Category_Dumpling Restaurant,Venue Category_Eastern European Restaurant,Venue Category_Electronics Store,Venue Category_Falafel Restaurant,Venue Category_Fast Food Restaurant,Venue Category_Fireworks Store,Venue Category_Fish & Chips Shop,Venue Category_Fish Market,Venue Category_Flower Shop,Venue Category_Food & Drink Shop,Venue Category_Food Court,Venue Category_Food Truck,Venue Category_French Restaurant,Venue Category_Fried Chicken Joint,Venue Category_Frozen Yogurt Shop,Venue Category_Furniture / Home Store,Venue Category_Gas Station,Venue Category_Golf Course,Venue Category_Greek Restaurant,Venue Category_Grocery Store,Venue Category_Gym,Venue Category_Gym / Fitness Center,Venue Category_History Museum,Venue Category_Hobby Shop,Venue Category_Hockey Arena,Venue Category_Hookah Bar,Venue Category_Hot Dog Joint,Venue Category_Hotel,Venue Category_Ice Cream Shop,Venue Category_Indian Restaurant,Venue Category_Intersection,Venue Category_Italian Restaurant,Venue Category_Japanese Restaurant,Venue Category_Kitchen Supply Store,Venue Category_Korean Restaurant,Venue Category_Latin American Restaurant,Venue Category_Laundry Service,Venue Category_Liquor Store,Venue Category_Lounge,Venue Category_Massage Studio,Venue Category_Men's Store,Venue Category_Metro Station,Venue Category_Mexican Restaurant,Venue Category_Middle Eastern Restaurant,Venue Category_Miscellaneous Shop,Venue Category_Mobile Phone Shop,Venue Category_Movie Theater,Venue Category_Moving Target,Venue Category_New American Restaurant,Venue Category_Noodle House,Venue Category_Optical Shop,Venue Category_Other Repair Shop,Venue Category_Paper / Office Supplies Store,Venue Category_Park,Venue Category_Pet Store,Venue Category_Pharmacy,Venue Category_Pizza Place,Venue Category_Playground,Venue Category_Plaza,Venue Category_Pool,Venue Category_Portuguese Restaurant,Venue Category_Pub,Venue Category_Ramen Restaurant,Venue Category_Recreation Center,Venue Category_Residential Building (Apartment / Condo),Venue Category_Restaurant,Venue Category_Salad Place,Venue Category_Salon / Barbershop,Venue Category_Sandwich Place,Venue Category_Seafood Restaurant,Venue Category_Shop & Service,Venue Category_Shopping Mall,Venue Category_Skating Rink,Venue Category_Ski Area,Venue Category_Ski Chalet,Venue Category_Snack Place,Venue Category_Soccer Field,Venue Category_Spa,Venue Category_Sporting Goods Shop,Venue Category_Sports Bar,Venue Category_Sports Club,Venue Category_Storage Facility,Venue Category_Supermarket,Venue Category_Sushi Restaurant,Venue Category_Tennis Court,Venue Category_Thai Restaurant,Venue Category_Theater,Venue Category_Toy / Game Store,Venue Category_Trail,Venue Category_Train Station,Venue Category_Turkish Restaurant,Venue Category_Video Game Store,Venue Category_Video Store,Venue Category_Vietnamese Restaurant,Venue Category_Wings Joint,Venue Category_Women's Store,Venue Category_Yoga Studio
0,Hillcrest Village,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Hillcrest Village,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Hillcrest Village,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Hillcrest Village,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Hillcrest Village,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


**And let's examine the new dataframe size.**

In [24]:
north_york_onehot.shape

(602, 144)

**Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category**

In [25]:
north_york_grouped = north_york_onehot.groupby('Neighborhood').mean().reset_index()
north_york_grouped

Unnamed: 0,Neighborhood,Venue Category_ATM,Venue Category_Accessories Store,Venue Category_Airport,Venue Category_American Restaurant,Venue Category_Art Gallery,Venue Category_Arts & Crafts Store,Venue Category_Asian Restaurant,Venue Category_Athletics & Sports,Venue Category_Automotive Shop,Venue Category_Baby Store,Venue Category_Bagel Shop,Venue Category_Bakery,Venue Category_Bank,Venue Category_Bar,Venue Category_Baseball Field,Venue Category_Beer Store,Venue Category_Boutique,Venue Category_Bowling Alley,Venue Category_Breakfast Spot,Venue Category_Bridal Shop,Venue Category_Bubble Tea Shop,Venue Category_Burger Joint,Venue Category_Burrito Place,Venue Category_Bus Line,Venue Category_Bus Station,Venue Category_Bus Stop,Venue Category_Business Service,Venue Category_Butcher,Venue Category_Cafeteria,Venue Category_Café,Venue Category_Caribbean Restaurant,Venue Category_Chinese Restaurant,Venue Category_Clothing Store,Venue Category_Coffee Shop,Venue Category_Comfort Food Restaurant,Venue Category_Community Center,Venue Category_Convenience Store,Venue Category_Cosmetics Shop,Venue Category_Creperie,Venue Category_Cupcake Shop,Venue Category_Department Store,Venue Category_Dessert Shop,Venue Category_Dim Sum Restaurant,Venue Category_Diner,Venue Category_Discount Store,Venue Category_Dog Run,Venue Category_Dumpling Restaurant,Venue Category_Eastern European Restaurant,Venue Category_Electronics Store,Venue Category_Falafel Restaurant,Venue Category_Fast Food Restaurant,Venue Category_Fireworks Store,Venue Category_Fish & Chips Shop,Venue Category_Fish Market,Venue Category_Flower Shop,Venue Category_Food & Drink Shop,Venue Category_Food Court,Venue Category_Food Truck,Venue Category_French Restaurant,Venue Category_Fried Chicken Joint,Venue Category_Frozen Yogurt Shop,Venue Category_Furniture / Home Store,Venue Category_Gas Station,Venue Category_Golf Course,Venue Category_Greek Restaurant,Venue Category_Grocery Store,Venue Category_Gym,Venue Category_Gym / Fitness Center,Venue Category_History Museum,Venue Category_Hobby Shop,Venue Category_Hockey Arena,Venue Category_Hookah Bar,Venue Category_Hot Dog Joint,Venue Category_Hotel,Venue Category_Ice Cream Shop,Venue Category_Indian Restaurant,Venue Category_Intersection,Venue Category_Italian Restaurant,Venue Category_Japanese Restaurant,Venue Category_Kitchen Supply Store,Venue Category_Korean Restaurant,Venue Category_Latin American Restaurant,Venue Category_Laundry Service,Venue Category_Liquor Store,Venue Category_Lounge,Venue Category_Massage Studio,Venue Category_Men's Store,Venue Category_Metro Station,Venue Category_Mexican Restaurant,Venue Category_Middle Eastern Restaurant,Venue Category_Miscellaneous Shop,Venue Category_Mobile Phone Shop,Venue Category_Movie Theater,Venue Category_Moving Target,Venue Category_New American Restaurant,Venue Category_Noodle House,Venue Category_Optical Shop,Venue Category_Other Repair Shop,Venue Category_Paper / Office Supplies Store,Venue Category_Park,Venue Category_Pet Store,Venue Category_Pharmacy,Venue Category_Pizza Place,Venue Category_Playground,Venue Category_Plaza,Venue Category_Pool,Venue Category_Portuguese Restaurant,Venue Category_Pub,Venue Category_Ramen Restaurant,Venue Category_Recreation Center,Venue Category_Residential Building (Apartment / Condo),Venue Category_Restaurant,Venue Category_Salad Place,Venue Category_Salon / Barbershop,Venue Category_Sandwich Place,Venue Category_Seafood Restaurant,Venue Category_Shop & Service,Venue Category_Shopping Mall,Venue Category_Skating Rink,Venue Category_Ski Area,Venue Category_Ski Chalet,Venue Category_Snack Place,Venue Category_Soccer Field,Venue Category_Spa,Venue Category_Sporting Goods Shop,Venue Category_Sports Bar,Venue Category_Sports Club,Venue Category_Storage Facility,Venue Category_Supermarket,Venue Category_Sushi Restaurant,Venue Category_Tennis Court,Venue Category_Thai Restaurant,Venue Category_Theater,Venue Category_Toy / Game Store,Venue Category_Trail,Venue Category_Train Station,Venue Category_Turkish Restaurant,Venue Category_Video Game Store,Venue Category_Video Store,Venue Category_Vietnamese Restaurant,Venue Category_Wings Joint,Venue Category_Women's Store,Venue Category_Yoga Studio
0,Bathurst Manor / Wilson Heights / Downsview North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.083333,0.0,0.041667,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.041667,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.041667,0.0,0.041667,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.117647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.117647,0.0,0.0,0.117647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.117647,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.058824,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bedford Park / Lawrence Manor East,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.025641,0.051282,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.025641,0.0,0.0,0.0,0.076923,0.025641,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.025641,0.025641,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.025641,0.025641,0.051282,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.025641,0.025641,0.025641,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.051282,0.0,0.0,0.051282,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.025641,0.0,0.025641,0.0,0.025641,0.0,0.0,0.0,0.0,0.025641,0.0,0.025641,0.0,0.0
3,Don Mills)Nort,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.074074,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.037037,0.0,0.0,0.074074,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.074074,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.148148,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Don Mills)Sout,0.0,0.0,0.0,0.027778,0.027778,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.055556,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.027778,0.027778,0.027778,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.055556,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.0
5,Downsview)Centra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0
6,Downsview)East,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.045455,0.0,0.136364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.045455,0.0,0.0,0.0,0.045455,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.045455,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.045455,0.0,0.0,0.0
7,Downsview)Northwes,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.028571,0.0,0.057143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.028571,0.057143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.057143,0.0,0.0,0.085714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.114286,0.0,0.0,0.028571,0.0,0.0,0.028571,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057143,0.085714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.057143,0.028571,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0
8,Downsview)Wes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.285714,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Fairview / Henry Farm / Oriole,0.0,0.0,0.0,0.022222,0.0,0.0,0.022222,0.0,0.0,0.0,0.0,0.044444,0.044444,0.022222,0.0,0.022222,0.0,0.0,0.0,0.0,0.0,0.0,0.022222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022222,0.0,0.111111,0.111111,0.0,0.0,0.0,0.022222,0.0,0.0,0.022222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022222,0.0,0.088889,0.0,0.0,0.0,0.0,0.0,0.022222,0.0,0.0,0.022222,0.0,0.0,0.0,0.0,0.0,0.022222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022222,0.0,0.0,0.044444,0.0,0.0,0.0,0.0,0.022222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022222,0.022222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022222,0.0,0.022222,0.044444,0.0,0.0,0.022222,0.0,0.0,0.0,0.0,0.0,0.0,0.022222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022222,0.0,0.0,0.0,0.022222,0.0,0.0,0.0,0.0,0.0


**Let's confirm the new size**

In [26]:
north_york_grouped.shape

(24, 144)

**Let's print each neighborhood along with the top 5 most common venues**

In [32]:
num_top_venues = 5
for hood in north_york_grouped.Neighborhood :
    print(f'----{hood}----')
    temp = north_york_grouped[north_york_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp.freq = temp.freq.astype(float)
    temp.freq = temp.round({'freq' : 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')
    

----Bathurst Manor / Wilson Heights / Downsview North----
                                  venue                                  freq
0            Venue Category_Yoga Studio            Venue Category_Yoga Studio
1          Venue Category_Women's Store          Venue Category_Women's Store
2            Venue Category_Wings Joint            Venue Category_Wings Joint
3  Venue Category_Vietnamese Restaurant  Venue Category_Vietnamese Restaurant
4            Venue Category_Video Store            Venue Category_Video Store


----Bayview Village----
                                  venue                                  freq
0            Venue Category_Yoga Studio            Venue Category_Yoga Studio
1          Venue Category_Women's Store          Venue Category_Women's Store
2            Venue Category_Wings Joint            Venue Category_Wings Joint
3  Venue Category_Vietnamese Restaurant  Venue Category_Vietnamese Restaurant
4            Venue Category_Video Store            Venue C

**First, let's write a function to sort the venues in descending order.**

In [33]:
def return_most_common_venues(row, num_top_venues):
    row = row.iloc[1:]
    row_sorted = row.sort_values(ascending=False)
    
    return row_sorted.index.values[0:num_top_venues]

**Now let's create the new dataframe and display the top 10 venues for each neighborhood.**

In [49]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = north_york_grouped['Neighborhood']

for ind in np.arange(north_york_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(north_york_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bathurst Manor / Wilson Heights / Downsview North,Venue Category_Park,Venue Category_Bank,Venue Category_Coffee Shop,Venue Category_Pharmacy,Venue Category_Pizza Place,Venue Category_Convenience Store,Venue Category_Mobile Phone Shop,Venue Category_Diner,Venue Category_Men's Store,Venue Category_Sandwich Place
1,Bayview Village,Venue Category_Grocery Store,Venue Category_Gas Station,Venue Category_Bank,Venue Category_Intersection,Venue Category_Japanese Restaurant,Venue Category_Playground,Venue Category_Chinese Restaurant,Venue Category_Café,Venue Category_Restaurant,Venue Category_Shopping Mall
2,Bedford Park / Lawrence Manor East,Venue Category_Coffee Shop,Venue Category_Sandwich Place,Venue Category_Italian Restaurant,Venue Category_Restaurant,Venue Category_Bank,Venue Category_Pharmacy,Venue Category_Pub,Venue Category_Skating Rink,Venue Category_Bridal Shop,Venue Category_Intersection
3,Don Mills)Nort,Venue Category_Pizza Place,Venue Category_Coffee Shop,Venue Category_Burger Joint,Venue Category_Japanese Restaurant,Venue Category_Gym,Venue Category_Bank,Venue Category_Diner,Venue Category_Caribbean Restaurant,Venue Category_Café,Venue Category_Cafeteria
4,Don Mills)Sout,Venue Category_Restaurant,Venue Category_Gym,Venue Category_Coffee Shop,Venue Category_Supermarket,Venue Category_Clothing Store,Venue Category_Sandwich Place,Venue Category_Indian Restaurant,Venue Category_Intersection,Venue Category_Italian Restaurant,Venue Category_Burger Joint


**Run k-means to cluster the neighborhood into 5 clusters**

In [50]:
# consider number of clusters as 5.
k = 5

X = north_york_grouped.drop('Neighborhood', axis = 1)

# run k-means clustering
kmeans = KMeans(n_clusters = k, random_state=0)
kmeans.fit(X)

KMeans(n_clusters=5, random_state=0)

**Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.**

In [51]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

north_york_merged = df_north_york

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
north_york_merged = df_north_york.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

north_york_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M2H,North York,Hillcrest Village,43.803762,-79.363452,3,Venue Category_Park,Venue Category_Coffee Shop,Venue Category_Pharmacy,Venue Category_Grocery Store,Venue Category_Bank,Venue Category_Shopping Mall,Venue Category_Convenience Store,Venue Category_Fast Food Restaurant,Venue Category_Restaurant,Venue Category_Korean Restaurant
1,M2J,North York,Fairview / Henry Farm / Oriole,43.778517,-79.346556,3,Venue Category_Coffee Shop,Venue Category_Clothing Store,Venue Category_Fast Food Restaurant,Venue Category_Japanese Restaurant,Venue Category_Bank,Venue Category_Bakery,Venue Category_Sandwich Place,Venue Category_Restaurant,Venue Category_Shopping Mall,Venue Category_Burrito Place
2,M2K,North York,Bayview Village,43.786947,-79.385975,3,Venue Category_Grocery Store,Venue Category_Gas Station,Venue Category_Bank,Venue Category_Intersection,Venue Category_Japanese Restaurant,Venue Category_Playground,Venue Category_Chinese Restaurant,Venue Category_Café,Venue Category_Restaurant,Venue Category_Shopping Mall
3,M2L,North York,York Mills / Silver Hills,43.75749,-79.374714,2,Venue Category_Park,Venue Category_Pool,Venue Category_Diner,Venue Category_Falafel Restaurant,Venue Category_Electronics Store,Venue Category_Eastern European Restaurant,Venue Category_Dumpling Restaurant,Venue Category_Dog Run,Venue Category_Discount Store,Venue Category_Dim Sum Restaurant
4,M2M,North York,Willowdale / Newtonbrook,43.789053,-79.408493,3,Venue Category_Korean Restaurant,Venue Category_Café,Venue Category_Diner,Venue Category_Park,Venue Category_Hookah Bar,Venue Category_Bus Station,Venue Category_Hot Dog Joint,Venue Category_Fried Chicken Joint,Venue Category_Indian Restaurant,Venue Category_Japanese Restaurant


**Finally, let's visualize the resulting clusters**

In [54]:
# create map
map_clusterd = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(north_york_merged['Latitude'], north_york_merged['Longitude'],
                                  north_york_merged['Neighborhood'], north_york_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusterd)
       
map_clusterd

**`Below Fuction can be used to find top venues of neighbourhoods of a borough of Toronto and cluster them`**

In [55]:
def explore_borough(b, n, cluster_k):
    new_df = df[df['Borough'] == b].reset_index(drop = True)
    print(new_df.shape)

    address = b+' ,Toronto, Ontario'
    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude

    venues =  get_near_by_venues(names = new_df['Neighborhood'],latitudes = new_df['Latitude'], longitudes = new_df['Longitude'])

    onehot_df = pd.get_dummies(venues[['Venue Category']], prefix= "", prefix_sep= " ")

    # # add neighborhood column back to dataframe
    onehot_df['Neighborhood'] = new_df['Neighborhood']
    # move neighborhood column to the first column
    fixed_columns = [onehot_df.columns[-1]] + list(onehot_df.columns[:-1])
    onehot_df = onehot_df[fixed_columns]
    onehot_df_grouped = onehot_df.groupby('Neighborhood').mean().reset_index()

    num_top_venues = 10

    indicators = ['st', 'nd', 'rd']

    # create columns according to number of top venues
    columns = ['Neighborhood']
    for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))

    # create a new dataframe
    venues_sorted = pd.DataFrame(columns=columns)
    venues_sorted['Neighborhood'] = onehot_df_grouped['Neighborhood']

    for ind in np.arange(onehot_df_grouped.shape[0]):
        venues_sorted.iloc[ind, 1:] = return_most_common_venues(onehot_df_grouped.iloc[ind, :], num_top_venues)

    k = cluster_k
    X = onehot_df_grouped.drop('Neighborhood', axis = 1)

    # run k-means clustering
    kmeans = KMeans(n_clusters = k, random_state=0)
    kmeans.fit(X)

    # add clustering labels
    venues_sorted['Cluster_Labels']=  kmeans.labels_

    merged_df = new_df
    # merge top venues_sorted with toronto_data
    merged_df = merged_df.join(venues_sorted.set_index('Neighborhood'), on='Neighborhood')
    
    # create map
    borough_map = folium.Map(location=[latitude, longitude], zoom_start=11)
    
    # set color scheme for the clusters
    x = np.arange(cluster_k)
    ys = [i + x + (i*x)**2 for i in range(cluster_k)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

        
    # add markers to the map
    for lat, lon, poi, cluster in zip(merged_df['Latitude'], merged_df['Longitude'], merged_df['Neighborhood'], merged_df['Cluster_Labels']):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(borough_map)

    return borough_map, merged_df

**Exploring `Scarborough`**

In [56]:
map_Scarborough, data_Scarborough = explore_borough(b = 'Scarborough', n = 10, cluster_k = 5)
data_Scarborough.head()

(17, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster_Labels
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353,Bank,Yoga Studio,Gas Station,Discount Store,Electronics Store,Event Space,Fast Food Restaurant,Filipino Restaurant,Fish Market,Flea Market,0
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497,Fast Food Restaurant,Yoga Studio,Convenience Store,Discount Store,Electronics Store,Event Space,Filipino Restaurant,Fish Market,Flea Market,Flower Shop,3
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711,Fast Food Restaurant,Yoga Studio,Convenience Store,Discount Store,Electronics Store,Event Space,Filipino Restaurant,Fish Market,Flea Market,Flower Shop,3
3,M1G,Scarborough,Woburn,43.770992,-79.216917,Restaurant,Yoga Studio,Furniture / Home Store,Discount Store,Electronics Store,Event Space,Fast Food Restaurant,Filipino Restaurant,Fish Market,Flea Market,0
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,Caribbean Restaurant,Yoga Studio,Furniture / Home Store,Discount Store,Electronics Store,Event Space,Fast Food Restaurant,Filipino Restaurant,Fish Market,Flea Market,0


In [57]:
map_Scarborough

**Exploring `Downtown Toronto`**

In [58]:
map_Downtown_Toronto, data_Downtown_Toronto = explore_borough(b = 'Downtown Toronto', n = 10, cluster_k = 5)
data_Downtown_Toronto.head()

(17, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster_Labels
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,Grocery Store,Yoga Studio,Dog Run,Farmers Market,Farm,Falafel Restaurant,Event Space,Ethiopian Restaurant,Escape Room,Electronics Store,3
1,M4X,Downtown Toronto,St. James Town / Cabbagetown,43.667967,-79.367675,BBQ Joint,Yoga Studio,Filipino Restaurant,Farmers Market,Farm,Falafel Restaurant,Event Space,Ethiopian Restaurant,Escape Room,Electronics Store,0
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,Bistro,Comic Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant,Event Space,Ethiopian Restaurant,Escape Room,Electronics Store,0
3,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636,Park,Yoga Studio,Distribution Center,Farm,Falafel Restaurant,Event Space,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,1
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,Bank,Yoga Studio,Filipino Restaurant,Farmers Market,Farm,Falafel Restaurant,Event Space,Ethiopian Restaurant,Escape Room,Electronics Store,0


In [59]:
map_Downtown_Toronto

# Conclusion

> **We found that Borough North York has maximum neighborgoods-24, follwed by Scarborough-17 and Downtown Toronto-17 and we clustered them into 5 clusters using K Means alogorythm based on top 10 venues in their vicinity.**

# References

>* https://www.coursera.org/learn/applied-data-science-capstone
>* https://en.wikipedia.org/wiki/Toronto

In [None]:
!pip install jovian --upgrade --quiet

In [62]:
import jovian

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project="exploring-segmenting-and-clustering-neighborhoods-in-the-city-of-toronto-ontario-canada")

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
