In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository.

# FIRST PART IS TO DO WEB SCRAPPING https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 

To create the dataframe:

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11  in the above table.
If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.
Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [149]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import os
import folium 
from geopy.geocoders import Nominatim 
import re 
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe


In [2]:
List_url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(List_url).text

In [3]:
soup = BeautifulSoup(source, 'xml')
table=soup.find('table')
column_names=['Postalcode','Borough','Neighborhood']
df = pd.DataFrame(columns=column_names)

In [4]:
#lets check dataframe created with required columnnames
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood


In [47]:
postal_code=[]
Borough =[]
Neighbourhood =[]
for tr_cell in table.find_all('tr'):
   for td_cell in tr_cell.find_all('td'):
        for p in td_cell.find_all('p'):
            for b in p.find_all('b'):
                postal_code.append(b.text)
            for span in p.find_all('span'):
                # print(span.text.split(sep='('))
                hold_list = span.text.split(sep='(')
                #print(hold_list)
                Borough.append(hold_list[0])
                if len(hold_list) > 1:
                    var_list = hold_list[1].replace(')','').split('/')
                    # Neighbourhood.append(hold_list[1].replace(')','').split('/'))  
                    Neighbourhood.append(','.join(map(str,var_list)))
                else:
                    Neighbourhood.append(' ')
               

In [49]:
df_final = pd.DataFrame({'Postalcode':postal_code,'Borough':Borough,'Neighbourhood':Neighbourhood})

In [51]:
df_final.head(50)

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"
5,M6A,North York,"Lawrence Manor , Lawrence Heights"
6,M7A,Queen's Park,Ontario Provincial Government
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern , Rouge"


In [52]:
df_final=df_final[df_final['Borough']!= 'Not assigned']

In [53]:
df_final.shape

(103, 3)

In [55]:
df_final.head(10)

Unnamed: 0,Postalcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"
5,M6A,North York,"Lawrence Manor , Lawrence Heights"
6,M7A,Queen's Park,Ontario Provincial Government
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern , Rouge"
11,M3B,North York,Don MillsNorth
12,M4B,East York,"Parkview Hill , Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


# SECOND PART IS TO GET LATITUDE AND LONTITUDE OF EACH POSTAL CODE 

 Read Geospatial_cordinates.csv file to get respective lat & lon of each postal code


In [57]:
df_lat_lon = pd.read_csv('Geospatial_Coordinates.csv')

In [58]:
df_lat_lon.head(10)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [59]:
df_lat_lon.shape

(103, 3)

In [63]:
df_final.columns

Index(['Postalcode', 'Borough', 'Neighbourhood'], dtype='object')

In [64]:
df_final = df_final.rename(columns={'Postalcode' : 'Postal Code'})

In [65]:
df_final.columns

Index(['Postal Code', 'Borough', 'Neighbourhood'], dtype='object')

In [66]:
merge_df = pd.merge(df_final,df_lat_lon,on=['Postal Code'])

In [67]:
merge_df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway , Montgomery Road , Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South , King's Mill Park , Sunnylea ,...",43.636258,-79.498509


# THIRD PART is to Explore and cluster the neighborhoods in Toronto

In [105]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(merge_df['Borough'].unique()),
        merge_df.shape[0]
    )
)

The dataframe has 15 boroughs and 103 neighborhoods.


Lets analyse the boroughs that contain the word Toronto

In [106]:
merge_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


In [122]:
new_df = merge_df[merge_df['Borough'].str.contains('Toronto', na=False)]
new_df.reset_index(drop=True,inplace=True)


In [123]:
new_df['Borough'].value_counts()

Downtown Toronto                                                17
Central Toronto                                                  9
West Toronto                                                     6
East Toronto                                                     4
Downtown TorontoStn A PO Boxes25 The Esplanade                   1
East YorkEast Toronto                                            1
East TorontoBusiness reply mail Processing Centre969 Eastern     1
Name: Borough, dtype: int64

In [125]:
analyze_df = new_df

In [126]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(analyze_df['Borough'].unique()),
        analyze_df.shape[0]
    )
)

The dataframe has 7 boroughs and 39 neighborhoods.


In [129]:
analyze_df.columns

Index(['Postal Code', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude'], dtype='object')

In [127]:
address = 'TORONTO, CMA'

geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are -34.8899421, -56.0790982.


Create a map of Toronto with neighborhoods superimposed on top

In [130]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(analyze_df['Latitude'], analyze_df['Longitude'], analyze_df['Borough'], analyze_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [132]:
address = 'Downtown Toronto, CMA'

geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto City are 44.3860593, -79.6918184.


In [136]:
map_d_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(analyze_df['Latitude'], analyze_df['Longitude'], analyze_df['Borough'], analyze_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_d_toronto)  
    
map_d_toronto

# Lets use FOURSQUARE API to explore the neighbourhood

In [131]:

CLIENT_ID = 'LIBCLXHFCE2P55HOQP0G25DJXWXDZ4BLUML50EU0UU0EHZQP' # your Foursquare ID
CLIENT_SECRET = 'GM2X5HHKUB3VRA0BI3IHGBOINQQKENC1DG3OQIMJTQLCPDFJ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Successfully Logged-In')

Successfully Logged-In


# Let's explore the first neighborhood in our dataframe.

Get the neighborhood's name.

In [137]:
analyze_df_dwntwn_toronto = analyze_df[analyze_df['Borough'] == 'Downtown Toronto'].reset_index(drop=True)


In [139]:
analyze_df_dwntwn_toronto.loc[0, 'Neighbourhood']

'Regent Park , Harbourfront'

In [140]:
radius = 500
Limit = 100 

In [142]:
neighborhood_latitude = analyze_df_dwntwn_toronto.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = analyze_df_dwntwn_toronto.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = analyze_df_dwntwn_toronto.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Regent Park , Harbourfront are 43.6542599, -79.3606359.


Now, let's get the top 100 venues that are in Regent Park , Harbourfront within a radius of 500 meters.

Lets create the URL for get request 

In [143]:
url='https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
neighborhood_latitude,
neighborhood_longitude,    
radius,
Limit)


In [144]:
url

'https://api.foursquare.com/v2/venues/explore?client_id=LIBCLXHFCE2P55HOQP0G25DJXWXDZ4BLUML50EU0UU0EHZQP&client_secret=GM2X5HHKUB3VRA0BI3IHGBOINQQKENC1DG3OQIMJTQLCPDFJ&v=20180605&ll=43.6542599,-79.3606359&radius=500&limit=100'

Send the GET request


In [145]:
results = requests.get(url).json()
results

ing Glory Cafe',
       'location': {'address': '457 King St. E',
        'crossStreet': 'Gilead Place',
        'lat': 43.653946942635294,
        'lng': -79.36114884214422,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.653946942635294,
          'lng': -79.36114884214422}],
        'distance': 54,
        'postalCode': 'M5A 1L6',
        'cc': 'CA',
        'city': 'Toronto',
        'state': 'ON',
        'country': 'Canada',
        'formattedAddress': ['457 King St. E (Gilead Place)',
         'Toronto ON M5A 1L6',
         'Canada']},
       'categories': [{'id': '4bf58dd8d48988d143941735',
         'name': 'Breakfast Spot',
         'pluralName': 'Breakfast Spots',
         'shortName': 'Breakfast',
         'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/breakfast_',
          'suffix': '.png'},
         'primary': True}],
       'photos': {'count': 0, 'groups': []},
       'venuePage': {'id': '39686393'}},
      'referralId': 'e-0-4ae5b91f

In [146]:
results['response'].keys()

dict_keys(['suggestedFilters', 'headerLocation', 'headerFullLocation', 'headerLocationGranularity', 'totalResults', 'suggestedBounds', 'groups'])

In [147]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [150]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Tandem Coffee,Coffee Shop,43.653559,-79.361809
1,Roselle Desserts,Bakery,43.653447,-79.362017
2,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008
3,Body Blitz Spa East,Spa,43.654735,-79.359874
4,Impact Kitchen,Restaurant,43.656369,-79.35698


In [151]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

44 venues were returned by Foursquare.


# Explore Neighborhoods in Downtown Toronto 

In [161]:
LIMIT=100

In [162]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [163]:
downtown_toronto_venues = getNearbyVenues(names=analyze_df_dwntwn_toronto['Neighbourhood'],
                                        latitudes=analyze_df_dwntwn_toronto['Latitude'], 
                                        longitudes=analyze_df_dwntwn_toronto['Longitude'], radius=500)

Regent Park , Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Christie
Richmond , Adelaide , King
Harbourfront East , Union Station , Toronto Islands
Toronto Dominion Centre , Design Exchange
Commerce Court , Victoria Hotel
University of Toronto , Harbord
Kensington Market , Chinatown , Grange Park
CN Tower , King and Spadina , Railway Lands , Harbourfront West , Bathurst Quay , South Niagara , Island airport
Rosedale
St. James Town , Cabbagetown
First Canadian Place , Underground city
Church and Wellesley


In [164]:
print(downtown_toronto_venues.shape)
downtown_toronto_venues.head()

(1108, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park , Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
1,"Regent Park , Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
2,"Regent Park , Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park , Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park , Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


In [166]:
downtown_toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,57,57,57,57,57,57
"CN Tower , King and Spadina , Railway Lands , Harbourfront West , Bathurst Quay , South Niagara , Island airport",17,17,17,17,17,17
Central Bay Street,66,66,66,66,66,66
Christie,16,16,16,16,16,16
Church and Wellesley,80,80,80,80,80,80
"Commerce Court , Victoria Hotel",100,100,100,100,100,100
"First Canadian Place , Underground city",100,100,100,100,100,100
"Garden District, Ryerson",100,100,100,100,100,100
"Harbourfront East , Union Station , Toronto Islands",100,100,100,100,100,100
"Kensington Market , Chinatown , Grange Park",66,66,66,66,66,66


In [167]:
print('There are {} uniques categories.'.format(len(downtown_toronto_venues['Venue Category'].unique())))

There are 202 uniques categories.


Analyze each neighborhood

In [169]:
# one hot encoding
downtown_onehot = pd.get_dummies(downtown_toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
downtown_onehot['Neighborhood'] = downtown_toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [downtown_onehot.columns[-1]] + list(downtown_onehot.columns[:-1])
downtown_onehot = downtown_onehot[fixed_columns]

downtown_onehot.head()

Unnamed: 0,Yoga Studio,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [177]:
downtown_onehot['Neighborhood']

0       Regent Park , Harbourfront
1       Regent Park , Harbourfront
2       Regent Park , Harbourfront
3       Regent Park , Harbourfront
4       Regent Park , Harbourfront
                   ...            
1103          Church and Wellesley
1104          Church and Wellesley
1105          Church and Wellesley
1106          Church and Wellesley
1107          Church and Wellesley
Name: Neighborhood, Length: 1108, dtype: object

In [178]:
downtown_onehot.shape

(1108, 202)

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [179]:
downtown_grouped = downtown_onehot.groupby('Neighborhood').mean().reset_index()
downtown_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.017544,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0
1,"CN Tower , King and Spadina , Railway Lands , ...",0.0,0.0,0.058824,0.058824,0.058824,0.117647,0.117647,0.117647,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Central Bay Street,0.015152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.015152,0.015152,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.015152
3,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Church and Wellesley,0.025,0.0125,0.0,0.0,0.0,0.0,0.0,0.0,0.0125,...,0.0,0.0125,0.0125,0.0125,0.0,0.0,0.0,0.0,0.0,0.0
5,"Commerce Court , Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.01,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01
6,"First Canadian Place , Underground city",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.01,0.0,0.01,0.0,0.0,0.01,0.01,0.0,0.0,0.01
7,"Garden District, Ryerson",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.01,0.02,0.0,0.0,0.0,0.0,0.01,0.01,0.01
8,"Harbourfront East , Union Station , Toronto Is...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01
9,"Kensington Market , Chinatown , Grange Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.015152,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.045455,0.015152


Let's print each neighborhood along with the top 5 most common venues

In [181]:
num_top_venues = 5

for hood in downtown_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = downtown_grouped[downtown_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
            venue  freq
0     Coffee Shop  0.09
1    Cocktail Bar  0.05
2          Bakery  0.05
3        Beer Bar  0.04
4  Farmers Market  0.04


----CN Tower , King and Spadina , Railway Lands , Harbourfront West , Bathurst Quay , South Niagara , Island airport----
              venue  freq
0    Airport Lounge  0.12
1   Airport Service  0.12
2  Airport Terminal  0.12
3   Harbor / Marina  0.06
4             Plane  0.06


----Central Bay Street----
                 venue  freq
0          Coffee Shop  0.17
1       Sandwich Place  0.06
2   Italian Restaurant  0.05
3                 Café  0.05
4  Japanese Restaurant  0.03


----Christie----
           venue  freq
0  Grocery Store  0.25
1           Café  0.19
2           Park  0.12
3     Baby Store  0.06
4    Candy Store  0.06


----Church and Wellesley----
                 venue  freq
0          Coffee Shop  0.08
1  Japanese Restaurant  0.06
2     Sushi Restaurant  0.06
3           Restaurant  0.04
4              Gay Ba

Lets put this in dataframe 

In [182]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [184]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = downtown_grouped['Neighborhood']

for ind in np.arange(downtown_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(downtown_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Bakery,Cocktail Bar,Beer Bar,Cheese Shop,Restaurant,Farmers Market,Italian Restaurant,Seafood Restaurant,Museum
1,"CN Tower , King and Spadina , Railway Lands , ...",Airport Lounge,Airport Service,Airport Terminal,Boat or Ferry,Coffee Shop,Boutique,Harbor / Marina,Sculpture Garden,Bar,Rental Car Location
2,Central Bay Street,Coffee Shop,Sandwich Place,Café,Italian Restaurant,Restaurant,Bubble Tea Shop,Burger Joint,Salad Place,Japanese Restaurant,Wine Bar
3,Christie,Grocery Store,Café,Park,Nightclub,Candy Store,Restaurant,Baby Store,Italian Restaurant,Coffee Shop,Athletics & Sports
4,Church and Wellesley,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Gay Bar,Restaurant,Pub,Hotel,Bubble Tea Shop,Café,Yoga Studio


# Cluster Neighborhood 

In [186]:
from sklearn.cluster import KMeans

In [187]:
# set number of clusters
kclusters = 5

downtown_grouped_clustering = downtown_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(downtown_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 0, 1, 2, 1, 1, 1, 1, 1, 3])

In [188]:
analyze_df_dwntwn_toronto.columns

Index(['Postal Code', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude'], dtype='object')

In [194]:
analyze_df_dwntwn_toronto=analyze_df_dwntwn_toronto.rename(columns={'Neighbourhood' : 'Neighborhood'})

In [195]:
analyze_df_dwntwn_toronto.columns

Index(['Postal Code', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'], dtype='object')

In [196]:
neighborhoods_venues_sorted.columns

Index(['Cluster Labels', 'Neighborhood', '1st Most Common Venue',
       '2nd Most Common Venue', '3rd Most Common Venue',
       '4th Most Common Venue', '5th Most Common Venue',
       '6th Most Common Venue', '7th Most Common Venue',
       '8th Most Common Venue', '9th Most Common Venue',
       '10th Most Common Venue'],
      dtype='object')

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [198]:
analyzedf_merged = analyze_df_dwntwn_toronto

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
analyzedf_merged = analyzedf_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

analyzedf_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636,1,Coffee Shop,Bakery,Café,Park,Pub,Breakfast Spot,Theater,Restaurant,Bank,Event Space
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1,Coffee Shop,Clothing Store,Sandwich Place,Café,Hotel,Pizza Place,Italian Restaurant,Cosmetics Shop,Middle Eastern Restaurant,Japanese Restaurant
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Café,Coffee Shop,Cocktail Bar,Gastropub,Restaurant,Beer Bar,Lingerie Store,Italian Restaurant,Moroccan Restaurant,Seafood Restaurant
3,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,1,Coffee Shop,Bakery,Cocktail Bar,Beer Bar,Cheese Shop,Restaurant,Farmers Market,Italian Restaurant,Seafood Restaurant,Museum
4,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,1,Coffee Shop,Sandwich Place,Café,Italian Restaurant,Restaurant,Bubble Tea Shop,Burger Joint,Salad Place,Japanese Restaurant,Wine Bar


Visualize the Clusters in map

In [200]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [201]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(analyzedf_merged['Latitude'], analyzedf_merged['Longitude'], analyzedf_merged['Neighborhood'], analyzedf_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Lets examine first cluster


In [202]:
analyzedf_merged.loc[analyzedf_merged['Cluster Labels'] == 0, analyzedf_merged.columns[[1] + list(range(5, analyzedf_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,Downtown Toronto,0,Airport Lounge,Airport Service,Airport Terminal,Boat or Ferry,Coffee Shop,Boutique,Harbor / Marina,Sculpture Garden,Bar,Rental Car Location
