# Webscrapping - An HTML Table

Import the required packages for webscrapping, other packages can be installed/imported when required.

In [137]:
#import packages
import pandas as pd
import requests
from bs4 import BeautifulSoup

Define the URL that identifies the source of data in the web

In [138]:
# Define URL
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

Extract data from the above URL, and convert it into a dataframe using BeautifulSoup package

In [139]:
# Import and process data
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


# Cleanup DataFrame

1. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhoo - __Done, as shown above__

2. Only process the cells that have an assigned borough. Ignore(remove) cells with a borough that is Not assigned

In [140]:
# Remove "Not assigned" Boroughs
df = df[df.Borough!= "Not assigned"]
df.reset_index(drop=True, inplace=True)
df.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


3. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table - __No such records in source data__

In [141]:
# Verify number of records by Postal Code
code_counts=df["Postal Code"].value_counts()
num_dups = len(code_counts[code_counts>1])
print("Postal Codes with more than one record to be combined/consolidated: {}".format(num_dups))

Postal Codes with more than one record to be combined/consolidated: 0


4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough - __No such records in source data__

In [142]:
# Not assigned neighborhoods
NA_neighborhoods = len(df[df.Neighborhood == "Not assigned"])
print("Number of 'Not assigned' neighborhoods: {}".format(NA_neighborhoods))

Number of 'Not assigned' neighborhoods: 0


5. Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making - __Done__

6. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe - __as shown below__

In [143]:
# Number of Dataframe records
num_records = df.shape[0]
print("Number of records in webscrapped and cleaned dataframe: {}".format(num_records))

Number of records in webscrapped and cleaned dataframe: 103


# Add Geo Coordinates to the Dataframe

## Research of Geo Location APIs

Tested geopy, geocoder, and pgeocode. pgeocode seems to work most consistently for getting Geo-Coordinates with Postal Code as input. Results returned by geopy and geocodes are not consistent for Canadian postal codes. 
I have demonstrated 2 approcahes below:
1. Using pgeocode
2. Using the given .csv file lookup

### 1. Using pgeocode

Install and import pgeocode. You may not have to install if already installed.

In [144]:
# Install package
!pip install pgeocode



In [145]:
#import package
import pgeocode

Define functions that cal be used with pandas apply function later to derive Geo-Coordinates. Test the fucntions.

In [146]:
#test pgeocode
country = "CA"

def get_latitude(postal_code):
    geo = pgeocode.Nominatim(country)
    location = geo.query_postal_code(postal_code)
    return location.latitude
    
def get_longitude(postal_code):
    geo = pgeocode.Nominatim(country)
    location = geo.query_postal_code(postal_code)
    return location.longitude

print("Geo-Coordinates of {}, Country {}: \n Latitude\t: {}\n Longitude\t: {}".format("MY7", \
                                                "CA", get_latitude("M7Y"), get_longitude("M7Y")))

Geo-Coordinates of MY7, Country CA: 
 Latitude	: 43.7804
 Longitude	: -79.2505


Use the above functions with pandas apply to add Geo-Coordinates to the dataframe

In [147]:
# Enrich dataframe with "Latitude" and "Longitude"
pgeo_df = df.copy()
pgeo_df["Latitude"] = pgeo_df["Postal Code"].apply(get_latitude)
pgeo_df["Longitude"] = pgeo_df["Postal Code"].apply(get_longitude)
pgeo_df.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.6662,-79.5282
6,M1B,Scarborough,"Malvern, Rouge",43.8113,-79.193
7,M3B,North York,Don Mills,43.745,-79.359
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.7063,-79.3094
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783


### 2. Using .csv file lookup

Read the given csv file into a pandas dataframe

In [148]:
csv_codes_df = pd.read_csv("Geospatial_Coordinates.csv")
csv_codes_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Add Geo-Coordinates to the main dataframe, by joining(inner) with the above data from csv file

In [149]:
#join the data frames df with csv_codes_df with "Postal Code" as key for join
geo_csv_df = df.copy()
geo_csv_df = geo_csv_df.join(csv_codes_df.set_index("Postal Code"), on = "Postal Code")
geo_csv_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### Analysis:
As it can be noticed from the results above Geo-Coordinates (Latitude, Logitude) derived by both of the above methods are not totally consistent.
Will use geo_cvs_df for further analysis.

# Explore and Cluster neighborhoods of Toronto

### Filter the Borough records that contain "Toronto" in name

This analysis is limited to the Boroughs with "Toronto", in their name. Filter data from the main dataframe from the above analysis, to only "Toronto" Boroughs.

In [150]:
#Prepare Toronto dataframe
toronto_df = geo_csv_df[geo_csv_df["Borough"].str.contains("oronto")].reset_index(drop=True)
print("Number of Borough records with 'Toronto', in name: {}".format(toronto_df.shape[0]))
print()
toronto_df.head()

Number of Borough records with 'Toronto', in name: 39



Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


### Map Toronto Neighborhoods
1. Map Toronto based on Geo-Coordinates
2. Overlay neighborhoods in Toronto dataframe on this map

Install and import geopy package, capable for deriving Geo-Coordinates by name/address, like "Toronto", where pgeocode woundn't help.

In [151]:
#Install geopy
!pip install geopy



In [152]:
#Import package
from geopy.geocoders import Nominatim

In [153]:
#Get Toronto Geo-Coordinates
address = 'Toronto'

geolocator = Nominatim(user_agent="my_explorer")
main_location = geolocator.geocode(address, timeout=5)
main_latitude = main_location.latitude
main_longitude = main_location.longitude
print('The geograpical coordinates of Manhattan are {}, {}.'.format(main_latitude, main_longitude))

The geograpical coordinates of Manhattan are 43.6534817, -79.3839347.


Use folium to map Toronto neighborhood. Install and import folium package. Dont have to intall, if you already have folium.

In [155]:
#Map the Toronto neighborhood
#!pip install folium
import folium # map rendering library

In [156]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[main_latitude, main_longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## Explore Toronto Neighborhoods

Define Foursquare API, to explore neighborhoods

In [157]:
#Set Credentials
CLIENT_ID = 'N233FNIITUBDDXZ4OR5AWM0YEY1XXVKWP2L2IAZOWISCPLWC' # your Foursquare ID
CLIENT_SECRET = 'ONLP4IHAAYIXVTIOQL0SJNIZ2OGEPJUVRA5CERLKJY2EIEI0' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: N233FNIITUBDDXZ4OR5AWM0YEY1XXVKWP2L2IAZOWISCPLWC
CLIENT_SECRET:ONLP4IHAAYIXVTIOQL0SJNIZ2OGEPJUVRA5CERLKJY2EIEI0


### Explore top 100 venues in 500m radius

### Extract and build a dataframe of Venues with their Geo-Coordinates and Category, by Neighborhood

All the required information is in the "items" key of json output returned by Foursquare API. Build functions to get and cleanup category data by venue:
1. get_nearby_venus(): calls the Foursquare API to get venue data for a given neighborhood
2. get_category_type(): extracts venue category from the API output data
3. get_AllNearbyVenues(): Loops through all records of a dataframe to call get_nearby_venus()

In [160]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Build function to extract data from json to a dataframe, for a given neighborhood

In [161]:
#import package to transform json to dataframe
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

def get_nearby_venus(name, neighborhood_latitude, neighborhood_longitude, LIMIT = 100, radius=500):
    #Build GET request URL
    # create URL
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
    
    #Get query results from the url
    results = requests.get(url).json()
    #Extract data in items
    venues = results['response']['groups'][0]['items']
    nearby_venues = json_normalize(venues) # flatten JSON
    
    # filter the category for each row
    nearby_venues['Venue Category'] = nearby_venues.apply(get_category_type, axis=1)

    # clean columns
    nearby_venues["Neighborhood"] = name
    nearby_venues["Neighborhood Latitude"] = neighborhood_latitude
    nearby_venues["Neighborhood Longitude"] = neighborhood_longitude
    nearby_venues.rename(columns={'venue.name': 'Venue', 'venue.location.lat': 'Venue Latitude', 
                  'venue.location.lng': 'Venue Longitude'}, inplace=True)  
    filtered_columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    nearby_venues = nearby_venues.loc[:, filtered_columns]
    
    return nearby_venues

Build function to process all neighborhoods in a dataframe

In [162]:
def get_AllNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    counter = 0
    all_nearby_venues = pd.DataFrame(columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category'])

    for name, lat, lng in zip(names, latitudes, longitudes):
        temp_nearby_venues = get_nearby_venus(name, lat, lng, radius=500, LIMIT=100)
        all_nearby_venues = pd.concat([all_nearby_venues, temp_nearby_venues])
        
    return all_nearby_venues

In [163]:
#Call function to get venue data for all Toronto neighborhoods
names = toronto_df['Neighborhood']
latitudes = toronto_df['Latitude']
longitudes = toronto_df['Longitude']
toronto_venues = get_AllNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100)

## Explore all venues by neighborhoods in toronto dataframe

In [165]:
#Examine nearby toronto venues dataframe
toronto_venues.tail()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
12,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,Jonathan Ashbridge Park,43.664702,-79.319898,Park
13,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,Toronto Yoga Mamas,43.664824,-79.324335,Yoga Studio
14,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,Olliffe On Queen,43.664503,-79.324768,Butcher
15,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,ONE Academy,43.662253,-79.326911,Gym / Fitness Center
16,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,Revolution Recording,43.662561,-79.32694,Recording Studio


In [166]:
#count Venues per each neighborhood
toronto_venues.groupby("Neighborhood").count().head()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,58,58,58,58,58,58
"Brockton, Parkdale Village, Exhibition Place",24,24,24,24,24,24
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",17,17,17,17,17,17
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",17,17,17,17,17,17
Central Bay Street,65,65,65,65,65,65


### Identify number unique categories across venues

In [114]:
num_of_cats = len(toronto_venues['Venue Category'].unique())
print("Number of unique Categories in Toronto area: {}".format(num_of_cats))

Number of unique Categories in Toronto area: 233


# Analyze Categories by Neighborhood

Convert Venues by neighborhood into a feature matrix - cross-tab with venues as columns; use one-hot encoding.

In [115]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Cleanup column names, like there is a category "Neighborhood" which will cause errors in groupby
toronto_onehot.drop(columns="Neighborhood", inplace=True)
init_columns = list(toronto_onehot.columns)

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = ["Neighborhood"] + init_columns
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [168]:
#Review shape of the dataframe
toronto_onehot.shape

(1622, 233)

## Group by Neighborhood, and evaluate (sum of) venues by each category

In [169]:
#toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index() <--- # leads to really small numbers
toronto_grouped = toronto_onehot.groupby('Neighborhood').sum().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Berczy Park,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,"Brockton, Parkdale Village, Exhibition Place",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Business reply mail Processing Centre, South C...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,"CN Tower, King and Spadina, Railway Lands, Har...",0,1,1,1,2,3,2,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Central Bay Street,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,1


This dataframe will be used for cluster analysis using kmeans clustering method 

In [170]:
#Size of dataframe
toronto_grouped.shape

(39, 233)

### Neighborhood analysis with Top 10 Venues

List Top5 venues by each neighborhood

In [174]:
num_top_venues = 10

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
                venue  freq
0         Coffee Shop   5.0
1        Cocktail Bar   3.0
2          Restaurant   2.0
3  Seafood Restaurant   2.0
4         Cheese Shop   2.0
5              Bakery   2.0
6            Beer Bar   2.0
7                Café   2.0
8            Creperie   1.0
9    Greek Restaurant   1.0


----Brockton, Parkdale Village, Exhibition Place----
                    venue  freq
0                    Café   3.0
1                  Bakery   2.0
2          Breakfast Spot   2.0
3             Coffee Shop   2.0
4      Italian Restaurant   1.0
5                 Stadium   1.0
6  Furniture / Home Store   1.0
7               Nightclub   1.0
8            Climbing Gym   1.0
9                     Bar   1.0


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                  venue  freq
0  Gym / Fitness Center   1.0
1               Brewery   1.0
2            Skate Park   1.0
3            Restaurant   1.0
4      Recording Stu

                         venue  freq
0               Breakfast Spot   2.0
1                    Gift Shop   2.0
2                  Coffee Shop   1.0
3                    Bookstore   1.0
4                Movie Theater   1.0
5  Eastern European Restaurant   1.0
6                          Bar   1.0
7                         Bank   1.0
8                   Restaurant   1.0
9           Italian Restaurant   1.0


----Queen's Park, Ontario Provincial Government----
                 venue  freq
0          Coffee Shop   8.0
1                Diner   2.0
2     Sushi Restaurant   2.0
3          Yoga Studio   1.0
4    College Cafeteria   1.0
5        Burrito Place   1.0
6        Smoothie Shop   1.0
7   Mexican Restaurant   1.0
8  Fried Chicken Joint   1.0
9       Sandwich Place   1.0


----Regent Park, Harbourfront----
            venue  freq
0     Coffee Shop   8.0
1             Pub   3.0
2          Bakery   3.0
3            Park   3.0
4  Breakfast Spot   2.0
5            Café   2.0
6         Theate

Build a function to list the Top X number of (most popular) venues category...by neighborhood sort venues in the descending order to sect the Top X.

In [175]:
#Sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Build a function that lists Top 10 Venues by neighborhood, as a cross-tab with the ranking(sorted) as columns

In [176]:

import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Bakery,Restaurant,Café,Beer Bar,Cheese Shop,Department Store,Japanese Restaurant
1,"Brockton, Parkdale Village, Exhibition Place",Café,Bakery,Breakfast Spot,Coffee Shop,Gym,Stadium,Burrito Place,Restaurant,Climbing Gym,Performing Arts Venue
2,"Business reply mail Processing Centre, South C...",Yoga Studio,Auto Workshop,Burrito Place,Light Rail Station,Farmers Market,Fast Food Restaurant,Butcher,Restaurant,Recording Studio,Brewery
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Boutique,Coffee Shop,Airport,Airport Food Court,Airport Gate,Sculpture Garden,Rental Car Location
4,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Japanese Restaurant,Café,Thai Restaurant,Salad Place,Bar,Burger Joint,Bubble Tea Shop


# Cluster Neighborhoods

Run k-means to cluster neighborhoods

In [122]:
#Import package
from sklearn.cluster import KMeans

In [177]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[:] 

array([1, 2, 2, 2, 0, 2, 1, 3, 2, 2, 2, 3, 2, 4, 0, 2, 2, 1, 2, 2, 2, 2,
       2, 1, 1, 3, 2, 2, 1, 1, 2, 3, 1, 2, 2, 2, 2, 3, 2])

### Build a dataframe that inludes the cluster, with neighborhood and corresponding Top 10 venues

In [178]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1,Coffee Shop,Park,Bakery,Pub,Breakfast Spot,Café,Theater,Event Space,Electronics Store,Distribution Center
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1,Coffee Shop,Diner,Sushi Restaurant,Yoga Studio,Park,Mexican Restaurant,Italian Restaurant,Hobby Shop,Fried Chicken Joint,Distribution Center
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,4,Clothing Store,Coffee Shop,Cosmetics Shop,Bubble Tea Shop,Middle Eastern Restaurant,Café,Japanese Restaurant,Tea Room,Bookstore,Ramen Restaurant
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Coffee Shop,Café,Gastropub,Cocktail Bar,American Restaurant,Restaurant,Gym,Hotel,Creperie,Moroccan Restaurant
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,Health Food Store,Trail,Pub,Yoga Studio,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant


### Visualize the clusters

In [179]:
# create map
map_clusters = folium.Map(location=[main_latitude, main_longitude], zoom_start=12)

# set color scheme for the clusters

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], \
                                  toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)   
    
map_clusters

# Analyze Clusters

Cluster count summary:

In [126]:
toronto_merged["Cluster Labels"].value_counts()

2    23
1     8
3     5
0     2
4     1
Name: Cluster Labels, dtype: int64

Doesnt seem like a very helpful clustering(Cluster2 is too big ?). It could be improved if "Categories Data" can be better organized, say,
Group Categories into broader categories like Transport Services, Restaurants (of type1), Restaurants (of type2), Stores, Groceries, Bakeries & Coffee, and so on. 233 unique categories are too many to make sense out of.

Following are data records of each of the categories. Cant make much sense out of them by simply looking at them. Analysing by a certain category grouping as suggested above will make the clusters more tangible.

### Cluster0

In [180]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, \
                     toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]].head()

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Central Bay Street,0,Coffee Shop,Italian Restaurant,Sandwich Place,Japanese Restaurant,Café,Thai Restaurant,Salad Place,Bar,Burger Joint,Bubble Tea Shop
10,"Harbourfront East, Union Station, Toronto Islands",0,Coffee Shop,Aquarium,Hotel,Café,Italian Restaurant,Sporting Goods Shop,Scenic Lookout,Brewery,Fried Chicken Joint,Restaurant


### Cluster1

In [181]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, \
                     toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]].head()

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Regent Park, Harbourfront",1,Coffee Shop,Park,Bakery,Pub,Breakfast Spot,Café,Theater,Event Space,Electronics Store,Distribution Center
1,"Queen's Park, Ontario Provincial Government",1,Coffee Shop,Diner,Sushi Restaurant,Yoga Studio,Park,Mexican Restaurant,Italian Restaurant,Hobby Shop,Fried Chicken Joint,Distribution Center
3,St. James Town,1,Coffee Shop,Café,Gastropub,Cocktail Bar,American Restaurant,Restaurant,Gym,Hotel,Creperie,Moroccan Restaurant
5,Berczy Park,1,Coffee Shop,Cocktail Bar,Seafood Restaurant,Bakery,Restaurant,Café,Beer Bar,Cheese Shop,Department Store,Japanese Restaurant
17,Studio District,1,Café,Coffee Shop,Bakery,Gastropub,American Restaurant,Brewery,Yoga Studio,Fish Market,Italian Restaurant,Bookstore


### Cluster2

In [182]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, \
                     toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]].head()

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,The Beaches,2,Health Food Store,Trail,Pub,Yoga Studio,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant
7,Christie,2,Grocery Store,Café,Park,Nightclub,Diner,Italian Restaurant,Baby Store,Restaurant,Candy Store,Coffee Shop
9,"Dufferin, Dovercourt Village",2,Pharmacy,Bakery,Music Venue,Portuguese Restaurant,Middle Eastern Restaurant,Café,Brewery,Bar,Supermarket,Bank
11,"Little Portugal, Trinity",2,Bar,Vietnamese Restaurant,Vegetarian / Vegan Restaurant,Men's Store,Asian Restaurant,Restaurant,Café,Yoga Studio,Record Shop,Italian Restaurant
12,"The Danforth West, Riverdale",2,Greek Restaurant,Italian Restaurant,Coffee Shop,Bookstore,Restaurant,Ice Cream Shop,Furniture / Home Store,Yoga Studio,Liquor Store,Spa


### Cluster3

In [183]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, \
                     toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]].head()

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,"Richmond, Adelaide, King",3,Coffee Shop,Restaurant,Café,Gym,Deli / Bodega,Hotel,Thai Restaurant,Bookstore,Pizza Place,Clothing Store
13,"Toronto Dominion Centre, Design Exchange",3,Coffee Shop,Hotel,Café,Restaurant,Salad Place,Japanese Restaurant,Italian Restaurant,Seafood Restaurant,American Restaurant,Sushi Restaurant
16,"Commerce Court, Victoria Hotel",3,Coffee Shop,Café,Restaurant,Hotel,American Restaurant,Gym,Seafood Restaurant,Italian Restaurant,Japanese Restaurant,Vegetarian / Vegan Restaurant
34,Stn A PO Boxes,3,Coffee Shop,Seafood Restaurant,Café,Restaurant,Japanese Restaurant,Italian Restaurant,Hotel,Beer Bar,Cocktail Bar,Gym
36,"First Canadian Place, Underground city",3,Coffee Shop,Café,Restaurant,Hotel,Gym,American Restaurant,Seafood Restaurant,Steakhouse,Salad Place,Japanese Restaurant


### Cluster4

In [184]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, \
                     toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]].head()

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,"Garden District, Ryerson",4,Clothing Store,Coffee Shop,Cosmetics Shop,Bubble Tea Shop,Middle Eastern Restaurant,Café,Japanese Restaurant,Tea Room,Bookstore,Ramen Restaurant
