<h1>Segmenting and Clustering Neighborhoods in Toronto</h1>
<h2>Robert Leon</h2>
<h3>Coursera IBM Data Science Capstone</h3>
<p>In this exercise, I will be using the Foursquare API and scikitlearn's k-means clustering to classify Toronto neighborhoods according to the makeup of local business categories<p>

The first step will be retrieving a list of neighborhoods in Toronto. A list is available for scraping at https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [4]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

In [5]:
#URL of info source
zip_codes_url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#create http request to query URL content
wiki_response = requests.get(zip_codes_url)

In [6]:
#split cell to avoid multiple http requests everytime I rerun the cell
#create BeautifulSoup object to parse response content
wikipage = BeautifulSoup(wiki_response.text)

print(wikipage.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"449ccf20-e1f6-4d16-813e-c13126a76ab2","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":969510799,"wgRevisionId":969510799,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Communications in Ontario","Postal codes in Canada","Toron

In [7]:
#use pd.read_html to create a list of dataframes, save dataframe to a variable for easy referencing 
#I used prettify() after realizing beautifulsoup returns the object as text by default, and read_html parses HTML
wiki_page_table = pd.read_html(wikipage.table.prettify())[0]

In [8]:
#Confirm success!
wiki_page_table.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [9]:
#Create mask of all rows with 'Not assigned' as borough
table_mask = wiki_page_table['Borough']!='Not assigned'
#Apply mask to wiki_page_table and save as 'clean' version to use
clean_wiki_page_table = wiki_page_table[table_mask]

In [10]:
#Pandas is not letting me view all rows, changing the 'display.max_rows' setting's value to 'None'
pd.set_option('display.max_rows', None)

#Displaying dataframe in its entirety 
clean_wiki_page_table

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [22]:
#Display number of rows x number of columns for easy reference
clean_wiki_page_table.shape

(103, 3)

The next step is getting the Geo coordinates for each Zip code in thelist.

In [44]:
#import Geocoder library
import geocoder
#import json to handle geocoding api responses
import json
#import statistics package to average out boxed coordinates
import statistics

Here I create my dataframe to store each zip code's coordinates

In [129]:
#create the dataframe to store my results
lat_lon_columns=['Latitude','Longitude','Code Searched']
lat_lon_df=pd.DataFrame(columns=lat_lon_columns)

#used a for loop to loop through Postal codes, and save the center coordinates for each in a dataframe
#coordinates are presented as corners of an area, center of each area seems best if doing a radius search

for zipcode in clean_wiki_page_table['Postal Code']:
        g=geocoder.google('{}, Toronto, Ontario'.format(zipcode), key='AIzaSyDF9De9BQcP7vDsK0VxuS8ReoHOMUO-n_Q')
        latitude = statistics.mean([g.json['bbox']['northeast'][0],g.json['bbox']['southwest'][0]])
        longitude = statistics.mean([g.json['bbox']['northeast'][1],g.json['bbox']['southwest'][1]])
        lat_lon_df = lat_lon_df.append({'Latitude':latitude,'Longitude':longitude,'Code Searched':zipcode}, ignore_index=True)
        
print(lat_lon_df)

      Latitude  Longitude Code Searched
0    43.750384 -79.335351           M3A
1    43.729376 -79.312923           M4A
2    43.649954 -79.352845           M5A
3    43.723346 -79.450757           M6A
4    43.662415 -79.389786           M7A
5    43.662091 -79.528267           M9A
6    43.810176 -79.190328           M1B
7    43.748739 -79.356410           M3B
8    43.707380 -79.311904           M4B
9    43.657723 -79.378585           M5B
10   43.706510 -79.447217           M6B
11   43.650063 -79.553815           M9B
12   43.785741 -79.156456           M1C
13   43.721230 -79.350863           M3C
14   43.690582 -79.309789           M4C
15   43.651446 -79.376139           M5C
16   43.691950 -79.430395           M6C
17   43.638168 -79.578721           M9C
18   43.766170 -79.185798           M1E
19   43.678474 -79.294502           M4E
20   43.645672 -79.374024           M5E
21   43.687424 -79.450271           M6E
22   43.769966 -79.217827           M1G
23   43.706052 -79.364867           M4G


After a long time torubleshooting null resonses from the foursquare API further below, I found the above block originally logged the coordinates in the wrong order

In [131]:
#Create new DataFrame to hold merged toronto neighborhood data
t_neighborhood_columns = ['PostalCode', 'Burough', 'Neighborhood', 'Latitude', 'Longitude']
toronto_neighborhoods = pd.DataFrame(columns=t_neighborhood_columns)
lat_lon_df.set_index('Code Searched', inplace=True)

#Add the longitude and latitude coordinates to the new DF
for index, row in clean_wiki_page_table.iterrows():
    postcode = row['Postal Code']
    borough = row['Borough']
    nhood = row['Neighbourhood']
    lat = lat_lon_df.loc[postcode]['Latitude']
    lon = lat_lon_df.loc[postcode]['Longitude']
    toronto_neighborhoods = toronto_neighborhoods.append({
        'PostalCode':postcode,
        'Burough':borough,
        'Neighborhood':nhood,
        'Latitude':lat,
        'Longitude':lon
    },ignore_index=True
    )
    
#Display dataframe    
toronto_neighborhoods
    


Unnamed: 0,PostalCode,Burough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.750384,-79.335351
1,M4A,North York,Victoria Village,43.729376,-79.312923
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.649954,-79.352845
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723346,-79.450757
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662415,-79.389786
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.662091,-79.528267
6,M1B,Scarborough,"Malvern, Rouge",43.810176,-79.190328
7,M3B,North York,Don Mills,43.748739,-79.35641
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.70738,-79.311904
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657723,-79.378585


Hiding my keys for security

In [200]:
CLIENT_ID = 'C13Q5GJLF3P3NCV2PWSK1LLHPOVJZ5QTHDAQPQTMOHYLUGBJ' # your Foursquare ID
CLIENT_SECRET = 'MHBAEXFXLIPRD35KHGXIF5PXIDMMX3HH0WVZUIBIZASGOYBN' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 50

Reusing this bit of code from a previous assignment. 
This function will build the request to the foursquare API, parse the response into JSON format, 
iterate through the nested JSON data to extract venue names, coordinates, and categories and return them as a dataframe to work with.


In [248]:
def getNearbyVenues(names, latitudes, longitudes, radius=2000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        #print(url) #- had to print url to visit directly for troubleshooting. i learned i had interpretted latitude and longitude results from geocoder in the worng order 
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode','PostalCode Latitude','PostalCode Longitude','Venue','Venue Latitude','Venue Longitude','Venue Category']
    
    return(nearby_venues)

Formatting the dataframe to show every venue returned for every postal code

In [249]:
toronto_venues = getNearbyVenues(names=toronto_neighborhoods['PostalCode'],
                                   latitudes=toronto_neighborhoods['Latitude'],
                                   longitudes=toronto_neighborhoods['Longitude']
                                  )
toronto_venues.head()

M3A
M4A
M5A
M6A
M7A
M9A
M1B
M3B
M4B
M5B
M6B
M9B
M1C
M3C
M4C
M5C
M6C
M9C
M1E
M4E
M5E
M6E
M1G
M4G
M5G
M6G
M1H
M2H
M3H
M4H
M5H
M6H
M1J
M2J
M3J
M4J
M5J
M6J
M1K
M2K
M3K
M4K
M5K
M6K
M1L
M2L
M3L
M4L
M5L
M6L
M9L
M1M
M2M
M3M
M4M
M5M
M6M
M9M
M1N
M2N
M3N
M4N
M5N
M6N
M9N
M1P
M2P
M4P
M5P
M6P
M9P
M1R
M2R
M4R
M5R
M6R
M7R
M9R
M1S
M4S
M5S
M6S
M1T
M4T
M5T
M1V
M4V
M5V
M8V
M9V
M1W
M4W
M5W
M8W
M9W
M1X
M4X
M5X
M8X
M4Y
M7Y
M8Y
M8Z


Unnamed: 0,PostalCode,PostalCode Latitude,PostalCode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,43.750384,-79.335351,Donalda Golf & Country Club,43.752816,-79.342741,Golf Course
1,M3A,43.750384,-79.335351,Island Foods,43.745866,-79.346035,Caribbean Restaurant
2,M3A,43.750384,-79.335351,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
3,M3A,43.750384,-79.335351,Galleria Supermarket,43.75352,-79.349518,Supermarket
4,M3A,43.750384,-79.335351,Brookbanks Park,43.751976,-79.33214,Park


Checking how many venues (4893), and postal codes (103) are in the data

In [203]:
toronto_venues.shape

(4893, 7)

In [204]:
toronto_venues.groupby('PostalCode').count().shape

(103, 6)

In [205]:
toronto_venues.groupby('PostalCode').count()
print('There are {} unique Venue Categories'.format(len(toronto_venues['Venue Category'].unique())))

There are 295 unique Venue Categories


Venue categories are categorical value. Since we'll be feeding these into the kMeans model, we want the values broken out into dimensions to indicate whether the dimension applies to the Venue

In [250]:
# break the Venue Category out into dummy columns 
toronto_onehot = pd.get_dummies(toronto_venues['Venue Category'], prefix="", prefix_sep="")

# add postal code column to dataframe
toronto_onehot.insert(0,'PostalCode',toronto_venues['PostalCode'])

#Having trouble seeing all columns, so changing settings to allow all to display
pd.set_option('display.max_columns', None)


I'm summarizing the above onhot table into a dataframe broken out by zip code

In [217]:
toronto_grouped = toronto_onehot.groupby('PostalCode').sum().reset_index()
toronto_grouped

Unnamed: 0,PostalCode,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Amphitheater,Antique Shop,Aquarium,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,Automotive Shop,BBQ Joint,Baby Store,Badminton Court,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Court,Basketball Stadium,Beach,Beach Bar,Beer Bar,Beer Store,Big Box Store,Bike Shop,Bistro,Boat or Ferry,Bookstore,Botanical Garden,Boutique,Bowling Alley,Brazilian Restaurant,Breakfast Spot,Brewery,Bridal Shop,Bridge,Bubble Tea Shop,Burger Joint,Burrito Place,Bus Line,Bus Station,Bus Stop,Business Service,Butcher,Café,Camera Store,Campground,Cantonese Restaurant,Caribbean Restaurant,Casino,Castle,Cheese Shop,Chinese Restaurant,Chiropractor,Chocolate Shop,Church,Circus,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Rec Center,College Stadium,Comedy Club,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Coworking Space,Creperie,Cuban Restaurant,Cupcake Shop,Curling Ice,Dance Studio,Deli / Bodega,Department Store,Design Studio,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space,Fabric Shop,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Gaming Cafe,Garden,Gas Station,Gastropub,Gay Bar,General Entertainment,German Restaurant,Gift Shop,Golf Course,Golf Driving Range,Gourmet Shop,Government Building,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Hakka Restaurant,Harbor / Marina,Hardware Store,Hawaiian Restaurant,Health & Beauty Service,Health Food Store,Historic Site,History Museum,Hobby Shop,Hockey Arena,Hockey Field,Home Service,Hookah Bar,Hostel,Hotel,Hotel Bar,Hotpot Restaurant,Hungarian Restaurant,IT Services,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Indonesian Restaurant,Intersection,Italian Restaurant,Japanese Restaurant,Jewelry Store,Jewish Restaurant,Juice Bar,Kids Store,Kitchen Supply Store,Korean Restaurant,Lake,Latin American Restaurant,Laundromat,Light Rail Station,Lingerie Store,Liquor Store,Lounge,Malay Restaurant,Market,Martial Arts School,Massage Studio,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Mobile Phone Shop,Monument / Landmark,Movie Theater,Moving Target,Museum,Music School,Music Store,Music Venue,Nail Salon,National Park,Neighborhood,New American Restaurant,Nightclub,Noodle House,Optical Shop,Organic Grocery,Other Great Outdoors,Outdoor Supply Store,Paintball Field,Pakistani Restaurant,Paper / Office Supplies Store,Park,Pastry Shop,Performing Arts Venue,Persian Restaurant,Peruvian Restaurant,Pet Store,Pharmacy,Pizza Place,Playground,Plaza,Poke Place,Pool,Pool Hall,Portuguese Restaurant,Pub,Racecourse,Racetrack,Ramen Restaurant,Record Shop,Recreation Center,Rental Car Location,Restaurant,Rock Climbing Spot,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Science Museum,Sculpture Garden,Seafood Restaurant,Shoe Store,Shopping Mall,Shopping Plaza,Skate Park,Skating Rink,Ski Chalet,Smoke Shop,Smoothie Shop,Snack Place,Soccer Field,Soccer Stadium,Soup Place,South American Restaurant,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Sri Lankan Restaurant,Stables,Steakhouse,Street Art,Supermarket,Supplement Shop,Sushi Restaurant,Szechuan Restaurant,Taco Place,Taiwanese Restaurant,Tapas Restaurant,Tattoo Parlor,Tea Room,Tech Startup,Tennis Court,Tennis Stadium,Thai Restaurant,Theater,Theme Park,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Toy / Game Store,Track,Trail,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo,Zoo Exhibit
0,M1B,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,2,17
1,M1C,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,2,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,M1E,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,1,7,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3,M1G,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,2,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,M1H,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,3,0,5,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
5,M1J,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,3,2,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,4,4,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
6,M1K,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,2,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,1,0,0,3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0
7,M1L,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,2,0,0,0,0,0,0,0,0,0,0,3,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,0,0,0,0,0,0,0
8,M1M,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,1,2,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,2,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
9,M1N,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,1,0,0,0,0,4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,1,1,2,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


I will classify and label by zip code, rather than by neighbohood. In retrospect, a neighborhood classification may have been more practical.

In [251]:
#import the KMeans package
from sklearn.cluster import KMeans

#We will identify 5 distinct clusters
kclusters = 5

#Prep the dataframe for fitting by dropping the PostalCode column
toronto_grouped_clustering = toronto_grouped.drop(['PostalCode'],1)

#Fit the model using the new dataframe, and the predetermined number of clusters above
cluster_model = KMeans(n_clusters=kclusters, random_state=9).fit(toronto_grouped_clustering)

#print length on labels to verify it aligns with amoung of zip codes
len(cluster_model.labels_)


103

Group the zipcodes with their labels to join further down into a final table with coordinates

In [252]:
toronto_grouped['Labels'] = cluster_model.labels_
toronto_grouped[['PostalCode','Labels']]

Unnamed: 0,PostalCode,Labels
0,M1B,4
1,M1C,1
2,M1E,1
3,M1G,1
4,M1H,2
5,M1J,1
6,M1K,2
7,M1L,2
8,M1M,2
9,M1N,1


I thought i would improve on the class example by feeding the whole count of nearby venue categories rather than just the top 10. I found the data a bit overwhelming to process. I settled on analyzing their average and sorting by columns rather than rows. In the end, I checked only the top 10 categories anyway. I had gone a long about way to do it the same way as the assignment. Listen to your teachers!

In [240]:
df_to_sort = toronto_grouped.groupby('Labels').mean()


for row, col in df_to_sort.iterrows():
    print(col.sort_values(ascending=False)[:10])
    

Coffee Shop            3.000000
Hotel                  2.214286
Café                   2.000000
Japanese Restaurant    1.785714
Clothing Store         1.785714
Restaurant             1.571429
Park                   1.571429
Plaza                  1.285714
Gym                    1.142857
Italian Restaurant     1.071429
Name: 0, dtype: float64
Coffee Shop              5.208333
Pizza Place              2.750000
Bank                     2.708333
Sandwich Place           2.458333
Fast Food Restaurant     2.250000
Pharmacy                 2.125000
Gas Station              1.708333
Grocery Store            1.541667
Vietnamese Restaurant    1.125000
Park                     1.000000
Name: 1, dtype: float64
Coffee Shop                  3.379310
Grocery Store                1.586207
Bakery                       1.310345
Chinese Restaurant           1.206897
Bank                         1.172414
Burger Joint                 1.068966
Restaurant                   1.068966
Japanese Restaurant       

After much thought, I'm going to hypothesize on the nature of the clusters<br/>
Cluster 0 - Tourist/commercial center because Hotels and plazas<br/>
Cluster 1 - High Density Residential - banks, fast food, more pharmacy than grocery stores<br/>
Cluster 2 - Less Dense residential - not many venues,characterized by grocery stores, banks, restuarants<br/>
Cluster 3 - Nightlife are - date attractions like restuarants, cafes, ice cream, sushi, bars<br/>
Cluster 4 - Far out zoo area - fast food, zoo, gas stations<br/>

In [243]:
toronto_merged = toronto_neighborhoods.join(toronto_grouped[['PostalCode','Labels']].set_index('PostalCode'),on='PostalCode')

In [242]:
toronto_merged

Unnamed: 0,PostalCode,Burough,Neighborhood,Latitude,Longitude,Labels
0,M3A,North York,Parkwoods,43.750384,-79.335351,3
1,M4A,North York,Victoria Village,43.729376,-79.312923,2
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.649954,-79.352845,3
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723346,-79.450757,0
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662415,-79.389786,0
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.662091,-79.528267,1
6,M1B,Scarborough,"Malvern, Rouge",43.810176,-79.190328,4
7,M3B,North York,Don Mills,43.748739,-79.35641,2
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.70738,-79.311904,1
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657723,-79.378585,0


Iterate through the data and plot the rows on a map

In [247]:
import folium
import numpy as np

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#Toronto Coordinates to initialize map
latitude = 43.6532
longitude = -79.3832

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['PostalCode'], toronto_merged['Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters