# Coursera Capstone Week 3 Assignment

### Segmenting and Clustering Neighborhoods in Toronto

# Part 1

In [3]:
!pip install geocoder

#Import necessary packages
import pandas as pd
import geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |███▎                            | 10kB 17.3MB/s eta 0:00:01[K     |██████▋                         | 20kB 24.3MB/s eta 0:00:01[K     |██████████                      | 30kB 16.6MB/s eta 0:00:01[K     |█████████████▎                  | 40kB 12.3MB/s eta 0:00:01[K     |████████████████▋               | 51kB 11.5MB/s eta 0:00:01[K     |████████████████████            | 61kB 11.9MB/s eta 0:00:01[K     |███████████████████████▎        | 71kB 10.5MB/s eta 0:00:01[K     |██████████████████████████▋     | 81kB 11.2MB/s eta 0:00:01[K     |██████████████████████████████  | 92kB 9.9MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 4.5MB/s 
Collecting ratelim
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd

First, we need to scrape the table from Wikipedia. We will do this using pandas.

In [4]:
#use read_html to import tables on the page
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
print(dfs[0].head())

  Postal Code           Borough              Neighbourhood
0         M1A      Not assigned               Not assigned
1         M2A      Not assigned               Not assigned
2         M3A        North York                  Parkwoods
3         M4A        North York           Victoria Village
4         M5A  Downtown Toronto  Regent Park, Harbourfront


In [5]:
#check to see if there are more tables
print(dfs[1].head())

                                                  0   ...   17
0                                                NaN  ...  NaN
1  NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...  ...  NaN
2                                                 NL  ...   YT
3                                                  A  ...    Y

[4 rows x 18 columns]


It looks like the first table is the one we want. In the next code cell, we'll select the first table (index 0)

### Requirements for data preparation:
- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11  in the above table.
- If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [6]:
df = dfs[0]

In [7]:
# Filter out the cells that have 'Not assigned' in the Borough column
df = df[df.Borough != "Not assigned"]

# Check result
print(df.head(10))

   Postal Code           Borough                                Neighbourhood
2          M3A        North York                                    Parkwoods
3          M4A        North York                             Victoria Village
4          M5A  Downtown Toronto                    Regent Park, Harbourfront
5          M6A        North York             Lawrence Manor, Lawrence Heights
6          M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government
8          M9A         Etobicoke      Islington Avenue, Humber Valley Village
9          M1B       Scarborough                               Malvern, Rouge
11         M3B        North York                                    Don Mills
12         M4B         East York              Parkview Hill, Woodbine Gardens
13         M5B  Downtown Toronto                     Garden District, Ryerson


Conveniently, the rows are already combined where Neighbourhoods share a single Postal Code. We don't need to do any additional processing for this. 

In [8]:
# Check for rows where Neighbourhood has a value of 'Not assigned'
print(df[df.Neighbourhood == "Not assigned"])

Empty DataFrame
Columns: [Postal Code, Borough, Neighbourhood]
Index: []


There are no entries in the 'Neighbourhood' column where the value is 'Not assigned'. We don't need to do any processing here, either.

### Conclusion
We now have a cleaned dataframe.
- [x] Three columns: Postal Code, Borough, and Neighborhood
- [x] Only process cells with an assigned borough
- [x] Where neighbourhoods share a postal code, include them in a single entry for the postal code in the 'Neighbourhood' column, separated by commas
- [x] If the 'Neighbourhood' value is 'Not assigned', change it to be the same as the value in the 'Borough' column

In [9]:
df.shape

(103, 3)

# Part 2

Now we need to get the latitude and longitude values for each entry, using the python geocoder package

In [20]:
import requests

longitudes = []
latitudes = []

for index, row in df.iterrows():
  try:
    #get Postal Code for each entry
    neigh = row['Postal Code']
    headers = {
    #This key will be deleted once the assignment is complete
    "apikey": "32fcbcb0-65d7-11eb-b90f-1ff6313abf17"}
    params = (
    ("text", neigh + ", Toronto, Ontario, Canada"),
    );
    response = requests.get('https://app.geocodeapi.io/api/v1/search', headers=headers, params=params);
    result = response.json()
    longitude = result['features'][0]['geometry']['coordinates'][0]
    latitude = result['features'][0]['geometry']['coordinates'][1]
    longitudes.append(longitude)
    latitudes.append(latitude)
  except:
    print('something broke - using default values', neigh)
    longitudes.append(None)
    latitudes.append(None)
  
df['longitude'] = longitudes
df['latitude'] = latitudes

print(df.head())


something broke - using default values M5K
  Postal Code           Borough  ...  longitude   latitude
2         M3A        North York  ... -79.328265  43.754227
3         M4A        North York  ... -79.313559  43.724686
4         M5A  Downtown Toronto  ... -79.363640  43.656078
5         M6A        North York  ... -79.452785  43.721307
6         M7A  Downtown Toronto  ...  26.171175  44.427782

[5 rows x 5 columns]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [27]:
print(df[df['longitude'].isnull()])

   Postal Code           Borough  ... longitude  latitude
67         M5K  Downtown Toronto  ...       NaN       NaN

[1 rows x 5 columns]


Well that's weird. We had a failed location call on one row. We should still have enough data so we can just get rid of the row with NaN values for lat/long.

In [29]:
df = df[df['longitude'].notnull()]
df.shape

(102, 5)

Alright! Now we have our data and we are ready to do some exploratory analysis.

In [30]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# First we'll find the latitude and longitude of Toronto
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

/bin/bash: conda: command not found
The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [44]:
import folium
# Now we can map the neighborhoods using folium, with the neighborhood data points superimposed over the map of Toronto
toronto_map = folium.Map(location=[latitude, longitude], zoom_start=11)
# add markers to map
for pc, lat, lng, borough, neighbourhood in zip(df['Postal Code'], df['latitude'], df['longitude'], df['Borough'], df['Neighbourhood']):
    label = 'Postal Code: {}\nBorough: {}\nNeighbourhood(s): {}'.format(pc, borough, neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
toronto_map

To simplify things, we will zoom in on the downtown area. It seems to have the highest concentration of data points. 

In [42]:
# First we need to create a new dataframe containing just the downtown area
dt_df = df[df['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
dt_df.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,longitude,latitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",-79.36364,43.656078
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",26.171175,44.427782
2,M5B,Downtown Toronto,"Garden District, Ryerson",-79.3776,43.657181
3,M5C,Downtown Toronto,St. James Town,-79.375694,43.651386
4,M5E,Downtown Toronto,Berczy Park,-79.374916,43.645466
5,M5G,Downtown Toronto,Central Bay Street,-79.38171,43.64877
6,M6G,Downtown Toronto,Christie,-79.38171,43.64877
7,M5H,Downtown Toronto,"Richmond, Adelaide, King",-79.38171,43.64877
8,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",-79.38171,43.64877
9,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",-79.38171,43.64877


In [45]:
# Now we'll get the coordinates for Downtown Toronto as we did above

# Find the latitude and longitude of Downtown Toronto
address = 'Downtown Toronto, Ontario'

geolocator = Nominatim(user_agent="explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6563221, -79.3809161.


Now we can make another folium map, this time focusing in on Downtown Toronto

In [41]:
# Now we can map the neighborhoods using folium, with the neighborhood data points superimposed over the map of Toronto
dtt_map = folium.Map(location=[latitude, longitude], zoom_start=14)

# add markers to map
for pc, lat, lng, borough, neighbourhood in zip(dt_df['Postal Code'], dt_df['latitude'], dt_df['longitude'], dt_df['Borough'], dt_df['Neighbourhood']):
    label = 'Postal Code: {}\nBorough: {}\nNeighbourhood(s): {}'.format(pc, borough, neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(dtt_map)  
    
dtt_map

We'll do the same thing as in the lab - looking at nearby venues.

In [46]:
# Setup and test Foursquare API request
LIMIT = 100
radius = 500
VERSION = '20180605' # Foursquare API version
# These will be reset
CLIENT_ID = 'C5KFFWHCPEDZHER2PK10Y3K2ZIZWN1G404VRJAVE2IHT0SQS'
CLIENT_SECRET = 'Y1P0UJZHJ21QI5FEL3F1FZJ3COKTC24LFQM2CHDLQCBLBIKT'
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '601d9b9fddc7935877ad183f'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-57eda381498ebe0e6ef40972-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/shops/apparel_',
          'suffix': '.png'},
         'id': '4bf58dd8d48988d103951735',
         'name': 'Clothing Store',
         'pluralName': 'Clothing Stores',
         'primary': True,
         'shortName': 'Apparel'}],
       'id': '57eda381498ebe0e6ef40972',
       'location': {'address': '220 Yonge St',
        'cc': 'CA',
        'city': 'Toronto',
        'country': 'Canada',
        'crossStreet': 'at Dundas St W',
        'distance': 50,
        'formattedAddress': ['220 Yonge St (at Dundas St W)',
         'Toronto ON M5B 2H1',
         'Canada'],
        'labeledLatLngs': 

Now that we've confirmed we can get venue data from the Foursquare API, let's get the data for each of our data points.

In [58]:
# First we'll create a function to be run for each postal code
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

# And now we'll run it for every postal code
dt_venues = getNearbyVenues(names = dt_df['Postal Code'],
                                latitudes = dt_df['latitude'],
                                longitudes = dt_df['longitude']
                                )

M5A
M7A
M5B
M5C
M5E
M5G
M6G
M5H
M5J
M5L
M5S
M5T
M5V
M4W
M5W
M4X
M5X
M4Y


In [59]:
# And now we can check the size of our dataset
print(dt_venues.shape)
dt_venues.head()

(1569, 7)


Unnamed: 0,Postal Code,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M5A,43.656078,-79.36364,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
1,M5A,43.656078,-79.36364,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,M5A,43.656078,-79.36364,Roselle Desserts,43.653447,-79.362017,Bakery
3,M5A,43.656078,-79.36364,Sukhothai,43.658444,-79.365681,Thai Restaurant
4,M5A,43.656078,-79.36364,The Yoga Lounge,43.655515,-79.364955,Yoga Studio


Ok, let's forge ahead with some further analysis. I'm curious what types of venues are available in Downtown Toronto, and their frequency.

In [60]:
venue_type_count = dt_venues.groupby(['Venue Category'])['Venue'].count()
# Okay now that we have the venues counted by category, let's check what is the most common
venue_type_count.sort_values(inplace=True, ascending=False)
print(venue_type_count.head())

Venue Category
Coffee Shop            155
Hotel                  114
Restaurant              89
Café                    88
Japanese Restaurant     57
Name: Venue, dtype: int64


Wow! That is a whole lot of coffee shops! Apparently Toronto is fueled by caffeine! Let's look at the top 10 to see if there any other notable venue types. It seems like restaurants are broken down by type of cuisine, so the list may be deceiving...

In [61]:
print(venue_type_count.head(10))

Venue Category
Coffee Shop            155
Hotel                  114
Restaurant              89
Café                    88
Japanese Restaurant     57
Gym                     57
Seafood Restaurant      49
American Restaurant     43
Steakhouse              41
Asian Restaurant        40
Name: Venue, dtype: int64


Alright now we're getting a clearer picture. Even if we do not count 'Coffee Shop' in the restaurant category, seven out of ten of the most common types of venues in Downtown Toronto are restaurants. The top two 'restaurant' type venues (Restaurant and Cafe) account for 177 of the venues in Downtown Toronto.

## Now to look at the postal codes


In [64]:
# one hot encoding
dt_onehot = pd.get_dummies(dt_venues[['Venue Category']], prefix="", prefix_sep="")

# add postal code column back to dataframe
dt_onehot['Postal Code'] = dt_venues['Postal Code'] 

# move neighborhood column to the first column
fixed_columns = [dt_onehot.columns[-1]] + list(dt_onehot.columns[:-1])
dt_onehot = dt_onehot[fixed_columns]

dt_onehot.head()

Unnamed: 0,Postal Code,American Restaurant,Art Gallery,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bar,Basketball Stadium,Beer Bar,Belgian Restaurant,Bistro,Bookstore,Breakfast Spot,Bubble Tea Shop,Burger Joint,Burrito Place,Café,Cheese Shop,Clothing Store,Cocktail Bar,Coffee Shop,College Rec Center,Colombian Restaurant,Comfort Food Restaurant,Comic Shop,Concert Hall,Cosmetics Shop,Creperie,Cupcake Shop,Deli / Bodega,Department Store,Diner,Discount Store,Distribution Center,Eastern European Restaurant,Electronics Store,Event Space,Falafel Restaurant,Farmers Market,...,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Molecular Gastronomy Restaurant,Monument / Landmark,Moroccan Restaurant,Movie Theater,Museum,Music Venue,New American Restaurant,Nightclub,Opera House,Park,Performing Arts Venue,Pharmacy,Pizza Place,Plaza,Pub,Ramen Restaurant,Restaurant,Salad Place,Salon / Barbershop,Sandwich Place,Seafood Restaurant,Shopping Mall,Spa,Speakeasy,Sporting Goods Shop,Steakhouse,Supermarket,Sushi Restaurant,Tailor Shop,Tea Room,Thai Restaurant,Theater,Thrift / Vintage Store,Vegetarian / Vegan Restaurant,Wine Bar,Yoga Studio
0,M5A,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,M5A,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,M5A,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,M5A,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
4,M5A,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


Now we'll group them by postal code and the mean frequency for each type of venue

In [65]:
dt_grouped = dt_onehot.groupby('Postal Code').mean().reset_index()
dt_grouped

Unnamed: 0,Postal Code,American Restaurant,Art Gallery,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bar,Basketball Stadium,Beer Bar,Belgian Restaurant,Bistro,Bookstore,Breakfast Spot,Bubble Tea Shop,Burger Joint,Burrito Place,Café,Cheese Shop,Clothing Store,Cocktail Bar,Coffee Shop,College Rec Center,Colombian Restaurant,Comfort Food Restaurant,Comic Shop,Concert Hall,Cosmetics Shop,Creperie,Cupcake Shop,Deli / Bodega,Department Store,Diner,Discount Store,Distribution Center,Eastern European Restaurant,Electronics Store,Event Space,Falafel Restaurant,Farmers Market,...,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Molecular Gastronomy Restaurant,Monument / Landmark,Moroccan Restaurant,Movie Theater,Museum,Music Venue,New American Restaurant,Nightclub,Opera House,Park,Performing Arts Venue,Pharmacy,Pizza Place,Plaza,Pub,Ramen Restaurant,Restaurant,Salad Place,Salon / Barbershop,Sandwich Place,Seafood Restaurant,Shopping Mall,Spa,Speakeasy,Sporting Goods Shop,Steakhouse,Supermarket,Sushi Restaurant,Tailor Shop,Tea Room,Thai Restaurant,Theater,Thrift / Vintage Store,Vegetarian / Vegan Restaurant,Wine Bar,Yoga Studio
0,M4W,0.03,0.01,0.03,0.0,0.0,0.01,0.02,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.06,0.0,0.0,0.0,0.1,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.01,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.02,0.01,0.01,0.0,0.06,0.03,0.01,0.01,0.03,0.0,0.0,0.01,0.0,0.03,0.0,0.01,0.01,0.01,0.02,0.01,0.0,0.01,0.01,0.0
1,M4X,0.03,0.01,0.03,0.0,0.0,0.01,0.02,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.06,0.0,0.0,0.0,0.1,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.01,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.02,0.01,0.01,0.0,0.06,0.03,0.01,0.01,0.03,0.0,0.0,0.01,0.0,0.03,0.0,0.01,0.01,0.01,0.02,0.01,0.0,0.01,0.01,0.0
2,M4Y,0.03,0.01,0.03,0.0,0.0,0.01,0.02,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.06,0.0,0.0,0.0,0.1,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.01,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.02,0.01,0.01,0.0,0.06,0.03,0.01,0.01,0.03,0.0,0.0,0.01,0.0,0.03,0.0,0.01,0.01,0.01,0.02,0.01,0.0,0.01,0.01,0.0
3,M5A,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.178571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.035714,0.0,0.0,0.035714,0.035714,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.035714,0.0,0.0,0.035714,0.0,0.107143,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.035714,0.035714,0.0,0.0,0.035714
4,M5B,0.0,0.0,0.0,0.0,0.0,0.014493,0.0,0.0,0.0,0.0,0.0,0.014493,0.0,0.014493,0.014493,0.014493,0.028986,0.0,0.043478,0.0,0.086957,0.014493,0.0,0.0,0.014493,0.0,0.028986,0.0,0.0,0.0,0.014493,0.0,0.0,0.014493,0.0,0.014493,0.0,0.014493,0.0,...,0.014493,0.014493,0.014493,0.014493,0.0,0.0,0.0,0.028986,0.0,0.014493,0.014493,0.0,0.0,0.0,0.0,0.0,0.028986,0.014493,0.0,0.014493,0.028986,0.0,0.0,0.028986,0.014493,0.014493,0.014493,0.0,0.014493,0.014493,0.0,0.0,0.0,0.014493,0.014493,0.028986,0.0,0.0,0.014493,0.0
5,M5C,0.035294,0.011765,0.011765,0.011765,0.0,0.035294,0.0,0.0,0.011765,0.011765,0.011765,0.011765,0.011765,0.0,0.011765,0.011765,0.058824,0.0,0.023529,0.047059,0.070588,0.0,0.0,0.011765,0.0,0.0,0.023529,0.023529,0.0,0.0,0.023529,0.011765,0.0,0.0,0.0,0.011765,0.0,0.0,0.023529,...,0.0,0.011765,0.0,0.0,0.0,0.011765,0.023529,0.0,0.0,0.0,0.011765,0.0,0.0,0.011765,0.011765,0.011765,0.0,0.0,0.011765,0.0,0.023529,0.0,0.011765,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011765,0.0,0.011765,0.011765,0.0,0.011765,0.011765,0.0
6,M5E,0.0125,0.0125,0.0,0.0,0.0125,0.0375,0.0,0.0125,0.025,0.0,0.0125,0.0,0.025,0.0,0.0,0.0,0.0375,0.025,0.025,0.0375,0.1,0.0,0.0,0.0125,0.0,0.0125,0.0,0.0125,0.0,0.0125,0.0125,0.0125,0.0,0.0,0.0125,0.0,0.0,0.0,0.025,...,0.0,0.0,0.0,0.0,0.0125,0.0,0.0,0.0,0.0125,0.0,0.0,0.0125,0.0,0.0125,0.0,0.0125,0.0,0.0,0.0125,0.0,0.0375,0.0,0.0,0.0125,0.05,0.0,0.0,0.0,0.0125,0.0125,0.0,0.0125,0.0125,0.0,0.0125,0.0,0.0,0.0125,0.0,0.0125
7,M5G,0.03,0.01,0.03,0.0,0.0,0.01,0.02,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.06,0.0,0.0,0.0,0.1,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.01,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.02,0.01,0.01,0.0,0.06,0.03,0.01,0.01,0.03,0.0,0.0,0.01,0.0,0.03,0.0,0.01,0.01,0.01,0.02,0.01,0.0,0.01,0.01,0.0
8,M5H,0.03,0.01,0.03,0.0,0.0,0.01,0.02,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.06,0.0,0.0,0.0,0.1,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.01,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.02,0.01,0.01,0.0,0.06,0.03,0.01,0.01,0.03,0.0,0.0,0.01,0.0,0.03,0.0,0.01,0.01,0.01,0.02,0.01,0.0,0.01,0.01,0.0
9,M5J,0.03,0.01,0.03,0.0,0.0,0.01,0.02,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.06,0.0,0.0,0.0,0.1,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.01,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.02,0.01,0.01,0.0,0.06,0.03,0.01,0.01,0.03,0.0,0.0,0.01,0.0,0.03,0.0,0.01,0.01,0.01,0.02,0.01,0.0,0.01,0.01,0.0


Next, we want to print out each postal code along with the top 5 venues in each one.

In [67]:
num_top_venues = 5

for pc in dt_grouped['Postal Code']:
    print("----"+pc+"----")
    temp = dt_grouped[dt_grouped['Postal Code'] == pc].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M4W----
                 venue  freq
0          Coffee Shop  0.10
1                Hotel  0.08
2                 Café  0.06
3           Restaurant  0.06
4  Japanese Restaurant  0.04


----M4X----
                 venue  freq
0          Coffee Shop  0.10
1                Hotel  0.08
2                 Café  0.06
3           Restaurant  0.06
4  Japanese Restaurant  0.04


----M4Y----
                 venue  freq
0          Coffee Shop  0.10
1                Hotel  0.08
2                 Café  0.06
3           Restaurant  0.06
4  Japanese Restaurant  0.04


----M5A----
                  venue  freq
0           Coffee Shop  0.18
1            Restaurant  0.11
2        Breakfast Spot  0.07
3         Grocery Store  0.04
4  Gym / Fitness Center  0.04


----M5B----
                venue  freq
0         Coffee Shop  0.09
1               Hotel  0.06
2      Clothing Store  0.04
3  Italian Restaurant  0.04
4       Movie Theater  0.03


----M5C----
                venue  freq
0         Coffee Sho

We want this output in a dataframe, but first we need to create a function for sorting the venues

In [68]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [78]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
dt_venues_sorted = pd.DataFrame(columns=columns)
dt_venues_sorted['Postal Code'] = dt_grouped['Postal Code']

for ind in np.arange(dt_grouped.shape[0]):
    dt_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dt_grouped.iloc[ind, :], num_top_venues)

dt_venues_sorted.head()

Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4W,Coffee Shop,Hotel,Café,Restaurant,Gym,Japanese Restaurant,Salad Place,Asian Restaurant,Steakhouse,Deli / Bodega
1,M4X,Coffee Shop,Hotel,Café,Restaurant,Gym,Japanese Restaurant,Salad Place,Asian Restaurant,Steakhouse,Deli / Bodega
2,M4Y,Coffee Shop,Hotel,Café,Restaurant,Gym,Japanese Restaurant,Salad Place,Asian Restaurant,Steakhouse,Deli / Bodega
3,M5A,Coffee Shop,Restaurant,Breakfast Spot,Yoga Studio,Bakery,Diner,Discount Store,Electronics Store,Event Space,Food Truck
4,M5B,Coffee Shop,Hotel,Clothing Store,Italian Restaurant,Café,Pizza Place,Restaurant,Japanese Restaurant,Sandwich Place,Movie Theater


We already knew this, but WOW are coffee shops popular! Most common venue in each of the first five postal codes.

## Now we are going to do some clustering

We are working with a smaller number of datapoints, so I think we will probably have fewer clusters than Manhattan, as we looked at in the lab.

In [79]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 3

dt_grouped_clustering = dt_grouped.drop('Postal Code', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dt_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 0, 0, 0, 0, 1, 1, 1], dtype=int32)

Now we will create a new dataframe with the clusters and  top 10 venues for each one

In [85]:
# add clustering labels
dt_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

dt_merged = dt_df

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
dt_merged = dt_merged.join(dt_venues_sorted.set_index('Postal Code'), on='Postal Code')

dt_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,longitude,latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",-79.36364,43.656078,0,Coffee Shop,Restaurant,Breakfast Spot,Yoga Studio,Bakery,Diner,Discount Store,Electronics Store,Event Space,Food Truck
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",26.171175,44.427782,2,Supermarket,Grocery Store,Farmers Market,Gym,Eastern European Restaurant,Restaurant,Cupcake Shop,Deli / Bodega,Department Store,Diner
2,M5B,Downtown Toronto,"Garden District, Ryerson",-79.3776,43.657181,0,Coffee Shop,Hotel,Clothing Store,Italian Restaurant,Café,Pizza Place,Restaurant,Japanese Restaurant,Sandwich Place,Movie Theater
3,M5C,Downtown Toronto,St. James Town,-79.375694,43.651386,0,Coffee Shop,Café,Seafood Restaurant,Cocktail Bar,American Restaurant,Bakery,Gastropub,Italian Restaurant,Hotel,Moroccan Restaurant
4,M5E,Downtown Toronto,Berczy Park,-79.374916,43.645466,0,Coffee Shop,Seafood Restaurant,Hotel,Restaurant,Café,Cocktail Bar,Bakery,Beer Bar,Cheese Shop,Clothing Store


## Finally, we will visualize the clusters

In [89]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=14)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dt_merged['latitude'], dt_merged['longitude'], dt_merged['Postal Code'], dt_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Conclusion

Oddly, there look to be only 5 points on the map. I had a look at the data, and it appears to be a byproduct of the geolocation API I chose to use. There are a lot of overlapping points, where Postal Codes share the same latitude and longitude coordinates. It would likely be more informative to go back and have a look at the larger (everything, not just Downtown Toronto) dataset, or possibly use a different geolocation API which provided more granular geolocation data for Postal Codes.