# Toronto Neighborhood, Explore, Analyze and Cluster
### Toronto M Postal Code Data 

### Table of Contents
1. [Installing and Importing of Libraries](#libraries)
2. [Part 1 Data Identification and Preparation ](#Part1)
3. [Part 2 Getting Latitude and Longitude of neighbourhood](#Part2)
4. [Part 3 Clustring Toronto Neighbourhood](#Part3)

__ __

### 1. Installing and Importing of Libraries <a class="anchor" id="libraries"></a>

In [1]:
try:
    print("Installing Libraries...\n")
    !conda install -c conda-forge beautifulsoup4 --yes
    print("BeautifulSoup4 has been successfully installed!\n")
    !conda install -c conda-forge ProgressBar2 --yes
    print("ProgressBar has been successfully installed!\n")
    !conda install -c conda-forge lxml --yes
    print("lxml has been successfully installed!\n")
    !conda install -c conda-forge geopy --yes
    print("GeoPy has been successfully installed!\n")
    !conda install -c conda-forge folium=0.5.0 --yes
    print("Folium has been successfully installed!\n")
    print("Libraries has been successfully installed!\n")
except:
    print("ERROR: could not install Libraries!\n")

try:
    print("Importing libraries...\n")
    import numpy as np # library to handle data in a vectorized manner
    import pandas as pd # library for data analysis
    from bs4 import BeautifulSoup as bts # library for web scraping
    from pandas.io.json import json_normalize
    from IPython.display import Image 
    from IPython.core.display import HTML 
    import matplotlib as mp # library for visualization
    import matplotlib.cm as cm
    import matplotlib.colors as colors
    import requests # library to handle requests
    from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
    from sklearn.cluster import KMeans # import k-means from clustering stage
    import folium # map rendering library
    import lxml
    import re
    from time import sleep
    print("All libraries imported successfully!\n")
except:
    print("ERROR: Could not import all libraries!\n")

%matplotlib inline

Installing Libraries...

Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    soupsieve-1.7.3            |           py36_0          52 KB  conda-forge
    cryptography-2.4.2         |   py36h1ba5d50_0         618 KB
    beautifulsoup4-4.7.1       |        py36_1001         140 KB  conda-forge
    openssl-1.1.1a             |    h14c3975_1000         4.0 MB  conda-forge
    libarchive-3.3.3           |       h5d8350f_5         1.5 MB
    grpcio-1.16.1              |   py36hf8bcb03_1         1.1 MB
    conda-4.6.2                |           py36_0         869 KB  conda-forge
    libssh2-1.8.0              |                1         239 KB  conda-forge
    python-3.6.8               |       h0371630_0        34.4 MB
    ---------------------------------

# Part 1 <a class="anchor" id="Part1"></a>
### __Data Identification and Preparation__

### 2. Scraping and Cleaning Toronto Neighborhood Data <a class="anchor" id="neighborhood_data"></a>


__Read given wikipedia web page using pandas read_html method__

In [2]:
try:
    print("Reading web page ...")
    url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
    wikipage = pd.read_html(url)
    print("Web page read  successful !")
except:
    print("ERROR: could not read web page.\n")

Reading web page ...
Web page read  successful !


__Check object type of wikipage object__

In [3]:
type(wikipage)

list

__wikipage object type is list, get the length of list object__

In [4]:
len(wikipage)

3

__length of wikipage list is 3, check all 3 elements of list one by one__

In [5]:
wikipage[0]

Unnamed: 0,0,1,2
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned


In [6]:
wikipage[1]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
0,,Canadian postal codes,,,,,,,,,...,,,,,,,,,,
1,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL,NS,PE,NB,QC,ON,MB,SK,AB,...,L,M,N,P,R,S,T,V,X,Y
2,NL,NS,PE,NB,QC,ON,MB,SK,AB,BC,...,,,,,,,,,,
3,A,B,C,E,G,H,J,K,L,M,...,,,,,,,,,,
4,NL,NS,PE,NB,QC,ON,MB,SK,AB,BC,...,,,,,,,,,,
5,A,B,C,E,G,H,J,K,L,M,...,,,,,,,,,,


In [7]:
wikipage[2]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,NL,NS,PE,NB,QC,ON,MB,SK,AB,BC,NU/NT,YT,,,,,,
1,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


__wikipage is nested list and our relevent data is in wikipage[0]__ 

__Create a data frame using list wikipage[0]__

In [8]:
neighbour = pd.DataFrame(wikipage[0])
neighbour.head()

Unnamed: 0,0,1,2
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


__Assign proper header to data frame which is first row of data frame itself ['Postcode','Borough','Neighbourhood']__

In [9]:
header=neighbour[0:1].values.tolist()
header

[['Postcode', 'Borough', 'Neighbourhood']]

In [10]:
header = header[0]

In [11]:
neighbour.columns = header
neighbour.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


__Delete row 0  and reset index - Cleenup neighbour data frame__

In [12]:
neighbour.drop(neighbour.index[:1], inplace=True)
neighbour.reset_index(drop=True, inplace=True)
neighbour.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


__Clear all rows where Borough is "Not assigned"and reset index - Cleenup neighbour data frame__

In [13]:
# Replace all 'Not assigned' in borough with np.nan
neighbour.replace({'Borough': 'Not assigned' }, np.nan, inplace = True)
# Drop whole row with NaN in "Borough" column
neighbour.dropna(subset=["Borough"], axis=0, inplace=True)
# reset index, because we droped some rows
neighbour.reset_index(drop=True, inplace=True)
neighbour.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


__Merge neighbourhood with same postal code - Prepare neighbour data frame__

In [14]:
# Groupby and join can be used for the purpose
neighbour = neighbour.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
neighbour.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


__Set Neighbourhood =  Brough where neighbourhood is equal to 'Not assigned' - Prepare neighbour data frame__

In [15]:
neighbour['Neighbourhood'] = np.where(neighbour['Neighbourhood'] == 'Not assigned', neighbour['Borough'], neighbour['Neighbourhood'])
neighbour.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


__Neighbour data frame shape__

In [16]:
neighbour.shape

(103, 3)

# Part 2  <a class="anchor" id="Part2"></a>
### __Getting Latitude and Longitude' of neighbourhood__

__ __

__Create dataframe 'toronto_df' with rows comtaining 'Toronto' in Borough column and reset index__

In [17]:
toronto_df = neighbour[neighbour['Borough'].str.contains('Toronto')]
toronto_df.reset_index(drop=True, inplace=True)
print(toronto_df.shape)
toronto_df.head()

(38, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M4E,East Toronto,The Beaches
1,M4K,East Toronto,"The Danforth West, Riverdale"
2,M4L,East Toronto,"The Beaches West, India Bazaar"
3,M4M,East Toronto,Studio District
4,M4N,Central Toronto,Lawrence Park


__Add two columns ('Latitude', 'Longitude') to dataframe toronto_df__

In [23]:
#toronto_df['Latitude'] = np.nan
#toronto_df['Longitude'] = np.nan
toronto_df.loc[:,'Latitude'] = np.nan
toronto_df.loc[:,'Longitude'] = np.nan


In [24]:
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,,
1,M4K,East Toronto,"The Danforth West, Riverdale",,
2,M4L,East Toronto,"The Beaches West, India Bazaar",,
3,M4M,East Toronto,Studio District,,
4,M4N,Central Toronto,Lawrence Park,,


__Gathering the Latitude and Longitude coordinates for each borough using geolocater__

In [25]:
try:
    for index in range(0, toronto_df['Neighbourhood'].shape[0]):
        geolocator = Nominatim(user_agent="ny_explorer")
        location = geolocator.geocode(toronto_df.loc[index,'Neighbourhood'].split(",")[0] + ", Toronto" + ", Canada")    
        if (location != None and location !=""):
            toronto_df.loc[index,'Latitude'] = location.latitude
            toronto_df.loc[index,'Longitude'] = location.longitude
            print(str(index) + "- Neighbourhood - " + toronto_df.loc[index,'Neighbourhood'].split(",")[0] + ", Toronto"  + ", Canada" + " Latitude = " + str(location.latitude) + "  Longitude = " + str(location.longitude))
        else:
            attempt=1 # make 3 attempts if failed to get location
            while(location is None):
                print(str(index) + "- Neighbourhood - " + "Failed to get Location Attempt - " + str(attempt))
                location = geolocator.geocode(toronto_df.loc[index,'Neighbourhood'].split(",")[0] + ", Toronto" + ", Canada")
                if (location != None):
                    toronto_df.loc[index,'Latitude'] = location.latitude
                    toronto_df.loc[index,'Longitude'] = location.longitude
                    print(str(index) + "- Neighbourhood - " + toronto_df.loc[index,'Neighbourhood'].split(",")[0] + ", Toronto"  + ", Canada" + " Latitude = " + str(location.latitude) + "  Longitude = " + str(location.longitude))
                attempt=attempt+1
                sleep(1)
                
                if(attempt>3): # after 3 attempts get the location of realted borough insted of Neighbourhood
            
                    location = geolocator.geocode(toronto_df.loc[index,'Borough'].split(",")[0] + ", Toronto" + ", Canada")
                    if (location != None):
                        toronto_df.loc[index,'Latitude'] = location.latitude
                        toronto_df.loc[index,'Longitude'] = location.longitude
                        print(str(index) + "- Borough - " + toronto_df.loc[index,'Borough'].split(",")[0] + ", Toronto"  + ", Canada" + " Latitude = " + str(location.latitude) + "  Longitude = " + str(location.longitude))
                    else:
                        location="" # To exit while loop
                        print ("location not fount for index " + str("index"))

        sleep(1)
    print(toronto_df.shape)
except:
    print ("ERROR: Failed to find location of given address, Plese revise location address!\n")
toronto_df

0- Neighbourhood - The Beaches, Toronto, Canada Latitude = 43.6710244  Longitude = -79.296712
1- Neighbourhood - The Danforth West, Toronto, Canada Latitude = 43.6863598  Longitude = -79.3003158
2- Neighbourhood - The Beaches West, Toronto, Canada Latitude = 43.6710244  Longitude = -79.296712
3- Neighbourhood - Studio District, Toronto, Canada Latitude = 43.643105  Longitude = -79.39131
4- Neighbourhood - Lawrence Park, Toronto, Canada Latitude = 43.729199  Longitude = -79.4032525
5- Neighbourhood - Davisville North, Toronto, Canada Latitude = 43.7043123  Longitude = -79.3885169
6- Neighbourhood - North Toronto West, Toronto, Canada Latitude = 43.653963  Longitude = -79.387207
7- Neighbourhood - Davisville, Toronto, Canada Latitude = 43.7043123  Longitude = -79.3885169
8- Neighbourhood - Moore Park, Toronto, Canada Latitude = 43.6903876  Longitude = -79.3832965
9- Neighbourhood - Deer Park, Toronto, Canada Latitude = 43.68809  Longitude = -79.3940935
10- Neighbourhood - Rosedale, Toron

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.671024,-79.296712
1,M4K,East Toronto,"The Danforth West, Riverdale",43.68636,-79.300316
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.671024,-79.296712
3,M4M,East Toronto,Studio District,43.643105,-79.39131
4,M4N,Central Toronto,Lawrence Park,43.729199,-79.403252
5,M4P,Central Toronto,Davisville North,43.704312,-79.388517
6,M4R,Central Toronto,North Toronto West,43.653963,-79.387207
7,M4S,Central Toronto,Davisville,43.704312,-79.388517
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.690388,-79.383297
9,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.68809,-79.394093


# Part 3  <a class="anchor" id="Part3"></a>
### __Clustring Neighbourhood__

__ __

#### Use geopy library to get the latitude and longitude values of Toronto City.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.

In [26]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.653963, -79.387207.


#### Create a map of Toronto with neighborhoods superimposed on top.

In [28]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
folium.CircleMarker(
    [latitude, longitude],
    radius=5,
    popup='New Yark',
    color='red',
    fill=True,
    fill_color='red',
    fill_opacity=0.6,
    parse_html=False).add_to(map_toronto) 

  # add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
map_toronto

let's simplify the above map and segment and cluster only the neighborhoods in Central Toronto. So let's slice the original dataframe and create a new dataframe of the Central Toronto data.

In [30]:
Central_Toronto_data = toronto_df[toronto_df['Borough'] == 'Central Toronto'].reset_index(drop=True)
Central_Toronto_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.729199,-79.403252
1,M4P,Central Toronto,Davisville North,43.704312,-79.388517
2,M4R,Central Toronto,North Toronto West,43.653963,-79.387207
3,M4S,Central Toronto,Davisville,43.704312,-79.388517
4,M4T,Central Toronto,"Moore Park, Summerhill East",43.690388,-79.383297


Let's get the geographical coordinates of Manhattan.

In [31]:
address = 'Central Toronto, Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Central Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Central Toronto are 43.653963, -79.387207.


As we did with all of Toronto City, let's visualizat Central Toronto the neighborhoods in it.

In [32]:
# create map of Manhattan using latitude and longitude values
map_Central_Toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(Central_Toronto_data['Latitude'], Central_Toronto_data['Longitude'], Central_Toronto_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Central_Toronto)  
    
map_Central_Toronto

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [33]:
# The code was removed by Watson Studio for sharing.

Your credentails:
CLIENT_ID: KLM3YX5M4BPNZETT5BWB54OYFUOIPMY3OLWU1HYRJLWC4TC0
CLIENT_SECRET:2NZPASHCA2BNERJYMDDXZCL1WTU2NKX5XSKGOZQKYJWPGPXP


#### Let's explore the first neighborhood in our dataframe.

Get the neighborhood's name.

In [34]:
Central_Toronto_data.loc[0, 'Neighbourhood']

'Lawrence Park'

Get the neighborhood's latitude and longitude values.

In [36]:
neighborhood_latitude = Central_Toronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = Central_Toronto_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = Central_Toronto_data.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Lawrence Park are 43.729199, -79.4032525.


#### Now, let's get the top 100 venues that are in Central Toronto within a radius of 500 meters.

First, let's create the GET request URL. Name your URL **url**.

In [37]:
# type your answer here
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url 



'https://api.foursquare.com/v2/venues/explore?&client_id=KLM3YX5M4BPNZETT5BWB54OYFUOIPMY3OLWU1HYRJLWC4TC0&client_secret=2NZPASHCA2BNERJYMDDXZCL1WTU2NKX5XSKGOZQKYJWPGPXP&v=20180604&ll=43.729199,-79.4032525&radius=500&limit=100'

Send the GET request and examine the resutls

In [38]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5c5c5f4f6a60717af16a3e2a'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Bedford Park',
  'headerFullLocation': 'Bedford Park, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 55,
  'suggestedBounds': {'ne': {'lat': 43.733699004500004,
    'lng': -79.39703673823003},
   'sw': {'lat': 43.7246989955, 'lng': -79.40946826176996}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '543a91fb498e12bc6fd44f2e',
       'name': 'Bobbette & Belle',
       'location': {'address': '3347 Yonge St',
        'crossStreet': 'btwn Snowdon Ave & Golfdale Rd',
        'lat': 43.73133877297435,
        'lng': -79.40376944766918,
        'la

Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [39]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [40]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Bobbette & Belle,Bakery,43.731339,-79.403769
1,T-buds,Tea Room,43.731247,-79.40364
2,For The Win Cafe,Bubble Tea Shop,43.728636,-79.403255
3,Menchies Frozen Yogurt,Ice Cream Shop,43.728336,-79.403173
4,STACK,BBQ Joint,43.729311,-79.403241


And how many venues were returned by Foursquare?

In [41]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

55 venues were returned by Foursquare.


<a id='item2'></a>

## 2. Explore Neighborhoods in Central Toronto

#### Let's create a function to repeat the same process to all the neighborhoods in Central Toronto

In [42]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called *Central_Toronto_venues*.

In [45]:
# type your answer here

Central_Toronto_venues = getNearbyVenues(names=Central_Toronto_data['Neighbourhood'],
                                   latitudes=Central_Toronto_data['Latitude'],
                                   longitudes=Central_Toronto_data['Longitude']
                                  )



Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville


#### Let's check the size of the resulting dataframe

In [46]:
print(Central_Toronto_venues.shape)
Central_Toronto_venues.head()

(311, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Lawrence Park,43.729199,-79.403252,Bobbette & Belle,43.731339,-79.403769,Bakery
1,Lawrence Park,43.729199,-79.403252,T-buds,43.731247,-79.40364,Tea Room
2,Lawrence Park,43.729199,-79.403252,For The Win Cafe,43.728636,-79.403255,Bubble Tea Shop
3,Lawrence Park,43.729199,-79.403252,Menchies Frozen Yogurt,43.728336,-79.403173,Ice Cream Shop
4,Lawrence Park,43.729199,-79.403252,STACK,43.729311,-79.403241,BBQ Joint


Let's check how many venues were returned for each neighborhood

In [48]:
Central_Toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Davisville,35,35,35,35,35,35
Davisville North,35,35,35,35,35,35
"Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West",56,56,56,56,56,56
"Forest Hill North, Forest Hill West",5,5,5,5,5,5
Lawrence Park,55,55,55,55,55,55
"Moore Park, Summerhill East",4,4,4,4,4,4
North Toronto West,70,70,70,70,70,70
Roselawn,4,4,4,4,4,4
"The Annex, North Midtown, Yorkville",47,47,47,47,47,47


#### Let's find out how many unique categories can be curated from all the returned venues

In [49]:
print('There are {} uniques categories.'.format(len(Central_Toronto_venues['Venue Category'].unique())))

There are 108 uniques categories.


<a id='item3'></a>

## 3. Analyze Each Neighborhood

In [50]:
# one hot encoding
Central_Toronto_onehot = pd.get_dummies(Central_Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Central_Toronto_onehot['Neighbourhood'] = Central_Toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [Central_Toronto_onehot.columns[-1]] + list(Central_Toronto_onehot.columns[:-1])
Central_Toronto_onehot = Central_Toronto_onehot[fixed_columns]

Central_Toronto_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,...,Tapas Restaurant,Tea Room,Thai Restaurant,Toy / Game Store,Trail,University,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint,Yoga Studio
0,Lawrence Park,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Lawrence Park,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [51]:
Central_Toronto_onehot.shape

(311, 109)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [52]:
Central_Toronto_grouped = Central_Toronto_onehot.groupby('Neighbourhood').mean().reset_index()
Central_Toronto_grouped

Unnamed: 0,Neighbourhood,Accessories Store,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,...,Tapas Restaurant,Tea Room,Thai Restaurant,Toy / Game Store,Trail,University,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint,Yoga Studio
0,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.028571,0.028571,0.0,0.0,0.0,0.0,0.0,0.0
1,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.028571,0.028571,0.0,0.0,0.0,0.0,0.0,0.0
2,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.035714,0.017857,...,0.017857,0.017857,0.035714,0.0,0.0,0.0,0.0,0.017857,0.0,0.017857
3,"Forest Hill North, Forest Hill West",0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Lawrence Park,0.0,0.0,0.0,0.0,0.0,0.036364,0.018182,0.0,0.072727,...,0.0,0.036364,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.018182
5,"Moore Park, Summerhill East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,North Toronto West,0.0,0.028571,0.057143,0.014286,0.014286,0.014286,0.0,0.0,0.014286,...,0.014286,0.014286,0.014286,0.014286,0.0,0.014286,0.014286,0.0,0.0,0.0
7,Roselawn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0
8,"The Annex, North Midtown, Yorkville",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021277,...,0.0,0.0,0.042553,0.0,0.0,0.0,0.021277,0.0,0.021277,0.0


#### Let's confirm the new size

In [53]:
Central_Toronto_grouped.shape

(9, 109)

#### Let's print each neighborhood along with the top 5 most common venues

In [54]:
num_top_venues = 5

for hood in Central_Toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = Central_Toronto_grouped[Central_Toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Davisville----
                venue  freq
0        Dessert Shop  0.09
1         Pizza Place  0.09
2      Sandwich Place  0.09
3                Café  0.06
4  Italian Restaurant  0.06


----Davisville North----
                venue  freq
0        Dessert Shop  0.09
1         Pizza Place  0.09
2      Sandwich Place  0.09
3                Café  0.06
4  Italian Restaurant  0.06


----Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West----
                venue  freq
0         Coffee Shop  0.11
1  Italian Restaurant  0.07
2                Café  0.05
3      Sandwich Place  0.05
4    Sushi Restaurant  0.04


----Forest Hill North, Forest Hill West----
                      venue  freq
0         Accessories Store   0.2
1                Playground   0.2
2  Mediterranean Restaurant   0.2
3                      Bank   0.2
4                      Park   0.2


----Lawrence Park----
                venue  freq
0  Italian Restaurant  0.09
1         Coffee Shop  0.07
2              B

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [55]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [59]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = Central_Toronto_grouped['Neighbourhood']

for ind in np.arange(Central_Toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Central_Toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Davisville,Dessert Shop,Pizza Place,Sandwich Place,Italian Restaurant,Seafood Restaurant,Sushi Restaurant,Coffee Shop,Café,Restaurant,Indoor Play Area
1,Davisville North,Dessert Shop,Pizza Place,Sandwich Place,Italian Restaurant,Seafood Restaurant,Sushi Restaurant,Coffee Shop,Café,Restaurant,Indoor Play Area
2,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",Coffee Shop,Italian Restaurant,Sandwich Place,Café,Bagel Shop,Restaurant,Pizza Place,Thai Restaurant,Sushi Restaurant,Pub
3,"Forest Hill North, Forest Hill West",Accessories Store,Playground,Park,Bank,Mediterranean Restaurant,Farmers Market,Creperie,Deli / Bodega,Design Studio,Dessert Shop
4,Lawrence Park,Italian Restaurant,Coffee Shop,Bakery,Sushi Restaurant,Bank,Tea Room,Cosmetics Shop,Pub,Burger Joint,Asian Restaurant


<a id='item4'></a>

## 4. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [60]:
# set number of clusters
kclusters = 5

Central_Toronto_grouped_clustering = Central_Toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Central_Toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 4, 2, 3, 2, 0, 2, 1, 2], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [61]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Central_Toronto_merged = Central_Toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Central_Toronto_merged = Central_Toronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

Central_Toronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4N,Central Toronto,Lawrence Park,43.729199,-79.403252,2,Italian Restaurant,Coffee Shop,Bakery,Sushi Restaurant,Bank,Tea Room,Cosmetics Shop,Pub,Burger Joint,Asian Restaurant
1,M4P,Central Toronto,Davisville North,43.704312,-79.388517,4,Dessert Shop,Pizza Place,Sandwich Place,Italian Restaurant,Seafood Restaurant,Sushi Restaurant,Coffee Shop,Café,Restaurant,Indoor Play Area
2,M4R,Central Toronto,North Toronto West,43.653963,-79.387207,2,Café,Coffee Shop,Art Gallery,Japanese Restaurant,Breakfast Spot,Sushi Restaurant,Bar,Chinese Restaurant,Exhibit,American Restaurant
3,M4S,Central Toronto,Davisville,43.704312,-79.388517,4,Dessert Shop,Pizza Place,Sandwich Place,Italian Restaurant,Seafood Restaurant,Sushi Restaurant,Coffee Shop,Café,Restaurant,Indoor Play Area
4,M4T,Central Toronto,"Moore Park, Summerhill East",43.690388,-79.383297,0,Playground,Gym,Park,Convenience Store,Yoga Studio,Falafel Restaurant,Creperie,Deli / Bodega,Design Studio,Dessert Shop


Finally, let's visualize the resulting clusters

In [63]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Central_Toronto_merged['Latitude'], Central_Toronto_merged['Longitude'], Central_Toronto_merged['Neighbourhood'], Central_Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

## 5. Examine Clusters

Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

#### Cluster 1

In [64]:
Central_Toronto_merged.loc[Central_Toronto_merged['Cluster Labels'] == 0, Central_Toronto_merged.columns[[1] + list(range(5, Central_Toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Central Toronto,0,Playground,Gym,Park,Convenience Store,Yoga Studio,Falafel Restaurant,Creperie,Deli / Bodega,Design Studio,Dessert Shop


#### Cluster 2

In [65]:
Central_Toronto_merged.loc[Central_Toronto_merged['Cluster Labels'] == 1, Central_Toronto_merged.columns[[1] + list(range(5, Central_Toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Central Toronto,1,Skating Rink,Trail,Bus Stop,Bank,Yoga Studio,Creperie,Deli / Bodega,Design Studio,Dessert Shop,Diner


#### Cluster 3

In [66]:
Central_Toronto_merged.loc[Central_Toronto_merged['Cluster Labels'] == 2, Central_Toronto_merged.columns[[1] + list(range(5, Central_Toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Central Toronto,2,Italian Restaurant,Coffee Shop,Bakery,Sushi Restaurant,Bank,Tea Room,Cosmetics Shop,Pub,Burger Joint,Asian Restaurant
2,Central Toronto,2,Café,Coffee Shop,Art Gallery,Japanese Restaurant,Breakfast Spot,Sushi Restaurant,Bar,Chinese Restaurant,Exhibit,American Restaurant
5,Central Toronto,2,Coffee Shop,Italian Restaurant,Sandwich Place,Café,Bagel Shop,Restaurant,Pizza Place,Thai Restaurant,Sushi Restaurant,Pub
8,Central Toronto,2,Pizza Place,Coffee Shop,Grocery Store,Park,Metro Station,Greek Restaurant,Ice Cream Shop,Thai Restaurant,Bookstore,Indian Restaurant


#### Cluster 4

In [67]:
Central_Toronto_merged.loc[Central_Toronto_merged['Cluster Labels'] == 3, Central_Toronto_merged.columns[[1] + list(range(5, Central_Toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
7,Central Toronto,3,Accessories Store,Playground,Park,Bank,Mediterranean Restaurant,Farmers Market,Creperie,Deli / Bodega,Design Studio,Dessert Shop


#### Cluster 5

In [68]:
Central_Toronto_merged.loc[Central_Toronto_merged['Cluster Labels'] == 4, Central_Toronto_merged.columns[[1] + list(range(5, Central_Toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Central Toronto,4,Dessert Shop,Pizza Place,Sandwich Place,Italian Restaurant,Seafood Restaurant,Sushi Restaurant,Coffee Shop,Café,Restaurant,Indoor Play Area
3,Central Toronto,4,Dessert Shop,Pizza Place,Sandwich Place,Italian Restaurant,Seafood Restaurant,Sushi Restaurant,Coffee Shop,Café,Restaurant,Indoor Play Area
