# Toronto neighborhoods

**In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto.** 

The neighborhood data for Toronto is not readily available on the internet. However, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. 

According to above, this assignment will consist of three notebooks, each of one for a different part: 

- **Parts 1 and 2**: you will be required to get the data, clean it, and then read it into a pandas dataframe so that it is in a suitable structured format. 

- **Part 3**: once the data is in a structured format, you can analyze the dataset to explore and cluster the neighborhoods in the city of Toronto.

Let's start!

In [1]:
#!conda install -c conda-forge geopy --yes           # uncomment only if it is necessary to install the 'geopy' library

Import the necessary libraries to run this notebook:

In [2]:
                     import numpy             as np      # library to handle data in a vectorized manner
                     import pandas            as pd      # library for data analysis
from pandas.io.json  import json_normalize               # tranform JSON file into a pandas dataframe    
from geopy.geocoders import Nominatim                    # convert an address into latitude and longitude values
                     import folium                       # map rendering library
                     import requests                     # library to handle requests
from sklearn.cluster import KMeans                       # import k-means from clustering stage
                     import matplotlib.cm     as cm      # matplotlib and associated plotting modules
                     import matplotlib.colors as colors  # matplotlib and associated plotting modules

## Part 1: Postal codes

**Scrape the Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M in order**

**1) to obtain the data that is in the table of postal codes and** 

**2) to transform the data into a pandas dataframe of the postal code of each neighborhood along with the borough name and neighborhood name.**

## 1.1. Obtain the data

**Use the BeautifulSoup package or *any other way* you are comfortable with to transform the data in the table on the Wikipedia page into the pandas dataframe.**

Instead of BeautifulSoup package, I preferred to use the function *'read_html'* from Pandas to read the HTML table into a dataframe.

In [3]:
df_cp = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

Explore the data:

In [4]:
df_cp.shape         # dataframe dimension (rows, columns)

(288, 3)

In [5]:
df_cp.columns       # column names

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

In [6]:
df_cp               # dataframe display

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


## 1.2. Transform the data

**1.2.1.The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood.**

Rename the column names to match the ones in the required dataframe:

In [7]:
df_cp.rename(columns={'Postcode':'PostalCode', 'Neighbourhood':'Neighborhood'}, inplace=True)   # columns rename 
df_cp.columns                                                                                   # verification

Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')

**1.2.2. Only process the cells that have an assigned borough. Ignore cells with a borough that is *'Not assigned'*.**

In [8]:
filter_1 = (df_cp['Borough']!='Not assigned') # create a filter with the condition 'Borough' different to'Not assigned'
df_cp = df_cp[filter_1]                       # apply the filter 
'Not assigned' is df_cp['Borough']            # verification

False

In [9]:
df_cp                                         # visualize the resulting dataframe

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


**1.2.3. If a cell has a borough but a *'Not assigned'* neighborhood, then the neighborhood will be the same as the borough.** 

In [10]:
filter_21 = (df_cp['Neighborhood'] == 'Not assigned')   # create a filter
labels_21 = df_cp[filter_21].index                       # apply the filter and get the row labels
df_cp[filter_21]                                        # visualize the filtered dataframe

Unnamed: 0,PostalCode,Borough,Neighborhood
8,M7A,Queen's Park,Not assigned


There is only one row where the value of **'Neighborhood'** is *'Not assigned'*. For this row, we equal the Borough and Neighborhood columns.

In [11]:
# Change 'Neighborhood' if it is 'Not assigned'
# 1 - Valid only in this case, where filter_2 only has 1 element.
#df_cp.loc[rows_2[0]]['Neighbourhood'] = df_cp.loc[rows_2[0]]['Borough']  # df.loc - access to a row by label

# 2 - Valid in general, for any filter size.
for i in labels_21:                                             
    df_cp.loc[i]['Neighborhood'] = df_cp.loc[i]['Borough'] 

Validation (three different ways):

In [12]:
print('Not assigned' in df_cp['Neighborhood'])           # first

False


In [13]:
filter_22 = (df_cp['Neighborhood'] == 'Not assigned')    # second      
df_cp[filter_22]

Unnamed: 0,PostalCode,Borough,Neighborhood


In [14]:
df_cp.loc[[8]]                                            # third

Unnamed: 0,PostalCode,Borough,Neighborhood
8,M7A,Queen's Park,Queen's Park


The value of **'Neigborhood'** is the same than the value of **'Borough'** in the row with label 8. That's right!

**1.2.4. When more than one neighborhood exists in one postal code area, these rows will be combined into one row with the neighborhoods separated with a comma.**

In [15]:
df_cp             # display the dataframe before combination

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [16]:
print('The city of Toronto has {} boroughs, {} postal codes and {} neighborhoods.'.format(
       len(df_cp['Borough'].unique()),len(df_cp['PostalCode'].unique()), df_cp.shape[0]))

The city of Toronto has 11 boroughs, 103 postal codes and 211 neighborhoods.


Get the row labels for the postal codes with more than one neighborhood:

In [17]:
filter_3 = df_cp[['PostalCode']].duplicated()
labels_3 = df_cp[filter_3].index
labels_3

Int64Index([  5,   7,  12,  16,  18,  23,  24,  25,  26,  28,
            ...
            268, 269, 270, 271, 272, 273, 283, 284, 285, 286],
           dtype='int64', length=108)

Combine the neighborhoods for these postal codes into one row:

In [18]:
for k in range(labels_3.size):
    neigh = df_cp.loc[labels_3[k]]['Neighborhood']  # get neighborhood to combine
    row = df_cp.index.get_loc(labels_3[k])          # get the row label origen
    df_cp.iloc[row-1]['Neighborhood'] = df_cp.iloc[row-1]['Neighborhood'] + ',' + neigh   # combine
    df_cp = df_cp.drop(labels_3[k], axis=0)         # delete original row

In [19]:
df_cp          # validation

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Harbourfront,Regent Park"
6,M6A,North York,"Lawrence Heights,Lawrence Manor"
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,"Rouge,Malvern"
14,M3B,North York,Don Mills North
15,M4B,East York,"Woodbine Gardens,Parkview Hill"
17,M5B,Downtown Toronto,"Ryerson,Garden District"


**1.2.5. Change the row labels to match the ones in the required dataframe.**

In [20]:
df_cp = df_cp.reset_index(drop = True)        # rows rename 
df_cp                                         # verification

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


**1.2.6. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.**

In [21]:
print('The number of rows of my postal codes dataframe is: ', df_cp.shape[0])

The number of rows of my postal codes dataframe is:  103


## Part 2: Geographical coordinates of each postal code

**We need to get the latitude and the longitude coordinates of each neighborhood in order to utilize the Foursquare location data. We can use the Geocoder Python package or a csv file that has the geographical coordinates of each postal code to create the solicited dataframe.**

## 2.1. Obtain the data

I got the geographical coordinates from the available file. Here is a link to this csv file : http://cocl.us/Geospatial_data.

In [22]:
df_gc = pd.read_csv('http://cocl.us/Geospatial_data')

Explore the coordinates dataframe:

In [23]:
df_gc.shape      # dataframe dimensions (rows, columns)

(103, 3)

The geographical coordinates dataframe has the same number of rows than my postal codes dataframe. That's right!

In [24]:
df_gc.columns   # columns labels

Index(['Postal Code', 'Latitude', 'Longitude'], dtype='object')

In [25]:
df_gc.head()    # the first 5 rows

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


## 2.2. Transform the data

**2.2.1. Change the name of the first column in the coordinates dataframe to match the one in the postal codes dataframe to be able to join both dataframes.**

In [26]:
df_gc.rename(columns = {'Postal Code':'PostalCode'}, inplace = True)

In [27]:
df_gc.columns

Index(['PostalCode', 'Latitude', 'Longitude'], dtype='object')

**2.2.2. Join the postal codes and the geographical coordinates dataframes to obtain the requerided dataframe to segmentation and clustering.**

In [28]:
df_tn = df_cp.merge(df_gc, on = 'PostalCode')

The solicited datraframe to segmentation and clustering is: 

In [29]:
df_tn

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.654260,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937


## Part 3: Segmentation, clustering and visualization

**Explore and cluster the neighborhoods in Toronto using the Foursquare location data. You can decide to work with only boroughs that contain the word Toronto.**

You will get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. You will use the *k*-means clustering algorithm to complete this task. Finally, you will use the Folium library to visualize the neighborhoods in Toronto and their emerging clusters.

## 3.1. Visualization of the neighborhoods of the city of Toronto

 Create a map of Toronto with neighborhoods superimposed on top.


**3.1.1. Get the latitude and longitude values of Toronto**

In [30]:
address = 'Toronto'

geolocator = Nominatim(user_agent = "toronto_explorer")
location   = geolocator.geocode(address)
latitude   = location.latitude
longitude  = location.longitude

print('The geographical coordinates of the city of Toronto are latitude = {} and longitude = {}.'.format(latitude, longitude))

The geographical coordinates of the city of Toronto are latitude = 43.653963 and longitude = -79.387207.


**3.1.2. Create a map of Toronto neighborhoods**

From **Part 1** (step 1.2.4), we know **the city of Toronto has 11 boroughs, 103 postal codes and 211 neighborhoods**.

In [31]:
df_tn         # the solicited datraframe for segmentation and clustering 

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.654260,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937


In [32]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_tn['Latitude'], df_tn['Longitude'], df_tn['Borough'], df_tn['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## 3.2. Segmentation and clustering of only the neighborhoods corresponding to boroughs that contain the word 'Toronto'

I decided to work with only boroughs that contain the word Toronto. 

### 3.2.1. Obtain the data

We visualize the boroughs of Toronto with the number of postal codes associated with them.

In [33]:
df_tn.groupby('Borough').count()[['PostalCode']]

Unnamed: 0_level_0,PostalCode
Borough,Unnamed: 1_level_1
Central Toronto,9
Downtown Toronto,18
East Toronto,5
East York,5
Etobicoke,12
Mississauga,1
North York,24
Queen's Park,1
Scarborough,17
West Toronto,6


We will segment and cluster only the neighborhoods corresponding to the following boroughs:
 - **Central Toronto**
 - **Downtown Toronto**
 - **East Toronto**
 - **West Toronto**

So let's slice the dataframe for the city of Toronto and create a new dataframe of these four borouhgs:

In [34]:
filter_ = (df_tn['Borough'] == ('Central Toronto')) | (df_tn['Borough'] == ('Downtown Toronto'))| (df_tn['Borough'] == ('East Toronto'))| (df_tn['Borough'] == ('West Toronto'))
df_t_cp = df_tn[filter_].reset_index(drop=True)
df_t_cp

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
1,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
5,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
6,M6G,Downtown Toronto,Christie,43.669542,-79.422564
7,M5H,Downtown Toronto,"Adelaide,King,Richmond",43.650571,-79.384568
8,M6H,West Toronto,"Dovercourt Village,Dufferin",43.669005,-79.442259
9,M5J,Downtown Toronto,"Harbourfront East,Toronto Islands,Union Station",43.640816,-79.381752


### 3.2.2. Visualization of the neighborhoods to segment and cluster

As we did with all of Toronto City, let's visualizate the neighborhoods of these four boroughs.

In [35]:
address = 'Toronto'

geolocator = Nominatim(user_agent = "toronto_explorer")
location   = geolocator.geocode(address)
latitude   = location.latitude
longitude  = location.longitude

print('The geographical coordinates of the city of Toronto are latitude = {} and longitude = {}.'.format(latitude, longitude))

The geographical coordinates of the city of Toronto are latitude = 43.653963 and longitude = -79.387207.


In [36]:
# create map of Toronto using latitude and longitude values
map_toronto_cp = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_t_cp['Latitude'],df_t_cp['Longitude'], df_t_cp['Borough'], df_t_cp['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_cp)  
    
map_toronto_cp

### 3.2.3. Explore the neighborhoods and segment them

**3.2.3.1. Define Foursquare Credentials and Version**

In [37]:
CLIENT_ID     = 'F1YOZLYDP201ASDXJCR0CSSV5PTIF2R0JHMXUU0RNA5HISCN' # your Foursquare ID
CLIENT_SECRET = 'JKJ1UD25OVAH2SR1MMNRTZPST5XCWVTRL3D4JDC3PFC3JKFS' # your Foursquare Secret
VERSION       = '20190416' # Foursquare API version

**3.2.3.2. Let's get the top 100 venues that are in each neighborhood in the dataframe within a radius of 500 meters.**

Let's create a function for this process.

In [38]:
def getNearbyVenues(bor,names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for bor,name, lat, lng in zip(bor, names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            bor,
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough','Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Run the above function on each neighborhood and create a new dataframe called *toronto_cp_venues*.

In [39]:
toronto_cp_venues = getNearbyVenues(bor       = df_t_cp['Borough'], 
                                    names     = df_t_cp['Neighborhood'],
                                   latitudes  = df_t_cp['Latitude'],
                                   longitudes = df_t_cp['Longitude'])

Harbourfront,Regent Park
Ryerson,Garden District
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Adelaide,King,Richmond
Dovercourt Village,Dufferin
Harbourfront East,Toronto Islands,Union Station
Little Portugal,Trinity
The Danforth West,Riverdale
Design Exchange,Toronto Dominion Centre
Brockton,Exhibition Place,Parkdale Village
The Beaches West,India Bazaar
Commerce Court,Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North,Forest Hill West
High Park,The Junction South
North Toronto West
The Annex,North Midtown,Yorkville
Parkdale,Roncesvalles
Davisville
Harbord,University of Toronto
Runnymede,Swansea
Moore Park,Summerhill East
Chinatown,Grange Park,Kensington Market
Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West
CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown,St. James Town
First Canadian Place,Underground city


Let's check the resulting dataframe:

In [40]:
print(toronto_cp_venues.shape)
toronto_cp_venues.head()

(1702, 8)


Unnamed: 0,Borough,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636,Toronto Cooper Koo Family Cherry St YMCA Centre,43.653191,-79.357947,Gym / Fitness Center
3,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
4,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


Let's check how many venues were returned for each borough:

In [41]:
toronto_cp_venues.groupby('Borough').count()

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Central Toronto,112,112,112,112,112,112,112
Downtown Toronto,1294,1294,1294,1294,1294,1294,1294
East Toronto,117,117,117,117,117,117,117
West Toronto,179,179,179,179,179,179,179


Let's check how many venues categories are:

In [42]:
print('There are {} uniques categories.'.format(len(toronto_cp_venues['Venue Category'].unique())))

There are 233 uniques categories.


Let's check the distribution of these venues categories: 

In [43]:
toronto_cp_venues.groupby(['Borough', 'Neighborhood'])[['Venue Category']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Venue Category
Borough,Neighborhood,Unnamed: 2_level_1
Central Toronto,Davisville,34
Central Toronto,Davisville North,9
Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West",14
Central Toronto,"Forest Hill North,Forest Hill West",4
Central Toronto,Lawrence Park,3
Central Toronto,"Moore Park,Summerhill East",3
Central Toronto,North Toronto West,20
Central Toronto,Roselawn,1
Central Toronto,"The Annex,North Midtown,Yorkville",24
Downtown Toronto,"Adelaide,King,Richmond",100


**3.2.3.3. Analyze each neighborhood.**

In [44]:
# one hot encoding
toronto_cp_onehot = pd.get_dummies(toronto_cp_venues[['Venue Category']], prefix="", prefix_sep="")

# add borough and neighborhood column back to dataframe
toronto_cp_onehot['Borough'] = toronto_cp_venues['Borough'] 
toronto_cp_onehot['Neighborhood'] = toronto_cp_venues['Neighborhood'] 

# move borough and neighborhood columns to the first two columns
borough = toronto_cp_onehot['Borough']
toronto_cp_onehot.drop(labels=['Borough'], axis=1,inplace = True)
toronto_cp_onehot.insert(0,'Borough', borough)

neigh = toronto_cp_onehot['Neighborhood']
toronto_cp_onehot.drop(labels=['Neighborhood'], axis=1,inplace = True)
toronto_cp_onehot.insert(1,'Neighborhood', neigh)

toronto_cp_onehot.head()

Unnamed: 0,Borough,Neighborhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,Downtown Toronto,"Harbourfront,Regent Park",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Downtown Toronto,"Harbourfront,Regent Park",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Downtown Toronto,"Harbourfront,Regent Park",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Downtown Toronto,"Harbourfront,Regent Park",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Downtown Toronto,"Harbourfront,Regent Park",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's examine the new dataframe size:

In [45]:
toronto_cp_onehot.shape

(1702, 234)

Let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category:

In [46]:
toronto_cp_grouped = toronto_cp_onehot.groupby(['Borough','Neighborhood']).mean().reset_index()
toronto_cp_grouped

Unnamed: 0,Borough,Neighborhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,Central Toronto,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.029412,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Central Toronto,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0
3,Central Toronto,"Forest Hill North,Forest Hill West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Toronto,Lawrence Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Central Toronto,"Moore Park,Summerhill East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Toronto,North Toronto West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05
7,Central Toronto,Roselawn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Central Toronto,"The Annex,North Midtown,Yorkville",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0
9,Downtown Toronto,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0


Let's confirm the new size:

In [47]:
toronto_cp_grouped.shape

(38, 234)

Let's print each neighborhood along with the top 5 most common venues:

In [48]:
num_top_venues = 5

for hood in toronto_cp_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_cp_grouped[toronto_cp_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[2:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Davisville----
                venue  freq
0      Sandwich Place  0.09
1        Dessert Shop  0.09
2                Café  0.06
3         Pizza Place  0.06
4  Italian Restaurant  0.06


----Davisville North----
            venue  freq
0  Breakfast Spot  0.11
1           Hotel  0.11
2            Park  0.11
3             Gym  0.11
4      Restaurant  0.11


----Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West----
                 venue  freq
0                  Pub  0.14
1          Coffee Shop  0.14
2          Pizza Place  0.07
3  American Restaurant  0.07
4     Sushi Restaurant  0.07


----Forest Hill North,Forest Hill West----
                venue  freq
0  Mexican Restaurant  0.25
1               Trail  0.25
2    Sushi Restaurant  0.25
3       Jewelry Store  0.25
4      Adult Boutique  0.00


----Lawrence Park----
            venue  freq
0            Park  0.33
1     Swim School  0.33
2        Bus Line  0.33
3  Adult Boutique  0.00
4          Museum  0.00


----Moore Par

Let's put that into a *pandas* dataframe.

1) First, let's write a function to sort the venues in descending order:

In [49]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

2) Let's create the new dataframe and display the top 10 venues for each neighborhood:

In [50]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Borough', 'Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Borough'] = toronto_cp_grouped['Borough']
neighborhoods_venues_sorted['Neighborhood'] = toronto_cp_grouped['Neighborhood']


for ind in np.arange(toronto_cp_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 2:] = return_most_common_venues(toronto_cp_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Borough,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Central Toronto,Davisville,Sandwich Place,Dessert Shop,Sushi Restaurant,Café,Pizza Place,Coffee Shop,Italian Restaurant,Indian Restaurant,Deli / Bodega,Gourmet Shop
1,Central Toronto,Davisville North,Gym,Food & Drink Shop,Park,Breakfast Spot,Clothing Store,Sandwich Place,Restaurant,Burger Joint,Hotel,Discount Store
2,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",Pub,Coffee Shop,Convenience Store,Sushi Restaurant,Bagel Shop,Fried Chicken Joint,Sports Bar,American Restaurant,Supermarket,Pizza Place
3,Central Toronto,"Forest Hill North,Forest Hill West",Trail,Mexican Restaurant,Jewelry Store,Sushi Restaurant,Yoga Studio,Dim Sum Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant
4,Central Toronto,Lawrence Park,Bus Line,Park,Swim School,Yoga Studio,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
5,Central Toronto,"Moore Park,Summerhill East",Restaurant,Gym,Playground,Department Store,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
6,Central Toronto,North Toronto West,Clothing Store,Coffee Shop,Sporting Goods Shop,Bagel Shop,Fast Food Restaurant,Mexican Restaurant,Diner,Dessert Shop,Park,Chinese Restaurant
7,Central Toronto,Roselawn,Garden,Yoga Studio,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
8,Central Toronto,"The Annex,North Midtown,Yorkville",Coffee Shop,Sandwich Place,Café,Pizza Place,Liquor Store,Burger Joint,Jewish Restaurant,Pub,BBQ Joint,Cheese Shop
9,Downtown Toronto,"Adelaide,King,Richmond",Coffee Shop,Thai Restaurant,Café,Steakhouse,American Restaurant,Salad Place,Restaurant,Bar,Burger Joint,Bakery


**3.2.3.4. Cluster neighborhoods.**

Run *k*-means to cluster the neighborhood into 5 clusters:

In [51]:
# set number of clusters
kclusters = 5

toronto_cp_grouped_clustering = toronto_cp_grouped.drop(['Borough','Neighborhood'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_cp_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 4, 3, 0, 1, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood:

In [52]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_cp_merged = df_t_cp 

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_cp_merged = toronto_cp_merged.merge(neighborhoods_venues_sorted, on='Neighborhood')
toronto_cp_merged = toronto_cp_merged.drop(['PostalCode','Borough_y'], 1)
toronto_cp_merged = toronto_cp_merged.rename(columns = {"Borough_x": "Borough"}) 
toronto_cp_merged 

#toronto_cp_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636,0,Coffee Shop,Bakery,Café,Pub,Park,Mexican Restaurant,Breakfast Spot,Restaurant,Theater,Dessert Shop
1,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937,0,Coffee Shop,Clothing Store,Café,Cosmetics Shop,Middle Eastern Restaurant,Restaurant,Bubble Tea Shop,Pizza Place,Bar,Bookstore
2,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Coffee Shop,Restaurant,Café,Hotel,Cosmetics Shop,Italian Restaurant,Gastropub,Cocktail Bar,Bakery,Breakfast Spot
3,East Toronto,The Beaches,43.676357,-79.293031,0,Health Food Store,Coffee Shop,Pub,Yoga Studio,Dim Sum Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant
4,Downtown Toronto,Berczy Park,43.644771,-79.373306,0,Coffee Shop,Cocktail Bar,Seafood Restaurant,Cheese Shop,Steakhouse,Farmers Market,Pub,Café,Restaurant,Bakery
5,Downtown Toronto,Central Bay Street,43.657952,-79.387383,0,Coffee Shop,Italian Restaurant,Bubble Tea Shop,Burger Joint,Ice Cream Shop,Bar,Café,Sandwich Place,Middle Eastern Restaurant,Sushi Restaurant
6,Downtown Toronto,Christie,43.669542,-79.422564,0,Grocery Store,Café,Park,Restaurant,Italian Restaurant,Convenience Store,Coffee Shop,Athletics & Sports,Nightclub,Baby Store
7,Downtown Toronto,"Adelaide,King,Richmond",43.650571,-79.384568,0,Coffee Shop,Thai Restaurant,Café,Steakhouse,American Restaurant,Salad Place,Restaurant,Bar,Burger Joint,Bakery
8,West Toronto,"Dovercourt Village,Dufferin",43.669005,-79.442259,0,Supermarket,Bakery,Pharmacy,Liquor Store,Middle Eastern Restaurant,Bank,Bar,Discount Store,Furniture / Home Store,Pool
9,Downtown Toronto,"Harbourfront East,Toronto Islands,Union Station",43.640816,-79.381752,0,Coffee Shop,Aquarium,Hotel,Café,Italian Restaurant,Scenic Lookout,Brewery,Bakery,Pizza Place,Fried Chicken Joint


Finally, let's visualize the resulting clusters:

In [53]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, bor, cluster in zip(toronto_cp_merged['Latitude'], toronto_cp_merged['Longitude'], toronto_cp_merged['Neighborhood'], toronto_cp_merged['Borough'], toronto_cp_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ', ' + bor + ', ' + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

**3.2.3.5. Examine clusters.**

**Cluster 0**

In [54]:
toronto_cp_merged.loc[toronto_cp_merged['Cluster Labels'] == 0, toronto_cp_merged.columns[[1] + list(range(5,toronto_cp_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Harbourfront,Regent Park",Coffee Shop,Bakery,Café,Pub,Park,Mexican Restaurant,Breakfast Spot,Restaurant,Theater,Dessert Shop
1,"Ryerson,Garden District",Coffee Shop,Clothing Store,Café,Cosmetics Shop,Middle Eastern Restaurant,Restaurant,Bubble Tea Shop,Pizza Place,Bar,Bookstore
2,St. James Town,Coffee Shop,Restaurant,Café,Hotel,Cosmetics Shop,Italian Restaurant,Gastropub,Cocktail Bar,Bakery,Breakfast Spot
3,The Beaches,Health Food Store,Coffee Shop,Pub,Yoga Studio,Dim Sum Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant
4,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Cheese Shop,Steakhouse,Farmers Market,Pub,Café,Restaurant,Bakery
5,Central Bay Street,Coffee Shop,Italian Restaurant,Bubble Tea Shop,Burger Joint,Ice Cream Shop,Bar,Café,Sandwich Place,Middle Eastern Restaurant,Sushi Restaurant
6,Christie,Grocery Store,Café,Park,Restaurant,Italian Restaurant,Convenience Store,Coffee Shop,Athletics & Sports,Nightclub,Baby Store
7,"Adelaide,King,Richmond",Coffee Shop,Thai Restaurant,Café,Steakhouse,American Restaurant,Salad Place,Restaurant,Bar,Burger Joint,Bakery
8,"Dovercourt Village,Dufferin",Supermarket,Bakery,Pharmacy,Liquor Store,Middle Eastern Restaurant,Bank,Bar,Discount Store,Furniture / Home Store,Pool
9,"Harbourfront East,Toronto Islands,Union Station",Coffee Shop,Aquarium,Hotel,Café,Italian Restaurant,Scenic Lookout,Brewery,Bakery,Pizza Place,Fried Chicken Joint


**Cluster 1**

In [55]:
toronto_cp_merged.loc[toronto_cp_merged['Cluster Labels'] == 1, toronto_cp_merged.columns[[1] + list(range(5,toronto_cp_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Roselawn,Garden,Yoga Studio,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant


**Cluster 2**

In [56]:
toronto_cp_merged.loc[toronto_cp_merged['Cluster Labels'] == 2, toronto_cp_merged.columns[[1] + list(range(5,toronto_cp_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
32,Rosedale,Park,Playground,Trail,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


**Cluster 3**

In [57]:
toronto_cp_merged.loc[toronto_cp_merged['Cluster Labels'] == 3, toronto_cp_merged.columns[[1] + list(range(5,toronto_cp_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
28,"Moore Park,Summerhill East",Restaurant,Gym,Playground,Department Store,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


**Cluster 4**

In [58]:
toronto_cp_merged.loc[toronto_cp_merged['Cluster Labels'] == 4, toronto_cp_merged.columns[[1] + list(range(5,toronto_cp_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,Lawrence Park,Bus Line,Park,Swim School,Yoga Studio,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
