# Project - Segmenting and Clustering Neighborhoods in Toronto

## Project tasks - 1

1. Start by creating a new Notebook for this assignment;


2. Use the Notebook to build the code to scrape the following Wikipedia [page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M), in order to obtain the data that is in the table of postal codes and to transform the data into a pandas  dataframe;


3. To create the above dataframe:
    - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood;
    - Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned;
    - More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11  in the above table.
    - If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.
    - Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
    - In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.


4. Submit a link to your Notebook on your Github repository.(**10 marks**)

**Note**: There are different website scraping libraries and packages in Python. For scraping the above table, you can simply use pandas  to read the table into a pandas dataframe.

Another way, which would help to learn for more complicated cases of web scraping is using the BeautifulSoup package. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/

Use pandas, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe.

### Code

#### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim # convert the address into latitude and longitude values
import folium # map rendering library
from sklearn.cluster import KMeans # kmeans

# matplotlib associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import requests # library to handle requests
from pandas import json_normalize # transform JSON file into a pandas dataframe

#### Loading the data

In [2]:
df_raw = pd.read_html('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969')[0]
df_raw

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


#### Cleaning the data

Renaming the columns

In [3]:
df_raw.columns=['PostalCode', 'Borough', 'Neighborhood']
df_raw.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Dataframe dimentions

In [4]:
df_raw.shape

(180, 3)

Querying the "Not assigned" data in Borough column

In [5]:
df_raw.loc[df_raw['Borough'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
7,M8A,Not assigned,Not assigned
10,M2B,Not assigned,Not assigned
15,M7B,Not assigned,Not assigned
...,...,...,...
174,M4Z,Not assigned,Not assigned
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned


Creating a new dataframe without the data above

In [6]:
new_df = df_raw.drop(df_raw.loc[df_raw['Borough'] == 'Not assigned'].index)

# reseting the index and dropping the "index" column
new_df.reset_index(drop=True, inplace=True)

Showing the firsts rows of the dataframe

In [7]:
new_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


The new size of the dataframe

In [8]:
print(f'The dataframe has {new_df.shape[0]} rows and {new_df.shape[1]} columns')

The dataframe has 103 rows and 3 columns


## Project tasks - 2

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, **we need to get the latitude and the longitude coordinates of each neighborhood.** 

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started [charging](http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/) for their API. , so we will use the [Geocoder](https://geocoder.readthedocs.io/index.html) Python package instead: .

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code.

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a [link](http://cocl.us/Geospatial_data) to a csv file that has the geographical coordinates of each postal code: 

Use the Geocoder package or the csv file to create the following dataframe:

![image1](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/HZ3jNHNOEeiMwApe4i-fLg_f44f0f10ccfaf42fcbdba9813364e173_Screen-Shot-2018-06-18-at-7.18.16-PM.png?expiry=1615939200000&hmac=mq080w2Ir4JW5B-xGJYu33K5idfavshLw9ogWHMeFtk)

**Important Note**: There is a limit on how many times you can call geocoder.google function. It is 2500 times per day. This should be way more than enough for you to get acquainted with the package and to use it to get the geographical coordinates of the neighborhoods in the Toronto.

Once you are able to create the above dataframe, submit a link to the new Notebook on your Github repository. (2 marks)

### Code

Since the geocoder library was not abble to return the latitude and longitude of the Toronto neighborhoods, we will use the **csv** file to get the latitude and longitude

#### Loading and merging the dataset

In [9]:
coordinates_df = pd.read_csv('toronto_neighborhoods/Geospatial_Coordinates.csv')

# renaming the "Postal Code" column
coordinates_df.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
coordinates_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [10]:
# merging the dataframes
df = pd.merge(new_df, coordinates_df, on='PostalCode')
df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


In [11]:
print(f'Now we have a datafre with {df.shape[0]} rows and {df.shape[1]} columns')

Now we have a datafre with 103 rows and 5 columns


In [12]:
# checking the Toronto's Borough
df['Borough'].unique()

array(['North York', 'Downtown Toronto', 'Etobicoke', 'Scarborough',
       'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Toronto/York', 'Mississauga'], dtype=object)

## Project tasks - 3

**Explore and cluster the neighborhoods in Toronto.** You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you. 

**Just make sure:**

- to add enough Markdown cells to explain what you decided to do and to report any observations you make. 
- to generate maps to visualize your neighborhoods and how they cluster together.

Once you are happy with your analysis, submit a link to the new Notebook on your Github repository. (3 marks)

### Code

**Let's slice the original dataframe and create a new dataframe of the Toronto data**

In [13]:
address = 'Toronto, ON, Canada'

# defining the user_agent
geolocator = Nominatim(user_agent='toronto_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print(f'The geographical coordinates of Toronto is {latitude}, {longitude}')

The geographical coordinates of Toronto is 43.6534817, -79.3839347


**Creating a map of Toronto with neighborhoods superimposed on top**

In [14]:
toronto_map = folium.Map(location=[latitude, longitude], zoom_start=10)

# adding markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = f'{neighborhood}, {borough}'
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    location=[lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(toronto_map)

toronto_map

Next, we will use the Foursquare API to explore the neighborhoods and explore them.

#### Defining Foursquare credencials and version

**The credencials were removed for sharing**

In [16]:
# defining the radius and the limit of venues to get
LIMIT = 100
radius = 500

#### Exploring the neighborhoods in Toronto

Creating a function to get each nearby venue information

In [17]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [18]:
toronto_venues = getNearbyVenues(names=df['Neighborhood'],
                                latitudes=df['Latitude'],
                                longitudes=df['Longitude'])

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

**Let's check the size of the new dataframe**

In [19]:
toronto_venues.shape

(2119, 7)

In [20]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Brookbanks Pool,43.751389,-79.332184,Pool
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Parkwoods,43.753259,-79.329656,TTC stop - 44 Valley Woods,43.755402,-79.333741,Bus Stop
4,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena


**Now, let's check the how many venues were returned for each neighborhood**

In [21]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
"Alderwood, Long Branch",7,7,7,7,7,7
"Bathurst Manor, Wilson Heights, Downsview North",23,23,23,23,23,23
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",24,24,24,24,24,24
...,...,...,...,...,...,...
"Willowdale, Willowdale East",34,34,34,34,34,34
"Willowdale, Willowdale West",5,5,5,5,5,5
Woburn,4,4,4,4,4,4
Woodbine Heights,6,6,6,6,6,6


**Let's check how many unique categories can be curated from all returned venues**

In [22]:
print(f'There are {toronto_venues["Venue Category"].nunique()} unique categories')

There are 266 unique categories


#### Analyzing each Neighborhood**

In [23]:
# one_hot encoding
toronto_one_hot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_one_hot.drop('Neighborhood', axis=1, inplace=True)

In [24]:
# add neighborhood column back to dataframe
toronto_one_hot['Neighborhood'] = toronto_venues['Neighborhood']

In [25]:
# move neighborhood column to the first column
fixed_columns = [toronto_one_hot.columns[-1]] + list(toronto_one_hot.columns[:-1])
toronto_one_hot = toronto_one_hot[fixed_columns]

toronto_one_hot.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's examine the new dataframe

In [26]:
toronto_one_hot.shape

(2119, 266)

#### Next we will group rows by neighborhood and by take the mean of the frequency of occurrence of each group

In [27]:
toronto_grouped = toronto_one_hot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Let's confirm the new size**

In [28]:
toronto_grouped.shape

(96, 266)

**Let's print each neighborhood along the top 5 most common venues**

In [29]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("-----"+hood+"-----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue', 'freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

-----Agincourt-----
                       venue  freq
0                     Lounge  0.25
1  Latin American Restaurant  0.25
2             Breakfast Spot  0.25
3               Skating Rink  0.25
4              Metro Station  0.00


-----Alderwood, Long Branch-----
            venue  freq
0     Pizza Place  0.29
1             Pub  0.14
2             Gym  0.14
3  Sandwich Place  0.14
4    Skating Rink  0.14


-----Bathurst Manor, Wilson Heights, Downsview North-----
                       venue  freq
0                       Bank  0.09
1                Coffee Shop  0.09
2              Shopping Mall  0.04
3  Middle Eastern Restaurant  0.04
4             Sandwich Place  0.04


-----Bayview Village-----
                 venue  freq
0                 Café  0.25
1  Japanese Restaurant  0.25
2                 Bank  0.25
3   Chinese Restaurant  0.25
4                Motel  0.00


-----Bedford Park, Lawrence Manor East-----
                  venue  freq
0        Sandwich Place  0.08
1    Italian 

                venue  freq
0  Dim Sum Restaurant  0.25
1                Park  0.25
2            Bus Line  0.25
3         Swim School  0.25
4   Accessories Store  0.00


-----Leaside-----
                    venue  freq
0             Coffee Shop  0.12
1     Sporting Goods Shop  0.09
2  Furniture / Home Store  0.06
3            Burger Joint  0.06
4                    Bank  0.06


-----Little Portugal, Trinity-----
                   venue  freq
0                    Bar  0.14
1  Vietnamese Restaurant  0.05
2            Men's Store  0.05
3       Asian Restaurant  0.05
4                   Café  0.05


-----Malvern, Rouge-----
                  venue  freq
0  Fast Food Restaurant   1.0
1   Monument / Landmark   0.0
2         Luggage Store   0.0
3                Market   0.0
4   Martial Arts School   0.0


-----Milliken, Agincourt North, Steeles East, L'Amoreaux East-----
               venue  freq
0         Playground  0.25
1               Park  0.25
2   Asian Restaurant  0.25
3       Inter

4          Coffee Shop  0.06


-----Willowdale, Willowdale West-----
           venue  freq
0    Pizza Place   0.2
1    Coffee Shop   0.2
2    Supermarket   0.2
3       Pharmacy   0.2
4  Grocery Store   0.2


-----Woburn-----
                       venue  freq
0                Coffee Shop  0.50
1      Korean BBQ Restaurant  0.25
2          Indian Restaurant  0.25
3  Middle Eastern Restaurant  0.00
4        Monument / Landmark  0.00


-----Woodbine Heights-----
                venue  freq
0            Bus Stop  0.17
1                Park  0.17
2          Beer Store  0.17
3        Skating Rink  0.17
4  Athletics & Sports  0.17


-----York Mills West-----
                        venue  freq
0           Electronics Store  0.25
1                        Park  0.25
2  Construction & Landscaping  0.25
3           Convenience Store  0.25
4          Mexican Restaurant  0.00




**Let's put that into a pandas dataframe**

First we will define a function that sort the venues in descending order

In [30]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create a new dataframe and display the top venues for each neighborhood

In [31]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according the number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Lounge,Latin American Restaurant,Skating Rink,Breakfast Spot,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant
1,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Gym,Skating Rink,Sandwich Place,Pub,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Park,Mobile Phone Shop,Bridal Shop,Sandwich Place,Diner,Restaurant,Deli / Bodega,Intersection
3,Bayview Village,Café,Bank,Japanese Restaurant,Chinese Restaurant,Yoga Studio,Dim Sum Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Coffee Shop,Italian Restaurant,Pizza Place,Thai Restaurant,Fast Food Restaurant,Butcher,Pub,Café,Sushi Restaurant


#### Cluster Neighborhoods

Run k-means to cluster the neighborhoods into 5 clusters

In [32]:
# set the number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood

In [33]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df

# merge toronto_grouped with df to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1.0,Park,Food & Drink Shop,Pool,Bus Stop,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Doner Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,1.0,Portuguese Restaurant,Coffee Shop,Hockey Arena,Intersection,Pizza Place,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Department Store
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1.0,Coffee Shop,Pub,Bakery,Park,Theater,Restaurant,Breakfast Spot,Café,Event Space,Performing Arts Venue
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1.0,Clothing Store,Accessories Store,Furniture / Home Store,Boutique,Coffee Shop,Women's Store,Vietnamese Restaurant,Colombian Restaurant,Falafel Restaurant,Event Space
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1.0,Coffee Shop,Diner,Sushi Restaurant,Yoga Studio,Fried Chicken Joint,Sandwich Place,Smoothie Shop,Burrito Place,Beer Bar,Japanese Restaurant


we decide to drop the rows that hasn't data avaiable

In [34]:
toronto_merged.dropna(inplace=True)

In [35]:
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype(int)

**Let's visualize the resulting clusters**

In [36]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lng, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label=folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
    [lat, lng],
    radius=5,
    popup=label,
    color=rainbow[cluster-1],
    fill=True,
    fill_color=rainbow[cluster-1],
    fill_opacity=0.7).add_to(map_clusters)
    
map_clusters

#### Examine Clusters

**Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster.**

**Cluster 1**

In [37]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
21,York,0,Park,Women's Store,Pool,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run
35,East York,0,Park,Intersection,Convenience Store,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Yoga Studio
52,North York,0,Park,Yoga Studio,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
64,York,0,Park,Yoga Studio,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
91,Downtown Toronto,0,Park,Playground,Trail,Eastern European Restaurant,Dumpling Restaurant,Electronics Store,Drugstore,Donut Shop,Doner Restaurant,Dance Studio


**Cluster 2**

In [38]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,1,Park,Food & Drink Shop,Pool,Bus Stop,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Doner Restaurant
1,North York,1,Portuguese Restaurant,Coffee Shop,Hockey Arena,Intersection,Pizza Place,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Department Store
2,Downtown Toronto,1,Coffee Shop,Pub,Bakery,Park,Theater,Restaurant,Breakfast Spot,Café,Event Space,Performing Arts Venue
3,North York,1,Clothing Store,Accessories Store,Furniture / Home Store,Boutique,Coffee Shop,Women's Store,Vietnamese Restaurant,Colombian Restaurant,Falafel Restaurant,Event Space
4,Downtown Toronto,1,Coffee Shop,Diner,Sushi Restaurant,Yoga Studio,Fried Chicken Joint,Sandwich Place,Smoothie Shop,Burrito Place,Beer Bar,Japanese Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...
97,Downtown Toronto,1,Coffee Shop,Café,Hotel,Restaurant,Gym,Japanese Restaurant,Seafood Restaurant,American Restaurant,Asian Restaurant,Deli / Bodega
99,Downtown Toronto,1,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Gay Bar,Restaurant,Yoga Studio,Hotel,Fast Food Restaurant,Café,Pub
100,East Toronto,1,Yoga Studio,Skate Park,Auto Workshop,Brewery,Burrito Place,Comic Shop,Farmers Market,Fast Food Restaurant,Garden,Garden Center
101,Etobicoke,1,Baseball Field,Yoga Studio,Dessert Shop,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop


**Cluster 3**

In [39]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
98,Etobicoke,2,River,Pool,Dog Run,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant


**Cluster 4**

In [40]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,Etobicoke,3,Bakery,Yoga Studio,Dessert Shop,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop


**Cluster 5**

In [41]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Scarborough,4,Fast Food Restaurant,Yoga Studio,Department Store,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Doner Restaurant
