## Project Segmenting and Clustering Neighborhoods in Toronto

<h2>Table of content</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="#import_data">Import Data from wiki page</a></li>
    <li><a href="#get coordinates">Get the latitude and the longitude coordinates of each neighborhood</a></li>
</ol>
    
</div>
 
<hr>

In [1]:
import pandas as pd
import numpy as np
import requests
from geopy.geocoders import Nominatim
!conda install -c conda-forge lxml --yes
!conda install -c conda-forge geocoder --yes




Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/smirnova/apps/miniconda3

  added / updated specs:
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2020.12.5  |       ha878542_0         137 KB  conda-forge
    certifi-2020.12.5          |   py37h89c1867_1         143 KB  conda-forge
    conda-4.9.2                |   py37h89c1867_0         3.0 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.3 MB

The following packages will be UPDATED:

  ca-certificates                      2020.6.20-hecda079_0 --> 2020.12.5-ha878542_0
  certifi                          2020.6.20-py37hc8dfbb8_0 --> 2020.12.5-py37h89c1867_1
  conda                                4.8.5-py37hc8dfbb8_1 --> 4.9.2-py37h89

<h2 id="import_data">Part 1. Import Data from wiki page</h2>

Firs, lets copy the link to the web page we want to discover.

In [2]:
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&direction=prev&oldid=926287641"

Make a request with requests lib to the web page.

In [3]:
r = requests.get(url)


Transform content of the web page to the Data Frame with function _read_html_

In [4]:
wiki_toronto = pd.read_html(r.content, header = 0)[0]
wiki_toronto.rename(columns={'Neighbourhood':'Neighborhood', 'Postcode':'Postalcode'},inplace=True)


Delete raws with "Not assigned" Borough

In [5]:
toronto_neighborhoods = wiki_toronto[wiki_toronto["Neighborhood"] != 'Not assigned']
toronto_neighborhoods.reset_index(drop=True, inplace=True)
toronto_neighborhoods.head()


Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Lets merge neighborhoods with the same postal code

In [6]:
temp_df=toronto_neighborhoods.groupby('Postalcode')['Neighborhood'].apply(lambda x: "%s" % ', '.join(x))
temp_df=temp_df.reset_index(drop=False)
temp_df.rename(columns={'Neighborhood':'Neighborhood_joined'},inplace=True)


In [7]:
toronto_neighborhoods = pd.merge(toronto_neighborhoods, temp_df, on='Postalcode') #merge two dataframe
toronto_neighborhoods.drop(['Neighborhood'],axis=1,inplace=True)
toronto_neighborhoods.drop_duplicates(inplace=True) 
toronto_neighborhoods.rename(columns={'Neighborhood_joined':'Neighborhood'},inplace=True) #rename columns
toronto_neighborhoods.reset_index(drop=True) #reset index
toronto_neighborhoods.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
4,M6A,North York,"Lawrence Heights, Lawrence Manor"
6,M9A,Etobicoke,Islington Avenue


Lets find out how many postal codes we have in total

In [8]:
print("There are",toronto_neighborhoods['Postalcode'].value_counts().sum(), 'postal codes in the dataframe!')

There are 102 postal codes in the dataframe!


Lets check if there are not assigned Neighborhood and assign them with a name of according Borough if there are some.

In [9]:
toronto_neighborhoods[toronto_neighborhoods['Neighborhood']=='Not assigned']

Unnamed: 0,Postalcode,Borough,Neighborhood


There are no 'not assigned' neighborhoods in Toronto!

In [10]:
print('Size of the table is',toronto_neighborhoods.shape)

Size of the table is (102, 3)


<h2 id="get coordinates">Part 2. Get the latitude and the longitude coordinates of each neighborhood</h2>

In [11]:
# Define the dataframe columns for the final dataframe of Toronto neighborhoods with the coordinates.
column_names = ['PostalCode','Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# Create the dataframe for the data
neighborhoods = pd.DataFrame(columns=column_names)

1 way. Double-click __here__ to see the function to get coordinates with geocoder (geocoder doesn't work!!!)

<!-- Import library:
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
for postal_code, borough, neighborhood_name in zip(toronto_neighborhoods['Postcode'],  toronto_neighborhoods['Borough'], toronto_neighborhoods['Neighborhood']):
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    neighborhood_lat = lat_lng_coords[0]
    neighborhood_lon = lat_lng_coords[1]
    neighborhoods = neighborhoods.append({'PostalCode': postal_code,
                                          'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
--> 

2 way. Double-click __here__ to see the function to get coordinates of the neighborhoods with Nominatim (doesn't work eather!!!)

<!-- Import library:
for postal_code, borough, neighborhood_name in zip(toronto_neighborhoods['Postcode'],  toronto_neighborhoods['Borough'], toronto_neighborhoods['Neighborhood']):
    address = "{}, Toronto, Ontario".format(postal_code)
    geolocator = Nominatim(user_agent="foursquare_agent")
    location = geolocator.geocode(address)
    neighborhood_lat = location.latitude
    neighborhood_lon = location.longitude
    neighborhoods = neighborhoods.append({'PostalCode': postal_code,
                                          'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
--> 

3 way. Get the coordinates from the file:

Download the table by the link

In [12]:
!wget -q -O 'Geospatial_coordinates.csv' https://cocl.us/Geospatial_data

Read the csv file to data frame

In [13]:
geocoord_by_postal_code = pd.read_csv('Geospatial_coordinates.csv', header = 0)
geocoord_by_postal_code.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Combine two tables in one, which will consist from 5 columns: PostalCode, Borough, Neighborhood, Latitude and Longitude.

In [14]:
for postal_code, borough, neighborhood_name in zip(toronto_neighborhoods['Postalcode'],  
                                                    toronto_neighborhoods['Borough'], 
                                                    toronto_neighborhoods['Neighborhood']):
    filter_by_postal_code = geocoord_by_postal_code['Postal Code'] == postal_code
    filtred = geocoord_by_postal_code[filter_by_postal_code].reset_index(drop=True)
    neighborhood_lat = filtred.loc[0,'Latitude']
    neighborhood_lon = filtred.loc[0,'Longitude']
    neighborhoods = neighborhoods.append({'PostalCode': postal_code,
                                          'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242


In [15]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 10 boroughs and 102 neighborhoods.


## Part 3. Explore and cluster the neighborhoods in Toronto

In [16]:
import folium # map rendering library

Use geopy library to get the latitude and longitude values of Toronto.

In [17]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [18]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [19]:
neighborhoods['Borough'].unique()


array(['North York', 'Downtown Toronto', 'Etobicoke', 'Scarborough',
       'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

Segment and cluster only the neighborhoods in boroughs which contains 'York' in its name. So let's slice the original dataframe and create a new dataframe of the York data.

In [20]:
condition = neighborhoods.Borough.str.contains('York')
york_data = neighborhoods[condition].reset_index(drop=True)
york_data.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
3,M3B,North York,Don Mills North,43.745906,-79.352188
4,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
5,M6B,North York,Glencairn,43.709577,-79.445073
6,M3C,North York,"Flemingdon Park, Don Mills South",43.7259,-79.340923
7,M4C,East York,Woodbine Heights,43.695344,-79.318389
8,M6C,York,Humewood-Cedarvale,43.693781,-79.428191
9,M6E,York,Caledonia-Fairbanks,43.689026,-79.453512


Let's get the geographical coordinates of Toronto.

In [21]:
address = 'York, Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of York are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of York are 43.6896191, -79.479188.


Visualizate Yourk neighborhoods.

In [22]:
# create map of York using latitude and longitude values
map_york = folium.Map(location=[latitude, longitude], zoom_start=10.5)

# add markers to map
for lat, lng, label in zip(york_data['Latitude'], york_data['Longitude'], york_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_york)  
    
map_york

#### Define Foursquare Credentials and Version

In [23]:
VERSION = '20180605' # Foursquare API version


CLIENT_ID = 'DQRT0IKE5HMTNMS0MJ3MLY3UTE4EH52OKOBI41GTYSSJXKLL' # your Foursquare ID
CLIENT_SECRET = 'MZDL0MCCBXNYWH2AAZOWWIS3RX4ZGDSZR2JT4RCUPTVBFDIX' # your Foursquare Secret
#VERSION = '20180801'
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: DQRT0IKE5HMTNMS0MJ3MLY3UTE4EH52OKOBI41GTYSSJXKLL
CLIENT_SECRET:MZDL0MCCBXNYWH2AAZOWWIS3RX4ZGDSZR2JT4RCUPTVBFDIX


From the Foursquare lab in the previous module, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [24]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### A function to explore the venues in radius 500 meters around the neighborhood for each  neighborhood in the borough.

In [25]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT = 100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Let's run the above function on each neighborhood and create a new dataframe called *york_venues*.

In [26]:
# type your answer here

york_venues = getNearbyVenues(names=york_data['Neighborhood'],
                                   latitudes=york_data['Latitude'],
                                   longitudes=york_data['Longitude']
                                  )

Parkwoods
Victoria Village
Lawrence Heights, Lawrence Manor
Don Mills North
Woodbine Gardens, Parkview Hill
Glencairn


KeyError: 'groups'

Let's check the size of the resulting dataframe

In [27]:
print(york_venues.shape)
york_venues.head()

NameError: name 'york_venues' is not defined

Let's check how many venues were returned for each neighborhood

In [29]:
york_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor, Downsview North, Wilson Heights",21,21,21,21,21,21
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",22,22,22,22,22,22
"CFB Toronto, Downsview East",2,2,2,2,2,2
Caledonia-Fairbanks,4,4,4,4,4,4
"Del Ray, Keelesdale, Mount Dennis, Silverthorn",5,5,5,5,5,5
Don Mills North,4,4,4,4,4,4
Downsview Central,4,4,4,4,4,4
Downsview Northwest,4,4,4,4,4,4
Downsview West,6,6,6,6,6,6


Let's find out how many unique categories can be curated from all the returned venues

In [30]:
print('There are {} uniques categories.'.format(len(york_venues['Venue Category'].unique())))

There are 120 uniques categories.


## Analyze Each Neighborhood

In [31]:
# one hot encoding
york_onehot = pd.get_dummies(york_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
york_onehot['Neighborhood'] = york_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [york_onehot.columns[-1]] + list(york_onehot.columns[:-1])
york_onehot = york_onehot[fixed_columns]

york_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bagel Shop,Bakery,...,Thai Restaurant,Theater,Toy / Game Store,Trail,Turkish Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
print('Dataframe size is ', york_onehot.shape)

Dataframe size is  (335, 121)


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [33]:
york_grouped = york_onehot.groupby('Neighborhood').mean().reset_index()
york_grouped

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bagel Shop,Bakery,...,Thai Restaurant,Theater,Toy / Game Store,Trail,Turkish Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Women's Store,Yoga Studio
0,"Bathurst Manor, Downsview North, Wilson Heights",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,...,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CFB Toronto, Downsview East",0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Caledonia-Fairbanks,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0
5,"Del Ray, Keelesdale, Mount Dennis, Silverthorn",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0
6,Don Mills North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Downsview Central,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Downsview Northwest,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Downsview West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
print('The new size of the data frame is', york_grouped.shape)

The new size of the data frame is (33, 121)


#### Let's print each neighborhood along with the top 5 most common venues

In [35]:
num_top_venues = 5

for hood in york_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = york_grouped[york_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Manor, Downsview North, Wilson Heights----
            venue  freq
0     Coffee Shop  0.10
1            Bank  0.10
2  Ice Cream Shop  0.05
3     Supermarket  0.05
4        Pharmacy  0.05


----Bayview Village----
                 venue  freq
0  Japanese Restaurant  0.25
1   Chinese Restaurant  0.25
2                 Café  0.25
3                 Bank  0.25
4                 Park  0.00


----Bedford Park, Lawrence Manor East----
                venue  freq
0  Italian Restaurant  0.09
1         Coffee Shop  0.09
2      Sandwich Place  0.09
3       Grocery Store  0.05
4             Butcher  0.05


----CFB Toronto, Downsview East----
               venue  freq
0            Airport   0.5
1               Park   0.5
2  Accessories Store   0.0
3        Pizza Place   0.0
4           Pharmacy   0.0


----Caledonia-Fairbanks----
           venue  freq
0           Park  0.50
1           Pool  0.25
2  Women's Store  0.25
3  Luggage Store  0.00
4       Pharmacy  0.00


----Del Ray, Keele

#### Let's put that into a *pandas* dataframe
First, let's write a function to sort the venues in descending order.

In [36]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [37]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = york_grouped['Neighborhood']

for ind in np.arange(york_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(york_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Downsview North, Wilson Heights",Coffee Shop,Bank,Pharmacy,Bridal Shop,Fried Chicken Joint,Ice Cream Shop,Diner,Deli / Bodega,Middle Eastern Restaurant,Mobile Phone Shop
1,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Yoga Studio,Dog Run,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio
2,"Bedford Park, Lawrence Manor East",Sandwich Place,Italian Restaurant,Coffee Shop,Greek Restaurant,Grocery Store,Indian Restaurant,Juice Bar,Liquor Store,Locksmith,Comfort Food Restaurant
3,"CFB Toronto, Downsview East",Airport,Park,Yoga Studio,Dog Run,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio,Deli / Bodega
4,Caledonia-Fairbanks,Park,Women's Store,Pool,Discount Store,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio


##  Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [38]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

york_grouped_clustering = york_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(york_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [39]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

york_merged = york_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
york_merged = york_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

york_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,0.0,Park,Food & Drink Shop,Yoga Studio,Discount Store,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio
1,M4A,North York,Victoria Village,43.725882,-79.315572,1.0,Pizza Place,Hockey Arena,Portuguese Restaurant,Intersection,Coffee Shop,Fried Chicken Joint,Frozen Yogurt Shop,Comfort Food Restaurant,Gas Station,Construction & Landscaping
2,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,1.0,Clothing Store,Accessories Store,Boutique,Gift Shop,Furniture / Home Store,Event Space,Coffee Shop,Women's Store,Vietnamese Restaurant,Athletics & Sports
3,M3B,North York,Don Mills North,43.745906,-79.352188,1.0,Gym,Caribbean Restaurant,Café,Japanese Restaurant,Yoga Studio,Dog Run,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio
4,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937,1.0,Pizza Place,Athletics & Sports,Pharmacy,Gastropub,Breakfast Spot,Intersection,Bank,Pet Store,Gym / Fitness Center,Food Court


In [40]:
york_merged.dropna(subset = ['Cluster Labels'], axis = 0, inplace = True)
york_merged['Cluster Labels'] = york_merged['Cluster Labels'].astype("int")
york_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,0,Park,Food & Drink Shop,Yoga Studio,Discount Store,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio
1,M4A,North York,Victoria Village,43.725882,-79.315572,1,Pizza Place,Hockey Arena,Portuguese Restaurant,Intersection,Coffee Shop,Fried Chicken Joint,Frozen Yogurt Shop,Comfort Food Restaurant,Gas Station,Construction & Landscaping
2,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,1,Clothing Store,Accessories Store,Boutique,Gift Shop,Furniture / Home Store,Event Space,Coffee Shop,Women's Store,Vietnamese Restaurant,Athletics & Sports
3,M3B,North York,Don Mills North,43.745906,-79.352188,1,Gym,Caribbean Restaurant,Café,Japanese Restaurant,Yoga Studio,Dog Run,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio
4,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937,1,Pizza Place,Athletics & Sports,Pharmacy,Gastropub,Breakfast Spot,Intersection,Bank,Pet Store,Gym / Fitness Center,Food Court


Finally, let's visualize the resulting clusters

In [41]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(york_merged['Latitude'], york_merged['Longitude'], york_merged['Neighborhood'], york_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine Clusters
We can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can then assign a name to each cluster. 

#### Cluster 0

In [42]:
york_merged.loc[york_merged['Cluster Labels'] == 0, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,0,Park,Food & Drink Shop,Yoga Studio,Discount Store,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio
9,York,0,Park,Women's Store,Pool,Discount Store,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio
18,North York,0,Airport,Park,Yoga Studio,Dog Run,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio,Deli / Bodega
21,North York,0,Construction & Landscaping,Park,Bakery,Yoga Studio,Dog Run,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio,Deli / Bodega
31,York,0,Park,Yoga Studio,Golf Course,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio,Deli / Bodega


#### Cluster 1

In [43]:
york_merged.loc[york_merged['Cluster Labels'] == 1, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,1,Pizza Place,Hockey Arena,Portuguese Restaurant,Intersection,Coffee Shop,Fried Chicken Joint,Frozen Yogurt Shop,Comfort Food Restaurant,Gas Station,Construction & Landscaping
2,North York,1,Clothing Store,Accessories Store,Boutique,Gift Shop,Furniture / Home Store,Event Space,Coffee Shop,Women's Store,Vietnamese Restaurant,Athletics & Sports
3,North York,1,Gym,Caribbean Restaurant,Café,Japanese Restaurant,Yoga Studio,Dog Run,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio
4,East York,1,Pizza Place,Athletics & Sports,Pharmacy,Gastropub,Breakfast Spot,Intersection,Bank,Pet Store,Gym / Fitness Center,Food Court
5,North York,1,Pizza Place,Park,Pub,Japanese Restaurant,Discount Store,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice
6,North York,1,Coffee Shop,Gym,Beer Store,Sandwich Place,Sporting Goods Shop,Clothing Store,Art Gallery,Discount Store,Asian Restaurant,Chinese Restaurant
7,East York,1,Skating Rink,Dance Studio,Park,Beer Store,Curling Ice,Athletics & Sports,Department Store,Diner,Dim Sum Restaurant,Dessert Shop
8,York,1,Tennis Court,Field,Hockey Arena,Trail,Yoga Studio,Dim Sum Restaurant,Dessert Shop,Department Store,Deli / Bodega,Curling Ice
10,East York,1,Sporting Goods Shop,Coffee Shop,Shopping Mall,Burger Joint,Bank,Furniture / Home Store,Department Store,Liquor Store,Dessert Shop,Smoothie Shop
11,North York,1,Dog Run,Golf Course,Pool,Athletics & Sports,Mediterranean Restaurant,Discount Store,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice


#### Cluster 2

In [44]:
york_merged.loc[york_merged['Cluster Labels'] == 2, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,East York,2,Park,Convenience Store,Intersection,Yoga Studio,Dog Run,Construction & Landscaping,Cosmetics Shop,Curling Ice,Dance Studio,Deli / Bodega
32,North York,2,Park,Convenience Store,Yoga Studio,Golf Course,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Curling Ice,Dance Studio,Deli / Bodega


#### Cluster 3

In [45]:
york_merged.loc[york_merged['Cluster Labels'] == 3, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
27,North York,3,Baseball Field,Yoga Studio,Electronics Store,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio,Deli / Bodega,Department Store


#### Cluster 4

In [46]:
york_merged.loc[york_merged['Cluster Labels'] == 4, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,North York,4,Martial Arts School,Yoga Studio,Golf Course,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio,Deli / Bodega
