# _Notebook for Segmenting and Clustering Neighborhoods in Toronto_

### **Part 1: Getting, cleaning, processing data**

In [2]:
# Importing libraries
import pandas as pd
import numpy as np
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import requests
# import k-means from clustering stage
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
!conda install -c anaconda lxml --yes # for pandas read_html
import lxml

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - folium=0.5.0


The following packages will be UPDATED:

    certifi: 2019.6.16-py36_1 anaconda --> 2019.9.11-py36_0  conda-forge

The following packages will be DOWNGRADED:

    openssl: 1.1.1-h7b6447c_0 anaconda --> 1.1.1c-h516909a_0 conda-forge

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environme

#### 1. Get info from Wikipedia page

In [67]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [68]:
# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
df0 = pd.read_html(url)[0]
df0.columns = ['PostalCode', 'Borough', 'Neighborhood']
df0

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor


#### 2. Ignore cells with a borough that is *__Not assigned__*

Let's see how many 'Not assigned'

In [69]:
df0.Borough.value_counts()

Not assigned        77
Etobicoke           45
North York          38
Downtown Toronto    37
Scarborough         37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64

And create new DF without 'Not assigned'

In [70]:
df1 = df0[df0.Borough != 'Not assigned']
print('New size is', df1.shape)

New size is (211, 3)


#### 3. More than one neighborhood can exist in one postal code area.
#### Rows will be combined into one row with the neighborhoods separated with a comma

In [71]:
df2 = df1.groupby(['PostalCode', 'Borough'])['Neighborhood']\
.apply(lambda Neighborhood: ','.join(Neighborhood))\
.to_frame(name = 'Neighborhood').reset_index()
df2.head(3)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"


#### 4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [72]:
# Check 'Not assigned' Neighborhood
df2.loc[df2['Neighborhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Not assigned


In [73]:
df2.loc[df2['Neighborhood'] == 'Not assigned', 'Neighborhood'] = \
df2.loc[df2['Neighborhood'] == 'Not assigned', 'Borough']

In [74]:
df2.loc[df2['PostalCode'] == 'M7A']

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Queen's Park


#### 5. In the last cell of notebook, use the .shape method to print the number of rows of dataframe.

In [75]:
df2.shape

(103, 3)

### **Part 2: Getting coordinates of each neighborhood**

#### 6. We need to get the latitude and the longitude coordinates of each neighborhood

Loading coordinates data from csv file

In [76]:
df_coord = pd.read_csv('Geospatial_Coordinates.csv')
df_coord.columns = ['PostalCode', 'Latitude', 'Longitude']
df_coord.head(3)

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711


Combining two datasets

In [77]:
df_final = pd.merge(left=df2,right=df_coord, left_on='PostalCode', right_on='PostalCode')
df_final

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",43.739416,-79.588437


### **Part 3: Explore and cluster the neighborhoods in Toronto**

First - let's get Toronto coordinates to draw a map

In [78]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Torronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Torronto are 43.653963, -79.387207.


In [79]:
df_final.Borough.value_counts() # just to check borough column

North York          24
Downtown Toronto    18
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
York                 5
East York            5
East Toronto         5
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64

Let's create map of Toronto with postcodes and borough markers

In [80]:
map_tor = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood, postcode in zip(df_final['Latitude'], \
                                                     df_final['Longitude'], \
                                                     df_final['Borough'], \
                                                     df_final['Neighborhood'], \
                                                    df_final['PostalCode']):
    label = '{}, {}'.format(postcode, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#863100',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tor)  
    
map_tor

Let's try to process rows containing M4* in postalcode

In [81]:
df_explore = df_final.loc[df_final.iloc[:,0].str.contains(r'(M4)')].reset_index(drop=True)
# create map of Toronto with postcodes and boruogh markers
map_tor = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood, postcode in zip(df_explore['Latitude'], \
                                                     df_explore['Longitude'], \
                                                     df_explore['Borough'], \
                                                     df_explore['Neighborhood'], \
                                                    df_explore['PostalCode']):
    label = '{}, {}'.format(postcode, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#863100',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tor)  
    
map_tor

  return func(self, *args, **kwargs)


In [82]:
df_explore

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4A,North York,Victoria Village,43.725882,-79.315572
1,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
2,M4C,East York,Woodbine Heights,43.695344,-79.318389
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M4G,East York,Leaside,43.70906,-79.363452
5,M4H,East York,Thorncliffe Park,43.705369,-79.349372
6,M4J,East York,East Toronto,43.685347,-79.338106
7,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
8,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
9,M4M,East Toronto,Studio District,43.659526,-79.340923


Foursquare Credentials and Version

In [83]:
CLIENT_ID = '3Y4VRJ3XEEFJVFYIOJZI222GCS5YZJQWK5Y0DYVL43KLHFCM' # Foursquare ID
CLIENT_SECRET = '1JJEBLWZEKCSFWIZTIWR2K5TFYJ4XOC10Y4T2VFOG2JRC2FM' # Foursquare Secret
VERSION = '20190917' # Foursquare API version

Function for processing all the neighborhoods

In [84]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [85]:
# Getting all the data
Tor_venues = getNearbyVenues(names=df_explore['Neighborhood'],
                                   latitudes=df_explore['Latitude'],
                                   longitudes=df_explore['Longitude']
                                  )

Victoria Village
Woodbine Gardens,Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The Danforth West,Riverdale
The Beaches West,India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park,Summerhill East
Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West
Rosedale
Cabbagetown,St. James Town
Church and Wellesley


In [86]:
# checking resulting dataframe
print(Tor_venues.shape)
Tor_venues.head()

(411, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
1,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
2,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
3,Victoria Village,43.725882,-79.315572,Eglinton Ave E & Sloane Ave/Bermondsey Rd,43.726086,-79.31362,Intersection
4,Victoria Village,43.725882,-79.315572,Pizza Nova,43.725824,-79.31286,Pizza Place


In [87]:
print('There are {} uniques categories.'.format(len(Tor_venues['Venue Category'].unique())))

There are 129 uniques categories.


Let's analyze Each Neighborhood

In [88]:
# one hot encoding
Tor_onehot = pd.get_dummies(Tor_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Tor_onehot['Neighborhood'] = Tor_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [Tor_onehot.columns[-1]] + list(Tor_onehot.columns[:-1])
Tor_onehot = Tor_onehot[fixed_columns]

Tor_onehot.head()

Unnamed: 0,Yoga Studio,Afghan Restaurant,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bagel Shop,Bakery,Bank,Bar,...,Thai Restaurant,Theater,Theme Restaurant,Toy / Game Store,Trail,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wings Joint
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [89]:
Tor_onehot.shape

(411, 129)

let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [90]:
Tor_grouped = Tor_onehot.groupby('Neighborhood').mean().reset_index()
Tor_grouped.head(3)

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bagel Shop,Bakery,Bank,...,Thai Restaurant,Theater,Theme Restaurant,Toy / Game Store,Trail,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wings Joint
0,"Cabbagetown,St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.021739,...,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Church and Wellesley,0.011111,0.011111,0.011111,0.011111,0.0,0.0,0.0,0.0,0.0,...,0.011111,0.011111,0.011111,0.0,0.0,0.011111,0.0,0.011111,0.0,0.011111
2,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.028571,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0


In [91]:
Tor_grouped.shape

(19, 129)

Let's print each neighborhood along with the top 5 most common venues

In [92]:
num_top_venues = 5

for hood in Tor_grouped['Neighborhood']:
    #print("----"+hood+"----")
    temp = Tor_grouped[Tor_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    #print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    #print('\n')

Let's put that into a pandas dataframe

In [93]:
# function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [94]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = Tor_grouped['Neighborhood']

for ind in np.arange(Tor_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Tor_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Cabbagetown,St. James Town",Coffee Shop,Café,Restaurant,Pub,Bakery,Pizza Place,Italian Restaurant,General Entertainment,Caribbean Restaurant,Jewelry Store
1,Church and Wellesley,Coffee Shop,Japanese Restaurant,Gay Bar,Sushi Restaurant,Restaurant,Hotel,Pub,Men's Store,Mediterranean Restaurant,Gym
2,Davisville,Dessert Shop,Sandwich Place,Gym,Sushi Restaurant,Coffee Shop,Café,Italian Restaurant,Pizza Place,Flower Shop,Brewery
3,Davisville North,Hotel,Gym,Breakfast Spot,Park,Clothing Store,Food & Drink Shop,Asian Restaurant,Sandwich Place,Dog Run,Electronics Store
4,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",Pub,Coffee Shop,Health & Beauty Service,Sushi Restaurant,Restaurant,Liquor Store,Light Rail Station,Fried Chicken Joint,Supermarket,Sports Bar


### Clustering time

In [95]:
# set number of clusters
kclusters = 5

Tor_grouped_clustering = Tor_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Tor_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([1, 1, 1, 1, 1, 4, 0, 1, 2, 1, 0, 1, 3, 1, 1, 1, 1, 1, 1],
      dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [96]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Tor_merged = df_explore

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Tor_merged = Tor_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

In [97]:
Tor_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4A,North York,Victoria Village,43.725882,-79.315572,1,Hockey Arena,Intersection,Financial or Legal Service,Coffee Shop,Pizza Place,Portuguese Restaurant,Dog Run,Diner,Food & Drink Shop,Electronics Store
1,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937,1,Fast Food Restaurant,Pizza Place,Pharmacy,Gym / Fitness Center,Athletics & Sports,Intersection,Bank,Gastropub,Pet Store,Breakfast Spot
2,M4C,East York,Woodbine Heights,43.695344,-79.318389,1,Skating Rink,Curling Ice,Video Store,Cosmetics Shop,Pharmacy,Park,Beer Store,Wings Joint,Financial or Legal Service,Fast Food Restaurant
3,M4E,East Toronto,The Beaches,43.676357,-79.293031,3,Health Food Store,Pub,Trail,Wings Joint,Electronics Store,Fish & Chips Shop,Financial or Legal Service,Fast Food Restaurant,Farmers Market,Ethiopian Restaurant
4,M4G,East York,Leaside,43.70906,-79.363452,1,Coffee Shop,Sporting Goods Shop,Furniture / Home Store,Sushi Restaurant,Burger Joint,Fish & Chips Shop,Liquor Store,Mexican Restaurant,Electronics Store,Dessert Shop


Finally, let's visualize the resulting clusters

In [98]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
Tor_merged['Cluster Labels'] = Tor_merged['Cluster Labels'].astype(int)
markers_colors = []
for lat, lon, poi, cluster in zip(Tor_merged['Latitude'], Tor_merged['Longitude'], Tor_merged['Neighborhood'], Tor_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

__That's all__