# Segmenting and Clustering Neighborhoods in Toronto

This Notebook is used to build the code to scrape the following Wikipedia page,

> https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

## *Getting wikipedia data*

#### Let's get started!

First, we have to import the libraries:

In [66]:
#import libraries
#!conda install -c conda-forge folium

In [67]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import json # library to handle JSON files
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

Now, we scrap the wikipedia table by using BeautifulSoup and pandas libraries.

In [68]:
wiki = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(wiki,'lxml')
table = soup.find("table",{"class":"wikitable sortable"})

In [69]:
#table #to see its content

#### We can now create a data frame by looping through BeautifulSoup Table.

I recommend that you see the contents of the table to better understand this process.

In [70]:
#Create data frame

columns=['Postal Code','Borough','Neighborhood']
df=pd.DataFrame(columns=columns)
p=[]
b=[]
n=[]

for row in table.findAll("tr"):
    cells = row.findAll("td")
    if len(cells) == 3:
        p.append(cells[0].find(text=True).lstrip('\n').strip())
        b.append(cells[1].find(text=True).lstrip('\n').strip())
        n.append(cells[2].find(text=True).lstrip('\n').strip())

df['Postal Code']=p
df['Borough']=b
df['Neighborhood']=n
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### Let's clean our data!
We only need the cells that have an assigned borough. Therefore, we can ignore the cells with a borough that is 'Not assigned'.

In order to do that let's filter our data where the Borough is not equal to 'Not assigned':

In [71]:
ng_tnt=df[df['Borough'] != 'Not assigned'].reset_index(drop=True)

More than one neighborhood can exist in one postal code area. Luckly, the data frame that was extracted from Wikipedia came in that sort of way. So it is only necessary to check if the Postal Code are grouped correctly. Therefore, if the shape of the unique values of the Borough column is the same as the original, we can assume it is correct.

In [72]:
ng_tnt['Postal Code'].unique().shape[0]==ng_tnt['Postal Code'].shape[0]

True

#### Great! Now let's confirme that any neighborhood has 'Not assigned' values:

In [73]:
ng_tnt[ng_tnt['Neighborhood'] == 'Not assigned'].shape[0]

0

And get the shape of the grouped data frame:

In [74]:
ng_tnt.shape

(103, 3)

## *Uploading data frame with latitude and longitude information*

#### I tried to get the latitude and longitude of each postal code area by using geocoder, but I failed.
> ## It was taking to long!

*This is what I did...*

```python 
!pip install geocoder
import geocoder

    def get_lat_log(postal):
        lat_lng_coords = None
        while(lat_lng_coords is None):
          g = geocoder.google('{}, Toronto, Ontario'.format(postal))
          lat_lng_coords = g.latlng

        lat = lat_lng_coords[0]
        long = lat_lng_coords[1]
        return (lat, long)

    lat_list=[]
    long_list=[]

    lat,long = get_lat_log(ng_group['Postal Code'][0])

    
for pc in ng_group['Postal Code']:
    (lat,long)=get_lat_log(pc)
    lat_list.append(lat)
    long_list.append(long)
    
ng_group['Latitude']=lat_list
ng_group['Longitude']=long_list
```

#### So, I decided to read the csv file and merge the latitude and longitude with my data frame:

In [75]:
df_ltlg=pd.read_csv('https://cocl.us/Geospatial_data')
df_ltlg.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [76]:
ng_data=ng_tnt.merge(df_ltlg, on='Postal Code', how='left')
ng_data

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


Let's check if the coordinates matches with Capstone example:

In [77]:
check=ng_data[ng_data['Postal Code'] =='M9V'].reset_index(drop=True)
check

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


It does! So, now we are ready to analyze our data and cluster it for a better visualization.

## *Preparing data for clustering*

For ease of analysis we consider only boroughs with the 'Toronto' word in it.

In [78]:
ng_data = ng_data[ng_data['Borough'].str.contains('Toronto')].reset_index(drop=True)
print(ng_data.shape)
ng_data.head()

(39, 5)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


Plotting data without clusters for further analysis.

In [79]:
address = 'Toronto, ON'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [80]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)
for lat, lng, label in zip(ng_data['Latitude'], ng_data['Longitude'], ng_data['Borough']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### Define Foursquare Credentials and Version

In [81]:
CLIENT_ID = '10VW5NMQI1LHJF3ZDFKOBJKNJ1XNUT1WUS2HWWC1KNYZOUYX'
CLIENT_SECRET = 'SLHNFIPFSKJQTJAY0NVY3QXBRSXV4CNAK02U350DRRQ5IFEW'
VERSION = '20180605'

#### Let's create a function that loops through our neighborhood data and get the top venues

Notice that we are setting a default value of 500 for our radius and a limit for the top 100 venues.

In [82]:
def getNearbyVenues(names, latitudes, longitudes, radius=500,limit=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):    
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough', 
                  'Borough Latitude', 
                  'Borough Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    print('done')
    return(nearby_venues)

Now we apply it on our Borough data

In [83]:
toronto_venues = getNearbyVenues(names=ng_data['Borough'],
                                   latitudes=ng_data['Latitude'],
                                   longitudes=ng_data['Longitude']
                                  )

done


And examine the dataframe size

In [84]:
print(toronto_venues.shape)
toronto_venues.head()

(1614, 7)


Unnamed: 0,Borough,Borough Latitude,Borough Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Downtown Toronto,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Downtown Toronto,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Downtown Toronto,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
3,Downtown Toronto,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
4,Downtown Toronto,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


## *Analyze each Borough*

#### One Hot Encoding and grouping by city analyzed by its mean, to consider the frequency of each element

In [85]:
t1hot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
t1hot['Borough'] = toronto_venues['Borough'] 
t1hot = t1hot[[t1hot.columns[-1]] + list(t1hot.columns[:-1])]
tgb=t1hot.groupby('Borough').mean().reset_index()
print(tgb.shape)
tgb.head()

(4, 242)


Unnamed: 0,Borough,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Central Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009434,0.0,...,0.018868,0.0,0.009434,0.0,0.009434,0.0,0.0,0.0,0.0,0.009434
1,Downtown Toronto,0.000817,0.000817,0.000817,0.000817,0.001634,0.002451,0.001634,0.014706,0.001634,...,0.000817,0.002451,0.010621,0.001634,0.003268,0.005719,0.000817,0.001634,0.000817,0.005719
2,East Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024194,0.0,...,0.016129,0.0,0.0,0.0,0.0,0.008065,0.0,0.0,0.0,0.024194
3,West Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00625,...,0.0,0.0,0.01875,0.0,0.00625,0.00625,0.00625,0.0,0.0,0.0125


#### Now let's display the top 10 venues for each borough

First we define a function to get the top index of a series

In [86]:
def top_n_val(row, limit):
    row2sort = row.iloc[1:]
    row_sorted = row2sort.sort_values(ascending=False)
    
    return row_sorted.index.values[0:limit]

And it is created the data frame with top 10 venues printed by column

In [102]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
columns = ['Borough']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

tgb_sorted = pd.DataFrame(columns=columns)
tgb_sorted['Borough'] = tgb['Borough']

for ind in np.arange(tgb.shape[0]):
    tgb_sorted.iloc[ind, 1:] = top_n_val(tgb.iloc[ind, :], num_top_venues)

tgb_sorted.head()

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Central Toronto,Coffee Shop,Sandwich Place,Café,Pizza Place,Park,Sushi Restaurant,Pub,Restaurant,Dessert Shop,Italian Restaurant
1,Downtown Toronto,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Italian Restaurant,Bakery,Park,Gym,Seafood Restaurant
2,East Toronto,Greek Restaurant,Coffee Shop,Café,Italian Restaurant,Brewery,Ice Cream Shop,Park,Restaurant,Yoga Studio,Light Rail Station
3,West Toronto,Café,Bar,Coffee Shop,Italian Restaurant,Bakery,Restaurant,Pizza Place,Breakfast Spot,Gift Shop,Park


## *Cluster Neighborhoods*

Run k-means to cluster the neighborhood into 4 clusters.

In [103]:
kclusters = 4
X = tgb.drop('Borough', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(X)
kmeans.labels_[0:10] 

array([0, 3, 1, 2])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [104]:
tgb_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
tnb_merged = ng_data
tnb_merged = tnb_merged.join(tgb_sorted.set_index('Borough'), on='Borough')
print(tnb_merged.shape)
tnb_merged.head()

(39, 16)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,3,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Italian Restaurant,Bakery,Park,Gym,Seafood Restaurant
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,3,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Italian Restaurant,Bakery,Park,Gym,Seafood Restaurant
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,3,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Italian Restaurant,Bakery,Park,Gym,Seafood Restaurant
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,3,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Italian Restaurant,Bakery,Park,Gym,Seafood Restaurant
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,1,Greek Restaurant,Coffee Shop,Café,Italian Restaurant,Brewery,Ice Cream Shop,Park,Restaurant,Yoga Studio,Light Rail Station


Finally, let's visualize the resulting clusters

In [105]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(tnb_merged['Latitude'], tnb_merged['Longitude'], tnb_merged['Neighborhood'], tnb_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## *Examine Clusters*

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

#### Cluster 1

In [114]:
c1=tnb_merged[tnb_merged['Cluster Labels'] ==0]
print(c1.shape)
c1.head()

(9, 16)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Coffee Shop,Sandwich Place,Café,Pizza Place,Park,Sushi Restaurant,Pub,Restaurant,Dessert Shop,Italian Restaurant
19,M5N,Central Toronto,Roselawn,43.711695,-79.416936,0,Coffee Shop,Sandwich Place,Café,Pizza Place,Park,Sushi Restaurant,Pub,Restaurant,Dessert Shop,Italian Restaurant
20,M4P,Central Toronto,Davisville North,43.712751,-79.390197,0,Coffee Shop,Sandwich Place,Café,Pizza Place,Park,Sushi Restaurant,Pub,Restaurant,Dessert Shop,Italian Restaurant
21,M5P,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",43.696948,-79.411307,0,Coffee Shop,Sandwich Place,Café,Pizza Place,Park,Sushi Restaurant,Pub,Restaurant,Dessert Shop,Italian Restaurant
23,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678,0,Coffee Shop,Sandwich Place,Café,Pizza Place,Park,Sushi Restaurant,Pub,Restaurant,Dessert Shop,Italian Restaurant


## Cluster 2

In [115]:
c2=tnb_merged[tnb_merged['Cluster Labels']==1]
print(c2.shape)
c2.head()

(5, 16)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,1,Greek Restaurant,Coffee Shop,Café,Italian Restaurant,Brewery,Ice Cream Shop,Park,Restaurant,Yoga Studio,Light Rail Station
12,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,1,Greek Restaurant,Coffee Shop,Café,Italian Restaurant,Brewery,Ice Cream Shop,Park,Restaurant,Yoga Studio,Light Rail Station
15,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,1,Greek Restaurant,Coffee Shop,Café,Italian Restaurant,Brewery,Ice Cream Shop,Park,Restaurant,Yoga Studio,Light Rail Station
17,M4M,East Toronto,Studio District,43.659526,-79.340923,1,Greek Restaurant,Coffee Shop,Café,Italian Restaurant,Brewery,Ice Cream Shop,Park,Restaurant,Yoga Studio,Light Rail Station
38,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,1,Greek Restaurant,Coffee Shop,Café,Italian Restaurant,Brewery,Ice Cream Shop,Park,Restaurant,Yoga Studio,Light Rail Station


## Cluster 3

In [116]:
c3=tnb_merged[tnb_merged['Cluster Labels']==2]
print(c3.shape)
c3.head()

(6, 16)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259,2,Café,Bar,Coffee Shop,Italian Restaurant,Bakery,Restaurant,Pizza Place,Breakfast Spot,Gift Shop,Park
11,M6J,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975,2,Café,Bar,Coffee Shop,Italian Restaurant,Bakery,Restaurant,Pizza Place,Breakfast Spot,Gift Shop,Park
14,M6K,West Toronto,"Brockton, Parkdale Village, Exhibition Place",43.636847,-79.428191,2,Café,Bar,Coffee Shop,Italian Restaurant,Bakery,Restaurant,Pizza Place,Breakfast Spot,Gift Shop,Park
22,M6P,West Toronto,"High Park, The Junction South",43.661608,-79.464763,2,Café,Bar,Coffee Shop,Italian Restaurant,Bakery,Restaurant,Pizza Place,Breakfast Spot,Gift Shop,Park
25,M6R,West Toronto,"Parkdale, Roncesvalles",43.64896,-79.456325,2,Café,Bar,Coffee Shop,Italian Restaurant,Bakery,Restaurant,Pizza Place,Breakfast Spot,Gift Shop,Park


## Cluster 4

In [117]:
c4=tnb_merged[tnb_merged['Cluster Labels']==3]
print(c4.shape)
c4.head()

(19, 16)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,3,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Italian Restaurant,Bakery,Park,Gym,Seafood Restaurant
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,3,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Italian Restaurant,Bakery,Park,Gym,Seafood Restaurant
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,3,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Italian Restaurant,Bakery,Park,Gym,Seafood Restaurant
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,3,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Italian Restaurant,Bakery,Park,Gym,Seafood Restaurant
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,3,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Italian Restaurant,Bakery,Park,Gym,Seafood Restaurant
