<h1>Torontos Neighborhood</h1>

This assignement clusters the neighborhood in Toronto.

<h3>Scrape the Wikipedia page</h3>

This part scrapes the wikipedia page given and extract a data frame from that.

In [27]:
# import libraries and packages
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd

In [28]:
# get data from webpage by using BuitifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

In [109]:
## extract data table
# separate table in headlines and columns
tableWiki = soup.findAll('table')[0]
headline = tableWiki.findAll('th')
rows = tableWiki.findAll('tr')

# get all table entries
postcode = []
borough = []
neighborhood = []
for i in range(1,len(rows)):
    temp_rows = rows[i].findAll('td')
    current_postcode = temp_rows[0].text
    current_borough = temp_rows[1].text
    current_neighborhood = temp_rows[2].text[:-1]
    # remove all rows where borough is 'Not assigned'
    if current_borough != 'Not assigned':
        # check whether postal code already exist
        exist = False
        for j, temp in enumerate(postcode):
            if temp == current_postcode:
                neighborhood[j] = neighborhood[j]+', '+current_neighborhood
                exist = True
        if exist == False:
            # check neighborhood for 'Not Assigned'
            if current_neighborhood == 'Not assigned':
                current_neighborhood = current_borough
            postcode.append(current_postcode)
            borough.append(current_borough)
            neighborhood.append(current_neighborhood)

# create data frame in pandas
df = pd.DataFrame({headline[0].text:postcode,
                  headline[1].text:borough,
                  headline[2].text[:-1]:neighborhood})
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [110]:
# show number of rows for the data frame
df.shape

(103, 3)

<h3>Get the coordinates of each neighborhood</h3>

This part uses the csv file to get the coordinates of each neighborhood.

In [111]:
# create dataframe from csv file
df_coordinates = pd.read_csv('Geospatial_Coordinates.csv')
df_coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [112]:
## append to the data frame the Latitude and Longitude
# sort values of each data frame
df.sort_values(by=['Postcode'], inplace = True)
df_coordinates.sort_values(by=['Postal Code'], inplace = True)
# create new data Frame
df_temp1 = pd.DataFrame(df_coordinates['Latitude'])
df_temp2 = pd.DataFrame(df_coordinates['Longitude'])
df['Latitude'] = df_temp1
df['Longitude'] = df_temp2
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
6,M1B,Scarborough,"Rouge, Malvern",43.727929,-79.262029
12,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.7942,-79.262029
18,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.778517,-79.346556
22,M1G,Scarborough,Woburn,43.77012,-79.408493
26,M1H,Scarborough,Cedarbrae,43.745906,-79.352188


<h3>Explore and Clustering the neighborhoods in Toronto</h3>

<h5>Create a map of Toronto</h5>

In [159]:
# import packages
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import json
from pandas.io.json import json_normalize
import numpy as np

This part creates a map of the different in neighborhoods in Toronto. The map is saved in the wd, because the presentation does not work on this browser. You can easily use "map_toronto" to plot the map.

In [213]:
latitude = 43.70011
longitude = -79.4163
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto.save(outfile = 'Map_Toronto.html')

In [214]:
# connect to Foursuqare api
CLIENT_ID = 'R32VDER01KIPAABYQGIPAL1LDDSPVLA1WOVUL5WCTGMPW5GA' 
CLIENT_SECRET = 'TS3S0DHH5JIOZDEBJJAAMURIBGIHPF1EL0CVCX1IXD4VTAKA' 
VERSION = '20180605' 

In this section we get the Top 100 venues in the first neighborhood. In the second cell the data are stored in a data frame.

In [215]:
neighborhood_id = 2
neighborhood_latitude = df.loc[neighborhood_id, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df.loc[neighborhood_id, 'Longitude'] # neighborhood longitude value
neighborhood_name = df.loc[neighborhood_id, 'Neighbourhood'] # neighborhood name


In [216]:
def getNearbyVenues(names, latitudes, longitudes, radius=300):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [217]:
toronto_venues = getNearbyVenues(names=df['Neighbourhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )
print(toronto_venues.shape)
toronto_venues.head()
toronto_venues.groupby('Neighborhood').count()
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

(1214, 7)
There are 219 uniques categories.


<h4>Analyzing und clustering each neighborhood</h4>

In [230]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

num_top_venues = 6

indicators = ['st', 'nd', 'rd']


# function for sorting venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

In [231]:
# set number of clusters
kclusters = 4

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue
6,M1B,Scarborough,"Rouge, Malvern",43.727929,-79.262029,0.0,Convenience Store,Playground,Women's Store,Dive Bar,Field,Fast Food Restaurant
12,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.7942,-79.262029,0.0,Breakfast Spot,Sandwich Place,Dog Run,Financial or Legal Service,Field,Fast Food Restaurant
18,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.778517,-79.346556,0.0,Clothing Store,Fast Food Restaurant,Women's Store,Coffee Shop,Shoe Store,Japanese Restaurant
22,M1G,Scarborough,Woburn,43.77012,-79.408493,2.0,Coffee Shop,Women's Store,Dive Bar,Financial or Legal Service,Field,Fast Food Restaurant
26,M1H,Scarborough,Cedarbrae,43.745906,-79.352188,0.0,Tennis Court,Women's Store,Fish Market,Financial or Legal Service,Field,Fast Food Restaurant


In [232]:
# create a map for clustering
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    if np.isnan(cluster) == False:
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[int(cluster)-1],
            fill=True,
            fill_color=rainbow[int(cluster)-1],
            fill_opacity=0.7).add_to(map_clusters)
       
map_clusters.save(outfile = 'Map_Toronto_Cluster.html')

<h4>Examine Cluster</h4>

In [233]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue
6,Scarborough,0.0,Convenience Store,Playground,Women's Store,Dive Bar,Field,Fast Food Restaurant
12,Scarborough,0.0,Breakfast Spot,Sandwich Place,Dog Run,Financial or Legal Service,Field,Fast Food Restaurant
18,Scarborough,0.0,Clothing Store,Fast Food Restaurant,Women's Store,Coffee Shop,Shoe Store,Japanese Restaurant
26,Scarborough,0.0,Tennis Court,Women's Store,Fish Market,Financial or Legal Service,Field,Fast Food Restaurant
32,Scarborough,0.0,Baseball Field,Women's Store,Flower Shop,Fish & Chips Shop,Financial or Legal Service,Field
38,Scarborough,0.0,Sporting Goods Shop,Sushi Restaurant,Breakfast Spot,Restaurant,Mexican Restaurant,Electronics Store
44,Scarborough,0.0,Photography Studio,Dive Bar,Financial or Legal Service,Field,Fast Food Restaurant,Farmers Market
51,Scarborough,0.0,Pizza Place,Café,Coffee Shop,Italian Restaurant,Japanese Restaurant,Dive Bar
58,Scarborough,0.0,Steakhouse,Coffee Shop,Bar,American Restaurant,Asian Restaurant,Seafood Restaurant
65,Scarborough,0.0,Park,Sandwich Place,Metro Station,Cheese Shop,Café,Indian Restaurant


In [234]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue
40,North York,1.0,Park,Construction & Landscaping,Women's Store,Dog Run,Financial or Legal Service,Field
23,East York,1.0,Convenience Store,Park,Bank,Dog Run,Financial or Legal Service,Field
48,Downtown Toronto,1.0,Park,Shopping Mall,Women's Store,Dive Bar,Field,Fast Food Restaurant
74,Central Toronto,1.0,Women's Store,Park,Market,Fast Food Restaurant,Dive Bar,Field
25,Downtown Toronto,1.0,Park,Fast Food Restaurant,Women's Store,Dive Bar,Financial or Legal Service,Field
37,West Toronto,1.0,Park,Other Great Outdoors,Trail,Women's Store,Dive Bar,Field
100,East Toronto,1.0,Park,Bus Line,Women's Store,Dive Bar,Field,Fast Food Restaurant
98,Etobicoke,1.0,Park,Women's Store,Dive Bar,Financial or Legal Service,Field,Fast Food Restaurant


In [235]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue
22,Scarborough,2.0,Coffee Shop,Women's Store,Dive Bar,Financial or Legal Service,Field,Fast Food Restaurant
24,Downtown Toronto,2.0,Coffee Shop,Women's Store,Dive Bar,Financial or Legal Service,Field,Fast Food Restaurant


In [236]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue
97,Downtown Toronto,3.0,Construction & Landscaping,Women's Store,Dog Run,Fish & Chips Shop,Financial or Legal Service,Field
