## IBM Data Science - Peer graded assignment
### Explore, segment and cluster the neighborhoods of Toronto city

In this assignment, I am going to explore Toronto city neighborhoods by using segmenting and clustering. <br>
The data is not readily available on the internet. There is a Wikipedia page that exists for Toronto neighborhood data. <br>
Here is the link below: 

https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.


### Importing all the required libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

import requests
!pip install BeautifulSoup4
from bs4 import BeautifulSoup 

print('Required Libraries imported.')

Required Libraries imported.


### Scraping the required data from Wikipedia Url and extracting data

In [2]:
URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib') 
table = soup.find('div', attrs = {'id':'container'})

print('Wikipedia Page Scrapped.')

Wikipedia Page Scrapped.


And only processing the cells that have an assigned borough. Ignore cells with a borough that is Not assigned. <br>
If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough.

In [3]:
postalCodes = [];
boroughs= [];
neighborhoods = [];
columnNum = 1;
passVal = False

for row in soup.find_all('td'):
    for cell in row:
        if cell.string and cell.string[0].isalpha() and len(cell.string) > 2:
            passVal = False
            if columnNum == 1:
                if passVal == False and cell.string[1].isdigit():
                    postalCodes.append(cell.string);   
                    columnNum = 2
                else:
                    continue
            elif columnNum == 2 :
                if cell.string == 'Not assigned':
                    passVal = True
                    del postalCodes[-1]
                    columnNum = 1
                    continue
                else:
                    boroughs.append(cell.string);      
                    columnNum = 3
            elif columnNum == 3 :
                if cell.string == 'Not assigned\n':
                    neighborhoods.append(boroughs[-1])
                else:
                    neighborhoods.append(cell.string); 
                columnNum = 1
                
print('Required Data Collected.')

Required Data Collected.


#### Defining columns for the Dataframe

In [6]:
# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood'] 

# instantiate the dataframe
df = pd.DataFrame(columns=column_names)
df

Unnamed: 0,PostalCode,Borough,Neighborhood


In [11]:
# Appending columns and 
for data in range(len(neighborhoods)):
    code = postalCodes[data]
    borough = boroughs[data]
    neighborhood_name = neighborhoods[data]

    df = df.append({ 'PostalCode': code,
                                   'Borough': borough,
                                   'Neighborhood': neighborhood_name}, ignore_index=True)

df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M8Z,Etobicoke,Mimico NW / The Queensway West / South of Bloo...
1,M1A,Not assigned,M2A
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,M9A
8,M1B,Scarborough,Malvern / Rouge
9,M2B,Not assigned,M3B


In [12]:
df.shape

(406, 3)

### Installing and importing Geo-coder library

In [14]:
import sys
!{sys.executable} -m pip install geocoder
import geocoder # import geocoder

print('GeoCoder Package installed.')

GeoCoder Package installed.


#### Defining new dataframe columns to include geoCodes

In [15]:
# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
df = pd.DataFrame(columns=column_names)

df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude


#### Merging and Appending the dataframe with geocodes - Latitude and Longitudes

In [16]:
# initialize your variable to None
lat_lng_coords = None

for data in range(0, len(postalCodes)-1):
    code = postalCodes[data]
    borough = boroughs[data]
    neighborhood_name = neighborhoods[data]
    
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(code))
    lat_lng_coords = g.latlng

    df = df.append({ 'PostalCode': code,
                                   'Borough': borough,
                                   'Neighborhood': neighborhood_name,
                                   'Latitude': lat_lng_coords[0],
                                   'Longitude': lat_lng_coords[1]}, ignore_index=True)
    
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1A,Not assigned,M2A,43.64869,-79.38544
1,M3A,North York,Parkwoods,43.752935,-79.335641
2,M4A,North York,Victoria Village,43.728102,-79.31189
3,M5A,Downtown Toronto,Regent Park / Harbourfront,43.650964,-79.353041
4,M6A,North York,Lawrence Manor / Lawrence Heights,43.723265,-79.451211
5,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.66179,-79.38939
6,M8A,Not assigned,M9A,43.64869,-79.38544
7,M1B,Scarborough,Malvern / Rouge,43.808626,-79.189913
8,M2B,Not assigned,M3B,43.64869,-79.38544
9,M4B,East York,Parkview Hill / Woodbine Gardens,43.707193,-79.311529


In [17]:
df.shape

(135, 5)

### Explore and cluster the neighborhoods in Toronto

In [23]:
import sys
!{sys.executable} -m pip install folium

from sklearn.cluster import KMeans   # import k-means from clustering stage
import folium # map rendering library
import matplotlib.cm as cm          # Matplotlib and associated plotting modules
import matplotlib.colors as colors
from pandas.io.json import json_normalize

print('Required Packages installed.')

Required Packages installed.


In [24]:
VERSION = '20180605' # Foursquare API version

neighborhood_name = df.loc[0, 'Neighborhood'] # neighborhood name
neighborhood_latitude = df.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df.loc[0, 'Longitude'] # neighborhood longitude value

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))
radius = 500 # define radius
LIMIT = 100 # limit of number of venues returned by Foursquare API

url = 'https://api.foursquare.com/v2/venues/explore?&client_id=SR2LDZKIXARRVINIY4RYR2XTETJT0IAIDSXPPLPSUTXLVUEV&client_secret=0533WJN2IE3CK2QREGNDBPRLXLDBG1URBRWIF0PUJ5GZEUZS&v=20180605&ll=43.67635739999999,-79.2930312&radius=500&limit=100'

results = requests.get(url).json()

Latitude and longitude values of M2A
 are 43.648690000000045, -79.38543999999996.


#### Extracting venue categories

In [25]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [26]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Upper Beaches,Neighborhood,43.680563,-79.292869
4,Dip 'n Sip,Coffee Shop,43.678897,-79.297745


Getting near by venues

In [27]:
def getNearbyVenues(names, latitudes, longitudes):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [39]:
toronto_venues = getNearbyVenues(names = df, latitudes = df['Latitude'], longitudes = df['Longitude'])

print(toronto_venues.shape)

toronto_venues.head()

PostalCode
Borough
Neighborhood
Latitude
Longitude
(25, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,PostalCode,43.64869,-79.38544,Glen Manor Ravine,43.676821,-79.293942,Trail
1,PostalCode,43.64869,-79.38544,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,PostalCode,43.64869,-79.38544,Grover Pub and Grub,43.679181,-79.297215,Pub
3,PostalCode,43.64869,-79.38544,Upper Beaches,43.680563,-79.292869,Neighborhood
4,PostalCode,43.64869,-79.38544,Dip 'n Sip,43.678897,-79.297745,Coffee Shop


In [40]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Borough,5,5,5,5,5,5
Latitude,5,5,5,5,5,5
Longitude,5,5,5,5,5,5
Neighborhood,5,5,5,5,5,5
PostalCode,5,5,5,5,5,5


In [41]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Trail,Coffee Shop,Health Food Store,Neighborhood,Pub
0,1,0,0,PostalCode,0
1,0,0,1,PostalCode,0
2,0,0,0,PostalCode,1
3,0,0,0,PostalCode,0
4,0,1,0,PostalCode,0


In [42]:
# examine the new dataframe size.

toronto_onehot.shape

(25, 5)

In [43]:
# group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Trail,Coffee Shop,Health Food Store,Pub
0,Borough,0.2,0.2,0.2,0.2
1,Latitude,0.2,0.2,0.2,0.2
2,Longitude,0.2,0.2,0.2,0.2
3,Neighborhood,0.2,0.2,0.2,0.2
4,PostalCode,0.2,0.2,0.2,0.2


In [44]:
toronto_grouped.shape

(5, 5)

In [45]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))

----Borough----
               venue  freq
0              Trail   0.2
1        Coffee Shop   0.2
2  Health Food Store   0.2
3                Pub   0.2
----Latitude----
               venue  freq
0              Trail   0.2
1        Coffee Shop   0.2
2  Health Food Store   0.2
3                Pub   0.2
----Longitude----
               venue  freq
0              Trail   0.2
1        Coffee Shop   0.2
2  Health Food Store   0.2
3                Pub   0.2
----Neighborhood----
               venue  freq
0              Trail   0.2
1        Coffee Shop   0.2
2  Health Food Store   0.2
3                Pub   0.2
----PostalCode----
               venue  freq
0              Trail   0.2
1        Coffee Shop   0.2
2  Health Food Store   0.2
3                Pub   0.2


In [47]:
# Sorting the venues
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [48]:
num_top_venues = 3

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Borough,Pub,Health Food Store,Coffee Shop
1,Latitude,Pub,Health Food Store,Coffee Shop
2,Longitude,Pub,Health Food Store,Coffee Shop
3,Neighborhood,Pub,Health Food Store,Coffee Shop
4,PostalCode,Pub,Health Food Store,Coffee Shop


### Running K-means to Cluster the neighborhood into 4

In [60]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

  return_n_iter=True)


array([0, 0, 0, 0, 0], dtype=int32)

In [63]:
# creating new dataframe with the cluster & top 10 venues for each neighborhood.

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df

# merging toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

ValueError: cannot insert Cluster Labels, already exists

### Toronto Map creation and visualizing the clusters

In [61]:
# Visualizing the resulting clusters

# creating map
mapClusters = folium.Map(location=[43.67635739999999, -79.2930312], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
mapClusters

KeyError: 'Latitude'