## Part 1. Postal codes
1. Use BeautifulSoup library to extract postal codes from Wikipedia
1. Load postal codes into pandas data frame
1. While loading the data, check for 'Not assigned' and exlclude or replace the cells
1. Use GroupBy to combine rows with repeating postal codes

In [2]:
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Import all necessary libraries

import requests # library to handle requests
from bs4 import BeautifulSoup # library to decode HTML pages
import pandas as pd # library to process data as dataframes
import numpy as np # library to handle data in a vectorized manner

from geopy.geocoders import Nominatim

import matplotlib.cm as cm # Matplotlib and associated plotting modules
import matplotlib.colors as colors

from sklearn.cluster import KMeans # import k-means from clustering stage

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    altair-4.0.1               |             py_0         575 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.0 MB

The following NEW packages will be 

In [12]:
# Downoad wiki page into soup
wiki_page = requests.get(wiki_url).text
soup = BeautifulSoup(wiki_page,'lxml')

Wikipedia uses sortable tables marked with __wikitable sortable__. I use it to find my table on the page.<br>
Then I read all lines from the table (using __tr__ tag) and each line contains a code, borough and neighbourhood.

In [13]:
# Load page lines
postal_table = soup.find('table',{'class':'wikitable sortable'})
postal_lines = postal_table.findAll('tr')

In [14]:
# Read <td> and collect the data into a dataframe
col1 = []
col2 = []
col3 = []

for tr in postal_table.find_all('tr'):
    tds = tr.find_all('td')
    if not tds:
        continue
    cell1, cell2, cell3 = [td.text.strip() for td in tds[:3]]
    if cell2 != 'Not assigned':
        if cell3 == 'Not assigned':
            cell3 = cell2
        col1.append(cell1)
        col2.append(cell2)
        col3.append(cell3)

df = pd.DataFrame()
df['Postalcode'] = col1
df['Borough'] = col2
df['Neighborhood'] = col3

In [15]:
# Group neighbourhoods
df_grouped = df.groupby(['Postalcode','Borough'])['Neighborhood'].apply(list)
df_grouped = df_grouped.sample(frac=1).reset_index()
df_grouped['Neighborhood']= df_grouped['Neighborhood'].str.join(', ')

df_grouped.shape
df_grouped.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M9W,Etobicoke,Northwest
1,M2M,North York,"Newtonbrook, Willowdale"
2,M4T,Central Toronto,"Moore Park, Summerhill East"
3,M6E,York,Caledonia-Fairbanks
4,M3N,North York,Downsview Northwest


## Part 2. Add coordinates

I decided not to work with unstable services, so I load coordinates from the provided csv file

In [16]:
url_coordinates = 'http://cocl.us/Geospatial_data'
dfCoords = pd.read_csv(url_coordinates)

In [17]:
dfAreas = df_grouped.merge(dfCoords, left_on='Postalcode',right_on='Postal Code')
dfAreas.drop(['Postal Code'], axis=1, inplace=True)
dfAreas.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M9W,Etobicoke,Northwest,43.706748,-79.594054
1,M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493
2,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
3,M6E,York,Caledonia-Fairbanks,43.689026,-79.453512
4,M3N,North York,Downsview Northwest,43.761631,-79.520999


## Part 3. Cluster the neighbourhoods in Toronto

1. First, I select all areas in Toronto
1. Draw map of those areas
1. Load venues data from Foursquare
1. Create 10 clusters
1. Draw the clusters on the map

In [102]:
# Select areas in Toronto
dfTorontoAreas = dfAreas[dfAreas['Borough'].str.contains('Toronto')].copy()
dfTorontoAreas.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
2,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
7,M5P,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,-79.411307
11,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
14,M6H,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259
21,M6S,West Toronto,"Runnymede, Swansea",43.651571,-79.48445


In [103]:
# Create map of Toronto
address = 'Toronto, Canada'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# Add markers to map
for lat, lng, borough, neighborhood in zip(dfTorontoAreas['Latitude'], dfTorontoAreas['Longitude'], dfTorontoAreas['Borough'], dfTorontoAreas['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

# Draw the map
map_toronto

In [104]:
CLIENT_ID = 'XHNVTFBKXSJXY3GRSW2J32Q03315DXPHYWDJKDL53SBU1VHH' # your Foursquare ID
CLIENT_SECRET = '5ATQKE5HEZTODFOBUUBXVJDAW5NR40WEGNFTPKR24QSOCFKX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100
radius = 500

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [105]:
toronto_venues = getNearbyVenues(names=dfTorontoAreas['Neighborhood'],
                                   latitudes=dfTorontoAreas['Latitude'],
                                   longitudes=dfTorontoAreas['Longitude']
                                  )

Moore Park, Summerhill East
Forest Hill North, Forest Hill West
Lawrence Park
Dovercourt Village, Dufferin
Runnymede, Swansea
First Canadian Place, Underground city
The Beaches West, India Bazaar
Harbourfront East, Toronto Islands, Union Station
Christie
Business Reply Mail Processing Centre 969 Eastern
Rosedale
The Beaches
High Park, The Junction South
Davisville
Adelaide, King, Richmond
Ryerson, Garden District
Harbourfront
The Danforth West, Riverdale
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Studio District
Commerce Court, Victoria Hotel
Davisville North
The Annex, North Midtown, Yorkville
Design Exchange, Toronto Dominion Centre
Church and Wellesley
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Parkdale, Roncesvalles
Harbord, University of Toronto
Berczy Park
Cabbagetown, St. James Town
Roselawn
Little Portugal, Trinity
Central Bay Street
Brockton, Exhibition Place, Parkdale Village
Chinatown, Gra

In [106]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_onehot.drop('Neighborhood', axis=1, inplace=True)
toronto_onehot.insert(0, 'Neighborhood', toronto_venues['Neighborhood'])
toronto_onehot.shape

(1717, 234)

In [107]:
toronto_grouped = toronto_onehot.groupby(['Neighborhood']).mean().reset_index()
toronto_grouped.shape

(39, 234)

In [108]:
import numpy as np # library to handle data in a vectorized manner

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [137]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Bar,Thai Restaurant,Café,Restaurant,Cosmetics Shop,Sushi Restaurant,Bakery,Burger Joint,Steakhouse
1,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Café,Farmers Market,Seafood Restaurant,Beer Bar,Steakhouse,Cheese Shop,Greek Restaurant
2,"Brockton, Exhibition Place, Parkdale Village",Café,Bakery,Breakfast Spot,Coffee Shop,Furniture / Home Store,Stadium,Restaurant,Italian Restaurant,Intersection,Bar
3,Business Reply Mail Processing Centre 969 Eastern,Yoga Studio,Garden Center,Skate Park,Restaurant,Recording Studio,Pizza Place,Park,Light Rail Station,Garden,Spa
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Lounge,Airport Terminal,Plane,Sculpture Garden,Bar,Coffee Shop,Boat or Ferry,Boutique,Harbor / Marina


In [138]:
kclusters = 10

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 9, 0, 2, 2, 2, 2, 9, 2], dtype=int32)

In [139]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = dfTorontoAreas

# merge toronto_grouped with dfTorontoAreas to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,1,Playground,Tennis Court,Department Store,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
7,M5P,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,-79.411307,4,Park,Jewelry Store,Trail,Sushi Restaurant,Yoga Studio,Dessert Shop,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
11,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,6,Park,Swim School,Bus Line,Yoga Studio,Dim Sum Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
14,M6H,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259,9,Bakery,Pharmacy,Gym / Fitness Center,Grocery Store,Furniture / Home Store,Middle Eastern Restaurant,Music Venue,Park,Pool,Portuguese Restaurant
21,M6S,West Toronto,"Runnymede, Swansea",43.651571,-79.48445,9,Pizza Place,Café,Sushi Restaurant,Coffee Shop,Italian Restaurant,Yoga Studio,Sandwich Place,Fish Market,Fish & Chips Shop,Indie Movie Theater


In [140]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters