## Assignment: Segmenting and Clustering Neighborhoods in Toronto

### Part 1:
**Beautiful Soup** is used to extract postal code information from wikipedia.   
First relevant modules are imported:

In [1]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from urllib.request import urlopen
from bs4 import BeautifulSoup

Then, the URL containing the dataset is specified and passed to urlopen() to get the html of the page.

In [2]:
#Specify the URL containing the dataset and pass it to urlopen():
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html = urlopen(url)

#create a Beautiful Soup object from the html
soup = BeautifulSoup(html, 'lxml')

**Find_all()** method is used to extract html tags within a webpage. As the postal codes are stored in a table, the tag "table" is extracted.

In [50]:
tablesOnWebsite = soup.find_all("table")

len(tablesOnWebsite) # 3 tables found on website

# first table is table of postal codes
tablePostalCodes_Toronto = tablesOnWebsite[0]

#extract the rows
rows = tablePostalCodes_Toronto.find_all('tr')
rows[:2]

[<tr>
 <th>Postal Code
 </th>
 <th>Borough
 </th>
 <th>Neighborhood
 </th></tr>,
 <tr>
 <td>M1A
 </td>
 <td>Not assigned
 </td>
 <td>
 </td></tr>]

Next, create a new dataframe and insert the rows in the dataframe:

In [4]:
# define the dataframe columns
column_names = ['Postal Code','Borough', 'Neighborhood'] 

# instantiate the dataframe
df_toronto = pd.DataFrame(columns=column_names)

#extract rows and store the values in the dataframe
for row in rows[1:]:
    cols = row.find_all('td')
    
    postal_code = cols[0].text.replace('\n','')
    borough = cols[1].text.replace('\n','')
    
    if cols[2].text.replace('\n','') != 'Not assigned':
        neighborhood = cols[2].text.replace('\n','')
    else: neighborhood = borough
    
    if borough != 'Not assigned':
        df_toronto = df_toronto.append({'Postal Code':postal_code,
                                       'Borough': borough,
                                       'Neighborhood': neighborhood}, ignore_index=True)

df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


There are 103 postal codes in the dataframe:

In [5]:
df_toronto.shape

(103, 3)

### Part 2:

In [12]:
locationData = pd.read_csv('Geospatial_Coordinates.csv')
locationData.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge the two datasets:

In [14]:
df_tor_merged = df_toronto.join(locationData.set_index('Postal Code'), on='Postal Code')
df_tor_merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## Part 3:
Now we cluster the neighborhoods in Toronto.Only boroughs that contain the word Toronto are included. 

In [17]:
#filter dataframe
df_filtered = df_tor_merged[df_tor_merged['Borough'].str.contains('Toronto')].reset_index(drop=True)
df_filtered.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


Next step is to retrieve the venues of the neighborhoods and cluster the neighborhoods according to the top venues.
First define foursquare credentials and version. (Code is hidden due to private credentials)

Next, we create a function that extracts the category of the venue:

In [20]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we create a function to retrieve the nearby venues for all neighborhoods in Toronto:

In [21]:
import requests # library to handle requests

In [22]:
def getNearbyVenues(postalCodes, latitudes, longitudes, radius=500):
    limit = 30
    venues_list=[]
    for postalCode, lat, lng in zip(postalCodes, latitudes, longitudes):
        #print(postalCode)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()
        results = results["response"]['groups'][0]['items']
        #results = requests.get(url).json()['response']['venues'] #own code
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            postalCode, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [23]:
toronto_venues = getNearbyVenues(postalCodes=df_filtered['Postal Code'],
                                   latitudes=df_filtered['Latitude'],
                                   longitudes=df_filtered['Longitude']
                                  )

Now we analyse the neighborhood / postal code areas by using one-hot encoding:

In [24]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot["Neighborhood"] = toronto_venues['Neighborhood']

loc_neighColumn = toronto_onehot.columns.get_loc("Neighborhood")

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[loc_neighColumn]] + list(toronto_onehot.columns[:loc_neighColumn])+ list(toronto_onehot.columns[loc_neighColumn+1:])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()


Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Aquarium,Art Gallery,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
toronto_onehot.shape

(860, 195)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [26]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.shape

(39, 195)

Function to sort the venues in descending order:

In [33]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues] #index values = categories of venues

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [34]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,Pub,Trail,Health Food Store,Yoga Studio,Cuban Restaurant,Eastern European Restaurant,Donut Shop,Dog Run,Distribution Center,Diner
1,M4K,Greek Restaurant,Italian Restaurant,Ice Cream Shop,Yoga Studio,Bubble Tea Shop,Juice Bar,Bookstore,Spa,Restaurant,Dessert Shop
2,M4L,Fast Food Restaurant,Park,Fish & Chips Shop,Sandwich Place,Italian Restaurant,Burrito Place,Restaurant,Ice Cream Shop,Light Rail Station,Steakhouse
3,M4M,Café,Coffee Shop,Bakery,Yoga Studio,Bookstore,Seafood Restaurant,Sandwich Place,Brewery,Cheese Shop,Park
4,M4N,Park,Swim School,Bus Line,Yoga Studio,Dance Studio,Eastern European Restaurant,Donut Shop,Dog Run,Distribution Center,Diner


### Cluster neighborhoods
Run *k*-means to cluster the neighborhood into 5 clusters.

In [35]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

In [36]:
# set number of clusters
kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering) 
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [38]:
toronto_merged = df_filtered

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Postal Code')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Park,Breakfast Spot,Theater,Bakery,Distribution Center,Dessert Shop,Spa,Event Space,Restaurant
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,Mexican Restaurant,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Café,Park,College Auditorium,Creperie
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Café,Coffee Shop,Theater,Burger Joint,Hotel,Sporting Goods Shop,Burrito Place,Ramen Restaurant,Plaza,Steakhouse
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Gastropub,Café,Cocktail Bar,Coffee Shop,Ice Cream Shop,Creperie,Japanese Restaurant,Italian Restaurant,Restaurant,New American Restaurant
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,4,Pub,Trail,Health Food Store,Yoga Studio,Cuban Restaurant,Eastern European Restaurant,Donut Shop,Dog Run,Distribution Center,Diner


Finally, visualize the clusters:

In [41]:
# get the coordinates of Toronto:
from geopy.geocoders import Nominatim
address = 'Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine Clusters:

Cluster 0: Lifestyle-Neighborhoods (very popular: Cafe/Coffee Shops, Bars, mix of restaurants) 

Cluster 1 and 2 are outliers containing only 1 neighboorhood.

Cluster 3: Recreational Neighboorhoods (very popular: park, playground, yoga studio), but could be also outlier as it only contains 2 neighborhoods.

Cluster 4: A lot of Food / Fast Food places

In [42]:
# change cluster label in order to examine other clusters
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,0,Coffee Shop,Park,Breakfast Spot,Theater,Bakery,Distribution Center,Dessert Shop,Spa,Event Space,Restaurant
1,Downtown Toronto,0,Coffee Shop,Mexican Restaurant,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Café,Park,College Auditorium,Creperie
2,Downtown Toronto,0,Café,Coffee Shop,Theater,Burger Joint,Hotel,Sporting Goods Shop,Burrito Place,Ramen Restaurant,Plaza,Steakhouse
3,Downtown Toronto,0,Gastropub,Café,Cocktail Bar,Coffee Shop,Ice Cream Shop,Creperie,Japanese Restaurant,Italian Restaurant,Restaurant,New American Restaurant
5,Downtown Toronto,0,Seafood Restaurant,Coffee Shop,Cocktail Bar,Beer Bar,Comfort Food Restaurant,Fish Market,Restaurant,Breakfast Spot,Jazz Club,Museum
6,Downtown Toronto,0,Coffee Shop,Café,Yoga Studio,Poke Place,Japanese Restaurant,Seafood Restaurant,Italian Restaurant,Bubble Tea Shop,Ice Cream Shop,Spa
7,Downtown Toronto,0,Grocery Store,Café,Park,Candy Store,Italian Restaurant,Restaurant,Athletics & Sports,Diner,Coffee Shop,Nightclub
8,Downtown Toronto,0,Café,Coffee Shop,Deli / Bodega,Steakhouse,Hotel,Speakeasy,Seafood Restaurant,Japanese Restaurant,Restaurant,Smoke Shop
9,West Toronto,0,Pharmacy,Bakery,Grocery Store,Bank,Bar,Pool,Middle Eastern Restaurant,Café,Supermarket,Music Venue
11,West Toronto,0,Asian Restaurant,Bar,Coffee Shop,Yoga Studio,Men's Store,Boutique,Brewery,Record Shop,Pizza Place,New American Restaurant
