# Toronto Neighborhood Data Project

## Scope of the Project

The idea is to follow a similar neighborhood clustering analysis that we performed for Manhattan (in the NYC_Neighborhood_Segmentation notebook). However, there are a few challenges we need to address.

The neighborhood data for New York was provided in an easily accessible json file for us to use. In this case we will be using Beautiful Soup to scrape the Toronto neighborhood data from Wikipedia.  Additionally, since we are scraping web data to get our neighborhoods, there will be a lot more data cleaning and formatting work that we need to do before we can start any sort of analysis.

### Assignment #1 - Data Scraping

Here we use the Beautiful Soup library to get data from web pages (in particular, Wikipedia pages on Toronto).

The Wikipedia page we are using as the basis for our Toronto neighborhoods can be found <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M</a>.

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

from geopy.geocoders import Nominatim
import folium

import json
from pandas.io.json import json_normalize

from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors

We now create a pandas dataframe with the Toronto postal codes as well as the borough and neighborhoods that are associated with each postal code.  

There is a bit of data-cleaning that we need to do as well:

* First, we remove "Not Assigned" postal codes.  These are postal codes that have no neighborhood or borough associated with them.

* Next, we make sure that if more than one neighborhood belongs to the same postal code, they are all on the same line and we separate them by a comma. (The current wikipedia table seems to do this already, but it's a good check to perform anyway and in case the wikipedia table changes in the future.)

* Then, we make sure that if there is a neighborhood without a name that belongs to a Borough (so the Neighborhood value is "Not Assigned" but the Borough name is there), we assign the Borough name to the neighborhorhood value. (Again, the current wikipedia table seems to have taken care of this already, but still worth guarding against future changes.)

* Finally, we check the shape of the dataframe to make sure we have the number of rows that we expect.

In [None]:
# Use the requests package to get the postal code page from Wikipedia

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url).text

# Use beautiful soup to get a cleaned up html file of the page, and to extract the table from the file
# This is not strictly necessary, we could probably get away with using pandas read_html file directly on the page,
# but this is a bit more robust, and could handle more cleaning of the web source if it were necessary.
post_code_page = BeautifulSoup(page, 'html.parser')
post_code_table = post_code_page.find('table')

# Use the read_html function in pandas to create a dataframe from the table
post_code_df = pd.read_html(str(post_code_table))[0]

# Drop postal codes where Borough is "Not assigned"
post_code_df = post_code_df[post_code_df['Borough'] != "Not assigned"]

# In case there are multiple neighborhoods belonging to the same postal code, we group these together and drop the duplicates
post_code_df['Neighbourhood'] = post_code_df.groupby('Postal Code')['Neighbourhood'].transform(lambda x: ', '.join(x))
post_code_df = post_code_df.drop_duplicates()

# If a neighborhood does not have a name assigned, but does have a Borough assigned, then make the neighbourhood name the same as the Borough
post_code_df['Neighbourhood'].replace('Not assigned', post_code_df['Borough'], inplace = True)

# Check that the dataframe is as expected
post_code_df

In [None]:
post_code_df.shape

## Assignment #2 - Adding Geo Coordinates to the Data Frame

First, we need to get latitude and longitude for each of the postal codes.  We load a file, **Geospatial_Coordinates.csv** that has this information. Currently the Google API for accessing this data is no longer free. There is a free alternative, geocoder, but it does not seem to provide consistent or reliable data.

In [None]:
# Read the csv file into a pandas dataframe
geo_coordinates = pd.read_csv('Geospatial_Coordinates.csv')

# Check the data
geo_coordinates.head()

In [None]:
# Merge the two data frames based on postal code

post_code_coordinates = post_code_df.merge(geo_coordinates, how='left', on='Postal Code')

post_code_coordinates.head()

## Assignment #3 - Exploring and Clustering Neighborhoods in the Toronto Area

Finally, we use the foursquare API and folium to explore, cluster and visualize neighborhoods in the Toronto area.

First, we use geocoders to get the latitude and longitude of Toronto and then build a map of the area with Folium.

In [None]:
# Get coordinates for Toronto, Ontario
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print("The geographical coordinates of Toronto are {}, {}.".format(latitude, longitude))

In [None]:
# Use folium to create a map of the Toronto

map_toronto = folium.Map(location = [latitude, longitude], zoom_start = 10)

for lat, lng, borough, neighbourhood in zip(post_code_coordinates['Latitude'], post_code_coordinates['Longitude'], post_code_coordinates['Borough'], post_code_coordinates['Neighbourhood']):
    label = '{},{}'.format(neighbourhood, borough)
    label = folium.Popup(label,parse_html = True)
    folium.CircleMarker(
        [lat,lng],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html = False).add_to(map_toronto)
map_toronto

Next we create a new dataframe containing only the neighbourhoods in downtown Toronto.

In [None]:
downtown_toronto = post_code_coordinates[post_code_coordinates['Borough'] == 'Downtown Toronto'].reset_index(drop = True)
downtown_toronto.head()

In [None]:
downtown_toronto.shape

In [None]:
# Get the coordinates of downtown toronto for a new Folium map

address = 'Downtown Toronto, ON'

geolocator = Nominatim(user_agent = 'to_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Downtown Toronto are {}, {}.'.format(latitude, longitude))

In [None]:
# Create a map of Downtown Toronto

map_downtown_toronto = folium.Map(location = [latitude, longitude], zoom_start = 13)

for lat, lng, label in zip(downtown_toronto['Latitude'], downtown_toronto['Longitude'], downtown_toronto['Neighbourhood']):
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html = False).add_to(map_downtown_toronto)

map_downtown_toronto

Next we setup access to the foursquare API and get latitude and longitude values for each neighbourhood.

In [None]:
# Connecting to foursquare

CLIENT_ID = 'P4GV2G4LEKK4XMTHM0H4H5W3CFR055TAHR2IQ3JPLJ0TFHEK'
CLIENT_SECRET = 'CJ2YTC5JCO4F3PNHKPSZPMFOTIU5RYD4QD1WB5FYRCS4FD5B'
VERSION ='20180604'
LIMIT = 100

In [None]:
neighbourhood_latitude = downtown_toronto.loc[0, 'Latitude']
neighbourhood_longitude = downtown_toronto.loc[0, 'Longitude']

neighbourhood_name = downtown_toronto.loc[0, 'Neighbourhood']

print('Latitude and longitude values of {} are {}, {}'.format(neighbourhood_name, neighbourhood_latitude, neighbourhood_longitude))

Now, we get the top 100 venues in the Harbourfront and Regent Park neighbourhoods (within a radius of 500 meters).

In [None]:
search_query = ''
radius = 500

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID,CLIENT_SECRET,neighbourhood_latitude,neighbourhood_longitude,VERSION,search_query,radius,LIMIT)

results = requests.get(url).json()

In [None]:
# Function that extracts the categories of the venues

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
    
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [None]:
# Put this data into a pandas dataframe

nearby_venues = results['response']['venues']
nearby_venues = pd.json_normalize(nearby_venues)

filtered_columns = ['name','categories','location.lat','location.lng']
nearby_venues = nearby_venues.loc[:,filtered_columns]

nearby_venues['categories'] = nearby_venues.apply(get_category_type, axis = 1)

nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

In [None]:
# Check how many venues foursquare returned

print('{} venues were returned by Foursquare'.format(nearby_venues.shape[0]))

### Collecting Venues for all Downtown Toronto Neighbourhoods

Now, we repeat this process for all of the neighborhoods in Downtown Toronto.

In [None]:
# Function to get nearby venues

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID,CLIENT_SECRET,VERSION,lat,lng,radius,LIMIT)
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 'Neighbourhood Latitude','Neighbourhood Longitude','Venue','Venue Latitude','Venue Longitude','Venue Category']
    
    return(nearby_venues)

In [None]:
downtown_venues = getNearbyVenues(names = downtown_toronto['Neighbourhood'], latitudes = downtown_toronto['Latitude'], longitudes = downtown_toronto['Longitude'])

In [None]:
print(downtown_venues.shape)
downtown_venues.head()

In [None]:
# Check how many unique categories from all the returned venues

print('There are {} unique categories.'.format(len(downtown_venues['Venue Category'].unique())))

In [None]:
# Neighbourhood analysis

downtown_onehot = pd.get_dummies(downtown_venues[['Venue Category']], prefix = '', prefix_sep='')

downtown_onehot['Neighbourhood'] = downtown_venues['Neighbourhood']

fixed_columns = [downtown_onehot.columns[-1]] + list(downtown_onehot.columns[:-1])
downtown_onehot = downtown_onehot[fixed_columns]

downtown_onehot.head()

In [None]:
downtown_onehot.shape

In [None]:
downtown_grouped = downtown_onehot.groupby('Neighbourhood').mean().reset_index()
downtown_grouped.head()

Next, we display the top 5 most common venues for each neighbourhood.

In [None]:
num_top_venues = 5

for hood in downtown_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = downtown_grouped[downtown_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending = False).reset_index(drop = True).head(num_top_venues))
    print('\n')

### Additional Analysis of Neighbourhood Venues

First, we'll get this data into a pandas dataframe.

Then, we do some sorting on the most common venues, and display the top venues for each neighbouhood.

In [None]:
# Function to sort venues

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending = False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
# New data frame to display the top 10 venues for each neighbourhood

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
    

neighbourhoods_venues_sorted = pd.DataFrame(columns = columns)
neighbourhoods_venues_sorted['Neighbourhood'] = downtown_grouped['Neighbourhood']

for ind in np.arange(downtown_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(downtown_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

### Neighbourhood Clustering

In [None]:
# Set number of clusters

kclusters = 3

downtown_grouped_clustering = downtown_grouped.drop('Neighbourhood', 1)

kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(downtown_grouped_clustering)

kmeans.labels_[0:10]

In [None]:
# Create dataframe that includes cluster and top 10 venues for each neighbourhood

neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

downtown_merged = downtown_toronto

downtown_merged = downtown_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on = 'Neighbourhood')

downtown_merged.head()

In [None]:
map_clusters = folium.Map(location=[latitude,longitude], zoom_start = 13)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0,1,len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(downtown_merged['Latitude'], downtown_merged['Longitude'], downtown_merged['Neighbourhood'], downtown_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster' + str(cluster), parse_html = True)
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup = label,
        color = rainbow[cluster -1],
        fill = True,
        fill_color = rainbow[cluster -1],
        fill_opacity = 0.7).add_to(map_clusters)

map_clusters

### Examine Each Cluster

In [None]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 0, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

In [None]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 1, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

In [None]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 2, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]