## Capstone: Find the best neighborhood in Toronto to open Hairdressing Branches for Hair Co.

### Part 1: Load Data, Pre-processing

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
List_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(List_url).text

soup = BeautifulSoup(source, 'html.parser')

table=soup.find('table')

column_names = ['PostalCode','Borough','Neighborhood']
df = pd.DataFrame(columns = column_names)

for tr_cell in table.find_all('tr'):
    row_data=[]
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data)==3:
        df.loc[len(df)] = row_data
        
        
df = df[df['Borough'] != 'Not assigned']
df = df[df['Neighborhood'] != 'Not assigned']

df = df.reset_index(drop=True)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


### Adding in the geospatial data

In [3]:
def get_geocode(postal_code):
    # initialize your variable to None
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return latitude,longitude


geo_df=pd.read_csv('http://cocl.us/Geospatial_data')


geo_df.rename(columns={'Postal Code':'PostalCode'}, inplace = True)
geo_df_merged = pd.merge(df, geo_df, on='PostalCode')
geo_df_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### Search for Toronoto Population Data

In [22]:
df_pop = pd.read_csv('https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Tables/File.cfm?T=1201&SR=1&RPP=9999&PR=0&CMA=0&CSD=0&S=22&O=A&Lang=Eng&OFT=CSV',encoding = 'unicode_escape', na_values='NaN')
#remove unnecessary columns and first row
df_pop = df_pop.iloc[1:,[0, 4]]

# Rename the columns 
df_pop = df_pop.rename(columns={'Geographic code':'PostalCode', 'Population, 2016':'Population_2016'})

df_pop = df_pop[df_pop['Population_2016'].notna()]

df_pop

Unnamed: 0,PostalCode,Population_2016
1,A0A,46587.0
2,A0B,19792.0
3,A0C,12587.0
4,A0E,22294.0
5,A0G,35266.0
...,...,...
1637,X0G,500.0
1638,X1A,20054.0
1639,Y0A,1641.0
1640,Y0B,6561.0


### Merge population data with original dataframe

In [24]:
df_new = pd.merge(geo_df_merged, df_pop, on='PostalCode', how='inner')

df_new = df_new.sort_values(by=['Population_2016'], ascending=False)

df_new.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Population_2016
59,M2N,North York,"Willowdale, Willowdale East",43.77012,-79.408493,75897.0
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,66108.0
33,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,58293.0
88,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437,55959.0
84,M1V,Scarborough,"Milliken, Agincourt North, Steeles East, L'Amo...",43.815252,-79.284577,54680.0


### Toronto list of Amenities/ Venues that can be used for pin-pointing locations with Hairdressers

In [25]:

#FourSquare Credentials

CLIENT_ID = 'APO00QTF2Y3WAWZUT2YTZXSPZGDGHOYNY5FSI1ARNPVQ2WQU' # your Foursquare ID


CLIENT_SECRET = 'RWUKTJGS3Y1GOCBSRX1TMUUFMYPFL2BVV03GVUVIHH3G25UC' # your Foursquare Secret


VERSION = '20180605' # Foursquare API version

In [26]:
#Let's explore neighborhoods in our dataframe.
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

LIMIT = 200 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [28]:
df_new.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Population_2016
59,M2N,North York,"Willowdale, Willowdale East",43.77012,-79.408493,75897.0
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,66108.0
33,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,58293.0
88,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437,55959.0
84,M1V,Scarborough,"Milliken, Agincourt North, Steeles East, L'Amo...",43.815252,-79.284577,54680.0


In [29]:
# Get venue location list
Venue_locations = getNearbyVenues(names=df_new['Neighborhood'],
                                   latitudes=df_new['Latitude'],
                                   longitudes=df_new['Longitude']
                                  )

Willowdale, Willowdale East
Malvern, Rouge
Fairview, Henry Farm, Oriole
South Steeles, Silverstone, Humbergate, Jamestown, Mount Olive, Beaumond Heights, Thistletown, Albion Gardens
Milliken, Agincourt North, Steeles East, L'Amoreaux East
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Steeles West, L'Amoreaux West
Kennedy Park, Ionview, East Birchmount Park
Guildwood, Morningside, West Hill
Woodbine Heights
Dorset Park, Wexford Heights, Scarborough Town Centre
Dufferin, Dovercourt Village
Del Ray, Mount Dennis, Keelsdale and Silverthorn
Downsview
Runnymede, The Junction North
Regent Park, Harbourfront
Brockton, Parkdale Village, Exhibition Place
Willowdale, Willowdale West
Northwest, West Humber - Clairville
High Park, The Junction South
Don Mills
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Caledonia-Fairbanks
New Toronto, Mimico South, Humber Bay Shores
Agincourt
Bathurst Manor, Wilson Heights, Downsview Nor

In [32]:
Venue_locations.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
"Alderwood, Long Branch",8,8,8,8,8,8
"Bathurst Manor, Wilson Heights, Downsview North",22,22,22,22,22,22
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
...,...,...,...,...,...,...
"Willowdale, Willowdale West",5,5,5,5,5,5
Woburn,4,4,4,4,4,4
Woodbine Heights,5,5,5,5,5,5
York Mills West,2,2,2,2,2,2


In [33]:
print('Unique Venue Categories:')
list(Venue_locations['Venue Category'].unique())

Unique Venue Categories:


['Grocery Store',
 'Ramen Restaurant',
 'Movie Theater',
 'Café',
 'Steakhouse',
 'Japanese Restaurant',
 'Coffee Shop',
 'Ice Cream Shop',
 'Arts & Crafts Store',
 'Juice Bar',
 'Plaza',
 'Shopping Mall',
 'Lounge',
 'Restaurant',
 'Sushi Restaurant',
 'Pet Store',
 'Sandwich Place',
 'Discount Store',
 'Electronics Store',
 'Fast Food Restaurant',
 'Bank',
 'Middle Eastern Restaurant',
 'Pizza Place',
 'Vietnamese Restaurant',
 'Bubble Tea Shop',
 'Hotel',
 'Print Shop',
 'Toy / Game Store',
 'Chocolate Shop',
 'Bakery',
 'Salon / Barbershop',
 'Burger Joint',
 'Pharmacy',
 'American Restaurant',
 'Clothing Store',
 'Theater',
 'Department Store',
 'Liquor Store',
 'Video Game Store',
 'Food Court',
 'Asian Restaurant',
 'Cosmetics Shop',
 'Burrito Place',
 'Sporting Goods Shop',
 'Bar',
 "Women's Store",
 'Deli / Bodega',
 'Boutique',
 'Tea Room',
 'Supplement Shop',
 'Distribution Center',
 'Shoe Store',
 'Mobile Phone Shop',
 'Jewelry Store',
 'Greek Restaurant',
 'Chinese Restaur

In [34]:
# Pick similar amenitites to Hairdressers that we can use
amenities_list = ['Plaza', 'Shopping Mall', 'Pet Store', 'Discount Store', 'Electronics Store', 'Bank', 'Hotel',
                  'Clothing Store', 'Theater', 'Video Game Store', 'Bus Station', 'Convenience Store', 'Playground',
                  'Bus Line', 'Spa', 'Pub', 'Yoga Studio', 'Gym', 'Bookstore', 'Bike Shop', 'Gas Station', 'Comic Shop',
                  'Candy Store', 'Baby Store', 'Butcher', 'Motel', 'Tailor Shop']

amenities_list_pd = pd.DataFrame(amenities_list)

amenities_list_pd = amenities_list_pd.rename(columns={0:'Venue Category'})

venue_new = pd.merge(Venue_locations, amenities_list_pd, on='Venue Category', how='right')

venue_new.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Alderwood, Long Branch",2,2,2,2,2,2
"Bathurst Manor, Wilson Heights, Downsview North",4,4,4,4,4,4
Bayview Village,1,1,1,1,1,1
"Bedford Park, Lawrence Manor East",2,2,2,2,2,2
Berczy Park,5,5,5,5,5,5
...,...,...,...,...,...,...
Westmount,1,1,1,1,1,1
"Wexford, Maryvale",1,1,1,1,1,1
"Willowdale, Willowdale East",8,8,8,8,8,8
"Willowdale, Willowdale West",1,1,1,1,1,1


### One-hot encoding

In [35]:
venue_new_onehot = pd.get_dummies(venue_new[['Venue Category']], prefix="", prefix_sep="")

venue_new_onehot['Neighborhood'] = venue_new['Neighborhood'] 

fixed_columns = [venue_new_onehot.columns[-1]] + list(venue_new_onehot.columns[:-1])
venue_new_onehot = venue_new_onehot[fixed_columns]

venue_new_onehot.head()

Unnamed: 0,Neighborhood,Baby Store,Bank,Bike Shop,Bookstore,Bus Line,Bus Station,Butcher,Candy Store,Clothing Store,...,Pet Store,Playground,Plaza,Pub,Shopping Mall,Spa,Tailor Shop,Theater,Video Game Store,Yoga Studio
0,"Willowdale, Willowdale East",0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,"Harbourfront East, Union Station, Toronto Islands",0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,"Harbourfront East, Union Station, Toronto Islands",0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,"Garden District, Ryerson",0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,"Richmond, Adelaide, King",0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [36]:
venue_grouped = venue_new_onehot.groupby('Neighborhood').mean().reset_index()
venue_grouped.shape


venue_grouped.head()

Unnamed: 0,Neighborhood,Baby Store,Bank,Bike Shop,Bookstore,Bus Line,Bus Station,Butcher,Candy Store,Clothing Store,...,Pet Store,Playground,Plaza,Pub,Shopping Mall,Spa,Tailor Shop,Theater,Video Game Store,Yoga Studio
0,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0
1,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0
2,Bayview Village,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,...,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0
4,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.2,...,0.0,0.0,0.0,0.2,0.0,0.0,0.2,0.0,0.0,0.0


### K-means clustering to determine optimal clusters to segment data

In [38]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np


venue_grouped_clustering = venue_grouped.drop('Neighborhood', 1)

# Use silhouette score to find optimal number of clusters to segment the data
kclusters = np.arange(2,10)
results = {}
for size in kclusters:
    model = KMeans(n_clusters = size).fit(venue_grouped_clustering)
    predictions = model.predict(venue_grouped_clustering)
    results[size] = silhouette_score(venue_grouped_clustering, predictions)

best_size = max(results, key=results.get)
best_size

8

In [39]:
#import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = best_size


# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(venue_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([7, 3, 3, 0, 0, 0, 0, 3, 0, 0])

In [40]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = venue_grouped['Neighborhood']

for ind in np.arange(venue_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(venue_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Alderwood, Long Branch",Pub,Gym,Yoga Studio,Discount Store,Bank,Bike Shop,Bookstore,Bus Line,Bus Station,Butcher
1,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Gas Station,Shopping Mall,Electronics Store,Bike Shop,Bookstore,Bus Line,Bus Station,Butcher,Candy Store
2,Bayview Village,Bank,Yoga Studio,Electronics Store,Bike Shop,Bookstore,Bus Line,Bus Station,Butcher,Candy Store,Clothing Store
3,"Bedford Park, Lawrence Manor East",Pub,Butcher,Yoga Studio,Electronics Store,Bank,Bike Shop,Bookstore,Bus Line,Bus Station,Candy Store
4,Berczy Park,Tailor Shop,Pub,Butcher,Hotel,Clothing Store,Yoga Studio,Discount Store,Bank,Bike Shop,Bookstore


In [43]:

#Merge the Toronto data with geo cooridinate data and make sure it's the right shape
venue_labels = pd.merge(df_new,venue_grouped, on='Neighborhood', how='right')
venue_labels.shape


venue_labels = venue_labels.drop(columns=amenities_list)
venue_labels.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Population_2016
0,M8W,Etobicoke,"Alderwood, Long Branch",43.602414,-79.543484,20674.0
1,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.754328,-79.442259,37011.0
2,M2K,North York,Bayview Village,43.786947,-79.385975,23852.0
3,M5M,North York,"Bedford Park, Lawrence Manor East",43.733283,-79.41975,25975.0
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,9118.0


In [48]:
import numpy as np
from geopy.geocoders import Nominatim
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

In [49]:
address = 'Toronto'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(f'The geograpical coordinate of Toronto are {latitude}, {longitude}.')

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [53]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)
map_toronto

for lat, lng, borough, neighborhood in zip(
        venue_labels['Latitude'], 
        venue_labels['Longitude'], 
        venue_labels['Borough'], 
        venue_labels['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

### Discussion

Using the Amenities list from Forsquare, we discovered that most neighborhoods were similar and the greatest concentration of restaurants was in Central Toronto and downtown Toronto. 

When we built our our K-Means dataset we used Silhouette analysis to tell us there was a lot of similarity between neighborhoods and the most common restaurants contained with in. Really there was only 2 types of cluster or neighborhoods in greater Toronto. The vast majority of those were in 1 cluster. So Toronto restaurants might be many but they are very homogeneously located near the center of Toronto.

Of the 103 Toronto Neighborhoods gathered only 55.3% or 57 Neighborhoods are above the median after-tax income. 37.8% or 39 Neighborhoods are below he median after-tax income. 6.7% or 7 neighborhoods did not register as it appears their populations are too low. It appears that the greatest concentration of affluence is near central Toronto. We decided to keep all neighborhoods in the dataset regardless of income of population as the majority were close enough.

### Conclusion

I feel confident with the recommendation I have given to V as it backed up with analysis of data and k-means clustering. Since it is backed up with hard-data and facts, it should be given as a recommendation.

In the future, there is a potential for further analysis by consdering more amenities location and obtaining more accurate datasets regarding the amenities locations and population size and income.