# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone

## Table of contents
* [Introduction to Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction to Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a new Gym or Fitness Studio. This report will target individuals interested in opening a Gym or Fitness Studio in Toronto, Canada.

We will leverage data science to identify neighborhoods that are not already crowded with Gyms/Fitness Studios. We will also look for areas without ANY Gym or Fitness Studios. The final analysis will provide the top location candidates, and a summary of the pros/cons of each locale.


## Data <a name="data"></a>

Given our problem described above, factors that will influence our decission are:
* number of existing Gyms or Fitness Studios located in a given neighborhood
* Popularity and relative "foot traffic" in each respective neighborhood

Following data sources will be needed to extract/generate the required information:
* Wikipedia page consisting of all Toronto-area neighborhoods
* Toronto neighborhood geocoding JSON to link neighborhoods to respective latitude and longitude coordinates.
* number of Gyms/Fitness Studios and their type and location in every neighborhood will be obtained using the Foursquare API.


## DataFrame #1: Toronoto Neighborhood Wikipedia Page

In [3]:
import pandas as pd

#get wiki page into a DataFrame
link = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
df = pd.read_html(link)[0]

In [4]:
#remove "Not assigned" rows from [Borough]
toronto_neighborhoods = df[df["Borough"] != 'Not assigned']

#reset DataFrame index
toronto_neighborhoods.reset_index(drop=True, inplace=True)
toronto_neighborhoods.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [5]:
# Shape of DataFrame
toronto_neighborhoods.shape

(103, 3)

# DataFrame #2: Add Neighborhood Coordinates

In [6]:
import pandas as pd

# Read in Geospatial csv
link='http://cocl.us/Geospatial_data'
coord=pd.read_csv(link)
coord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
#new DataFrame with coordinates added to Postal Codes
toronto_coord=toronto_neighborhoods.join(coord.set_index('Postal Code'), on='Postal Code')
toronto_coord.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## Methodology <a name="methodology"></a>

Let us begin by identifying areas in Toronto that have low Gym / Fitness Center density. We are particularly interested in those with a low number of Gyms.

First, we collected the required data: location and category of every Gym / Fitness Center around Toronto via the Foursquare API.

Secondly, we will explore the geo-density of Gyms/Fitness Centers across different Toronto neighborhoods. We will leverage heatmaps to identify the neighborhoods with low number of fitness facilities and focus our attention on them.

We will then focus on the most promising locations and create clusters of locations that meet some basic requirements established in discussion with stakeholders: we will take into consideration locations with no more than one facility within a radius of 500 meters, and we want locations without fitness facilities in radius of 800 meters. We will present map of all such locations but also leverage k-means clustering to identify neighborhoods which should be a starting point for final exploration and search for optimal venue location by interested parties.

In [8]:
import json
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import numpy as np

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

In [9]:
#Limit DataFrame to "Downtown Toronto" borough only
toronto_only = toronto_coord

### Function to get Venues near neighborhoods

In [10]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

## Foursquare API calls

### Get top 100 venues per Neighborhood near Toronto

In [11]:
#Foursquare API Credentials
CLIENT_ID = '100U444XSYQYNV5CDBCYU2VLCIAW4PRQKLMIOGUDJ432W1XC' # your Foursquare ID
CLIENT_SECRET = 'QQHDGNIR0FRO5J2423QYZX34QRJAFHJVNWFGIGAMVURM40MN' # your Foursquare Secret
VERSION = '20200609' # Foursquare API version
LIMIT=100

# Toronto lat/long
latitude=43.651070
longitude=-79.347015

#create new DataFrame with Venues from Foursquare API
toronto_venues = getNearbyVenues(names=toronto_only['Neighborhood'],
                                   latitudes=toronto_only['Latitude'],
                                   longitudes=toronto_only['Longitude']
                                  )


Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [13]:
#quick glance at top 30 venue count
print(toronto_venues.shape)
toronto_gyms=toronto_venues
venue_counts=toronto_gyms['Venue Category'].value_counts()
venue_counts.head(30)

(2128, 7)


Coffee Shop                      179
Café                             100
Restaurant                        64
Park                              56
Pizza Place                       50
Italian Restaurant                44
Japanese Restaurant               43
Hotel                             43
Bakery                            43
Sandwich Place                    39
Clothing Store                    32
Gym                               32
Grocery Store                     29
Sushi Restaurant                  28
Bar                               28
Fast Food Restaurant              27
Pub                               26
American Restaurant               26
Bank                              25
Pharmacy                          24
Breakfast Spot                    22
Thai Restaurant                   21
Seafood Restaurant                21
Ice Cream Shop                    19
Gastropub                         18
Beer Bar                          18
Bookstore                         17
D

# Analysis <a name="analysis"></a>

## Preparation - One hot encoding AND Venue Grouping

In [24]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_gyms[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_gyms['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

#Group Venues by Neighborhood
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

Sort top 10 venues

In [25]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [26]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_gyms_sorted = pd.DataFrame(columns=columns)
neighborhoods_gyms_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_gyms_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_gyms_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Lounge,Clothing Store,Breakfast Spot,Latin American Restaurant,Skating Rink,Women's Store,Doner Restaurant,Diner,Discount Store,Distribution Center
1,"Alderwood, Long Branch",Pizza Place,Pharmacy,Skating Rink,Coffee Shop,Pub,Sandwich Place,Gym,Airport Lounge,Deli / Bodega,Ethiopian Restaurant
2,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Coffee Shop,Sandwich Place,Diner,Mobile Phone Shop,Ice Cream Shop,Middle Eastern Restaurant,Restaurant,Deli / Bodega,Fried Chicken Joint
3,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Women's Store,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run
4,"Bedford Park, Lawrence Manor East",Restaurant,Italian Restaurant,Coffee Shop,Sandwich Place,Pub,Butcher,Sushi Restaurant,Café,Cupcake Shop,Juice Bar


## K-means Clustering

## Cluster Neighborhoods

In [27]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 1, 1, 1, 1, 1, 3, 1, 3, 1], dtype=int32)

In [28]:
# add clustering labels
neighborhoods_gyms_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_only

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_gyms_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,0.0,Park,Food & Drink Shop,Women's Store,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,1.0,Hockey Arena,Portuguese Restaurant,French Restaurant,Intersection,Coffee Shop,Dumpling Restaurant,Drugstore,Donut Shop,Doner Restaurant,Deli / Bodega
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1.0,Coffee Shop,Bakery,Pub,Park,Breakfast Spot,Café,Theater,Cosmetics Shop,Shoe Store,Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1.0,Furniture / Home Store,Clothing Store,Accessories Store,Boutique,Vietnamese Restaurant,Coffee Shop,Event Space,Miscellaneous Shop,Women's Store,Dog Run
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1.0,Coffee Shop,Sushi Restaurant,Yoga Studio,Arts & Crafts Store,Café,Diner,Bar,Bank,Italian Restaurant,Beer Bar


Let's check out the clusters on a map...

In [32]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        fill=True,
        fill_color=rainbow,
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results and Discussion <a name="results"></a>

The analysis above shows there are numerous "opportunity pockets" of low fitness facility neighborhood density around Toronto. Surprisingly, there are a number of areas in the downtown Toronto area that fit our acceptance criteria.

There are 57 zones surrounding Toronto in total that fit our new fitness facility location criteria based on number of and distance to existing fitness facilities. This analysis purely looked at areas close to Toronto with low fitness facility density - it is feasible that low density could be explained or is justified in any of these pockets, reasons which would make them unsuitable for an additional fitness facility regardless of area disposition. The recommended "opportunity pockets" should only be treated as a starting point for more detailed analysis.

## Conclusion <a name="conclusion"></a>

The objective of this project was to identify Toronto neighborhoods with low number of fitness facilities (Gyms and Fitness Centers) in order to aid stakeholders in narrowing down the search for optimal location. By calculating fitness facility density distribution from Foursquare data, we have identified several neighborhoods that justify potential real estate investment and further investigation. K-means clustering of those neighborhoods was then performed in order to create major areas of interest for stakeholders to investigate further.