# Capstone Final Project  - The Battle of Neighborhoods

Sayaka Minegishi

## Introduction/ Business Problem

A majority of the Americans are getting the recommended amount of exercise in their lives. To encourage more individuals to exercise, we need to have more recreational facilities distributed evenly in neighborhoods so that everyone would have equal access to these facilities. This report aims to find the best neighborhood to open a new gym in Boston by taking into consideration the number of gyms/fitness centers and parks available, as well as the number of restaurants. The optimal location for a new gym would be the neighborhood with a relatively large number of restaurants as compared to gyms or fitness centers, and to a lesser extent, parks. This report is specifically targeted to stakeholders who are considering of opening a new gym or a fitness center in Boston, Massachusetts.

#### Background:

With more jobs requiring people to sit in their office and work using computers all day, an increasing number of individuals are experiencing a sedentary lifestyle. 

According to the 2018 Physical Activity Guidelines for Americans, 2nd edition, released by the U.S. Department of Health and Human Services, adults are recommended to do at least 150 minutes a week of moderate-intensity aerobic physical activity, and muscle-strengthening activities of moderate or greater intensity on at least 2 days per week. 

However, CDC reports that merely 23.2% of U.S. adults aged 18 and older met the Physical Activity Guidelines for both aerobic and muscle-strengthening activity, based on data from the 2018 National Health Interview Survey. 
To help maintain the health of Americans, there is a need to make recreational facilities equally accessible to all.


## Data

This analysis will employ data about the neighborhood features in Boston, MA, which is available as a CSV file on https://data.boston.gov/dataset/boston-neighborhoods. This file contains data on the all the neighborhoods in Boston, including their neighborhood name and area in square miles. This data will be used to find the names of the neighborhoods, which forms the foundational columns for our dataframes. 

I will utilize the Nominatim library from geopy.geocoders to find the geographical coordinates of each of the neighborhoods. 

The geographical coordinates of each neighborhood will be passed to Foursquare API to search for restaurants, gyms and parks in each of the neighborhoods. We would like to place a new gym where there are relatively large number of restaurants where there would be a greater need or tendency to exercise. At the same time, we would like to construct a gym where there is relatively fewer fitness centers or parks.




## Program

#### Install required packages and libraries

In [59]:
!pip install folium

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


In [60]:
#Import the libraries required
import numpy as np 

import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json 

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests 

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib 
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt

#plotly
import plotly.graph_objects as go
import plotly.express as px

# import k-means
from sklearn.cluster import KMeans

import folium

print('Libraries imported.')

Libraries imported.


#### Find Neighborhoods to Compare 

We will first load the Boston Neighborhoods data and clean it.

In [61]:
#load Boston Neighborhoods data
import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.

if os.environ.get('RUNTIME_ENV_LOCATION_TYPE') == 'external':
    endpoint_3058bde3eca8429bb88d442216d35bc2 = 'https://s3-api.us-geo.objectstorage.softlayer.net'
else:
    endpoint_3058bde3eca8429bb88d442216d35bc2 = 'https://s3-api.us-geo.objectstorage.service.networklayer.com'

client_3058bde3eca8429bb88d442216d35bc2 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='xdxD669rwzPwkfMR4hU9WeDqJv8rIQT6KoxLmaXsRCJ7',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url=endpoint_3058bde3eca8429bb88d442216d35bc2)

body = client_3058bde3eca8429bb88d442216d35bc2.get_object(Bucket='applieddatasciencecapstonecourser-donotdelete-pr-aabi7egcmpr7ui',Key='Boston_Neighborhoods.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

boston_data = pd.read_csv(body) #save the neighborhood data in a dataframe


In [62]:
#drop unnecessary columns
boston_data = boston_data.drop(['OBJECTID', 'Acres', 'Neighborhood_ID', 'ShapeSTArea', 'ShapeSTLength'], axis = 1)

In [63]:
#create a new dataframe with longitude and latitude values for each neighborhood
column_names = ['Neighborhood', 'Address', 'SqMiles', 'Latitude', 'Longitude'] #define the column names
boston_neighborhoods = pd.DataFrame(columns = column_names) #instantiate the dataframe

#add neighborhood column to the new dataframe
boston_neighborhoods['Neighborhood'] = boston_data['Name']

#add sqmiles column to the new dataframe
boston_neighborhoods['SqMiles'] = boston_data['SqMiles']

In [64]:
#form the address column
address = [] #create an empty array
numrows = boston_neighborhoods.shape[0]


i = 0
while i < numrows:
    neighborhood = boston_neighborhoods['Neighborhood'][i]
    boston_neighborhoods['Address'][i] = '{}, Massachusetts'.format(neighborhood) #form the address of the neighborhood
    i= i + 1
    
    



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [65]:
boston_neighborhoods #display dataframe

Unnamed: 0,Neighborhood,Address,SqMiles,Latitude,Longitude
0,Roslindale,"Roslindale, Massachusetts",2.51,,
1,Jamaica Plain,"Jamaica Plain, Massachusetts",3.94,,
2,Mission Hill,"Mission Hill, Massachusetts",0.55,,
3,Longwood,"Longwood, Massachusetts",0.29,,
4,Bay Village,"Bay Village, Massachusetts",0.04,,
5,Leather District,"Leather District, Massachusetts",0.02,,
6,Chinatown,"Chinatown, Massachusetts",0.12,,
7,North End,"North End, Massachusetts",0.2,,
8,Roxbury,"Roxbury, Massachusetts",3.29,,
9,South End,"South End, Massachusetts",0.74,,


We will now fill our boston_neighborhoods dataframe with the latitude and longitude values for each neighborhood.

In [66]:

j = 0 #counter

while j < numrows:
    #find the geographical coordinates of the particular neighborhood
    address = boston_neighborhoods['Address'][j]
    geolocator = Nominatim(user_agent = "boston_explorer")
    location = geolocator.geocode(address)
    lat = location.latitude
    lng = location.longitude
    
    boston_neighborhoods['Latitude'][j] = lat
    boston_neighborhoods['Longitude'][j] = lng
    j = j +1
    
    
boston_neighborhoods.head()



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Neighborhood,Address,SqMiles,Latitude,Longitude
0,Roslindale,"Roslindale, Massachusetts",2.51,42.2912,-71.1245
1,Jamaica Plain,"Jamaica Plain, Massachusetts",3.94,42.3098,-71.1203
2,Mission Hill,"Mission Hill, Massachusetts",0.55,42.3326,-71.1036
3,Longwood,"Longwood, Massachusetts",0.29,42.3415,-71.1102
4,Bay Village,"Bay Village, Massachusetts",0.04,42.35,-71.0669


We will further clean our dataframe by dropping rows with NaN vaues

In [67]:
boston_neighborhoods.dropna(axis = 'rows') #drop rows with NaN values

Unnamed: 0,Neighborhood,Address,SqMiles,Latitude,Longitude
0,Roslindale,"Roslindale, Massachusetts",2.51,42.2912,-71.1245
1,Jamaica Plain,"Jamaica Plain, Massachusetts",3.94,42.3098,-71.1203
2,Mission Hill,"Mission Hill, Massachusetts",0.55,42.3326,-71.1036
3,Longwood,"Longwood, Massachusetts",0.29,42.3415,-71.1102
4,Bay Village,"Bay Village, Massachusetts",0.04,42.35,-71.0669
5,Leather District,"Leather District, Massachusetts",0.02,42.3523,-71.0573
6,Chinatown,"Chinatown, Massachusetts",0.12,42.3522,-71.0626
7,North End,"North End, Massachusetts",0.2,42.3651,-71.0545
8,Roxbury,"Roxbury, Massachusetts",3.29,42.3248,-71.095
9,South End,"South End, Massachusetts",0.74,42.3413,-71.0772


In [68]:
boston_neighborhoods.head()

Unnamed: 0,Neighborhood,Address,SqMiles,Latitude,Longitude
0,Roslindale,"Roslindale, Massachusetts",2.51,42.2912,-71.1245
1,Jamaica Plain,"Jamaica Plain, Massachusetts",3.94,42.3098,-71.1203
2,Mission Hill,"Mission Hill, Massachusetts",0.55,42.3326,-71.1036
3,Longwood,"Longwood, Massachusetts",0.29,42.3415,-71.1102
4,Bay Village,"Bay Village, Massachusetts",0.04,42.35,-71.0669


We will now visualize the neighborhoods in Boston

In [69]:
#find the geographical coordinates of Boston

address = 'Boston, MA'
geolocator = Nominatim(user_agent = "boston_explorer")
location = geolocator.geocode(address)
latboston = location.latitude
lngboston = location.longitude


map_boston = folium.Map(location = [latboston, lngboston], zoom_start = 11) #create a map

#add markers 
for lat, lng, label in zip(boston_neighborhoods['Latitude'], boston_neighborhoods['Longitude'], boston_neighborhoods['Address']):
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
    [lat,lng],
    radius = 5,
    popup = label,
    color = 'blue',
    fill = True,
    fill_color = '#3186cc',
    fill_opacity = 0.7,
    parse_html = False).add_to(map_boston)
    
map_boston
        

#### Obtain Data from Foursquare API

We will now use Foursquare API to obtain the number of gyms in each neighborhood, as well as the number of food stores nearby.

In [70]:
#my Foursquare credentials
CLIENT_ID = 'HU4NW3X5GFHOKALM3VMUHXFC5GBEE04KCAEYUEYZOTHQYU0U' 
CLIENT_SECRET = 'EVWCF1AOGUWLPFPKWYUOQJZ34BPCTMNNRRVOB3I4CXOZJ00N' 
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
ACCESS_TOKEN = "ENKE2RY4UVQJMNPHGD43HBNHO0JC3HPQ03C50ZKX3EE54AYL"
print('My credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

My credentails:
CLIENT_ID: HU4NW3X5GFHOKALM3VMUHXFC5GBEE04KCAEYUEYZOTHQYU0U
CLIENT_SECRET:EVWCF1AOGUWLPFPKWYUOQJZ34BPCTMNNRRVOB3I4CXOZJ00N


We will make a function to explore the venues around each Boston neighborhood using Foursquare.

In [71]:
#make a function to explore venues in each Boston neighborhood using Foursquare

def getNearbyVenues(names, latitudes, longitudes, radius = 500):
    venues_list = []
    
    for name, lat, lng in zip(names, latitudes, longitudes):
        
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [72]:
#call the function and display venues
venuesaroundboston = getNearbyVenues(names = boston_neighborhoods['Neighborhood'], latitudes = boston_neighborhoods['Latitude'], longitudes = boston_neighborhoods['Longitude'])
                                 
venuesaroundboston.head() #view the first 5 rows                        

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Roslindale,42.291209,-71.124497,Guira Y Tambora,42.291845,-71.122254,Cuban Restaurant
1,Roslindale,42.291209,-71.124497,Peters Hill,42.293617,-71.128063,Scenic Lookout
2,Roslindale,42.291209,-71.124497,Roslindale House Of Pizza,42.287989,-71.126549,Pizza Place
3,Roslindale,42.291209,-71.124497,Target,42.288204,-71.126659,Big Box Store
4,Roslindale,42.291209,-71.124497,BCYF- Flaherty Pool,42.288133,-71.122913,Pool


Filter results to find all the gyms in Boston neighborhoods, and store in a dataframe

In [73]:
gymsinboston = venuesaroundboston[venuesaroundboston['Venue Category'].str.contains('Gym')] #dataframe containing all gyms in Boston
gymsinboston['Venue Category'] = 'Gym' #change venue category label to 'Gym'
gymsinboston.fillna(0)
gymsinboston.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
115,Bay Village,42.350011,-71.066948,Equinox Sports Club Boston,42.353189,-71.063053,Gym
142,Leather District,42.352322,-71.057343,Barry's Bootcamp,42.35401,-71.059776,Gym
174,Leather District,42.352322,-71.057343,Equinox Sports Club Boston,42.353189,-71.063053,Gym
178,Leather District,42.352322,-71.057343,Stay Fit at Hyatt,42.353963,-71.060688,Gym
187,Leather District,42.352322,-71.057343,Equinox Franklin Street,42.356074,-71.054484,Gym


We also find all the parks in Boston's neighborhoods, though we prioritize the presence of gyms in making our decisions about where to construct a new gym:

In [74]:
parksinboston= venuesaroundboston[venuesaroundboston['Venue Category'].str.contains('Park')] 
parksinboston['Venue Category'] = 'Park'
parksinboston.fillna(0)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
27,Jamaica Plain,42.30982,-71.12033,Linden Path,42.305793,-71.122191,Park
28,Mission Hill,42.33256,-71.103608,Kevin W Fitzgerald Park,42.332031,-71.102734,Park
47,Longwood,42.341533,-71.110155,Riverway,42.340129,-71.109825,Park
74,Longwood,42.341533,-71.110155,Longwood Mall,42.343092,-71.111595,Park
87,Bay Village,42.350011,-71.066948,Elliot Norton Park,42.349124,-71.065949,Park
111,Bay Village,42.350011,-71.066948,Commonwealth Park,42.352752,-71.070626,Park
182,Leather District,42.352322,-71.057343,Post Office Square,42.35634,-71.055686,Park
277,Chinatown,42.352217,-71.062607,Boston Common,42.355487,-71.064882,Park
328,Chinatown,42.352217,-71.062607,Elliot Norton Park,42.349124,-71.065949,Park
364,North End,42.365097,-71.054495,Paul Revere Mall,42.365863,-71.053787,Park


Filter results to find all the restaurants and pizza places in Boston neighborhoods, and store in a dataframe.

In [75]:
restaurants_boston = venuesaroundboston[venuesaroundboston['Venue Category'].str.contains('Restaurant')] #dataframe to store all restaurants

pizzaplaces_boston = venuesaroundboston[venuesaroundboston['Venue Category']== 'Pizza Place'] #dataframe to store all pizza places

boston_gourmet = pd.concat([restaurants_boston, pizzaplaces_boston]) #dataframe containg all the restaurants and pizza places
                                    
boston_gourmet['Venue Category'] = 'Restaurant' #change the venue category to a broader 'Restaurant' label
boston_gourmet.fillna(0)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Roslindale,42.291209,-71.124497,Guira Y Tambora,42.291845,-71.122254,Restaurant
9,Jamaica Plain,42.30982,-71.12033,Vee Vee,42.31021,-71.115143,Restaurant
23,Jamaica Plain,42.30982,-71.12033,JP Seafood Cafe,42.310894,-71.114657,Restaurant
32,Mission Hill,42.33256,-71.103608,Lilly's Gourmet Pasta Express,42.332445,-71.100046,Restaurant
35,Mission Hill,42.33256,-71.103608,Laughing Monk Cafe,42.334077,-71.105563,Restaurant
36,Mission Hill,42.33256,-71.103608,Milkweed,42.332168,-71.099424,Restaurant
37,Mission Hill,42.33256,-71.103608,Mission Bar & Grill,42.333925,-71.105127,Restaurant
41,Mission Hill,42.33256,-71.103608,Mama's Place,42.333391,-71.106357,Restaurant
43,Mission Hill,42.33256,-71.103608,Mission Sushi & Wok,42.333834,-71.103637,Restaurant
44,Mission Hill,42.33256,-71.103608,Flames Restaurant II,42.333661,-71.105541,Restaurant


In [76]:
#make one big dataframe containing gourmet stores and gyms:
boston_gymandfood = pd.concat([boston_gourmet, gymsinboston, parksinboston])
boston_gymandfood

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Roslindale,42.291209,-71.124497,Guira Y Tambora,42.291845,-71.122254,Restaurant
9,Jamaica Plain,42.30982,-71.12033,Vee Vee,42.31021,-71.115143,Restaurant
23,Jamaica Plain,42.30982,-71.12033,JP Seafood Cafe,42.310894,-71.114657,Restaurant
32,Mission Hill,42.33256,-71.103608,Lilly's Gourmet Pasta Express,42.332445,-71.100046,Restaurant
35,Mission Hill,42.33256,-71.103608,Laughing Monk Cafe,42.334077,-71.105563,Restaurant
36,Mission Hill,42.33256,-71.103608,Milkweed,42.332168,-71.099424,Restaurant
37,Mission Hill,42.33256,-71.103608,Mission Bar & Grill,42.333925,-71.105127,Restaurant
41,Mission Hill,42.33256,-71.103608,Mama's Place,42.333391,-71.106357,Restaurant
43,Mission Hill,42.33256,-71.103608,Mission Sushi & Wok,42.333834,-71.103637,Restaurant
44,Mission Hill,42.33256,-71.103608,Flames Restaurant II,42.333661,-71.105541,Restaurant


#### Analysis

We will now analyze each of the neighborhoods. We first apply one hot encoding on our dataframe, boston_gymandfood.

In [77]:
#use one hot encoding
boston_onehot = pd.get_dummies(boston_gymandfood[['Venue Category']], prefix = "", prefix_sep = "")

#put the neighborhood column back
boston_onehot['Neighborhood'] = boston_gymandfood['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [boston_onehot.columns[-1]] + list(boston_onehot.columns[:-1])
boston_onehot = boston_onehot[fixed_columns]

boston_onehot.head()

Unnamed: 0,Neighborhood,Gym,Park,Restaurant
0,Roslindale,0,0,1
9,Jamaica Plain,0,0,1
23,Jamaica Plain,0,0,1
32,Mission Hill,0,0,1
35,Mission Hill,0,0,1


Group the rows based on neighborhood name and the mean of frequency of occurrence for each category.

In [78]:

#group rows by neighborhood and by taking the mean of each category of business
boston_grouped = boston_onehot.groupby('Neighborhood').mean().reset_index()
boston_grouped.head()

Unnamed: 0,Neighborhood,Gym,Park,Restaurant
0,Allston,0.027027,0.0,0.972973
1,Back Bay,0.111111,0.027778,0.861111
2,Bay Village,0.066667,0.133333,0.8
3,Beacon Hill,0.0,0.0625,0.9375
4,Brighton,0.083333,0.083333,0.833333


In [79]:
boston_grouped.sort_values(by = 'Gym', ascending = 'False', inplace = True)

In [80]:
def most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]



In [81]:
#order the venues and store in a dataframe
num_top_venues = 3

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
sorted_venues = pd.DataFrame(columns=columns)
sorted_venues['Neighborhood'] = boston_grouped['Neighborhood']

for ind in np.arange(boston_grouped.shape[0]):
    sorted_venues.iloc[ind, 1:] = most_common_venues(boston_grouped.iloc[ind, :], num_top_venues)

sorted_venues.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
9,East Boston,Restaurant,Park,Gym
16,Mission Hill,Restaurant,Park,Gym
15,Mattapan,Restaurant,Park,Gym
14,Longwood,Restaurant,Park,Gym
12,Jamaica Plain,Restaurant,Park,Gym


We will now conduct k-means clustering to cluster the Boston neighborhoods into 4 clusters.

In [82]:
#define the number of clusters
kclusters = 4
boston_data4clustering = boston_grouped.drop('Neighborhood',1)

#conduct k means clustering
kmeans = KMeans(n_clusters = kclusters, random_state = 4).fit(boston_data4clustering)


                    

In [83]:
sorted_venues.insert(0, 'Cluster Labels', kmeans.labels_) #add cluster labels 


We will now finalize our table to compare different neighborhoods.

In [84]:

boston_table_finalized = boston_neighborhoods
boston_table_finalized['Latitude'] = boston_neighborhoods['Latitude']
boston_table_finalized['Longitude'] = boston_neighborhoods['Longitude']
boston_table_finalized= boston_table_finalized.join(sorted_venues.set_index('Neighborhood'), on='Neighborhood')
boston_table_finalized.head()

Unnamed: 0,Neighborhood,Address,SqMiles,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Roslindale,"Roslindale, Massachusetts",2.51,42.2912,-71.1245,3.0,Restaurant,Park,Gym
1,Jamaica Plain,"Jamaica Plain, Massachusetts",3.94,42.3098,-71.1203,0.0,Restaurant,Park,Gym
2,Mission Hill,"Mission Hill, Massachusetts",0.55,42.3326,-71.1036,3.0,Restaurant,Park,Gym
3,Longwood,"Longwood, Massachusetts",0.29,42.3415,-71.1102,0.0,Restaurant,Park,Gym
4,Bay Village,"Bay Village, Massachusetts",0.04,42.35,-71.0669,0.0,Restaurant,Park,Gym


In [85]:
boston_table_finalized.dropna(inplace = True) #drop neighborhoods whose data have missing values (NaN)

In [86]:
boston_table_finalized= boston_table_finalized.merge(boston_grouped, on = "Neighborhood")
boston_table_finalized

Unnamed: 0,Neighborhood,Address,SqMiles,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,Gym,Park,Restaurant
0,Roslindale,"Roslindale, Massachusetts",2.51,42.2912,-71.1245,3.0,Restaurant,Park,Gym,0.0,0.0,1.0
1,Jamaica Plain,"Jamaica Plain, Massachusetts",3.94,42.3098,-71.1203,0.0,Restaurant,Park,Gym,0.0,0.25,0.75
2,Mission Hill,"Mission Hill, Massachusetts",0.55,42.3326,-71.1036,3.0,Restaurant,Park,Gym,0.0,0.1,0.9
3,Longwood,"Longwood, Massachusetts",0.29,42.3415,-71.1102,0.0,Restaurant,Park,Gym,0.0,0.2,0.8
4,Bay Village,"Bay Village, Massachusetts",0.04,42.35,-71.0669,0.0,Restaurant,Park,Gym,0.066667,0.133333,0.8
5,Leather District,"Leather District, Massachusetts",0.02,42.3523,-71.0573,2.0,Restaurant,Gym,Park,0.139535,0.023256,0.837209
6,Chinatown,"Chinatown, Massachusetts",0.12,42.3522,-71.0626,2.0,Restaurant,Gym,Park,0.088889,0.044444,0.866667
7,North End,"North End, Massachusetts",0.2,42.3651,-71.0545,0.0,Restaurant,Park,Gym,0.022222,0.111111,0.866667
8,Roxbury,"Roxbury, Massachusetts",3.29,42.3248,-71.095,1.0,Gym,Restaurant,Park,0.5,0.25,0.25
9,South End,"South End, Massachusetts",0.74,42.3413,-71.0772,2.0,Restaurant,Park,Gym,0.105263,0.105263,0.789474


We will now visualize our clusters

In [87]:
#visualize the clusters


# create map
map_clusters = folium.Map(location=[latboston, lngboston], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(boston_table_finalized['Latitude'], boston_table_finalized['Longitude'], boston_table_finalized['Neighborhood'], boston_table_finalized['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Examination of the Clusters

We will now examine each of the clusters:

CLUSTER 0:

In [88]:
Cluster0 = boston_table_finalized.loc[boston_table_finalized['Cluster Labels'] == 0, boston_table_finalized.columns[[1] + list(range(5, boston_table_finalized.shape[1]))]]
Cluster0

Unnamed: 0,Address,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,Gym,Park,Restaurant
1,"Jamaica Plain, Massachusetts",0.0,Restaurant,Park,Gym,0.0,0.25,0.75
3,"Longwood, Massachusetts",0.0,Restaurant,Park,Gym,0.0,0.2,0.8
4,"Bay Village, Massachusetts",0.0,Restaurant,Park,Gym,0.066667,0.133333,0.8
7,"North End, Massachusetts",0.0,Restaurant,Park,Gym,0.022222,0.111111,0.866667
12,"Charlestown, Massachusetts",0.0,Restaurant,Park,Gym,0.0,0.222222,0.777778
21,"South Boston Waterfront, Massachusetts",0.0,Restaurant,Park,Gym,0.0,0.153846,0.846154
22,"South Boston, Massachusetts",0.0,Restaurant,Park,Gym,0.0,0.153846,0.846154


CLUSTER 1:

In [89]:
Cluster1 = boston_table_finalized.loc[boston_table_finalized['Cluster Labels'] == 1, boston_table_finalized.columns[[1] + list(range(5, boston_table_finalized.shape[1]))]]
Cluster1

Unnamed: 0,Address,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,Gym,Park,Restaurant
8,"Roxbury, Massachusetts",1.0,Gym,Restaurant,Park,0.5,0.25,0.25


CLUSTER2:

In [90]:
Cluster2= boston_table_finalized.loc[boston_table_finalized['Cluster Labels'] == 2, boston_table_finalized.columns[[1] + list(range(5, boston_table_finalized.shape[1]))]]
Cluster2

Unnamed: 0,Address,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,Gym,Park,Restaurant
5,"Leather District, Massachusetts",2.0,Restaurant,Gym,Park,0.139535,0.023256,0.837209
6,"Chinatown, Massachusetts",2.0,Restaurant,Gym,Park,0.088889,0.044444,0.866667
9,"South End, Massachusetts",2.0,Restaurant,Park,Gym,0.105263,0.105263,0.789474
10,"Back Bay, Massachusetts",2.0,Restaurant,Gym,Park,0.111111,0.027778,0.861111
13,"West End, Massachusetts",2.0,Restaurant,Gym,Park,0.2,0.04,0.76
15,"Downtown, Massachusetts",2.0,Restaurant,Park,Gym,0.166667,0.166667,0.666667
17,"Brighton, Massachusetts",2.0,Restaurant,Park,Gym,0.083333,0.083333,0.833333
18,"Hyde Park, Massachusetts",2.0,Restaurant,Gym,Park,0.166667,0.0,0.833333
20,"Dorchester, Massachusetts",2.0,Restaurant,Gym,Park,0.25,0.0,0.75


CLUSTER 3:

In [91]:
Cluster3 = boston_table_finalized.loc[boston_table_finalized['Cluster Labels'] == 3, boston_table_finalized.columns[[1] + list(range(5, boston_table_finalized.shape[1]))]]
Cluster3

Unnamed: 0,Address,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,Gym,Park,Restaurant
0,"Roslindale, Massachusetts",3.0,Restaurant,Park,Gym,0.0,0.0,1.0
2,"Mission Hill, Massachusetts",3.0,Restaurant,Park,Gym,0.0,0.1,0.9
11,"East Boston, Massachusetts",3.0,Restaurant,Park,Gym,0.0,0.071429,0.928571
14,"Beacon Hill, Massachusetts",3.0,Restaurant,Park,Gym,0.0,0.0625,0.9375
16,"Fenway, Massachusetts",3.0,Restaurant,Gym,Park,0.066667,0.0,0.933333
19,"Mattapan, Massachusetts",3.0,Restaurant,Park,Gym,0.0,0.0,1.0
23,"Allston, Massachusetts",3.0,Restaurant,Gym,Park,0.027027,0.0,0.972973


#### Further Visualization of Our Results

Here are the graphs comparing the proportion of each venue category in each neighborhood.

CLUSTER 0:

In [92]:
#Cluster 0
#create dataframe for graphing
cluster0_df = pd.DataFrame(columns = ['Neighborhood','Gym', 'Park', 'Restaurant'])
cluster0_df['Gym'] = Cluster0['Gym']
cluster0_df['Park'] = Cluster0['Park']
cluster0_df['Restaurant'] = Cluster0['Restaurant']
cluster0_df['Neighborhood'] = Cluster0['Address']


#make graph
fig0 = px.bar(cluster0_df, x = "Neighborhood", y = ['Gym', 'Park', 'Restaurant'], barmode= 'group', title = 'Proportion of Venue Categories in Cluster 0')
fig0.show()

CLUSTER 1:

In [93]:
#Cluster 1
#create dataframe for graphing
cluster1_df = pd.DataFrame(columns = ['Neighborhood','Gym', 'Park', 'Restaurant'])
cluster1_df['Gym'] = Cluster1['Gym']
cluster1_df['Park'] = Cluster1['Park']
cluster1_df['Restaurant'] = Cluster1['Restaurant']
cluster1_df['Neighborhood'] = Cluster1['Address']


#make graph
fig1 = px.bar(cluster1_df, x = "Neighborhood", y = ['Gym', 'Park', 'Restaurant'], barmode= 'group', title = 'Proportion of Venue Categories in Cluster 1')
fig1.show()

CLUSTER 2:

In [94]:
#Cluster 2
#create dataframe for graphing
cluster2_df = pd.DataFrame(columns = ['Neighborhood','Gym', 'Park', 'Restaurant'])
cluster2_df['Gym'] = Cluster2['Gym']
cluster2_df['Park'] = Cluster2['Park']
cluster2_df['Restaurant'] = Cluster2['Restaurant']
cluster2_df['Neighborhood'] = Cluster2['Address']


#make graph
fig2 = px.bar(cluster2_df, x = "Neighborhood", y = ['Gym', 'Park', 'Restaurant'], barmode= 'group', title = 'Proportion of Each Venue Category in Cluster 2')
fig2.show()

CLUSTER 3:

In [95]:
#Cluster 3
#create dataframe for graphing
cluster3_df = pd.DataFrame(columns = ['Neighborhood','Gym', 'Park', 'Restaurant'])
cluster3_df['Gym'] = Cluster3['Gym']
cluster3_df['Park'] = Cluster3['Park']
cluster3_df['Restaurant'] = Cluster3['Restaurant']
cluster3_df['Neighborhood'] = Cluster3['Address']


#make graph
fig3 = px.bar(cluster3_df, x = "Neighborhood", y = ['Gym', 'Park', 'Restaurant'], barmode= 'group', title = 'Proportion of Each Venue Category in Cluster 3')
fig3.show()

This concludes our analysis. Upon examination of our clusters, it becomes evident that Cluster 0 is the only cluster where restaurant is consistently the most common venue and gyms are the least common venue, without having a mixture of orders among the three variables like in the other clusters (for example in Cluster 3, where Gym and Park take both the values of the 2nd and 3rd most common venues). However, the neighborhoods in Cluster 3 consistently has the highest proportion of restaurants. Thus, neighborhoods that are included in Cluster 3 should be prioritized when considering the location to build a new fitness facility.

## Results and Discussion

In this analysis, we applied the k-means clustering technique to separate Boston neighborhoods into 4 distinct clusters, based on the similarities in the proportions of gym, parks and restaurants present in each neighborhood. Examination of the clusters revealed that Cluster 0 was composed solely of neighborhoods whose most common venue was 'Restaurant', followed by 'Park' and 'Gym'. More than half of the neighborhoods in this cluster had very few or no gyms in the neighborhood, while the proportions of restaurants were consistently very high, constituting at least 75% of the facilities compared. Neighborhoods in Cluster 2 and Cluster 3 all shared 'Restaurant' category as the most common venue type; however, unlike in Cluster 0, 'Park' and 'Gym' switched between being the 2nd most common venue and the 3rd most common venue. Upon examination of the proportions for each venue category, it becomes apparent that all of the neighborhoods in Cluster 2 contains a recreational facility or a gym, while this is not so for Cluster 3. Unlike in Cluster 0 where all neighborhoods had at least one park despite having no gyms, there are two neighborhoods in Cluster 3 (namely Roslindale and Mattapan) where there are no gyms or parks visible in the data, and restaurants comprise 100% of the venues compared. Cluster 1, which is comprised of only 1 neighborhood (Roxbury), is unique from all other clusters in that gym is the most common venue, followed by restaurants and parks. 


The results of this analysis guide us to prioritize the neighborhoods in Clusters 0 and 3 as potential locations to build a new fitness facility such as a gym. In particular, neighborhoods in Cluster 3 have the highest proportion of restaurants compared to the proportions of gyms or parks, as compared to the neighborhoods in Cluster 0. This is exemplified by how restaurants in all of the Cluster 3 neighborhoods comprise over 92% of the venues compared, while in the proportion of restaurants never exceeds 90% in Cluster 0 neighborhoods. Thus, stakeholders should especially consider neighborhoods in Cluster 3. Having made this recommendation, it must be noted that this analysis has not compared all possible factors that would make a neighborhood suitable for building a new gym. To further narrow down our decisions on the best venues, a deeper analysis of the neighborhoods is recommended, including the demographics of the neighborhoods and property prices.



## Conclusion

An analysis using k-means clustering was conducted to determine which Boston neighborhood should be prioritized when stakeholders are considering the next location to build a gym.  Upon examination of our clusters, it became evident that Cluster 3 consistently had the highest proportion of restaurants and a very small proportion of gyms and parks combined. Thus, neighborhoods that are included in Cluster 3 should be prioritized when considering the location to build a new fitness facility. These neighborhoods are: Roslindale, Mission Hill, East Boston, Beacon Hill, Fenway, Mattapan, and Allston. South Boston and South Boston Waterfront should be especially considered, as these neighborhoods had the highest proportion of restaurants and the lowest proportion of gyms among all the neighborhoods in this cluster.

## Bibliography:

Alkson, Alex, and Lin, Polong. 2020. “Segmenting and Clustering Neighborhoods in New York City.” IBM Corporation. November 26, 2020. https://labs.cognitiveclass.ai/tools/jupyterlab/lab/tree/labs/DS0701EN/DS0701EN-3-3-2-Neighborhoods-New-York-py-v1.0.ipynb?lti=true.


“American Heart Association Recommendations for Physical Activity in Adults and Kids.” n.d. Accessed April 6, 2021. https://www.heart.org/en/healthy-living/fitness/fitness-basics/aha-recs-for-physical-activity-in-adults.


“Exercise or Physical Activity.” 2021. March 1, 2021. https://www.cdc.gov/nchs/fastats/exercise.htm.
