<h1>Capstone Project - The Battle of Neighborhoods (Week 2)</h1>

<h2>Introduction</h2>

What do you think of when you think of working out? Probably a **gym**, right? In most areas, rural and urban, a gym is the primary location for people to exercise and work on fitness. For this project, I'm going to determine where is the best place to open a gym in the Greater Boston Area. There are a number of gyms in the city ranging from Boston Sports Clubs in Backbay to Planet Fitness in Downtown, but where is *really* the best place to be?

As the Greater Boston Area continues to expand and develop into its surrounding neighborhoods such as Dorchester, Roxbury, etc., this information will become relevant and can even be manipulated to target the development of movie theaters, housing complexes, etc. instead of gyms. Therefore, the target audience of this report will be developers and investors that are looking to profit on the production of such development.  According to Norada Real Estate Investments, the average housing prices in Boston are increasing by about 5.7% per year. This might not seem like much until you realize that the most expensive neighborhood of Beacon Hill has an median housing price of over $2 million dollars. Having such information, but for all neighbors can let investors know if its really worth placing a gym in an area where housing is so expensive. 

<h2>Data</h2>

The primary dataset we're going to be using in this project is that from the Foursquare API, specifically the venue data that lists the gyms in the Greater Boston Area. In regards to the area we will be surveying, I will refer to the 22 neighborhoods listed in [this](https://en.wikipedia.org/wiki/Neighborhoods_in_Boston) Wikipedia article. I will use python webscraping techniques using beautiful soup in order extract the neighborhood information which I'll then be able to send into the Geocoder package and eventually the Foursquare API to get venue information

Once we have all of our data sourced, we will be able to implement the techniques learned in this course, specifically k-means clustering in order to determine the answer to our question at hand. The clustering will happen based on the venues (specifically the gyms) in the different neighborhoods.

<h2>Analysis</h2>

<h3>Install Necessary Dependencies</h3>

In [119]:
import numpy as np
import pandas as pd
import requests
#!conda install -c conda-forge bs4 --yes
import bs4
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
#!conda install -c conda-forge folium=0.5.0 --yes 
import folium
#!conda install -c conda-forge geopy --yes
import geopy
from geopy.geocoders import Nominatim
#!conda install -c conda-forge geocoder --yes
import geocoder
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

<h3>Import Data</h3>

In this section we will scrape the Wikipedia page in order to retrieve the names of the 22 neighborhoods in the Greater Boston Area. Something that was prevalent in this process was that in the neighborhood name was surrounding areas that were included. In order to avoid running into issues with this going forward, everything except for the neighborhood name was emitted.

In [30]:
neighborhoodData = requests.get("https://en.wikipedia.org/wiki/Neighborhoods_in_Boston").text
souped = BeautifulSoup(neighborhoodData, 'html.parser')

In [31]:
data = []

for row in souped.find_all("div", class_="div-col columns column-width")[0].findAll("li"):
    data.append(row.text)
    
df = pd.DataFrame({"Neighborhood": data})
df.info() # Should show that there are 22 entries (aka 22 neighborhoods)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Neighborhood  22 non-null     object
dtypes: object(1)
memory usage: 304.0+ bytes


In [34]:
print(df)

                                         Neighborhood
0                                             Allston
1                                            Back Bay
2                                         Bay Village
3                                         Beacon Hill
4                                            Brighton
5                                         Charlestown
6                          Chinatown/Leather District
7   Dorchester (divided for planning purposes into...
8                                            Downtown
9                                         East Boston
10                 Fenway Kenmore (includes Longwood)
11                                          Hyde Park
12                                      Jamaica Plain
13                                           Mattapan
14                                       Mission Hill
15                                          North End
16                                         Roslindale
17                          

In [42]:
df['Neighborhood'].replace({'Fenway Kenmore (includes Longwood)':'Fenway Kenmore'},inplace=True)
df['Neighborhood'].replace({'Dorchester (divided for planning purposes into Mid Dorchester and Dorchester)':'Dorchester'},inplace=True)

In [43]:
print(df)

                  Neighborhood
0                      Allston
1                     Back Bay
2                  Bay Village
3                  Beacon Hill
4                     Brighton
5                  Charlestown
6   Chinatown/Leather District
7                   Dorchester
8                     Downtown
9                  East Boston
10              Fenway Kenmore
11                   Hyde Park
12               Jamaica Plain
13                    Mattapan
14                Mission Hill
15                   North End
16                  Roslindale
17                     Roxbury
18                South Boston
19                   South End
20                    West End
21                West Roxbury


In [35]:
df.shape

(22, 1)

<h3>Latitude & Longitude</h3>

In this section we will retrieve the coordinates of each neighborhood using the geocoder library. This method is similar to that provided in the *Segmenting and Clustering Neighborhoods in Toronto* section, but is slightly modified to examine the neighborhood rather than the postal code. After the coordinates were received, they were matched to their respective neighborhoods in the original dataframe initialized above.

In [50]:
def getCoords(neighborhood):
    coords = None
    while(coords is None):
        geo = geocoder.arcgis('{}, Boston, Massachusetts'.format(neighborhood))
        coords = geo.latlng
    return coords

In [52]:
coordinates = [ getCoords(neighborhood) for neighborhood in df["Neighborhood"].tolist() ]

In [58]:
dfCoordinates = pd.DataFrame(coordinates, columns=['Latitude', 'Longitude'])

df['Latitude'] = dfCoordinates['Latitude']
df['Longitude'] = dfCoordinates['Longitude']

print(df.head())

  Neighborhood   Latitude  Longitude
0      Allston  42.350531 -71.111091
1     Back Bay  42.349990 -71.087650
2  Bay Village  42.348165 -71.068470
3  Beacon Hill  42.358420 -71.068600
4     Brighton  42.352134 -71.124925


<h3>Initialize a Map</h3>

In this section we will initialize a map that not only uses the general coordinates of the Greater Boston Area, but also each individual neighborhood. The blue markers found in the map are marking the 22 neighborhoods retrieved above.

In [69]:
# Sourced from Google Search Response, Coordinates of Boston: 42.3601° N, 71.0589° W
latitude = 42.3601
longitude = -71.0589

In [121]:
map_bos = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_bos)  
    
map_bos

<h3>Foursquare API: Initialized & Implemented</h3>

This section initializes the API by entering the necessary parameters in the request and also calls the API to find the types and number of venues in each of the neighborhood. Because each neighborhood has *many* venues, there was a limit put on the number of venues pulled. 

In [68]:
CLIENT_ID = 'RIGJPJJEEDR4UIWFSGNAYNMMTV1HKD5QI1USLLMLHXVOXGI0'
CLIENT_SECRET = 'PP3WT3JSYJYPBK4FTFCFM3QSQ3JAKZKDFBDRREMN2L1ZSJDS'
VERSION = '20180605'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: RIGJPJJEEDR4UIWFSGNAYNMMTV1HKD5QI1USLLMLHXVOXGI0
CLIENT_SECRET:PP3WT3JSYJYPBK4FTFCFM3QSQ3JAKZKDFBDRREMN2L1ZSJDS


In [73]:
print('The geograpical coordinate of Boston, Massachusetts are {} N, {} W.'.format(latitude, longitude))

The geograpical coordinate of Boston, Massachusetts are 42.3601 N, -71.0589 W.


In [82]:
radius = 1609.34 # 1 mile radius
LIMIT = 100

venues_list = []

for lat, long, neighborhood in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    results = requests.get(url).json()["response"]["groups"][0]["items"]
    #print(results)
    
    for venue in results:
        venues_list.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [85]:
dfVenues = pd.DataFrame(venues_list)

dfVenues.columns = ['Neighborhood', 'Latitude', 'Longitude', 'Venue Name', 'Venue Lat', 'Venue Long', 'Venue Category']

print(dfVenues.head())
print(dfVenues.groupby(["Neighborhood"]).count())


  Neighborhood   Latitude  Longitude             Venue Name  Venue Lat  \
0      Allston  42.350531 -71.111091                   OTTO  42.350388   
1      Allston  42.350531 -71.111091  Boston House of Pizza  42.350281   
2      Allston  42.350531 -71.111091     South West Day Spa  42.345855   
3      Allston  42.350531 -71.111091  Gyu-Kaku Japanese BBQ  42.346322   
4      Allston  42.350531 -71.111091        Down Under Yoga  42.345767   

   Venue Long       Venue Category  
0  -71.115236          Pizza Place  
1  -71.113864          Pizza Place  
2  -71.108433                  Spa  
3  -71.106985  Japanese Restaurant  
4  -71.109114          Yoga Studio  
                            Latitude  Longitude  Venue Name  Venue Lat  \
Neighborhood                                                             
Allston                          100        100         100        100   
Back Bay                         100        100         100        100   
Bay Village                      100 

In [95]:
print('There are {} uniques categories in the Greater Boston Area.'.format(len(dfVenues['Venue Category'].unique())))
print('There are: {}.'.format(dfVenues['Venue Category'].unique()))

There are 201 uniques categories in the Greater Boston Area.
There are: ['Pizza Place' 'Spa' 'Japanese Restaurant' 'Yoga Studio' 'Park'
 'Chinese Restaurant' 'Gym / Fitness Center' 'Bakery' 'Udon Restaurant'
 'Shipping Store' 'Fried Chicken Joint' 'Trail' 'Israeli Restaurant'
 'Grocery Store' 'Coffee Shop' 'Food Court' 'Rock Club' 'Tapas Restaurant'
 'Sporting Goods Shop' 'Salad Place' 'Bubble Tea Shop'
 'Furniture / Home Store' 'Liquor Store' 'Thai Restaurant' 'Burrito Place'
 'BBQ Joint' 'Beer Garden' 'Baseball Stadium' 'Mexican Restaurant'
 'American Restaurant' 'Café' 'Lounge' 'Hotel' 'Bookstore'
 'Indie Movie Theater' "Doctor's Office" 'Burger Joint' 'Donut Shop'
 'Cycle Studio' 'Seafood Restaurant' 'Pub' 'Mediterranean Restaurant'
 'Tour Provider' 'Electronics Store' 'Ice Cream Shop' 'Sushi Restaurant'
 'Sandwich Place' 'Greek Restaurant' 'Noodle House' 'Music Venue'
 'Creperie' 'Big Box Store' 'Deli / Bodega'
 'Vegetarian / Vegan Restaurant' 'Falafel Restaurant' 'Wine Bar' 'Gym'

<h3>Inspect Neighborhoods</h3>

In this section we take the existing dataframes and essentially combine them with the different types of venues found in each. As a part of this process, I also looked at the most common found venue in each neighborhood and based on my experience with the city, I was not surprised to find that some of the most common were Pizza Place, Italian Restaurant, and Coffee Shop. I then found the total percentage of gyms in each neighborhood.

In [109]:
bos_onehot = pd.get_dummies(dfVenues[['Venue Category']], prefix="", prefix_sep="")

bos_onehot['Neighborhood'] = dfVenues['Neighborhood'] 

fixed_columns = [bos_onehot.columns[-1]] + list(bos_onehot.columns[:-1])
bos_onehot = bos_onehot[fixed_columns]

In [102]:
num_top_venues = 5
bos_grouped = bos_onehot.groupby('Neighborhood').mean().reset_index()

for hood in bos_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = bos_grouped[bos_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Allston----
                 venue  freq
0          Coffee Shop  0.04
1          Pizza Place  0.04
2                 Park  0.04
3  American Restaurant  0.04
4               Bakery  0.04


----Back Bay----
                venue  freq
0      Clothing Store  0.05
1  Seafood Restaurant  0.05
2               Hotel  0.05
3      Ice Cream Shop  0.05
4         Coffee Shop  0.04


----Bay Village----
                venue  freq
0  Italian Restaurant  0.06
1                Park  0.04
2    Asian Restaurant  0.04
3         Coffee Shop  0.04
4                 Spa  0.04


----Beacon Hill----
           venue  freq
0    Coffee Shop  0.06
1           Park  0.05
2            Spa  0.05
3         Bakery  0.05
4  Historic Site  0.04


----Brighton----
                  venue  freq
0           Pizza Place  0.05
1         Grocery Store  0.04
2     Korean Restaurant  0.04
3             Rock Club  0.03
4  Gym / Fitness Center  0.03


----Charlestown----
                venue  freq
0  Italian Restaurant  0

In [108]:
dfGym = bos_grouped[['Neighborhood',"Gym"]]
print(dfGym)

                  Neighborhood       Gym
0                      Allston  0.010000
1                     Back Bay  0.010000
2                  Bay Village  0.030000
3                  Beacon Hill  0.010000
4                     Brighton  0.020000
5                  Charlestown  0.000000
6   Chinatown/Leather District  0.030000
7                   Dorchester  0.030000
8                     Downtown  0.010000
9                  East Boston  0.030000
10              Fenway Kenmore  0.020000
11                   Hyde Park  0.000000
12               Jamaica Plain  0.000000
13                    Mattapan  0.000000
14                Mission Hill  0.030000
15                   North End  0.010000
16                  Roslindale  0.012987
17                     Roxbury  0.040000
18                South Boston  0.030000
19                   South End  0.020000
20                    West End  0.010000
21                West Roxbury  0.013333


<h3>Cluster Neighborhoods</h3>

In this section we created 5 clusters of the 22 neighborhoods. In addition to creating the clusters, we will merge the new dataframe created with one of the existing dataframes. Lastly, we visualized the clusters using folium and different colored markers to depict which neighborhoods are in which cluster.

In [113]:
kclusters = 5

bos_grouped_clustering = dfGym.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(bos_grouped_clustering)

kmeans.labels_[0:10] 

array([2, 2, 1, 2, 3, 0, 1, 1, 2, 1], dtype=int32)

In [117]:
dfCopied = dfGym.copy()

dfCopied["Cluster Labels"] = kmeans.labels_
dfCopied = dfCopied.join(df.set_index("Neighborhood"), on="Neighborhood")
dfCopied.sort_values(["Cluster Labels"], inplace=True)

print(dfCopied)

                  Neighborhood       Gym  Cluster Labels   Latitude  Longitude
5                  Charlestown  0.000000               0  42.367771 -71.059016
13                    Mattapan  0.000000               0  42.278222 -71.096083
11                   Hyde Park  0.000000               0  42.274773 -71.119898
12               Jamaica Plain  0.000000               0  42.305849 -71.119092
18                South Boston  0.030000               1  42.352250 -71.055690
2                  Bay Village  0.030000               1  42.348165 -71.068470
6   Chinatown/Leather District  0.030000               1  42.352510 -71.060900
7                   Dorchester  0.030000               1  42.351355 -71.052848
14                Mission Hill  0.030000               1  42.335710 -71.109800
9                  East Boston  0.030000               1  42.351418 -71.056714
0                      Allston  0.010000               2  42.350531 -71.111091
16                  Roslindale  0.012987            

In [120]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dfCopied['Latitude'], dfCopied['Longitude'], dfCopied['Neighborhood'], dfCopied['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h3>Examine Clusters</h3>

In [123]:
print(dfCopied.loc[dfCopied['Cluster Labels'] == 0, dfCopied.columns[[1] + list(range(5, dfCopied.shape[1]))]])

    Gym
5   0.0
13  0.0
11  0.0
12  0.0


In [124]:
print(dfCopied.loc[dfCopied['Cluster Labels'] == 1, dfCopied.columns[[1] + list(range(5, dfCopied.shape[1]))]])

     Gym
18  0.03
2   0.03
6   0.03
7   0.03
14  0.03
9   0.03


In [125]:
print(dfCopied.loc[dfCopied['Cluster Labels'] == 2, dfCopied.columns[[1] + list(range(5, dfCopied.shape[1]))]])

         Gym
0   0.010000
16  0.012987
15  0.010000
21  0.013333
8   0.010000
3   0.010000
1   0.010000
20  0.010000


In [126]:
print(dfCopied.loc[dfCopied['Cluster Labels'] == 3, dfCopied.columns[[1] + list(range(5, dfCopied.shape[1]))]])

     Gym
4   0.02
19  0.02
10  0.02


In [127]:
print(dfCopied.loc[dfCopied['Cluster Labels'] == 4, dfCopied.columns[[1] + list(range(5, dfCopied.shape[1]))]])

     Gym
17  0.04


<h2>Conclusions</h2>

Based on the analysis conducted above, we can come to a few conclusions. In general, we see that there are fewer gyms as you move away from the literal city of Boston and into its surrounding neighborhoods. Cluster 0 which does not have any gyms within a mile radius of the neighborhoods coordinates includes neighborhoods which are almost entirely not within the city. On the other hand we see the cluster with an average number of gyms to be closer to Downtown, where there are a greater number of businesses and vicinity to the MBTA train system. It is possible this number actually appears to be lower on average because in Clusters 1 and 2, there are a greater number of businesses in the area. 

In regards to where an investor should open a gym, I would suggest opening one in Cluster 0. Cluster 0 is a generally residential area and would potentially find benefits in development. Because of this fact, it is actually possible that residents in Cluster 0 are traveling to gyms in other neighbhorhoods. On the other hand, I would deter investors from opening any gyms in Clusters 1 and 2. Because these areas are densely populated with venues ranging from bars to restaurants, there are also many gyms and therefore a new gym would face severe competition. One issue the gym would face in this area is convincing gym-goers to switch from their home gym to the new gym and it is ultimately not worth the risk. 