# Data Science Capstone Project (week 2) - The Battle of Neighborhoods

__Business Problem__
A (hypothetical) client is a dentist who would like to __open a new dental office in the New York area__.
He has all the expertise and capital to open the clinic and is asking us to recommend one or more places where to establish the clinic. 


The client has two requirements:


1)	The business must be located in an area that presents good business opportunities and allow him to get enough patients


2)	Ideally, the location would be (or not far from) a nice place to live for a family, as he would like to live near his work. 


To address the first requirement, we look for locations that are relatively underserved, i.e. have a lower number of existing dental offices per capita. We also factor in economic data, like the median household income levels, as we can expect that a more affluent population creates better business conditions for a dental office.


To address the second requirement, we look at various aspects like: the presence of schools, parks, shopping and restaurants, and we look for lower levels of criminality.


<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

Sections:

1. <a href="#item1">Prepare NYC neighborhood dataframe</a>

2. <a href="#item2">Prepare NYC socio economic data </a>

3. <a href="#item3">Get from Foursquare list of dentists</a>

4. <a href="#item4">Cluster Neigborhoods to find best business location</a>

5. <a href="#item5">Analyze Neigborhoods for family life requirement</a>

</font>
</div>

In [1]:
import numpy as np 

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>

## 1. Prepare NYC neighborhood dataframe

In [2]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [3]:
#newyork_data

In [4]:
neighborhoods_data = newyork_data['features']

# transform this data of nested Python dictionaries into a pandas dataframe.

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

# loop through the data and fill the dataframe one row at a time.

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [5]:
neighborhoods.shape

(306, 4)

In [98]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [7]:
neighborhoods.Borough.value_counts()

Queens           81
Brooklyn         70
Staten Island    63
Bronx            52
Manhattan        40
Name: Borough, dtype: int64

In [8]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


Use geopy library to get the latitude and longitude values of New York City.
In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent ny_explorer, as shown below.

In [9]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


Create a map of New York with neighborhoods - different colors for each borough

In [10]:
borough_color = {'Queens' : 'grey', 'Brooklyn' : 'red', 'Staten Island' : 'green', 'Bronx' : 'orange', 'Manhattan' : 'blue'}

map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=borough_color[borough],
        fill=True,
        fill_color=borough_color[borough],   # '#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

---------------------

<a id='item2'></a>

## 2. Prepare NYC socio economic data

We have data on all aspects only until 2017; so we use this year from all datasets.

In [191]:
nyc_median_household_income = pd.read_csv('borough-medianhouseholdincome2018.csv')

In [219]:
nyc_median_household_income_2017 = nyc_median_household_income[['Borough','2017']].copy()
nyc_median_household_income_2017.rename(columns={'2017' : 'median_household_income'}, inplace=True)
nyc_median_household_income_2017['median_household_income'] = nyc_median_household_income_2017['median_household_income'].astype('int')
nyc_median_household_income_2017

Unnamed: 0,Borough,median_household_income
0,Manhattan,86693
1,Bronx,38110
2,Brooklyn,58027
3,Queens,65739
4,Staten Island,80711


In [196]:
nyc_pop_density = pd.read_csv('borough-populationdensity1000personspersquaremile.csv')

In [220]:
nyc_pop_density_2017 = nyc_pop_density[['Borough','2017']].copy()
nyc_pop_density_2017.rename(columns={'2017' : 'pop_density'}, inplace=True)
nyc_pop_density_2017

Unnamed: 0,Borough,pop_density
0,Manhattan,73.479122
1,Bronx,34.985434
2,Brooklyn,37.941149
3,Queens,21.684484
4,Staten Island,8.241244


In [199]:
nyc_unemployment = pd.read_csv('borough-unemploymentrate.csv')

In [216]:
nyc_unemployment_2017 = nyc_unemployment[['Borough','2017']].copy()
nyc_unemployment_2017.rename(columns={'2017' : 'unemployment'}, inplace=True)
nyc_unemployment_2017

Unnamed: 0,Borough,unemployment
0,Staten Island,0.043181
1,Manhattan,0.053543
2,Bronx,0.109212
3,Brooklyn,0.064241
4,Queens,0.051791


In [201]:
nyc_seriouscrimerateper1000residents = pd.read_csv('borough-seriouscrimerateper1000residents.csv')

In [217]:
nyc_seriouscrimerateper1000residents_2017 = nyc_seriouscrimerateper1000residents[['Borough','2017']].copy()
nyc_seriouscrimerateper1000residents_2017.rename(columns={'2017' : 'crimerateper1000residents'}, inplace=True)
nyc_seriouscrimerateper1000residents_2017

Unnamed: 0,Borough,crimerateper1000residents
0,Manhattan,16.458252
1,Bronx,14.552795
2,Brooklyn,11.281601
3,Queens,8.6117
4,Staten Island,5.970494


In [203]:
nyc_poverty_rate = pd.read_csv('borough-povertyrate.csv')

In [218]:
nyc_poverty_rate_2017 = nyc_poverty_rate[['Borough','2017']].copy()
nyc_poverty_rate_2017.rename(columns={'2017' : 'poverty_rate'}, inplace=True)
nyc_poverty_rate_2017

Unnamed: 0,Borough,poverty_rate
0,Manhattan,0.162231
1,Bronx,0.280333
2,Brooklyn,0.198106
3,Queens,0.121145
4,Staten Island,0.117867


Merge all data in a single dataframe

In [222]:
nyc_socioeconomic = nyc_median_household_income_2017.merge(nyc_pop_density_2017, on='Borough', how='left')

In [224]:
nyc_socioeconomic = nyc_socioeconomic.merge(nyc_unemployment_2017, on='Borough', how='left')
nyc_socioeconomic = nyc_socioeconomic.merge(nyc_seriouscrimerateper1000residents_2017, on='Borough', how='left')
nyc_socioeconomic = nyc_socioeconomic.merge(nyc_poverty_rate_2017, on='Borough', how='left')

In [225]:
nyc_socioeconomic

Unnamed: 0,Borough,median_household_income,pop_density,unemployment,crimerateper1000residents,poverty_rate
0,Manhattan,86693,73.479122,0.053543,16.458252,0.162231
1,Bronx,38110,34.985434,0.109212,14.552795,0.280333
2,Brooklyn,58027,37.941149,0.064241,11.281601,0.198106
3,Queens,65739,21.684484,0.051791,8.6117,0.121145
4,Staten Island,80711,8.241244,0.043181,5.970494,0.117867


<a id='item3'></a>

## 3. Get from Foursquare list of dentists

Let's explore the first neighborhood in our dataframe, to practice the code.

In [11]:
# Define Foursquare Credentials and Version

CLIENT_ID = '4X0PNNTVKSX0OQUM3EAPVKJMRLCOTJ1QNHPRZHF0A2WTF4AN' # your Foursquare ID
CLIENT_SECRET = '4LLOJHMWTZ1QEY2ALCGGWBZW4PMMRR2DCTLYA14KR5UQYNMK' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 4X0PNNTVKSX0OQUM3EAPVKJMRLCOTJ1QNHPRZHF0A2WTF4AN
CLIENT_SECRET:4LLOJHMWTZ1QEY2ALCGGWBZW4PMMRR2DCTLYA14KR5UQYNMK


In [113]:
neighborhoods[neighborhoods['Neighborhood'] == 'Wakefield']

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201


In [263]:
neighborhoods.loc[0, 'Neighborhood']

'Wakefield'

In [264]:
neighborhood_latitude = neighborhoods.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighborhoods.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = neighborhoods.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Wakefield are 40.89470517661, -73.84720052054902.


In [321]:
# Now, let's get the top 40 venues (then we focus on dentists) that are in Marble Hill within a radius of 500 meters.

category = '4bf58dd8d48988d178941735' # 'Dentist's Office'
#category = '4d4b7104d754a06370d81259, 4d4b7105d754a06374d81259' #, 4d4b7105d754a06377d81259' 

radius = 500
LIMIT = 40 # limit of number of venues returned by Foursquare API
time ='any'
day = 'any' # any day of the week, not current day

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&categoryId={}&limit={}&time={}&day={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    category,
    LIMIT,
    time,
    day)

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ddbfcfd949393001b80016a'},
  'headerLocation': 'Wakefield',
  'headerFullLocation': 'Wakefield, Bronx',
  'headerLocationGranularity': 'neighborhood',
  'query': "dentist's office",
  'totalResults': 0,
  'suggestedBounds': {'ne': {'lat': 40.899205181110005,
    'lng': -73.84125857127495},
   'sw': {'lat': 40.89020517211, 'lng': -73.8531424698231}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': []}]}}

In [292]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [293]:
# clean the json and structure it into a pandas dataframe.

venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

In [294]:
nearby_venues.head(10)

Unnamed: 0,name,categories,lat,lng
0,my neighbor park,Playground,40.895864,-73.844522


In [285]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

6 venues were returned by Foursquare.


Let's create a function to __repeat the same process to all the neighborhoods__

In [46]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&categoryId={}&limit={}&time={}&day={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            category,
            LIMIT,
            time,
            day)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [50]:
# run the above function on each neighborhood and create a new dataframe called manhattan_venues

NYC_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Marble Hill
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker

In [51]:
print(NYC_venues.shape)
NYC_venues.head()

(1542, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Co-op City,40.874294,-73.829939,Dental Group NY,40.875545,-73.829761,Dentist's Office
1,Co-op City,40.874294,-73.829939,Advanced Dental Group,40.875545,-73.829761,Dentist's Office
2,Co-op City,40.874294,-73.829939,Cohen Gentle Dental,40.871569,-73.830243,Dentist's Office
3,Co-op City,40.874294,-73.829939,Smile-Savers Pediatric Dentistry,40.877143,-73.828029,Dentist's Office
4,Co-op City,40.874294,-73.829939,City Smiles Dental,40.87021,-73.827829,Dentist's Office


In [49]:
neighborhoods['Neighborhood']

0                      Wakefield
1                     Co-op City
2                    Eastchester
3                      Fieldston
4                      Riverdale
5                    Kingsbridge
6                    Marble Hill
7                       Woodlawn
8                        Norwood
9                 Williamsbridge
10                    Baychester
11                Pelham Parkway
12                   City Island
13                  Bedford Park
14            University Heights
15                Morris Heights
16                       Fordham
17                  East Tremont
18                    West Farms
19                  High  Bridge
20                       Melrose
21                    Mott Haven
22                   Port Morris
23                      Longwood
24                   Hunts Point
25                    Morrisania
26                     Soundview
27                  Clason Point
28                   Throgs Neck
29                  Country Club
30        

In [52]:
NYC_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Allerton,4,4,4,4,4,4
Annadale,2,2,2,2,2,2
Arden Heights,3,3,3,3,3,3
Arrochar,2,2,2,2,2,2
Astoria,13,13,13,13,13,13
Auburndale,3,3,3,3,3,3
Bath Beach,7,7,7,7,7,7
Battery Park City,5,5,5,5,5,5
Bay Ridge,9,9,9,9,9,9
Bay Terrace,8,8,8,8,8,8


In [53]:
print('There are {} uniques categories.'.format(len(NYC_venues['Venue Category'].unique())))

There are 3 uniques categories.


In [55]:
NYC_venues['Venue Category'].unique()

# There should be only "Dentist's Office"

array(["Dentist's Office", "Doctor's Office", 'Office'], dtype=object)

In [57]:
NYC_venues[NYC_venues['Venue Category'] != "Dentist's Office"]

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
27,Norwood,40.877224,-73.879391,Montefiore Hospital Clinic Building,40.881271,-73.881318,Doctor's Office
1484,Flatiron,40.739673,-73.990947,Healthy Chelsea Dental,40.738159,-73.995828,Office


In [72]:
NYC_venues[NYC_venues['Neighborhood'] == "Allerton"]

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
1526,Allerton,40.865788,-73.859319,Williamsbridge Dental,40.863363,-73.858509,Dentist's Office
1527,Allerton,40.865788,-73.859319,Allerton Dental - Robert Garfinkel DDS,40.865795,-73.86288,Dentist's Office
1528,Allerton,40.865788,-73.859319,Dr Golden DDS,40.863796,-73.856626,Dentist's Office
1529,Allerton,40.865788,-73.859319,Dr. Garfinkel,40.865444,-73.86469,Dentist's Office


In [60]:
# one hot encoding
NYC_onehot = pd.get_dummies(NYC_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
NYC_onehot['Neighborhood'] = NYC_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [NYC_onehot.columns[-1]] + list(NYC_onehot.columns[:-1])
NYC_onehot = NYC_onehot[fixed_columns]

Unnamed: 0,Neighborhood,Dentist's Office,Doctor's Office,Office
0,Co-op City,1,0,0
1,Co-op City,1,0,0
2,Co-op City,1,0,0
3,Co-op City,1,0,0
4,Co-op City,1,0,0


In [64]:
NYC_onehot.head(10)

Unnamed: 0,Neighborhood,Dentist's Office,Doctor's Office,Office
0,Co-op City,1,0,0
1,Co-op City,1,0,0
2,Co-op City,1,0,0
3,Co-op City,1,0,0
4,Co-op City,1,0,0
5,Fieldston,1,0,0
6,Riverdale,1,0,0
7,Riverdale,1,0,0
8,Riverdale,1,0,0
9,Riverdale,1,0,0


In [62]:
NYC_onehot.shape

(1542, 4)

In [65]:
type(NYC_venues.groupby('Neighborhood').count())

pandas.core.frame.DataFrame

In [67]:
NYC_dentists_nr = NYC_venues.groupby('Neighborhood').count()  # [['Neighborhood','Venue']]

In [70]:
NYC_dentists_nr.columns

Index(['Neighborhood Latitude', 'Neighborhood Longitude', 'Venue',
       'Venue Latitude', 'Venue Longitude', 'Venue Category'],
      dtype='object')

In [71]:
NYC_dentists_nr

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Allerton,4,4,4,4,4,4
Annadale,2,2,2,2,2,2
Arden Heights,3,3,3,3,3,3
Arrochar,2,2,2,2,2,2
Astoria,13,13,13,13,13,13
Auburndale,3,3,3,3,3,3
Bath Beach,7,7,7,7,7,7
Battery Park City,5,5,5,5,5,5
Bay Ridge,9,9,9,9,9,9
Bay Terrace,8,8,8,8,8,8


In [73]:
NYC_dentists_nr.reset_index(level=0, inplace=True)

In [75]:
NYC_dentists_nr = NYC_dentists_nr[['Neighborhood','Venue']]

In [77]:
NYC_dentists_nr.rename(columns={'Venue' : 'Dentists'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


In [78]:
NYC_dentists_nr

Unnamed: 0,Neighborhood,Dentists
0,Allerton,4
1,Annadale,2
2,Arden Heights,3
3,Arrochar,2
4,Astoria,13
5,Auburndale,3
6,Bath Beach,7
7,Battery Park City,5
8,Bay Ridge,9
9,Bay Terrace,8


In [79]:
NYC_full_df = NYC_dentists_nr

<a id='item4'></a>

## 4. Cluster Neigborhoods to find best business location

We look for clusters of neighborhood with similar number of businesses. The neighborhoods with smaller number are the ones likely to be underserved.

Run k-means to cluster the neighborhood into 5 clusters.

In [81]:
# set number of clusters
kclusters = 5

NYC_full_df_clustering = NYC_full_df.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(NYC_full_df_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 2, 2, 4, 2, 0, 0, 0, 0])

In [82]:
NYC_full_df.insert(0, 'Cluster Labels', kmeans.labels_)

In [83]:
NYC_full_df

Unnamed: 0,Cluster Labels,Neighborhood,Dentists
0,2,Allerton,4
1,2,Annadale,2
2,2,Arden Heights,3
3,2,Arrochar,2
4,4,Astoria,13
5,2,Auburndale,3
6,0,Bath Beach,7
7,0,Battery Park City,5
8,0,Bay Ridge,9
9,0,Bay Terrace,8


### Visualize clusters

In [109]:
# merge df to add latitude/longitude for each neighborhood

NYC_merged = neighborhoods.merge(NYC_full_df, on='Neighborhood', how='left')

NYC_merged.head() 

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,Dentists
0,Bronx,Wakefield,40.894705,-73.847201,,
1,Bronx,Co-op City,40.874294,-73.829939,0.0,5.0
2,Bronx,Eastchester,40.887556,-73.827806,,
3,Bronx,Fieldston,40.895437,-73.905643,2.0,1.0
4,Bronx,Riverdale,40.890834,-73.912585,2.0,4.0


In [110]:
NYC_merged.shape

(306, 6)

In [111]:
NYC_full_df.shape

(232, 3)

In [136]:
nr_no_dentists = NYC_merged.shape[0] - NYC_full_df.shape[0]
print(str(nr_no_dentists) + ' neighboors have no dentists within 500m')

74 neighboors have no dentists within 500m


In [112]:
neighborhoods.shape

(306, 4)

In [144]:
# These neighboors had no dentists returned from Foursquare within 500m
NYC_merged[NYC_merged['Cluster Labels'].isnull() == True] #.shape[0]
NYC_merged[NYC_merged['Dentists'].isnull() == True]

0

In [137]:
# Define Cluster Label = 6 (nr. of clusters +1) for those with NaN, i.e. no dentists
# replace NaN with 0 in Dentists

NYC_merged['Cluster Labels'].replace(np.nan, kclusters+1 , inplace=True)  
NYC_merged['Dentists'].replace(np.nan, 0, inplace=True)  

In [175]:
NYC_merged['Cluster Labels'] = NYC_merged['Cluster Labels'].astype('int')
NYC_merged['Dentists'] = NYC_merged['Dentists'].astype('int')
NYC_merged.head(10)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,Dentists
0,Bronx,Wakefield,40.894705,-73.847201,5,0
1,Bronx,Co-op City,40.874294,-73.829939,0,5
2,Bronx,Eastchester,40.887556,-73.827806,5,0
3,Bronx,Fieldston,40.895437,-73.905643,2,1
4,Bronx,Riverdale,40.890834,-73.912585,2,4
5,Bronx,Kingsbridge,40.881687,-73.902818,2,4
6,Manhattan,Marble Hill,40.876551,-73.91066,0,5
7,Bronx,Woodlawn,40.898273,-73.867315,2,2
8,Bronx,Norwood,40.877224,-73.879391,0,7
9,Bronx,Williamsbridge,40.881039,-73.857446,2,1


In [181]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters+1)
ys = [i + x + (i*x)**2 for i in range(kclusters+1)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

In [182]:
rainbow[5]

'#ff0000'

In [183]:
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(NYC_merged['Latitude'], NYC_merged['Longitude'], NYC_merged['Neighborhood'], NYC_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

The red circles indicate neighboorhoods with no dentists within 500m

### Examine clusters

In [179]:
NYC_merged[['Cluster Labels','Dentists']].groupby('Cluster Labels').describe()

Unnamed: 0_level_0,Dentists,Dentists,Dentists,Dentists,Dentists,Dentists,Dentists,Dentists
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Cluster Labels,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,63.0,6.984127,1.853437,5.0,5.0,6.0,8.0,11.0
1,10.0,40.6,2.412928,38.0,39.25,40.0,40.0,45.0
2,141.0,2.22695,1.110783,1.0,1.0,2.0,3.0,4.0
3,6.0,32.166667,2.316607,29.0,30.5,32.5,33.75,35.0
4,16.0,16.0,2.804758,12.0,14.0,15.5,17.25,23.0
5,70.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [248]:
clusters = {}
for i in range(kclusters+1):
    clusters[i] = NYC_merged[['Cluster Labels','Borough','Neighborhood','Dentists']][NYC_merged['Cluster Labels'] == i ]

In [None]:
NYC_merged[['Cluster Labels','Borough','Neighborhood','Dentists']][NYC_merged['Cluster Labels'] == 5 ]

In [254]:
clusters[2].Borough.value_counts()

Queens           38
Brooklyn         33
Bronx            32
Staten Island    31
Manhattan         7
Name: Borough, dtype: int64

In [253]:
clusters[5].Borough.value_counts()

Staten Island    26
Queens           19
Brooklyn         13
Bronx            11
Manhattan         1
Name: Borough, dtype: int64

In [226]:
nyc_socioeconomic

Unnamed: 0,Borough,median_household_income,pop_density,unemployment,crimerateper1000residents,poverty_rate
0,Manhattan,86693,73.479122,0.053543,16.458252,0.162231
1,Bronx,38110,34.985434,0.109212,14.552795,0.280333
2,Brooklyn,58027,37.941149,0.064241,11.281601,0.198106
3,Queens,65739,21.684484,0.051791,8.6117,0.121145
4,Staten Island,80711,8.241244,0.043181,5.970494,0.117867


<a id='item5'></a>

## 5. Analyze Neigborhoods for family life requirement

As our client has indicated a preference for Staten Island, we focus on this Borough

In [258]:
si_pref_neigh = NYC_merged[(NYC_merged['Cluster Labels'] == 5) & (NYC_merged['Borough'] == 'Staten Island') ]

In [320]:
len(si_pref_neigh)

26

In [259]:
si_pref_neigh

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,Dentists
197,Staten Island,St. George,40.644982,-74.079353,5,0
198,Staten Island,New Brighton,40.640615,-74.087017,5,0
200,Staten Island,Rosebank,40.615305,-74.069805,5,0
203,Staten Island,Todt Hill,40.597069,-74.111329,5,0
204,Staten Island,South Beach,40.580247,-74.079553,5,0
205,Staten Island,Port Richmond,40.633669,-74.129434,5,0
206,Staten Island,Mariner's Harbor,40.632546,-74.150085,5,0
207,Staten Island,Port Ivory,40.639683,-74.174645,5,0
210,Staten Island,Travis,40.586314,-74.190737,5,0
224,Staten Island,Park Hill,40.60919,-74.080157,5,0


In [297]:
# categoryId: include a list of comma-separated IDs if you want to select multiple categories
# Arts & Entertainment = 4d4b7104d754a06370d81259
# Food = 4d4b7105d754a06374d81259
# Outdoors & Recreation = 4d4b7105d754a06377d81259

category = '4d4b7105d754a06377d81259' #, 4d4b7105d754a06374d81259, 4d4b7105d754a06377d81259' 
LIMIT = 50 # limit of number of venues returned by Foursquare API

In [298]:
si_venues = getNearbyVenues(names=si_pref_neigh['Neighborhood'],
                            latitudes=si_pref_neigh['Latitude'],
                            longitudes=si_pref_neigh['Longitude'] )

St. George
New Brighton
Rosebank
Todt Hill
South Beach
Port Richmond
Mariner's Harbor
Port Ivory
Travis
Park Hill
Arlington
Midland Beach
Grant City
New Dorp Beach
Pleasant Plains
Clifton
Emerson Hill
Randall Manor
Howland Hook
Elm Park
Manor Heights
Willowbrook
Egbertville
Lighthouse Hill
Richmond Valley
Fox Hills


In [300]:
si_venues.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,St. George,40.644982,-74.079353,Postcards 9/11 Memorial,40.646546,-74.076703,Plaza
1,St. George,40.644982,-74.079353,Lt. Lia Playground,40.643433,-74.079862,Playground
2,St. George,40.644982,-74.079353,North Shore Esplanade,40.646929,-74.076798,Harbor / Marina
3,St. George,40.644982,-74.079353,St. George Esplanade,40.645545,-74.074937,Scenic Lookout
4,St. George,40.644982,-74.079353,Fort Hill,40.641511,-74.080522,Park
5,St. George,40.644982,-74.079353,Maritime Hospital Quarantine Cemetery,40.641593,-74.07773,Park
6,St. George,40.644982,-74.079353,Seaview Fitness Center,40.646019,-74.08443,Gym / Fitness Center
7,St. George,40.644982,-74.079353,Westervelt Community Garden,40.644293,-74.084787,Botanical Garden
8,St. George,40.644982,-74.079353,Barrett Triangle,40.641731,-74.075859,Park
9,New Brighton,40.640615,-74.087017,Bocce Courts,40.6398,-74.09,Park


In [301]:
si_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Arlington,6,6,6,6,6,6
Clifton,5,5,5,5,5,5
Egbertville,3,3,3,3,3,3
Elm Park,2,2,2,2,2,2
Emerson Hill,1,1,1,1,1,1
Fox Hills,2,2,2,2,2,2
Grant City,6,6,6,6,6,6
Howland Hook,1,1,1,1,1,1
Lighthouse Hill,3,3,3,3,3,3
Manor Heights,4,4,4,4,4,4


In [302]:
print('There are {} uniques categories.'.format(len(si_venues['Venue Category'].unique())))

There are 29 uniques categories.


In [303]:
si_venues['Venue Category'].unique()

array(['Plaza', 'Playground', 'Harbor / Marina', 'Scenic Lookout', 'Park',
       'Gym / Fitness Center', 'Botanical Garden', 'Martial Arts Dojo',
       'Pilates Studio', 'Beach', 'Trail', 'Athletics & Sports', 'Gym',
       'Baseball Field', 'Stables', 'Basketball Court', 'Skate Park',
       'Cycle Studio', 'Sports Club', 'Tree', 'Skating Rink',
       'Yoga Studio', 'Pool', 'Dog Run', 'Surf Spot', 'Sculpture Garden',
       'Fishing Spot', 'Campground', 'Lighthouse'], dtype=object)

In [309]:
si_venues[si_venues['Venue Category'].isin(['Playground','Park','Pool'])]

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
1,St. George,40.644982,-74.079353,Lt. Lia Playground,40.643433,-74.079862,Playground
4,St. George,40.644982,-74.079353,Fort Hill,40.641511,-74.080522,Park
5,St. George,40.644982,-74.079353,Maritime Hospital Quarantine Cemetery,40.641593,-74.07773,Park
8,St. George,40.644982,-74.079353,Barrett Triangle,40.641731,-74.075859,Park
9,New Brighton,40.640615,-74.087017,Bocce Courts,40.6398,-74.09,Park
10,New Brighton,40.640615,-74.087017,Skyline Playground,40.63919,-74.089867,Playground
11,New Brighton,40.640615,-74.087017,Mahoney Park,40.643793,-74.085313,Park
13,Rosebank,40.615305,-74.069805,White playground,40.61533,-74.06905,Park
14,Rosebank,40.615305,-74.069805,DeMatti Park,40.615036,-74.073499,Park
18,Todt Hill,40.597069,-74.111329,St Francis Woodlands,40.599524,-74.114515,Park


In [327]:
si_venues[['Neighborhood','Venue','Venue Category']][si_venues['Venue Category'].isin(['Playground','Park','Pool'])]

Unnamed: 0,Neighborhood,Venue,Venue Category
1,St. George,Lt. Lia Playground,Playground
4,St. George,Fort Hill,Park
5,St. George,Maritime Hospital Quarantine Cemetery,Park
8,St. George,Barrett Triangle,Park
9,New Brighton,Bocce Courts,Park
10,New Brighton,Skyline Playground,Playground
11,New Brighton,Mahoney Park,Park
13,Rosebank,White playground,Park
14,Rosebank,DeMatti Park,Park
18,Todt Hill,St Francis Woodlands,Park


In [317]:
si_venues[si_venues['Venue Category'].isin(['Playground','Park','Pool'])].groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Arlington,4,4,4,4,4,4
Clifton,2,2,2,2,2,2
Egbertville,1,1,1,1,1,1
Elm Park,1,1,1,1,1,1
Fox Hills,1,1,1,1,1,1
Mariner's Harbor,1,1,1,1,1,1
New Brighton,3,3,3,3,3,3
Park Hill,1,1,1,1,1,1
Pleasant Plains,2,2,2,2,2,2
Port Richmond,1,1,1,1,1,1


In [319]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(si_pref_neigh['Latitude'], si_pref_neigh['Longitude'], si_pref_neigh['Neighborhood'], si_pref_neigh['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

The red dots on the map are the neighborhoods we recommend to check out.