# Capstone project

## 1. Choose the problem
- If someone is looking to open a restaurant in a borough of NY city, where would I recommend that they open it? Similarly, if a contractor is trying to start their own business, where would I recommend that they setup their office?

## 2. Explore the data
- We will explore the neighborhoods in Brooklyn to see if which neighborhoods is worth to open a restaurant or open business.

## 3. Choose the approach (model/solution)
- Since the data don't contain any revenue for each venues in the neighborhoods, So we can only discuss things base on what venue that neighborhood contain.
- So that the model of choice is still KMeans.

## 4. Discussion and Conclusion section

Disclaimer: I re-used the notebook of New York neighborhoods

Before we get the data and start exploring it, let's download all the dependencies that we will need.


In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>


## 2. Explore the data

In [2]:
!wget -q -O newyork_data.json https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
print('Data downloaded!')

Data downloaded!


#### Load and explore the data


Next, let's load the data.


In [2]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

Let's take a quick look at the data.


In [3]:
print(newyork_data.keys())
print(newyork_data['type'])
print(newyork_data['totalFeatures'])
print(newyork_data['features'][0].keys())
print(newyork_data['features'][0]['properties'])
print(newyork_data['crs'])
print(newyork_data['bbox'])

dict_keys(['type', 'totalFeatures', 'features', 'crs', 'bbox'])
FeatureCollection
306
dict_keys(['type', 'id', 'geometry', 'geometry_name', 'properties'])
{'name': 'Wakefield', 'stacked': 1, 'annoline1': 'Wakefield', 'annoline2': None, 'annoline3': None, 'annoangle': 0.0, 'borough': 'Bronx', 'bbox': [-73.84720052054902, 40.89470517661, -73.84720052054902, 40.89470517661]}
{'type': 'name', 'properties': {'name': 'urn:ogc:def:crs:EPSG::4326'}}
[-74.2492599487305, 40.5033187866211, -73.7061614990234, 40.9105606079102]


Notice how all the relevant data is in the _features_ key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.


In [4]:
neighborhoods_data = newyork_data['features']

Let's take a look at the first item in this list.


In [5]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### Tranform the data into a _pandas_ dataframe


The next task is essentially transforming this data of nested Python dictionaries into a _pandas_ dataframe. So let's start by creating an empty dataframe.


In [6]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Take a look at the empty dataframe to confirm that the columns are as intended.


In [7]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Then let's loop through the data and fill the dataframe one row at a time.


In [8]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Quickly examine the resulting dataframe.


In [9]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [10]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


#### Use geopy library to get the latitude and longitude values of New York City.


In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.


In [11]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


#### Create a map of New York with neighborhoods superimposed on top.


In [12]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

**Folium** is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.


As far as we aware that there are 5 boroughs in NY city, now let twist it a little bit and make it a bit clearer

In [13]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)
color_scheme = ['red','green','blue','gray','purple']
borough_list = neighborhoods['Borough'].unique().tolist()

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='white',
        fill=True,
        fill_color=color_scheme[borough_list.index(borough)],
        # '#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [17]:
CLIENT_ID = 'K2SXLTTTY5B0KO1T2IZPI4GT1XC5NQNNQMPTKMCR3QIWULRN' # your Foursquare ID
CLIENT_SECRET = 'DMQIU44QIYAJ05SNIWH4FU5NZKSGQ05A4APXTEDN2YNCCAQP' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: K2SXLTTTY5B0KO1T2IZPI4GT1XC5NQNNQMPTKMCR3QIWULRN
CLIENT_SECRET:DMQIU44QIYAJ05SNIWH4FU5NZKSGQ05A4APXTEDN2YNCCAQP


In [18]:
borough_list

['Bronx', 'Manhattan', 'Brooklyn', 'Queens', 'Staten Island']

But to make thing easy to understand, from here we will analyse ~~all 5 boroughs~~ Brooklyn (cause we already done with Manhattan) to see if:
- Which neighboor of Brooklyn is worth to open a restaurant.
- Which neighboor of Brooklyn is worth to open an office

In [37]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

def borough_visualize(neighborhoods, borough_name='Brooklyn'):

    borough_data = neighborhoods[neighborhoods['Borough'] == borough_name].reset_index(drop=True)
    
    # Check the location
    address = f'{borough_name}, NY'

    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    print('The geograpical coordinate of {} are {}, {}.'.format(borough_name,latitude, longitude))
    
    # create map using latitude and longitude values to visualize neighborhoods
    map_ = folium.Map(location=[latitude, longitude], zoom_start=11)

    # add markers to map
    for lat, lng, label in zip(borough_data['Latitude'], borough_data['Longitude'], borough_data['Neighborhood']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_)  
        
    map_

    borough_venues = getNearbyVenues(names=borough_data['Neighborhood'],latitudes=borough_data['Latitude'],longitudes=borough_data['Longitude'])

    return borough_venues, borough_data, map_

In [38]:
borough_venues, borough_data, map_ = borough_visualize(neighborhoods)
map_

The geograpical coordinate of Brooklyn are 40.6501038, -73.9495823.


In [74]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

def neighborhood_analyse(borough_venues, borough_data):
    # Apply one-hot encoding
    borough_onehot = pd.get_dummies(borough_venues[['Venue Category']], prefix="", prefix_sep="")

    # add neighborhood column back to dataframe
    borough_onehot['Neighborhood'] = borough_venues['Neighborhood'] 

    # move neighborhood column to the first column
    fixed_columns = [borough_onehot.columns[-1]] + list(borough_onehot.columns[:-1])
    borough_onehot = borough_onehot[fixed_columns]
    
    print("--- Borough one-hot shape ---")
    print(borough_onehot.shape)
    borough_grouped = borough_onehot.groupby('Neighborhood').mean().reset_index()
    print("--- Borough neighborhood grouped shape ---")
    print(borough_grouped.shape)
    # ---------------------------------------------------------
    # ---------------------------------------------------------
    num_top_venues = 10

    indicators = ['st', 'nd', 'rd']

    # create columns according to number of top venues
    columns = ['Neighborhood']
    for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))

    # create a new dataframe
    neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
    neighborhoods_venues_sorted['Neighborhood'] = borough_grouped['Neighborhood']

    for ind in np.arange(borough_grouped.shape[0]):
        neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(borough_grouped.iloc[ind, :], num_top_venues)

    neighborhoods_venues_sorted.head()

    # ---------------------------------------------------------
    # ---------------------------------------------------------
    # Run k-means to cluster the neighborhood into 6 clusters
    kclusters = 6

    borough_grouped_clustering = borough_grouped.drop('Neighborhood', 1)

    # run k-means clustering
    kmeans = KMeans(n_clusters=kclusters, n_init=100, random_state=0).fit(borough_grouped_clustering)

    # ---------------------------------------------------------
    # ---------------------------------------------------------
    neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

    borough_merged = borough_data.copy()

    # merge borough_grouped with borough_data to add latitude/longitude for each neighborhood
    borough_merged = borough_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

    # ---------------------------------------------------------
    # ---------------------------------------------------------
    # create map
    map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

    # set color scheme for the clusters
    x = np.arange(kclusters)
    ys = [i + x + (i*x)**2 for i in range(kclusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # add markers to the map
    markers_colors = []
    for lat, lon, poi, cluster in zip(borough_merged['Latitude'], borough_merged['Longitude'], borough_merged['Neighborhood'], borough_merged['Cluster Labels']):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)
        
    map_clusters
    # ---------------------------------------------------------
    # ---------------------------------------------------------
    return borough_merged, map_clusters

In [60]:
borough_merged, map_clusters = neighborhood_analyse(borough_venues, borough_data)
map_clusters

--- Borough one-hot shape ---
(2534, 288)
--- Borough neighborhood grouped shape ---
(70, 288)


In [114]:
def count_category(dataframe):
    df = dataframe.copy().reset_index(drop=True)
    df.drop(columns=["Neighborhood"], inplace=True)
    res_count = [len(df)]
    for column in df.columns:
        res_count.append(len([category for category in df[column] if 'restaurant' in category.lower()]))

    column_list =["Total neighborhood"]
    column_list.extend(df.columns.tolist())
 
    c_category = pd.DataFrame(columns=column_list)

    c_category.loc[len(c_category)] = res_count

    return c_category

## Examine Clusters


#### Cluster 0


In [107]:
table = count_category(borough_merged.loc[borough_merged['Cluster Labels'] == 0, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]])
table

Unnamed: 0,Total neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,2,0,0,0,0,0,0,0,0,1,0


In [115]:
borough_merged.loc[borough_merged['Cluster Labels'] == 0, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,Crown Heights,Museum,Pizza Place,Café,Bagel Shop,Playground,Electronics Store,Coffee Shop,Salon / Barbershop,Candy Store,Bus Station
46,Midwood,Pizza Place,Convenience Store,Pharmacy,Ice Cream Shop,Candy Store,Bagel Shop,Video Game Store,Farmers Market,Fast Food Restaurant,Field


#### Cluster 1


In [108]:
table = count_category(borough_merged.loc[borough_merged['Cluster Labels'] == 1, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]])
table

Unnamed: 0,Total neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,27,8,3,9,5,9,7,10,6,4,3


#### Cluster 2


In [109]:
table = count_category(borough_merged.loc[borough_merged['Cluster Labels'] == 2, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]])
table

Unnamed: 0,Total neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,2,1,1,1,0,1,1,2,0,0,1


In [116]:
borough_merged.loc[borough_merged['Cluster Labels'] == 2, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
28,Canarsie,Caribbean Restaurant,Gym,Thai Restaurant,Bus Line,Food,Asian Restaurant,Yemeni Restaurant,Fish Market,Field,Filipino Restaurant
59,Paerdegat Basin,Business Service,Asian Restaurant,Bus Line,Food,Fast Food Restaurant,Field,Filipino Restaurant,Fish & Chips Shop,Fish Market,Flower Shop


#### Cluster 3


In [110]:
table = count_category(borough_merged.loc[borough_merged['Cluster Labels'] == 3, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]])
table

Unnamed: 0,Total neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,0,0,1,0,0,0,1,0,1,0


In [117]:
borough_merged.loc[borough_merged['Cluster Labels'] == 3, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
30,Mill Island,Locksmith,Pool,Yemeni Restaurant,Fish Market,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Fish & Chips Shop


#### Cluster 4


In [111]:
table = count_category(borough_merged.loc[borough_merged['Cluster Labels'] == 4, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]])
table

Unnamed: 0,Total neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,37,7,10,10,8,7,5,11,7,6,12


#### Cluster 5


In [112]:
table = count_category(borough_merged.loc[borough_merged['Cluster Labels'] == 5, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]])
table

Unnamed: 0,Total neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,0,0,0,0,0,0,0,0,0,0


In [118]:
borough_merged.loc[borough_merged['Cluster Labels'] == 5, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
45,Bergen Beach,Harbor / Marina,Athletics & Sports,Baseball Field,Park,Playground,Food Court,Food & Drink Shop,Food,Flower Shop,Fish Market


## 4. Discuss and Conclusion

As far as we aware that with:
- Cluster 0, 5  -   Group A:
    - These are the place that have least Restaurant

- Cluster 3     -   Group B:
    -  Restaurant is a 3rd most common venue

- Cluster 1,4   -   Group C:
    -  These areas have diversity in-store/shop/restaurant. Almost 1/3 of the places in the top 5 is a restaurant.

- Cluster 2     -   Group D:
    - Have restaurant as a 1st, 2nd and 3rd common venue

## Conclusion

If a stakeholder want to open a restaurant:
- Group B and C are the most stable, cause the restaurant already thrive at those place.

- Group A is like a gamble, cause the restaurant isn't popular in those place.

- And Group D, unless the restaurant the stakeholder want to open is stand out from the rest, it hard to success.
