# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

New York is the most populous city in the United States and quite famous for the business and tourism. People do visit NY from different part of the world and enjoy different types of cuisine while their stay in New York. In order to offer different varieties of food, Mr. X is looking for to open a ‘Caribbean Restaurant’ in New York. There are 5 boroughs in NY, Mr. X is unsure as where to open the restaurant. His requirements for opening the restaurant are:

-	As part of the business strategy, the area should be famous for different varieties of food. 
-	There should be more Caribbean Restaurants in the neighborhood so that Mr. X can give good competition to other restaurants.

The goal of the analytic solution is to explore all 5 boroughs in New York and help Mr. X in finding the most appropriate place to open the ‘Caribbean Restaurant’ which should meet the above conditions.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* Number of existing Caribbean Restaurant in the neighborhood 
* How populare the existing Caribbean Restaurants in the neighborhood? 


Following data sources will be needed to extract/generate the required information:
* Foursquare location data is the key data requirement for solving this problem as this will help in exploring different boroughs in New York and understanding people choices of different varieties of food.**
* The Zip codes for all 5 boroughs will be scraped from NYC health website: https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm. This will be the base data for exploring each ZIP code of all boroughs.
* Geographical coordinates (latitude & longitude) to be pulled from publicly shared data available at http://cocl.us/Geospatial_data


Setting up the libraries

In [137]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


Scraping all 5 Boroughs and NY Neighborhoods from https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm
 

In [2]:
#Scraping Borough and New York City Neighborhoods from https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm

df_list = pd.read_html("https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm")

df_list2 = df_list[0]
df_list2.rename(columns={0: 'Borough', 1: 'Neighborhood',2: 'Zip'}, inplace=True)

# Dropping first cell as heading
df_list2.drop(df_list2[df_list2['Borough'] == 'Borough'].index, inplace = True)

df_list2.head()

Unnamed: 0,Borough,Neighborhood,Zip
1,Bronx,Central Bronx,"10453, 10457, 10460"
2,Bronx Park and Fordham,"10458, 10467, 10468",
3,High Bridge and Morrisania,"10451, 10452, 10456",
4,Hunts Point and Mott Haven,"10454, 10455, 10459, 10474",
5,Kingsbridge and Riverdale,"10463, 10471",


Data Cleaning

In [3]:
#Data Cleaning 

df_list2.reset_index(inplace=True)
df_list3 = df_list2.copy()

mbor = df_list3.loc[0,'Borough']
#mbor='Bronx'
for row,txt in df_list3.iterrows():
    if not txt[1] in ['Bronx','Brooklyn','Manhattan','Queens','Staten Island']:
        df_list3.loc[row,'Zip'] = df_list3.loc[row,'Neighborhood']
        df_list3.loc[row,'Neighborhood'] = df_list3.loc[row,'Borough']
        df_list3.loc[row,'Borough'] = mbor
    else:
        mbor = txt[1]
        

        
df_list4 = pd.DataFrame(columns = ['Borough', 'Neighborhood', 'Zip']) 

for row in df_list3.itertuples(index=False):
    mzip = row[3].split(',')
    for mr in mzip:
        df_list4 = df_list4.append({'Zip' : mr, 'Borough' : row[1],'Neighborhood' : row[2]},
                                      ignore_index = True)
        
        
df_list4['Zip'] = df_list4['Zip'].astype('int64')



Let's check the volume of boroughs data imported

In [4]:
df_list4.shape

(178, 3)

Scrap and load ZIP CSV from https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/table/
 

In [5]:
#Scrap and load ZIP CSV from https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/table/
 
df_geo = pd.read_csv("C:/Users/290011261/Documents/coursera_practice/7_IBM_DataScience/9_Capstone/Week4/us-zip-code-latitude-and-longitude.csv",delimiter=';')

df_geo2 = df_geo[['Zip','Latitude','Longitude']].copy()
df_geo2.head()


Unnamed: 0,Zip,Latitude,Longitude
0,67553,38.654948,-99.32062
1,85743,32.335122,-111.14888
2,75016,32.767268,-96.777626
3,60401,41.350484,-87.62408
4,80432,39.24344,-105.79431


Merge latitude and longitude data with Borough's data

In [6]:
df_list5 = df_list4.merge(df_geo2, how='left', 
                     left_on=['Zip'], right_on=['Zip'])

df_list5.head()

Unnamed: 0,Borough,Neighborhood,Zip,Latitude,Longitude
0,Bronx,Central Bronx,10453,40.853017,-73.91214
1,Bronx,Central Bronx,10457,40.846745,-73.89861
2,Bronx,Central Bronx,10460,40.84095,-73.88036
3,Bronx,Bronx Park and Fordham,10458,40.864166,-73.88881
4,Bronx,Bronx Park and Fordham,10467,40.872265,-73.86937


Let's check how many neighborhood in each borough

In [7]:
df_list5 = df_list5.dropna(axis=0)

df_list5['Borough'].value_counts()

Queens           61
Manhattan        41
Brooklyn         37
Bronx            25
Staten Island    12
Name: Borough, dtype: int64

Let's plot all boroughs neighborhoods in New York City map

In [8]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [9]:
# create map of all 5 different boroughs to visualize their view

map_1 = folium.Map(location=[latitude, longitude], zoom_start=10)

def create_map1(boroughName,mcolor,mhex):
    
    df_map = df_list5[df_list5['Borough'].str.contains(boroughName, regex=False)]
    
    # add markers to map
    for lat, lng, borough, neighborhood in zip(df_map['Latitude'], df_map['Longitude'], df_map['Borough'], df_map['Neighborhood']):
        label = '{}, {}'.format(neighborhood, borough)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color=mcolor,
            fill=True,
            fill_color=mhex,
            fill_opacity=0.7,
            parse_html=False).add_to(map_1)  

    return map_1

create_map1('Manhattan','blue', '#3186cc')
create_map1('Queens','yellow', '#ffff00')
create_map1('Brooklyn','red','#ff0000')
create_map1('Bronx','green','#40ff00')
create_map1('Staten Island','pink','#ff00bf')

The visual looks good:
    - Blue indicates Manhatten
    - Green indicates Bronx
    - Yellow indicates Queens
    - Red indicates Brooklyn
    - Pink indicates Staten Island
    
Except Staten Island, all other boroughs look dense populated and interesting to explore further. Staten Island is bit isolated and markers are bit scattered so we will exclude this borough from our further analysis

Define Foursquare Credentials and Version

## Methodology <a name="methodology"></a>

In this project we will direct our efforts on detecting the areas where Caribbean Restaurants exist and particularly their popularity amongst other venues.  We have already excluded ‘Staten Island’ in previous step as the population relatively is less with low count of neighborhoods. 

In the first step of our analysis, we will first extract all the venues corresponding to Zip code. We will than define the venue category and analyze the outcome by top venues by popularity of each borough. In the second step, we will segment the venues in cluster and explore the outcome using Folium map.  We will further narrow down our exploration/analysis to look at the top venues where Caribbean Restaurants exist and conclude which is the most appropriate venue for opening the Caribbean Restaurant.


## Analysis <a name="analysis"></a>

Let's first setup the credentials for exploring venues using Foursquare API

In [216]:
#hidden_Cell

#Credentials - 1
    
#CLIENT_ID = 'OGDRG3ZMUOKRE5CFRQL5OVEJCSK4OKUD0HL1UNGZRPEIAAT0' # your Foursquare ID
#CLIENT_SECRET = 'J0ULWEIQP5JRKDDR31BS4WGW3PRESXIHXP2QALB41IFPQ1FN' # your Foursquare Secret

#Credentials - 2

CLIENT_ID = 'EL5UZ4FJFYVLKW3GSM0WTNUYJ5CA2XQNUE4CGEWXKGCAUJXA' # your Foursquare ID
CLIENT_SECRET = '01G0I0UVJZFFA1BLBGXKMRSKBNCYXA0MF25FIMTYWSWB0TVA' # your Foursquare Secret


#VERSION = '20180605' # Foursquare API version
#LIMIT = 100 # A default Foursquare API limit value

#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

Let's create a function to explore venues for each borough. The function will be used repeatedly for each borough

In [11]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now write the code to run the below function of each borough and create a new dataframe called df_venues.

- Filter for each borough 
- Do the one hot encoding and create the DF of 1op 10 common venues in each borough
- Segment the venues using 'K-Means' clustering

In [12]:
def run_borough_analysis(mborough):
    # filter the borough df
    df_bor = df_list5[df_list5['Borough'].str.contains(mborough, regex=False)]

    df_venues = getNearbyVenues(names=df_bor['Neighborhood'],
                                       latitudes=df_bor['Latitude'],
                                       longitudes=df_bor['Longitude']
                                      )



    # one hot encoding
    df_onehot = pd.get_dummies(df_venues[['Venue Category']], prefix="", prefix_sep="")

    # add neighborhood column back to dataframe
    df_onehot['Neighborhood'] = df_venues['Neighborhood'] 

    # move neighborhood column to the first column
    fixed_columns = [df_onehot.columns[-1]] + list(df_onehot.columns[:-1])
    df_onehot = df_onehot[fixed_columns]

    # Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
    df_grouped = df_onehot.groupby('Neighborhood').mean().reset_index()

    #First, let's write a function to sort the venues in descending order.
    def return_most_common_venues(row, num_top_venues):
        row_categories = row.iloc[1:]
        row_categories_sorted = row_categories.sort_values(ascending=False)

        return row_categories_sorted.index.values[0:num_top_venues]

    # Now let's create the new dataframe and display the top 10 venues for each neighborhood.

    num_top_venues = 10

    indicators = ['st', 'nd', 'rd']

    # create columns according to number of top venues
    columns = ['Neighborhood']
    for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))

    # create a new dataframe
    neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
    neighborhoods_venues_sorted['Neighborhood'] = df_grouped['Neighborhood']

    for ind in np.arange(df_grouped.shape[0]):
        neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_grouped.iloc[ind, :], num_top_venues)



    # Run k-means to cluster the neighborhood into 5 clusters.

    # set number of clusters
    kclusters = 5

    df_grouped_clustering = df_grouped.drop('Neighborhood', 1)

    # run k-means clustering
    kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_grouped_clustering)

    # Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

    # add clustering labels
    neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

    df_merged = df_bor

    # merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
    df_merged = df_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

    return df_merged


In [138]:
# Merge the output from above function into one DF

df_borough_analysis = pd.concat([run_borough_analysis('Manhattan'),run_borough_analysis('Queens'),
                                 run_borough_analysis('Brooklyn'),run_borough_analysis('Bronx')], sort=True)
                            


Central Harlem
Central Harlem
Central Harlem
Central Harlem
Central Harlem
Chelsea and Clinton
Chelsea and Clinton
Chelsea and Clinton
Chelsea and Clinton
Chelsea and Clinton
Chelsea and Clinton
East Harlem
East Harlem
Gramercy Park and Murray Hill
Gramercy Park and Murray Hill
Gramercy Park and Murray Hill
Gramercy Park and Murray Hill
Greenwich Village and Soho
Greenwich Village and Soho
Greenwich Village and Soho
Lower Manhattan
Lower Manhattan
Lower Manhattan
Lower Manhattan
Lower Manhattan
Lower Manhattan
Lower East Side
Lower East Side
Lower East Side
Upper East Side
Upper East Side
Upper East Side
Upper East Side
Upper West Side
Upper West Side
Upper West Side
Inwood and Washington Heights
Inwood and Washington Heights
Inwood and Washington Heights
Inwood and Washington Heights
Inwood and Washington Heights
Northeast Queens
Northeast Queens
Northeast Queens
Northeast Queens
North Queens
North Queens
North Queens
North Queens
North Queens
North Queens
North Queens
Central Queens


Let's look at rows of each borough 

In [140]:

df_borough_analysis['Borough'].value_counts()

Queens       61
Manhattan    41
Brooklyn     37
Bronx        25
Name: Borough, dtype: int64

Let's create the  function for plotting the folium map of each set of clusters for each individual borough

In [141]:
# create map
def create_map(dfname, noclusters):
    map_clusters = folium.Map(title="Your map title", location=[latitude, longitude], zoom_start=10)
    kclusters=5

    # set color scheme for the clusters
    x = np.arange(kclusters)
    ys = [i + x + (i*x)**2 for i in range(kclusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # add markers to the map
    markers_colors = []
    for lat, lon, poi, cluster, mb in zip(dfname['Latitude'], dfname['Longitude'], dfname['Neighborhood'], dfname['Cluster Labels'], dfname['Borough']):
        label = folium.Popup(str(mb) + ': '+ str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)

    return map_clusters



In [182]:
#Clusters Map - Manhattan

create_map (df_borough_analysis[df_borough_analysis['Borough'] == 'Manhattan'], 5)

The cluster map of Manhattan looks pretty good. The each colored clusters have been segmented into:
    - Red: Italian, American restaurants and coffee shops. 
    - Purple: Park, gym, coffee shops and wine bars.
    - Blue: Mexican restaurants, pharmacy, super markets etc
    - Light blue: Chinese restaurants, Coffee Shops, Grocery Stores etc
    - Orange: Theater, Coffee Shops, Hotel etc

However, the map doesn't show any restarants chain for Carribean foods

In [17]:
#Clusters Map - Queens

create_map (df_borough_analysis[df_borough_analysis['Borough'] == 'Queens'], 5)

In Queens map, the red marker cluster has come out very dense and strong one. Let's explore all clusters in Queens map:
    - Red: Chinese & Korean Restaurants, Pizza/Bakery Shops, Pharmacy, Supermarket etc
    - Purple: Caribbean Restaurant, Fast Food Restaurant, Sandwich/Pizza Place Pharmacy etc
    - Blue: Beach, Deli/Bodega, Department/Discount Stores etc
    - Light blue: Sandwich/Pizza place, Ice Cream Shops, Bank etc
    - Orange: Chinese Restaurants, Donut Shops, Bus Stations etc

Interestingly second cluster shows Caribbean restaurants, we need to further explore this later

In [18]:
#Clusters Map - Brooklyn

create_map (df_borough_analysis[df_borough_analysis['Borough'] == 'Brooklyn'], 5)

Now let's look at the clusters Brooklyn:
    - Red: Bar, Caribbean Restaurant, Coffee Shop, Chicken Point, Mexican Restaurant etc
    - Purple: Pizza Place, Italian/Chinese Restaurant, Pharmacy, Bakery, Mobile Shops etc
    - Blue: Pizza Place, Supermarket, Discount Store, Playground etc
    - Light blue: Caribbean Restaurant, Chinese Restaurant, Bank, Mobile Shops etc
    - Orange: Chinese Restaurant, Pizza Place, Italian/American Restaurant, Supert Market, Pharmacy etc



In [19]:
#Clusters Map - Bronx

create_map (df_borough_analysis[df_borough_analysis['Borough'] == 'Bronx'], 5)

Now let's look at the clusters of Bronx:
    - Red: Pizza Place, Donut/Coffee Shop, Pharmacy, Latin/Chinese Restaurant, Bank, Caribbean/Chinese Restaurant etc
    - Purple: Latin American Restaurant, Bar, Pharmacy, Pizza/Donut Shop, Park etc
    - Blue: Pizza Place, Gym, Mexican/Spanish Restaurant etc
    - Light blue: Pizza Place, Caribbean Restaurant, Pharmacy, Chinese Restaurant, Gas Station etc
    - Orange: Fast Food Restaurant, Ice Cream Shop, Pizza Shop, Bar etc

So far we have got the good understanding about type of venues in each cluster and in each borough. We have also seen where Caribbean Restaurants exist and which are the venues located near to the caribbean restaurants. Now next step will be to explore only Caribbean Restaurants across the Boroughs on the map

In [207]:
df_carb = df_borough_analysis[df_borough_analysis.apply(lambda row: row.astype(str).str.contains('Caribbean Restaurant').any(), axis=1)]

What is the size of DF which contains only Caribbean Restaurant?

In [208]:
df_carb.shape

(36, 16)

Let's create the map only for Carribbean Restaurants

In [209]:
create_map (df_carb, 5)

The above map looks  interesting. As you can see Caribbean Restaurants are located in Brooklyn, Queens and Bronx but not in Manhattan. Now as a next step, we would like to see the popularity/rank of venues where Caribbean Restaurants are located.



In [210]:
# let's put the columns in order first

fixed_columns=list(df_carb.columns[10:16])+list(df_carb.columns[1:10])+list(df_carb.columns[0:1])
df_carb2 = df_carb[fixed_columns]
df_carb2.head()

Unnamed: 0,Borough,Cluster Labels,Latitude,Longitude,Neighborhood,Zip,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
73,Manhattan,2,40.791586,-73.94575,East Harlem,10029,Mexican Restaurant,Pharmacy,Latin American Restaurant,Thai Restaurant,Sandwich Place,Bakery,Deli / Bodega,Supermarket,Donut Shop,Caribbean Restaurant
74,Manhattan,2,40.802395,-73.93359,East Harlem,10035,Mexican Restaurant,Pharmacy,Latin American Restaurant,Thai Restaurant,Sandwich Place,Bakery,Deli / Bodega,Supermarket,Donut Shop,Caribbean Restaurant
119,Queens,4,40.697188,-73.75948,Jamaica,11412,Chinese Restaurant,Donut Shop,Bus Station,Fast Food Restaurant,Bank,Pizza Place,Southern / Soul Food Restaurant,Deli / Bodega,Sandwich Place,Caribbean Restaurant
120,Queens,4,40.714261,-73.76824,Jamaica,11423,Chinese Restaurant,Donut Shop,Bus Station,Fast Food Restaurant,Bank,Pizza Place,Southern / Soul Food Restaurant,Deli / Bodega,Sandwich Place,Caribbean Restaurant
121,Queens,4,40.714144,-73.79324,Jamaica,11432,Chinese Restaurant,Donut Shop,Bus Station,Fast Food Restaurant,Bank,Pizza Place,Southern / Soul Food Restaurant,Deli / Bodega,Sandwich Place,Caribbean Restaurant


Updating rank and colours basis on top common venues 

In [211]:

for ind, col in enumerate(df_carb2.columns[6:16]):
    df_carb2.loc[df_carb2[col].str.contains('Caribbean Restaurant'), 'rank'] = ind+1

df_carb2.loc[df_carb2['rank'] == 1, 'marker_color'] = 'darkgreen'
df_carb2.loc[(df_carb2['rank'] >= 2) & (df_carb2['rank'] <= 3), 'marker_color'] = 'pink'
df_carb2.loc[(df_carb2['rank'] >= 4) & (df_carb2['rank'] <= 5), 'marker_color'] = 'orange'
df_carb2.loc[df_carb2['rank'] >= 6, 'marker_color'] = 'red'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Let's create the map of Caribbean Restaurants basis on their rank

In [212]:
map_2 = folium.Map(title="Your map title", location=[latitude, longitude], zoom_start=10)


tt = np.array(df_carb2['rank']).astype('int')

labels=['yellow', 'green', 'blue', 'red']

 # add markers to the map
markers_colors = []
for lat, lon, poi, cluster, mb, mcolor in zip(df_carb2['Latitude'], df_carb2['Longitude'], df_carb2['Neighborhood'], df_carb2['rank'], df_carb2['Borough'], df_carb2['marker_color']):
    label = folium.Popup(str(mb) + ': '+ str(poi) + ' Rank ' + str(int(cluster)), parse_html=True)
    folium.Marker(
        [lat, lon],
        radius=5,
        popup=label,
        #color='red',
        fill=True,
        #fill_color='red',
        icon=folium.Icon(color=mcolor),
        fill_opacity=0.7).add_to(map_2)
map_2



As per above map:
- Icon with Green color are most popular ones where Caribbean Restaurant is located
- Pink is the Second Rank and Orange is Third Rank
- The red icon indicates fourth rank which should be ruled out basis on less popularity


Basis on the output, we would like to go for Rank 1 venues for suggesting the location of Caribbean Restaurant. Let's see the list of those venues:

In [213]:
df_carb2.loc[df_carb2['rank'] == 1,['Borough','Neighborhood', 'Cluster Labels','Latitude','Longitude','Zip']]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,Latitude,Longitude,Zip
142,Queens,Southeast Queens,1,40.742944,-73.70956,11004
143,Queens,Southeast Queens,1,40.756983,-73.7148,11005
144,Queens,Southeast Queens,1,40.693538,-73.73574,11411
145,Queens,Southeast Queens,1,40.670138,-73.75141,11413
146,Queens,Southeast Queens,1,40.662538,-73.73514,11422
147,Queens,Southeast Queens,1,40.732239,-73.72108,11426
148,Queens,Southeast Queens,1,40.728235,-73.74782,11427
149,Queens,Southeast Queens,1,40.719981,-73.74127,11428
150,Queens,Southeast Queens,1,40.708833,-73.73903,11429
49,Brooklyn,Flatbush,3,40.649059,-73.93304,11203


## Result and Discussion <a name="conclusion"></a>

Our analysis shows that although there are large number of ranked 1 Caribbean Restaurants in Southeast Queens but Flatbush-Brooklyn appears to be the better option. The reason is high competition in Queens due to high number of restaurants compared to Brooklyn. Thus, our recommendation is Flatbush neighborhood in Brooklyn borough to open a Caribbean Restaurants.

## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify the best space for opening the Caribbean Restaurant amongst 5 boroughs i.e. Manhattan, Brooklyn, Queens, Bronx and Staten Island. When we plotted the neighborhoods of all 5 boroughs in a single map, we found that Staten Island has relatively low population and less density so that is not the optimal choice. So we ruled out Staten Island from our analysis and focused on remaining 4 boroughs. We identified the top 10 venues of each borough and segmented them using K-Means clustering. The clustering output helped us to segment the boroughs by common venue categories. We identified the venues where Caribbean Restaurants located and analyzed them using Folium map. We finally identified Southeast Queens and Flatbush in Brooklyn where Caribbean Restaurants are popular however Queens has relatively large number of restaurants which may give stiff competition so Flatbush-Brooklyn is the optimal choice.  

Final decision on optimal restaurant location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location (proximity to park or water), levels of noise / proximity to major roads, real estate availability, prices, social and economic dynamics of every neighborhood etc
