<h1>Final Project</h1>

<h2>Selecting a suitable neighborhood in San Jose, California, for opening a new Indian Restaurant</h2>

<li>Build a dataframe of neighborhoods in San Jose, California by web scraping the data from Wikipedia page</li>
<li>Get the geographical coordinates of the neighborhoods</li>
<li>Obtain the venue data for the neighborhoods from Foursquare API</li>
<li>Explore and cluster the neighborhoods</li>
<li>Select the best cluster to open a new Indian Restaurant</li>

<h3>1. Import Libraries</h3>

In [105]:
import numpy as np
import pandas as pd
import json as js
import requests
import os

#!conda install -c conda-forge geocoder
import geocoder



#!conda install -c conda-forge folium=0.5.0
import folium

from bs4 import BeautifulSoup
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

#!conda install -c conda-forge geopy
import geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from geopy.exc import GeocoderTimedOut


# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.pyplot as plt # plotting library
import matplotlib.colors as colors

from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs

print('Libraries imported.')

Libraries imported.


<h3>2. Scrap data from Wikipedia page into a DataFrame</h3>

In [106]:
# send the GET request
data = requests.get("https://en.wikipedia.org/wiki/Category:Neighborhoods_in_San_Jose,_California").text

In [107]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

In [108]:
# create a list to store neighborhood data
neighborhoodList = []

In [109]:
# append the data into the list
for row in soup.find_all("div", class_="mw-category")[0].findAll("li"):
    neighborhoodList.append(row.text)

In [110]:
# create a new DataFrame from the list
sjc_df = pd.DataFrame({"Neighborhood": neighborhoodList})

sjc_df.head()

Unnamed: 0,Neighborhood
0,"The Alameda, San Jose"
1,"Almaden Valley, San Jose"
2,"Alum Rock, San Jose"
3,"Alviso, San Jose"
4,"Berryessa, San Jose"


In [112]:
# print the number of rows of the dataframe
sjc_df.shape

(46, 1)

<h3>3. Get the geographical coordinates</h3>

In [113]:
geolocator = Nominatim(user_agent="Capstone")

In [115]:
# call the function to get the coordinates, store in a new list using list comprehension
coords = [ geolocator.geocode(neighborhood) for neighborhood in sjc_df["Neighborhood"].tolist() ]

In [116]:
coords

[Location(The Alameda, College Park, Rose Garden, San Jose, Santa Clara County, California, 95191, United States of America, (37.3381443, -121.9198405, 0.0)),
 Location(Almaden Valley, San Jose, Santa Clara County, California, 95120, United States, (37.2216072, -121.8617567, 0.0)),
 Location(Alum Rock, San Jose, Santa Clara County, California, 95127:95148, United States, (37.3660513, -121.8271756, 0.0)),
 Location(Alviso, San Jose, Santa Clara County, California, 95002, United States, (37.426051, -121.9752373, 0.0)),
 Location(Berryessa, San Jose, Santa Clara County, California, 93133, United States, (37.3863287, -121.8605104, 0.0)),
 Location(Blossom Valley, San Diego County, California, United States, (32.863939, -116.8605804, 0.0)),
 Location(San José, Buena Vista, Las Tunas, 75200, Cuba, (20.9601558, -76.9300667, 0.0)),
 Location(Burbank House, Arguello Mall, Lucie Stern Hall, Stanford, Santa Clara County, California, 94304, United States of America, (37.4240601, -122.165268884609,

In [117]:
dir(coords[0])

['__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '_address',
 '_point',
 '_raw',
 '_tuple',
 'address',
 'altitude',
 'latitude',
 'longitude',
 'point',
 'raw']

In [118]:
print (coords[0].address)

The Alameda, College Park, Rose Garden, San Jose, Santa Clara County, California, 95191, United States of America


In [119]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude

df_temp = []
for i in range(0, len(coords)):
    if coords[i]!=None:
        df_temp.append((coords[i].latitude, coords[i].longitude))
    else:
        df_temp.append(("NaN", "NaN"))

print (df_temp)

[(37.3381443, -121.9198405), (37.2216072, -121.8617567), (37.3660513, -121.8271756), (37.426051, -121.9752373), (37.3863287, -121.8605104), (32.863939, -116.8605804), (20.9601558, -76.9300667), (37.4240601, -122.165268884609), (37.2564495, -121.931483923164), ('NaN', 'NaN'), (37.2863146, -121.8584326), (39.25596895, -123.213647574184), (37.3355895, -121.888761), (37.3359104, -121.8910758), (37.3894065, -121.835877448829), (29.580961, -81.70915), (37.2649431, -121.8180072), (37.3096638, -121.7835622), (37.3489602, -121.8942303), (36.59452545, -87.4174999883608), (37.3509298, -121.8598424), (37.3612014, -121.889860250647), (37.31889, -121.8162349), (40.7498417, -73.984251), (37.3384015, -121.874715), (33.4155389, -111.8782198), ('NaN', 'NaN'), (1.3603802, 103.877476531688), (37.3339826, -121.9240708), (37.3361905, -121.8905833), (37.3361188, -121.893828161873), (9.3347401, -83.5716552), (37.3209172, -121.948158209277), (37.2906975, -121.84104595283), (37.3412453, -121.9168804), ('NaN', '

In [120]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(df_temp, columns=['Latitude', 'Longitude'])

print (df_coords)

   Latitude Longitude
0   37.3381   -121.92
1   37.2216  -121.862
2   37.3661  -121.827
3   37.4261  -121.975
4   37.3863  -121.861
5   32.8639  -116.861
6   20.9602  -76.9301
7   37.4241  -122.165
8   37.2564  -121.931
9       NaN       NaN
10  37.2863  -121.858
11   39.256  -123.214
12  37.3356  -121.889
13  37.3359  -121.891
14  37.3894  -121.836
15   29.581  -81.7091
16  37.2649  -121.818
17  37.3097  -121.784
18   37.349  -121.894
19  36.5945  -87.4175
20  37.3509   -121.86
21  37.3612   -121.89
22  37.3189  -121.816
23  40.7498  -73.9843
24  37.3384  -121.875
25  33.4155  -111.878
26      NaN       NaN
27  1.36038   103.877
28   37.334  -121.924
29  37.3362  -121.891
30  37.3361  -121.894
31  9.33474  -83.5717
32  37.3209  -121.948
33  37.2907  -121.841
34  37.3412  -121.917
35      NaN       NaN
36  44.9312   -89.016
37  43.1676   131.938
38  33.3949  -111.878
39      NaN       NaN
40  6.09591  -74.8889
41  32.3744  -104.226
42 -27.5014  -58.8064
43  37.3085  -121.901
44  39.302

In [121]:
# merge the coordinates into the original dataframe
sjc_df['Latitude'] = df_coords['Latitude']
sjc_df['Longitude'] = df_coords['Longitude']

In [91]:
print (sjc_df)

                                         Neighborhood Latitude Longitude
0                               The Alameda, San Jose  37.3381   -121.92
1                            Almaden Valley, San Jose  37.2216  -121.862
2                                 Alum Rock, San Jose  37.3661  -121.827
3                                    Alviso, San Jose  37.4261  -121.975
4                                 Berryessa, San Jose  37.3863  -121.861
5                            Blossom Valley, San Jose  32.8639  -116.861
6                               Buena Vista, San Jose  20.9602  -76.9301
7             Burbank, Santa Clara County, California  37.4241  -122.165
8                           Cambrian Park, California  37.2564  -121.931
10                      Communications Hill, San Jose  37.2863  -121.858
11                          Coyote Valley, California   39.256  -123.214
12  Downtown Historic District (San Jose, California)  37.3356  -121.889
13                                  Downtown San Jo

In [122]:
#cleanup the NaN rows
sjc_df = sjc_df[sjc_df.Latitude != 'NaN']
sjc_df.reset_index()

print (sjc_df)

                                         Neighborhood Latitude Longitude
0                               The Alameda, San Jose  37.3381   -121.92
1                            Almaden Valley, San Jose  37.2216  -121.862
2                                 Alum Rock, San Jose  37.3661  -121.827
3                                    Alviso, San Jose  37.4261  -121.975
4                                 Berryessa, San Jose  37.3863  -121.861
5                            Blossom Valley, San Jose  32.8639  -116.861
6                               Buena Vista, San Jose  20.9602  -76.9301
7             Burbank, Santa Clara County, California  37.4241  -122.165
8                           Cambrian Park, California  37.2564  -121.931
10                      Communications Hill, San Jose  37.2863  -121.858
11                          Coyote Valley, California   39.256  -123.214
12  Downtown Historic District (San Jose, California)  37.3356  -121.889
13                                  Downtown San Jo

<h3>4. Create a map of San Jose, CA, with neighborhoods super-imposed on top</h3>

In [123]:
# get the coordinates of San Jose, CA
address = 'San Jose, California'


location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of San Jose, CA {}, {}.'.format(latitude, longitude))

The geograpical coordinate of San Jose, CA 37.3361905, -121.8905833.


In [124]:
# create map of San Jose, CA using latitude and longitude values
map_sjc = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(sjc_df['Latitude'], sjc_df['Longitude'], sjc_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_sjc)  
    
map_sjc

<h3>5. Use the Foursquare API to explore the neighborhoods</h3>

In [125]:
CLIENT_ID = 'C1TBKEYWPTCQUSIDXZKENUKS3JR5JPEM1ZS0U1JNFTUDIWMQ' # Foursquare ID
CLIENT_SECRET = 'W3W4MTVTE3QCVJXTY1DLZ04QBTHM2DR42BWSBDR5KQDHG14S' # Foursquare Secret
VERSION = '20180605'


print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: C1TBKEYWPTCQUSIDXZKENUKS3JR5JPEM1ZS0U1JNFTUDIWMQ
CLIENT_SECRET:W3W4MTVTE3QCVJXTY1DLZ04QBTHM2DR42BWSBDR5KQDHG14S


<b>Now, let's get the top 100 venues that are within a radius of 2000 meters.</b>

In [126]:
radius = 2000
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(sjc_df['Latitude'], sjc_df['Longitude'], sjc_df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [127]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(2732, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,"The Alameda, San Jose",37.338144,-121.919841,Central YMCA,37.337796,-121.919896,Gym
1,"The Alameda, San Jose",37.338144,-121.919841,Luna Mexican Restaurant,37.333935,-121.91518,Mexican Restaurant
2,"The Alameda, San Jose",37.338144,-121.919841,Zona Rosa,37.333079,-121.914073,Mexican Restaurant
3,"The Alameda, San Jose",37.338144,-121.919841,Trader Joe's,37.340948,-121.909405,Grocery Store
4,"The Alameda, San Jose",37.338144,-121.919841,San Jose Municipal Rose Garden,37.331812,-121.92863,Garden


<b>Let's check how many venues were returned for each neighorhood</b>

In [128]:
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Almaden Valley, San Jose",54,54,54,54,54,54
"Alum Rock, San Jose",100,100,100,100,100,100
"Alviso, San Jose",25,25,25,25,25,25
"Berryessa, San Jose",75,75,75,75,75,75
"Blossom Valley, San Jose",11,11,11,11,11,11
"Buena Vista, San Jose",1,1,1,1,1,1
"Burbank, Santa Clara County, California",100,100,100,100,100,100
"Cambrian Park, California",100,100,100,100,100,100
"College Park, San Jose",100,100,100,100,100,100
"Communications Hill, San Jose",92,92,92,92,92,92


<b>Let's find out how many unique categories can be curated from all the returned venues</b>

In [129]:

print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 275 uniques categories.


In [130]:
# print out the list of categories
venues_df['VenueCategory'].unique()[:50]

array(['Gym', 'Mexican Restaurant', 'Grocery Store', 'Garden',
       'Breakfast Spot', 'New American Restaurant', 'Wine Bar',
       'Sandwich Place', 'Museum', 'Coffee Shop', 'Burger Joint',
       'Frozen Yogurt Shop', 'Taco Place', 'Pet Store', 'Pizza Place',
       'Playground', 'Soccer Stadium', 'Buffet', 'Café',
       'Chinese Restaurant', 'Trail', 'Thai Restaurant',
       'Bubble Tea Shop', 'Bakery', 'Smoke Shop', 'Gym / Fitness Center',
       'Hockey Arena', 'Nail Salon', 'Park', 'Furniture / Home Store',
       'Yoga Studio', 'Shipping Store', 'Sporting Goods Shop',
       'Sushi Restaurant', 'Garden Center', 'Used Bookstore',
       'Cajun / Creole Restaurant', 'BBQ Joint', 'Clothing Store',
       'Health & Beauty Service', 'Pharmacy', 'Juice Bar',
       'Martial Arts Dojo', 'Automotive Shop', 'Ethiopian Restaurant',
       'Gift Shop', 'Mediterranean Restaurant', 'Ice Cream Shop',
       'Thrift / Vintage Store', 'Ramen Restaurant'], dtype=object)

In [148]:

# check if the results contain "Indian Restaurant"
"Indian Restaurant" in venues_df['VenueCategory'].unique()

True

<h3>6. Analyze Each Neighborhood</h3>

In [149]:
# one hot encoding
sjc_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
sjc_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [sjc_onehot.columns[-1]] + list(sjc_onehot.columns[:-1])
sjc_onehot = sjc_onehot[fixed_columns]

print(sjc_onehot.shape)
sjc_onehot.head()

(2732, 276)


Unnamed: 0,Neighborhoods,ATM,Acupuncturist,American Restaurant,Antique Shop,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,"The Alameda, San Jose",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"The Alameda, San Jose",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"The Alameda, San Jose",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"The Alameda, San Jose",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Alameda, San Jose",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [150]:
sjc_grouped = sjc_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(sjc_grouped.shape)
sjc_grouped

(40, 276)


Unnamed: 0,Neighborhoods,ATM,Acupuncturist,American Restaurant,Antique Shop,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,"Almaden Valley, San Jose",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alum Rock, San Jose",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.01,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0
2,"Alviso, San Jose",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Berryessa, San Jose",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013333,0.04,...,0.013333,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Blossom Valley, San Jose",0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Buena Vista, San Jose",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"Burbank, Santa Clara County, California",0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0
7,"Cambrian Park, California",0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0
8,"College Park, San Jose",0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,...,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.02,0.0
9,"Communications Hill, San Jose",0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01087,0.0,0.0,0.0,0.0,0.0,0.01087,0.0,0.0,0.0


In [151]:
len(sjc_grouped[sjc_grouped["Indian Restaurant"] > 0])

12

In [158]:

sjc_ir = sjc_grouped[["Neighborhoods","Indian Restaurant"]]

In [159]:

sjc_ir.head()

Unnamed: 0,Neighborhoods,Indian Restaurant
0,"Almaden Valley, San Jose",0.0
1,"Alum Rock, San Jose",0.01
2,"Alviso, San Jose",0.0
3,"Berryessa, San Jose",0.013333
4,"Blossom Valley, San Jose",0.0


<h3>7. Cluster Neighborhoods</h3>
Run k-means to cluster the neighborhoods in San Jose CA into 3 clusters.

In [160]:
# set number of clusters
kclusters = 3

sjc_clustering = sjc_ir.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sjc_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 2, 0, 2, 0, 0, 2, 1, 2, 0], dtype=int32)

In [161]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
sjc_merged = sjc_ir.copy()

# add clustering labels
sjc_merged["Cluster Labels"] = kmeans.labels_

In [162]:
sjc_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
sjc_merged.head()

Unnamed: 0,Neighborhood,Indian Restaurant,Cluster Labels
0,"Almaden Valley, San Jose",0.0,0
1,"Alum Rock, San Jose",0.01,2
2,"Alviso, San Jose",0.0,0
3,"Berryessa, San Jose",0.013333,2
4,"Blossom Valley, San Jose",0.0,0


In [163]:
# merge sjc_grouped with sjc_df to add latitude/longitude for each neighborhood
sjc_merged = sjc_merged.join(sjc_df.set_index("Neighborhood"), on="Neighborhood")

print(sjc_merged.shape)
sjc_merged.head() # check the last columns!

(40, 5)


Unnamed: 0,Neighborhood,Indian Restaurant,Cluster Labels,Latitude,Longitude
0,"Almaden Valley, San Jose",0.0,0,37.2216,-121.862
1,"Alum Rock, San Jose",0.01,2,37.3661,-121.827
2,"Alviso, San Jose",0.0,0,37.4261,-121.975
3,"Berryessa, San Jose",0.013333,2,37.3863,-121.861
4,"Blossom Valley, San Jose",0.0,0,32.8639,-116.861


In [164]:
# sort the results by Cluster Labels
print(sjc_merged.shape)
sjc_merged.sort_values(["Cluster Labels"], inplace=True)
sjc_merged

(40, 5)


Unnamed: 0,Neighborhood,Indian Restaurant,Cluster Labels,Latitude,Longitude
0,"Almaden Valley, San Jose",0.0,0,37.2216,-121.862
37,West Valley (California),0.0,0,-27.5014,-58.8064
36,West San Jose,0.0,0,32.3744,-104.226
34,South San Jose,0.0,0,33.3949,-111.878
33,SoFA District,0.0,0,43.1676,131.938
32,Silver Creek Valley,0.0,0,44.9312,-89.016
31,"Seven Trees, San Jose",0.0,0,37.2907,-121.841
30,Santana Row,0.0,0,37.3209,-121.948
29,"Santa Teresa, San Jose",0.0,0,9.33474,-83.5717
28,San Pedro Square,0.0,0,37.3361,-121.894


<b>Visualize the resulting clusters</b>

In [165]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sjc_merged['Latitude'], sjc_merged['Longitude'], sjc_merged['Neighborhood'], sjc_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='orange',#rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h3>8. Examine Clusters</h3>

<b>Cluster 0</b>

In [166]:
sjc_merged.loc[sjc_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Indian Restaurant,Cluster Labels,Latitude,Longitude
0,"Almaden Valley, San Jose",0.0,0,37.2216,-121.862
37,West Valley (California),0.0,0,-27.5014,-58.8064
36,West San Jose,0.0,0,32.3744,-104.226
34,South San Jose,0.0,0,33.3949,-111.878
33,SoFA District,0.0,0,43.1676,131.938
32,Silver Creek Valley,0.0,0,44.9312,-89.016
31,"Seven Trees, San Jose",0.0,0,37.2907,-121.841
30,Santana Row,0.0,0,37.3209,-121.948
29,"Santa Teresa, San Jose",0.0,0,9.33474,-83.5717
28,San Pedro Square,0.0,0,37.3361,-121.894


<b>Cluster 1</b>

In [167]:
sjc_merged.loc[sjc_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Indian Restaurant,Cluster Labels,Latitude,Longitude
25,"Palm Haven, San Jose",0.02,1,1.36038,103.877
16,"Evergreen, San Jose",0.031746,1,37.3097,-121.784
7,"Cambrian Park, California",0.02,1,37.2564,-121.931
20,"Luna Park, San Jose",0.02,1,37.3612,-121.89
35,"The Alameda, San Jose",0.02,1,37.3381,-121.92
17,"Japantown, San Jose",0.02,1,37.349,-121.894


<b>Cluster 2</b>

In [168]:
sjc_merged.loc[sjc_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Indian Restaurant,Cluster Labels,Latitude,Longitude
26,"Rose Garden, San Jose",0.01,2,37.334,-121.924
21,Meadowfair,0.01,2,37.3189,-121.816
8,"College Park, San Jose",0.01,2,37.3412,-121.917
6,"Burbank, Santa Clara County, California",0.01,2,37.4241,-122.165
3,"Berryessa, San Jose",0.013333,2,37.3863,-121.861
1,"Alum Rock, San Jose",0.01,2,37.3661,-121.827


<h3>9. Conclusion</h3>

Most of the Indian Restaurants are concentrated in the northern part of the San Jose city, with the highest number in cluster 1 and moderate number in cluster 2. On the other hand, cluster 0 has very low number or no indian restaurants in the neighborhoods. This represents a great opportunity and high potential areas to open new indian restaurants as there is very little to no competition from existing restaurants. Meanwhile, indian restaurants in cluster 1 are likely suffering from intense competition due to oversupply and high concentration of restaurants. From another perspective, this also shows that the oversupply of indian restaurants mostly happened in the Northern and central area of the city, with the southern area still have very few or no indian restaurants. Therefore, this project recommends businessmen to capitalize on these findings to open new indian restaurants in neighborhoods in cluster 0 with little to no competition. Businessmen are advised to avoid neighborhoods in cluster 1 which already have high concentration of indian restaurents.