# Project Overview

#### Business Case:

![image-2.png](attachment:image-2.png)

The stakeholder is a successful restaurant located in uptown Manhattan, New York City: “Amy Ruth's”. “Successful” is defined in this context as being in the top venues list of the foursquare data of that particular neighborhood. The restaurant owner wants to branch out and open a second restaurant in another larger metropolitan city or even in New York itself if appropriate. The goal of the project is to identify another neighborhood in an US city, which is similar to the one where “Amy Ruth’s” currently resides (Zip Code 10026).

#### Data and Processing:

The project requires three data sources:
- List of US cities: this information is extracted from Wikipedia with BeautifulSoup: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population
- Neighborhood definition: postal zip codes of the individual cities are used as a surrogate for the classical neighborhood definition (i.e. neighborhood names). This is done for ease of use as obtaining neighborhood lists of several US cities is less consistent. Zip codes of US cities are available as a python library and database “uszipcode” https://pypi.org/project/uszipcode/
- The data characterizing a neighborhood is obtained through the Foursquare API
https://developer.foursquare.com/

Using the three data sources above a master table will be generated. First the zip codes are retrieved for the most populous cities in the US (>= 1M inhabitants) extracted from Wikipedia. Then venue data is extracted using the Foursquare API for each zip code. This information is added to a master table as one row per zip code and columns corresponding to venue types. A clustering will be performed on the one-hot encoded data. The cluster containing the neighborhood in which “Amy Ruth’s” is located is identified. This means that “Similarity” of neighborhoods is defined as neighborhoods belonging to the same cluster after a Kmeans clustering of all the collected neighborhood data. These neighborhoods of interest (zip codes) are then analyzed to ensure that there is no restaurant of the same type in the area to avoid direct competition. This then finally yields a list of zip codes in the US which could be of interest to open a new location of “Amy Ruth’s”.

# TO DO :
- Check NAN zip codes -> regenerate missing

Odd "outlier" values for lat,long of e.g. San Jose:

603	San Jose	California	62682	40.280	-89.630

604	San Jose	California	87565	35.500	-105.400

606	San Jose	California	95110	37.340	-121.910


- check unique longitutde , latitude values for zip codes
- get optimal value for "k" for KMEANS clustering
-> check DBX values

--> check proximity to original restaurant!

go through identified cluster and examine zipcodes which:
 -> do not have a "Southern / Soul Food Restaurant" venue in the top 10 list
 -> check gegraphical distance
 -> euclidan vector distance
 -> check number of restaurants is below average ? or there are at least some restaurants in the area


### Import libraries


In [1]:
!conda create --name myenv
#activate env
!activate myenv


Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/wsuser/.conda/envs/myenv



Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate myenv
#
# To deactivate an active environment, use
#
#     $ conda deactivate



In [2]:


!pip install bs4
from bs4 import BeautifulSoup

!pip install uszipcode
from uszipcode import SearchEngine

import numpy as np

import pandas as pd
import requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes 
!pip install folium
import folium # map rendering library

#!conda install -c conda-forge geopy --yes 
#!pip install goepy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values


print ("Libraries imported")

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1272 sha256=ad40dd175c75c2b181c244c5b8e18a3cf44691664f6ae16c5d282f12aaf1a265
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/0a/9e/ba/20e5bbc1afef3a491f0b3bb74d508f99403aabe76eda2167ca
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting uszipcode
  Downloading uszipcode-0.2.5-py2.py3-none-any.whl (453 kB)
[K     |████████████████████████████████| 453 kB 20.2 MB/s eta 0:00:01
Collecting SQLAlchemy>=1.4.0
  Downloading SQLAlchemy-1.4.14-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_6

### Download list of most populated cities in US

In [3]:
# get list of most populous US cities from Wikipedia

url = "https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population"
html = requests.get(url).text


### Extract table with BeautifulSoup

In [4]:

# get all tables with BeautifulSoup
soup = BeautifulSoup(html,"html5lib")

tables = soup.find_all("table")
table_index= -1
table = None

# find the correct table 
for i,table in enumerate(tables):
    if "New York City" in str(table) and "Chicago" in str(table): #
        table_index = i
        #break # first table which 

if table_index >=0:
    table = tables[table_index]
    print ("Table found")
    #print (table)
else:
    print ("No valid table found")
    table = None



Table found


### Extract required data from table and generate a pandas dataframe for cities with at least 1M inhabitants

In [5]:

table_contents=[]

# loop over all cells in the table 
for i,row in enumerate(table.findAll('tr')): # orig: td
    
    #create dictionary for a given cell to be added to dataframe

    if not i:
        #skip header
        continue

    arr = (row.text).split("\n")
    cell = {}

    if len(arr)>17:
        #city 
        cell['City'] = arr[3].split("[")[0]
        #state 
        cell['State'] = arr[5].replace("\xa0","")
        #print (arr)
        cell['Size Estimate'] = int(arr[7].replace(",",""))

        #print (cell)
        
        table_contents.append(cell)

#print(table_contents)
# print(table_contents)
cities = pd.DataFrame(table_contents)

cities.head(10)

Unnamed: 0,City,State,Size Estimate
0,New York City,New York,8336817
1,Los Angeles,California,3979576
2,Chicago,Illinois,2693976
3,Houston,Texas,2320268
4,Phoenix,Arizona,1680992
5,Philadelphia,Pennsylvania,1584064
6,San Antonio,Texas,1547253
7,San Diego,California,1423851
8,Dallas,Texas,1343573
9,San Jose,California,1021795


### Download zipcode database

In [6]:

#download zip code database
search = SearchEngine(simple_zipcode=True) # simple_zipcode=False


Start downloading data for simple zipcode database, total size 9MB ...
  1 MB finished ...
  2 MB finished ...
  3 MB finished ...
  4 MB finished ...
  5 MB finished ...
  6 MB finished ...
  7 MB finished ...
  8 MB finished ...
  9 MB finished ...
  10 MB finished ...
  Complete!


### Retrieve all zipcodes for cities with 1M or more inhabitants

In [7]:
#cities with at least one million inhabitants
tmpdata = []

for city, state,size in zip(cities["City"],cities["State"],cities["Size Estimate"]):
    if size > 1000000:
        #res = search.by_city_and_state(city, state)
        res = search.by_city(city=city, returns=0)
        if not len(res):
            print ("Error occurred for {}".format(city))
        else:
            #pcode = {}
            print ("Retrieved {} zip codes for {}".format(len(res),city))
            #city_zipcodes[city] = [z.zipcode for z in res]
            for z in res:
                pcode = {'City':city,'State':state,'Zipcode':z.zipcode,'Latitude':z.lat,'Longitude':z.lng}
                tmpdata.append(pcode)
        #break   # REMOVE ------------------------------
city_wzipcodes =  pd.DataFrame(tmpdata)
city_wzipcodes.head()


Retrieved 99 zip codes for New York City
Retrieved 64 zip codes for Los Angeles
Retrieved 58 zip codes for Chicago
Retrieved 106 zip codes for Houston
Retrieved 53 zip codes for Phoenix
Retrieved 56 zip codes for Philadelphia
Retrieved 68 zip codes for San Antonio
Retrieved 36 zip codes for San Diego
Retrieved 63 zip codes for Dallas
Retrieved 32 zip codes for San Jose


Unnamed: 0,City,State,Zipcode,Latitude,Longitude
0,New York City,New York,10001,40.75,-73.99
1,New York City,New York,10002,40.72,-73.99
2,New York City,New York,10003,40.73,-73.99
3,New York City,New York,10004,40.7,-74.02
4,New York City,New York,10005,40.705,-74.005


### Set up foursquare information

In [8]:
# The code was removed by Watson Studio for sharing.

### Function to retrieve location data from foursquare

In [9]:

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        #print (url) #REMOVE
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Zipcode', 
                  'Zipcode Latitude', 
                  'Zipcode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [10]:

#df = city_wzipcodes[city_wzipcodes["City"] == "San Jose"] # REDO REMOVE
df = city_wzipcodes 

# clean up data 
df = df.dropna()
print (df.head())
print (df.shape)




            City     State Zipcode  Latitude  Longitude
0  New York City  New York   10001    40.750    -73.990
1  New York City  New York   10002    40.720    -73.990
2  New York City  New York   10003    40.730    -73.990
3  New York City  New York   10004    40.700    -74.020
4  New York City  New York   10005    40.705    -74.005
(572, 5)


In [11]:

df_venues = getNearbyVenues(df["Zipcode"],df["Latitude"] , df["Longitude"], radius=500)
df_venues.head()


Unnamed: 0,Zipcode,Zipcode Latitude,Zipcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,10001,40.75,-73.99,STORY,40.750866,-73.989272,Gift Shop
1,10001,40.75,-73.99,Louis Vuitton,40.750274,-73.988146,Boutique
2,10001,40.75,-73.99,Krispy Kreme Doughnuts,40.74999,-73.992149,Donut Shop
3,10001,40.75,-73.99,Vans Off The Wall,40.750377,-73.989716,Shoe Store
4,10001,40.75,-73.99,Victoria's Secret,40.749745,-73.987693,Lingerie Store


#### Display example venues for zip code "10026" where Amy Ruth's is located

In [38]:
print ((df_venues[df_venues["Zipcode"] == "10026"].head(20))["Venue"])

amy_ruths = df_venues[ df_venues["Venue"] == "Amy Ruth's" ]

#only one Amy Ruth's in entire data set
print (len(amy_ruths))
print (amy_ruths.head())

amy_ruths_category = str((amy_ruths["Venue Category"].values)[0])
print (amy_ruths_category)


1989                             Fieldtrip
1990                        Seasoned Vegan
1991                    Little Bean Coffee
1992        Cantina Taqueria & Tequila Bar
1993              Central Park - North End
1994                            Amy Ruth's
1995                        Shuteye Coffee
1996                           Bo's Bagels
1997                         Farmers' Gate
1998             tropical grill restaurant
1999                           North Woods
2000         Melba's American Comfort Food
2001                       iLoveKickboxing
2002                      67 Orange Street
2003                                Safari
2004                      Harlem Pizza Co.
2005                            The Winery
2006    Central Park - 110th St Playground
2007                            Monkey Cup
2008                 Sea & Sea Fish Market
Name: Venue, dtype: object
1
     Zipcode  Zipcode Latitude  Zipcode Longitude       Venue  Venue Latitude  \
1994   10026            40.801

### One hot encoding of venue information:


In [39]:
# one hot encoding
df_onehot = pd.get_dummies(df_venues[['Venue Category']], prefix="", prefix_sep="")

# add zipcode column back to dataframe
df_onehot['Zipcode'] = df_venues['Zipcode'] 

# move ipcode column to the first column
fixed_columns = [df_onehot.columns[-1]] + list(df_onehot.columns[:-1])
df_onehot = df_onehot[fixed_columns]

df_onehot.head()


Unnamed: 0,Zipcode,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Service,Airport Terminal,...,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Yoshoku Restaurant,Zoo Exhibit
0,10001,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,10001,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,10001,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,10001,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,10001,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### group rows by zip code and by taking the mean of the frequency of occurrence of each category


In [40]:
df_grouped = df_onehot.groupby('Zipcode').mean().reset_index()
#print (df_grouped.head())
#print (df_grouped.shape)


In [74]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

### top 10 venues for each neighborhood.


In [75]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Zipcode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
zipcodes_venues_sorted = pd.DataFrame(columns=columns)
zipcodes_venues_sorted['Zipcode'] = df_grouped['Zipcode']

for ind in np.arange(df_grouped.shape[0]):
    zipcodes_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_grouped.iloc[ind, :], num_top_venues)

zipcodes_venues_sorted.head()

#zipcodes_venues_sorted[zipcodes_venues_sorted["Zipcode"]=="10026"].head()



Unnamed: 0,Zipcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,690,Caribbean Restaurant,Coworking Space,Pizza Place,Zoo Exhibit,Fish Market,Eye Doctor,Fabric Shop,Factory,Falafel Restaurant,Farm
1,10001,Korean Restaurant,Hotel,Coffee Shop,Gym / Fitness Center,Burger Joint,American Restaurant,Sushi Restaurant,Clothing Store,Ramen Restaurant,Cosmetics Shop
2,10002,Cocktail Bar,Hotel,Ice Cream Shop,Asian Restaurant,Café,Wine Bar,Coffee Shop,Art Gallery,French Restaurant,Mexican Restaurant
3,10003,Japanese Restaurant,Coffee Shop,Italian Restaurant,Dessert Shop,Grocery Store,Pet Store,Wine Shop,Sushi Restaurant,Café,Pizza Place
4,10004,Pier,Boat or Ferry,Park,Gym / Fitness Center,Historic Site,Snack Place,Monument / Landmark,American Restaurant,Harbor / Marina,Flower Shop


### cluster neighborhoods, figure out good "k" value


In [None]:
import matplotlib.pyplot as plt 

from sklearn.metrics import silhouette_score


# set number of clusters
cost =[]
k_values = [i for i in range(2,400)]

sil = []

df_grouped_clustering = df_grouped.drop('Zipcode', 1)
#print (toronto_grouped_clustering.head())
# run k-means clustering
for kclusters in k_values:
    if kclusters %5==0 and kclusters > 0:
        print (kclusters)
    kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_grouped_clustering)
    #print (dir(kmeans))
    #print (kmeans.inertia_)
    cost.append(kmeans.inertia_)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(df_grouped_clustering, kmeans.labels_)
    sil.append(silhouette_avg)
    print("For n_clusters =", kclusters,
          "The average silhouette_score is :", silhouette_avg)

print (max(sil))
# plot the cost against K values
plt.plot(k_values, cost, color ='g', linewidth ='3')
plt.xlabel("Value of K")
plt.ylabel("Sqaured Error (Cost)")
plt.show() # clear the plot

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 



For n_clusters = 2 The average silhouette_score is : 0.38351553649658654
For n_clusters = 3 The average silhouette_score is : 0.10276201836510075
For n_clusters = 4 The average silhouette_score is : 0.18292794020168282
5
For n_clusters = 5 The average silhouette_score is : 0.3374738333460325
For n_clusters = 6 The average silhouette_score is : 0.13924267979229407
For n_clusters = 7 The average silhouette_score is : 0.21743111522692013
For n_clusters = 8 The average silhouette_score is : 0.1636248717010723
For n_clusters = 9 The average silhouette_score is : 0.17417758498356536
10
For n_clusters = 10 The average silhouette_score is : 0.17980501473133417
For n_clusters = 11 The average silhouette_score is : 0.18380729731293044
For n_clusters = 12 The average silhouette_score is : 0.19295689994877763
For n_clusters = 13 The average silhouette_score is : 0.24093783729261545
For n_clusters = 14 The average silhouette_score is : 0.19938464117629034
15
For n_clusters = 15 The average silhouet

### re-cluster with appropriate "k"-value

In [68]:
# there is an elbow at k = 9

kmeans = KMeans(n_clusters=9, random_state=0).fit(df_grouped_clustering)


### create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.


In [69]:
# add clustering labels
zipcodes_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

df_merged = df

# merge df_grouped with df_data to add latitude/longitude for each zipcode
df_merged = df_merged.join(zipcodes_venues_sorted.set_index('Zipcode'), on='Zipcode')

# DBX -> FIX
df_merged = df_merged.dropna()

df_merged.head() # check the last columns!


Unnamed: 0,City,State,Zipcode,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,New York City,New York,10001,40.75,-73.99,0.0,Korean Restaurant,Hotel,Coffee Shop,Gym / Fitness Center,Burger Joint,American Restaurant,Sushi Restaurant,Clothing Store,Ramen Restaurant,Cosmetics Shop
1,New York City,New York,10002,40.72,-73.99,0.0,Cocktail Bar,Hotel,Ice Cream Shop,Asian Restaurant,Café,Wine Bar,Coffee Shop,Art Gallery,French Restaurant,Mexican Restaurant
2,New York City,New York,10003,40.73,-73.99,0.0,Japanese Restaurant,Coffee Shop,Italian Restaurant,Dessert Shop,Grocery Store,Pet Store,Wine Shop,Sushi Restaurant,Café,Pizza Place
3,New York City,New York,10004,40.7,-74.02,0.0,Pier,Boat or Ferry,Park,Gym / Fitness Center,Historic Site,Snack Place,Monument / Landmark,American Restaurant,Harbor / Marina,Flower Shop
4,New York City,New York,10005,40.705,-74.005,0.0,Seafood Restaurant,Coffee Shop,American Restaurant,Food Truck,Café,Wine Shop,Cocktail Bar,Restaurant,Pizza Place,Italian Restaurant


In [70]:
df_merged.shape

(505, 16)

# --> REMOVE ???

### compare neigbhorhoods of a specific city as well as with neighborhoods in different cities


# compare cluster IDs of neighborhoods
# --> REMOVE ???

zips = list(df_merged["Zipcode"])
vals = []

diff_n = 0
same_n = 0

for i1 in range(0,len(zips)):
    #print (zips[i1])
    #print (i1)
    for i2 in range(i1+1,len(zips)):
    
        #print (zips[i1],zips[i2])
        x1 =  df_merged[df_merged["Zipcode"]==zips[i1]]
        x2 =  df_merged[df_merged["Zipcode"]==zips[i2]]
        # append identical city, cluster labels
        #vals.append([ (x1["City"].values)[0] == (x2["City"].values)[0] , (x1["Cluster Labels"].values)[0] == (x2["Cluster Labels"].values)[0] ])
        
        #if (x1["City"].values)[0] == (x2["City"].values)[0] :
            

        #break
        if i2> 20: #REMOVE --------------------
            break
    break
#print (vals)
            

# --> REMOVE ???

# compare distance (Euclidean distance) between neighborhoods

### Find cluster information for respective zipcode

In [71]:
# Venuue type -> "Southern / Soul Food Restaurant"
#print (info)

cluster_nr =   ((df_merged[df_merged["Zipcode"]=="10026"])["Cluster Labels"].values )[0]

print (cluster_nr)


0.0


In [72]:
cluster = df_merged[df_merged["Cluster Labels"]==cluster_nr]

print (cluster.shape)

cluster.head()



(400, 16)


Unnamed: 0,City,State,Zipcode,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,New York City,New York,10001,40.75,-73.99,0.0,Korean Restaurant,Hotel,Coffee Shop,Gym / Fitness Center,Burger Joint,American Restaurant,Sushi Restaurant,Clothing Store,Ramen Restaurant,Cosmetics Shop
1,New York City,New York,10002,40.72,-73.99,0.0,Cocktail Bar,Hotel,Ice Cream Shop,Asian Restaurant,Café,Wine Bar,Coffee Shop,Art Gallery,French Restaurant,Mexican Restaurant
2,New York City,New York,10003,40.73,-73.99,0.0,Japanese Restaurant,Coffee Shop,Italian Restaurant,Dessert Shop,Grocery Store,Pet Store,Wine Shop,Sushi Restaurant,Café,Pizza Place
3,New York City,New York,10004,40.7,-74.02,0.0,Pier,Boat or Ferry,Park,Gym / Fitness Center,Historic Site,Snack Place,Monument / Landmark,American Restaurant,Harbor / Marina,Flower Shop
4,New York City,New York,10005,40.705,-74.005,0.0,Seafood Restaurant,Coffee Shop,American Restaurant,Food Truck,Café,Wine Shop,Cocktail Bar,Restaurant,Pizza Place,Italian Restaurant


In [105]:
# DBX TO DO: go through list and identify zip codes in cluster which:
# -> do not have a "Southern / Soul Food Restaurant" venue in the top 10 list
# -> check gegraphical distance
# -> euclidan vector distance
# -> check number of restaurants is below average ? or there are at least some restaurants in the area



### display cluster information of zip codes on map


In [None]:
# create map
address = 'San Jose, California'

geolocator = Nominatim(user_agent="us_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10.5)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['Zipcode'], df_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[(int(cluster))-1],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters