INTRODUCTION

The Hague City in the Netherlands(South Holland) is the target city of this project. The project is about assessing the best location for food restuarant within the eight urban districts (Boroughs) comprising the Hague City - South Holland. The codes for this analyis on Scheveningen district was used to analyze all three districts investigated in this project.

BACKGROUND

A group of immigrants arrived from Asia and South America to seek asylum in the Netherlands and have been given a stay in the Hague City by the government of Netherlands and City Council of the Hague. These asylum seekers have made it known to the authorities that they will be better integrated into the Dutch society if they are provided opportunity to start and engage in food restuarant business, which they claimed to be their profession. However the Hague City, like other parts of the Netherlands, is already a scene of diverse type of restuarants everywhere.
The City Council contracted me as a Data Scientist and Analyst to investigate the best possible locations for food restuarants, and provide them recomendations on possible locations in a presentation, the result of which will be used to determine whether or not council will provide financial assistance to these immigrants in order to start their food restuarant businesses of interest.

DESCRIPTION OF THE DATA AND HOW IT WILL BE USED TO SOLVE THE PROBLEM

The Hague City is comprised of 8 districts, equivalent to Boroughs, and a total of 38 neighborhoods. The required dataset, in order to segment the neighborhoods and explore them, will consist of the 8 boroughs and the neighborhoods that exist in each of the boroughs(districts), and as well as the latitude and longitude coordinates of each neighborhood. This dataset is available on the following links: https://www.postcode.nl/services/adresdata/api for a yearly chargeable fee of 40 euro for 10,000 requests per year(exclusive value added tax). It is also available on: https://api.postcode.eu/nl/v1/addresses/latlon/{latitude}/{longitude}.
Alternatively, also available on Foursquare, which I checked and seems to be so by indicating location name in a search box. Whichever means is used, the the dataset will be named haguecity_data.
When this dataset is downloaded, it will be first opened using with open() function as a jason data and then explored. A list of the neighboods returned will be defined as a new variable that includes this data, the name of variable will be hgcneighborhoods_data, to extract the features. This will be followed by a transformation of the data into a pandas dataframe, a task essentially transforming this data of expected nested Python dictionaries into pandas dataframe. Fisrtly, dataframe columns will be defined, and then initialize the dataframe. This dataframe will be filled one row at a time by looping through the data.
A geopy library will be used to get the latitude and longitude values of the Hague City following the normal procedure learned in the lessons. It is deemed necessary to create a map of the Hague City with the neighborhoods superimposed on top. However, I am skeptical about the use of folium to do this because it has never worked throughout the lab exercises provided using folium and creating choropleth maps. That's my serious concern, and was observed that other students experienced the same when I was grading peer review assignments very recently, especially for one of the best students with very strong python programming background.
The foursquare API will be used to explore the neighborhoods and segment them for the top 100 venues, following procedures already learned in the courses. The get_category_type function will be used to extract the categories of the venues. The json will be cleaned and structured into a pandas dataframe.
The next step will be to explore neighborhoods in the Hague City by first creating a function that will repeat the process of exploration for all neighborhoods in the Hague City. A code will be written to run this function on each neighborhood and create a new dataframe called haguecity_venues.
This will be the time to analyze each neighborhood, first using the onehot encoding to allow the use of k-means later for clustering. The new dataframe will be examined and rows be grouped by neighborhoods, and each neighborhood to be printed along with top 5 most common venues, and then put into a pandas dataframe by creating a function to sort venues in descending order. After creating the new dataframe, the top 10 venues will be displayed for each neighborhood.
The neighborhoods will then be clustered by run k-means to create, say, 5 clusters. This will lead to creating a new dataframe that includes the clusters as well as the top 10 venues for each neighborhood, likely to be visualized if my experience with folium does not repeat itself.
Finally, the clusters will be examined to determine the discriminating venue categories that distinguish each cluster. This will aid to assign a name to each cluster and for recommendation to the council during the presentation.

In [None]:
#Download all the needed dependencies
import numpy as np
import pandas as pd #library for data analysis

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json #library to handle JSON files

#!conda install -c conda-forge geopy --yes
#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim #converts an address into latitude and longitude values

import requests #library to handle requests
from pandas.io.json import json_normalize #transform JSON file into pandas dataframe

#Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors 
%matplotlib inline

#Import k-means for clustering stage
from sklearn.cluster import KMeans

#conda install -c conda-forge folium=0.5.0 --yes
#!conda install -c conda-forge folium=0.5.0 --yes 
import folium #map visualization librabry

print('Libraries imported')

In [None]:
#Create the dataframe with data gathered from Dutch web site on the Hague City(thcty)
thcty_data=pd.DataFrame({'Borough':['Hague Centrum', 'Hague Centrum', 'Hague Centrum', 'Hague Centrum', 'Hague Centrum', 'Hague Centrum', 'Escamp'
                                    ,'Escamp','Escamp','Escamp','Haagse Hout', 'Haagse Hout', 'Haagse Hout', 'Haagse Hout', 'Haagse Hout', 'Haagse Hout'
                                    ,'Laak','Laak','Laak','Leidschenveen-Ypenburg','Leidschenveen-Ypenburg','Loosduinen','Loosduinen','Loosduinen'
                                    ,'Loosduinen','Scheveningen','Scheveningen','Scheveningen','Scheveningen','Scheveningen','Scheveningen','Scheveningen'
                                    ,'Scheveningen','Segbroek','Segbroek','Segbroek','Segbroek','Segbroek'],
      'Neighborhood':['Archipelbuurt-Willemspark','Zeeheldenkwartier','Kortenbos','Transvaalkwartier','Schildersbuurt','Stationsbuurt'
                     ,'Rustenbuurt-Oostbroek','Wateringse Veld','Moerwijk','Bouwlust-Vrederust','Benoordenhout','Marlot','Haagse Rose','Mariahoeve'
                      ,'Bezuidenhout','Beatrixkwartier','Binckhorst','Spoorwijk','Laakkwartier','Forepark','Hornwijk','Kijkduin-Ockenburg','Kraayenstein-De Uithof'
                      ,'Bohemen','Waldeck','Oostduinen','Belgische Park','Westbroekpark','Van Stolkpark','Hof van Schreveningen','Statenkwartier'
                      ,'Intnl Zone','Duindorp','Bornen & Bloemenbuurt','Regentessekwartier','Valkenboskwartier','Vruchtenbuurt','Vogelwijk'],
      'Latitude':[52.09709, 52.0825203, 52.0770373, 52.0668307, 52.0684867, 52.07151, 52.0604259, 52.0271, 52.0478362, 52.037498, 52.0932446, 52.0991393
                  ,52.0861,52.0936444,52.0840958,52.0803668,52.0676951,52.0535406,52.0561421,52.0701741,52.0468617,52.0666,52.0339228,52.0656615,52.0584695
                 ,52.1155078, 52.1091988, 52.1039161, 52.0986338, 52.0927868, 52.0945228, 52.0595723, 52.0906178, 52.0728265, 52.0768491, 52.0714, 52.0681189
                  , 52.0780333],
      'Longitude':[4.3009372, 4.2995842, 4.3024605, 4.2911217, 4.3003207, 4.3163635, 4.2842477, 4.2897, 4.2892, 4.256879, 4.3223079, 4.3514929, 4.3109, 4.3592344
                   ,4.3396654, 4.334579, 4.3400508, 4.3152625, 4.3209061, 4.3925933, 4.3565011, 4.2212, 4.2470762, 4.2316372, 4.2420443, 4.3035489
                   ,4.2944283, 4.2936099, 4.2927918, 4.2619202, 4.2795905, 4.2218275, 4.2592868, 4.2538506, 4.283172, 4.2747, 4.2555854, 4.25177]},
                      columns=['Borough', 'Neighborhood', 'Latitude', 'Longitude'])
thcty_data

In [None]:
schev_data=thcty_data[thcty_data['Borough']=='Scheveningen'].reset_index(drop=True)
schev_data

USING geopy LIBRARY TO GET THE LATITUDE AND LONGITUDE COORDINATES OF SCHEVENINGEN

In [None]:
#Get the geographical coordinates of Hague centrum (hgcentrum)
address='Scheveningen'
geolocator=Nominatim(user_agent="schev_explorer")
location=geolocator.geocode(address)
latitude=location.latitude
longitude=location.longitude
print('The geographical coordinate of Scheveningen are {},{}.'.format(latitude, longitude))

CREATE A MAP OF SCHEVENINGEN WITH NEIGHBORHOODS SUPERIMPOSED ON TOP

In [None]:
#Create a map of the Hague Centrum (map_hgcentrum) using the latitude and longitude values returned above
latitude=52.0789797
longitude=4.3126423
map_hgcentrum=folium.Map(location=[latitude, longitude], zoom_start=12)

#Add markers to the map
for lat,lng,borough,neighborhood in zip(hgcentrum_data['Latitude'], hgcentrum_data['Longitude'], hgcentrum_data['Borough'], hgcentrum_data['Neighborhood']):
    label='{},{}'.format(neighborhood, borough)
    label=folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng],
                       radius=5,
                       popup=label,
                       color='blue',
                       fill_color='#3186cc',
                       fill_opacity=0.7,
                       parse_html=False).add_to(map_hgcentrum) 
map_hgcentrum   

THE Foursquare API WILL NOW BE USED TO EXPLORE THE NEIGHBORHOODS AND SEGMENT THEM

In [None]:
#Define foursquare credentials and version
CLIENT_ID='' # your foursquare ID
CLIENT_SECRET='F2TKA1ICYOIE4IVYHTZFB2PMWG4AP02OH2VJHCUBD3CDFSH3' # your foursquare SECRET
VERSION='20190314' # foursquare API version
print('Your credentials:')
print('CLIENT_ID:' +CLIENT_ID)
print('CLIENT_SECRET:' +CLIENT_SECRET)

EXPLORING THE FIRST NEIGHBORHOOD IN OUR DATAFRAME (schev_data)

In [None]:
#Get neighborhoods names
schev_data.loc[0,'Neighborhood']

In [None]:
#Get the neighborhoods latitude and longitude values
latitude=schev_data.loc[0,'Latitude'] # neighborhood's latitude value
longitude=schev_data.loc[0,'Longitude'] # neighborhood's longitude value
names=schev_data.loc[0,'Neighborhood'] # neighborhood's name
print('Latitude and longitude values of {} are {},{}.'.format(names, latitude, longitude))

GET THE TOP 100 VENUES THAT ARE IN SCHEVENINGEN WITHIN A RADIUS OF 500 METERS

In [None]:
#Create the GET request URL and name it url
LIMIT=100
radius=500
url='https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
latitude,
longitude,
radius,
LIMIT)
url

In [None]:
#Send the GET request and examine the results
results=requests.get(url).json()

DEFINE THE get_category_type FUNCTION WHICH WILL ALLOW THE CLEANING AND STRUCTURING OF THE json RESULTS INTO A PANDAS DATAFRAME

In [None]:
#Create function that will extract the category of the venue
def get_category_type(row):
    try:
        categories_list=row['categories']
    except:
        categories_list=row['venue.categories']
    if len(categories_list)==0:
        return None
    else:
        return categories_list[0]['name']

NOW TO CLEAN THE json AND STRUCTURE IT INTO PANDAS DATAFRAME

In [None]:
venues=results['response']['groups'][0]['items']
nearby_venues=json_normalize(venues) # flatten JSON

#Filter columns
filtered_columns=['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues=nearby_venues.loc[:,filtered_columns]

#Filter the category for each row
nearby_venues['venue.categories']=nearby_venues.apply(get_category_type, axis=1)

#Clean columns
nearby_venues.columns=[col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

In [None]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

EXPLORING NEIGHBORHOODS IN SCHEVENINGEN

In [None]:
#Create a function to repeat the same process to all neighborhoods in the Hague Centrum
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
        #create the API request URL
        url='https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        latitude,
        longitude,
        radius,
        LIMIT)
        url
        
        #Make the GET request
        results=requests.get(url).json()["response"]["groups"][0]['items']
        #return only relevant information for each nearby venue
        venues_list.append([(
        name,
        lat,
        lng,
        v['venue']['name'],
        v['venue']['location']['lat'],
        v['venue']['location']['lng'],
        v['venue']['categories'][0]['name']) for v in results])
        
    nearby_venues=pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns=['Neighborhood',
                          'Neighborhood Latitude',
                          'Neighborhood Longitude',
                          'Venue',
                          'Venue Latitude',
                          'Venue Longitude',
                          'Venue Category']
    return(nearby_venues)

CREATE DATAFRAME WHICH CONTAINS NEARBY VENUES FOR EACH NEIGHBORHOOD OR GROUP OF NEIGHBORHOODS IN SCHEVENINGEN

In [None]:
schev_venues=getNearbyVenues(names=schev_data['Neighborhood'],
                          latitudes=schev_data['Latitude'],
                          longitudes=schev_data['Longitude']
                          )

In [None]:
#Check the size of the resulting dataframe
print(schev_venues.shape)
schev_venues.head()

In [None]:
#Check how many venues were returned for each neighborhood
schev_venues.groupby('Neighborhood').count()

In [None]:
#Find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(schev_venues['Venue Category'].unique())))

In [None]:
#Check if do not have Venue Category as Neighborhood
schev_venues[schev_venues['Venue Category']=='Neighborhood']

In [None]:
#That has to be fixed by changing slightly the name of that Venue Category
schev_venues.loc[schev_venues['Venue Category']=='Neighborhood', 'Venue Category']='Neighborhoods'

In [None]:
#Check now to see
schev_venues[schev_venues['Venue Category']=='Neighborhood']

ANALYZE THE FIRST NEIGHBORHOOD-OoSTDUINEN

In [None]:
#onehot encoding
schev_onehot_=pd.get_dummies(schev_venues[['Venue Category']], prefix="", prefix_sep="")

#Merge the schev_venues and schev_onehot by using 'Neighborhood' column
schev_onehot=pd.concat([schev_venues['Neighborhood'], schev_onehot_], axis=1)

schev_onehot.head(10)

In [None]:
#Sanity check for finding any invalid value
schev_onehot.isnull().values.any()

In [None]:
#Examine the new dataframe size
schev_onehot.shape

In [None]:
#Group rows by neighborhood and by taking the mean of the frequencies of occurrence of each category
schev_grouped=schev_onehot.groupby('Neighborhood').mean().reset_index()

In [None]:
#Check the new dataframe
schev_grouped.head()

In [None]:
#Sanity check for finding any invalid value
schev_grouped.isnull().values.any()

CREATE THE NEW DATAFRAME AND DISPLAY THE TOP 10 VENUES FOR EACH NEIGHBORHOOD

In [None]:
#Fisrt, a function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories=row.iloc[1:]
    row_categories_sorted=row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues=2
indicators=['st', 'nd', 'rd']

#Create columns according to number of top venues
columns=['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
#Create a new dataframe
schev_venues_sorted=pd.DataFrame(columns=columns)
schev_venues_sorted['Neighborhood']=schev_grouped['Neighborhood']

for ind in np.arange(schev_grouped.shape[0]):
    schev_venues_sorted.iloc[ind,1:]= return_most_common_venues(schev_grouped.iloc[ind,:],
                                                                          num_top_venues)
schev_venues_sorted.head()    

In [None]:
#Sanity check for finding any invalid value
schev_venues_sorted.isnull().values.any()

In [None]:
schev_venues_sorted['1st Most Common Venue'].count()

CLUSTER NEIGHBORHOODS BY RUNNING k-means TO CLUSTER NEIGHBORHOODS INTO 4 CLUSTERS

In [None]:
#Set number of clusters
kclusters=4

schev_grouped_clustering=schev_grouped.drop('Neighborhood', 1)

#Run k-means clustering
kmeans=KMeans(n_clusters=kclusters, random_state=0).fit(schev_grouped_clustering)

#Check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

CREATE A NEW DATAFRAME THAT INCLUDES THE CLUSTER AS WELL AS THE TOP 10 VENUES FOR EACH NEIGHBORHOOD

In [None]:
#Add clustering labels
schev_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

schev_merged=schev_data

#Merge schev_grouped with schev_data to add latitude and longitude for each neighborhood
schev_merged=schev_merged.join(schev_venues_sorted.set_index('Neighborhood'), on="Neighborhood")

schev_merged.head(10) #check the last columns!

FINALLY, VISUALIZE THE RESULTING CLUSTERS

EXAMINE CLUSTERS

Now We Can Examine Each Cluster and Determine The Discriminating Venue Category That Distinguish Each Cluster. Based on The Defining Categories, We Can Then Assign a Name To Each Cluster

In [None]:
#Cluster 1
schev_merged.loc[schev_merged['Cluster Labels'] ==0, schev_merged.columns[[2] +list(range(5, schev_merged.shape[1]))]]

In [None]:
#Cluster 2
schev_merged.loc[schev_merged['Cluster Labels']==1, schev_merged.columns[[1] + list(range(5, schev_merged.shape[1]))]]

In [None]:
#Cluster 3
schev_merged.loc[schev_merged['Cluster Labels']==2, schev_merged.columns[[1] + list(range(5, schev_merged.shape[1]))]]

In [None]:
#Cluster 4
schev_merged.loc[schev_merged['Cluster Labels']==3, schev_merged.columns[[1] + list(range(5, schev_merged.shape[1]))]]