## Recomemendation for Hotel Construction in Bandung, Indonesia through Clustering Analysis

## Introduction: Problem Description

**Hotel Franchise A** decides to open a hotel in Bandung, Indonesia. Before beginning the construction, they want to analyze which part of the town the hotel should be built. A survey is deployed to the board of commissioners and other relevant stakeholders. The first and most important criterion that they want is that the hotel is located to as many point-of-interests/attractions as possible. Therefore, the analysis of neighborhood venues is necessary to be performed

## Introduction: Data

**1. Geo-Location**
Specific Geo-Location data will be needed to do the analysis, specifically the coordinates (latitude and longitude) of all boroughs and neighborhoods in Bandung, Indonesia. Through the search, the writer decides to borrow Geo JSON data of Bandung in https://github.com/tyohan/bandung-map-dataset . Thank you for the author.

We will extract the JSON into panda DataFrame. More or less to the following form:

In [10]:
import pandas as pd
column_names = ['Boroughs', 'Neighborhoods', 'Latitude', 'Longitude'] 
neighborhoods = pd.DataFrame(columns=column_names)
neighborhoods

Unnamed: 0,Boroughs,Neighborhoods,Latitude,Longitude


**2. Venues Data** To understand the position of venues we need the list of venues in each of the neighborhood and their latitude-longitude data. Another important data is the venue category. To simplify the project, we will only use K-Means Machine Learning clustering algorithm on venue categories. Other informations such as Venue Summary and Distance can be used as additional deciding factors. These data will be gathered through Foursquare Open API. More about Foursquare Open API can be read on https://foursquare.com/developers/apps

To extract data from Foursquare, we will crawl through Foursquare API, and get the data into this following form:

In [10]:
column_names_ = ['Neighborhood', 'Neighborhood Latitude',
       'Neighborhood Longitude', 'Venue', 'Venue Summary', 'Venue Category',
       'Distance'] 
venues = pd.DataFrame(columns=column_names_)
venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Summary,Venue Category,Distance


## Methodology

## Part 1: Gathering Neighborhoods and Their Coordinates

To obtain Neighborhood data of Bandung area, we extract the JSON data from https://github.com/tyohan/bandung-map-dataset. The neighborhood is then transformed into a pandas DataFrame. 

The raw JSON data contains Polygon-typed coordinate. To simplify the coordinate extraction, we will pick the first most coordinate from the polygon as point. Due this process, the extracted coordinate is a point on the  **border** of the actual neighborhood.

From the prior process, we got the Borough, Neighborhood, Latitude and Longitude in Bandung, ID. We summarized the list and it contains 26 boroughs and 139 neighborhoods.

After that, we plotted the neighborhoods to the Bandung map 

In [None]:
from PIL import Image
im = Image.open("bride.jpg")
im.rotate(45).show()

## Part 2: Gathering Venues Data in Bandung and Cleaning Them

The venues data are gathered from GET function of Foursquare through crawling function, credit to https://github.com/chenyang03/Foursquare_Crawler

Through function, the database of venues gathered from all the neighborhoods and within the 1000 m radius of their long-lat coordinate and limit of 500 venues per coordinates.

With using the function credit to https://github.com/alidastgheib called GetVenuesDataset, we will extract the crawled data explained in the prior paragraph to the Pandas dataframe containing the following data before saving them to CSV


In [5]:
column_names_ = ['Neighborhood', 'Neighborhood Latitude',
       'Neighborhood Longitude', 'Venue', 'Venue Summary', 'Venue Category',
       'Distance'] 
venues = pd.DataFrame(columns=column_names_)
venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Summary,Venue Category,Distance


## Part 3: One-hot Encoding of Venues

The next step of the methodology is understanding the unique values of venue categories. 
These unique categories is then plotted and grouped into each of the neighborhood as one-hot encoding. One hot encoding
describes the number of each unique categories (as cloumn) and count the number of unique category appears in the following neighborhoods. It should be noted that some categories are eliminated as we are only interested in categories related to the point-of-interests and attractions. The elimination of these categories are done manually according to the domain knowledge.

## Part 3: Machine Learning Step, Clustering by K Means

The one hot dataframe is then undergo the clustering by using K-Means clustering with K=5. This was done using sklearn library.

To rank the cluster we calculate the centroids of each clusters towards the respective selected unique categories and then total them. The rank is descending from the highest to the lowest sum. The highest sum is regarded as the best cluster to build the hotel

## Result and Discussion

The result of ranked cluster is:

**Cluster1    18.125000**


Cluster4    12.555556


Cluster3    12.500000

    
Cluster5    11.375000

    
Cluster2    2.921569

**Therefore the first cluster is the ideal location to build the hotel (based on the criteria of proximities to the point-of-interests and attraction venues)**

<img src = "https://i.ibb.co/D7ZJ5Ss/result.png" width = 200> </a>

<img src = "https://i.ibb.co/JH7kZHG/Viz.png" width = 1000> </a>

The first cluster denoted as purple circles.

## Conclusion
After using K-Means as clustering algorithm, it is found that the best neighborhoods to build the hotel (based on the proximity to point-of-interest/attraction venues) is Cluster 1, which includes Babakan Sari, Binong, Cibangkong, Gumuruh, Kacapiring, Kebon Gedang, Kebon Waru and Maleer neighborhood.

**Thank you for reading and assessing this project. Should there be any question, you can contact me through satriaawb@yahoo.co.id**