<h1 align="center"> Battle of Neighborhoods - Clustering and Segmenting the Neighborhoods of Jakarta and Surabaya, Where is the Good Place to Open New Restaurant in the Area?</h1>

<p align ="center"> Muhammad Adisatriyo Pratama
<br>
<br>
15th December 2020
</p>




# 1. Introduction

In this report i will analyze area of cluster between two major metropolitan area in Indonesia that is Jakarta and Surabaya. Jakarta and Surabaya are very popular and most populated metropolitan area in Indonesia. Altough Jakarta have approximately 3 times population than Surabaya, Surabaya has it's own destination, unique places, and landmark to go to. 

### Disclaimer :
We will use Foursquare API to get data about venues and places in the neighborhood. However there is another popular venues and places in Jakarta or Surabaya neighborhood that is not in the data of Foursquare API. That being said, this article is not the best result for venues and places in the neighborhood, altogh this article is not perfect i hope this will help you or others to know about my analysis.

# 2. Business Problem

The aim of this report is try to help tourist or business owner to open new detinations or places in the neighborhood depending on the experiences that neighborhood have. Once the data is obtained, the cluster and segmentation between neighborhood is created to see which neighborhood has the same simmilarity based on destination and places. This also will help people to make decision if they are want to migrate or move into another neighborhood.

# 3. Data Collecting

In this report, we require neighborhood (Kecamatan) for Jakarta and Surabaya. Using the location of the neighborhood we can search most popular venue or places for each categories using **Foursquare API**. We also need the coordinates/geographical location for each neighborhood in Jakarta and Surabaya. Using the coordinates of the neighborhood we can visualize with **OpenStreetMap using Folium API**.

## 3.1 Jakarta

In order to get neighborhood (Kecamatan) in Jakarta we scrape the data from : https://id.wikipedia.org/wiki/Daftar_kecamatan_dan_kelurahan_di_Daerah_Khusus_Ibukota_Jakarta

In this wikipedia page there is several table representing each Town in Jakarta. in each table, there is data about name of neighborhood (Kecamatan) for each town and name of villages (Kelurahan) for each neighborhood.

After doing data processing we limit the data and concatenate 5 table into 1 table containing information about : 

1. *Neighborhood* : Name of kecamatan, we call this neighborhood to make it easy for report.
2. *Town* : Name of Administrative Town for each neighborhood.

At the end, we obtained 48 rows of data each representing its neighborhood.

## 3.2 Surabaya

We scrape neighborhood data in Surabaya also from wikipedia page : https://id.wikipedia.org/wiki/Daftar_kecamatan_dan_kelurahan_di_Kota_Surabaya

In this wikipedia page there is just containing 1 table with the same information from wikipedia page in Jakarta. Because the table contains some data that we do not need so we can keep the same information we got fram table Jakarta.

## 3.3 Nominatim OpenStreetMap

The data scraping from wikipedia page does not give information about the coordinates for each neighborhood. So we can use Nominatim OpenStreetMap API in order to get *latitude* and *longitude* for each neighborhood.

Using Nominatim OpenStreetMap API in python we can use **geopy** library and import **geopy.geocoders.Nominatim** package into notebook.

Using nominatim we can pass neighborhood keyword into nominatim object and get the representing latitude and longitude so we can add this information into neighborhood table for Jakarta and Surabaya.

## 3.4 Foursquare API

Foursquare is a company focusing on social media services. One of their products is Foursquare City Guide commonly called Foursquare is a product that give information about venues, places, or events within an area of interest. This app also proveides personalized reccomendations of places to go in near the user's current location based on other user's rating for the places. Using Foursquare API we can find data about different venues for different neighborhood. With Foursquare API we can make a call containing neighborhood information so we can gain information about the places or venues.

After using Foursquare API we can find data about venues for each neighborhood and we can create a **Pandas Dataframe** object for information about Jakarta and Surabaya. After this, the information we obtained as follows:

1. *Neighborhood* : Name of kecamatan, we call this neighborhood to make it easy for report.
2. *Town* : Name of Administrative Town for each neighborhood.
3. *Latitude* : Latitude coordinates of the neighborhood.
4. *Longitude* : Longitude coordinates of the neighborhood.
5. *Venue* : Name of the venue.
6. *Venue Category* : Category of the venue.
7. *Venue Latitude* : Latitude coordinates of the venue.
8. *Venue Longitude* : Longitude coordinates of the venue.


# 4. Methodology

In this part of the section, i will collecting data (data scrapping) from wikipedia page in order to get **neighborhood information** for **Jakarta** and **Surabaya**. After getting that information, i will use name of the neighborhood as a keyword to providing information about **neighborhood coordinates** (latitude and longitude) using **Nominatim** with *geopy.geocoders.Nominatim* package. Using coordinates for each neighborhood i will use **Foursquare API** to get relevant venues and places near the given **latitude** and **longitude**. Using that information we create a pandas dataframe to sort **5 most popular venues (categories) for each neighborhood**

### Import library
Before we start collecting and processing data we want to import necessary library that we use in this research notebook.

```python
# basic library
import pandas as pd
import requests
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

# import folium library for map visualization
# !pip install folium # uncomment this if you have not already insalled folium
import folium

# Import Nominatim API from geopy.geocoders.Nominatim package for providing information about latitude and longitude
# !pip install geopy # uncomment this if you have not already installed folium
from geopy.geocoders import Nominatim

# import k-means for the clustering stage
from sklearn.cluster import KMeans
```
## 4.1 Data Collection

## Explore Jakarta

### In this part i will do data wrangling from wikipedia page for providing neighborhood data and information in Jakarta
URL : https://id.wikipedia.org/wiki/Daftar_kecamatan_dan_kelurahan_di_Daerah_Khusus_Ibukota_Jakarta

From this wikipedia page there is several table that we need so we can use **pandas.read_html()** function to get a list of table that we need.

### After processing data here is the head of resulting dataframe that we will use for now.

![Jakarta Dataframe 1](assets/jakarta_df_1.png)

## Explore Surabaya

Here are the URL that we will use for data scrapping, URL : https://id.wikipedia.org/wiki/Daftar_kecamatan_dan_kelurahan_di_Kota_Surabaya

The approach to get data is pretty much the same from what i did with Jakarta Neighborhood

### After processing data here is the head of resulting dataframe that we will use for now.

![Surabaya Dataframe 1](assets/surabaya_df_1.png)

## Nominatim OpenStreetMap API

To get information about latitude and longitude for each neighborhood in Jakarta and Surabaya we can use Nominatim from **geopy.geocoders.Nominatim** package to provide coordinates passing neighborhood keyword as an argument.

First we create Nominatim object. Here is the script code :

```python
# Create Nominatim object as 'geolocator'
geolocator = Nominatim(user_agent='explorer')
```

To get latitude and longitude, for each neighborhood i will define some function to apply to corresponding dataframe.

Script function :

```python
# All of these function will provide information about latitude and longitude for neighborhood
def get_latitude_jakarta(neighborhood):
    location = geolocator.geocode(f'{neighborhood}, Jakarta, Indonesia')
    latitude = location.latitude
    return latitude

def get_longitude_jakarta(neighborhood):
    location = geolocator.geocode(f'{neighborhood}, Jakarta, Indonesia')
    longitude = location.longitude
    return longitude

def get_latitude_surabaya(neighborhood):
    location = geolocator.geocode(f'{neighborhood}, Surabaya, Indonesia')
    latitude = location.latitude
    return latitude

def get_longitude_surabaya(neighborhood):
    location = geolocator.geocode(f'{neighborhood}, Surabaya, Indonesia')
    longitude = location.longitude
    return longitude
```

After we define some of the function, then we can apply to the dataframe given the keyword of the name of neighborhood.

#### Here are the updated dataframes.
Jakarta

![Jakarta Dataframe 2](assets/jakarta_df_2.png)

Surbabaya

![Surabaya Dataframe 1](assets/jakarta_df_2.png)

## 4.2 Map Visualize

Visualizing map using Folium API with OpenStreetMap view with information of neighborhood from both dataframes

### Jakarta Neighborhood Map View

To visualize OpenStreetMap of Jakarta, first we need to get coordinates for Jakarta. Then we apply each neighborhood coordinates to for mark a circular area representing each neighborhood.

Here are the script for getting coordinates of Jakarta
```python
address = 'Jakarta'

location = geolocator.geocode(address)
jakarta_latitude = location.latitude
jakarta_longitude = location.longitude
print(f'Coordinates of Jakarta are {jakarta_latitude}, {jakarta_longitude}')
```

Scrip for map visualizing using Folium python library : 
```python
jakarta_map = folium.Map(location=[jakarta_latitude, jakarta_longitude], zoom_start=11)

for latitude, longitude, borough, neighborhood in zip(jakarta_df['Latitude'], jakarta_df['Longitude'], jakarta_df['Town'], jakarta_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='red',
        fill=True
        ).add_to(jakarta_map)

# Show the map   
jakarta_map
```
### Here are the map for Jakarta :
![Jakarta Map 1](assets/jakarta_map_1.png)

### We do the same for visualizing Surabaya OpenStreetMap : 
![Surabaya Map 1](assets/surabaya_map_1.png)

## 4.3 Foursquare API

Defining Foursquare API Credentials and Version. You can get your Foursquare credentials by signing up to Foursquare Developer, More : [Foursquare Developer API](https://foursquare.com/developers/)

We create a function to get nearby venues and places around the neighborhood. Here are the script function : 
```python
# Function that return latitude, longitude, venues, and venue_categories in neighborhood_df
def get_nearby_venues(names, latitudes, longitudes, radius=500):
    
    # create an empty list
    venues_list=[]
    
    # for loop that iterate through dataframe
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

        
    # Create pandas dataframe from venues_list
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)
```

Then, apply it to the dataframe : 

```python
jakarta_venues = get_nearby_venues(jakarta_df['Neighborhood'], jakarta_df['Latitude'], jakarta_df['Longitude'])
```

Here are the resulting dataframe for Jakarta and Surabaya.

***Jakarta DataFrame***

![Jakarta Dataframe 3](assets/jakarta_df_3.png)

***Surabaya DataFrame***

![Surabaya Dataframe 3](assets/surabaya_df_3.png)

## 4.5 One Hot Encoding

In order to find top 5 most common venue, we need to transform each categorical data into number with One Hot Encoding using **pandas.get_dummies()** function. After get the one hot encoded dataframe, then we can count the average for each venue in each neighborhood. Here are the result after apply the average most common venue in neighborhood.

### One hot encoding for Jakarta venues 
![Jakarta Dataframe ohe](assets/jakarta_df_ohe.png)

### One hot encoding for Surabaya venues 
![Surabaya Dataframe ohe](assets/surabaya_df_ohe.png)

### Find 5 most common venue
To find 5 most common venue in area we can use loop to see what is the most venue categories in neighborhood.

Script : 
```python
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
```

#### Then apply it to create new dataframe for Jakarta and Surabaya. 
Script for Jakarta :

```python
# create a new dataframe
jakarta_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
jakarta_neighborhoods_venues_sorted['Neighborhood'] = jakarta_grouped['Neighborhood']

for ind in np.arange(jakarta_grouped.shape[0]):
    jakarta_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(jakarta_grouped.iloc[ind, :], num_top_venues)
```

***Jakarta DataFrame***

![Jakarta Dataframe 4](assets/jakarta_df_4.png)

***Surabaya DataFrame***

![Surabaya Dataframe 4](assets/surabaya_df_4.png)

# 5. Modeling

After we get data about top 10 most common venue for each neighborhood in Jakarta and Surabaya we can begin create a clustering model using **K-Means Clustering** library from Scikit-Learn

We will run the K-Means Clustering to cluster and segment the neighborhood into 5 different clusters based on type of venues and places.

```python
# set number of clusters
kclusters = 5

# instantiate kmeans model
kmeans = KMeans(n_clusters=kclusters, random_state=0)
```

## 5.1 Prepare the data (features) for modeling

We will use grouped dataframe for Jakarta and Surabaya that is containing values of one hot encoded venues and places and drop 'Neighborhood' column that is contain Neighborhood name (string dtypes)

## 5.2 Begin modeling

### Clustering in Jakarta Neighborhood

```python
# fit the data
jakarta_kmmeans = kmeans.fit(jakarta_cluster)

# add clustering labels
jakarta_neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

jakarta_merged = jakarta_df

# merge jakarta_grouped with neighborhood_df to add latitude/longitude for each neighborhood
jakarta_merged = jakarta_merged.join(jakarta_neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

jakarta_merged.head() 
```
#### The resulting dataframe : 
![Jakarta Dataframe 5](assets/jakarta_df_5.png)

### Clustering in Surabaya Neighborhood
For clustering in Surabaya, we use the same script we use for Jakarta using dataframe from neighborhood in Surabaya.

#### The resulting dataframe :
![Surabaya Dataframe 5](assets/surabaya_df_5.png)

### The column 'Cluster Labels' indicates the resulting cluster for each neighborhood in Jakarta and Surabaya

## 5.3 Visualizing the clusters

Using OpenStreetMap from Folium we can create a cluster from coordinates of the neighborhood with different color fot each color representing the cluster.

Here are the script for Jakarta Clusters and with using different dataframe we also use it for Surabaya Clusters :
```python
# create map
jakarta_map_clusters = folium.Map(location=[jakarta_latitude, jakarta_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(jakarta_merged['Latitude'], jakarta_merged['Longitude'], jakarta_merged['Neighborhood'], jakarta_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)],
        fill=True,
        fill_color=rainbow[int(cluster)],
        fill_opacity=0.7).add_to(jakarta_map_clusters)

# Show the map       
jakarta_map_clusters
```

### Jakarta Clusters
![Jakarta Map 2](assets/jakarta_map_2.png)
### Surabaya Clusters
![Surabaya Map 2](assets/surabaya_map_2.png)


# 6. Results and Discussion

In this section we will see the clusters results from Jakarta Neighborhood and Surabaya Neighborhood.

## 6.1 Results in Jakarta Neighborhood

- **Cluster 1** : In this cluster we can see this area most common venue is conveniece store because of the result being first and second most common venue in the area.

![Jakarta Cluster 1](assets/jakarta_cluster_1.png)

- **Cluster 2** : In this cluster there is also only one neighborhood with the most common venue being some kind of Desert Places in the area.

![Jakarta Cluster 2](assets/jakarta_cluster_2.png)

- **Cluster 3** : In this cluster also resulting only one neighborhood with Resort as the most common venue in the area.

![Jakarta Cluster 3](assets/jakarta_cluster_3.png)

- **Cluster 4** : In this cluster we can see there is more Neighborhood with in the cluster. The most common venue in this area is pretty diverse we can see from the variety of restaurant like Japanese, Chinese, Vegan, Mediterranean, Asian, and other type of Restaurant. With the variety of places and venues being this is a good place forr a hangout because venues like Restaurant, Coffee Shop, Theme Park, Movie, or Clothing Store.

![Jakarta Cluster 4](assets/jakarta_cluster_4.png)

- **Cluster 5** : In this cluster we can see most of the neighborhood have Indonesian Restaurant as the first most common venue. We can say that this cluster an area for having a meal because the variety of the restaurant places. That being said, if you want to open new restaurant in these neighborhood this is the area that has more restaurant variety and possibly being one of the competitors for your restaurant.

![Jakarta Cluster 5](assets/jakarta_cluster_5.png)

## 6.2 Results in Surabaya Neighborhood
- **Cluster 1** : In this cluster we can see this area most common venue is conveniece store because of the result being first and second most common venue in the area.

![Surabaya Cluster 1](assets/surabaya_cluster_1.png)

- **Cluster 2** : In this cluster we can see this area is pretty popular for restaurant and other food places because of variety of the restaurant and other food places. This also being the cluster that if you want to open a new restaurant there is already a variety of restaurant in the neighborhood.

![Surabaya Cluster 2](assets/surabaya_cluster_2.png)

- **Cluster 3** : In this cluster we have the area that is popular for cafe and coffee shop. This neighborhood is perfect for people to hangout or just to enjoy a cup of coffee.

![Surabaya Cluster 3](assets/surabaya_cluster_3.png)

- **Cluster 4** : In this cluster there is only one neighborhood with Soccer Stadium as the first most common venue.

![Surabaya Cluster 4](assets/surabaya_cluster_4.png)

- **Cluster 5** : In this cluster there is also just one neighborhood with Wine Bar as the first most common venue in the area.

![Surabaya Cluster 5](assets/surabaya_cluster_5.png)

## 6.3 Discussion

From the resulting cluster in Jakarta and Surabaya we can see that Food Place or Restaurant is the most common venue with the most neighborhood in the cluster. Altough the area of Jakarta is much bigger than Surabaya and population of Jakarta is about 3 times than Surabaya, the neighborhood is relatively similar with the most common venue is the Restaurant and Bars. There is also different type for leisure and hangout because there are couple of Parks, Movies, Golf Course, Resort, even Soccer Stadium.

# 7. Conclusion

After we create a cluster for neighborhood in Jakarta and Surabaya, there is several clusters and segmentation based on venues and places from Foursquare API. But there is a cluster in Jakarta and Surabaya with restaurant being the first most common venue in the area. Back to the first question from beginning of these article and you want to open new restaurant hopefully this will help you for consideration to decide if you want to open a restaurant in the neighborhood.

Link to original notebook : [notebook](https://github.com/roynozoa/Coursera_Capstone/blob/master/IBM%20Data%20Science/BattleOfNeighborhoods_Notebook.ipynb)


Thanks for reading this article that i have made. Any suggestion or feedback will be appreciated. If you want to reach me, feel free sent a message trough linkedin and if you want to see more of my portfolio you can see trough my github account.

- [Linkedin](https://www.linkedin.com/in/muhammad-adisatriyo-pratama/)
- [GitHub](https://github.com/roynozoa)


## References :
- [Coursera Applied Data Science Capstone Course](https://www.coursera.org/learn/applied-data-science-capstone)

## Thanks to :
- [Foursquare Developer API](https://foursquare.com/developers/)
- [Indonesian Ministry of Internal Affairs (Kemendagri)](https://www.kemendagri.go.id/page/read/48/peraturan-menteri-dalam-negeri-no72-tahun-2019) Accessed via Wikipedia page [neighborhood Jakarta](https://id.wikipedia.org/wiki/Daftar_kecamatan_dan_kelurahan_di_Daerah_Khusus_Ibukota_Jakarta) and [neighborhood Surabaya](https://id.wikipedia.org/wiki/Daftar_kecamatan_dan_kelurahan_di_Kota_Surabaya).
- Python library package (pandas, numpy, matplotlibt, folium, scikit-learn, and geopy)

## Tools :
- Jupyter Notebooks using Visual Studio Code
- GitHub (version control)

Muhammad Adisatriyo Pratama - December 2020