# IBM Applied Data Science Capstone
### Week 5 Final Project
**_Opening a New Shopping Mall in Kuala Lumpur, Malaysia_**
- Build a dataframe of the neighborhoods in Kuala Lumpur, by web scraping the data from Wikipedia page
- Get the geographical coordinates (Latitude and Longitude) of the neighborhoods
- Obtain the venue data for the neighborhoods from Foursquare API
- Explore, cluster and analyze the neighborhoods
- Select the best cluster to open a new shopping mall
***
### 1. Import libraries

In [1]:
import numpy as np 
import pandas as pd
import json 
from geopy.geocoders import Nominatim 
import geocoder
import requests 
from bs4 import BeautifulSoup
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

### 2. Scrap data from Wikipedia page into a DataFrame

In [2]:
data = requests.get("https://en.wikipedia.org/wiki/Category:Suburbs_in_Kuala_Lumpur").text

In [3]:
so = BeautifulSoup(data, 'html.parser')

In [4]:
# create a list to store neighborhood data
neighborhoodList = []

In [5]:
# append the data into the list
for row in so.find_all("div", class_="mw-category")[0].findAll("li"):
    neighborhoodList.append(row.text)

In [6]:
# create a new DataFrame from the list
kl_df = pd.DataFrame({"Neighborhood": neighborhoodList})

kl_df.head()

Unnamed: 0,Neighborhood
0,Alam Damai
1,"Ampang, Kuala Lumpur"
2,Bandar Menjalara
3,Bandar Sri Permaisuri
4,Bandar Tasik Selatan


### 3. Get the geographical coordinates

In [7]:
# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Kuala Lumpur, Malaysia'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [8]:
# call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(neighborhood) for neighborhood in kl_df["Neighborhood"].tolist() ]

In [9]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [10]:
# merge the coordinates into the original dataframe
kl_df['Latitude'] = df_coords['Latitude']
kl_df['Longitude'] = df_coords['Longitude']

In [11]:
# check the neighborhoods and the coordinates
print(kl_df.shape)
kl_df

(71, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Alam Damai,3.057690,101.743880
1,"Ampang, Kuala Lumpur",3.148492,101.696727
2,Bandar Menjalara,3.190350,101.625450
3,Bandar Sri Permaisuri,3.103910,101.712260
4,Bandar Tasik Selatan,3.072750,101.714610
5,Bandar Tun Razak,3.082800,101.722810
6,Bangsar,3.129200,101.678440
7,Bangsar Park,3.134780,101.672620
8,Bangsar South,3.111020,101.662830
9,Batu 11 Cheras,3.098980,101.734990


In [12]:
# save the DataFrame as CSV file
kl_df.to_csv("kl_df.csv", index=False)

### 4. Create a map of Kuala Lumpur with neighborhoods superimposed on top

In [13]:
# get the coordinates of Kuala Lumpur
address = 'Kuala Lumpur, Malaysia'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Kuala Lumpur, Malaysiae {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Kuala Lumpur, Malaysiae 3.1516636, 101.6943028.


In [14]:
# create map of Toronto using latitude and longitude values
map_kl = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(kl_df['Latitude'], kl_df['Longitude'], kl_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_kl)  
    
map_kl

In [15]:
# save the map as HTML file
map_kl.save('map_klm.html')

### 5. Use the Foursquare API to explore the neighborhoods

In [38]:
# define Foursquare Credentials and Version
CLIENT_ID = 'your credentials' # your Foursquare ID
CLIENT_SECRET = 'your credentials' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: your credentials
CLIENT_SECRET:your credentials


**Now, let's get the top 100 venues that are within a radius of 2000 meters.**

In [17]:
radius = 2000
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(kl_df['Latitude'], kl_df['Longitude'], kl_df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [18]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(7080, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Alam Damai,3.05769,101.74388,Pengedar Shaklee Kuala Lumpur,3.061235,101.740696,Supplement Shop
1,Alam Damai,3.05769,101.74388,Machi Noodle 妈子面,3.057695,101.746635,Noodle House
2,Alam Damai,3.05769,101.74388,628火焰鑫茶室,3.058442,101.747947,Chinese Restaurant
3,Alam Damai,3.05769,101.74388,Restoran Ikbal,3.061134,101.75022,Restaurant
4,Alam Damai,3.05769,101.74388,Minang Tomyam,3.057185,101.749812,Seafood Restaurant


**Let's find out how many unique categories can be curated from all the returned venues**

In [19]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 311 uniques categories.


In [20]:
# print out the list of categories
venues_df['VenueCategory'].unique()[:50]

array(['Supplement Shop', 'Noodle House', 'Chinese Restaurant',
       'Restaurant', 'Seafood Restaurant', 'Breakfast Spot',
       'Vegetarian / Vegan Restaurant', 'Food Court', 'Asian Restaurant',
       'Dim Sum Restaurant', 'Other Great Outdoors', 'Park', 'Coffee Shop',
       'Indian Restaurant', 'Bubble Tea Shop', 'Spa', 'Convenience Store',
       'Snack Place', 'Japanese Restaurant', 'Chinese Breakfast Place',
       'Food Truck', 'Pet Store', 'Dessert Shop', 'Farmers Market', 'Café',
       'Outlet Store', 'Cantonese Restaurant', 'Malay Restaurant',
       'Gym / Fitness Center', 'Athletics & Sports',
       'Fast Food Restaurant', 'Bakery', 'Steakhouse',
       'Middle Eastern Restaurant', 'Badminton Court', 'Hakka Restaurant',
       'Mamak Restaurant', 'Winery', 'Burger Joint', 'College Bookstore',
       'Grocery Store', 'Halal Restaurant', 'Playground',
       'Vietnamese Restaurant', 'Hostel', 'South Indian Restaurant',
       'Exhibit', 'Hotel', 'Chettinad Restaurant', 

In [21]:
# check if the results contain "Shopping Mall"
"Neighborhood" in venues_df['VenueCategory'].unique()

True

### 6. Analyze Each Neighborhood

In [22]:
# one hot encoding
kl_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
kl_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [kl_onehot.columns[-1]] + list(kl_onehot.columns[:-1])
kl_onehot = kl_onehot[fixed_columns]

print(kl_onehot.shape)
kl_onehot.head()

(7080, 312)


Unnamed: 0,Neighborhoods,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Volleyball Court,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Alam Damai,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Alam Damai,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Alam Damai,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Alam Damai,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Alam Damai,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category**

In [23]:
kl_grouped = kl_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(kl_grouped.shape)
kl_grouped

(71, 312)


Unnamed: 0,Neighborhoods,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Volleyball Court,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Alam Damai,0.00,0.0,0.00,0.00,0.000000,0.00,0.00,0.00,0.060000,...,0.000000,0.00,0.00,0.00,0.00,0.01,0.000000,0.00,0.00,0.00
1,"Ampang, Kuala Lumpur",0.00,0.0,0.00,0.00,0.000000,0.00,0.00,0.00,0.010000,...,0.000000,0.00,0.00,0.00,0.00,0.00,0.000000,0.00,0.00,0.00
2,Bandar Menjalara,0.00,0.0,0.01,0.00,0.000000,0.00,0.00,0.00,0.040000,...,0.000000,0.00,0.00,0.00,0.00,0.00,0.000000,0.00,0.00,0.00
3,Bandar Sri Permaisuri,0.01,0.0,0.00,0.00,0.000000,0.01,0.00,0.00,0.030000,...,0.000000,0.01,0.00,0.00,0.00,0.00,0.000000,0.00,0.00,0.00
4,Bandar Tasik Selatan,0.01,0.0,0.00,0.00,0.000000,0.00,0.00,0.00,0.100000,...,0.000000,0.00,0.00,0.01,0.00,0.00,0.000000,0.00,0.00,0.00
5,Bandar Tun Razak,0.01,0.0,0.00,0.00,0.000000,0.00,0.00,0.01,0.090000,...,0.000000,0.00,0.00,0.00,0.00,0.00,0.000000,0.00,0.00,0.00
6,Bangsar,0.00,0.0,0.00,0.00,0.000000,0.00,0.00,0.02,0.000000,...,0.000000,0.00,0.00,0.00,0.00,0.00,0.000000,0.00,0.01,0.00
7,Bangsar Park,0.00,0.0,0.00,0.00,0.000000,0.00,0.00,0.02,0.000000,...,0.000000,0.00,0.00,0.01,0.00,0.00,0.000000,0.00,0.01,0.00
8,Bangsar South,0.02,0.0,0.00,0.00,0.000000,0.00,0.00,0.00,0.010000,...,0.000000,0.00,0.00,0.00,0.00,0.00,0.000000,0.01,0.00,0.00
9,Batu 11 Cheras,0.00,0.0,0.00,0.00,0.000000,0.00,0.00,0.00,0.060000,...,0.000000,0.00,0.00,0.00,0.00,0.00,0.000000,0.00,0.00,0.00


In [24]:
len(kl_grouped[kl_grouped["Shopping Mall"] > 0])

39

**Create a new DataFrame for Shopping Mall data only**

In [25]:
kl_mall = kl_grouped[["Neighborhoods","Shopping Mall"]]

In [26]:
kl_mall.head()

Unnamed: 0,Neighborhoods,Shopping Mall
0,Alam Damai,0.0
1,"Ampang, Kuala Lumpur",0.01
2,Bandar Menjalara,0.01
3,Bandar Sri Permaisuri,0.0
4,Bandar Tasik Selatan,0.01


### 7. Cluster Neighborhoods
Run k-means to cluster the neighborhoods in Kuala Lumpur into 3 clusters.

In [27]:
# set number of clusters
kclusters = 3

kl_clustering = kl_mall.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(kl_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 2, 2, 1, 0])

In [28]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
kl_merged = kl_mall.copy()

# add clustering labels
kl_merged["Cluster Labels"] = kmeans.labels_

In [29]:
kl_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
kl_merged.head()

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels
0,Alam Damai,0.0,0
1,"Ampang, Kuala Lumpur",0.01,0
2,Bandar Menjalara,0.01,0
3,Bandar Sri Permaisuri,0.0,0
4,Bandar Tasik Selatan,0.01,0


In [30]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
kl_merged = kl_merged.join(kl_df.set_index("Neighborhood"), on="Neighborhood")

print(kl_merged.shape)
kl_merged.head() # check the last columns!

(71, 5)


Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
0,Alam Damai,0.0,0,3.05769,101.74388
1,"Ampang, Kuala Lumpur",0.01,0,3.148492,101.696727
2,Bandar Menjalara,0.01,0,3.19035,101.62545
3,Bandar Sri Permaisuri,0.0,0,3.10391,101.71226
4,Bandar Tasik Selatan,0.01,0,3.07275,101.71461


In [31]:
# sort the results by Cluster Labels
print(kl_merged.shape)
kl_merged.sort_values(["Cluster Labels"], inplace=True)
kl_merged

(71, 5)


Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
0,Alam Damai,0.000000,0,3.057690,101.743880
34,Kepong,0.000000,0,3.217500,101.637630
69,Titiwangsa,0.010000,0,3.180670,101.703220
37,Maluri,0.000000,0,3.147890,101.694050
39,Miharja,0.000000,0,3.147890,101.694050
41,Pantai Dalam,0.000000,0,3.094760,101.667470
44,Salak South,0.000000,0,3.081020,101.697240
46,Semarak,0.000000,0,3.179916,101.721437
47,Sentul Raya,0.000000,0,3.187431,101.691453
48,Setapak,0.000000,0,3.188160,101.704150


**Finally, let's visualize the resulting clusters**

In [32]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(kl_merged['Latitude'], kl_merged['Longitude'], kl_merged['Neighborhood'], kl_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [34]:
# save the map as HTML file
map_clusters.save('map_clusterss.html')

### 8. Examine Clusters

#### Cluster 0

In [35]:
kl_merged.loc[kl_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
0,Alam Damai,0.0,0,3.05769,101.74388
34,Kepong,0.0,0,3.2175,101.63763
69,Titiwangsa,0.01,0,3.18067,101.70322
37,Maluri,0.0,0,3.14789,101.69405
39,Miharja,0.0,0,3.14789,101.69405
41,Pantai Dalam,0.0,0,3.09476,101.66747
44,Salak South,0.0,0,3.08102,101.69724
46,Semarak,0.0,0,3.179916,101.721437
47,Sentul Raya,0.0,0,3.187431,101.691453
48,Setapak,0.0,0,3.18816,101.70415


#### Cluster 1

In [36]:
kl_merged.loc[kl_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
8,Bangsar South,0.02,1,3.11102,101.66283
10,"Batu, Kuala Lumpur",0.02,1,3.13576,101.70837
11,Brickfields,0.03,1,3.12916,101.68406
12,Bukit Bintang,0.02,1,3.14777,101.70855
66,Taman Tun Dr Ismail,0.03,1,3.15283,101.62271
30,KL Eco City,0.02,1,3.11714,101.67388
17,Bukit Tunku,0.02,1,3.17381,101.68276
31,"Kampung Baru, Kuala Lumpur",0.02,1,3.16546,101.71028
27,"Jalan Cochrane, Kuala Lumpur",0.02,1,3.132903,101.724678
38,Medan Tuanku,0.02,1,3.15926,101.69834


#### Cluster 2

In [37]:
kl_merged.loc[kl_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
6,Bangsar,0.05,2,3.1292,101.67844
42,"Pudu, Kuala Lumpur",0.04,2,3.13354,101.71307
36,Lembah Pantai,0.05,2,3.121202,101.663899
67,Taman U-Thant,0.04,2,3.1577,101.72452
7,Bangsar Park,0.06,2,3.13478,101.67262


#### Observations:
Most of the shopping malls are concentrated in the central area of Kuala Lumpur city, with the highest number in cluster 2 and moderate number in cluster 0. On the other hand, cluster 1 has very low number to totally no shopping mall in the neighborhoods. This represents a great opportunity and high potential areas to open new shopping malls as there is very little to no competition from existing malls. Meanwhile, shopping malls in cluster 2 are likely suffering from intense competition due to oversupply and high concentration of shopping malls. From another perspective, this also shows that the oversupply of shopping malls mostly happened in the central area of the city, with the suburb area still have very few shopping malls. Therefore, this project recommends property developers to capitalize on these findings to open new shopping malls in neighborhoods in cluster 1 with little to no competition. Property developers with unique selling propositions to stand out from the competition can also open new shopping malls in neighborhoods in cluster 0 with moderate competition. Lastly, property developers are advised to avoid neighborhoods in cluster 2 which already have high concentration of shopping malls and suffering from intense competition.