# IBM Applied Data Science Capstone Course by Coursera

#### Week 5 Final Report

##### Opening a New Shopping Mall in South Africa, Johannesburg

* Build a dataframe of neighborhoods in South Africa, Johannesburg by web scraping the data from Wikipedia page
* Get the geographical coordinates of the neighborhoods
* Obtain the venue data for the neighborhoods from Foursquare API
* Explore and cluster the neighborhoods
* Select the best cluster to open a new shopping mall

#### 1. Import libraries

In [5]:
pip install geopy

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/53/fc/3d1b47e8e82ea12c25203929efb1b964918a77067a874b2c7631e2ec35ec/geopy-1.21.0-py2.py3-none-any.whl (104kB)
[K     |████████████████████████████████| 112kB 2.7MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-1.21.0
Note: you may need to restart the kernel to use updated packages.


In [13]:
pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 5.9MB/s ta 0:00:011
[?25hCollecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Collecting future (from geocoder)
[?25l  Downloading https://files.pythonhosted.org/packages/45/0b/38b06fd9b92dc2b68d58b75f900e97884c45bedd2ff83203d933cf5851c9/future-0.18.2.tar.gz (829kB)
[K     |████████████████████████████████| 829kB 8.9MB/s eta 0:00:01
Collecting click (from geocoder)
[?25l  Downloading https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl (81kB)
[K     |████████████████████████████████| 81kB 8.3MB/s eta 0:00:011
Building wheels f

In [16]:
pip  install bs4

Note: you may need to restart the kernel to use updated packages.


In [26]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder # to get coordinates

import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print("Libraries imported.")

Libraries imported.


#### 2. Scrap data from Wikipedia page into a DataFrame

In [27]:
# send the GET request
data = requests.get("https://en.wikipedia.org/wiki/Category:Suburbs_in_South_Africa").text

In [None]:
https://en.wikipedia.org/wiki/Category:Suburbs_in_South_Africa

In [28]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

In [30]:
# create a list to store neighborhood data
neighborhoodList = []

In [31]:
# append the data into the list
for row in soup.find_all("div", class_="mw-category")[0].findAll("li"):
    neighborhoodList.append(row.text)

In [32]:
# create a new DataFrame from the list
kl_df = pd.DataFrame({"Neighborhood": neighborhoodList})

kl_df.head()

Unnamed: 0,Neighborhood
0,► Lists of suburbs in South Africa‎ (3 P)
1,► Suburbs of Bloemfontein‎ (30 P)
2,"► Suburbs of Cape Town‎ (4 C, 136 P)"
3,"► Suburbs of Centurion, Gauteng‎ (17 P)"
4,► Suburbs of Durban‎ (59 P)


In [33]:
# print the number of rows of the dataframe
kl_df.shape

(10, 1)

####  3. Get the geographical coordinates

In [34]:
# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, South Africa, Johannesburg'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [35]:
# call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(neighborhood) for neighborhood in kl_df["Neighborhood"].tolist() ]

In [36]:
coords

[[-26.151542991373, 28.064935121559287],
 [-26.090876043046766, 28.160477523064245],
 [-25.993666542847475, 28.10447872162271],
 [-26.035349999999937, 27.952530000000024],
 [-26.20807364556529, 28.055775347625374],
 [46.68509700000001, 14.888173499999994],
 [-26.083493999999998, 28.138675499999998],
 [-26.188743351116486, 28.050263591176638],
 [-26.141598595733413, 28.02325556628718],
 [-26.50662852954593, 27.883307038432736]]

In [37]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [38]:
# merge the coordinates into the original dataframe
kl_df['Latitude'] = df_coords['Latitude']
kl_df['Longitude'] = df_coords['Longitude']

In [39]:
# check the neighborhoods and the coordinates
print(kl_df.shape)
kl_df

(10, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,► Lists of suburbs in South Africa‎ (3 P),-26.151543,28.064935
1,► Suburbs of Bloemfontein‎ (30 P),-26.090876,28.160478
2,"► Suburbs of Cape Town‎ (4 C, 136 P)",-25.993667,28.104479
3,"► Suburbs of Centurion, Gauteng‎ (17 P)",-26.03535,27.95253
4,► Suburbs of Durban‎ (59 P),-26.208074,28.055775
5,"► Suburbs of Johannesburg‎ (7 C, 31 P)",46.685097,14.888173
6,"► Suburbs of Kempton Park, Gauteng‎ (8 P)",-26.083494,28.138675
7,► Suburbs of Pretoria‎ (58 P),-26.188743,28.050264
8,► University and college campuses in South Af...,-26.141599,28.023256
9,► Suburbs of Vereeniging‎ (2 P),-26.506629,27.883307


In [40]:
# save the DataFrame as CSV file
kl_df.to_csv("kl_df.csv", index=False)

#### 4. Create a map of Kuala Lumpur with neighborhoods superimposed on top

In [41]:
# get the coordinates of Kuala Lumpur
address = 'South Africa, Johannesburg'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of South Africa, Johannesburg {}, {}.'.format(latitude, longitude))

The geograpical coordinate of South Africa, Johannesburg -26.205, 28.049722.


In [42]:
# create map of Toronto using latitude and longitude values
map_kl = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(kl_df['Latitude'], kl_df['Longitude'], kl_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_kl)  
    
map_kl

In [43]:
# save the map as HTML file
map_kl.save('map_kl.html')

#### 5. Use the Foursquare API to explore the neighborhoods

In [67]:
# define Foursquare Credentials and Version
CLIENT_ID = 'XXXX' # your Foursquare ID
CLIENT_SECRET = 'XXXX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: XXXX
CLIENT_SECRET:XXXX


##### Now, let's get the top 100 venues that are within a radius of 2000 meters.



In [45]:
radius = 2000
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(kl_df['Latitude'], kl_df['Longitude'], kl_df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [46]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(372, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,► Lists of suburbs in South Africa‎ (3 P),-26.151543,28.064935,Tortellino D'Oro,-26.146556,28.063648,Italian Restaurant
1,► Lists of suburbs in South Africa‎ (3 P),-26.151543,28.064935,The Schwarma Co.,-26.157439,28.076384,Middle Eastern Restaurant
2,► Lists of suburbs in South Africa‎ (3 P),-26.151543,28.064935,La Vie En Rose,-26.14834,28.055011,Café
3,► Lists of suburbs in South Africa‎ (3 P),-26.151543,28.064935,The Residence Boutique Hotel,-26.164635,28.059165,Hotel
4,► Lists of suburbs in South Africa‎ (3 P),-26.151543,28.064935,Loof Coffee,-26.160295,28.07597,Coffee Shop


##### Let's check how many venues were returned for each neighorhood

In [47]:
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
► Lists of suburbs in South Africa‎ (3 P),49,49,49,49,49,49
► Suburbs of Bloemfontein‎ (30 P),6,6,6,6,6,6
"► Suburbs of Cape Town‎ (4 C, 136 P)",18,18,18,18,18,18
"► Suburbs of Centurion, Gauteng‎ (17 P)",12,12,12,12,12,12
► Suburbs of Durban‎ (59 P),74,74,74,74,74,74
"► Suburbs of Johannesburg‎ (7 C, 31 P)",4,4,4,4,4,4
"► Suburbs of Kempton Park, Gauteng‎ (8 P)",11,11,11,11,11,11
► Suburbs of Pretoria‎ (58 P),94,94,94,94,94,94
► Suburbs of Vereeniging‎ (2 P),4,4,4,4,4,4
"► University and college campuses in South Africa‎ (1 C, 1 P)",100,100,100,100,100,100


##### Let's find out how many unique categories can be curated from all the returned venues



In [48]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 106 uniques categories.


In [49]:
# print out the list of categories
venues_df['VenueCategory'].unique()[:50]

array(['Italian Restaurant', 'Middle Eastern Restaurant', 'Café', 'Hotel',
       'Coffee Shop', 'Fast Food Restaurant', 'Indian Restaurant', 'Gym',
       'Sushi Restaurant', 'Golf Course', 'Japanese Restaurant',
       'Playground', 'Park', 'African Restaurant',
       'Portuguese Restaurant', 'Food & Drink Shop', 'Grocery Store',
       'Steakhouse', 'Gas Station', 'Bakery', 'Farmers Market',
       'Athletics & Sports', 'Juice Bar', 'Gym / Fitness Center', 'Road',
       'Shopping Mall', 'Seafood Restaurant', 'Ethiopian Restaurant',
       'Flea Market', 'Bookstore', 'Shop & Service', 'Airport Terminal',
       'Gastropub', 'Sports Club', 'Pizza Place', 'Chinese Restaurant',
       'Pub', 'Supermarket', 'Breakfast Spot', 'Swiss Restaurant',
       'Convenience Store', 'Video Game Store', 'Burger Joint',
       'Garden Center', 'Deli / Bodega', 'Basketball Court',
       'Fish Market', 'Climbing Gym', 'Restaurant', 'Arts & Crafts Store'],
      dtype=object)

In [50]:
# check if the results contain "Shopping Mall"
"Neighborhood" in venues_df['VenueCategory'].unique()

False

#### 6. Analyze Each Neighborhood

In [51]:
# one hot encoding
kl_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
kl_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [kl_onehot.columns[-1]] + list(kl_onehot.columns[:-1])
kl_onehot = kl_onehot[fixed_columns]

print(kl_onehot.shape)
kl_onehot.head()

(372, 107)


Unnamed: 0,Neighborhoods,African Restaurant,Airport Terminal,Argentinian Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bar,Basketball Court,Beach,Bistro,Bookstore,Boutique,Breakfast Spot,Burger Joint,Business Service,Café,Chinese Restaurant,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Deli / Bodega,Department Store,Diner,Donut Shop,Electronics Store,Ethiopian Restaurant,Farmers Market,Fast Food Restaurant,Fish Market,Flea Market,Flower Shop,Food & Drink Shop,Food Court,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Garden Center,Gas Station,Gastropub,Golf Course,Grocery Store,Gym,Gym / Fitness Center,Historic Site,Hostel,Hotel,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Indie Theater,Italian Restaurant,Japanese Restaurant,Juice Bar,Korean Restaurant,Lake,Latin American Restaurant,Lounge,Market,Martial Arts Dojo,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Mobile Phone Shop,Modern European Restaurant,Music Venue,Nightclub,Park,Performing Arts Venue,Pharmacy,Pizza Place,Playground,Plaza,Portuguese Restaurant,Pub,Public Art,Radio Station,Restaurant,Road,Rock Climbing Spot,Rugby Pitch,Scenic Lookout,Seafood Restaurant,Shop & Service,Shopping Mall,Snack Place,Spa,Sports Club,Stadium,Steakhouse,Supermarket,Sushi Restaurant,Swiss Restaurant,Thai Restaurant,Theater,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Whisky Bar,Wine Bar
0,► Lists of suburbs in South Africa‎ (3 P),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,► Lists of suburbs in South Africa‎ (3 P),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,► Lists of suburbs in South Africa‎ (3 P),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,► Lists of suburbs in South Africa‎ (3 P),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,► Lists of suburbs in South Africa‎ (3 P),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


##### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [52]:
kl_grouped = kl_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(kl_grouped.shape)
kl_grouped

(10, 107)


Unnamed: 0,Neighborhoods,African Restaurant,Airport Terminal,Argentinian Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bar,Basketball Court,Beach,Bistro,Bookstore,Boutique,Breakfast Spot,Burger Joint,Business Service,Café,Chinese Restaurant,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Deli / Bodega,Department Store,Diner,Donut Shop,Electronics Store,Ethiopian Restaurant,Farmers Market,Fast Food Restaurant,Fish Market,Flea Market,Flower Shop,Food & Drink Shop,Food Court,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Garden Center,Gas Station,Gastropub,Golf Course,Grocery Store,Gym,Gym / Fitness Center,Historic Site,Hostel,Hotel,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Indie Theater,Italian Restaurant,Japanese Restaurant,Juice Bar,Korean Restaurant,Lake,Latin American Restaurant,Lounge,Market,Martial Arts Dojo,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Mobile Phone Shop,Modern European Restaurant,Music Venue,Nightclub,Park,Performing Arts Venue,Pharmacy,Pizza Place,Playground,Plaza,Portuguese Restaurant,Pub,Public Art,Radio Station,Restaurant,Road,Rock Climbing Spot,Rugby Pitch,Scenic Lookout,Seafood Restaurant,Shop & Service,Shopping Mall,Snack Place,Spa,Sports Club,Stadium,Steakhouse,Supermarket,Sushi Restaurant,Swiss Restaurant,Thai Restaurant,Theater,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Whisky Bar,Wine Bar
0,► Lists of suburbs in South Africa‎ (3 P),0.020408,0.020408,0.0,0.0,0.0,0.0,0.020408,0.0,0.020408,0.0,0.0,0.0,0.0,0.040816,0.0,0.0,0.0,0.0,0.061224,0.0,0.0,0.0,0.0,0.040816,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020408,0.020408,0.020408,0.0,0.040816,0.0,0.040816,0.0,0.0,0.0,0.0,0.0,0.040816,0.0,0.040816,0.020408,0.020408,0.020408,0.0,0.0,0.102041,0.0,0.040816,0.0,0.0,0.061224,0.020408,0.020408,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020408,0.0,0.0,0.0,0.0,0.020408,0.0,0.0,0.0,0.020408,0.0,0.020408,0.0,0.0,0.0,0.0,0.020408,0.0,0.0,0.0,0.020408,0.020408,0.040816,0.0,0.0,0.0,0.0,0.020408,0.0,0.040816,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,► Suburbs of Bloemfontein‎ (30 P),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"► Suburbs of Cape Town‎ (4 C, 136 P)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.055556,0.0,0.0,0.055556,0.0,0.0,0.0,0.055556,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.055556,0.0,0.0,0.0,0.0,0.055556,0.0,0.0
3,"► Suburbs of Centurion, Gauteng‎ (17 P)",0.0,0.0,0.0,0.0,0.083333,0.0,0.083333,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,► Suburbs of Durban‎ (59 P),0.0,0.0,0.013514,0.027027,0.0,0.013514,0.0,0.013514,0.0,0.013514,0.0,0.0,0.013514,0.0,0.0,0.027027,0.027027,0.0,0.094595,0.0,0.0,0.0,0.0,0.094595,0.0,0.0,0.013514,0.013514,0.013514,0.0,0.013514,0.013514,0.013514,0.0,0.148649,0.0,0.027027,0.0,0.0,0.0,0.0,0.013514,0.0,0.0,0.013514,0.0,0.0,0.027027,0.0,0.0,0.027027,0.013514,0.054054,0.0,0.0,0.013514,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013514,0.0,0.0,0.0,0.013514,0.0,0.0,0.0,0.0,0.0,0.0,0.013514,0.054054,0.013514,0.013514,0.0,0.013514,0.0,0.0,0.013514,0.013514,0.013514,0.0,0.027027,0.0,0.013514,0.0,0.027027,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"► Suburbs of Johannesburg‎ (7 C, 31 P)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"► Suburbs of Kempton Park, Gauteng‎ (8 P)",0.0,0.0,0.0,0.0,0.0,0.0,0.181818,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.181818,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,► Suburbs of Pretoria‎ (58 P),0.0,0.0,0.010638,0.053191,0.0,0.010638,0.0,0.010638,0.0,0.021277,0.0,0.010638,0.010638,0.0,0.010638,0.042553,0.010638,0.0,0.06383,0.0,0.0,0.0,0.0,0.085106,0.0,0.010638,0.010638,0.021277,0.010638,0.0,0.0,0.0,0.010638,0.0,0.138298,0.0,0.031915,0.010638,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010638,0.0,0.0,0.0,0.0,0.085106,0.0,0.0,0.010638,0.010638,0.0,0.0,0.0,0.0,0.0,0.0,0.010638,0.0,0.0,0.0,0.010638,0.0,0.010638,0.0,0.010638,0.0,0.010638,0.010638,0.0,0.010638,0.0,0.0,0.053191,0.021277,0.010638,0.0,0.010638,0.0,0.0,0.010638,0.010638,0.010638,0.0,0.010638,0.0,0.010638,0.0,0.010638,0.021277,0.010638,0.0,0.0,0.0,0.021277,0.010638,0.0,0.0,0.0,0.0
8,► Suburbs of Vereeniging‎ (2 P),0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,► University and college campuses in South Af...,0.02,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.04,0.0,0.06,0.01,0.0,0.0,0.02,0.09,0.01,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.02,0.0,0.04,0.01,0.02,0.0,0.0,0.03,0.01,0.03,0.01,0.0,0.06,0.0,0.01,0.01,0.01,0.01,0.0,0.0,0.01,0.03,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.01,0.03,0.0,0.0,0.01,0.02,0.0,0.01,0.04,0.0,0.0,0.0,0.0,0.01,0.0,0.03,0.0,0.0,0.0,0.0,0.03,0.0,0.03,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.01


In [53]:
len(kl_grouped[kl_grouped["Shopping Mall"] > 0])

6

##### Create a new DataFrame for Shopping Mall data only

In [55]:
kl_mall = kl_grouped[["Neighborhoods","Shopping Mall"]]

In [56]:
kl_mall.head()

Unnamed: 0,Neighborhoods,Shopping Mall
0,► Lists of suburbs in South Africa‎ (3 P),0.040816
1,► Suburbs of Bloemfontein‎ (30 P),0.0
2,"► Suburbs of Cape Town‎ (4 C, 136 P)",0.166667
3,"► Suburbs of Centurion, Gauteng‎ (17 P)",0.083333
4,► Suburbs of Durban‎ (59 P),0.027027


#### 7. Cluster Neighborhoods

##### Run k-means to cluster the neighborhoods in South Africa, Johannesburg into 3 clusters.

In [57]:
# set number of clusters
kclusters = 3

kl_clustering = kl_mall.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(kl_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 2, 1, 0, 0, 0, 0, 0, 0], dtype=int32)

In [58]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
kl_merged = kl_mall.copy()

# add clustering labels
kl_merged["Cluster Labels"] = kmeans.labels_

In [59]:
kl_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
kl_merged.head()

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels
0,► Lists of suburbs in South Africa‎ (3 P),0.040816,0
1,► Suburbs of Bloemfontein‎ (30 P),0.0,0
2,"► Suburbs of Cape Town‎ (4 C, 136 P)",0.166667,2
3,"► Suburbs of Centurion, Gauteng‎ (17 P)",0.083333,1
4,► Suburbs of Durban‎ (59 P),0.027027,0


In [60]:

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
kl_merged = kl_merged.join(kl_df.set_index("Neighborhood"), on="Neighborhood")

print(kl_merged.shape)
kl_merged.head() # check the last columns!

(10, 5)


Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
0,► Lists of suburbs in South Africa‎ (3 P),0.040816,0,-26.151543,28.064935
1,► Suburbs of Bloemfontein‎ (30 P),0.0,0,-26.090876,28.160478
2,"► Suburbs of Cape Town‎ (4 C, 136 P)",0.166667,2,-25.993667,28.104479
3,"► Suburbs of Centurion, Gauteng‎ (17 P)",0.083333,1,-26.03535,27.95253
4,► Suburbs of Durban‎ (59 P),0.027027,0,-26.208074,28.055775


In [61]:
print(kl_merged.shape)
kl_merged.sort_values(["Cluster Labels"], inplace=True)
kl_merged

(10, 5)


Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
0,► Lists of suburbs in South Africa‎ (3 P),0.040816,0,-26.151543,28.064935
1,► Suburbs of Bloemfontein‎ (30 P),0.0,0,-26.090876,28.160478
4,► Suburbs of Durban‎ (59 P),0.027027,0,-26.208074,28.055775
5,"► Suburbs of Johannesburg‎ (7 C, 31 P)",0.0,0,46.685097,14.888173
6,"► Suburbs of Kempton Park, Gauteng‎ (8 P)",0.0,0,-26.083494,28.138675
7,► Suburbs of Pretoria‎ (58 P),0.010638,0,-26.188743,28.050264
8,► Suburbs of Vereeniging‎ (2 P),0.0,0,-26.506629,27.883307
9,► University and college campuses in South Af...,0.03,0,-26.141599,28.023256
3,"► Suburbs of Centurion, Gauteng‎ (17 P)",0.083333,1,-26.03535,27.95253
2,"► Suburbs of Cape Town‎ (4 C, 136 P)",0.166667,2,-25.993667,28.104479


##### Finally, let's visualize the resulting clusters



In [62]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(kl_merged['Latitude'], kl_merged['Longitude'], kl_merged['Neighborhood'], kl_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [63]:
# save the map as HTML file
map_clusters.save('map_clusters.html')


#### 8. Examine Clusters

##### CLuster 0

In [64]:
kl_merged.loc[kl_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
0,► Lists of suburbs in South Africa‎ (3 P),0.040816,0,-26.151543,28.064935
1,► Suburbs of Bloemfontein‎ (30 P),0.0,0,-26.090876,28.160478
4,► Suburbs of Durban‎ (59 P),0.027027,0,-26.208074,28.055775
5,"► Suburbs of Johannesburg‎ (7 C, 31 P)",0.0,0,46.685097,14.888173
6,"► Suburbs of Kempton Park, Gauteng‎ (8 P)",0.0,0,-26.083494,28.138675
7,► Suburbs of Pretoria‎ (58 P),0.010638,0,-26.188743,28.050264
8,► Suburbs of Vereeniging‎ (2 P),0.0,0,-26.506629,27.883307
9,► University and college campuses in South Af...,0.03,0,-26.141599,28.023256


##### Cluster 1

In [65]:
kl_merged.loc[kl_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
3,"► Suburbs of Centurion, Gauteng‎ (17 P)",0.083333,1,-26.03535,27.95253


##### Cluster 2

In [66]:
kl_merged.loc[kl_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
2,"► Suburbs of Cape Town‎ (4 C, 136 P)",0.166667,2,-25.993667,28.104479


###### Observations:
Most of the shopping malls are concentrated in the central area of South Africa, Johannesburg city, with the highest number in cluster 2 and moderate number in cluster 0. On the other hand, cluster 1 has very low number of shopping mall in the neighborhoods. This represents a great opportunity and high potential areas to open new shopping malls as there is very little to no competition from existing malls. Meanwhile, shopping malls in cluster 2 are likely suffering from intense competition due to oversupply and high concentration of shopping malls. From another perspective, this also shows that the oversupply of shopping malls mostly happened in the central area of the city, with the suburb area still have very few shopping malls. Therefore, this project recommends property developers to capitalize on these findings to open new shopping malls in neighborhoods in cluster 1 with little to no competition. Property developers with unique selling propositions to stand out from the competition can also open new shopping malls in neighborhoods in cluster 0 with moderate competition. Lastly, property developers are advised to avoid neighborhoods in cluster 2 which already have high concentration of shopping malls and suffering from intense competition.