<h1>The Battle of Neighborhood | Part 1</h1>

<h2>Introduction</h2>
<p>For many shoppers, visiting shopping malls is a great way to relax and enjoy themselves during weekends and holidays. They can shop grocery and various fashion outlets, dine at restaurants, watch movies and perform many more activities. Shopping malls are like one-stop destination for all types of shoppers. For retailers, the central location and the large crowed at the shopping malls provides a great distribution channel to market their products and services. Property developers are also taking advantage of this trend to build more shopping malls to cater to the demand. As a result, there are many shopping malls in the city of Kuala Lumpur and many more are being built. Opening shopping malls allows property developers to earn consistent rental income. Of course, as with any business decision, opening a new shopping mall requires serious consideration and is a lot more complicated than it seems. Particularly, the location of the shopping mall is one of the most important decisions that will determine whether the mall be a success or a failure.</p>

<h2>Business Problem</h2>
<p>The objective of this capstone project is to analyze and select the best locations in the city of Kuala Lumpur, Malaysia to open new shopping mall. Using data science methodology and machine learning techniques like clustering, this project aims to provide solutions to answer the business question : In the city of Kuala Lumpur, Malaysia, if a property developer is looking to open a new shopping mall, where would you recommend that they open it ?</p>

<h3>Target Audience of this project</h3>
<p>This project is particularly useful to property developers and investors to open or invest in new shopping malls in the capital city of Malaysia i.e. Kuala Lumpur. This project is timely as the city is currently suffering from oversupply of shopping malls. Data from the National Property Information Center (NPIC) released last year showed that an additional 15 per cent will be added to existing mall space and the agency predicted that total occupancy may dip below 86 per cent. The local newspaper The Malay Mail also reported in March last year that the true occupancy rates in malls may be as low as 40 per cent in some areas, quoting a Financial Times (FT) article cataloging the country's continued obsession with building more shopping space despite chronic oversupply.</p>

<h2>Data Description</h2>
<h4>To solve the problem, we will need the following data : </h4>
<blockquote>
    <ul>
        <li>List of neighborhoods in Kuala Lumpur. This defines the scope of this project which is confined to the city of Kuala Lumpur, the capital city of the country of Malaysia in the South East Asia.</li>
        <li>Latitude and Longitude coordinates of those neighborhoods. This is required in order to plot the map and also to get the venue data.</li>
        <li>Venue data, particularly data related to shopping malls. We will use this data to perform clustering on the neighborhoods.</li>
    </ul>
</blockquote>

<h4>Sources of data and methods to extract them</h4>
<p>The <a href='https://en.wikipedia.org/wiki/Category:Suburbs_in_Kuala_Lumpur'>Wikipedia page</a> contains a list of neighborhoods in Kuala Lumpur, with a total of 71 neighborhoods. We will use web scrapping techniques to extract the data from Wikipedia page, with the help of Python 'requests' and 'beautifulsoup' libraries. Then we will get the geographical coordinates of the neighborhoods using Python 'Geocoder' library which will give us the latitude and longitude coordinates of the neighborhoods.</p>

<p>The output shows the final dataset. The dataset consists of a single dataframe with 3 columns containing 'Neighbohood', 'Latitude' and 'Longitude'. 'Neighborhood' column contains the Federal Territory of Kuala lumpur.</p>

<h3>Import Libraries</h3>

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

!pip install geocoder
import geocoder # to get coordinates

import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

print("\nLibraries imported.")


Libraries imported.


<h3>Scrap data from Wikipedia page into a DataFrame</h3>

In [2]:
# send the GET request
data = requests.get("https://en.wikipedia.org/wiki/Category:Suburbs_in_Kuala_Lumpur").text

# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

In [3]:
# create a list to store neighborhood data
neighborhoodlist = []

# append the data into the list
for row in soup.find_all("div", class_="mw-category")[0].findAll("li"):
    neighborhoodlist.append(row.text)

In [4]:
# create a new DataFrame from the list
kuala_lumpur_df = pd.DataFrame({"Neighborhood": neighborhoodlist})

kuala_lumpur_df.head()

Unnamed: 0,Neighborhood
0,Alam Damai
1,"Ampang, Kuala Lumpur"
2,Bandar Menjalara
3,Bandar Sri Permaisuri
4,Bandar Tasik Selatan


In [5]:
# print the number of rows of the dataframe
kuala_lumpur_df.shape

(71, 1)

<h3>Get the geographical coordinates</h3>

In [6]:
# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Kuala Lumpur, Malaysia'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [7]:
# call the function to get the coordinates, store in a new list using list comprehension
co_ords = [ get_latlng(neighborhood) for neighborhood in kuala_lumpur_df["Neighborhood"].tolist() ]

In [8]:
co_ords[:5]

[[3.0576900000000364, 101.74388000000005],
 [3.1484988508598852, 101.69672774991264],
 [3.1903500000000236, 101.62545000000006],
 [3.1039100000000417, 101.71226000000007],
 [3.072750000000042, 101.71461000000005]]

In [9]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(co_ords, columns=['Latitude', 'Longitude'])
df_coords.head()

Unnamed: 0,Latitude,Longitude
0,3.05769,101.74388
1,3.148499,101.696728
2,3.19035,101.62545
3,3.10391,101.71226
4,3.07275,101.71461


In [10]:
# merge the coordinates into the original dataframe
kuala_lumpur_df['Latitude'] = df_coords['Latitude']
kuala_lumpur_df['Longitude'] = df_coords['Longitude']

# check the neighborhoods and the coordinates
print(kuala_lumpur_df.shape)
kuala_lumpur_df.head(10)

(71, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Alam Damai,3.05769,101.74388
1,"Ampang, Kuala Lumpur",3.148499,101.696728
2,Bandar Menjalara,3.19035,101.62545
3,Bandar Sri Permaisuri,3.10391,101.71226
4,Bandar Tasik Selatan,3.07275,101.71461
5,Bandar Tun Razak,3.08276,101.72281
6,Bangsar,3.1292,101.67844
7,Bangsar Park,3.1292,101.67844
8,Bangsar South,3.11102,101.66283
9,Batu 11 Cheras,3.06187,101.74675


In [11]:
# save the DataFrame as CSV file
kuala_lumpur_df.to_csv("Kuala_Lumpur.csv", index=False)