# Capstone Project - Opening a Shopping Mall in correct place in India 
### Applied Data Science Capstone by IBM/Coursera



## Introduction: Business Problem <a name="introduction"></a>

For many people, visiting Shopping Mall and eating or shopping or exploring is a great way to relax and enjoy themselves during weekends and holidays. They can dine at restaurants, pack it for home, do shopping, etc.  Of course, as with any business decision, opening a Shopping Mall requires serious consideration and is a lot more complicated than it seems. Particularly, the location of it, the competition in that area are one of the most important decisions that will determine whether the Shopping Mall will be a success or a failure.  This becomes way more important when you are to open one of the biggest shopping mall.<br><br>
The objective of this capstone project is to analyse and select the best locations in India to open a one of the biggest shopping mall. Using data science methodology and machine learning techniques like clustering, this project aims to provide solutions to answer the business question: In which city and where in that city, if a property developer is looking to open a new Shopping Mall, where would you recommend that they open it?  

## Data <a name="data"></a>

To solve the problem, we will need the following data: 

• The wikipedia page(https://en.wikipedia.org/wiki/List_of_shopping_malls_in_India) containing info about all the big malls of india.

• Latitude and longitude coordinates of capital city and then the existing malls of it. This is required in order to plot the map and also to get the venue data. 

• Venue data, particularly data related to restaurants. We will use this data to perform clustering on the area. 


## Methodology <a name="data"></a>

This Wikipedia page (https://en.wikipedia.org/wiki/List_of_shopping_malls_in_India ) contains a list of all malls of India. We will use web scraping techniques to extract the data from the Wikipedia page, with the help of Python requests and <b>beautifulsoup packages</b>. After importing this data into a dataframe, we will conclude that delhi does not have a big shopping mall. Delhi being the national capital and a right amount of population to have one of the nation’s biggest mall. Hence, we will now work on Delhi.

 We will get the geographical coordinates of the Delhi and it’s neighbour Gurugram and Gaziabad. Using <b>Python Geocoder</b> package which will give us the latitude and longitude coordinates of the neighbourhoods. 

After that, we will use <b>Foursquare</b> API to get the venue data of Malls for those areas. Foursquare has one of the largest database of 105+ million places and is used by over 125,000 developers. Foursquare API will provide many categories of the venue data, we are particularly interested in the Shopping Mall category in order to help us to solve the business problem put forward. This is a project that will make use of many data science skills, from web scraping (Wikipedia), working with API (Foursquare), data cleaning, data wrangling, to machine learning (K-means clustering) and map visualization (Folium).
 

<h2>Web Scarping</h2>

In [111]:
import urllib.request
from bs4 import BeautifulSoup
url="https://en.wikipedia.org/wiki/List_of_shopping_malls_in_India"
page=urllib.request.urlopen(url)

In [112]:
soup=BeautifulSoup(page, "lxml")
#print(soup.prettify())
#print this to see the website. select the label of the required table from here. It will be used in the next query.

In [151]:
right_table=soup.find('table', class_='wikitable sortable')
#right_table
#I've hidden it after using as it takes a lot of page

In [152]:
A=[]
B=[]
C=[]
for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==5:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
df=pd.DataFrame(A,columns=['PostalCode'])
df['Borough']=B
df['Neighborhood']=C


In [115]:
df
#printing the table

Unnamed: 0,PostalCode,Borough,Neighborhood
0,LuLu International Shopping Mall,Kochi,2013\n
1,"World Trade Park, Jaipur",Jaipur,2012\n
2,DLF Mall of India,Noida,2016\n
3,Sarath City Capital Mall\n,Hyderabad,2018\n
4,"Z Square Mall, Kanpur",Kanpur,2010\n
5,Phoenix Marketcity (Bangalore),Bangalore,2010\n
6,Elante Mall,Chandigarh,2013\n
7,Esplanade One,Bhubaneswar,2018\n
8,Phoenix Marketcity (Chennai),Chennai,2013\n
9,Viviana Mall,Thane,2013\n


<h4>Conclusion From Web Scaping</h4> <br>We can see that Delhi is no where to be seen in biggest malls of India. It is the national Capital and heavely populated. So, it should be a good choice for building India's biggest mall.<h3>Now let's work on Delhi area.</h3>

<h3>Plan of action </h3><br>We will search for all of the malls in the entire Delhi area and major city beside delhi, which is Gurugram and Gaziabad.<br><br> We will locate delhi and find all the malls in required area using <b>Foursquare API</b>.We will combine the details of the 3 areas and then we will visualize them on the map using <b>folium</b>. This will be followed by data cleaning and then we will perform <b>K means Clustering</b>. <br><br> This will tell which area has the least number of malls. That area(cluster) will be suitable for building our mall.<br><br><br>Let us import the required libraries.

In [116]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Folium installed
Libraries imported.


Now let us prepare for a connection with foursquare API. I've hidden away my credentials here.

In [150]:
CLIENT_ID = '-------------'
CLIENT_SECRET = '-------' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: -------------
CLIENT_SECRET:-------


Now, we had to deal with 3 cities.. Delhi, Gurugram and Gaziabad. Let us start with Gaziabad.

In [119]:
address = 'Gaziabad'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude,longitude)

28.6711527 77.4120356


 Search for a specific venue category

In [120]:
search_query = 'Mall'
radius = 1000000
print(search_query + ' .... OK!')

Mall .... OK!


Now, make a request for the data and store it in dataframe.

In [121]:
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
#print url to look at the url
results = requests.get(url).json()
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
dataframe = json_normalize(venues)
dataframe.head(2)

  


Unnamed: 0,id,name,categories,referralId,hasPerk,location.address,location.crossStreet,location.lat,location.lng,location.labeledLatLngs,location.distance,location.postalCode,location.cc,location.city,location.state,location.country,location.formattedAddress
0,530882fc498e6152e44fa1e7,"Wave Cinemas, Gaur Central Mall, RDC","[{'id': '4bf58dd8d48988d180941735', 'name': 'M...",v-1589734276,False,Gaur City Centre MALL,RDC,28.646943,77.440607,"[{'label': 'display', 'lat': 28.64694294797724...",3879,201001.0,IN,Ghāziābād,Uttar Pradesh,India,"[Gaur City Centre MALL (RDC), Ghāziābād 201001..."
1,4d1329220ad2f04da661b154,Shoppers Stop Shipra Mall,"[{'id': '4bf58dd8d48988d103951735', 'name': 'C...",v-1589734276,False,,,28.634457,77.369911,"[{'label': 'display', 'lat': 28.63445650327106...",5798,,IN,,,India,[India]


<h3>This is data cleaning and preprocessing.</h3><br><br> Select the required area for further processing

In [122]:
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
dataframe_filtered_1 = dataframe.loc[:, filtered_columns]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
dataframe_filtered_1['categories'] = dataframe_filtered_1.apply(get_category_type, axis=1)

# clean column names by keeping only last term
dataframe_filtered_1.columns = [column.split('.')[-1] for column in dataframe_filtered_1.columns]

dataframe_filtered_1=dataframe_filtered_1[['name','categories','lat','lng','distance']]
dataframe_filtered_1

Unnamed: 0,name,categories,lat,lng,distance
0,"Wave Cinemas, Gaur Central Mall, RDC",Multiplex,28.646943,77.440607,3879
1,Shoppers Stop Shipra Mall,Clothing Store,28.634457,77.369911,5798
2,Mahagun Metro Mall,Shopping Mall,28.645045,77.335597,8012
3,"Mall Road, Shimla, Himachal Pradesh",Scenic Lookout,28.655501,77.4352,2855
4,Gaur Central Mall,Shopping Mall,28.67396,77.440178,2766
5,Fun World Square Mall,Multiplex,28.680063,77.393266,2084
6,Opulent Mall,Shopping Mall,28.654135,77.436591,3056
7,Aggarwal Funcity Mall,Shopping Mall,28.660635,77.298245,11175
8,Angel Mega Mall,Shopping Mall,28.641133,77.327683,8891
9,Spice World Mall,Shopping Mall,28.586524,77.340725,11717


<h3>REPEAT THIS PROCESS FOR DELHI AND GURUGRAM</h3>

In [125]:
address = 'Delhi'
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude,longitude)
search_query = 'Mall'
radius = 1000000
print(search_query + ' .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
#print url to look at the url
results = requests.get(url).json()
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
dataframe = json_normalize(venues)
#dataframe

filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
dataframe_filtered_2 = dataframe.loc[:, filtered_columns]

# filter the category for each row
dataframe_filtered_2['categories'] = dataframe_filtered_2.apply(get_category_type, axis=1)

# clean column names by keeping only last term
dataframe_filtered_2.columns = [column.split('.')[-1] for column in dataframe_filtered_2.columns]

dataframe_filtered_2=dataframe_filtered_2[['name','categories','lat','lng','distance']]
dataframe_filtered_2

28.6517178 77.2219388
Mall .... OK!




Unnamed: 0,name,categories,lat,lng,distance
0,V3S Mall,Shopping Mall,28.631343,77.2784,5964
1,the mall central craft cottage industries,Clothing Store,28.638347,77.210541,1858
2,"V3S Mall, East Center",Shopping Mall,28.637154,77.286716,6532
3,Aggarwal Funcity Mall,Shopping Mall,28.660635,77.298245,7519
4,"More, Moments Mall, Kirti Nagar",Department Store,28.634403,77.220848,1930
5,"Moments mall, kirti nagar",Department Store,28.634401,77.220849,1930
6,Big Bazaar Parsvnath mall,Department Store,28.674176,77.169526,5697
7,Ambience Mall,Shopping Mall,28.541012,77.155128,13946
8,Cross River Mall,Shopping Mall,28.657641,77.302267,7874
9,DLF City Center Mall,Shopping Mall,28.703163,77.158063,8468


In [124]:
address = 'Gurugram'
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude,longitude)
search_query = 'Mall'
radius = 1000000
print(search_query + ' .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
#print url to look at the url
results = requests.get(url).json()
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
dataframe = json_normalize(venues)
#dataframe

filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
dataframe_filtered_3 = dataframe.loc[:, filtered_columns]

# filter the category for each row
dataframe_filtered_3['categories'] = dataframe_filtered_3.apply(get_category_type, axis=1)

# clean column names by keeping only last term
dataframe_filtered_3.columns = [column.split('.')[-1] for column in dataframe_filtered_3.columns]

dataframe_filtered_3=dataframe_filtered_3[['name','categories','lat','lng','distance']]
dataframe_filtered_3

28.4646148 77.0299194
Mall .... OK!




Unnamed: 0,name,categories,lat,lng,distance
0,DT Star Mall,Shopping Mall,28.460202,77.05203,2218
1,Gurgaon Dreamz Mall,Shopping Mall,28.474049,77.019363,1473
2,Omaxe Mall,Shopping Mall,28.411493,77.04317,6054
3,DT Mega Mall,Shopping Mall,28.449477,77.093114,6410
4,Sahara Mall,Shopping Mall,28.479727,77.086637,5799
5,DLF Mega Mall,Shopping Mall,28.475803,77.093168,6313
6,Ambience Mall,Shopping Mall,28.505784,77.095846,7912
7,INOX : Dreamz MALL,Multiplex,28.478293,77.017434,1952
8,Ambience Mall,Shopping Mall,28.541012,77.155128,14911
9,Grand Mall,Shopping Mall,28.479967,77.090315,6152


<h3>Now, concat all the 3 dataframes into 1 and use it for further mapping and higher analysis</h3>

In [143]:
dataframe_filtered=pd.concat([dataframe_filtered_1,dataframe_filtered_2,dataframe_filtered_3]).drop_duplicates().reset_index(drop=True)

In [144]:
dataframe_filtered

Unnamed: 0,name,categories,lat,lng,distance
0,"Wave Cinemas, Gaur Central Mall, RDC",Multiplex,28.646943,77.440607,3879
1,Shoppers Stop Shipra Mall,Clothing Store,28.634457,77.369911,5798
2,Mahagun Metro Mall,Shopping Mall,28.645045,77.335597,8012
3,"Mall Road, Shimla, Himachal Pradesh",Scenic Lookout,28.655501,77.435200,2855
4,Gaur Central Mall,Shopping Mall,28.673960,77.440178,2766
...,...,...,...,...,...
85,Manyavar @ MGF Metropolitan Mall,Men's Store,28.481079,77.080292,5258
86,Lift @ MGF Mall,,28.477965,77.071177,4302
87,Cross Point Mall,Shopping Mall,28.468426,77.083126,5224
88,DLF City Center Mall,Shopping Mall,28.479007,77.076236,4807


<h3>VISUALIZATION OF THE DATA</h3>

In [145]:
venues_map = folium.Map(location=[latitude, longitude], zoom_start=10) # generate map centred around the Conrad Hotel

# add a red circle marker to represent the Conrad Hotel
folium.features.CircleMarker(
    [latitude, longitude],
    radius=10,
    color='red',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(venues_map)

# add the Italian restaurants as blue circle markers
for lat, lng in zip(dataframe_filtered.lat, dataframe_filtered.lng):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_map)

# display map
venues_map

There we have it, the visual representation of all the malls. Now it is time too cluster them to find the perfect place.<br><h3>K MEANS CLUSTERING</h3> <br><br>Now, it is time for our analysis. We will perform K means clustering, the number of clusters will be 5. This depends upon us to decide. I have done this various  times and concluded that k=5 seems like the perfect number of clusters

In [146]:
# set number of clusters
from sklearn.cluster import KMeans
kclusters = 5

manhattan_grouped_clustering = dataframe_filtered[['lat','lng']]

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 3, 0, 0, 0, 0, 3, 3, 3])

In [147]:
# add clustering labels
dataframe_filtered.insert(4, 'ClusterLabels', kmeans.labels_)

manhattan_merged = dataframe_filtered


manhattan_merged.head() # check the last columns!

Unnamed: 0,name,categories,lat,lng,ClusterLabels,distance
0,"Wave Cinemas, Gaur Central Mall, RDC",Multiplex,28.646943,77.440607,0,3879
1,Shoppers Stop Shipra Mall,Clothing Store,28.634457,77.369911,0,5798
2,Mahagun Metro Mall,Shopping Mall,28.645045,77.335597,3,8012
3,"Mall Road, Shimla, Himachal Pradesh",Scenic Lookout,28.655501,77.4352,0,2855
4,Gaur Central Mall,Shopping Mall,28.67396,77.440178,0,2766


<br><br><b>Now, it is the time to plot the clustered result for better visualization</b><br><br>

In [148]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lng, cluster in zip(dataframe_filtered.lat, dataframe_filtered.lng, dataframe_filtered.ClusterLabels):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(map_clusters)

       
map_clusters

Select the cluster with minimum number of malls

In [149]:
dataframe_filtered['ClusterLabels'].value_counts()

3    29
1    28
0    17
4    12
2     4
Name: ClusterLabels, dtype: int64

<br>We can see that label 2 has least number of malls, so let us plot it for better understanding off the area we are dealing with here.<br>

In [142]:
map_clusterss = folium.Map(location=[latitude, longitude], zoom_start=11)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lng, cluster in zip(dataframe_filtered.lat, dataframe_filtered.lng, dataframe_filtered.ClusterLabels):
    if cluster==2:
        folium.features.CircleMarker(
            [lat, lng],
            radius=5,
            color=rainbow[cluster],
            fill=True,
            fill_color=rainbow[cluster-2],
            fill_opacity=0.7).add_to(map_clusterss)
 
map_clusterss

<h2>RESULT</h2><br>So, judging by the map, the area between New Delhi and Gurugram seems like the perfect place to open a mall.