<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 6>IBM Applied Data Science Capstone Course by Coursera</font></h1>

# Week 5 Final Report
## Opening a New Shopping Mall in Mumbai, Maharashtra, India

- Build a dataframe of neighborhoods in Mumbai, India by web scraping the data from Wikipedia page
- Get the geographical coordinates of the neighborhoods
- Obtain the venue data for the neighborhoods from Foursquare API
- Explore and cluster the neighborhoods
- Select the best cluster to open a new shopping mall

In [1]:
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder # to get coordinates

import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

## Scraping data from Wikipedia page

In [2]:
# send the GET request
data = requests.get("https://en.wikipedia.org/wiki/Category:Suburbs_of_Mumbai").text

# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

# create a list to store neighborhood data
neighborhoodList = []


# append the data into the list
for row in soup.find_all("div", class_="mw-category")[0].findAll("li"):
    neighborhoodList.append(row.text)
    
# create a new DataFrame from the list
Mumbai_df = pd.DataFrame({"Neighborhood": neighborhoodList})

Mumbai_df.head()

Unnamed: 0,Neighborhood
0,Andheri
1,Anushakti Nagar
2,Baiganwadi
3,Bandra
4,Bhandup


In [3]:
# print the number of rows of the dataframe
Mumbai_df.shape

(40, 1)

## Get the geographical coordinates

In [4]:
# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Mumbai'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [5]:
coordinates = list(map(get_latlng, Mumbai_df["Neighborhood"].tolist()))

coordinates

[[19.11848309908247, 72.84177419095158],
 [19.042830000000038, 72.92734000000007],
 [19.06293000000005, 72.92666000000008],
 [19.054220000000043, 72.84019000000006],
 [19.145560000000046, 72.94856000000004],
 [19.229360000000042, 72.85751000000005],
 [19.208660000000066, 72.82612000000006],
 [19.062200000000075, 72.90242000000006],
 [19.250030000000038, 72.85908000000006],
 [19.224720000000048, 72.86606000000006],
 [19.220110000000034, 73.09075000000007],
 [19.00538889189226, 72.85576887678867],
 [19.086476606699875, 72.9089562772808],
 [19.164550000000077, 72.84946000000008],
 [18.959290000000067, 72.83108000000004],
 [19.13790000000006, 72.84941000000003],
 [19.01493000000005, 72.84522000000004],
 [18.953937419095155, 72.82036732944775],
 [19.21195211212422, 72.83754191243007],
 [19.131400000000042, 72.93565000000007],
 [19.127560000000074, 72.82540000000006],
 [19.064940000000036, 72.88073000000003],
 [19.21094000000005, 72.84137000000004],
 [19.048530000000028, 72.93220000000008],


In [8]:
# merge the coordinates into the original dataframe
Mumbai_df['Latitude'] = [x[0] for x in coordinates]
Mumbai_df['Longitude'] = [x[1] for x in coordinates]

Mumbai_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Andheri,19.118483,72.841774
1,Anushakti Nagar,19.04283,72.92734
2,Baiganwadi,19.06293,72.92666
3,Bandra,19.05422,72.84019
4,Bhandup,19.14556,72.94856


In [9]:
Mumbai_df

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Andheri,19.118483,72.841774
1,Anushakti Nagar,19.04283,72.92734
2,Baiganwadi,19.06293,72.92666
3,Bandra,19.05422,72.84019
4,Bhandup,19.14556,72.94856
5,Borivali,19.22936,72.85751
6,Charkop,19.20866,72.82612
7,Chembur,19.0622,72.90242
8,Dahisar,19.25003,72.85908
9,Devipada,19.22472,72.86606


In [10]:
# save the DataFrame as CSV file
Mumbai_df.to_csv("Mumbai.csv", index=False)

## Create a map of Mumbai with neighborhoods superimposed on top

In [11]:
# get the coordinates of Mumbai
address = 'Mumbai, India'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Mumbai, India {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Mumbai, India 18.9387711, 72.8353355.


In [13]:
# create map of Toronto using latitude and longitude values
map_mumbai = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(Mumbai_df['Latitude'], Mumbai_df['Longitude'], Mumbai_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_mumbai)  
    
    
map_mumbai

In [15]:
# save the map as HTML file
map_mumbai.save('map_mumbai.html')

## Using the Foursquare API to explore the neighborhoods

In [16]:
CLIENT_ID = 'XXXXXXXXXXXXXXXXXXXXX' # your Foursquare ID
CLIENT_SECRET = 'XXXXXXXXXXXXXXXXXXXX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

### Now, let's get the top 100 venues that are within a radius of 3000 meters.

In [19]:
def getNearbyVenues(names, latitudes, longitudes, radius=3000, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [20]:
# get venues list into a new DataFrame
venues_df = getNearbyVenues(Mumbai_df.Neighborhood,
                            Mumbai_df.Latitude,
                            Mumbai_df.Longitude)

Andheri
Anushakti Nagar
Baiganwadi
Bandra
Bhandup
Borivali
Charkop
Chembur
Dahisar
Devipada
Dombivli
Eastern Suburbs (Mumbai)
Ghatkopar
Goregaon
Grant Road
Jogeshwari
Juhu
Kalyan
Kandivali
Kanjurmarg
Kausa
Kurla
Mahavir Nagar (Kandivali)
Mankhurd
Matharpacady, Mumbai
Mira Road
Mogra Village
Mulund
Mumbra
Pestom sagar
Seven Bungalows
Shil Phata
Sion, Mumbai
Thakur village
Tilak Nagar (Mumbai)
Vashi
Vikhroli
Wadala
Western Suburbs (Mumbai)
Worli


In [21]:
venues_df.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Andheri,19.118483,72.841774,Merwans Cake shop,19.1193,72.845418,Bakery
1,Andheri,19.118483,72.841774,Naturals,19.111204,72.837255,Ice Cream Shop
2,Andheri,19.118483,72.841774,Radha Krishna Veg Restaurant,19.11513,72.84306,Indian Restaurant
3,Andheri,19.118483,72.841774,Shawarma Factory,19.124591,72.840398,Falafel Restaurant
4,Andheri,19.118483,72.841774,Joey's Pizza,19.126762,72.830001,Pizza Place


### Let's check how many venues were returned for each neighorhood

In [22]:
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Andheri,100,100,100,100,100,100
Anushakti Nagar,24,24,24,24,24,24
Baiganwadi,33,33,33,33,33,33
Bandra,100,100,100,100,100,100
Bhandup,57,57,57,57,57,57
Borivali,100,100,100,100,100,100
Charkop,69,69,69,69,69,69
Chembur,88,88,88,88,88,88
Dahisar,100,100,100,100,100,100
Devipada,100,100,100,100,100,100


In [24]:
# Let's find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(venues_df['Venue Category'].unique())))

There are 178 uniques categories.


In [25]:
# print out the list of categories
venues_df['Venue Category'].unique()[:50]

array(['Bakery', 'Ice Cream Shop', 'Indian Restaurant',
       'Falafel Restaurant', 'Pizza Place', 'Coffee Shop',
       'Sandwich Place', 'Multiplex', 'Juice Bar', 'Breakfast Spot',
       'Pub', 'Seafood Restaurant', 'Theater', 'American Restaurant',
       'Fast Food Restaurant', 'Café', 'Snack Place', 'Brewery',
       'Food Truck', 'Bar', 'Cocktail Bar', 'Beach', 'Hotel',
       'Mughlai Restaurant', 'Mediterranean Restaurant', 'BBQ Joint',
       'Gym / Fitness Center', 'Lounge', 'Diner', 'Club House',
       'Dessert Shop', 'Chinese Restaurant',
       'Vegetarian / Vegan Restaurant', 'Italian Restaurant',
       'Comfort Food Restaurant', 'Spa', 'Electronics Store',
       'Movie Theater', 'Spanish Restaurant', 'Food', 'Asian Restaurant',
       'Plaza', 'Supermarket', 'Sports Bar', 'Concert Hall',
       'Shop & Service', 'Restaurant', 'Sculpture Garden', 'Garden',
       'Gym'], dtype=object)

In [27]:
# check if the results contain "Shopping Mall"
"Shopping Mall" in venues_df['Venue Category'].unique()

True

### Analyze Each Neighborhood

In [28]:
# one hot encoding
Mumbai_onehot = pd.get_dummies(venues_df[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Mumbai_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [Mumbai_onehot.columns[-1]] + list(Mumbai_onehot.columns[:-1])
Mumbai_onehot = Mumbai_onehot[fixed_columns]

print(Mumbai_onehot.shape)
Mumbai_onehot.head()

(3340, 179)


Unnamed: 0,Neighborhoods,Afghan Restaurant,Airport,Airport Lounge,American Restaurant,Arcade,Art Gallery,Asian Restaurant,Athletics & Sports,BBQ Joint,...,Theme Park,Toy / Game Store,Track,Train Station,Travel & Transport,Vegetarian / Vegan Restaurant,Water Park,Whisky Bar,Wine Bar,Women's Store
0,Andheri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Andheri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Andheri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Andheri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Andheri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [29]:
Mumbai_grouped = Mumbai_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(Mumbai_grouped.shape)
Mumbai_grouped

(39, 179)


Unnamed: 0,Neighborhoods,Afghan Restaurant,Airport,Airport Lounge,American Restaurant,Arcade,Art Gallery,Asian Restaurant,Athletics & Sports,BBQ Joint,...,Theme Park,Toy / Game Store,Track,Train Station,Travel & Transport,Vegetarian / Vegan Restaurant,Water Park,Whisky Bar,Wine Bar,Women's Store
0,Andheri,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0
1,Anushakti Nagar,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Baiganwadi,0.0,0.0,0.0,0.0,0.0,0.0,0.060606,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0
3,Bandra,0.0,0.0,0.0,0.0,0.02,0.0,0.03,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bhandup,0.0,0.0,0.0,0.0,0.017544,0.0,0.035088,0.0,0.0,...,0.0,0.0,0.0,0.070175,0.0,0.0,0.0,0.0,0.0,0.0
5,Borivali,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.01,...,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0
6,Charkop,0.0,0.0,0.0,0.0,0.014493,0.0,0.0,0.0,0.014493,...,0.014493,0.0,0.0,0.014493,0.0,0.014493,0.014493,0.0,0.0,0.0
7,Chembur,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,...,0.0,0.0,0.0,0.011364,0.0,0.045455,0.0,0.0,0.0,0.0
8,Dahisar,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.02,...,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0
9,Devipada,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,...,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0


In [30]:
# Finding how many Shopping Malls as there
len(Mumbai_grouped[Mumbai_grouped["Shopping Mall"] > 0])

19

#### Create a new DataFrame for Shopping Mall data only

In [31]:
Mumbai_mall = Mumbai_grouped[["Neighborhoods","Shopping Mall"]]
Mumbai_mall.head()

Unnamed: 0,Neighborhoods,Shopping Mall
0,Andheri,0.0
1,Anushakti Nagar,0.0
2,Baiganwadi,0.0
3,Bandra,0.0
4,Bhandup,0.035088


## Cluster Neighborhoods

In [32]:
# set number of clusters
mclusters = 3

Mumbai_clustering = Mumbai_mall.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=mclusters, random_state=0).fit(Mumbai_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 1, 1, 1, 2, 0, 0, 1, 1, 0])

In [33]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
Mumbai_merged = Mumbai_mall.copy()

# add clustering labels
Mumbai_merged["Cluster Labels"] = kmeans.labels_

In [34]:
Mumbai_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
Mumbai_merged.head()

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels
0,Andheri,0.0,1
1,Anushakti Nagar,0.0,1
2,Baiganwadi,0.0,1
3,Bandra,0.0,1
4,Bhandup,0.035088,2


In [35]:
# merge Mumbai_grouped with Mumbai_df to add latitude/longitude for each neighborhood
Mumbai_merged = Mumbai_merged.join(Mumbai_df.set_index("Neighborhood"), on="Neighborhood")

print(Mumbai_merged.shape)
Mumbai_merged.head() # check the last columns!

(39, 5)


Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
0,Andheri,0.0,1,19.118483,72.841774
1,Anushakti Nagar,0.0,1,19.04283,72.92734
2,Baiganwadi,0.0,1,19.06293,72.92666
3,Bandra,0.0,1,19.05422,72.84019
4,Bhandup,0.035088,2,19.14556,72.94856


In [36]:
# sort the results by Cluster Labels
print(Mumbai_merged.shape)
Mumbai_merged.sort_values(["Cluster Labels"], inplace=True)
Mumbai_merged

(39, 5)


Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
19,Kanjurmarg,0.018519,0,19.1314,72.93565
35,Vikhroli,0.020408,0,19.11109,72.92781
33,Tilak Nagar (Mumbai),0.02,0,18.99616,72.85281
32,Thakur village,0.016129,0,19.2102,72.87541
28,Pestom sagar,0.02,0,19.07063,72.9022
27,Mulund,0.02,0,19.17185,72.95564
26,Mogra Village,0.01,0,19.0988,72.91706
22,Mahavir Nagar (Kandivali),0.02,0,19.21094,72.84137
21,Kurla,0.02,0,19.06494,72.88073
18,Kandivali,0.01,0,19.211952,72.837542


#### Finally, let's visualize the resulting clusters

In [37]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(mclusters)
ys = [i+x+(i*x)**2 for i in range(mclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Mumbai_merged['Latitude'], Mumbai_merged['Longitude'], Mumbai_merged['Neighborhood'], Mumbai_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [38]:
# save the map as HTML file
map_clusters.save('map_clusters.html')

## Examine Clusters

#### Cluster 0

In [40]:
Mumbai_merged.loc[Mumbai_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
19,Kanjurmarg,0.018519,0,19.1314,72.93565
35,Vikhroli,0.020408,0,19.11109,72.92781
33,Tilak Nagar (Mumbai),0.02,0,18.99616,72.85281
32,Thakur village,0.016129,0,19.2102,72.87541
28,Pestom sagar,0.02,0,19.07063,72.9022
27,Mulund,0.02,0,19.17185,72.95564
26,Mogra Village,0.01,0,19.0988,72.91706
22,Mahavir Nagar (Kandivali),0.02,0,19.21094,72.84137
21,Kurla,0.02,0,19.06494,72.88073
18,Kandivali,0.01,0,19.211952,72.837542


#### Cluster 1

In [41]:
Mumbai_merged.loc[Mumbai_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
11,Eastern Suburbs (Mumbai),0.0,1,19.005389,72.855769
36,Wadala,0.0,1,19.01716,72.85813
1,Anushakti Nagar,0.0,1,19.04283,72.92734
2,Baiganwadi,0.0,1,19.06293,72.92666
3,Bandra,0.0,1,19.05422,72.84019
31,"Sion, Mumbai",0.0,1,19.04359,72.8641
30,Shil Phata,0.0,1,18.94017,72.83486
29,Seven Bungalows,0.0,1,19.12856,72.82085
25,Mira Road,0.0,1,19.074161,72.86167
10,Dombivli,0.0,1,19.22011,73.09075


#### Cluster 2

In [42]:
Mumbai_merged.loc[Mumbai_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
4,Bhandup,0.035088,2,19.14556,72.94856
37,Western Suburbs (Mumbai),0.03125,2,19.197,72.82763
34,Vashi,0.03,2,19.0847,72.90484
13,Goregaon,0.03,2,19.16455,72.84946
12,Ghatkopar,0.03,2,19.086477,72.908956


## Observations

Most of the shopping malls are concentrated in the central area of Mumbai city, with the highest number in cluster 1 and moderate number in cluster 0. On the other hand, cluster 2 has very low number to totally no shopping mall in the neighborhoods. This represents a great opportunity and high potential areas to open new shopping malls as there is very little to no competition from existing malls. Meanwhile, shopping malls in cluster 1 are likely suffering from intense competition due to oversupply and high concentration of shopping malls. From another perspective, this also shows that the oversupply of shopping malls mostly happened in the central area of the city, with the suburb area still have very few shopping malls. Therefore, this project recommends property developers to capitalize on these findings to open new shopping malls in neighborhoods in cluster 2 with little to no competition. Property developers with unique selling propositions to stand out from the competition can also open new shopping malls in neighborhoods in cluster 0 with moderate competition. Lastly, property developers are advised to avoid neighborhoods in cluster 1 which already have high concentration of shopping malls and suffering from intense competition.