# Recommending to a Property Developer where to built a Restaurant in Mumbai

Build a dataframe of neighborhoods in Mumbai, India by web scraping the data from Wikipedia page

Get the geographical coordinates of the neighborhoods

Obtain the venue data for the neighborhoods from Foursquare API

Explore and cluster the neighborhoods

Select the best cluster to open a new restaurant

In [42]:
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim
import geocoder 
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
from urllib.request import urlopen as uReq
import bs4 as bs
import re

# 1. Import Dataset

In [43]:
data=uReq('https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Mumbai').read()
soup=bs.BeautifulSoup(data,'lxml')
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of neighbourhoods in Mumbai - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_neighbourhoods_in_Mumbai","wgTitle":"List of neighbourhoods in Mumbai","wgCurRevisionId":918603214,"wgRevisionId":918603214,"wgArticleId":37060396,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 errors: dates","CS1 maint: unfit url","Use dmy dates from February 2019","Use Indian English from February 2019","All Wikipedia articles written in Indian English","Neighbourhoods in Mumbai","Lists of neighbourhoods in Indian cities","Mumbai-related lists"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateF

In [44]:
content_lis = soup.find_all('span', attrs={'class':'mw-headline'})

In [45]:
neighborhoodList = []
for span in content_lis:
    neighborhoodList.append(span.getText().split('\n')[0])
print(neighborhoodList)

['Western Suburbs', 'Andheri', 'Bhayandar', 'Bandra', 'Borivali', 'Dahisar', 'Goregaon', 'Jogeshwari', 'Juhu', 'Kandivali west', 'Kandivali east', 'Khar', 'Malad', 'Santacruz', 'Vasai', 'Virar', 'Vile Parle', 'Eastern Suburbs', 'Bhandup', 'Ghatkopar', 'Kanjurmarg', 'Kurla', 'Mulund', 'Powai', 'Vidyavihar', 'Vikhroli', 'Harbour Suburbs', 'Chembur', 'Govandi', 'Mankhurd', 'Trombay', 'South Mumbai', 'Antop Hill', 'Byculla', 'Colaba', 'Dadar', 'Fort', 'Girgaon', 'Kalbadevi', 'Kamathipura', 'Matunga', 'Parel', 'Tardeo', 'Other', 'References']


In [46]:
df=pd.DataFrame({'Neighborhoods':neighborhoodList})
df

Unnamed: 0,Neighborhoods
0,Western Suburbs
1,Andheri
2,Bhayandar
3,Bandra
4,Borivali
5,Dahisar
6,Goregaon
7,Jogeshwari
8,Juhu
9,Kandivali west


In [47]:
df.shape

(45, 1)

# 2. Get the geographical coordinates

In [48]:
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Mumbai, India'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [49]:
coords = [ get_latlng(neighborhood) for neighborhood in df["Neighborhoods"].tolist() ]

In [50]:
coords

[[19.167730000000063, 72.85052000000007],
 [19.11848309908247, 72.84177419095158],
 [19.30746000000005, 72.85170000000005],
 [19.054220000000043, 72.84019000000006],
 [19.229360000000042, 72.85751000000005],
 [19.250030000000038, 72.85908000000006],
 [19.164550000000077, 72.84946000000008],
 [19.13790000000006, 72.84941000000003],
 [19.01493000000005, 72.84522000000004],
 [19.207110000000057, 72.83492000000007],
 [19.205750000000023, 72.86969000000005],
 [19.069120000000055, 72.84643000000005],
 [19.186550000000068, 72.84836000000007],
 [19.081770000000063, 72.84205000000003],
 [19.07934000000006, 72.83916000000005],
 [19.01657000000006, 72.85853000000003],
 [19.100580000000036, 72.84377000000006],
 [19.00538889189226, 72.85576887678867],
 [19.145560000000046, 72.94856000000004],
 [19.086476606699875, 72.9089562772808],
 [19.131400000000042, 72.93565000000007],
 [19.064940000000036, 72.88073000000003],
 [19.171850000000063, 72.95564000000007],
 [19.123110000000054, 72.90944000000007],


In [51]:
df_coors=pd.DataFrame(coords,columns=['Latitude', 'Longitude'])

In [52]:
df['Latitude']=df_coors['Latitude']
df['Longitude']=df_coors['Longitude']

In [53]:
df

Unnamed: 0,Neighborhoods,Latitude,Longitude
0,Western Suburbs,19.16773,72.85052
1,Andheri,19.118483,72.841774
2,Bhayandar,19.30746,72.8517
3,Bandra,19.05422,72.84019
4,Borivali,19.22936,72.85751
5,Dahisar,19.25003,72.85908
6,Goregaon,19.16455,72.84946
7,Jogeshwari,19.1379,72.84941
8,Juhu,19.01493,72.84522
9,Kandivali west,19.20711,72.83492


In [54]:
df.drop([0, 17, 26, 31], inplace=True)

In [55]:
df.head()

Unnamed: 0,Neighborhoods,Latitude,Longitude
1,Andheri,19.118483,72.841774
2,Bhayandar,19.30746,72.8517
3,Bandra,19.05422,72.84019
4,Borivali,19.22936,72.85751
5,Dahisar,19.25003,72.85908


In [56]:
df.shape

(41, 3)

In [57]:
df=df.reset_index()

In [58]:
del df['index']

In [59]:
df.head()

Unnamed: 0,Neighborhoods,Latitude,Longitude
0,Andheri,19.118483,72.841774
1,Bhayandar,19.30746,72.8517
2,Bandra,19.05422,72.84019
3,Borivali,19.22936,72.85751
4,Dahisar,19.25003,72.85908


In [60]:
df.to_csv('df.csv',index=False)

# 3. Create a map of Mumbai with neighborhoods superimposed on top

In [61]:
# get the coordinates of Mumbai
address = 'Mumbai, India'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Mumbai, India {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Mumbai, India 18.9387711, 72.8353355.


In [62]:
import folium  
map_mum = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(df['Latitude'], df['Longitude'], df['Neighborhoods']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_mum)  
    
map_mum

In [63]:
map_mum.save('map_mum.html')

# 4. Use the Foursquare API to explore the neighborhoods

In [64]:
CLIENT_ID = 'ZZ5MY2SGCGUYP5FOEKAI4BBPEDRP1DZUSZGROJVVONT35XFI' 
CLIENT_SECRET = 'JHMZXMQRRZ2GOAJNJX41BX3CKQN3YEZ3GLG0KW3IFDWSMWKB' 
VERSION = '20191010' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ZZ5MY2SGCGUYP5FOEKAI4BBPEDRP1DZUSZGROJVVONT35XFI
CLIENT_SECRET:JHMZXMQRRZ2GOAJNJX41BX3CKQN3YEZ3GLG0KW3IFDWSMWKB


Now, let's get the top 100 venues that are within a radius of 2000 meters

In [65]:
import requests
radius = 2000
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(df['Latitude'], df['Longitude'], df['Neighborhoods']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [66]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhoods', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(3217, 7)


Unnamed: 0,Neighborhoods,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Andheri,19.118483,72.841774,Merwans Cake shop,19.1193,72.845418,Bakery
1,Andheri,19.118483,72.841774,Radha Krishna Veg Restaurant,19.11513,72.84306,Indian Restaurant
2,Andheri,19.118483,72.841774,Naturals,19.111204,72.837255,Ice Cream Shop
3,Andheri,19.118483,72.841774,Tewari Bros Sweets,19.115305,72.834501,Indian Restaurant
4,Andheri,19.118483,72.841774,Shawarma Factory,19.124591,72.840398,Falafel Restaurant


Let's check how many venues were returned for each neighorhood


In [67]:
venues_df.groupby(['Neighborhoods']).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhoods,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Andheri,100,100,100,100,100,100
Antop Hill,82,82,82,82,82,82
Bandra,100,100,100,100,100,100
Bhandup,24,24,24,24,24,24
Bhayandar,19,19,19,19,19,19
Borivali,100,100,100,100,100,100
Byculla,47,47,47,47,47,47
Chembur,99,99,99,99,99,99
Colaba,100,100,100,100,100,100
Dadar,100,100,100,100,100,100


Let's find out how many unique categories can be curated from all the returned venues

In [68]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 204 uniques categories.


In [69]:
# print out the list of categories
venues_df['VenueCategory'].unique()[:203]

array(['Bakery', 'Indian Restaurant', 'Ice Cream Shop',
       'Falafel Restaurant', 'Coffee Shop', 'Sandwich Place', 'Pub',
       'Pizza Place', 'Juice Bar', 'Fast Food Restaurant', 'Multiplex',
       'Seafood Restaurant', 'Snack Place', 'Breakfast Spot', 'Café',
       'Cocktail Bar', 'American Restaurant', 'Bar', 'BBQ Joint',
       'Gym / Fitness Center', 'Diner', 'Chinese Restaurant',
       'Electronics Store', 'Asian Restaurant', 'Department Store', 'Spa',
       'Lounge', 'Park', 'Liquor Store', 'Vegetarian / Vegan Restaurant',
       "Women's Store", 'Residential Building (Apartment / Condo)',
       'Smoke Shop', 'Food Truck', 'Fish Market', 'Martial Arts Dojo',
       'Tea Room', 'Athletics & Sports', 'Hotel', 'Burger Joint',
       'Clothing Store', 'Train Station', 'Restaurant', 'Soccer Field',
       'Playground', 'Mexican Restaurant', 'Gym', 'Shipping Store',
       'Dessert Shop', 'Sports Club', 'Gourmet Shop', 'Deli / Bodega',
       'Indie Movie Theater', 'Salad Pla

In [70]:
# check if the results contain "Cricket Ground"
"Restaurant" in venues_df['VenueCategory'].unique()

True

# 5. Analyze Each Neighborhood

In [71]:
# one hot encoding
mum_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
mum_onehot['Neighborhoods'] = venues_df['Neighborhoods'] 

# move neighborhood column to the first column
fixed_columns = [mum_onehot.columns[-1]] + list(mum_onehot.columns[:-1])
mum_onehot = mum_onehot[fixed_columns]

print(mum_onehot.shape)
mum_onehot.head()

(3217, 205)


Unnamed: 0,Neighborhoods,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Arcade,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Track,Trail,Train Station,Vegetarian / Vegan Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,Andheri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Andheri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Andheri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Andheri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Andheri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [72]:
mum_grouped = mum_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(mum_grouped.shape)
mum_grouped

(41, 205)


Unnamed: 0,Neighborhoods,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Arcade,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Track,Trail,Train Station,Vegetarian / Vegan Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,Andheri,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.01,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0
1,Antop Hill,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012195,0.0,...,0.0,0.0,0.02439,0.060976,0.0,0.0,0.0,0.0,0.0,0.0
2,Bandra,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.03,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bhandup,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.041667,0.0,...,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bhayandar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Borivali,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,...,0.0,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.0
6,Byculla,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.042553,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021277
7,Chembur,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020202,0.0,...,0.0,0.0,0.010101,0.020202,0.0,0.0,0.0,0.0,0.0,0.0
8,Colaba,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.03,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Dadar,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,...,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0


In [73]:
len(mum_grouped[mum_grouped["Restaurant"] > 0])

33

Create a new DataFrame for Restaurant data only

In [74]:
mum_mall = mum_grouped[["Neighborhoods","Restaurant"]]

In [75]:
mum_mall.head()

Unnamed: 0,Neighborhoods,Restaurant
0,Andheri,0.0
1,Antop Hill,0.0
2,Bandra,0.02
3,Bhandup,0.083333
4,Bhayandar,0.105263


## 6. Cluster Neighborhoods

In [76]:
# set number of clusters
kclusters = 3

mum_clustering = mum_mall.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(mum_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 1, 1, 2, 0, 2, 0, 0])

In [77]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
mum_merged = mum_mall.copy()

# add clustering labels
mum_merged["Cluster Labels"] = kmeans.labels_

In [78]:
mum_merged.head()

Unnamed: 0,Neighborhoods,Restaurant,Cluster Labels
0,Andheri,0.0,0
1,Antop Hill,0.0,0
2,Bandra,0.02,0
3,Bhandup,0.083333,1
4,Bhayandar,0.105263,1


In [79]:
# merge mumbai_grouped with mumbai_data to add latitude/longitude for each neighborhood
mum_merged = mum_merged.join(df.set_index("Neighborhoods"), on="Neighborhoods")

print(mum_merged.shape)
mum_merged.head()

(41, 5)


Unnamed: 0,Neighborhoods,Restaurant,Cluster Labels,Latitude,Longitude
0,Andheri,0.0,0,19.118483,72.841774
1,Antop Hill,0.0,0,19.02635,72.86634
2,Bandra,0.02,0,19.05422,72.84019
3,Bhandup,0.083333,1,19.14556,72.94856
4,Bhayandar,0.105263,1,19.30746,72.8517


In [80]:
# sort the results by Cluster Labels
print(mum_merged.shape)
mum_merged.sort_values(["Cluster Labels"], inplace=True)
mum_merged.head()

(41, 5)


Unnamed: 0,Neighborhoods,Restaurant,Cluster Labels,Latitude,Longitude
0,Andheri,0.0,0,19.118483,72.841774
37,Vidyavihar,0.01,0,19.023261,72.8439
36,Vasai,0.0,0,19.07934,72.83916
35,Trombay,0.0,0,19.019,72.89799
33,Santacruz,0.02,0,19.08177,72.84205


Finally, let's visualize the resulting clusters

In [81]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mum_merged['Latitude'], mum_merged['Longitude'], mum_merged['Neighborhoods'], mum_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [82]:
# save the map as HTML file
map_clusters.save('mum_map_clusters.html')

In [83]:
mum_merged.loc[mum_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhoods,Restaurant,Cluster Labels,Latitude,Longitude
0,Andheri,0.0,0,19.118483,72.841774
37,Vidyavihar,0.01,0,19.023261,72.8439
36,Vasai,0.0,0,19.07934,72.83916
35,Trombay,0.0,0,19.019,72.89799
33,Santacruz,0.02,0,19.08177,72.84205
29,Other,0.016129,0,19.1716,72.95752
28,Mulund,0.012821,0,19.17185,72.95564
27,Matunga,0.01,0,19.02718,72.8559
26,Mankhurd,0.0,0,19.04853,72.9322
25,Malad,0.010753,0,19.18655,72.84836


In [84]:
mum_merged.loc[mum_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhoods,Restaurant,Cluster Labels,Latitude,Longitude
13,Girgaon,0.07,1,18.95696,72.81945
3,Bhandup,0.083333,1,19.14556,72.94856
31,Powai,0.072289,1,19.12311,72.90944
4,Bhayandar,0.105263,1,19.30746,72.8517


In [85]:
mum_merged.loc[mum_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhoods,Restaurant,Cluster Labels,Latitude,Longitude
10,Dahisar,0.05,2,19.25003,72.85908
38,Vikhroli,0.04,2,19.11109,72.92781
19,Kamathipura,0.06,2,18.96172,72.82627
34,Tardeo,0.05,2,18.97243,72.81483
32,References,0.045455,2,19.14435,72.93769
5,Borivali,0.06,2,19.22936,72.85751
7,Chembur,0.030303,2,19.0622,72.90242
21,Kandivali west,0.040541,2,19.20711,72.83492
12,Ghatkopar,0.0375,2,19.086477,72.908956
30,Parel,0.04,2,18.99566,72.83907


# Observation

Most of the restaurants are concentrated in the Northern arears of Mumbai city, with the highest number in cluster 1 and moderate number in cluster 2. On the other hand, cluster 0 has very low number of restaurants in the neighborhoods. This represents a great opportunity and high potential areas to open new restaurant as there is very little to no competition from existing malls. Meanwhile, restaurant in cluster 1 are likely suffering from intense competition due to oversupply and high concentration of restaurant. From another perspective, this also shows that the oversupply of restaurants mostly happened in the developed parts like Thane in Mumbai city, with the suburb areas like South Mumbai still have very few restaurants. Therefore, this project recommends property developers to capitalize on these findings to open new restaurants in neighborhoods in cluster 0 with little to no competition. Property developers with unique selling propositions to stand out from the competition can also open new restaurants in neighborhoods in cluster 2 with moderate competition. Lastly, property developers are advised to avoid neighborhoods in cluster 1 which already have high concentration of restaurants and suffering from intense competition.