# IBM Capstone Project

## Week 5 Project Notebook

### Objectives:
+ Opening a New Multiplex in Mumbai, India
+ Build a dataframe of neighborhoods, along with their geographical coordinates in Mumbai, India by web scraping the data from Wikipedia page
+ Obtain the venue data for the neighborhoods from Foursquare API
+ Explore and cluster the neighborhoods
+ Select the best cluster to open a multiplex

###  Download Dependencies

In [1]:
import numpy as np
import pandas as pd
import json
#!pip install folium
import folium

from geopy.geocoders import Nominatim

import matplotlib.colors as colors
import matplotlib.cm as cm
from sklearn.cluster import KMeans
import requests

### Getting the data from web

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Mumbai#Mumbai_neighbourhood_coordintes'
list_df = pd.read_html(url, header = 0)
df = list_df[0]
df.head()

Unnamed: 0,Area,Location,Latitude,Longitude
0,Amboli,"Andheri,Western Suburbs",19.1293,72.8434
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833
2,D.N. Nagar,"Andheri,Western Suburbs",19.124085,72.831373
3,Four Bungalows,"Andheri,Western Suburbs",19.124714,72.82721
4,Lokhandwala,"Andheri,Western Suburbs",19.130815,72.82927


### Prinitng the shape and head of the dataframe

In [3]:
print(df.shape)
df.head()

(93, 4)


Unnamed: 0,Area,Location,Latitude,Longitude
0,Amboli,"Andheri,Western Suburbs",19.1293,72.8434
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833
2,D.N. Nagar,"Andheri,Western Suburbs",19.124085,72.831373
3,Four Bungalows,"Andheri,Western Suburbs",19.124714,72.82721
4,Lokhandwala,"Andheri,Western Suburbs",19.130815,72.82927


### Getting the geographical coordinates of Mumbai

In [4]:
address = 'Mumbai'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Mumbai are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Mumbai are 18.9387711, 72.8353355.


### Plotting the areas on a map

In [5]:
map = folium.Map(location=[latitude, longitude], zoom_start=10)

# Add markers to map
for lat, lng, area, location in zip(df['Latitude'], df['Longitude'], df['Area'], df['Location']):
    label = '{}, {}'.format(area, location)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='darkred',
        fill=True,
        fill_color='orange',
        fill_opacity=0.3,
        parse_html=False).add_to(map)  
    
map

### Getting the data using foursquare api

In [6]:
## Credentials
CLIENT_ID = 'RDQTXVDZDN1T3I2Z0QVKWRGKTFODBQZDQE0THGK4TOGTZLKJ' # your Foursquare ID
CLIENT_SECRET = '1VFSGNMG5CO0H2IZS4I2WC5VRNHPF5CCULUMT2EBRBKLIKZC'
VERSION = '20201008'
LIMIT = 500
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

radius = 10000

Your credentails:
CLIENT_ID: RDQTXVDZDN1T3I2Z0QVKWRGKTFODBQZDQE0THGK4TOGTZLKJ
CLIENT_SECRET:1VFSGNMG5CO0H2IZS4I2WC5VRNHPF5CCULUMT2EBRBKLIKZC


In [7]:
#Getting the data in a list
venues = []

for lat, long, area, location in zip(df['Latitude'], df['Longitude'], df['Area'], df['Location']): 
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    for venue in results:
        venues.append(( 
            area,
            location,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [8]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = [ 'Area', 'Location', 'Area Latitude', 'Area Longitude', 'Venue Name', 'Venue Latitude', 'Venue Longitude', 'Venue Category']

### Printing the shape and head of the dataframe

In [9]:
print(venues_df.shape)
venues_df.head()

(8917, 8)


Unnamed: 0,Area,Location,Area Latitude,Area Longitude,Venue Name,Venue Latitude,Venue Longitude,Venue Category
0,Amboli,"Andheri,Western Suburbs",19.1293,72.8434,Merwans Cake shop,19.1193,72.845418,Bakery
1,Amboli,"Andheri,Western Suburbs",19.1293,72.8434,Hard Rock Cafe Andheri,19.135995,72.835335,American Restaurant
2,Amboli,"Andheri,Western Suburbs",19.1293,72.8434,Joey's Pizza,19.126762,72.830001,Pizza Place
3,Amboli,"Andheri,Western Suburbs",19.1293,72.8434,The Little Door,19.139265,72.83318,Pub
4,Amboli,"Andheri,Western Suburbs",19.1293,72.8434,Indigo Delicatessen,19.13645,72.827565,Mediterranean Restaurant


### Counting the number of venues returned by foursquare api

In [10]:
venues_df.groupby(["Area", "Location"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Area Latitude,Area Longitude,Venue Name,Venue Latitude,Venue Longitude,Venue Category
Area,Location,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Aarey Milk Colony,"Goregaon,Western Suburbs",100,100,100,100,100,100
Agripada,South Mumbai,100,100,100,100,100,100
Altamount Road,South Mumbai,100,100,100,100,100,100
Amboli,"Andheri,Western Suburbs",100,100,100,100,100,100
Amrut Nagar,"Ghatkopar,Eastern Suburbs",100,100,100,100,100,100
Asalfa,"Ghatkopar,Eastern Suburbs",100,100,100,100,100,100
Ballard Estate,"Fort,South Mumbai",100,100,100,100,100,100
Bandstand Promenade,"Bandra,Western Suburbs",100,100,100,100,100,100
Bangur Nagar,"Goregaon,Western Suburbs",100,100,100,100,100,100
Bhandup,Eastern Suburbs,100,100,100,100,100,100


In [11]:
print('There are {} uniques categories.'.format(len(venues_df['Venue Category'].unique())))

There are 136 uniques categories.


In [12]:
venues_df.drop_duplicates(inplace = True)

In [13]:
venues_df.shape

(8917, 8)

### Getting the unique venue category

In [14]:
venues_df['Venue Category'].unique()

array(['Bakery', 'American Restaurant', 'Pizza Place', 'Pub',
       'Mediterranean Restaurant', 'Brewery', 'Mughlai Restaurant',
       'Multiplex', 'Café', 'Chinese Restaurant', 'Sandwich Place',
       'Ice Cream Shop', 'Theater', 'Indian Restaurant', 'Coffee Shop',
       'Seafood Restaurant', 'Juice Bar', 'Spa', 'Cupcake Shop', 'Bar',
       'Hotel', 'South Indian Restaurant', 'Comfort Food Restaurant',
       'Club House', 'Dessert Shop', 'Bengali Restaurant', 'Donut Shop',
       'Italian Restaurant', 'Diner', 'Restaurant', 'Track', 'Tea Room',
       'Clothing Store', 'Toy / Game Store', 'Shopping Mall',
       'Deli / Bodega', 'Movie Theater', 'Salad Place', 'Snack Place',
       'Garden', 'Fast Food Restaurant', 'Burger Joint', 'Scenic Lookout',
       'Park', 'North Indian Restaurant', 'General Entertainment',
       'Lounge', 'Gym Pool', 'Food Truck', 'Plaza', 'Theme Park', 'Gym',
       'Beach', 'Farmers Market', 'Gym / Fitness Center', 'Food Court',
       'Sushi Restaura

### One hot encoding

In [15]:
# one hot encoding
venues_onehot = pd.get_dummies(venues_df[['Venue Category']], prefix="", prefix_sep="")

# add postal, borough and neighborhood column back to dataframe
venues_onehot['Area'] = venues_df['Area'] 
venues_onehot['Location'] = venues_df['Location'] 

# move postal, borough and neighborhood column to the first column
fixed_columns = list(venues_onehot.columns[-2:]) + list(venues_onehot.columns[:-2])
venues_onehot = venues_onehot[fixed_columns]

print(venues_onehot.shape)
venues_onehot.head()

(8917, 138)


Unnamed: 0,Area,Location,Afghan Restaurant,Airport Service,American Restaurant,Art Gallery,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,...,Thai Restaurant,Theater,Theme Park,Toy / Game Store,Track,Train Station,Vegetarian / Vegan Restaurant,Water Park,Wine Shop,Women's Store
0,Amboli,"Andheri,Western Suburbs",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Amboli,"Andheri,Western Suburbs",0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Amboli,"Andheri,Western Suburbs",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Amboli,"Andheri,Western Suburbs",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Amboli,"Andheri,Western Suburbs",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
venues 

[('Amboli',
  'Andheri,Western Suburbs',
  19.1293,
  72.8434,
  'Merwans Cake shop',
  19.119300215885474,
  72.84541776016009,
  'Bakery'),
 ('Amboli',
  'Andheri,Western Suburbs',
  19.1293,
  72.8434,
  'Hard Rock Cafe Andheri',
  19.13599450781993,
  72.8353350383582,
  'American Restaurant'),
 ('Amboli',
  'Andheri,Western Suburbs',
  19.1293,
  72.8434,
  "Joey's Pizza",
  19.126762155150107,
  72.83000121236746,
  'Pizza Place'),
 ('Amboli',
  'Andheri,Western Suburbs',
  19.1293,
  72.8434,
  'The Little Door',
  19.139264910758484,
  72.83317996821813,
  'Pub'),
 ('Amboli',
  'Andheri,Western Suburbs',
  19.1293,
  72.8434,
  'Indigo Delicatessen',
  19.1364504468811,
  72.82756504045669,
  'Mediterranean Restaurant'),
 ('Amboli',
  'Andheri,Western Suburbs',
  19.1293,
  72.8434,
  'Doolally Taproom',
  19.13591735538127,
  72.83309403406167,
  'Brewery'),
 ('Amboli',
  'Andheri,Western Suburbs',
  19.1293,
  72.8434,
  "Jaffer Bhai's Delhi Darbar",
  19.137714056593047,
  7

### Grouping the data with respect to area and location

In [17]:
venues_grouped = venues_onehot.groupby(["Area","Location"]).mean().reset_index()

print(venues_grouped.shape)
venues_grouped

(93, 138)


Unnamed: 0,Area,Location,Afghan Restaurant,Airport Service,American Restaurant,Art Gallery,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,...,Thai Restaurant,Theater,Theme Park,Toy / Game Store,Track,Train Station,Vegetarian / Vegan Restaurant,Water Park,Wine Shop,Women's Store
0,Aarey Milk Colony,"Goregaon,Western Suburbs",0.00,0.00,0.01,0.00,0.000000,0.00,0.000000,0.00,...,0.000000,0.01,0.010000,0.01,0.01,0.000000,0.000000,0.000000,0.000000,0.0
1,Agripada,South Mumbai,0.00,0.00,0.00,0.01,0.010000,0.01,0.000000,0.00,...,0.010000,0.01,0.000000,0.01,0.00,0.000000,0.010000,0.000000,0.000000,0.0
2,Altamount Road,South Mumbai,0.00,0.00,0.00,0.01,0.010000,0.01,0.000000,0.00,...,0.010000,0.01,0.000000,0.01,0.00,0.000000,0.000000,0.000000,0.000000,0.0
3,Amboli,"Andheri,Western Suburbs",0.00,0.00,0.01,0.00,0.000000,0.00,0.000000,0.00,...,0.000000,0.01,0.000000,0.01,0.01,0.000000,0.000000,0.000000,0.000000,0.0
4,Amrut Nagar,"Ghatkopar,Eastern Suburbs",0.01,0.00,0.01,0.00,0.010000,0.00,0.000000,0.00,...,0.000000,0.01,0.010000,0.01,0.00,0.000000,0.010000,0.000000,0.000000,0.0
5,Asalfa,"Ghatkopar,Eastern Suburbs",0.01,0.00,0.01,0.00,0.010000,0.00,0.000000,0.00,...,0.000000,0.01,0.010000,0.01,0.00,0.000000,0.020000,0.000000,0.000000,0.0
6,Ballard Estate,"Fort,South Mumbai",0.00,0.00,0.00,0.01,0.010000,0.01,0.000000,0.00,...,0.010000,0.01,0.000000,0.01,0.00,0.000000,0.010000,0.000000,0.000000,0.0
7,Bandstand Promenade,"Bandra,Western Suburbs",0.00,0.00,0.00,0.00,0.020000,0.00,0.000000,0.00,...,0.000000,0.01,0.000000,0.01,0.00,0.000000,0.030000,0.000000,0.000000,0.0
8,Bangur Nagar,"Goregaon,Western Suburbs",0.00,0.00,0.01,0.00,0.000000,0.00,0.000000,0.00,...,0.000000,0.01,0.010000,0.00,0.01,0.000000,0.000000,0.020000,0.000000,0.0
9,Bhandup,Eastern Suburbs,0.01,0.01,0.01,0.00,0.000000,0.00,0.000000,0.01,...,0.000000,0.00,0.010000,0.01,0.00,0.000000,0.010000,0.000000,0.000000,0.0


In [18]:
len(venues_grouped[venues_grouped["Theater"] > 0])

66

In [19]:
len(venues_grouped[venues_grouped["Multiplex"] > 0])

84

In [20]:
len(venues_grouped[venues_grouped["Indie Movie Theater"] > 0])

10

In [21]:
len(venues_grouped[venues_grouped["Movie Theater"] > 0])

39

### Obtaining the usefull data

In [22]:
venues_theater = venues_grouped[['Area','Location','Theater','Multiplex','Indie Movie Theater','Movie Theater']]

In [23]:
venues_theater.drop_duplicates()

Unnamed: 0,Area,Location,Theater,Multiplex,Indie Movie Theater,Movie Theater
0,Aarey Milk Colony,"Goregaon,Western Suburbs",0.01,0.030000,0.000000,0.01
1,Agripada,South Mumbai,0.01,0.010000,0.000000,0.00
2,Altamount Road,South Mumbai,0.01,0.010000,0.000000,0.00
3,Amboli,"Andheri,Western Suburbs",0.01,0.030000,0.000000,0.01
4,Amrut Nagar,"Ghatkopar,Eastern Suburbs",0.01,0.030000,0.000000,0.01
5,Asalfa,"Ghatkopar,Eastern Suburbs",0.01,0.020000,0.000000,0.01
6,Ballard Estate,"Fort,South Mumbai",0.01,0.010000,0.000000,0.00
7,Bandstand Promenade,"Bandra,Western Suburbs",0.01,0.010000,0.000000,0.01
8,Bangur Nagar,"Goregaon,Western Suburbs",0.01,0.030000,0.000000,0.00
9,Bhandup,Eastern Suburbs,0.00,0.020000,0.000000,0.01


In [24]:
venues_theater.head()

Unnamed: 0,Area,Location,Theater,Multiplex,Indie Movie Theater,Movie Theater
0,Aarey Milk Colony,"Goregaon,Western Suburbs",0.01,0.03,0.0,0.01
1,Agripada,South Mumbai,0.01,0.01,0.0,0.0
2,Altamount Road,South Mumbai,0.01,0.01,0.0,0.0
3,Amboli,"Andheri,Western Suburbs",0.01,0.03,0.0,0.01
4,Amrut Nagar,"Ghatkopar,Eastern Suburbs",0.01,0.03,0.0,0.01


### Clustering the neighbourhoods

In [25]:
kclusters = 3

venues_clustering = venues_theater.drop(["Area","Location"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state= 0 ).fit(venues_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 1, 1, 0, 0, 0, 1, 1, 0, 0], dtype=int32)

In [26]:
venues_merged = venues_theater.copy()
venues_merged["Cluster Label"]=kmeans.labels_

### Assigning the neighbourhoods their clusters

In [27]:

venues_merged = venues_merged.join(df.drop('Location', axis = 1).set_index("Area"), on="Area")

print(venues_merged.shape)
venues_merged.head() # check the last columns!

(93, 9)


Unnamed: 0,Area,Location,Theater,Multiplex,Indie Movie Theater,Movie Theater,Cluster Label,Latitude,Longitude
0,Aarey Milk Colony,"Goregaon,Western Suburbs",0.01,0.03,0.0,0.01,0,19.148493,72.881756
1,Agripada,South Mumbai,0.01,0.01,0.0,0.0,1,18.9777,72.8273
2,Altamount Road,South Mumbai,0.01,0.01,0.0,0.0,1,18.9681,72.8095
3,Amboli,"Andheri,Western Suburbs",0.01,0.03,0.0,0.01,0,19.1293,72.8434
4,Amrut Nagar,"Ghatkopar,Eastern Suburbs",0.01,0.03,0.0,0.01,0,19.102077,72.912835


### Mapping the clusters

In [35]:
# create map
map_clusters = folium.Map(location=[latitude+.05, longitude+.05], zoom_start=10)

# set color scheme for the clusters
#x = np.arange(kclusters)
#ys = [i+x+(i*x)**2 for i in range(kclusters)]
#colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
#rainbow = [colors.rgb2hex(i) for i in colors_array]
rainbow = ['red', 'blue','brown']

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(venues_merged['Latitude'], venues_merged['Longitude'], venues_merged['Area'], venues_merged['Cluster Label']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### analysing the clusters

In [29]:
venues_merged.loc[venues_merged['Cluster Label'] == 0]

Unnamed: 0,Area,Location,Theater,Multiplex,Indie Movie Theater,Movie Theater,Cluster Label,Latitude,Longitude
0,Aarey Milk Colony,"Goregaon,Western Suburbs",0.01,0.03,0.0,0.01,0,19.148493,72.881756
3,Amboli,"Andheri,Western Suburbs",0.01,0.03,0.0,0.01,0,19.1293,72.8434
4,Amrut Nagar,"Ghatkopar,Eastern Suburbs",0.01,0.03,0.0,0.01,0,19.102077,72.912835
5,Asalfa,"Ghatkopar,Eastern Suburbs",0.01,0.02,0.0,0.01,0,19.091,72.901
8,Bangur Nagar,"Goregaon,Western Suburbs",0.01,0.03,0.0,0.0,0,19.167362,72.832252
9,Bhandup,Eastern Suburbs,0.0,0.02,0.0,0.01,0,19.14,72.93
16,"Chakala, Andheri",Western Suburbs,0.01,0.03,0.0,0.01,0,19.111388,72.860833
17,Chandivali,"Powai,Eastern Suburbs",0.01,0.04,0.0,0.01,0,19.11,72.9
27,D.N. Nagar,"Andheri,Western Suburbs",0.01,0.03,0.0,0.01,0,19.124085,72.831373
34,Dindoshi,"Malad,Western Suburbs",0.01,0.03,0.0,0.0,0,19.176382,72.864891


In [30]:
venues_merged.loc[venues_merged['Cluster Label'] == 1]

Unnamed: 0,Area,Location,Theater,Multiplex,Indie Movie Theater,Movie Theater,Cluster Label,Latitude,Longitude
1,Agripada,South Mumbai,0.01,0.01,0.0,0.0,1,18.9777,72.8273
2,Altamount Road,South Mumbai,0.01,0.01,0.0,0.0,1,18.9681,72.8095
6,Ballard Estate,"Fort,South Mumbai",0.01,0.01,0.0,0.0,1,18.95,72.84
7,Bandstand Promenade,"Bandra,Western Suburbs",0.01,0.01,0.0,0.01,1,19.042718,72.819132
10,Bhayandar,"Mira-Bhayandar,Western Suburbs",0.0,0.01,0.01,0.0,1,19.29,72.85
11,Bhuleshwar,South Mumbai,0.01,0.01,0.0,0.0,1,18.95,72.83
12,Breach Candy,South Mumbai,0.01,0.01,0.0,0.0,1,18.967,72.805
13,C.G.S. colony,"Antop Hill,South Mumbai",0.0,0.01,0.0,0.01,1,19.016378,72.856629
14,Carmichael Road,South Mumbai,0.01,0.01,0.0,0.0,1,18.9722,72.8113
15,Cavel,South Mumbai,0.01,0.01,0.0,0.0,1,18.9474,72.8272


In [31]:
venues_merged.loc[venues_merged['Cluster Label'] == 2]

Unnamed: 0,Area,Location,Theater,Multiplex,Indie Movie Theater,Movie Theater,Cluster Label,Latitude,Longitude
68,Naigaon,"Vasai,Western Suburbs",0.0,0.04918,0.0,0.0,2,19.351467,72.846343
69,Nalasopara,"Vasai,Western Suburbs",0.0,0.058824,0.0,0.0,2,19.4154,72.8613
90,Virar,Western Suburbs,0.0,0.051282,0.0,0.0,2,19.47,72.8


# Observations

### We get 3 clusters, where the third cluster corresponds to the areas which have a lot of multiplex, first cluster has moderate number of multiplex, while second cluster has sparse multiplex, in the range of 10 kilometers. So, it is suggested that the investors build a multiplex in a area in the second cluster