## IBM Data Science Capstone Project
### Week 5: Final Project

#### Starting a new fast food chain in Pune
- Get data from wikipedia page on Pune neighborhoods using web scraping.
- Get coordinates of these neighborhoods using geocoder package
- Explore these neighborhoods using Foursquare API
- Form clusters using K-means algorithm 
- Visualize these clusters using folium
- Conclusion 

In [2]:
pip install geocoder

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
Note: you may need to restart the kernel to use updated packages.


In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder # to get coordinates

import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

### Web Scraping using BeautifulSoup

In [2]:
data1 = requests.get("https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Pune").text

In [3]:
soup = BeautifulSoup(data1, 'html.parser')

### Neighborhoods in Pune City

In [10]:
neighborhoodlist=[]
for row in soup.findAll('ul')[1].findAll('li'):
    if row.text=='Yerwada':
        break
    neighborhoodlist.append(row.text)

In [11]:
pune_df=pd.DataFrame(neighborhoodlist)

In [12]:
pune_df.columns=['Neighborhood']

In [13]:
print(pune_df.shape)
pune_df.head()

(46, 1)


Unnamed: 0,Neighborhood
0,Ambegaon
1,Aundh
2,Baner
3,Bavdhan Khurd
4,Bavdhan Budruk


### Latitudes and Longitudes 

In [14]:
def get_latlng(neighborhood):
    
    lat_lng_coords = None
    
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Pune, Maharashtra'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [15]:
coords = [ get_latlng(neighborhood) for neighborhood in pune_df["Neighborhood"].tolist() ]

In [16]:
coords

[[19.00496000000004, 73.94583000000006],
 [18.563450000000046, 73.81227000000007],
 [18.548200000000065, 73.77316000000008],
 [18.511100000000056, 73.77773000000008],
 [18.51827000000003, 73.76557000000008],
 [18.576020000000028, 73.77983000000006],
 [18.537230000000022, 73.83808000000005],
 [18.471870000000024, 73.86336000000006],
 [18.499220000000037, 73.75316000000004],
 [18.495100000000036, 73.72124000000008],
 [18.46628000000004, 73.85326000000003],
 [18.57856000000004, 73.89264000000003],
 [18.447020000000066, 73.80757000000006],
 [18.509650000000022, 73.83124000000004],
 [18.473650000000077, 73.97473000000008],
 [18.522320000000036, 73.89712000000003],
 [18.502530000000036, 73.92706000000004],
 [18.479790000000037, 73.83075000000008],
 [18.49150000000003, 73.82172000000008],
 [18.578450000000032, 73.87489000000005],
 [18.447320000000047, 73.86405000000008],
 [18.561140000000023, 73.85300000000007],
 [18.544620000000066, 73.93922000000003],
 [18.43825000000004, 73.89895000000007]

In [17]:
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [18]:
pune_df['Latitude']=df_coords['Latitude']
pune_df['Longitude']=df_coords['Longitude']

In [19]:
pune_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Ambegaon,19.00496,73.94583
1,Aundh,18.56345,73.81227
2,Baner,18.5482,73.77316
3,Bavdhan Khurd,18.5111,73.77773
4,Bavdhan Budruk,18.51827,73.76557


### Coordinates of Pune City

In [20]:
address = 'Pune, Maharashtra'

geolocator = Nominatim(user_agent="http")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Pune, Maharashtra {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Pune, Maharashtra 18.521428, 73.8544541.


### Map of Pune with locations superimposed

In [21]:
pune_map = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(pune_df['Latitude'], pune_df['Longitude'], pune_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(pune_map)  

pune_map

### Exploring the neighborhoods using FourSquare API

In [22]:
CLIENT_ID = 'ORZBGKIYYRASHDQ4LCJHRFXIDIE3PRWFKDHLYX4NVBZMOSDU' # your Foursquare ID
CLIENT_SECRET = 'XQVTEX3CUZGSIAEWLOS4VXBBTF4X0KW5EZ1UORYG3PLLNP2S' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ORZBGKIYYRASHDQ4LCJHRFXIDIE3PRWFKDHLYX4NVBZMOSDU
CLIENT_SECRET:XQVTEX3CUZGSIAEWLOS4VXBBTF4X0KW5EZ1UORYG3PLLNP2S


In [23]:
radius = 500
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(pune_df['Latitude'], pune_df['Longitude'], pune_df['Neighborhood']):
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    for venue in results:
        venues.append(( 
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [24]:
venues_df=pd.DataFrame(venues)
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']
venues_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Ambegaon,19.00496,73.94583,Manchar theatre,19.002986,73.943859,Indie Movie Theater
1,Ambegaon,19.00496,73.94583,My Idea Store,19.007062,73.949491,Mobile Phone Shop
2,Ambegaon,19.00496,73.94583,Axis Bank ATM,19.00098,73.944656,ATM
3,Aundh,18.56345,73.81227,Picantos Mexican Grill,18.560654,73.812447,Mexican Restaurant
4,Aundh,18.56345,73.81227,Baker's Basket,18.560704,73.8131,Restaurant


In [55]:
venues_df.shape

(233, 7)

#### Count of number of venues per neighborhood

In [25]:
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ambegaon,3,3,3,3,3,3
Aundh,17,17,17,17,17,17
Balewadi,3,3,3,3,3,3
Baner,4,4,4,4,4,4
Bavdhan Budruk,3,3,3,3,3,3
Bavdhan Khurd,6,6,6,6,6,6
Bhamburde (now called Shivajinagar),16,16,16,16,16,16
Bhugaon,3,3,3,3,3,3
Bhukum,1,1,1,1,1,1
Bibvewadi,6,6,6,6,6,6


### Unique venues

In [26]:
venues_df["VenueCategory"].unique()

array(['Indie Movie Theater', 'Mobile Phone Shop', 'ATM',
       'Mexican Restaurant', 'Restaurant', 'Korean Restaurant',
       'Fast Food Restaurant', 'Indian Restaurant', 'Ice Cream Shop',
       'Sporting Goods Shop', 'Plaza', 'Snack Place', 'Clothing Store',
       'Bus Station', 'Beer Garden', 'Italian Restaurant', 'Mountain',
       'Café', 'Seafood Restaurant', 'Shop & Service', 'Pool',
       'Golf Course', 'Coffee Shop', 'Breakfast Spot', 'Lounge',
       'Asian Restaurant', 'Chinese Restaurant', 'Multiplex', 'Bookstore',
       'Food Court', 'Pharmacy', 'Pizza Place', 'Hotel', 'Diner',
       'Sandwich Place', "Men's Store", 'Tea Room', 'Cosmetics Shop',
       'Farm', 'Cheese Shop', 'Bakery', 'Smoke Shop',
       'Vegetarian / Vegan Restaurant', 'Gastropub',
       'South Indian Restaurant', 'Chocolate Shop', 'Convenience Store',
       'Furniture / Home Store', 'Gym', 'Juice Bar', 'Food Truck',
       'Video Store', 'Shopping Mall', 'Concert Hall', 'Garden',
       'Rental

In [27]:
print('{} unique venues'.format(len(venues_df["VenueCategory"].unique())))

84 unique venues


In [28]:
venues_df.VenueCategory.value_counts()

Indian Restaurant                36
Snack Place                      14
ATM                              11
Café                             11
Fast Food Restaurant              9
Restaurant                        9
Pizza Place                       8
Breakfast Spot                    8
Bakery                            7
Seafood Restaurant                6
Chinese Restaurant                6
Ice Cream Shop                    5
Tea Room                          4
Gym                               4
Bus Station                       4
Multiplex                         4
Vegetarian / Vegan Restaurant     3
Hotel                             3
Mobile Phone Shop                 3
Italian Restaurant                3
Lounge                            3
Coffee Shop                       3
Clothing Store                    2
Food Truck                        2
Men's Store                       2
Asian Restaurant                  2
Sandwich Place                    2
Convenience Store           

In [29]:
SnackFF=[]
for i in range(0,len(venues_df)):
    if (venues_df['VenueCategory'][i]=='Snack Place' or venues_df['VenueCategory'][i]=='Fast Food Restaurant') :
        SnackFF.append(1)
    else:
        SnackFF.append(0)

In [30]:
venues_df['SnackFF']=SnackFF

In [31]:
venues_df['SnackFF'].value_counts()

0    215
1     23
Name: SnackFF, dtype: int64

In [32]:
pune_ff=venues_df.groupby(["Neighborhood","SnackFF"])
pune_snackff=venues_df.drop(["VenueCategory"],axis=1)
pune_snackff=pune_snackff[pune_snackff["SnackFF"]==1]

In [33]:
pune_ns_df=pune_snackff[["Neighborhood","SnackFF"]]
pune_ns_df=pune_ns_df.groupby(["Neighborhood"]).sum()
pune_ns_df

Unnamed: 0_level_0,SnackFF
Neighborhood,Unnamed: 1_level_1
Aundh,3
Balewadi,1
Bavdhan Khurd,1
Bibvewadi,1
Dhankawadi,1
Dhanori,1
Erandwane,3
Hadapsar,3
Kalas,1
Karve Nagar,2


### Forming clusters using K-Means 

In [34]:
clusters = 3

#pune_clusters = pune_ns_df.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=clusters, random_state=42).fit(pune_ns_df)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 0, 0, 0, 0, 0, 1, 1, 0, 2])

In [35]:
pune_ns_df['ClusterLabel']=kmeans.labels_

In [36]:
pune_ns_df["Neighborhood"]=pune_ns_df.index
pune_ns_df.head()

Unnamed: 0_level_0,SnackFF,ClusterLabel,Neighborhood
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aundh,3,1,Aundh
Balewadi,1,0,Balewadi
Bavdhan Khurd,1,0,Bavdhan Khurd
Bibvewadi,1,0,Bibvewadi
Dhankawadi,1,0,Dhankawadi


In [37]:
index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
pune_ns_df.index=index
pune_ns_df=pune_ns_df[["Neighborhood","SnackFF","ClusterLabel"]]
pune_ns_df.head()

Unnamed: 0,Neighborhood,SnackFF,ClusterLabel
0,Aundh,3,1
1,Balewadi,1,0
2,Bavdhan Khurd,1,0
3,Bibvewadi,1,0
4,Dhankawadi,1,0


In [38]:
neighborhoods=pune_ns_df["Neighborhood"].tolist()
new_df=pune_df[pune_df['Neighborhood'].isin(neighborhoods)]
new_df=new_df.set_index(['Neighborhood'])
pune_ns_df=pune_ns_df.set_index(['Neighborhood'])

In [39]:
pune_ns_df['Latitude']=new_df['Latitude']
pune_ns_df['Longitude']=new_df['Longitude']
pune_ns_df.head()

Unnamed: 0_level_0,SnackFF,ClusterLabel,Latitude,Longitude
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aundh,3,1,18.56345,73.81227
Balewadi,1,0,18.57602,73.77983
Bavdhan Khurd,1,0,18.5111,73.77773
Bibvewadi,1,0,18.47187,73.86336
Dhankawadi,1,0,18.46628,73.85326


In [40]:
pune_ns_df.index=index
pune_ns_df["Neighborhood"]=neighborhoods
pune_ns_df=pune_ns_df[["Neighborhood","SnackFF","ClusterLabel","Latitude","Longitude"]]
pune_ns_df.head()

Unnamed: 0,Neighborhood,SnackFF,ClusterLabel,Latitude,Longitude
0,Aundh,3,1,18.56345,73.81227
1,Balewadi,1,0,18.57602,73.77983
2,Bavdhan Khurd,1,0,18.5111,73.77773
3,Bibvewadi,1,0,18.47187,73.86336
4,Dhankawadi,1,0,18.46628,73.85326


### Visualizing the clusters

In [41]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(clusters)
ys = [i+x+(i*x)**2 for i in range(clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(pune_ns_df['Latitude'], pune_ns_df['Longitude'], pune_ns_df['Neighborhood'], pune_ns_df['ClusterLabel']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Analyzing each cluster

In [43]:
pune_ns_df[pune_ns_df["ClusterLabel"]==0]

Unnamed: 0,Neighborhood,SnackFF,ClusterLabel,Latitude,Longitude
1,Balewadi,1,0,18.57602,73.77983
2,Bavdhan Khurd,1,0,18.5111,73.77773
3,Bibvewadi,1,0,18.47187,73.86336
4,Dhankawadi,1,0,18.46628,73.85326
5,Dhanori,1,0,18.57856,73.89264
8,Kalas,1,0,18.57845,73.87489
10,Manjri,1,0,18.48381,73.85814
11,Nanded,1,0,18.45642,73.792
12,Parvati,1,0,18.48696,73.85006
13,Shivane,1,0,18.46781,73.78897


In [44]:
pune_ns_df[pune_ns_df["ClusterLabel"]==1]

Unnamed: 0,Neighborhood,SnackFF,ClusterLabel,Latitude,Longitude
0,Aundh,3,1,18.56345,73.81227
6,Erandwane,3,1,18.50965,73.83124
7,Hadapsar,3,1,18.50253,73.92706


In [45]:
pune_ns_df[pune_ns_df["ClusterLabel"]==2]

Unnamed: 0,Neighborhood,SnackFF,ClusterLabel,Latitude,Longitude
9,Karve Nagar,2,2,18.4915,73.82172


### Conclusion
Opening an outlet in the first cluster will be the most ideal start since it will face minimal competition because there is only one snack bar or fast food restaurant in those areas. The second cluster will face the highest resistance since there are 3 snack bars/ fast food restaurants per area. However this might prove to be good locations since the footfall must be good. The third cluster lies between in the first and second cluster in terms of number of restaurants. 