# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Introduction
Who doesn’t love baked goods? In Stockholm, there are many bakeries and an even bigger interest in baking.\
It could be assumed that many people in Stockholm are considering opening a bakery of their own. \
However, it is difficult to know in which district there is a lack of bakeries and could use an additional one. 

### Business Problem 
The goal with this project is to analyze and suggest which locations in Stockholm would be the most suitable for a new bakery.\
Since the location appears to be the most important competitive advantage, knowing where to locate a bakery would be the first step towards success. 

### Target Audience
The target audience for this project consists of baking entrepreneurs and bakery chains seeking to open a new venue in an optimal location.  

## Data
In order to make the analysis needed we will need two sources of data:
* Wikipedia to get a list of all the Districts in Stockholm (https://en.wikipedia.org/wiki/Stockholm_City_Centre)
* Foursquare to get information about Geolocation of Districts
* Foursquare to get information about venues related to bowling

Through analyzing the number of bakeries in each district we will be able to create a number of clusters, which will be the basis for the selection of location a new bakery.   

## Methodology

In this project we'll try to find which districts are optimal to establish a new bakery in based on number of bakeries per capita in each region.

We'll use k-means clustering to create a map of the various clusters of districts in which the number of bakeries vary.

## Importing Libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium

!conda install -c conda-forge geocoder -y
import geocoder

print("Libraries imported.")

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0         conda-forge
    geopy:           1.21.0-py_0       conda-forge

The following packages will be UPDATED:

    ca-

## Getting the information about Districts in Stockholm

In [12]:
#Create dataframe from Wikipedia Site. 
df_sthlm=pd.read_html("https://en.wikipedia.org/wiki/Stockholm_City_Centre")[0]

#Removing dropping summary row, dropping columns not needed, renaming columns. 
df_sthlm=df_sthlm.drop(38)
df_sthlm = df_sthlm.drop(df_sthlm.columns[[1, 3,5]], axis=1)
df_sthlm.columns = ['District', 'Population', 'Borough']
df_sthlm.head()

Unnamed: 0,District,Population,Borough
0,Djurgården,788,Östermalm
1,Fredhäll,4958,Kungsholmen
2,Gustav Vasa,12911,Norrmalm
3,Gärdet,18158,Östermalm
4,Hedvig Eleonora,10387,Östermalm


## Getting Geolocation of each District

In [24]:
# Defining and calling function to get coordinates of each district. 
def get_latlng(district):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Stockholm, Sweden'.format(district))
        lat_lng_coords = g.latlng
    return lat_lng_coords
coords = [ get_latlng(district) for district in df_sthlm["District"].tolist() ]

In [25]:
# Creating temporary dataframe to hold coordinates of each district. 
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [26]:
# Merging the two data frames
df_sthlm['Latitude'] = df_coords['Latitude']
df_sthlm['Longitude'] = df_coords['Longitude']

In [27]:
#Checking that the dataframe looks OK
df_sthlm

Unnamed: 0,District,Population,Borough,Latitude,Longitude
0,Djurgården,788,Östermalm,59.32462,18.0978
1,Fredhäll,4958,Kungsholmen,59.33103,18.00545
2,Gustav Vasa,12911,Norrmalm,59.3425,18.04775
3,Gärdet,18158,Östermalm,59.33361,18.11336
4,Hedvig Eleonora,10387,Östermalm,59.33524,18.08046
5,Hjorthagen-Värtahamnen,2225,Östermalm,59.35518,18.1002
6,Jakob,201,Norrmalm,59.329274,18.066006
7,Klara,1597,Norrmalm,59.334667,18.068028
8,Kristineberg,5572,Kungsholmen,59.33662,18.00486
9,Kungsholm,18465,Kungsholmen,59.32901,18.04854


## Plotting the Districts to a map of Stockholm

In [28]:
# Getting the geolocation of Stockholm
address = 'Stockholm, Sweden'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geolocation of Stockholm is {}, {}.'.format(latitude, longitude))

The geolocation of Stockholm is 59.3251172, 18.0710935.


In [215]:
# Creating a map of Stockholm using geolocation.
map_sthlm = folium.Map(location=[latitude, longitude], zoom_start=13)

# add markers to map
for lat, lng, district in zip(df_sthlm['Latitude'], df_sthlm['Longitude'], df_sthlm['District']):
    label = '{}'.format(district)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_sthlm)  

map_sthlm

Ok, so far it looks really good. We have all the Districts of Stockholm plotted out onto a map.

Next step is to start looking at venues in each of the Districts. 

## Foursquare

In [166]:
# The code was removed by Watson Studio for sharing.

Client ID and Client Secret were entered here, but hidden.


### Getting the Bakeries venues within a radius of 500m from each district center

In [179]:
radius = 500
LIMIT = 100

# Category IDs taken from Foursquare (https://developer.foursquare.com/docs/resources/categories)
categoryId ='4bf58dd8d48988d16a941735'
venues = []

for lat, long, district in zip(df_sthlm['Latitude'], df_sthlm['Longitude'], df_sthlm['District']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        categoryId,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            district,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [180]:
# convert the venues list into a new DataFrame
df_venues = pd.DataFrame(venues)

# define the column names
df_venues.columns = ['District', 'Latitude', 'Longitude', 'Venue Name', 'Venue Latitude', 'Venue Longitude', 'Venue Category']

print(df_venues.shape)
df_venues

(202, 7)


Unnamed: 0,District,Latitude,Longitude,Venue Name,Venue Latitude,Venue Longitude,Venue Category
0,Djurgården,59.32462,18.0978,Bageriet Skansen,59.325785,18.100556,Bakery
1,Fredhäll,59.33103,18.00545,Bageriet 1935,59.330982,18.000153,Bakery
2,Gustav Vasa,59.3425,18.04775,Fabrique stenugnsbageri,59.342635,18.049212,Bakery
3,Gustav Vasa,59.3425,18.04775,Brunkebergs Bageri,59.33966,18.047594,Bakery
4,Gustav Vasa,59.3425,18.04775,Konditori Ritorno,59.341506,18.044752,Bakery
5,Gustav Vasa,59.3425,18.04775,Stinas bageri,59.344208,18.044825,Bakery
6,Gustav Vasa,59.3425,18.04775,Gateau,59.34122,18.048151,Bakery
7,Gustav Vasa,59.3425,18.04775,Nybergs Hembageri & Konditori,59.339767,18.051687,Bakery
8,Gustav Vasa,59.3425,18.04775,Bröd & Salt,59.343442,18.050245,Bakery
9,Hedvig Eleonora,59.33524,18.08046,Fabrique stenugnsbageri,59.335854,18.07638,Bakery


In [181]:
# Okay, let's see how many bakeries we have in Stockholm
print('There are a total of {} bakeries in Stockholm.'.format(len(df_venues.index)))


There are a total of 202 bakeries in Stockholm.


## Analyzing the Districts in Stockholm

In [182]:
# Let's see how many bakeries we have in each District
df_bakery = pd.DataFrame(venues)
df_bakery.columns = ['District', 'Latitude', 'Longitude', 'Venue Name', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
df_bakery['Bakery Freq'] = df_venues.groupby('District')['District'].transform('count')

#Remove columns not needed and removing duplicates
df_bakery=df_bakery.drop(df_venues.columns[[1, 2, 3, 4, 5, 5,6]], axis=1)
df_bakery=df_bakery.drop_duplicates()
df_bakery

Unnamed: 0,District,Bakery Freq
0,Djurgården,1
1,Fredhäll,1
2,Gustav Vasa,7
9,Hedvig Eleonora,6
15,Jakob,17
32,Klara,18
50,Kristineberg,1
51,Kungsholm,6
57,Lilla Essingen,3
60,Mariatorget,7


__Now, let's see how many bakeries we have per capita in each district as a basis for our clustering.__

In [210]:
# Merging with our first data frame to get population in each District
df_merged = pd.merge(df_bakery, df_sthlm, on = 'District')
# Add a new column with number of bakeries per person.
df_merged['Bakeries Per Capita'] = df_merged['Bakery Freq'].divide(df_merged['Population'])
df_merged

Unnamed: 0,District,Bakery Freq,Population,Borough,Latitude,Longitude,Bakeries Per Capita
0,Djurgården,1,788,Östermalm,59.32462,18.0978,0.001269
1,Fredhäll,1,4958,Kungsholmen,59.33103,18.00545,0.000202
2,Gustav Vasa,7,12911,Norrmalm,59.3425,18.04775,0.000542
3,Hedvig Eleonora,6,10387,Östermalm,59.33524,18.08046,0.000578
4,Jakob,17,201,Norrmalm,59.329274,18.066006,0.084577
5,Klara,18,1597,Norrmalm,59.334667,18.068028,0.011271
6,Kristineberg,1,5572,Kungsholmen,59.33662,18.00486,0.000179
7,Kungsholm,6,18465,Kungsholmen,59.32901,18.04854,0.000325
8,Lilla Essingen,3,4519,Kungsholmen,59.32504,18.00677,0.000664
9,Mariatorget,7,14099,Maria-Gamla stan,59.318302,18.063466,0.000496


## Clustering Districts
__Here we use ML to cluster each district using k-means clustering.__

In [211]:
# set number of clusters
kclusters = 5

df_merged_clustering = df_merged.drop(df_merged.columns[[0, 1, 2, 3, 4, 5, 5,]], axis=1)


# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_merged_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_ 

array([0, 4, 4, 4, 1, 3, 4, 4, 4, 4, 4, 2, 0, 4, 0, 4, 4, 2, 0, 2, 4, 0,
       2, 4, 0, 0, 4, 4, 4], dtype=int32)

Let's create a new dataframe that includes the cluster for each neighborhood.

In [212]:
df_merged.insert(1, 'Cluster Labels', kmeans.labels_)
df_merged

Unnamed: 0,District,Cluster Labels,Bakery Freq,Population,Borough,Latitude,Longitude,Bakeries Per Capita
0,Djurgården,0,1,788,Östermalm,59.32462,18.0978,0.001269
1,Fredhäll,4,1,4958,Kungsholmen,59.33103,18.00545,0.000202
2,Gustav Vasa,4,7,12911,Norrmalm,59.3425,18.04775,0.000542
3,Hedvig Eleonora,4,6,10387,Östermalm,59.33524,18.08046,0.000578
4,Jakob,1,17,201,Norrmalm,59.329274,18.066006,0.084577
5,Klara,3,18,1597,Norrmalm,59.334667,18.068028,0.011271
6,Kristineberg,4,1,5572,Kungsholmen,59.33662,18.00486,0.000179
7,Kungsholm,4,6,18465,Kungsholmen,59.32901,18.04854,0.000325
8,Lilla Essingen,4,3,4519,Kungsholmen,59.32504,18.00677,0.000664
9,Mariatorget,4,7,14099,Maria-Gamla stan,59.318302,18.063466,0.000496


## Map of the Clusters

In [216]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['District'], df_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Clusters

Here we will examine each cluster. 

### Cluster 0

In [217]:
df_merged.loc[df_merged['Cluster Labels'] == 0]

Unnamed: 0,District,Cluster Labels,Bakery Freq,Population,Borough,Latitude,Longitude,Bakeries Per Capita
0,Djurgården,0,1,788,Östermalm,59.32462,18.0978,0.001269
12,Norra Johannes,0,12,9043,Norrmalm,59.3237,18.074492,0.001327
14,Norra Sofia,0,7,7721,Katarina-Sofia,59.31215,18.08835,0.000907
18,Stureplan-Lärkstaden,0,10,8104,Östermalm,59.335578,18.073046,0.001234
21,Södra Högalid,0,4,4155,Katarina-Sofia,59.31726,18.03773,0.000963
24,Södra Station,0,4,4844,Katarina-Sofia,59.34627,18.03467,0.000826
25,Tekniska Högskolan,0,3,3442,Östermalm,59.3457,18.07194,0.000872


### Cluster 1

In [218]:
df_merged.loc[df_merged['Cluster Labels'] == 1]

Unnamed: 0,District,Cluster Labels,Bakery Freq,Population,Borough,Latitude,Longitude,Bakeries Per Capita
4,Jakob,1,17,201,Norrmalm,59.329274,18.066006,0.084577


### Cluster 2

In [219]:
df_merged.loc[df_merged['Cluster Labels'] == 2]

Unnamed: 0,District,Cluster Labels,Bakery Freq,Population,Borough,Latitude,Longitude,Bakeries Per Capita
11,Norra Adolf Fredrik,2,16,3816,Norrmalm,59.33789,18.06006,0.004193
17,Storkyrkan,2,13,3017,Maria-Gamla stan,59.32563,18.06982,0.004309
19,Södra Adolf Fredrik,2,16,3703,Norrmalm,59.33789,18.06006,0.004321
22,Södra Johannes,2,12,2011,Norrmalm,59.3237,18.074492,0.005967


### Cluster 3

In [220]:
df_merged.loc[df_merged['Cluster Labels'] == 3]

Unnamed: 0,District,Cluster Labels,Bakery Freq,Population,Borough,Latitude,Longitude,Bakeries Per Capita
5,Klara,3,18,1597,Norrmalm,59.334667,18.068028,0.011271


### Cluster 4

In [221]:
df_merged.loc[df_merged['Cluster Labels'] == 4]

Unnamed: 0,District,Cluster Labels,Bakery Freq,Population,Borough,Latitude,Longitude,Bakeries Per Capita
1,Fredhäll,4,1,4958,Kungsholmen,59.33103,18.00545,0.000202
2,Gustav Vasa,4,7,12911,Norrmalm,59.3425,18.04775,0.000542
3,Hedvig Eleonora,4,6,10387,Östermalm,59.33524,18.08046,0.000578
6,Kristineberg,4,1,5572,Kungsholmen,59.33662,18.00486,0.000179
7,Kungsholm,4,6,18465,Kungsholmen,59.32901,18.04854,0.000325
8,Lilla Essingen,4,3,4519,Kungsholmen,59.32504,18.00677,0.000664
9,Mariatorget,4,7,14099,Maria-Gamla stan,59.318302,18.063466,0.000496
10,Mellersta Högalid,4,4,9914,Maria-Gamla stan,59.31726,18.03773,0.000403
13,Norra Högalid,4,4,13166,Maria-Gamla stan,59.31726,18.03773,0.000304
15,Oscars Kyrka,4,2,15271,Östermalm,59.3333,18.0929,0.000131


# Results and Discussion

Our analysis shows that there are over 200 bakeries in central Stockholm, with five distinct clusters of density of bakeries per capita. 
It could of course be interesting to include other venues such as cafés and bistros, since these could be competing with eachother. 
Additional data to analyze would be average income in each district, but I could not find the data. 

What we can see from our analysis is that there are several areas in Stockholm that could use some new bakeries - which can be found in cluster 4. 

# Conclusion

For anyone interested in opening a new bakery in Stockholm, any of the districts in cluster 4 would be recommended!