# Capstone Project - The Battle of the Neighborhoods (Week 2)

### Introduction: Business Problem

As there has been a steady growth in Bratislava in recent years, it has attracted more and more people to a point where there is a shortage of infrastructure and facilities in various neighborhoods.

One of the shortcomings is the lack of supermarkets in neighborhoods close to the city center of Bratislava, affecting the quality of life of many of its residents as these areas are now considered densely populated and the few grocery shops are always overcrowded with long queues.

This is an opportunity for stakeholders who are interested in running a supermarket in one of these areas as they will have a monopoly nature.

Our goal is to highlight these areas and propose the optimal neighborhood or neighborhoods for running a business of that kind based on various criteria.

### Data

From wikipedia page https://en.wikipedia.org/wiki/Boroughs_and_localities_of_Bratislava we will retrieve the boroughs of Bratislava by using web scraping techniques.

Then we will get the geographical coordinates of the neighborhoods by using Python Geocoder package which will give us the latitude and longitude coordinates of the neighborhoods.

We will also get the venue data for these boroughs by making use of Foursquare API. Our focus will be in venues related to supermarkets, grocery shops or shopping malls in order to visualize the current presence of this kind of business in its area of the city. After that we will explore those neighborhoods with lack of supermarkets that are closer to the historical city center (Stare Mesto) instead of areas that belong to the suburbs of Bratislava and propose the optimal area for a new supermarket.

#### Import Libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
import urllib

#!conda install -c conda-forge geopy --yes # uncomment this line if the package is not installed yet
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')

Libraries imported.


#### Scrap data from Wikipedia page into a DataFrame

In [2]:
wikipage = urllib.request.urlopen("https://en.wikipedia.org/wiki/Boroughs_and_localities_of_Bratislava")
content = wikipage.read()

In [3]:
scontent = content.decode("UTF-8")

In [4]:
wikitable = scontent[scontent.find("<table"):scontent.find("</table>")+8]

In [5]:
mydata = pd.read_html(wikitable, header = 0)[0]

In [6]:
mydata.head()

Unnamed: 0,Borough and district,Population Dec. 2005[3],Area in km²[3],Annexed,Location
0,Staré Mesto (I),42241,9.59,,
1,Podunajské Biskupice (II),19977,42.49,1972.0,
2,Ružinov (II),69674,39.7,,
3,Vrakuňa (II),18996,10.3,1972.0,
4,Nové Mesto (III),37040,37.48,,


Now that we have all the data from wikipedia related to Boroughs in Bratislava, let's do some data cleaning.
- We only need the first column, so the rest will be deleded.
- Moving forward I realized that is necessary to include 'Bratislava' in the naming of the Boroughs in order to have accurate data when looking for the Borough coordinates.

In [7]:
col2 = 'Population Dec. 2005[3]'
col3 = 'Area in km²[3]'
del mydata[col2]
del mydata[col3]
del mydata['Annexed']
del mydata['Location']
# Adding 'Bratislava' to the tail of each Borough
mydata['Borough and district'] = mydata['Borough and district'] +', '+ "Bratislava"

Let's see how our dataframe looks so far

In [8]:
mydata.head()

Unnamed: 0,Borough and district
0,"Staré Mesto (I), Bratislava"
1,"Podunajské Biskupice (II), Bratislava"
2,"Ružinov (II), Bratislava"
3,"Vrakuňa (II), Bratislava"
4,"Nové Mesto (III), Bratislava"


In [9]:
mydata.shape

(18, 1)

Let's start with the geograpical coordinate of Bratislava, Slovakia

In [10]:
geolocator = Nominatim(user_agent="get_coords")
location = geolocator.geocode("Bratislava")
print((location.latitude, location.longitude))

(48.1516988, 17.1093063)


Create temporary dataframe to populate the coordinates into Latitude and Longitude

In [11]:
lat_coords=[]
lon_coords=[]
for index, row in mydata.iterrows():
    location = geolocator.geocode(row['Borough and district'])
    lat_coords.append(location.latitude)
    lon_coords.append(location.longitude)
    coords = {'Latitude':lat_coords,'Longitude':lon_coords}
    mydata_coords = pd.DataFrame(coords)

In [12]:
mydata_coords

Unnamed: 0,Latitude,Longitude
0,48.155137,17.101021
1,48.118048,17.208115
2,48.143488,17.160702
3,48.151912,17.209565
4,48.191963,17.096901
5,48.215235,17.149956
6,48.216978,17.19483
7,48.170906,17.010161
8,48.224308,16.98413
9,48.186545,17.029535


Merge the coordinates into the original dataframe

In [13]:
mydata['Latitude'] = mydata_coords['Latitude']
mydata['Longitude'] = mydata_coords['Longitude']

In [14]:
mydata

Unnamed: 0,Borough and district,Latitude,Longitude
0,"Staré Mesto (I), Bratislava",48.155137,17.101021
1,"Podunajské Biskupice (II), Bratislava",48.118048,17.208115
2,"Ružinov (II), Bratislava",48.143488,17.160702
3,"Vrakuňa (II), Bratislava",48.151912,17.209565
4,"Nové Mesto (III), Bratislava",48.191963,17.096901
5,"Rača (III), Bratislava",48.215235,17.149956
6,"Vajnory (III), Bratislava",48.216978,17.19483
7,"Devín (IV), Bratislava",48.170906,17.010161
8,"Devínska Nová Ves (IV), Bratislava",48.224308,16.98413
9,"Dúbravka (IV), Bratislava",48.186545,17.029535


Rename the header of the first column and delete the last row of our dataframe as it's not a Borough but it's just a total row from the table that we pulled the data

In [15]:
mydata.rename(columns={"Borough and district":"Borough"}, inplace=True)
mydata = mydata[mydata.Borough != "Total, Bratislava"]
mydata

Unnamed: 0,Borough,Latitude,Longitude
0,"Staré Mesto (I), Bratislava",48.155137,17.101021
1,"Podunajské Biskupice (II), Bratislava",48.118048,17.208115
2,"Ružinov (II), Bratislava",48.143488,17.160702
3,"Vrakuňa (II), Bratislava",48.151912,17.209565
4,"Nové Mesto (III), Bratislava",48.191963,17.096901
5,"Rača (III), Bratislava",48.215235,17.149956
6,"Vajnory (III), Bratislava",48.216978,17.19483
7,"Devín (IV), Bratislava",48.170906,17.010161
8,"Devínska Nová Ves (IV), Bratislava",48.224308,16.98413
9,"Dúbravka (IV), Bratislava",48.186545,17.029535


As we move forward let's store the [latitude, longitude] coords of Bratislava into 2 variables

In [16]:
# get the coordinates of Bratislava
address = 'Bratislava, Slovakia'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

#### Install necessary package and import folium library that will be used for data visualization

In [17]:
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if the package is not installed
import folium # map rendering library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    folium-0.5.0               |             py_0          45 KB  conda-forge
    altair-3.2.0               |           py36_0         770 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ca-certificates-2019.9.11  |       hecc5488_0         144 KB  conda-forge
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.2 MB

The following NEW packages will be 

##### Let's visualize the data we have so far: city center location and borough centers:

In [18]:
# create map of Bratislava using latitude and longitude values
map_br = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
folium.Marker([latitude, longitude], popup='Historical_Center').add_to(map_br)

for lat, lng, neighborhood in zip(mydata['Latitude'], mydata['Longitude'], mydata['Borough']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=45,
        popup=label,
        color='blue',
        fill=False).add_to(map_br)  
    
map_br

### Foursquare

Use the Foursquare API to explore the Boroughs

In [19]:
# The code was removed by Watson Studio for sharing.

Your credentails:
CLIENT_ID: BRVJTMCMYT44LTAXIZVEDM1KRIWLWRNT50SJZ310LB5EGWNR
CLIENT_SECRET:OV40LNUQUEEEYE4VS4PIXZ0MOLF4F4MX5Y3XE0MNPYU1IK2J


##### Let's get the venues with category related to 'Supermarket' that are within a radius of 2500.

In [20]:
radius = 2500
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(mydata['Latitude'], mydata['Longitude'], mydata['Borough']):
        
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&query=Supermarket&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,        
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [21]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Borough', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(111, 7)


Unnamed: 0,Borough,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,"Staré Mesto (I), Bratislava",48.155137,17.101021,Billa,48.156347,17.118284,Supermarket
1,"Staré Mesto (I), Bratislava",48.155137,17.101021,Kaufland,48.165582,17.075952,Supermarket
2,"Staré Mesto (I), Bratislava",48.155137,17.101021,Lidl,48.144314,17.114057,Supermarket
3,"Staré Mesto (I), Bratislava",48.155137,17.101021,My Bratislava (Tesco),48.145238,17.114039,Department Store
4,"Staré Mesto (I), Bratislava",48.155137,17.101021,Billa,48.140633,17.120492,Supermarket


Let's see how many supermarkets were returned for each Borough

In [22]:
venues_df.groupby(["Borough"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Devínska Nová Ves (IV), Bratislava",3,3,3,3,3,3
"Dúbravka (IV), Bratislava",8,8,8,8,8,8
"Karlova Ves (IV), Bratislava",11,11,11,11,11,11
"Lamač (IV), Bratislava",9,9,9,9,9,9
"Petržalka (V), Bratislava",20,20,20,20,20,20
"Podunajské Biskupice (II), Bratislava",5,5,5,5,5,5
"Rača (III), Bratislava",4,4,4,4,4,4
"Rusovce (V), Bratislava",1,1,1,1,1,1
"Ružinov (II), Bratislava",21,21,21,21,21,21
"Staré Mesto (I), Bratislava",20,20,20,20,20,20


We see that there is a high volume of supermarkets in 'Petrzalka', 'Ruzinov' and 'Stare Mesto' which is driven by the fact that the majority of Bratislava's population is located in these 3 Boroughs.

In [23]:
# print out the list of categories
venues_df['VenueCategory'].unique()

array(['Supermarket', 'Department Store', 'Grocery Store',
       'Farmers Market', 'Shopping Mall'], dtype=object)

Checking the list of categories, I'd say that they are all acceptable as their venues refer to supermarkets and I won't need to exclude any of them.

### Methodology

- First step would be to visualize all the stores from our last dataframe on a map in parallel with Borough area circles in order to have an overview of supermarket distribution across the Boroughs.
- Then will focus on the 3 most populated Boroughs and look for areas that have no supermarket in close range.
- Finally will suggest one or two areas where the stakeholders could look into and consider running a business.

In [24]:
# create map of Supermarkets in Bratislava
map_markets = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
folium.Marker([latitude, longitude], popup='Historical_Center').add_to(map_markets)

for lat, lng, neighborhood in zip(mydata['Latitude'], mydata['Longitude'], mydata['Borough']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=45,
        popup=label,
        color='blue',
        fill=False).add_to(map_markets) 


for lat, lng, supermarket in zip(venues_df['VenueLatitude'], venues_df['VenueLongitude'], venues_df['VenueCategory']):
    label = '{}'.format(supermarket)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.7).add_to(map_markets)  
    
map_markets

### Analysis

Let's zoom a little bit in order to have a better view of the 3 most populated Boroughs.

In [25]:
# create map focusing on 3 most populated Boroughs
map_zoom = folium.Map(location=[latitude, longitude], zoom_start=13)


folium.Circle([latitude, longitude], radius=1800, color='white', fill=True, fill_opacity=0.6).add_to(map_zoom)
folium.Circle([48.143488, 17.160702], radius=1800, color='white', fill=True, fill_opacity=0.6).add_to(map_zoom)
folium.Circle([48.111508, 17.106420], radius=1800, color='white', fill=True, fill_opacity=0.6).add_to(map_zoom)


for lat, lng, supermarket in zip(venues_df['VenueLatitude'], venues_df['VenueLongitude'], venues_df['VenueCategory']):
    label = '{}'.format(supermarket)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.7).add_to(map_zoom)  
    
map_zoom

As I'm currently living in Bratislava, I can easily say where our focus should be. But let's see what I mean...

Starting with Petrzalka, I would say that there is a fair amount of supermarkets as it is not a crowded area like stare mesto and there is no opportunity based on current view.

Regarding Ruzinov, we are able to see that the north part has a fair amount of supermarkets that are distributed across the areas that most people live while as we are going south we see almost nothing. This makes sense as there is no opportunity there due to the small amount of residents.

Finally, let's examine Stare mesto. To the west of Stare mesto there is an area at altitude which is not offering an opportunity and this is why there is no supermarket. On the contrary we see that there is one area slightly north from the center and another area slightly east from the center were there is lack of supermarkets although there are many people living there and we talk about areas that are really close to the city center, so except of the permanent residents, there are many tourists and people from other areas that passing by regularly.

Let's focus on Stare mesto and try to see if there is any real opportunity.

In [26]:
# create map focusing on Bratislava's city center
map_center = folium.Map(location=[latitude, longitude], zoom_start=15)

folium.Circle([latitude, longitude], radius=1500, color='white', fill=True, fill_opacity=0.6).add_to(map_center)

for lat, lng, supermarket in zip(venues_df['VenueLatitude'], venues_df['VenueLongitude'], venues_df['VenueCategory']):
    label = '{}'.format(supermarket)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.7).add_to(map_center)  
    
map_center

What we see here is that our claiming for lack of supermarkets close to the city center was correct from the introduction and there is no better proof than the current view.

Now it's time to propose 2 spots to the stakeholders for further examination with radius = 500 and no supermarket in range.

In [27]:
# create map proposing 2 spots with radius=500 and no supermarket in range
map_spots = folium.Map(location=[latitude, longitude], zoom_start=15)

# add markers to map
folium.Marker([48.1514, 17.116], popup='A').add_to(map_spots)
folium.Marker([48.1537, 17.112], popup='B').add_to(map_spots)

folium.Circle([latitude, longitude], radius=1500, color='white', fill=True, fill_opacity=0.6).add_to(map_spots)
folium.Circle([48.1514, 17.116], radius=500, color='blue', fill=True, fill_opacity=0.1).add_to(map_spots)
folium.Circle([48.1537, 17.112], radius=500, color='blue', fill=True, fill_opacity=0.1).add_to(map_spots)

for lat, lng, supermarket in zip(venues_df['VenueLatitude'], venues_df['VenueLongitude'], venues_df['VenueCategory']):
    label = '{}'.format(supermarket)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.7).add_to(map_spots)  
    
map_spots

### Results and Discussion

We ended up with 2 spots close to the city center of Bratislava were there is no competition in range of 500 meters. This is a great opportunity for stakeholders in order to start running a business in those areas as the monopoly nature that exists will bring high profits.

On the other hand this will also give more shopping options to the residents of Bratislava, reduce the queues from the overcrowded supermarkets and contribute to a better quality of life.

Our initial claiming that there are not enough supermarkets in the capital of Slovakia was proven right. By using geographical data from wikipedia and foursquare, related to Boroughs and venues we managed to highlight an area that is seeking improvement and affects mainly the life of residents around those neighbourhoods.

In general we didn't pay a lot of attention to the rest of the boroughs as they do not offer a good opportunity and it doesn't worth it to invest in those areas, as the majority of Bratislava's residents are living and working close to the city center.

### Conclusion

Our goal was achieved. We managed to provide insights related to the lack of supermarkets close to the city center and finally end up with a suggestion based on our data analysis. Without using unnecessary calculations, we tried to keep it simple and rely heavily on data visualization and analysis on something that was obvious but seeking attention in order to produce an outcome based on our observations that was also driven by the real knowledge of the city and its special considerations regarding the population and the sites of interest.