# The Battle of Neighborhoods

## Introduction / Business problem:

A personal trainer is planning on opening a new gym in Helsinki. He would like to place it so that it wouldnt be too close to other gyms and so that there would be a lot of potential customers.

The task is to create a visualization on how many gyms there are on each suburb area of Helsinki for the personal trainer, so (s)he could make better decisions. A good outcome would be a map, which would visualize the amoun of gyms on each suburb area. 

## Description of the data:

Data will be gathered from two sources:
- Wikipedia (web scraping)
- FourSquare API (search api: GET https://api.foursquare.com/v2/venues/search )

From wikipedia we are going to scrape a list of zip-codes/suburbs in Helsinki and then turn them into coordinates and from FourSquare API we are going to fetch data about gyms in the suburbs.

#### Example of Wikipedia site and data format

From here: https://fi.wikipedia.org/wiki/Luettelo_Suomen_postinumeroista_kunnittain

Bullet points and html like this:

<li><a href="/wiki/Helsinki" title="Helsinki">Helsinki</a>
<ul><li>00002 Helsinki / Helsingfors</li>
<li>00100 Helsinki / Helsingfors – <a href="/wiki/Postitalo_(Helsinki)" title="Postitalo (Helsinki)">Postitalo</a>, <a href="/wiki/Kamppi" title="Kamppi">Kamppi</a>, <a href="/wiki/Lepp%C3%A4suo" title="Leppäsuo">Leppäsuo</a>, <a href="/wiki/Etu-T%C3%B6%C3%B6l%C3%B6" title="Etu-Töölö">Etu-Töölö</a></li>
<li>...</li>
<li>... lots of other suburbs...</li>
<li>...</li>
<li>00970 Helsinki / Helsingfors – <a href="/wiki/Mellunm%C3%A4ki" title="Mellunmäki">Mellunmäki</a>, <a href="/wiki/Mellunkyl%C3%A4" title="Mellunkylä">Mellunkylä</a>, <a href="/wiki/Uutela" title="Uutela">Uutela</a></li>
<li>00980 Helsinki / Helsingfors – <a href="/wiki/Vuosaari" title="Vuosaari">Vuosaari</a>, <a href="/wiki/Meri-Rastila" title="Meri-Rastila">Meri-Rastila</a></li>
<li>00990 Helsinki / Helsingfors – <a href="/wiki/Vuosaari" title="Vuosaari">Vuosaari</a>, <a href="/wiki/Aurinkolahti" title="Aurinkolahti">Aurinkolahti</a></li></ul></li>

#### Example on FourSquare data response

```"response": {
    "venues": [
      {
        "id": "5642aef9498e51025cf4a7a5",
        "name": "Mr. Purple",
        "location": {
          "address": "180 Orchard St",
          "crossStreet": "btwn Houston & Stanton St",
          "lat": 40.72173744277209,
          "lng": -73.98800687282996,
          "labeledLatLngs": [
            {
              "label": "display",
              "lat": 40.72173744277209,
              "lng": -73.98800687282996
            }
          ],
          "distance": 8,
          "postalCode": "10002",
          "cc": "US",
          "city": "New York",
          "state": "NY",
          "country": "United States",
          "formattedAddress": [
            "180 Orchard St (btwn Houston & Stanton St)",
            "New York, NY 10002",
            "United States"
          ]
        },
        "categories": [
          {
            "id": "4bf58dd8d48988d1d5941735",
            "name": "Hotel Bar",
            "pluralName": "Hotel Bars",
            "shortName": "Hotel Bar",
            "icon": {
              "prefix": "https://ss3.4sqi.net/img/categories_v2/travel/hotel_bar_",
              "suffix": ".png"
            },
            "primary": true
          }
        ],
        "venuePage": {
          "id": "150747252"
        }
      }
    ]
  }
}```

# Execution

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
from pandas.io.json import json_normalize
from sklearn.cluster import KMeans
!pip install folium --quiet
import folium
print('All imports ready')

All imports ready


## Data scraping to get all zip codes and suburb names of Helsinki

In Helsinki there's several suburbs. In order to get their latitude and lognitude, let's first scrape all the zip codes from Wikipedia from here: https://fi.wikipedia.org/wiki/Luettelo_Suomen_postinumeroista_kunnittain

In [2]:
base_url_wikipedia = 'https://fi.wikipedia.org/wiki/Luettelo_Suomen_postinumeroista_kunnittain'
page_wikipedia = requests.get(base_url_wikipedia).content

In [3]:
soup = BeautifulSoup(page_wikipedia, 'html.parser')
content = soup.find(class_='mw-parser-output')
lists = content.find_all('ul')
zip_codes = lists[1].find_all('li')

helsinki = []
for z in zip_codes:
    if 'Helsinki / Helsingfors' in z.text.strip():
        helsinki.append(z.text.strip())

helsinki_suburbs = []
for h in helsinki[1:len(helsinki)]:
    parts = h.split('–')
    
    suburb = []
    zip_code_and_suburb = parts[0].split(' ')
    suburb.append(zip_code_and_suburb[0]) # adding zip code
    
    # adding suburb names
    if(len(parts) == 2):
        suburb.append(parts[1].strip().split(',')[0].strip())
    else:
        suburb.append('')
    helsinki_suburbs.append(suburb)
    
print('Helsinki has', len(helsinki_suburbs),'suburbs')

Helsinki has 85 suburbs


Instead of geocoder, using geopy, because geocoder didnt work. That's however not as good and some of the suburb locations were not found at all. Those ones were left out from the comparison. Also they wouldnt most probably be the optimum ones for gyms either, if they are so random that geopy doesnt even know they exist.

In [4]:
!pip install geopy --quiet

In [5]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="helsinki-data")

After the installation, let's load all the coordinates of different suburbs.

In [31]:
suburbs_with_coords = []
for sub in helsinki_suburbs:
    location = geolocator.geocode('{}, {}, Helsinki'.format(sub[0], sub[1]))
    
    # some of the suburbs were not found due to the limitations of the library used
    # unfortunately that other library recommended by coursera didnt work again :((
    if(location and location.latitude):
        sub.extend([location.latitude, location.longitude])
        suburbs_with_coords.append(sub)

After coordinates, let's create a dataframe.

In [19]:
helsinki_df = pd.DataFrame(suburbs_with_coords, columns=['Postal Code', 'Neighborhood', 'Latitude', 'Longitude'])
helsinki_df.tail()

Unnamed: 0,Postal Code,Neighborhood,Latitude,Longitude
65,950,Vartioharju,60.226439,25.12496
66,960,Vuosaari,60.220808,25.135899
67,970,Mellunmäki,60.231909,25.123513
68,980,Vuosaari,60.205461,25.128293
69,990,Vuosaari,60.202969,25.152177


Defining Helsinki's city center latitude and longitude values (the top most postal code).

In [8]:
helsinki_lat = helsinki_df['Latitude'][0]
helsinki_long = helsinki_df['Longitude'][0]

## Creating a map of Helsinki with all suburbs marked

In [9]:
map_helsinki = folium.Map(location=[helsinki_lat, helsinki_long], zoom_start=11)

In [10]:
for lat, lng, zip_code, neighborhood in zip(helsinki_df['Latitude'], helsinki_df['Longitude'], helsinki_df['Postal Code'], helsinki_df['Neighborhood']):
    
    label = '{}, {}'.format(zip_code, neighborhood)
    label = folium.Popup(label, parse_html=True)
    
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_helsinki)

In [11]:
map_helsinki

The map doesnt work on github. To see the map go here: https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/f6a4e6da-2e18-4d5c-927d-6ab44f9bcba1/view?access_token=cb4033043cfa41e47305e62c3fbed6484e04baf2df3c37a2bf3dfd78949ba8ff

## Extracting gyms' location data from FourSquare

In order to reach our goal, we need the current locations of gyms in Helsinki.

In [32]:
CLIENT_ID = 'secret' # your Foursquare ID
CLIENT_SECRET = 'secret' # your Foursquare Secret

In [33]:
# @hidden_cell
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret

In [34]:
VERSION = '20201804' # Foursquare API version
gym_category_ids = '4bf58dd8d48988d175941735,4bf58dd8d48988d1b2941735'

Let's create a function to extract that kind of data.

In [15]:
def getNearbyGyms(zip_codes, names, latitudes, longitudes):
    
    venues_list=[]
    for zip_code, name, lat, lng in zip(zip_codes, names, latitudes, longitudes):
        print(zip_code, name, lat, lng)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat,
            lng, 
            1000, 
            100,
            gym_category_ids
        )
            
        # make the GET request
        results = requests.get(url).json()['response']['venues']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            zip_code,
            name, 
            lat, 
            lng, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code',
                  'Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Gym Name', 
                  'Gym Latitude', 
                  'Gym Longitude']
    
    return(nearby_venues)

Then let's run the function with our Helsinki suburb data.

In [None]:
helsinki_gyms_by_suburb = getNearbyGyms(zip_codes=helsinki_df['Postal Code'], names=helsinki_df['Neighborhood'], latitudes=helsinki_df['Latitude'], longitudes=helsinki_df['Longitude'])

Based on the data collected, let's create two types of dataframes: one for all the gyms and their locations and another one for the total amount of gyms on certain area. All the gyms were extracted from the same query, due to the limitation of max 50 (couldnt just extract all the gyms in Helsinki, cos it would be more than 50).

### All the gyms dataframe:

In [29]:
all_gyms_in_helsinki_df = helsinki_gyms_by_suburb.drop_duplicates(subset=['Gym Name']).reset_index()[['Gym Name', 'Gym Latitude', 'Gym Longitude']]
all_gyms_in_helsinki_df.tail()

Unnamed: 0,Gym Name,Gym Latitude,Gym Longitude
675,"iZENZEi Academy Helsinki – Taekwondoa, kuntoil...",60.208717,25.142019
676,Gympark,60.202675,25.160864
677,aurinko punttisali,60.203204,25.16041
678,Sali,60.213165,25.155818
679,RANTA PAJA,60.202503,25.160644


### Total amount of gyms on each suburb dataframe:

In [30]:
helsinki_gyms_by_suburb['Gym Count'] = helsinki_gyms_by_suburb.groupby('Postal Code')['Neighborhood'].transform('count')
gym_count_by_suburb = helsinki_gyms_by_suburb.drop_duplicates(subset=['Postal Code']).reset_index()[['Postal Code','Neighborhood','Neighborhood Latitude','Neighborhood Longitude', 'Gym Count']]
gym_count_by_suburb.head()

Unnamed: 0,Postal Code,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Gym Count
0,2,,60.16741,24.942577,49
1,120,Punavuori,60.163565,24.939153,49
2,130,Kaartinkaupunki,60.165214,24.947222,49
3,140,Ullanlinna,60.158073,24.952387,49
4,150,Eira,60.156817,24.942843,49


## Heat map of competitors on the area

Finally based on the data gathered, let's create a heat map kind of visualization, where the size of the tick will represent the amount of gyms in that specific suburb of Helsinki. This will represent the amount of competitos at the area.

In [27]:
map_helsinki_with_gyms = folium.Map(location=[helsinki_lat, helsinki_long], zoom_start=11)

# adding "heat map circles"
for lat, lng, zip_code, neighborhood, gym_count in zip(gym_count_by_suburb['Neighborhood Latitude'], gym_count_by_suburb['Neighborhood Longitude'], gym_count_by_suburb['Postal Code'], gym_count_by_suburb['Neighborhood'], gym_count_by_suburb['Gym Count']):
    
    label = '{}, {}'.format(zip_code, neighborhood)
    label = folium.Popup(label, parse_html=True)
    
    folium.CircleMarker(
        [lat, lng],
        radius=gym_count,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_helsinki_with_gyms)
    
# adding gyms
for name, lat2, lng2 in zip(all_gyms_in_helsinki_df['Gym Name'], all_gyms_in_helsinki_df['Gym Latitude'], all_gyms_in_helsinki_df['Gym Longitude']):    
    label = '{}'.format(name)
    label = folium.Popup(label, parse_html=True)
    
    folium.CircleMarker(
        [lat2, lng2],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#e076a9',
        fill_opacity=0.7,
        parse_html=False).add_to(map_helsinki_with_gyms)

map_helsinki_with_gyms

The map doesnt work on github. To see the map go here: https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/f6a4e6da-2e18-4d5c-927d-6ab44f9bcba1/view?access_token=cb4033043cfa41e47305e62c3fbed6484e04baf2df3c37a2bf3dfd78949ba8ff

From the map one can see all the locations of gyms with red ticks and heat map kind of representation of amount of gyms on each suburb of Helsinki.

As we can see, the center of Helsinki is pretty busy and already has lot's of supply. Also there's lot's of demand. To form an even better decision basis, one could include the population on each area. However, I didnt have access to that kind of data. :(

However potentially the research could be continued after gaining such data. Also a person living in Helsinki has pretty good picture on the population density of most of the suburbs. Also the personal trainer might have certain preferences for the location so this data will give him insights on how many gyms there are in those areas.