# Districts of Budapest (WIP)

This is my final capstone project for IBM Applied Data Science Capstone course in Coursera.

## Intoduction
Our (imaginary) business provides consulting services to enterpreneurs and investors, who are looking into opening their own businesses in Budapest, Hungary.

After assessing our customers we would like to be able to recommend areas, districts to open their shop or service, where their initiative is most like to succeed. In support of this task, we are going to measure and segment the districts of the Hungarian capital based on the most common venues in the area.

## Getting the data

### List of districts, and population density:
Web scraping from [Wikipedia](https://en.wikipedia.org/wiki/List_of_districts_in_Budapest)

In [2]:
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_districts_in_Budapest'
df = pd.read_html(url)[1]

# drop the Sum row

df = df[df['District'].str.contains('kerület')]

# A quick look at the data
df.head()

Unnamed: 0,District,Name,Population (2016),Area (km2),Population density (people per km2)
0,I. kerület,"Várkerület (""Castle District"")",25196,3.41,7388.8
1,II. kerület,-,89903,36.34,2473.9
2,III. kerület,"Óbuda-Békásmegyer (""Old Buda-Békásmegyer"")",130415,39.7,3285.0
3,IV. kerület,"Újpest (""New Pest"")",101558,18.82,5396.2
4,V. kerület,"Belváros-Lipótváros (""Inner City - Leopold Town"")",26284,2.59,10148.2


In [3]:
df.shape

(23, 5)

District name, population and area are not needed.

In [4]:
df.drop(columns=['Name', 'Population (2016)', 'Area (km2)'], inplace = True)
df.rename(columns={'Population density (people per km2)': 'Population density'}, inplace=True)

df.head()

Unnamed: 0,District,Population density
0,I. kerület,7388.8
1,II. kerület,2473.9
2,III. kerület,3285.0
3,IV. kerület,5396.2
4,V. kerület,10148.2


### Average cost of business space
Web scraping from [jofogas.hu](https://jofogas.hu), a popular real estate and bartering website.

### Most common venues
Foursqare API

In [5]:
import os
from urllib.parse import urlencode
import pandas as pd
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium
import requests
import json

In [6]:
CLIENT_ID = os.environ['FOURSQUARE_CLIENT_ID']
CLIENT_SECRET = os.environ['FOURSQUARE_CLIENT_SECRET']
VERSION = '20200411' # Foursquare API version

In [8]:
def get_url(endpoint, **argv):
    api_endpoint = {
        'explore': "https://api.foursquare.com/v2/venues/explore?"
    }[endpoint]

    credentials = {
        'client_id': CLIENT_ID,
        'client_secret': CLIENT_SECRET,
        'v': VERSION,
    }
    
    url = api_endpoint + urlencode({**credentials, **argv})
    
    return url


def query_district(district_name, venue_data):
    print('Getting venues for: ' + district_name)
    url = get_url('explore', near='Budapest, ' + district_name, limit=50)
    
    results = requests.get(url).json()
    
    for item in results['response']['groups'][0]['items']:
        venue = item['venue']
        venue_data.append((
            district_name,
            venue['name'],
            venue['location']['lat'],
            venue['location']['lng'],
            venue['categories'][0]['name'],
        ))
        
def get_foursquare_data():
    venue_data = []
    for district in df['District']:
        query_district(district, venue_data)
        
    venue_df = pd.DataFrame(venue_data)
    venue_df.columns = ["District", "Name", "Latitude", "Longitude", "Category"]
    
    return venue_df

In [9]:
venues = get_foursquare_data()

Getting venues for: I. kerület
Getting venues for: II. kerület
Getting venues for: III. kerület
Getting venues for: IV. kerület
Getting venues for: V. kerület
Getting venues for: VI. kerület
Getting venues for: VII. kerület
Getting venues for: VIII. kerület
Getting venues for: IX. kerület
Getting venues for: X. kerület
Getting venues for: XI. kerület
Getting venues for: XII. kerület
Getting venues for: XIII. kerület
Getting venues for: XIV. kerület
Getting venues for: XV. kerület
Getting venues for: XVI. kerület
Getting venues for: XVII. kerület
Getting venues for: XVIII. kerület
Getting venues for: XIX. kerület
Getting venues for: XX. kerület
Getting venues for: XXI. kerület
Getting venues for: XXII. kerület
Getting venues for: XXIII. kerület


In [10]:
print(venues.shape)
venues.head()

(1150, 5)


Unnamed: 0,District,Name,Latitude,Longitude,Category
0,I. kerület,Budavári Palota,47.496198,19.039543,Castle
1,I. kerület,Zhao Zhou Teashop & Lab,47.497354,19.041026,Tea Room
2,I. kerület,Magyar Nemzeti Galéria | Hungarian National Ga...,47.496082,19.039468,Art Museum
3,I. kerület,Hotel Clark,47.498507,19.040412,Hotel
4,I. kerület,Várhegy,47.49757,19.038747,Scenic Lookout


In [11]:
# Find the city center, for map visualization
budapest_center = {
    'Latitude': (venues['Latitude'].min() + venues['Latitude'].max())/2,
    'Longitude': (venues['Longitude'].min() + venues['Longitude'].max())/2,
}
zoom_start=11

budapest_center


{'Latitude': 47.46947086504526, 'Longitude': 19.155755226305104}

In [12]:
# Find the center of each district for map visualization
districts = venues[['District', 'Latitude', 'Longitude']].groupby('District').median().reset_index()

In [13]:
onehot = pd.get_dummies(venues[['Category']], prefix="", prefix_sep="")

onehot['District'] = venues['District'] 

# move District column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

# group by district and apply weights
grouped = onehot.groupby('District').mean().reset_index()

grouped.head()


Unnamed: 0,District,Afghan Restaurant,Airport,Airport Service,American Restaurant,Animal Shelter,Aquarium,Art Museum,Arts & Crafts Store,Arts & Entertainment,...,Video Game Store,Video Store,Vietnamese Restaurant,Water Park,Whisky Bar,Wine Bar,Wine Shop,Yoga Studio,Zoo,Zoo Exhibit
0,I. kerület,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.04,0.02,0.0,0.0,0.0
1,II. kerület,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.0,0.0
2,III. kerület,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.02,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0
3,IV. kerület,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0
4,IX. kerület,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.02,0.0,0.0,0.02,0.02,0.0,0.0,0.0


My idea here is that I keep the 10 features with the highest standard deviation, since they are most likely to indicate differences between the individual districts. By normalize the data by row, the differences will be even easier to spot.

In [23]:
from sklearn.preprocessing import Normalizer

# Drop insignificant venues
to_keep = list(grouped.describe().transpose().nlargest(10, 'std').index)
features = grouped[to_keep]

# Normalize the rest by row
scaled_features = pd.DataFrame(Normalizer().fit_transform(features), index=features.index, columns=features.columns)

scaled_features

Unnamed: 0,Supermarket,Hotel,Coffee Shop,Hungarian Restaurant,Restaurant,Gym / Fitness Center,Bakery,Grocery Store,Dessert Shop,Beer Garden
0,0.0,0.780869,0.312348,0.468521,0.156174,0.156174,0.0,0.0,0.156174,0.0
1,0.0,0.169031,0.338062,0.0,0.338062,0.169031,0.676123,0.0,0.507093,0.0
2,0.179605,0.0,0.179605,0.718421,0.179605,0.538816,0.179605,0.0,0.179605,0.179605
3,0.152499,0.0,0.304997,0.152499,0.304997,0.457496,0.304997,0.304997,0.609994,0.0
4,0.223607,0.0,0.67082,0.0,0.0,0.223607,0.223607,0.0,0.447214,0.447214
5,0.0,0.557086,0.185695,0.557086,0.557086,0.0,0.0,0.0,0.185695,0.0
6,0.0,0.19245,0.7698,0.19245,0.3849,0.0,0.19245,0.0,0.3849,0.0
7,0.0,0.58346,0.437595,0.145865,0.58346,0.0,0.145865,0.0,0.29173,0.0
8,0.0,0.426401,0.852803,0.0,0.0,0.213201,0.0,0.0,0.0,0.213201
9,0.714286,0.142857,0.142857,0.571429,0.0,0.142857,0.285714,0.0,0.142857,0.0


In [24]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 4

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(scaled_features)

# add clustering labels
results = pd.concat([grouped[['District']], districts[['Latitude', 'Longitude']],  features], axis=1)
results.insert(1, 'Cluster Labels', kmeans.labels_)
results
# merge the results with the original dataset
# merged = toronto_df.join(neighborhoods_venues_sorted.set_index('Postal Code'), on='Postal Code')

Unnamed: 0,District,Cluster Labels,Latitude,Longitude,Supermarket,Hotel,Coffee Shop,Hungarian Restaurant,Restaurant,Gym / Fitness Center,Bakery,Grocery Store,Dessert Shop,Beer Garden
0,I. kerület,0,47.497504,19.039827,0.0,0.1,0.04,0.06,0.02,0.02,0.0,0.0,0.02,0.0
1,II. kerület,2,47.511077,19.030827,0.0,0.02,0.04,0.0,0.04,0.02,0.08,0.0,0.06,0.0
2,III. kerület,1,47.538686,19.043071,0.02,0.0,0.02,0.08,0.02,0.06,0.02,0.0,0.02,0.02
3,IV. kerület,2,47.561244,19.089207,0.02,0.0,0.04,0.02,0.04,0.06,0.04,0.04,0.08,0.0
4,IX. kerület,3,47.484081,19.066096,0.02,0.0,0.06,0.0,0.0,0.02,0.02,0.0,0.04,0.04
5,V. kerület,0,47.499897,19.051031,0.0,0.12,0.04,0.12,0.12,0.0,0.0,0.0,0.04,0.0
6,VI. kerület,3,47.503376,19.0647,0.0,0.02,0.08,0.02,0.04,0.0,0.02,0.0,0.04,0.0
7,VII. kerület,0,47.499658,19.06675,0.0,0.08,0.06,0.02,0.08,0.0,0.02,0.0,0.04,0.0
8,VIII. kerület,3,47.489445,19.070286,0.0,0.04,0.08,0.0,0.0,0.02,0.0,0.0,0.0,0.02
9,X. kerület,1,47.470173,19.145878,0.1,0.02,0.02,0.08,0.0,0.02,0.04,0.0,0.02,0.0


In [25]:
from IPython.display import display, Markdown, HTML

for i in range(kclusters):
    display(Markdown(f'### Cluster {i}'))
    display(HTML(results[results['Cluster Labels'] == i].to_html()))
    profile = results[results['Cluster Labels'] == i].sum(axis=0).to_frame()
    profile.columns=['Feature']
    profile.drop(index=['District', 'Cluster Labels', 'Latitude', 'Longitude'], axis=1, inplace=True)
    display(HTML(profile.sort_values('Feature', ascending=False).to_html()))

### Cluster 0

Unnamed: 0,District,Cluster Labels,Latitude,Longitude,Supermarket,Hotel,Coffee Shop,Hungarian Restaurant,Restaurant,Gym / Fitness Center,Bakery,Grocery Store,Dessert Shop,Beer Garden
0,I. kerület,0,47.497504,19.039827,0.0,0.1,0.04,0.06,0.02,0.02,0.0,0.0,0.02,0.0
5,V. kerület,0,47.499897,19.051031,0.0,0.12,0.04,0.12,0.12,0.0,0.0,0.0,0.04,0.0
7,VII. kerület,0,47.499658,19.06675,0.0,0.08,0.06,0.02,0.08,0.0,0.02,0.0,0.04,0.0


Unnamed: 0,Feature
Hotel,0.3
Restaurant,0.22
Hungarian Restaurant,0.2
Coffee Shop,0.14
Dessert Shop,0.1
Gym / Fitness Center,0.02
Bakery,0.02
Supermarket,0.0
Grocery Store,0.0
Beer Garden,0.0


### Cluster 1

Unnamed: 0,District,Cluster Labels,Latitude,Longitude,Supermarket,Hotel,Coffee Shop,Hungarian Restaurant,Restaurant,Gym / Fitness Center,Bakery,Grocery Store,Dessert Shop,Beer Garden
2,III. kerület,1,47.538686,19.043071,0.02,0.0,0.02,0.08,0.02,0.06,0.02,0.0,0.02,0.02
9,X. kerület,1,47.470173,19.145878,0.1,0.02,0.02,0.08,0.0,0.02,0.04,0.0,0.02,0.0
14,XIX. kerület,1,47.459683,19.14695,0.08,0.04,0.04,0.08,0.02,0.04,0.02,0.02,0.02,0.0
17,XVII. kerület,1,47.482866,19.256146,0.12,0.02,0.0,0.02,0.04,0.02,0.0,0.02,0.1,0.0
18,XVIII. kerület,1,47.447584,19.166829,0.08,0.02,0.02,0.06,0.02,0.04,0.02,0.04,0.0,0.02
20,XXI. kerület,1,47.43245,19.070588,0.02,0.0,0.0,0.02,0.06,0.02,0.02,0.08,0.02,0.08
21,XXII. kerület,1,47.427965,19.036773,0.06,0.0,0.0,0.02,0.02,0.06,0.06,0.06,0.0,0.06
22,XXIII. kerület,1,47.417383,19.107207,0.08,0.0,0.0,0.04,0.04,0.02,0.0,0.0,0.04,0.06


Unnamed: 0,Feature
Supermarket,0.56
Hungarian Restaurant,0.4
Gym / Fitness Center,0.28
Beer Garden,0.24
Restaurant,0.22
Grocery Store,0.22
Dessert Shop,0.22
Bakery,0.18
Hotel,0.1
Coffee Shop,0.1


### Cluster 2

Unnamed: 0,District,Cluster Labels,Latitude,Longitude,Supermarket,Hotel,Coffee Shop,Hungarian Restaurant,Restaurant,Gym / Fitness Center,Bakery,Grocery Store,Dessert Shop,Beer Garden
1,II. kerület,2,47.511077,19.030827,0.0,0.02,0.04,0.0,0.04,0.02,0.08,0.0,0.06,0.0
3,IV. kerület,2,47.561244,19.089207,0.02,0.0,0.04,0.02,0.04,0.06,0.04,0.04,0.08,0.0
11,XII. kerület,2,47.491135,19.01997,0.0,0.0,0.02,0.0,0.0,0.06,0.1,0.0,0.02,0.04
13,XIV. kerület,2,47.51793,19.106544,0.02,0.0,0.0,0.02,0.0,0.1,0.04,0.0,0.04,0.04
15,XV. kerület,2,47.561733,19.100293,0.02,0.0,0.06,0.02,0.04,0.06,0.04,0.08,0.08,0.0
16,XVI. kerület,2,47.505499,19.145488,0.04,0.0,0.04,0.02,0.02,0.04,0.02,0.0,0.04,0.0
19,XX. kerület,2,47.435619,19.100047,0.06,0.0,0.0,0.0,0.04,0.02,0.06,0.04,0.04,0.04


Unnamed: 0,Feature
Bakery,0.38
Gym / Fitness Center,0.36
Dessert Shop,0.36
Coffee Shop,0.2
Restaurant,0.18
Supermarket,0.16
Grocery Store,0.16
Beer Garden,0.12
Hungarian Restaurant,0.08
Hotel,0.02


### Cluster 3

Unnamed: 0,District,Cluster Labels,Latitude,Longitude,Supermarket,Hotel,Coffee Shop,Hungarian Restaurant,Restaurant,Gym / Fitness Center,Bakery,Grocery Store,Dessert Shop,Beer Garden
4,IX. kerület,3,47.484081,19.066096,0.02,0.0,0.06,0.0,0.0,0.02,0.02,0.0,0.04,0.04
6,VI. kerület,3,47.503376,19.0647,0.0,0.02,0.08,0.02,0.04,0.0,0.02,0.0,0.04,0.0
8,VIII. kerület,3,47.489445,19.070286,0.0,0.04,0.08,0.0,0.0,0.02,0.0,0.0,0.0,0.02
10,XI. kerület,3,47.47589,19.04508,0.0,0.0,0.12,0.02,0.0,0.0,0.0,0.04,0.02,0.02
12,XIII. kerület,3,47.529305,19.077385,0.02,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.04,0.0


Unnamed: 0,Feature
Coffee Shop,0.44
Gym / Fitness Center,0.14
Dessert Shop,0.14
Beer Garden,0.08
Hotel,0.06
Supermarket,0.04
Hungarian Restaurant,0.04
Restaurant,0.04
Bakery,0.04
Grocery Store,0.04


In [26]:
clusters = [
    "Hotels and Restaurants",
    "Supermarkets",
    "Coffe shops and Gym's",
    "A little bit of everything"
]

https://data2.openstreetmap.hu/hatarok/index.php?admin=9


In [32]:
budapest_map = folium.Map(location=[budapest_center['Latitude'], budapest_center['Longitude']], zoom_start=zoom_start)
color_map = [colors.rgb2hex(i) for i in cm.gist_rainbow(np.linspace(0, 1, kclusters))]


for lat, lng, district, cluster in zip(results['Latitude'], results['Longitude'], results['District'], results['Cluster Labels']):
    district_name = district.replace("kerület", "kerulet")
    geojson = "./districts/" + district_name + ".geojson"
    color = color_map[cluster]
    
    budapest_map.choropleth(
        geojson,
        fill_color=color,
        fill_opacity=0.7, 
        line_opacity=0.2,
    )

for lat, lng, district, cluster in zip(results['Latitude'], results['Longitude'], results['District'], results['Cluster Labels']):
    district_name = district.replace("kerület", "District")

    label = folium.Popup(f"<h5>{district_name}</h5><hr />{clusters[cluster]}")
    
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup = label,
        color="blue",
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(budapest_map)  

budapest_map