# EU members capitals comparison

## Introduction

This project aims to group similar capitals of EU countries into clusters and to compare similar neighbourhoods on those clusters to help people who want to live abroad or just meet places similar of those they love.

To achieve our objective the model will consider the type of venues existing on each neighbourhood of each place, which will be purchased using Foursquare API, and the cost of living on those cities, based on NUMBEO Cost of Living Index.

## Data pre-processing

In [1]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import folium # plot maps
from geopy.geocoders import Nominatim # get coordinates
import numpy as np

### Neighbourhood Data

I got those values manualy and uploaded an CSV into my website server. Let's first import the table.

__Import neighbourhood table__

In [2]:
nbh_url = 'https://victorrodrigues.com.br/wp-content/uploads/2020/01/EU_capitals_neighborhoods.csv'
nbh = pd.read_csv(nbh_url, sep=';', decimal='.', thousands=',')

Let's see how it looks like

In [3]:
nbh.head()

Unnamed: 0,Country,City,Borough,Neighbourhood,Area,Population,Density
0,Malta,Valletta,Valletta,,0.8,6444.0,8055
1,Luxembourg,Luxembourg City,Beggen,,1.7091,3746.0,2192
2,Luxembourg,Luxembourg City,Belair,,1.718,11494.0,6690
3,Luxembourg,Luxembourg City,North Bonnevoie-Verlorenkost,,0.6776,4296.0,6340
4,Luxembourg,Luxembourg City,South Bonnevoie,,2.3921,12734.0,5323


In [4]:
nbh.describe(include='all')

Unnamed: 0,Country,City,Borough,Neighbourhood,Area,Population,Density
count,485,485,479,152,475.0,479.0,485
unique,28,28,354,146,,,466
top,Germany,Berlin,Lefkosía,Lichtenberg,,,#DIV/0!
freq,108,108,19,2,,,10
mean,,,,,20.602107,79513.720251,
std,,,,,34.242399,89845.617733,
min,,,,,0.03,50.0,
25%,,,,,4.915,16555.0,
50%,,,,,9.31,47414.0,
75%,,,,,24.35,103099.0,


__Grouping the table__

I couldn't find all neighbourhoods, so I decided to work until the Borough level of table. For that, we'll group the table:

In [5]:
df = nbh.groupby(['Country','City','Borough'], as_index=False).sum()
df.head()

Unnamed: 0,Country,City,Borough,Area,Population
0,Austria,Vienna,Alsergrund,2.99,41958.0
1,Austria,Vienna,Brigittenau,5.68,86502.0
2,Austria,Vienna,Donaustadt,102.34,191008.0
3,Austria,Vienna,Döbling,24.9,72947.0
4,Austria,Vienna,Favoriten,31.8,204142.0


__Geting the codinates of each borough__

Create a function to get the coordinates

In [6]:
geolocator = Nominatim(user_agent="eu_capitals")

def get_coordinates(borough, city, country):
    place = borough + ', ' + city + ', ' + country
    lat = None
    lng = None
    t = 0
    while (lat is None) and (lng is None) and (t < 5): # In case of error, it tries up to 5 times retrieving the coordinates
        try:
            location = geolocator.geocode(place)
            lat = location.latitude
            lng = location.longitude
            break
        except:
            lat = None
            lng = None
            t = t + 1
    coordinates = str(lat) + ',' + str(lng)
    return coordinates

Apply the function to the dataframe

In [7]:
df['coordinates'] = df.apply(lambda x : get_coordinates(x.Borough, x.City, x.Country), axis=1)

Split the coordinates column into two

In [8]:
df['lat'] = df['coordinates'].str.split(",", n = 1, expand = True)[0]
df['lng'] = df['coordinates'].str.split(",", n = 1, expand = True)[1]
df.drop(columns =['coordinates'], inplace = True) 

Check for empty values

In [9]:
df[df.lat == 'None']

Unnamed: 0,Country,City,Borough,Area,Population,lat,lng
89,Cyprus,Nicosia,Synoikismós Anthoúpolis,0.44,1756.0,,
91,Cyprus,Nicosia,Énkomi,9.49,18010.0,,
114,Denmark,Copenhagen,Amager Vest,19.18,71755.0,,
115,Denmark,Copenhagen,Amager Øst,9.11,57673.0,,
122,Denmark,Copenhagen,Vesterbro/Kongens Enghave,8.18,67884.0,,
179,Hungary,Budapest,"Rákospalota, Pestújhely, Újpalota",26.95,79779.0,,
185,Hungary,Budapest,"Árpádföld, Cinkota, Mátyásföld, Sashalom, Ráko...",33.52,68235.0,,
189,Latvia,Riga,Central District,3.0,26466.0,,
190,Latvia,Riga,Kurzeme District,79.0,134817.0,,
191,Latvia,Riga,Latgale Suburb,50.0,197166.0,,


__Create a map to visualize locations__

In [None]:
latitude = 44.8584319
longitude = 3.1515146

# create map of New York using latitude and longitude values
map_eu = folium.Map(location=[latitude, longitude], zoom_start=4)

# add markers to map
for lat, lng, borough, city, country in zip(df['lat'], df['lng'], df['Borough'], df['City'], df['Country']):
    label = '{}, {}'.format(borough, city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_eu)  
    
map_eu

### Cost of living index

We'll use the Cost of Living Index from NUMBEO as a variable for compare cities (not neighbourhoods). Below a little explanation about this index, provided by NUMBEO on [this link](https://www.numbeo.com/cost-of-living/cpi_explained.jsp):
> These indices are relative to New York City (NYC). Which means that for New York City, each index should be 100(%). If another city has, for example, rent index of 120, it means that on an average in that city rents are 20% more expensive than in New York City. If a city has rent index of 70, that means on an average in that city rents are 30% less expensive than in New York City.

In [None]:
url_cost_living = 'https://www.numbeo.com/cost-of-living/rankings_current.jsp'
cost_living = pd.read_html(url_cost_living)
cost_living

In [None]:
# Let's filter just the data we need
cost_living = cost_living[2].iloc[:,1:]

In [None]:
# Rename the column City
cost_living = cost_living.rename(columns={'City' : 'CountryCity'})

In [None]:
# Let's see everything is ok
cost_living.head()