# The MVP

### We will create a function that takes a city name and feature and produces a list of the cities that are similar as well as a visualisation of this data on a map.

- We create the feature options (ie. similar summers = sunshine_hours + high_temperatures").

### Create feature list (designed by us)
- E.g. User gives the city and features they are interested in, and we will give a list of similar cities based on the user's input.
 
Use-case:
Let's say you live in Berlin, but would like to move to a different city, there are things you like about Berlin and things you dont'. Ideally the next city you move to will still have all the things you like about Berlin (or similar), but will have less of the things you do not like.
So you like the autumns in Berlin and the cost of living, but you would like to find a place with better safety. Our function will take the city and features you like and provde a list of cities which have similar levels of the features you like.

features = cost_of_living (soc_ec) and autumn (weather)

## 1. Data and custom features
### 1.1 Import the sample data

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# clustering
from sklearn.cluster import KMeans

# displaying on a map
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
from IPython.display import HTML, display
from IPython.display import Image 
from IPython.core.display import HTML

In [2]:
city_data = pd.read_json('../data/Combined_data.json')

In [3]:
city_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33 entries, 0 to 32
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   city                   33 non-null     object 
 1   autumn_high            33 non-null     int64  
 2   autumn_prec_days       33 non-null     int64  
 3   autumn_sun_hrs         33 non-null     int64  
 4   spring_high            33 non-null     int64  
 5   spring_prec_days       33 non-null     int64  
 6   spring_sun_hrs         33 non-null     int64  
 7   summer_high            33 non-null     int64  
 8   summer_prec_days       33 non-null     int64  
 9   summer_sun_hrs         33 non-null     int64  
 10  winter_high            33 non-null     int64  
 11  winter_prec_days       33 non-null     int64  
 12  winter_sun_hrs         33 non-null     int64  
 13  climate                33 non-null     float64
 14  cost_of_living         33 non-null     float64
 15  health_c

## 2. The MVP function
### 2.1 First creating mini-functions for the cluster groups and information lists for single feature input

In [4]:
# Getting the clusters for one

def get_clusters(df, col, cluster_no=4):
    col = col[0]
    X = df[['city', col]]
    X = X.drop('city', 1)
    clusters = KMeans(n_clusters=cluster_no)
    clusters.fit(X)
    clusters.predict(X)
    
    df2 = df.assign(cluster=clusters.predict(X))
    df2 = df2[['city', 'lat', 'lng', 'cluster', col]]
    
    return df2


In [5]:
clustered = get_clusters(city_data, ['safety'])
clustered.head()

Unnamed: 0,city,lat,lng,cluster,safety
0,Amsterdam,52.35,4.916667,0,67.32
1,Athens,37.983333,23.733333,2,50.49
2,Belgrade,44.833333,20.5,1,62.02
3,Berlin,52.516667,13.4,1,58.92
4,Bratislava,48.15,17.116667,0,68.68


In [6]:
# Getting the list for one feature

def get_list(df, col):
    col = col[0]
    # create the list of custers
    cluster0 = (df.loc[df['cluster'] == 0])
    list1 = list(cluster0['city'])
    list2 = list(cluster0[col])
    zipped0 = dict(zip(list1, list2))
    mean0 = round(cluster0[col].mean())
    
    cluster1 = (df.loc[df['cluster'] == 1])
    list3 = list(cluster1['city'])
    list4 = list(cluster1[col])
    zipped1 = dict(zip(list3, list4))
    mean1 = round(cluster1[col].mean())
    
    cluster2 = (df.loc[df['cluster'] == 2])
    list5 = list(cluster2['city'])
    list6 = list(cluster2[col])
    zipped2 = dict(zip(list5, list6))
    mean2 = round(cluster2[col].mean())
    
    cluster3 = (df.loc[df['cluster'] == 3])
    list7 = list(cluster3['city'])
    list8 = list(cluster3[col])
    zipped3 = dict(zip(list7, list8))
    mean3 = round(cluster3[col].mean())
    
    print(f'Clustering based on: {col}\n')
    print(zipped0)
    print(f'\nThe average {col} of cluster 0 is {mean0}')
    print('--------------------------------------------------------------------------------------------')
    print(zipped1)
    print(f'\nThe average {col} of cluster 1 is {mean1}')
    print('--------------------------------------------------------------------------------------------')
    print(zipped2)
    print(f'\nThe average {col} of cluster 2 is {mean2}')
    print('--------------------------------------------------------------------------------------------')
    print(zipped3)
    print(f'\nThe average {col} of cluster 3 is {mean3}')



In [7]:
get_list(clustered, ['safety'])

Clustering based on: safety

{'Amsterdam': 67.32, 'Bratislava': 68.68, 'Bucharest': 72.64, 'Lisbon': 71.94, 'Luxembourg': 71.91, 'Madrid': 70.01, 'Nicosia': 68.46, 'Valletta': 68.57, 'Vilnius': 72.15, 'Warsaw': 71.22}

The average safety of cluster 0 is 70.0
--------------------------------------------------------------------------------------------
{'Belgrade': 62.02, 'Berlin': 58.92, 'Budapest': 63.82, 'Oslo': 63.36, 'Riga': 61.59, 'Skopje': 56.04, 'Sofia': 57.81, 'Stockholm': 55.39, 'Tirana': 61.25}

The average safety of cluster 1 is 60.0
--------------------------------------------------------------------------------------------
{'Athens': 50.49, 'Brussels': 48.28, 'Dublin': 50.42, 'London': 47.44, 'Paris': 48.03, 'Rome': 48.25, 'Sarajevo': 53.02}

The average safety of cluster 2 is 49.0
--------------------------------------------------------------------------------------------
{'Copenhagen': 74.77, 'Helsinki': 77.24, 'Ljubljana': 78.72, 'Reykjavik': 77.74, 'Tallinn': 77.11, 'Vie

In [8]:
# Putting single feature functions together

def one_feature(df, col):
    clusters = get_clusters(df, col)
    get_list(clusters, col)

    return clusters

In [9]:
one_feature(city_data, ['health_care'])

Clustering based on: health_care

{'Amsterdam': 69.45, 'Berlin': 69.68, 'Lisbon': 71.38, 'Ljubljana': 66.24, 'London': 70.28, 'Reykjavik': 66.63, 'Stockholm': 66.9, 'Tallinn': 71.0, 'Vilnius': 71.09, 'Zagreb': 65.16}

The average health_care of cluster 0 is 69.0
--------------------------------------------------------------------------------------------
{'Athens': 56.17, 'Bratislava': 57.17, 'Bucharest': 54.34, 'Riga': 60.73, 'Rome': 59.35, 'Sarajevo': 60.13, 'Skopje': 55.93, 'Sofia': 57.24, 'Valletta': 58.86, 'Warsaw': 54.65}

The average health_care of cluster 1 is 57.0
--------------------------------------------------------------------------------------------
{'Brussels': 74.5, 'Copenhagen': 78.15, 'Helsinki': 77.06, 'Luxembourg': 73.71, 'Madrid': 78.97, 'Oslo': 75.07, 'Paris': 78.58, 'Vienna': 78.83}

The average health_care of cluster 2 is 77.0
--------------------------------------------------------------------------------------------
{'Belgrade': 53.69, 'Budapest': 47.7, 'Dubli

Unnamed: 0,city,lat,lng,cluster,health_care
0,Amsterdam,52.35,4.916667,0,69.45
1,Athens,37.983333,23.733333,1,56.17
2,Belgrade,44.833333,20.5,3,53.69
3,Berlin,52.516667,13.4,0,69.68
4,Bratislava,48.15,17.116667,1,57.17
5,Brussels,50.833333,4.333333,2,74.5
6,Bucharest,44.433333,26.1,1,54.34
7,Budapest,47.5,19.083333,3,47.7
8,Copenhagen,55.666667,12.583333,2,78.15
9,Dublin,53.316667,-6.233333,3,51.5


### 2.2 Now creating mini-functions for the cluster groups and information lists for multiple feature input

In [10]:
# Getting the clusters for multiple features

def get_clusters_multi(df, col_list, cluster_no=4):
    X = df[col_list]

    clusters = KMeans(n_clusters=cluster_no)
    clusters.fit(X)
    
    clusters.predict(X)
    
    df2 = df.assign(cluster=clusters.predict(X))
    df2 = df2[['city', 'lat', 'lng', 'cluster']]
    
    return df2

In [11]:
get_clusters_multi(city_data, ['safety', 'summer_sun_hrs'])

Unnamed: 0,city,lat,lng,cluster
0,Amsterdam,52.35,4.916667,3
1,Athens,37.983333,23.733333,2
2,Belgrade,44.833333,20.5,0
3,Berlin,52.516667,13.4,1
4,Bratislava,48.15,17.116667,0
5,Brussels,50.833333,4.333333,3
6,Bucharest,44.433333,26.1,0
7,Budapest,47.5,19.083333,0
8,Copenhagen,55.666667,12.583333,3
9,Dublin,53.316667,-6.233333,3


In [12]:
# Getting the list for multiple features

def get_list_multi(df, col_list):
    # create the list of custers
    cluster0 = (df.loc[df['cluster'] == 0])
    list0 = list(cluster0['city'])
    
    cluster1 = (df.loc[df['cluster'] == 1])
    list1 = list(cluster1['city'])
    
    cluster2 = (df.loc[df['cluster'] == 2])
    list2 = list(cluster2['city'])
    
    cluster3 = (df.loc[df['cluster'] == 3])
    list3 = list(cluster3['city'])

    print(f'Clustering based on: {col_list}\n')
    print(list0)
    print('--------------------------------------------------------------------------------------------')
    print(list1)
    print('--------------------------------------------------------------------------------------------')
    print(list2)
    print('--------------------------------------------------------------------------------------------')
    print(list3)


In [13]:
get_list_multi(clustered, ['safety', 'summer_sun_hrs'])

Clustering based on: ['safety', 'summer_sun_hrs']

['Amsterdam', 'Bratislava', 'Bucharest', 'Lisbon', 'Luxembourg', 'Madrid', 'Nicosia', 'Valletta', 'Vilnius', 'Warsaw']
--------------------------------------------------------------------------------------------
['Belgrade', 'Berlin', 'Budapest', 'Oslo', 'Riga', 'Skopje', 'Sofia', 'Stockholm', 'Tirana']
--------------------------------------------------------------------------------------------
['Athens', 'Brussels', 'Dublin', 'London', 'Paris', 'Rome', 'Sarajevo']
--------------------------------------------------------------------------------------------
['Copenhagen', 'Helsinki', 'Ljubljana', 'Reykjavik', 'Tallinn', 'Vienna', 'Zagreb']


In [14]:
# Putting single feature functions together

def multi_feature(df, col_list):
    clusters = get_clusters_multi(df, col_list)
    get_list_multi(clusters, col_list)
    
    return clusters

In [15]:
multi_feature(city_data, ['safety', 'summer_sun_hrs'])

Clustering based on: ['safety', 'summer_sun_hrs']

['Amsterdam', 'Brussels', 'Copenhagen', 'Dublin', 'London', 'Luxembourg', 'Reykjavik']
--------------------------------------------------------------------------------------------
['Athens', 'Lisbon', 'Madrid', 'Nicosia', 'Rome', 'Tirana', 'Valletta']
--------------------------------------------------------------------------------------------
['Belgrade', 'Bratislava', 'Bucharest', 'Budapest', 'Helsinki', 'Riga', 'Skopje', 'Sofia', 'Stockholm', 'Tallinn']
--------------------------------------------------------------------------------------------
['Berlin', 'Ljubljana', 'Oslo', 'Paris', 'Sarajevo', 'Vienna', 'Vilnius', 'Warsaw', 'Zagreb']


Unnamed: 0,city,lat,lng,cluster
0,Amsterdam,52.35,4.916667,0
1,Athens,37.983333,23.733333,1
2,Belgrade,44.833333,20.5,2
3,Berlin,52.516667,13.4,3
4,Bratislava,48.15,17.116667,2
5,Brussels,50.833333,4.333333,0
6,Bucharest,44.433333,26.1,2
7,Budapest,47.5,19.083333,2
8,Copenhagen,55.666667,12.583333,0
9,Dublin,53.316667,-6.233333,0


### 2.3 Creating map function

In [45]:
# Getting the map

def get_map(df, col, cluster_no=4):
    col = col[0]
    # Map of Europe (54.5260° N, 15.2551° E)
    map_europe = folium.Map(location=[51.5260,15.2551], zoom_start=4, tiles='OpenStreetMap')
    # tiles: 'OpenStreetMap', 'Stamen Toner', 
    
    # color of the clusters
    colors = (['#F91100', '#8000ff', '#2adddd', '#FFFF00', 
               '#F900FF', '#3FFF00', '#ff0000', '#009b48', 
               '#0000FF', '#000000', '#ff5800', '#ffff00'])
    
    for lat, lng, city, cluster in zip(df['lat'], df['lng'], df['city'], df['cluster']):
        tooltip = f'{city}, cluster {cluster}'
        folium.CircleMarker(location=[lat, lng], 
                            radius=4, 
                            tooltip=tooltip, 
                            color=colors[cluster], 
                            fill=True, 
                            fill_color=colors[cluster], 
                            fill_opacity=1).add_to(map_europe)
    
    folium.TileLayer('cartodbpositron').add_to(map_europe) 
    # cartodbpositron, cartodbdark_matter, stamenwatercolor, stamenterrain
        
    return map_europe

In [46]:
get_map(clustered, ['health_care'])

### 2.2 Putting it all together in one function

In [47]:
# The full function

def city_clusters(df, col_list, cluster_no=4):
    
    if len(col_list) == 1:
        clu_df = one_feature(df, col_list)
    
    else:
        clu_df = multi_feature(df, col_list)
    
    geo_map = get_map(clu_df, col_list)

    return geo_map

In [48]:
city_clusters(city_data, ['summer_sun_hrs'])

Clustering based on: summer_sun_hrs

{'Athens': 347, 'Lisbon': 331, 'Madrid': 324, 'Nicosia': 384, 'Rome': 308, 'Tirana': 329, 'Valletta': 355}

The average summer_sun_hrs of cluster 0 is 340.0
--------------------------------------------------------------------------------------------
{'Belgrade': 266, 'Bratislava': 292, 'Bucharest': 279, 'Budapest': 259, 'Helsinki': 275, 'Riga': 260, 'Skopje': 274, 'Sofia': 267, 'Stockholm': 258, 'Tallinn': 270}

The average summer_sun_hrs of cluster 1 is 270.0
--------------------------------------------------------------------------------------------
{'Amsterdam': 203, 'Brussels': 189, 'Copenhagen': 204, 'Dublin': 168, 'London': 201, 'Luxembourg': 195, 'Reykjavik': 162}

The average summer_sun_hrs of cluster 2 is 189.0
--------------------------------------------------------------------------------------------
{'Berlin': 219, 'Ljubljana': 237, 'Oslo': 237, 'Paris': 223, 'Sarajevo': 230, 'Vienna': 232, 'Vilnius': 222, 'Warsaw': 228, 'Zagreb': 244}



In [49]:
city_clusters(city_data, ['purchasing_power'])

Clustering based on: purchasing_power

{'Athens': 40.69, 'Belgrade': 34.87, 'Skopje': 33.72, 'Tirana': 27.81, 'Valletta': 38.26}

The average purchasing_power of cluster 0 is 35.0
--------------------------------------------------------------------------------------------
{'Amsterdam': 81.63, 'Berlin': 98.54, 'Brussels': 88.03, 'Copenhagen': 90.21, 'Helsinki': 89.4, 'London': 82.61, 'Luxembourg': 103.08, 'Oslo': 85.25, 'Stockholm': 81.61, 'Vienna': 84.06}

The average purchasing_power of cluster 1 is 88.0
--------------------------------------------------------------------------------------------
{'Bucharest': 50.66, 'Budapest': 49.16, 'Lisbon': 46.65, 'Ljubljana': 58.61, 'Nicosia': 55.83, 'Riga': 49.76, 'Rome': 53.72, 'Sarajevo': 52.49, 'Sofia': 48.55, 'Vilnius': 59.3, 'Warsaw': 54.75, 'Zagreb': 50.31}

The average purchasing_power of cluster 2 is 52.0
--------------------------------------------------------------------------------------------
{'Bratislava': 61.82, 'Dublin': 67.69, 'M

In [50]:
city_clusters(city_data, ['pollution'])

Clustering based on: pollution

{'Athens': 57.3, 'Belgrade': 63.57, 'Brussels': 62.36, 'Budapest': 54.38, 'London': 58.57, 'Madrid': 52.06, 'Nicosia': 61.04, 'Paris': 64.23, 'Rome': 66.4, 'Sarajevo': 69.24, 'Sofia': 69.24, 'Warsaw': 65.58}

The average pollution of cluster 0 is 62.0
--------------------------------------------------------------------------------------------
{'Amsterdam': 30.79, 'Berlin': 39.45, 'Bratislava': 41.12, 'Dublin': 40.42, 'Lisbon': 34.91, 'Riga': 38.65, 'Zagreb': 31.37}

The average pollution of cluster 1 is 37.0
--------------------------------------------------------------------------------------------
{'Bucharest': 75.53, 'Skopje': 82.9, 'Tirana': 87.45, 'Valletta': 74.23}

The average pollution of cluster 2 is 80.0
--------------------------------------------------------------------------------------------
{'Copenhagen': 21.47, 'Helsinki': 13.19, 'Ljubljana': 23.24, 'Luxembourg': 21.59, 'Oslo': 25.6, 'Reykjavik': 15.33, 'Stockholm': 20.05, 'Tallinn': 22.1

In [51]:
city_clusters(city_data, ['summer_sun_hrs', 'purchasing_power', 'pollution'])

Clustering based on: ['summer_sun_hrs', 'purchasing_power', 'pollution']

['Belgrade', 'Bratislava', 'Bucharest', 'Budapest', 'Paris', 'Riga', 'Sarajevo', 'Skopje', 'Sofia', 'Warsaw']
--------------------------------------------------------------------------------------------
['Athens', 'Lisbon', 'Madrid', 'Nicosia', 'Rome', 'Tirana', 'Valletta']
--------------------------------------------------------------------------------------------
['Amsterdam', 'Berlin', 'Brussels', 'Copenhagen', 'Dublin', 'London', 'Luxembourg', 'Reykjavik']
--------------------------------------------------------------------------------------------
['Helsinki', 'Ljubljana', 'Oslo', 'Stockholm', 'Tallinn', 'Vienna', 'Vilnius', 'Zagreb']
