# IBM Applied Data Science Specialization - Capstone Project
# Moving to a similar neighborhood

## Table of contents
* [Introduction: Definition of the problem and objective](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Conclusion](#results)

## Definition of the problem and objective <a name="introduction"></a>

The approach of my project is based on the fact that I am going to move from Mexico City to Madrid. I want to use the techniques learned in the course to find neighborhoods in Madrid that have the same types of attractions and places of interest (venues) as those in the neighborhood where I currently live.

To do this, I will collect the places of interest of my current neighborhood using the Foursquare API, and then I will explore the venues for each of the neighborhoods of Madrid. Finally, I will use KMEANS to find neighborhoods in Madrid similar to my current neighborhood in terms of the places of interest found on Foursquare.

## Data to be use to solve the problem <a name="data"></a>

In [1]:
import pandas as pd
import numpy as np
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
import io
import json
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.metrics import pairwise_distances
import folium

In [2]:
CLIENT_ID = 'xxxxx'
CLIENT_SECRET = 'xxxxx'
VERSION = '20180605' # Foursquare API version

In [3]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']

    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                             'Neighborhood Latitude',
                             'Neighborhood Longitude',
                             'Venue',
                             'Venue Latitude',
                             'Venue Longitude',
                             'Venue Category']

    return (nearby_venues)


def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)

    return row_categories_sorted.index.values[0:num_top_venues]

### For my current location

I live in a neighborhood in Mexico City called "Coyoacan". First I will use **Nominatim** from **geopy.geocoders** library to get the latitude and longitude for this neighborhood.

In [4]:
cdmx_barrio = 'Coyoacan'
cdmx = 'CDMX'
address = cdmx_barrio + ', ' + cdmx

geolocator = Nominatim(user_agent="cdmx_explorer")
location = geolocator.geocode(address)
cdmx_latitud = location.latitude
cdmx_longitud = location.longitude
print(location, cdmx_latitud, cdmx_longitud)

Coyoacán, CDMX, México 19.32804005 -99.1510634069359


Then I will use the Foursquare API with the **explore** endpoint to get recommended venues near my current location. I will limit the search to a radius of 1000 meters and 100 venues at most.
- GET https://api.foursquare.com/v2/venues/explore

In [5]:
radius = 1000
LIMIT = 100

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, cdmx_latitud, cdmx_longitud, VERSION, radius, LIMIT)
print(url)

results = requests.get(url).json()
print('There are {} venues around your location.'.format(len(results['response']['groups'][0]['items'])))

cdmx_venues = results['response']['groups'][0]['items']

cdmx_nearby_venues = json_normalize(cdmx_venues)  # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.location.lat', 'venue.location.lng', 'venue.categories']
cdmx_nearby_venues = cdmx_nearby_venues.loc[:, filtered_columns]

# filter the category for each row
cdmx_nearby_venues['venue.categories'] = cdmx_nearby_venues.apply(get_category_type, axis=1)

# clean columns
cdmx_nearby_venues.columns = [col.split(".")[-1] for col in cdmx_nearby_venues.columns]

cdmx_nearby_venues['Neighborhood'] = cdmx_barrio
cdmx_nearby_venues['Neighborhood Latitude'] = cdmx_latitud
cdmx_nearby_venues['Neighborhood Longtude'] = cdmx_longitud
cols = cdmx_nearby_venues.columns.tolist()
cols = cols[-3:] + cols[:-3]
cdmx_nearby_venues = cdmx_nearby_venues[cols]

cdmx_nearby_venues.head()
print('{} venues were returned by Foursquare.'.format(cdmx_nearby_venues.shape[0]))
print('There are {} unique categories.'.format(len(cdmx_nearby_venues['categories'].unique())))

https://api.foursquare.com/v2/venues/explore?client_id=xxxxx&client_secret=xxxxx&ll=19.32804005,-99.1510634069359&v=20180605&radius=1000&limit=100
There are 84 venues around your location.
84 venues were returned by Foursquare.
There are 41 unique categories.


### For Madrid

To get data from Madrid neighborhoods I will use data from [Portal de datos abiertos del Ayuntamiento de Madrid](https://datos.madrid.es/portal/site/egob/). Specifically I will download a CSV file titled [Relación de barrios (superficie y perímetro)](https://datos.madrid.es/portal/site/egob/menuitem.c05c1f754a33a9fbe4b2e4b284f1a5a0/?vgnextoid=46b55cde99be2410VgnVCM1000000b205a0aRCRD&vgnextchannel=374512b9ace9f310VgnVCM100000171f5a0aRCRD&vgnextfmt=default). 

This file is a list of Districts and Neighborhoods in Madrid. 

I will geocode each neighborhood using **Nominatim** from **geopy.geocoders**.

In [6]:
url = "https://www.dropbox.com/s/77vqznq3bik3q57/madrid_barrios.csv?dl=1"
urlData = requests.get(url).content
barrios_df = pd.read_csv(io.StringIO(urlData.decode('utf-8')))

print(barrios_df.shape)
print(barrios_df.head())

(128, 2)
     Distrito       Barrio
0  Arganzuela       Atocha
1  Arganzuela     Delicias
2  Arganzuela     Imperial
3  Arganzuela   La Chopera
4  Arganzuela  Las Acacias


In [7]:
geo_barrios = pd.DataFrame(columns=['Distrito', 'Barrio', 'Latitud', 'Longitud'])
j = 0

for idx, row in barrios_df.iterrows():
    address = row['Barrio'] + ', ' + row['Distrito'] +  ', Madrid'
    geolocator = Nominatim(user_agent="cdmx_explorer")
    location = geolocator.geocode(address, timeout=30)
    if location != None:
        latitude = location.latitude
        longitude = location.longitude
        geo_barrios.loc[j] = [row['Distrito'], row['Barrio'], latitude, longitude]
        j += 1

print(geo_barrios.shape)
print(geo_barrios.head())

(118, 4)
     Distrito       Barrio    Latitud  Longitud
0  Arganzuela       Atocha  40.405477 -3.689800
1  Arganzuela     Delicias  40.397292 -3.689495
2  Arganzuela     Imperial  40.406915 -3.717329
3  Arganzuela   La Chopera  40.394893 -3.699705
4  Arganzuela  Las Acacias  40.400759 -3.706995


Then again, I will use the Foursquare API with the **explore** endpoint to get the recommended venues for each neighborhood (using radius of 1000 meters and limit of 100 venues per neighborhood).

In [8]:
madrid_venues = getNearbyVenues(names=geo_barrios['Barrio'],
                                   latitudes=geo_barrios['Latitud'],
                                   longitudes=geo_barrios['Longitud']
                                  )

print(madrid_venues.shape)
print(madrid_venues.head())
print('There are {} unique categories.'.format(len(madrid_venues['Venue Category'].unique())))

Atocha
Delicias
Imperial
La Chopera
Las Acacias
Legazpi
Palos de Moguer
Aeropuerto
Alameda de Osuna
Corralejos
Timon
Abrantes
Buenavista
Comillas
Opanel
Puerta Bonita
San Isidro
Vista Alegre
Cortes
Embajadores
Justicia
Palacio
Sol
Universidad
Castilla
Ciudad Jardin
Nueva Espana
Prosperidad
Almagro
Arapiles
Gaztambide
Rios Rosas
Trafalgar
Vallehermoso
Atalaya
Colina
Concepcion
Costillares
Pueblo Nuevo
Quintana
San Juan Bautista
San Pascual
Ventas
El Pardo
El Pilar
Mirasierra
Apostol Santiago
Canillas
Pinar del Rey
Piovera
Valdefuentes
Aluche
Campamento
Cuatro Vientos
Las aguilas
Lucero
Puerta del angel
Aravaca
Arguelles
Casa de Campo
Ciudad Universitaria
El Plantio
Valdemarin
Valdezarza
Fontarron
Horcajo
Marroquina
Media Legua
Pavones
Vinateros
Entrevias
Numancia
Palomeras Bajas
Palomeras Sureste
Portazgo
San Diego
Adelfas
Estrella
Ibiza
Jeronimos
Nino Jesus
Pacifico
Castellana
Fuente del Berro
Goya
Guindalera
Lista
Recoletos
Amposta
Arcos
Canillejas
El Salvador
Hellin
Rejas
Rosas
Siman

## Methodology <a name="methodology"></a>

In order to use KMEANS we have to transform the data so that they are all numeric. For this purpose I will follow the following steps:
- Put together the location data of CDMX with those of the neighborhoods of Madrid (variable *geo_barrios*)
- Collect the data of places of interest of CDMX with those of Madrid (variable *madrid_venues*)
- Use "onehot encoding" to transpose the categories of the places of interest and convert them to numerical values
- Group the resulting matrix by neighborhood, using the average value of each category
- Apply KMEANS using 10 clusters

In [9]:
tmp = {'Distrito': cdmx, 'Barrio':cdmx_barrio, 'Latitud': cdmx_latitud, 'Longitud': cdmx_longitud}
geo_barrios = geo_barrios.append(tmp, ignore_index=True)
print(geo_barrios.shape)
print(geo_barrios.head())

(119, 4)
     Distrito       Barrio    Latitud  Longitud
0  Arganzuela       Atocha  40.405477 -3.689800
1  Arganzuela     Delicias  40.397292 -3.689495
2  Arganzuela     Imperial  40.406915 -3.717329
3  Arganzuela   La Chopera  40.394893 -3.699705
4  Arganzuela  Las Acacias  40.400759 -3.706995


In [10]:
cols = madrid_venues.columns.tolist()
cdmx_nearby_venues.columns = cols

madrid_venues = madrid_venues.append(cdmx_nearby_venues, ignore_index=True)

print(madrid_venues.shape)
print(madrid_venues.head())

(3767, 7)
  Neighborhood  Neighborhood Latitude  Neighborhood Longitude  \
0       Atocha              40.405477                 -3.6898   
1       Atocha              40.405477                 -3.6898   
2       Atocha              40.405477                 -3.6898   
3       Atocha              40.405477                 -3.6898   
4       Atocha              40.405477                 -3.6898   

                                     Venue  Venue Latitude  Venue Longitude  \
0                    Only You Hotel Atocha       40.407161        -3.688438   
1                           Bodegas Rosell       40.403803        -3.690620   
2                   Running Company Madrid       40.406714        -3.686904   
3                            Pandora's Vox       40.405600        -3.691992   
4  Jardín Tropical - Invernadero de Atocha       40.406866        -3.691204   

        Venue Category  
0                Hotel  
1   Spanish Restaurant  
2  Sporting Goods Shop  
3          Music Venue  

In [11]:
# one hot encoding
madrid_onehot = pd.get_dummies(madrid_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
madrid_onehot['Nborhood'] = madrid_venues['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [madrid_onehot.columns[-1]] + list(madrid_onehot.columns[:-1])
madrid_onehot = madrid_onehot[fixed_columns]

print(madrid_onehot.shape)
print(madrid_onehot.head())

(3767, 264)
  Nborhood  Accessories Store  Adult Boutique  African Restaurant  \
0   Atocha                  0               0                   0   
1   Atocha                  0               0                   0   
2   Atocha                  0               0                   0   
3   Atocha                  0               0                   0   
4   Atocha                  0               0                   0   

   American Restaurant  Arcade  Arepa Restaurant  Argentinian Restaurant  \
0                    0       0                 0                       0   
1                    0       0                 0                       0   
2                    0       0                 0                       0   
3                    0       0                 0                       0   
4                    0       0                 0                       0   

   Art Gallery  Art Museum     ...       Used Bookstore  \
0            0           0     ...                    0  

In [12]:
madrid_grouped = madrid_onehot.groupby('Nborhood').mean().reset_index()
print(madrid_grouped.shape)
print(madrid_grouped.head())

(117, 264)
           Nborhood  Accessories Store  Adult Boutique  African Restaurant  \
0          Abrantes                0.0             0.0                 0.0   
1           Adelfas                0.0             0.0                 0.0   
2        Aeropuerto                0.0             0.0                 0.0   
3  Alameda de Osuna                0.0             0.0                 0.0   
4           Almagro                0.0             0.0                 0.0   

   American Restaurant  Arcade  Arepa Restaurant  Argentinian Restaurant  \
0                 0.00     0.0               0.0                     0.0   
1                 0.00     0.0               0.0                     0.0   
2                 0.00     0.0               0.0                     0.0   
3                 0.00     0.0               0.0                     0.0   
4                 0.01     0.0               0.0                     0.0   

   Art Gallery  Art Museum     ...       Used Bookstore  \
0   

### To find similar neighborhoods

I'm going to use KMEANS to group neighborhoods in Madrid with my neighborhood in CDMX to see which ones are more similar to mine.

I will try with several number of clusters to find the best value for K. For this purpose I will use 2 metrics:
- Silhouette Coefficient which bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters. The score is higher when clusters are dense and well separated.
- Calinski-Harabaz index which is higher when clusters are dense and well separated.

In [13]:
madrid_grouped_clustering = madrid_grouped.drop('Nborhood', 1)
best_sc = 0
best_sc_k = 0
best_chi = 0 
best_chi_k = 0

for x in range(3,12):
    kclusters = x
    # run k-means clustering
    kmeans_model = KMeans(n_clusters=kclusters, random_state=0).fit(madrid_grouped_clustering)
    # get cluster labels
    labels = kmeans_model.labels_
    # compute Silhouette Coefficient
    sc = metrics.silhouette_score(madrid_grouped_clustering, labels, metric='euclidean')
    print('Number of clusters: ', kclusters, ' Silhouette Coefficient: ', sc)
    # compute Calinski-Harabaz Index
    chi = metrics.calinski_harabaz_score(madrid_grouped_clustering, labels)
    print('Number of clusters: ', kclusters, ' Calinski-Harabaz Index: ', chi)
    if sc > best_sc:
        best_sc = sc
        best_sc_k = kclusters
    if chi > best_chi:
        best_chi = chi
        best_chi_k = kclusters  

print('Best # clusters according to Silhouette Coefficient: ', best_sc_k, ' with score = ', best_sc)
print('Best # clusters according to Calinski-Harabaz Index: ', best_chi_k, ' with score = ', best_chi)

Number of clusters:  3  Silhouette Coefficient:  0.132597341788
Number of clusters:  3  Calinski-Harabaz Index:  7.92341160907
Number of clusters:  4  Silhouette Coefficient:  0.142149664364
Number of clusters:  4  Calinski-Harabaz Index:  8.73471076551
Number of clusters:  5  Silhouette Coefficient:  0.13331619705
Number of clusters:  5  Calinski-Harabaz Index:  8.24378819744
Number of clusters:  6  Silhouette Coefficient:  0.138480531096
Number of clusters:  6  Calinski-Harabaz Index:  8.33687686577
Number of clusters:  7  Silhouette Coefficient:  0.126676030648
Number of clusters:  7  Calinski-Harabaz Index:  8.06296535194
Number of clusters:  8  Silhouette Coefficient:  0.0409799327639
Number of clusters:  8  Calinski-Harabaz Index:  7.80280487852
Number of clusters:  9  Silhouette Coefficient:  0.149387140622
Number of clusters:  9  Calinski-Harabaz Index:  7.72514704205
Number of clusters:  10  Silhouette Coefficient:  0.0139522144899
Number of clusters:  10  Calinski-Harabaz Ind

I choose K from the best Silhouette Coefficient to use for the final model.

I'm going to calculate the distance of each point to its corresponding centroid using *kmeans.transform*.

In [15]:
kmeans_model = KMeans(n_clusters=best_sc_k, random_state=0).fit(madrid_grouped_clustering)
labels = kmeans_model.labels_
# get distance from centroids
distance = kmeans_model.transform(madrid_grouped_clustering)

## Analysis <a name="analysis"></a>

For each neighborhood we will look for what are the 10 most frequent venue categories. This will help us to obtain the conclusions about the neighborhoods of Madrid that are more similar to my neighborhood in Mexico City.

In [16]:
num_top_venues = 12

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = madrid_grouped['Nborhood']

for ind in np.arange(madrid_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(madrid_grouped.iloc[ind, :], num_top_venues)

print(neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted.head()

(117, 13)


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue
0,Abrantes,Soccer Field,Bakery,Athletics & Sports,Gym / Fitness Center,Fast Food Restaurant,Pizza Place,Fried Chicken Joint,French Restaurant,Fountain,Football Stadium,Fish & Chips Shop,Food Court
1,Adelfas,Bar,Tapas Restaurant,Hotel,Fast Food Restaurant,Bakery,Diner,Grocery Store,Skate Park,Supermarket,Gift Shop,Spanish Restaurant,Brewery
2,Aeropuerto,Food,Yoga Studio,Gas Station,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Fountain,Football Stadium,Food Truck,Food Court,Food & Drink Shop
3,Alameda de Osuna,Smoke Shop,Hotel,Tapas Restaurant,Restaurant,Market,Cocktail Bar,Gym,Grocery Store,Coffee Shop,Metro Station,Bookstore,Bakery
4,Almagro,Spanish Restaurant,Restaurant,Hotel,Mediterranean Restaurant,Bar,French Restaurant,Italian Restaurant,Supermarket,Cocktail Bar,Asian Restaurant,Japanese Restaurant,Plaza


Add clustering labels and find cluster number and index value for my neighborhood in Mexico City.

In [18]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans_model.labels_)
# find index of CDMX neighborhood
cdmx_ind = neighborhoods_venues_sorted.index[neighborhoods_venues_sorted['Neighborhood'] == cdmx_barrio].tolist()
# find cluster of CDMX neighborhood
cdmx_cluster = neighborhoods_venues_sorted.loc[cdmx_ind]['Cluster Labels'].values[0]

- Find the distances of all the points to the cluster where the Mexico City neighborhood is located. 
- Add cluster label to each point.
- Sort rows by cluster and distance.
- Keep only row for the cluster where the Mexico City neighborhood is located.
- Get the index of Mexico City in this last dataframe.

In [20]:
# create dataframe with centroids distance to CDMX cluster
dist_df = pd.DataFrame(distance[:,cdmx_cluster], columns=['distance'])
dist_df['cluster'] = kmeans_model.labels_
# sort dataframe by cluster and distance
dist_df.sort_values(by=['cluster', 'distance'], inplace=True)
dist_df.reset_index(inplace=True)
# keep only the rows from the CDMX cluster
dist_df = dist_df.loc[dist_df['cluster'] == cdmx_cluster]
# get CDMX position in sorted dataframe
cdmx_pos = dist_df.index[dist_df['index'] == cdmx_ind].tolist()[0]

Select the neighborhoods closest to the one in Mexico City, according to the distances to the centroid.

In [21]:
# get 10 closest neighborhoods from CDMX neighborhood (5 below and 5 above in the sorted dataframe)
top10 = dist_df.loc[cdmx_pos+1:cdmx_pos+5]
top10 = top10.append(dist_df.loc[cdmx_pos-5:cdmx_pos], ignore_index=True)
neighborhoods_venues_sorted = neighborhoods_venues_sorted.join(top10.set_index('index'), how='inner')

Join neighborhoods_venues_sorted dataframe to geo_barrios dataframe to get the final result. 

In [22]:
madrid_merged = geo_barrios

# merge madrid_grouped with madrid_data to add latitude/longitude for each neighborhood
madrid_merged = madrid_merged.set_index('Barrio').join(neighborhoods_venues_sorted.set_index('Neighborhood'), how="inner")

madrid_merged.reset_index(inplace=True)
madrid_merged = madrid_merged.rename(index=str, columns={"index": "Barrio"})
madrid_merged

Unnamed: 0,Barrio,Distrito,Latitud,Longitud,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,distance,cluster
0,Timon,Barajas,40.473171,-3.584152,3,Hotel,Spanish Restaurant,Restaurant,Tapas Restaurant,Coffee Shop,Gastropub,Brewery,Bistro,Mexican Restaurant,Bar,Asian Restaurant,Diner,0.220967,3
1,Colina,Ciudad Lineal,40.458175,-3.659664,3,Chinese Restaurant,Spanish Restaurant,Smoke Shop,Seafood Restaurant,Fast Food Restaurant,Restaurant,Bar,Thai Restaurant,Bakery,Paper / Office Supplies Store,Convenience Store,Cupcake Shop,0.221639,3
2,Concepcion,Ciudad Lineal,40.438206,-3.649578,3,Bar,Park,Restaurant,Spanish Restaurant,Supermarket,Plaza,Café,Burger Joint,General Entertainment,Grocery Store,Tapas Restaurant,Market,0.198395,3
3,El Pilar,Fuencarral-El Prado,40.477269,-3.705511,3,Clothing Store,Italian Restaurant,Fast Food Restaurant,Restaurant,Burger Joint,American Restaurant,Department Store,Park,Boutique,Coffee Shop,Donut Shop,Sandwich Place,0.236293,3
4,Aluche,Latina,40.385644,-3.760741,3,Restaurant,Tapas Restaurant,Bar,Spanish Restaurant,Pharmacy,Latin American Restaurant,Shopping Mall,Sandwich Place,Chinese Restaurant,Metro Station,Clothing Store,Gym,0.211028,3
5,Puerta del angel,Latina,40.413722,-3.727171,3,Spanish Restaurant,Grocery Store,Farmers Market,Diner,Bakery,Sandwich Place,Beer Garden,Circus,Sporting Goods Shop,Bar,Coffee Shop,Thrift / Vintage Store,0.205023,3
6,El Plantio,Moncloa-Aravaca,40.468442,-3.817696,3,Spanish Restaurant,Italian Restaurant,Café,Burger Joint,Mediterranean Restaurant,Big Box Store,Bookstore,Beer Garden,Sporting Goods Shop,Gym,Mobile Phone Shop,Restaurant,0.198901,3
7,Estrella,Retiro,40.411762,-3.666998,3,Bar,Coffee Shop,Spanish Restaurant,Italian Restaurant,Dessert Shop,Pool,Rental Car Location,Food & Drink Shop,Chinese Restaurant,Sports Club,Metro Station,Grocery Store,0.203335,3
8,Almenara,Tetuan,40.471193,-3.692085,3,Spanish Restaurant,Hotel,Chinese Restaurant,Library,Supermarket,Frozen Yogurt Shop,Bar,Fast Food Restaurant,Sandwich Place,Gym / Fitness Center,Café,Breakfast Spot,0.226856,3
9,Moscardo,Usera,40.390016,-3.705424,3,Bakery,Clothing Store,Coffee Shop,Restaurant,Gym,Grocery Store,Burger Joint,Soccer Field,Nightclub,Gastropub,Brazilian Restaurant,Big Box Store,0.226864,3


Draw a map of the selected neighborhoods in Madrid.

In [23]:
address = "Madrid, España"

geolocator = Nominatim(user_agent="cdmx_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(location, latitude, longitude)

Madrid, Área metropolitana de Madrid y Corredor del Henares, Comunidad de Madrid, 28001, España 40.4167047 -3.7035825


In [24]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i * x) ** 2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(madrid_merged['Latitud'], madrid_merged['Longitud'], madrid_merged['Barrio'],
                                  madrid_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster - 1],
        fill=True,
        fill_color=rainbow[cluster - 1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

## Results and Conclusion <a name="results"></a>

The result of this exercise shows that it is possible to help people who are in a situation similar to the one described in this case, that is, someone who has to move to another city and wants to find conditions similar to those in their current residence using public data available through the Foursquare API.

In KMEANS one of the difficulties is the choice of the value for K. To decide what value to use, I executed the algorithm with different K values and, for each case, I calculated the Silhouette Coefficient and the Calinski-Harabaz Index. These are 2 metric that allow us to decide if we obtain dense and well separated clusters. With both indicators the best value for K was 4.

The venues of my neighborhood in Mexico City were added to the venues of the 117 neighborhoods in Madrid and I generated 4 clusters. To further reduce the options of candidate neighborhoods, I searched within the cluster where my neighborhood in Mexico City was assigned (cluster 2), for those neighborhoods in Madrid that were closest considering the Euclidean distance of each one of them to the cluster centroid.

The characteristics that distinguish my neighborhood, according to the results of Foursquare, is the diversity of places to eat, shops and places to exercise. These same characteristics are present in almost all the selected neighborhoods.

The technique used to select the candidate neighborhoods is illustrated in the following figure:

<img src="https://www.dropbox.com/s/uvq9t6ikeuhv4qf/kmeans4.png?raw=1",width=600px,height=600px>

Of course, the final decision can not be based solely on the results of this analysis. Rather it should be considered as a tool to narrow the options that must be investigated in greater detail.

For example, one way to enrich the results of the analysis would be by adding demographic and socioeconomic attributes to each of the neighborhoods. This would result in more similar and homogeneous clusters.