# Moving to Barcelona or moving to Madrid? A data-driven decision

## Introduction

In Spain there is a regular and historical trend for people willing to progress moving to one of the two main cities of the country $ ^{[1,2]}$: Madrid or Barcelona. This also happens when we talk about the final destination of international immigrants, which will likely be one of these two cities in a big majority of the cases.   

This is mainly based on their huge facilities portfolio, their overall business structure and a rich high-quality jobs offer. It is likely similar in most of the countries Worldwide, where the capital and other main cities use to behave as population attractors, while the rest of the regions starts becoming empty areas.

But, where to move? There are countless online and offline resources from which you can pull information about where your dreamed job may be. But little to no options where to check rationalized and objective information about what you can do in your personal life, when the working time is over, and how to compare those two cities in regards to such aspect of your life.  

It is true that some headhunting and talent acquisition companies provide certain degree of advisory about this, but there is no way you can manage to get such info yourself easily. And taking the blind decision of moving to one or another place can end up on a complete success or a tremendous error, mostly if you just get blinded by the images of the flashing city lights and the night-life of Madrid or by the sun and the sea of Barcelona.

Of course, not everybody needs the same or follows the same drivers when moving to a different city. In this case, we will focus on a specific profile of people which is getting more and more frequent in Spain (and abroad) and also accounts for a big part of those internal and international migrants. 

In Spain there is a growing part of the population which is composed by single people living alone or, as much, people living as a couple but with no children. In fact, Spain is one of the countries with the lowest natality in the World (196$ ^{th}$ over a total of 219 countries in an analysis performed in 2021), with 9.2 children born every year for every 1000 inhabitants. $ ^{[3]}$

This is based on the balance between cost of life and salaries, which is not favorable to raise many children within the Spanish families. In order to 'compensate' that, many people is including a pet or more (mainly dogs and cats) in their families, with Spain being the 15$ ^{th}$ country in the World on the total number of pets (cats + dogs) owned by its population. From those top 15 countries, Spain rises up the 9$ ^{th}$ position based on the biggest ratio of pets per inhabitant $ ^{[4]}$, with 0.17 pets per person.

It is also usual that people more susceptible to move and build a new future from scratch are of a younger age on average and are more concerned about sport activities and natural lifestyle. As of 2014, Eurostat ranked Spain within the top 10  countries of the whole European Union with more people regularly practicing sports (with 46.3% of total population). $ ^{[5]}$

It is also true Spaniards are not amongst the citizens with the biggest settlement of vegan and vegetarian nutrition $ ^{[6]}$, but there is a progressive trend of Spanish citizens to increasingly buy and consume more organic and healthy food.

Spain was also considered as the healthiest country in the World in 2020 as per the Bloomberg Global Health Index (out of a total of 163 countries in the analysis) $ ^{[7]}$. 
Spain was also in 2017 the top European country on total organic farmland area, with 2.1M ha $ ^{[8]}$.

This all adds to the idea of the people and the country readiness to live a life which is more filled of sports, high-quality and organic food, and natural experiences in general. 

Thus, the target group of people from Spain (or abroad) considered for this analysis is those young people with no children, likely with pets, which are lovers of the sports and the natural life, willing to move to one of the two main cities of the country to prosper and build a new future for themselves.

In conclusion, the idea of this study is to provide such people an useful tool when trying to chose the right city to move to.  But this is not all. Once the best city candidate is uncovered we will also provide insights on the best district and neighborhood matching the migrant economic capabilities and their requirements to enjoy a fulfilling life.  

**References:**

[1] ***El Confidencial:*** https://www.elconfidencial.com/economia/2019-09-27/exodo-urbano-espana-llegadas-madrid-ciudades_2240155/. Press

[2] ***ABC:*** https://www.abc.es/sociedad/abci-espanoles-espana-lugar-origen-mas-frecuente-inmigracion-interna-cada-provincia-201901310239_noticia.html?ref=https:%2F%2Fwww.google.com%2F. Press

[3] ***World Population Review:*** https://worldpopulationreview.com/country-rankings/birth-rate-by-country

[4] ***PetSecure:*** https://www.petsecure.com.au/pet-care/a-guide-to-worldwide-pet-ownership/

[5] ***Eurostat - European Comission:*** https://ec.europa.eu/eurostat/statistics-explained/index.php/Statistics_on_sport_participation

[6] ***Wikipedia:*** https://en.wikipedia.org/wiki/Vegetarianism_by_country

[7] ***Boomberg:*** https://www.bloomberg.com/news/articles/2019-02-24/spain-tops-italy-as-world-s-healthiest-nation-while-u-s-slips

[8] ***FIBL:*** https://orgprints.org/id/eprint/34608/7/Willer-2019-02-14-EUROPE.pdf

## Methodology

During this project we will follow a funneling process, defined by the steps:
1. define Madrid and Barcelona characteristics based on features of relevance for the target audience
2. compare both cities and decide if there are statistic and meaningful differences between them
3. chose the best city to move to based on prior comparison
4. for the selected city, incorporate socio-economic variables and define groups of neighborhoods based on common profiles
5. display diverse scenarios based on possible migrant socio-economic characteristics and align with possible final neighborhood selection. 

In order to define the geographic location of both Madrid and Barcelona districts and neighborhoods several tweaks must be done. In Spain there is no direct correlation between the postal codes and the neighborhoods, being this relationship rather settled between postal codes and sets of street located in a delimited area. 

This is the reason why, if we extract the list of districts and neighborhoods for Madrid $ ^{[9]}$ and Barcelona $ ^{[10]}$ from any online source, there will be no straight way to correlate those with postal codes to exploit via any geolocation packages as `geopy`or `pgeocode`and return the expected coordinates.    

Thus, after parsing the structure of districts and neighborhoods for Madrid and Barcelona (see table below), we need to do some additional searches, based on the `Geohack` site $ ^{[11]}$, and manage to build in somehow a manual way an additional table correlating neighborhoods and coordinate values. Once this file is generated, it can be imported from a .csv as a pandas dataframe and merged with the original one generated from the parsed info.

| City          | Districts      | Neighborhoods   |
| :------------ | :------------: | --------------: |
| Madrid        | 21             | 131             |
| Barcelona     | 10             | 73              |

Once we have created the dataframes with the respective locations for each city, we can evaluate the position of their neighborhoods via the `Folium`library.

With a clear picture of the shape and placement of the diverse neighborhoods in Madrid and Barcelona we can go a step forward and use the `Foursquare`API to obtain 100 venues for each of those in 1km radius around their geographic center. Setting the radius to 1km will help exhausting the venues gathering by neighborhood. Of course, there will be cases in which same venue is coming back for more than one neighborhood; to avoid duplications we will modify the resulting pandas dataframes by dropping the duplicates.

Once we have a dataframe for each city with information about the district, neighborhood, venue category and venue, we can do a grouping by the first three variables and get cumulated counts for the last. After that, we can extract from the possible categories those which may be related to natural life and filter the dataframes based on those, to later aggregate those categories into a fewer amount like:
- healthy restaurants
- healthy shops
- pet places
- sports
- and walkways.

In this stage of the project, we can get a couple of dataframes per city, only including natural category venues:
- one dataframe including absolute venue counts per natural category
- and one dataframe grouped by district and neighborhood including venue counts for each.

The merge of the 2 dataframes of the same type for each city into one will allow to:
- **absolute counts per category:** display graphically the differences (i.e. via a bar chart) between cities
- **grouped counts per category and neighborhood:** run a statistic analysis to compare inter-group variability vs intra-group variability, likely a `t-student test`. We can also display the variability info graphically via a box and whisker plot.

When those analyses are run we can check if there is statistic difference between the cities or not, and anyway select one as preferential migration target based on results.

After the selection of the preferred city to migrate to is made, we can incorporate socio-economic information to the existing dataframes, whether the chosen city being Madrid $ ^{[12]}$ or Barcelona $ ^{[13]}$. Some additional information can also be included from generic Spanish Statistics Institute (INE) databases $ ^{[14,15]}$. 

With such a consolidated dataframe, granular down to the neighborhood level, we can generate the right data structure to run a `K-means` clustering analysis to defined aggregations of neighborhoods based on common characteristics. This could allow a stratification of those not only based on natural categories of venues but also based on additional socio-economic variables like average family yearly salary or the value of the m$^{2}$ of rented accommodation.  

Finally, we will generate several scenarios based on the economic capacity of migrants, linked to possible jobs to obtain while in destination. For this we can look at average salary values extracted from sites like Glassdoor $ ^{[16]}$. When we are able to feed our dataset with such info, we can generate displays for that cluster of neighborhoods better meeting migrant needs, and intercept their possible salary per role with the price of accommodation by neighborhood: those options with accommodation price below a the given threshold of the migrant salary will be possible final destination options to move to.  

**References:**

[9] ***Wikipedia - Madrid districts:*** https://es.wikipedia.org/wiki/Anexo:Distritos_de_Madrid

[10] ***Wikipedia - Barcelona districts:*** https://en.wikipedia.org/wiki/Districts_of_Barcelona

[11] ***Geohack:*** https://geohack.toolforge.org/geohack.php?pagename=Corralejos&params=40_27_52_N_3_35_24_W_type:city(150000)_region:ES

[12] ***OpenData - Madrid:*** https://datos.madrid.es/portal/site/egob/#

[13] ***OpenData - Barcelona:*** https://opendata-ajuntament.barcelona.cat/data/en/dataset

[14] ***INE - sub-municipal data:*** https://www.ine.es/jaxiT3/Tabla.htm?t=30139

[15] ***INEBase - economy:*** https://www.ine.es/dyngs/INEbase/es/categoria.htm?c=Estadistica_P&cid=1254735570541

[16] ***Glassdoor:*** https://www.glassdoor.es/index.htm