# **Tourist Establishments in Spain: Exploratory Data Analysis**

## **Dataset**

The dataset for this particular study is created from the different files generated in the notebooks called *Tourism_Capacity_xxxx_EDA*, which contain the number of tourist establishments registered between 2019 and 2023 for all the Spanish provinces. For each of the provinces, a row shall be added for each of the years for which data are available. Therefore, knowing that Spain has 52 provinces, and that 5 years have been taken into account, our final dataset will consist of 260 rows and 3 columns (province, year and number of tourist establishments).

## **Goal**

Analyze the evolution of the number of tourist establishments between 2019 and 2023 in each of the Spanish provinces. This will allow to check if the numbers recorded were affected by the COVID-19 pandemic.

## **Questions we want to answer**

As is well known, the main reason for performing an analysis of a given dataset is to be able to formulate hypotheses and find the answer to different questions. Therefore, the questionsto be answered are formulated below:

  1. **In general, which provinces show the highest number of tourist establishments over the years? In which year did they reach the highest number?**

  2. **In which areas (coastal, inland or island) are the best figures recorded? Does it make sense?**

  3. **In which provinces has there been an increase in the number of tourist establishments over the years? In which ones can a decrease be observed?**
  
  4. **In which year can be appreciated a greater variation (increase or decrease) in the number of tourist establishments? Regardless of years, in which provinces can we observe a bigger change in the number of establishments between the beginning and the end of the period of study?**
  
  5. **Is it possible to appreciate the effect of the COVID-19 pandemic on the number of tourist establishments over the years of study? What is the general evolution of the number of establishments both globally and at the province level?**

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
data_dir_new = '/content/gdrive/MyDrive/TFM/New/'

In [None]:
# Import necessary packages
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

## 1. Load and Explore Data

The first thing we are going to do is to access the data on tourist establishments, classified by year, for each of the Spanish provinces. These files are the ones that will allow us to build our final dataset.

In [None]:
# Read all the files containing the number of tourist establishments for each Spanish province between 2019 and 2023
df_estab_2019 = pd.read_excel(data_dir_new + '2019_estab_by_province.xlsx', usecols=['province', 'establishments'])
df_estab_2019['year'] = "2019"
df_estab_2020 = pd.read_excel(data_dir_new + '2020_estab_by_province.xlsx', usecols=['province', 'establishments'])
df_estab_2020['year'] = "2020"
df_estab_2021 = pd.read_excel(data_dir_new + '2021_estab_by_province.xlsx', usecols=['province', 'establishments'])
df_estab_2021['year'] = "2021"
df_estab_2022 = pd.read_excel(data_dir_new + '2022_estab_by_province.xlsx', usecols=['province', 'establishments'])
df_estab_2022['year'] = "2022"
df_estab_2023 = pd.read_excel(data_dir_new + '2023_estab_by_province.xlsx', usecols=['province', 'establishments'])
df_estab_2023['year'] = "2023"

In [None]:
# Create the new dataset
df_estab = pd.concat([df_estab_2019, df_estab_2020, df_estab_2021, df_estab_2022, df_estab_2023], axis=0)
df_estab

Unnamed: 0,province,establishments,year
0,Melilla,10,2019
1,Ceuta,2,2019
2,Cádiz,387,2019
3,Málaga,874,2019
4,Almería,218,2019
...,...,...,...
47,Bizkaia,353,2023
48,Araba/Álava,138,2023
49,Palencia,225,2023
50,Las Palmas,1284,2023


As we can see, the dimensions of the dataset coincide with those mentioned in the introduction.

In [None]:
df_estab.province.unique()

array(['Melilla', 'Ceuta', 'Cádiz', 'Málaga', 'Almería', 'Granada',
       'Sevilla', 'Huelva', 'Jaén', 'Córdoba', 'Castelló/Castellón',
       'Murcia', 'Illes Balears', 'Alacant/Alicante', 'Albacete',
       'Badajoz', 'Ciudad Real', 'València/Valencia', 'Gipuzkoa',
       'Toledo', 'Cáceres', 'Madrid', 'Cuenca', 'Ávila', 'Guadalajara',
       'Salamanca', 'Teruel', 'Segovia', 'Tarragona', 'Soria', 'Zamora',
       'Valladolid', 'Barcelona', 'Girona', 'La Rioja', 'Ourense',
       'Navarra', 'Zaragoza', 'Lleida', 'Huesca', 'Pontevedra', 'Burgos',
       'León', 'Cantabria', 'Asturias', 'Lugo', 'A Coruña', 'Bizkaia',
       'Araba/Álava', 'Palencia', 'Las Palmas', 'Santa Cruz de Tenerife'],
      dtype=object)

Now, we are going to start by carrying out a small study to examine the evolution of the number of tourist establishments but only in some specific provinces. This will allow us to find out if all of them follow the same pattern or, on the contrary, if there is any province that shows a different behaviour over the years of interest.

## 2. Study by Province

### **MADRID**

In [None]:
# Select province of interest
df_province = df_estab[df_estab['province'] == 'Madrid']
df_province

Unnamed: 0,province,establishments,year
21,Madrid,1066,2019
21,Madrid,1085,2020
21,Madrid,1134,2021
21,Madrid,1174,2022
21,Madrid,1233,2023


To better understand how these values have evolved over the years, we will visualise them using a bar chart. This will allow us to see whether these values tend to increase or decrease as we move closer to or further away from the pandemic.

In [None]:
# Barplot of Madrid
fig = px.bar(df_province, x = 'year', y = 'establishments',
             title = "Number of Tourist Establishments in Madrid",
             labels = {'year': 'Year of Study', 'establishments': 'Number of Establishments'})

fig.show()

As can be seen in the graph, there appears to have been an increase in the number of tourist establishments in Madrid since 2019. Now let's take a look at which years saw the biggest increase.

In [None]:
diff = df_province["establishments"].diff()
diff

21     NaN
21    19.0
21    49.0
21    40.0
21    59.0
Name: establishments, dtype: float64

By calculating the difference between the number of establishments registered each year, we see that the largest jump occurred between 2022 and 2023. On the other hand, we observe that during the first year of the pandemic, the hardest of all, although there was also an increase, it was smaller compared to the rest.

With this in mind, let's take a look at what happened in another of Spain's provinces. In this case, we will select a coastal province: **Valencia**.

### **VALENCIA**

In [None]:
# Select province of interest
df_province = df_estab[df_estab['province'] == 'València/Valencia']
df_province

Unnamed: 0,province,establishments,year
17,València/Valencia,383,2019
17,València/Valencia,413,2020
17,València/Valencia,440,2021
17,València/Valencia,456,2022
17,València/Valencia,495,2023


As in Madrid, we will visualize the evolution of the number of establishments using a barplot.

In [None]:
# Barplot of Valencia
fig = px.bar(df_province, x = 'year', y = 'establishments',
             title = "Number of Tourist Establishments in Valencia",
             labels = {'year': 'Year of Study', 'establishments': 'Number of Establishments'})

fig.show()

Although the number of establishments is much smaller compared to Madrid, the tendency is also to increase. Let's see when the biggest change occurred.

In [None]:
diff = df_province["establishments"].diff()
diff

17     NaN
17    30.0
17    27.0
17    16.0
17    39.0
Name: establishments, dtype: float64

In this scenario, the most substantial increase takes place during the same period as in Madrid. However, in the first year of the pandemic, there is also a notable surge, which contrasts with the situation in the capital. It is between 2021 and 2022 when a smaller increase can be seen in this province.

Now, let's look at the numbers achieved on one of the archipelagos: **Illes Balears**.

### **ILLES BALEARS**

In [None]:
# Select province of interest
df_province = df_estab[df_estab['province'] == 'Illes Balears']
df_province

Unnamed: 0,province,establishments,year
12,Illes Balears,1494,2019
12,Illes Balears,1507,2020
12,Illes Balears,1672,2021
12,Illes Balears,1736,2022
12,Illes Balears,1847,2023


At first glance we can see that higher numbers are reached compared to the other two provinces.

Now, let's visualize these numbers more easily with some graphs.

In [None]:
# Barplot of Illes Balears
fig = px.bar(df_province, x = 'year', y = 'establishments',
             title = "Number of Tourist Establishments in Illes Balears",
             labels = {'year': 'Year of Study', 'establishments': 'Number of Establishments'})

fig.show()

As can be seen in the graph, the trend is the same as in the two previous provinces: the number of establishments increases over the years. Let's check when the biggest increases occur.

In [None]:
diff = df_province["establishments"].diff()
diff

12      NaN
12     13.0
12    165.0
12     64.0
12    111.0
Name: establishments, dtype: float64

In this case, we notice that the smallest increase occurred during the first year of the pandemic, with a significantly smaller jump compared to subsequent years. However, despite the opening of numerous new establishments in the last year of the period, the most substantial increase occurred between 2020 and 2021.

Finally, let's take a look at how the number of tourist establishments behaves in one of the most depopulated provinces in Spain: **Jaén**

--> https://www.publico.es/sociedad/mapa-despoblacion-espana-cerca-20-provincias-han-perdido-millon-habitantes-medio-siglo.html

--> https://www.eleconomista.es/economia/noticias/11051135/02/21/Las-tres-Espanas-despobladas-23-provincias-con-un-pasado-similar-pero-con-futuros-muy-diferentes.html

### **JAÉN**

In [None]:
# Select province of interest
df_province = df_estab[df_estab['province'] == 'Jaén']
df_province

Unnamed: 0,province,establishments,year
8,Jaén,304,2019
8,Jaén,313,2020
8,Jaén,322,2021
8,Jaén,336,2022
8,Jaén,352,2023


As expected, this is the province with the lowest number of establishments so far. Although there are probably provinces with fewer establishments, such as Ceuta or Melilla, they will not be analyzed in this section.

Once again, let's check how the data for this province has evolved.

In [None]:
# Barplot of Jaén
fig = px.bar(df_province, x = 'year', y = 'establishments',
             title = "Number of Tourist Establishments in Jaén",
             labels = {'year': 'Year of Study', 'establishments': 'Number of Establishments'})

fig.show()

As in the previous cases, we see once again that the number of establishments has increased over the years. Let's see if the biggest jumps occur in the same years as in the rest of the provinces.

In [None]:
diff = df_province["establishments"].diff()
diff

8     NaN
8     9.0
8     9.0
8    14.0
8    16.0
Name: establishments, dtype: float64

In this last scenario, we see that the increases are practically the same throughout the period under study, although it is between 2022 and 2023 when we see a slightly higher increase.

**Therefore, we see that, in general, the number of tourist establishments behaves in a similar way in the selected provinces regardless of when the biggest jumps have taken place.**

## 3. Global Study: Comparison

Below, we will compare the number of establishments for all the Spanish provinces. This will allow us to know which are the places that, on average, have the highest number of tourist establishments and which type (coastal, inland or island) of provinces are the preferred ones for opening new tourist accommodations. Therefore, what we are going to do is to calculate the average number of establishments in each province over the 5 years of study.

In [None]:
# Get the average of establishments for each province during the selected time period
df_mean = df_estab.groupby('province').mean(numeric_only=True).round(0).sort_values(by='establishments')
df_mean

Unnamed: 0_level_0,establishments
province,Unnamed: 1_level_1
Ceuta,5.0
Melilla,9.0
Araba/Álava,128.0
Ciudad Real,145.0
Huelva,149.0
Albacete,151.0
Ourense,159.0
Valladolid,172.0
La Rioja,177.0
Córdoba,186.0


To visualize these numbers more easily we are going to represent them graphically from smallest to largest. This will allow us to see at a glance the provinces with the lowest and highest number of establishments.

In [None]:
fig = px.bar(df_mean, x = df_mean.index, y = 'establishments', text_auto='.2s',
             title = "Number of Tourist Establishments by Province",
             labels = {'province': 'Province', 'establishments': 'Number of Establishments'})

fig.update_traces(textfont_size=12, textangle=0, textposition="outside", cliponaxis=False)
fig.show()

In the barplot above we can see the following:

- The provinces that stand out for their number of establishments are Illes Balears (island), Barcelona (coastal), Las Palmas (island), Madrid (inland, capital of Spain), Asturias (coastal), Málaga (coastal) and Santa Cruz de Tenerife (island).

- The provinces with the lowest number of establishments are Ceuta and Melilla (Morocco), Álava (inland), Ciudad Real (inland), Huelva (coastal), Albacete (inland) and Ourense (inland).

As can be seen, it seems that the areas with the highest number of establishments correspond to provinces that are islands or are by the sea, in addition to the country's capital, while those with the fewest establishments are the two provinces on the African continent and those in the centre of the mainland (empty Spain).

If we recall the first study we made of the tourist numbers collected by the INE for each of the provinces, we will see that the results are very similar to those obtained here. Therefore, it seems that, in this sense, the results are consistent with the tourism situation in Spain. What does not seem to make so much sense is that the number of establishments also increases during the hardest years of the pandemic. We will continue to analyze the data and, at the end of this study, we will see what conclusions can be drawn.

Now, although we have seen approximately how the numbers behaved in all the Spanish provinces over time, we are going to see how they vary in general without taking into account each province separately. This will allow us to check whether the provinces analyzed reflect the overall behaviour of the number of tourist establishments in Spain as a whole over the selected time period.

In [None]:
# Calculate the total volume of tourist establishments in Spain for each of the years under study
df_overall = df_estab.groupby('year').sum(numeric_only=True)
df_overall

Unnamed: 0_level_0,establishments
year,Unnamed: 1_level_1
2019,20376
2020,21818
2021,23364
2022,24736
2023,26669


After calculating the total volume of establishments in Spain, we are going to represent these numbers in the same way as in the province-by-province analysis.

In [None]:
# Barplot of Spain
fig = px.bar(df_overall, x = df_overall.index, y = 'establishments',
             title = "Number of Tourist Establishments in Spain",
             labels = {'year': 'Year of Study', 'establishments': 'Number of Establishments'})

fig.show()

As we have seen in each of the provinces studied above, there is also an overall increase in the number of establishments over the years. This makes sense given that, if the number of tourist establishments in each province increases, and these are the values used to calculate the aggregate numbers, the latter will also increase. Now let's check when the biggest jumps occur.

In [None]:
diff = df_overall["establishments"].diff()
diff

year
2019       NaN
2020    1442.0
2021    1546.0
2022    1372.0
2023    1933.0
Name: establishments, dtype: float64

If we recall the results obtained for the 4 provinces previously analyzed, we saw that, in general, the largest increases occurred between 2022 and 2023. The exception is Illes Balears, which did register a large jump in this period, but the most notable increase occurred between 2020 and 2021. On the other hand, in all provinces except Valencia, the smallest increase was observed during the first year of the pandemic. It is between 2021 and 2022 when this province shows the smallest difference in the number of establishments.

If we look at the overall numbers, we see that, indeed, it was during 2023 when more tourist establishments were opened in Spain. The next big jump occurred during the year following the outbreak of the pandemic, as was the case in Illes Balears. In terms of the year in which there were fewer openings, we see that this coincides with that seen for Valencia, although the first pandemic year shows the second lowest value.

In other words, although the variation in the number of tourist accommodations does not happen in the same way in all the provinces previously analyzed, which makes sense given that echa of them has its own characteristics, we can say that the behaviour seen at the national level is a combination of what has been seen in these provinces. In fact, what is most significant is that in all cases we observe an increase in the number of tourist establishments over the years. It is also important to note that, after all, we have analyzed only a small part of the regions that make up Spain, so it is normal that the selected provinces do not accurately reflect the behaviour of the entire Spanish tourism business.

 Finally, to conclude this section, we are going to represent in a line plot the evolution of the number of establishments both in each province and as a whole. In this way, we will see much better how the trend in each province matches with that seen in Spain. To do so, we will combine in a single dataframe both the numbers registered for all the provinces in each of the years of study and the total number of establishments in Spain in those same years.

In [None]:
# Reset index of the aggregated dataframe and add province column
df_overall = df_overall.reset_index(inplace=False)
df_overall['province'] = 'Spain'
print(df_overall.index)
print(df_overall.columns)

RangeIndex(start=0, stop=5, step=1)
Index(['year', 'establishments', 'province'], dtype='object')


In [None]:
# Concatenate aggregated dataframe and dataframe with numbers by province
df_plot_overall = pd.concat([df_estab, df_overall[df_estab.columns]])
df_plot_overall

Unnamed: 0,province,establishments,year
0,Melilla,10,2019
1,Ceuta,2,2019
2,Cádiz,387,2019
3,Málaga,874,2019
4,Almería,218,2019
...,...,...,...
0,Spain,20376,2019
1,Spain,21818,2020
2,Spain,23364,2021
3,Spain,24736,2022


In [None]:
px.line(df_plot_overall, x="year", y="establishments", color='province', markers=True, title='Evolution of the number of tourist establishments')

Since the lines for each province are very small compared to the cumulative line, we have checked each one individually to see their trend. After this, we can confirm that, with the exception of the provinces discussed below, all the other provinces show an increase in the number of establishments over the years. There are three exceptions:

- **Melilla** -> Throughout 2020 the number of establishments remained at 10 but, during the following year, this number dropped to 9 and remained at that number until the end of the period of study.

- **Lleida** -> There was an increase in the number of establishments throughout 2020 and 2021 but, over the course of 2022, the number of establishments decreased below the number recorded in 2019. Although during the last year there was an increase again, the peak was reached in 2021.

To conclude this study, we are going to check which provinces have experienced the greatest increase between the first and the last year of the period studied; in other words, we will analyze which areas have been the preferred ones for business owners to open new establishments during the last five years.

In [None]:
df_increases = pd.DataFrame()

increases = []
provinces = df_estab.province.unique()

for p in provinces:

  df_province = df_estab[df_estab['province'] == p]
  df_province.reset_index(inplace=True)

  increases.append(df_province.loc[4, 'establishments']- df_province.loc[0, 'establishments'])

df_increases['provinces'] = provinces
df_increases['increases'] = increases

df_increases.sort_values(by='increases', ascending=False).head(10)

Unnamed: 0,provinces,increases
44,Asturias,395
12,Illes Balears,353
28,Tarragona,290
36,Navarra,285
32,Barcelona,283
43,Cantabria,263
2,Cádiz,254
33,Girona,229
3,Málaga,205
40,Pontevedra,200


As we can see, Asturias and Illes Balears stand out above the rest of the provinces with an increase of around 400 establishments over these five years. They are followed by Tarragona, Navarra and Barcelona. If we recall those provinces with the highest number of tourist establishments, we will see that almost all of them are in this top 10.

## 4. Answering Questions

### **1. In general, which provinces show the highest number of tourist establishments over the years? In which year did they reach such a number?**

As we have seen in previous graphs, Illes Balears, Barcelona, Las Palmas and Madrid stand out in terms of number of tourist establishments, followed by Asturias, Málaga and Santa Cruz de Tenerife. In all these cases, the peak was recorded at the end of 2023.

### **2. In which areas (coastal, inland or island) are the best figures recorded? Does it make sense?**

Taking into account the answer to the first question, we see that the coastal areas, together with the islands and the capital, are the ones with the best figures, which makes sense.

### **3. In which provinces has there been an increase in the number of tourist establishments over the years? In which ones can a decrease be observed?**

As we have seen previously, in all the provinces except in Melilla and Lleida there has been an increase in the number of establishments over the period of study.

### **4. In which year can be appreciated a greater variation (increase or decrease) in the number of tourist establishments? Regardless of years, in which provinces can we observe a bigger change in the number of establishments between the beginning and the end of the period of study?**

If we consider the aggregate numbers, we see that the largest increase was recorded between 2022 and 2023. On the other hand, and as we have seen previously, the provinces that have experienced the greatest increase in the number of establishments over the last few years are Asturias, Illes Balears, Tarragona, Navarra and Barcelona.

###  **5. Is it possible to appreciate the effect of the COVID-19 pandemic on the number of tourist establishments over the years of study? What is the general evolution of the number of establishments both globally and at the province level?**

5.	Regarding the general evolution, we have observed that almost all provinces have seen an increase in the number of tourist accommodations over the years. Therefore, it is evident that this trend is reflected at the national level as well. Analyzing the period of the largest increases, we have confirmed that they occurred between 2020 and 2021, as well as between 2022 and 2023. The first substantial increase took place in the year following the pandemic outbreak. During this time, some restrictions had been lifted, tourism was gradually resuming, and preventive health measures were in place to curb further contagion. This significant rise likely stemmed from entrepreneurs' urgent need to recover losses incurred during the months of lockdown, combined with people's desire to leave their homes, which also contributed to the proliferation of businesses. In other words, the number of establishments that closed due to COVID-19 was balanced by the significant amount of new openings after the quarantine, which explains why the volume of tourist accommodations was not affected that year. On the other hand, the past year saw the most important growth in tourist accommodation options. In this case, there is not much more to elaborate on. As mentioned in the introduction, many studies predicted that pre-pandemic tourism figures would be restored by the end of 2023, a prediction confirmed by our initial study. With restrictions gone and complete freedom to travel, it is logical that tourism numbers would continue to improve as we leave the pandemic behind.