# **Spanish Tourism Data: Exploratory Data Analysis**

## **Exploratory Data Analysis**

In this notebook we are going to perform an Exploratory Data Analysis (EDA) using a specific dataset. This step is an approach to analyzing datasets in order to summarize their main characteristics, often using statistical graphics and other data visualization methods.This allows us to identify possible errors, reveal the presence of outliers, check the relationship between variables (correlations) and their possible redundancy, and perform a descriptive analysis of the data by means of graphical representations and summaries of the most significant aspects.

Considering this, the main EDA tasks include:

  - Cleansing the data.

  - Sampling the data.

  - Transforming data.

  - Generate summaries and plots.

  - Understanding the data and formulating hypotheses for testing.

## **Dataset**

The dataset to be explored, published on the website of the Spanish National Statistics Institute (https://www.ine.es/jaxiT3/Tabla.htm?t=52047&L=0), provides a measure of inbound tourism in each of Spain's provinces based on the position of mobile phones.

## **Goal**

Understand how tourism works in Spain to help entrepreneurs set up new businesses.

## **Questions we want to answer**

As we know, the main reason why we perform an analysis of a given dataset is to be able to formulate hypotheses and find the answer to different questions. Therefore, the questions we want to answer are formulated below:

  1. **Which provinces have the highest number of tourists, overnights and average travel time?**

  2. **In which periods (summer, Christmas...), in general, are the best numbers achieved? And the worst?**

  3. **In which areas (inland or coastal) are there better numbers?**

  4. **In each province, in which month and year were the highest numbers reached? Are the same in all provinces?**
  
  5. **In each province, in which month and year have the most anomalous numbers been recorded? Are the same in all provinces?**

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
# Import necessary packages
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [None]:
data_dir = '/content/gdrive/My Drive/TFM/INE/'
data_dir_new = '/content/gdrive/MyDrive/TFM/New/'

The first thing we are going to do is to study several provinces to see if they all follow the same pattern in terms of tourism data over a given period of time (from July 2019 to December 2023). This will allow us to see which months of the year have seen the highest tourism numbers and how the pandemic has affected these numbers. Then, a global comparison will be made to see which provinces are receiving the most tourism. This will also give us an idea of what kind of provinces (coastal, inland or island) receive the most tourism. The result of this analysis will tell us which of the Spanish provinces will need to have more services and what type of services (hotels, hostels, guesthouses, rural houses...). The final objective of this study is to understand in which areas it is most recommendable for a given entrepreneur to set up a new business.

## 1. Study by Province

In [None]:
df_bp = pd.read_excel(data_dir + 'INE_data_by_province.xlsx')
df_bp

Unnamed: 0,period,province,no_tourists,no_overnights,avg_monthly_travel_time (days)
0,2023-12,Albacete,15085,115895,7.7
1,2023-11,Albacete,12833,93364,7.3
2,2023-10,Albacete,17009,124112,7.3
3,2023-09,Albacete,17104,133101,7.8
4,2023-08,Albacete,21226,182420,8.6
...,...,...,...,...,...
2803,2019-11,Zaragoza,43417,426649,9.8
2804,2019-10,Zaragoza,49672,442630,8.9
2805,2019-09,Zaragoza,44523,345621,7.8
2806,2019-08,Zaragoza,51119,519279,10.2


In [None]:
df_bp.dtypes

period                             object
province                           object
no_tourists                         int64
no_overnights                       int64
avg_monthly_travel_time (days)    float64
dtype: object

To simplify the visualizations, we are going to change the *period* column to datetime format.

In [None]:
df_bp.period = pd.to_datetime(df_bp.period, format="%Y-%m")
df_bp.dtypes

period                            datetime64[ns]
province                                  object
no_tourists                                int64
no_overnights                              int64
avg_monthly_travel_time (days)           float64
dtype: object

In [None]:
# Check the dimensions of the dataset
df_bp.shape

(2808, 5)

In [None]:
# Get information about the dataset
df_bp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2808 entries, 0 to 2807
Data columns (total 5 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   period                          2808 non-null   datetime64[ns]
 1   province                        2808 non-null   object        
 2   no_tourists                     2808 non-null   int64         
 3   no_overnights                   2808 non-null   int64         
 4   avg_monthly_travel_time (days)  2808 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 109.8+ KB


#### As we can see, there are no missing values in the dataset. This can also be checked by using the `isnull()` and `sum()` methods.

In [None]:
df_bp.isnull().sum()

period                            0
province                          0
no_tourists                       0
no_overnights                     0
avg_monthly_travel_time (days)    0
dtype: int64

#### Again, we see that in this study it will not be necessary to deal with missing values.

### **MADRID**

In [None]:
# Select province of interest
df_province = df_bp[df_bp['province'] == 'Madrid']
print(df_province.shape)
df_province

(54, 5)


Unnamed: 0,period,province,no_tourists,no_overnights,avg_monthly_travel_time (days)
1620,2023-12-01,Madrid,608805,3682622,6.0
1621,2023-11-01,Madrid,610855,3405314,5.6
1622,2023-10-01,Madrid,757454,4249639,5.6
1623,2023-09-01,Madrid,683907,3806872,5.6
1624,2023-08-01,Madrid,583072,3837767,6.6
1625,2023-07-01,Madrid,663257,4095496,6.2
1626,2023-06-01,Madrid,618147,3909494,6.3
1627,2023-05-01,Madrid,631076,3865416,6.1
1628,2023-04-01,Madrid,621867,3863172,6.2
1629,2023-03-01,Madrid,566664,3431482,6.1


Before we start reviewing the numbers recorded for Madrid, let's see how they have evolved over the whole period of study. To do this, we are going to implement a line plot for each of the variables that make up the dataset.

In [None]:
# Plot some graphs
fig = make_subplots(rows=3, cols=1)

fig.add_trace(
    go.Scatter(x=df_province['period'], y=df_province['no_tourists'], name='Number of Tourists'),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=df_province['period'], y=df_province['no_overnights'], name='Number of Overnights'),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=df_province['period'], y=df_province['avg_monthly_travel_time (days)'], name='Avg. Monthly Travel Time'),
    row=3, col=1
)

fig.update_layout(height=800, width=1200, title_text="Madrid Tourist Data")
fig.show()

These graphs provide the following information:

- The variables representing the number of tourists and the number of overnights behave similarly, although the latter shows more abrupt changes (peaks).

- The variable reflecting the average travel time behaves in the opposite way to the two previous variables.

- The first two variables decreased with the arrival of the pandemic, while the average travel time increased between February 2020 and January 2021. After this, all three variables gradually recovered their initial values.

However, what conclusions can we draw from this behaviour? As we go through the analysis, we will gain a better understanding of why tourism behaves as it does in each of the Spanish provinces and we will be able to answer the questions previously formulated.

For the moment, let's continue with the analysis of Madrid's tourism data by obtaining some statistical values.

In [None]:
# Get some statistical data from the province dataset
df_province.describe()

Unnamed: 0,period,no_tourists,no_overnights,avg_monthly_travel_time (days)
count,54,54.0,54.0,54.0
mean,2021-09-15 14:40:00,409694.240741,2994263.0,8.281481
min,2019-07-01 00:00:00,82328.0,1047787.0,5.6
25%,2020-08-08 18:00:00,210084.25,2384662.0,6.325
50%,2021-09-16 00:00:00,449761.0,3218248.0,7.35
75%,2022-10-24 06:00:00,566182.25,3856821.0,8.8
max,2023-12-01 00:00:00,757454.0,4422659.0,15.3
std,,190264.493262,941318.9,2.667914


The following can be observed:

- The average value of tourists is approximately 409694, the minimum is 82328 and the maximum is 757454.

- The average value of overnights is 2994263, the minimum is 1047787 and the maximum is 4422659.

- The mean of the average travel time (days) during the selected time period is 8.28, the minimum is 5.60 and the maximum is 15.30.

Now, we are going to check in which months the maximum and minimum values seen above are obtained and whether these values make sense or not.

In [None]:
print(df_province[df_province['no_tourists'] == df_province['no_tourists'].max()])
print(df_province[df_province['no_overnights'] == df_province['no_overnights'].max()])
print(df_province[df_province['avg_monthly_travel_time (days)'] == df_province['avg_monthly_travel_time (days)'].max()])

         period province  no_tourists  no_overnights  \
1622 2023-10-01   Madrid       757454        4249639   

      avg_monthly_travel_time (days)  
1622                             5.6  
         period province  no_tourists  no_overnights  \
1670 2019-10-01   Madrid       660919        4422659   

      avg_monthly_travel_time (days)  
1670                             6.7  
         period province  no_tourists  no_overnights  \
1658 2020-10-01   Madrid       164158        2506520   

      avg_monthly_travel_time (days)  
1658                            15.3  


As can be seen, the maximum number of tourists was recorded in October 2023, which means that the figures reached during the months preceding the pandemic have been surpassed. In the case of overnights, the month and year in which the highest numbers were reached was October 2019. This makes sense as the COVID-19 pandemic had not yet started. However, although October 2023 has recorded a number of overnights very close to the maximum, the pre-pandemic values have not been surpassed. Finally, and taking into account the graphs seen above, it can be confirmed that the average travel time increased considerably during the pandemic months. In fact, it is in October 2020 when the maximum value was reached. Despite continuing restrictions due to the pandemic, this may be due to the reluctance of travellers to do a lot of travelling, so those who did travel decided to stay longer in the same place.

In [None]:
print(df_province[df_province['no_tourists'] == df_province['no_tourists'].min()])
print(df_province[df_province['no_overnights'] == df_province['no_overnights'].min()])
print(df_province[df_province['avg_monthly_travel_time (days)'] == df_province['avg_monthly_travel_time (days)'].min()])

         period province  no_tourists  no_overnights  \
1664 2020-04-01   Madrid        82328        1251565   

      avg_monthly_travel_time (days)  
1664                            15.2  
         period province  no_tourists  no_overnights  \
1662 2020-06-01   Madrid       122046        1047787   

      avg_monthly_travel_time (days)  
1662                             8.6  
         period province  no_tourists  no_overnights  \
1621 2023-11-01   Madrid       610855        3405314   
1622 2023-10-01   Madrid       757454        4249639   
1623 2023-09-01   Madrid       683907        3806872   

      avg_monthly_travel_time (days)  
1621                             5.6  
1622                             5.6  
1623                             5.6  


If we look at the lows, we see that they are just in the months following the outbreak of the pandemic, which makes sense. However, in terms of average travel time, it is observed that the minimum is reached from September 2023. If we look at this column, we see practically the same values as before the pandemic.

With this in mind, let's take a look at what happened in another of Spain's provinces. In this case we will select a coastal province: **Valencia**.

### **VALENCIA**

In [None]:
# Select province of interest
df_province = df_bp[df_bp['province'] == 'Valencia']
print(df_province.shape)
df_province

(54, 5)


Unnamed: 0,period,province,no_tourists,no_overnights,avg_monthly_travel_time (days)
2592,2023-12-01,Valencia,193108,1414980,7.3
2593,2023-11-01,Valencia,196416,1351362,6.9
2594,2023-10-01,Valencia,260748,1758858,6.7
2595,2023-09-01,Valencia,257181,1768700,6.9
2596,2023-08-01,Valencia,302090,2406803,8.0
2597,2023-07-01,Valencia,264001,1921034,7.3
2598,2023-06-01,Valencia,215883,1591899,7.4
2599,2023-05-01,Valencia,229997,1554800,6.8
2600,2023-04-01,Valencia,206776,1538651,7.4
2601,2023-03-01,Valencia,197817,1413137,7.1


As in Madrid, we will visualize the behavior of the different variables of the dataset by means of graphs.

In [None]:
# Plot some graphs
fig = make_subplots(rows=3, cols=1)

fig.add_trace(
    go.Scatter(x=df_province['period'], y=df_province['no_tourists'], name='Number of Tourists'),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=df_province['period'], y=df_province['no_overnights'], name='Number of Overnights'),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=df_province['period'], y=df_province['avg_monthly_travel_time (days)'], name='Avg. Monthly Travel Time'),
    row=3, col=1
)

fig.update_layout(height=800, width=1200, title_text="Valencia Tourist Data")
fig.show()

As we can see, the behavior is quite similar to that obtained for Madrid.

In [None]:
# Get some statistical data from the province dataset
df_province.describe()

Unnamed: 0,period,no_tourists,no_overnights,avg_monthly_travel_time (days)
count,54,54.0,54.0,54.0
mean,2021-09-15 14:40:00,151611.12963,1260761.0,8.87037
min,2019-07-01 00:00:00,39941.0,441708.0,6.5
25%,2020-08-08 18:00:00,102229.5,1027742.0,7.4
50%,2021-09-16 00:00:00,148002.5,1275572.0,8.2
75%,2022-10-24 06:00:00,199084.5,1481736.0,9.475
max,2023-12-01 00:00:00,302090.0,2406803.0,14.0
std,,64646.544952,434930.8,2.028442


In the case of Valencia, we have the following numbers:

- The average value of tourists is approximately 151611, the minimum is 39941 and the maximum is 302090.

- The average value of overnights is 1260761, the minimum is 441708 and the maximum is 2406803.

- The mean of the average travel time (days) during the selected time period is 8.87, the minimum is 6.50 and the maximum is 14.00.

In comparison with the data obtained for Madrid, the first two variables have significantly lower values. This may be due to the fact that Madrid is the capital, so tourists are more likely to choose to visit this province. However, there is a slight increase in the average travel time, since the average and minimum values obtained are higher than those of Madrid. This last figure may indicate that, as Valencia is a coastal province, people are more likely to extend their trips to relax on the beach, while people visiting Madrid tend to organize express trips to visit the most popular monuments.

Now let's see which months and years these data correspond to.

In [None]:
print(df_province[df_province['no_tourists'] == df_province['no_tourists'].max()])
print(df_province[df_province['no_overnights'] == df_province['no_overnights'].max()])
print(df_province[df_province['avg_monthly_travel_time (days)'] == df_province['avg_monthly_travel_time (days)'].max()])

         period  province  no_tourists  no_overnights  \
2596 2023-08-01  Valencia       302090        2406803   

      avg_monthly_travel_time (days)  
2596                             8.0  
         period  province  no_tourists  no_overnights  \
2596 2023-08-01  Valencia       302090        2406803   

      avg_monthly_travel_time (days)  
2596                             8.0  
         period  province  no_tourists  no_overnights  \
2636 2020-04-01  Valencia        39941         560388   

      avg_monthly_travel_time (days)  
2636                            14.0  


In the case of Valencia, the maximums are reached in August 2023. As in the case of Madrid, this means that the values reached just before the pandemic began have been surpassed. Also, being a coastal province, it is more likely to receive more tourists during the summer months. In the case of the average travel time, we see that the maximum was reached in April 2020. This could be due to two things: since the mobility restrictions were put in place, people could not move to their home province earlier; or people who have a second residence are counted as tourists in the data. The latter theory could make sense given that during the pandemic many people decided to move to their second residence. **Still, this needs to be checked** --> https://www.caixabankresearch.com/es/analisis-sectorial/inmobiliario/segundas-residencias-espana-mar-o-montana (Alicante, Valencia and Malaga are the preferred provinces!!!)

In [None]:
print(df_province[df_province['no_tourists'] == df_province['no_tourists'].min()])
print(df_province[df_province['no_overnights'] == df_province['no_overnights'].min()])
print(df_province[df_province['avg_monthly_travel_time (days)'] == df_province['avg_monthly_travel_time (days)'].min()])

         period  province  no_tourists  no_overnights  \
2636 2020-04-01  Valencia        39941         560388   

      avg_monthly_travel_time (days)  
2636                            14.0  
         period  province  no_tourists  no_overnights  \
2634 2020-06-01  Valencia        54543         441708   

      avg_monthly_travel_time (days)  
2634                             8.1  
         period  province  no_tourists  no_overnights  \
2645 2019-07-01  Valencia       214511        1397200   

      avg_monthly_travel_time (days)  
2645                             6.5  


As for the minimum values of tourists and overnight stays, we see that they occur in the same months as in Madrid (April and June 2020), in the middle of the pandemic. However, in this case the minimum average travel time was recorded in July 2019, which means that pre-pandemic values have not been recovered yet. Even so, we see that the behaviour of tourism in these two provinces is very similar.

Now, let's look at the numbers achieved on the islands.

### **ILLES BALEARS**

In [None]:
# Select province of interest
df_province = df_bp[df_bp['province'] == 'Illes Balears']
print(df_province.shape)
df_province

(54, 5)


Unnamed: 0,period,province,no_tourists,no_overnights,avg_monthly_travel_time (days)
378,2023-12-01,Illes Balears,170948,1120822,6.6
379,2023-11-01,Illes Balears,271513,1836712,6.8
380,2023-10-01,Illes Balears,1417327,8913811,6.3
381,2023-09-01,Illes Balears,1974683,12486206,6.3
382,2023-08-01,Illes Balears,2201001,15179957,6.9
383,2023-07-01,Illes Balears,2287816,14695616,6.4
384,2023-06-01,Illes Balears,1991387,11894543,6.0
385,2023-05-01,Illes Balears,1662382,9361026,5.6
386,2023-04-01,Illes Balears,1101841,6284990,5.7
387,2023-03-01,Illes Balears,358283,2352098,6.6


Let's visualize these numbers more easily with some graphs, as we did before.

In [None]:
# Plot some graphs
fig = make_subplots(rows=3, cols=1)

fig.add_trace(
    go.Scatter(x=df_province['period'], y=df_province['no_tourists'], name='Number of Tourists'),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=df_province['period'], y=df_province['no_overnights'], name='Number of Overnights'),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=df_province['period'], y=df_province['avg_monthly_travel_time (days)'], name='Avg. Monthly Travel Time'),
    row=3, col=1
)

fig.update_layout(height=800, width=1200, title_text="Illes Balears Tourist Data")
fig.show()

At a first glance, we see that the behavior of the variables in the islands is slightly different from that seen for the two previous provinces. The following is observed:

- First, we see smoother graphs, i.e. there are not as sharp changes in trend as in the previous two cases.

- In the case of the number of tourists and overnight stays, a considerable increase is observed between March and July (peaks), and then decreases until December (valleys). As expected, the peaks are much smaller during pandemic years (2020-2021).

- Regarding the third variable, as seen in the two previous provinces, an increase is observed after the outbreak of the pandemic. In this case, two peaks are clearly identified, one in April 2020 and the other in December 2020. The variable then returns to its usual values.

In [None]:
# Get some statistical data from the province dataset
df_province.describe()

Unnamed: 0,period,no_tourists,no_overnights,avg_monthly_travel_time (days)
count,54,54.0,54.0,54.0
mean,2021-09-15 14:40:00,759410.4,5087378.0,8.068519
min,2019-07-01 00:00:00,21208.0,316260.0,5.6
25%,2020-08-08 18:00:00,142050.8,1121320.0,6.6
50%,2021-09-16 00:00:00,278906.5,1963418.0,7.25
75%,2022-10-24 06:00:00,1395502.0,8812546.0,8.35
max,2023-12-01 00:00:00,2287816.0,15179960.0,17.8
std,,781173.5,5008415.0,2.635521


In the case of the Balearic Islands, we have the following numbers:

- The average value of tourists is approximately 759410, the minimum is 21208 and the maximum is 2287816.

- The average value of overnights is 5087378, the minimum is 316260 and the maximum is 15179960.

- The mean of the average travel time (days) during the selected time period is 8.07, the minimum is 5.60 and the maximum is 17.80.

As can be seen, the numbers are higher than in the capital, although the minimums reached are lower. In other words, there is a more abrupt change between the minimums and the maximums. As for the average travel time, the values are very similar to those obtained for Madrid.

Again, let's see in which months and years these numbers are reached.

In [None]:
print(df_province[df_province['no_tourists'] == df_province['no_tourists'].max()])
print(df_province[df_province['no_overnights'] == df_province['no_overnights'].max()])
print(df_province[df_province['avg_monthly_travel_time (days)'] == df_province['avg_monthly_travel_time (days)'].max()])

        period       province  no_tourists  no_overnights  \
383 2023-07-01  Illes Balears      2287816       14695616   

     avg_monthly_travel_time (days)  
383                             6.4  
        period       province  no_tourists  no_overnights  \
382 2023-08-01  Illes Balears      2201001       15179957   

     avg_monthly_travel_time (days)  
382                             6.9  
        period       province  no_tourists  no_overnights  \
422 2020-04-01  Illes Balears        21208         377119   

     avg_monthly_travel_time (days)  
422                            17.8  


Given that the results are practically the same as those obtained in Valencia, we can use the same reasoning to explain them.

In [None]:
print(df_province[df_province['no_tourists'] == df_province['no_tourists'].min()])
print(df_province[df_province['no_overnights'] == df_province['no_overnights'].min()])
print(df_province[df_province['avg_monthly_travel_time (days)'] == df_province['avg_monthly_travel_time (days)'].min()])

        period       province  no_tourists  no_overnights  \
422 2020-04-01  Illes Balears        21208         377119   

     avg_monthly_travel_time (days)  
422                            17.8  
        period       province  no_tourists  no_overnights  \
420 2020-06-01  Illes Balears        35726         316260   

     avg_monthly_travel_time (days)  
420                             8.9  
        period       province  no_tourists  no_overnights  \
385 2023-05-01  Illes Balears      1662382        9361026   

     avg_monthly_travel_time (days)  
385                             5.6  


The minimum values are obtained in the same months and years as in Madrid and Valencia except in the case of the average travel time. Even so, as in Madrid, this variable recorded its minimum in 2023, although in this case a little earlier (May instead of September). Again, this means that the average travel time also returns to pre-pandemic values.

Finally, let's take a look at how tourism behaves in one of the most depopulated provinces in Spain: **Jaén**

--> https://www.publico.es/sociedad/mapa-despoblacion-espana-cerca-20-provincias-han-perdido-millon-habitantes-medio-siglo.html

--> https://www.eleconomista.es/economia/noticias/11051135/02/21/Las-tres-Espanas-despobladas-23-provincias-con-un-pasado-similar-pero-con-futuros-muy-diferentes.html

### **JAÉN**

In [None]:
# Select province of interest
df_province = df_bp[df_bp['province'] == 'Jaén']
print(df_province.shape)
df_province

(54, 5)


Unnamed: 0,period,province,no_tourists,no_overnights,avg_monthly_travel_time (days)
1404,2023-12-01,Jaén,16256,132365,8.1
1405,2023-11-01,Jaén,14763,109054,7.4
1406,2023-10-01,Jaén,18697,137262,7.3
1407,2023-09-01,Jaén,17157,129812,7.6
1408,2023-08-01,Jaén,23870,171103,7.2
1409,2023-07-01,Jaén,18846,133587,7.1
1410,2023-06-01,Jaén,13973,120920,8.7
1411,2023-05-01,Jaén,15869,142254,9.0
1412,2023-04-01,Jaén,14719,138545,9.4
1413,2023-03-01,Jaén,12464,111022,8.9


Once again, let's check how the data for this province has evolved.

In [None]:
# Plot some graphs
fig = make_subplots(rows=3, cols=1)

fig.add_trace(
    go.Scatter(x=df_province['period'], y=df_province['no_tourists'], name='Number of Tourists'),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=df_province['period'], y=df_province['no_overnights'], name='Number of Overnights'),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=df_province['period'], y=df_province['avg_monthly_travel_time (days)'], name='Avg. Monthly Travel Time'),
    row=3, col=1
)

fig.update_layout(height=800, width=1200, title_text="Jaén Tourist Data")
fig.show()

The behaviour of the three variables is very similar to that observed for Madrid and Valencia, although the numbers recorded in this province for the first two variables are much lower.

In [None]:
# Get some statistical data from the province dataset
df_province.describe()

Unnamed: 0,period,no_tourists,no_overnights,avg_monthly_travel_time (days)
count,54,54.0,54.0,54.0
mean,2021-09-15 14:40:00,12987.851852,131641.5,10.514815
min,2019-07-01 00:00:00,5182.0,68842.0,6.0
25%,2020-08-08 18:00:00,10872.25,109546.0,8.75
50%,2021-09-16 00:00:00,12512.0,129857.5,9.65
75%,2022-10-24 06:00:00,14857.75,149275.75,11.8
max,2023-12-01 00:00:00,23870.0,234855.0,18.1
std,,3740.385241,36378.132446,2.785299


In the case of Jaén, we have the following numbers:

- The average value of tourists is approximately 12688, the minimum is 5182 and the maximum is 23870.

- The average value of overnights is 131642, the minimum is 68842 and the maximum is 234855.

- The mean of the average travel time (days) during the selected time period is 10.51, the minimum is 6.00 and the maximum is 18.10.

As can be seen, the numbers are much lower than in the other three provinces, except in the case of the average travel time. This could be because Jaén is one of the cheapest Spanish provinces to live in, making it a destination where tourists can afford to stay longer. On the other hand, continuing with the second home reasoning, and connecting it with the previous one, it could make sense that people with second homes spend more time in this province.

As before, let's see in which months and years these numbers are reached.

In [None]:
print(df_province[df_province['no_tourists'] == df_province['no_tourists'].max()])
print(df_province[df_province['no_overnights'] == df_province['no_overnights'].max()])
print(df_province[df_province['avg_monthly_travel_time (days)'] == df_province['avg_monthly_travel_time (days)'].max()])

         period province  no_tourists  no_overnights  \
1408 2023-08-01     Jaén        23870         171103   

      avg_monthly_travel_time (days)  
1408                             7.2  
         period province  no_tourists  no_overnights  \
1451 2020-01-01     Jaén        16162         234855   

      avg_monthly_travel_time (days)  
1451                            14.5  
         period province  no_tourists  no_overnights  \
1449 2020-03-01     Jaén        11296         204607   

      avg_monthly_travel_time (days)  
1449                            18.1  


Similar to the situation in Valencia, we see that the maximum number of tourists is reached in August 2023 and, checking the values of our dataframe, we see that it slightly exceeds the tourists recorded in August 2019, which is the second highest value.In the case of the number of overnight stays, the maximum value is reached in January 2020, a few months before the pandemic started. If we look at the graph, we see that it is in the months before and after the pandemic that the highest values were recorded. Increases are observed again in January 2022 and August 2022 and 2023, but not back to pre-pandemic levels. Finally, we see that the maximum average travel time was reached in March 2020, just the month in which the pandemic officially begins. This figure is reached one month earlier than in the rest of the provinces.

In [None]:
print(df_province[df_province['no_tourists'] == df_province['no_tourists'].min()])
print(df_province[df_province['no_overnights'] == df_province['no_overnights'].min()])
print(df_province[df_province['avg_monthly_travel_time (days)'] == df_province['avg_monthly_travel_time (days)'].min()])

         period province  no_tourists  no_overnights  \
1448 2020-04-01     Jaén         5182          90004   

      avg_monthly_travel_time (days)  
1448                            17.4  
         period province  no_tourists  no_overnights  \
1446 2020-06-01     Jaén         7761          68842   

      avg_monthly_travel_time (days)  
1446                             8.9  
         period province  no_tourists  no_overnights  \
1457 2019-07-01     Jaén        15399          92576   

      avg_monthly_travel_time (days)  
1457                             6.0  


As for the minimums, they occur in the same months as in Valencia.

**Therefore, we see that, in general, tourism behaves almost equally in all the selected provinces.**

## 2. Global Study: Comparison

Below, we will compare the numbers for all the Spanish provinces. This will allow us to check, for example, which are the ones that receive the most or the least tourism, or the ones where tourists stay the longest. In addition, we will also see what type of provinces are preferred by tourists: coastal, inland or islands.

**It is important to note that, for the first checks, we are not going to make a separation by months or periods of the year (summer, Christmas, etc.), but we are going to work with the average of all the values available for each province over all the months.**

In [None]:
# Get the average of the figures for each province during the selected time period
df_mean = df_bp.groupby('province').mean()

In [None]:
df_mean.to_excel(data_dir_new + 'average_tourism_data_by_province.xlsx')

In [None]:
fig = px.bar(df_mean, x=df_mean.index, y='no_tourists')
fig.show()

As can be seen in the graph, there is quite a difference in the number of tourists passing through each province. Illes Balears and Barcelona stand out, followed by Las Palmas, Madrid, Malaga, Alicante, Santa Cruz de Tenerife and Girona. **That is, we see that they are all coastal provinces and islands, apart from the country's capital.** As for the provinces with the lowest number of tourists, we find Ávila, Teruel, Soria, Segovia, Palencia, Melilla, Lugo, León, La Rioja, Jaén, Guadalajara, Cuenca, Ciudad Real, Ceuta and Albacete. **In other words, they correspond to the provinces of the African continent and to those that are on the list of places with the greatest depopulation in Spain.**

In [None]:
fig = px.bar(df_mean, x=df_mean.index, y='no_overnights')
fig.show()

In terms of overnight stays, this is exactly the same as in the previous case.

In [None]:
fig = px.bar(df_mean, x=df_mean.index, y='avg_monthly_travel_time (days)')
fig.show()

However, if we look at the average travel time, we see that the numbers are much more even. In fact, we see that the exact opposite is true for this variable. It is the provinces with the lowest number of tourists and overnight stays that stand out in terms of trip duration. **As mentioned above, this may be due to the fact that these areas are cheaper than others, allowing tourists to stay more days with the same budget.**

Now, although we have seen approximately how the numbers behaved in certain specific provinces over time, we are going to see how the numbers vary in general without taking into account each province separately. This will allow us to check whether the studied provinces reflect the general behaviour of tourism in Spain as a whole over the selected time period.

In [None]:
df_nat = pd.read_excel(data_dir + 'INE_national_data.xlsx')
df_nat

Unnamed: 0,period,no_tourists,no_overnights,avg_monthly_travel_time (days)
0,2023-12,5710153,38998458,6.8
1,2023-11,5707346,37627264,6.6
2,2023-10,8899883,58090298,6.5
3,2023-09,9415601,61951741,6.6
4,2023-08,11210440,80592395,7.2
5,2023-07,10394305,69638650,6.7
6,2023-06,8298322,55741101,6.7
7,2023-05,8114596,51917877,6.4
8,2023-04,7151712,47912673,6.7
9,2023-03,5524605,38963532,7.1


We are going to change the period column to datetime format as we did before.

In [None]:
df_nat.period = pd.to_datetime(df_nat.period, format="%Y-%m")
df_nat.dtypes

period                            datetime64[ns]
no_tourists                                int64
no_overnights                              int64
avg_monthly_travel_time (days)           float64
dtype: object

In [None]:
df_nat.describe()

Unnamed: 0,period,no_tourists,no_overnights,avg_monthly_travel_time (days)
count,54,54.0,54.0,54.0
mean,2021-09-15 14:40:00,5179251.0,39237790.0,8.212963
min,2019-07-01 00:00:00,897381.0,9675762.0,6.2
25%,2020-08-08 18:00:00,3162696.0,26965750.0,7.1
50%,2021-09-16 00:00:00,4722302.0,37876740.0,7.65
75%,2022-10-24 06:00:00,7484834.0,51683050.0,8.575
max,2023-12-01 00:00:00,11210440.0,85225670.0,13.6
std,,2811330.0,17986200.0,1.821367


If we look at the national numbers, we can see the following:

- The average value of tourists is approximately 5179251, the minimum is 897381 and the maximum is 11210440.

- The average value of overnights is 39237790, the minimum is 9675762 and the maximum is 85225670.

- The mean of the average travel time (days) during the selected time period is 8.21, the minimum is 6.20 and the maximum is 13.60.

Now, we are going to check in which months the maximum and minimum values seen are obtained.

In [None]:
print(df_nat[df_nat['no_tourists'] == df_nat['no_tourists'].max()])
print(df_nat[df_nat['no_overnights'] == df_nat['no_overnights'].max()])
print(df_nat[df_nat['avg_monthly_travel_time (days)'] == df_nat['avg_monthly_travel_time (days)'].max()])

      period  no_tourists  no_overnights  avg_monthly_travel_time (days)
4 2023-08-01     11210440       80592395                             7.2
       period  no_tourists  no_overnights  avg_monthly_travel_time (days)
52 2019-08-01     10759089       85225670                             7.9
       period  no_tourists  no_overnights  avg_monthly_travel_time (days)
44 2020-04-01       897381       12231870                            13.6


In [None]:
print(df_nat[df_nat['no_tourists'] == df_nat['no_tourists'].min()])
print(df_nat[df_nat['no_overnights'] == df_nat['no_overnights'].min()])
print(df_nat[df_nat['avg_monthly_travel_time (days)'] == df_nat['avg_monthly_travel_time (days)'].min()])

       period  no_tourists  no_overnights  avg_monthly_travel_time (days)
44 2020-04-01       897381       12231870                            13.6
       period  no_tourists  no_overnights  avg_monthly_travel_time (days)
42 2020-06-01      1285909        9675762                             7.5
       period  no_tourists  no_overnights  avg_monthly_travel_time (days)
53 2019-07-01      9155467       57099975                             6.2


As can be seen, the results obtained in the previous provinces are quite in line with what can be observed for Spain as a whole, especially the minimum values. In other words, in general, the provinces are representative of what is happening with tourism at the national level.

Now, we are going to see this using some graphs.

In [None]:
df_nat.set_index('period', inplace = True)

In [None]:
df_plot_tourists = df_nat.copy()
df_plot_tourists = df_plot_tourists.drop(['no_overnights', 'avg_monthly_travel_time (days)'], axis=1)
df_plot_tourists = df_plot_tourists.rename(columns={'no_tourists':'Spain'})

df_plot_overnights = df_nat.copy()
df_plot_overnights = df_plot_overnights.drop(['no_tourists', 'avg_monthly_travel_time (days)'], axis=1)
df_plot_overnights = df_plot_overnights.rename(columns={'no_overnights':'Spain'})

df_plot_avg = df_nat.copy()
df_plot_avg = df_plot_avg.drop(['no_tourists', 'no_overnights'], axis=1)
df_plot_avg = df_plot_avg.rename(columns={'avg_monthly_travel_time (days)':'Spain'})

provinces_list = list(df_bp['province'].unique())

for p in provinces_list:

  df_province = df_bp[df_bp['province'] == p]

  df_plot_tourists[p] = list(df_province['no_tourists'])
  df_plot_overnights[p] = list(df_province['no_overnights'])
  df_plot_avg[p] = list(df_province['avg_monthly_travel_time (days)'])

### **Number of Tourists**

In [None]:
px.line(df_nat, x=df_nat.index, y="no_tourists", width=1000, height=400, title='Number of Tourists Nationwide')

In [None]:
px.line(df_plot_tourists, markers=True)

### **Number of Overnights**

In [None]:
px.line(df_nat, x=df_nat.index, y="no_overnights", width=1000, height=400, title='Number of Overnights Nationwide')

In [None]:
px.line(df_plot_overnights, markers=True)

### **Average Travel Time**

In [None]:
px.line(df_nat, x=df_nat.index, y="avg_monthly_travel_time (days)", width=1000, height=400, title='Average Travel Time Nationwide')

In [None]:
px.line(df_plot_avg, markers=True)

As can be seen in the graphs above, the peaks occurring in each of the provinces correspond to the same peaks observed at the national level. Therefore, although the numbers differ greatly from one province to another, the behavior of tourism is very similar in the different Spanish provinces.

## 3. Answering Questions

### **1. Which provinces have the highest number of tourists, overnights and average travel time?**

As we have seen in previous graphs, Illes Balears and Barcelona stand out in terms of number of tourists and overnight stays, followed by Madrid, Las Palmas, Malaga, Alicante, Santa Cruz de Tenerife and Girona. In terms of average travel time, exactly the opposite is true. It is in the most depopulated provinces of Spain and with the lowest number of tourists and overnight stays that the longest travel times are recorded.


### **2. In which periods (summer, Christmas...), in general, are the best numbers achieved? And the worst?**

As can be seen in the graphs, the best numbers are obtained during the summer months (mainly August), while the lowest numbers are recorded in winter and spring.

### **3. In which areas (inland or coastal) are there better numbers?**

Taking into account the answer to the first question, we see that the coastal areas, together with the islands and the capital, are the ones with the best numbers.

### **4. In each province, in which month and year were the highest numbers reached? Are the same in all provinces?**

As we saw in the study carried out by province, the best numbers of tourists and overnight stays were achieved in the months prior to the pandemic (summer 2019, mostly) and in summer 2023, while the longest trips were recorded in the months of the pandemic, which is not representative of the reality. As we leave the pandemic behind, trips are again shorter in duration. This behaviour is very similar between the different provinces and at the national level.

###  **5. In each province, in which month and year have the most anomalous numbers been recorded? Are the same in all provinces?**

Predictably, it is in the early months of the pandemic that the worst tourism numbers were recorded. This is true both nationally and in all provinces equally.