# **Correlations and Regressions**

## **Datasets**

The dataset to be used in this study comes from the combination of three different files containing information on each of the Spanish provinces:

- Average number of tourists, overnight stays and average travel time (in days) in 2023 (https://www.ine.es/jaxiT3/Datos.htm?t=52047).
- Total number of establishments dedicated to tourism in 2023 according to GeoFabrik.
- Resident population on 1st January 2023 (https://www.ine.es/jaxiT3/Tabla.htm?t=36725&L=0).

The merging of these three datasets will be carried out by means of the code assigned to each province, which will be included in each of them.

## **Goal**

The aim of this study is to perform a series of correlations and regressions to check the relationship between our different variables.


## **Useful Links:**

 - https://realpython.com/numpy-scipy-pandas-correlation-python/
 - https://www.eustat.eus/documentos/opt_1/tema_25/elem_3830/definicion.html

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
data_dir_INE = '/content/gdrive/MyDrive/TFM/INE/'
data_dir_new = '/content/gdrive/MyDrive/TFM/New/'
data_dir_code = '/content/gdrive/MyDrive/TFM/New/Code/'

In [None]:
# Import necessary packages
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

## 1. Load and Prepare Data

The first thing we are going to do is to obtain the average tourism data recorded in 2023 for each of the Spanish provinces. For this purpose, a simplified version of the file used to carry out the first study (**INE_Exploratory_Data_Analysis.ipynb**) has been uploaded.

In [None]:
# Read file with 2023 tourism numbers for each province
df_tourism = pd.read_excel(data_dir_INE + 'INE_data_by_province_2023.xlsx')
df_tourism

Unnamed: 0,period,province,no_tourists,no_overnights,avg_monthly_travel_time (days)
0,2023-12,Albacete,15085,115895,7.7
1,2023-11,Albacete,12833,93364,7.3
2,2023-10,Albacete,17009,124112,7.3
3,2023-09,Albacete,17104,133101,7.8
4,2023-08,Albacete,21226,182420,8.6
...,...,...,...,...,...
619,2023-05,Zaragoza,56135,404632,7.2
620,2023-04,Zaragoza,48076,383513,8.0
621,2023-03,Zaragoza,42400,313021,7.4
622,2023-02,Zaragoza,35315,255035,7.2


Next, we will calculate the average of each of the variables for each of the provinces.

In [None]:
provinces = ['Albacete', 'Alicante', 'Almería', 'Álava', 'Asturias', 'Ávila', 'Badajoz', 'Illes Balears', 'Barcelona', 'Bizkaia', 'Burgos', 'Cáceres', 'Cádiz', 'Cantabria',
             'Castellón', 'Ceuta', 'Ciudad Real', 'Córdoba', 'A Coruña', 'Cuenca', 'Gipuzkoa', 'Girona', 'Granada', 'Guadalajara', 'Huelva', 'Huesca', 'Jaén', 'León', 'Lleida',
             'Lugo', 'Madrid', 'Málaga', 'Melilla', 'Murcia', 'Navarra', 'Ourense', 'Palencia', 'Las Palmas', 'Pontevedra', 'La Rioja', 'Salamanca', 'Santa Cruz de Tenerife',
             'Segovia', 'Sevilla', 'Soria', 'Tarragona', 'Teruel', 'Toledo', 'Valencia', 'Valladolid', 'Zamora', 'Zaragoza']

avg_tourists = []
avg_overnights = []
avg_travel_time = []

for province in provinces:

  # Select data from a specific province
  df_province = df_tourism.loc[df_tourism['province'] == province]
  print(df_province)

  no_tourists = df_province['no_tourists'].mean()
  avg_tourists.append(no_tourists)
  no_overnights = df_province['no_overnights'].mean()
  avg_overnights.append(no_overnights)
  travel_time = df_province['avg_monthly_travel_time (days)'].mean()
  avg_travel_time.append(travel_time)

     period  province  no_tourists  no_overnights  \
0   2023-12  Albacete        15085         115895   
1   2023-11  Albacete        12833          93364   
2   2023-10  Albacete        17009         124112   
3   2023-09  Albacete        17104         133101   
4   2023-08  Albacete        21226         182420   
5   2023-07  Albacete        18086         145257   
6   2023-06  Albacete        13538         121436   
7   2023-05  Albacete        14187         123389   
8   2023-04  Albacete        13592         122614   
9   2023-03  Albacete        11884         108932   
10  2023-02  Albacete        10400          91074   
11  2023-01  Albacete        10822         105717   

    avg_monthly_travel_time (days)  
0                              7.7  
1                              7.3  
2                              7.3  
3                              7.8  
4                              8.6  
5                              8.0  
6                              9.0  
7             

Once we have calculated the data we are interested in, we can create our final dataset.

In [None]:
# Create a new dataset
datos = {
    'province' : provinces,
    'no_tourists': avg_tourists,
    'no_overnights': avg_overnights,
    'avg_monthly_travel_time (days)': avg_travel_time
}
df_avg_tourism = pd.DataFrame(datos)
df_avg_tourism

Unnamed: 0,province,no_tourists,no_overnights,avg_monthly_travel_time (days)
0,Albacete,14647.17,122275.9,8.433333
1,Alicante,487653.2,4044514.0,8.3
2,Almería,70619.92,600900.8,8.7
3,Álava,35820.33,214037.4,6.166667
4,Asturias,42173.67,341250.5,8.108333
5,Ávila,8651.417,66288.58,7.716667
6,Badajoz,54654.08,396741.2,7.633333
7,Illes Balears,1147973.0,7212594.0,6.466667
8,Barcelona,937805.8,5096908.0,5.433333
9,Bizkaia,60008.08,374469.7,6.275


As can be seen, we would now have our dataset with the average tourist values for each of the Spanish provinces. The last thing missing would be to add the column with the code of each one of them, to be done manually.

In [None]:
# Save the results in an Excel file
df_avg_tourism.to_excel(data_dir_new + 'average_tourism_data_by_province.xlsx')

Having obtained our first dataset, we are going to read the next one, which contains the total number of tourist establishments registered for each province throughout 2023.

In [None]:
# Read file with tourism establishments of each province
df_estab = pd.read_excel(data_dir_code + '2023_estab_by_province_code.xlsx')
df_estab

Unnamed: 0,province,establishments,code
0,Melilla,9,52
1,Ceuta,7,51
2,Cádiz,519,11
3,Málaga,802,29
4,Almería,245,4
5,Granada,545,18
6,Sevilla,360,41
7,Huelva,138,21
8,Jaén,324,23
9,Córdoba,194,14


Next, we will combine these two datasets by means of the code designated to each province. To do this, we have to read the file previously saved with the average tourism data but which also contains the province code (added manually, as mentioned above).

In [None]:
# Read file with average tourism data for each province plus the province code
df_tourism_code = pd.read_excel(data_dir_code + 'average_tourism_data_by_province_code.xlsx')
df_tourism_code

Unnamed: 0,province,no_tourists,no_overnights,avg_monthly_travel_time (days),code
0,Albacete,14647.17,122275.9,8.433333,2
1,Alicante,487653.2,4044514.0,8.3,3
2,Almería,70619.92,600900.8,8.7,4
3,Álava,35820.33,214037.4,6.166667,1
4,Asturias,42173.67,341250.5,8.108333,33
5,Ávila,8651.417,66288.58,7.716667,5
6,Badajoz,54654.08,396741.2,7.633333,6
7,Illes Balears,1147973.0,7212594.0,6.466667,7
8,Barcelona,937805.8,5096908.0,5.433333,8
9,Bizkaia,60008.08,374469.7,6.275,48


If we look at the dataset obtained at the beginning and this one, we see that the data are exactly the same.

In [None]:
# Combine the two datasets
df_merged = df_tourism_code.merge(df_estab, how='inner', on='code')
df_merged

Unnamed: 0,province_x,no_tourists,no_overnights,avg_monthly_travel_time (days),code,province_y,establishments
0,Albacete,14647.17,122275.9,8.433333,2,Albacete,147
1,Alicante,487653.2,4044514.0,8.3,3,Alacant/Alicante,575
2,Almería,70619.92,600900.8,8.7,4,Almería,245
3,Álava,35820.33,214037.4,6.166667,1,Araba/Álava,126
4,Asturias,42173.67,341250.5,8.108333,33,Asturias,1019
5,Ávila,8651.417,66288.58,7.716667,5,Ávila,219
6,Badajoz,54654.08,396741.2,7.633333,6,Badajoz,213
7,Illes Balears,1147973.0,7212594.0,6.466667,7,Illes Balears,1641
8,Barcelona,937805.8,5096908.0,5.433333,8,Barcelona,1304
9,Bizkaia,60008.08,374469.7,6.275,48,Bizkaia,295


Once this is done, we are going to delete and rename the necessary columns.

In [None]:
# Remove some columns
df_merged = df_merged.drop(['province_y'], axis=1)

In [None]:
# Rename some columns
df_final = df_merged.rename(columns={"province_x": "province"})

In [None]:
df_final

Unnamed: 0,province,no_tourists,no_overnights,avg_monthly_travel_time (days),code,establishments
0,Albacete,14647.17,122275.9,8.433333,2,147
1,Alicante,487653.2,4044514.0,8.3,3,575
2,Almería,70619.92,600900.8,8.7,4,245
3,Álava,35820.33,214037.4,6.166667,1,126
4,Asturias,42173.67,341250.5,8.108333,33,1019
5,Ávila,8651.417,66288.58,7.716667,5,219
6,Badajoz,54654.08,396741.2,7.633333,6,213
7,Illes Balears,1147973.0,7212594.0,6.466667,7,1641
8,Barcelona,937805.8,5096908.0,5.433333,8,1304
9,Bizkaia,60008.08,374469.7,6.275,48,295


The next step will be to add to this dataframe a new column containing the population data for each of the Spanish provinces.

In [None]:
# Read file which contains the population of each province in 2023
df_population = pd.read_excel(data_dir_INE + 'INE_provinces_population_2023.xlsx')
df_population

Unnamed: 0,province,population,code
0,Melilla,83002.0,52
1,Ceuta,81922.66,51
2,Cádiz,1262736.0,11
3,Málaga,1735934.0,29
4,Almería,732500.5,4
5,Granada,935187.5,18
6,Sevilla,1965086.0,41
7,Huelva,536940.2,21
8,Jaén,619492.0,23
9,Córdoba,775733.4,14


Taking into account that this dataset also contains the province code, we will make another merger between the previous dataset and this new file.

In [None]:
df_final = df_final.merge(df_population, how='inner', on='code')
df_final

Unnamed: 0,province_x,no_tourists,no_overnights,avg_monthly_travel_time (days),code,establishments,province_y,population
0,Albacete,14647.17,122275.9,8.433333,2,147,Albacete,387680.6
1,Alicante,487653.2,4044514.0,8.3,3,575,Alacant/Alicante,1933594.0
2,Almería,70619.92,600900.8,8.7,4,245,Almería,732500.5
3,Álava,35820.33,214037.4,6.166667,1,126,Araba/Álava,331528.4
4,Asturias,42173.67,341250.5,8.108333,33,1019,Asturias,1001910.0
5,Ávila,8651.417,66288.58,7.716667,5,219,Ávila,159010.4
6,Badajoz,54654.08,396741.2,7.633333,6,213,Badajoz,665022.2
7,Illes Balears,1147973.0,7212594.0,6.466667,7,1641,Illes Balears,1248273.0
8,Barcelona,937805.8,5096908.0,5.433333,8,1304,Barcelona,5705514.0
9,Bizkaia,60008.08,374469.7,6.275,48,295,Bizkaia,1135703.0


In [None]:
# Remove some columns
df_final = df_final.drop(['province_y', 'code'], axis=1)

In [None]:
# Rename some columns
df_final = df_final.rename(columns={"total": "establishments", "province_x": "province"})
df_final

Unnamed: 0,province,no_tourists,no_overnights,avg_monthly_travel_time (days),establishments,population
0,Albacete,14647.17,122275.9,8.433333,147,387680.6
1,Alicante,487653.2,4044514.0,8.3,575,1933594.0
2,Almería,70619.92,600900.8,8.7,245,732500.5
3,Álava,35820.33,214037.4,6.166667,126,331528.4
4,Asturias,42173.67,341250.5,8.108333,1019,1001910.0
5,Ávila,8651.417,66288.58,7.716667,219,159010.4
6,Badajoz,54654.08,396741.2,7.633333,213,665022.2
7,Illes Balears,1147973.0,7212594.0,6.466667,1641,1248273.0
8,Barcelona,937805.8,5096908.0,5.433333,1304,5705514.0
9,Bizkaia,60008.08,374469.7,6.275,295,1135703.0


After deleting and renaming some columns, we would have our final dataset.

### **Feature Engineering**

However, before we move on to the correlations and regressions sections, we are going to carry out a small feature engineering process in order to add new variables that may be useful for the study. For the moment, the following are considered:

*   **Number of tourists/Number of overnights**: ratio between number of tourists and overnight stays.

*   **Number of tourists/Number of establishments**: this variable will allow us to know the approximate number of tourists staying in a tourist establishment in each province.

Note: **The number of tourists used to calculate both ratios is an average of the tourists received in each province throughout 2023, not the average number of tourists staying in establishments. Since what we want to check with the second variable is the number of tourists that approximately corresponds to a tourist accommodation in each province, it is possible that the figure obtained does not accurately reflect the reality.**

In [None]:
# Compute and add new columns
tour_over = []
tour_estab = []

for i in range(len(df_final)):

  ratio_1 = df_final.iloc[i]['no_tourists']/df_final.iloc[i]['no_overnights']
  tour_over.append(ratio_1)

  ratio_2 = df_final.iloc[i]['no_tourists']/df_final.iloc[i]['establishments']
  tour_estab.append(ratio_2)

df_final['tourists/overnights'] = list(tour_over)
df_final['tourists/establishments'] = list(tour_estab)

df_final

Unnamed: 0,province,no_tourists,no_overnights,avg_monthly_travel_time (days),establishments,population,tourists/overnights,tourists/establishments
0,Albacete,14647.17,122275.9,8.433333,147,387680.6,0.119788,99.64059
1,Alicante,487653.2,4044514.0,8.3,575,1933594.0,0.120572,848.092464
2,Almería,70619.92,600900.8,8.7,245,732500.5,0.117523,288.244558
3,Álava,35820.33,214037.4,6.166667,126,331528.4,0.167355,284.28836
4,Asturias,42173.67,341250.5,8.108333,1019,1001910.0,0.123586,41.387308
5,Ávila,8651.417,66288.58,7.716667,219,159010.4,0.130511,39.504186
6,Badajoz,54654.08,396741.2,7.633333,213,665022.2,0.137758,256.591941
7,Illes Balears,1147973.0,7212594.0,6.466667,1641,1248273.0,0.159162,699.556876
8,Barcelona,937805.8,5096908.0,5.433333,1304,5705514.0,0.183995,719.176253
9,Bizkaia,60008.08,374469.7,6.275,295,1135703.0,0.160248,203.417232


We can now start to check the relationships between our variables.

## 2. Correlations

The first thing to know is that correlation coefficients quantify the association between variables or features of a dataset. We can find three different forms of correlation:

1. **Negative correlation.** In a plot, the y values tend to decrease as the x values increase. That is, large values of one feature correspond to small values of the other, and vice versa.

2. **Weak or no correlation.** There is no obvious trend. This  occurs when an association between two features is not obvious or is hardly observable.

3. **Positive correlation.** In a plot, the y values tend to increase as the x values increase. In other words, large values of one feature correspond to large values of the other, and vice versa.

Therefore, to check the correlation between the different variables in our dataset, we are going to use the Pandas *corr()* method. This function allows us to calculate the pairwise correlation between all the columns in the dataset. Although there are different correlation coefficients, in this study we will focus on **Pearson's** coefficient, which is a measure of linear correlation between two variables.

In [None]:
df_corr = df_final.corr(numeric_only=True)
df_corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,no_tourists,no_overnights,avg_monthly_travel_time (days),establishments,population,tourists/overnights,tourists/establishments
no_tourists,1.0,0.986549,-0.153963,0.823425,0.645295,0.096594,0.345816
no_overnights,0.986549,1.0,-0.07194,0.806306,0.597425,0.020057,0.35201
avg_monthly_travel_time (days),-0.153963,-0.07194,1.0,-0.121669,-0.130349,-0.970936,-0.426951
establishments,0.823425,0.806306,-0.121669,1.0,0.555035,0.046345,0.045669
population,0.645295,0.597425,-0.130349,0.555035,1.0,0.071296,0.20437
tourists/overnights,0.096594,0.020057,-0.970936,0.046345,0.071296,1.0,0.524238
tourists/establishments,0.345816,0.35201,-0.426951,0.045669,0.20437,0.524238,1.0


As we can see, there is a strong positive correlation between the number of tourists and the number of overnight stays, and a slightly weaker but still significant correlation between these two variables and the number of establishments. On the other hand, there is also a positive, but much weaker correlation, between population and number of tourists, number of overnight stays and establishments. All these positive correlations are explained as follows:

-  The greater the number of tourists in a province, the greater the number of overnight stays, which makes sense. The reverse is also true. If we report a large number of overnight stays over a period of time, it means that the number of tourists has been high. There may be cases where tourists only spend the day (do not stay overnight) in one province or, on the contrary, plan a long-stay trip and spend many more nights but, in general, the likelihood of more overnight stays increases with the number of tourists.

- It also makes sense that the greater the number of tourists, the greater the tourist offer. That is to say, if the number of tourists visiting a province is high, their economic investment will be greater compared to other areas, and businesspeople will be interested in opening new establishments in those provinces. However, the weaker correlation can be explained by two factors: the province does not have enough space on which new establishments can be built (protected spaces, legislation in the area...) or the tourist demand is so high that it cannot be completely covered even if new establishments are opened.

- In the case of the relationship between tourists and population, the same is usually true. If a province has a high population, it means that it is attractive to live in, as is the case with Madrid or Barcelona. The more people live in a place, the more services and activities are offered, which will also make it attractive to people from other areas. Therefore, it makes sense that the higher the population, the greater the number of arriving tourists. However, there may also be provinces where the population is much smaller due to their size, but they are equally attractive for tourism, or the opposite case. In other words, there is a relationship between the two variables and, in general, it makes sense that if one grows, the other will grow as well, but it does not always have to be that way.

- Finally, and considering the above, it makes sense that the larger the population of a province, the greater the tourist offer. On the one hand, all these establishments are not only intended for people from outside the province, but many people who live in it may also want to spend a weekend away from home without having to leave the region. On the other hand, we have already seen that the more people live in a province, the more money is generally spent on services and the more likely it is that a tourist will want to visit.

Now, if we look at the two new variables added, we see that there is a strong negative correlation between the ratio of tourists to overnights and the average travel time. That is, the more tourists spend fewer nights in a place, the shorter the average travel time, which makes sense. On the other hand, if we look at the ratio of tourists to establishments, we see that it has a negative correlation, weaker than the previous one, with the average travel time. That is, if we reduce the hotel capacity in an area or if this capacity is insufficient, it will be necessary to group a greater number of tourists in the same establishment. Since demand will be too high compared to the number of establishments available, the number of days tourists can be accommodated will be lower and, therefore, trips will be shorter.

Once we have checked the correlation between the variables of interest, let's start implementing the regressions

## 3. Regressions

**Linear regression** is the process of finding the linear function that is as close as possible to the actual relationship between variables. In other words, we are going to determine the linear function that best describes the association between these features. This linear function is also called the regression line.

In order to obtain these lines, the first thing we are going to do is to plot the data points of the variables we are interested in. This representation is done two by two, placing one variable on the x-axis and the other on the y-axis. To do this, we will use the *scatter()* function of the *Plotly* library (used in previous studies). After this, using the *trendline* parameter, we represent an Ordinary Least Squares regression line. This line makes the vertical distance from the data points to the regression line as small as possible. It is called a “Least Squares” because the best line of fit is one that minimizes the variance (the sum of squares of the errors).

To evaluate the results, we will focus on the following coefficients:

  - **Slope**, which is interpreted as the change of y for a one unit increase in x.

  - **Intercept**, which is the estimated value of y when x=0.

  - **R-Squared**, which shows how well the data fit the regression model (goodness of fit).

### **Population - Hotel Capacity**

In [None]:
fig = px.scatter(df_final, x="establishments", y="population", trendline="ols", trendline_color_override='darkblue')
fig.show()

First of all, we see that the slope of this regression line is 1993.28, which means that, if the number of establishments increase by 1, the population will increase by approximately 1993. On the other hand, if the number of establishments is 0, the population will be 59208. Finally, if we focus on the r-squared, we see that the fit is not so good as its value is closer to 0 than to 1 (0.308). Therefore, we see that this linear regression confirms what we have seen before: there is a a not very strong positive linear correlation between the population and the number of tourist establishments.

### **Population - Number of Tourists**

In [None]:
fig = px.scatter(df_final, x="no_tourists", y="population", trendline="ols", trendline_color_override='darkblue')
fig.show()

In this second plot, we see a positive linear correlation very similar to the previous one. We find a positive slope and an r-squared closer to 0. This is also consistent with what was seen in the previous matrix since the correlation between these variables was positive but a little bit stronger than in the previous case.

### **Number of Tourists - Hotel Capacity**

In [None]:
fig = px.scatter(df_final, x="establishments", y="no_tourists", trendline="ols", trendline_color_override='darkblue')
fig.show()

In this last case, and as we have seen in the correlation matrix above, we find a regression line that reflects a stronger positive linear correlation between the number of tourists and the number of establishments. We see a positive slope, which means that, if the number of establishments increase by 1, the number of tourists will increase approximately by 574. In the case of the r-squared, we see that it is much closer to 1, which reflects a better fit.