# **OSM 2022 Markers Data: Exploratory Data Analysis**

## **Datasets**

The dataset we will use to carry out this study is obtained from a .pbf file taken from the Geofabrik database (https://download.geofabrik.de/europe/spain.html#), a website that allows downloading updated geospatial information for different regions from OpenStreetMap. By clicking on the link, in the *raw directory index* section, we can see all the historical data available on the website. Given that the file corresponding to 2022 was updated in January of that year, we will keep the data from the beginning of 2023, as this way we will avoid losing the information of all the tourist spots added throughout 2022. It is important to note that the information for the Canary Islands is stored separately and must be downloaded from another section of the website (http://download.geofabrik.de/africa/canary-islands.html#).

## **Goal**

Count all the tourist establishments in each Spanish province in 2022.

## **Useful Links**

- https://cienciadedatos.net/documentos/py40-puntos-interes-openstreetmap-python
- https://wiki.openstreetmap.org/wiki/Map_features#Tourism

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
data_dir_geom = '/content/gdrive/MyDrive/TFM/Geometry/'
data_dir_geofabrik = '/content/gdrive/MyDrive/TFM/Geofabrik/'
data_dir_new = '/content/gdrive/MyDrive/TFM/New/'

In [None]:
!pip install osmium

Collecting osmium
  Downloading osmium-3.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: osmium
Successfully installed osmium-3.7.0


In [None]:
!pip install pandas fiona shapely pyproj rtree

Collecting rtree
  Downloading Rtree-1.2.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (535 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m535.2/535.2 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rtree
Successfully installed rtree-1.2.0


In [None]:
!pip install geopandas



In [None]:
!pip install geopy



In [None]:
# Import necessary packages
import osmium as osm
import pandas as pd
import geopandas as gpd
import numpy as np
import itertools
from shapely.geometry import Point, LineString, Polygon
from shapely import wkb
from geopy.distance import distance
import plotly.express as px
import plotly.graph_objects as go

## 1. Load and Explore Data

The first thing we are going to do is to extract the information about the tourist establishments contained in the downloaded file. The first step in extracting a given map element from a .pbf file is to identify how it is stored within OSM. Consulting in the OSM Wikipedia the Map_features list, it can be observed that the tourist establishments are identified with the key *tourism* and that they are stored as elements of type *node* and *way*. In addition, within this category we can find different values, which will help us to filter to keep only the points that we are interested in. Therefore, from all the information stored in the .pbf file, it is necessary to extract only the nodes and ways that have the key *tourism* with the values of interest.

To do this, we will use the **Pyosmium** library, which allows reading and extracting information from OSM files through the **SimpleHandler** class.


In [None]:
class POIHandler(osm.SimpleHandler):
    '''
    Class to extract information from an osm.pbf file. Only elements identified
    as 'node' or 'area' are extracted. In addition, a filtering can be applied to
    select only those that have tags with a certain key and value.

    The position of the areas is obtained by calculating the centroid of the
    polygon formed by their nodes.

    Arguments
    ---------

    custom_filter: dict
        Dictionary with the keys and values that the elements must have to be
        extracted. For example:

        `{'amenity': ['restaurant', 'bar']}` selects only those elements that
        have the key 'amenity' with value 'restaurant' or 'bar'.

        `{'amenity': ['restaurant', 'bar'], 'building': ['car']}` selects only
        those elements that have the key 'amenity' with value 'restaurant' or
        'bar', or those with the key 'building' with value 'hotel'.
    '''

    def __init__(self, custom_filter=None):
        osm.SimpleHandler.__init__(self)
        self.osm_data = []
        self.custom_filter = custom_filter

        if self.custom_filter:
            for key, value in self.custom_filter.items():
                if isinstance(value, str):
                    self.custom_filter[key] = [value]

    def node(self, node):
        if self.custom_filter is None:
            name = node.tags.get('name', '')
            self.tag_inventory(node, 'node', name)
        else:
            if any([node.tags.get(key) in self.custom_filter[key] for key in self.custom_filter.keys()]):
                name = node.tags.get('name', '')
                self.tag_inventory(node, 'node', name)

    def area(self, area):
        if self.custom_filter is None:
            name = area.tags.get('name', '')
            self.tag_inventory(area, 'area', name)
        else:
            if any([area.tags.get(key) in self.custom_filter[key] for key in self.custom_filter.keys()]):
                name = area.tags.get('name', '')
                self.tag_inventory(area, 'area', name)

    def tag_inventory(self, elem, elem_type, name):
        if elem_type == 'node':
            for tag in elem.tags:
                self.osm_data.append([elem_type,
                                       elem.id,
                                       name,
                                       elem.location.lon,
                                       elem.location.lat,
                                       pd.Timestamp(elem.timestamp),
                                       len(elem.tags),
                                       tag.k,
                                       tag.v])
        if elem_type == 'area':
            try:
                # A Polygon is created with the nodes that form the area to
                # calculate its centroid.
                nodes = list(elem.outer_rings())[0]
                polygon = Polygon([(node.lon, node.lat) for node in nodes])
                for tag in elem.tags:
                    self.osm_data.append([elem_type,
                                           elem.id,
                                           name,
                                           polygon.centroid.x,
                                           polygon.centroid.y,
                                           pd.Timestamp(elem.timestamp),
                                           len(elem.tags),
                                           tag.k,
                                           tag.v])
            except:
                pass

### **Spain**

In [None]:
# OSM markers Spain: extraction of elements identified with 'tourism'= ['dictionary with values of interest']
poi_handler_sp = POIHandler(custom_filter={'tourism':['alpine_hut', 'apartment', 'camp_pitch', 'camp_site', 'caravan_site', 'chalet',
                                          'guest_house', 'hostel', 'hotel', 'motel', 'wilderness_hut']})
poi_handler_sp.apply_file(data_dir_geofabrik + 'Spain/2022/spain-230101.osm.pbf')

As a result of the extraction, a list is obtained in which each element is in turn a list with the information associated with a node or area. To facilitate the management of the extracted data, we are going to store this information in a pandas dataframe.

In [None]:
colnames = ['type', 'id', 'name', 'lon', 'lat', 'timestamp','n_tags', 'tag_key',
            'tag_value']
df_poi_sp = pd.DataFrame(poi_handler_sp.osm_data, columns=colnames)
df_poi_sp.head()

Unnamed: 0,type,id,name,lon,lat,timestamp,n_tags,tag_key,tag_value
0,node,21947483,,-3.702375,40.411544,2022-04-01 20:07:02+00:00,7,addr:city,Madrid
1,node,21947483,,-3.702375,40.411544,2022-04-01 20:07:02+00:00,7,addr:country,ES
2,node,21947483,,-3.702375,40.411544,2022-04-01 20:07:02+00:00,7,addr:housenumber,15
3,node,21947483,,-3.702375,40.411544,2022-04-01 20:07:02+00:00,7,addr:postcode,28012
4,node,21947483,,-3.702375,40.411544,2022-04-01 20:07:02+00:00,7,addr:street,Calle del Calvario


Each row contains an attribute (*tag_key* and *tag_value*) associated to a node. OpenStreetMap has a free tag system so a same node can have an unlimited number of associated attributes. For example, we see that the node with id 21947483 has 7 associated tags.

In [None]:
df_poi_sp.shape

(159742, 9)

Therefore, to avoid counting a node as many times as tags it has associated, we will be left with only one tag: **tourism**.

In [None]:
tourism_sp = df_poi_sp[df_poi_sp['tag_key'].isin(['tourism'])]
tourism_sp

Unnamed: 0,type,id,name,lon,lat,timestamp,n_tags,tag_key,tag_value
6,node,21947483,,-3.702375,40.411544,2022-04-01 20:07:02+00:00,7,tourism,apartment
17,node,25913327,NH Ciudad de la Imagen,-3.788176,40.398441,2022-02-19 10:04:33+00:00,13,tourism,hotel
22,node,26860899,Buchaca,1.357838,42.373091,2019-10-05 08:59:10+00:00,3,tourism,camp_site
24,node,26860903,Cadaques,3.281944,42.288056,2011-09-11 16:11:28+00:00,2,tourism,hotel
27,node,26860947,El Quinto,-1.859175,37.140828,2007-12-08 01:04:36+00:00,3,tourism,camp_site
...,...,...,...,...,...,...,...,...,...
159704,area,2251657820,Cabaña de Escobio,-5.159962,43.019472,2022-12-30 12:57:39+00:00,7,tourism,wilderness_hut
159707,area,2251752538,Albergue Hoces del Duratón,-3.809665,41.226235,2022-12-30 16:54:50+00:00,3,tourism,hostel
159708,area,2252037932,,0.873075,41.872825,2022-12-31 11:45:57+00:00,1,tourism,caravan_site
159727,area,2252121154,"Irurtzungo autokarabana gunea, Iturrizar",-1.825245,42.920637,2022-12-31 16:01:50+00:00,20,tourism,caravan_site


As we can see, the size of the dataset has been reduced considerably.

### **Canary Islands**

We are going to follow the same procedure with the data from the Canary Islands.

In [None]:
# OSM markers Canary Islands: extraction of elements identified with 'tourism'= ['dictionary with values of interest']
poi_handler_ci = POIHandler(custom_filter={'tourism':['alpine_hut', 'apartment', 'camp_pitch', 'camp_site', 'caravan_site', 'chalet',
                                          'guest_house', 'hostel', 'hotel', 'motel', 'wilderness_hut']})
poi_handler_ci.apply_file(data_dir_geofabrik + 'Canary_Islands/2022/canary-islands-230101.osm.pbf')

In [None]:
colnames = ['type', 'id', 'name', 'lon', 'lat', 'timestamp','n_tags', 'tag_key',
            'tag_value']
df_poi_ci = pd.DataFrame(poi_handler_ci.osm_data, columns=colnames)
df_poi_ci.head()

Unnamed: 0,type,id,name,lon,lat,timestamp,n_tags,tag_key,tag_value
0,node,33104770,allsun Esquinzo Beach Hotel,-14.308186,28.073667,2017-07-20 14:09:17+00:00,3,name,allsun Esquinzo Beach Hotel
1,node,33104770,allsun Esquinzo Beach Hotel,-14.308186,28.073667,2017-07-20 14:09:17+00:00,3,operator,alltours
2,node,33104770,allsun Esquinzo Beach Hotel,-14.308186,28.073667,2017-07-20 14:09:17+00:00,3,tourism,hotel
3,node,33105081,Club Aldiana Fuerteventura,-14.316887,28.060005,2020-05-22 21:16:08+00:00,2,name,Club Aldiana Fuerteventura
4,node,33105081,Club Aldiana Fuerteventura,-14.316887,28.060005,2020-05-22 21:16:08+00:00,2,tourism,hotel


In [None]:
df_poi_ci.shape

(11736, 9)

In [None]:
tourism_ci = df_poi_ci[df_poi_ci['tag_key'].isin(['tourism'])]
tourism_ci

Unnamed: 0,type,id,name,lon,lat,timestamp,n_tags,tag_key,tag_value
2,node,33104770,allsun Esquinzo Beach Hotel,-14.308186,28.073667,2017-07-20 14:09:17+00:00,3,tourism,hotel
4,node,33105081,Club Aldiana Fuerteventura,-14.316887,28.060005,2020-05-22 21:16:08+00:00,2,tourism,hotel
15,node,33105367,Iberostar Selection Fuerteventura Palace,-14.320332,28.056630,2021-08-26 11:32:02+00:00,11,tourism,hotel
18,node,33105714,disused hotel (ex: Dunas Club Calela del Sol),-14.288661,28.092809,2011-11-16 08:10:54+00:00,3,tourism,hotel
28,node,33106383,Iberostar Playa Gaviotas,-14.319421,28.056746,2020-08-27 16:14:39+00:00,11,tourism,hotel
...,...,...,...,...,...,...,...,...,...
11711,area,2220285062,Cordial Muelle Viejo,-15.761691,27.817757,2022-11-04 05:45:25+00:00,12,tourism,apartment
11724,area,2224436870,Hotel Drago de San Antonio,-16.719145,28.363781,2022-11-11 08:28:55+00:00,12,tourism,hotel
11727,area,714809,Barceló Fuerteventural Castillo,-13.855852,28.396796,2022-11-26 14:27:02+00:00,3,tourism,hotel
11731,area,2239160954,2,-13.841199,28.859583,2022-12-06 10:26:14+00:00,4,tourism,chalet


Again, we select a single tag to count each node only once.

## 2. Prepare Data

To facilitate the study, we will now join the datasets so that we have the tourist establishments of both the mainland and the islands in the same dataset.

In [None]:
# Concat datasets
tourism = pd.concat([tourism_sp, tourism_ci], ignore_index=True)
tourism

Unnamed: 0,type,id,name,lon,lat,timestamp,n_tags,tag_key,tag_value
0,node,21947483,,-3.702375,40.411544,2022-04-01 20:07:02+00:00,7,tourism,apartment
1,node,25913327,NH Ciudad de la Imagen,-3.788176,40.398441,2022-02-19 10:04:33+00:00,13,tourism,hotel
2,node,26860899,Buchaca,1.357838,42.373091,2019-10-05 08:59:10+00:00,3,tourism,camp_site
3,node,26860903,Cadaques,3.281944,42.288056,2011-09-11 16:11:28+00:00,2,tourism,hotel
4,node,26860947,El Quinto,-1.859175,37.140828,2007-12-08 01:04:36+00:00,3,tourism,camp_site
...,...,...,...,...,...,...,...,...,...
27893,area,2220285062,Cordial Muelle Viejo,-15.761691,27.817757,2022-11-04 05:45:25+00:00,12,tourism,apartment
27894,area,2224436870,Hotel Drago de San Antonio,-16.719145,28.363781,2022-11-11 08:28:55+00:00,12,tourism,hotel
27895,area,714809,Barceló Fuerteventural Castillo,-13.855852,28.396796,2022-11-26 14:27:02+00:00,3,tourism,hotel
27896,area,2239160954,2,-13.841199,28.859583,2022-12-06 10:26:14+00:00,4,tourism,chalet


Next, to ensure that we do not have duplicates, we will delete those records that may have the same name, except in the case where there is an empty value.

In [None]:
# Eliminate establishments that may be duplicated by name except when it is an empty value

# Filter rows where 'name' is not empty and remove duplicates, keeping the first occurrence
tourism_no_empty = tourism[tourism['name'] != ''].drop_duplicates(subset='name', keep='first')

# Filter rows where 'name' is empty
tourism_empty = tourism[tourism['name'] == '']

# Concatenate the DataFrames
tourism_cleaned = pd.concat([tourism_no_empty, tourism_empty])

tourism_cleaned

Unnamed: 0,type,id,name,lon,lat,timestamp,n_tags,tag_key,tag_value
1,node,25913327,NH Ciudad de la Imagen,-3.788176,40.398441,2022-02-19 10:04:33+00:00,13,tourism,hotel
2,node,26860899,Buchaca,1.357838,42.373091,2019-10-05 08:59:10+00:00,3,tourism,camp_site
3,node,26860903,Cadaques,3.281944,42.288056,2011-09-11 16:11:28+00:00,2,tourism,hotel
4,node,26860947,El Quinto,-1.859175,37.140828,2007-12-08 01:04:36+00:00,3,tourism,camp_site
5,node,26860948,El Sur,-5.171844,36.721090,2012-09-27 16:10:10+00:00,2,tourism,hostel
...,...,...,...,...,...,...,...,...,...
27875,area,2059231940,,-16.672312,28.378086,2022-02-09 13:42:17+00:00,2,tourism,chalet
27876,area,2059231942,,-16.672171,28.377983,2022-02-09 13:42:17+00:00,2,tourism,chalet
27883,area,2106288226,,-17.923545,28.600040,2022-04-19 19:56:41+00:00,3,tourism,caravan_site
27884,area,2107079232,,-17.788747,28.842906,2022-04-21 07:19:37+00:00,2,tourism,caravan_site


Before starting to count the number of tourist establishments, there is one last step: to obtain the *geometry* column of each of them from the longitude and latitude values.

In [None]:
# Obtain geometry for each establishment
geometry = []

for i in range(len(tourism_cleaned)):
    geometry.append(Point(tourism_cleaned.iloc[i]['lon'], tourism_cleaned.iloc[i]['lat']))

tourism_cleaned = tourism_cleaned.set_geometry(geometry)
tourism_cleaned

Unnamed: 0,type,id,name,lon,lat,timestamp,n_tags,tag_key,tag_value,geometry
1,node,25913327,NH Ciudad de la Imagen,-3.788176,40.398441,2022-02-19 10:04:33+00:00,13,tourism,hotel,POINT (-3.78818 40.39844)
2,node,26860899,Buchaca,1.357838,42.373091,2019-10-05 08:59:10+00:00,3,tourism,camp_site,POINT (1.35784 42.37309)
3,node,26860903,Cadaques,3.281944,42.288056,2011-09-11 16:11:28+00:00,2,tourism,hotel,POINT (3.28194 42.28806)
4,node,26860947,El Quinto,-1.859175,37.140828,2007-12-08 01:04:36+00:00,3,tourism,camp_site,POINT (-1.85918 37.14083)
5,node,26860948,El Sur,-5.171844,36.721090,2012-09-27 16:10:10+00:00,2,tourism,hostel,POINT (-5.17184 36.72109)
...,...,...,...,...,...,...,...,...,...,...
27875,area,2059231940,,-16.672312,28.378086,2022-02-09 13:42:17+00:00,2,tourism,chalet,POINT (-16.67231 28.37809)
27876,area,2059231942,,-16.672171,28.377983,2022-02-09 13:42:17+00:00,2,tourism,chalet,POINT (-16.67217 28.37798)
27883,area,2106288226,,-17.923545,28.600040,2022-04-19 19:56:41+00:00,3,tourism,caravan_site,POINT (-17.92355 28.60004)
27884,area,2107079232,,-17.788747,28.842906,2022-04-21 07:19:37+00:00,2,tourism,caravan_site,POINT (-17.78875 28.84291)


In [None]:
# Select columns of interest
tourism_cleaned = tourism_cleaned[['id', 'geometry']]
tourism_cleaned

Unnamed: 0,id,geometry
1,25913327,POINT (-3.78818 40.39844)
2,26860899,POINT (1.35784 42.37309)
3,26860903,POINT (3.28194 42.28806)
4,26860947,POINT (-1.85918 37.14083)
5,26860948,POINT (-5.17184 36.72109)
...,...,...
27875,2059231940,POINT (-16.67231 28.37809)
27876,2059231942,POINT (-16.67217 28.37798)
27883,2106288226,POINT (-17.92355 28.60004)
27884,2107079232,POINT (-17.78875 28.84291)


Once this is done, we can represent our points on a map to check in which province each of them is located. That is, we will check if the geometry of each point falls within the geometry of a particular province.

## 3. Represent and Count Data

In this section, we are going to count how many tourist establishments there were in each Spanish province before the pandemic. As we did in previous studies, we will access the geometry of Spain (taking into account that the geometry of Santa Cruz de Tenerife and Las Palmas will be added later) and we will represent and count the different tourist spots per province. First, let's plot all the tourist points identified in the previous section on a map of Spain.

In [None]:
fig = px.scatter_mapbox(
    tourism_cleaned,
    lat=tourism_cleaned.geometry.y,
    lon=tourism_cleaned.geometry.x,
    size=None,
    #color_continuous_scale='Viridis',
    mapbox_style="open-street-map",
    zoom=5,
    center={"lat": 40.0, "lon": -3.5},
    title="Tourist Establishments in Spain"
)

fig.update_traces(marker=dict(size=15))

fig.update_layout(margin={"r":0,"t":30,"l":0,"b":0})
fig.show()

As shown, the points are distributed across the entire country. Due to the large number of tourist establishments, it is necessary to zoom in to distinguish specific areas.

In [None]:
spanish_provinces = gpd.read_file(data_dir_geom + 'spanish_provinces.geojson')
spanish_provinces

Unnamed: 0,rotulo,population,time_light_sun,geometry
0,Melilla,1173039.0,12,"POLYGON ((-2.93446 35.27442, -2.92929 35.27324..."
1,Ceuta,1884441.0,5,"POLYGON ((-5.33273 35.89270, -5.31086 35.88727..."
2,Cádiz,505665.5,11,"POLYGON ((-5.14234 37.00387, -5.11510 36.98897..."
3,Málaga,1077906.0,10,"POLYGON ((-4.32759 37.18435, -4.32333 37.18075..."
4,Almería,1849180.0,12,"POLYGON ((-2.18494 37.90366, -2.17201 37.88872..."
5,Granada,1589546.0,6,"POLYGON ((-2.34163 38.02603, -2.32309 38.01401..."
6,Sevilla,1696450.0,10,"POLYGON ((-5.72320 38.19244, -5.70785 38.18830..."
7,Huelva,1109913.0,8,"POLYGON ((-6.18021 37.94103, -6.14518 37.92050..."
8,Jaén,1827838.0,12,"POLYGON ((-2.55129 38.08412, -2.57299 38.07785..."
9,Córdoba,1714056.0,5,"POLYGON ((-5.00801 38.71523, -4.87696 38.68610..."


In [None]:
# Drop some columns
spanish_provinces = spanish_provinces.drop(['population', 'time_light_sun'], axis=1)
spanish_provinces

Unnamed: 0,rotulo,geometry
0,Melilla,"POLYGON ((-2.93446 35.27442, -2.92929 35.27324..."
1,Ceuta,"POLYGON ((-5.33273 35.89270, -5.31086 35.88727..."
2,Cádiz,"POLYGON ((-5.14234 37.00387, -5.11510 36.98897..."
3,Málaga,"POLYGON ((-4.32759 37.18435, -4.32333 37.18075..."
4,Almería,"POLYGON ((-2.18494 37.90366, -2.17201 37.88872..."
5,Granada,"POLYGON ((-2.34163 38.02603, -2.32309 38.01401..."
6,Sevilla,"POLYGON ((-5.72320 38.19244, -5.70785 38.18830..."
7,Huelva,"POLYGON ((-6.18021 37.94103, -6.14518 37.92050..."
8,Jaén,"POLYGON ((-2.55129 38.08412, -2.57299 38.07785..."
9,Córdoba,"POLYGON ((-5.00801 38.71523, -4.87696 38.68610..."


In [None]:
# Get geometry of Santa Cruz de Tenerife and Las Palmas
canary_islands = gpd.read_file(data_dir_geom + 'recintos_provinciales_inspire_canarias_wgs84.shp')
canary_islands

Unnamed: 0,INSPIREID,COUNTRY,NATLEV,NATLEVNAME,NATCODE,NAMEUNIT,CODNUT1,CODNUT2,CODNUT3,geometry
0,ES.IGN.BDDAE.34053500000,ES,https://inspire.ec.europa.eu/codelist/Administ...,Provincia,34053500000,Las Palmas,ES7,ES70,,"MULTIPOLYGON (((-15.69749 27.77109, -15.69750 ..."
1,ES.IGN.BDDAE.34053800000,ES,https://inspire.ec.europa.eu/codelist/Administ...,Provincia,34053800000,Santa Cruz de Tenerife,ES7,ES70,,"MULTIPOLYGON (((-18.00161 27.64707, -18.00158 ..."


In [None]:
# Define new rows to add
row1 = pd.DataFrame([[canary_islands.loc[0, 'NAMEUNIT'], canary_islands.loc[0, 'geometry']]], columns=spanish_provinces.columns)
row2 = pd.DataFrame([[canary_islands.loc[1, 'NAMEUNIT'], canary_islands.loc[1, 'geometry']]], columns=spanish_provinces.columns)

In [None]:
# Add new rows with geometry
spanish_provinces = pd.concat([spanish_provinces, row1, row2], ignore_index=True)
spanish_provinces

Unnamed: 0,rotulo,geometry
0,Melilla,"POLYGON ((-2.93446 35.27442, -2.92929 35.27324..."
1,Ceuta,"POLYGON ((-5.33273 35.89270, -5.31086 35.88727..."
2,Cádiz,"POLYGON ((-5.14234 37.00387, -5.11510 36.98897..."
3,Málaga,"POLYGON ((-4.32759 37.18435, -4.32333 37.18075..."
4,Almería,"POLYGON ((-2.18494 37.90366, -2.17201 37.88872..."
5,Granada,"POLYGON ((-2.34163 38.02603, -2.32309 38.01401..."
6,Sevilla,"POLYGON ((-5.72320 38.19244, -5.70785 38.18830..."
7,Huelva,"POLYGON ((-6.18021 37.94103, -6.14518 37.92050..."
8,Jaén,"POLYGON ((-2.55129 38.08412, -2.57299 38.07785..."
9,Córdoba,"POLYGON ((-5.00801 38.71523, -4.87696 38.68610..."


In [None]:
poly_dict = {}

for i in spanish_provinces.index:

  poly_dict[spanish_provinces['rotulo'][i]] = spanish_provinces['geometry'][i]

polygons = gpd.GeoSeries(poly_dict)
polygons

Melilla                   POLYGON ((-2.93446 35.27442, -2.92929 35.27324...
Ceuta                     POLYGON ((-5.33273 35.89270, -5.31086 35.88727...
Cádiz                     POLYGON ((-5.14234 37.00387, -5.11510 36.98897...
Málaga                    POLYGON ((-4.32759 37.18435, -4.32333 37.18075...
Almería                   POLYGON ((-2.18494 37.90366, -2.17201 37.88872...
Granada                   POLYGON ((-2.34163 38.02603, -2.32309 38.01401...
Sevilla                   POLYGON ((-5.72320 38.19244, -5.70785 38.18830...
Huelva                    POLYGON ((-6.18021 37.94103, -6.14518 37.92050...
Jaén                      POLYGON ((-2.55129 38.08412, -2.57299 38.07785...
Córdoba                   POLYGON ((-5.00801 38.71523, -4.87696 38.68610...
Castelló/Castellón        POLYGON ((-0.07991 40.73335, -0.06476 40.72742...
Murcia                    POLYGON ((-1.11971 38.73774, -1.11563 38.71141...
Illes Balears             MULTIPOLYGON (((3.19286 39.93983, 3.15416 39.9...
Alacant/Alic

In [None]:
points_tourism = gpd.GeoDataFrame(index=tourism_cleaned['id'], crs='epsg:4326', geometry= list(tourism_cleaned['geometry']))

In [None]:
# Check in which province each establishment of the dataset is located
estab_tourism = points_tourism.assign(**{key: points_tourism.within(geom) for key, geom in polygons.items()})
estab_tourism

Unnamed: 0_level_0,geometry,Melilla,Ceuta,Cádiz,Málaga,Almería,Granada,Sevilla,Huelva,Jaén,...,León,Cantabria,Asturias,Lugo,A Coruña,Bizkaia,Araba/Álava,Palencia,Las Palmas,Santa Cruz de Tenerife
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25913327,POINT (-3.78818 40.39844),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
26860899,POINT (1.35784 42.37309),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
26860903,POINT (3.28194 42.28806),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
26860947,POINT (-1.85918 37.14083),False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
26860948,POINT (-5.17184 36.72109),False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2059231940,POINT (-16.67231 28.37809),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
2059231942,POINT (-16.67217 28.37798),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
2106288226,POINT (-17.92355 28.60004),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
2107079232,POINT (-17.78875 28.84291),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True


In [None]:
estab_tourism_number = pd.DataFrame()

for column in estab_tourism:

  if column != 'geometry':

    number = np.count_nonzero(estab_tourism[column] == True)

    new_row = {'province':column, 'establishments':number}

    estab_tourism_number = pd.concat([estab_tourism_number, pd.DataFrame([new_row])], ignore_index=True)

estab_tourism_number

Unnamed: 0,province,establishments
0,Melilla,9
1,Ceuta,7
2,Cádiz,519
3,Málaga,940
4,Almería,271
5,Granada,557
6,Sevilla,370
7,Huelva,153
8,Jaén,336
9,Córdoba,194


As can be seen, we have already calculated the number of establishments in each province in 2022.

In [None]:
# Save the results in an Excel file
estab_tourism_number.to_excel(data_dir_new + '2022_estab_by_province.xlsx')

Once we have saved this data in a file for later use, we are going to visualize this numerical data in a graph to check more easily which provinces had the highest tourism capacity in 2022.

In [None]:
fig = px.bar(estab_tourism_number, x='province', y='establishments')
fig.show()

As can be seen, the provinces with the greatest number of tourist establishments are Illes Balears, Barcelona, Las Palmas, Madrid, Asturias, Girona and Santa Cruz de Tenerife.