# **OSM 2019 Markers Data: Exploratory Data Analysis**

## **Datasets**

The dataset we will use to carry out this study is obtained from a .pbf file taken from the Geofabrik database (https://download.geofabrik.de/europe/spain.html#), a website that allows downloading updated geospatial information for different regions from OpenStreetMap. By clicking on the link, in the *raw directory index* section, we can see all the historical data available on the website. Given that the file corresponding to 2019 was updated in January of that year, we will keep the data from the beginning of 2020, as this way we will avoid losing the information of all the tourist spots added throughout 2019. It is important to note that the information for the Canary Islands is stored separately and must be downloaded from another section of the website (http://download.geofabrik.de/africa/canary-islands.html#).

## **Goal**

Count all the tourist establishments in each Spanish province in 2019, right before the pandemic.

## **Useful Links**

- https://cienciadedatos.net/documentos/py40-puntos-interes-openstreetmap-python
- https://wiki.openstreetmap.org/wiki/Map_features#Tourism

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
data_dir_geom = '/content/gdrive/MyDrive/TFM/Geometry/'
data_dir_geofabrik = '/content/gdrive/MyDrive/TFM/Geofabrik/'
data_dir_new = '/content/gdrive/MyDrive/TFM/New/'

In [None]:
!pip install osmium

Collecting osmium
  Downloading osmium-3.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: osmium
Successfully installed osmium-3.7.0


In [None]:
!pip install pandas fiona shapely pyproj rtree

Collecting rtree
  Downloading Rtree-1.2.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (535 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m535.2/535.2 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rtree
Successfully installed rtree-1.2.0


In [None]:
!pip install geopandas



In [None]:
!pip install geopy



In [None]:
# Import necessary packages
import osmium as osm
import pandas as pd
import geopandas as gpd
import numpy as np
import itertools
from shapely.geometry import Point, LineString, Polygon
from shapely import wkb
from geopy.distance import distance
import plotly.express as px
import plotly.graph_objects as go

## 1. Load and Explore Data

The first thing we are going to do is to extract the information about the tourist establishments contained in the downloaded file. The first step in extracting a given map element from a .pbf file is to identify how it is stored within OSM. Consulting in the OSM Wikipedia the Map_features list, it can be observed that the tourist establishments are identified with the key *tourism* and that they are stored as elements of type *node* and *way*. In addition, within this category we can find different values, which will help us to filter to keep only the points that we are interested in. Therefore, from all the information stored in the .pbf file, it is necessary to extract only the nodes and ways that have the key *tourism* with the values of interest.

To do this, we will use the **Pyosmium** library, which allows reading and extracting information from OSM files through the **SimpleHandler** class.


In [None]:
class POIHandler(osm.SimpleHandler):
    '''
    Class to extract information from an osm.pbf file. Only elements identified
    as 'node' or 'area' are extracted. In addition, a filtering can be applied to
    select only those that have tags with a certain key and value.

    The position of the areas is obtained by calculating the centroid of the
    polygon formed by their nodes.

    Arguments
    ---------

    custom_filter: dict
        Dictionary with the keys and values that the elements must have to be
        extracted. For example:

        `{'amenity': ['restaurant', 'bar']}` selects only those elements that
        have the key 'amenity' with value 'restaurant' or 'bar'.

        `{'amenity': ['restaurant', 'bar'], 'building': ['car']}` selects only
        those elements that have the key 'amenity' with value 'restaurant' or
        'bar', or those with the key 'building' with value 'hotel'.
    '''

    def __init__(self, custom_filter=None):
        osm.SimpleHandler.__init__(self)
        self.osm_data = []
        self.custom_filter = custom_filter

        if self.custom_filter:
            for key, value in self.custom_filter.items():
                if isinstance(value, str):
                    self.custom_filter[key] = [value]

    def node(self, node):
        if self.custom_filter is None:
            name = node.tags.get('name', '')
            self.tag_inventory(node, 'node', name)
        else:
            if any([node.tags.get(key) in self.custom_filter[key] for key in self.custom_filter.keys()]):
                name = node.tags.get('name', '')
                self.tag_inventory(node, 'node', name)

    def area(self, area):
        if self.custom_filter is None:
            name = area.tags.get('name', '')
            self.tag_inventory(area, 'area', name)
        else:
            if any([area.tags.get(key) in self.custom_filter[key] for key in self.custom_filter.keys()]):
                name = area.tags.get('name', '')
                self.tag_inventory(area, 'area', name)

    def tag_inventory(self, elem, elem_type, name):
        if elem_type == 'node':
            for tag in elem.tags:
                self.osm_data.append([elem_type,
                                       elem.id,
                                       name,
                                       elem.location.lon,
                                       elem.location.lat,
                                       pd.Timestamp(elem.timestamp),
                                       len(elem.tags),
                                       tag.k,
                                       tag.v])
        if elem_type == 'area':
            try:
                # A Polygon is created with the nodes that form the area to
                # calculate its centroid.
                nodes = list(elem.outer_rings())[0]
                polygon = Polygon([(node.lon, node.lat) for node in nodes])
                for tag in elem.tags:
                    self.osm_data.append([elem_type,
                                           elem.id,
                                           name,
                                           polygon.centroid.x,
                                           polygon.centroid.y,
                                           pd.Timestamp(elem.timestamp),
                                           len(elem.tags),
                                           tag.k,
                                           tag.v])
            except:
                pass

### **Spain**

In [None]:
# OSM markers Spain: extraction of elements identified with 'tourism'= ['dictionary with values of interest']
poi_handler_sp = POIHandler(custom_filter={'tourism':['alpine_hut', 'apartment', 'camp_pitch', 'camp_site', 'caravan_site', 'chalet',
                                          'guest_house', 'hostel', 'hotel', 'motel', 'wilderness_hut']})
poi_handler_sp.apply_file(data_dir_geofabrik + 'Spain/2019/spain-200101.osm.pbf')

As a result of the extraction, a list is obtained in which each element is in turn a list with the information associated with a node or area. To facilitate the management of the extracted data, we are going to store this information in a pandas dataframe.

In [None]:
colnames = ['type', 'id', 'name', 'lon', 'lat', 'timestamp','n_tags', 'tag_key',
            'tag_value']
df_poi_sp = pd.DataFrame(poi_handler_sp.osm_data, columns=colnames)
df_poi_sp.head()

Unnamed: 0,type,id,name,lon,lat,timestamp,n_tags,tag_key,tag_value
0,node,21947483,,-3.70235,40.411489,2018-07-15 13:45:36+00:00,5,source,yahoo_maps
1,node,21947483,,-3.70235,40.411489,2018-07-15 13:45:36+00:00,5,tourism,apartment
2,node,21947483,,-3.70235,40.411489,2018-07-15 13:45:36+00:00,5,addr:street,Calle del Calvario
3,node,21947483,,-3.70235,40.411489,2018-07-15 13:45:36+00:00,5,addr:postcode,28012
4,node,21947483,,-3.70235,40.411489,2018-07-15 13:45:36+00:00,5,addr:housenumber,15


Each row contains an attribute (*tag_key* and *tag_value*) associated to a node. OpenStreetMap has a free tag system so a same node can have an unlimited number of associated attributes. For example, we see that the node with id 21947483 has 5 associated tags.

In [None]:
df_poi_sp.shape

(112256, 9)

Therefore, to avoid counting a node as many times as tags it has associated, we will be left with only one tag: **tourism**.

In [None]:
tourism_sp = df_poi_sp[df_poi_sp['tag_key'].isin(['tourism'])]
tourism_sp

Unnamed: 0,type,id,name,lon,lat,timestamp,n_tags,tag_key,tag_value
1,node,21947483,,-3.702350,40.411489,2018-07-15 13:45:36+00:00,5,tourism,apartment
6,node,25913327,NH Ciudad de la Imagen,-3.788216,40.398389,2016-10-21 00:50:00+00:00,2,tourism,hotel
8,node,26860857,Alpha,-3.688983,40.317295,2019-03-06 08:15:58+00:00,3,tourism,camp_site
11,node,26860858,Camping l'Alqueria,-0.163515,38.985976,2016-08-28 19:53:29+00:00,2,tourism,camp_site
13,node,26860866,Arco Iris,-3.907500,40.381111,2012-07-15 17:30:43+00:00,2,tourism,camp_site
...,...,...,...,...,...,...,...,...,...
112231,area,1518890438,Hotel Rural Robles,-5.662785,40.133844,2019-12-29 21:43:53+00:00,10,tourism,hotel
112242,area,1518890482,Hotel Ruta Imperial,-5.666122,40.127340,2019-12-29 21:43:53+00:00,7,tourism,hotel
112248,area,1518890772,Cabañas Campus Zuasti,-1.747392,42.854190,2019-12-29 21:46:48+00:00,2,tourism,camp_site
112250,area,1518937062,Parador de Verín,-7.446803,41.943662,2019-12-30 00:22:15+00:00,3,tourism,hotel


As we can see, the size of the dataset has been reduced considerably.

### **Canary Islands**

We are going to follow the same procedure with the data from the Canary Islands.

In [None]:
# OSM markers Canary Islands: extraction of elements identified with 'tourism'= ['dictionary with values of interest']
poi_handler_ci = POIHandler(custom_filter={'tourism':['alpine_hut', 'apartment', 'camp_pitch', 'camp_site', 'caravan_site', 'chalet',
                                          'guest_house', 'hostel', 'hotel', 'motel', 'wilderness_hut']})
poi_handler_ci.apply_file(data_dir_geofabrik + 'Canary_Islands/2019/canary-islands-200101.osm.pbf')

In [None]:
colnames = ['type', 'id', 'name', 'lon', 'lat', 'timestamp','n_tags', 'tag_key',
            'tag_value']
df_poi_ci = pd.DataFrame(poi_handler_ci.osm_data, columns=colnames)
df_poi_ci.head()

Unnamed: 0,type,id,name,lon,lat,timestamp,n_tags,tag_key,tag_value
0,node,33104770,allsun Esquinzo Beach Hotel,-14.308186,28.073667,2017-07-20 14:09:17+00:00,3,name,allsun Esquinzo Beach Hotel
1,node,33104770,allsun Esquinzo Beach Hotel,-14.308186,28.073667,2017-07-20 14:09:17+00:00,3,tourism,hotel
2,node,33104770,allsun Esquinzo Beach Hotel,-14.308186,28.073667,2017-07-20 14:09:17+00:00,3,operator,alltours
3,node,33105081,Aldiana,-14.316887,28.060005,2011-12-13 08:50:05+00:00,2,name,Aldiana
4,node,33105081,Aldiana,-14.316887,28.060005,2011-12-13 08:50:05+00:00,2,tourism,hotel


In [None]:
df_poi_ci.shape

(9194, 9)

In [None]:
tourism_ci = df_poi_ci[df_poi_ci['tag_key'].isin(['tourism'])]
tourism_ci

Unnamed: 0,type,id,name,lon,lat,timestamp,n_tags,tag_key,tag_value
1,node,33104770,allsun Esquinzo Beach Hotel,-14.308186,28.073667,2017-07-20 14:09:17+00:00,3,tourism,hotel
4,node,33105081,Aldiana,-14.316887,28.060005,2011-12-13 08:50:05+00:00,2,tourism,hotel
7,node,33105367,Iberostar Palace Fuerteventura,-14.320375,28.056673,2012-02-14 11:35:21+00:00,10,tourism,hotel
17,node,33105714,disused hotel (ex: Dunas Club Calela del Sol),-14.288661,28.092809,2011-11-16 08:10:54+00:00,3,tourism,hotel
20,node,33106383,Iberostar Playa Gaviotas,-14.319408,28.056813,2018-03-04 10:19:50+00:00,11,tourism,hotel
...,...,...,...,...,...,...,...,...,...
9170,area,20195871,Hotel Radisson Blu Resort,-15.690884,27.772012,2019-10-01 20:56:43+00:00,3,tourism,hotel
9173,area,1486737346,H10 Atlantic Sunset,-16.774649,28.119212,2019-11-13 09:06:22+00:00,6,tourism,hotel
9180,area,1486752956,Hard Rock Hotel Tenerife,-16.775028,28.121134,2019-12-21 14:00:05+00:00,7,tourism,hotel
9187,area,1495855946,Albergue El Pinar,-17.938215,28.701704,2019-11-21 22:13:13+00:00,4,tourism,hostel


Again, we select a single tag to count each node only once.

## 2. Prepare Data

To facilitate the study, we will now join the datasets so that we have the tourist establishments of both the mainland and the islands in the same dataset.

In [None]:
# Concat datasets
tourism = pd.concat([tourism_sp, tourism_ci], ignore_index=True)
tourism

Unnamed: 0,type,id,name,lon,lat,timestamp,n_tags,tag_key,tag_value
0,node,21947483,,-3.702350,40.411489,2018-07-15 13:45:36+00:00,5,tourism,apartment
1,node,25913327,NH Ciudad de la Imagen,-3.788216,40.398389,2016-10-21 00:50:00+00:00,2,tourism,hotel
2,node,26860857,Alpha,-3.688983,40.317295,2019-03-06 08:15:58+00:00,3,tourism,camp_site
3,node,26860858,Camping l'Alqueria,-0.163515,38.985976,2016-08-28 19:53:29+00:00,2,tourism,camp_site
4,node,26860866,Arco Iris,-3.907500,40.381111,2012-07-15 17:30:43+00:00,2,tourism,camp_site
...,...,...,...,...,...,...,...,...,...
23057,area,20195871,Hotel Radisson Blu Resort,-15.690884,27.772012,2019-10-01 20:56:43+00:00,3,tourism,hotel
23058,area,1486737346,H10 Atlantic Sunset,-16.774649,28.119212,2019-11-13 09:06:22+00:00,6,tourism,hotel
23059,area,1486752956,Hard Rock Hotel Tenerife,-16.775028,28.121134,2019-12-21 14:00:05+00:00,7,tourism,hotel
23060,area,1495855946,Albergue El Pinar,-17.938215,28.701704,2019-11-21 22:13:13+00:00,4,tourism,hostel


Next, to ensure that we do not have duplicates, we will delete those records that may have the same name, except in the case where there is an empty value.

In [None]:
# Eliminate establishments that may be duplicated by name except when it is an empty value

# Filter rows where 'name' is not empty and remove duplicates, keeping the first occurrence
tourism_no_empty = tourism[tourism['name'] != ''].drop_duplicates(subset='name', keep='first')

# Filter rows where 'name' is empty
tourism_empty = tourism[tourism['name'] == '']

# Concatenate the DataFrames
tourism_cleaned = pd.concat([tourism_no_empty, tourism_empty])

tourism_cleaned

Unnamed: 0,type,id,name,lon,lat,timestamp,n_tags,tag_key,tag_value
1,node,25913327,NH Ciudad de la Imagen,-3.788216,40.398389,2016-10-21 00:50:00+00:00,2,tourism,hotel
2,node,26860857,Alpha,-3.688983,40.317295,2019-03-06 08:15:58+00:00,3,tourism,camp_site
3,node,26860858,Camping l'Alqueria,-0.163515,38.985976,2016-08-28 19:53:29+00:00,2,tourism,camp_site
4,node,26860866,Arco Iris,-3.907500,40.381111,2012-07-15 17:30:43+00:00,2,tourism,camp_site
5,node,26860889,Càmping Begur,3.200000,41.940278,2015-06-26 15:18:06+00:00,2,tourism,camp_site
...,...,...,...,...,...,...,...,...,...
23019,area,1228903226,,-15.705435,27.787274,2018-08-06 11:00:10+00:00,2,tourism,hotel
23020,area,1248811338,,-16.755689,28.319000,2018-09-10 13:13:12+00:00,1,tourism,caravan_site
23022,area,1250033538,,-16.500868,28.355481,2018-09-12 08:19:45+00:00,1,tourism,camp_site
23024,area,1250730576,,-17.975234,27.809832,2018-09-13 19:37:45+00:00,2,tourism,chalet


Before starting to count the number of tourist establishments, there is one last step: to obtain the *geometry* column of each of them from the longitude and latitude values.

In [None]:
# Obtain geometry for each establishment
geometry = []

for i in range(len(tourism_cleaned)):
    geometry.append(Point(tourism_cleaned.iloc[i]['lon'], tourism_cleaned.iloc[i]['lat']))

tourism_cleaned = tourism_cleaned.set_geometry(geometry)
tourism_cleaned

Unnamed: 0,type,id,name,lon,lat,timestamp,n_tags,tag_key,tag_value,geometry
1,node,25913327,NH Ciudad de la Imagen,-3.788216,40.398389,2016-10-21 00:50:00+00:00,2,tourism,hotel,POINT (-3.78822 40.39839)
2,node,26860857,Alpha,-3.688983,40.317295,2019-03-06 08:15:58+00:00,3,tourism,camp_site,POINT (-3.68898 40.31730)
3,node,26860858,Camping l'Alqueria,-0.163515,38.985976,2016-08-28 19:53:29+00:00,2,tourism,camp_site,POINT (-0.16352 38.98598)
4,node,26860866,Arco Iris,-3.907500,40.381111,2012-07-15 17:30:43+00:00,2,tourism,camp_site,POINT (-3.90750 40.38111)
5,node,26860889,Càmping Begur,3.200000,41.940278,2015-06-26 15:18:06+00:00,2,tourism,camp_site,POINT (3.20000 41.94028)
...,...,...,...,...,...,...,...,...,...,...
23019,area,1228903226,,-15.705435,27.787274,2018-08-06 11:00:10+00:00,2,tourism,hotel,POINT (-15.70544 27.78727)
23020,area,1248811338,,-16.755689,28.319000,2018-09-10 13:13:12+00:00,1,tourism,caravan_site,POINT (-16.75569 28.31900)
23022,area,1250033538,,-16.500868,28.355481,2018-09-12 08:19:45+00:00,1,tourism,camp_site,POINT (-16.50087 28.35548)
23024,area,1250730576,,-17.975234,27.809832,2018-09-13 19:37:45+00:00,2,tourism,chalet,POINT (-17.97523 27.80983)


In [None]:
# Select columns of interest
tourism_cleaned = tourism_cleaned[['id', 'geometry']]
tourism_cleaned

Unnamed: 0,id,geometry
1,25913327,POINT (-3.78822 40.39839)
2,26860857,POINT (-3.68898 40.31730)
3,26860858,POINT (-0.16352 38.98598)
4,26860866,POINT (-3.90750 40.38111)
5,26860889,POINT (3.20000 41.94028)
...,...,...
23019,1228903226,POINT (-15.70544 27.78727)
23020,1248811338,POINT (-16.75569 28.31900)
23022,1250033538,POINT (-16.50087 28.35548)
23024,1250730576,POINT (-17.97523 27.80983)


Once this is done, we can represent our points on a map to check in which province each of them is located. That is, we will check if the geometry of each point falls within the geometry of a particular province.

## 3. Represent and Count Data

In this section, we are going to count how many tourist establishments there were in each Spanish province before the pandemic. As we did in previous studies, we will access the geometry of Spain (taking into account that the geometry of Santa Cruz de Tenerife and Las Palmas will be added later) and we will represent and count the different tourist spots per province. First, let's plot all the tourist points identified in the previous section on a map of Spain.

In [None]:
fig = px.scatter_mapbox(
    tourism_cleaned,
    lat=tourism_cleaned.geometry.y,
    lon=tourism_cleaned.geometry.x,
    size=None,
    #color_continuous_scale='Viridis',
    mapbox_style="open-street-map",
    zoom=5,
    center={"lat": 40.0, "lon": -3.5},
    title="Tourist Establishments in Spain"
)

fig.update_traces(marker=dict(size=15))

fig.update_layout(margin={"r":0,"t":30,"l":0,"b":0})
fig.show()

As shown, the points are distributed across the entire country. Due to the large number of tourist establishments, it is necessary to zoom in to distinguish specific areas.

In [None]:
spanish_provinces = gpd.read_file(data_dir_geom + 'spanish_provinces.geojson')
spanish_provinces

Unnamed: 0,rotulo,population,time_light_sun,geometry
0,Melilla,1173039.0,12,"POLYGON ((-2.93446 35.27442, -2.92929 35.27324..."
1,Ceuta,1884441.0,5,"POLYGON ((-5.33273 35.89270, -5.31086 35.88727..."
2,Cádiz,505665.5,11,"POLYGON ((-5.14234 37.00387, -5.11510 36.98897..."
3,Málaga,1077906.0,10,"POLYGON ((-4.32759 37.18435, -4.32333 37.18075..."
4,Almería,1849180.0,12,"POLYGON ((-2.18494 37.90366, -2.17201 37.88872..."
5,Granada,1589546.0,6,"POLYGON ((-2.34163 38.02603, -2.32309 38.01401..."
6,Sevilla,1696450.0,10,"POLYGON ((-5.72320 38.19244, -5.70785 38.18830..."
7,Huelva,1109913.0,8,"POLYGON ((-6.18021 37.94103, -6.14518 37.92050..."
8,Jaén,1827838.0,12,"POLYGON ((-2.55129 38.08412, -2.57299 38.07785..."
9,Córdoba,1714056.0,5,"POLYGON ((-5.00801 38.71523, -4.87696 38.68610..."


In [None]:
# Drop some columns
spanish_provinces = spanish_provinces.drop(['population', 'time_light_sun'], axis=1)
spanish_provinces

Unnamed: 0,rotulo,geometry
0,Melilla,"POLYGON ((-2.93446 35.27442, -2.92929 35.27324..."
1,Ceuta,"POLYGON ((-5.33273 35.89270, -5.31086 35.88727..."
2,Cádiz,"POLYGON ((-5.14234 37.00387, -5.11510 36.98897..."
3,Málaga,"POLYGON ((-4.32759 37.18435, -4.32333 37.18075..."
4,Almería,"POLYGON ((-2.18494 37.90366, -2.17201 37.88872..."
5,Granada,"POLYGON ((-2.34163 38.02603, -2.32309 38.01401..."
6,Sevilla,"POLYGON ((-5.72320 38.19244, -5.70785 38.18830..."
7,Huelva,"POLYGON ((-6.18021 37.94103, -6.14518 37.92050..."
8,Jaén,"POLYGON ((-2.55129 38.08412, -2.57299 38.07785..."
9,Córdoba,"POLYGON ((-5.00801 38.71523, -4.87696 38.68610..."


In [None]:
# Get geometry of Santa Cruz de Tenerife and Las Palmas
canary_islands = gpd.read_file(data_dir_geom + 'recintos_provinciales_inspire_canarias_wgs84.shp')
canary_islands

Unnamed: 0,INSPIREID,COUNTRY,NATLEV,NATLEVNAME,NATCODE,NAMEUNIT,CODNUT1,CODNUT2,CODNUT3,geometry
0,ES.IGN.BDDAE.34053500000,ES,https://inspire.ec.europa.eu/codelist/Administ...,Provincia,34053500000,Las Palmas,ES7,ES70,,"MULTIPOLYGON (((-15.69749 27.77109, -15.69750 ..."
1,ES.IGN.BDDAE.34053800000,ES,https://inspire.ec.europa.eu/codelist/Administ...,Provincia,34053800000,Santa Cruz de Tenerife,ES7,ES70,,"MULTIPOLYGON (((-18.00161 27.64707, -18.00158 ..."


In [None]:
# Define new rows to add
row1 = pd.DataFrame([[canary_islands.loc[0, 'NAMEUNIT'], canary_islands.loc[0, 'geometry']]], columns=spanish_provinces.columns)
row2 = pd.DataFrame([[canary_islands.loc[1, 'NAMEUNIT'], canary_islands.loc[1, 'geometry']]], columns=spanish_provinces.columns)

In [None]:
# Add new rows with geometry
spanish_provinces = pd.concat([spanish_provinces, row1, row2], ignore_index=True)
spanish_provinces

Unnamed: 0,rotulo,geometry
0,Melilla,"POLYGON ((-2.93446 35.27442, -2.92929 35.27324..."
1,Ceuta,"POLYGON ((-5.33273 35.89270, -5.31086 35.88727..."
2,Cádiz,"POLYGON ((-5.14234 37.00387, -5.11510 36.98897..."
3,Málaga,"POLYGON ((-4.32759 37.18435, -4.32333 37.18075..."
4,Almería,"POLYGON ((-2.18494 37.90366, -2.17201 37.88872..."
5,Granada,"POLYGON ((-2.34163 38.02603, -2.32309 38.01401..."
6,Sevilla,"POLYGON ((-5.72320 38.19244, -5.70785 38.18830..."
7,Huelva,"POLYGON ((-6.18021 37.94103, -6.14518 37.92050..."
8,Jaén,"POLYGON ((-2.55129 38.08412, -2.57299 38.07785..."
9,Córdoba,"POLYGON ((-5.00801 38.71523, -4.87696 38.68610..."


In [None]:
poly_dict = {}

for i in spanish_provinces.index:

  poly_dict[spanish_provinces['rotulo'][i]] = spanish_provinces['geometry'][i]

polygons = gpd.GeoSeries(poly_dict)
polygons

Melilla                   POLYGON ((-2.93446 35.27442, -2.92929 35.27324...
Ceuta                     POLYGON ((-5.33273 35.89270, -5.31086 35.88727...
Cádiz                     POLYGON ((-5.14234 37.00387, -5.11510 36.98897...
Málaga                    POLYGON ((-4.32759 37.18435, -4.32333 37.18075...
Almería                   POLYGON ((-2.18494 37.90366, -2.17201 37.88872...
Granada                   POLYGON ((-2.34163 38.02603, -2.32309 38.01401...
Sevilla                   POLYGON ((-5.72320 38.19244, -5.70785 38.18830...
Huelva                    POLYGON ((-6.18021 37.94103, -6.14518 37.92050...
Jaén                      POLYGON ((-2.55129 38.08412, -2.57299 38.07785...
Córdoba                   POLYGON ((-5.00801 38.71523, -4.87696 38.68610...
Castelló/Castellón        POLYGON ((-0.07991 40.73335, -0.06476 40.72742...
Murcia                    POLYGON ((-1.11971 38.73774, -1.11563 38.71141...
Illes Balears             MULTIPOLYGON (((3.19286 39.93983, 3.15416 39.9...
Alacant/Alic

In [None]:
points_tourism = gpd.GeoDataFrame(index=tourism_cleaned['id'], crs='epsg:4326', geometry= list(tourism_cleaned['geometry']))

In [None]:
# Check in which province each establishment of the dataset is located
estab_tourism = points_tourism.assign(**{key: points_tourism.within(geom) for key, geom in polygons.items()})
estab_tourism

Unnamed: 0_level_0,geometry,Melilla,Ceuta,Cádiz,Málaga,Almería,Granada,Sevilla,Huelva,Jaén,...,León,Cantabria,Asturias,Lugo,A Coruña,Bizkaia,Araba/Álava,Palencia,Las Palmas,Santa Cruz de Tenerife
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25913327,POINT (-3.78822 40.39839),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
26860857,POINT (-3.68898 40.31730),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
26860858,POINT (-0.16352 38.98598),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
26860866,POINT (-3.90750 40.38111),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
26860889,POINT (3.20000 41.94028),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1228903226,POINT (-15.70544 27.78727),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
1248811338,POINT (-16.75569 28.31900),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
1250033538,POINT (-16.50087 28.35548),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
1250730576,POINT (-17.97523 27.80983),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True


In [None]:
estab_tourism_number = pd.DataFrame()

for column in estab_tourism:

  if column != 'geometry':

    number = np.count_nonzero(estab_tourism[column] == True)

    new_row = {'province':column, 'establishments':number}

    estab_tourism_number = pd.concat([estab_tourism_number, pd.DataFrame([new_row])], ignore_index=True)

estab_tourism_number

Unnamed: 0,province,establishments
0,Melilla,10
1,Ceuta,2
2,Cádiz,387
3,Málaga,874
4,Almería,218
5,Granada,397
6,Sevilla,324
7,Huelva,135
8,Jaén,304
9,Córdoba,174


As can be seen, we have already calculated the number of establishments in each province in 2019.

In [None]:
# Save the results in an Excel file
estab_tourism_number.to_excel(data_dir_new + '2019_estab_by_province.xlsx')

Once we have saved this data in a file for later use, we are going to visualize this numerical data in a graph to check more easily which provinces had the highest tourism capacity in 2019.

In [None]:
fig = px.bar(estab_tourism_number, x='province', y='establishments')
fig.show()

As can be seen, the provinces with the greatest number of tourist establishments are Illes Balears, Barcelona, Las Palmas, Madrid, Málaga, Santa Cruz de Tenerife and Girona.