# Tourism Data Cleaning

This notebook is designed to load and clean the tourism data from the [World Tourism Organization](https://www.unwto.org/tourism-statistics/key-tourism-statistics). The data `unwto-all-data-download_2022.xlsx` is downloaded from the website and saved in the `raw` folder. The cleaned data is saved in the `export` folder.

Outline:

0. [Pre-requisites, loading libraries and data](#0.-Pre-requisites,-loading-libraries-and-data)
1. [Identifying the sheets](#1.-Identifying-the-sheets)
2. [Filtering the data](#2.-Filtering-the-data)
3. [Cleaning the data](#3.-Cleaning-the-data)
4. [Renaming columns for consistency](#4.-Rename-columns-for-consistency)
5. [Exporting the cleaned data](#5.-Export-the-cleaned-data)
6. [Summary](#6.-Summary)

The cleaned data is saved in the `export` folder as `tourism_data_cleaned.csv`.

Author: [Perry Gabriel](https://www.github.com/prgabriel)

Submitted: 2025-02-27


## 0. Pre-requisites, loading libraries and data

In [None]:
import os
import numpy as np # type: ignore
import pandas as pd # type: ignore
from IPython.display import display # type: ignore

# Load the Excel file
file_path = "../data/raw/unwto-all-data-download_2022.xlsx"
xls = pd.ExcelFile(file_path)

# List sheet names to verify
print("Available Sheets:")
print(pd.DataFrame(xls.sheet_names, columns=["Sheet Names"]))

Available Sheets:
                       Sheet Names
0                            Index
1         Inbound Tourism-Arrivals
2          Inbound Tourism-Regions
3          Inbound Tourism-Purpose
4        Inbound Tourism-Transport
5    Inbound Tourism-Accommodation
6      Inbound Tourism-Expenditure
7           Domestic Tourism-Trips
8   Domestic Tourism-Accommodation
9      Outbound Tourism-Departures
10    Outbound Tourism-Expenditure
11              Tourism Industries
12                      Employment


## 1. Identifying the sheets

The data is stored in multiple sheets. We will identify the sheet that contains the data we are interested in.


In [2]:
# Load the relevant sheets, skipping the first two rows
df_arrivals = xls.parse(" Inbound Tourism-Arrivals", skiprows=2)
df_transport = xls.parse("Inbound Tourism-Transport", skiprows=2)
df_regions = xls.parse("Inbound Tourism-Regions", skiprows=2)

# Display first few rows of each sheet to understand structure
print("\nInbound Tourism-Arrivals Sample:")
display(df_arrivals.head())

print("\nInbound Tourism-Transport Sample:")
display(df_transport.head())

print("\nInbound Tourism-Regions Sample:")
display(df_regions.head())


Inbound Tourism-Arrivals Sample:


Unnamed: 0,C.,S.,C. & S.,Basic data and indicators,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Units,Notes,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 39
0,4.0,0.0,4-0,AFGHANISTAN,,,,,,,...,,,,,,,,,,
1,,,,,Arrivals,,,,,,...,,,,,,,,,,
2,4.0,1.1,4-1.1,,,Total arrivals,,,Thousands,,...,..,..,..,..,..,..,..,..,..,
3,4.0,1.2,4-1.2,,,,Overnights visitors (tourists),,Thousands,,...,..,..,..,..,..,..,..,..,..,
4,4.0,1.3,4-1.3,,,,Same-day visitors (excursionists),,Thousands,,...,..,..,..,..,..,..,..,..,..,



Inbound Tourism-Transport Sample:


Unnamed: 0,C.,S.,C. & S.,Basic data and indicators,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Units,Notes,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 39
0,4.0,0.0,4-0,AFGHANISTAN,,,,,,,...,,,,,,,,,,
1,,,,,Arrivals by mode of transport,,,,,,...,,,,,,,,,,
2,4.0,1.19,4-1.19,,,Total,,,Thousands,,...,..,..,..,..,..,..,..,..,..,
3,4.0,1.2,4-1.20,,,,Air,,Thousands,,...,..,..,..,..,..,..,..,..,..,
4,4.0,1.21,4-1.21,,,,Water,,Thousands,,...,..,..,..,..,..,..,..,..,..,



Inbound Tourism-Regions Sample:


Unnamed: 0,C.,S.,C. & S.,Basic data and indicators,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Units,Notes,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 39
0,4.0,0.0,4-0,AFGHANISTAN,,,,,,,...,,,,,,,,,,
1,,,,,Arrivals by region,,,,,,...,,,,,,,,,,
2,4.0,1.5,4-1.5,,,Total,,,Thousands,,...,..,..,..,..,..,..,..,..,..,
3,4.0,1.6,4-1.6,,,,Africa,,Thousands,,...,..,..,..,..,..,..,..,..,..,
4,4.0,1.7,4-1.7,,,,Americas,,Thousands,,...,..,..,..,..,..,..,..,..,..,


## 2. Filtering the data

We will filter the data to include only the columns that are relevant to our analysis. We will also filter the data to include only the rows that contain data for the countries we are interested in.

This is what the sheet looks like before any changes

In [3]:
display(df_arrivals)

Unnamed: 0,C.,S.,C. & S.,Basic data and indicators,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Units,Notes,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 39
0,4.0,0.0,4-0,AFGHANISTAN,,,,,,,...,,,,,,,,,,
1,,,,,Arrivals,,,,,,...,,,,,,,,,,
2,4.0,1.1,4-1.1,,,Total arrivals,,,Thousands,,...,..,..,..,..,..,..,..,..,..,
3,4.0,1.2,4-1.2,,,,Overnights visitors (tourists),,Thousands,,...,..,..,..,..,..,..,..,..,..,
4,4.0,1.3,4-1.3,,,,Same-day visitors (excursionists),,Thousands,,...,..,..,..,..,..,..,..,..,..,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1341,,,,VF,Arrivals of non-resident visitors at national ...,,,,,,...,,,,,,,,,,
1342,,,,THS,Arrivals of non-resident tourists in hotels an...,,,,,,...,,,,,,,,,,
1343,,,,TCE,Arrivals of non-resident tourists in all types...,,,,,,...,,,,,,,,,,
1344,,,,..,Data not available,,,,,,...,,,,,,,,,,


In [4]:
display(df_transport)

Unnamed: 0,C.,S.,C. & S.,Basic data and indicators,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Units,Notes,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 39
0,4.0,0.00,4-0,AFGHANISTAN,,,,,,,...,,,,,,,,,,
1,,,,,Arrivals by mode of transport,,,,,,...,,,,,,,,,,
2,4.0,1.19,4-1.19,,,Total,,,Thousands,,...,..,..,..,..,..,..,..,..,..,
3,4.0,1.20,4-1.20,,,,Air,,Thousands,,...,..,..,..,..,..,..,..,..,..,
4,4.0,1.21,4-1.21,,,,Water,,Thousands,,...,..,..,..,..,..,..,..,..,..,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1341,,,,VF,Arrivals of non-resident visitors at national ...,,,,,,...,,,,,,,,,,
1342,,,,THS,Arrivals of non-resident tourists in hotels an...,,,,,,...,,,,,,,,,,
1343,,,,TCE,Arrivals of non-resident tourists in all types...,,,,,,...,,,,,,,,,,
1344,,,,..,Data not available,,,,,,...,,,,,,,,,,


In [5]:
display(df_regions)

Unnamed: 0,C.,S.,C. & S.,Basic data and indicators,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Units,Notes,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 39
0,4.0,0.0,4-0,AFGHANISTAN,,,,,,,...,,,,,,,,,,
1,,,,,Arrivals by region,,,,,,...,,,,,,,,,,
2,4.0,1.5,4-1.5,,,Total,,,Thousands,,...,..,..,..,..,..,..,..,..,..,
3,4.0,1.6,4-1.6,,,,Africa,,Thousands,,...,..,..,..,..,..,..,..,..,..,
4,4.0,1.7,4-1.7,,,,Americas,,Thousands,,...,..,..,..,..,..,..,..,..,..,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2456,,,,VF,Arrivals of non-resident visitors at national ...,,,,,,...,,,,,,,,,,
2457,,,,THS,Arrivals of non-resident tourists in hotels an...,,,,,,...,,,,,,,,,,
2458,,,,TCE,Arrivals of non-resident tourists in all types...,,,,,,...,,,,,,,,,,
2459,,,,..,Data not available,,,,,,...,,,,,,,,,,


In [6]:
# Drop the specified columns
columns_to_drop = ["C.", "S.", "C. & S.", "Unnamed: 4", "Unnamed: 5", "Unnamed: 6","Unnamed: 7", "Unnamed: 39", "Units", "Notes", 'Series']
columns_to_drop_transport = ["C.", "S.", "C. & S.", "Unnamed: 4", "Unnamed: 5","Unnamed: 7", "Unnamed: 39", "Units", "Notes", 'Series']
columns_to_drop_regions = ["C.", "S.", "C. & S.", "Unnamed: 4", "Unnamed: 5","Unnamed: 7", "Unnamed: 39", "Units", "Notes", 'Series']

df_arrivals.drop(columns=columns_to_drop, inplace=True)
df_transport.drop(columns=columns_to_drop_transport, inplace=True)
df_regions.drop(columns=columns_to_drop_regions, inplace=True)


After filtering, the data will look like this

In [7]:
# Display the updated dataframe
display(df_arrivals)

Unnamed: 0,Basic data and indicators,1995,1996,1997,1998,1999,2000,2001,2002,2003,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,AFGHANISTAN,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,..,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
3,,..,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
4,,..,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1341,VF,,,,,,,,,,...,,,,,,,,,,
1342,THS,,,,,,,,,,...,,,,,,,,,,
1343,TCE,,,,,,,,,,...,,,,,,,,,,
1344,..,,,,,,,,,,...,,,,,,,,,,


In [8]:
display(df_transport)

Unnamed: 0,Basic data and indicators,Unnamed: 6,1995,1996,1997,1998,1999,2000,2001,2002,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,AFGHANISTAN,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
3,,Air,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
4,,Water,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1341,VF,,,,,,,,,,...,,,,,,,,,,
1342,THS,,,,,,,,,,...,,,,,,,,,,
1343,TCE,,,,,,,,,,...,,,,,,,,,,
1344,..,,,,,,,,,,...,,,,,,,,,,


In [9]:
display(df_regions)

Unnamed: 0,Basic data and indicators,Unnamed: 6,1995,1996,1997,1998,1999,2000,2001,2002,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,AFGHANISTAN,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
3,,Africa,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
4,,Americas,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2456,VF,,,,,,,,,,...,,,,,,,,,,
2457,THS,,,,,,,,,,...,,,,,,,,,,
2458,TCE,,,,,,,,,,...,,,,,,,,,,
2459,..,,,,,,,,,,...,,,,,,,,,,


## 3. Cleaning the data

The data is cleaned by removing unnecessary rows and columns, renaming columns, and converting the data to the appropriate data types.

In [10]:
# Forward fill the missing values in the first column
df_arrivals['Basic data and indicators'].fillna(method='ffill', inplace=True)
df_transport['Basic data and indicators'].fillna(method='ffill', inplace=True)
df_regions['Basic data and indicators'].fillna(method='ffill', inplace=True)

# Display the updated dataframe
display(df_arrivals)
display(df_transport)
display(df_regions)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_arrivals['Basic data and indicators'].fillna(method='ffill', inplace=True)
  df_arrivals['Basic data and indicators'].fillna(method='ffill', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_transport['Basic data and indicators'].fillna(method='ffill', inplace=T

Unnamed: 0,Basic data and indicators,1995,1996,1997,1998,1999,2000,2001,2002,2003,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,AFGHANISTAN,,,,,,,,,,...,,,,,,,,,,
1,AFGHANISTAN,,,,,,,,,,...,,,,,,,,,,
2,AFGHANISTAN,..,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
3,AFGHANISTAN,..,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
4,AFGHANISTAN,..,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1341,VF,,,,,,,,,,...,,,,,,,,,,
1342,THS,,,,,,,,,,...,,,,,,,,,,
1343,TCE,,,,,,,,,,...,,,,,,,,,,
1344,..,,,,,,,,,,...,,,,,,,,,,


Unnamed: 0,Basic data and indicators,Unnamed: 6,1995,1996,1997,1998,1999,2000,2001,2002,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,AFGHANISTAN,,,,,,,,,,...,,,,,,,,,,
1,AFGHANISTAN,,,,,,,,,,...,,,,,,,,,,
2,AFGHANISTAN,,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
3,AFGHANISTAN,Air,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
4,AFGHANISTAN,Water,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1341,VF,,,,,,,,,,...,,,,,,,,,,
1342,THS,,,,,,,,,,...,,,,,,,,,,
1343,TCE,,,,,,,,,,...,,,,,,,,,,
1344,..,,,,,,,,,,...,,,,,,,,,,


Unnamed: 0,Basic data and indicators,Unnamed: 6,1995,1996,1997,1998,1999,2000,2001,2002,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,AFGHANISTAN,,,,,,,,,,...,,,,,,,,,,
1,AFGHANISTAN,,,,,,,,,,...,,,,,,,,,,
2,AFGHANISTAN,,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
3,AFGHANISTAN,Africa,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
4,AFGHANISTAN,Americas,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2456,VF,,,,,,,,,,...,,,,,,,,,,
2457,THS,,,,,,,,,,...,,,,,,,,,,
2458,TCE,,,,,,,,,,...,,,,,,,,,,
2459,..,,,,,,,,,,...,,,,,,,,,,


In [11]:
# Group the dataframe by every 6 rows and select rows from position 2 to 4 within each group
df_arrivals_filtered = df_arrivals.groupby(df_arrivals.index // 6).apply(lambda x: x.iloc[2:4]).reset_index(drop=True)

# Group the dataframe by every 6 rows and select rows from position 3 to 6 within each group
df_transport_filtered = df_transport.groupby(df_transport.index // 6).apply(lambda x: x.iloc[3:6]).reset_index(drop=True)

# Group the dataframe by every 11 rows and select rows from position 3 to 10 within each group
df_regions_filtered = df_regions.groupby(df_regions.index // 11).apply(lambda x: x.iloc[3:10]).reset_index(drop=True)

In [12]:
# Display the collected rows
display(df_arrivals_filtered)

Unnamed: 0,Basic data and indicators,1995,1996,1997,1998,1999,2000,2001,2002,2003,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,AFGHANISTAN,..,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
1,AFGHANISTAN,..,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
2,ALBANIA,304,287,119,184,371,317,354,470,557,...,3256,3673,4131,4736,5118,5927,6406,2658,5689,7543.8
3,ALBANIA,..,..,..,..,..,..,..,..,..,...,2857,3341,3784,4070,4643,5340,6128,2604,5515,7104.7
4,ALGERIA,520,605,635,678,749,866,901,988,1166,...,2733,2301,1710,2039,2451,2657,2371,591,125,1398
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
443,ZAMBIA,163,264,341,362,404,457,492,565,413,...,915,947,932,956,1009,1072,1266,502,554,..
444,ZIMBABWE,1416,1597,1336,2090,2250,1967,2217,2041,2256,...,1833,1880,2057,2168,2423,2580,2294,639,381,1044
445,ZIMBABWE,1363,1577,1281,1986,2101,1868,2068,..,..,...,..,..,..,..,..,..,..,..,..,..
446,TF,,,,,,,,,,...,,,,,,,,,,


In [13]:
display(df_transport_filtered)

Unnamed: 0,Basic data and indicators,Unnamed: 6,1995,1996,1997,1998,1999,2000,2001,2002,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,AFGHANISTAN,Air,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
1,AFGHANISTAN,Water,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
2,AFGHANISTAN,Land,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
3,ALBANIA,Air,45,39,32,79,86,72,91,80,...,314,337.2,401,457,577.8,691.6,783.9,269.8,764.7,1250
4,ALBANIA,Water,83,78,19,33,152,79,103,111,...,182,198,211,276,393,439.3,468.4,64.1,205.9,383.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,ZIMBABWE,Water,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
668,ZIMBABWE,Land,..,..,..,1763,1746,1504,2311,1738,...,1610,1682,1776,1929,2112.8,2242.8,1973.4,562.1,233.4,676.8
669,VF,,,,,,,,,,...,,,,,,,,,,
670,THS,,,,,,,,,,...,,,,,,,,,,


In [14]:
df_transport_filtered.rename(columns={'Unnamed: 6': 'Arrival by mode of transport'}, inplace=True)
display(df_transport_filtered)

Unnamed: 0,Basic data and indicators,Arrival by mode of transport,1995,1996,1997,1998,1999,2000,2001,2002,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,AFGHANISTAN,Air,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
1,AFGHANISTAN,Water,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
2,AFGHANISTAN,Land,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
3,ALBANIA,Air,45,39,32,79,86,72,91,80,...,314,337.2,401,457,577.8,691.6,783.9,269.8,764.7,1250
4,ALBANIA,Water,83,78,19,33,152,79,103,111,...,182,198,211,276,393,439.3,468.4,64.1,205.9,383.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,ZIMBABWE,Water,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
668,ZIMBABWE,Land,..,..,..,1763,1746,1504,2311,1738,...,1610,1682,1776,1929,2112.8,2242.8,1973.4,562.1,233.4,676.8
669,VF,,,,,,,,,,...,,,,,,,,,,
670,THS,,,,,,,,,,...,,,,,,,,,,


In [15]:
df_regions_filtered.rename(columns={'Unnamed: 6': 'Arrival by Region'}, inplace=True)
display(df_regions_filtered)

Unnamed: 0,Basic data and indicators,Arrival by Region,1995,1996,1997,1998,1999,2000,2001,2002,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,AFGHANISTAN,Africa,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
1,AFGHANISTAN,Americas,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
2,AFGHANISTAN,East Asia and the Pacific,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
3,AFGHANISTAN,Europe,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
4,AFGHANISTAN,Middle East,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1561,VF,,,,,,,,,,...,,,,,,,,,,
1562,THS,,,,,,,,,,...,,,,,,,,,,
1563,TCE,,,,,,,,,,...,,,,,,,,,,
1564,..,,,,,,,,,,...,,,,,,,,,,


In [16]:
# Convert the dataframe to numeric, forcing errors to NaN, excluding the "Basic data and indicators" column
df_arrivals_filtered_numeric = df_arrivals_filtered.drop(columns=["Basic data and indicators"]).apply(pd.to_numeric, errors='coerce')

# Compare every two rows and replace the columns with the max value
df_max_values = df_arrivals_filtered_numeric.groupby(np.arange(len(df_arrivals_filtered_numeric)) // 2).max()

# Add the "Basic data and indicators" column back to the dataframe
df_max_values["Basic data and indicators"] = df_arrivals_filtered["Basic data and indicators"].iloc[::2].values

# Reorder columns to place "Basic data and indicators" at the beginning
df_max_values = df_max_values[["Basic data and indicators"] + df_max_values.columns[:-1].tolist()]

# Display the new dataframe
display(df_max_values)

Unnamed: 0,Basic data and indicators,1995,1996,1997,1998,1999,2000,2001,2002,2003,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,AFGHANISTAN,,,,,,,,,,...,,,,,,,,,,
1,ALBANIA,304.0,287.0,119.0,184.0,371.0,317.0,354.0,470.0,557.0,...,3256.0,3673.0,4131.0,4736.0,5118.0,5927.0,6406.0,2658.0,5689.0,7543.8
2,ALGERIA,520.0,605.0,635.0,678.0,749.0,866.0,901.0,988.0,1166.0,...,2733.0,2301.0,1710.0,2039.0,2451.0,2657.0,2371.0,591.0,125.0,1398.0
3,AMERICAN SAMOA,34.0,35.0,26.0,36.0,41.0,44.0,36.0,,,...,49.3,51.6,47.1,38.3,42.3,51.8,58.6,0.9,,
4,ANDORRA,,,,,9422.0,10991.0,11351.0,11507.0,11601.0,...,7676.0,7797.0,7850.0,8025.0,8152.0,8328.0,8235.0,5207.0,5422.0,8426.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219,VIET NAM,1351.0,1607.0,1716.0,1520.0,1782.0,2140.0,2330.0,2628.0,2429.0,...,7572.0,7874.0,7944.0,10013.0,12922.0,15498.0,18009.0,3837.0,157.0,3661.0
220,YEMEN,61.0,74.0,80.0,88.0,58.0,73.0,76.0,98.0,155.0,...,1323.0,1218.0,398.0,,,,,,,
221,ZAMBIA,163.0,264.0,341.0,362.0,404.0,457.0,492.0,565.0,413.0,...,915.0,947.0,932.0,956.0,1009.0,1072.0,1266.0,502.0,554.0,
222,ZIMBABWE,1416.0,1597.0,1336.0,2090.0,2250.0,1967.0,2217.0,2041.0,2256.0,...,1833.0,1880.0,2057.0,2168.0,2423.0,2580.0,2294.0,639.0,381.0,1044.0


In [17]:
# Drop rows where all columns (except 'Basic data and indicators') have no numerical values
df_cleaned_Inbound_Tourism_Arrivals = df_max_values.dropna(subset=df_max_values.columns[1:], how='all')

# Replace missing values with 0
df_cleaned_Inbound_Tourism_Arrivals.fillna(0, inplace=True)

# Display the cleaned DataFrame
display(df_cleaned_Inbound_Tourism_Arrivals)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned_Inbound_Tourism_Arrivals.fillna(0, inplace=True)


Unnamed: 0,Basic data and indicators,1995,1996,1997,1998,1999,2000,2001,2002,2003,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
1,ALBANIA,304.0,287.0,119.0,184.0,371.0,317.0,354.0,470.0,557.0,...,3256.0,3673.0,4131.0,4736.0,5118.0,5927.0,6406.0,2658.0,5689.0,7543.8
2,ALGERIA,520.0,605.0,635.0,678.0,749.0,866.0,901.0,988.0,1166.0,...,2733.0,2301.0,1710.0,2039.0,2451.0,2657.0,2371.0,591.0,125.0,1398.0
3,AMERICAN SAMOA,34.0,35.0,26.0,36.0,41.0,44.0,36.0,0.0,0.0,...,49.3,51.6,47.1,38.3,42.3,51.8,58.6,0.9,0.0,0.0
4,ANDORRA,0.0,0.0,0.0,0.0,9422.0,10991.0,11351.0,11507.0,11601.0,...,7676.0,7797.0,7850.0,8025.0,8152.0,8328.0,8235.0,5207.0,5422.0,8426.7
5,ANGOLA,9.0,21.0,45.0,52.0,45.0,51.0,67.0,91.0,107.0,...,650.0,595.0,592.0,397.0,261.0,218.0,218.0,64.0,64.0,130.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
218,"VENEZUELA, BOLIVARIAN REPUBLIC OF",879.0,960.0,933.0,813.0,702.0,602.0,792.0,590.0,435.0,...,1085.0,967.0,882.0,681.0,429.0,0.0,0.0,0.0,0.0,0.0
219,VIET NAM,1351.0,1607.0,1716.0,1520.0,1782.0,2140.0,2330.0,2628.0,2429.0,...,7572.0,7874.0,7944.0,10013.0,12922.0,15498.0,18009.0,3837.0,157.0,3661.0
220,YEMEN,61.0,74.0,80.0,88.0,58.0,73.0,76.0,98.0,155.0,...,1323.0,1218.0,398.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
221,ZAMBIA,163.0,264.0,341.0,362.0,404.0,457.0,492.0,565.0,413.0,...,915.0,947.0,932.0,956.0,1009.0,1072.0,1266.0,502.0,554.0,0.0


In [18]:
# Convert the dataframe to numeric, forcing errors to NaN, excluding the "Basic data and indicators" column
df_transport_filtered_numeric = df_transport_filtered.drop(columns=["Basic data and indicators", "Arrival by mode of transport"]).apply(pd.to_numeric, errors='coerce')

# Drop rows where all columns have no numerical values
df_cleaned_Inbound_Tourism_Transport = df_transport_filtered.loc[~df_transport_filtered_numeric.isna().all(axis=1)]

# Replace '..' with 0.0 in the DataFrame
df_cleaned_Inbound_Tourism_Transport.replace('..', 0.0, inplace=True)

# Display the new DataFrame
display(df_cleaned_Inbound_Tourism_Transport)

  df_cleaned_Inbound_Tourism_Transport.replace('..', 0.0, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned_Inbound_Tourism_Transport.replace('..', 0.0, inplace=True)


Unnamed: 0,Basic data and indicators,Arrival by mode of transport,1995,1996,1997,1998,1999,2000,2001,2002,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
3,ALBANIA,Air,45.0,39.0,32.0,79.0,86.0,72.0,91.0,80.0,...,314.0,337.2,401.0,457.0,577.8,691.6,783.9,269.8,764.7,1250.0
4,ALBANIA,Water,83.0,78.0,19.0,33.0,152.0,79.0,103.0,111.0,...,182.0,198.0,211.0,276.0,393.0,439.3,468.4,64.1,205.9,383.9
5,ALBANIA,Land,176.0,170.0,68.0,72.0,133.0,166.0,160.0,279.0,...,2760.0,3137.5,3519.0,4003.0,4146.8,4795.9,5153.8,2323.9,4718.1,5909.9
6,ALGERIA,Air,346.0,370.0,343.0,0.0,369.0,382.0,403.0,456.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,ALGERIA,Water,110.0,172.0,107.0,0.0,196.0,201.0,203.0,322.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
663,ZAMBIA,Air,41.0,31.0,36.0,42.0,46.0,57.0,62.0,71.0,...,241.0,262.4,272.8,285.5,294.0,319.0,326.5,89.2,105.6,0.0
664,ZAMBIA,Water,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,8.0,4.3,3.2,7.1,7.3,4.5,6.3,2.6,6.7,0.0
665,ZAMBIA,Land,122.0,233.0,305.0,320.0,358.0,401.0,430.0,494.0,...,666.0,680.3,655.9,663.7,707.8,748.6,933.6,409.8,442.0,0.0
666,ZIMBABWE,Air,0.0,0.0,0.0,324.0,504.0,463.0,452.0,340.0,...,223.0,198.0,280.6,239.0,310.1,337.2,320.8,77.3,147.4,367.0


In [19]:
display(df_regions_filtered)

Unnamed: 0,Basic data and indicators,Arrival by Region,1995,1996,1997,1998,1999,2000,2001,2002,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,AFGHANISTAN,Africa,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
1,AFGHANISTAN,Americas,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
2,AFGHANISTAN,East Asia and the Pacific,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
3,AFGHANISTAN,Europe,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
4,AFGHANISTAN,Middle East,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1561,VF,,,,,,,,,,...,,,,,,,,,,
1562,THS,,,,,,,,,,...,,,,,,,,,,
1563,TCE,,,,,,,,,,...,,,,,,,,,,
1564,..,,,,,,,,,,...,,,,,,,,,,


In [20]:
# Convert the dataframe to numeric, forcing errors to NaN, excluding the "Basic data and indicators" and "Region" columns
df_regions_filtered_numeric = df_regions_filtered.drop(columns=["Basic data and indicators", "Arrival by Region"]).apply(pd.to_numeric, errors='coerce')

# Drop rows where all columns have no numerical values
df_cleaned_Inbound_Tourism_Regions = df_regions_filtered.loc[~df_regions_filtered_numeric.isna().all(axis=1)]

# Replace '..' with 0.0 in the DataFrame
df_cleaned_Inbound_Tourism_Regions.replace('..', 0.0, inplace=True)

# Display the new DataFrame
display(df_cleaned_Inbound_Tourism_Regions)

  df_cleaned_Inbound_Tourism_Regions.replace('..', 0.0, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned_Inbound_Tourism_Regions.replace('..', 0.0, inplace=True)


Unnamed: 0,Basic data and indicators,Arrival by Region,1995,1996,1997,1998,1999,2000,2001,2002,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
7,ALBANIA,Africa,0.0,0.0,0.0,0.0,0.2,0.2,0.1,0.1,...,1.0,1.0,3.0,2.4,2.8,3.5,24.3,1.70,3.2,4.6
8,ALBANIA,Americas,0.0,0.0,0.0,0.0,11.0,14.0,14.0,16.0,...,73.3,90.0,96.8,105.0,125.3,148.8,156.7,30.00,115.8,177.4
9,ALBANIA,East Asia and the Pacific,0.0,0.0,0.0,0.0,1.0,2.0,2.0,3.0,...,23.6,31.0,33.0,36.3,54.4,68.1,68.2,5.00,8.4,26.8
10,ALBANIA,Europe,0.0,0.0,0.0,0.0,322.0,295.0,320.0,439.0,...,2963.6,3423.7,3759.4,4490.6,4694.3,5331.6,5796.1,2616.90,5172.8,6921.7
11,ALBANIA,Middle East,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,...,4.0,2.6,3.6,4.8,5.6,7.2,11.7,1.80,37.0,41.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1555,ZIMBABWE,Americas,40.0,48.0,62.0,119.0,116.0,117.0,112.0,65.0,...,54.0,66.8,76.8,92.6,121.0,120.3,101.2,17.10,20.7,100.1
1556,ZIMBABWE,East Asia and the Pacific,34.0,57.0,56.0,79.0,101.0,81.0,103.0,66.0,...,73.0,66.6,59.0,79.4,119.7,141.9,113.4,14.20,17.1,50.1
1557,ZIMBABWE,Europe,152.0,230.0,228.0,302.0,380.0,272.0,265.0,150.0,...,131.4,143.5,153.0,141.1,222.6,237.2,190.5,37.45,70.6,177.6
1558,ZIMBABWE,Middle East,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.4,2.3,0.7,1.3,2.8,2.4,4.4,0.30,0.3,3.8


## 4. Rename columns for consistency

We will rename the columns to ensure consistency across all the datasets. This will make it easier to merge the datasets later on.


In [21]:
# Rename the 'Basic data and indicators' column to 'Country' for consistency
df_cleaned_Inbound_Tourism_Arrivals.rename(columns={'Basic data and indicators': 'Country'}, inplace=True)
display(df_cleaned_Inbound_Tourism_Arrivals)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned_Inbound_Tourism_Arrivals.rename(columns={'Basic data and indicators': 'Country'}, inplace=True)


Unnamed: 0,Country,1995,1996,1997,1998,1999,2000,2001,2002,2003,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
1,ALBANIA,304.0,287.0,119.0,184.0,371.0,317.0,354.0,470.0,557.0,...,3256.0,3673.0,4131.0,4736.0,5118.0,5927.0,6406.0,2658.0,5689.0,7543.8
2,ALGERIA,520.0,605.0,635.0,678.0,749.0,866.0,901.0,988.0,1166.0,...,2733.0,2301.0,1710.0,2039.0,2451.0,2657.0,2371.0,591.0,125.0,1398.0
3,AMERICAN SAMOA,34.0,35.0,26.0,36.0,41.0,44.0,36.0,0.0,0.0,...,49.3,51.6,47.1,38.3,42.3,51.8,58.6,0.9,0.0,0.0
4,ANDORRA,0.0,0.0,0.0,0.0,9422.0,10991.0,11351.0,11507.0,11601.0,...,7676.0,7797.0,7850.0,8025.0,8152.0,8328.0,8235.0,5207.0,5422.0,8426.7
5,ANGOLA,9.0,21.0,45.0,52.0,45.0,51.0,67.0,91.0,107.0,...,650.0,595.0,592.0,397.0,261.0,218.0,218.0,64.0,64.0,130.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
218,"VENEZUELA, BOLIVARIAN REPUBLIC OF",879.0,960.0,933.0,813.0,702.0,602.0,792.0,590.0,435.0,...,1085.0,967.0,882.0,681.0,429.0,0.0,0.0,0.0,0.0,0.0
219,VIET NAM,1351.0,1607.0,1716.0,1520.0,1782.0,2140.0,2330.0,2628.0,2429.0,...,7572.0,7874.0,7944.0,10013.0,12922.0,15498.0,18009.0,3837.0,157.0,3661.0
220,YEMEN,61.0,74.0,80.0,88.0,58.0,73.0,76.0,98.0,155.0,...,1323.0,1218.0,398.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
221,ZAMBIA,163.0,264.0,341.0,362.0,404.0,457.0,492.0,565.0,413.0,...,915.0,947.0,932.0,956.0,1009.0,1072.0,1266.0,502.0,554.0,0.0


In [22]:
# Rename the 'Basic data and indicators' column to 'Country' for consistency
df_cleaned_Inbound_Tourism_Transport.rename(columns={'Basic data and indicators': 'Country'}, inplace=True)
display(df_cleaned_Inbound_Tourism_Transport)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned_Inbound_Tourism_Transport.rename(columns={'Basic data and indicators': 'Country'}, inplace=True)


Unnamed: 0,Country,Arrival by mode of transport,1995,1996,1997,1998,1999,2000,2001,2002,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
3,ALBANIA,Air,45.0,39.0,32.0,79.0,86.0,72.0,91.0,80.0,...,314.0,337.2,401.0,457.0,577.8,691.6,783.9,269.8,764.7,1250.0
4,ALBANIA,Water,83.0,78.0,19.0,33.0,152.0,79.0,103.0,111.0,...,182.0,198.0,211.0,276.0,393.0,439.3,468.4,64.1,205.9,383.9
5,ALBANIA,Land,176.0,170.0,68.0,72.0,133.0,166.0,160.0,279.0,...,2760.0,3137.5,3519.0,4003.0,4146.8,4795.9,5153.8,2323.9,4718.1,5909.9
6,ALGERIA,Air,346.0,370.0,343.0,0.0,369.0,382.0,403.0,456.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,ALGERIA,Water,110.0,172.0,107.0,0.0,196.0,201.0,203.0,322.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
663,ZAMBIA,Air,41.0,31.0,36.0,42.0,46.0,57.0,62.0,71.0,...,241.0,262.4,272.8,285.5,294.0,319.0,326.5,89.2,105.6,0.0
664,ZAMBIA,Water,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,8.0,4.3,3.2,7.1,7.3,4.5,6.3,2.6,6.7,0.0
665,ZAMBIA,Land,122.0,233.0,305.0,320.0,358.0,401.0,430.0,494.0,...,666.0,680.3,655.9,663.7,707.8,748.6,933.6,409.8,442.0,0.0
666,ZIMBABWE,Air,0.0,0.0,0.0,324.0,504.0,463.0,452.0,340.0,...,223.0,198.0,280.6,239.0,310.1,337.2,320.8,77.3,147.4,367.0


In [23]:
# Rename the 'Basic data and indicators' column to 'Country' for consistency
df_cleaned_Inbound_Tourism_Regions.rename(columns={'Basic data and indicators': 'Country'}, inplace=True)
display(df_cleaned_Inbound_Tourism_Regions)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned_Inbound_Tourism_Regions.rename(columns={'Basic data and indicators': 'Country'}, inplace=True)


Unnamed: 0,Country,Arrival by Region,1995,1996,1997,1998,1999,2000,2001,2002,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
7,ALBANIA,Africa,0.0,0.0,0.0,0.0,0.2,0.2,0.1,0.1,...,1.0,1.0,3.0,2.4,2.8,3.5,24.3,1.70,3.2,4.6
8,ALBANIA,Americas,0.0,0.0,0.0,0.0,11.0,14.0,14.0,16.0,...,73.3,90.0,96.8,105.0,125.3,148.8,156.7,30.00,115.8,177.4
9,ALBANIA,East Asia and the Pacific,0.0,0.0,0.0,0.0,1.0,2.0,2.0,3.0,...,23.6,31.0,33.0,36.3,54.4,68.1,68.2,5.00,8.4,26.8
10,ALBANIA,Europe,0.0,0.0,0.0,0.0,322.0,295.0,320.0,439.0,...,2963.6,3423.7,3759.4,4490.6,4694.3,5331.6,5796.1,2616.90,5172.8,6921.7
11,ALBANIA,Middle East,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,...,4.0,2.6,3.6,4.8,5.6,7.2,11.7,1.80,37.0,41.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1555,ZIMBABWE,Americas,40.0,48.0,62.0,119.0,116.0,117.0,112.0,65.0,...,54.0,66.8,76.8,92.6,121.0,120.3,101.2,17.10,20.7,100.1
1556,ZIMBABWE,East Asia and the Pacific,34.0,57.0,56.0,79.0,101.0,81.0,103.0,66.0,...,73.0,66.6,59.0,79.4,119.7,141.9,113.4,14.20,17.1,50.1
1557,ZIMBABWE,Europe,152.0,230.0,228.0,302.0,380.0,272.0,265.0,150.0,...,131.4,143.5,153.0,141.1,222.6,237.2,190.5,37.45,70.6,177.6
1558,ZIMBABWE,Middle East,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.4,2.3,0.7,1.3,2.8,2.4,4.4,0.30,0.3,3.8



## 5. Export the cleaned data

The cleaned data is saved in the `export` folder.


In [24]:
# Define the export directory and file paths
export_dir = "../data/export"
export_file_paths = {
    "df_cleaned_Inbound_Tourism_Regions": os.path.join(export_dir, "df_cleaned_Inbound_Tourism_Regions.csv"),
    "df_cleaned_Inbound_Tourism_Transport": os.path.join(export_dir, "df_cleaned_Inbound_Tourism_Transport.csv"),
    "df_cleaned_Inbound_Tourism_Arrivals": os.path.join(export_dir, "df_cleaned_Inbound_Tourism_Arrivals.csv")
}

# Create the directory if it doesn't exist
os.makedirs(export_dir, exist_ok=True)

# Export the DataFrames to CSV files
df_cleaned_Inbound_Tourism_Regions.to_csv(export_file_paths["df_cleaned_Inbound_Tourism_Regions"], index=False)
df_cleaned_Inbound_Tourism_Transport.to_csv(export_file_paths["df_cleaned_Inbound_Tourism_Transport"], index=False)
df_cleaned_Inbound_Tourism_Arrivals.to_csv(export_file_paths["df_cleaned_Inbound_Tourism_Arrivals"], index=False)

print(f"DataFrames exported to {export_dir}")

DataFrames exported to ../data/export


## 6. Summary

The cleaned data is saved in the `export` folder. The data is now ready for analysis.