<!-- Developed by Renan Amaral de Oliveira - https://www.linkedin.com/in/renanammaral/ -->
# <font color='#415a77'>**Assessing Turnaround Process Bottlenecks to Optimize Airport Operations**</font>
## <font color = '#415a77'> **Developed by <a href="https://www.linkedin.com/in/renanammaral/">Renan Amaral de Oliveira**</a></font>
### <font color='#415a77'> **Identifying the Variables That Most Affect the Total Turnaround Time**</font>

<font color='#415a77'>

# **The Problem:**

One of the major challenges faced by airports and airlines is the optimization of operations to ensure smooth air traffic, reduce costs, and improve the passenger experience. One of the key performance indicators in this context is the turnaround time.


### **What is Turnaround?**

Turnaround is one of the most critical processes in airport operations, as it directly affects the efficiency, punctuality, and profitability of airlines. This term refers to the time required for an aircraft to complete all ground activities between landing and the next takeoff. The faster and more efficient this process is, the lower the operational costs and the greater the airport’s capacity to handle a higher volume of flights.


### **The Importance of Turnaround:**

Optimizing turnaround reduces delays, improves punctuality, and increases operational efficiency, positively impacting the reputation of both airlines and airports. Moreover, a well-managed turnaround enables better fleet utilization, contributing to the sustainability of the aviation sector.


### **The Challenge:**

In this scenario, there is a need to develop an analytical model capable of identifying which processes may be creating operational bottlenecks and affecting the total ground time, considering various factors such as:

	- Passenger disembarkation/embarkation
	- Baggage removal
	- Aircraft cleaning
	- Restocking of supplies
	- Fuel refueling
	- Maintenance and technical inspections
	- Baggage and cargo loading
	- Passenger boarding
	- Crew change

### **Benefits of Optimizing Turnaround Time:**

	- Delay reduction: With a faster and more efficient operational flow, flights depart on schedule, minimizing impacts on the flight network.
	- Increased operational capacity: Optimized processes allow more flights to be handled without the need for infrastructure expansion.
	- Reduced operational costs: Less ground time means lower expenses with fuel, staff, and airport fees.
	- Improved passenger experience: Faster boarding and disembarkation lead to greater satisfaction and comfort for travelers.
	- Sustainability: Less ground waiting time results in lower carbon emissions, contributing to environmental efficiency.
	- Fleet optimization: With less idle time, aircraft can perform more flights per day, maximizing profitability.
	- Greater predictability: With predictive analysis tools, it’s possible to anticipate and address bottlenecks before they impact operations.

### **Next Steps:**

To solve this problem, the following steps are necessary:

	- Data collection and organization: Gather historical data on airport operations, including records of each process that affects turnaround.
	- Exploratory data analysis: Identify the most important variables and their relationships with turnaround time.
	- Building a predictive model: Use machine learning techniques to develop a model capable of identifying the processes that most affect turnaround time.
	- Model validation: Assess the model’s accuracy using a test dataset.
	- Implementation and monitoring: Integrate the model into a decision-support system and continuously monitor its performance.
	- By implementing a solution to predict taxi time, airports and airlines can optimize their operations, reduce costs, and enhance customer satisfaction.

</font>

# Initial Setup

## Imports

In [1]:
# importing the necessary libraries
%pip install ydata-profiling
%pip install -q -U watermark

# importing data manipulation libraries
import pandas as pd
import numpy as np

#importing os manipulation libraries
import os

# importing visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# importing machine learning libraries
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# importing machine learning models
from sklearn.linear_model import LinearRegression, LogisticRegression
import xgboost as xgb

# importing evaluation metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_squared_error, r2_score

# importing profiling library
from ydata_profiling import ProfileReport

#importing warnings library
import warnings
warnings.filterwarnings('ignore')

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
%reload_ext watermark
%watermark -a "Renan Amaral" -u -d -v -p pandas,n

Author: Renan Amaral

Last updated: 2025-07-09

Python implementation: CPython
Python version       : 3.12.7
IPython version      : 8.27.0

pandas: 2.2.2
n     : not installed



# Loading and visualizing data

In [5]:
# loading the dataset in the folder 'data/raw'
#print(os.getcwd())
df = pd.read_csv('/Users/renanoliveira/Documents/Portfolio/turnaround/data/raw/turnaround_data.csv')

In [6]:
df.head()

Unnamed: 0,flight_number,airline,origin,destination,weather_condition,arrival_time,inblock_time,deboarding_start,deboarding_end,cleaning_start,...,bags_unloading_end,bags_loading_start,bags_loading_end,boarding_start,boarding_end,door_close_time,offblock_time,takeoff_time,turnaround_minutes,delay_minutes
0,G32316,Gol,REC,GRU,Clear,2025-05-03 12:03:00,2025-05-03 12:06:00,2025-05-03 12:07:00,2025-05-03 12:15:00,2025-05-03 12:17:00,...,2025-05-03 12:14:00,2025-05-03 12:27:00,2025-05-03 12:32:00,2025-05-03 12:34:00,2025-05-03 12:46:00,2025-05-03 12:50:00,2025-05-03 12:52:00,2025-05-03 12:58:00,46,10
1,AZ6296,Azul,GIG,SSA,Rain,2025-03-10 06:00:00,2025-03-10 06:01:00,2025-03-10 06:03:00,2025-03-10 06:08:00,2025-03-10 06:11:00,...,2025-03-10 06:10:00,2025-03-10 06:24:00,2025-03-10 06:31:00,2025-03-10 06:33:00,2025-03-10 06:43:00,2025-03-10 06:45:00,2025-03-10 06:47:00,2025-03-10 07:02:00,46,17
2,AZ3059,Azul,CNF,SSA,Storm,2025-04-30 20:13:00,2025-04-30 20:14:00,2025-04-30 20:15:00,2025-04-30 20:22:00,2025-04-30 20:25:00,...,2025-04-30 20:20:00,2025-04-30 20:32:00,2025-04-30 20:39:00,2025-04-30 20:42:00,2025-04-30 20:55:00,2025-04-30 20:57:00,2025-04-30 20:59:00,2025-04-30 21:14:00,45,16
3,AZ9585,Azul,GIG,CWB,Rain,2025-06-19 18:40:00,2025-06-19 18:42:00,2025-06-19 18:43:00,2025-06-19 18:48:00,2025-06-19 18:50:00,...,2025-06-19 18:51:00,2025-06-19 19:01:00,2025-06-19 19:07:00,2025-06-19 19:10:00,2025-06-19 19:21:00,2025-06-19 19:25:00,2025-06-19 19:28:00,2025-06-19 19:41:00,46,16
4,G33907,Gol,CGH,POA,Rain,2025-05-02 16:37:00,2025-05-02 16:40:00,2025-05-02 16:42:00,2025-05-02 16:49:00,2025-05-02 16:52:00,...,2025-05-02 16:48:00,2025-05-02 17:05:00,2025-05-02 17:10:00,2025-05-02 17:11:00,2025-05-02 17:24:00,2025-05-02 17:26:00,2025-05-02 17:28:00,2025-05-02 17:40:00,48,18


# Exploratory Data Analysis

In [7]:
# Using ydata-profiling to generate a report and saving in the folder 'reports'
profile = ProfileReport(df, title="Turnaround Data Profiling Report", explorative=True)
profile.to_file("/Users/renanoliveira/Documents/Portfolio/turnaround/reports/turnaround_data_profiling_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 26/26 [00:02<00:00, 12.18it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
# evaluating the column names
df.columns

Index(['flight_number', 'airline', 'origin', 'destination',
       'weather_condition', 'arrival_time', 'inblock_time', 'deboarding_start',
       'deboarding_end', 'cleaning_start', 'cleaning_end', 'refueling_start',
       'refueling_end', 'catering_start', 'catering_end',
       'bags_unloading_start', 'bags_unloading_end', 'bags_loading_start',
       'bags_loading_end', 'boarding_start', 'boarding_end', 'door_close_time',
       'offblock_time', 'takeoff_time', 'turnaround_minutes', 'delay_minutes'],
      dtype='object')

In [9]:
# evaluating the data types of the columns
df.dtypes

flight_number           object
airline                 object
origin                  object
destination             object
weather_condition       object
arrival_time            object
inblock_time            object
deboarding_start        object
deboarding_end          object
cleaning_start          object
cleaning_end            object
refueling_start         object
refueling_end           object
catering_start          object
catering_end            object
bags_unloading_start    object
bags_unloading_end      object
bags_loading_start      object
bags_loading_end        object
boarding_start          object
boarding_end            object
door_close_time         object
offblock_time           object
takeoff_time            object
turnaround_minutes       int64
delay_minutes            int64
dtype: object

In [10]:
# evaluating the presence of null values
df.isnull().sum()

flight_number           0
airline                 0
origin                  0
destination             0
weather_condition       0
arrival_time            0
inblock_time            0
deboarding_start        0
deboarding_end          0
cleaning_start          0
cleaning_end            0
refueling_start         0
refueling_end           0
catering_start          0
catering_end            0
bags_unloading_start    0
bags_unloading_end      0
bags_loading_start      0
bags_loading_end        0
boarding_start          0
boarding_end            0
door_close_time         0
offblock_time           0
takeoff_time            0
turnaround_minutes      0
delay_minutes           0
dtype: int64

In [11]:
# dropping unnecessary columns
drop_cols = ['flight_number', 'origin', 'destination']
df.drop(columns=drop_cols, inplace=True)

In [12]:
# checking the dataframe after dropping the columns
df.head()

Unnamed: 0,airline,weather_condition,arrival_time,inblock_time,deboarding_start,deboarding_end,cleaning_start,cleaning_end,refueling_start,refueling_end,...,bags_unloading_end,bags_loading_start,bags_loading_end,boarding_start,boarding_end,door_close_time,offblock_time,takeoff_time,turnaround_minutes,delay_minutes
0,Gol,Clear,2025-05-03 12:03:00,2025-05-03 12:06:00,2025-05-03 12:07:00,2025-05-03 12:15:00,2025-05-03 12:17:00,2025-05-03 12:26:00,2025-05-03 12:18:00,2025-05-03 12:26:00,...,2025-05-03 12:14:00,2025-05-03 12:27:00,2025-05-03 12:32:00,2025-05-03 12:34:00,2025-05-03 12:46:00,2025-05-03 12:50:00,2025-05-03 12:52:00,2025-05-03 12:58:00,46,10
1,Azul,Rain,2025-03-10 06:00:00,2025-03-10 06:01:00,2025-03-10 06:03:00,2025-03-10 06:08:00,2025-03-10 06:11:00,2025-03-10 06:21:00,2025-03-10 06:13:00,2025-03-10 06:23:00,...,2025-03-10 06:10:00,2025-03-10 06:24:00,2025-03-10 06:31:00,2025-03-10 06:33:00,2025-03-10 06:43:00,2025-03-10 06:45:00,2025-03-10 06:47:00,2025-03-10 07:02:00,46,17
2,Azul,Storm,2025-04-30 20:13:00,2025-04-30 20:14:00,2025-04-30 20:15:00,2025-04-30 20:22:00,2025-04-30 20:25:00,2025-04-30 20:33:00,2025-04-30 20:25:00,2025-04-30 20:30:00,...,2025-04-30 20:20:00,2025-04-30 20:32:00,2025-04-30 20:39:00,2025-04-30 20:42:00,2025-04-30 20:55:00,2025-04-30 20:57:00,2025-04-30 20:59:00,2025-04-30 21:14:00,45,16
3,Azul,Rain,2025-06-19 18:40:00,2025-06-19 18:42:00,2025-06-19 18:43:00,2025-06-19 18:48:00,2025-06-19 18:50:00,2025-06-19 18:55:00,2025-06-19 18:52:00,2025-06-19 19:00:00,...,2025-06-19 18:51:00,2025-06-19 19:01:00,2025-06-19 19:07:00,2025-06-19 19:10:00,2025-06-19 19:21:00,2025-06-19 19:25:00,2025-06-19 19:28:00,2025-06-19 19:41:00,46,16
4,Gol,Rain,2025-05-02 16:37:00,2025-05-02 16:40:00,2025-05-02 16:42:00,2025-05-02 16:49:00,2025-05-02 16:52:00,2025-05-02 17:00:00,2025-05-02 16:54:00,2025-05-02 17:04:00,...,2025-05-02 16:48:00,2025-05-02 17:05:00,2025-05-02 17:10:00,2025-05-02 17:11:00,2025-05-02 17:24:00,2025-05-02 17:26:00,2025-05-02 17:28:00,2025-05-02 17:40:00,48,18


In [13]:
# evaluating the shape of the dataframe
rows, columns = df.shape
print(f"The dataframe has {rows} rows and {columns} columns.")

The dataframe has 100000 rows and 23 columns.


In [14]:
# filtering the datafram to keep onnly rows with airline = 'Azul'
df = df[df['airline'] == 'Azul']

In [17]:
# describind the dataframe only with numeric columns
df.describe()

Unnamed: 0,turnaround_minutes,delay_minutes
count,33226.0,33226.0
mean,46.485674,13.499549
std,3.55393,4.849055
min,34.0,0.0
25%,44.0,10.0
50%,46.0,13.0
75%,49.0,17.0
max,58.0,30.0


In [18]:
# describind the categorical columns
df.select_dtypes(include=['object']).describe()

Unnamed: 0,airline,weather_condition,arrival_time,inblock_time,deboarding_start,deboarding_end,cleaning_start,cleaning_end,refueling_start,refueling_end,...,catering_end,bags_unloading_start,bags_unloading_end,bags_loading_start,bags_loading_end,boarding_start,boarding_end,door_close_time,offblock_time,takeoff_time
count,33226,33226,33226,33226,33226,33226,33226,33226,33226,33226,...,33226,33226,33226,33226,33226,33226,33226,33226,33226,33226
unique,1,5,31129,31152,31167,31212,31181,31206,31135,31174,...,31153,31188,31252,31154,31194,31189,31251,31149,31180,31177
top,Azul,Rain,2025-05-21 12:07:00,2025-05-26 07:37:00,2025-02-04 07:29:00,2025-06-04 10:28:00,2025-03-10 18:05:00,2025-05-27 20:02:00,2025-05-03 11:11:00,2025-05-10 09:40:00,...,2025-02-16 00:35:00,2025-04-02 15:27:00,2025-02-13 08:57:00,2025-04-11 22:55:00,2025-01-12 07:16:00,2025-03-15 12:12:00,2025-04-22 08:39:00,2025-03-09 07:09:00,2025-01-30 00:38:00,2025-03-11 15:00:00
freq,33226,6760,4,4,4,4,4,4,4,4,...,4,4,4,4,4,4,4,4,4,4
