<a href="https://colab.research.google.com/github/sedcakmak/Airline-Passenger-Satisfaction-Data-Analysis/blob/main/Airline_Passenger_Satisfaction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# ✈️ **INTRODUCTION**

---

For this project, the **Airline Passenger Satisfaction** dataset is used. You can find the original dataset on Kaggle [here](https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction/data).

## 📊 **Dataset Selection & Description**

I chose this dataset because I found its structure clearer and easier to understand. The column names were familiar (age, gender, flight distance, etc.), and the goal — to understand whether passengers were satisfied or not — seemed more straightforward. It felt like a good starting point to learn data analysis without needing advanced cleaning or domain knowledge.

## 📋 **Project Objectives**
* Dataset Selection and Setup

* Statistical Summary

* Missing Data Analysis

* Outlier Detection

* Visualization



## 📥 **Dataset Selection and Setup**


In [291]:
import kagglehub
path = kagglehub.dataset_download("teejmahal20/airline-passenger-satisfaction")
print("Path to dataset files:", path)

import pandas as pd
train = pd.read_csv(f'{path}/train.csv')
test = pd.read_csv(f'{path}/test.csv')

df = pd.concat([train, test], ignore_index=True)

Path to dataset files: /kaggle/input/airline-passenger-satisfaction



## 📊 **Statistical Summary**


In [292]:
print("="*13)
print("DATASET INFO:")
print("="*13)
df.info()

print()

print("="*20)
print("STATISTICAL SUMMARY:")
print("="*20)
df.describe()

DATASET INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 25 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Unnamed: 0                         129880 non-null  int64  
 1   id                                 129880 non-null  int64  
 2   Gender                             129880 non-null  object 
 3   Customer Type                      129880 non-null  object 
 4   Age                                129880 non-null  int64  
 5   Type of Travel                     129880 non-null  object 
 6   Class                              129880 non-null  object 
 7   Flight Distance                    129880 non-null  int64  
 8   Inflight wifi service              129880 non-null  int64  
 9   Departure/Arrival time convenient  129880 non-null  int64  
 10  Ease of Online booking             129880 non-null  int64  
 11  Gate location            

Unnamed: 0.1,Unnamed: 0,id,Age,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes
count,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129487.0
mean,44158.7,64940.5,39.427957,1190.316392,2.728696,3.057599,2.756876,2.976925,3.204774,3.252633,3.441361,3.358077,3.383023,3.350878,3.632114,3.306267,3.642193,3.286326,14.713713,15.091129
std,31207.377062,37493.270818,15.11936,997.452477,1.32934,1.526741,1.40174,1.27852,1.329933,1.350719,1.319289,1.334049,1.287099,1.316252,1.180025,1.266185,1.176669,1.313682,38.071126,38.46565
min,0.0,1.0,7.0,31.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,16234.75,32470.75,27.0,414.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,3.0,2.0,0.0,0.0
50%,38963.5,64940.5,40.0,844.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,3.0,4.0,3.0,0.0,0.0
75%,71433.25,97410.25,51.0,1744.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,4.0,4.0,4.0,5.0,4.0,5.0,4.0,12.0,13.0
max,103903.0,129880.0,85.0,4983.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,1592.0,1584.0


In [293]:
# Most common passenger age
print("Most common passenger age:", df['Age'].mode()[0])

# Number of passengers under 18
print("Number of passengers under 18:", df[df['Age'] < 18].shape[0])

Most common passenger age: 39
Number of passengers under 18: 9847


In [294]:
# Shows all values in each categorical column
for col in df.select_dtypes(include=['object']).columns:
    print(f"{df[col].value_counts().to_string()}\n")

Gender
Female    65899
Male      63981

Customer Type
Loyal Customer       106100
disloyal Customer     23780

Type of Travel
Business travel    89693
Personal Travel    40187

Class
Business    62160
Eco         58309
Eco Plus     9411

satisfaction
neutral or dissatisfied    73452
satisfied                  56428




## 🔍 **Missing Data Analysis**


In [295]:
# Number of missing values
df.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
id,0
Gender,0
Customer Type,0
Age,0
Type of Travel,0
Class,0
Flight Distance,0
Inflight wifi service,0
Departure/Arrival time convenient,0


In [296]:
# Percentage of missing data in the dataset
print(f"Percentage: {df['Arrival Delay in Minutes'].isnull().mean():.2%}")

Percentage: 0.30%



### 🧹 **Handling Missing Data**


In [297]:
# Unnamed and id are not needed
df.drop(['Unnamed: 0', 'id'], axis=1, inplace=True)

In [298]:
# The reason for this missing data could be due to the fact that there is no arrival delay
# Arrival Delay in minutes are dropped since the percentage is too low
df.dropna(subset=['Arrival Delay in Minutes'], inplace=True)

In [299]:
df.info() # double checking

<class 'pandas.core.frame.DataFrame'>
Index: 129487 entries, 0 to 129879
Data columns (total 23 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Gender                             129487 non-null  object 
 1   Customer Type                      129487 non-null  object 
 2   Age                                129487 non-null  int64  
 3   Type of Travel                     129487 non-null  object 
 4   Class                              129487 non-null  object 
 5   Flight Distance                    129487 non-null  int64  
 6   Inflight wifi service              129487 non-null  int64  
 7   Departure/Arrival time convenient  129487 non-null  int64  
 8   Ease of Online booking             129487 non-null  int64  
 9   Gate location                      129487 non-null  int64  
 10  Food and drink                     129487 non-null  int64  
 11  Online boarding                    129487 no


## 🚨 **Outlier Detection**


In [303]:
def detect_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_limit = Q1 - 1.5 * IQR
    upper_limit = Q3 + 1.5 * IQR
    return df[(df[column] < lower_limit) | (df[column] > upper_limit)]

for col in ['Age', 'Flight Distance', 'Departure Delay in Minutes']:
    outliers = detect_outliers(df, col)
    print(f"Outliers in {col}:")
    print(outliers)
    print()

Outliers in Age:
Empty DataFrame
Columns: [Gender, Customer Type, Age, Type of Travel, Class, Flight Distance, Inflight wifi service, Departure/Arrival time convenient, Ease of Online booking, Gate location, Food and drink, Online boarding, Seat comfort, Inflight entertainment, On-board service, Leg room service, Baggage handling, Checkin service, Inflight service, Cleanliness, Departure Delay in Minutes, Arrival Delay in Minutes, satisfaction]
Index: []

[0 rows x 23 columns]

Outliers in Flight Distance:
        Gender   Customer Type  Age   Type of Travel     Class  \
80        Male  Loyal Customer   26  Business travel  Business   
173       Male  Loyal Customer   52  Business travel  Business   
201     Female  Loyal Customer   43  Business travel  Business   
215     Female  Loyal Customer   38  Business travel  Business   
379       Male  Loyal Customer   46  Business travel  Business   
...        ...             ...  ...              ...       ...   
129608    Male  Loyal Cust


## 📉 **Visualization**


In [301]:
import seaborn as sns
import matplotlib.pyplot as plt
