<a href="https://colab.research.google.com/github/sedcakmak/Airline-Passenger-Satisfaction-Data-Analysis/blob/main/Airline_Passenger_Satisfaction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# ✈️ **INTRODUCTION**

---

For this project, the **Airline Passenger Satisfaction** dataset is used. You can find the original dataset on Kaggle [here](https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction/data).

## 📊 **Dataset Selection & Description**

I chose this dataset because I found its structure clearer and easier to understand. The column names were familiar (age, gender, flight distance, etc.), and the goal — to understand whether passengers were satisfied or not — seemed more straightforward. It felt like a good starting point to learn data analysis without needing advanced cleaning or domain knowledge.

## 📋 **Project Objectives**
* Dataset Selection and Setup

* Statistical Summary

* Missing Data Analysis

* Outlier Detection

* Visualization



## 📥 **Dataset Selection and Setup**


In [None]:
import kagglehub
path = kagglehub.dataset_download("teejmahal20/airline-passenger-satisfaction")
print("Path to dataset files:", path)

import pandas as pd
train = pd.read_csv(f'{path}/train.csv')
test = pd.read_csv(f'{path}/test.csv')

df = pd.concat([train, test], ignore_index=True)


## 📊 **Statistical Summary**


In [None]:
print("="*13)
print("DATASET INFO:")
print("="*13)
df.info()

print()

print("="*20)
print("STATISTICAL SUMMARY:")
print("="*20)
df.describe()

In [None]:
# Shows all values in each categorical column
for col in df.select_dtypes(include=['object']).columns:
    print(f"{df[col].value_counts().to_string().upper()}\n")

In [None]:
# Most common passenger age
print("Most common passenger age:", df['Age'].mode()[0])

# Number of passengers under 18
print("Number of passengers under 18:", len(df[df['Age'] < 18]))

# Number of senior passengers (over65)
print("Number of senior passengers:", len(df[df['Age'] > 65]))


### **Cleaning Data**


In [None]:
# Unnamed and id are not needed
df.drop(['Unnamed: 0', 'id'], axis=1, inplace=True)


## 🔍 **Missing Data Analysis**


In [None]:
# Number of missing values
df.isnull().sum()


### 🧹 **Handling Missing Data**


In [None]:
# Create a boolean flag to track which values were originally missing
df['Arrival_Delay_Missing'] = df['Arrival Delay in Minutes'].isnull()

In [None]:
# Fill missing values with the median (0.0 minutes)
df.fillna({'Arrival Delay in Minutes': df['Arrival Delay in Minutes'].median()}, inplace=True)

In [None]:
print(f"Imputed {df['Arrival_Delay_Missing'].sum()} missing values")
print(f"Median used for imputation: {df['Arrival Delay in Minutes'].median()} minutes") # The median of 0.0 indicates that at least 50% of flights had no arrival delay.

In [None]:
df.info() # double checking


## 🚨 **Outlier Detection**


In [None]:
# Checking Outlier in Rating Columns

rating_columns = [
    'Inflight wifi service', 'Departure/Arrival time convenient',
    'Ease of Online booking', 'Gate location', 'Food and drink',
    'Online boarding', 'Seat comfort', 'Inflight entertainment',
    'On-board service', 'Leg room service', 'Baggage handling',
    'Checkin service', 'Inflight service', 'Cleanliness'
]

melted = pd.DataFrame()

for col in rating_columns:
    counts = df[col].value_counts().sort_index()
    counts.name = col
    melted = pd.concat([melted, counts], axis=1)

display(melted.fillna(0).astype(int))


In [None]:
# Using Interquartile Range (IQR) (Age, Flight Distance, Departure Delay in Minutes, Arrival Delay in Minutes)

def detect_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_limit = Q1 - 1.5 * IQR
    upper_limit = Q3 + 1.5 * IQR
    return df[(df[column] < lower_limit) | (df[column] > upper_limit)]

for col in ['Age', 'Flight Distance', 'Departure Delay in Minutes', 'Arrival Delay in Minutes']:
    outliers = detect_outliers(df, col)
    print(f"Outliers in {col.upper()}:")
    print("="*40)
   # print(outliers.shape)
    print(outliers[col].describe().to_string())
    print()

In [None]:
# Visualizing Outliers Using Boxplot - Departure Delay
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 6))
outliers = detect_outliers(df, 'Departure Delay in Minutes')
sns.boxplot(x=outliers['Departure Delay in Minutes'])
plt.title("Boxplot for Outlier Detection - Departure Delay in Minutes", fontweight = 'bold')
plt.show()

In [None]:
# Visualizing Outliers Using Boxplot - Arrival Delay
plt.figure(figsize=(8, 6))
outliers = detect_outliers(df, 'Arrival Delay in Minutes')
sns.boxplot(x=outliers['Arrival Delay in Minutes'])
plt.title("Boxplot for Outlier Detection - Arrival Delay in Minutes", fontweight = 'bold')
plt.show()



## 📉 **Visualization**


In [None]:
# Gender Distribution
ax = df['Gender'].value_counts().plot(kind='bar', figsize=(5, 7), color=['mediumpurple', 'darkorange'])
plt.title('Gender Distribution', fontweight='bold', fontsize=16)
plt.xlabel('Gender', fontweight='bold')
plt.ylabel('Number of Customers', fontweight='bold')
plt.xticks(rotation=0, fontweight='bold')

for i, v in enumerate(df['Gender'].value_counts().values):
    ax.text(i, v + 500, str(v), ha='center', va='bottom', fontstyle='italic', fontweight='bold')

plt.show()

In [None]:
# Age Distribution
df['Age'].plot(kind='hist', bins=20, figsize=(8, 6), color='steelblue')
plt.title('Age Distribution of Passengers', fontweight='bold', fontsize=16)
plt.xlabel('Age', fontweight='bold')
plt.ylabel('Number of Passengers', fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Customer Satisfaction
satisfaction_counts = df['satisfaction'].value_counts()
labels = satisfaction_counts.index
sizes = satisfaction_counts.values
colors = ['#FD4954','#01BF7F']
explode = [0.05, 0.05]

plt.figure(figsize=(6, 6))
plt.pie(
    sizes,
    labels=labels,
    autopct='%1.1f%%',
    startangle=140,
    shadow=True,
    explode=explode,
    colors=colors
)
plt.title('Customer Satisfaction', fontsize=16, fontweight='bold')
plt.axis('equal')
plt.show()

In [None]:
# Satisfaction Across Different Classes
df.groupby('Class')['satisfaction'].value_counts().unstack().plot(kind='bar', figsize=(10, 6), color = ['#FD4954', '#01BF7F'])

plt.title('Customer Satisfaction by Class', fontweight='bold', fontsize=16)
plt.xlabel('Class', fontweight='bold')
plt.ylabel('Number of Customers', fontweight='bold')
plt.legend(title='Satisfaction')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Average Service Ratings by Flight Class

services = [
    'Inflight wifi service', 'Departure/Arrival time convenient',
    'Ease of Online booking', 'Gate location', 'Food and drink',
    'Online boarding', 'Seat comfort', 'Inflight entertainment',
    'On-board service', 'Leg room service', 'Baggage handling',
    'Checkin service', 'Inflight service', 'Cleanliness'
]

heatmap_df = pd.DataFrame()

for class_type in ['Business', 'Eco', 'Eco Plus']:
    class_data = df[df['Class'] == class_type][services].mean()
    heatmap_df[class_type] = class_data

heatmap_df = heatmap_df.T

plt.figure(figsize=(14, 6))
sns.heatmap(heatmap_df,
            annot=True,
            fmt='.1f',
            cmap='RdYlGn',
            cbar_kws={'label': 'Rating (1-5)'})

plt.title('Average Service Ratings by Flight Class', fontweight='bold', fontsize=16)
plt.xlabel('Service', fontweight='bold')
plt.ylabel('Flight Class', fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Departure Delay vs Arrival Delay (Scatter Plot)
plt.figure(figsize=(10, 6))
plt.scatter(df['Departure Delay in Minutes'], df['Arrival Delay in Minutes'], alpha=0.5, color='steelblue')
plt.title('Departure Delay vs Arrival Delay', fontweight='bold', fontsize=16)
plt.xlabel('Departure Delay (Minutes)', fontweight='bold')
plt.ylabel('Arrival Delay (Minutes)', fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()


### **CONCLUSION** ✈️
This project analyzed the Airline Passenger Satisfaction dataset to uncover key insights into passenger demographics, service ratings, delays, and satisfaction levels. The data showed that most passengers were adults, with a small percentage under 18 or over 65. Ratings were mostly within the expected 1–5 range, though significant outliers existed in delay-related fields.

# **Data Quality and Outlier Analysis** 🔍

Outlier analysis was conducted across four key categories: Age, Flight Distance, Departure Delay in Minutes, and Arrival Delay in Minutes. The results revealed interesting patterns in data quality and operational challenges.
Age showed no outliers, indicating reliable demographic data collection. Flight Distance identified 2,855 potential outliers, but with a mean distance of 3,890 miles—typical for international flights—these values represent legitimate long-haul routes rather than data errors.

However, delay data revealed significant operational issues. Departure delays contained 18,098 outliers, while arrival delays had 17,492 outliers. The most extreme cases were concerning: the longest departure delay reached 1,592 minutes (26.5 hours), and the longest arrival delay was 1,584 minutes (26.4 hours). These extreme delays suggest serious operational disruptions affecting thousands of passengers. Missing values in "Arrival Delay in Minutes" were minimal and successfully imputed using the median.

#**Key Findings** 📊

Visualizations revealed a fairly balanced gender distribution and satisfaction levels leaning slightly positive, with notably higher satisfaction rates among Business class passengers. Heatmaps showed that Business class consistently rated services higher than Economy and Eco Plus across all service categories, highlighting a clear service quality gap between passenger classes.
Most importantly, departure delays were moderately correlated with arrival delays, showing that when flights leave late, they tend to arrive late too. This suggests that fixing departure timing issues could improve overall flight performance.

#**Final Thoughts** 💭

This analysis provides a strong foundation for further customer experience improvements, particularly focusing on Economy class service enhancement and operational delay reduction strategies.
