<a href="https://colab.research.google.com/github/sedcakmak/Airline-Passenger-Satisfaction-Data-Analysis/blob/main/Airline_Passenger_Satisfaction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ✈️ **INTRODUCTION**

---

For this project, the **Airline Passenger Satisfaction** dataset is used. You can find the original dataset on Kaggle [here](https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction/data).

## 📊 **Dataset Selection & Description**

I chose this dataset because I found its structure clearer and easier to understand. The column names were familiar (age, gender, flight distance, etc.), and the goal — to understand whether passengers were satisfied or not — seemed more straightforward. It felt like a good starting point to learn data analysis without needing advanced cleaning or domain knowledge.

## 📋 **Project Objectives**
* Dataset Selection and Setup

* Statistical Summary

* Missing Data Analysis

* Outlier Detection

* Visualization


---



## 📥 **Setup: Install Kaggle API & Download Dataset**

In [None]:
! pip install kaggle

In [None]:
import kagglehub

path = kagglehub.dataset_download("teejmahal20/airline-passenger-satisfaction")

print("Path to dataset files:", path)

In [None]:
import os

os.listdir('/kaggle/input/airline-passenger-satisfaction')

In [None]:
import pandas as pd

train = pd.read_csv('/kaggle/input/airline-passenger-satisfaction/train.csv')
test = pd.read_csv('/kaggle/input/airline-passenger-satisfaction/test.csv')

df = pd.concat([train, test], ignore_index=True)
df.head(10)

## 📊 **Statistical Summary**

In [None]:
df.shape         # how many rows & columns
df.columns       # list of column names
df.info()        # data types + missing values
df.describe()    # basic stats (only numeric columns)

In [None]:
# Number of males and females
gender_counts = df['Gender'].value_counts()
for gender, count in gender_counts.items():
    print(f"{gender}: {count}")

# Average age
print("Average age:", round(df['Age'].mean(), 2))

# Most common passenger age
print("Most common passenger age:", df['Age'].mode()[0])

# Youngest passenger
print("Youngest age:", df['Age'].min())

# Oldest passenger
print("Oldest age:", df['Age'].max())

# Number of passengers under 18
print("Number of passengers under 18:", df[df['Age'] < 18].shape[0])

In [None]:
# Create age bins
age_bins = pd.cut(df['Age'], bins=[0, 19, 29, 39, 49, 59, 69, 79, 89])
print(age_bins.value_counts().sort_values(ascending=False))

In [None]:
# Shows the first 10 unique values
for col in df.select_dtypes(include=['object']).columns:
    print(f"--- {col} ---")
    print(df[col].unique()[:10])
    print()

In [None]:
# Shows all values and their counts in each categorical column
for col in df.select_dtypes(include=['object']).columns:
    print(f"\n{df[col].value_counts().to_string()}\n")

In [None]:
#Correlation between numeric columns
print(df.corr(numeric_only=True))

In [None]:
# wifi not available
print("Number of passengers with wifi service = 0:", (df['Inflight wifi service'] == 0).sum())

print(df[df['Inflight wifi service'] == 0])

## 🔍 **Missing Data Analysis**

In [None]:
# Number of missing values
df.isnull().sum()


In [None]:
# Percentage of missing data in the dataset
print(f"Percentage: {df['Arrival Delay in Minutes'].isnull().mean():.2%}")

## 🧹 **Handling Missing Data**

In [None]:
# Unnamed and id are not needed
df.drop(['Unnamed: 0', 'id'], axis=1, inplace=True)

In [None]:
# Arrival Delay in minutes are dropped since the percentage is too low
# The reason for this missing data could be due to the fact that there is no arrival delay
df.dropna(subset=['Arrival Delay in Minutes'], inplace=True)

In [None]:
df.info() # double checking

## 🚨 **Outlier Detection**

In [None]:
Q1 = df['Age'].quantile(0.25)  # lower quartile
Q3 = df['Age'].quantile(0.75)  # upper quartile
IQR = Q3 - Q1
lower_limit = Q1 - 1.5 * IQR
upper_limit = Q3 + 1.5 * IQR

outliers = df[(df['Age'] < lower_limit) | (df['Age'] > upper_limit)]
print(outliers)


## 📉 **Visualization**

In [None]:
import seaborn as sns
sns.boxplot(data=df, x='Age')
