# Business Understanding
Business Case for Airline Passenger Satisfaction Prediction

Airlines face fierce competition to stay top of mind for travelers worldwide. Understanding what drives a positive customer experience is key for airlines to make effective investment decisions and foster passenger loyalty. Managing overhead costs is a constant focus for any airline. If the company is currently spending money in areas that do not impact the passenger positively, they can redirect their spend into more effective ways that do. 

Once an airline identifies the most impactful factors for their customers, they can analyze their policies and incentives to best target these areas. Airlines already manage large sets of data points related to their flight distance, aircrafts, and delays. They can enact policies to where predetermined criteria trigger incentive programs or offerings to offset potential negative impact areas. They can also continue to target their most satisfied and loyal customers, with focus on what they care for most.

## Data Understanding
- [1.5 points] Load the dataset and appropriately define data types. What data type should be used to represent each data attribute? Discuss the attributes collected in the dataset. For datasets with a large number of attributes, only discuss a subset of relevant attributes.  
- [1.5 points] Verify data quality: Explain any missing values or duplicate data. Visualize entries that are missing/complete for different attributes. Are those mistakes? Why do these quality issues exist in the data? How do you deal with these problems? Give justifications for your methods (elimination or imputation).  

In [1]:
import pandas as pd

df = pd.read_csv('PassengerSatisfaction/test.csv')

In [2]:
# Let's get a quick view of the dataframe...
df

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,19556,Female,Loyal Customer,52,Business travel,Eco,160,5,4,...,5,5,5,5,2,5,5,50,44.0,satisfied
1,1,90035,Female,Loyal Customer,36,Business travel,Business,2863,1,1,...,4,4,4,4,3,4,5,0,0.0,satisfied
2,2,12360,Male,disloyal Customer,20,Business travel,Eco,192,2,0,...,2,4,1,3,2,2,2,0,0.0,neutral or dissatisfied
3,3,77959,Male,Loyal Customer,44,Business travel,Business,3377,0,0,...,1,1,1,1,3,1,4,0,6.0,satisfied
4,4,36875,Female,Loyal Customer,49,Business travel,Eco,1182,2,3,...,2,2,2,2,4,2,4,0,20.0,satisfied
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25971,25971,78463,Male,disloyal Customer,34,Business travel,Business,526,3,3,...,4,3,2,4,4,5,4,0,0.0,neutral or dissatisfied
25972,25972,71167,Male,Loyal Customer,23,Business travel,Business,646,4,4,...,4,4,5,5,5,5,4,0,0.0,satisfied
25973,25973,37675,Female,Loyal Customer,17,Personal Travel,Eco,828,2,5,...,2,4,3,4,5,4,2,0,0.0,neutral or dissatisfied
25974,25974,90086,Male,Loyal Customer,14,Business travel,Business,1127,3,3,...,4,3,2,5,4,5,4,0,0.0,satisfied


In [3]:
# Now, let's get ALL of the columns in the dataframe, so we know what we're working with
pd.set_option('display.max_columns', None) # This sets the display option to show all columns
df

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,19556,Female,Loyal Customer,52,Business travel,Eco,160,5,4,3,4,3,4,3,5,5,5,5,2,5,5,50,44.0,satisfied
1,1,90035,Female,Loyal Customer,36,Business travel,Business,2863,1,1,3,1,5,4,5,4,4,4,4,3,4,5,0,0.0,satisfied
2,2,12360,Male,disloyal Customer,20,Business travel,Eco,192,2,0,2,4,2,2,2,2,4,1,3,2,2,2,0,0.0,neutral or dissatisfied
3,3,77959,Male,Loyal Customer,44,Business travel,Business,3377,0,0,0,2,3,4,4,1,1,1,1,3,1,4,0,6.0,satisfied
4,4,36875,Female,Loyal Customer,49,Business travel,Eco,1182,2,3,4,3,4,1,2,2,2,2,2,4,2,4,0,20.0,satisfied
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25971,25971,78463,Male,disloyal Customer,34,Business travel,Business,526,3,3,3,1,4,3,4,4,3,2,4,4,5,4,0,0.0,neutral or dissatisfied
25972,25972,71167,Male,Loyal Customer,23,Business travel,Business,646,4,4,4,4,4,4,4,4,4,5,5,5,5,4,0,0.0,satisfied
25973,25973,37675,Female,Loyal Customer,17,Personal Travel,Eco,828,2,5,1,5,2,1,2,2,4,3,4,5,4,2,0,0.0,neutral or dissatisfied
25974,25974,90086,Male,Loyal Customer,14,Business travel,Business,1127,3,3,3,3,4,4,4,4,3,2,5,4,5,4,0,0.0,satisfied


In [4]:
!pip install numpy
import numpy as np

# Let's combine some of the ages into buckets...
# Define the age ranges for each category
bins = [0, 3, 12, 25, 65, np.inf]
labels = ['Toddler', 'Child', 'Young Adult', 'Adult', 'Elderly']

# Create a new column with the age categories
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Remove the 'Age' column in-place
df.drop('Age', axis=1, inplace=True)

df


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,Age_Group
0,0,19556,Female,Loyal Customer,Business travel,Eco,160,5,4,3,4,3,4,3,5,5,5,5,2,5,5,50,44.0,satisfied,Adult
1,1,90035,Female,Loyal Customer,Business travel,Business,2863,1,1,3,1,5,4,5,4,4,4,4,3,4,5,0,0.0,satisfied,Adult
2,2,12360,Male,disloyal Customer,Business travel,Eco,192,2,0,2,4,2,2,2,2,4,1,3,2,2,2,0,0.0,neutral or dissatisfied,Young Adult
3,3,77959,Male,Loyal Customer,Business travel,Business,3377,0,0,0,2,3,4,4,1,1,1,1,3,1,4,0,6.0,satisfied,Adult
4,4,36875,Female,Loyal Customer,Business travel,Eco,1182,2,3,4,3,4,1,2,2,2,2,2,4,2,4,0,20.0,satisfied,Adult
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25971,25971,78463,Male,disloyal Customer,Business travel,Business,526,3,3,3,1,4,3,4,4,3,2,4,4,5,4,0,0.0,neutral or dissatisfied,Adult
25972,25972,71167,Male,Loyal Customer,Business travel,Business,646,4,4,4,4,4,4,4,4,4,5,5,5,5,4,0,0.0,satisfied,Young Adult
25973,25973,37675,Female,Loyal Customer,Personal Travel,Eco,828,2,5,1,5,2,1,2,2,4,3,4,5,4,2,0,0.0,neutral or dissatisfied,Young Adult
25974,25974,90086,Male,Loyal Customer,Business travel,Business,1127,3,3,3,3,4,4,4,4,3,2,5,4,5,4,0,0.0,satisfied,Young Adult


We're going to make an executive decision and DROP the "Ease of Online booking" column, because it doesn't appear to be that useful.

Other columns that are kind of suspect, but not necessarily misleading or usless are the Gate locaiton column and, yes, age. We will keep age, because the data may tell us something that we didn't know about the age of a passenger related to general satisfaction. The same goes for Gender. We will also keep Gate number because, upon further reflection, it is actaully a *rating* of satisfaction with the *location* of the passenger's gate. That could be very useful, indeed.

In [5]:
# Remove the 'Ease' column in-place
df.drop('Ease of Online booking', axis=1, inplace=True)
df

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,Age_Group
0,0,19556,Female,Loyal Customer,Business travel,Eco,160,5,4,4,3,4,3,5,5,5,5,2,5,5,50,44.0,satisfied,Adult
1,1,90035,Female,Loyal Customer,Business travel,Business,2863,1,1,1,5,4,5,4,4,4,4,3,4,5,0,0.0,satisfied,Adult
2,2,12360,Male,disloyal Customer,Business travel,Eco,192,2,0,4,2,2,2,2,4,1,3,2,2,2,0,0.0,neutral or dissatisfied,Young Adult
3,3,77959,Male,Loyal Customer,Business travel,Business,3377,0,0,2,3,4,4,1,1,1,1,3,1,4,0,6.0,satisfied,Adult
4,4,36875,Female,Loyal Customer,Business travel,Eco,1182,2,3,3,4,1,2,2,2,2,2,4,2,4,0,20.0,satisfied,Adult
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25971,25971,78463,Male,disloyal Customer,Business travel,Business,526,3,3,1,4,3,4,4,3,2,4,4,5,4,0,0.0,neutral or dissatisfied,Adult
25972,25972,71167,Male,Loyal Customer,Business travel,Business,646,4,4,4,4,4,4,4,4,5,5,5,5,4,0,0.0,satisfied,Young Adult
25973,25973,37675,Female,Loyal Customer,Personal Travel,Eco,828,2,5,5,2,1,2,2,4,3,4,5,4,2,0,0.0,neutral or dissatisfied,Young Adult
25974,25974,90086,Male,Loyal Customer,Business travel,Business,1127,3,3,3,4,4,4,4,3,2,5,4,5,4,0,0.0,satisfied,Young Adult


The rest of the dataset appears to be both relevant and well-documented. Gender could be converted to a binary 0 or 1, as it appears to be simply "Male" or "Female." "Customer Type" could also be converted into a binary representation, along with Type of Travel. Class could be one-hot encoding or, even better, could be a SCALE from least luxurious to most luxurious. Finally, the satisfaction category could be a binary to represent "satisfied" and "neural or dissatisfied."

In [6]:
# GENDER

# Create a dictionary for mapping Male and Female
gender_map = {'Male': 0, 'Female': 1}

# Replace the original column instead of creating a new one:
df['Gender_Numeric'] = df['Gender'].map(gender_map).fillna(df['Gender'])
df.drop('Gender', axis=1, inplace=True)
df

Unnamed: 0.1,Unnamed: 0,id,Customer Type,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,Age_Group,Gender_Numeric
0,0,19556,Loyal Customer,Business travel,Eco,160,5,4,4,3,4,3,5,5,5,5,2,5,5,50,44.0,satisfied,Adult,1
1,1,90035,Loyal Customer,Business travel,Business,2863,1,1,1,5,4,5,4,4,4,4,3,4,5,0,0.0,satisfied,Adult,1
2,2,12360,disloyal Customer,Business travel,Eco,192,2,0,4,2,2,2,2,4,1,3,2,2,2,0,0.0,neutral or dissatisfied,Young Adult,0
3,3,77959,Loyal Customer,Business travel,Business,3377,0,0,2,3,4,4,1,1,1,1,3,1,4,0,6.0,satisfied,Adult,0
4,4,36875,Loyal Customer,Business travel,Eco,1182,2,3,3,4,1,2,2,2,2,2,4,2,4,0,20.0,satisfied,Adult,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25971,25971,78463,disloyal Customer,Business travel,Business,526,3,3,1,4,3,4,4,3,2,4,4,5,4,0,0.0,neutral or dissatisfied,Adult,0
25972,25972,71167,Loyal Customer,Business travel,Business,646,4,4,4,4,4,4,4,4,5,5,5,5,4,0,0.0,satisfied,Young Adult,0
25973,25973,37675,Loyal Customer,Personal Travel,Eco,828,2,5,5,2,1,2,2,4,3,4,5,4,2,0,0.0,neutral or dissatisfied,Young Adult,1
25974,25974,90086,Loyal Customer,Business travel,Business,1127,3,3,3,4,4,4,4,3,2,5,4,5,4,0,0.0,satisfied,Young Adult,0


Let's look for the types of "Class" flown in so we can convert it to a better type of metric or datatype.

In [7]:
# Get unique values
unique_classes = df['Class'].unique()

# Print the unique values
print("Unique values in the Class column:")
print(unique_classes)

Unique values in the Class column:
['Eco' 'Business' 'Eco Plus']


It would appear that there are only two classes. So, let's just use the existing values. They aren't a binary classification, and it would be difficult to rank them numerically as I wanted to do above. It makes more sense to keep them the way they are. But we DO know that Business Class is usually more luxurious than Eco(nomy) and that Eco Plus is probably better than plain Eco.

In [8]:
# I CAN'T GET NO SATISFACTION
# Get unique values
unique_satisfaction = df['satisfaction'].unique()

# Print the unique values
print("Unique values in the satisfaction column:")
print(unique_satisfaction)

Unique values in the satisfaction column:
['satisfied' 'neutral or dissatisfied']


Okay, so as expected there are only two types of values in the "satisfaction" column. Great! That will help us convert it to a binary: 1 for satisfied, 0 for 'neutral or dissatisfied.'

In [9]:
# Convert satisfaction to a binary
# Create a mapping dictionary
satisfaction_map = {
    'neutral or dissatisfied': 0,
    'satisfied': 1
}

# Create a new column with the binary representation
df['Satisfaction_Binary'] = df['satisfaction'].map(satisfaction_map)
df

Unnamed: 0.1,Unnamed: 0,id,Customer Type,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,Age_Group,Gender_Numeric,Satisfaction_Binary
0,0,19556,Loyal Customer,Business travel,Eco,160,5,4,4,3,4,3,5,5,5,5,2,5,5,50,44.0,satisfied,Adult,1,1
1,1,90035,Loyal Customer,Business travel,Business,2863,1,1,1,5,4,5,4,4,4,4,3,4,5,0,0.0,satisfied,Adult,1,1
2,2,12360,disloyal Customer,Business travel,Eco,192,2,0,4,2,2,2,2,4,1,3,2,2,2,0,0.0,neutral or dissatisfied,Young Adult,0,0
3,3,77959,Loyal Customer,Business travel,Business,3377,0,0,2,3,4,4,1,1,1,1,3,1,4,0,6.0,satisfied,Adult,0,1
4,4,36875,Loyal Customer,Business travel,Eco,1182,2,3,3,4,1,2,2,2,2,2,4,2,4,0,20.0,satisfied,Adult,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25971,25971,78463,disloyal Customer,Business travel,Business,526,3,3,1,4,3,4,4,3,2,4,4,5,4,0,0.0,neutral or dissatisfied,Adult,0,0
25972,25972,71167,Loyal Customer,Business travel,Business,646,4,4,4,4,4,4,4,4,5,5,5,5,4,0,0.0,satisfied,Young Adult,0,1
25973,25973,37675,Loyal Customer,Personal Travel,Eco,828,2,5,5,2,1,2,2,4,3,4,5,4,2,0,0.0,neutral or dissatisfied,Young Adult,1,0
25974,25974,90086,Loyal Customer,Business travel,Business,1127,3,3,3,4,4,4,4,3,2,5,4,5,4,0,0.0,satisfied,Young Adult,0,1


### Verify Data Quality

In [10]:
# Check for missing values
missing_values = df.isnull().sum()

# Display columns with missing values
print(missing_values[missing_values > 0])

Arrival Delay in Minutes    83
dtype: int64


This is deceptive, since a value of 0.0 is valid in this case. Though, it is odd that "Departure Delay" is not a float and "Arrival Delay" is. This probably doesn't matter in the grand scheme of things. It appears that no data need to be imputed because there are **NO** missing values!

Another potential problem would be duplicated passenger IDs. Let's check if any exist...

In [11]:
# Find duplicated IDs
duplicated_ids = df[df['id'].duplicated(keep=False)]

# Count occurrences of each duplicated ID
id_counts = duplicated_ids['id'].value_counts()

print("Duplicated IDs and their counts:")
print(id_counts)

Duplicated IDs and their counts:
Series([], Name: id, dtype: int64)


Additionally, there appear to be NO duplicated passengers, which is phenomenal! What a clean, concise dataset!