# Airline Passenger Satisfication Prediction Table of Contents:

## [1. Data Preprocessing / Exploratory Data Analysis](#1.)

#### [1.1. Data Familiarization](#1.1.)

#### [1.2. Data Cleaning](#1.2.)
- [1.2.1. NaN Values](#1.2.1.)
- [1.2.2. Duplicate Values](#1.2.2.)
- [1.2.3. Outliers](#1.2.3.)

#### [1.3. Final Data Preparation](#1.3.)

#### [1.4. Data Visualization](#1.4.)
- [1.4.1. Correlation Matrix](#1.4.1.)
- [1.4.2. Features Visualization](#1.4.2.)


## [2. Machine Learning: Prediction](#2.)

#### [2.1. KNN, Decision Tree, Random Forest](#2.1.)

- [2.1.1. Random Forest: Hyperparameter tuning](#2.1.1.)

#### [2.2. XGBoost](#2.2.)
- [2.2.1. XGBoost: Hyperparameter tuning](#2.2.1.)

#### [2.3. Confusion Matrix](#2.3.)

#### [2.4. Classification Report](#2.4.)


## [3. Conclusion](#3.)

***

#### Data Description -> below

Gender: Gender of the passengers (Female, Male)

Customer Type: The customer type (Loyal customer, disloyal customer)

Age: The actual age of the passengers

Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)

Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)

Flight distance: The flight distance of this journey

Inflight wifi service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)

Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient

Ease of Online booking: Satisfaction level of online booking

Gate location: Satisfaction level of Gate location

Food and drink: Satisfaction level of Food and drink

Online boarding: Satisfaction level of online boarding

Seat comfort: Satisfaction level of Seat comfort

Inflight entertainment: Satisfaction level of inflight entertainment

On-board service: Satisfaction level of On-board service

Leg room service: Satisfaction level of Leg room service

Baggage handling: Satisfaction level of baggage handling

Check-in service: Satisfaction level of Check-in service

Inflight service: Satisfaction level of inflight service

Cleanliness: Satisfaction level of Cleanliness

Departure Delay in Minutes: Minutes delayed when departure

Arrival Delay in Minutes: Minutes delayed when Arrival

Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction)

# 1. Data Preprocessing / Exploratory Data Analysis
<a id="1."></a>

### 1.1. Data Familiarization
<a id="1.1."></a>

In [None]:
# Import necessary modules

import numpy as np 
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Read in dataframe
df = pd.read_csv('/kaggle/input/airline-passenger-satisfaction/train.csv', index_col='id')

In [None]:
# Drop unneeded row and sort by ascending ID's
df = df.drop('Unnamed: 0', axis=1)
df = df.sort_values('id', ascending=True)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.nunique()[:10].sort_values(ascending=False)

### 1.2. Data Cleaning
 <a id="1.2."></a>

#### 1.2.1. NaN Values
<a id="1.2.1."></a>

In [None]:
df.isnull().sum()

So, we see that we have 310 missing values in the Arrival Delay in Minutes row. At this point, we could either impute these missing values or remove them. We're going to drop the rows with NaN values because it's not worth the possibility of somewhat skewing the data by imputing the mean when we have over 100k observations (rows).

In [None]:
# Dropping rows with NaN values

df = df.dropna().copy()

df.shape

As we can see, dropping the rows with NaN values had a very small impact on the data as a whole.

#### 1.2.2. Duplicate Values
<a id="1.2.2."></a>

In [None]:
df.duplicated().any()

We check to see if there are any duplicate values; luckily there are no duplicates values. Now, let's move on to handling outliers in our dataset.

#### 1.2.3. Outliers
<a id="1.2.3."></a>

First, we'll look at the key numbers of the datset (mean, std, etc) to see if we can detect any anomalies; we'll use the describe function.

In [None]:
df.describe()

The max values of Departure Delay in Minutes and Arrival Delay in Minutes look to be extremely large: 1592 and 1584. We'll make a boxplot to visualize this.

In [None]:
sns.boxplot(x=df['Departure Delay in Minutes'])

In [None]:
sns.boxplot(x=df['Arrival Delay in Minutes'])

In [None]:
df.loc[df['Departure Delay in Minutes'] > 1300]

df.loc[df['Arrival Delay in Minutes'] > 1250]

These visualizations confirm that these two max's are significantly larger than their counterparts. Furthermore, there is another anomaly shown above (around 1300). Overall, these observations seem to be natural, however, they are definetly outliers. Although these observations seem to be natural, our dataset is extremely large, and we do not want them to significantly skew our data (especially when we perform Machine Learning later in this project). Therefore, we will remove these 2 data points.

In [None]:
df.shape

In [None]:
outliers = df[df['Arrival Delay in Minutes'] > 1250].index
df.drop(outliers, inplace=True)
df.shape

Above we sucessfully removed the 2 rows that contained outliers. Now if we look at a boxplot of these 2 columns again, we can see the data is more concentrated than before.

In [None]:
sns.boxplot(x=df['Departure Delay in Minutes'])

In [None]:
sns.boxplot(x=df['Arrival Delay in Minutes'])

### 1.3. Final Data Preparation
<a id="1.3."></a>

In [None]:
df['satisfaction'].value_counts()

Here, we'll convert our y-variable's 2 categorical values (neutral or dissatisfied and satisfied) into 0 and 1 in order for our machine learning models to be able to classify the data

In [None]:
df['satisfaction'] = pd.get_dummies(df['satisfaction'])
df['satisfaction']

# 0 = neutral or dissatisfied
# 1 = satisfied

In [None]:
df.dtypes

The following features: Gender, Customer Type, Type of Travel, and Class are all currently categorical data (dtype: "object"). However, we need to convert it to numerical data in order for our Machine Learning model(s) to be able understand the data. Therefore, we'll do just that using the get_dummies function from pandas. 

In [None]:
df['Gender'] = pd.get_dummies(df['Gender'])
df['Customer Type'] = pd.get_dummies(df['Customer Type'])
df['Type of Travel'] = pd.get_dummies(df['Type of Travel'])
df['Class'] = pd.get_dummies(df['Class'])
df.dtypes

### 1.4. Data Visualization
<a id="1.4."></a>

#### 1.4.1. Correlation Matrix
<a id="1.4.1."></a>

In [None]:
corr = df.corr()

np.fill_diagonal(corr.values, 0)

corr.replace(0, np.nan, inplace=True)

corr

Now, let's visualize our correlation matrix using a heatmap.

In [None]:
plt.figure(figsize = (18,9))
sns.heatmap(corr, annot=True)

Next, we'll print out the strongest correlations between all of our variables.

In [None]:
corr.unstack().sort_values(kind='quicksort', na_position='first').drop_duplicates(keep='first')

Finally, we'll print out the highest correlated variables to our y-variable (satisfaction).

In [None]:
# We call the absolute value func. because whether the variables are positively or negatively correlated to our y-variable is irrelevant, as they're still highly correlated

df.corr().abs()['satisfaction'].sort_values(ascending = False)

We can see that the variables most highly correlated to our y-variable are: 
- Class (~50%)
- Online boarding (~50%)
- Type of Travel (~44%)
- Inflight entertainment (~40%)
- Seat comfort (~35%)
- On-board service (~32%)
- Leg room service (~31%)
- Cleanliness (~31%)
- Flight distance (~30%)

We will use these insights to guide our data visualization.

### 1.4.2. Features Visualization
<a id="1.4.2."></a>

In [None]:
sns.boxplot(x='Inflight wifi service', y='Online boarding', data=df)

People who get better Inflight wifi service likely had a better online boarding experience

In [None]:
sns.boxplot(x='satisfaction', y='Online boarding', data=df)

For some people, even though they had a good online boarding experience, they weren't satisified. On the other hand, some people who only had a decent online boarding experience turned out to be satisified with their overall flying experience. Very interesting!

In [None]:
sns.lmplot(x='Arrival Delay in Minutes', y='Departure Delay in Minutes', 
                hue='satisfaction', data=df)

Above, we see the strong correlation between Departure Delay and Arrival Delay; we can also see whether the passenger was satisified or not

In [None]:
sns.scatterplot(x='Inflight wifi service', y='Ease of Online booking', 
                hue='satisfaction', data=df)

Note that the above variables in the graph above have a ~70% correlation. When we think of correlation this graph is not what we expect. So why does this graph look so weird?

In [None]:
df['Inflight wifi service'].value_counts()

In [None]:
df['Ease of Online booking'].value_counts()

This is because the data points for both variables are evaluated on a scale of 0-5; this is why the graph looks odd. Rest assured, though, that it is valid.

Next, we'll use plotly to see some 3d visualizations on how 3 variables all highly correlated to a passenger's satisfaction affect a passenger's overall airline satisfaction.

In [None]:
import plotly.express as px

fig = px.scatter_3d(df.head(1000), x='On-board service', y='Leg room service', z='Cleanliness', 
                   color='satisfaction')
fig.show()

Above we see how on-board service, leg room service, and cleanliness affect a passenger's satisifaction. Looking at the features in the plotly graph above, we see that the higher ranking a passenger gives to each category the more likely they are to be satisified with their overall airline experience. Specificially, we see the highest importance of a passenger's satisfaction with their on-board service experience.

In [None]:
import plotly.express as px

fig = px.scatter_3d(df.head(1000), x='Online boarding', y='Inflight entertainment', z='Seat comfort', 
                   color='satisfaction')
fig.show()

Above we see how online boarding, inflight entertainment, and seat comfort all contribute to a passenger's airline satisfaction. Overall, the insights from this plot are similar to the first plotly graph above.

# 2. Machine Learning: Prediction
<a id="2."></a>

First, we'll create our X and y variables

In [None]:
X = df.drop('satisfaction', axis=1)
y = df['satisfaction']

Now, let's perform a train-test-split on our data to split our data into train and test sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

### 2.1. KNN, Decision Tree, Random Forest
<a id="2.1."></a>

In [None]:
models = {'KNN': KNeighborsClassifier(),
          'Decision Tree' : DecisionTreeClassifier(),
         'Random Forest': RandomForestClassifier()}

def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    
    for name, model in models.items():
        model.fit(X_train, y_train)
        
        model_scores[name] = model.score(X_test, y_test)
    return model_scores

In [None]:
model_scores = fit_and_score(models=models, 
                             X_train=X_train,
                            X_test=X_test,
                            y_train=y_train,
                            y_test=y_test)
model_scores

Looking at the scores of our models we see that our Decision Tree and Random Forest models both perform very well! However, our KNN model doesn't perform well; therefore, we'll discard it. Now, let's visualize our results!

In [None]:
model_comp = pd.DataFrame(model_scores, index=['accuracy'])
model_comp.T.plot.bar();

We'll now take our best performing model (Random Forest) and tune its hyperparameters.

#### 2.1.1. Random Forest: Hyperparameter tuning
<a id="2.1.1."></a>

*Not going to run Randomized Search CV on final project as it takes up a lot of computational power and therefore takes a very long time to complete. However, rest assured, that after having run it a few times, it does not increase the accuracy of the Random Forest model*

In [None]:
# rf_grid = {"n_estimators": np.arange(10, 1000, 50),
#           "max_depth": [None, 3, 5, 10],
#           "min_samples_split": np.arange(2, 20, 2),
#           "min_samples_leaf": np.arange(1, 20, 2)}

# rs_rf = RandomizedSearchCV(RandomForestClassifier(),
#                           param_distributions=rf_grid,
#                           cv=5,
#                           verbose=True)

# rs_rf.fit(X_train, y_train);

# rs_rf.best_params_

In [None]:
# rf = RandomForestClassifier(n_estimators=210, min_samples_split=14, min_samples_leaf=15, max_depth=None)
# rf.fit(X_train, y_train)
# rf_pred = rf.predict(X_test)
# rf.score(X_test, y_test)

Tuning our Random Forest model's hyperparameters doesn't make much of a difference in the accuracy score of the model.

Now, let's use **XGBoost** to see how well it scores!

### 2.2. XGBoost
<a id="2.2."></a>

In [None]:
xgb = XGBClassifier(eval_metric='logloss', use_label_encoder=False)

xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)
xgb.score(X_test, y_test)

It looks like our XGBoost model performs the best out of all our models -- with an +96% score! Now, let's tune its hyperparameters.

#### 2.2.1. XGBoost: Hyperparameter tuning
<a id="2.2.1."></a>

In [None]:
params_xgb = {'n_estimators': [50,100,250,400,600,800,1000], 'learning_rate': [0.2,0.5,0.8,1]}
rs_xgb =  RandomizedSearchCV(xgb, param_distributions=params_xgb, cv=5)
rs_xgb.fit(X_train, y_train)
xgb_pred_2 = rs_xgb.predict(X_test)
rs_xgb.score(X_test, y_test)

Once again, we see that tuning our model's hyperparameters doesn't make much of a difference.

**Overall, we see that our best performing model is our XGBoost model.** We'll now evaluate this model using other metrics. 

### 2.3. Confusion Matrix
<a id="2.3."></a>

In [None]:
print(confusion_matrix(y_test, xgb_pred))

In [None]:
sns.set(font_scale=1.5) # Increase font size

def plot_conf_mat(y_test, xgb_pred):
    
    fig, ax = plt.subplots(figsize=(3, 3))
    ax = sns.heatmap(confusion_matrix(y_test, xgb_pred),
                     annot=True, # Annotate the boxes
                     cbar=False,
                    fmt='g', # don't use scientic notation
                    cmap='Blues')
    
    plt.xlabel("true label", weight='bold')
    plt.ylabel("predicted label", weight='bold')
    
plot_conf_mat(y_test, xgb_pred)

### 2.4. Classification Report
<a id="2.4."></a>

In [None]:
print(classification_report(y_test, xgb_pred))

# 3. Conclusion
<a id="3."></a>

Overall, our XGBoost model performs extremely well across all metrics beyond just the accuracy score. This makes sense because it is commonly known among the DS/ML community that XGBoost is the best performing model with tabular data.

Beyond our model, however, this data shows that there are many important factors that go into if a passenger is satisified with their experience flying with an airline or not. The most important features of which are: **Class, Online boarding, Type of Travel, Inflight entertainment, Seat comfort, On-board service, Leg room service, Cleanliness, and Flight distance.** We saw that by using these features, along with other flying-experience metrics, we can predict with a very high level of confidence **(+96%)** whether or not a passenger is satisified with their airline flying experience.

# **Thanks for reading my notebook! My objective for this notebook was to clearly explain my thought process to help guide readers. Feel free to upvote this notebook and leave feedback via the comment section!**