<h2>AIRLINE PASSENGER SATISFACTION DATASET</h2>

Data consists of the details of customers in an airline company who have already flown with them. The main purpose of this dataset is to predict whether a future customer would be satisfied with their service and which aspect of the services offered by them have to be emphasized more to generate more satisfied customers. Data consists of total 129880 observations (train data:103904, test data:25976) and 25 columns.

**ATTRIBUTES:**

>**Id:** Id number of the passengers

>**Gender:** Gender of the passengers (Female, Male)

>**Customer Type:** The customer type (Loyal customer, disloyal customer)

>**Age:** The actual age of the passengers

>**Type of Travel:** Purpose of the flight of the passengers (Personal Travel, Business Travel)

>**Class:** Travel class in the plane of the passengers (Business, Eco, Eco Plus)

>**Flight Distance:** The flight distance of this journey

>**Inflight wifi service:** Satisfaction level of the inflight wifi service (0,1,2,3,4,5/ 0=Not Applicable; 1=Least Satisfied to 5=Most Satisfied)

>**Departure/Arrival time convenient:** Satisfaction level of Departure/Arrival time convenient (0,1,2,3,4,5/ 0=Not Applicable; 1=Least Satisfied to 5=Most Satisfied)

>**Ease of Online booking:** Satisfaction level of online booking (0,1,2,3,4,5/ 0=Not Applicable; 1=Least Satisfied to 5=Most Satisfied)

>**Gate location:** Satisfaction level of Gate location (0,1,2,3,4,5/ 0=Not Applicable; 1=Least Satisfied to 5=Most Satisfied)

>**Food and drink:** Satisfaction level of Food and drink service (0,1,2,3,4,5/ 0=Not Applicable; 1=Least Satisfied to 5=Most Satisfied)

>**Online boarding:** Satisfaction level of online boarding (0,1,2,3,4,5/ 0=Not Applicable; 1=Least Satisfied to 5=Most Satisfied) 

>**Seat comfort:** Satisfaction level of Seat comfort (0,1,2,3,4,5/ 0=Not Applicable; 1=Least Satisfied to 5=Most Satisfied)

>**Inflight entertainment:** Satisfaction level of inflight entertainment (0,1,2,3,4,5/ 0=Not Applicable; 1=Least Satisfied to 5=Most Satisfied)

>**On-board service:** Satisfaction level of On-board service (0,1,2,3,4,5/ 0=Not Applicable; 1=Least Satisfied to 5=Most Satisfied)

>**Leg room service:** Satisfaction level of Leg room service (0,1,2,3,4,5/ 0=Not Applicable; 1=Least Satisfied to 5=Most Satisfied)

>**Baggage handling:** Satisfaction level of baggage handling (1,2,3,4,5/ 1=Least Satisfied to 5=Most Satisfied)

>**Checkin service:** Satisfaction level of Check-in service (0,1,2,3,4,5/ 0=Not Applicable; 1=Least Satisfied to 5=Most Satisfied)

>**Inflight service:** Satisfaction level of inflight service (0,1,2,3,4,5/ 0=Not Applicable; 1=Least Satisfied to 5=Most Satisfied)

>**Cleanliness:** Satisfaction level of Cleanliness (0,1,2,3,4,5/ 0=Not Applicable; 1=Least Satisfied to 5=Most Satisfied)

>**Departure Delay in Minutes:** Minutes delayed when departure

>**Arrival Delay in Minutes:** Minutes delayed when arrival

>**Satisfaction:** /output column/ Airline satisfaction level ('satisfied', 'neutral or dissatisfied')

In [None]:
#Importing Libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import boxcox
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import VotingClassifier, BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import confusion_matrix,classification_report
import warnings
warnings.filterwarnings("ignore")
sns.set()
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


There are two data available, train and test. I will check the first few rows and the column names of this data. For convenience, I will combine two data and continue my operations with a single data.

In [None]:
data_train=pd.read_csv("../input/airline-passenger-satisfaction/train.csv")
data_test=pd.read_csv("../input/airline-passenger-satisfaction/test.csv")
data_train.head()

In [None]:
data_test.head()

In [None]:
data_train.columns

In [None]:
data_test.columns

In [None]:
print(f"Train data has {data_train.shape[0]} rows and  {data_train.shape[1]} columns.")
print("Distribution of target value:\n")
data_train.satisfaction.value_counts()

In [None]:
print(f"Test data has {data_test.shape[0]} rows and {data_test.shape[1]} columns.")
print("Distribution of target value:\n")
data_test.satisfaction.value_counts()

Combining these two dataset into one

In [None]:
data=data_train.append(data_test)
data.head()

In [None]:
data.shape

Looking at the data types, and descriptive statistics of features

In [None]:
data.info()

There are both numeric and object type features in data.

In [None]:
data.describe().T

>For the Age column, we can see that the youngest passenger is 7 years old and the oldest passenger is 85 years old. Average age is 39. When we evaluate the quarters, we can observe that the age is evenly distributed.

>For the Flight Distance column, we see that the minimum value is 414 and the maximum value is 4983. The average distance of flight is 1190. When we look at the quarters, we can say that there are outliers because there is too much difference between the 3rd quarter and the maximum value.

>For the Departure Delay in Minutes and Arrival Delay in Minutes columns, the minimum value is 0 (which corresponds to no delay in that flights) and the maximum value is around 1500. When we examine the 3rd quarter and maximum values, we can see that there are too many outlier values. 

>There are many categorical features evaluated in 0-1-2-3-4-5 degrees. If we look at the averages of these ratings, the highest level of satisfaction is the Inflight service category with an average of 3.64, while the lowest is the Inflight wifi service category with an average of 2.72 points.

Checking unique values of target column

In [None]:
data.satisfaction.unique()

Checking null values

In [None]:
data.isna().sum()

'Arrival_Delay_in_Minutes' column has 393 null values. I will deal with them later.

Checking whether data has duplicate values

In [None]:
data.duplicated().sum()

Checking number of unique elements in features

In [None]:
data.nunique()

In [None]:
data.loc[data["Customer Type"]=="disloyal Customer","Customer Type"]="Disloyal Customer"
data.loc[data["Type of Travel"]=="Business travel","Type of Travel"]="Business Travel"

<h2> DATA VISUALIZATION </h2>

**Visualizing target column**

In [None]:
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.countplot(x='satisfaction', data=data, palette=["#f08080","#87cefa"])

plt.subplot(1, 2, 2)
plt.pie(data['satisfaction'].value_counts(), labels=["neutral or dissatisfied","satisfied"], explode=[0, 0.05], autopct='%1.2f%%', shadow=True,colors=["lightcoral","lightskyblue"])
plt.title('satisfaction', fontsize=15)

plt.show()

Our target column consists of two categories which are "neutral or dissatisfied" and "satisfied". Neutral or dissatisfied passenger amount is higher in data. As shown in graph, we can say that we do not have any imbalance problem.

**Visualizing categorical features**

Let's visualize categorical features by count plot. First comparing their numbers, then comparing their amount with respect to the target column (satisfaction).

In [None]:
categorics=['Gender', 'Customer Type','Type of Travel', 'Class','Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness']
for i in categorics:
  plt.figure(figsize=(16,6))
  plt.subplot(1,2,1)
  sns.countplot(x=data[i],palette="Pastel1")

  plt.subplot(1,2,2)
  sns.countplot(x=data[i],hue=data.satisfaction, palette=["#f08080","#87cefa"])
  plt.show()

Female and male data amounts are close to each other. In the Customer Type feature, which is divided into two groups as Loyal customer and disloyal customer, the number of Loyal customers is more than the number of Disloyal customers. We can say that half of the Loyal customers are satisfied and half are neutral or dissatisfied. But in Disloyal customers, the number of satisfied passengers is less than the number of neutral or dissatisfied. Type of Travel feature consists of two categories as Personal and Business travel. It seems that the number of passengers making Business travel is higher than those making Personnel travel. While the number of satisfied passengers is higher in Business travel, the number of satisfied passengers is very low in Personal travel. Class features are divided into three categories: Eco, Business, and Eco Plus. While the number of passengers in the Business and Eco classes is close to each other, the number of passengers in the Eco Plus class is much less. While the majority of passengers in Business class are satisfied, the majority of passengers in Eco class are neutral or dissatisfied. In the other features, there are 6 categories from 1 to 5 (increasing satisfaction rates), while 0 represents unimplemented features. Neutral or dissatisfied passengers appear more in all categories of the Departure/Arrival time convenient feature. As we can predict for other features, neutral or dissatisfied passengers are more at low satisfaction levels like 0-1-2 at the beginning, while satisfied passengers are more at high satisfaction levels like 4-5.

Removing [ 'Gender' , 'Customer_Type' , 'Type_of_Travel' , 'Class' ] features from categorics list so I only have categorical features having 0-1-2-3-4-5 scores can stay in list.

In [None]:
for i in ['Gender','Customer Type','Type of Travel','Class']:
  categorics.remove(i)

In [None]:
data[categorics].mean().sort_values(ascending=False)

In [None]:
total = float(len(data))
ax = data[categorics].mean().sort_values(ascending=False).plot(kind="barh",ylabel="Features",colormap="Pastel1",xticks=[0,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5],figsize=(14,6))
plt.title('Average satisfaction ratings of services', fontsize=16)
for p in ax.patches:
    count = '{:.1f}'.format(p.get_width())
    x, y = p.get_x() + p.get_width()+0.15, p.get_y()
    ax.annotate(count, (x, y), ha='right')
plt.show()

The features with the highest average satisfaction rate are Inflight_service and Baggage_handling with average 3.6. The feature with the lowest satisfaction rate is Inflight_wifi_service with average 2.7.

In [None]:
ax = data[categorics].std().sort_values(ascending=False).plot(kind="barh",ylabel="Features",colormap="Pastel1",figsize=(14,6))
plt.title('Standard deviation of service ratings', fontsize=16)
for p in ax.patches:
    count = '{:.1f}'.format(p.get_width())
    x, y = p.get_x() + p.get_width()+0.05, p.get_y()
    ax.annotate(count, (x, y), ha='right')
plt.show()

I also checked the standard deviation to account for any deviations between ratings. They are close to each other.

In [None]:
data[data["Class"]=="Business"].mean()[4:18].plot(kind="barh",legend=True,ylabel="Features",colormap="Pastel2",figsize=(14,6),label="Business Class",title="Average satisfaction ratings of Business and Eco Class passengers")
data[data["Class"]=="Eco"].mean()[4:18].plot(kind="barh",legend=True,colormap="Pastel1",label="Eco Class",xticks=[0,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5])
plt.show()

We can see that Business Class passengers give higher ratings to services than Eco Class passengers.

In [None]:
plt.figure(figsize=(7,7))
sns.catplot(y='Departure/Arrival time convenient',col='Type of Travel',x ='Customer Type',
            hue='satisfaction',row='Class', data=data, kind= 'bar',palette='Pastel1')
plt.show()

In [None]:
def percentage(x):
  return round(100*x.count()/data.shape[0],2)
table1=data.pivot_table(index=["Gender"],columns=["satisfaction"],aggfunc={"satisfaction":["count",percentage]},fill_value=0)
table1

In [None]:
gender="female"
for i,j,k,l in table1.values:
  print("Satisfaction rate for {} is: {:.3f}".format(gender,j/(i+j)))
  gender="male"

In [None]:
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.pie(data.loc[data.Gender=="Female",'satisfaction'].value_counts(), labels=["neutral or dissatisfied","satisfied"], explode=[0, 0.05], autopct='%1.2f%%', shadow=True,colors=["lightcoral","lightskyblue"])
plt.title('Satisfaction (Female)', fontsize=15)

plt.subplot(1, 2, 2)
plt.pie(data.loc[data.Gender=="Male",'satisfaction'].value_counts(), labels=["neutral or dissatisfied","satisfied"], explode=[0, 0.05], autopct='%1.2f%%', shadow=True,colors=["lightcoral","lightskyblue"])
plt.title('Satisfaction (Male)', fontsize=15)

plt.show()

When we look at the satisfaction rates of women and men, we see that both are around 43-44%. There is no dominance in satisfaction by gender. The dissatisfaction rate is higher in both gender.

In [None]:
data.pivot_table(index=["Customer Type","Class"],columns=["satisfaction"],aggfunc={"satisfaction":["count",percentage]})

In [None]:
ax = data.pivot_table(index=["Customer Type","Class"],columns=["satisfaction"],aggfunc={"satisfaction":"count"}).plot(kind="barh",figsize=(24,6))
plt.title('Satisfaction based on Customer Type and Class', fontsize=16)
for p in ax.patches:
    percentage = '{:.1f}%'.format(100 * p.get_width()/total)
    x,y  = p.get_x() + p.get_width()+1000, p.get_y()
    ax.annotate(percentage, (x, y),ha='right')
plt.show()

We can see that those who are loyal customers and in the business class are the passengers with the highest satisfaction rate (30.4%). Passengers who are loyal and in the eco class are the passengers with the highest dissatisfaction rate (27.5%).

In [None]:
data.pivot_table(index=["Type of Travel","Class"],columns=["satisfaction"],aggfunc={"satisfaction":["count",percentage]})

In [None]:
ax = data.pivot_table(index=["Type of Travel","Class"],columns=["satisfaction"],aggfunc={"satisfaction":"count"}).plot(kind="barh",figsize=(24,6))
plt.title('Satisfaction based on Travel Type and Class', fontsize=16)
for p in ax.patches:
    percentage = '{:.1f}%'.format(100 * p.get_width()/total)
    x,y  = p.get_x() + p.get_width()+1100, p.get_y()
    ax.annotate(percentage, (x, y),ha='right')
plt.show()

Business class passengers making business travel have the highest satisfaction rate (33.0%). Eco class passengers who make personal travel have the highest dissatisfaction rate (22.8%). We can say that travel type and class are both major factors in satisfaction.

In [None]:
data.pivot_table(index=["Gender","Customer Type"],columns=["Inflight wifi service"],aggfunc={"satisfaction":["count",percentage]})

In [None]:
ax = pd.crosstab([data["Gender"], data["Customer Type"]],data["Inflight wifi service"],
            rownames=['Gender ', " Customer Type"],
            colnames=["Inflight wifi service"],
            dropna=False).plot(kind="bar",figsize=(30,6),rot=0)
plt.title('Inflight wifi service ratings based on Gender and Customer Type', fontsize=16)
for p in ax.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height()/total)
    x, y = p.get_x() + p.get_width(), p.get_height()
    ax.annotate(percentage, (x, y),ha='right')
plt.show()

When we compare the scores given to the service based on gender and customer type, we see a similar distribution. Since the number of loyal customers is high in data, their ratio seems to be higher. Gender is not a discriminative factor in scores.

In [None]:
data.pivot_table(index="Class",columns=["Food and drink"],aggfunc={"satisfaction":["count",percentage]})

In [None]:
ax = pd.crosstab(data["Class"],data["Food and drink"],
            rownames=['Class '],
            colnames=['Food and drink'],
            dropna=False).plot(kind="bar",figsize=(30,6),rot=0)
plt.title('Food and drink service points based on Class', fontsize=16)
for p in ax.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height()/total)
    x = p.get_x() + p.get_width()
    y = p.get_height()
    ax.annotate(percentage, (x, y), ha='right')
plt.show()

The number of passengers in the eco plus class is low in the data. The number of business class passengers and eco class passengers is very close to each other. Business class passengers seem to give more points to the food and drink service.

In [None]:
data.pivot_table(index=["Type of Travel","Cleanliness"],columns=['satisfaction'],aggfunc={"satisfaction":["count",percentage]},fill_value=0)

In [None]:
ax = data.pivot_table(index=["Type of Travel","Cleanliness"],columns=["satisfaction"],aggfunc={"satisfaction":"count"},fill_value=0).T.plot(kind="bar",figsize=(30,6),rot=0)
plt.title('Cleanliness service points based on Type of Travel and Satisfaction', fontsize=16)
for p in ax.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height()/total)
    x, y = p.get_x() + p.get_width(), p.get_height()
    ax.annotate(percentage, (x, y), ha='right')
plt.show()

While business travel passengers were more satisfied, personal travel passengers were more dissatisfied. Neutral or dissatisfied passengers give similar rates to cleanliness regardless of the type of travel. But among satisfied passengers, business travel passengers give more points to cleanliness. The number of passengers who are satisfied and make personal travel is quite low.

Dividing age column to four groups by looking at the quartiles to check if any pattern will be seen in different groups

In [None]:
data["Age Group"]=pd.cut(data.Age,[np.min(data.Age),np.percentile(data.Age,25),np.percentile(data.Age,50),np.percentile(data.Age,75),np.max(data.Age)+1], right=False)

In [None]:
data[["Age","Age Group"]][:5]

In [None]:
table2=data.pivot_table(index=["Age Group"],columns=["satisfaction"],aggfunc={"satisfaction":"count"},fill_value=0)
table2

In [None]:
ax = table2.plot(kind="bar", figsize=(16,8), color=["#f08080","#87cefa"])
plt.title('Satisfaction based on Age Groups', fontsize=16)
for p in ax.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height()/total)
    x, y = p.get_x() + p.get_width(), p.get_height()
    ax.annotate(percentage, (x, y), ha='right')
plt.show()

While the majority of passengers between the ages of [40,51) are satisfied, the rate of dissatisfaction is higher for passengers in other age ranges.

In [None]:
table3=data.pivot_table(index=["Age Group"],columns=["Baggage handling",],aggfunc={"Baggage handling":"count"})
table3

In [None]:
ax = table3.plot(kind="bar",figsize=(26,6))
plt.title('Baggage Handling service ratings based on Age Groups', fontsize=16)
for p in ax.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height()/total)
    x, y = p.get_x() + p.get_width(), p.get_height()
    ax.annotate(percentage, (x, y), ha='right')
plt.show()

When we look at the scores given to the baggage handling service by dividing the age groups, we can say that the scores are similar among age groups. I can say that age groups do not have an obvious effect on the scores.

In [None]:
data[(data['Departure Delay in Minutes'] == 0 ) & (data['Arrival Delay in Minutes'] == 0 )].groupby('satisfaction')["id"].count().reset_index().set_index("satisfaction")

When we look at the flights that do not delay on arrival and departure, the number of satisfied passenger still seems less.

**Visualizing numeric features**

Creating a pairplot to see the distribution of numeric features and their relation with other numeric features.

In [None]:
numerics=['Departure Delay in Minutes', 'Arrival Delay in Minutes','Flight Distance',"Age"]
sns.pairplot(data[[*numerics,"satisfaction"]],hue="satisfaction")
plt.show()

While there is a visible relationship between some numeric features (Arrival_Delay_in_Minutes and Departure_Delay_in_Minute), some are unrelated to each other (Flight_Distance and Age).

In [None]:
sns.boxplot(x="Age",y="satisfaction",data=data)
plt.show()

Checking outliers with boxplot

In [None]:
plt.figure(figsize=(20, 6))
for i,j in enumerate(numerics):
  plt.subplot(1,len(numerics),i+1)
  sns.boxplot(data[j])

There are outliers in data. I will handle them later.

In [None]:
fig, ax = plt.subplots(1,len(numerics),figsize=(20,5))
fig.suptitle("Distribution of numeric features",y=1)
for i,j in enumerate(numerics):
  sns.distplot(x=data[j],ax=ax[i])
  ax[i].set_xlabel(j)
fig.tight_layout(pad=1.5)

Arrival_Delay_in_Minutes and Departure_Delay_in_Minutes columns show a maximum value at 0. As the delay minutes increase, occurrences decrease. We see that the values in the Flight_Distance column are mostly concentrated in the 0-1000 range. Also there are people of all ages in the data.

In [None]:
sns.scatterplot(x=data['Arrival Delay in Minutes'],y=data['Departure Delay in Minutes'])
plt.show()

In [None]:
data[['Arrival Delay in Minutes','Departure Delay in Minutes']].corr()

Arrival_Delay_in_Minutes and Departure_Delay_in_Minutes columns are highly positive correlated. Correlated features will be checked again with heatmap.

**Dropping unnecessary columns**

In [None]:
data.drop(["Unnamed: 0","id","Age Group"],axis=1,inplace=True)
data_backup=data.copy()
data.head()

**Checking correlation between features by creating a heatmap**

In [None]:
plt.figure(figsize=(22,10))
sns.heatmap(data.corr(), vmin=-1, vmax=1, center=0,cmap=sns.diverging_palette(20, 220, n=200),square=True,annot=True,fmt='.2f',)
plt.show()

* "Ease of Online booking" and "Inflight wifi service" are positive correlated with ratio 0.71.
* "Inflight entertainment" and "Food and drink" are positive correlated with ratio 0.62.
* "Inflight entertainment" and "Seat comfort" are positive correlated with ratio 0.61.
* "Inflight service" and "Baggage handling" are positive correlated with ratio 0.63.
* "Cleanliness" and "Food and drink" are positive correlated with ratio 0.66.
* "Cleanliness" and "Seat comfort" are positive correlated with ratio 0.68.
* "Cleanliness" and "Inflight entertainment" are positive correlated with ratio 0.69.

'Departure delay in minutes' and 'Arrival delay in minutes' columns are highly positive correlated (0.97) as we have seen. Normally we should drop one of them. Since 'Arrival_Delay_in_Minutes' column has null values, it would be our first choice. But 'Departure delay in minutes' and 'Arrival_Delay_in_Minutes' columns have full of zero values, so they are not very important features in model. I will drop both of these columns.

In [None]:
data.drop(["Arrival Delay in Minutes","Departure Delay in Minutes"],axis=1,inplace=True)

**Checking correlation to target**

In [None]:
data_temp=data.copy()
data_temp["satisfaction"]=data_temp["satisfaction"].map({"satisfied":1,"neutral or dissatisfied":0})
data_temp.corr()['satisfaction'].sort_values().drop('satisfaction').plot(kind='barh',title="Correlation with Satisfaction")
plt.show()

Features that slightly correlates more with customer satisfaction are 'Inflight wifi service', 'Flight Distance', 'Cleanliness', 'Leg room service', 'On-board service', 'Seat comfort', 'Inflight entertainment',and 'Online boarding'.

Among features "Online boarding" has the maximum correlation to target, i will check its correlation with other features.

In [None]:
data_temp.corr()['Online boarding'].sort_values().drop(['Online boarding','satisfaction']).plot(kind='barh',title="Correlation with Online Boarding service")
plt.show()

In [None]:
sns.boxplot(x=data['Inflight wifi service'], y = data_temp['Online boarding'])
plt.show()

As the score given to the Inflight wifi service increases, the range distributed to online boarding decreases and its score increases. People who gets better service of inflight wifi are more likely to give better rating for online boarding.

<h2> PRE-PROCESSING </h2>

**Encoding categoric features** 

In [None]:
data.head()

In [None]:
#categorics
data[["Gender","Customer Type","Type of Travel","Class","satisfaction"]].head()

We have to transform our categoric features to numerics so our model can understand better and learn from the features.

In [None]:
#mapping ordinal features
data["Class"] = data["Class"].map({'Business':2, 'Eco Plus':1, 'Eco':0})
data["satisfaction"]=data["satisfaction"].map({"satisfied":1,"neutral or dissatisfied":0})

In [None]:
#for nominal features,
data_new=pd.get_dummies(data,drop_first=True)
#i use drop_first parameter so my model does not get any confusion by counting some features second time
data_new.reset_index(inplace=True)
data_new.drop("index",axis=1,inplace=True)
data_new.head()

In [None]:
data_new[["Gender_Male","Customer Type_Loyal Customer","Type of Travel_Personal Travel","Class","satisfaction"]].head()

**Outlier Detection**

I will use Local Outlier Factor Method to detect and drop outliers.

In [None]:
df_local=data_new.copy()
temp = df_local.drop("satisfaction", axis=1)
local_outlier = LocalOutlierFactor(n_neighbors=2).fit_predict(temp)
outlier_local=list(np.where(local_outlier == -1)[0])
del temp
print(f"Outlier Count: {len(outlier_local)} \nSample Count: {len(df_local)} \nFraction: {round(len(outlier_local)/len(df_local),3)}")
df_local=df_local.drop(outlier_local).reset_index(drop=True)

**Feature Transformation**

The assumptions of some machine learning models are based on the normality of features. I will try to make the distribution of my features look like a normal distribution with some transformation operations. Different methods can be used to see which one is better for the data. I will mostly check "Flight_Distance' and "Age" columns.

Methods i use:
1. Log Transformation
2. Square Root Transformation
3. Box Cox Transformation

In [None]:
#Log Transformation
df_log=df_local.copy()
df_log["Flight Distance"]=np.log(df_log["Flight Distance"])
df_log["Age"]=np.log(df_log["Age"])

In [None]:
#Square-Root Transformation
df_sqrt=df_local.copy()
df_sqrt["Flight Distance"]=np.sqrt(df_sqrt["Flight Distance"])
df_sqrt["Age"]=np.sqrt(df_sqrt["Age"])

In [None]:
#Box Cox Transformation
df_boxcox=df_local.copy()
df_boxcox["Flight Distance"],lmbda=boxcox(df_boxcox["Flight Distance"],lmbda=None)
df_boxcox["Age"],lmbda=boxcox(df_boxcox["Age"],lmbda=None)

Visualizing Transformed Features

In [None]:
#Flight Distance feature
plt.figure(figsize=(20, 12))

plt.subplot(2, 4, 1)
plt.boxplot(df_local['Flight Distance'])
plt.title('Flight Distance')

plt.subplot(2, 4, 2)
plt.boxplot(df_log["Flight Distance"])
plt.title('Flight Distance (Log Transformation)')

plt.subplot(2, 4, 3)
plt.boxplot(df_sqrt['Flight Distance'])
plt.title('Flight Distance (Square Root Transformation)')

plt.subplot(2, 4, 4)
plt.boxplot(df_boxcox['Flight Distance']);
plt.title('Flight Distance (Box Cox Transformation)')

plt.subplot(2, 4, 5)
plt.hist(df_local['Flight Distance'])
plt.title('Flight Distance')

plt.subplot(2, 4, 6)
plt.hist(df_log["Flight Distance"])
plt.title('Flight Distance (Log Transformation)')

plt.subplot(2, 4, 7)
plt.hist(df_sqrt['Flight Distance'])
plt.title('Flight Distance (Square Root Transformation)')

plt.subplot(2, 4, 8)
plt.hist(df_boxcox['Flight Distance']);
plt.title('Flight Distance (Box Cox Transformation)')

plt.show()

In [None]:
#Age Feature
plt.figure(figsize=(20, 12))

plt.subplot(2, 4, 1)
plt.boxplot(df_local['Age'])
plt.title('Age')

plt.subplot(2, 4, 2)
plt.boxplot(df_log["Age"])
plt.title('Age (Log Transformation)')

plt.subplot(2, 4, 3)
plt.boxplot(df_sqrt['Age'])
plt.title('Age (Square Root Transformation)')

plt.subplot(2, 4, 4)
plt.boxplot(df_boxcox['Age']);
plt.title('Age (Box Cox Transformation)')

plt.subplot(2, 4, 5)
plt.hist(df_local['Age'])
plt.title('Age')

plt.subplot(2, 4, 6)
plt.hist(df_log["Age"])
plt.title('Age (Log Transformation)')

plt.subplot(2, 4, 7)
plt.hist(df_sqrt['Age'])
plt.title('Age (Square Root Transformation)')

plt.subplot(2, 4, 8)
plt.hist(df_boxcox['Age']);
plt.title('Age (Box Cox Transformation)')

plt.show()

Checking Normality of transformed features

In [None]:
for j in ["Flight Distance","Age"]:
  transforms=[df_local[j], df_log[j], df_sqrt[j], df_boxcox[j]]
  processes=["original","log","square root","box cox"]
  for i,k in zip(transforms,processes):
    print(f"Normality for {j} Feature ({k}):",stats.shapiro(i))

Even if I do transformation, my features still do not have a normal distribution. That's why I'm going to continue without transformation.

**Splitting data**

Splitting data into train and test with 0.7 train/0.3 test ratio so i can train my model with train data, and then test its performance with test data.

In [None]:
X_train, X_test, y_train, y_test=train_test_split(df_local.drop("satisfaction",axis=1),df_local["satisfaction"],test_size=0.3,random_state=42)

In [None]:
print("Train size:", X_train.shape)
print("Test size:", X_test.shape)

**Feature Scaling**

We use Feature Scaling to standardize the independent features in a fixed range so each feature contributes approximately to model. Due to my distribution, I can not use Standard Scaler. I use MinMax Scaler which transforms data range to (0,1).

In [None]:
scaler=MinMaxScaler()
scaler.fit(X_train)
X_train_scaled=scaler.transform(X_train)
X_test_scaled=scaler.transform(X_test)

<h2> MODEL </h2>

Models i use:
* Gaussian Naive Bayes
* Linear SVC
* Logistic Regression
* K-Nearest Neighbors
* Decision Tree
* Voting Classifier
* Bagging Classifier
* Random Forest
* AdaBoost 
* Stochastic Gradient Boosting
* XGBoost

In [None]:
#Creating a function that creates a dataframe for testing model performance
def model_perf(model,X_train,X_test,y_train,y_test,pred,model_name):
  """Takes the data, returns a dataframe that calculates the performance of the model"""
  cv_results=cross_val_score(model,X_train,y_train,cv=5)
  perf_df=pd.DataFrame({"Mean_CV":np.mean(cv_results),"Std_CV":np.std(cv_results),'Train_Score':model.score(X_train,y_train),"Test_Score":model.score(X_test,y_test),"Precision_Score":precision_score(y_test,pred),"Recall_Score":recall_score(y_test,pred),"F1_Score":f1_score(y_test,pred)},index=[model_name])
  return perf_df

**Gaussian Naive Bayes**

In [None]:
nb=GaussianNB().fit(X_train_scaled,y_train)
pred_nb = nb.predict(X_test_scaled)
perf_nb=model_perf(nb,X_train_scaled,X_test_scaled,y_train,y_test,pred_nb,"Gaussian NB")
perf_nb

**Linear SVC**

In [None]:
svc=LinearSVC()
parameters={"C":[0.01,0.1,1,10]}
searcher=GridSearchCV(svc,parameters,cv=5,n_jobs=-1).fit(X_train_scaled,y_train)
best_model_svc=searcher.best_estimator_
pred_svc = best_model_svc.predict(X_test_scaled)
print("Best Parameters:",searcher.best_params_)
perf_svc=model_perf(best_model_svc,X_train_scaled,X_test_scaled,y_train,y_test,pred_svc,"Linear SVC")
perf_svc

**Logistic Regression**

In [None]:
log=LogisticRegression(random_state=42)
params={"C":[0.001,0.01,0.1,1,10],"penalty":["l1","l2"]}
searcher=GridSearchCV(log,params,cv=5,n_jobs=-1).fit(X_train_scaled,y_train)
best_model_log=searcher.best_estimator_
pred_log = best_model_log.predict(X_test_scaled)
print("Best Parameters:",searcher.best_params_)
y_pred_log_proba=best_model_log.predict_proba(X_test_scaled)[:,1]
print("ROC AUC Score:",roc_auc_score(y_test,y_pred_log_proba))
perf_log=model_perf(best_model_log,X_train_scaled,X_test_scaled,y_train,y_test,pred_log,"Logistic Regression")
perf_log

**KNN**

In [None]:
knn=KNeighborsClassifier()
params={"n_neighbors":np.arange(3,10,2)}
searcher=GridSearchCV(knn,params,cv=5,n_jobs=-1).fit(X_train_scaled,y_train)
best_model_knn=searcher.best_estimator_
pred_knn = best_model_knn.predict(X_test_scaled)
print("Best Parameters:",searcher.best_params_)
perf_knn=model_perf(best_model_knn,X_train_scaled,X_test_scaled,y_train,y_test,pred_knn,"KNN")
perf_knn

**Decision Tree**

In [None]:
dt=DecisionTreeClassifier(random_state=42)
parameters={"max_depth":[*range(3,10,2),None],"max_features":[*range(3,10,2),None],"min_samples_leaf":list(range(1,10,2)),"criterion":["gini","entropy"]}
searcher=GridSearchCV(dt,parameters,cv=5,n_jobs=-1).fit(X_train_scaled,y_train)
best_model_dt=searcher.best_estimator_
pred_dt = best_model_dt.predict(X_test_scaled)
print("Best Parameters:",searcher.best_params_)
perf_dt=model_perf(best_model_dt,X_train_scaled,X_test_scaled,y_train,y_test,pred_dt,"Decision Tree")
perf_dt

**Ensemble Learning - Voting Classifier**

In [None]:
classifiers=[("Logistic Regression",best_model_log),("KNN",best_model_knn),("Decision Tree",best_model_dt)]
vc=VotingClassifier(estimators=classifiers).fit(X_train_scaled,y_train)
pred_vc=vc.predict(X_test_scaled)
perf_vc=model_perf(vc,X_train_scaled,X_test_scaled,y_train,y_test,pred_vc,"Voting Classifier")
perf_vc

**Ensemble Learning - Bagging Classifier**

In [None]:
base_dt=DecisionTreeClassifier(random_state=42)
bc=BaggingClassifier(base_estimator=base_dt,n_estimators=300,oob_score=True,n_jobs=-1).fit(X_train_scaled,y_train)
pred_bc=bc.predict(X_test_scaled)
print("OOB Score:",bc.oob_score_)
perf_bc=model_perf(bc,X_train_scaled,X_test_scaled,y_train,y_test,pred_bc,"Bagging Classifier")
perf_bc

**Ensemble Learning - Random Forest**

In [None]:
rf=RandomForestClassifier(random_state=42,n_estimators=300).fit(X_train_scaled,y_train)
pred_rf=rf.predict(X_test_scaled)
perf_rf=model_perf(rf,X_train_scaled,X_test_scaled,y_train,y_test,pred_rf,"Random Forest")
perf_rf

**Ensemble Learning - AdaBoost**

In [None]:
base_ada_dt=DecisionTreeClassifier(max_depth=1,random_state=42)
adb=AdaBoostClassifier(base_estimator=base_ada_dt,n_estimators=100).fit(X_train_scaled,y_train)
pred_adb=adb.predict(X_test_scaled)
y_pred_adb_proba=adb.predict_proba(X_test_scaled)[:,1]
print("ROC AUC Score:",roc_auc_score(y_test,y_pred_adb_proba))
perf_adb=model_perf(adb,X_train_scaled,X_test_scaled,y_train,y_test,pred_adb,"AdaBoost")
perf_adb

**Ensemble Learning - Stochastic Gradient Boosting Classifier**

In [None]:
sgb=GradientBoostingClassifier(n_estimators=300,max_depth=11,subsample=0.8,max_features=0.6,random_state=42).fit(X_train_scaled,y_train) #Tuned parameters (with GridCV)
pred_sgb=sgb.predict(X_test_scaled)
perf_sgb=model_perf(sgb,X_train_scaled,X_test_scaled,y_train,y_test,pred_sgb,"Stochastic Gradient Boosting")
perf_sgb

**Ensemble Learning - XGBoost (Extreme Gradient Boosting)**

In [None]:
xgb=XGBClassifier(random_state=42, max_depth=9, min_child_weight=3, n_estimators=100) #Tuned parameters (with GridCV)
xgb.fit(X_train_scaled,y_train)
pred_xgb = xgb.predict(X_test_scaled)
perf_xgb=model_perf(xgb,X_train_scaled,X_test_scaled,y_train,y_test,pred_xgb,"XGBoost")
perf_xgb

<h2>MODEL RESULTS</h2>

In [None]:
pd.concat([perf_nb, perf_svc, perf_log, perf_knn, perf_dt, perf_vc, perf_bc, perf_rf, perf_adb, perf_sgb, perf_xgb])

As I knew it was not right to only look at accuracy when evaluating my model performance in classification problems, I also looked at precision, recall, and f1 score. I also checked the cross validation mean and the cross validation standard deviation to account for any deviations between results. As a result, all models performed well and close to each other. By considering F1 score; Bagging Classifier, Random forest, Stochastic Gradient Boosting and XGBoost have the highest scores. Classification report and confusion matrix of Random Forest model are shown below.

In [None]:
perf_rf

In [None]:
print(classification_report(y_test,pred_rf))

In [None]:
plt.figure(figsize=(12, 8))
cf_matrix=confusion_matrix(y_test,pred_rf)
group_names = ["True Negative","False Positive","False Negative","True Positive"]
group_counts = ["{0:0.0f}".format(value) for value in cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
ax=sns.heatmap(cf_matrix, annot=labels, fmt="", cmap='Blues',xticklabels=["neutral or dissatisfied","satisfied"], yticklabels=["neutral or dissatisfied","satisfied"])
ax.set_xlabel('Predicted Label',fontsize = 15)
ax.set_ylabel('Actual Label',fontsize = 15)
plt.show()

As we saw in the confusion matrix, the model predicted most of the data correctly. If we look at the wrong guesses,

* Model guessed 395 of those who were actually 'neutral or dissatifisfied' wrong and said 'satisfied'. Their rate is 1.03% of the total data. 
* Model guessed 1023 of those who were actually 'satisfied' wrong and said 'neutral or dissatisfied'. Their rate is 2.66% of the total data.

<h2>CONCLUSION</h2>

> It would be more meaningful if the data were divided into 3 groups as satisfied, neutral and dissatisfied passengers. It was difficult to draw meaningful conclusions as the inclusion of neutral passengers in the dissatisfied group increased the dissatisfaction rate for all services.

> By looking at the visualizations and feature importances of model; services that affect satisfaction the most are Online boarding, Inflight wifi service, Inflight entertainment, Seat comfort, Cleanliness, and On board service.

> Gender has no obvious effect on overall satisfaction and scores.

> Passengers whose age is between 40 to 51 are more likely to be satisfied.

> The majority of personal travel passengers are not satisfied, incentive campaigns can be organized for them.

> While the business class passengers are generally satisfied, the majority of the eco class passengers are not. Extra services can be added for eco class.