<h3>Airline Passenger Satisfaction</h3>
<br>
<br>

<h3>Tasks</h3>
<br>
<p>What factors lead to customer satisfaction for an Airline?</p>
<br>
<p>Predict passenger satisfaction</p>
<br>
<br>

<h3>Data</h3>
<br>

<ol>
    <li>Gender: Gender of the passengers (Female, Male)</li>
    <li>Customer Type: The customer type (Loyal customer, disloyal customer)</li>
    <li>Age: The actual age of the passengers</li>
    <li>Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)</li>
    <li>Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)</li>
    <li>Flight distance: The flight distance of this journey</li>
    <li>Inflight wifi service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)</li>
    <li>Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient</li>
    <li>Ease of Online booking: Satisfaction level of online booking</li>
    <li>Gate location: Satisfaction level of Gate location</li>
    <li>Food and drink: Satisfaction level of Food and drink</li>
    <li>Online boarding: Satisfaction level of online boarding</li>
    <li>Seat comfort: Satisfaction level of Seat comfort</li>
    <li>Inflight entertainment: Satisfaction level of inflight entertainment</li>
    <li>On-board service: Satisfaction level of On-board service</li>
    <li>Leg room service: Satisfaction level of Leg room service</li>
    <li>Baggage handling: Satisfaction level of baggage handling</li>
    <li>Check-in service: Satisfaction level of Check-in service</li>
    <li>Inflight service: Satisfaction level of inflight service</li>
    <li>Cleanliness: Satisfaction level of Cleanliness</li>
    <li>Departure Delay in Minutes: Minutes delayed when departure</li>
    <li>Arrival Delay in Minutes: Minutes delayed when Arrival</li>
    <li>Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction)</li>
</ol>


<h4>Imports</h4>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import plot_confusion_matrix, classification_report

<h4>EDA</h4>

In [None]:
# load the train data to df 
df = pd.read_csv('../input/airline-passenger-satisfaction/train.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Remove the unnecessary features
df = df.drop(['Unnamed: 0', 'id'], axis = 1)
df.info()

In [None]:
# Checking if there are nan values in the data
sns.set_style('whitegrid')
plt.figure(figsize=(14,10))
sns.set_context('paper', font_scale=1.4)

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
# How many are there?
np.isnan(df['Arrival Delay in Minutes']).value_counts()

In [None]:
df['Arrival Delay in Minutes'].mean()

In [None]:
# Not too much. The nan values replace the mean.
df['Arrival Delay in Minutes'] = df['Arrival Delay in Minutes'].fillna(df['Arrival Delay in Minutes'].mean())
# Checking again
np.isnan(df['Arrival Delay in Minutes']).value_counts()

In [None]:
## I am doing the same for the test data as above ##

In [None]:
# load the train data to test 
test = pd.read_csv('../input/airline-passenger-satisfaction/test.csv')

test = test.drop(['Unnamed: 0', 'id'], axis = 1)
test.info()

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(14,10))
sns.set_context('paper', font_scale=1.4)

sns.heatmap(test.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
np.isnan(test['Arrival Delay in Minutes']).value_counts()

In [None]:
test['Arrival Delay in Minutes'] = test['Arrival Delay in Minutes'].fillna(test['Arrival Delay in Minutes'].mean())
np.isnan(test['Arrival Delay in Minutes']).value_counts()

In [None]:
## -------------------------------------------- ##

In [None]:
# Graph of neutral or dissatisfied vs satisfied customers
sns.set_style('whitegrid')
plt.figure(figsize=(8,6))
sns.set_context('paper', font_scale=1.5)

sns.countplot(x='satisfaction', data = df).set_title('Neutral or Dissatisfied vs Statisfied')

In [None]:
# What does customer satisfaction look like by Age?
sns.set_style('whitegrid')
plt.figure(figsize=(10,8))
sns.set_context('paper', font_scale=1.5)

sns.histplot(x='Age', data = df,
             hue ='satisfaction').set_title('Customer satisfaction by Age')

In [None]:
df[(df['Age'] >= 40) & (df['Age'] <= 60)]['Age'].count()

In [None]:
df[(df['Age'] < 40)]['Age'].count()

<p>There are more satisfied customers than dissatisfied customers in the age range from 40 to 60 years.
This age range is 43% of customers.</p>
<br>
<p>Customers under 40 years old = 49%</p>
<p>Customers over 60 years old = 8%</p>

In [None]:
# Graphs of satisfaction customers by Class, Customer Type and Type of Travel.

sns.set_style('whitegrid')
fig, ax = plt.subplots(1,3, figsize=(18,16))
sns.set_context('paper', font_scale=1.5)

ax[0].set_title('Customer Satisfaction by Class')
sns.countplot(x='satisfaction', data = df, hue = 'Class', ax=ax[0])

ax[1].set_title('Customer Satisfaction by Customer Type')
sns.countplot(x='satisfaction', data = df, hue = 'Customer Type', ax=ax[1])

ax[2].set_title('Customer Satisfaction by Type of Travel')
sns.countplot(x='satisfaction', data = df, hue = 'Type of Travel', ax=ax[2])

<p>More are neutral or dissatisfied than satisfied customers in the Eco or Eco Plus class.
The opposite was the case in the Business class, where we have more satisfied customers than neutral or dissatisfied.</p>
<br>

<p>We also know that the type of satisfaction of loyal customers is spread with a slight advantage for neutral or dissatisfied. (40k : 44k) - 84k Loyal Customers</p>

<p>In the case of disloyal customers, the type of satisfaction that is neutral or dissatisfied is slightly over 3 times greater than that of satisfied customers. However, there are almost 4.5 times less disloyal customers than loyal customers. (14.5k : 4.5k) - 19k disloyal Customers</p>
<br>

<p>Most of the business travelers were satisfied. However, the difference is not that big when comparing the number of neutral or dissatisfied customers.</p>

In [None]:
# A glance at the correlations
sns.set_style('whitegrid')
plt.figure(figsize=(25,15))
sns.set_context('paper', font_scale=1.4)

crash_mx = df.corr()
sns.heatmap(crash_mx, annot=True, cmap='Blues')

In [None]:
# We know that there are differences between the Business class vs the Eco and Eco plus classes.
# We will see what the average ratings look like in featuers where clients gave scores from 0 to 5.
# We will consider these featuers separately for the business class vs for the Eco and Eco Plus classes.

# Looking at Mean values in Eco and Eco Plus classes
df[df['Class'] != 'Business'].describe()

In [None]:
# Creates a data frames that contains mean values. For satisfied and neutral or dissatisfied customers
# Eco and Eco Plus Classes
df_s = df[(df['satisfaction'] != 'neutral or dissatisfied') & (df['Class'] != 'Business')].describe()
df_nds = df[(df['satisfaction'] == 'neutral or dissatisfied') & (df['Class'] != 'Business')].describe()

# Creates a data frame that contains only a row with mean values for the selected featuers
# satisfied
df_s_mean = df_s[1:2][['Inflight wifi service', 'Departure/Arrival time convenient', 
                       'Ease of Online booking', 'Gate location', 'Food and drink', 
                       'Online boarding', 'Seat comfort', 'Inflight entertainment', 'On-board service', 
                       'Leg room service', 'Baggage handling','Checkin service', 'Inflight service', 
                       'Cleanliness']]
# Changing  the name of index from 'mean' to 'satisfied'
df_s_mean = df_s_mean.rename(index = {'mean':'satisfied'})

###

# Creates a data frame that contains only a row with mean values for the selected featuers
# neutral or dissatisfied
df_nds_mean = df_nds[1:2][['Inflight wifi service', 'Departure/Arrival time convenient', 
                       'Ease of Online booking', 'Gate location', 'Food and drink', 
                       'Online boarding', 'Seat comfort', 'Inflight entertainment', 'On-board service', 
                       'Leg room service', 'Baggage handling','Checkin service', 'Inflight service', 
                       'Cleanliness']]
# Changing  the name of index from 'mean' to 'neutral or dissatisfied'
df_nds_mean = df_nds_mean.rename(index = {'mean':'neutral or dissatisfied'})

###

# Combines two data frames into one
final_mean = pd.concat([df_nds_mean, df_s_mean])
final_mean

In [None]:
# Graph of Mean Grades in Eco and Eco Plus Class by selected featuers
final_mean.T.plot(figsize=(16,10), fontsize=15, kind = 'bar', 
                          title='Mean Grades in Eco and Eco Plus Class')

Let's focus on neutral or dissatisfied customers:
Low Average grades will appear among others.
<ul>
Top 3:
<li>Inflight wifi service</li>
<li>Ease of Online booking</li>
<li>Online boarding</li>
</ul>
It can be seen that these indicators correlate with each other. Perhaps it would be worth looking at them in real life. This means checking, for example, the operation of wifi (Inflight wifi service) and the system and application, or creating a more user-friendly interface (Online boarding, Ease of Online booking).

Satisfied customers rated these three featuers high, so maybe they didn't always work properly.

In [None]:
## Check the same for business class ##

In [None]:
# Creates a data frames that contains mean values. For satisfied and neutral or dissatisfied customers
# Business Classes
df_sb = df[(df['satisfaction'] != 'neutral or dissatisfied') & (df['Class'] == 'Business')].describe()
df_ndsb = df[(df['satisfaction'] == 'neutral or dissatisfied') & (df['Class'] == 'Business')].describe()

# Creates a data frame that contains only a row with mean values for the selected featuers
# satisfied
df_sb = df_sb[1:2][['Inflight wifi service', 'Departure/Arrival time convenient', 
                       'Ease of Online booking', 'Gate location', 'Food and drink', 
                       'Online boarding', 'Seat comfort', 'Inflight entertainment', 'On-board service', 
                       'Leg room service', 'Baggage handling','Checkin service', 'Inflight service', 
                       'Cleanliness']]
# Changing  the name of index from 'mean' to 'satisfied'
df_sb = df_sb.rename(index = {'mean':'satisfied'})

###

# Creates a data frame that contains only a row with mean values for the selected featuers
# neutral or dissatisfied
df_ndsb = df_ndsb[1:2][['Inflight wifi service', 'Departure/Arrival time convenient', 
                       'Ease of Online booking', 'Gate location', 'Food and drink', 
                       'Online boarding', 'Seat comfort', 'Inflight entertainment', 'On-board service', 
                       'Leg room service', 'Baggage handling','Checkin service', 'Inflight service', 
                       'Cleanliness']]
# Changing  the name of index from 'mean' to 'neutral or dissatisfied'
df_ndsb = df_ndsb.rename(index = {'mean':'neutral or dissatisfied'})


# Combines two data frames into one
final_meanb = pd.concat([df_ndsb, df_sb])
final_meanb

In [None]:
# Graph of Mean Grades in Business Class by selected featuers
final_meanb.T.plot(figsize=(16,10), fontsize=15, kind = 'bar', 
                          title='Mean Grades in Business Class')

Apart from the fact that our top 3 eco-class features are at a low level in the business class and fairly evenly assessed taking into account the difference into satisfaction.

In addition, neutral or dissatisfied business class customers counted on better:
<ul>
<li>Inflight entertainment</li>
<li>Cleanliness</li>
</ul>

In [None]:
# How does flight distance affect customer satisfaction?

sns.set_style('whitegrid')
plt.figure(figsize=(8,6))
sns.set_context('paper', font_scale=1.5)

sns.histplot(x='Flight Distance', data = df, bins = 20, hue ='satisfaction', 
             kde = True).set_title('Histogram of Flight Distance')

In [None]:
# How is this feautre distributed in the Eco and Eco Plus class?

sns.set_style('whitegrid')
plt.figure(figsize=(8,6))
sns.set_context('paper', font_scale=1.5)

sns.histplot(x='Flight Distance', data = df[df['Class'] != 'Business'], bins = 20, hue ='satisfaction', 
             kde = True).set_title('Histogram of Flight Distance by Eco and Eco Plus Class')

In [None]:
# How is this feautre distributed in the Business class?

sns.set_style('whitegrid')
plt.figure(figsize=(8,6))
sns.set_context('paper', font_scale=1.5)

sns.histplot(x='Flight Distance', data = df[df['Class'] == 'Business'], bins = 20, hue ='satisfaction', 
             kde = True).set_title('Histogram of Flight Distance by Business Class')

<p>In the first chart where we check all customers, you can see that a large proportion of the flights are in the range of 0-1250 (distance). Twice or more clients, depending on the specific distance, are neutral or dissatisfied.
Most of them are in the range of 250 - 750.</p>
<br>
<p>In the Eco and Eco Plus classes, the above conclusion repeats itself, and dissatisfaction in relation to satisfaction occurs with even more customers.</p>
<br>

<p>In the Business class, the number of satisfied customers generally prevails.
On the 0-1250 distance, the advantage of satisfied customers is visible, while above 1250 the advantage is significant.</p>

<br>
If we would like to increase customer satisfaction. Some additions could be considered for customers in the Eco and Eco Plus class who fly 250 - 750. If possible.

In [None]:
# Departure Delay in Minutes vs Arrival Delay in Minutes
sns.set_style('whitegrid')
plt.figure(figsize=(8,6))
sns.set_context('paper', font_scale=1.5)

g = sns.jointplot(x='Departure Delay in Minutes', y='Arrival Delay in Minutes', 
              data = df, hue = 'satisfaction')

g.fig.suptitle('Departure Delay in Minutes vs Arrival Delay in Minutes')

In [None]:
# Histogram of Departure Delay
sns.set_style('whitegrid')
plt.figure(figsize=(8,6))
sns.set_context('paper', font_scale=1.5)

sns.histplot(x='Departure Delay in Minutes', data = df[df['Departure Delay in Minutes'] > 0], 
             bins = 50, hue ='satisfaction', kde = True).set_title('Histogram of Departure Delay')

In [None]:
# Histogram of Arrival Delay
sns.set_style('whitegrid')
plt.figure(figsize=(8,6))
sns.set_context('paper', font_scale=1.5)

sns.histplot(x='Arrival Delay in Minutes', data = df[df['Arrival Delay in Minutes'] > 1], 
             bins = 50, hue ='satisfaction', kde = True).set_title('Histogram of Arrival Delay')

In general, a delay in departure affects the delay in arrival.

Most of delayed departures and arrivals are about 30 minutes. But there are even cases up to 200 minutes.

How do delays translate into satisfaction?

In [None]:
# Graphs of Customer Satisfaction by Departure and Arrival Delay
sns.set_style('whitegrid')
fig, ax = plt.subplots(1,4, figsize=(25,10))
sns.set_context('paper', font_scale=1.9)

ax[0].set_title('Customer Satisfaction \nby Flight without any delays\n')
sns.countplot(x='satisfaction', order = ['neutral or dissatisfied', 'satisfied'], 
              data = df[(df['Departure Delay in Minutes'] == 0) & (df['Arrival Delay in Minutes'] == 0)], 
              ax=ax[0])

ax[1].set_title('Customer Satisfaction \nby Flight with delayed departure and arrival\n')
sns.countplot(x='satisfaction', order = ['neutral or dissatisfied', 'satisfied'], 
              data = df[(df['Departure Delay in Minutes'] > 0) & (df['Arrival Delay in Minutes'] > 0)], 
              ax=ax[1])

ax[2].set_title('Customer Satisfaction \nby Flight with on-time departure \nand delayed arrival')
sns.countplot(x='satisfaction', order = ['neutral or dissatisfied', 'satisfied'], 
              data = df[(df['Departure Delay in Minutes'] == 0) & (df['Arrival Delay in Minutes'] > 0)], 
              ax=ax[2])

ax[3].set_title('Customer Satisfaction \nby Flight with delayed departure \nand on-time arrival')
sns.countplot(x='satisfaction', order = ['neutral or dissatisfied', 'satisfied'], 
              data = df[(df['Departure Delay in Minutes'] > 0) & (df['Arrival Delay in Minutes'] == 0)], 
              ax=ax[3])

<p>Generally, there is high neutral or dissatisfaction with customers everywhere, but it makes sense because there were more of them.
However, it can be seen that arriving on time, despite a delayed departure, is important because you can see almost an equal number of satisfied and neutral or dissatisfied customers.</p>

<h4>Prediction</h4>

In [None]:
df.info()

In [None]:
# First deal with features that are objects, using get_dummies

#train data
train = pd.get_dummies(df, drop_first = False, columns = ['Gender', 'Customer Type', 
                                                         'Type of Travel', 'Class'])
#test data
test = pd.get_dummies(test, drop_first = False, columns = ['Gender', 'Customer Type', 
                                                         'Type of Travel', 'Class'])

In [None]:
train.info()

In [None]:
# Changing feature satisfaction:
# 1 - satisfied
# 0 - neutral or dissatisfied

train['satisfaction'] = train['satisfaction'].apply(lambda x: np.where(x=='satisfied', 1, 0))
test['satisfaction'] = test['satisfaction'].apply(lambda x: np.where(x=='satisfied', 1, 0))

In [None]:
# preparing train_test_set
X_train = train.drop('satisfaction', axis = 1)
y_train = train['satisfaction']

X_test = test.drop('satisfaction', axis = 1)
y_test = test['satisfaction']

X = pd.concat([X_train, X_test])
y = pd.concat([y_train, y_test])

In [None]:
# making a dictionary in which we include two models with some parameters pre-set.
model_params = {
    'random_forest' :{
        'model' : RandomForestClassifier(),
        'params' : {
            'n_estimators' : [1, 5, 10, 100]
        }
    },
    
    'logistics_regression' : {
        'model' : LogisticRegression(solver = 'lbfgs', multi_class = 'auto'),
        'params' : {
            'C' : [0.1, 1, 10, 100],
            'solver' : ['lbfgs', 'liblinear']
        }
    }
}

In [None]:
# implement GridSearchCV for two models using a loop and a previously created dictionary
# in the created variable scores, we save best_score and best_params for each model
scores = []

for model_name, mp in model_params.items():
    clf = GridSearchCV(mp['model'], mp['params'], cv=5, n_jobs=-1)
    print(mp['model'])
    print('\nfitting...')
    clf.fit(X, y)
    scores.append({
        'model' : model_name,
        'best_score' : clf.best_score_,
        'best_params' : clf.best_params_
    })
    print('\nscore is appended\n')

In [None]:
# making data frame with best scores and best params
sc = pd.DataFrame(scores, columns=['model', 'best_score', 'best_params'])
sc

<h4>Random Forest<h4>

In [None]:
# implementing Random Forest model with best_params
clf_rfc = RandomForestClassifier(n_estimators =  100)
clf_rfc.fit(X_train, y_train)

In [None]:
# confusion matrix RFC
sns.set_style("whitegrid", {'axes.grid' : False})
plot_confusion_matrix(clf_rfc,
                     X_test,
                     y_test,
                     values_format = 'd',
                     display_labels=['neutral or \ndissatisfied', 'satisfied'])

In [None]:
# Create a classification report for the RFC model 
predictions_rfc = clf_rfc.predict(X_test)

# Create a classification report for the RFC model 
print(classification_report(y_test,predictions_rfc))

Random Forest did quite well. We have a precision of 96%.
The remaining ratios are at the level of 94% - 98%.