#  Air passenger satisfaction




## 1. Introduction:

### Context

This kernel contains an exploration of the relationship between air passenger satisfaction and various factors from ticket purchase to arrival at the destination.

Dataset sourse: https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction

### Content

__The dataset contains the following columns:__

Gender: Gender of the passengers (Female, Male)

Customer Type: The customer type (Loyal customer, disloyal customer)

Age: The actual age of the passengers

Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)

Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)

Flight distance: The flight distance of this journey

Inflight wifi service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)

Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient

Ease of Online booking: Satisfaction level of online booking

Gate location: Satisfaction level of Gate location

Food and drink: Satisfaction level of Food and drink

Online boarding: Satisfaction level of online boarding

Seat comfort: Satisfaction level of Seat comfort

Inflight entertainment: Satisfaction level of inflight entertainment

On-board service: Satisfaction level of On-board service

Leg room service: Satisfaction level of Leg room service

Baggage handling: Satisfaction level of baggage handling

Check-in service: Satisfaction level of Check-in service

Inflight service: Satisfaction level of inflight service

Cleanliness: Satisfaction level of Cleanliness

Departure Delay in Minutes: Minutes delayed when departure

Arrival Delay in Minutes: Minutes delayed when Arrival

Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction)

## 2. Importing data and libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import confusion_matrix

from scipy.stats import norm
from matplotlib import pyplot as plt
from scipy import stats

%matplotlib inline


plt.style.use('seaborn-whitegrid')
#plt.rcParams['figure.dpi'] = 100

In [None]:
df = pd.read_csv('/kaggle/input/airline-passenger-satisfaction/train.csv')
df.shape

## 3. First look at the data

In [None]:
df.head()

There are two columns in the dataset, such as "Unnamed:0" and "id", that don't contain useful information, so we can drop them.

In [None]:
df = df.drop('Unnamed: 0', axis=1)
df = df.drop('id', axis=1)

Let's check the dataset for null values:

In [None]:
df.isnull().sum()

Just one column "Arrival Delay in Minutes" contains nulls, however, this feature should strongly correlate with "Departure Delay in Minutes", because the aircraft usually spends the same time flying on the same route, as far as I know. We'll check this out later.


## 4. Exploratory Data Analysis

### Target

In [None]:
df['satisfaction'].unique()

In [None]:
sns.countplot(x='satisfaction',data=df, palette="Set1");

As we can see, there are more dissatisfied passengers than satisfied ones.

### Categorical features

We have four categorical features, such as "Gender", "Customer Type", "Type of Travel" and "Class".

__(!)__ I DON'T THINK the features such "Inflight wifi service", "Departure/Arrival time convenient", "Ease of Online booking", "Gate location", "Food and drink", "Online boarding", "Seat comfort", "Inflight entertainment", "On-board service", "Leg room service", "Baggage handling", "Check-in service", "Inflight service", "Cleanliness" ARE CATEGORICAL due to we can compare them with each other (for example, level 5 of wifi service satisfaction is better, then level 1) 



Let's look at the target distribution depending on categorical features:

In [None]:
sns.catplot("satisfaction", col="Gender", col_wrap=2, data=df, kind="count", height=3.5, aspect=1.0, palette="Set1"); 

The target distributions are the same approximately... 

In [None]:
sns.catplot("satisfaction", col="Customer Type", col_wrap=2, data=df, kind="count", height=3.5, aspect=1.0, palette="Set1"); 

There are more neutral or dissatisfied among disloyal, but I'm not sure what loyalty means here so I can't try to explain this distribution... 

In [None]:
sns.catplot("satisfaction", col="Class", col_wrap=3, data=df, kind="count", height=3.5, aspect=1.0, palette="Set1"); 

This picture is already quite common (Business class has the most number of satisfied).  

In [None]:
sns.catplot("satisfaction", col="Type of Travel", col_wrap=2, data=df, kind="count", height=3.5, aspect=1.0, palette="Set1"); 

These plots are very interesting, they show that people don't like spend their own money on flights:)
Usually travel expenses are paid by the employer and as a result most passengers are satisfied with the flight.

By the way, it's a pity that there is no information about buying tickets in the dataset. I think information about discounts and sales would correlate well with the target and other features.

### Numerical features

In [None]:
num_features = df.columns.drop(["Gender", "Customer Type", "Class", "Type of Travel", "satisfaction"])
num_features

Let's look at the coorelation matrix of numerical features and test the hypothesis of a relationship between "Arrival Delay in Minutes" and "Departure Delay in Minutes": 

In [None]:
corr_matrix = df[num_features].corr()
corr_matrix = np.round(corr_matrix, 2)
corr_matrix[np.abs(corr_matrix) < 0.3] = 0
plt.figure(figsize=(15,7))
sns.heatmap(corr_matrix, annot=True, linewidths=.5, cmap='coolwarm');


In [None]:
#sns.pairplot(df['Cleanliness', 'Food and drink', 'Seat comfort' ,'Inflight entertainment'], size = 2.5)
#plt.show();

As we previously assumed, "Departure Delay in Minutes" and "Arrival Delay in Minutes" are strongly correlated, so we can drop one of them.

In [None]:
df = df.drop('Arrival Delay in Minutes', axis=1)

__Flight distance__

In [None]:
plt.figure(figsize = (15, 7))
sns.distplot(df['Flight Distance'], fit=norm, color='grey');
fig = plt.figure()
res = stats.probplot(df['Flight Distance'], plot=plt)

The distribution doesn't look much like normal

_Satisfaction + Flight Distance_

In [None]:
plt.figure(figsize = (15, 7))
fig = sns.kdeplot(df.loc[df['satisfaction'] == 'neutral or dissatisfied', 'Flight Distance'], label="neutral or dissatisfied", color='red');
fig = sns.kdeplot(df.loc[df['satisfaction'] == 'satisfied', 'Flight Distance'], label="satisfied", color='blue');
fig.figure.suptitle("Satisfaction + Flight Distance", fontsize = 16);
plt.xlabel('Flight Distance', fontsize=14);
plt.ylabel('Distribution', fontsize=14);

In [None]:
plt.figure(figsize = (15, 7))
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'neutral or dissatisfied') | (df['Class'] == 'Business') , 'Flight Distance'], label="Business - neutral or dissatisfied", color='red', linestyle='--');
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'satisfied') | (df['Class'] == 'Business'), 'Flight Distance'], label="Business - satisfied", color='blue', linestyle='--');
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'neutral or dissatisfied') | (df['Class'] != 'Business') , 'Flight Distance'], label="Eco - neutral or dissatisfied", color='red');
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'satisfied') | (df['Class'] != 'Business'), 'Flight Distance'], label="Eco - satisfied" , color='blue');
fig.figure.suptitle("Satisfaction + Class+ Flight Distance", fontsize = 16);
plt.xlabel('Flight Distance', fontsize=14);
plt.ylabel('Distribution', fontsize=14);

The distribution of satisfied passengers is more even in both classes

In [None]:
plt.figure(figsize = (15, 7))
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'neutral or dissatisfied') | (df['Type of Travel'] == 'Personal Travel') , 'Flight Distance'], label="Personal Travel - neutral or dissatisfied", color='red', linestyle='--');
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'satisfied') | (df['Type of Travel'] == 'Personal Travel'), 'Flight Distance'], label="Personal Travel - satisfied", color='blue', linestyle='--');
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'neutral or dissatisfied') | (df['Type of Travel'] != 'Personal Travel') , 'Flight Distance'], label="Business Travel - neutral or dissatisfied", color='red');
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'satisfied') | (df['Type of Travel'] != 'Personal Travel'), 'Flight Distance'], label="Business Travel - satisfied" , color='blue');
fig.figure.suptitle("Satisfaction + Type of travel + Flight Distance", fontsize = 16)
plt.xlabel('Purchase amount', fontsize=14);
plt.ylabel('Distribution', fontsize=14);

The distribution of dissatisfied passengers in a Personal Travel differs from all others

__Age__

In [None]:
plt.figure(figsize = (15, 7))
sns.distplot(df['Age'], fit=norm, color='grey');
fig = plt.figure()
res = stats.probplot(df['Age'], plot=plt)

And the age distribution is quite normal

In [None]:
plt.figure(figsize = (15, 7))
fig = sns.kdeplot(df.loc[df['satisfaction'] == 'neutral or dissatisfied', 'Age'], label="neutral or dissatisfied", color='red');
fig = sns.kdeplot(df.loc[df['satisfaction'] == 'satisfied', 'Age'], label="satisfied", color='blue');
fig.figure.suptitle("Satisfaction + Age", fontsize = 16);
plt.xlabel('Age', fontsize=14);
plt.ylabel('Distribution', fontsize=14);

In [None]:
plt.figure(figsize = (15, 7))
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'neutral or dissatisfied') | (df['Class'] == 'Business') , 'Age'], label="Business - neutral or dissatisfied", color='red', linestyle='--');
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'satisfied') | (df['Class'] == 'Business'), 'Age'], label="Business - satisfied", color='blue', linestyle='--');
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'neutral or dissatisfied') | (df['Class'] != 'Business') , 'Age'], label="Eco - neutral or dissatisfied", color='red');
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'satisfied') | (df['Class'] != 'Business'), 'Age'], label="Eco - satisfied" , color='blue');
fig.figure.suptitle("Satisfaction + Class+ Age", fontsize = 16);
plt.xlabel('Age', fontsize=14);
plt.ylabel('Distribution', fontsize=14);

In [None]:
plt.figure(figsize = (15, 7))
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'neutral or dissatisfied') | (df['Type of Travel'] == 'Personal Travel') , 'Age'], label="Personal Travel - neutral or dissatisfied", color='red', linestyle='--');
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'satisfied') | (df['Type of Travel'] == 'Personal Travel'), 'Age'], label="Personal Travel - satisfied", color='blue', linestyle='--');
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'neutral or dissatisfied') | (df['Type of Travel'] != 'Personal Travel') , 'Age'], label="Business Travel - neutral or dissatisfied", color='red');
fig = sns.kdeplot(df.loc[(df['satisfaction'] == 'satisfied') | (df['Type of Travel'] != 'Personal Travel'), 'Age'], label="Business Travel - satisfied" , color='blue');
fig.figure.suptitle("Satisfaction + Type of travel + Age", fontsize = 16)
plt.xlabel('Age', fontsize=14);
plt.ylabel('Distribution', fontsize=14);

In General, looking at the "Age" distribution depending on different categorical features, we can say that the number of satisfied passengers has been growing since about 33 years old approximately. We can assume that this is due to the fact that adults fly business class more often, both on business travel (employers seldom buy young workers a business class ticket) and personal travel too (that the older a person is, the more affluent they are).

__Departure Delay in Minutes__

In [None]:
plt.figure(figsize = (15, 7))
sns.distplot(df['Departure Delay in Minutes'], fit=norm, color='grey');
fig = plt.figure()
res = stats.probplot(df['Departure Delay in Minutes'], plot=plt)

In [None]:
plt.figure(figsize = (15, 7))
fig = sns.kdeplot(df.loc[df['satisfaction'] == 'neutral or dissatisfied', 'Departure Delay in Minutes'], label="neutral or dissatisfied", color='red');
fig = sns.kdeplot(df.loc[df['satisfaction'] == 'satisfied', 'Departure Delay in Minutes'], label="satisfied", color='blue');
fig.figure.suptitle("Satisfaction + Departure Delay in Minutes", fontsize = 16);
plt.xlabel('Departure Delay in Minutes', fontsize=14);
plt.ylabel('Distribution', fontsize=14);

There is no big difference between satisfied and dissatisfied people here

__Inflight wifi service, Departure/Arrival time convenient,
       Ease of Online booking, Gate location, Food and drink,
       Online boarding, Seat comfort, Inflight entertainment,
       On-board service, Leg room service, Baggage handling,
       Checkin service, Inflight service, Cleanliness__

In [None]:
features_0_5 = num_features.drop(["Age", "Flight Distance", "Departure Delay in Minutes", "Arrival Delay in Minutes"])
features_0_5

In [None]:
for feature in features_0_5:
    print(feature, df[feature].unique())

Almost all these features contain values from 0 to 5, so we can look at distribution one graph (
just to be clear)

In [None]:
plt.figure(figsize = (20, 10))
for feature in features_0_5:    
    fig = sns.lineplot(data=df[feature].value_counts(sort=False), linewidth=2, label=feature)
fig.figure.suptitle("Count + Feature_0_5", fontsize = 16);
plt.xlabel('Value', fontsize=14);
plt.ylabel('Count', fontsize=14);

In [None]:
plt.figure(figsize = (20, 10))
for feature in features_0_5:    
    fig = sns.lineplot(data=df.loc[df['satisfaction'] == 'neutral or dissatisfied', feature].value_counts(sort=False), linewidth=2, label=feature)
fig.figure.suptitle("Count + Feature_0_5 - neutral or dissatisfied ", fontsize = 16);
plt.xlabel('Value', fontsize=14);
plt.ylabel('Count', fontsize=14);

In [None]:
plt.figure(figsize = (20, 10))
for feature in features_0_5:    
    fig = sns.lineplot(data=df.loc[df['satisfaction'] != 'neutral or dissatisfied', feature].value_counts(sort=False), linewidth=2.5, label=feature)
fig.figure.suptitle("Count + Feature_0_5 - satisfied ", fontsize = 16);
plt.xlabel('Value', fontsize=14);
plt.ylabel('Count', fontsize=14);

We can conclude that there are more satisfied passengers with "4" and "5" values for all investigated features

## 5. Digitizing categorical features

To build predict models and analyze data (based on the kind of the target we can use logistic regression) using mathematics we need to digitize the target and categorical features.

In [None]:
df.info()

__Target__

In [None]:
df['satisfaction'].unique()

In [None]:
df['satisfaction'].replace({'neutral or dissatisfied': 0, 'satisfied': 1},inplace = True)

In [None]:
df['satisfaction'].unique()

__Gender__

In [None]:
df['Gender'].unique()

In [None]:
df['Gender'].replace({'Female': 0, 'Male': 1},inplace = True) # may the feminists forgive me  :)

In [None]:
df['satisfaction'].unique()

__Customer Type__

In [None]:
df['Customer Type'].unique()

In [None]:
df['Customer Type'].replace({'disloyal Customer': 0, 'Loyal Customer': 1},inplace = True)

In [None]:
df['Customer Type'].unique()

__Type of Travel__

In [None]:
df['Type of Travel'].unique()

In [None]:
df['Type of Travel'].replace({'Personal Travel': 0, 'Business travel': 1},inplace = True)

In [None]:
df['Type of Travel'].unique()

__Class__

In [None]:
df['Class'].unique()

In [None]:
ClassD = pd.get_dummies(df['Class'])
df = pd.concat([df, ClassD],axis =1)

In [None]:
df = df.drop('Class', axis=1)

In [None]:
df.head()

## 6. PCA

The features describing the levels of satisfaction (from 0 to 5) can logically be divided into two groups - inflight and boarding. 
Let's use Principal Component Analysis, we can downgrade each group to two components and look at the differences between variances before and after the downgrade (I've heard about hypothesis that if you add new components to your dataset, it can improve your predict model)


__Inflight features__

In [None]:
features_0_5

In [None]:
num_features = df.columns.drop(["Gender", "Customer Type", "Type of Travel", "satisfaction", "Flight Distance", "Departure Delay in Minutes", "Age", "Business", "Eco", "Eco Plus"])
num_features

In [None]:
inflight_features = ['Inflight wifi service', 'Departure/Arrival time convenient', 'Food and drink', 'Seat comfort', 'Inflight entertainment', 'Inflight service', 'Cleanliness']
inflight_features

__Boarding features__

In [None]:
boardinf_features = num_features.drop(inflight_features)
boardinf_features

In [None]:
def reduce_dims(df, dims=2, method='pca'):
    
    assert method in ['pca', 'tsne'], 'Incorrect method'
    
    if method=='pca':
        dim_reducer = PCA(n_components=dims, random_state=42)
        components = dim_reducer.fit_transform(df)
    elif method == 'tsne':
        dim_reducer = TSNE(n_components=dims, learning_rate=250, random_state=42)
        components = dim_reducer.fit_transform(df)
    else:
        print('Error')
        
    colnames = ['component_' + str(i) for i in range(1, dims+1)]
    return dim_reducer, pd.DataFrame(data = components, columns = colnames) 

__Downgrade of inflight features__

In [None]:
dim_reducer2d, inflight_components_2d = reduce_dims(df[inflight_features], dims=2, method='pca')
inflight_components_2d.head(2)

In [None]:
df[inflight_features].shape, inflight_components_2d.shape

In [None]:
variance_before_inflight_features = np.var(df[inflight_features], axis=0, ddof=1).sum()

In [None]:
variance_after_inflight_features = np.var(inflight_components_2d, axis=0, ddof=1).sum()
variance_after_inflight_features

In [None]:
variance_after_inflight_features / variance_before_inflight_features

The difference between the variance amounts is large, using PCA we drop 35% of information.

In [None]:
inflight_components_2d

__Downgrade of boarding features__

In [None]:
dim_reducer2d, boardinf_components_2d = reduce_dims(df[boardinf_features], dims=2, method='pca')
boardinf_components_2d.head(2)

In [None]:
df[boardinf_features].shape, boardinf_components_2d.shape

In [None]:
variance_before_boardinf_features = np.var(df[boardinf_features], axis=0, ddof=1).sum()

In [None]:
variance_after_boardinf_features = np.var(boardinf_components_2d, axis=0, ddof=1).sum()
variance_after_boardinf_features

In [None]:
variance_after_boardinf_features / variance_before_boardinf_features

The situation is worse than the last one (we drop about a half of information) 

In [None]:
boardinf_components_2d

__Making dataset with PCA components__

In [None]:
df_with_pca_components = df

In [None]:
df_with_pca_components['inflight_component_1'] = inflight_components_2d['component_1']

In [None]:
df_with_pca_components['inflight_component_2'] = inflight_components_2d['component_2']

In [None]:
df_with_pca_components['boarding_component_1'] = boardinf_components_2d['component_1']

In [None]:
df_with_pca_components['boarding_component_2'] = boardinf_components_2d['component_2']

In [None]:
df.info()

## 7. Prediction

In [None]:
y = df['satisfaction']
X = df.drop('satisfaction', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True, random_state=42)

In [None]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

In [None]:
predictions = logmodel.predict(X_test)
print(classification_report(y_test, predictions))
confusion_matrix(y_test, predictions)

In [None]:
y = df_with_pca_components['satisfaction']
X = df_with_pca_components.drop('satisfaction', axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True, random_state=42)

In [None]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

In [None]:
predictions = logmodel.predict(X_test)
print(classification_report(y_test, predictions))
confusion_matrix(y_test, predictions)

As we see, the hypothesis that if you add new components to your dataset, it can improve your predict model doesn't work (results of prediction are the same).