The following notebook presents a variety of functions useful for exploratory data analysis, as well as a Factor Analysis, all using the "Airline Passenger Satisfaction" dataset. Only the train dataset will be used here. Let's begin with the exploratory data analysis (EDA).    

# 1. **EXPLORATORY DATA ANALYSIS**

In [None]:
#Packages
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
!pip install factor_analyzer  
from factor_analyzer import FactorAnalyzer

In [None]:
#Import dataset
df = pd.read_csv("../input/airline-passenger-satisfaction/train.csv")

Let's start by looking at the head of the data, see what we have at hand. 

In [None]:
df.head()

We have 25 columns. The first two columns don't seem to contain any important information for our purpose. We can therefore remove them.

In [None]:
df.drop(['Unnamed: 0', 'id', ], axis=1, inplace=True)

The describe function can also help in giving a proper overview of the data. 

In [None]:
df.describe()

We see that our dataset contains 103,904 observations. Of the 23 columns we currently have, 14 seem to be representing responses, on a scale of 1 to 5, to a survey evaluating different aspects of the flights (Inflight wifi service, food and drink, online boarding, seat comfort, etc). These 14 columns will be very important for our upcoming factor analysis. Let's now look for missing values. 

In [None]:
sns.heatmap(df.isnull(), cbar=False)
df.isnull().sum()

We have no missing values, however, we do have 310 Nan in the "Arrival Delay in Minutes" column. Before we decide what to do about those, let's look at a correlation plot of all the variables. 

In [None]:
plt.figure(figsize=(20,10))
c= df.corr()
sns.heatmap(c)

Some variables are quite highly correlated, especially the ones relating to what seems to be answers to a survey. However, what really stands out is the extremely high correlation (0.98) between the "Departure Delay in Minutes" and the "Arrival Delay in Minutes". That makes sense. If the plane leaves later than expected, it should arrive later as well. Considering this high correlation and the fact that we had 310 Nan in the "Arrival Delay in Minutes" column, I decided to just remove that column from the dataset. 

In [None]:
df.drop(['Arrival Delay in Minutes'], axis=1, inplace=True)

We can now check for duplicates, i.e. rows that would be identical. It is important to do so to ensure that all our information is necessary and relevant. 

In [None]:
print(df[df.duplicated()])

It prints an empty dataframe, which means that we have no duplicates in our dataset. We can now look more closely at the "satisfaction" variable, since it is in reality the one that really separates the observations. 

In [None]:
df['satisfaction'].describe()

We only have 2 classes (unique=2). Looking back at the head of the table, we see that we have satisfied customers, and neutral or dissatisfied ones. That second class is the dominant one, as shown above. I will transform the variable into a binary one, to facilite future analysis.

In [None]:
df.satisfaction.replace(['satisfied', 'neutral or dissatisfied'], [1,0], inplace=True)

We can now start asking real interesting questions. Let's first look at the average score for each class, on the 14 variables that were surveyed. 

In [None]:
eco = df[df['Class']=='Eco'][df.columns[6:20]].mean().mean()
eco_plus = df[df['Class']=='Eco Plus'][df.columns[6:20]].mean().mean()
business = df[df['Class']=='Business'][df.columns[6:20]].mean().mean()
print(eco, eco_plus, business)

As expected, Business class was better rated. Eco class and Eco Plus practically got the same grade, which indicates that customers that paid for Eco Plus don't feel they got their money's worth. Let's now look at every variable and how they were rated, for each class. 

In [None]:
df.groupby('Class')[df.columns[6:20]].mean()

This gives some great hindsight into what could be improved by the company. For instance, we see that Wifi, across all classes was poorly evaluated. We also see that the online boarding was more much less convenient for customers not in the business class. Let's now look at overall satisfaction, across classes. 

In [None]:
plt.subplot(1,2,1)
df.Class.value_counts().plot(kind='bar', figsize=(10,5))
plt.title('Observations per class')
plt.subplot(1,2,2)
df[df['satisfaction']==0].Class.value_counts().plot(kind='bar', figsize=(10,5))
plt.title('Neutral or dissatisfied per class')

Results are clear. Customers in the "Eco" class are not as numerous as ones in the business class but they still had a very large chunk of unhappy customers. To put exact numbers to it:

In [None]:
eco_proportion = len(df[df['Class']=='Eco'])/len(df)
bad_proportion = len(df[df['Class']=='Eco']['satisfaction']==0)/len(df[df['satisfaction']==0])
print(eco_proportion, bad_proportion)

The "Eco" class customers accounted for about 45% of total customers, but for 79% of unhappy ones. Let's look again at the 14 surveyed variables, just for the Eco class. 

In [None]:
df[df['Class']=='Eco'][df.columns[6:20]].mean()

A possible recommandation could be to simply improve the wifi quality, or make online booking easier to use for the "Eco" class customers.This kind of simple EDA helps give a good portrayal the situation.

# **2. FACTOR ANALYSIS**

The idea of factor analysis is to describe variability among correlated variables in fewer variables called factors. It is based on the idea that some "latent" variables exist, which cannot be described with a single variable. For instance, instead of having multiple variables to describe the price (price of ticket, price of extra baggage, price of food, etc.), we could work with one latent variable called price. 

So can our 14 variables be described into fewer latent variables (factors)? Let's find out. To figure out how many factors we would need, we can look at eigenvalues, which is a measure of how much variance of the variables does a factor explain. A eigenvalue of more than one means that the factor explains more variance than a unique variable. 

In [None]:
#Subset of the data
x =df[df.columns[6:20]] 

fa = FactorAnalyzer()
fa.fit(x, 10)

#Get Eigen values and plot
ev, v = fa.get_eigenvalues()
ev
plt.plot(range(1,x.shape[1]+1),ev)

We will only use 3 factors here, given the big dropoff in eigenvalue after the 3rd factor. Let's see what factors are created, and what variables they contain. A loading cutoff of 0.5 will be used here. 

In [None]:
fa = FactorAnalyzer(3, rotation='varimax')
fa.fit(x)
loads = fa.loadings_
print(loads)

Here are the 3 factors, the variables they contain and their possible "interpretability":
1. **Comfort**: Food and Drink, Seat comfort, Inflight entertainment, Cleanliness
1. **Service**: Onboard service, Baggage Handling, Inflight Service
1. **Convenience**: In flight Wifi, Departure/Arrival time convenience, Online Booking, Gate Location. 

Now, thats great, but how do we know if our factors are any good? Well, the Cronbach alpha can be used to measure whether or not the variables of a factor form a "coherent" factor. A value above 0.6 for the alpha is in practice deemed acceptable. Here is the code to get the Cronbach alpha using the pingouin package. 

In [None]:
!pip install pingouin
import pingouin as pg

In [None]:
#Create factors
factor1 = df[['Food and drink', 'Seat comfort', 'Inflight entertainment', 'Cleanliness']]
factor2 = df[['On-board service', 'Baggage handling', 'Inflight service']]
factor3 = df[['Inflight wifi service', 'Departure/Arrival time convenient', 'Ease of Online booking', 'Gate location']]

#Get cronbach alpha
factor1_alpha = pg.cronbach_alpha(factor1)
factor2_alpha = pg.cronbach_alpha(factor2)
factor3_alpha = pg.cronbach_alpha(factor3)

print(factor1_alpha, factor2_alpha, factor3_alpha)

The alphas are evaluated at 0.87, 0.79 and 0.76, which indicates that they are useful and coherent. We could use these new factors as variable for other analysis or for prediction. In this notebook, we will leave it at that. 



Thanks a lot for reading!