# Business Problem:

* Can we categorize the dimensions of passenger's satisfaction into smaller number of Factors?

In order to do that, we'll perform Factor Analysis.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
dftrain = pd.read_csv('train.csv')
dftest = pd.read_csv('test.csv')
df_ori = pd.concat([dftrain, dftest], sort=False) # concatenate the train and test for generalizability
df = df_ori.copy()
df = df.iloc[:,8:24]
df = df.dropna() # checking missing data
df

Unnamed: 0,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes
0,3,4,3,1,5,3,5,5,4,3,4,4,5,5,25,18.0
1,3,2,3,3,1,3,1,1,1,5,3,1,4,1,1,6.0
2,2,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0
3,2,5,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0
4,3,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25971,3,3,3,1,4,3,4,4,3,2,4,4,5,4,0,0.0
25972,4,4,4,4,4,4,4,4,4,5,5,5,5,4,0,0.0
25973,2,5,1,5,2,1,2,2,4,3,4,5,4,2,0,0.0
25974,3,3,3,3,4,4,4,4,3,2,5,4,5,4,0,0.0


The above dataframe shows all the parameters that contribute to passengers' satisfaction.

# Importing Factor Analyzer
Let's install the package for factor analysis called Factor Analyzer.

In [3]:
pip install factor_analyzer # installing Factor Analyzer

Collecting factor_analyzer
  Downloading factor_analyzer-0.3.2.tar.gz (40 kB)
[K     |████████████████████████████████| 40 kB 2.3 MB/s 
Building wheels for collected packages: factor-analyzer
  Building wheel for factor-analyzer (setup.py) ... [?25l- \ done
[?25h  Created wheel for factor-analyzer: filename=factor_analyzer-0.3.2-py3-none-any.whl size=40380 sha256=33069bd81d654ab773f55172723ba1460a8941977e99bb4ff439bb77f26480bb
  Stored in directory: /root/.cache/pip/wheels/8d/9e/4c/fd4cb92cecf157b13702cc0907e5c56ddc48e5388134dc9f1a
Successfully built factor-analyzer
Installing collected packages: factor-analyzer
Successfully installed factor-analyzer-0.3.2
Note: you may need to restart the kernel to use updated packages.


In [3]:
from factor_analyzer import FactorAnalyzer # Then, we import the installed package into our notebook.

# Testing a few Assumptions
1. Bartlett's Test: to check whether observed variables intercorrelate at all using the observed correlation matrix
2. Kaiser-Meyer-Olkin Test: to measures the suitability of data for factor analysis 

[Source](https://www.datacamp.com/community/tutorials/introduction-factor-analysis)

In [4]:
# Barlett's

from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
calculate_bartlett_sphericity(df)

(1100454.3463635414, 0.0)

In [5]:
# KMO

from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all, kmo_model = calculate_kmo(df)
print(kmo_model)

0.7347314786302451


Both Bartlett's Test and KMO indicates that the dataframe is good to go for factor analysis.

# Identify the Number of Factors

In [6]:
# instantiate the Factor Analyzer
fa = FactorAnalyzer() 

# Fit the dataframe using Factor Analyzer
fa.fit(df)

# Identify the eigenvalues
ev, v = fa.get_eigenvalues() #eigenvalues

# display the eigenvalues
ev

array([3.80389837, 2.3721328 , 2.16956004, 1.96182794, 1.06300589,
       0.94961228, 0.69607891, 0.53723377, 0.51374997, 0.46784182,
       0.36634408, 0.32884042, 0.29332142, 0.25443665, 0.18743275,
       0.03468288])

The above array of eigen values show that we can create five factors.

In [7]:
# Then, we repeat the factor analyzer using five factors, fitting it, and print the factor loadings for each variables.
fa = FactorAnalyzer(5, rotation='varimax')
fa.fit(df)
print(fa.loadings_)

[[ 9.51222546e-02  1.34785546e-01 -9.00112235e-03  6.14101713e-01
   4.65372148e-01]
 [-9.57623180e-03  5.54625878e-02 -2.96640256e-04  5.89526400e-01
  -6.49139954e-03]
 [-3.22972077e-02  3.10800908e-02 -2.35280251e-03  7.72955085e-01
   4.48606165e-01]
 [ 1.25845037e-02 -4.67148475e-02  4.77284147e-03  6.82653008e-01
  -1.11332163e-01]
 [ 7.70829828e-01  4.10135188e-03 -1.80185151e-02  3.06650437e-02
   3.46802833e-02]
 [ 2.89549138e-01  1.22384915e-01 -9.53510997e-03  1.08246330e-01
   7.54004531e-01]
 [ 7.56387764e-01  7.95257133e-02 -1.38440024e-02 -2.64576490e-02
   2.09396959e-01]
 [ 7.67525558e-01  4.66055426e-01 -7.83337084e-03  4.09450708e-02
   2.32560644e-02]
 [ 8.52709959e-02  7.01341669e-01 -1.92813548e-02  1.03362675e-02
   4.71336116e-02]
 [ 5.78302900e-02  4.86147655e-01  2.34399007e-02  4.31284274e-02
   9.26339940e-02]
 [ 3.64249934e-02  7.64505665e-01  6.93860766e-03  4.62035699e-02
  -3.50870358e-02]
 [ 1.13416171e-01  2.87751429e-01 -1.30508044e-02 -2.70974644e-02

The above output is not human-eye friendly, so we make the below dataframes.

In [8]:
lmatrix = pd.DataFrame(fa.loadings_, index = list(df.columns), columns = ['Factor 1', 'Factor 2', 'Factor 3', 'Factor 4', 'Factor 5'])
lmatrix #loading matrix

Unnamed: 0,Factor 1,Factor 2,Factor 3,Factor 4,Factor 5
Inflight wifi service,0.095122,0.134786,-0.009001,0.614102,0.465372
Departure/Arrival time convenient,-0.009576,0.055463,-0.000297,0.589526,-0.006491
Ease of Online booking,-0.032297,0.03108,-0.002353,0.772955,0.448606
Gate location,0.012585,-0.046715,0.004773,0.682653,-0.111332
Food and drink,0.77083,0.004101,-0.018019,0.030665,0.03468
Online boarding,0.289549,0.122385,-0.009535,0.108246,0.754005
Seat comfort,0.756388,0.079526,-0.013844,-0.026458,0.209397
Inflight entertainment,0.767526,0.466055,-0.007833,0.040945,0.023256
On-board service,0.085271,0.701342,-0.019281,0.010336,0.047134
Leg room service,0.05783,0.486148,0.02344,0.043128,0.092634


Let's sort the dataframe based on each factor using cut off value 0.2.

In [9]:
lmatrix.sort_values('Factor 1', ascending=False)

Unnamed: 0,Factor 1,Factor 2,Factor 3,Factor 4,Factor 5
Cleanliness,0.854195,0.084949,0.000647,-0.001291,0.097845
Food and drink,0.77083,0.004101,-0.018019,0.030665,0.03468
Inflight entertainment,0.767526,0.466055,-0.007833,0.040945,0.023256
Seat comfort,0.756388,0.079526,-0.013844,-0.026458,0.209397
Online boarding,0.289549,0.122385,-0.009535,0.108246,0.754005
Checkin service,0.113416,0.287751,-0.013051,-0.027097,0.133295
Inflight wifi service,0.095122,0.134786,-0.009001,0.614102,0.465372
On-board service,0.085271,0.701342,-0.019281,0.010336,0.047134
Leg room service,0.05783,0.486148,0.02344,0.043128,0.092634
Baggage handling,0.036425,0.764506,0.006939,0.046204,-0.035087


In [10]:
lmatrix.sort_values('Factor 2', ascending=False)

Unnamed: 0,Factor 1,Factor 2,Factor 3,Factor 4,Factor 5
Inflight service,0.035749,0.799371,-0.044377,0.046369,-0.058022
Baggage handling,0.036425,0.764506,0.006939,0.046204,-0.035087
On-board service,0.085271,0.701342,-0.019281,0.010336,0.047134
Leg room service,0.05783,0.486148,0.02344,0.043128,0.092634
Inflight entertainment,0.767526,0.466055,-0.007833,0.040945,0.023256
Checkin service,0.113416,0.287751,-0.013051,-0.027097,0.133295
Inflight wifi service,0.095122,0.134786,-0.009001,0.614102,0.465372
Online boarding,0.289549,0.122385,-0.009535,0.108246,0.754005
Cleanliness,0.854195,0.084949,0.000647,-0.001291,0.097845
Seat comfort,0.756388,0.079526,-0.013844,-0.026458,0.209397


In [11]:
lmatrix.sort_values('Factor 3', ascending=False)

Unnamed: 0,Factor 1,Factor 2,Factor 3,Factor 4,Factor 5
Arrival Delay in Minutes,-0.017345,-0.01942,0.995885,-0.0008,-0.008277
Departure Delay in Minutes,-0.01568,-0.014231,0.968664,9.1e-05,-0.006186
Leg room service,0.05783,0.486148,0.02344,0.043128,0.092634
Baggage handling,0.036425,0.764506,0.006939,0.046204,-0.035087
Gate location,0.012585,-0.046715,0.004773,0.682653,-0.111332
Cleanliness,0.854195,0.084949,0.000647,-0.001291,0.097845
Departure/Arrival time convenient,-0.009576,0.055463,-0.000297,0.589526,-0.006491
Ease of Online booking,-0.032297,0.03108,-0.002353,0.772955,0.448606
Inflight entertainment,0.767526,0.466055,-0.007833,0.040945,0.023256
Inflight wifi service,0.095122,0.134786,-0.009001,0.614102,0.465372


In [12]:
lmatrix.sort_values('Factor 4', ascending=False)

Unnamed: 0,Factor 1,Factor 2,Factor 3,Factor 4,Factor 5
Ease of Online booking,-0.032297,0.03108,-0.002353,0.772955,0.448606
Gate location,0.012585,-0.046715,0.004773,0.682653,-0.111332
Inflight wifi service,0.095122,0.134786,-0.009001,0.614102,0.465372
Departure/Arrival time convenient,-0.009576,0.055463,-0.000297,0.589526,-0.006491
Online boarding,0.289549,0.122385,-0.009535,0.108246,0.754005
Inflight service,0.035749,0.799371,-0.044377,0.046369,-0.058022
Baggage handling,0.036425,0.764506,0.006939,0.046204,-0.035087
Leg room service,0.05783,0.486148,0.02344,0.043128,0.092634
Inflight entertainment,0.767526,0.466055,-0.007833,0.040945,0.023256
Food and drink,0.77083,0.004101,-0.018019,0.030665,0.03468


In [13]:
lmatrix.sort_values('Factor 5', ascending=False)

Unnamed: 0,Factor 1,Factor 2,Factor 3,Factor 4,Factor 5
Online boarding,0.289549,0.122385,-0.009535,0.108246,0.754005
Inflight wifi service,0.095122,0.134786,-0.009001,0.614102,0.465372
Ease of Online booking,-0.032297,0.03108,-0.002353,0.772955,0.448606
Seat comfort,0.756388,0.079526,-0.013844,-0.026458,0.209397
Checkin service,0.113416,0.287751,-0.013051,-0.027097,0.133295
Cleanliness,0.854195,0.084949,0.000647,-0.001291,0.097845
Leg room service,0.05783,0.486148,0.02344,0.043128,0.092634
On-board service,0.085271,0.701342,-0.019281,0.010336,0.047134
Food and drink,0.77083,0.004101,-0.018019,0.030665,0.03468
Inflight entertainment,0.767526,0.466055,-0.007833,0.040945,0.023256


* Factor 1: Cleanliness, Food and Drink, Inflight Entertainment, Seat Comfort
* Factor 2: Inflight Services, Baggage Handling, Onboard Services, Leg Room
* Factor 3: Arrival and Departure Delay
* Factor 4: Online Booking, Gate Location, Inflight wifi, Departure/Arrival Time Convenience
* Factor 5: Online Boarding

Of course we can simulate the number of factors towards other numbers, but I will stop until this point and you can further explore other possibilities.