Import the relevant libraries and create a random state variable to be used all across the notebook:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
%matplotlib inline
rstate = 42 #establish a fixed random state

### Exploring the data

Load the data from csv to dataframes and check out some basic properties:

In [None]:
directory = "/kaggle/input/airline-passenger-satisfaction/"
feature_tables = ['train.csv', 'test.csv']
df_train_str = directory + feature_tables[0]
df_test_str = directory + feature_tables[1]
df_train = pd.read_csv(df_train_str)
df_test = pd.read_csv(df_test_str)

In [None]:
df_train.sample(5, random_state=rstate)

In [None]:
df_train.shape #amount of records is rather high, sampling will probably be required to avoid overloadign processor

In [None]:
df_train.info()

The training data is almost perfect in terms of readiness for analysis. The only nulls are in "Arrival Delay in Minutes" and since the amount is insignificant (300/100K), these records can be easily removed. Also we can see that there are some dtypes that are objects due to the fact these are categorical variables. They will need to be converted later on to dummies \ indicators.

In [None]:
df_train.isnull().sum()

The same applies for the test data:

In [None]:
df_test.info()

We can see that the mean in all survey categories is between 2-3.5 and std is 1-1.5. Hence, the categories are rather balanced. The flight distance has a very high variance so it should be treated carefully. Age mean\median are around 40 so no exceptions here as well. "id" & "Unnamed..." have little value as information.

In [None]:
df_train.describe()

Once we plot the initial correlation matrix (before data is cleaned), we immediately notice a pattern of 3 correlated chunks with 4 categories each:
1. Comfort related categories: seat comfort, food and drink etc
2. Flight order related categories: wifi, ease of inline booking etc
3. Service related categories: leg room, on board etc

Also, seems like departure \ arrival delay are heavily correlated.

In [None]:
plt.figure(figsize=(20,15))
sns.heatmap(df_train.corr(),annot=True,cmap='YlGnBu')
plt.tight_layout

Since we are about to use the data for supervised learning tasks eventually, it is important to see if the labels are balance. If not, an adjustment in the training process is required such that bias is minimized as a result of uneven labeling. As can be seen, the labeled train set is balanced so no need for further actions.

In [None]:
sns.countplot(x='satisfaction',data=df_train)

In order for the labels to be used in the training, we need to convert them to binary. Satisfied will be assigned with 1 and all others will take zeros:

In [None]:
d = {'satisfied': True, 'neutral or dissatisfied': False} #create a dictionary to use map on
df_train['sat_label'] = df_train['satisfaction'].map(d) #map the values according to the dictionary to a new column
df_train.drop('satisfaction',inplace=True,axis=1) #erase old column
df_train["sat_label"] = df_train["sat_label"].astype(int) #convert to int to use later in models

In [None]:
#convert the same way for the testset
df_test['sat_label'] = df_test['satisfaction'].map(d)
df_test.drop('satisfaction',inplace=True,axis=1)
df_test["sat_label"] = df_test["sat_label"].astype(int)

Make sure the change was applied:

In [None]:
df_train['sat_label'].value_counts()

In [None]:
df_test['sat_label'].value_counts()

Display the correlation differently, highlighting top correlated features with the satisfaction label. Easy to see that on-line boarding takes the lead with 50% while there's another ~10 features that are 20-40% correlated. This is not a bad start. If a single features has that much measured correlation, prediction will probably be very solid.

In [None]:
df_train.corr()['sat_label'].sort_values().drop('sat_label').plot(kind='bar')

Further inspecting the strongest predictor we find out nothing out of the ordinary. 4 is the leading score which is quite high and this means that the average satisfied customer ranked "online boarding" close to 5.

In [None]:
df_train['Online boarding'].sort_values().value_counts()

Here it is seen very clearly in a box plot. The satisfied customers (sat_label = 1) score this parameter between 4-5 (q1-3) while the other are settled in the range of 2 to 3.

In [None]:
sns.boxplot(x='sat_label',y = 'Online boarding',data=df_train)

### Pre processing

First, all the categorical variables need to be transformed to indicator variables. Most of them have only 2 types but class has 3 and therefore 2 columns are created in order to encode it. This is essential for the correlation to be calculated at first and for the trainning to be done later (in case the algorithms that will be used cannot use categorical features).

In [None]:
Gender_cat = pd.get_dummies(df_train['Gender'],drop_first=True)
Customer_cat = pd.get_dummies(df_train['Customer Type'],drop_first=True)
Travel_cat = pd.get_dummies(df_train['Type of Travel'],drop_first=True)
Class_cat = pd.get_dummies(df_train['Class'],drop_first=True)
df_train = pd.concat([df_train,Gender_cat,Customer_cat,Travel_cat,Class_cat],axis =1) # add all the newly created columns to the existing dataframe
df_train.drop(['Gender','Customer Type','Type of Travel','Class'],inplace =True,axis = 1) #erase the old categorical columns

Same thing is done for the testset:

In [None]:
Gender_cat = pd.get_dummies(df_test['Gender'],drop_first=True)
Customer_cat = pd.get_dummies(df_test['Customer Type'],drop_first=True)
Travel_cat = pd.get_dummies(df_test['Type of Travel'],drop_first=True)
Class_cat = pd.get_dummies(df_test['Class'],drop_first=True)
df_test = pd.concat([df_test,Gender_cat,Customer_cat,Travel_cat,Class_cat],axis =1)
df_test.drop(['Gender','Customer Type','Type of Travel','Class'],inplace =True,axis = 1)

Sample the data to make sure it is OK:

In [None]:
df_train.sample(random_state=rstate)

In [None]:
df_test.sample(random_state=rstate)

Now we can plot a more comprehensive matrix that will take into consideration additional variables. Neverthelss, the strongest patterns was already seen before - 3 subgroups of survey categories.

In [None]:
plt.figure(figsize=(26,20))
sns.heatmap(df_train.corr(),annot = True,cmap='YlGnBu')

Once we check correlation against the label once more we see more relationships (especially negative ones) with type of travel and class:

In [None]:
df_train.corr()['sat_label'].sort_values().drop('sat_label').plot(kind='bar', color='maroon')

Since flight distance was marked before as a high variance feature, let's plot a histogram to visualize the exact distribution:

In [None]:
plt.hist(df_train['Flight Distance'])

We see a very long thick tail indeed. Though this feature is correlated relatively strongly with the label, it might be difficult to process for some of the algorithms. Also, we noticed before that arrival and departure delay is very heavly correlated which means there's little sense in using them both as feature. It can be seen how strong is the relationship between them (not surprising - once there's a delay in departure, assuming flight time is approx the same, the delay in arrival will be super close).

In [None]:
sns.lmplot(x='Departure Delay in Minutes',y='Arrival Delay in Minutes',data=df_train)

Drop irrelevant columns (as discussed above) and NAN records:

In [None]:
df_train.drop(['Unnamed: 0','id', 'Arrival Delay in Minutes'],axis=1,inplace=True)
df_train.dropna(axis=0,inplace=True)

Same for the testset:

In [None]:
df_test.drop(['Unnamed: 0','id', 'Arrival Delay in Minutes'],axis=1,inplace=True)
df_test.dropna(axis=0,inplace=True)

After trying out different aggregations (groupings) of age (I've erased them since the notebook is long enough without them), we see that correlation is not increased so we leave it as is. Now the data is ready for analysis and contains no nulls:

In [None]:
df_train.isnull().sum()

After recognizing the 3 subgroups as detailed above, what we will do is aggregate the relevant features into 3 groups (with a minor overlap) and take the value of the subgroup as the mean of its components. This should reduce the noise within each group of correlated features and allow a better fit using the different algorithms. Age, flight distance and arrival delay have also been removed to narrow dimensions since they are not expected to generate much value.

In [None]:
df_train_grouped = df_train.copy() #copy to aviod deletion on original memory
df_train_grouped['Order'] = df_train[['Inflight wifi service','Departure/Arrival time convenient','Ease of Online booking', 'Gate location']].mean(axis=1)
df_train_grouped['Comfort'] = df_train[['Food and drink','Online boarding','Seat comfort', 'Inflight entertainment']].mean(axis=1)
df_train_grouped['Service'] = df_train[['Inflight entertainment','On-board service','Leg room service', 'Baggage handling']].mean(axis=1)
df_train_grouped.drop(['Inflight wifi service','Departure/Arrival time convenient','Ease of Online booking', 'Gate location', 'Food and drink','Online boarding','Seat comfort', 'Inflight entertainment', 'On-board service','Leg room service', 'Baggage handling', 'Age', 'Flight Distance', 'Departure Delay in Minutes'],axis=1,inplace=True)

Same for testset:

In [None]:
df_test_grouped = df_test.copy()
df_test_grouped['Order'] = df_test[['Inflight wifi service','Departure/Arrival time convenient','Ease of Online booking', 'Gate location']].mean(axis=1)
df_test_grouped['Comfort'] = df_test[['Food and drink','Online boarding','Seat comfort', 'Inflight entertainment']].mean(axis=1)
df_test_grouped['Service'] = df_test[['Inflight entertainment','On-board service','Leg room service', 'Baggage handling']].mean(axis=1)
df_test_grouped.drop(['Inflight wifi service','Departure/Arrival time convenient','Ease of Online booking', 'Gate location', 'Food and drink','Online boarding','Seat comfort', 'Inflight entertainment', 'On-board service','Leg room service', 'Baggage handling', 'Age', 'Flight Distance', 'Departure Delay in Minutes'],axis=1,inplace=True)

In [None]:
df_train_grouped.sample(random_state=rstate)

In [None]:
df_test_grouped.sample(random_state=rstate)

### Feature importance analysis

In [None]:
#Import the relevant libraries
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import MeanShift

Due to the heavy processing, sometimes a 1/10 random sample we be used in the following questions. Otherwise, the available machine will not be able to process the task in a reasonable amount of time. Data is also normalized and converted to numpy for performance improvement.

In [None]:
sample_size = 10000
df_train_grouped_sample = df_train_grouped.sample(sample_size,random_state=rstate) #save the sample
df_train_grouped_sample_std = StandardScaler().fit_transform(df_train_grouped_sample) #scaled data
df_train_grouped_std = StandardScaler().fit_transform(df_train_grouped)#save a version of the entire scaled trainset
df_test_grouped_std = StandardScaler().fit_transform(df_test_grouped)#test as well

We begin with the clustering task and Kmeans as its basic pioneer. Using the silhouette score to determine optimal amount of clusters:

In [None]:
range_n_clusters = list (range(3,10)) 
print ("Number of clusters from 3 to 10: \n", range_n_clusters)
for n_clusters in range_n_clusters:
    clusterer = KMeans (n_clusters=n_clusters)
    preds = clusterer.fit_predict(df_train_grouped_sample_std)
    centers = clusterer.cluster_centers_
    score = silhouette_score (df_train_grouped_sample_std, preds) 
    print ("For %d clusters, average Silhouette score is %.2f" % (n_clusters, score))

Best score is obtained with 5 clusters which turn out rather balanced:

In [None]:
kclusters = 5
kmeans = KMeans(n_clusters=kclusters, random_state=rstate).fit(df_train_grouped)
cluster_results_kmeans = kmeans.labels_
np.bincount(cluster_results_kmeans)

Add the cluster id as an additional column and display the mean values for every cluster once grouped:

In [None]:
Summary_kmeans = df_train_grouped.copy()
Summary_kmeans.insert(0, 'K Cluster Label', kmeans.labels_) #input the column that contains the labels of each record to the table
Summary_kmeans_full = Summary_kmeans 
Summary_kmeans = Summary_kmeans.groupby(['K Cluster Label']).mean() #group by cluster id
Summary_kmeans

Easy to see cluster 0,2 have significantly higher satisfaction ratios. If we look at the means of those clusters we can see that compared to the others:
1. Cleanliness is higher
2. Inflight service is higher
3. Gender is meaningless
4. Less disloyal customers
5. Less eco\eco-plus classes
6. Higher scores for the 3 systetic subgroups (a very good sign!)

Now we'll try agglomerative clustering since it is well built for our requirement because we can set the amount of cluster to 2 and the merge will continue all the way there and find the commonalities:

In [None]:
agglom = AgglomerativeClustering(n_clusters = 2, linkage = 'complete') #two clusters are set, aiming for a large difference in satisfaction labels
AC_labels = agglom.fit_predict(df_train_grouped_sample)
cluster_results_AC = agglom.labels_
np.bincount(cluster_results_AC)

Analyzing the results of agglomerative clustering, the picture is rather clear and the same trends can be seen as listed for Kmeans. Now the same groupby is performed again:

In [None]:
Summary_AC = df_train_grouped_sample.copy()
Summary_AC.insert(0, 'AC Cluster Labels', AC_labels)
Summary_AC = Summary_AC.groupby(['AC Cluster Labels']).mean()
Summary_AC

Another algorithm we'll try is spectral and we can see the same conclusions can be derived from it. Spectral is also showing the best seperation ratio (75% vs 18% satisfaction) and it fits our task well because it decompresses the features, reducing the information to 2 clusters - similar to agglomerative:

In [None]:
from sklearn.cluster import SpectralClustering
Spec = SpectralClustering(n_clusters=2, assign_labels='discretize', random_state=rstate).fit(df_train_grouped_sample)
cluster_results_SC = Spec.labels_
np.bincount(cluster_results_SC)

In [None]:
#Same process of addition of cluster id to table
Summary_SC = df_train_grouped_sample.copy()
Summary_SC.insert(0, 'SC Cluster Labels', cluster_results_SC)
Summary_SC = Summary_SC.groupby(['SC Cluster Labels']).mean()
Summary_SC

And the results show the same trends again (for the third time) so I think the conclusions are strong.

### Prediction of overall satisfaction label

Import the relevant libraries:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV

Preparing the train & test vectors for use:

In [None]:
x_train = np.delete(df_train_grouped_std, 3, 1) #remove label from input vector
y_train = df_train_grouped_std[:,3]
x_test = np.delete(df_train_grouped_std, 3, 1) #remove label from input vector
y_test = df_train_grouped_std[:,3]

In [None]:
#Needs to be set to int (binary) to allow classification algorithms to perform fit
y_train = y_train.astype(int)
y_test = y_test.astype(int)

We create a generalized function to deal with fitting, predicting satisfaction and measuring results. For each run we print out the ROC area under curve score, classification metrics and a confusion matrix:

In [None]:
def run_model(model, x_train, y_train, x_test, y_test, verbose=True):
    if verbose == False:
        model.fit(x_train,y_train, verbose=0)
    else:
        model.fit(x_train,y_train)
    y_pred = model.predict(x_test)
    roc_auc = roc_auc_score(y_test, y_pred)
    print("ROC_AUC = {}".format(roc_auc))
    print(classification_report(y_test,y_pred,digits=5))
    plot_confusion_matrix(model, x_test, y_test,cmap=plt.cm.Blues, normalize = 'all')
    
    return model, roc_auc #function returns model object and ROC_AUC

First up is random forest which is a decision-tree based classic algorithm. Since some of the optimizations are taking an extremely long amount of time, what I will do is run it offline (many hours), display the code with a comment FYI, while the actual run will be done with determinstic parameters. Other algorithms that run quickly we be applied by initializing a Grid object:

In [None]:
#Since runtime is extremely long for a full grid search, we will use its best parameters for setup:
params_rf = {'max_depth': 25, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 1200,'random_state': rstate}

#hparam_rf = {'criterion': ['gini', 'entropy'], 'n_estimators': [50,100,200,500,1200],'max_depth':[5,10,25,35], 'min_samples_split':[1,2,3]}

#model_rf = GridSearchCV(RandomForestClassifier(), param_grid=hparam_rf,scoring = 'roc_auc', n_jobs=-1)
model_rf = RandomForestClassifier(**params_rf) #get an unspecified number of parameters to the function
model_rf, roc_auc_rf = run_model(model_rf, x_train, y_train, x_test, y_test) #pass the model together with the vectors for training and prediction

Random forest generates outstanding results and as can be seen from the feature importance analysis, our synthetic aggregated features do a great job helping out (50% accumulated feature importance)! the results are better than the top rated kaggle notebooks I've seen:

In [None]:
model_rf.feature_importances_

Since random forest performed very well, we will continue with another  decision tree based algorithm - lightGBM:

In [None]:
#listing all parameters for grid search:
hparam_lgb = {'n_estimators': [50, 100, 200],'max_depth':[5,10,15,20],'num_leaves': [25, 50, 100], 'random_state': [rstate]}

In [None]:
model_lgb = GridSearchCV(lgb.LGBMClassifier(), param_grid=hparam_lgb,scoring = 'roc_auc', n_jobs=-1)
model_lgb, roc_auc_lgb = run_model(model_lgb, x_train, y_train, x_test, y_test)

Same principal as in random forest - the 3 sub groups have a major impact on prediction (last 3 features in list):

In [None]:
model_lgb.best_estimator_.feature_importances_

Now we will try to apply a different algorithmic basis - support vectors:

In [None]:
#Same principal as in random forest - we'll use the optimal hparameters obtained using gridsearch as the finite input.
params_svc ={'C': 1, 
         'kernel': 'linear', 
         'degree': 3, 
         'gamma': 'scale',
          'random_state':rstate}

#hparam_svc = {'C': [1,2,3],'kernel':['rbf', 'linear'],'degree': [2, 3, 4], 'gamma': ['scale', 'auto']}

In [None]:
model_svc = SVC(**params_svc)
#model_svc = GridSearchCV(SVC(), param_grid=hparam_svc,scoring = 'roc_auc', n_jobs=-1)
model_svc, roc_auc_svc = run_model(model_svc, x_train, y_train, x_test, y_test)

Results are not that good as the previous, decision tree-based, algorithms. For the final run we'll use Adaboost, expecting a performance similar to light GBM & random forest. In practice, the results are worse. 

In [None]:
hparam_ada = {'learning_rate':[0.8, 1.0, 1.1],'n_estimators':[50,100,150,200,500,1000], 'random_state':[rstate]}
model_ada = GridSearchCV(AdaBoostClassifier(), param_grid=hparam_ada,scoring = 'roc_auc', n_jobs=-1)
model_ada, roc_auc_ada = run_model(model_ada, x_train, y_train, x_test, y_test)

Also, we see the same effect in feature importance once more:

In [None]:
model_ada.best_estimator_.feature_importances_