# Airlines Customer satisfaction

This data given by an airline organization. The actual name of the company is not given due to various purposes that's why the name Invistico airlines.

The dataset consists of the details of customers who have already flown with them. The feedback of the customers on various context and their flight data has been consolidated.

The main purpose of this dataset is to predict whether a future customer would be satisfied with their service given the details of the other parameters values.

Also the airlines need to know on which aspect of the services offered by them have to be emphasized more to generate more satisfied customers.

Dataset: https://www.kaggle.com/sjleshrac/airlines-customer-satisfaction/

## Importing Libraries and Dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
df=pd.read_csv('../input/airlines-customer-satisfaction/Invistico_Airline.csv')
df.head()

## EDA and Visualization

In [None]:
#df.info()

In [None]:
df.isna().sum()

There are some null values in the columbn 'Arrival Delay in Minutes'.

In [None]:
df['Arrival Delay in Minutes'].describe()

In [None]:
df_c=df.copy()
df.dropna(inplace=True)

Rows containing null values are dropped because there are few such rows in compared to total numvber of entries.

In [None]:
category = ["satisfaction", "Gender", "Customer Type", "Type of Travel", "Class"]
for c in category:
    print ("{} \n".format(df[c].value_counts()))
df['satisfaction']=df['satisfaction'].map({'satisfied':1,'dissatisfied':0})

In [None]:
sn.countplot(x="satisfaction", data=df)
plt.title('Airlines Customer satisfaction Count')
plt.xticks([0,1],['Dissatisfied',"Satisfied"])
plt.show()

In our data, number of both satisfied and dissatisfied cutomer are almosrt equal. So, our datasetr is balanced.

In [None]:
fig,axs = plt.subplots(2,2,figsize=(14, 14))
cols=['Gender', 'Customer Type', 'Type of Travel', 'Class']
c=0
for i in range(2):
  for j in range(2):
    sn.countplot(data=df,x=cols[c],hue='satisfaction',ax=axs[i][j])
    axs[i][j].set_title('Customer Satisafaction as per {}'.format(cols[c]))
    axs[i][j].legend(['Dissatisfied',"Satisfied"])
    c+=1

From the abovce charts, we can conbclude that:
* Comparitively, female customers are more satisfied than male customers.
* Loyal Customers are more satisfied than disloyal ones.
* People who travel for business purpose are more satisfied than ones who travel for personal purpose.
* More number of people travel in Business class and are also comparitively more satisfied than customers travelling in economy or economy plus class.

In [None]:
fg=sn.displot(df,x='Age',binwidth=0.55,hue='satisfaction')
fg.fig.set_figwidth(24.27)
fg.fig.set_figheight(14.7)
plt.show()

Customers of age group between 38 to 60 are more satisfied than customers of other age group.

In [None]:
fig, ax = plt.subplots(figsize=(15,8))
sn.heatmap(df.corr(),cmap='gist_earth',annot=True)
plt.show()

The factors like Flight Distance, Departure/Arrival time convenient,Gate location,Departure Delay in Minutes and Arrival Delay in Minutes have very low impact on customer satisfaction. So, we are going to drop those columns to reduce model complexity.

In [None]:
df.drop(['Flight Distance','Departure/Arrival time convenient','Gate location','Departure Delay in Minutes','Arrival Delay in Minutes'],axis=1,inplace=True)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
X = df.iloc[:,1:].values
y = df.iloc[:,0].values
X.shape

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encoder',OneHotEncoder(),[0,1,3,4])],remainder='passthrough')
X = np.array(ct.fit_transform(X),dtype=np.float)

In [None]:
X.shape

## Model Selection

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, recall_score
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.9, random_state=0)

sc_X  = StandardScaler()
X_train_sc = sc_X.fit_transform(X_train)
X_test_sc = sc_X.transform(X_test)

min_max_scaler = MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_test_minmax = min_max_scaler.fit_transform(X_test)

In [None]:
#function to plot learning curve for any classifier
from sklearn.model_selection import learning_curve, validation_curve
def plotLearningCurves(X_train, y_train, classifier, title):
    train_sizes, train_scores, test_scores = learning_curve(
            classifier, X_train, y_train, cv=5, scoring="accuracy")
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.plot(train_sizes, train_scores_mean, label="Training Error")
    plt.plot(train_sizes, test_scores_mean, label="Cross Validation Error")
    
    plt.legend()
    plt.grid()
    plt.title(title, fontsize = 18, y = 1.03)
    plt.xlabel('Train Sizes', fontsize = 14)
    plt.ylabel('Score', fontsize = 14)
    plt.tight_layout()

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log_reg1=LogisticRegression(max_iter=2500)
log_reg1.fit(X_train_sc,y_train)
pred_log1=log_reg1.predict(X_test_sc)

print('Confusion Matrix is\n',confusion_matrix(y_test,pred_log1))
print('Accuracy is', accuracy_score(y_test,pred_log1))

In [None]:
log_reg3=LogisticRegression(max_iter=2500)
log_reg3.fit(X_train,y_train)
pred_log3=log_reg3.predict(X_test)

print("Test Scores")
print('Confusion Matrix is\n',confusion_matrix(y_test,pred_log3))
print('Accuracy is\n', accuracy_score(y_test,pred_log3))

'''
pred_log_train=log_reg3.predict(X_train)
print("Train Scores")
print('Confusion Matrix is\n',confusion_matrix(y_train,pred_log_train))
print('Accuracy is', accuracy_score(y_train,pred_log_train))
'''

In [None]:
log_reg2=LogisticRegression(max_iter=2500)
log_reg2.fit(X_train_minmax,y_train)
pred_log2=log_reg2.predict(X_test_minmax)

print('Confusion Matrix is\n',confusion_matrix(y_test,pred_log2))
print('Accuracy is', accuracy_score(y_test,pred_log2))

Sacled data using MinMax scaling performed better than other scaling methods and unscaled data.

In [None]:
plt.figure(figsize = (16,5))
title = 'Logistic Regression Learning Curve'
plotLearningCurves(X_train_minmax, y_train, log_reg2,title)


As train size increase, trainining score and cross validation score are converging which means less deviation in accuracy.

## KNeighbours Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
for k in range(10,18):
  knn = KNeighborsClassifier(n_neighbors=k,metric='minkowski',p=2) 
  knn.fit(X_train_sc,y_train)
  pred_knn = knn.predict(X_test_sc)

  print("k=",k)
  print('Confusion Matrix is ',confusion_matrix(y_test,pred_knn))
  print('Accuracy is', accuracy_score(y_test,pred_knn))
  print('\n')

Knn algortihm performed best when value of k_neighjbours equals 11. So let's find the model's performance on both training and test sets.
Performance was bettwe with scaled data scaled using StandardScaler.

In [None]:
knn = KNeighborsClassifier(n_neighbors=11,metric='minkowski',p=2) 
knn.fit(X_train_sc,y_train)

knn_train = knn.predict(X_train_sc)
knn_test= knn.predict(X_test_sc)

print("For Test")
print('Confusion Matrix is \n',confusion_matrix(y_test,knn_test))
print('Accuracy is', accuracy_score(y_test,knn_test))
print('\n')

print("For Train")
print('Confusion Matrix is\n ',confusion_matrix(y_train,knn_train))
print('Accuracy is', accuracy_score(y_train,knn_train))
print('\n')

In [None]:
plt.figure(figsize = (16,5))
title = 'kNeighbours Learning Curve'
plotLearningCurves(X_train, y_train, knn,title)

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
for d in range(20,30):
  dtc = DecisionTreeClassifier(criterion='entropy', max_depth=d,max_leaf_nodes=1000)
  dtc.fit(X_train,y_train)
  pred_dtc=dtc.predict(X_test)
  print("d=",d)
  print(accuracy_score(y_test,pred_dtc))

Decision Tree classifier performed best when max_depth is set to 25, max_leaf_nodes to 1000 and using entropy as criterion. 

In [None]:
dtc_best=DecisionTreeClassifier(criterion='entropy', max_depth=25,max_leaf_nodes=1000)
dtc_best.fit(X_train,y_train)
#pred_dtc=dtc.predict(X_test)
plt.figure(figsize = (16,5))
title = 'Decision Tree Learning Curve'
plotLearningCurves(X_train, y_train, dtc_best,title)

Cross validation score is increasing with increase in train size and also converging with training score. It means our model is learning well.

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(n_estimators=40, criterion='entropy', max_depth=40,max_leaf_nodes=4100)

rfc.fit(X_train_sc, y_train)

pred_rfc = rfc.predict(X_test_sc)
rfc_train= rfc.predict(X_train_sc)
print('Test Score:',accuracy_score(y_test,pred_rfc))
print('Train Score:',accuracy_score(y_train,rfc_train))

print('Confusion Matrix for test set  \n',confusion_matrix(y_test,pred_rfc))

#0.9477466379221846
#0.9887918518192618

In [None]:
plt.figure(figsize = (16,5))
title = 'Random Forest Learning Curve'
plotLearningCurves(X_train, y_train, rfc,title)

We can see that as train sizes increases cross validation score and training score are converging. But still there is some deviation between them.

## Conclusion

In our proble to classify customers as satisfied or dissatisfied, best accuracy was achieved using RandomForest Classifier. The best train score and test score achieved are 0.98 and 0.95 respectively. 