# Airline Satisfaction - Classification

The following dataframe was imported from kaggle.com. It presents reviews of a certain airline made by close to 130 thousand people. Each person is identified by his/her ID number follwed by some identifying features of the person and the type of flight and the review of said person in several different categories (explained further down in the notebook). The review system is to give a 0-5 star rating for each category. This is followed by a concluding column of whether the person was overall "satisfied" with the airline and the flight or was "neutral or dissatisfied".<br>
Link to the dataset on kaggle: https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction<br><br>

Our purpose in this notebook is to determine whether a given passenger not from the dataframe given, provided all the data of the passenger's review except the last column where he/she stated the overall satisfaction, was satisfied with the airline or not (in other words: which of the 2 possible answers the passenger would choose: "satisfied" or "neutral or dissatisfied").

In [None]:
#Importing the libraries used
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

We next have to import the dataframe into the notebook. The dataframe was divided on kaggle.com to the train and test set. We shall first combine the 2 dataframes into a single one so as not to apply all changes to both dataframes (to be divided later on in the notebook back into the train and test set, although not necessarily the same ones as on download).<br>
On upload of the .csv files into jupyter as dataframes, an index is added to the dataframes, leaving the initial id as a column of no significance ("Unnamed: 0"). On merging the 2 datasets, again, we give the dataframe a new index, but due to the both columns name being "index" we first rename the index column, then apply the reset_index() function, then drop the column now named "i".<br>
Henceforth, we shall refer to the whole dataset "fulldf" indstead of the train and test sets imported to the notebook (until it is again divided).

In [None]:
defaultTest = pd.read_csv("../input/airline-passenger-satisfaction/test.csv") #importing the dataframe 'test.csv' from the appropriate folder
defaultTrain = pd.read_csv("../input/airline-passenger-satisfaction/train.csv") #importing the dataframe 'train.csv' from the appropriate folder
fulldf = pd.concat([defaultTest, defaultTrain]) #merging  the 2 datasets, so that we can work as if we were given the undivided
#dataset in the first place.
fulldf.index.name = 'i' #renaming the index column so that the new index will not have the same name as the old
fulldf = fulldf.reset_index() #Resetting the indexes for appearance's sake.
fulldf = fulldf.drop(['i', 'Unnamed: 0'], axis = 1) #removing the columns of indexes which we changed: the change was not
#necessary, only for the sake of appearance
fulldf

## Data Cleaning

Before delving into the data cleaning, we first have to understand what it is that we have to do. To do that, we first check whether there are any duplicates in the data provided. Because all reviewer has a unique ID, it is enough to check whether there are 2 reviews (2 records in the dataset) with the same ID. To to that, we group the the dataset by ID using groupby() and check if the size of any of the groups provided is greater than 1 (as we can see below). From this we can gather that no duplicate IDs exist, hence there are no duplicate reviews in the dataset.

In [None]:
dupl = fulldf.groupby(['id']).size()>1
dupl.value_counts()
#Checking that there are no duplicates (no values that are true)

After checking the dataset for duplicates, we no longer need the ID attribute, as it is of little to no use in predicting the overall satisfaction of the reviewer.

In [None]:
fulldf = fulldf.drop(['id'], axis = 1) #Further unneded attribute

Let us first examine the dataset we now have using info(). Here we can see the name of every attribute, the count of non null values for each attribute and the type of data each attribute has. Kepping in mind that the dataset has 129880 records, we can derive that the only attribute that has NaN values is "Arrival Delay in Minutes". We will have to deal with it in a minute. Further on, as seen below, there are several attributes where the type of data is "object". From a quick review of the dataset, we can see that it is a string, which for the purposes of machine learning we shall have to encode as numerical values.

In [None]:
fulldf.info()

In order to encode the attributes with strings we first have to understand what kinds of strings each attribute presents. From the code below we can see that every attribute with strings, has only a small ammount of possible values, for example: the attribute "Gender" has only 2 possible values: "Male" and "Female". All other values for attributes can be seen below, with the amount of values of each value for each attribute.

In [None]:
print('Gender:\n', fulldf['Gender'].value_counts(), '\n')
print('Customer Type:\n', fulldf['Customer Type'].value_counts(), '\n')
print('Type of Travel:\n', fulldf['Type of Travel'].value_counts(), '\n')
print('Class:\n', fulldf['Class'].value_counts(), '\n')
print('satisfaction:\n', fulldf['satisfaction'].value_counts(), '\n')

Now that we know what each "object" attribute consists of (in terms of values), we can encode them as we see fit. The most logican encoding (for me at least), is one with integers from 0 onward, where 0 is the "worst" of the options, and going upwards with the nu,bers up to the best, unless it has absolutely no meaning and no "worst" or "best" as is the case with the "Gender" and "Type of Travel" attributes.

In [None]:
#Encoding values from string to numerical

fulldf['Gender'] = fulldf['Gender'].replace({"Male": 0, "Female": 1})
fulldf['satisfaction'] = fulldf['satisfaction'].replace({"neutral or dissatisfied": 0, "satisfied": 1})
fulldf['Type of Travel'] = fulldf['Type of Travel'].replace({"Personal Travel": 0, "Business travel": 1})
fulldf['Customer Type'] = fulldf['Customer Type'].replace({"disloyal Customer": 0, "Loyal Customer": 1})
fulldf['Class'] = fulldf['Class'].replace({"Eco": 0, "Eco Plus": 1, "Business": 2})
fulldf

Below are the histogams of every attribute and the with the attribute values as the x axis and the amount of records with said value in said attribute as the y axis. For our purposes, it is of almost no significance, except to visually understand the approximate distributions of values over the dataframe. This is needed in turn to understand that the data is not "capped" on either end (for example), meaning that the actual values were not rounded to a certain value if they were above that value, and likewise below. Another use for the histograms is seing that there are suffficient records for most if not all values of every attribute, for example if all reviewers were men, there will be no point of that attribute for the model's purposes.

In [None]:
fulldf.hist(bins=50, figsize=(20,15))

### Dealing with NaN values
Below, we list the records which have NaN values in any cell. As we know from before, the only attribute that has NaN values is "Arrival Delay in Minutes", so all the records listed below have the value of said attribute as NaN.

In [None]:
fulldf[fulldf.isna().any(axis=1)]

Below we create a dataframe of correlations between every attribute with one another. The dataframe represents the linear correlation between the attributes (non linear relations have very low values as the functions considers only linear relations). The values range between -1 and 1, where positive values represent positive relations ("as X increases, Y increases"), while negative value represent negative relations ("as X increases, Y decreases"). All attribute has an exact linear relation with itself (1.0000). A final thing to not is that the table presented has many duplicates (we could remove the upper right triangle (if the table was cut diagonally from the upper-left corner to the lower-right corner), and not lose any data).<br>
Using this data, we plot a heatmap using seaborn, to better and easier see the higher values.

In [None]:
plt.figure(figsize=(13,13))
sns.heatmap(fulldf.corr(), cmap = 'Blues', annot=True, fmt=".2f")

We can see from the correlation function above that the attributes "Arrival Delay in Minutes" and "Departure Delay in Minutes" have an extremely high linear correlation (0.9653), and, although not an exact duplicate, as we have NaN values for of the "Arrival Delay in Minutes" and not of "Departure Delay in Minutes", we can copy values from the later to the former, where needed. Again, it is not a perfect match, nor do we expect the values copied to be an exact replica of what happened in reality, their effect on the dependant variable (the satisfaction) remains the same. All that remains, after making the decision, is to make said changes, which is done below.<br>
To visualize the codependence of the 2 attributes, we shall plot a scattergraph using seaborn - Although most values are at the bottom left hand corner, the tendency is undeniable.

In [None]:
fig = plt.figure(figsize = (10,7))
plt.scatter(fulldf['Departure Delay in Minutes'], fulldf['Arrival Delay in Minutes'], alpha = 0.1)

In [None]:
fulldf['Arrival Delay in Minutes'].fillna(fulldf['Departure Delay in Minutes'], inplace = True)

Next, due to the 2 attributes having so close a relation (as mentioned in the cell above), we can remove the unneeded attribute. We define which of the 2 is unneded by checking which has a slightly closer linear relation to the overall satisfaction.

In [None]:
fulldf = fulldf.drop(['Departure Delay in Minutes'], axis = 1) #removing the departure delay in minutes -
#correlation is 0.96 - very high, no need for both as one follows mostly from the other

In [None]:
fulldf.info() #Presenting the final dataset info before maachine learning algorithms for future reference

One last thing we have to do before we split the data into train and test sets, we have to normalize the values. As all are values are now numerical, and there are no NaN values, we can do that without too much trouble. The normalization will only be done on the values which will be given to us in the future for prediction (all data except the attribute which we have decided to predict: the overall satisfaction of the reviewer). In this case, the overall satisfaction is already normalized as it comprises of values 0 and 1 only, but in general it is not necessary to normalize the attribute for prediction.<br>
We choose normalizaiton and not standartization due to there being no outliers in this dataset.<br><br>
To normalize the data, we first split it to x and y, and mentioned above.

In [None]:
x = fulldf.drop(['satisfaction'], axis = 1)
y = fulldf['satisfaction']

Next, we use the fuction min_max_scaler.fit_transform() method to achive normalization of the data.

In [None]:
z = x.values #returns a numpy array
min_max_scaler = MinMaxScaler()
z_scaled = min_max_scaler.fit_transform(z)
x = pd.DataFrame(z_scaled)
x

## Models
Now that we are finished with data cleaning, we have a dataset, where there are no NaN values, all attributes are numerical and normalized, we can split the data using the train_test_split() method. The method splits the data according to a random state (an integer), so that over multiple reruns, the resulting dataframes from the method remain the same. The test size is set to 0.1 (10%), which results in close to 13000 records (which is enough in my opinion).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.1, random_state = 0)

### Dummy Classifier

Now that we have split our data, we can get down to the models. Before choosing and applying the models to the data, we need a baseline model, to compare all others to it: if the models thereafter are worse than the base model, we are doing something wrong. If the model is better than the base model, on the other hand, it means that the model works, and is able to be better at predicting than a "dummy" model. The base model we will use is "Dummy Classifier", which will, regardless of the input, put in the most frequent value in y_train. It puts in the most frequent value due to the strategy we have put into the model, although we could choose differently, but this strategy seems to return the best results.<br><br>
after using the model.fit() and model.predict() methods, we can use the metrics.accuracy_score() method to test a see which percentage of the predictions made by the dummy classifier were correct. in our case, the percentage of correct answers is 56.7%. Not bad, although not unexpected, as it almost certainly would have to be over 50%, as out of the 2 options, the dummy classifier chose the one which appeared the most.

In [None]:
dummyModel = DummyClassifier(strategy="most_frequent")
dummyModel.fit(X_train, y_train)
predictionsDummy = dummyModel.predict(X_test)

accuracyDummy = metrics.accuracy_score(y_test, predictionsDummy)
accuracyDummy

We shall plot a confusion matrix to visualize how many values the dummy classifier predicted correctly (the confusion matrix is explained in detail later in the notebook.

In [None]:
plt.figure(figsize=(5,4))
sns.heatmap(confusion_matrix(y_test, predictionsDummy), cmap = "Blues", annot=True, fmt = '.0f')

### KNN: K Nearest Neighbor

Now that we know what we strive for (a model that returns an accurace of more than 0.56), we can start applying a model to the daa that does something more than the dummy classifier. We will first use the KNN model. The KNN model (K nearest Neighbors) is one that chooses the value closest to the values from the train set. if K = 1, then it chooses depending on the single closest neighbor to the wanted data; if N > 1, then it chooses the label depending on the N closest neighbors (in terms of data). The "weights" parameter specifies whether the model is to give a higher penalty for the neighbors further away or to count their value as equally valuable to determine the label.<br><br>
In the cell below we run a function, which purpose is to determine the best N value, meaning the number of neighbors for which the data works best. True, the differences seem not very signufucant between the different values, but I believe it gives perspective.<br>
*Note: Due to the function run time being abysmal (several minutes), we will save the results in a different variable, and put the function call as a comment, so as not to delay the running of the notebook.

In [None]:
def chooseKNN():
    maxi = 1 #saving the index of the highest score
    max = 0 #saving the value of the highest score
    for i in range(1,20):
        modelKNN = KNeighborsClassifier(n_neighbors = i, weights='distance')
        modelKNN.fit(X_train, y_train)
        accuracy = modelKNN.score(X_test, y_test)
        if (accuracy > max):
            maxi = i
            max = accuracy
    print(maxi, "  ", max)
#chooseKNN()
#The value for maxi returned was: 9
#The value for max was: 0.9335540498922081

We have identified the best amount of neightbors for this data: 9. We shall run the model one last time to record the predictions made, co calculate the models accuracy using measures other than the built in score() method.

In [None]:
modelKNN = KNeighborsClassifier(n_neighbors = 9, weights='distance')
modelKNN.fit(X_train, y_train)
predictionsKNN = modelKNN.predict(X_test)
accuracyKNN = metrics.accuracy_score(y_test, predictionsKNN) #Resturns the same value as the score() method in the
#previous cell
accuracyKNN

Below we calculate the best cv parameter (K-folds) value for the cross validation score. We run from 1 to 20. The cv value in cross validation score means the number of "folds" the training set is divided into.<br>
Cross validation is when we divide the training set into a certain number of parts of identical length. There are made this certain anount of runs, when for each run, a different one of the divided sets is set aside. The model runs as in the usual trainig set, then another "piece" of the training set is used. Their performances are recorded, and the mean value is displayed.<br><br>
We have identified after these runs that 19 is the best number of k-folds for this data (out of the given range at least). As we were hoping, the r2 score matches (almost) the r2 score of the original model, meaning the model works as expected, without overfitting or underfitting (most probably).<br>
*Note: As with KNN value, due to the function run time being abysmal (several minutes), we will save the results in a different variable, and put the function call as a comment, so as not to delay the running of the notebook.

In [None]:
def chooseKFold(model):
    maxi = 1 #saving the index of the highest score
    max = 0 #saving the value of the highest score
    for i in range(2,20):
        accuracy = cross_val_score(model, X_train, y_train, cv = i).mean()
        if (accuracy > max):
            maxi = i
            max = accuracy
    print('Best index:', maxi, "\ncross_val_score of index", maxi, ':', max)

In [None]:
#chooseKFold(modelKNN)
#The value for maxi returned was: 19
#The value for max was: 0.9317832165238031

We now want to know exactly how well our model has performed on the task at hand. We shall plot a confusion matrix to understand how many of the answers the model got wrong and right per label for the test set.<br>
Each cell from left to right and top to bottom: The model said 0 and it was 0, the model said 1 and it was 0, the model said 1 and it was 0, the model said 1 and it was 1.<br>
This is of course, where 0 is "neutral or dissatisfied" and 1 is "satisfied".<br>
In other words, the diagonal cells from top left corner are the value the model predicted correctly, all others (in this case only 2), are the ones the model predicted incorrectly.<br>
Another thing to note, is that in genera, the confusion matrix is built thus: The x axis represents the predicted labels, whereas the y axis represents the actual labels.<br>
The confusion matrix is shown as a heatmap using seaborn.

In [None]:
plt.figure(figsize=(5,4))
sns.heatmap(confusion_matrix(y_test, predictionsKNN), cmap = "Blues", annot=True, fmt = '.0f')

Next, we want to show the "classification report" of the model at hand (KNN). The report show us several things:
- Precision - Here are 2 values:<br>
    Precision for 0: out of all predicted 0'es, how many were predicted correct? the answer according to the report is 0.92, or 92%.<br>
    Precision for 1: out of all predicted 1's, how many were predicted correct? the answer according to the report is 0.96, or 96%.<br>
    The equasion for precision is:
    $$ Precision = \frac{TruePositive}{TruePositive + FalsePositive} $$<br>

- Recall - Here are 2 values:<br>
    Precision for 0: out of all true 0'es, how many were predicted correct? the answer according to the report is 0.97, or 97%.<br>
    Precision for 1: out of all true 1's, how many were predicted correct? the answer according to the report is 0.89, or 89%.<br>
    The equasion for precision is:
    $$ Recall = \frac{TruePositive}{TruePositive + FalseNegative} $$<br>
    
- F1 score - The overall evaluation of the model. calculated by the equation for each value:
  $$ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} $$<br>

- Accuracy - How many predictions were made correctly. The report show us that the accuracy of the KNN model is 0.93, or 94%

- Support - The column states the number of true occurrences of each of the 2 values in the dataset, as well as them combined (the last 3 values of support)

In [None]:
print(classification_report(y_test, predictionsKNN))

### Logistic Regression

For our second model we will choose Logistic Regression. This model is chosen because it makes no assumptions regarding the distribution of labels, it works according to probabilities which it calculates for each record in the dataframe, and it is a fairly simple model that yet works well with many datasets. The model tries to fit an S shape to the data (as opposed to linear regression that fits a straight line), where the edges of the S are values where the probability is very high for a label to occur, and closer to the middle of the S, it is lower. We can also set a threshold "on the S", from which the model will predict a certain value. This is usually done when a certain value is preferred over the other. As we try to be objective with the reviews, we will not do it.<br><br>
As done with the KNN model, we define the model, use fit() and predict() methods and calculate and present the accuracy of the model, in this case 0.87 or 87%

In [None]:
modelLogReg = LogisticRegression()
modelLogReg.fit(X_train, y_train)
predictionsLogReg = modelLogReg.predict(X_test)
accuracyLogReg = modelLogReg.score(X_test, y_test)
accuracyLogReg

Below, again, the function for identifying the optimal k-fold was called. As is clearly visible, the result is very similar to the original model of Logistic Regression, meaning the model works as should, without overfitting or underfitting (most probably)

In [None]:
#chooseKFold(modelLogReg)
print('The values returned by the function were:\n', 'The optimal K-fold is: 12 and its score is 0.8742771104951578')

As in KNN, we show the confusion matrix as a heatmap and the classification report.

In [None]:
plt.figure(figsize=(5,4))
sns.heatmap(confusion_matrix(y_test, predictionsLogReg), cmap = "Blues", annot=True, fmt = '.0f')

In [None]:
print(classification_report(y_test, predictionsLogReg))

In the following cells, we will plot the ROC curve for the 3 models we have. The ROC curve describes the performance of the model/s according to the confusion matrix. With the changes in thresholds from the default (0.5 in our case), so does change the confusion matruix, and more specifically the ratio between the right and the left columns of it. As one increases, the other decreses. Likewise, when a bottom cell of a certain column increses, so does the other one in the same column. The later relationship is shown plinly as the ROC curve.<br>
To plot the curve, we first calculate and save the probabilities of every model we want to present. The probabilities are resurned as a 2D array, but we only need a single column. The second column ("1") was chosen. we use the metrics.roc_curve() method to return 3 values, the first 2 of which will be used as the x and y axis as their names imply (fpr = false positive rate; trp = true positive rate).<br>
the code that comes after is the plotting of the graphs, the naming of axis and the showing of legend.

In [None]:
probsKNN = modelKNN.predict_proba(X_test)[:, 1]
probsLogReg = modelLogReg.predict_proba(X_test)[:, 1]
dummyProbs = dummyModel.predict_proba(X_test)[:, 1]

In [None]:
fprLR, tprLR, thresholdsLR = metrics.roc_curve(y_test, probsLogReg)
fprKNN, tprKNN, thresholdsKNN = metrics.roc_curve(y_test, probsKNN)
fprDummy, tprDummy, thresholdsDummy = metrics.roc_curve(y_test, dummyProbs)
fig = plt.figure()
axes = fig.add_axes([0,0,1,1])
axes.plot(fprLR, tprLR, label = "LogReg")
axes.plot(fprKNN, tprKNN, label = "KNN")
axes.plot(fprDummy, tprDummy, label = "Dummy")
axes.set_xlabel("False positive rate")
axes.set_ylabel("True positive rate")
axes.set_title("ROC Curve for KNN, Logistic regression, Dummy")
axes.legend()

For absolution, we seek to calculate the AUC of each model: the area under the ROC curve. Using the built-in method metrics.auc() to calculate. The results are plain. Out of the 3 models shown, the KNN is the best, although the logistic regression is not bad too. The dummy, as predicted, should not be used.

In [None]:
print('AUC of Logistic Regression model:', metrics.auc(fprLR, tprLR))
print('AUC of KNN model:', metrics.auc(fprKNN, tprKNN))
print('AUC of Dummy model:', metrics.auc(fprDummy, tprDummy))

Let us show the accuracy scores of the 3 models:

In [None]:
fig = plt.figure()
accuracies = [accuracyDummy, accuracyKNN, accuracyLogReg]
accuraciesSize = np.arange(len(accuracies))
axes = fig.add_axes([0,0,1,1])
axes.bar(['Dummy', 'KNN', 'Logistic Regression'], accuracies)
axes.set_ylabel("Accuracy Score")
axes.set_title("Accuracy Scores of the different models")

To conclude, we have tested 2 models (3 if you count the dummy) on the dataset to determine passengers satisfaction of the airline after they have taken the survey. The best model out of the 3 is undoubtedly the KNN (for this specific dataset). With a high accuracy rate (0.93) and a high value of the AUC of the ROC curve.