# Introduction

I found this dataset randomly when I was looking for more interesting classification problems. The dataset was already fairly clean, hence why it was not a troublesome to run a simple classifier. Since I have flown all over the world in the past 4 years, I was interested to see if provided variables could be useful in determining the satisfaction levels of the customers, and I found out that almost all of them indeed are useful. Also, after seeing that the problem itself is a binary classification problem and due to its simplicity, I realized that I probably would not need to run several models, therefore I chose to go with Random Forests, since it's my favorite classifier.

I found that the dataset could be skewed in a lot of ways. There is no information on how the dataset was created, but I assume it was just a simple survey sent to the passengers. In a lot of cases, it seems that people answer the surveys without giving it much thought, thus making the numbers heavily skewed.

When looking into the dataset more, I found that some variables could have been better. A very simple example is the satisfaction levels themselves. The dataset only provides 2 classification: neutral or dissatisfied and satisfied. I think it would be better if the satisfaction levels were rated, rather than classified or neutral or dissatisfied was split into two. Also, I wish the prices of the tickets were provided in the dataset, since I assume that prices would play a relatively important role in the satisfaction levels.

In addition, it would be interesting to explore the satisfaction levels of economy class passengers more. I assume that business class or first class passengers regardless of the airline would be satisfied with their flights and that economy class passengers would be more prone to being dissatisfied with their experiences. Determining what makes economy class passengers happier without it being too costly could definitely increase the sales of tickets, since economy class tickets are the most common. 

Overall, with Random Forests, I was able to get a 96% accuracy rate, which is not surprising since the dataset was fairly clean and it was a binary classification problem. 

## Initial Cleaning and Exploration

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, chi2

I dropped Arrival Delay after comparing it with Departure Delay, they are both essentially the same. I also dropped Gender with the assumption that it is not important in overall satisfaction with the flight.

In [None]:
train = pd.read_csv('/kaggle/input/airline-passenger-satisfaction/train.csv')
test = pd.read_csv('/kaggle/input/airline-passenger-satisfaction/test.csv')
d = pd.concat([train, test])
d = d.drop(columns=['Unnamed: 0', 'id', 'Arrival Delay in Minutes', 'Gender'])

In [None]:
y = d.satisfaction
ydict = {'neutral or dissatisfied':0,
        'satisfied':1}
y = y.map(ydict)

d = d.drop(columns='satisfaction')
X = d

X = pd.get_dummies(X, columns=['Customer Type', 'Type of Travel', 'Class'])

I was surprised to see that Flight Distance is significantly more important than all the other variables, especially knowing that the distances in the dataset are mostly domestic flights, and because of that reason alone, actually, I assumed that the distances would not matter as much. After all, if it is all about the destination, I would think that a customer would go with the cheapest ticket.

Once again, the credit goes to [this article](https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e).

In [None]:
bestfeatures = SelectKBest(score_func=chi2, k=X.columns.size)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Feature','Score'] 
print(featureScores.nlargest(X.columns.size,'Score')) 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
print('Train', X_train.shape, y_train.shape)
print('Test', X_test.shape, y_test.shape)

In [None]:
model = RandomForestClassifier()
model.fit(X_train, y_train)
pred = model.predict(X_test)
print('Accuracy: ', accuracy_score(y_test, pred))