**Context**
This dataset contains an airline passenger satisfaction survey. What factors are highly correlated to a satisfied (or dissatisfied) passenger? Can you predict passenger satisfaction?

**Content**
Gender: Gender of the passengers (Female, Male)

Customer Type: The customer type (Loyal customer, disloyal customer)

Age: The actual age of the passengers

Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)

Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)

Flight distance: The flight distance of this journey

Inflight wifi service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)

Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient

Ease of Online booking: Satisfaction level of online booking

Gate location: Satisfaction level of Gate location

Food and drink: Satisfaction level of Food and drink

Online boarding: Satisfaction level of online boarding

Seat comfort: Satisfaction level of Seat comfort

Inflight entertainment: Satisfaction level of inflight entertainment

On-board service: Satisfaction level of On-board service

Leg room service: Satisfaction level of Leg room service

Baggage handling: Satisfaction level of baggage handling

Check-in service: Satisfaction level of Check-in service

Inflight service: Satisfaction level of inflight service

Cleanliness: Satisfaction level of Cleanliness

Departure Delay in Minutes: Minutes delayed when departure

Arrival Delay in Minutes: Minutes delayed when Arrival

Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction)

Note that this data set was modified from this dataset by John D here. It has been cleaned up for the purposes of classification.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Carregando os dados
df = pd.read_csv('/kaggle/input/airline-passenger-satisfaction/train.csv')
test = pd.read_csv('/kaggle/input/airline-passenger-satisfaction/test.csv')

df.shape, test.shape

In [None]:
# Juntando os dataframes
df_all = df.append(test)

df_all.shape

In [None]:
# Visualização dos dados
df_all

In [None]:
# Análise das características dos dados
df_all.info()

In [None]:
# Visualização das estatísticas dos dados numéricos
df_all.describe().T

In [None]:
# Remoção das colunas unnamed e id
df_all.drop(['Unnamed: 0','id'],axis=1,inplace=True)


In [None]:
# Quais colunas do dataframe são do tipo object
df_all.select_dtypes('object').head()

In [None]:
df_all.head(5)

In [None]:
# Verificando quantos valores únicos há em cada coluna do tipo object
df_all[['Gender','Customer Type','Type of Travel','Class','satisfaction']].nunique()

In [None]:
# Olhando a coluna satisfaction
df_all['satisfaction'].value_counts()

In [None]:
# sempre que for neutral or dissatisfied, converto para 0; sempre que for satisfied, para 1.
df_all['satisfaction'] = df_all['satisfaction'].replace({'satisfied': 1, 'neutral or dissatisfied' : 0}).astype(int)

In [None]:
# Conferindo a coluna satisfaction
df_all['satisfaction'].value_counts()

In [None]:
# Olhando a coluna Gender
df_all['Gender'].value_counts()

In [None]:
# sempre que for Female, converto para 0; sempre que for Male, para 1.
df_all['Gender'] = df_all['Gender'].replace({'Male': 1, 'Female' : 0}).astype(int)

In [None]:
# Conferindo a coluna Gender
df_all['Gender'].value_counts()

In [None]:
# Olhando a coluna Customer Type
df_all['Customer Type'].value_counts()

In [None]:
# sempre que for Loyal, converto para 0; sempre que for disloyal, para 1.
df_all['Customer Type'] = df_all['Customer Type'].replace({'Loyal Customer': 0, 'disloyal Customer' : 1}).astype(int)

In [None]:
# Conferindo a coluna Customer Type
df_all['Customer Type'].value_counts()

In [None]:
# Olhando a coluna Type of Travel
df_all['Type of Travel'].value_counts()

In [None]:
# sempre que for business, converto para 0; sempre que for personal, para 1.
df_all['Type of Travel'] = df_all['Type of Travel'].replace({'Business travel': 0, 'Personal Travel' : 1}).astype(int)

In [None]:
# Conferindo a coluna Type of Travel
df_all['Type of Travel'].value_counts()

In [None]:
# Olhando a coluna Class
df_all['Class'].value_counts()

In [None]:
# sempre que for business, converto para 0; sempre que for eco, para 1; eco plus, para 2.
df_all['Class'] = df_all['Class'].replace({'Business': 0, 'Eco' : 1, 'Eco Plus' : 2}).astype(int)

In [None]:
# Conferindo a coluna Class
df_all['Class'].value_counts()

In [None]:
# Visualização Geral dos Dados após os ajustes
df_all.head()

In [None]:
# todos os campos object convertidos para int
df_all.info()

In [None]:
# Verificando os valores nulos
df_all.isnull().sum()

In [None]:
# sempre que a coluna "Arrival Delay in Minutes" for null(NaN), seto o valor 0
df_all[df_all['Arrival Delay in Minutes'].isnull()] = 0

In [None]:
df_all.isnull().sum()

**Iniciando a construção do Modelo de Regressão**

In [None]:
# Instanciando o random forest classifier
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_jobs=-1, n_estimators=200, random_state=420)


In [None]:
# Separando as colunas para treinamento
feats = [c for c in df_all.columns if c not in ['satisfaction']]
feats

In [None]:
# Importando o train_test_split
from sklearn.model_selection import train_test_split

# Separando treino e teste
train, test = train_test_split(df_all, test_size=0.20, random_state=42)

# Separando treino e validação
train, valid = train_test_split(train, test_size=0.20, random_state=42)

train.shape, valid.shape, test.shape

In [None]:
# Treinando o modelo
rfc.fit(train[feats], train['satisfaction'])

In [None]:
# Prevendo os dados de validação
preds_val = rfc.predict(valid[feats])

preds_val

In [None]:
# Avaliando o desempenho do modelo

# Importando a metrica
from sklearn.metrics import accuracy_score

In [None]:
# Acurácia das previsões de validação
accuracy_score(valid['satisfaction'], preds_val)

In [None]:
# Medindo a acurácia nos dados de teste
preds_test = rfc.predict(test[feats])

accuracy_score(test['satisfaction'], preds_test)

In [None]:
df_all.info()

In [None]:
# Simulando uma predição com dados fictícios 
simulacao = [[1,0,45,0,0,460,3,2,5,3,5,3,3,4,5,4,3,5,3,5,5,10],
             [0,1,15,1,1,2500,3,2,5,1,4,3,3,2,3,4,3,1,2,3,15,20],
             [1,0,25,1,2,1320,4,5,4,4,3,4,4,1,4,5,4,2,3,5,10,10]]
print(rfc.predict(simulacao))

In [None]:
import matplotlib.pyplot as plt

fig=plt.figure(figsize=(10, 15))

# Avaliando a importancia de cada coluna (cada variável de entrada)
pd.Series(rfc.feature_importances_, index=feats).sort_values().plot.barh()