# Exploratory Data Analysis (EDA)

# Intro
This data shows whether a customer is satisfied with the airlines or not after travelling with them. There are several other measurement or to say feedback taken from the customers as well as their demographic data is also recorded.

Data set URL : https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction


# STEP #1: Import Libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import random as rnd

from collections import Counter

# From Sklearn, sub-library model_selection, train_test_split so I can, well, split to training and test sets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import plot_roc_curve
from sklearn.model_selection import learning_curve, validation_curve
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier

%matplotlib inline
plt.style.use("seaborn-whitegrid")


# STEP #2: IMPORT DATASET

In [None]:
dataset = pd.read_csv("../input/airlines-customer-satisfaction/Invistico_Airline.csv")
data = dataset.sample(frac=.05)
data.sample(10)

# STEP #3: Explore /Visualze Data set

## STEP #3.1: Check the count / data type for each feature in train and test data files

In [None]:
data.info()

as shown above, there are missing values in Arrival Delay in Minutes feature in dataset

## STEP #3.2: Features analysis

By checking the features, there are two types of features:
   * Categorical Features: satisfaction, Gender, Customer Type, Type of Travel, Class.
   * Numerical Features: Age, Flight Distance, Seat comfort, Departure/Arrival time convenient, Food and drink, Gate location, Inflight wifi service, Inflight entertainment, Online support, Ease of Online booking, On-board service, Leg room service, Baggage handling, Checkin service, Cleanliness, Online boarding, Departure Delay in Minutes, Arrival Delay in Minutes.

## STEP #3.3: Categorical Analysis:

check the specific values for each of categorical features

In [None]:
category = ["satisfaction", "Gender", "Customer Type", "Type of Travel", "Class"]
for c in category:
    print ("{} \n".format(data[c].value_counts()))

so, we conclude that all the categorical features have specific values

## STEP #3.4: Mapping

In [None]:
#Mapping satisfied and dissatisfied in number 
satisfaction_mapping = {"satisfied": 1,"dissatisfied": 0 }
data['satisfaction']  = data['satisfaction'].map(satisfaction_mapping)

#Mapping Male and Female in number 
Gender_mapping = {"Male": 1,"Female": 2 }
data['Gender']  = data['Gender'].map(Gender_mapping)

#Mapping Loyal and disloyal in number 
Customer_Type_mapping = {"Loyal Customer": 1,"disloyal Customer": 0 }
data['Customer Type']  = data['Customer Type'].map(Customer_Type_mapping)

#Mapping Business travel and Business travel in number 
Type_of_Travel_mapping = {"Business travel": 1,"Personal Travel": 2 }
data['Type of Travel']  = data['Type of Travel'].map(Type_of_Travel_mapping)

#Mapping Business and Eco and Eco plus in number 
Class_mapping = {"Business": 1,"Eco": 3, "Eco Plus": 2 }
data['Class']  = data['Class'].map(Class_mapping)


## STEP #3.5: Numerical Analysis

In [None]:
plt.figure(figsize=(15, 5))
plt.subplot(121)
sns.countplot(x='Class',data=data)
plt.subplot(122)
sns.countplot(x='Class',hue='satisfaction',data=data)

In [None]:
numericVar = ["Age", "Flight Distance", "Seat comfort", "Departure/Arrival time convenient", "Food and drink", "Gate location", "Inflight wifi service", "Inflight entertainment", "Online support", "Ease of Online booking", "On-board service", "Leg room service", "Baggage handling", "Checkin service", "Cleanliness", "Online boarding", "Departure Delay in Minutes", "Arrival Delay in Minutes"]

fig, axs = plt.subplots(nrows=9, ncols=2,figsize=(20,25))

row = 0
col = 0
for n in numericVar:
    if(col==2):
        row+=1
        col=0
    axs[row,col].hist(data[n], bins = 50)
    axs[row,col].set_xlabel(n)
    axs[row,col].set_ylabel('Frequency')
    axs[row,col].set_title("{} distribution with hist".format(n))
    
    col+=1
    
fig.tight_layout()

In [None]:
plt.figure(figsize=(80,30))
sns.countplot(x='Age',hue='satisfaction',data=data)

## STEP #3.6: Prepare the Data for Training / Data Cleaning
find the best way to fill the missing values in each feature
### Find the missing values in train and test data files
Count the null values in each column to decide to drop these rows or replace the values


In [None]:
# import missingno
# # Plot graphic of missing values
# missingno.matrix(data, figsize = (30,20))


sns.heatmap(data.isnull(),cmap='Blues')

In [None]:
## Fill Missing Values
data['Arrival Delay in Minutes']=data['Arrival Delay in Minutes'].fillna(data['Arrival Delay in Minutes'].mean()) 

In [None]:
sns.heatmap(data.isnull(),cmap='Blues')
# sns.heatmap(dataset.isnull(),yticklabels=False,cbar=False,cmap='YlGnBu')

In [None]:
data.isnull().sum()

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(x='Gender',y='Age',data=data)

from shown above in dataset, missing values for Arrival Delay in Minutes column are not that much (didnot get more than 90%), so we are going to keep them

==================================================================================================================

## STEP #3.7: Visualization

### 7.1- Correlation Between numeric values (Satisfaction, Gender, Customer Type, Age, Type of Travel, Class, Flight Distance, Seat comfort, Departure/Arrival time convenient, Food and drink, Gate location, Inflight wifi service, Inflight entertainment, Online support, Ease of Online booking, On-board service, Leg room service, Baggage handling, Checkin service, Cleanliness, Online boarding, Departure Delay in Minutes, Arrival Delay in Minutes)

In [None]:
list1 =["satisfaction", "Gender", "Customer Type", "Age" , "Type of Travel", "Class" , "Flight Distance" , "Seat comfort" ,"Departure/Arrival time convenient" ,"Food and drink"
, "Gate location" ,"Inflight wifi service","Inflight entertainment","Online support","Ease of Online booking"
,"On-board service","Leg room service","Baggage handling","Checkin service","Cleanliness","Online boarding"
,"Departure Delay in Minutes", "Arrival Delay in Minutes"]

plt.subplots(figsize=(15,15)) 
sns.heatmap(data[list1].corr(), annot = True, fmt = ".2f")
plt.show()

* Departure delay in Minutes and Arrival Delay in Minutes features have the weakest correlation with all other features, so these twi feature are going to be dropped.
* Food and drink feature has the strongest correlation with seat comfort feature (0.72).
* Ease of Online booking, Online boarding, Online support, Cleanless, Baggage handling, and inflight wifi service features are considered to be have the secong strong correclation between all of them (~0.6)
* Food drink, Online suppor, Cleanless, Baggage handling, Gate location, Onboard service, inflight wifi service features are considered to be have the secong strong correclation between all of them (~0.5)
* satisfaction feature has a strong correlation with inflight entertainment and Ease of Online booking
* Now, visualize each of these features with the satisfaction feature

In [None]:
#define a function, so that we can make bar chart for every feature. 
def barchart(feature):
    g = sns.barplot(x=feature,y="satisfaction",data=data)
    g = g.set_ylabel("Satisfaction Probability")

## STEP #3.8: Check features with satisfaction

In [None]:
# For Gender feature.
barchart('Gender')

In [None]:
# For Customer Type feature.
barchart('Customer Type')

In [None]:
# For Class feature.
barchart('Class')

In [None]:
# For Type of Travel feature.
barchart('Type of Travel')

In [None]:
# For Age feature.
g = sns.FacetGrid(data, col = "satisfaction")
g.map(sns.distplot, "Age", bins = 25)
plt.show()

For Ages between 20 and 40 have high dissatisfaction, while the Ages more than 40 have high satisfaction.

In [None]:
# For Flight distance feature.
g = sns.FacetGrid(data, col = "satisfaction")
g.map(sns.distplot, "Flight Distance", bins = 25)
plt.show()

In [None]:
sns.barplot(x="Customer Type", y="satisfaction", hue="Gender", data=data);

In [None]:
sns.barplot(x="Class", y="satisfaction", hue="Gender", data=data);

In [None]:
sns.barplot(x="Type of Travel", y="satisfaction", hue="Gender", data=data);

===========================================================================================

## STEP #3.9: Feature Engineering

### 8.1- Arrival Delay in Minutes Feature

Feature Arrival Delay in Minutes has no correlation with any other feature, so this feature is going to be dropped

In [None]:
data = data.drop(['Arrival Delay in Minutes'], axis=1)
data.head(10)

### 8.2- Departure Delay in Minutes Feature

Feature Departure Delay in Minutes has no correlation with any other feature, so this feature is going to be dropped

In [None]:
data = data.drop(['Departure Delay in Minutes'], axis=1)
data.head(10)

# STEP #4: Split dataset to train / test data

In [None]:
X = data.drop('satisfaction',axis=1).values 
y = data['satisfaction'].values

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2, random_state = 100)

In [None]:
# y = data.satisfaction
# data = data.drop('satisfaction',axis=1)

# # split train and test data
# x_train,x_test,y_train,y_test = train_test_split(data,y,test_size=0.2)

print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

# STEP #5: Fitting and Tuning an Algorithm

## Define a general function for showing the Learning curve for any classifier

In [None]:
def plotLearningCurves(X_train, y_train, classifier, title):
    train_sizes, train_scores, test_scores = learning_curve(
            classifier, X_train, y_train, cv=5, scoring="accuracy")
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.plot(train_sizes, train_scores_mean, label="Training Error")
    plt.plot(train_sizes, test_scores_mean, label="Cross Validation Error")
    
    plt.legend()
    plt.grid()
    plt.title(title, fontsize = 18, y = 1.03)
    plt.xlabel('Training Error', fontsize = 14)
    plt.ylabel('Cross Validation Error', fontsize = 14)
    plt.tight_layout()

In [None]:
def plotValidationCurves(X_train, y_train, classifier, param_name, param_range, title):
    train_scores, test_scores = validation_curve(
        classifier, X_train, y_train, param_name = param_name, param_range = param_range,
        cv=5, scoring="accuracy")

    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.plot(param_range, train_scores_mean, label="Training Error")
    plt.plot(param_range, test_scores_mean, label="Cross Validation Error")

    plt.legend()
    plt.grid()
    plt.title(title, fontsize = 18, y = 1.03)
    plt.xlabel('Training Error', fontsize = 14)
    plt.ylabel('Cross Validation Error', fontsize = 14)
    plt.tight_layout()

## STEP #5.1: Logistic Regression Calssifier

In [None]:
# # LogisticRegression
# from sklearn.linear_model import LogisticRegression
# LR = LogisticRegression()
# LR.fit(X_train,y_train)

# # making predictions on the testing set 
# y_perdict_test = LR.predict(X_test)

# # from sklearn.metrics import confusion_matrix,classification_report
# # cm = confusion_matrix(y_test,y_perdict_test)
# # sns.heatmap(cm,annot=True)

  
# # comparing actual response values (y_test) with predicted response values (y_pred) 
# from sklearn.metrics import accuracy_score

# LR_accuracy = accuracy_score(y_test,y_perdict_test)
# print("Logistic Regression model accuracy:", LR_accuracy) 

## STEP #5.2: Random Forest Calssifier

In [None]:
# Choose some initial parameters combinations to try
rondomForestClf = RandomForestClassifier(n_estimators=9, # no of sample trees
                             # max_features=['log2', 'sqrt','auto'], 
                             # criterion=['entropy', 'gini'],
                             max_depth=2, # max depth of each ensamble tree
                             min_samples_split=2, # min no of samples in each node
                             min_samples_leaf=1 # min no of samples in each leaf
                            )

# Fit the best algorithm to the data. 
rondomForestClf.fit(X_train, y_train)
rondomForestPredictions1 = rondomForestClf.predict(X_test)

In [None]:
# call general function to fit the classifier and draw the learning curve
plt.figure(figsize = (16,5))
title = 'Random Forest Learning Curve'
plotLearningCurves(X_train, y_train, rondomForestClf, title)

In [None]:
# call general function to fit the classifier and draw the validation curve
title = 'Random Forest Validation Curve'
param_name = 'n_estimators'
param_range = [4, 6, 9]
plt.figure(figsize = (16,5))
plotValidationCurves(X_train, y_train, rondomForestClf, param_name, param_range, title)

In [None]:
print(accuracy_score(y_test, rondomForestPredictions1))
print(confusion_matrix(y_test, rondomForestPredictions1))
print(classification_report(y_test, rondomForestPredictions1))

In [None]:
rondomForestClf_disp = plot_roc_curve(rondomForestClf, X_test, y_test)
plt.show()

* ### keeping the tree depth at 2 and add more points in split / leaf

In [None]:
# change some parameters combinations to increase the accuracy
# increase the number of samples split / leaf numbers
rondomForestClf = RandomForestClassifier(n_estimators=9, # no of sample trees
                             # max_features=['log2', 'sqrt','auto'], 
                             # criterion=['entropy', 'gini'],
                             max_depth=2, # max depth of each ensamble tree
                             min_samples_split=5, # min no of samples in each node
                             min_samples_leaf=3 # min no of samples in each leaf
                            )

# Fit the best algorithm to the data. 
rondomForestClf.fit(X_train, y_train)
rondomForestPredictions2 = rondomForestClf.predict(X_test)

In [None]:
# call general function to fit the classifier and draw the learning curve
plt.figure(figsize = (16,5))
title = 'Random Forest Learning Curve'
plotLearningCurves(X_train, y_train, rondomForestClf, title)

In [None]:
# call general function to fit the classifier and draw the validation curve
title = 'Random Forest Validation Curve'
param_name = 'n_estimators'
param_range = [4, 6, 9]
plt.figure(figsize = (16,5))
plotValidationCurves(X_train, y_train, rondomForestClf, param_name, param_range, title)

In [None]:
print(accuracy_score(y_test, rondomForestPredictions2))
print(confusion_matrix(y_test, rondomForestPredictions2))
print(classification_report(y_test, rondomForestPredictions2))

### So changing the sample split / leaf numbers only without changing the tree depth decrease the accuracy

In [None]:
# change some parameters combinations to increase the accuracy
# change the tree depth with preserving the other parameters
rondomForestClf = RandomForestClassifier(n_estimators=9, # no of sample trees
                             # max_features=['log2', 'sqrt','auto'], 
                             # criterion=['entropy', 'gini'],
                             max_depth=3, # max depth of each ensamble tree
                             min_samples_split=5, # min no of samples in each node
                             min_samples_leaf=1 # min no of samples in each leaf
                            )

# Fit the best algorithm to the data. 
rondomForestClf.fit(X_train, y_train)

rondomForestPredictions3 = rondomForestClf.predict(X_test)

In [None]:
# call general function to fit the classifier and draw the learning curve
plt.figure(figsize = (16,5))
title = 'Random Forest Learning Curve'
plotLearningCurves(X_train, y_train, rondomForestClf, title)

In [None]:
# call general function to fit the classifier and draw the validation curve
title = 'Random Forest Validation Curve'
param_name = 'n_estimators'
param_range = [4, 6, 9]
plt.figure(figsize = (16,5))
plotValidationCurves(X_train, y_train, rondomForestClf, param_name, param_range, title)

In [None]:
print(accuracy_score(y_test, rondomForestPredictions3))
print(confusion_matrix(y_test, rondomForestPredictions3))
print(classification_report(y_test, rondomForestPredictions3))

In [None]:
rondomForestClf_disp = plot_roc_curve(rondomForestClf, X_test, y_test)
plt.show()

### as shown above, increasing the depth with the same number of samples for split / leaf increase the accuracy slightly, so next we will preserve the depth and try to increase the samples split / leaf numbers

In [None]:
# change some parameters combinations to increase the accuracy
# change the samples split / leaf numbers with preserving the other parameters
rondomForestClf = RandomForestClassifier(n_estimators=9, # no of sample trees
                             # max_features=['log2', 'sqrt','auto'], 
                             # criterion=['entropy', 'gini'],
                             max_depth=3, # max depth of each ensamble tree
                             min_samples_split=8, # min no of samples in each node
                             min_samples_leaf=3 # min no of samples in each leaf
                            )

# Fit the best algorithm to the data. 
rondomForestClf.fit(X_train, y_train)

rondomForestPredictions4 = rondomForestClf.predict(X_test)

In [None]:
# call general function to fit the classifier and draw the learning curve
plt.figure(figsize = (16,5))
title = 'Random Forest Learning Curve'
plotLearningCurves(X_train, y_train, rondomForestClf, title)

In [None]:
# call general function to fit the classifier and draw the validation curve
title = 'Random Forest Validation Curve'
param_name = 'n_estimators'
param_range = [4, 6, 9]
plt.figure(figsize = (16,5))
plotValidationCurves(X_train, y_train, rondomForestClf, param_name, param_range, title)

In [None]:
print(accuracy_score(y_test, rondomForestPredictions4))
print(confusion_matrix(y_test, rondomForestPredictions4))
print(classification_report(y_test, rondomForestPredictions4))

### accuracy increased after increasing the number of samples split / leaf, so next expirement will preserve the samples split / leaf number and try to increase the tree depth

In [None]:
# change some parameters combinations to increase the accuracy
# change the tree depth with preserving the other parameters
rondomForestClf = RandomForestClassifier(n_estimators=9, # no of sample trees
                             # max_features=['log2', 'sqrt','auto'], 
                             # criterion=['entropy', 'gini'],
                             max_depth=5, # max depth of each ensamble tree
                             min_samples_split=8, # min no of samples in each node
                             min_samples_leaf=5 # min no of samples in each leaf
                            )

# Fit the best algorithm to the data. 
rondomForestClf.fit(X_train, y_train)

rondomForestPredictions5 = rondomForestClf.predict(X_test)

In [None]:
# call general function to fit the classifier and draw the learning curve
plt.figure(figsize = (16,5))
title = 'Random Forest Learning Curve'
plotLearningCurves(X_train, y_train, rondomForestClf, title)

In [None]:
# call general function to fit the classifier and draw the validation curve
title = 'Random Forest Validation Curve'
param_name = 'n_estimators'
param_range = [4, 6, 9]
plt.figure(figsize = (16,5))
plotValidationCurves(X_train, y_train, rondomForestClf, param_name, param_range, title)

In [None]:
print(accuracy_score(y_test, rondomForestPredictions5))
print(confusion_matrix(y_test, rondomForestPredictions5))
print(classification_report(y_test, rondomForestPredictions5))

### as shown above, the accuracy increased significantly when the tree depth increased to 5, so the tree depth parameter has a great impact on the accuracy, let's increase it to see if the accuracy will increase or not

In [None]:
# change some parameters combinations to increase the accuracy
# change the tree depth with preserving the other parameters
rondomForestClf = RandomForestClassifier(n_estimators=9, # no of sample trees
                             # max_features=['log2', 'sqrt','auto'], 
                             # criterion=['entropy', 'gini'],
                             max_depth=8, # max depth of each ensamble tree
                             min_samples_split=8, # min no of samples in each node
                             min_samples_leaf=5 # min no of samples in each leaf
                            )

# Fit the best algorithm to the data. 
rondomForestClf.fit(X_train, y_train)

rondomForestPredictions6 = rondomForestClf.predict(X_test)

In [None]:
# call general function to fit the classifier and draw the learning curve
plt.figure(figsize = (16,5))
title = 'Random Forest Learning Curve'
plotLearningCurves(X_train, y_train, rondomForestClf, title)

In [None]:
# call general function to fit the classifier and draw the validation curve
title = 'Random Forest Validation Curve'
param_name = 'n_estimators'
param_range = [4, 6, 9]
plt.figure(figsize = (16,5))
plotValidationCurves(X_train, y_train, rondomForestClf, param_name, param_range, title)

In [None]:
print(accuracy_score(y_test, rondomForestPredictions6))
print(confusion_matrix(y_test, rondomForestPredictions6))
print(classification_report(y_test, rondomForestPredictions6))

so the best parameters will be: max_depth=3, min_samples_leaf=1, min_samples_split=5 for perdiction rondomForestPredictions3

In [None]:
rondomForestClf = RandomForestClassifier()

parameters = {'n_estimators': [4, 6, 9], # no of sample trees
              'max_features': ['log2', 'sqrt','auto'], 
              'criterion': ['entropy', 'gini'],
              'max_depth': [2, 3, 5, 8], # max depth of each ensamble tree
              'min_samples_split': [2, 5, 8], # min no of samples in each node
              'min_samples_leaf': [1, 3, 5] # min no of samples in each leaf
             }

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(accuracy_score)

# Run the grid search
grid_obj = GridSearchCV(rondomForestClf, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rondomForestClf = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
rondomForestClf.fit(X_train, y_train)

predictions = rondomForestClf.predict(X_test)

print(grid_obj.best_estimator_)

In [None]:
# call general function to fit the classifier and draw the learning curve
plt.figure(figsize = (16,5))
title = 'Random Forest Learning Curve'
plotLearningCurves(X_train, y_train, rondomForestClf, title)

In [None]:
# call general function to fit the classifier and draw the validation curve
title = 'Random Forest Validation Curve'
param_name = 'n_estimators'
param_range = [4, 6, 9]
plt.figure(figsize = (16,5))
plotValidationCurves(X_train, y_train, rondomForestClf, param_name, param_range, title)

## STEP #5.3: K Nearest Neighbors Calssifier

### STEP #5.3.1: Results for neighbors = 1

In [None]:
knn = KNeighborsClassifier(1)
knn.fit(X_train, y_train)
knnPredictions1 = knn.predict(X_test)
print(accuracy_score(y_test, knnPredictions1))
print(confusion_matrix(y_test, knnPredictions1))
print(classification_report(y_test, knnPredictions1))

In [None]:
knn_disp = plot_roc_curve(knn, X_test, y_test)
plt.show()

In [None]:
plt.figure(figsize = (16,5))
title = 'KNN k=1 Learning Curve'
plotLearningCurves(X_train, y_train, knn, title)

### STEP #5.3.2: Results for neighbors = 3

In [None]:
knn = KNeighborsClassifier(3)
knn.fit(X_train, y_train)
knnPredictions3 = knn.predict(X_test)
print(accuracy_score(y_test, knnPredictions3))
print(confusion_matrix(y_test, knnPredictions3))
print(classification_report(y_test, knnPredictions3))

In [None]:
knn_disp = plot_roc_curve(knn, X_test, y_test)
plt.show()

In [None]:
plt.figure(figsize = (16,5))
title = 'KNN k=3 Learning Curve'
plotLearningCurves(X_train, y_train, knn, title)

### STEP #5.3.3: Results for neighbors = 5

In [None]:
knn = KNeighborsClassifier(5)
knn.fit(X_train, y_train)
knnPredictions5 = knn.predict(X_test)
print(accuracy_score(y_test, knnPredictions5))
print(confusion_matrix(y_test, knnPredictions5))
print(classification_report(y_test, knnPredictions5))

In [None]:
knn_disp = plot_roc_curve(knn, X_test, y_test)
plt.show()

In [None]:
plt.figure(figsize = (16,5))
title = 'KNN k=5 Learning Curve'
plotLearningCurves(X_train, y_train, knn, title)

### STEP #5.3.4: Results for neighbors = 7

In [None]:
knn = KNeighborsClassifier(7)
knn.fit(X_train, y_train)
knnPredictions7 = knn.predict(X_test)
print(accuracy_score(y_test, knnPredictions7))
print(confusion_matrix(y_test, knnPredictions7))
print(classification_report(y_test, knnPredictions7))

In [None]:
knn_disp = plot_roc_curve(knn, X_test, y_test)
plt.show()

In [None]:
plt.figure(figsize = (16,5))
title = 'KNN k=7 Learning Curve'
plotLearningCurves(X_train, y_train, knn, title)

### as shown above, the accuracy increased significantly when the neighbors increased to 5
### Validation Curve


In [None]:
# call general function to fit the classifier and draw the validation curve
knn = KNeighborsClassifier()
title = 'KNN Validation Curve'
param_name = 'n_neighbors'
param_range = np.arange(1,9,2)
plt.figure(figsize = (16,5))
plotValidationCurves(X_train, y_train, knn, param_name, param_range, title)

## STEP #5.4: Support Vector Machine Calssifier

In [None]:
# Train a SVC model using different kernal
svclassifier = SVC(C = 0.1 , gamma =1 ,kernel='sigmoid')
svclassifier.fit(X_train, y_train)
# Make prediction
SvcPredictions1 = svclassifier.predict(X_test)
# Evaluate our model
print(classification_report(y_test,SvcPredictions1))

In [None]:
# call general function to fit the classifier and draw the learning curve
plt.figure(figsize = (16,5))
title = 'Support Vector Machine Learning Curve'
plotLearningCurves(X_train, y_train, svclassifier, title)

In [None]:
# call general function to fit the classifier and draw the validation curve
title = 'Support Vector Machine Validation Curve'
param_name = 'C'
param_range = [0.1,1, 10]
plt.figure(figsize = (16,5))
plotValidationCurves(X_train, y_train, svclassifier, param_name, param_range, title)

Initialy tried with C=0.1, and gamma=1, and kernal = segmoid, accuracy = 0.9, then change the C parameter to 1

In [None]:
# Train a SVC model using different kernal
svclassifier = SVC(C = 1 , gamma =1 ,kernel='sigmoid')
svclassifier.fit(X_train, y_train)
# Make prediction
SvcPredictions2 = svclassifier.predict(X_test)
# Evaluate our model
print(classification_report(y_test,SvcPredictions2))

In [None]:
# call general function to fit the classifier and draw the learning curve
plt.figure(figsize = (16,5))
title = 'Support Vector Machine Learning Curve'
plotLearningCurves(X_train, y_train, svclassifier, title)

In [None]:
# call general function to fit the classifier and draw the validation curve
title = 'Support Vector Machine Validation Curve'
param_name = 'C'
param_range = [0.1,1, 10]
plt.figure(figsize = (16,5))
plotValidationCurves(X_train, y_train, svclassifier, param_name, param_range, title)

now C=1, and gamma=1, and kernal = segmoid, accuracy = 0.9 (no change), then change the C parameter to 10

In [None]:
# Train a SVC model using different kernal
svclassifier = SVC(C = 10 , gamma =1 ,kernel='sigmoid')
svclassifier.fit(X_train, y_train)
# Make prediction
SvcPredictions3 = svclassifier.predict(X_test)
# Evaluate our model
print(classification_report(y_test,SvcPredictions3))

In [None]:
# call general function to fit the classifier and draw the learning curve
plt.figure(figsize = (16,5))
title = 'Support Vector Machine Learning Curve'
plotLearningCurves(X_train, y_train, svclassifier, title)

In [None]:
# call general function to fit the classifier and draw the validation curve
title = 'Support Vector Machine Validation Curve'
param_name = 'C'
param_range = [0.1,1, 10]
plt.figure(figsize = (16,5))
plotValidationCurves(X_train, y_train, svclassifier, param_name, param_range, title)

now C=10, and gamma=1, and kernal = segmoid, accuracy = 0.9 (no change), then change the gamma parameter to 0.01

In [None]:
# Train a SVC model using different kernal
svclassifier = SVC(C = 10 , gamma =0.01 ,kernel='sigmoid')
svclassifier.fit(X_train, y_train)
# Make prediction
SvcPredictions4 = svclassifier.predict(X_test)
# Evaluate our model
print(classification_report(y_test,SvcPredictions4))

In [None]:
# call general function to fit the classifier and draw the learning curve
plt.figure(figsize = (16,5))
title = 'Support Vector Machine Learning Curve'
plotLearningCurves(X_train, y_train, svclassifier, title)

In [None]:
# call general function to fit the classifier and draw the validation curve
title = 'Support Vector Machine Validation Curve'
param_name = 'C'
param_range = [0.1,1, 10]
plt.figure(figsize = (16,5))
plotValidationCurves(X_train, y_train, svclassifier, param_name, param_range, title)

for C and gamma parameters, the effect is slightly noticable, now change the kernal parameter to rbf

In [None]:
# Train a SVC model using different kernal
svclassifier = SVC(C = 10 , gamma =0.01 ,kernel='rbf')
svclassifier.fit(X_train, y_train)
# Make prediction
SvcPredictions5 = svclassifier.predict(X_test)
# Evaluate our model
print(classification_report(y_test,SvcPredictions5))

In [None]:
# call general function to fit the classifier and draw the learning curve
plt.figure(figsize = (16,5))
title = 'Support Vector Machine Learning Curve'
plotLearningCurves(X_train, y_train, svclassifier, title)

In [None]:
# call general function to fit the classifier and draw the validation curve
title = 'Support Vector Machine Validation Curve'
param_name = 'C'
param_range = [0.1,1, 10]
plt.figure(figsize = (16,5))
plotValidationCurves(X_train, y_train, svclassifier, param_name, param_range, title)

rbf generates a high variance model as the gab increased between the training error and the cross validation error, now change the gamma parameter to 1 and check

In [None]:
# Train a SVC model using different kernal
svclassifier = SVC(C = 10 , gamma =1 ,kernel='rbf')
svclassifier.fit(X_train, y_train)
# Make prediction
SvcPredictions6 = svclassifier.predict(X_test)
# Evaluate our model
print(classification_report(y_test,SvcPredictions6))

In [None]:
# call general function to fit the classifier and draw the learning curve
plt.figure(figsize = (16,5))
title = 'Support Vector Machine Learning Curve'
plotLearningCurves(X_train, y_train, svclassifier, title)

In [None]:
# call general function to fit the classifier and draw the validation curve
title = 'Support Vector Machine Validation Curve'
param_name = 'C'
param_range = [0.1,1, 10]
plt.figure(figsize = (16,5))
plotValidationCurves(X_train, y_train, svclassifier, param_name, param_range, title)

changing the gamma parameter to 1,increased the gab (higher variance),so the best parameters are: C= 10, gamma = 0.01, kernal = segmoid for prediction: SvcPredictions4

In [None]:
param_grid = {'C': [0.1,1, 10], 'gamma': [1,0.1,0.01],'kernel': ['sigmoid']}
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=2)
svclassifier7 = grid.fit(X_train,y_train)
SvcPredictions7 = svclassifier7.predict(X_test)
print(grid.best_estimator_)

In [None]:
print(accuracy_score(y_test, SvcPredictions7))
print(confusion_matrix(y_test, SvcPredictions7))
print(classification_report(y_test, SvcPredictions7))

In [None]:
svm_disp = plot_roc_curve(svclassifier, X_test, y_test)
plt.show()

## STEP #5.5: Neural Network Calssifier

In [None]:
def neural_network_learning_curve(classifier):
    # call general function to fit the classifier and draw the learning curve
    title = 'Neural Network Learning Curve'
    plt.figure(figsize = (16,5))
    plotLearningCurves(X_train, y_train, classifier, title)

In [None]:
 def neural_network_validation_curve(classifier):   
    # call general function to fit the classifier and draw the validation curve
    title = 'Neural Network Validation Curve'
    param_name="alpha"
    param_range = np.logspace(-6, -1, 5)
    plt.figure(figsize = (16,5))
    plotValidationCurves(X_train, y_train, classifier, param_name, param_range, title)

In [None]:
nnclf1 = MLPClassifier(hidden_layer_sizes=(2, 1), activation='logistic', solver='adam', max_iter=500, 
                    alpha=1e-5, random_state=1)

nnclf1.fit(X_train, y_train)

In [None]:
NeuralNetworkPredictions1 = nnclf1.predict(X_test)
print(confusion_matrix(y_test,NeuralNetworkPredictions1))
print(classification_report(y_test,NeuralNetworkPredictions1))

In [None]:
neural_network_learning_curve(nnclf1)

In [None]:
neural_network_validation_curve(nnclf1)

> ### as shown we built an neural classifier with 2 hiden layer wich contint of 3 nodes and we got .84 accuracy

In [None]:
nnclf2 = MLPClassifier(hidden_layer_sizes=(2, 2), activation='logistic', solver='adam', max_iter=500, 
                    alpha=1e-5, random_state=1)

nnclf2.fit(X_train, y_train)

In [None]:
NeuralNetworkPredictions2 = nnclf1.predict(X_test)
print(confusion_matrix(y_test,NeuralNetworkPredictions2))
print(classification_report(y_test,NeuralNetworkPredictions2))

In [None]:
neural_network_learning_curve(nnclf2)

In [None]:
neural_network_validation_curve(nnclf2)

### here we added another node to the second hidin layer so we have .85 accuracy

In [None]:
nnclf3 = MLPClassifier(hidden_layer_sizes=(2, 3), activation='logistic', solver='adam', max_iter=500, 
                    alpha=1e-5, random_state=1)

nnclf3.fit(X_train, y_train)

In [None]:
NeuralNetworkPredictions3 = nnclf1.predict(X_test)
print(confusion_matrix(y_test,NeuralNetworkPredictions3))
print(classification_report(y_test,NeuralNetworkPredictions3))

In [None]:
neural_network_learning_curve(nnclf3)

In [None]:
neural_network_validation_curve(nnclf3)

### then we added another node to the second hidin layer and we still recieved the same accuracy .85

In [None]:
nnclf4 = MLPClassifier(hidden_layer_sizes=(3, 2), activation='logistic', solver='adam', max_iter=500, 
                    alpha=1e-5, random_state=1)

nnclf4.fit(X_train, y_train)

In [None]:
NeuralNetworkPredictions4 = nnclf1.predict(X_test)
print(confusion_matrix(y_test,NeuralNetworkPredictions4))
print(classification_report(y_test,NeuralNetworkPredictions4))

In [None]:
neural_network_learning_curve(nnclf4)

In [None]:
neural_network_validation_curve(nnclf4)

### here we tryed to swap the number of nodes in our 2 hidin layer instead of (2, 3) we made it (3, 2), so we recieved .88 accuracy

In [None]:
nnclf5 = MLPClassifier(hidden_layer_sizes=(3, 3), activation='logistic', solver='adam', max_iter=500, 
                    alpha=1e-5, random_state=1)

nnclf5.fit(X_train, y_train)

In [None]:
NeuralNetworkPredictions5 = nnclf1.predict(X_test)
print(confusion_matrix(y_test,NeuralNetworkPredictions5))
print(classification_report(y_test,NeuralNetworkPredictions5))

In [None]:
neural_network_learning_curve(nnclf5)

In [None]:
neural_network_validation_curve(nnclf5)

### then we added another node in the second hidin layer and we recieved .89 accuracy

In [None]:
nnclf6 = MLPClassifier(hidden_layer_sizes=(5, 3), activation='logistic', solver='adam', max_iter=500, 
                    alpha=1e-5, random_state=1)

nnclf6.fit(X_train, y_train)

In [None]:
NeuralNetworkPredictions6 = nnclf1.predict(X_test)
print(confusion_matrix(y_test,NeuralNetworkPredictions6))
print(classification_report(y_test,NeuralNetworkPredictions6))

In [None]:
neural_network_learning_curve(nnclf6)

In [None]:
neural_network_validation_curve(nnclf6)

### finally we have the best accuracy of .90 when we tryed to make 2 hidin layers with 8 nodes for prediction: NeuralNetworkPredictions6

# STEP #6: Comparison between all classifier performance(AUC curve) 

In [None]:
# Import the classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, roc_auc_score

 

# Instantiate the classfiers and make a list
classifiers = [LogisticRegression(random_state=1234), 
               SVC(),
               KNeighborsClassifier(), 
               RandomForestClassifier(random_state=1234),
               MLPClassifier()]

result_table = pd.DataFrame(columns=['classifiers', 'fpr','tpr','auc'])
 
# print('auc =', auc)
lr_fpr1, lr_tpr1, _ = roc_curve(y_test, rondomForestPredictions3)
lr_fpr2, lr_tpr2, _ = roc_curve(y_test, knnPredictions5)
lr_fpr3, lr_tpr3, _ = roc_curve(y_test, SvcPredictions4)
lr_fpr4, lr_tpr4, _ = roc_curve(y_test, NeuralNetworkPredictions2)
# fpr , tpr, _= roc_curve(X_test, predict6_test)
auc1 = roc_auc_score(y_test, rondomForestPredictions3)
auc2 = roc_auc_score(y_test, knnPredictions5)
auc3 = roc_auc_score(y_test, SvcPredictions5)
auc4 = roc_auc_score(y_test, NeuralNetworkPredictions2)




result_table = result_table.append({'classifiers':RandomForestClassifier.__class__.__name__,
                                     'fpr':lr_fpr1, 
                                     'tpr':lr_tpr1, 
                                     'auc':auc1}, ignore_index=True)

result_table = result_table.append({'classifiers':KNeighborsClassifier.__class__.__name__,
                                     'fpr':lr_fpr2, 
                                     'tpr':lr_tpr2, 
                                     'auc':auc2}, ignore_index=True)

result_table = result_table.append({'classifiers':SVC.__class__.__name__,
                                     'fpr':lr_fpr3, 
                                     'tpr':lr_tpr3, 
                                     'auc':auc3}, ignore_index=True)

result_table = result_table.append({'classifiers':MLPClassifier.__class__.__name__,
                                     'fpr':lr_fpr4, 
                                     'tpr':lr_tpr4, 
                                     'auc':auc4}, ignore_index=True)

 
fig = plt.figure(figsize=(8,6))

# for i in result_table.index:
#     plt.plot(result_table.loc[i]['fpr'], 
#              result_table.loc[i]['tpr'], 
#              label="{}, AUC={:.3f}".format(i, result_table.loc[i]['auc']))


plt.plot(result_table.loc[0]['fpr'], 
         result_table.loc[0]['tpr'], 
         label="RandomForestClassifier, AUC={:.3f}".format( result_table.loc[0]['auc']))

plt.plot(result_table.loc[1]['fpr'], 
         result_table.loc[1]['tpr'], 
         label="KNeighborsClassifier, AUC={:.3f}".format( result_table.loc[1]['auc']))

plt.plot(result_table.loc[2]['fpr'], 
         result_table.loc[2]['tpr'], 
         label="SVM, AUC={:.3f}".format( result_table.loc[2]['auc']))

plt.plot(result_table.loc[3]['fpr'], 
         result_table.loc[3]['tpr'], 
         label="MLPClassifier, AUC={:.3f}".format( result_table.loc[3]['auc']))
    
# plt.plot([0,1], [0,1], color='orange', linestyle='--')

plt.xticks(np.arange(0.0, 1.1, step=0.1))
plt.xlabel("Flase Positive Rate", fontsize=15)

plt.yticks(np.arange(0.0, 1.1, step=0.1))
plt.ylabel("True Positive Rate", fontsize=15)

plt.title('ROC Curve Analysis', fontweight='bold', fontsize=15)
plt.legend(prop={'size':13}, loc='lower right')

plt.show()

As shown in the above curve, there are variations between the four curves that shows the randomforest classifier has the biggest area under curve, and SVM covers he lowest area under curve, while neural network has better performance than K nearest neighbors.

# STEP #7: Apply AutoML (e.g. auto sklearn )and compare its performance to your best model

### Install autosklearn on Kaggle
!apt-get remove swig
!apt-get install swig3.0 build-essential -y
!ln -s /usr/bin/swig3.0 /usr/bin/swig
!apt-get install build-essential
!pip install --upgrade setuptools
!pip install auto-sklearn

In [None]:
# Regression AutoML

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
import os  
import autosklearn.regression

tmp_folder='/tmp/autosklearn_regression_example_tmp'
output_folder='/tmp/autosklearn_regression_example_out'
    

automl = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder = tmp_folder,
    output_folder = output_folder,
)
automl.fit(X_train, y_train, dataset_name='Airlines' )

print(automl.show_models())
predictions = automl.predict(X_test)
print("R2 score:", sklearn.metrics.r2_score(y_test, predictions))

In [None]:
# Cross-Validation AutoML

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification


automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder='/tmp/autosklearn_cv_example_tmp1',
    output_folder='/tmp/autosklearn_cv_example_out1',
    delete_tmp_folder_after_terminate=False,
    resampling_strategy='cv',
    resampling_strategy_arguments={'folds': 5},
)

# fit() changes the data in place, but refit needs the original data. We
# therefore copy the data. In practice, one should reload the data
automl.fit(X_train.copy(), y_train.copy(), dataset_name='AirLines')
# During fit(), models are fit on individual cross-validation folds. To use
# all available data, we call refit() which trains all models in the
# final ensemble on the whole dataset.
automl.refit(X_train.copy(), y_train.copy())

print(automl.show_models())

predictions = automl.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))


# STEP #8: Conclusion

With this notebook we learned the basics of EDA with Pandas and Matplotlib as well as the foundations
for applying the classification models of the scikit learn library.
By EDA we found a strong impact of features like Gender and Age on Satisfaction.

We then built a simple baseline model with Pandas, using only these features.
Again, using Pandas, we also created a dataset that can be used by the scikit learn classifiers for prediction.

We applied Random Forest, k-nearest neighbors, Support vector machine (SVM) and Multilayer Perceptron (MLP).


Deciding by Auto ML, the best ML models for this task and set of features was: Random Forest with accuracy 85.7%


# STEP #9: # Reference

This kernel would have been imposible to make if not this amazing tutorials:
Most basic matplotlib: https://towardsdatascience.com/plt-xxx-or-ax-xxx-that-is-the-question-in-matplotlib-8580acf42f44

Understand the difference between all the methods (add_subplot, add_subplots, add_axes ...): https://towardsdatascience.com/the-many-ways-to-call-axes-in-matplotlib-2667a7b06e06

Matplotlib grid documentation: https://matplotlib.org/tutorials/intermediate/gridspec.html

Great tutorial fore beginners: https://github.com/rougier/matplotlib-tutorial

50 beautiful plots using matplotlib: https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/

Pandas plotting capabilities: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html