# Airline Passenger Satisfaction

Tailoring customer satisfaction is crucial for airlines to improve their future services. This project's goal is to create a predictive tool that forecasts satisfaction levels using the provided dataset.
By analyzing various features such as flight distance, inflight services, and customer demographics, we aim to uncover patterns that contribute to passenger satisfaction.

### About this dataset

The dataset comprises 24 columns:

*   Unnamed: 0
*   Gender with unique values female, male
*   customer_type: loyal/disloyal customer
*   age
*   type_of_travel: Personal/Business travel
*   customer_class: Eco, Eco Plus, Business
*   flight_distance
*   14 columns with passenger ratings on different services; each column contains values from 0 to 5
*   departure_delay_in_minutes
*   arrival_delay_in_minutes
*   satisfaction; target column that contains 2 unique values: "neutral or dissatisfied" and "satisfied"

## Importing the libraries

In [1]:
!pip install imblearn==0.0

Collecting imblearn==0.0
  Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.12.0-py3-none-any.whl (257 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m257.7/257.7 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.12.0 imblearn-0.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
!pip install xgboost==2.0.3

Collecting xgboost==2.0.3
  Downloading xgboost-2.0.3-py3-none-manylinux2014_x86_64.whl (297.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.1/297.1 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xgboost
Successfully installed xgboost-2.0.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
!pip install lightgbm==4.3.0

Collecting lightgbm==4.3.0
  Downloading lightgbm-4.3.0-py3-none-manylinux_2_28_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m57.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: lightgbm
Successfully installed lightgbm-4.3.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
!pip install missingno==0.5.2

Collecting missingno==0.5.2
  Downloading missingno-0.5.2-py3-none-any.whl (8.7 kB)
Installing collected packages: missingno
Successfully installed missingno-0.5.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [5]:
!pip install mlxtend==0.23.1

Collecting mlxtend==0.23.1
  Downloading mlxtend-0.23.1-py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: mlxtend
Successfully installed mlxtend-0.23.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [6]:
import pandas as pd
import numpy as np
import itertools

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import (
    f1_score, roc_auc_score, precision_score, recall_score, accuracy_score, confusion_matrix
)
from sklearn.model_selection import (
    learning_curve, validation_curve, train_test_split, KFold, StratifiedKFold,
    cross_val_score, GridSearchCV, RandomizedSearchCV, cross_validate, RepeatedStratifiedKFold
)
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier
from imblearn.pipeline import Pipeline as IMBPipeline
from imblearn.over_sampling import SMOTE, RandomOverSampler
from scipy.stats.mstats import winsorize, trim
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier, plot_importance
from scipy.stats import loguniform, beta, uniform
import missingno as msno
import warnings
warnings.filterwarnings('ignore')

## Data overview

In [7]:
# Read a sample of the CSV dataset
percent_to_read = 0.1  # 10%
df = pd.read_csv('airline_passenger_satisfaction.csv', nrows=int(percent_to_read * pd.read_csv('airline_passenger_satisfaction.csv').shape[0]))
df

Unnamed: 0.1,Unnamed: 0,Gender,customer_type,age,type_of_travel,customer_class,flight_distance,inflight_wifi_service,departure_arrival_time_convenient,ease_of_online_booking,...,inflight_entertainment,onboard_service,leg_room_service,baggage_handling,checkin_service,inflight_service,cleanliness,departure_delay_in_minutes,arrival_delay_in_minutes,satisfaction
0,0,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,1,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,2,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,3,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,4,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12983,12983,Female,Loyal Customer,23,Personal Travel,Eco,1276,1,5,1,...,4,5,5,4,5,4,4,54,50.0,neutral or dissatisfied
12984,12984,Male,disloyal Customer,38,Business travel,Eco,834,2,2,2,...,5,5,1,5,2,5,5,0,0.0,neutral or dissatisfied
12985,12985,Female,Loyal Customer,40,Personal Travel,Eco,733,4,1,4,...,5,2,1,3,1,2,5,0,3.0,neutral or dissatisfied
12986,12986,Male,Loyal Customer,7,Personal Travel,Eco,577,3,3,3,...,5,2,1,3,1,3,5,12,38.0,neutral or dissatisfied


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12988 entries, 0 to 12987
Data columns (total 24 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Unnamed: 0                         12988 non-null  int64  
 1   Gender                             12988 non-null  object 
 2   customer_type                      12988 non-null  object 
 3   age                                12988 non-null  int64  
 4   type_of_travel                     12988 non-null  object 
 5   customer_class                     12988 non-null  object 
 6   flight_distance                    12988 non-null  int64  
 7   inflight_wifi_service              12988 non-null  int64  
 8   departure_arrival_time_convenient  12988 non-null  int64  
 9   ease_of_online_booking             12988 non-null  int64  
 10  gate_location                      12988 non-null  int64  
 11  food_and_drink                     12988 non-null  int

In [9]:
df.describe()

Unnamed: 0.1,Unnamed: 0,age,flight_distance,inflight_wifi_service,departure_arrival_time_convenient,ease_of_online_booking,gate_location,food_and_drink,online_boarding,seat_comfort,inflight_entertainment,onboard_service,leg_room_service,baggage_handling,checkin_service,inflight_service,cleanliness,departure_delay_in_minutes,arrival_delay_in_minutes
count,12988.0,12988.0,12988.0,12988.0,12988.0,12988.0,12988.0,12988.0,12988.0,12988.0,12988.0,12988.0,12988.0,12988.0,12988.0,12988.0,12988.0,12988.0,12957.0
mean,6493.5,39.18032,1202.762088,2.739914,3.058362,2.763628,2.980674,3.208346,3.247459,3.42624,3.352787,3.376501,3.350323,3.63012,3.298506,3.640514,3.280875,14.679704,15.09408
std,3749.456983,15.118738,1004.368897,1.327355,1.524264,1.40071,1.284871,1.335532,1.345311,1.316665,1.332684,1.285852,1.312321,1.181304,1.266812,1.177696,1.313533,38.269905,38.758802
min,0.0,7.0,31.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,3246.75,27.0,414.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,2.75,3.0,2.0,0.0,0.0
50%,6493.5,40.0,852.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,3.0,4.0,3.0,0.0,0.0
75%,9740.25,51.0,1755.25,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,4.0,5.0,4.0,12.0,13.0
max,12987.0,85.0,4983.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,1305.0,1280.0


The dataset contains 129,880 rows. Upon inspecting the output of the `info()` method, it appears that there are some missing values in the column arrival_delay_in_minutes. Given that this column is likely highly correlated with the departure_delay_in_minutes column, we plan to fill these missing values based on the latter.

The first column, Unnamed: 0, appears to contain row indexes, which might have resulted from an error during the dataset import. Consequently, this column will be dropped.

Upon inspecting the standard deviation and summary statistics using the `describe()` method, we can observe that both the `departure_delay_in_minutes` and `arrival_delay_in_minutes` columns contain outliers. The maximum values in these columns reach 1592 and 1584, respectively. In the preprocessing stage, we will address these outliers to ensure data quality and robust modeling.

Additionally, the mean value of passenger valuation for onboard services is approximately 3.5 out of 5.

## Defining functions

In this section, we define functions to facilitate our data analysis process. I've selected the Plotly library for its ability to create interactive visualizations and unique graphing options.

In [10]:
# Divide all the columns into 3 categories: object columns, passenger ratings, and numerical columns
def define_list_columns(df):
    object_columns = []
    passenger_ratings = []
    numerical_columns = []
    for column in df.columns:
        if df[column].dtypes == 'object':
            object_columns.append(column)
        elif 5 <= len(df[column].value_counts()) <= 6:
            passenger_ratings.append(column)
        else:
            numerical_columns.append(column)

    return object_columns, passenger_ratings, numerical_columns

# Pie chart to analyze categorical columns
def plot_pie(df, name, title, color_map = None):
    fig = px.pie(df, names = df[name], title = title, color = name,  color_discrete_map=color_map)
    fig.update_layout(
    autosize=False,
    width=500,
    height=500, title_x = 0.5
)

    fig.show()

# Function to plot histograms of passengers' ratings for different services
def create_trace(df, column):
    return go.Histogram(x = df[column].apply(str).sort_values(), name = column)

# Another function to combine the histograms into a single plot
def passengers_ratings_plot(df, passenger_ratings):
    fig = make_subplots(rows=3, cols=5)
    x = 1
    y = 1
    for name in passenger_ratings:
        trace = create_trace(df, name)
        fig.append_trace(trace, x, y)
        y += 1
        if y > 5:
            x += 1
            y = 1

    fig.update_layout(title_text='How do passengers evaluate different services?', title_x=0.5)


    fig.show()

object_columns, passenger_ratings, numerical_columns = define_list_columns(df)

In [11]:
# defining dictionaries with colors for each graph
gender_color_map = {'Male': '#0000FF', 'Female': '#FF69B4'}
customer_type_map = {'Loyal Customer': '#FF3D00', 'disloyal Customer': '#FFB300'}
type_travel_map = {'Business travel': '#4CAF50', 'Personal Travel': '#FFC107'}
customer_class_map = {'Eco': '#AED581', 'Eco Plus': '#FFD54F', 'Business': '#4FC3F7'}

## Data Analysis

### Gender

In [12]:
plot_pie(df, 'Gender', 'Male vs Female', gender_color_map)

### Customer type

In [13]:
plot_pie(df, 'customer_type', 'Loyal VS Disloyal',customer_type_map )

### Type of travel

In [14]:
plot_pie(df, 'type_of_travel', 'Business VS Personal', type_travel_map)

### Customer class

In [15]:
plot_pie(df, 'customer_class', 'Customer Class', customer_class_map)

From the graphs above, we can observe that there is an equal distribution between male and female passengers. Additionally, the data indicates that passengers are predominantly loyal customers, with a higher frequency of business trips compared to personal travel. Among the available classes, Business class appears to be the most popular choice, followed by Eco class, while Eco Plus is the least utilized option.

### Passenger ratings

In [16]:
passengers_ratings_plot(df, passenger_ratings)

As previously mentioned, the average rating across these columns is approximately 3.3 out of 5. It's evident from the distributions that most passengers rate the services higher than the mean. Notably, passengers tend to be satisfied with food and drinks, cleanliness onboard, seat comfort, and baggage handling. However, the inflight wifi service receives the lowest ratings, which may be attributed to its availability primarily on long-distance flights and often requiring additional payment.

The comfort level regarding legroom and onboard dishes varies according to the class. As expected, passengers in the business class enjoy more comfortable seating and better food services, leading to higher satisfaction ratings.

### Distribution of values of other columns

In [17]:
fig = px.histogram(df, x = 'age', title = 'Distribution of age')
fig.show()

The distribution appears to be bimodal, with peaks at ages 25 and 39, indicating that individuals of these ages tend to travel more frequently than others. Additionally, the age range of passengers varies, with the youngest being 7 years old and the oldest being 85.

In [18]:
fig = px.histogram(df, x = 'flight_distance', marginal = 'rug', title = 'Distribution of flight distance values')
fig.show()

The data distribution appears to follow a log-normal pattern, implying that taking the natural logarithm of each value would result in a more normal distribution. The maximum flight distance in the dataset is approximately 5000 km (or miles), which is plausible in reality. However, there is a notable concentration of flights with lengths less than 150, which might indicate some outliers or unusual cases.

In a practical project scenario, it would be advisable to seek additional information from the airline company to better understand such extreme values. Unfortunately, in the absence of such information, I will address this by removing values lying outside the 0.1 and 0.95 quantiles during the preprocessing step.

In [19]:
fig = px.histogram(df, x = 'arrival_delay_in_minutes', title = 'Distribution of arrival delay values')
fig.show()

In [20]:
fig = px.histogram(df, x = 'departure_delay_in_minutes', title = 'Distribution of departure delay values')
fig.show()

The distributions of values in these two columns exhibit similar shapes, indicating a high degree of correlation between them. Typically, delays are minimal, with either no delay or delays of less than 10 minutes being common occurrences.

## Target column analysis

The "satisfaction" column in the dataset indicates whether passengers were satisfied with their flying experience. Let's explore the distribution of values in this column and identify any dependencies.

In [21]:
fig = px.histogram(df, x = 'satisfaction', title = 'Class distribution')
fig.update_layout(title_x = 0.5)
fig.show()

The ratio of classes is "neutral or dissatisfied" : "satisfied" = 1 : 1.3, so classes are balanced.

In [22]:
px.sunburst(df, path = ['satisfaction', 'type_of_travel', 'customer_class'], color = 'satisfaction', color_discrete_map={"satisfied": '#FF5733', "neutral or dissatisfied": '#33FF57'})

The sunburst graph provides a comprehensive view of the relationship between satisfaction, type of travel, and customer class. Notably, passengers traveling in Business class tend to report higher satisfaction levels. Additionally, the visualization suggests a pattern where passengers traveling for business purposes are more inclined to choose the Business class, aligning with the expectation of enhanced comfort and services for work-related travel.

### Split the data into features and target variable

In [23]:
X = df.drop('satisfaction', axis=1)
y = df['satisfaction'].map({'neutral or dissatisfied': 0, 'satisfied': 1})

## Data Preprocessing

### Handling missing values

In [24]:
X.isnull().sum()

Unnamed: 0                            0
Gender                                0
customer_type                         0
age                                   0
type_of_travel                        0
customer_class                        0
flight_distance                       0
inflight_wifi_service                 0
departure_arrival_time_convenient     0
ease_of_online_booking                0
gate_location                         0
food_and_drink                        0
online_boarding                       0
seat_comfort                          0
inflight_entertainment                0
onboard_service                       0
leg_room_service                      0
baggage_handling                      0
checkin_service                       0
inflight_service                      0
cleanliness                           0
departure_delay_in_minutes            0
arrival_delay_in_minutes             31
dtype: int64

There are 393 missing values in the "arrival_delay_in_minutes" column. Since this column is highly correlated with "departure_delay_in_minutes" and constitutes less than 1% of the dataset, we can drop the missing values.

## Definition of the Data Transformation Pipeline

In [25]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12988 entries, 0 to 12987
Data columns (total 23 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Unnamed: 0                         12988 non-null  int64  
 1   Gender                             12988 non-null  object 
 2   customer_type                      12988 non-null  object 
 3   age                                12988 non-null  int64  
 4   type_of_travel                     12988 non-null  object 
 5   customer_class                     12988 non-null  object 
 6   flight_distance                    12988 non-null  int64  
 7   inflight_wifi_service              12988 non-null  int64  
 8   departure_arrival_time_convenient  12988 non-null  int64  
 9   ease_of_online_booking             12988 non-null  int64  
 10  gate_location                      12988 non-null  int64  
 11  food_and_drink                     12988 non-null  int

The initial column labeled Unnamed: 0 appears to contain row indexes, potentially resulting from an error during dataset importation. As a corrective measure, this column will be removed from the dataset.

In [26]:
X.drop(['Unnamed: 0'], axis = 1, inplace = True)

Now, let's delve into the remaining columns to determine the most appropriate approach for defining our data transformation pipeline.

In [27]:
df_obj = X.select_dtypes(include=object)
df_obj

Unnamed: 0,Gender,customer_type,type_of_travel,customer_class
0,Male,Loyal Customer,Personal Travel,Eco Plus
1,Male,disloyal Customer,Business travel,Business
2,Female,Loyal Customer,Business travel,Business
3,Female,Loyal Customer,Business travel,Business
4,Male,Loyal Customer,Business travel,Business
...,...,...,...,...
12983,Female,Loyal Customer,Personal Travel,Eco
12984,Male,disloyal Customer,Business travel,Eco
12985,Female,Loyal Customer,Personal Travel,Eco
12986,Male,Loyal Customer,Personal Travel,Eco


We can note that there are four columns with data type 'object' in our dataset: Gender, customer_type, type_of_travel, and customer_class. The first three columns are binary categorical variables, while the customer_class column is categorical and encompasses three distinct values.
Given this, a reasonable approach would be to map binary categorical variables to 0 and 1, while employing one-hot encoding for the customer_class column.

In [28]:
numerics = ['int64']
df_num = X.select_dtypes(include=numerics)
df_num

Unnamed: 0,age,flight_distance,inflight_wifi_service,departure_arrival_time_convenient,ease_of_online_booking,gate_location,food_and_drink,online_boarding,seat_comfort,inflight_entertainment,onboard_service,leg_room_service,baggage_handling,checkin_service,inflight_service,cleanliness,departure_delay_in_minutes
0,13,460,3,4,3,1,5,3,5,5,4,3,4,4,5,5,25
1,25,235,3,2,3,3,1,3,1,1,1,5,3,1,4,1,1
2,26,1142,2,2,2,2,5,5,5,5,4,3,4,4,4,5,0
3,25,562,2,5,5,5,2,2,2,2,2,5,3,1,4,2,11
4,61,214,3,3,3,3,4,5,5,3,3,4,4,3,3,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12983,23,1276,1,5,1,4,4,1,4,4,5,5,4,5,4,4,54
12984,38,834,2,2,2,4,5,2,5,5,5,1,5,2,5,5,0
12985,40,733,4,1,4,3,5,4,4,5,2,1,3,1,2,5,0
12986,7,577,3,3,3,3,5,3,5,5,2,1,3,1,3,5,12


The table includes numerical columns in the dataset. In addition to the columns age, flight_distance, and departure_delay_in_minutes, there are 14 additional columns representing passenger ratings ranging from 0 to 5. Due to the large number of rating columns, only a subset will be chosen for transformation. To promote uniformity and enhance the algorithm's robustness, we will scale these selected numerical columns using a standard scaler. This preprocessing step will enable consistent feature representation and facilitate optimal model performance.

Now, we are prepared to define our comprehensive data transformation pipeline, incorporating all the transformations mentioned previously. 

In [29]:
columns_to_scale = ['age', 'flight_distance', 'inflight_wifi_service', 
                    'ease_of_online_booking', 'food_and_drink', 
                    'online_boarding', 'onboard_service', 
                    'baggage_handling', 'checkin_service', 
                    'departure_delay_in_minutes']

class BinaryCategoricalMapper(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        for column in ['Gender', 'customer_type', 'type_of_travel']:
            X[column] = X[column].map({X[column].unique()[0]: 0, X[column].unique()[1]: 1})
        return X

    def get_feature_names_out(self, input_features=None):
        return input_features

ct = ColumnTransformer(
    [
        ('scaler', StandardScaler(), columns_to_scale),
        ('ohe', OneHotEncoder(), ['customer_class']),
        ('binary_mapper', BinaryCategoricalMapper(), ['Gender', 'customer_type', 'type_of_travel'])
    ],
    verbose_feature_names_out=False,
    remainder = 'drop'

)

In [30]:
ct

In [31]:
new_data = pd.DataFrame(ct.fit_transform(X), columns=ct.get_feature_names_out())
new_data

Unnamed: 0,age,flight_distance,inflight_wifi_service,ease_of_online_booking,food_and_drink,online_boarding,onboard_service,baggage_handling,checkin_service,departure_delay_in_minutes,customer_class_Business,customer_class_Eco,customer_class_Eco Plus,Gender,customer_type,type_of_travel
0,-1.731714,-0.739560,0.195951,0.168758,1.341580,-0.183949,0.484910,0.313123,0.553769,0.269682,0.0,0.0,1.0,0.0,0.0,0.0
1,-0.937966,-0.963590,0.195951,0.168758,-1.653596,-0.183949,-1.848263,-0.533431,-1.814472,-0.357467,1.0,0.0,0.0,0.0,1.0,1.0
2,-0.871821,-0.060500,-0.557456,-0.545193,1.341580,1.302753,0.484910,0.313123,0.553769,-0.383598,1.0,0.0,0.0,1.0,0.0,1.0
3,-0.937966,-0.637999,-0.557456,1.596661,-0.904802,-0.927300,-1.070539,-0.533431,-1.814472,-0.096155,1.0,0.0,0.0,1.0,0.0,1.0
4,1.443277,-0.984499,0.195951,0.168758,0.592786,1.302753,-0.292814,0.313123,-0.235645,-0.383598,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12983,-1.070258,0.072922,-1.310863,-1.259144,0.592786,-1.670652,1.262634,0.313123,1.343182,1.027486,0.0,1.0,0.0,1.0,0.0,0.0
12984,-0.078073,-0.367172,-0.557456,-0.545193,1.341580,-0.927300,1.262634,1.159678,-1.025059,-0.383598,0.0,1.0,0.0,0.0,1.0,1.0
12985,0.054218,-0.467737,0.949358,0.882709,1.341580,0.559402,-1.070539,-0.533431,-1.814472,-0.383598,0.0,1.0,0.0,1.0,0.0,0.0
12986,-2.128588,-0.623064,0.195951,0.168758,1.341580,-0.183949,-1.070539,-0.533431,-1.814472,-0.070024,0.0,1.0,0.0,0.0,0.0,0.0


In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, stratify = y, random_state=12345, shuffle=True)

### Model selection

In [33]:
model_pipeline = IMBPipeline([
    ('trans', ct),
    ('sampler', SMOTE()),
    ('dim_reduction', PCA(n_components=0.8)),
    ('classifier', Perceptron())
])

In [34]:
model_pipeline.fit(X_train,y_train)

In [35]:
sampler_configs = [
    {
        'sampler':[None],# The element is bypassed
    },
    {
        'sampler':[SMOTE(n_jobs=-1)],
        'sampler__sampling_strategy':['minority', 1.2, 0.9, 0.7]
    },
   
]


dim_reduction_configs = [
    {
        'dim_reduction': [None]
    },
    {
        'dim_reduction': [PCA()],
        'dim_reduction__n_components': [0.5, 0.7, 0.9]
    },
    {
        'dim_reduction': [LDA()]
    },
    {
        'dim_reduction': [SFS(estimator=LogisticRegression(), cv=None, scoring='f1')],
        'dim_reduction__k_features' : [5, 10] 
    }
]

classifier_configs = [
    {
        'classifier__eta0' : loguniform(0.001,100),
        'classifier': [Perceptron()] ,
        'classifier__max_iter': [1,5,10,15,50,100] ,
        'classifier__class_weight' : [None, 'balanced']

    },
    {
        'classifier': [LogisticRegression(solver='saga')],
        'classifier__C' : [0.01, 0.03, 0.05, 0.07],
        'classifier__penalty': ['l1','l2'],
        'classifier__class_weight' : [None, 'balanced']

    },
    {
        'classifier': [KNeighborsClassifier()],
        'classifier__n_neighbors': [5, 10, 15],
        'classifier__weights':['uniform', 'distance']

    },
    {
        'classifier' : [RandomForestClassifier()],
        'classifier__n_estimators' : [50, 100, 150, 200],
        'classifier__max_depth': [1, 2, 3, None],
    }
]

In [36]:
all_configs = [dict(itertools.chain(*(e.items() 
for e in configuration))) 
for configuration in 
itertools.product(sampler_configs, dim_reduction_configs,classifier_configs)]

In [37]:
f'Number of all possible configurations: {len(all_configs)}'

'Number of all possible configurations: 32'

In [38]:
rs = RandomizedSearchCV(model_pipeline,
    param_distributions=all_configs,
    n_iter=len(all_configs) * 5,
    n_jobs=-1,
    cv = 2,
    scoring='f1'
)

In [None]:
scores = cross_validate(rs, X_train, y_train, scoring='f1', cv = 5, return_estimator=True, verbose=3)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


In [None]:
for index, estimator in enumerate(scores['estimator']):
    print(estimator.best_estimator_.get_params()['dim_reduction'])
    print(estimator.best_estimator_.get_params()['classifier'],estimator.best_estimator_.get_params()['classifier'].get_params())
    print(scores['test_score'][index])
    print('-'*10)

None
KNeighborsClassifier() {'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}
nan
----------
None
KNeighborsClassifier() {'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}
nan
----------
None
RandomForestClassifier(max_depth=2, n_estimators=50) {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 2, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 50, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
nan
----------
None
KNeighborsClassifier(n_neighbors=15) {'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbor

In [None]:
for estimator in scores['estimator']:
    pred_train = estimator.best_estimator_.fit(X_train, y_train)
    pred_train = estimator.best_estimator_.predict(X_train)
    pred_test = estimator.best_estimator_.predict(X_test)
    f1_train = f1_score(y_train, pred_train)
    f1_test = f1_score(y_test, pred_test)
    print(f'F1 on training set:{f1_train}, F1 on test set:{f1_test}')

ValueError: pos_label=1 is not a valid label. It should be one of ['neutral or dissatisfied', 'satisfied']

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=616aaa32-3fe1-4076-bacb-a28e7df2c90a' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>