# **Hackathon: Shinkansen Travel Experience**

## **Organization**

This project has 3 notebook files.

1. Data_preparation_TheNormals_Hackathon
2. Model_Building_TheNormals_Hackathon
3. Prediction_TheNormals_Hackathon

This notebook is the second notebook where data stored in .csv file in 1st notebook, is read and various models are built. The best performing models are stored to be used in the next notebook for prediction.

## **Problem Statement**

This problem is to determine passengers' experience, whether they are deligighted or not after their travel on the Shinkansen Bullet Train in Japan. This machine learning exercise aims to determine the relative importance of each parameter about their contribution to the passengers' overall travel experience. The dataset contains a random sample of individuals who traveled on this train. The on-time performance of the trains along with passenger information is published in a file named 'Traveldata_train.csv'.  These passengers were later asked to provide their feedback on various parameters related to the travel along with their overall experience. These collected details are made available in the survey report labeled 'Surveydata_train.csv'.

In the survey, each passenger was explicitly asked whether they were satisfied with their overall travel experience or not, and that is captured in the data of the survey report under the variable labeled 'Overall_Experience', which is the target variable.

The objective of this problem is to understand which parameters play an important role in swaying passenger feedback towards a positive scale. We are provided with test data containing the travel data and the survey data of passengers. Both the test data and the train data are collected at the same time and belong to the same population.

## **Solution**

The given problem statement suggests that the problem is a Supervised learning classification problem. We will use various Machine Learning models available in the library for classification and observe which model gives us the best accuracy on the test set and select it as our final model.

### **Loading Libraries**



In [136]:
%pip install flaml[automl] matplotlib openml



In [137]:
!pip3 install catboost



In [138]:
! pip install -U accelerate
! pip install -U transformers



In [139]:
!pip install datasets



In [140]:
# libraries for data manipulation
import pandas as pd
import numpy as np

from minio.error import ServerError
from flaml.automl.data import load_openml_dataset

# library to prettu print python datastructures
from pprint import pprint

# library for train test split
from sklearn.model_selection import train_test_split

# library for scaling data
from sklearn.preprocessing import MinMaxScaler

# library for metrics
from sklearn.metrics import accuracy_score

# libraries for GridSearchCV and RandomizedSearchCV
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# library for imputation
from sklearn.impute import KNNImputer

# libraries for cross validation
from sklearn.model_selection import cross_val_score, KFold

# library for ensemble models
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier, StackingClassifier

# library for XGBoostClassifier model
from xgboost import XGBClassifier

# library for lightgbm
from lightgbm import LGBMClassifier

# library for AutoML
from flaml import AutoML

# library for random number generator
import random

# library to save and load files
import joblib

# ignore Warnings
import warnings
warnings.filterwarnings("ignore")

### **Reading train dataframe from .csv file**



In [141]:
# mount google colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [142]:
# read df_train from .csv files
df_train = pd.read_csv('/content/drive/MyDrive/DScourse/Hackathon/KNN3/train_data.csv')

In [143]:
# read df_test from .csv files
# while we do not use df_test, we need to use the same MinMaxScaler to scale it.
# hence we will scale the df_test and
# store it as df_test_scaled to be used in the next notebook for prediction
df_pred = pd.read_csv('/content/drive/MyDrive/DScourse/Hackathon/KNN3/pred_data.csv')

### **Data Preprocessing**



In [144]:
df_train.head()

Unnamed: 0,ID,Gender,Customer_Type,Age,Type_Travel,Travel_Class,Travel_Distance,Departure_Delay_in_Mins,Arrival_Delay_in_Mins,Overall_Experience,...,Onboard_Wifi_Service,Onboard_Entertainment,Online_Support,Ease_of_Online_Booking,Onboard_Service,Legroom,Baggage_Handling,CheckIn_Service,Cleanliness,Online_Boarding
0,98800001,1.0,1.0,52.0,,1,272,0.0,5.0,0,...,4.0,2.0,3.0,2.0,2.0,3.0,2.0,4.0,2.0,1.0
1,98800002,0.0,1.0,48.0,0.0,0,2200,9.0,0.0,0,...,4.0,1.0,4.0,4.0,5.0,2.0,1.0,2.0,4.0,4.0
2,98800003,1.0,1.0,43.0,1.0,1,1061,77.0,119.0,1,...,2.0,4.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,5.0
3,98800004,1.0,1.0,44.0,1.0,1,780,13.0,18.0,0,...,3.0,2.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,3.0
4,98800005,1.0,1.0,50.0,1.0,1,1981,0.0,0.0,1,...,2.0,4.0,5.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0


In [145]:
df_pred.head()

Unnamed: 0,ID,Gender,Customer_Type,Age,Type_Travel,Travel_Class,Travel_Distance,Departure_Delay_in_Mins,Arrival_Delay_in_Mins,Seat_Comfort,...,Onboard_Wifi_Service,Onboard_Entertainment,Online_Support,Ease_of_Online_Booking,Onboard_Service,Legroom,Baggage_Handling,CheckIn_Service,Cleanliness,Online_Boarding
0,99900001,1.0,,36.0,1.0,1,532,0.0,0.0,3.0,...,2.0,5.0,4.0,5.0,5.0,5.0,5.0,4.0,5.0,1.0
1,99900002,1.0,0.0,21.0,1.0,1,1425,9.0,28.0,0.0,...,3.0,1.0,3.0,3.0,5.0,3.0,4.0,3.0,5.0,3.0
2,99900003,0.0,1.0,60.0,1.0,1,2832,0.0,0.0,5.0,...,5.0,5.0,5.0,2.0,2.0,2.0,2.0,4.0,2.0,5.0
3,99900004,1.0,1.0,29.0,0.0,0,1352,0.0,0.0,3.0,...,1.0,3.0,5.0,1.0,3.0,2.0,5.0,5.0,5.0,1.0
4,99900005,0.0,0.0,18.0,1.0,1,1610,17.0,0.0,5.0,...,5.0,5.0,5.0,5.0,,3.0,5.0,5.0,5.0,5.0


#### **Seperate Train and Test data**



In [146]:
# saving a copy of df_train for future reference
df_train_copy = df_train.copy()

# splitting X, y
X = df_train.drop(['ID','Overall_Experience'], axis = 1)
y = df_train['Overall_Experience']

In [147]:
X.shape, y.shape

((94379, 23), (94379,))

In [148]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=13)

In [149]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((66065, 23), (28314, 23), (66065,), (28314,))

#### **Impute Missing values**



In [150]:
# creating a copy of test data
df_pred_orig = df_pred.copy()

In [151]:
# checking for missing values
X_train.isnull().sum()[X_train.isnull().sum() != 0], X_test.isnull().sum()[X_test.isnull().sum() != 0], df_pred.isnull().sum()[df_pred.isnull().sum() != 0]

(Gender                       61
 Customer_Type              6284
 Age                          26
 Type_Travel                6403
 Departure_Delay_in_Mins      41
 Arrival_Delay_in_Mins       256
 Seat_Comfort                 49
 Arrival_Time_Convenient    6275
 Catering                   6153
 Platform_Location            18
 Onboard_Wifi_Service         18
 Onboard_Entertainment        10
 Online_Support               60
 Ease_of_Online_Booking       50
 Onboard_Service            5225
 Legroom                      61
 Baggage_Handling            101
 CheckIn_Service              56
 Cleanliness                   5
 Online_Boarding               5
 dtype: int64,
 Gender                       16
 Customer_Type              2667
 Age                           7
 Type_Travel                2823
 Departure_Delay_in_Mins      16
 Arrival_Delay_in_Mins       101
 Seat_Comfort                 12
 Arrival_Time_Convenient    2655
 Catering                   2588
 Platform_Location          

In [152]:
# making ID column index to be retained in original prediction dataframe
df_pred.set_index('ID', inplace = True)

In [153]:
df_pred.head()

Unnamed: 0_level_0,Gender,Customer_Type,Age,Type_Travel,Travel_Class,Travel_Distance,Departure_Delay_in_Mins,Arrival_Delay_in_Mins,Seat_Comfort,Seat_Class,...,Onboard_Wifi_Service,Onboard_Entertainment,Online_Support,Ease_of_Online_Booking,Onboard_Service,Legroom,Baggage_Handling,CheckIn_Service,Cleanliness,Online_Boarding
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
99900001,1.0,,36.0,1.0,1,532,0.0,0.0,3.0,1,...,2.0,5.0,4.0,5.0,5.0,5.0,5.0,4.0,5.0,1.0
99900002,1.0,0.0,21.0,1.0,1,1425,9.0,28.0,0.0,0,...,3.0,1.0,3.0,3.0,5.0,3.0,4.0,3.0,5.0,3.0
99900003,0.0,1.0,60.0,1.0,1,2832,0.0,0.0,5.0,0,...,5.0,5.0,5.0,2.0,2.0,2.0,2.0,4.0,2.0,5.0
99900004,1.0,1.0,29.0,0.0,0,1352,0.0,0.0,3.0,1,...,1.0,3.0,5.0,1.0,3.0,2.0,5.0,5.0,5.0,1.0
99900005,0.0,0.0,18.0,1.0,1,1610,17.0,0.0,5.0,0,...,5.0,5.0,5.0,5.0,,3.0,5.0,5.0,5.0,5.0


In [154]:
# declaring KNNImputer
imputer = KNNImputer(n_neighbors=14)

# imputing missing values
X_train_imputed = pd.DataFrame(imputer.fit_transform(X_train), columns = X_train.columns)
X_test_imputed = pd.DataFrame(imputer.transform(X_test), columns = X_test.columns)
df_pred_imputed = pd.DataFrame(imputer.transform(df_pred), columns = df_pred.columns)

In [155]:
X_train_imputed = X_train_imputed.round(decimals=0).astype(int)

X_test_imputed = X_test_imputed.round(decimals = 0).astype(int)

df_pred_imputed = df_pred_imputed.round(decimals = 0).astype(int)

In [156]:
# checking for null values
X_train_imputed.isnull().sum()[X_train_imputed.isnull().sum() != 0], X_test_imputed.isnull().sum()[X_test_imputed.isnull().sum() != 0],  df_pred_imputed.isnull().sum()[df_pred_imputed.isnull().sum() != 0]

(Series([], dtype: int64), Series([], dtype: int64), Series([], dtype: int64))

In [157]:
df_pred_imputed.head()

Unnamed: 0,Gender,Customer_Type,Age,Type_Travel,Travel_Class,Travel_Distance,Departure_Delay_in_Mins,Arrival_Delay_in_Mins,Seat_Comfort,Seat_Class,...,Onboard_Wifi_Service,Onboard_Entertainment,Online_Support,Ease_of_Online_Booking,Onboard_Service,Legroom,Baggage_Handling,CheckIn_Service,Cleanliness,Online_Boarding
0,1,1,36,1,1,532,0,0,3,1,...,2,5,4,5,5,5,5,4,5,1
1,1,0,21,1,1,1425,9,28,0,0,...,3,1,3,3,5,3,4,3,5,3
2,0,1,60,1,1,2832,0,0,5,0,...,5,5,5,2,2,2,2,4,2,5
3,1,1,29,0,0,1352,0,0,3,1,...,1,3,5,1,3,2,5,5,5,1
4,0,0,18,1,1,1610,17,0,5,0,...,5,5,5,5,4,3,5,5,5,5


### **Building Models**



#### **Functions to build models and obtain scores**

In [158]:
# predict on the test data
def predict(model, model_name, predictiondata):
    # create a copy of df_test
    #predictiondata = df_pred_unscaled.copy()

    # set 'ID' column as the index
    predictiondata.set_index('ID', inplace=True)

    # separate ID column from df_test for predictions
    X = predictiondata.copy()

    # predict on test set
    y_pred = np.round(model.predict(X), 0).astype(int) if model == 'nn_model9a' else model.predict(X).astype(int)

    # create a DataFrame for the current model's predictions
    prediction_df = pd.DataFrame({'ID': X.index, 'Overall_Experience': y_pred.flatten()})

    # save predictions to a CSV file for the current model
    filename = "/content/drive/MyDrive/DScourse/Hackathon/AutoML/Predictions/" + model_name + "_pred.csv"
    prediction_df.to_csv(filename, index=False, header=True, line_terminator='\r\n')

#### **Model 1: Unscaled imputed data, KNNImputer, all original columns**

In [66]:
train_x = X_train_imputed.copy()

test_x = X_test_imputed.copy()

pred_x = pd.concat([df_pred_orig['ID'], df_pred_imputed.copy()], axis = 1)

In [67]:
pred_x.head()

Unnamed: 0,ID,Gender,Customer_Type,Age,Type_Travel,Travel_Class,Travel_Distance,Departure_Delay_in_Mins,Arrival_Delay_in_Mins,Seat_Comfort,...,Onboard_Wifi_Service,Onboard_Entertainment,Online_Support,Ease_of_Online_Booking,Onboard_Service,Legroom,Baggage_Handling,CheckIn_Service,Cleanliness,Online_Boarding
0,99900001,1,1,36,1,1,532,0,0,3,...,2,5,4,5,5,5,5,4,5,1
1,99900002,1,0,21,1,1,1425,9,28,0,...,3,1,3,3,5,3,4,3,5,3
2,99900003,0,1,60,1,1,2832,0,0,5,...,5,5,5,2,2,2,2,4,2,5
3,99900004,1,1,29,0,0,1352,0,0,3,...,1,3,5,1,3,2,5,5,5,1
4,99900005,0,0,18,1,1,1610,17,0,5,...,5,5,5,5,4,3,5,5,5,5


In [52]:
# declare model for logistic regression
automl1 = AutoML()

# setting for AutoML
settings = {
    "time_budget": 600,  # total running time in seconds
    "metric": 'accuracy',
    # check the documentation for options of metrics (https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#optimization-metric)
    "task": 'classification',  # task type
    "estimator_list": ['xgboost', 'xgb_limitdepth', 'rf', 'lgbm', 'histgb', 'lrl1', 'lrl2', 'catboost', 'extra_tree', 'kneighbor'],
    "log_file_name": '/content/drive/MyDrive/DScourse/Hackathon/AutoML/hackathon1.log',  # flaml log file
    "seed": 13,    # random seed
}

In [55]:
# The main flaml automl API
automl1.fit(X_train=train_x, y_train=y_train, **settings)

[flaml.automl.logger: 01-12 02:25:03] {1679} INFO - task = classification
[flaml.automl.logger: 01-12 02:25:03] {1690} INFO - Evaluation method: holdout
[flaml.automl.logger: 01-12 02:25:03] {1788} INFO - Minimizing error metric: 1-accuracy
[flaml.automl.logger: 01-12 02:25:03] {1900} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'catboost', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl.logger: 01-12 02:25:03] {2218} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 01-12 02:25:03] {2344} INFO - Estimated sufficient time budget=12282s. Estimated necessary time budget=302s.
[flaml.automl.logger: 01-12 02:25:03] {2391} INFO -  at 0.8s,	estimator lgbm's best error=0.1563,	best estimator lgbm's best error=0.1563
[flaml.automl.logger: 01-12 02:25:03] {2218} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 01-12 02:25:04] {2391} INFO -  at 0.9s,	estimator lgbm's best error=0.1563,	best estimator lgbm's best error=0.1563
[flaml.automl.l

INFO:flaml.tune.searcher.blendsearch:No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'. More info can be found at https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune


[flaml.automl.logger: 01-12 02:28:28] {2391} INFO -  at 205.3s,	estimator lrl1's best error=0.4138,	best estimator lgbm's best error=0.0437
[flaml.automl.logger: 01-12 02:28:28] {2218} INFO - iteration 113, current learner rf
[flaml.automl.logger: 01-12 02:28:28] {2391} INFO -  at 205.6s,	estimator rf's best error=0.1218,	best estimator lgbm's best error=0.0437
[flaml.automl.logger: 01-12 02:28:28] {2218} INFO - iteration 114, current learner lrl1
[flaml.automl.logger: 01-12 02:28:29] {2391} INFO -  at 206.2s,	estimator lrl1's best error=0.4138,	best estimator lgbm's best error=0.0437
[flaml.automl.logger: 01-12 02:28:29] {2218} INFO - iteration 115, current learner lrl1
[flaml.automl.logger: 01-12 02:28:30] {2391} INFO -  at 206.9s,	estimator lrl1's best error=0.4138,	best estimator lgbm's best error=0.0437
[flaml.automl.logger: 01-12 02:28:30] {2218} INFO - iteration 116, current learner lrl1
[flaml.automl.logger: 01-12 02:28:34] {2391} INFO -  at 211.7s,	estimator lrl1's best error=

In [56]:
# retrieve best config and best learner
print('Best ML leaner:', automl1.best_estimator)
print('Best hyperparmeter config:', automl1.best_config)
print('Best accuracy on validation data: {0:.4g}'.format(1-automl1.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl1.best_config_train_time))

Best ML leaner: lgbm
Best hyperparmeter config: {'n_estimators': 489, 'num_leaves': 3341, 'min_child_samples': 26, 'learning_rate': 0.6810175093276573, 'log_max_bin': 7, 'colsample_bytree': 0.7474731186737704, 'reg_alpha': 0.0026114080577972036, 'reg_lambda': 0.4440306981875064}
Best accuracy on validation data: 0.9563
Training duration of best run: 16.24 s


In [57]:
# obtain best model
automl1.model.estimator

##### Predict for train data using CV

In [58]:
# compute predictions of testing dataset'''
y_pred1 = automl1.predict(X_test)
y_pred_proba1 = automl1.predict_proba(X_test)[:,1]

In [59]:
# compute different metric values on testing dataset'''
from flaml.ml import sklearn_metric_loss_score
print('accuracy', '=', 1 - sklearn_metric_loss_score('accuracy', y_pred1, y_test))
print('roc_auc', '=', 1 - sklearn_metric_loss_score('roc_auc', y_pred_proba1, y_test))
print('log_loss', '=', sklearn_metric_loss_score('log_loss', y_pred_proba1, y_test))

accuracy = 0.9539450448541358
roc_auc = 0.9924621781639097
log_loss = 0.1903503946566573


In [69]:
predict(automl1, 'automl1', pred_x)

#### **Model 2: Unscaled imputed data, KNNImputer, all original columns for stacked model - ensemble**

In [70]:
train_x = X_train_imputed.copy()

test_x = X_test_imputed.copy()

pred_x = pd.concat([df_pred_orig['ID'], df_pred_imputed.copy()], axis = 1)

In [76]:
# declare model for logistic regression
automl2 = AutoML()

# setting for AutoML
settings = {
    "time_budget": 600,  # total running time in seconds
    "metric": 'accuracy',
    # check the documentation for options of metrics (https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#optimization-metric)
    "task": 'classification',  # task type
    "estimator_list": ['xgboost', 'xgb_limitdepth', 'rf', 'lgbm', 'lrl1', 'lrl2', 'catboost', 'extra_tree', 'kneighbor'],
    "log_file_name": '/content/drive/MyDrive/DScourse/Hackathon/AutoML/hackathon1.log',  # flaml log file
    "ensemble": True,
    "seed": 13,    # random seed
}

In [77]:
# The main flaml automl API
automl2.fit(X_train=train_x, y_train=y_train, **settings)

[flaml.automl.logger: 01-12 02:50:23] {1679} INFO - task = classification
[flaml.automl.logger: 01-12 02:50:23] {1690} INFO - Evaluation method: holdout
[flaml.automl.logger: 01-12 02:50:23] {1788} INFO - Minimizing error metric: 1-accuracy
[flaml.automl.logger: 01-12 02:50:23] {1900} INFO - List of ML learners in AutoML Run: ['xgboost', 'xgb_limitdepth', 'rf', 'lgbm', 'lrl1', 'lrl2', 'catboost', 'extra_tree', 'kneighbor']
[flaml.automl.logger: 01-12 02:50:23] {2218} INFO - iteration 0, current learner xgboost
[flaml.automl.logger: 01-12 02:50:23] {2344} INFO - Estimated sufficient time budget=10036s. Estimated necessary time budget=188s.
[flaml.automl.logger: 01-12 02:50:23] {2391} INFO -  at 1.2s,	estimator xgboost's best error=0.1563,	best estimator xgboost's best error=0.1563
[flaml.automl.logger: 01-12 02:50:23] {2218} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 01-12 02:50:23] {2391} INFO -  at 1.4s,	estimator lgbm's best error=0.1563,	best estimator xgboost's 

INFO:flaml.tune.searcher.blendsearch:No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'. More info can be found at https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune


[flaml.automl.logger: 01-12 02:51:12] {2391} INFO -  at 50.1s,	estimator lrl2's best error=0.2345,	best estimator xgboost's best error=0.0442
[flaml.automl.logger: 01-12 02:51:12] {2218} INFO - iteration 74, current learner extra_tree
[flaml.automl.logger: 01-12 02:51:14] {2391} INFO -  at 51.6s,	estimator extra_tree's best error=0.1095,	best estimator xgboost's best error=0.0442
[flaml.automl.logger: 01-12 02:51:14] {2218} INFO - iteration 75, current learner lrl2
[flaml.automl.logger: 01-12 02:51:14] {2391} INFO -  at 52.6s,	estimator lrl2's best error=0.2345,	best estimator xgboost's best error=0.0442
[flaml.automl.logger: 01-12 02:51:14] {2218} INFO - iteration 76, current learner extra_tree
[flaml.automl.logger: 01-12 02:51:17] {2391} INFO -  at 54.8s,	estimator extra_tree's best error=0.1095,	best estimator xgboost's best error=0.0442
[flaml.automl.logger: 01-12 02:51:17] {2218} INFO - iteration 77, current learner rf
[flaml.automl.logger: 01-12 02:51:17] {2391} INFO -  at 55.4s,

INFO:flaml.tune.searcher.blendsearch:No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'. More info can be found at https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune


[flaml.automl.logger: 01-12 03:00:16] {2391} INFO -  at 594.3s,	estimator lrl1's best error=0.4138,	best estimator xgboost's best error=0.0422
[flaml.automl.logger: 01-12 03:00:16] {2218} INFO - iteration 186, current learner lrl1
[flaml.automl.logger: 01-12 03:00:17] {2391} INFO -  at 594.9s,	estimator lrl1's best error=0.4138,	best estimator xgboost's best error=0.0422
[flaml.automl.logger: 01-12 03:00:17] {2218} INFO - iteration 187, current learner lrl1
[flaml.automl.logger: 01-12 03:00:17] {2391} INFO -  at 595.6s,	estimator lrl1's best error=0.4138,	best estimator xgboost's best error=0.0422
[flaml.automl.logger: 01-12 03:00:17] {2218} INFO - iteration 188, current learner lrl1
[flaml.automl.logger: 01-12 03:00:22] {2391} INFO -  at 600.4s,	estimator lrl1's best error=0.3435,	best estimator xgboost's best error=0.0422
[flaml.automl.logger: 01-12 03:00:22] {2525} INFO - [('xgboost', {'n_jobs': -1, 'n_estimators': 284, 'max_leaves': 92, 'min_child_weight': 0.15784619863870844, 'lea

In [78]:
# retrieve best config and best learner
print('Best ML leaner:', automl2.best_estimator)
print('Best hyperparmeter config:', automl2.best_config)
print('Best accuracy on validation data: {0:.4g}'.format(1-automl2.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl2.best_config_train_time))

Best ML leaner: xgboost
Best hyperparmeter config: {'n_estimators': 284, 'max_leaves': 92, 'min_child_weight': 0.15784619863870844, 'learning_rate': 0.42764856004198537, 'subsample': 1.0, 'colsample_bylevel': 1.0, 'colsample_bytree': 0.8942072410001078, 'reg_alpha': 0.007949362705976761, 'reg_lambda': 0.15683325226091974}
Best accuracy on validation data: 0.9578
Training duration of best run: 3.314 s


In [79]:
# obtain best model
automl2.model.estimator

AttributeError: 'StackingClassifier' object has no attribute 'estimator'

##### Predict for train data using CV

In [82]:
# compute predictions of testing dataset'''
y_pred2 = automl2.predict(test_x)
y_pred_proba2 = automl2.predict_proba(test_x)[:,1]

In [83]:
# compute different metric values on testing dataset'''
from flaml.ml import sklearn_metric_loss_score
print('accuracy', '=', 1 - sklearn_metric_loss_score('accuracy', y_pred2, y_test))
print('roc_auc', '=', 1 - sklearn_metric_loss_score('roc_auc', y_pred_proba2, y_test))
print('log_loss', '=', sklearn_metric_loss_score('log_loss', y_pred_proba2, y_test))

accuracy = 0.9225824680370135
roc_auc = 0.9740513155440499
log_loss = 0.20551637566581302


In [None]:
predict(automl2, 'automl2', pred_x)

#### **Model 3: Unscaled imputed data, KNNImputer, Seat_Class deleted**

In [84]:
train_x = X_train_imputed.copy()
train_x.drop('Seat_Class', axis = 1, inplace = True)

test_x = X_test_imputed.copy()
test_x.drop('Seat_Class', axis = 1, inplace = True)

pred_x = pd.concat([df_pred_orig['ID'], df_pred_imputed.copy()], axis = 1)
pred_x.drop('Seat_Class', axis = 1, inplace = True)

In [85]:
train_x.shape

(66065, 22)

In [90]:
# declare model for logistic regression
automl3 = AutoML()

# setting for AutoML
settings = {
    "time_budget": 600,  # total running time in seconds
    "metric": 'accuracy',
    # check the documentation for options of metrics (https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#optimization-metric)
    "task": 'classification',  # task type
    "estimator_list": ['xgboost', 'xgb_limitdepth', 'rf', 'lgbm', 'lrl1', 'lrl2', 'catboost', 'extra_tree', 'kneighbor'],
    "log_file_name": '/content/drive/MyDrive/DScourse/Hackathon/AutoML/hackathon1.log',  # flaml log file
    #"ensemble": True,
    "seed": 13,    # random seed
}

In [91]:
# The main flaml automl API
automl3.fit(X_train=train_x, y_train=y_train, **settings)

[flaml.automl.logger: 01-12 03:13:51] {1679} INFO - task = classification
[flaml.automl.logger: 01-12 03:13:51] {1690} INFO - Evaluation method: cv
[flaml.automl.logger: 01-12 03:13:51] {1788} INFO - Minimizing error metric: 1-accuracy
[flaml.automl.logger: 01-12 03:13:51] {1900} INFO - List of ML learners in AutoML Run: ['xgboost', 'xgb_limitdepth', 'rf', 'lgbm', 'lrl1', 'lrl2', 'catboost', 'extra_tree', 'kneighbor']
[flaml.automl.logger: 01-12 03:13:51] {2218} INFO - iteration 0, current learner xgboost
[flaml.automl.logger: 01-12 03:13:54] {2344} INFO - Estimated sufficient time budget=30595s. Estimated necessary time budget=575s.
[flaml.automl.logger: 01-12 03:13:54] {2391} INFO -  at 3.5s,	estimator xgboost's best error=0.1547,	best estimator xgboost's best error=0.1547
[flaml.automl.logger: 01-12 03:13:54] {2218} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 01-12 03:13:56] {2391} INFO -  at 4.9s,	estimator lgbm's best error=0.1547,	best estimator xgboost's best 

INFO:flaml.tune.searcher.blendsearch:No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'. More info can be found at https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune


[flaml.automl.logger: 01-12 03:19:42] {2391} INFO -  at 351.5s,	estimator lrl2's best error=0.2429,	best estimator xgboost's best error=0.0463
[flaml.automl.logger: 01-12 03:19:42] {2218} INFO - iteration 51, current learner lrl2
[flaml.automl.logger: 01-12 03:19:52] {2391} INFO -  at 361.4s,	estimator lrl2's best error=0.2429,	best estimator xgboost's best error=0.0463
[flaml.automl.logger: 01-12 03:19:52] {2218} INFO - iteration 52, current learner lrl2
[flaml.automl.logger: 01-12 03:20:02] {2391} INFO -  at 371.3s,	estimator lrl2's best error=0.2367,	best estimator xgboost's best error=0.0463
[flaml.automl.logger: 01-12 03:20:02] {2218} INFO - iteration 53, current learner kneighbor
[flaml.automl.logger: 01-12 03:20:21] {2391} INFO -  at 390.4s,	estimator kneighbor's best error=0.3205,	best estimator xgboost's best error=0.0463
[flaml.automl.logger: 01-12 03:20:21] {2218} INFO - iteration 54, current learner rf
[flaml.automl.logger: 01-12 03:20:23] {2391} INFO -  at 392.2s,	estimato

INFO:flaml.tune.searcher.blendsearch:No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'. More info can be found at https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune


[flaml.automl.logger: 01-12 03:23:54] {2391} INFO -  at 603.6s,	estimator lrl1's best error=0.3515,	best estimator xgboost's best error=0.0463
[flaml.automl.logger: 01-12 03:23:56] {2627} INFO - retrain xgboost for 1.6s
[flaml.automl.logger: 01-12 03:23:56] {2630} INFO - retrained model: XGBClassifier(base_score=None, booster=None, callbacks=[],
              colsample_bylevel=0.5909530535241774, colsample_bynode=None,
              colsample_bytree=1.0, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy='lossguide', importance_type=None,
              interaction_constraints=None, learning_rate=0.14016333421486576,
              max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=0, max_leaves=147,
              min_child_weight=0.04742539395483736, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_e

In [92]:
# retrieve best config and best learner
print('Best ML leaner:', automl3.best_estimator)
print('Best hyperparmeter config:', automl3.best_config)
print('Best accuracy on validation data: {0:.4g}'.format(1-automl3.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl3.best_config_train_time))

Best ML leaner: xgboost
Best hyperparmeter config: {'n_estimators': 91, 'max_leaves': 147, 'min_child_weight': 0.04742539395483736, 'learning_rate': 0.14016333421486576, 'subsample': 1.0, 'colsample_bylevel': 0.5909530535241774, 'colsample_bytree': 1.0, 'reg_alpha': 2.766143654293635, 'reg_lambda': 0.04798910206552531}
Best accuracy on validation data: 0.9537
Training duration of best run: 1.607 s


In [94]:
# obtain best model
automl3.model.estimator

##### Predict for train data using CV

In [95]:
# compute predictions of testing dataset'''
y_pred3 = automl3.predict(test_x)
y_pred_proba3 = automl3.predict_proba(test_x)[:,1]

In [96]:
# compute different metric values on testing dataset'''
from flaml.ml import sklearn_metric_loss_score
print('accuracy', '=', 1 - sklearn_metric_loss_score('accuracy', y_pred3, y_test))
print('roc_auc', '=', 1 - sklearn_metric_loss_score('roc_auc', y_pred_proba3, y_test))
print('log_loss', '=', sklearn_metric_loss_score('log_loss', y_pred_proba3, y_test))

accuracy = 0.9572296390478209
roc_auc = 0.9937487713452847
log_loss = 0.10062098556037102


In [97]:
predict(automl3, 'automl3', pred_x)

#### **Model 4: Unscaled imputed data, KNNImputer, Seat_Class deleted for stacked model**

In [123]:
train_x = X_train_imputed
train_x.drop('Seat_Class', axis = 1, inplace = True)

test_x = X_test_imputed
test_x.drop('Seat_Class', axis = 1, inplace = True)

pred_x = pd.concat([df_pred_orig['ID'], df_pred_imputed], axis = 1)
pred_x.drop('Seat_Class', axis = 1, inplace = True)

In [124]:
train_x.shape

(66065, 22)

In [129]:
# declare model for logistic regression
automl4 = AutoML()

# setting for AutoML
settings = {
    "time_budget": 600,  # total running time in seconds
    "metric": 'accuracy',
    # check the documentation for options of metrics (https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#optimization-metric)
    "task": 'classification',  # task type
    "estimator_list": ['xgboost', 'xgb_limitdepth', 'rf', 'lgbm', 'lrl1', 'lrl2', 'catboost', 'extra_tree', 'kneighbor'],
    "log_file_name": '/content/drive/MyDrive/DScourse/Hackathon/AutoML/hackathon1.log',  # flaml log file
    "ensemble": True,
    "seed": 13,    # random seed
}

In [130]:
# The main flaml automl API
automl4.fit(X_train=train_x, y_train=y_train, **settings)

[flaml.automl.logger: 01-12 03:40:07] {1679} INFO - task = classification
[flaml.automl.logger: 01-12 03:40:07] {1690} INFO - Evaluation method: cv
[flaml.automl.logger: 01-12 03:40:07] {1788} INFO - Minimizing error metric: 1-accuracy
[flaml.automl.logger: 01-12 03:40:07] {1900} INFO - List of ML learners in AutoML Run: ['xgboost', 'xgb_limitdepth', 'rf', 'lgbm', 'lrl1', 'lrl2', 'catboost', 'extra_tree', 'kneighbor']
[flaml.automl.logger: 01-12 03:40:07] {2218} INFO - iteration 0, current learner xgboost
[flaml.automl.logger: 01-12 03:40:10] {2344} INFO - Estimated sufficient time budget=34296s. Estimated necessary time budget=644s.
[flaml.automl.logger: 01-12 03:40:10] {2391} INFO -  at 3.9s,	estimator xgboost's best error=0.1547,	best estimator xgboost's best error=0.1547
[flaml.automl.logger: 01-12 03:40:10] {2218} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 01-12 03:40:11] {2391} INFO -  at 5.0s,	estimator lgbm's best error=0.1547,	best estimator xgboost's best 

In [131]:
# retrieve best config and best learner
print('Best ML leaner:', automl4.best_estimator)
print('Best hyperparmeter config:', automl4.best_config)
print('Best accuracy on validation data: {0:.4g}'.format(1-automl4.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl4.best_config_train_time))

Best ML leaner: xgboost
Best hyperparmeter config: {'n_estimators': 91, 'max_leaves': 147, 'min_child_weight': 0.04742539395483736, 'learning_rate': 0.14016333421486576, 'subsample': 1.0, 'colsample_bylevel': 0.5909530535241774, 'colsample_bytree': 1.0, 'reg_alpha': 2.766143654293635, 'reg_lambda': 0.04798910206552531}
Best accuracy on validation data: 0.9537
Training duration of best run: 11.05 s


##### Predict for train data using CV

In [132]:
# compute predictions of testing dataset'''
y_pred4 = automl4.predict(test_x)
y_pred_proba4 = automl4.predict_proba(test_x)[:,1]

In [133]:
# compute different metric values on testing dataset'''
from flaml.ml import sklearn_metric_loss_score
print('accuracy', '=', 1 - sklearn_metric_loss_score('accuracy', y_pred4, y_test))
print('roc_auc', '=', 1 - sklearn_metric_loss_score('roc_auc', y_pred_proba4, y_test))
print('log_loss', '=', sklearn_metric_loss_score('log_loss', y_pred_proba4, y_test))

accuracy = 0.8783993783993784
roc_auc = 0.9472366056108122
log_loss = 0.2946694316508978


In [None]:
predict(automl4, 'automl4', pred_x)

#### **Model 5: Unscaled imputed data, KNNImputer, Seat_Class, ArrivalDelay Binned**

In [159]:
train_x = X_train_imputed.copy()
train_x.drop('Seat_Class', axis = 1, inplace = True)
train_x['Arrival_Delay_Bins'] = pd.cut(train_x['Arrival_Delay_in_Mins'], bins = [-float('inf'), 0, 10, 120, float('inf')], labels = ['0', '1-10', '11-120', '121+'])
train_x['Arrival_Delay_Bins'].replace(['0', '1-10', '11-120', '121+'], [0, 1, 2, 3], inplace = True)
train_x.drop('Arrival_Delay_in_Mins', axis = 1, inplace = True)

test_x = X_test_imputed.copy()
test_x.drop('Seat_Class', axis = 1, inplace = True)
test_x['Arrival_Delay_Bins'] = pd.cut(test_x['Arrival_Delay_in_Mins'], bins = [-float('inf'), 0, 10, 120, float('inf')], labels = ['0', '1-10', '11-120', '121+'])
test_x['Arrival_Delay_Bins'].replace(['0', '1-10', '11-120', '121+'], [0, 1, 2, 3], inplace = True)
test_x.drop('Arrival_Delay_in_Mins', axis = 1, inplace = True)

pred_x = pd.concat([df_pred_orig['ID'], df_pred_imputed.copy()], axis = 1)
pred_x.drop('Seat_Class', axis = 1, inplace = True)
pred_x['Arrival_Delay_Bins'] = pd.cut(pred_x['Arrival_Delay_in_Mins'], bins = [-float('inf'), 0, 10, 120, float('inf')], labels = ['0', '1-10', '11-120', '121+'])
pred_x['Arrival_Delay_Bins'].replace(['0', '1-10', '11-120', '121+'], [0, 1, 2, 3], inplace = True)
pred_x.drop('Arrival_Delay_in_Mins', axis = 1, inplace = True)

In [160]:
train_x.shape

(66065, 22)

In [161]:
# declare model for logistic regression
automl5 = AutoML()

# setting for AutoML
settings = {
    "time_budget": 600,  # total running time in seconds
    "metric": 'accuracy',
    # check the documentation for options of metrics (https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#optimization-metric)
    "task": 'classification',  # task type
    "estimator_list": ['xgboost', 'xgb_limitdepth', 'rf', 'lgbm', 'lrl1', 'lrl2', 'catboost', 'extra_tree', 'kneighbor'],
    "log_file_name": '/content/drive/MyDrive/DScourse/Hackathon/AutoML/hackathon1.log',  # flaml log file
    #"ensemble": True,
    "seed": 13,    # random seed
}

In [162]:
# The main flaml automl API
automl5.fit(X_train=train_x, y_train=y_train, **settings)

[flaml.automl.logger: 01-12 04:04:27] {1679} INFO - task = classification
[flaml.automl.logger: 01-12 04:04:27] {1690} INFO - Evaluation method: cv
[flaml.automl.logger: 01-12 04:04:27] {1788} INFO - Minimizing error metric: 1-accuracy
[flaml.automl.logger: 01-12 04:04:27] {1900} INFO - List of ML learners in AutoML Run: ['xgboost', 'xgb_limitdepth', 'rf', 'lgbm', 'lrl1', 'lrl2', 'catboost', 'extra_tree', 'kneighbor']
[flaml.automl.logger: 01-12 04:04:27] {2218} INFO - iteration 0, current learner xgboost
[flaml.automl.logger: 01-12 04:04:28] {2344} INFO - Estimated sufficient time budget=9131s. Estimated necessary time budget=171s.
[flaml.automl.logger: 01-12 04:04:28] {2391} INFO -  at 1.3s,	estimator xgboost's best error=0.1547,	best estimator xgboost's best error=0.1547
[flaml.automl.logger: 01-12 04:04:28] {2218} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 01-12 04:04:29] {2391} INFO -  at 2.2s,	estimator lgbm's best error=0.1547,	best estimator xgboost's best e

INFO:flaml.tune.searcher.blendsearch:No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'. More info can be found at https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune


[flaml.automl.logger: 01-12 04:09:18] {2391} INFO -  at 291.1s,	estimator lrl2's best error=0.2262,	best estimator xgboost's best error=0.0474
[flaml.automl.logger: 01-12 04:09:18] {2218} INFO - iteration 38, current learner extra_tree
[flaml.automl.logger: 01-12 04:09:20] {2391} INFO -  at 293.4s,	estimator extra_tree's best error=0.1111,	best estimator xgboost's best error=0.0474
[flaml.automl.logger: 01-12 04:09:20] {2218} INFO - iteration 39, current learner lrl2
[flaml.automl.logger: 01-12 04:09:29] {2391} INFO -  at 303.0s,	estimator lrl2's best error=0.2252,	best estimator xgboost's best error=0.0474
[flaml.automl.logger: 01-12 04:09:30] {2218} INFO - iteration 40, current learner xgboost
[flaml.automl.logger: 01-12 04:09:33] {2391} INFO -  at 306.1s,	estimator xgboost's best error=0.0474,	best estimator xgboost's best error=0.0474
[flaml.automl.logger: 01-12 04:09:33] {2218} INFO - iteration 41, current learner xgboost
[flaml.automl.logger: 01-12 04:09:43] {2391} INFO -  at 316

In [163]:
# retrieve best config and best learner
print('Best ML leaner:', automl5.best_estimator)
print('Best hyperparmeter config:', automl5.best_config)
print('Best accuracy on validation data: {0:.4g}'.format(1-automl5.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl5.best_config_train_time))

Best ML leaner: xgboost
Best hyperparmeter config: {'n_estimators': 737, 'max_leaves': 208, 'min_child_weight': 0.10024508277504081, 'learning_rate': 0.1439509194030956, 'subsample': 0.913964501756151, 'colsample_bylevel': 0.5735740572022017, 'colsample_bytree': 1.0, 'reg_alpha': 0.6572032565494386, 'reg_lambda': 0.029197533116278977}
Best accuracy on validation data: 0.9536
Training duration of best run: 14.37 s


In [164]:
# obtain best model
automl5.model.estimator

##### Predict for train data using CV

In [165]:
# compute predictions of testing dataset'''
y_pred5 = automl5.predict(test_x)
y_pred_proba5 = automl5.predict_proba(test_x)[:,1]

In [166]:
# compute different metric values on testing dataset'''
from flaml.ml import sklearn_metric_loss_score
print('accuracy', '=', 1 - sklearn_metric_loss_score('accuracy', y_pred5, y_test))
print('roc_auc', '=', 1 - sklearn_metric_loss_score('roc_auc', y_pred_proba5, y_test))
print('log_loss', '=', sklearn_metric_loss_score('log_loss', y_pred_proba5, y_test))

accuracy = 0.9569824115278661
roc_auc = 0.9933979959585009
log_loss = 0.12026265805894386


In [167]:
predict(automl5, 'automl5', pred_x)

#### **Model 6: Unscaled imputed data, KNNImputer, Seat_Class deleted, TotalDelay**

In [168]:
train_x = X_train_imputed.copy()
train_x.drop('Seat_Class', axis = 1, inplace = True)
train_x['Total_Delay'] = train_x['Arrival_Delay_in_Mins'] + train_x['Departure_Delay_in_Mins']
#train_x['Arrival_Delay_Bins'] = pd.cut(train_x['Arrival_Delay_in_Mins'], bins = [-float('inf'), 0, 10, 120, float('inf')], labels = ['0', '1-10', '11-120', '121+'])
#train_x['Arrival_Delay_Bins'].replace(['0', '1-10', '11-120', '121+'], [0, 1, 2, 3], inplace = True)
train_x.drop(['Arrival_Delay_in_Mins', 'Departure_Delay_in_Mins'], axis = 1, inplace = True)

test_x = X_test_imputed.copy()
test_x.drop('Seat_Class', axis = 1, inplace = True)
test_x['Total_Delay'] = test_x['Arrival_Delay_in_Mins'] + test_x['Departure_Delay_in_Mins']
#test_x['Arrival_Delay_Bins'] = pd.cut(test_x['Arrival_Delay_in_Mins'], bins = [-float('inf'), 0, 10, 120, float('inf')], labels = ['0', '1-10', '11-120', '121+'])
#test_x['Arrival_Delay_Bins'].replace(['0', '1-10', '11-120', '121+'], [0, 1, 2, 3], inplace = True)
test_x.drop(['Arrival_Delay_in_Mins', 'Departure_Delay_in_Mins'], axis = 1, inplace = True)

pred_x = pd.concat([df_pred_orig['ID'], df_pred_imputed.copy()], axis = 1)
pred_x.drop('Seat_Class', axis = 1, inplace = True)
pred_x['Total_Delay'] = pred_x['Arrival_Delay_in_Mins'] + pred_x['Departure_Delay_in_Mins']
#pred_x['Arrival_Delay_Bins'] = pd.cut(pred_x['Arrival_Delay_in_Mins'], bins = [-float('inf'), 0, 10, 120, float('inf')], labels = ['0', '1-10', '11-120', '121+'])
#pred_x['Arrival_Delay_Bins'].replace(['0', '1-10', '11-120', '121+'], [0, 1, 2, 3], inplace = True)
pred_x.drop(['Arrival_Delay_in_Mins', 'Departure_Delay_in_Mins'], axis = 1, inplace = True)

In [169]:
train_x.columns

Index(['Gender', 'Customer_Type', 'Age', 'Type_Travel', 'Travel_Class',
       'Travel_Distance', 'Seat_Comfort', 'Arrival_Time_Convenient',
       'Catering', 'Platform_Location', 'Onboard_Wifi_Service',
       'Onboard_Entertainment', 'Online_Support', 'Ease_of_Online_Booking',
       'Onboard_Service', 'Legroom', 'Baggage_Handling', 'CheckIn_Service',
       'Cleanliness', 'Online_Boarding', 'Total_Delay'],
      dtype='object')

In [170]:
# declare model for logistic regression
automl6 = AutoML()

# setting for AutoML
settings = {
    "time_budget": 600,  # total running time in seconds
    "metric": 'accuracy',
    # check the documentation for options of metrics (https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#optimization-metric)
    "task": 'classification',  # task type
    "estimator_list": ['xgboost', 'xgb_limitdepth', 'rf', 'lgbm', 'lrl1', 'lrl2', 'catboost', 'extra_tree', 'kneighbor'],
    "log_file_name": '/content/drive/MyDrive/DScourse/Hackathon/AutoML/hackathon1.log',  # flaml log file
    #"ensemble": True,
    "seed": 13,    # random seed
}

In [171]:
# The main flaml automl API
automl6.fit(X_train=train_x, y_train=y_train, **settings)

[flaml.automl.logger: 01-12 04:21:11] {1679} INFO - task = classification
[flaml.automl.logger: 01-12 04:21:11] {1690} INFO - Evaluation method: cv
[flaml.automl.logger: 01-12 04:21:11] {1788} INFO - Minimizing error metric: 1-accuracy
[flaml.automl.logger: 01-12 04:21:11] {1900} INFO - List of ML learners in AutoML Run: ['xgboost', 'xgb_limitdepth', 'rf', 'lgbm', 'lrl1', 'lrl2', 'catboost', 'extra_tree', 'kneighbor']
[flaml.automl.logger: 01-12 04:21:11] {2218} INFO - iteration 0, current learner xgboost
[flaml.automl.logger: 01-12 04:21:12] {2344} INFO - Estimated sufficient time budget=9873s. Estimated necessary time budget=185s.
[flaml.automl.logger: 01-12 04:21:12] {2391} INFO -  at 1.3s,	estimator xgboost's best error=0.1547,	best estimator xgboost's best error=0.1547
[flaml.automl.logger: 01-12 04:21:12] {2218} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 01-12 04:21:13] {2391} INFO -  at 2.2s,	estimator lgbm's best error=0.1547,	best estimator xgboost's best e

INFO:flaml.tune.searcher.blendsearch:No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'. More info can be found at https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune


[flaml.automl.logger: 01-12 04:22:30] {2391} INFO -  at 79.3s,	estimator lrl2's best error=0.2310,	best estimator catboost's best error=0.0509
[flaml.automl.logger: 01-12 04:22:30] {2218} INFO - iteration 24, current learner xgboost
[flaml.automl.logger: 01-12 04:22:32] {2391} INFO -  at 80.9s,	estimator xgboost's best error=0.0847,	best estimator catboost's best error=0.0509
[flaml.automl.logger: 01-12 04:22:32] {2218} INFO - iteration 25, current learner lrl2
[flaml.automl.logger: 01-12 04:22:41] {2391} INFO -  at 89.9s,	estimator lrl2's best error=0.2310,	best estimator catboost's best error=0.0509
[flaml.automl.logger: 01-12 04:22:41] {2218} INFO - iteration 26, current learner catboost
[flaml.automl.logger: 01-12 04:23:07] {2391} INFO -  at 116.4s,	estimator catboost's best error=0.0482,	best estimator catboost's best error=0.0482
[flaml.automl.logger: 01-12 04:23:07] {2218} INFO - iteration 27, current learner kneighbor
[flaml.automl.logger: 01-12 04:23:26] {2391} INFO -  at 135.

INFO:flaml.tune.searcher.blendsearch:No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'. More info can be found at https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune


[flaml.automl.logger: 01-12 04:31:11] {2391} INFO -  at 600.4s,	estimator lrl1's best error=0.3484,	best estimator xgb_limitdepth's best error=0.0475
[flaml.automl.logger: 01-12 04:31:14] {2627} INFO - retrain xgb_limitdepth for 3.2s
[flaml.automl.logger: 01-12 04:31:14] {2630} INFO - retrained model: XGBClassifier(base_score=None, booster=None, callbacks=[],
              colsample_bylevel=0.8976632531695978, colsample_bynode=None,
              colsample_bytree=0.9891267920793386, device=None,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, feature_types=None, gamma=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.08756537759052367,
              max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=11, max_leaves=None,
              min_child_weight=8.384287675156479, missing=nan,
              monotone_constr

In [172]:
# retrieve best config and best learner
print('Best ML leaner:', automl6.best_estimator)
print('Best hyperparmeter config:', automl6.best_config)
print('Best accuracy on validation data: {0:.4g}'.format(1-automl6.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl6.best_config_train_time))

Best ML leaner: xgb_limitdepth
Best hyperparmeter config: {'n_estimators': 266, 'max_depth': 11, 'min_child_weight': 8.384287675156479, 'learning_rate': 0.08756537759052367, 'subsample': 0.8485132172169152, 'colsample_bylevel': 0.8976632531695978, 'colsample_bytree': 0.9891267920793386, 'reg_alpha': 0.098118480584988, 'reg_lambda': 0.05551336945491275}
Best accuracy on validation data: 0.9525
Training duration of best run: 3.161 s


In [173]:
# obtain best model
automl6.model.estimator

##### Predict for train data using CV

In [175]:
# compute predictions of testing dataset'''
y_pred6 = automl6.predict(test_x)
y_pred_proba6 = automl6.predict_proba(test_x)[:,1]

In [176]:
# compute different metric values on testing dataset'''
from flaml.ml import sklearn_metric_loss_score
print('accuracy', '=', 1 - sklearn_metric_loss_score('accuracy', y_pred6, y_test))
print('roc_auc', '=', 1 - sklearn_metric_loss_score('roc_auc', y_pred_proba6, y_test))
print('log_loss', '=', sklearn_metric_loss_score('log_loss', y_pred_proba6, y_test))

accuracy = 0.9541569541569541
roc_auc = 0.9933064204282956
log_loss = 0.10364181346102141


In [None]:
predict(automl6, 'automl6', pred_x)

#### **Binning**



Creating Age Bins

In [None]:
# bin df_train['Age'] into 4 bins
#X_train_imputed['Age_Bins'] = pd.cut(X_train_imputed['Age'], bins = [0, 20, 40, 60, float('inf')], labels = ['0-20', '21-40', '41-60', '61+'])
#X_train_imputed['Age_Bins'].replace(['0-20', '21-40', '41-60', '61+'], [0, 1, 2, 3], inplace = True)

#X_test_imputed['Age_Bins'] = pd.cut(X_test_imputed['Age'], bins = [0, 20, 40, 60, float('inf')], labels = ['0-20', '21-40', '41-60', '61+'])
#X_test_imputed['Age_Bins'].replace(['0-20', '21-40', '41-60', '61+'], [0, 1, 2, 3], inplace = True)

#df_pred_imputed['Age_Bins'] = pd.cut(df_pred_imputed['Age'], bins = [0, 20, 40, 60, float('inf')], labels = ['0-20', '21-40', '41-60', '61+'])
#df_pred_imputed['Age_Bins'].replace(['0-20', '21-40', '41-60', '61+'], [0, 1, 2, 3], inplace = True)

In [None]:
#X_train_imputed['Age_Bins'].value_counts(), X_test_imputed['Age_Bins'].value_counts(), df_pred_imputed['Age_Bins'].value_counts()

Creating Arrival_Delay_Bins

In [None]:
# bin 'Arrival_Delay_in_Mins' into 4 bins
X_train_imputed['Arrival_Delay_Bins'] = pd.cut(X_train_imputed['Arrival_Delay_in_Mins'], bins = [-float('inf'), 0, 10, 120, float('inf')], labels = ['0', '1-10', '11-120', '121+'])
X_train_imputed['Arrival_Delay_Bins'].replace(['0', '1-10', '11-120', '121+'], [0, 1, 2, 3], inplace = True)

X_test_imputed['Arrival_Delay_Bins'] = pd.cut(X_test_imputed['Arrival_Delay_in_Mins'], bins = [-float('inf'), 0, 10, 120, float('inf')], labels = ['0', '1-10', '11-120', '121+'])
X_test_imputed['Arrival_Delay_Bins'].replace(['0', '1-10', '11-120', '121+'], [0, 1, 2, 3], inplace = True)

df_pred_imputed['Arrival_Delay_Bins'] = pd.cut(df_pred_imputed['Arrival_Delay_in_Mins'], bins = [-float('inf'), 0, 10, 120, float('inf')], labels = ['0', '1-10', '11-120', '121+'])
df_pred_imputed['Arrival_Delay_Bins'].replace(['0', '1-10', '11-120', '121+'], [0, 1, 2, 3], inplace = True)

Creating Total_Delay Bins

In [None]:
#X_train_imputed['Total_Delay'] = X_train_imputed['Arrival_Delay_in_Mins'] + X_train_imputed['Departure_Delay_in_Mins']

#X_test_imputed['Total_Delay'] = X_test_imputed['Arrival_Delay_in_Mins'] + X_test_imputed['Departure_Delay_in_Mins']

#df_pred_imputed['Total_Delay'] = df_pred_imputed['Arrival_Delay_in_Mins'] + df_pred_imputed['Departure_Delay_in_Mins']

In [None]:
X_train_imputed['Arrival_Delay_Bins'].describe()

count     66065
unique        4
top           0
freq      37106
Name: Arrival_Delay_Bins, dtype: int64

In [None]:
# bin 'Arrival_Delay_in_Mins' into 4 bins
#X_train_imputed['Total_Delay_Bins'] = pd.cut(X_train_imputed['Total_Delay'], bins = [-float('inf'), 0, 10, 120, float('inf')], labels = ['0', '1-10', '11-120', '121+'])
#X_train_imputed['Total_Delay_Bins'].replace(['0', '1-10', '11-120', '121+'], [0, 1, 2, 3], inplace = True)

#X_test_imputed['Total_Delay_Bins'] = pd.cut(X_test_imputed['Total_Delay'], bins = [-float('inf'), 0, 10, 120, float('inf')], labels = ['0', '1-10', '11-120', '121+'])
#X_test_imputed['Total_Delay_Bins'].replace(['0', '1-10', '11-120', '121+'], [0, 1, 2, 3], inplace = True)

#df_pred_imputed['Total_Delay_Bins'] = pd.cut(df_pred_imputed['Total_Delay'], bins = [-float('inf'), 0, 10, 120, float('inf')], labels = ['0', '1-10', '11-120', '121+'])
#df_pred_imputed['Total_Delay_Bins'].replace(['0', '1-10', '11-120', '121+'], [0, 1, 2, 3], inplace = True)

In [None]:
X_train_imputed['Arrival_Delay_Bins'].value_counts(), X_test_imputed['Arrival_Delay_Bins'].value_counts(), df_pred_imputed['Arrival_Delay_Bins'].value_counts()

(0    37106
 2    16570
 1    10733
 3     1656
 Name: Arrival_Delay_Bins, dtype: int64,
 0    15815
 2     7221
 1     4611
 3      667
 Name: Arrival_Delay_Bins, dtype: int64,
 0    19850
 2     9097
 1     5750
 3      905
 Name: Arrival_Delay_Bins, dtype: int64)

Creating Distance_Traveled Bins

In [None]:
# create 3 bins for Travel_Distance
X_train_imputed['Travel_Distance_Bins'] = pd.cut(X_train_imputed['Travel_Distance'], bins = [0, 1000, 3000, float('inf')], labels = ['0-1000', '1001-3000', '3001+'])
X_train_imputed['Travel_Distance_Bins'].replace(['0-1000', '1001-3000', '3001+'], [0, 1, 2], inplace = True)

X_test_imputed['Travel_Distance_Bins'] = pd.cut(X_test_imputed['Travel_Distance'], bins = [0, 1000, 3000, float('inf')], labels = ['0-1000', '1001-3000', '3001+'])
X_test_imputed['Travel_Distance_Bins'].replace(['0-1000', '1001-3000', '3001+'], [0, 1, 2], inplace = True)

df_pred_imputed['Travel_Distance_Bins'] = pd.cut(df_pred_imputed['Travel_Distance'], bins = [0, 1000, 3000, float('inf')], labels = ['0-1000', '1001-3000', '3001+'])
df_pred_imputed['Travel_Distance_Bins'].replace(['0-1000', '1001-3000', '3001+'], [0, 1, 2], inplace = True)

In [None]:
X_train_imputed['Travel_Distance_Bins'].value_counts(), X_test_imputed['Travel_Distance_Bins'].value_counts(), df_pred_imputed['Travel_Distance_Bins'].value_counts()

(1    44603
 0    11436
 2    10026
 Name: Travel_Distance_Bins, dtype: int64,
 1    18989
 0     5032
 2     4293
 Name: Travel_Distance_Bins, dtype: int64,
 1    23994
 0     6120
 2     5488
 Name: Travel_Distance_Bins, dtype: int64)

In [None]:
df_pred_imputed.head()

Unnamed: 0,Gender,Customer_Type,Age,Type_Travel,Travel_Class,Travel_Distance,Departure_Delay_in_Mins,Arrival_Delay_in_Mins,Seat_Comfort,Seat_Class,...,Online_Support,Ease_of_Online_Booking,Onboard_Service,Legroom,Baggage_Handling,CheckIn_Service,Cleanliness,Online_Boarding,Arrival_Delay_Bins,Travel_Distance_Bins
0,1.0,1.0,36.0,1.0,1.0,532.0,0.0,0.0,3.0,1.0,...,4.0,5.0,5.0,5.0,5.0,4.0,5.0,1.0,0,0
1,1.0,0.0,21.0,1.0,1.0,1425.0,9.0,28.0,0.0,0.0,...,3.0,3.0,5.0,3.0,4.0,3.0,5.0,3.0,2,1
2,0.0,1.0,60.0,1.0,1.0,2832.0,0.0,0.0,5.0,0.0,...,5.0,2.0,2.0,2.0,2.0,4.0,2.0,5.0,0,1
3,1.0,1.0,29.0,0.0,0.0,1352.0,0.0,0.0,3.0,1.0,...,5.0,1.0,3.0,2.0,5.0,5.0,5.0,1.0,0,1
4,0.0,0.0,18.0,1.0,1.0,1610.0,17.0,0.0,5.0,0.0,...,5.0,5.0,3.714286,3.0,5.0,5.0,5.0,5.0,0,1
