# Tabular Playground Series - Nov 2021


## 1. Exploratory data analyst

Load required libraries and open downloading datasets from this <a href="https://www.kaggle.com/c/tabular-playground-series-nov-2021/data" target="_blank"> link</a>.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

from IPython.display import display
from IPython.display import HTML

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np
import pandas as pd
import tensorflow as tf
import warnings

pd.options.display.max_columns = 110
pd.options.display.max_rows = 400


from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import StandardScaler


from tensorflow import keras
from keras.regularizers import l1
from keras.regularizers import l2
from keras.backend import sigmoid

from tensorflow.keras import activations
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import initializers
from tensorflow.keras import layers
from tensorflow.keras import regularizers
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import InputLayer
from tensorflow.keras.constraints import max_norm
from tensorflow.keras.layers import LayerNormalization
from tensorflow.keras.layers.experimental.preprocessing import Normalization
from tensorflow.keras.callbacks import EarlyStopping

warnings.filterwarnings('ignore')  


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
train = pd.read_csv("/kaggle/input/tabular-playground-series-nov-2021/train.csv")
test = pd.read_csv("/kaggle/input/tabular-playground-series-nov-2021/test.csv")
submission = pd.read_csv("/kaggle/input/tabular-playground-series-nov-2021/sample_submission.csv")
train.set_index("id", inplace=True)
# Check nan values
print("The train has {} features with nan values.".format(list(train.isnull().sum().values > 0).count(True)))
print("The test has {} features with nan values.".format(list(test.isnull().sum().values > 0).count(True)))
print("The sample_submission has with  {} features nan values.".format(list(submission.isnull().sum().values > 0).count(True)))

As I see above, I have the classical binary classification task with numeric continuous values for x_values  and binary y_values (1 or 0). Formally , only 70 values mutual information other than zero and deleting features with zero value should increase the accuracy value , but as I found out experimentally deleting features with a zero value mutual information decreasing  the accuracy values by 3-6%. File `train_mutual_clf.csv` with mutual information values you can download <a href="https://github.com/Vadim-Maklakov/Data-Science/blob/main/07_Kaggle_Tabular_Playground_Series-Nov%202021/train_mutual_clf.csv" target="_blank">here</a>. Let 's define the outliers quantity. 

In [None]:
%%time
def dfoutlsidx(dataframe):
    """
    Define indexes of outliers values less than quintile 25% - 1.5IRQ and more
    then quintile 75% + 1.5 IRQ for  continuous values of features.
    Parameters
    ----------
        dataframe : tested pandas dataframe 
    Returns
    -------
        list indexes of outliers values.
    """
    df = train.copy()
    outliers = set()
    features = list(df.columns)[:-1]
    for feature in features:
        quant_25 = df[feature].quantile(0.25)
        quant_75 = df[feature].quantile(0.75)
        delta = 1.5*(quant_75 - quant_25)
        df_feature = set(train[(train[feature] < quant_25 - delta) \
                               | (train[feature] > quant_75 + delta)].index)
        for idx in df_feature:
            if idx not in outliers:
                outliers.add(idx)
    return list(outliers)

outls_idx = dfoutlsidx(train)
print("Train dataset contains {:,} outliers values in the  {:,} rows.\n\
Share of outliers {:.3f}%".format(len(outls_idx), train.shape[0],
                              len(outls_idx)/train.shape[0]*100.0))

As seen above, almost all x_values of `train` dataset - outliers.

## 2. Determinate  model.

To be honest, I was surprised that DL  like ML, it does not have any clear and clear criteria for building a model - the number of hidden layers, the total number of neurons and numbers of neurons for each layer and the model you have to create empirically based on the rules `rule-of-thumbs` and your own experience or fantasy. For <a href="https://www.heatonresearch.com/2017/06/01/hidden-layers.html" target="_blank">example:</a>

`I have a few rules of thumb that I use to choose hidden layers. There are many rule-of-thumb methods for determining an acceptable number of neurons to use in the hidden layers, such as the following:
1.The number of hidden neurons should be between the size of the input layer and the size of the output layer.
2.The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.
3.The number of hidden neurons should be less than twice the size of the input layer.`

At the first iteration I tried to use xgboost, randomforest and SVM from scikit-learn to solve this binary classification problem. When, after more than 12 hours, the pipeline cross validation with these three train dataset algorithms did not end, I decided to use tensorflow. Maximal accuracy have SVM (train:test ration = 1:4, 3 trials) ~ 0.64. 

At second iteration  I started with Tensorflow, I tried to write my own functions to determine the optimal number of neurons, hidden layers, activation functions and etc. I tried to use KerasClassifier - but I couldn't connect the loss_validation metric in it - as a result, I got a bunch of spaghetti code with monstrous time costs. Realizing that I was getting dirty in the abyss of writing functions that spend a lot of time, I searched the Internet and found <a href="https://autokeras.com/" target="_blank">AutoKeras</a>.

With the help of Autokers and Keras tuner I created  3 models in about five hours: 
1. Standard binary classifier `automl_clf` with  default settings from AutoKeras.
2. `automl_clf` was regularized by d1 and d2 and rename to `automl_tuner`. 
3. `automl_regr` is a standard linear regression model with default settings from AutoKeras. 
The code  for finding  all three models can be found at this <a href="https://github.com/Vadim-Maklakov/Data-Science/blob/main/07_Kaggle_Tabular_Playground_Series-Nov%202021/automl_hyper_tuning" target="_blank">link</a>. 

After finishing the work, these models were exported to `json` or inserted into a text file using the `get_config` method and typed manually. In these cases , the acceptable calculation speed for defining models and hyperparameters is given by batch_size = 1024 - 2048. Experimentally, it was found out that the maximum value for `validation_accuracy` and minimal values for `validation_loss` gives only StandardScaler



## 3. Train models

### 3.1 Train  `automl_clf` model.

Load required functions:

In [None]:
%%time
def dfsplit(dataframe, scaler=None):
    """
    Split dataframe to x_train, y_train, x_test, y_test on ratio 4:1. 
    Possible scale/transform option for x features:
    1. None – not scale or trainsform
    2. “ptbc”   Power-transformer by Box-Cox
    3. “ptbc” - .PowerTransformer by Yeo-Johnson’
    4. “rb” - .RobustScaler(
    5. "ss" - StandardScaler    
    For prevent data leakage using separate instance scaler/transformer 
    for each train and test parts.
    Parameters
    ----------
        dataframe : pandas dataframe with numeric values of features.
        scaler : TYPE - None or str, optional.  The default is None.
    Returns
    -------
        x_train, x_test, y_train, y_test - numpy arrays.
    """
    df = dataframe.copy()
    mms_train = MinMaxScaler(feature_range=(1, 2))
    mms_test = MinMaxScaler(feature_range=(1, 2))
    ptbc_train = PowerTransformer(method='box-cox')
    ptbc_test = PowerTransformer(method='box-cox')
    ptyj_train = PowerTransformer()
    ptyj_test = PowerTransformer()
    rb_train = RobustScaler()
    rb_test = RobustScaler()
    ss_train = StandardScaler()
    ss_test = StandardScaler()
    df = dataframe.copy()
    # split dataframe for train and test x and y nparrays
    x_all, y_all =df.iloc[:,:-1].values, np.ravel(df.iloc[:,[-1]].values)  
    x_train, x_test, y_train, y_test = train_test_split(x_all, y_all, 
                                                    test_size=0.2, 
                                                    random_state=42, 
                                                    stratify=y_all)
    # Transform or scale 
    scalers = [None, "ptbc", "ptyj", "rb", "ss"]
    if  scaler == None:
        x_train, x_test = x_train[:,:], x_test[:,:] 
    
    if scaler == "ptbc":
        x_train, x_test = \
            ptbc_train.fit_transform(mms_train.fit_transform(x_train[:,:])), \
            ptbc_test.fit_transform(mms_test.fit_transform(x_test[:,:]))
                         
    elif scaler == "ptyj":
        x_train, x_test = \
            ptyj_train.fit_transform(x_train[:,:]), \
            ptyj_test.fit_transform(x_test[:,:])
    elif scaler == "rb":
        x_train, x_test = \
            rb_train.fit_transform(x_train[:,:]), \
            rb_test.fit_transform(x_test[:,:])
    elif scaler == "ss":
        x_train, x_test = \
            ss_train.fit_transform(x_train[:,:]), \
            ss_test.fit_transform(x_test[:,:])
    if scaler not in scalers:
        return "Value error for 'scaler'!", "Enter None or", \
        "'ptbc' or", " 'ptyj' or 'rb' or 'ss' value for scaler!"
    return x_train, x_test, y_train, y_test


def df_transform(dataframe, scaler=None, y=True):
    """
    Split dataframe to x_train, y_train, x_test, y_test on ratio 4:1. 
    Possible scale/transform option for x features:
    1. None – not scale or trainsform
    2. “ptbc”   Power-transformer by Box-Cox
    3. “ptbc” - .PowerTransformer by Yeo-Johnson’
    4. “rb” - .RobustScaler(
    5. "ss" - StandardScaler    
    For prevent data leakage using separate instance scaler/transformer 
    for each train and test parts.
    Parameters
    ----------
        dataframe : pandas dataframe with numeric values of features.
        scaler : TYPE - None or str, optional.  The default is None.
    Returns
    -------
        If y==True: x_train, x_test, y_train, y_test - numpy arrays.
        If y==False: x_train, x_test - numpy arrays.
    """
    df = dataframe.copy()
    mms_all = MinMaxScaler(feature_range=(1, 2))
    ptbc_all = PowerTransformer(method='box-cox')
    ptyj_all = PowerTransformer()
    rb_all = RobustScaler()
    ss_all = StandardScaler()
    df = dataframe.copy()
    
    # split dataframe for train and test x and y nparrays
    if y==True:
        x_all, y_all =df.iloc[:,:-1].values, np.ravel(df.iloc[:,[-1]].values)  
    elif y==False:
        x_all =df.iloc[:,:].values
    if y not in [True, False]:
        return "Y value error!", "Enter or True or False!"
    # Transform or scale x_all 
    scalers = [None, "ptbc", "ptyj", "rb", "ss"]
    if  scaler == None:
        x_all = x_all[:,:] 
    
    if scaler == "ptbc":
        x_all = ptbc_all.fit_transform(mms_all.fit_transform(x_all[:,:]))
                         
    elif scaler == "ptyj":
        x_all = ptyj_all.fit_transform(x_all[:,:])
    
    elif scaler == "rb":
        x_all = rb_all.fit_transform(x_all[:,:]), \
    
    elif scaler == "ss":
        x_all =  ss_all.fit_transform(x_all[:,:])
        
    if scaler not in scalers:
        return "Value error for 'scaler'!", "Enter None or", \
        "'ptbc' or", " 'ptyj' or 'rb' or 'ss' value for scaler!"
    if y==True:
        return x_all, y_all
    elif y==False:
        return x_all
    
    
def automl_clf(shape_x, learn_rate=0.01):
    """
    Model created manually from json file model  from auto-keras
    Parameters
    ----------
       	shape_x :  integer,  equal of dimensions the  dataset features.
        learn_rate : float, value for learning_rate of optimizer. 
        Default value of learn_rate = 0.001.
    Returns
    -------
       	 model : the keras model
    """
    model = Sequential()
    # 0.Input
    model.add(InputLayer(input_shape=(100,), dtype='float64', name="input_1"))
    # Normalization input == StandardScaler
    model.add(Normalization(name='normalization'))
    
    # Hidden layer 1
    # 1.1 Initializer for first hidden layer input linear
    model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",
        bias_initializer='zeros', name="layer_1"))
    # 1.2 Activation for fisrt hidden layer
    model.add(layers.Activation(activations.relu, name="relu_1"))
    model.add(layers.Dropout(.25))
    
    # Hidden layer 2
    # 2.1 Initializer for first hidden layer input linear
    model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",
        bias_initializer='zeros', name="layer_2"))
    # 2.2 Activation for second hidden layer
    model.add(layers.Activation(activations.relu, name="relu_2"))
    model.add(layers.Dropout(.25))
    
    # Hidden layer 3
    # 3.1 Initializer for third hidden layer input linear
    model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",
        bias_initializer='zeros', name="layer_3"))
    # 3.2 Activation for second hidden layer
    model.add(layers.Activation(activations.relu, name="relu_3"))
    model.add(layers.Dropout(.25))
    
    # 4. Final sigmoid
    model.add(layers.Dense(units=1, kernel_initializer="GlorotUniform",
        bias_initializer='zeros', name="layer_4"))
    model.add(layers.Activation(activations.sigmoid, name="sigmoid_1"))
    
    model.compile(loss='binary_crossentropy', 
              optimizer = tf.keras.optimizers.Adam(learning_rate=learn_rate),
              metrics=['accuracy',tf.keras.metrics.AUC(name='auc')])
    return model


def train_model(model, dataframe, batch_sz=16384, stop_no=30, scaler=None,
                estimator="clf"):
    """
    Scale / Transform numeric features for fit and train model.
    Parameters
    ----------
    	model : keras model for fitting data.
    	Dataframe : pandas dataframe with numeric values of x and y .
    	batch_sz : integer, Size of the batch, optional. The default is 16384.
    	stop_no : integer, number of repeat for callback, optional. 
        The default is 30.
    	scaler : None or str, available values - None, "ptbc", "ptyj", 
        "rb", "ss", optional. Default is None.
    Returns
    -------
    	model : keras fitted and trained model
       	hist_stat : pandas dataframe with values  of metrics ane epochs for 
        model.
    """    
    callbacks = [EarlyStopping(monitor='val_loss',mode='min',
                               patience=stop_no,restore_best_weights=True)]
    df = dataframe.copy()
    scaler=scaler
    # split and scale or transform features
    x_train, x_test, y_train, y_test = dfsplit(df, scaler=scaler)
    # Fit and train model
    history = model.fit(x_train, y_train,
                        batch_size=batch_sz,
                        epochs=10000,
                        validation_data=(x_test,y_test),
                        callbacks=callbacks,
                        verbose=0)
    # Export history to dataframe
    hist_stat = pd.DataFrame(history.history)
    hist_stat["epochs"] = np.array(list(hist_stat.index))+1
    if estimator == "clf":
        hist_stat.sort_values("val_accuracy", ascending=False, inplace=True)
    elif estimator == "regr":
        hist_stat.sort_values("val_mean_squared_error", ascending=True, inplace=True)
    estimators = ["clf", "regr"]
    if estimator not in estimators:
        return "Estimator value error!", "Enter 'clf' of 'regr'!"
    hist_stat.reset_index(drop=True, inplace=True)
    return model, hist_stat


# Get model and model history
automl_clf_ss, automl_stat_clf_ss = train_model(automl_clf(train.shape[1]-1), 
                                          train, batch_sz=2048, scaler='ss')
automl_stat_clf_ss

## 3.2. Train `automl_tuner` model.

In [None]:
%%time
def automl_tuner(shape_x):
    learning_rate = 0.0012589254117941675
    l1_kernel=0.0023713737056616554
    l2_bias = 0.0007943282347242813
    l1_val = 0.0001258925411794166
    
    model = Sequential()
    # 0.Input
    model.add(InputLayer(input_shape=(100,), dtype='float64', name="input_1"))
    model.add(Normalization(name='normalization'))
   
    # Hidden layer 1
    # 1.1 Initializer for first hidden layer input linear
    model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",
        bias_initializer='zeros', name="layer_1"))
    # l1 regularization
    model.add(layers.Dense(
        units=32, kernel_regularizer = tf.keras.regularizers.l1(l1_kernel),
        bias_regularizer=tf.keras.regularizers.l2(l2_bias),
        activity_regularizer=tf.keras.regularizers.l1(l1_val)))
    # 1.2 Activation for fisrt hidden layer
    model.add(layers.Activation(activations.relu, name="relu_1"))
    model.add(layers.Dropout(.25))
    
    # Hidden layer 2
    # 2.1 Initializer for first hidden layer input linear
    model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",
        bias_initializer='zeros', name="layer_2"))
    # l1 regularization
    model.add(layers.Dense(
        units=32, kernel_regularizer = tf.keras.regularizers.l1(l1_kernel),
        bias_regularizer=tf.keras.regularizers.l2(l2_bias),
        activity_regularizer=tf.keras.regularizers.l1(l1_val)))
    # 2.2 Activation for second hidden layer
    model.add(layers.Activation(activations.relu, name="relu_2"))
    model.add(layers.Dropout(.25))
    
    # Hidden layer 3
    # 3.1 Initializer for third hidden layer input linear
    model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",
        bias_initializer='zeros', name="layer_3"))
    # l1 regularization
    model.add(layers.Dense(
        units=32, kernel_regularizer = tf.keras.regularizers.l1(l1_kernel),
        bias_regularizer=tf.keras.regularizers.l2(l2_bias),
        activity_regularizer=tf.keras.regularizers.l1(l1_val)))
    # 3.2 Activation for second hidden layer
    model.add(layers.Activation(activations.relu, name="relu_3"))
    model.add(layers.Dropout(.25))
    
    # 4. Final sigmoid
    model.add(layers.Dense(units=1, kernel_initializer="GlorotUniform",
        bias_initializer='zeros', name="layer_4"))
    model.add(layers.Activation(activations.sigmoid, name="sigmoid_1"))
    
    model.compile(loss='binary_crossentropy', 
              optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate),
              metrics=['accuracy',tf.keras.metrics.AUC(name='auc')])
    return model

automl_clf_ss_tuner, automl_stat_clf_ss_tuner = train_model(automl_tuner(train.shape[1]-1), 
                                          train, batch_sz=2048, scaler='ss')
automl_stat_clf_ss_tuner

As see above increasing validation_accuracy for regularized classifier model hasn't radically improve, in this case, it is within the statistical error but iteration numbers increasing by two times comparing non regularized classifier model. 

### 3.3 Train `automl_regr`

In [None]:
%%time
def automl_regr(shape_x, learn_rate=0.001):
    """
    Regression Model created manually from json file model  from auto-keras
    Parameters
    ----------
       	shape_x :  integer,  equal of dimensions the  dataset features.
        learn_rate : float, value for learning_rate of optimizer. 
        Default value of learn_rate = 0.001.
    Returns
    -------
       	 model : the keras model
    """
    model = Sequential()
    # 0.Input
    model.add(InputLayer(input_shape=(100,), dtype='float64', name="input_1"))
    # Normalization input
    model.add(Normalization(name='normalization'))
    
    # Hidden layer 1
    # 1.1 Initializer for first hidden layer input linear
    model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",
        bias_initializer='zeros', name="layer_1"))
    # 1.2 Activation for fisrt hidden layer
    model.add(layers.Activation(activations.relu, name="relu_1"))
    
    # Hidden layer 2
    # 2.1 Initializer for first hidden layer input linear
    model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",
        bias_initializer='zeros', name="layer_2"))
    # 2.2 Activation for second hidden layer
    model.add(layers.Activation(activations.relu, name="relu_2"))
    model.add(layers.Dropout(.25))
    
    # 3. Final linear 
    model.add(layers.Dense(units=1, kernel_initializer="GlorotUniform",
        bias_initializer='zeros', name="layer_3"))
    model.add(layers.Activation(activations.linear, name="regression_head_1"))
    
    model.compile(loss='mean_squared_error', 
              optimizer = tf.keras.optimizers.Adam(learning_rate=learn_rate),
              metrics=['mean_squared_error'])
    return model


automl_regr_ss, automl_regr_ss_stat = train_model(automl_regr(train.shape[1]-1), 
                                              train, batch_sz=2048, scaler='ss',
                                              estimator="regr")
automl_regr_ss_stat

## 4. Define best estimator

In [None]:
%%time
# Transform and divide x and y for train dataset
x_all, y_all = df_transform(train, scaler="ss")

# Predict y_all for all models  and select best estimator using accuracy metric
y_pred_automl_clf = automl_clf_ss.predict(x_all, batch_size=2048, verbose=1)
y_pred_automl_clf_conv = np.where(y_pred_automl_clf < 0.5, 0, 1)


y_pred_automl_clf_tuner = automl_clf_ss_tuner.predict(x_all, batch_size=2048, 
                                                      verbose=1)
y_pred_automl_clf_tuner_conv = np.where(y_pred_automl_clf_tuner < 0.5, 0, 1)


y_pred_automl_regr_ss = automl_regr_ss.predict(x_all, batch_size=2048, 
                                                      verbose=1)
y_pred_automl_regr_ss_conv = np.where(y_pred_automl_regr_ss < 0.5, 0, 1)

Compare accuracy:

In [None]:
%%time
accuracy_automl_clf = accuracy_score(y_all, y_pred_automl_clf_conv)
accuracy_automl_clf_tuner = accuracy_score(y_all, y_pred_automl_clf_tuner_conv)
accuracy_automl_regr = accuracy_score(y_all, y_pred_automl_regr_ss_conv)
print(f"Accuracy for `automl_clf` model: {accuracy_automl_clf:.4f}.")
print(f"Accuracy for `automl_clf_tuner` model: {accuracy_automl_clf_tuner:.4f}.")
print(f"Accuracy for `automl_regr` model: {accuracy_automl_regr:.4f}.")

As see above, the best accuracy has `automl_regr` model.

## 5. Predict target for test dataset using `automl_regr` model

In [None]:
# Open and read test dataset
test.set_index("id", inplace=True)

# Set id as index for submission
# submission.set_index("id", inplace=True)

# Convert test x values  with StandardScaler
test_ss = df_transform(test, scaler="ss", y=False)

# Predict target and convert to binary format
predict_target = automl_regr_ss.predict(test_ss, batch_size=2048, 
                                                    verbose=1)
predict_target_conv = np.where(predict_target < 0.5, 0, 1)

# Fill `target` column
submission["target"] = predict_target_conv

# Save predict
submission.to_csv("submission_pred.csv",  index=False)

submission.head(18)

## 6. Conclusions


1. Auto-Keras gives a completely workable model and may in the future get rid of manual work that takes a lot of time to build an optimal model and a selection of hyperparameters.


2. In this case, the l2/l2 regularization of the classifier model did not bring any visible improvements, but only increased the execution time. It turned out that a simple regression has accuracy comparable with regularized l1/l2 classifier. l2/l2 regularization isn't always the  silver bullet. The possible reason is a wild outliers amount. 


3. I was pleased with the speed of the old GTX 1050 2GB RAM graphics accelerator with a data set containing 600K rows and 100 columns (60M cells), which takes several minutes to process and cross-validate. As I wrote earlier, when trying to choose the optimal algorithm from the classic ML between boost, random forest and SVM, the cross validation time took more than 12 hours. ML is dead long live DL, using  ML is justified on small datasets with several thousand rows, with an increase in the amount of data ML loses in speed to DL. Advice for those who do not have modern video cards - lower the TF version for example, GTX 1050 2GB RAM works quite correctly  by this dataset with TF 2.5,  with version TF above, out of memory problems begin.


4. This dataset itself turned out to be a tough nut to crack. High accuracy given only with StandardScaler,  all other transformations from classical ML  - MinMaxScaler, RobustScaler, QuantileTransformer, PowerTransformer, KBinsDiscretizer, Normalizer  had lower values of accuracy values and high time of executions. Also, the removal algorithms of outliers from scikit-learn doesn't work with this dataset.


5. I avoided data leakage everywhere, but if  using standard scaling  for the entire train dataset, it can be increased accuracy by rough 0.005.




Created on Mart 08, 2022

@author: Vadim Maklakov, used some ideas from public Internet resources.

© 3-clause BSD License

Software environment on local host: Debian 11, Python 3.8.12, TensorFlow 2.5.1 for code above, TensorFlow 2.8 for defining model with Auto-Keras.

See required installed and imported python modules in the cell No 1.