<big>For classical machine learning algorithms, we often use the most popular Python library, Scikit-learn. With Scikit-learn you can fit models and search for optimal parameters, but it sometimes works for hours.</big><br><br>

<big>I want to show you how to use Scikit-learn library and get the results faster without changing the code. To do this, we will make use of another Python library, <strong> <a href='https://github.com/intel/scikit-learn-intelex'>Intel® Extension for Scikit-learn*</a></strong>.</big><br><br>

<big>I will show you how to <strong>speed up your kernel more than 17 times</strong> without changing your code!</big><big>

In [None]:
import pandas as pd
import numpy as np
from timeit import default_timer as timer
from IPython.display import HTML
from sklearn.model_selection import train_test_split

<h2>Importing data</h2>

In [None]:
train = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv')
sample_sub = pd.read_csv('../input/tabular-playground-series-nov-2021/sample_submission.csv')

In [None]:
train.head()

In [None]:
test.head()

<h2>Preprocessing</h2>

<big>Split the data into features and target.</big>

In [None]:
x = train.drop(['id', 'target'], axis=1)
y = train['target']
x_test = test.drop(['id'], axis=1)

<big>Split the data into train and validation sets.</big>

In [None]:
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.1, random_state=0)

<big>Normalize the data.</big>

In [None]:
from sklearn.preprocessing import StandardScaler

scaler_x = StandardScaler().fit(x_train)
x_train = scaler_x.transform(x_train)
x_val = scaler_x.transform(x_val)
x_test = scaler_x.transform(x_test)

<h2>Installing Intel® Extension for Scikit-learn</h2>

<big>Use Intel® Extension for Scikit-learn* for fast compute Scikit-learn estimators.</big>

In [None]:
!pip install scikit-learn-intelex -q --progress-bar off

<big>Patch original scikit-learn.</big>

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

<h2>Using optuna to select parameters for Logistic Regression algorithm</h2><br><br>
<big>Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable.</big><br><br>
<big>We adjust hyperparameters for the best result.</big><br><br>
<big>Parameters that we select:</big><br>
<ul>
<li><big><code>C</code> - Parameter inverse to the regularization coefficient.</big></li><br>
<li><big><code>solver</code> - Algorithm to use in the optimization problem. </big></li><br>
</ul>

In [None]:
from sklearn.ensemble import StackingRegressor
import optuna 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc

In [None]:
def objective(trial):
    params ={
        'C': trial.suggest_float('C', 0.000000001, 1.0),
        'random_state': 0,
        'n_jobs': -1,
    }
    model = LogisticRegression(**params).fit(x_train, y_train)
    y_pred = model.predict_proba(x_val)[:, 1]
    fpr, tpr, _ = roc_curve(y_val, y_pred)
    score = auc(fpr, tpr)
    return score

<big><strong>Select parameters</strong></big>

In [None]:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=123),
                            direction="maximize",
                            pruner=optuna.pruners.HyperbandPruner())

<big>Let's see the execution time with Intel® Extension for Scikit-learn.</big>

In [None]:
start = timer()
study.optimize(objective, n_trials=20)
select_params_opt = timer() - start
f"Intel® extension for Scikit-learn selection time: {select_params_opt:.2f} s"

<h2>Training the model with the selected parameters</h2>

In [None]:
x_full = np.concatenate([x_train, x_val])
y_full = np.concatenate([y_train, y_val])

In [None]:
start = timer()
final_model = LogisticRegression(**study.best_params, random_state=0, n_jobs=-1).fit(x_full, y_full)
train_opt = timer() - start
f"Intel® extension for Scikit-learn train final model time: {train_opt:.2f} s"

<h2>Prediction</h2>

<big>Predict and save the results in 'submission.csv'.</big>

In [None]:
y_pred = final_model.predict_proba(x_test)[:, 1]
sample_sub['target'] = y_pred
sample_sub.to_csv('submission.csv', index=False)
sample_sub.head()

<h2>Now we use the same algorithms with original Scikit-learn</h2>
<big>Let’s run the same code with original Scikit-learn and compare it's execution time with the execution time of the patched by Intel® Extension for Scikit-learn.</big>

In [None]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()

In [None]:
from sklearn.linear_model import LogisticRegression

<big><strong>Select parameters.</strong></big>

In [None]:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=123),
                            direction="minimize",
                            pruner=optuna.pruners.HyperbandPruner())

<big>Let's see the execution time without patch.</big>

In [None]:
start = timer()
study.optimize(objective, n_trials=20)
select_params_original = timer() - start
f"Original Scikit-learn selection time: {select_params_original:.2f} s"

In [None]:
start = timer()
final_model = LogisticRegression(**study.best_params, random_state=0, n_jobs=-1).fit(x_full, y_full)
train_original = timer() - start
f"Original Scikit-learn train final model time: {train_original:.2f} s"

In [None]:
HTML(f'<h2>Selecting parameters speedup: {(select_params_original/select_params_opt):.2f}x</h2>'
     f'(from {select_params_original:.2f} seconds to {select_params_opt:.2f} seconds)'
     f'<h2>Training final model speedup: {(train_original/train_opt):.2f}x</h2>'
     f'(from {train_original:.2f} seconds to {train_opt:.2f} seconds)')

<h2>Conclusions</h2>
<big>We can see that using only one classical machine learning algorithm may give you a pretty hight accuracy score. We also use well-known libraries Scikit-learn and Optuna, as well as the increasingly popular library Intel® Extension for Scikit-learn. Noted that Intel® Extension for Scikit-learn gives you opportunities to:</big>

* <big>Use your Scikit-learn code for training and inference without modification.</big>
* <big>Speed up selection of parameters and training stages</big>.
* <big>Get predictions of the similar quality.</big>