<big>For classical machine learning algorithms, we often use the most popular Python library, Scikit-learn. With Scikit-learn you can fit models and search for optimal parameters, but it sometimes works for hours.</big><br><br>

<big>I want to show you how to use Scikit-learn library and get the results faster without changing the code. To do this, we will make use of another Python library,  <a href='https://github.com/intel/scikit-learn-intelex'>Intel® Extension for Scikit-learn*</a>.</big><br><br>

<big>I will show you how to <strong>speed up your kernel more than 2 times</strong> without changing your code!</big>

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

<h2>Importing data</h2>

In [None]:
train = pd.read_csv('../input/tabular-playground-series-sep-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-sep-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-sep-2021/sample_solution.csv')

<big>Let's look at the data.</big>

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.shape, test.shape

# Preprocessing

<big>Let's add a feature with the number of missing values in each line.</big>

In [None]:
train['n_nan'] = train.isnull().sum(axis=1)
test['n_nan'] = test.isnull().sum(axis=1)

<big>Fill missing feathures with mean value.</big>

In [None]:
FEATURES = train.columns[:-2]
for col in FEATURES:
    avg_val_train = train[col].mean()
    avg_val_test = test[col].mean()
    train[col].fillna(avg_val_train, inplace=True)
    test[col].fillna(avg_val_test, inplace=True)

<big>Split the data into train and test sets.</big>

In [None]:
X = train.drop(['claim'], axis=1)
y = train['claim']

In [None]:
x_train, x_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=0)

<big>Normalize the data.</big>

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler_x = MinMaxScaler()

In [None]:
scaler_x.fit(x_train)
x_train = scaler_x.transform(x_train)
x_val = scaler_x.transform(x_val)
x_test = scaler_x.transform(test)

<h2>Installing Intel(R) Extension for Scikit-learn</h2>

<big>Use Intel® Extension for Scikit-learn* for fast compute Scikit-learn estimators.</big>

In [None]:
!pip install scikit-learn-intelex --progress-bar off >> /tmp/pip_sklearnex.log

<big>Patch original scikit-learn.</big>

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

# Using optuna to select parameters for Ridge algorithm
<big>In Ridge regression, we add a penalty term which is equal to the square of the coefficient. The L2 term is equal to the square of the magnitude of the coefficients. We also add a coefficient  lambda  to control that penalty term.</big><br><br>
<big>We adjust hyperparameters for the best result.</big><br><br>

<big>Parameter that we select:</big><br>
<big>* <code>alpha</code> - Regularization parameter. Regularization improves the solution and reduces the variance of estimates.<br> </big>


In [None]:
from sklearn.linear_model import Ridge
from sklearn.metrics import roc_auc_score
import optuna

In [None]:
def objective_ridge(trial):
    params ={
        'alpha': trial.suggest_float('alpha', 0.0, 2.0),
    }
    model = Ridge(**params).fit(x_train, y_train)
    y_pred = model.predict(x_val)
    loss = roc_auc_score(y_val, y_pred)
    return loss

In [None]:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=123),
                            direction="maximize",
                            pruner=optuna.pruners.HyperbandPruner())

<big><strong>Select parameters</strong></big>

<big>Let's see the execution time with Intel(R) Extension for Scikit-learn.</big>

In [None]:
%%time
study.optimize(objective_ridge, n_trials=100)

<big><strong>Training the model with the selected parameters.</strong></big>

In [None]:
full_x = np.concatenate((x_train, x_val), axis=0)
full_y = np.concatenate((y_train, y_val), axis=0)

In [None]:
%%time
final_model = Ridge(**study.best_params).fit(full_x, full_y)

<big><strong>Prediction.</strong></big>

In [None]:
y_pred = final_model.predict(x_test)

<big>Save the results in 'submission.csv'.</big>

In [None]:
sample_submission['claim'] = y_pred
sample_submission.to_csv('submission.csv', index=False)
sample_submission.head(10)

# Now we use the same algorithms with original scikit-learn

<big>Let’s run the same code with original scikit-learn and compare its execution time with the execution time of the patched by Intel(R) Extension for Scikit-learn.</big>

In [None]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()

In [None]:
from sklearn.linear_model import Ridge

<big><strong>Select parameters</strong></big>

In [None]:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=123),
                            direction="maximize",
                            pruner=optuna.pruners.HyperbandPruner())

<big>Let's see the execution time without patch.</big>

In [None]:
%%time
study.optimize(objective_ridge, n_trials=100)

In [None]:
%%time
final_model = Ridge(**study.best_params).fit(full_x, full_y)

# Conclusions
<big>We can see that using only one classical machine learning algorithm may give you a pretty hight accuracy score. We also use well-known libraries Scikit-learn and Optuna, as well as the increasingly popular library Intel® Extension for Scikit-learn. Noted that Intel® Extension for Scikit-learn gives you opportunities to:</big>

* <big>Use your Scikit-learn code for training and inference without modification.</big>
* <big>Speed up selection of parameters <strong>from 2 minutes to 46 seconds.</strong></big>
* <big>Get predictions of the similar quality.</big>