<big>For classical machine learning algorithms, we often use the most popular Python library, Scikit-learn. With Scikit-learn you can fit models and search for optimal parameters, but it sometimes works for hours.</big><br><br>

<big>I want to show you how to use Scikit-learn library and get the results faster without changing the code. To do this, we will make use of another Python library, <strong> <a href='https://github.com/intel/scikit-learn-intelex'>Intel® Extension for Scikit-learn*</a></strong>.</big><br><br>

<big>I will show you how to <strong>speed up your kernel more than 3 times</strong> without changing your code!</big><big>

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

<h2>Importing data</h2>

In [None]:
data = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv', parse_dates=True)
test_data = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv')
semp_sub = pd.read_csv('../input/tabular-playground-series-jul-2021/sample_submission.csv')
pseudolabels = pd.read_csv('../input/tps-lightautoml-baseline-with-pseudolabels/lightautoml_with_pseudolabelling_kernel_version_15.csv')

In [None]:
data['date_time'] = pd.to_datetime(data['date_time'])
test_data['date_time'] = pd.to_datetime(test_data['date_time'])

In [None]:
data.head()

<h2>Preprocessing</h2>

<big>I added some features based on date</big>

In [None]:
def make_new_features(df):
    df["month"] = df["date_time"].dt.month
    df["day_of_week"] = df["date_time"].dt.dayofweek
    df["day_of_year"] = df["date_time"].dt.dayofyear
    df["hour"] = df["date_time"].dt.hour
    df["quarter"] = df["date_time"].dt.quarter
    df["week_of_year"] = df["date_time"].dt.isocalendar().week.astype("int")
    df["working_hours"] =  df["hour"].isin(np.arange(8, 21, 1)).astype("int")
    df["is_weekend"] = (data["date_time"].dt.dayofweek >= 5).astype("int")

In [None]:
make_new_features(data)
make_new_features(test_data)

<big><strong>Pseudodating</strong></big><br><br>
<big>I took the previously predicted labels and added them to the test dataset.</big>

In [None]:
for col in ['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']:
    test_data[col] = pseudolabels[col]

<big>Now let's combine the test and train datasets.</big>

In [None]:
full_data = pd.concat([data, test_data]).reset_index(drop = True)

<big>I added new feature to the dataset.</big> 
<big>It was obtained by <code>feature_importances_</code>.</big>

In [None]:
test_data = test_data.drop(['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides'], axis=1)
all_data = [full_data, test_data]

for df in all_data:
    df['date_time'] = df['date_time'].astype('datetime64[ns]').astype(np.int64)/10**9
data = data.sample(frac=1)

<big>Next step is split the data into features and targets.</big>

In [None]:
x_data = full_data.drop(['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides'], axis=1)
y_data = full_data[['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']]
x_data.shape, y_data.shape

<big>Now split the data into training and validation sets.</big>

In [None]:
x_train, x_val, y_train, y_val = train_test_split(x_data, y_data, test_size=0.2, random_state=42)

<h2>Installing Intel(R) Extension for Scikit-learn</h2>

<big>Use Intel® Extension for Scikit-learn* for fast compute Scikit-learn estimators.</big>

In [None]:
!pip install scikit-learn-intelex -q --progress-bar off

<big>Patch original scikit-learn.</big>

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

<h2>Using optuna to select parameters for Stacking algorithm</h2><br><br>
<big>Stacking or generalization is an ensemble of machine learning algorithms.

This generalization consists of output combination of individual estimators and the final prediction based on it. Stacking allows to use the strength of each individual estimator by using their output as input of a final estimator.</big><br><br>
<big>We adjust hyperparameters for the best result.</big><br><br>
<big>Parameters for Random Forest:</big><br>
<big>* <code>n_estimators</code> -  The number of trees to be used in the algorithm.<br></big>
<big>* <code>min_samples_split</code> - The minimum number of samples in a leaf to split.<br><br> </big>
<big>Parameter for SVR:</big><br>
<big>* <code>C</code> -  Parameter inverse to the regularization coefficient<br></big><br>
<big>Parameter for Lasso:</big><br>
<big>* <code>alpha</code> - Regularization parameter. Regularization improves the solution and reduces the variance of estimates.<br> </big>


In [None]:
from sklearn.multioutput import RegressorChain
from sklearn.ensemble import RandomForestRegressor, StackingRegressor
from sklearn.metrics import mean_squared_log_error
import numpy as np
import optuna 
from sklearn.svm import SVR
from sklearn.linear_model import Lasso

In [None]:
def get_stacking_regressor( C1=None,
                            n_estimators=None, min_samples_split=None,
                            alpha1=None, alpha2=None
                            ):
    svr = SVR(C=C1)
    rf = RandomForestRegressor(n_estimators=n_estimators, min_samples_split=min_samples_split,
                               random_state=0, n_jobs=-1)
    lasso = Lasso(alpha=alpha1, random_state=0, max_iter=100000)

    
    lasso_f = Lasso(alpha=alpha2, random_state=0, max_iter=100000)
    stacking_estimators = [
        ('svr', svr),
        ('rf', rf),
        ('lasso', lasso),
    ]
    
    return StackingRegressor(estimators=stacking_estimators, final_estimator=lasso_f)

In [None]:
def objective(trial):
    params ={
        'n_estimators': trial.suggest_int('n_estimators', 1300, 2000),
        'alpha1': trial.suggest_float('alpha1', 0.0, 0.15),
        'alpha2': trial.suggest_float('alpha2', 0.0, 0.05),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 50),
        'C1': trial.suggest_loguniform('C1', 1e-3, 1e2),
    }
    model = RegressorChain(get_stacking_regressor(**params), random_state=47).fit(x_train, y_train)
    y_pred = model.predict(x_val)
    loss = np.sqrt(mean_squared_log_error(y_val, np.abs(y_pred)))
    return loss



<big><strong>Select parameters</strong></big>

In [None]:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=123),
                            direction="minimize",
                            pruner=optuna.pruners.HyperbandPruner())

<big>Let's see the execution time.</big>

In [None]:
%%time
study.optimize(objective, n_trials=10)

<h2>Training the model with the selected parameters</h2>

In [None]:
%%time
new_model_rf = RegressorChain(get_stacking_regressor(**study.best_params)).fit(x_data, y_data)


<h2>Prediction</h2>

In [None]:
%%time
y_pred = new_model_rf.predict(test_data)

<big>Save the results in 'submission.csv'.</big>

In [None]:
semp_sub['target_carbon_monoxide'] = y_pred[:, 0]
semp_sub['target_benzene'] = y_pred[:, 1]
semp_sub['target_nitrogen_oxides'] = y_pred[:, 2]
semp_sub.to_csv('submission.csv', index=False)
semp_sub.head()

<h2>Now we use the same algorithms with original scikit-learn<h2>

<big>Let’s run the same code with original scikit-learn and compare its execution time with the execution time of the patched by Intel(R) Extension for Scikit-learn.</big>

In [None]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.linear_model import Lasso

<big>Select parameters for Stacking algorithm.</big>

In [None]:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=123),
                            direction="minimize",
                            pruner=optuna.pruners.HyperbandPruner())

<big>Let's see the execution time without patch.</big>

In [None]:
%%time
study.optimize(objective, n_trials=10)

In [None]:
%%time
new_model_rf = RegressorChain(get_stacking_regressor(**study.best_params)).fit(x_data, y_data)

<h2>Conclusions</h2>
<big>We can see that using only one classical machine learning algorithm may give you a pretty hight accuracy score. We also use well-known libraries Scikit-learn and Optuna, as well as the increasingly popular library Intel® Extension for Scikit-learn. Noted that Intel® Extension for Scikit-learn gives you opportunities to:</big>

* <big>Use your Scikit-learn code for training and inference without modification.</big>
* <big>Speed up selection of parameters <strong>from 1 hour 41 minutes to 31 minutes.</strong></big>
* <big>Get predictions of the similar quality.</big>
