<big>For classical machine learning algorithms, we often use the most popular Python library, Scikit-learn. With Scikit-learn you can fit models and search for optimal parameters, but it sometimes works for hours.</big><br><br>

<big>I want to show you how to use Scikit-learn library and get the results faster without changing the code. To do this, we will make use of another Python library,  <a href='https://github.com/intel/scikit-learn-intelex'>Intel® Extension for Scikit-learn*</a>.</big><br><br>

<big>I will show you how to <strong>speed up your kernel more than 2 times</strong> without changing your code!</big>

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

<h2>Importing data</h2>

In [None]:
data = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv', parse_dates=True)
test_data = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv')
semp_sub = pd.read_csv('../input/tabular-playground-series-jul-2021/sample_submission.csv')
pseudolabels = pd.read_csv('../input/psd-sub/submission_psd.csv')


In [None]:
data.head()

<h2>Preprocessing</h2>

<big><strong>Pseudodating</strong></big><br><br>
<big>I took the previously predicted labels and added them to the test dataset.</big>

In [None]:
for col in ['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']:
    test_data[col] = pseudolabels[col]

<big>Now let's combine the test and train datasets.</big>

In [None]:
full_data = pd.concat([data, test_data]).reset_index(drop = True)

<big>I added new features to the dataset.</big> 
<big>They were obtained by researching combinations of original features using <code>feature_importances_</code>.</big>

In [None]:
test_data = test_data.drop(['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides'], axis=1)
all_data = [full_data, test_data]

for df in all_data:
    df['date_time'] = df['date_time'].astype('datetime64[ns]').astype(np.int64)/10**9
    df['S1xS2'] = df['sensor_1'] * df['sensor_2']
    df['S2xS5'] = df['sensor_2'] * df['sensor_5']
    df['S2^2'] = df['sensor_2']**2
data = data.sample(frac=1)

<big>Next step is split the data into features and targets.</big>

In [None]:
x_data = full_data.drop(['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides'], axis=1)
y_data = full_data[['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']]
x_data.shape, y_data.shape

<big>Now split the data into training and validation sets.</big>

In [None]:
x_train, x_val, y_train, y_val = train_test_split(x_data, y_data, test_size=0.2, random_state=42)

<h2>Installing Intel(R) Extension for Scikit-learn</h2>

<big>Use Intel® Extension for Scikit-learn* for fast compute Scikit-learn estimators.</big>

In [None]:
!pip install scikit-learn-intelex --progress-bar off >> /tmp/pip_sklearnex.log

<big>Patch original scikit-learn.</big>

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

<h2>Using optuna to select parameters for Random Forest Regressor</h2><br><br>
<big>Random Forest is an ensemble of Decision Trees. The work of this algorithm can be represented as a collective decision made by some expert committee.</big><br><br>
<big>We adjust hyperparameters for the best result.</big><br><br>
<big>The parameters that we select:</big><br>
<big>1. <code>n_estimators</code> -  the number of trees to be used in the algorithm.<br></big>
<big>2. <code>max_depth</code> -  the depth of each tree.<br></big>
<big>3. <code>min_samples_split</code> - the minimum number of samples in a leaf to split.<br> </big>

In [None]:
from sklearn.multioutput import RegressorChain
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_log_error
import numpy as np
import optuna
import matplotlib.pyplot as plt

In [None]:
def objective_rf(trial):
    params ={
        'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
        'max_depth': trial.suggest_int('max_depth', 3, 70),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 50),
        'criterion': trial.suggest_categorical('criterion', ['mse']),
        'n_jobs': -1 
        
    }
    model = RegressorChain(RandomForestRegressor(**params), random_state=47).fit(x_train, y_train)
    y_pred = model.predict(x_val)
    loss = np.sqrt(mean_squared_log_error(y_val, y_pred))
    return loss



<big><strong>Select parameters</strong></big>

In [None]:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=123),
                            direction="minimize",
                            pruner=optuna.pruners.HyperbandPruner())

<big>Let's see the execution time.</big>

In [None]:
%%time
study.optimize(objective_rf, n_trials=40)

<h2>Training the model with the selected parameters</h2>

In [None]:
%%time
new_model_rf = RegressorChain(RandomForestRegressor(**study.best_params, n_jobs=-1)).fit(x_data, y_data)


<big>Let's look at the importance of features in training.</big>

In [None]:
fet0 = new_model_rf.estimators_[0].feature_importances_
fet1 = new_model_rf.estimators_[1].feature_importances_
fet2 = new_model_rf.estimators_[2].feature_importances_
fets = [fet0, fet1, fet2]

In [None]:
for i, _ in enumerate(fets):
    fets[i] = np.sort(fets[i])

for fet in fets:
    plt.figure()
    plt.barh(full_data.drop(['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides'], axis=1).columns, fet[:12])

<h2>Prediction</h2>

In [None]:
%%time
y_pred = new_model_rf.predict(test_data)

<big>Save the results in 'submission.csv'.</big>

In [None]:
semp_sub['target_carbon_monoxide'] = y_pred[:, 0]
semp_sub['target_benzene'] = y_pred[:, 1]
semp_sub['target_nitrogen_oxides'] = y_pred[:, 2]
semp_sub.to_csv('submission.csv', index=False)
semp_sub.head()

<h2>Now we use the same algorithms with original scikit-learn<h2>

<big>Let’s run the same Scikit-learn code without the patching offered by Intel® Extension for Scikit-learn and compare its execution time with the execution time of the patched Scikit-learn.</big>

In [None]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()

In [None]:
from sklearn.ensemble import RandomForestRegressor

<big>Select parameters for Random Forest Regressor.</big>

In [None]:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=123),
                            direction="minimize",
                            pruner=optuna.pruners.HyperbandPruner())

<big>Let's see the execution time without patch.</big>

In [None]:
%%time
study.optimize(objective_rf, n_trials=40)

In [None]:
%%time
new_model_rf = RegressorChain(RandomForestRegressor(**study.best_params, n_jobs=-1)).fit(x_data, y_data)

<h2>Conclusions</h2>
<big>We can see that using only one classical machine learning algorithm may give you a pretty hight accuracy score. We also use well-known libraries Scikit-learn and Optuna, as well as the increasingly popular library Intel® Extension for Scikit-learn. Noted that Intel® Extension for Scikit-learn gives you opportunities to:</big>

* <big>Use your Scikit-learn code for training and inference without modification.</big>
* <big>Speed up selection of parameters <strong>from 45 minutes to 20 minutes.</strong></big>
* <big>Get predictions of the similar quality.</big>
