<big>For classical machine learning algorithms, we often use the most popular Python library, Scikit-learn. With Scikit-learn you can fit models and search for optimal parameters, but it sometimes works for hours.</big><br><br>
​
<big>I want to show you how to use Scikit-learn library and get the results faster without changing the code. To do this, we will make use of another Python library, <strong> <a href='https://github.com/intel/scikit-learn-intelex'>Intel® Extension for Scikit-learn*</a></strong>.</big><br><br>
​
<big>I will show you how to <strong>speed up your kernel more than 4 times</strong> without changing your code!</big><big>

<big>Import libraries</big>

In [None]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split

# Preprocessing

<big>Importing data</big>

In [None]:
train = pd.read_csv('../input/tabular-playground-series-aug-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-aug-2021/test.csv')
sample_sub = pd.read_csv('../input/tabular-playground-series-aug-2021/sample_submission.csv')
pseudo = pd.read_csv('../input/blending-tool-tps-aug-2021/file1_7.85192_file2_7.85192_blend.csv')

<big><strong>Pseudodating</strong></big><br><br>
<big>I took the previously predicted labels and added them to the test dataset.</big>

In [None]:
test['loss'] = pseudo['loss']

<big>Let's look at the test and train sets.</big>

In [None]:
train.head()

In [None]:
test.head()

In [None]:
test.shape, train.shape

<big>Drop 'id' field</big>

In [None]:
train.drop(['id'], axis=1, inplace=True)
test.drop(['id'], axis=1, inplace=True)

<big>I added features that were obtained as a result of research <code>feature_importances_</code></big>

In [None]:
all_data = [train, test]

In [None]:
for df in all_data:
    df['f77^2/f52^2'] = (df['f77']**2)/(df['f52']**2)
    df['f74^2/f81^2'] = (df['f74']**2)/(df['f81']**2)
    df['f77/f69'] = df['f77']/df['f69']
    df['f81^2/f77^2'] = (df['f81']**2)/(df['f77']**2)
    df['f96/f28'] = df['f96']/df['f28']
    df['f96^2/f73^2'] = (df['f96']**2)/(df['f73']**2)
    df['f78/f28'] = df['f78']/df['f28']
    df['f73/f28'] = df['f73']/df['f28']
    df['f66/f69'] = df['f66']/df['f69']
    df['f46^2/f4^2'] = (df['f46']**2)/(df['f4']**2)
    df['f4/f75'] = df['f4']/df['f75']
    df['f69^2/f96^2'] = (df['f69']**2)/(df['f96']**2)
    df['f25/f69'] = df['f25']/df['f69']
    df['f78/f69'] = df['f78']/df['f69']
    df['f96^2/f77^2'] = (df['f96']**2)/(df['f77']**2)
    df['f4^2/f52^2'] = (df['f4']**2)/(df['f52']**2)
    df['f66^2/f52^2'] = (df['f66']**2)/(df['f52']**2)
    df['f4^2/f81^2'] = (df['f4']**2)/(df['f81']**2)
    df['f46^2/f81^2'] = (df['f46']**2)/(df['f81']**2)
    df['f47/f69'] = df['f47']/df['f69']
    df['f74xf70'] = df['f74']*df['f70']
    df['f46^2/f66^2'] = (df['f46']**2)/(df['f66']**2)
    df['f74/f47'] = df['f74']/df['f47']
    df['f96^2xf69^2'] = (df['f96']**2)/(df['f69']**2)
    df['f66/f46'] = df['f66']/df['f46']
    df['f25xf96'] = df['f25']*df['f96']
    df['f28xf81'] = df['f28']*df['f81']
    df['f52xf66'] = df['f52']*df['f66']
    df['f46^2xf81^2'] = (df['f46']**2)*(df['f81']**2)
    df['f46xf74'] = df['f46']*df['f74']
    df['f28_log'] = np.log2(df['f28'])
    df['f28xf70'] = df['f28']*df['f70']
    df['f52_log'] = np.log2(df['f52'])
    df['f47_log'] = np.log2(df['f47'])
    df['f66xf73'] = df['f66']*df['f73']
    df['f69_log'] = np.log2(df['f69'])
    df['f96/f78'] = df['f96']/df['f78']
    
    

In [None]:
test.shape, train.shape

In [None]:
test.fillna(0, inplace=True)
train.fillna(0, inplace=True)

<big>Delete features with low correlation.</big>

In [None]:
corr = train.corr()
columns_to_delete = corr[corr.loss<0.001][corr.loss>-0.001].index

In [None]:
train.drop(columns_to_delete, axis=1, inplace=True)
test.drop(columns_to_delete, axis=1, inplace=True)

<big>Concateate train and test sets.</big>

In [None]:
full_data = pd.concat([train, test])

<big>Split the data into 'X' and 'y' </big>

In [None]:
X = full_data.drop(['loss'], axis=1)
y = full_data['loss']
X_test = test.drop(['loss'], axis=1)

<big>Normalize data.</big>

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler_x = MinMaxScaler()
scaler_y = StandardScaler()

In [None]:
scaler_x.fit(X)
X = scaler_x.transform(X)
X_test = scaler_x.transform(X_test)

In [None]:
scaler_y.fit(y.to_numpy().reshape(-1, 1))
y = scaler_y.transform(y.to_numpy().reshape(-1, 1)).ravel()

# Installing Intel(R) Extension for Scikit-learn

<big>Use Intel® Extension for Scikit-learn* for fast compute Scikit-learn estimators.</big>

In [None]:
!pip install scikit-learn-intelex -q --progress-bar off

<big>Patch original scikit-learn.</big>

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

# Train nuSVR model
<big>Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection. The advantage of support vector machines is effective in high dimensional spaces.</big><br><br>

<big>NuSVR similar to SVR, but uses a parameter nu to control the number of support vectors. Nu replaces the parameter epsilon of epsilon-SVR</big><br><br>
<big>The process of selecting the parameters is too long and computationally intensive, so I selected the parameters in advance.</big><br><br>
<big>Parameters: </big><br>
<big>* <code>C</code> -  Parameter inverse to the regularization coefficient.<br></big>
<big>* <code>nu</code> - An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors..<br><br> </big>

In [None]:
from sklearn.svm import NuSVR

In [None]:
params = {'C': 0.9335786569734156, 'nu': 0.9426690319592885}

In [None]:
%%time
final_model = NuSVR(**params).fit(X, y)

# Prediction

In [None]:
%%time
y_pred = final_model.predict(X_test)

In [None]:
y_pred = scaler_y.inverse_transform(y_pred)

<big>Save the results in 'submission.csv'.</big>

In [None]:
sample_sub['loss'] = y_pred
sample_sub.to_csv('submission.csv', index=False)
sample_sub.head(10)

# Now we use the same algorithm with original scikit-learn

<big>Unfortunately, the original scikit-learn <strong>does not have time to train the model in 9 hours</strong> on the provided data.</big><br>
<big>On 10% of the total dataset, the patched version is trained in 1 minute 25 seconds, and the stock version in 33 minutes 42 seconds.</big>

In [None]:
# from sklearnex import unpatch_sklearn
# unpatch_sklearn()

In [None]:
# from sklearn.svm import NuSVR

In [None]:
# %%time
# final_model = NuSVR(**params).fit(X, y)

# Conclusions
<big>We can see that using only one classical machine learning algorithm may give you a pretty hight accuracy score. We also use well-known libraries Scikit-learn and Optuna, as well as the increasingly popular library Intel® Extension for Scikit-learn. Noted that Intel® Extension for Scikit-learn gives you opportunities to:</big>
​
* <big>Use your Scikit-learn code for training and inference without modification.</big>
* <big>Speed up selection of parameters <strong>from 9+ hours to 2 hours and 30 minutes.</strong></big>
* <big>Get predictions of the similar quality.</big>
​