<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Tabular Playground Series - Dec 2021
</div>

<center><a><img src="https://i.ibb.co/PWvpT9F/header.png" alt="header" border="0" width=800 height=400 class="center"></a>

For classical machine learning algorithms, we often use the most popular Python library, Scikit-learn. With Scikit-learn you can fit models and search for optimal parameters, but it sometimes works for hours. Speeding up this process is something anyone who uses Scikit-learn would be interested in.

I want to show you how to use Scikit-learn library and get the results faster without changing the code. To do this, we will make use of another Python library, [**Intel® Extension for Scikit-learn***](https://github.com/intel/scikit-learn-intelex). It accelerates Scikit-learn and does not require you to change the code written for Scikit-learn.

I will show you how to **speed up** your kernel without changing your code!

<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Importing Libraries and Data</center></h1>
</div>

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
from IPython.display import HTML
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt

### Reading Data

In [None]:
PATH_TRAIN      = '../input/tabular-playground-series-dec-2021/train.csv'
PATH_TEST       = '../input/tabular-playground-series-dec-2021/test.csv'
PATH_SUBMISSION = '../input/tabular-playground-series-dec-2021/sample_submission.csv'

In [None]:
train_data = pd.read_csv(PATH_TRAIN)
test_data  = pd.read_csv(PATH_TEST)
submission = pd.read_csv(PATH_SUBMISSION)

### Reduce DataFrame memory usage

Since data is quite big for Kaggle notebook instance RAM, we need to reduce memory usage by switching data types.

In [None]:
def reduce_memory_usage(df):
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != 'object':
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    pass
        else:
            df[col] = df[col].astype('category')
    
    return df

In [None]:
train_data = reduce_memory_usage(train_data)
test_data  = reduce_memory_usage(test_data)

In [None]:
train_data = train_data.drop(['Id', 'Soil_Type7', 'Soil_Type15'], axis = 1)
test_data = test_data.drop(['Id', 'Soil_Type7', 'Soil_Type15'], axis = 1)

In [None]:
train_data.info()

Collect garbage to reduce memory usage

In [None]:
import gc

gc.collect()

### Intel® Extension for Scikit-learn installation:

In [None]:
!pip install scikit-learn-intelex -q --progress-bar off > /dev/null 2>&1

### Accelerate Scikit-learn with two lines of code:

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

Setup logging to track accelerated cases:

In [None]:
import logging

logger = logging.getLogger()
fh     = logging.FileHandler('log.txt')

fh.setLevel(10)
logger.addHandler(fh)

<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Feature importance</center></h1>
</div>

One of the most basic questions we might ask of a model is: What features have the biggest impact on predictions?

This concept is called feature importance.

There are multiple ways to measure feature importance. In this kernel we consider permutation importance using library ELI5.

In [None]:
X, y = train_data.drop(['Cover_Type'], axis = 1), train_data['Cover_Type']

In [None]:
from sklearn.model_selection import train_test_split
from timeit import default_timer as timer

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state = 42)

### ELI5

ELI5 provides a way to compute feature importances for any black-box estimator by measuring how score decreases when a feature is not available.

In [None]:
import eli5
from eli5.sklearn import PermutationImportance
from timeit import default_timer as timer
from sklearn.ensemble import RandomForestClassifier

In [None]:
timeFirstI  = timer()
modelRF     = RandomForestClassifier(random_state = 42).fit(X_train, y_train)
perm        = PermutationImportance(modelRF, random_state = 42).fit(X_val, y_val)
timeSecondI = timer()

In [None]:
print("Total time with Intel Extension: {} seconds".format(timeSecondI - timeFirstI))

In [None]:
eli5.show_weights(perm, feature_names = X.columns.tolist())

In [None]:
pi_features = eli5.explain_weights_df(perm, feature_names = X_train.columns.tolist())
pi_features = pi_features.loc[pi_features['weight'] >= 0.0001]['feature'].tolist()

In [None]:
pi_features[:5]

In [None]:
X_trainPI = X_train.loc[:, pi_features]
X_valPI   = X_val.loc[:, pi_features]

In [None]:
X_trainPI[:5]

### Accelerated functions:

In [None]:
!cat log.txt | grep 'running accelerated version' | sort | uniq

### Default Scikit-learn

In [None]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()

In [None]:
import eli5
from eli5.sklearn import PermutationImportance
from timeit import default_timer as timer
from sklearn.ensemble import RandomForestClassifier

In [None]:
timeFirstD  = timer()
modelRF     = RandomForestClassifier(random_state = 42).fit(X_train, y_train)
perm        = PermutationImportance(modelRF, random_state = 42).fit(X_val, y_val)
timeSecondD = timer()

In [None]:
print("Total time with default Scikit-learn: {} seconds".format(timeSecondD - timeFirstD))

In [None]:
eli5.show_weights(perm, feature_names = X.columns.tolist())

In [None]:
eli5_speedup = round((timeSecondD - timeFirstD) / (timeSecondI - timeFirstI), 2)
HTML(f'<h2>ELI5 speedup: {eli5_speedup}x</h2>'
     f'(from {round((timeSecondD - timeFirstD), 2)} to {round((timeSecondI - timeFirstI), 2)} seconds)')

<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Catboost</center></h1>
</div>

In [None]:
test_data = test_data.loc[:, pi_features]

In [None]:
from catboost import CatBoostClassifier

cat_params = {
    'iterations': 20000,
    'depth': 7,
    'task_type' : 'GPU',
    'l2_leaf_reg': 5,
    'eval_metric': 'Accuracy',
}

cat = CatBoostClassifier(**cat_params)
cat.fit(X_trainPI, y_train, eval_set=(X_valPI, y_val))

In [None]:
predictions = cat.predict(test_data)
submission['Cover_Type'] = predictions
predictions[:5]

In [None]:
submission.to_csv("submission.csv", index = False)

<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Conclusion</center></h1>
</div>

**Intel® Extension for Scikit-learn** gives you opportunities to:
* Use your Scikit-learn code for training and inference without modification.
* Get speed up your kernel

*Please upvote if you liked it.*

<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Other notebooks with sklearnex usage</center></h1>
</div>

### [[predict sales] Stacking with scikit-learn-intelex](https://www.kaggle.com/alexeykolobyanin/predict-sales-stacking-with-scikit-learn-intelex)

### [[TPS-Aug] NuSVR with Intel Extension for Sklearn](https://www.kaggle.com/alexeykolobyanin/tps-aug-nusvr-with-intel-extension-for-sklearn)

### [Using scikit-learn-intelex for What's Cooking](https://www.kaggle.com/kppetrov/using-scikit-learn-intelex-for-what-s-cooking?scriptVersionId=58739642)

### [Fast KNN using  scikit-learn-intelex for MNIST](https://www.kaggle.com/kppetrov/fast-knn-using-scikit-learn-intelex-for-mnist?scriptVersionId=58738635)

### [Fast SVC using scikit-learn-intelex for MNIST](https://www.kaggle.com/kppetrov/fast-svc-using-scikit-learn-intelex-for-mnist?scriptVersionId=58739300)

### [Fast SVC using scikit-learn-intelex for NLP](https://www.kaggle.com/kppetrov/fast-svc-using-scikit-learn-intelex-for-nlp?scriptVersionId=58739339)

### [Fast AutoML with Intel Extension for Scikit-learn](https://www.kaggle.com/lordozvlad/fast-automl-with-intel-extension-for-scikit-learn)

### [[Titanic] AutoML with Intel Extension for Sklearn](https://www.kaggle.com/lordozvlad/titanic-automl-with-intel-extension-for-sklearn)