<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Tabular Playground Series - Nov 2021
</div>

<center><a><img src="https://i.ibb.co/PWvpT9F/header.png" alt="header" border="0" width=800 height=400 class="center"></a>

<h1> Fast AutoML and Intel® Extension for Scikit-learn* - Kaggle Tabular Playground Series - November 2021 </h1>

AutoML significantly simplifies building of high quality models but sometimes has insufficient performance, especially for big problems. In this notebook, we will show how to accelerate AutoML frameworks EvalML and AutoGluon using Intel® Extension for Scikit-learn* which speedups Scikit-learn's algorithms in seamless way with one pip package installation and two lines of code.

This notebook solves binary classification task, but you can use it as template for many other competitions with few changes depending on task type (multiclass or regression) and your needs.

I will show you how to **speed up** your kernel without changing your code using **Intel® Extension for Scikit-learn**.

In this kernel we use following AutoML implemetations:
* EvalML
* AutoGluon 

<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Importing Libraries and Data</center></h1>
</div>

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt

random_state = 42

### Reading Data

In [None]:
PATH_TRAIN      = '../input/tabular-playground-series-nov-2021/train.csv'
PATH_TEST       = '../input/tabular-playground-series-nov-2021/test.csv'
PATH_SUBMISSION = '../input/tabular-playground-series-nov-2021/sample_submission.csv'

In [None]:
PATH_AUTOGLUON_SUBMISSION = 'submission_autogluon.csv'
PATH_EVALML_SUBMISSION    = 'submission_evalml.csv'

In [None]:
id_column  = 'id'
train_data = pd.read_csv(PATH_TRAIN, index_col = id_column)
test_data  = pd.read_csv(PATH_TEST, index_col = id_column)
submission = pd.read_csv(PATH_SUBMISSION, index_col = id_column)

In [None]:
train_data[:5]

In [None]:
train_data.info()

### Reduce DataFrame memory usage

Since data and AutoML task are quite big for Kaggle notebook instance RAM, we need to reduce memory usage by switching data types.

In [None]:
label    = 'target'
features = [col for col in train_data.columns if 'f' in col]

cont_features = []
disc_features = []

for col in features:
    if train_data[col].dtype =='float64':
        cont_features.append(col)
    else:
        disc_features.append(col)

train_data[cont_features] = train_data[cont_features].astype('float32')
train_data[disc_features] = train_data[disc_features].astype('uint8')
train_data[cont_features] = train_data[cont_features].astype('float32')
train_data[disc_features] = train_data[disc_features].astype('uint8')

In [None]:
train_data.info()

Memory usage was reduced from 467 MB to 238 MB

Collect garbage to reduce memory usage

In [None]:
import gc

gc.collect()
!> log.txt

<center><a><img src="https://editor.analyticsvidhya.com/uploads/64117evalml_logo.png" alt="header" border="0" width=300 height=200 class="center"></a>

## EvalML with optimized Scikit-learn

### EvalML Installation

In [None]:
!python3 -m pip install -q evalml==0.30.0 > /dev/null 2>&1

### Intel® Extension for Scikit-learn installation:

In [None]:
!pip install scikit-learn-intelex -q --progress-bar off > /dev/null 2>&1

### Accelerate Scikit-learn with two lines of code:

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

Setup logging to track accelerated cases:

In [None]:
import logging

logger = logging.getLogger()
fh = logging.FileHandler('log.txt')
fh.setLevel(10)
logger.addHandler(fh)

In [None]:
from sklearn.model_selection import train_test_split

train_data, valid_data = train_test_split(train_data, test_size = 0.1, random_state = random_state)
X_train, y_train = train_data.drop(['target'], axis = 1), train_data['target']
X_valid, y_valid = valid_data.drop(['target'], axis = 1), valid_data['target']

In [None]:
from evalml.automl import AutoMLSearch

automl = AutoMLSearch(X_train = X_train, y_train = y_train, problem_type='binary', max_time = 60 * 5, objective = 'AUC')
automl.search()

In [None]:
automl.rankings

In [None]:
print("Number of pipelines:", len(automl.results['search_order']))

In [None]:
predictions = automl.best_pipeline.predict_proba(X_valid)

In [None]:
from sklearn.metrics import roc_auc_score

print("Roc Auc Score on validation data: ", roc_auc_score(y_valid, predictions.iloc[:, 1].values))

### List of algorithms which are accelerated by sklearnex

In [None]:
!cat log.txt | grep 'running accelerated version' | sort | uniq

In [None]:
gc.collect()
!> log.txt

<center><a><img src="https://user-images.githubusercontent.com/16392542/77208906-224aa500-6aba-11ea-96bd-e81806074030.png" alt="header" border="0" width=300 height=200 class="center"></a>

### AutoGluon Installation

In [None]:
!pip install autogluon.tabular[all] -q --progress-bar off

In [None]:
from autogluon.tabular import TabularPredictor

In [None]:
# use only Gradient Boosting, Random Forest and KNN to reduce execution time
hyperparameters = {
    'GBM': [
        {'extra_trees': True, 'seed': random_state, 'ag_args': {'name_suffix': 'XT'}},
        {},
    ],
   'RF': [
        {'criterion': 'gini', 'random_state': random_state, 'max_features': 'log2',
         'ag_args': {'name_suffix': 'Gini_Log2', 'problem_types': ['binary']},
         'ag_args_fit': {'use_daal': True}},
        {'criterion': 'gini', 'random_state': random_state, 'max_features': 'sqrt',
         'ag_args': {'name_suffix': 'Gini_Sqrt', 'problem_types': ['binary']},
         'ag_args_fit': {'use_daal': True}},
    ],
    'XGB': {},
    'KNN': {}
}

autogluon_predictor = TabularPredictor(
    label = label,
    eval_metric = "roc_auc",
    learner_kwargs = {'ignored_columns': [id_column]}
).fit(
    train_data = train_data,
    hyperparameters = hyperparameters,
    verbosity = 2,
    presets = 'best_quality',
    time_limit = 60 * 5,
)

In [None]:
leaderbord = autogluon_predictor.leaderboard(valid_data)

In [None]:
leaderbord

### List of algorithms which are accelerated by sklearnex

In [None]:
!cat log.txt | grep 'running accelerated version' | sort | uniq

<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Predicition</center></h1>
</div>

### EvalML

In [None]:
predictions       = automl.best_pipeline.predict_proba(test_data)
EvalML_submission = predictions
submission.target = predictions.iloc[:, 1].values
submission[:5]

In [None]:
submission.to_csv(PATH_EVALML_SUBMISSION)

### AutoGluon

In [None]:
predictions          = autogluon_predictor.predict_proba(test_data)
AutoGluon_submission = predictions
submission.target    = predictions.iloc[:, 1]
submission[:5]

In [None]:
submission.to_csv(PATH_AUTOGLUON_SUBMISSION)

<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Blending</center></h1>
</div>

In [None]:
X_valid.shape, y_valid.shape

In [None]:
evalML_pred    = automl.best_pipeline.predict_proba(X_valid)
autoGluon_pred = autogluon_predictor.predict_proba(X_valid)

In [None]:
table_pred       = {"evalML_pred": evalML_pred.iloc[:, 1].values, "autoGluon_pred": autoGluon_pred.iloc[:, 1]}
final_train_data = pd.DataFrame(data = table_pred)

In [None]:
final_train_data[:5]

In [None]:
from sklearn.linear_model import LogisticRegression

logReg = LogisticRegression()

logReg.fit(final_train_data, y_valid)

In [None]:
table_pred       = {"evalML_pred": EvalML_submission.iloc[:, 1].values, "autoGluon_pred": AutoGluon_submission.iloc[:, 1]}
final_train_data = pd.DataFrame(data = table_pred)

In [None]:
final_train_data[:5]

In [None]:
predictions = logReg.predict_proba(final_train_data)[:, 1]
predictions[:5]

In [None]:
submission.target = predictions
submission[:5]

In [None]:
submission.to_csv("submission.csv")

<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Conclusion</center></h1>
</div>

**Intel® Extension for Scikit-learn** gives you opportunities to:
* Use your Scikit-learn code for training and inference without modification.
* Get speed up your kernel
* Get more pipelines in EvalML
* Get best predictions quality in AutoGluon


*Please upvote if you liked it.*

<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Other notebooks with sklearnex usage</center></h1>
</div>

### [[predict sales] Stacking with scikit-learn-intelex](https://www.kaggle.com/alexeykolobyanin/predict-sales-stacking-with-scikit-learn-intelex)

### [[TPS-Aug] NuSVR with Intel Extension for Sklearn](https://www.kaggle.com/alexeykolobyanin/tps-aug-nusvr-with-intel-extension-for-sklearn)

### [Using scikit-learn-intelex for What's Cooking](https://www.kaggle.com/kppetrov/using-scikit-learn-intelex-for-what-s-cooking?scriptVersionId=58739642)

### [Fast KNN using  scikit-learn-intelex for MNIST](https://www.kaggle.com/kppetrov/fast-knn-using-scikit-learn-intelex-for-mnist?scriptVersionId=58738635)

### [Fast SVC using scikit-learn-intelex for MNIST](https://www.kaggle.com/kppetrov/fast-svc-using-scikit-learn-intelex-for-mnist?scriptVersionId=58739300)

### [Fast SVC using scikit-learn-intelex for NLP](https://www.kaggle.com/kppetrov/fast-svc-using-scikit-learn-intelex-for-nlp?scriptVersionId=58739339)

### [Fast AutoML with Intel Extension for Scikit-learn](https://www.kaggle.com/lordozvlad/fast-automl-with-intel-extension-for-scikit-learn)

### [[Titanic] AutoML with Intel Extension for Sklearn](https://www.kaggle.com/lordozvlad/titanic-automl-with-intel-extension-for-sklearn)