# Tutorial 1: Basics


In this tutorial you will learn how to:
* run LightAutoML GPU version training on tabular data
* obtain feature importances and reports
* configure resource usage in LightAutoML


### 0.1. Import libraries

Here we will import the libraries we use in this kernel:
- Standard python libraries for timing, working with OS etc.
- Essential python DS libraries like numpy, pandas, scikit-learn and torch (the last we will use in the next cell)
- LightAutoML modules: presets for AutoML, task and report generation module

In [1]:
# Standard python libraries
import os
# Optional: set the device to run
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1"

import time

# Essential DS libraries
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import torch

# LightAutoML presets, task and report generation
from lightautoml_gpu.automl.presets.gpu.tabular_gpu_presets import TabularAutoMLGPU
from lightautoml_gpu.tasks import Task
from lightautoml_gpu.report.gpu import ReportDeco

Level "INFO2: 17" already defined, skipping...
Level "INFO3: 13" already defined, skipping...
'pdf' extra dependecy package 'weasyprint' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.
'pdf' extra dependecy package 'weasyprint' isn't installed. Look at README.md in repo 'LightAutoML' for installation instructions.


### 0.2. Constants

Here we setup the constants to use in the kernel:
- `N_THREADS` - number of vCPUs for LightAutoML model creation
- `N_FOLDS` - number of folds in LightAutoML inner CV
- `RANDOM_STATE` - random seed for better reproducibility
- `TEST_SIZE` - houldout data part size 
- `TIMEOUT` - limit in seconds for model to train

In [2]:
RANDOM_STATE = 42
TEST_SIZE = 0.2
TIMEOUT = 300
N_THREADS = 4
N_FOLDS = 5

In [3]:
DATASET_DIR = './data/'
DATASET_NAMES = ['higgs.csv', 'Fashion-MNIST.csv']
DATASET_FULLNAME = [os.path.join(DATASET_DIR, name) for name in DATASET_NAMES]

### 0.3. Data loading
Let's check the data we have:

In [4]:
data = pd.read_csv('./data/higgs.csv')
data.head()

data_info_ = {
                'path': 'openml/higgs.csv',
                'target': 'class',
                'task_type': 'binary',
                'read_csv_params': {'na_values': '?'}
}



for col in data.columns:
    if data[col].isin(['?']).any():
        data[col] = data[col].replace('?', np.nan).astype(np.float32)

  data = pd.read_csv('./data/higgs.csv')


### 0.4 Data splitting for train-holdout
As we have only one file with target values, we can split it into 80%-20% for holdout usage:

In [5]:
tr_data, te_data = train_test_split(
    data, 
    test_size=TEST_SIZE, 
    stratify=data['class'], 
    random_state=RANDOM_STATE
)

print(f'Data splitted. Parts sizes: tr_data = {tr_data.shape}, te_data = {te_data.shape}')
tr_data = tr_data.reset_index(drop=True)
te_data = te_data.reset_index(drop=True)
tr_data.head()

Data splitted. Parts sizes: tr_data = (78440, 29), te_data = (19610, 29)


Unnamed: 0,class,lepton_pT,lepton_eta,lepton_phi,missing_energy_magnitude,missing_energy_phi,jet1pt,jet1eta,jet1phi,jet1b-tag,...,jet4eta,jet4phi,jet4b-tag,m_jj,m_jjj,m_lv,m_jlv,m_bb,m_wbb,m_wwbb
0,1,1.033086,-0.027325,0.556388,0.716491,-1.623269,1.044414,-0.14955,-1.633134,2.173076,...,0.891493,0.128023,0.0,1.221998,1.333303,1.101426,0.886849,1.525385,1.250846,1.04227
1,1,2.149442,0.240516,-1.208732,0.803575,1.224382,0.504847,0.587181,0.661531,1.086538,...,-0.587601,-0.081836,3.101961,1.161402,1.038688,1.479556,1.069424,0.603161,0.783799,0.821149
2,0,0.669081,0.802496,1.645025,1.346262,-1.145997,0.690627,-1.126907,0.855008,0.0,...,0.040348,-1.595084,3.101961,1.03714,0.983492,0.995939,0.926378,0.886266,0.912128,0.88306
3,0,0.444346,-0.500674,-0.364785,0.716306,0.833619,0.939249,-0.048547,-0.799354,0.0,...,-2.398159,0.857178,0.0,1.584398,1.213435,0.983564,0.895563,0.841721,1.141312,0.922072
4,0,0.434464,0.240516,-0.117872,1.407808,1.084599,1.574911,-1.624993,0.106601,0.0,...,-0.089573,-0.513003,3.101961,1.234359,1.151361,0.988237,0.614409,0.679219,0.704437,0.701936


## 1. Task definition

### 1.1. Task type


On the cell below we create Task object - the class to setup what task LightAutoML model should solve with specific loss and metric if necessary (more info can be found [here](https://lightautoml.readthedocs.io/en/latest/generated/lightautoml.tasks.base.Task.html#lightautoml.tasks.base.Task) in our documentation):

In [6]:
task = Task('binary', device='gpu')

### 1.2. Feature roles setup

To solve the task, we need to setup columns roles. The **only role you must setup is target role**, everything else (drop, numeric, categorical, group, weights etc.) is up to user - LightAutoML models have automatic columns typization inside:

In [7]:
roles = {
    'target': 'class',
}

### 1.3. LightAutoML model creation - TabularAutoML preset

In next the cell we are going to create LightAutoML model with `TabularAutoML` class - preset with default model structure like in the image below:

<img src="../../imgs/tutorial_blackbox_pipeline.png" alt="TabularAutoML preset pipeline" style="width:85%;"/>

in just several lines. Let's discuss the params we can setup:
- `task` - the type of the ML task (the only **must have** parameter)
- `timeout` - time limit in seconds for model to train
- `cpu_limit` - vCPU count for model to use
- `reader_params` - parameter change for Reader object inside preset, which works on the first step of data preparation: automatic feature typization, preliminary almost-constant features, correct CV setup etc. For example, we setup `n_jobs` threads for typization algo, `cv` folds and `random_state` as inside CV seed.

**Important note**: `reader_params` key is one of the YAML config keys, which is used inside `TabularAutoML` preset. [More details](https://github.com/sberbank-ai-lab/LightAutoML/blob/master/lightautoml/automl/presets/tabular_config.yml) on its structure with explanation comments can be found on the link attached. Each key from this config can be modified with user settings during preset object initialization. To get more info about different parameters setting (for example, ML algos which can be used in `general_params->use_algos`) please take a look at our [article on TowardsDataScience](https://towardsdatascience.com/lightautoml-preset-usage-tutorial-2cce7da6f936).

Moreover, to receive the automatic report for our model we will use `ReportDeco` decorator and work with the decorated version in the same way as we do with usual one. 

In [8]:
automl = TabularAutoMLGPU(task=task,    
                          reader_params = {'n_jobs': 2, 'cv': N_FOLDS, 'random_state': RANDOM_STATE},
    timeout=TIMEOUT)

## 2. AutoML training

To run autoML training use fit_predict method:

- `train_data` - Dataset to train.
- `roles` - Roles dict.
- `verbose` - Controls the verbosity: the higher, the more messages.
        <1  : messages are not displayed;
        >=1 : the computation process for layers is displayed;
        >=2 : the information about folds processing is also displayed;
        >=3 : the hyperparameters optimization process is also displayed;
        >=4 : the training process for every algorithm is displayed;

Note: out-of-fold prediction is calculated during training and returned from the fit_predict method

In [9]:
%%time 
oof_pred = automl.fit_predict(tr_data, roles = roles, verbose = 1)

[11:14:23] Stdout logging level is INFO.
[11:14:23] Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer
[11:14:23] Task: binary

[11:14:23] Start automl preset with listed constraints:
[11:14:23] - time: 300.00 seconds
[11:14:23] - CPU: 4 cores
[11:14:23] - memory: 16 GB

[11:14:23] Train data shape: (78440, 29)
[11:14:32] Feats was rejected during automatic roles guess: []
[11:14:32] Layer [1m1[0m train process start. Time left 291.06 secs
[11:14:32] Start fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m ...
[11:14:44] Fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m finished. score = [1m0.7598962501666371[0m
[11:14:44] [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m fitting and predicting completed
[11:14:44] Time left 279.46 secs



Default metric period is 5 because AUC is/are not implemented for GPU


[11:14:56] [1mSelector_CatBoostGPU[0m fitting and predicting completed
[11:14:56] Start fitting [1mLvl_0_Pipe_1_Mod_0_CatBoostGPU[0m ...


Default metric period is 5 because AUC is/are not implemented for GPU
Default metric period is 5 because AUC is/are not implemented for GPU


[11:15:18] Time limit exceeded after calculating fold 1
[11:15:18] Fitting [1mLvl_0_Pipe_1_Mod_0_CatBoostGPU[0m finished. score = [1m0.8085333969760496[0m
[11:15:18] [1mLvl_0_Pipe_1_Mod_0_CatBoostGPU[0m fitting and predicting completed
[11:15:18] Start fitting [1mLvl_0_Pipe_1_Mod_2_XGB[0m ...
[11:15:39] Fitting [1mLvl_0_Pipe_1_Mod_2_XGB[0m finished. score = [1m0.8064250733140045[0m
[11:15:39] [1mLvl_0_Pipe_1_Mod_2_XGB[0m fitting and predicting completed
[11:15:39] Start hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_3_Tuned_XGB[0m ... Time budget is 128.75 secs
[11:17:50] Hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_3_Tuned_XGB[0m completed
[11:17:50] Start fitting [1mLvl_0_Pipe_1_Mod_3_Tuned_XGB[0m ...
[11:18:04] Fitting [1mLvl_0_Pipe_1_Mod_3_Tuned_XGB[0m finished. score = [1m0.8077146571412942[0m
[11:18:04] [1mLvl_0_Pipe_1_Mod_3_Tuned_XGB[0m fitting and predicting completed
[11:18:04] Time left 78.74 secs

[11:18:04] Time limit exceeded in one 

## 3. Prediction on holdout and model evaluation

In [10]:
%%time

te_pred = automl.predict(te_data)
print(f'Prediction for te_data:\n{te_pred}\nShape = {te_pred.shape}')

Prediction for te_data:
array([[0.4453373 ],
       [0.3911723 ],
       [0.87689215],
       ...,
       [0.6192466 ],
       [0.89121807],
       [0.67450213]], dtype=float32)
Shape = (19610, 1)
CPU times: user 330 ms, sys: 12.3 ms, total: 342 ms
Wall time: 539 ms


In [11]:
auc_val = roc_auc_score(tr_data[data_info_['target']].values, oof_pred.data[:, 0])
print(f'OOF score: {auc_val}')
auc_test = roc_auc_score(te_data[data_info_['target']].values, te_pred.data[:, 0])
print(f'HOLDOUT score: {auc_test}')

OOF score: 0.8091652926417789
HOLDOUT score: 0.8116664670828967


## 4. Model analysis

### 4.1. Reports

You can obtain the description of the resulting pipeline:

In [12]:
print(automl.create_model_str_desc())

Final prediction for new objects (level 0) = 
	 0.35872 * (2 averaged models Lvl_0_Pipe_1_Mod_0_CatBoostGPU) +
	 0.23596 * (5 averaged models Lvl_0_Pipe_1_Mod_2_XGB) +
	 0.40532 * (5 averaged models Lvl_0_Pipe_1_Mod_3_Tuned_XGB) 


Also for this purposes LightAutoML have ReportDeco, use it to build reports:

In [13]:
RD = ReportDeco(output_path = 'tabularAutoML_model_report')

automl_rd = RD(
    TabularAutoMLGPU(
        task = task, 
        timeout = TIMEOUT,
        reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE}
    )
)

In [14]:
%%time
oof_pred = automl_rd.fit_predict(tr_data, roles = roles, verbose = 1)

[11:18:06] Stdout logging level is INFO.
[11:18:06] Task: binary

[11:18:06] Start automl preset with listed constraints:
[11:18:06] - time: 300.00 seconds
[11:18:06] - CPU: 4 cores
[11:18:06] - memory: 16 GB

[11:18:06] Train data shape: (78440, 29)
[11:18:16] Feats was rejected during automatic roles guess: []
[11:18:16] Layer [1m1[0m train process start. Time left 290.02 secs
[11:18:16] Start fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m ...
[11:18:25] Fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m finished. score = [1m0.7598962501666371[0m
[11:18:25] [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m fitting and predicting completed
[11:18:25] Time left 280.35 secs



Default metric period is 5 because AUC is/are not implemented for GPU


[11:18:37] [1mSelector_CatBoostGPU[0m fitting and predicting completed
[11:18:37] Start fitting [1mLvl_0_Pipe_1_Mod_0_CatBoostGPU[0m ...


Default metric period is 5 because AUC is/are not implemented for GPU
Default metric period is 5 because AUC is/are not implemented for GPU


[11:19:00] Time limit exceeded after calculating fold 1
[11:19:00] Fitting [1mLvl_0_Pipe_1_Mod_0_CatBoostGPU[0m finished. score = [1m0.808178117586074[0m
[11:19:00] [1mLvl_0_Pipe_1_Mod_0_CatBoostGPU[0m fitting and predicting completed
[11:19:00] Start fitting [1mLvl_0_Pipe_1_Mod_2_XGB[0m ...
[11:19:22] Fitting [1mLvl_0_Pipe_1_Mod_2_XGB[0m finished. score = [1m0.8064250733140045[0m
[11:19:22] [1mLvl_0_Pipe_1_Mod_2_XGB[0m fitting and predicting completed
[11:19:22] Start hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_3_Tuned_XGB[0m ... Time budget is 126.02 secs
[11:21:32] Hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_3_Tuned_XGB[0m completed
[11:21:32] Start fitting [1mLvl_0_Pipe_1_Mod_3_Tuned_XGB[0m ...
[11:21:46] Fitting [1mLvl_0_Pipe_1_Mod_3_Tuned_XGB[0m finished. score = [1m0.8077146571412942[0m
[11:21:46] [1mLvl_0_Pipe_1_Mod_3_Tuned_XGB[0m fitting and predicting completed
[11:21:46] Time left 79.42 secs

[11:21:46] Time limit exceeded in one o


`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(


CPU times: user 4min 1s, sys: 42.7 s, total: 4min 43s
Wall time: 3min 45s


So the report is available in tabularAutoML_model_report folder

In [15]:
!ls tabularAutoML_model_report

feature_importance.png		       test_roc_curve_1.png
lama_interactive_report.html	       valid_distribution_of_logits.png
test_distribution_of_logits_1.png      valid_pie_f1_metric.png
test_pie_f1_metric_1.png	       valid_pr_curve.png
test_pr_curve_1.png		       valid_preds_distribution_by_bins.png
test_preds_distribution_by_bins_1.png  valid_roc_curve.png


In [16]:
%%time

te_pred = automl_rd.predict(te_data)
print(f'Prediction for te_data:\n{te_pred}\nShape = {te_pred.shape}')


`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(


Prediction for te_data:
array([[0.4418166 ],
       [0.3866924 ],
       [0.8699182 ],
       ...,
       [0.61608374],
       [0.89125025],
       [0.6804474 ]], dtype=float32)
Shape = (19610, 1)
CPU times: user 6.95 s, sys: 4.17 s, total: 11.1 s
Wall time: 2.98 s


In [17]:
auc_val = roc_auc_score(tr_data[data_info_['target']].values, oof_pred.data[:, 0])
print(f'OOF score: {auc_val}')
auc_test = roc_auc_score(te_data[data_info_['target']].values, te_pred.data[:, 0])
print(f'HOLDOUT score: {auc_test}')

OOF score: 0.8091149385362273
HOLDOUT score: 0.8117723169223295


## 5. Multi-GPU results

Here is an example of how to run Multi-GPU configuration.

In [18]:
import cudf
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

``cluster`` is an object that connects all GPUS and handles their communication. You should pass indices of GPUs that you want to use for LAMA training through parameter `CUDA_VISIBLE_DEVICES`.

Also, other specifications are passed to `cluster` but you can leave these parameters unchanged, as shown in the example.

After that, an instance of `client` is created and it should be passed to `automl` object if you want to run multi-GPU training.

Finally you can run training.

In [19]:
cluster = LocalCUDACluster(rmm_managed_memory=True, CUDA_VISIBLE_DEVICES="0, 1",
                               protocol="ucx", enable_nvlink=True,
                               memory_limit="30GB")

client = Client(cluster)
client.run(cudf.set_allocator, "managed")

Perhaps you already have a cluster running?
Hosting the HTTP server on port 34015 instead




2023-10-12 11:21:57,865 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-tsyldut0', purging
2023-10-12 11:21:57,866 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-aaesqj3s', purging
2023-10-12 11:21:57,866 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-10-12 11:21:57,866 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-10-12 11:21:57,992 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-10-12 11:21:57,992 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize


{'ucx://127.0.0.1:48475': None, 'ucx://127.0.0.1:57647': None}

In [20]:
%%time
# task = Task(task_types['higgs.csv'], device='mgpu')

automl = TabularAutoMLGPU(
    task=Task('binary', device='mgpu'),
    timeout=TIMEOUT,
    config_path='./data/dp.yml',
    client=client,
    general_params = {'parallel_folds': True} # stands for compute parallel. True for DataParallel
)


oof_pred = automl.fit_predict(tr_data, roles = roles, verbose = 1)

[11:22:02] Stdout logging level is INFO.
[11:22:02] Task: binary

[11:22:02] Start automl preset with listed constraints:
[11:22:02] - time: 300.00 seconds
[11:22:02] - CPU: 4 cores
[11:22:02] - memory: 16 GB

[11:22:02] Train data shape: (78440, 29)
[11:22:10] Feats was rejected during automatic roles guess: []
[11:22:10] Layer [1m1[0m train process start. Time left 292.15 secs
[11:22:10] Start fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m ...
[11:22:30] Time limit exceeded after calculating fold(s) [0 1]
[11:22:30] Fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m finished. score = [1m0.7619489118408146[0m
[11:22:30] [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m fitting and predicting completed
[11:22:30] Time left 271.66 secs



Default metric period is 5 because AUC is/are not implemented for GPU


[11:22:41] [1mSelector_CatBoostGPU[0m fitting and predicting completed
[11:22:42] Start fitting [1mLvl_0_Pipe_1_Mod_0_CatBoostGPU[0m ...


Default metric period is 5 because AUC is/are not implemented for GPU
Default metric period is 5 because AUC is/are not implemented for GPU
Default metric period is 5 because AUC is/are not implemented for GPU
Default metric period is 5 because AUC is/are not implemented for GPU


[11:23:03] Time limit exceeded after calculating fold(s) [2 3]
[11:23:03] Fitting [1mLvl_0_Pipe_1_Mod_0_CatBoostGPU[0m finished. score = [1m0.8061328238067021[0m
[11:23:03] [1mLvl_0_Pipe_1_Mod_0_CatBoostGPU[0m fitting and predicting completed
[11:23:03] Start fitting [1mLvl_0_Pipe_1_Mod_2_XGB[0m ...
[11:23:19] Fitting [1mLvl_0_Pipe_1_Mod_2_XGB[0m finished. score = [1m0.8064250733140045[0m
[11:23:19] [1mLvl_0_Pipe_1_Mod_2_XGB[0m fitting and predicting completed
[11:23:19] Start hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_3_Tuned_XGB[0m ... Time budget is 97.50 secs
[11:25:05] Hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_3_Tuned_XGB[0m completed
[11:25:05] Start fitting [1mLvl_0_Pipe_1_Mod_3_Tuned_XGB[0m ...
[11:25:16] Fitting [1mLvl_0_Pipe_1_Mod_3_Tuned_XGB[0m finished. score = [1m0.8082896600400371[0m
[11:25:16] [1mLvl_0_Pipe_1_Mod_3_Tuned_XGB[0m fitting and predicting completed
[11:25:16] Time left 105.90 secs

[11:25:16] Time limit exceeded 

In [21]:
te_pred = automl.predict(te_data)

In [22]:
automl.predict(te_data)

array([[0.4493517 ],
       [0.37712663],
       [0.8564067 ],
       ...,
       [0.6270436 ],
       [0.90256083],
       [0.6837865 ]], dtype=float32)

In [23]:
auc_test = roc_auc_score(te_data[data_info_['target']].values, te_pred.data[:, 0])
print(f'HOLDOUT score: {auc_test}')

HOLDOUT score: 0.8122404752233056


## Additional materials

- [Official LightAutoML github repo](https://github.com/sberbank-ai-lab/LightAutoML)
- [LightAutoML documentation](https://lightautoml.readthedocs.io/en/latest)