# Python Client for Driverless AI

Static intuitive Python API.

Documentation:
https://docs.h2o.ai/driverless-ai/pyclient/docs/html/index.html

# DriverlessAI Python Client - demo and training

## Python environment set up

It is recommended to use a virtual environment to run this notebook. The following steps will create a virtual environment and install the required packages.

To create a virtual environment, run the following command in the terminal:

```bash
python3.8 -m venv dai_verizon_training
```
To activate the virtual environment, run the following command in the terminal:
```bash
source dai_verizon_training/bin/activate
```

Create requirements.txt file with the following packages:
```bash
pandas
driverlessai
matplotlib
scipy
scikit-learn
datatable
qtoml
ipykernel
```

To install the required packages, run the following command in the terminal:
```bash
pip install -r requirements.txt
```

To use with VS Code Jupyter Notebook kernel, run the following command in the terminal:
```bash
python -m ipykernel install --name=dai_verizon_training"
```

**Note**  
If you prefer to use Jupyter Notebook, you can install the required packages using the following command:
```bash
pip install notebook
```

To deactivate the virtual environment, run the following command in the terminal:
```bash
deactivate
```



In [1]:
import driverlessai
import pandas as pd
import json
from IPython.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

#import h2o
import datatable as dt
from zipfile import ZIP_DEFLATED, ZipFile
import qtoml
from typing import Dict
import os
import uuid
import pprint

In [2]:
driverlessai.__version__

'1.10.5.post1'

## Connect to stand along DAI Instance

Replace value of the variable `dai_url` with the URL of your DAI instance as provided in the Aquarium Lab.
![Alt text](Aquarium_Lab_selection.png)

![Alt text](Aquarium_Lab_DAI_training.png)


In [3]:
import warnings
warnings.filterwarnings('ignore')

user_name = "training"
user_password = "training"
dai_url = "https://34-209-212-180.aquarium-instance.h2o.ai/"

dai = driverlessai.Client(
    address=dai_url, 
    username=user_name,
    password=user_password,
    verify=False
)

In [4]:
# Verify connection and the server version
dai.server.version

'1.10.5'

**Example on how to access function help from the notebook**

In [5]:
help(dai.datasets)

Help on Datasets in module driverlessai._datasets object:

class Datasets(builtins.object)
 |  Datasets(client: '_core.Client') -> None
 |  
 |  Interact with datasets on the Driverless AI server.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, client: '_core.Client') -> None
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  create(self, data: Union[str, ForwardRef('pandas.DataFrame')], data_source: str = 'upload', data_source_config: Dict[str, str] = None, force: bool = False, name: str = None) -> 'Dataset'
 |      Create a dataset on the Driverless AI server and return a Dataset
 |      object corresponding to the created dataset.
 |      
 |      Args:
 |          data: path to data file(s) or folder, a Pandas DataFrame,
 |              or query string for SQL based data sources
 |          data_source: name of connector to use for data transfer
 |              (use ``driverlessai.connectors.list()`` to see configured names)
 |          data_sour

## Datasets

**List available datasets**

In [6]:
dai.datasets.list()

    | Type    | Key                                  | Name
----+---------+--------------------------------------+----------------------------
  0 | Dataset | f7a57df8-da2c-11ed-bbe0-0242ac110002 | CustomerChurn.test
  1 | Dataset | f7a51052-da2c-11ed-bbe0-0242ac110002 | CustomerChurn.train
  2 | Dataset | ddaf4dac-da2c-11ed-bbe0-0242ac110002 | CustomerChurnImbalance.csv
  3 | Dataset | c8754950-da09-11ed-bbe0-0242ac110002 | AmazonReviews.test
  4 | Dataset | c87493de-da09-11ed-bbe0-0242ac110002 | AmazonReviews.train
  5 | Dataset | 3ff5c726-da09-11ed-bbe0-0242ac110002 | AmazonFineFoodReviews.csv
  6 | Dataset | a2e7d276-da03-11ed-bbe0-0242ac110002 | CallCenterSmall.test
  7 | Dataset | a2e75904-da03-11ed-bbe0-0242ac110002 | CallCenterSmall.train
  8 | Dataset | 2feaf03c-da03-11ed-bbe0-0242ac110002 | CallCenterSmall.csv
  9 | Dataset | 47919262-d8c4-11ed-8c99-0242ac110002 | Credit_Score_Test.csv
 10 | Dataset | 40dffc1a-d8c4-11ed-8c99-0242ac110002 | Credit_Score_Train.csv
 11 | Dataset

Upload a dataset to the DAI server. 

First, list available connectors.

In [7]:
dai.connectors.list()

['upload', 'file', 'hdfs', 's3', 'recipe_file', 'recipe_url']

In [8]:
help(dai.datasets.create)

Help on method create in module driverlessai._datasets:

create(data: Union[str, ForwardRef('pandas.DataFrame')], data_source: str = 'upload', data_source_config: Dict[str, str] = None, force: bool = False, name: str = None) -> 'Dataset' method of driverlessai._datasets.Datasets instance
    Create a dataset on the Driverless AI server and return a Dataset
    object corresponding to the created dataset.
    
    Args:
        data: path to data file(s) or folder, a Pandas DataFrame,
            or query string for SQL based data sources
        data_source: name of connector to use for data transfer
            (use ``driverlessai.connectors.list()`` to see configured names)
        data_source_config: dictionary of configuration options for
            advanced connectors
        force: create new dataset even if dataset with same name already exists
        name: dataset name on the Driverless AI server
    
    Examples::
    
        ds = dai.datasets.create(
            data='s3:

Simple connector. Create dataset in DAI from local (Notebook location) file - using default option `data_source="upload"`.

In [9]:
ds = dai.datasets.create('./data/creditcard_small.csv', force=True)

Complete 100.00% - [4/4] Computed stats for column default payment next month


`ds` is now pointer to the dataset in DAI

In [10]:
ds.head()

ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAt_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMt1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,20000,2,2,1,24,-2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,True
2,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,True
3,90000,2,2,2,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,False
4,50000,2,2,1,37,1,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,False
5,50000,1,2,1,57,2,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,False


Example use of JDBC connector to create DAI dataset from a PostgreSQL database.
```python
jdbc_config = {
    'jdbc_jar': '/data/postgresql-42.2.9.jar',
    'jdbc_driver': 'org.postgresql.Driver',
    'jdbc_url': 'jdbc:postgresql://mr-dl2:5432/h2oaidev',
    'jdbc_username': 'h2oaitester',
    'jdbc_password': 'h2oaitesterreadonly'
}

jdbc_ds = dai.datasets.create(   
    data='SELECT * FROM creditcardtrain',
    data_source='jdbc',
    data_source_config=jdbc_config,
    name='beta_jdbc_test'
)
```

Set column type and datetime format. Only needed if we want to "force" spesific logical datatypes to be considered by DAI. Same as modifying via GUI.

Valid logical types:

- ``'categorical'``
- ``'date'``
- ``'datetime'``
- ``'id'``
- ``'numerical'``
- ``'text'``


In [11]:
ds.set_logical_types({'SEX': ['numerical', 'categorical']})
ds.set_datetime_format({'AGE': 'YY'})

Display dataset metadata

In [12]:
print(ds.name, "|", ds.key)
print("Creation Timestamp:", ds.creation_timestamp)
print("Columns:", ds.columns)
print('Shape:', ds.shape)
print("Head:")
display(ds.head())
print("Tail:")
display(ds.tail())
print("Summary:")
print(ds.column_summaries()[1:3])

creditcard_small.csv | 8052b888-30a8-11ee-b104-0242ac110002
Creation Timestamp: 1690921076.0646183
Columns: ['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAt_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMt1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'default payment next month']
Shape: (23999, 25)
Head:


ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAt_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMt1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,20000,2,2,1,24,-2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,True
2,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,True
3,90000,2,2,2,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,False
4,50000,2,2,1,37,1,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,False
5,50000,1,2,1,57,2,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,False


Tail:


ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAt_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMt1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
23995,30000,1,2,2,25,0,0,0,0,0,0,30274,30517,30485,30533,29148,29782,1932,3900,1700,1200,1180,2500,False
23996,80000,1,2,1,25,1,2,2,0,0,0,80906,82789,80903,80215,63296,49854,3800,6,3636,2646,2000,1830,False
23997,20000,1,2,1,25,0,0,0,0,0,0,14447,15455,17562,17322,17119,17350,1552,2659,1419,606,500,1000,False
23998,10000,1,2,2,26,0,0,0,0,0,0,8882,9933,9825,17506,16608,9176,1300,2200,1300,320,1820,1000,False
23999,20000,1,5,2,26,0,0,0,0,0,-2,20564,20284,19394,39950,0,0,3055,1467,1096,1000,0,0,False


Summary:
--- LIMIT_BAL ---

 1e+04|████████████████████
      |████████████
      |████████
      |████
      |██
      |
      |
      |
      |
 1e+06|

Data Type: int
Logical Types: []
Datetime Format: 
Count: 23999
Missing: 0
Mean: 1.65e+05
SD: 1.29e+05
Min: 1e+04
Max: 1e+06
Unique: 79
Freq: 2740
--- SEX ---

 1|████████████
 2|████████████████████

Data Type: int
Logical Types: ['categorical', 'numerical']
Datetime Format: 
Count: 23999
Missing: 0
Mean: 1.63
SD: 0.483
Min: 1
Max: 2
Unique: 2
Freq: 15078



Split into train and test.

In [13]:
ds_split = ds.split_to_train_test(
    train_size=0.7, 
    train_name='cc_train', 
    test_name='cc_test',
    seed=1
)
display(ds_split)

Complete


{'train_dataset': <class 'Dataset'> 8207414e-30a8-11ee-b104-0242ac110002 cc_train,
 'test_dataset': <class 'Dataset'> 820799be-30a8-11ee-b104-0242ac110002 cc_test}

View top records in the `train_dataset`:

In [14]:
ds_split["train_dataset"].head()

ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAt_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMt1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
2,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,True
4,50000,2,2,1,37,1,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,False
6,50000,1,1,2,37,3,0,0,0,0,0,64400,57069,57608,19394,19619,20024,2500,1815,657,1000,1000,800,False
9,140000,2,3,1,28,6,0,2,0,0,0,11285,14096,12108,12211,11793,3719,3329,0,432,1000,1000,1000,False
10,20000,1,3,2,35,7,-2,-2,-2,-1,-1,0,0,0,0,13007,13912,0,0,0,13007,1122,0,False


Download splits.

In [15]:
for dataset in ds_split.values():
    dataset.download()

Downloaded 'dasapoma.1690921078.9764829.csv'
Downloaded 'tuderuka.1690921079.0794213.csv'


In [16]:
dai.datasets.list()[:3]

    | Type    | Key                                  | Name
----+---------+--------------------------------------+----------------------
  0 | Dataset | 820799be-30a8-11ee-b104-0242ac110002 | cc_test
  1 | Dataset | 8207414e-30a8-11ee-b104-0242ac110002 | cc_train
  2 | Dataset | 8052b888-30a8-11ee-b104-0242ac110002 | creditcard_small.csv

Get dataset by `dataset_key` (**Replace with correct dataset key value**).  Here we obtain key from the first dataset in the list.


```python

In [17]:
dai.datasets.list()[:1].__dict__["_data"][0].__dict__['key']

'820799be-30a8-11ee-b104-0242ac110002'

In [18]:
dataset_key = dai.datasets.list()[:1].__dict__["_data"][0].__dict__['key']
test_tt = dai.datasets.get(dataset_key)
test_tt.head()

ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAt_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMt1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
8508,20000,2,2,2,53,4,4,3,3,2,2,18396,17826,18547,19118,18679,19812,0,1300,1170,0,1600,0,False
13818,160000,1,2,1,60,0,0,0,0,0,0,18056,19385,20093,20643,20920,21336,1626,1334,884,758,763,942,False
8374,20000,2,2,3,47,1,2,0,0,0,0,9325,8562,9571,9912,10036,10400,1000,1159,500,432,600,0,False
15578,180000,1,2,1,48,-1,-1,-1,-1,-1,-1,1294,1294,1466,1294,2324,264,1294,1466,1294,2324,264,264,False
22428,180000,2,2,1,34,0,-1,-1,-1,-1,-2,13954,5589,4312,2490,0,0,5589,4312,2490,0,0,150,False


Get first dataset from the list of all datasets

In [19]:
test1 = dai.datasets.list()[0]
test1.head()

ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAt_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMt1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
8508,20000,2,2,2,53,4,4,3,3,2,2,18396,17826,18547,19118,18679,19812,0,1300,1170,0,1600,0,False
13818,160000,1,2,1,60,0,0,0,0,0,0,18056,19385,20093,20643,20920,21336,1626,1334,884,758,763,942,False
8374,20000,2,2,3,47,1,2,0,0,0,0,9325,8562,9571,9912,10036,10400,1000,1159,500,432,600,0,False
15578,180000,1,2,1,48,-1,-1,-1,-1,-1,-1,1294,1294,1466,1294,2324,264,1294,1466,1294,2324,264,264,False
22428,180000,2,2,1,34,0,-1,-1,-1,-1,-2,13954,5589,4312,2490,0,0,5589,4312,2490,0,0,150,False


In [20]:
del test_tt
del test1

### Data Recipes

It's possible to modify a dataset through a code snippet or recipe.  

Preview what a new dataset would look like if code snippet transformation is was applied.  
Keep first 3 columns of the original dataset

Official Recipe Github location: https://github.com/h2oai/driverlessai-recipes

In [21]:
code = "return X[:, [0,1,2]]"

ds.modify_by_code_preview(code)

Complete


ID,LIMIT_BAL,SEX
1,20000,2
2,120000,2
3,90000,2
4,50000,2
5,50000,1
6,50000,1
7,500000,1
8,100000,2
9,140000,2
10,20000,1


Apply the transformation to create a new dataset.

In [22]:
new_ds_1 = ds.modify_by_code(code, names=['modify_by_code'])
new_ds_1['modify_by_code'].head()

Complete                                                         


ID,LIMIT_BAL,SEX
1,20000,2
2,120000,2
3,90000,2
4,50000,2
5,50000,1


Apply a transformation from a recipe file.

In [23]:
!cat ./recipes/first_three_col.py

"""
This is a data recipe based on live data modification from GUI
"""

from h2oaicore.data import CustomData


class DataRecipeNapubinu(CustomData):
    _display_name = "Data Recipe Napubinu"
    _description = "Generated by live code"

    @staticmethod
    def create_data(X):
        import datatable as dt
        X = dt.Frame(X)
        return X[:, [0,1,2,]]  # return dt.Frame, pd.DataFrame, np.ndarray or a list or named dict of those


In [24]:
new_ds_2 = ds.modify_by_recipe('./recipes/first_three_col.py', names=['modify_by_recipe_file'])
new_ds_2['modify_by_recipe_file'].head()

Complete                                                         


ID,LIMIT_BAL,SEX
1,20000,2
2,120000,2
3,90000,2
4,50000,2
5,50000,1


Apply a transformation from a recipe url.

In [25]:
#new_ds_3 = ds.modify_by_recipe(
#    'https://github.com/h2oai/driverlessai-recipes/raw/master/data/GroupAgg.py', 
#    names=['modify_by_recipe_url']
#)
#new_ds_3['modify_by_recipe_url'].head()

The new datasets are now on the Driverless AI server.

In [26]:
dai.datasets.list()[:6]

    | Type    | Key                                  | Name
----+---------+--------------------------------------+-----------------------
  0 | Dataset | 88a6fdfa-30a8-11ee-b104-0242ac110002 | modify_by_recipe_file
  1 | Dataset | 86a0746e-30a8-11ee-b104-0242ac110002 | modify_by_code
  2 | Dataset | 820799be-30a8-11ee-b104-0242ac110002 | cc_test
  3 | Dataset | 8207414e-30a8-11ee-b104-0242ac110002 | cc_train
  4 | Dataset | 8052b888-30a8-11ee-b104-0242ac110002 | creditcard_small.csv
  5 | Dataset | f7a57df8-da2c-11ed-bbe0-0242ac110002 | CustomerChurn.test

## Experiments

Search expert settings using key-word.

In [27]:
dai.experiments.search_expert_settings('ensemble')

recipe | default_value: auto
max_rows_final_blender | default_value: 1000000
min_rows_final_blender | default_value: 10000
ensemble_accuracy_switch | default_value: 5
num_ensemble_folds | default_value: 4
early_stopping | default_value: True
nfeatures_max | default_value: -1
ngenes_max | default_value: -1
nfeatures_min | default_value: -1
text_dominated_limit_tuning | default_value: True
image_dominated_limit_tuning | default_value: True
image_auto_num_final_models | default_value: 0
tournament_style | default_value: auto
tournament_remove_poor_scores_before_final_model_factor | default_value: 0.3
included_individuals | default_value: []
num_hyperopt_individuals_final | default_value: -1
drop_constant_model_final_ensemble | default_value: True
params_tensorflow | default_value: {}
min_learning_rate_final | default_value: 0.01
max_learning_rate_final | default_value: 0.05
fixed_ensemble_level | default_value: -1
ensemble_meta_learner | default_value: blender
cross_validate_meta_learner 

In [28]:
dai.experiments.search_expert_settings('rolling_test_method')

rolling_test_method | default_value: tta
rolling_test_method_max_splits | default_value: 1000


Define the settings dictionary. It will be used to start the experiment.   
Setting Accuracy to 10 in order to force long running experiment and demonstrate early stoppage.  
We will also limit available models to LightGBM and XGBoost.

In [29]:
settings = {
    'task': 'classification',
    'target_column': "default payment next month",
    'accuracy': 10,
    'time': 10,
    'interpretability': 6,
    'scorer': 'AUCPR',
    'feature_brain_level': 0,
    'make_python_scoring_pipeline': 'off',
    'make_mojo_scoring_pipeline': 'off',
    'make_autoreport': False,
    'check_leakage': 'off',
    'max_nestimators': 100,
    'included_models': ["LightGBM", "XGBoostGBM"],
    'fixed_ensemble_level': 0,
    'fixed_num_folds': 3,
}

Look at what the experiment Pre-view given the datasets and the settings.

In [30]:
dai.experiments.preview(
    **ds_split,
    **settings,
)

ACCURACY [10/10]:
- Training data size: *16,799 rows, 24 cols*
- Feature evolution: *[LightGBM, XGBoostGBM]*, *3-fold CV*
- Final pipeline: *One of [LightGBM, XGBoostGBM], single final model, validated with 3-fold CV*

TIME [10/10]:
- Feature evolution: *8 individuals*, up to *500 iterations*
- Early stopping: After *50* iterations of no improvement

INTERPRETABILITY [6/10]:
- Feature pre-pruning strategy: None
- Monotonicity constraints: disabled
- Feature engineering search space: [CVCatNumEncode, CVTargetEncode, CatOriginal, Cat, ClusterTE, Frequent, Interactions, NumCatTE, NumToCatTE, NumToCatWoE, OneHotEncoding, Original, WeightOfEvidence]

[LightGBM, XGBoostGBM] models to train:
- Model and feature tuning: *132*
- Feature evolution: *6024*
- Final pipeline: *4*
- MOJO DISABLED

Estimated runtime: *90 minutes*

Estimated max CPU memory usage: *1.0GB*
Finish/Abort (if not done) in: *90 minutes*/*7 days*


Start an experiment in async mode with our dataset splits and chosen settings.

In [31]:
ex = dai.experiments.create_async(
    **ds_split, 
    **settings
)

Experiment launched at: https://34-209-212-180.aquarium-instance.h2o.ai/#/experiment?key=8d98b592-30a8-11ee-b104-0242ac110002


Display some information. 

In [32]:
print("Name:", ex.name)
print("Datasets:", ex.datasets)
print("Train Dataset Head:")
display(ex.datasets['train_dataset'].head(1))
print("Target:", ex.settings['target_column'])
print("Scorer:", ex.metrics()['scorer'])
print("Task:", ex.settings['task'])
print("Status:", ex.status(verbose=2))
print("Creation Timestamp:", ex.creation_timestamp)
print("Run Duration:", ex.run_duration)
print("Web Page:", ex.gui())

Name: davupitu
Datasets: {'train_dataset': <class 'Dataset'> 8207414e-30a8-11ee-b104-0242ac110002 cc_train, 'validation_dataset': None, 'test_dataset': <class 'Dataset'> 820799be-30a8-11ee-b104-0242ac110002 cc_test}
Train Dataset Head:


ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAt_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMt1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
2,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,True


Target: default payment next month
Scorer: AUCPR
Task: classification
Status: Running 0.00% - Update configuration with overrides done.
Creation Timestamp: 1690921098.3901393
Run Duration: 0.00943136215209961
Web Page: https://34-209-212-180.aquarium-instance.h2o.ai/#/experiment?key=8d98b592-30a8-11ee-b104-0242ac110002


List running experiments.

In [33]:
for e in dai.experiments.list():
    if e.is_running():
        print(e)
        display(e.metrics())

davupitu (8d98b592-30a8-11ee-b104-0242ac110002)


{'scorer': 'AUCPR',
 'val_score': None,
 'val_score_sd': None,
 'val_roc_auc': None,
 'val_pr_auc': None,
 'test_score': None,
 'test_score_sd': None,
 'test_roc_auc': None,
 'test_pr_auc': None}

Monitor the experiment and finish when validation score > 0.53.

In [34]:
import time
from IPython.display import clear_output

while ex.is_running():
    time.sleep(1)
    # grab experiment status
    status = ex.status(verbose=2)
    # grab run duration 
    duration = ex.run_duration
    # grab current metrics
    metrics = ex.metrics()   
    # grab notifications
    notifications = ex.notifications()
    # pretty print info
    clear_output(wait=True)
    print(status, " - Run Time: ", duration, " - Validation ", metrics['scorer'], ": ", sep='', end='')
    if metrics['val_score'] is not None:
        print(round(metrics['val_score'], 4), '+/-', round(metrics['val_score_sd'], 4))
        if metrics['val_score'] > 0.53:
            ex.finish()
    else:
        print()
    if notifications:
        for n in notifications:
            print("\n", n["content"])
    print()
    ex.log.tail(3)
    time.sleep(5)
    
print("\nTest ", ex.metrics()['scorer'], ": ", round(ex.metrics()['test_score'], 4), sep='')

Finishing 98.00% - Testing Deployment... - Run Time: 123.17287278175354 - Validation AUCPR: 0.5279 +/- 0.0053

 Data is slightly imbalanced.
Target class fraction: 22.317%
Imbalance ratio: 1:3.481


 Automatically dropping ID column(s) during training: ['ID']. 
If unexpected, then check config.toml values for drop_id_columns, max_relative_cardinality, and max_absolute_cardinality.

 Number of tree iterations for tuning and feature evolution models (20) is smaller than iterations for final models (100). 
If unexpected, change max_nestimators_feature_evolution_factor=0.2 to 1.0 in expert settings.
Maximum learning rate for tuning and feature evolution models (0.1) can be larger than learning rate for final models (0.03):
If unexpected, change min_learning_rate, max_learning_rate, min_learning_rate_final, max_learning_rate_final in expert settings.
Scores shown during tuning and feature evolution may underestimate performance of final pipeline.
Please wait for final scores.


 Final model

View completed experiment metrics

In [35]:
print("\nTest ", ex.metrics()['scorer'], ": ", round(ex.metrics()['test_score'], 4), sep='')


Test AUCPR: 0.5295


View variable importance for completed experiment.

In [36]:
ex.variable_importance()

gain,interaction,description
1.0,7_OHE:PAt_0.5,One-hot encoding for column(s) ['PAt_0'] binned into 11 bins (sorting order - lexical). Bin # 5 with levels ['2.0'] [internal:lexical]
0.349721,2_OHE:PAY_2.5,One-hot encoding for column(s) ['PAY_2'] binned into 11 bins (sorting order - lexical). Bin # 5 with levels ['2.0'] [internal:lexical]
0.172756,5_OHE:PAY_5.4,One-hot encoding for column(s) ['PAY_5'] binned into 10 bins (sorting order - lexical). Bin # 4 with levels ['2.0'] [internal:lexical]
0.124092,17_PAY_AMT2,PAY_AMT2 (Original)
0.0860688,7_OHE:PAt_0.6,One-hot encoding for column(s) ['PAt_0'] binned into 11 bins (sorting order - lexical). Bin # 6 with levels ['3.0'] [internal:lexical]
0.0659421,16_PAY_AMT1,PAY_AMT1 (Original)
0.0621113,15_LIMIT_BAL,LIMIT_BAL (Original)
0.0584669,18_PAY_AMT3,PAY_AMT3 (Original)
0.0319246,7_OHE:PAt_0.4,One-hot encoding for column(s) ['PAt_0'] binned into 11 bins (sorting order - lexical). Bin # 4 with levels ['1.0'] [internal:lexical]
0.0275835,14_BILL_AMt1,BILL_AMt1 (Original)


In [37]:
ex.metrics()

{'scorer': 'AUCPR',
 'val_score': 0.5278693661337515,
 'val_score_sd': 0.005286134487174329,
 'val_roc_auc': 0.761491554836079,
 'val_pr_auc': 0.5278693661337515,
 'test_score': 0.5295384804743956,
 'test_score_sd': 0.01225873542165657,
 'test_roc_auc': 0.7653244059471658,
 'test_pr_auc': 0.5295384804743956}

Manage experiment artifacts.

In [38]:
ex.artifacts.list()

['logs', 'summary', 'test_predictions', 'train_predictions']

In [39]:
help(ex.artifacts)

Help on ExperimentArtifacts in module driverlessai._experiments object:

class ExperimentArtifacts(builtins.object)
 |  ExperimentArtifacts(experiment: 'Experiment') -> None
 |  
 |  Interact with files created by an experiment on the Driverless AI server.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, experiment: 'Experiment') -> None
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  create(self, artifact: str) -> None
 |      (Re)build certain artifacts, if possible.
 |      
 |      (re)buildable artifacts:
 |      
 |      - ``'autodoc'``
 |      - ``'mojo_pipeline'``
 |      - ``'python_pipeline'``
 |      
 |      Args:
 |          artifact: name of artifact to (re)build
 |  
 |  download(self, only: Union[str, List[str]] = None, dst_dir: str = '.', file_system: Union[ForwardRef('fsspec.spec.AbstractFileSystem'), NoneType] = None, include_columns: Union[List[str], NoneType] = None, overwrite: bool = False) -> Dict[str, str]
 |      Download e

Create AutoDoc for the experiment and download AutoDoc and the experiment summary file.

In [40]:
ex.artifacts.create('autodoc')

Generating autodoc...


In [41]:
print("Available artifacts:", ex.artifacts.list())
artifacts = ex.artifacts.download(['autodoc', 'summary','test_predictions'],
                                  dst_dir='./artifacts',
                                  overwrite=True)
pd.read_csv(artifacts['test_predictions']).head()

Available artifacts: ['autodoc', 'logs', 'summary', 'test_predictions', 'train_predictions']
Downloaded 'artifacts/report.docx'
Downloaded 'artifacts/h2oai_experiment_summary_8d98b592-30a8-11ee-b104-0242ac110002.zip'
Downloaded 'artifacts/test_preds.csv'


Unnamed: 0,default payment next month.0,default payment next month.1,default payment next month.predicted(th=0.26929)
0,0.656023,0.343977,1
1,0.869641,0.130359,0
2,0.64347,0.35653,1
3,0.850476,0.149524,0
4,0.883235,0.116765,0


In [42]:
dai.experiments.list()[0:5]

    | Type       | Key                                  | Name
----+------------+--------------------------------------+--------------
  0 | Experiment | 8d98b592-30a8-11ee-b104-0242ac110002 | davupitu
  1 | Experiment | 8abfb07a-da34-11ed-bbe0-0242ac110002 | Downsample
  2 | Experiment | 10a90338-da2d-11ed-bbe0-0242ac110002 | Default
  3 | Experiment | 8fcf7c6a-da0e-11ed-bbe0-0242ac110002 | PyTorch NLP
  4 | Experiment | 250392da-da0a-11ed-bbe0-0242ac110002 | Baseline NLP

**Get existing experiment by key**

In [43]:
ex_baseline_nlp = dai.experiments.get("250392da-da0a-11ed-bbe0-0242ac110002")

In [44]:
ex_baseline_nlp.artifacts.list()

['autodoc',
 'logs',
 'mojo_pipeline',
 'python_pipeline',
 'summary',
 'test_predictions',
 'train_predictions']

Get experiment settings.   
Useful to start experiment with the similar settings. 

In [45]:
baseline_nlp_settings = ex_baseline_nlp.settings
for ind, (key,val) in enumerate(baseline_nlp_settings.items()):
    if ind <10: print(ind, key, val)

0 task classification
1 target_column PositiveReview
2 drop_columns ['UserId', 'ProductId', 'Id', 'Score', 'HelpfulnessDenominator', 'ProfileName', 'HelpfulnessNumerator', 'Time']
3 accuracy 5
4 time 3
5 interpretability 7
6 scorer AUC
7 max_runtime_minutes 135
8 min_num_cores_per_gpu 2
9 cv_in_cv_overconfidence_protection on


View experiment summary.

In [46]:
ex.summary()

Status: Complete
Experiment: davupitu (8d98b592-30a8-11ee-b104-0242ac110002)
  Version: 1.10.5, 2023-08-01 20:20, Py client
  Settings: 10/10/6, seed=222569145, GPUs disabled
  Train data: cc_train (16799, 25)
  Validation data: N/A
  Test data: [Test] (7200, 24)
  Target column: default payment next month (binary, 22.317% target class)
System specs: Docker/Linux, 31 GB, 8 CPU cores, 0/0 GPU
  Max memory usage: 0.83 GB, 0 GB GPU, 0 GB MOJO
Recipe: AutoDL (7 iterations, 8 individuals)
  Validation scheme: stratified, 3 internal holdouts (3-fold CV)
  Feature engineering: 0 features scored (15 selected)
Timing:
  Data preparation: 9.53 secs
  Shift/Leakage detection: 1.05 secs
  Model and feature tuning: 51.78 secs (22 of 132 models trained)
  Feature evolution: 0.50 secs (0 of 6024 model trained)
  Final pipeline training: 33.00 secs (4 models trained)
  Python / MOJO scorer building: 0.00 secs / 24.61 secs
Validation score: AUCPR = 0 (constant preds of -1.247)
Validation score: AUCPR =

**Make predictions**.

In [47]:
p = ex.predict(
    ds_split['test_dataset'], 
    include_columns=ds_split['test_dataset'].columns,
    enable_mojo=False
)
pd.read_csv(p.download(dst_dir='./artifacts')).head()

Complete
Downloaded 'artifacts/8d98b592-30a8-11ee-b104-0242ac110002_preds_84aca679.csv'


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAt_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month,default payment next month.0,default payment next month.1
0,8508,20000,2,2,2,53,4,4,3,3,...,19812,0,1300,1170,0,1600,0,0,0.656023,0.343977
1,13818,160000,1,2,1,60,0,0,0,0,...,21336,1626,1334,884,758,763,942,0,0.869641,0.130359
2,8374,20000,2,2,3,47,1,2,0,0,...,10400,1000,1159,500,432,600,0,0,0.64347,0.35653
3,15578,180000,1,2,1,48,-1,-1,-1,-1,...,264,1294,1466,1294,2324,264,264,0,0.850476,0.149524
4,22428,180000,2,2,1,34,0,-1,-1,-1,...,0,5589,4312,2490,0,0,150,0,0.883235,0.116765


Calculate Shapley values for Original features

In [48]:
p = ex.predict(
    ex.datasets['train_dataset'], 
    include_shap_values_for_original_features=True
)
pd.read_csv(p.download()).head()

Complete
Downloaded '8d98b592-30a8-11ee-b104-0242ac110002_preds_fff56bf5.csv'


Unnamed: 0,default payment next month.0,default payment next month.1,contrib_AGE,contrib_BILL_AMT2,contrib_BILL_AMT3,contrib_BILL_AMT4,contrib_BILL_AMT5,contrib_BILL_AMT6,contrib_BILL_AMt1,contrib_EDUCATION,...,contrib_PAY_6,contrib_PAY_AMT1,contrib_PAY_AMT2,contrib_PAY_AMT3,contrib_PAY_AMT4,contrib_PAY_AMT5,contrib_PAY_AMT6,contrib_PAt_0,contrib_SEX,contrib_bias
0,0.669573,0.330427,0.0,-0.000911,0.0,0.0,0.0,0.0,-0.006887,0.0,...,0.0,0.148405,0.089687,-0.038416,0.001716,0.0,-0.005975,-0.11431,0.0,-1.386639
1,0.836617,0.163383,0.0,0.000756,0.0,0.0,0.0,0.0,-0.009164,0.0,...,0.0,-0.049478,-0.053282,-0.041875,-0.005939,0.0,-0.005975,-0.060593,0.0,-1.386639
2,0.660804,0.339196,0.0,0.000756,0.0,0.0,0.0,0.0,-0.009206,0.0,...,0.0,-0.046143,-0.04295,-0.03526,0.012508,0.0,0.012126,0.854431,0.0,-1.386639
3,0.821736,0.178264,0.0,0.006223,0.0,0.0,0.0,0.0,-0.021002,0.0,...,0.0,-0.054734,0.231547,0.043113,0.005334,0.0,-0.005975,-0.140515,0.0,-1.386639
4,0.75232,0.24768,0.0,-0.002624,0.0,0.0,0.0,0.0,0.01002,0.0,...,0.0,0.134703,0.216618,0.13833,-0.077954,0.0,0.012126,-0.140496,0.0,-1.386639


Bring predictions from DAI server into local Pandas dataframe 

In [49]:
p.to_pandas().head()

Unnamed: 0,default payment next month.0,default payment next month.1,contrib_AGE,contrib_BILL_AMT2,contrib_BILL_AMT3,contrib_BILL_AMT4,contrib_BILL_AMT5,contrib_BILL_AMT6,contrib_BILL_AMt1,contrib_EDUCATION,...,contrib_PAY_6,contrib_PAY_AMT1,contrib_PAY_AMT2,contrib_PAY_AMT3,contrib_PAY_AMT4,contrib_PAY_AMT5,contrib_PAY_AMT6,contrib_PAt_0,contrib_SEX,contrib_bias
0,0.669573,0.330427,0.0,-0.000911,0.0,0.0,0.0,0.0,-0.006887,0.0,...,0.0,0.148405,0.089687,-0.038416,0.001716,0.0,-0.005975,-0.11431,0.0,-1.386639
1,0.836617,0.163383,0.0,0.000756,0.0,0.0,0.0,0.0,-0.009164,0.0,...,0.0,-0.049478,-0.053282,-0.041875,-0.005939,0.0,-0.005975,-0.060593,0.0,-1.386639
2,0.660804,0.339196,0.0,0.000756,0.0,0.0,0.0,0.0,-0.009206,0.0,...,0.0,-0.046143,-0.04295,-0.03526,0.012508,0.0,0.012126,0.854431,0.0,-1.386639
3,0.821736,0.178264,0.0,0.006223,0.0,0.0,0.0,0.0,-0.021002,0.0,...,0.0,-0.054734,0.231547,0.043113,0.005334,0.0,-0.005975,-0.140515,0.0,-1.386639
4,0.75232,0.24768,0.0,-0.002624,0.0,0.0,0.0,0.0,0.01002,0.0,...,0.0,0.134703,0.216618,0.13833,-0.077954,0.0,0.012126,-0.140496,0.0,-1.386639


## Recipes

Displaying first 10 models including models from custom recipe.

In [50]:
dai.recipes.models.list()[0:10]

    | Type        | Key   | Name
----+-------------+-------+------------------------
  0 | ModelRecipe |       | Aggregator
  1 | ModelRecipe |       | Constant
  2 | ModelRecipe |       | DecisionTree
  3 | ModelRecipe |       | FTRL
  4 | ModelRecipe |       | GLM
  5 | ModelRecipe |       | ImageAuto
  6 | ModelRecipe |       | ImbalancedLightGBM
  7 | ModelRecipe |       | ImbalancedXGBoostGBM
  8 | ModelRecipe |       | IsolationForestAnomaly
  9 | ModelRecipe |       | KMeans

In [51]:
m = dai.recipes.models.list()[0]

In [52]:
print("Name:", m.name)
print("Custom Recipe:", m.is_custom)
print("Unsupervised Recipe:", m.is_unsupervised)

Name: Aggregator
Custom Recipe: False
Unsupervised Recipe: True


In [53]:
s = dai.recipes.scorers.list()[0]

In [54]:
print("Name:", s.name)
print("Custom Recipe:", s.is_custom)

Name: ACCURACY
Custom Recipe: False


In [55]:
dai.recipes.transformers.list()

    | Type              | Key   | Name
----+-------------------+-------+--------------------------------------------------------------------------------------------
  0 | TransformerRecipe |       | AggregatorTransformer
  1 | TransformerRecipe |       | AutovizRecommendationsTransformer
  2 | TransformerRecipe |       | BERTTransformer
  3 | TransformerRecipe |       | BinnerTransformer
  4 | TransformerRecipe |       | CVCatNumEncodeTransformer
  5 | TransformerRecipe |       | CVTECUMLTransformer
  6 | TransformerRecipe |       | CVTargetEncodeTransformer
  7 | TransformerRecipe |       | CatOriginalTransformer
  8 | TransformerRecipe |       | CatTransformer
  9 | TransformerRecipe |       | ClusterDistCUMLDaskTransformer
 10 | TransformerRecipe |       | ClusterDistCUMLTransformer
 11 | TransformerRecipe |       | ClusterDistTransformer
 12 | TransformerRecipe |       | ClusterIdAllNumTransformer
 13 | TransformerRecipe |       | ClusterTETransformer
 14 | TransformerRecipe |     

In [56]:
t = dai.recipes.transformers.list()[62]

In [57]:
print("Name:", t.name)
print("Custom Recipe:", t.is_custom)

Name: CountMissingNumericsPerRowTransformer|count_missing_values_transformer_a2e835ff_content.py
Custom Recipe: True


### Writing example Transformer recipe

Official Github repo: https://github.com/h2oai/driverlessai-recipes

Frequently asked questions on writing an recipe: https://github.com/h2oai/driverlessai-recipes/blob/master/FAQ.md#references

Writing Example Transformer step-by-step: https://github.com/h2oai/driverlessai-recipes/tree/master/how_to_write_a_recipe


#### What is a transformer recipe? 
A transformer (or feature) recipe is a collection of programmatic steps, the same steps that a data scientist would write as code to build a column transformation.  The recipe makes it possible to engineer the transformer in training and in production.
The transformer recipe, and recipes in general, provides a data scientist the power to enhance the strengths of DriverlessAI with custom recipes. These custom recipes would bring in nuanced knowledge about certain domains - i.e. financial crimes, cybersecurity, anomaly detection. etc. It also provides the ability to extend DriverlessAI to solve custom solutions for time-series. 

#### How to write a simple DAI recipe? 
The structure of a recipe that works with DriverlessAI is quite straight forward.

1. DriverlessAI provides a `CustomTransformer` Base class that needs to be extended for one to write a recipe. The `CustomTransformer` class provides one the ability to add a customized transformation function. In the following example we are going to create a transformer that will transform a column with the `log10` of the same column. The new column, which is transformed by `log10` will be returned to DriverlessAI as a new column that will be used for modeling. 

```{python eval=FALSE}
class ExampleLogTransformer(CustomTransformer):

```
The `ExampleLogTransformer` is the class name of the transformer that is being newly created. And in the parenthesis the `CustomTransformer` is being extended. 

2. In the next step, one needs to populate the type of problem the custom transformer is solving:
   a. Are you solving a regression problem? 
   b. Are you solving a classification problem that is binary?
   c. Are you solving a classification problem that is multiclass? 
   
Depending on what kind of outcome the custom transformer is solving, each one of the above needs to be enabled or disabled. And the following example will show you how this can be done

```{python eval=FALSE}
class ExampleLogTransformer(CustomTransformer):
	_regression = True
	_binary = True
	_multiclass = True
```
In the above example we are building a `log10` transformer, and this transformer is application, for a regression, binary, or a multiclass problem. Therefore we set all of those as `True`.


3. In the next step, we tackle four more settings of a transformer. They are as follows:
   a. Output Type - What is the output type of this transformer?
   b. Reproducibility - Is this a reproducible transformer? Meaning is this transformer deterministic, and deterministic if you can set the seed?
   c. Model inclusion/exclusion  - Here we describe the type of modeling that uniquely fits, or does not fit the transformer, respectively. 
   4. Custom package requirements - Does this transformer require any custom packages. 
      

```{python eval=FALSE}
class ExampleLogTransformer(CustomTransformer):
	_regression = True
	_binary = True
	_multiclass = True
	_numeric_output = True
	_is_reproducible = True
	_excluded_model_classes = ['tensorflow']
	_modules_needed_by_name = ["custom_package==1.0.0"]
```
In the above example we have set the `_numeric_output` to be `True` as our output is numeric. We have set the `_is_reproducible` to be `True` advicing DriverlessAI that in case the user asks for a reproducible model then this model is actually capable of producing a reproducible result. As an example, we have excluded `tensorflow` using `_excluded_model_classes`. Now, in case, you would want the transformer to only run on a specific kind of model, example - `catboost`, then you can use `_included_model_classes=['CatBoostModel']` instead of `_excluded_model_classes`. Merely, as an example we have also included `custom_package` version `1.0.0` as a package required for this transformation. 

4. In the following section we will discussion about DriverlessAI's ability to check the custom recipe. When the following function is enabled DriverlessAI has the ability to check the workings of the transformer using a synthetic dataset. If this is disabled then DriverlessAI will ingest the recipe but ignore the check. 

```{python eval=FALSE}
class ExampleLogTransformer(CustomTransformer):
	_regression = True
	_binary = True
	_multiclass = True
	_numeric_output = True
	_is_reproducible = True
	_excluded_model_classes = ['tensorflow']
	_modules_needed_by_name = ["custom_package==1.0.0"]

	@staticmethod
	def do_acceptance_test():
	return True
```
In this example we enable the acceptance test by returning `True` for the `do_acceptance_test` function

5. In the following example we set the parameters for the type of column that we require as input, the minimum and the maximum number of columns that we need to be able to provide an output, along with the relative importance of the transformer. 

The column type  or `col_type` can take nine different column data types, and they are as follows:

	a. "all"         - all column types
	b. "any"         - any column types
	c. "numeric"     - numeric int/float column
	d. "categorical" - string/int/float column considered a categorical for feature engineering
	e. "numcat"      - allow both numeric or categorical
	f. "datetime"    - string or int column with raw datetime such as '%Y/%m/%d %H:%M:%S' or '%Y%m%d%H%M'
	g. "date"        - string or int column with raw date such as '%Y/%m/%d' or '%Y%m%d'
	h. "text"        - string column containing text (and hence not treated as categorical)
	i. "time_column" - the time column specified at the start of the experiment (unmodified)

Please note that if `col_type` is set to `col_type=all` then all the columns in the dataframe are provided to this transformer, no selection of columns will occur. 

The `min_cols` and `max_cols` either take numbers/integers or take string parameters as `all` and `any`. The `all` and `any` should coincide with the same `col_type`, respectively. 

The `relative_importance` takes a positive value. If this value is more than `1` then the transformer is likely to be used more often than other transformers in the specific experiment. If it less than `1` then it is less likely to be used than other transformers in the specific experiment. If it is set to `1` then it is equally likely to be used as other transformers in the specific experiment, provided other transformers are also set to relative importance `1`.
i , which will over, or under representation. Default value is `1`, value greater than `1` is over representation and under `1` is under representation. 

```{python eval=FALSE}
class ExampleLogTransformer(CustomTransformer):
	_regression = True
	_binary = True
	_multiclass = True
	_numeric_output = True
	_is_reproducible = True
	_excluded_model_classes = ['tensorflow']
	_modules_needed_by_name = ["custom_package==1.0.0"]

	@staticmethod
	def do_acceptance_test():
	return True

	@staticmethod
	def get_default_properties():
	return dict(col_type = "numeric", min_cols = 1, max_cols = 1, relative_importance = 1)
```

In the above example, as we are dealing with a numeric column (recall, that we are calculating the log10 of a given column) we set the `col_type` to `numeric`. We set the `min_cols` and `max_cols` to `1` as we need only one column, and the `relative_importance` to `1`.

6. The custom transformer function has two fundamental functions that are required to make a transformer. They are:
   a. `fit_transform` This function is used to fit the transformation on the training dataset, and returns the output column. 
   b. `transform` This function is used to transform the testing or production dataset, and is always applied after the `fit_transform`


```{python eval=FALSE}
from h2oaicore.systemutils import segfault, loggerinfo, main_logger
from h2oaicore.transformer_utils import CustomTransformer
import datatable as dt
import numpy as np
import pandas as pd
import logging

from h2oaicore.systemutils import config
from h2oaicore.systemutils import make_experiment_logger, loggerinfo, loggerwarning

class ExampleLogTransformer(CustomTransformer):
    _regression = True
    _binary = True
    _multiclass = True
    _numeric_output = True
    _is_reproducible = True
    _excluded_model_classes = ['tensorflow']
    _modules_needed_by_name = ["custom_lib==0.27.7"] 

    @staticmethod
    def do_acceptance_test():
        return True


    @staticmethod
    def get_default_properties():
        return dict(col_type = "numeric", min_cols = 1, max_cols = 1, relative_importance = 1)


    def fit_transform(self, X: dt.Frame, y: np.array = None):
        X_pandas = X.to_pandas()
        X_p_log = np.log10(X_pandas)
        
        return X_p_log


    def transform(self, X: dt.Frame, y: np.array = None):
        X_pandas = X.to_pandas()
        X_p_log = np.log10(X_pandas)
        
        return X_p_log
```
In the above example, we compose the `fit_transform` and `transform` for training and testing data, respectively. In the `fit_transform` the response variable `y` is available. Here our dataframe is named `X`. Now `X` will be transformed to pandas frame by using the `to_pandas()` function. Further, a `log10` of the column will be applied and returned. The `to_pandas()` function is described here for ease of understanding.  

#### Add debugging and error handling

To simplify recipe development and to enable recipe debugging, we will add code to print messages and report errors to DAI experiment log.

```python
from h2oaicore.systemutils import print_debug
import logging
from h2oaicore.systemutils import config
from h2oaicore.systemutils import make_experiment_logger, loggerinfo, loggerwarning


from h2oaicore.transformer_utils import CustomTransformer
import datatable as dt
import numpy as np
import pandas as pd
import logging

from h2oaicore.systemutils import config
from h2oaicore.systemutils import make_experiment_logger, loggerinfo, loggerwarning

class ExampleLogTransformer(CustomTransformer):
    _regression = True
    _binary = True
    _multiclass = True
    _numeric_output = True
    _is_reproducible = True
    _excluded_model_classes = ['tensorflow']
    _modules_needed_by_name = ["lifelines==0.27.7"] # Not really needed, added as demo

    @staticmethod
    def do_acceptance_test():
        return True

    @property
    def logger(self):
        from h2oaicore import application_context
        from h2oaicore.systemutils import exp_dir
        # Don't assign to self, not picklable
        return make_experiment_logger(experiment_id=application_context.context.experiment_id, tmp_dir=None,
                                      experiment_tmp_dir=exp_dir())

    @staticmethod
    def get_default_properties():
        return dict(col_type = "numeric", min_cols = 1, max_cols = 1, relative_importance = 1)

    def fit_transform(self, X: dt.Frame, y: np.array = None):
        logger = self.logger
        loggerinfo(logger, "Start Example Transformer fit_transform .....")
        try:
            X_pandas = X.to_pandas()
            X_p_log = np.log10(X_pandas)
        except Exception as e:
            '''Print error message into DAI log file'''
            loggerinfo(logger, 'Error during Example transformer fit_transform. Exception raised: %s' % str(e))
            raise
        return X_p_log


    def transform(self, X: dt.Frame, y: np.array = None):
        logger = self.logger
        loggerinfo(logger, "Start Example Transformer transform.......")
        try:
            X_pandas = X.to_pandas()
            X_p_log = np.log10(X_pandas)
        except Exception as e:
            '''Print error message into DAI log file'''
            loggerinfo(logger, 'Error during Example transformer transform. Exception raised: %s' % str(e))
            raise
        return X_p_log
```

In [58]:
help(dai.recipes.create)

Help on method create in module driverlessai._recipes:

create(recipe: str) -> None method of driverlessai._recipes.Recipes instance
    Create a recipe on the Driverless AI server.
    
    Args:
        recipe: path to recipe or url for recipe
    
    Examples::
    
        dai.recipes.create(
            recipe='https://github.com/h2oai/driverlessai-recipes/blob/master/scorers/regression/explained_variance.py'
        )



In [68]:
!cat ./recipes/example_transformer_recipe.py

from h2oaicore.systemutils import print_debug
import logging
from h2oaicore.systemutils import config


from h2oaicore.transformer_utils import CustomTransformer
import datatable as dt
import numpy as np
import pandas as pd
import logging

from h2oaicore.systemutils import config

class ExampleLogTransformer(CustomTransformer):
    _regression = True
    _binary = True
    _multiclass = True
    _numeric_output = True
    _is_reproducible = True
    _excluded_model_classes = ['tensorflow']
    _modules_needed_by_name = ["lifelines==0.27.7"] # Not really needed, added as demo

    @staticmethod
    def do_acceptance_test():
        return True

    @property
    def logger(self):
        from h2oaicore import application_context
        from h2oaicore.systemutils import exp_dir
        # Don't assign to self, not picklable
        return make_experiment_logger(experiment_id=application_context.context.experiment_id, tmp_dir=None,
                                      experiment_tmp_dir=

**Create/Add recipe to DAI**

In [60]:
dai.recipes.create("./recipes/example_transformer_recipe.py")

Complete 100.00%


In [61]:
for item  in dai.recipes.transformers.list().__dict__["_data"]:
    if "ExampleLogTransformer" in item.__dict__["_name"]:
        custom_transformer_name = item.__dict__["_name"]
        print("Found custom transformer: "+custom_transformer_name)

Found custom transformer: ExampleLogTransformer|example_transformer_recipe_750a3ae1_content.py


### Train experiment with the custom transformer

In [62]:
settings_with_custom_transformer = {
'task': 'classification',
 'target_column': 'Churn',
 'accuracy': 5,
 'time': 3,
 'interpretability': 7,
 'scorer': 'LOGLOSS',
 'max_runtime_minutes': 60,
 'make_python_scoring_pipeline': 'off',
 'make_mojo_scoring_pipeline': 'off',
 'make_triton_scoring_pipeline': 'off',
 'benchmark_mojo_latency': 'off',
 'transformers': [custom_transformer_name, "FrequentTransformer",'OneHotEncodingTransformer',
                  'OriginalTransformer','CVCatNumEncodeTransformer']
}
datasets = {"train_dataset": dai.datasets.get("f7a51052-da2c-11ed-bbe0-0242ac110002"),
            "test_dataset": dai.datasets.get("f7a57df8-da2c-11ed-bbe0-0242ac110002")}

dai.experiments.preview(**datasets, **settings_with_custom_transformer)

ACCURACY [5/10]:
- Training data size: *4,594 rows, 20 cols*
- Feature evolution: *[Constant, GLM, LightGBM, XGBoostGBM]*, *3-fold CV, 3 reps*
- Final pipeline: *Blend of up to 4 [Constant, GLM, LightGBM, XGBoostGBM] models, each averaged across 3-fold CV splits, 3 reps*

TIME [3/10]:
- Feature evolution: *4 individuals*, up to *30 iterations*
- Early stopping: After *5* iterations of no improvement

INTERPRETABILITY [7/10]:
- Feature pre-pruning strategy: Permutation Importance FS
- Monotonicity constraints: enabled
- Feature engineering search space: [CVCatNumEncode, ExampleLog, Frequent, OneHotEncoding, Original]

[Constant, GLM, LightGBM, XGBoostGBM] models to train:
- Model and feature tuning: *144*
- Feature evolution: *576*
- Final pipeline: *36*
- MOJO DISABLED

Estimated runtime: *10 minutes*

Estimated max CPU memory usage: *1.0GB*
Finish/Abort (if not done) in: *1 hour*/*7 days*


Start experiment with custom transformer

In [63]:
ex_with_custom_transformer = dai.experiments.create_async(**datasets, **settings_with_custom_transformer)

Experiment launched at: https://34-209-212-180.aquarium-instance.h2o.ai/#/experiment?key=13070468-30a9-11ee-b104-0242ac110002


In [64]:
print("Name:", ex_with_custom_transformer.name)
print("Datasets:", ex_with_custom_transformer.datasets)
print("Train Dataset Head:")
display(ex_with_custom_transformer.datasets['train_dataset'].head(1))
print("Target:", ex_with_custom_transformer.settings['target_column'])
print("Scorer:", ex_with_custom_transformer.metrics()['scorer'])
print("Task:", ex_with_custom_transformer.settings['task'])
print("Status:", ex_with_custom_transformer.status(verbose=2))
print("Creation Timestamp:", ex_with_custom_transformer.creation_timestamp)
print("Run Duration:", ex_with_custom_transformer.run_duration)
print("Web Page:", ex_with_custom_transformer.gui())

Name: miperuwu
Datasets: {'train_dataset': <class 'Dataset'> f7a51052-da2c-11ed-bbe0-0242ac110002 CustomerChurn.train, 'validation_dataset': None, 'test_dataset': <class 'Dataset'> f7a57df8-da2c-11ed-bbe0-0242ac110002 CustomerChurn.test}
Train Dataset Head:


customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
7590-VHVEG,Female,False,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No


Target: Churn
Scorer: LOGLOSS
Task: classification
Status: Running 0.00% - Update configuration with overrides.
Creation Timestamp: 1690921322.2503042
Run Duration: 0.49209022521972656
Web Page: https://34-209-212-180.aquarium-instance.h2o.ai/#/experiment?key=13070468-30a9-11ee-b104-0242ac110002


In [65]:
ex_with_custom_transformer.variable_importance()