## Example notebook for the ATOM pipeline
---------------------------------

Load the data with one of the three imported datasets before running the
ATOM function. These datasets are provided by sklearn and are very small
and easy to learn. You can learn more about these datasets
at https://scikit-learn.org/stable/datasets/index.html.

    load_breast_cancer: binary classification
    load_wine: multi-class classification
    load_boston: regression

In [1]:
# Import packages
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston, load_wine, load_breast_cancer
from atom import ATOM

# Load the dataset and transform to a pd.DataFrame
dataset = load_breast_cancer()

data = np.c_[dataset.data, dataset.target]
columns = np.append(dataset.feature_names, ["target"])
data = pd.DataFrame(data, columns=columns)

In [2]:
# Call new ATOM class for ML task exploration
atom = ATOM(data,
            target='target',
            n_jobs=1,
            verbose=2)

# Perform some data cleaning steps
atom.balance(oversample=1.)
atom.outliers(max_sigma=5)

# Select the 10 best features according to a F-test
atom.feature_selection(strategy='univariate', max_features=20)

Algorithm task: binary classification.

Number of features: 30
Number of instances: 569
Size of training set: 398
Size of test set: 171

Performing oversampling...
Handling outliers...
Performing feature selection...


In [3]:
# Fit the pipeline with the selected models
atom.fit(models=['LDA','RF', 'LGBM'],
         metric='accuracy',
         max_iter=10,
         init_points=3,
         cv=3,
         bagging=5)


Models in pipeline: ['LDA', 'RF', 'LGBM']


Running BO for Linear Discriminant Analysis...
Final statistics for Linear Discriminant Analysis:         
Best hyperparameters: {'solver': 'svd', 'n_components': 1, 'tol': 0.0689}
Best Accuracy on the BO: 0.9739
Accuracy on the test set: 0.9474
Elapsed time: 5.6 seconds
--------------------------------------------------
Bagging Accuracy score --> Mean: 0.9532   Std: 0.0064
Elapsed time: 0.1 seconds


Running BO for Random Forest...
Final statistics for Random Forest:         
Best hyperparameters: {'n_estimators': 100, 'max_features': 0.8, 'criterion': 'entropy', 'bootstrap': True, 'min_samples_split': 12, 'min_samples_leaf': 2}
Best Accuracy on the BO: 0.9558
Accuracy on the test set: 0.9415
Elapsed time: 15.3 seconds
--------------------------------------------------
Bagging Accuracy score --> Mean: 0.9392   Std: 0.0070
Elapsed time: 1.8 seconds


Running BO for Light GBM...
Final statistics for Light GBM:         
Best hyperparameters: {

## Analyze results