![](https://media1.tenor.com/images/1c519d623788d3c049c47b0873dc5bc7/tenor.gif?itemid=15055034)

### In this Notebook, I have tried to compare various AUTOML frameworks that are commonly used and what AUTOML does is exactly what the guy says above, you telling the framework to do the heavy-lifting.


### This notebook also has some background on each framework along with the code snippet. Do let know your  thoughts about AutoML and any other framework that you know of. 


### Contents

1. [What is AUTOML](#at)

2. [Auto-sklearn](#ask)

3. [MLBOX](#mlbox)

4. [H2o](#h2o)

5. [TPOT](#tpo)

6. [Autokeras](#ak)

7. [PyCaret](#py)

8. [Hyperopt](#hp)

9. [Other frameworks](#oth)

<a id="at"></a>

### What Is AutoML?

AutoML (automated machine learning) refers to the automated end-to-end process of applying machine learning in real and practical scenarios.

A typical machine learning model includes the four following steps:

![](https://yqintl.alicdn.com/7e193d8335256ae97a8bbb94afd225435e7c60f2.png)


Right from ingesting data to pre-processing, optimization, and then predicting outcomes, every step is controlled and performed by humans. AutoML essentially focuses on two major aspects — data acquisition/collection and prediction. All the other steps that take place in between can be easily automated while delivering a model that’s optimized well and ready to make predictions.

The success of machine learning in a wide range of applications has led to an ever-growing demand for machine learning systems that can be used off the shelf by non-experts¹. AutoML tends to automate the maximum number of steps in an ML pipeline—with a minimum amount of human effort and without compromising the model’s performance.


### Advantages

#### The advantages of AutoML can be summed up in three major points:

    - Increases productivity by automating repetitive tasks. This enables a data scientist to focus more on the problem rather than the models.

    - Automating the ML pipeline also helps to avoid errors that might creep in manually.

    - Ultimately, AutoML is a step towards democratizing machine learning by making the power of ML accessible to everybody.

### Let us begin to look at the various frameworks, along with code.

In [None]:
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
from sklearn.model_selection import train_test_split        

In [None]:
df = pd.read_csv("../input/heart-disease-uci/heart.csv")
df.head()

In [None]:
y = df.target.values
x_data = df.drop(['target'], axis = 1)

# Normalize
#x = (x_data - np.min(x_data)) / (np.max(x_data) - np.min(x_data)).values
#x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2,random_state=0)

<a id="ask"></a>

### AUTO-SKLEARN


![](https://miro.medium.com/max/512/1*s2myX8bJIp9mQ2V_htcEpw.png)


### We begin with auto-sklearn which is basically the auto-ml component of the most used ML framework SCIKIT-LEARN.

### Auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction. [Here](http://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf) is the paper for this framework, in case you are interested. 

### It Creates a pipeline and optimizes it using Bayesian search. Two components are added to Bayesian hyperparameter optimization of an ML framework: meta-learning for initializing the Bayesian optimizer and automated ensemble construction from configurations evaluated during optimization. It performs well on small and medium-sized datasets, but it cannot be applied to modern deep learning systems that yield state-of-the-art performance on large datasets.


### Lets build a model now.


### Here are the main features of the API. You can:

    - Set time and memory limits
    - Restrict the searchspace by selecting or excluding some preprocessing methods or some estimators
    - Specify some resampling strategies (e.g. 5-fold cv)
    - Perform some parallel computation with the SMAC algorithm (sequential model-based algorithm configuration) that stores some data on a shared file system
    - Save your model as you would do with scikit-learn (pickle)

In [None]:
#!pip install -U scikit-learn

import sklearn
print(sklearn.__version__)

In [None]:
!curl -OL https://github.com/AxeldeRomblay/mlbox/tarball/3.0-dev

In [None]:
!apt-get -y remove swig

!apt-get -y install swig3.0 build-essential -y

!ln -s /usr/bin/swig3.0 /usr/bin/swig
!apt-get -y install build-essential

#!pip install --upgrade setuptools
#!pip install auto-sklearn
#!pip install --no-cache-dir -v pyrfr


#try:
 #   import autosklearn.classification
#except:
 #   pass

In [None]:
!pip install git+https://github.com/automl/auto-sklearn

### Is ensemble default in auto-sklearn?

    - Yes, refer Introduction section in the paper mentioned above
    
    
Below is a simple auto-sklearn pipeline    

In [None]:
#!pip uninstall -y scikit-learn

#import sklearn
#print(sklearn.__version__)

In [None]:
#!pip install scikit-learn

#import sklearn
#print(sklearn.__version__)

** IGNORE THE ERRORS FROM MLBOX AND AUTOSKLEARN** , THERE IS A PACKAGE ISSUE IN KAGGLE AND IT SOMETIMES DOESNT WORK AND SOMETIMES WORKS!! THATS WHY HAVE COMMENTED OUT THE CODE, BUT IT WORKS!! 


A RELATED THREAD -- https://www.kaggle.com/general/64808

In [None]:
# from sklearn import model_selection, metrics
#import sklearn
#import autosklearn
#import autosklearn.classification

#%timeit

#X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(x_data, y, random_state=1)

#automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=3600,
#per_run_time_limit=300,resampling_strategy='cv', resampling_strategy_arguments={'folds': 5},
#include_preprocessors=["no_preprocessing"],ensemble_size=2)

# Do not construct ensembles in parallel to avoid using more than one
# core at a time. The ensemble will be constructed after auto-sklearn
# finished fitting all machine learning models.

#automl.fit(X_train, y_train)

# This call to fit_ensemble uses all models trained in the previous call
# to fit to build an ensemble which can be used with automl.predict()

#automl.fit_ensemble(y_train, ensemble_size=50)

#print(automl.show_models())

#predictions = automl.predict(X_test)

#print(automl.sprint_statistics())

#print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))    

### Pros:

    - Easy to get started if you know sklearn
    
    - Has parameters/hyper parameters similar to SKLEARN API, so a quick baseline shouldn't take time.
    
    - Support Model Persistence and Parallel Computation
    
    
### Cons:

    - Installation of the package is not straight forward and has lots of dependencies, be it in kaggle or outside.
    
    - Not as explanatory in terms of the results and doesn't have plenty of options that other tools have. 
    
    - Takes a lot of time to return the results. For a simple baseline with the above dataset, it took nearly an hour.

<a id="mlbox"></a>

### ML-BOX


MLBox is a powerful Automated Machine Learning python library.

According to the official document, it provides the following features:

    - Fast reading and distributed data preprocessing/cleaning/formatting
    
    - Highly robust feature selection and leak detection as well as accurate hyper-parameter optimization
    
    - State-of-the art predictive models for classification and regression (Deep Learning, Stacking, LightGBM,...)
    
    - Prediction with model interpretation


MLBox focuses on the below three points in particular in comparison to the other libraries:

    - Drift Identification – A method to make the distribution of train data similar to the test data.
    - Entity Embedding – A categorical features encoding technique inspired from word2vec.
    - Hyperparameter Optimization


### MLBox architecture

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2017/07/03230616/Screenshot-from-2017-07-03-23-05-23.png)


MLBox main package contains 3 sub-packages:

    - Pre-processing: reading and pre-processing data

    - Optimization: testing or optimizing a wide range of learners
    
    - Prediction: predicting the target on a test dataset
    

In [None]:
!pip install mlbox

In [None]:
#import warnings
#warnings.filterwarnings("ignore")

#from mlbox.preprocessing.reader import Reader
#from mlbox.preprocessing.drift_thresholder import Drift_thresholder
#from mlbox.optimisation.optimiser import Optimiser 
#from mlbox.prediction.predictor import Predictor

** IGNORE THE ERRORS FROM MLBOX AND AUTOSKLEARN** , THERE IS A PACKAGE ISSUE IN KAGGLE AND IT SOMETIMES DOESNT WORK AND SOMETIMES WORKS!! 


A RELATED THREAD -- https://www.kaggle.com/general/64808

### Inputs to MLBox

### If you're having a train and a test set like in any Kaggle competition, you can feed these two paths directly to MLBox as well as the target name.

### Otherwise, if fed a train set only, MLBox creates a test set.

In [None]:
paths = ["../input/titanic/train.csv", "../input/titanic/test.csv"] 

target_name = "Survived"

### Reading and preprocessing

The Reader class of MLBox is in charge of preparing the data.

It basically provides methods and utilities to:

    - Read in the data with the correct separator (csv, xls, json, and h5) and load it
    - Clean the data by ;
        - deleting Unnamed columns
        - inferring column types (float, int, list)
        - processing dates and extracting relevant information from it: year, month, day, dayofweek, hour, etc. removing duplicates
        - Prepare train and test splits

In [None]:
#rd = Reader(sep=",")
#df = rd.train_test_split(paths, target_name)

#### When this function is done running, it creates a folder named save where it dumps the target encoder for later use.

In [None]:
#df["train"].head()

#### DRIFT REMOVAL

    This is an innovative feature I haven't encountered in other packages.

    The main idea is to automatically detect and remove variables that have a distribution that is substantially different between the train and the test set.

    This happens quite a lot and we generally talk about biased data. 

    You could have for example a situation when the train set has a population of young people whereas the test has elderly only. This indicates that the age feature is not robust and may lead to a poor performance of the model when testing. So it has to be discarded.

#### How does MLBox compute drifts for individual variables? 


    MLBox builds a classifier that separates train from test data. It then uses the ROC score related to this classifier as a measure of the drift.

    This makes sense:

    If the drift score is high (i.e. the ROC score is high) the ability the discern train data from test data is easy, which means that the two distributions are very different.
    Otherwise, if the drift score is low (i.e. the ROC score is low) the classifier is not able to separate the two disctributions correctly.
    MLBox provides a class called Drift_thresholder that takes as input the train and test sets as well as the target and computes a drift score of each one of the variables.

    Drift_thresholder then deletes the variables that have a drift score higher that a threshold (default to 0.6).

In [None]:
#dft = Drift_thresholder()
#df = dft.fit_transform(df)

### The heavy lifting : optimizing

#### All the functionalities inside this sub-package can be used via the command-

    from mlbox.optimisation import *

This hyper-parameter optimisation method in this library uses the hyperopt library which is very fast and you can almost optimise anything in this library from choosing the right missing value imputation method to the depth of an XGBOOST model. This library creates a high-dimensional space of the parameters to be optimised and chooses the best combination of the parameters that lowers the validation score.

#### This section performs the optimisation of the pipeline and tries different configurations of the parameters:

    - NA encoder (missing values encoder)
    - CA encoder (categorical features encoder)
    - Feature selector (OPTIONAL)
    - Stacking estimator - feature engineer (OPTIONAL)
    - Estimator (classifier or regressor)

In [None]:
#opt = Optimiser()

# Then we can run it using the default model configuration set as default (LightGBM) without any autoML or complex grid search.

# This should be the first baseline

#warnings.filterwarnings('ignore', category=DeprecationWarning)
#score = opt.evaluate(None, df)

Now, we get to define a space of multiple configurations:

    ne_numericalstrategy: how to handle missing data in numerical features
    ce__strategy: how to handle categorical variables encoding
    fs: feature selection
    stck: meta-features stacker
    est: final estimator

In [None]:
space = {
        'ne__numerical_strategy':{"search":"choice","space":[0, "mean"]},
        'ce__strategy':{"search":"choice", "space":["label_encoding", "random_projection", "entity_embedding"]}, 
        'fs__threshold':{"search":"uniform", "space":[0.001, 0.2]}, 
        'est__strategy':{"search":"choice", "space":["RandomForest", "ExtraTrees", "LightGBM"]},
        'est__max_depth':{"search":"choice", "space":[8, 9, 10, 11, 12, 13]}
        }

Step1: 

Create an object of class Optimiser which has the parameters as ‘scoring’ and ‘n_folds’. Scoring is the metric against which we want to optimise our hyper-parameter space and n_folds is the number of folds of cross-validation
Scoring values for Classification- "accuracy", "roc_auc", "f1", "log_loss", "precision", "recall"
Scoring values for Regression- "mean_absolute_error", "mean_squarred_error", "median_absolute_error", "r2"

In [None]:
#opt = Optimiser(scoring="accuracy",n_folds=5)

Step2:

Use the optimise function of the object created above which takes the hyper-parameter space, dictionary created by the train_test_split and number of iterations as the parameters. This function returns the best hyper-paramters from the hyper-parameter space.

In [None]:
#opt.evaluate(params, df)

#best=opt.optimise(space, df, 40)

#### There's clearly very good potential of more improvement if we define a better space of search or stacking operations and maybe other feature selection techniques. You can also see the best hyper parameters.

#### Running predictions

#### we fit the optimal pipeline and predict on our test dataset.

In [None]:
import pandas as pd

#prd = Predictor()
#prd.fit_predict(best, df)

The above method saves the feature importance, drift variables coefficients and the final predictions into a separate folder named ‘save’.

<a id="h2o"></a>

### H2o


H2O is an open source, in memory, distributed, fast and scalable machine learning and predictive analytics that allow building machine learning models to be an ease. They have majorly 2 products, H2o3 which is open source and DRIVERLESS AI which is a paid product and the rest are related to their Big data offerings as shown below.


H2O includes an automatic machine learning module that uses its own algorithms to build a pipeline. It performs an exhaustive search over its feature engineering methods and model hyperparameters to optimize its pipelines
H2O automates some of the most difficult data science and machine learning workflows, such as feature engineering, model validation, model tuning, model selection and model deployment. In addition to this, it also offers automatic visualizations and machine learning interpretability (MLI).


![](https://miro.medium.com/max/512/1*vQe69lEIajJFl86sWjDrnQ.png)

In [None]:
import h2o
from h2o.automl import H2OAutoML

In [None]:
h2o.init()
#df = h2o.import_file("../input/heart-disease-uci/heart.csv")
df = h2o.import_file("../input/titanic/train.csv")

In [None]:
train, test = df.split_frame([0.7], seed=42)

#### We already have our train and test sets, so we just need to choose our response variable, as well as the predictors. We will do the same thing that we did for the first tutorial.

In [None]:
train.head(2)

In [None]:
y = "Survived"

ignore = ["Survived", "PassengerId", "Name"] 

x = list(set(train.names) - set(ignore))

In [None]:
splits = df.split_frame(ratios=[0.7], seed=1)

train = splits[0]

test = splits[1]

In [None]:
y = "Survived" 

x = df.columns 

x.remove(y) 

x.remove("PassengerId")

x.remove("Name")


Now we are ready to run AutoML. Below you can see some of the default parameters that we could change for AutoML.

In [None]:
#H2OAutoML(nfolds=5, max_runtime_secs=3600, max_models=None, stopping_metric='AUTO', stopping_tolerance=None, stopping_rounds=3, seed=None, project_name=None)

aml = H2OAutoML(max_models=25, max_runtime_secs_per_model=30, seed=42)

%time aml.train(x=x, y=y, training_frame=train)

In [None]:
aml = H2OAutoML(max_runtime_secs=120, seed=1)

aml.train(x=x,y=y, training_frame=train)

The only required parameters for H2O's AutoML are, ytraining_frame, and max_runtime_secs, which let us train AutoML for ‘x' amount of seconds and/or max_models, which would train a maximum number of models. Please note that max_runtime_secs has a default value, while max_models does not. For this task, we will set a number of models constraint. The seed is the usual parameter that we set for reproducibility purposes. We also need a project name because we will do both classification and regression with AutoML.

The second line of code has the parameters that we need in order to train our model. For now, we will just pass x, y, and the training frame. Please note that the parameter x is optional because if you were using all the columns in your dataset, you would not need to declare this parameter. The leaderboard frame can be used to score and rank models on the leaderboard, but we will use the validation scores to do so because we will check the performance of our models with the test set.

Once AutoML is finished, print the leaderboard, and check out the results.

In [None]:
lb = aml.leaderboard

# lb.head(rows=lb.nrows)

lb.head()

We can also print a leaderboard with the training time, in milliseconds, of each model and the time it takes each model to predict each row, in milliseconds:

In [None]:
from h2o.automl import get_leaderboard

lb2 = get_leaderboard(aml, extra_columns='ALL')

lb2.head(rows=lb2.nrows)

By looking at the leaderboard, we can see that the best model at the top. The Ensembles will usually have a GLM, a Distributed Random Forest, Extremely-Randomized Forest, a GBM, and XGBoost, and Deep Learning model if you give it enough time to train all those models. Let's explore the coefficients of the metalearner to see the models in the Stacked Ensemble with their relative importance.

First, let's retrieve the metalearner, and we can do it as follow:

In [None]:
# Get model ids for all models in the AutoML Leaderboard

model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])

# Get the "All Models" Stacked Ensemble model

se = h2o.get_model([mid for mid in model_ids if "StackedEnsemble_BestOfFamily" in mid][0])

# Get the Stacked Ensemble metalearner model

metalearner = h2o.get_model(se.metalearner()['name'])
metalearner.coef()

#### From the list above, we can see that the most important model used in our Stacked Ensemble is GLM.

#### We can also plot the standardized coefficients with the following code (assuming you retrieved the metalearner from the step above):

In [None]:
metalearner.std_coef_plot()

#### Now lets check the performance in our test set

In [None]:
aml.leader.model_performance(test_data=test)

In [None]:
#%matplotlib inline
#aml.leader.model_performance(test_data=test).plot()

In [None]:
# Lastly, let's make some predictions on our test set.


pred = aml.predict(test)

pred.head()

Saving the Leader Model

You can also save and download your model and use it for deploying it to production.

In [None]:
h2o.save_model(aml.leader, path="./output")

### Pros:

    - very intuitive and lots of custom options to build models
    
    - easy to get started and build a baseline.
    
    - usage of H20 Flow in Web UI enables quick development and sharing of the analytical model
    
    - Readily available algorithms, easy to use in your analytical projects

    - Faster than Python scikit learn (in machine learning supervised learning area)
    
    - Well documented and suitable for fast training or self studying

 
### Cons:

    - Mostly the best models are Stacked ensembles although they end up as the best model, but not a lot of options to do DL, especially the latest methods. Although compared to other tools, they are ahead. So DL model options can be added to make it better. 

<a id="tpo"></a>


![](https://miro.medium.com/max/450/0*dCD9QwVjhVnKKz6U.jpg)


### TPOT(Tree-Based Pipeline Optimization Tool)


TPOT is a tree-based pipeline optimization tool that uses genetic algorithms to optimize machine learning pipelines. TPOT is built on top of scikit-learn and uses its own regressor and classifier methods. TPOT explore thousands of possible pipelines and finds the one that best fit the data.

TPOT cannot automatically process natural language inputs. Additionally, it’s also not able to processes categorical strings, which must be integer-encoded before being passed in as data.



![](https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1537396029/output_2_0_d7uh0v.png)


#### SOURCE - DATACAMP

In [None]:
df = pd.read_csv("../input/titanic/train.csv") 
df.head()

In [None]:
df = df.fillna(-999)

df_class = df['Survived'].values

In [None]:
from sklearn.model_selection import train_test_split

training_indices, validation_indices = training_indices, testing_indices = train_test_split(df.index,
                                                                                            stratify = df_class,
                                                                                            train_size=0.75, test_size=0.25)

In [None]:
training_indices.size, validation_indices.size

### For TPOT, everything needs to be in float or int, therefore deleting variables that are not those for example purpose.

In [None]:
#df.info()
df.drop(['Name', 'PassengerId', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)

TPOTClassifier has a wide variety of parameters, and you can read all about them here. But the most notable ones you must know are:

    generations: Number of iterations to the run pipeline optimization process. The default is 100.

    population_size: Number of individuals to retain in the genetic programming population every generation. The default is 100.

    offspring_size: Number of offspring to produce in each genetic programming generation. The default is 100.

    mutation_rate: Mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation. Default is 0.9

    crossover_rate: Crossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to "breed" every generation.

    scoring: Function used to evaluate the quality of a given pipeline for the classification problem like accuracy, average_precision, roc_auc, recall, etc. The default is accuracy.

    cv: Cross-validation strategy used when evaluating pipelines. The default is 5.

    random_state: The seed of the pseudo-random number generator used in TPOT. Use this parameter to make sure that TPOT will give you the same results each time you run it against the same data set with that seed.

In [None]:
#from tpot import TPOTClassifier
#from tpot import TPOTRegressor

#tpot = TPOTClassifier(generations=5, verbosity=2)
#tpot.fit(df.drop('Survived', axis=1).loc[training_indices].values, df.loc[training_indices,'Survived'].values)

In the above, 5 generations were computed, each giving the training efficiency of the fitting model on the training set. 

As evident, the best pipeline is the one that has the CV accuracy score of 74%. The model that produces this result is the pipeline, Consisting of ET classifier. 

In [None]:
#tpot.score(df.drop('Survived',axis=1).loc[validation_indices].values,  df.loc[validation_indices, 'Survived'].values)

### You can also tell TPOT to export the corresponding Python code for the optimized pipeline to a text file with the export function and I personally think this is an amazing feature:

In [None]:
#tpot.export('pipeline.py')

### PROS:

    - Very easy to get started, install and build a quick model
    
    - Easy preprocessing


### CONS:

    TPOT can take a long time to finish its search. Running TPOT isn’t as simple as fitting one model on the dataset. It is considering multiple machine learning algorithms (random forests, linear models, SVMs, etc.) in a pipeline with numerous preprocessing steps (missing value imputation, scaling, PCA, feature selection, etc.), the hyper-parameters for all of the models and preprocessing steps, as well as multiple ways to ensemble or stack the algorithms within the pipeline. That’s why it usually takes a long time to execute and isn’t feasible for large datasets.

    TPOT can recommend different solutions for the same dataset. If you're working with a reasonably complex dataset or run TPOT for a short amount of time, different TPOT runs may result in different pipeline recommendations. When two TPOT runs recommend different pipelines, this means that the TPOT runs didn't converge due to lack of time or that multiple pipelines perform more-or-less the same on your dataset.

<a id="ak"></a>


### AUTO-KERAS


#### One of the most widely used AUTOML framework, as the name says it all it is an AutoML system based on Keras. keras is a popular and widely used DL framework, so the support is huge.

#### The API’s design follows the classic design of the Scikit-Learn API; hence, it’s extremely simple to use. The current version provides functionalities to automatically search for hyperparameters during the deep learning process. Auto-Keras tends to simplify the ML process through the use of automated Neural Architecture Search (NAS) algorithms. Neural Architecture Search essentially replaces the deep learning engineer/practitioner with a set of algorithms that automatically tunes the model.


#### For deep learning, for now you have the ImageClassifier, the BayesianSearcher, a Graph module, a PreProcessor, a LayerTransformer, a NetTransformer, a ClassifierGenerator and some utilities. This is an evolving package.


#### For tabular dataset, keras has a method called StructuredDataClassifier which is what we will use. Keras also has Automodel that combines multiple inputs. So in simple words, StructuredDataClassifier, TextClassifiers etc. are all TaskAPIs. 

When doing a classical task such as image classification/regression, text classification/regression, ..., you can use the simplest APIs provided by autokeras called Task API: ImageClassifier, ImageRegressor, TextClassifier, TextRegressor, ... In this case you have one input (image or text or tabular data, ...) and one output (classification, regression). 

Automodel, however when you are in a situation where you have for example a task that requires multi inputs/outputs architecture, then you cannot use directly Task API, and this is where Automodel comes into play with the I/O API. 

GraphAutomodel works like keras functional API. It assembles different blocks (Convolutions, LSTM, GRU, ...) and create a model using this block, then it will look for the best hyperparameters given this architecture you provided. 
    

Keras AUTOMODEL is mainly used for Multi-Modal and Multi-Task. 

Multi-model data means each data instance has multiple forms of information. For example, a photo can be saved as a image. Besides the image, it may also have when and where it was taken as its attributes, which can be represented as structured data.

Multi-task here we refer to we want to predict multiple targets with the same input features. For example, we not only want to classify an image according to its content, but we also want to regress its quality as a float number between 0 and 1.

    
#### Now, lets do the modeling using StructuredDataClassifier

In [None]:
!pip install autokeras

!pip install git+https://github.com/keras-team/keras-tuner.git@1.0.2rc1

In [None]:
import pandas as pd
import numpy as np
import autokeras as ak

x_data = pd.read_csv("../input/titanic/train.csv")

print(type(x_data))

y = x_data.pop('Survived')

print(type(y))

In [None]:
y_train = pd.DataFrame(y)

print(type(y_train)) 

# You can also use numpy.ndarray for x_train and y_train.

x_train = x_data.to_numpy().astype(np.unicode)

y_train = y.to_numpy()

print(type(x_train)) 

print(type(y_train)) 

In [None]:
# Preparing testing data.

x_test = pd.read_csv("../input/titanic/test.csv")

In [None]:
import sklearn

from sklearn import model_selection, metrics
%timeit

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x_data, y, random_state=1)

#### The arguments used in structured data classification class:

#### Arguments

column_names: A list of strings specifying the names of the columns. The length of the list should be equal to the number of columns of the data excluding the target column. Defaults to None. If None, it will obtained from the header of the csv file or the pandas.DataFrame.

column_types: Dict. The keys are the column names. The values should either be 'numerical' or 'categorical', indicating the type of that column. Defaults to None. If not None, the column_names need to be specified. If None, it will be inferred from the data.

num_classes: Int. Defaults to None. If None, it will be inferred from the data.

multi_label: Boolean. Defaults to False.

loss: A Keras loss function. Defaults to use 'binary_crossentropy' or 'categorical_crossentropy' based on the number of classes.

metrics: A list of Keras metrics. Defaults to use 'accuracy'.

project_name: String. The name of the AutoModel. Defaults to 'structured_data_classifier'.

max_trials: Int. The maximum number of different Keras Models to try. The search may finish before reaching the 
max_trials. Defaults to 100.

directory: String. The path to a directory for storing the search outputs. Defaults to None, which would create a folder with the name of the AutoModel in the current directory.

objective: String. Name of model metric to minimize or maximize. Defaults to 'val_accuracy'.
tuner Union[str, Type[autokeras.engine.tuner.AutoTuner]]: String or subclass of AutoTuner. If string, it should be one of 'greedy', 'bayesian', 'hyperband' or 'random'. It can also be a subclass of AutoTuner. If left unspecified, it uses a task specific tuner, which first evaluates the most commonly used models for the task before exploring other models.

overwrite: Boolean. Defaults to False. If False, reloads an existing project of the same name if one is found. Otherwise, 
overwrites the project.

seed: Int. Random seed.

**kwargs: Any arguments supported by AutoModel.

In [None]:
clf = ak.StructuredDataClassifier(overwrite=True , max_trials=3)

In [None]:
# Feed the structured data classifier with training data.

clf.fit(x_train, y_train, epochs=10)

In [None]:
# Predict with the best model.

predicted_y = clf.predict(x_test)

# Evaluate the best model with testing data.

print(clf.evaluate(x_test, y_test))

#### The Evaluate Returns:

Scalar test loss (if the model has a single output and no metrics) or list of scalars (if the model has multiple outputs and/or metrics). The attribute model.metrics_names will give you the display labels for the scalar outputs.


The model can also be saved/exported as done in the keras models.

In [None]:
clf.export_model()

### Customized Search Space

For advanced users, you may customize your search space by using AutoModel instead of StructuredDataClassifier. 

You can configure the StructuredDataBlock for some high-level configurations, e.g., categorical_encoding for whether to use the CategoricalToNumerical. You can also do not specify these arguments, which would leave the different choices to be tuned automatically and that is the major difference between using a StructuredDataClassifier and AutoModel.

### Pros:

    - Quick set up of framework and can build a baseline quickly.
    
    - Imitates Keras API and therefore easy to use.
    
    
### Cons:

    - Compared with other frameworks, a bit of a a blackbox when it comes to knowing the best model that the automl chose.
    
    - As the name says it, its dependent on keras and is only DL models whereas frameworks like h2o also has other ML models being used that may sometimes be useful in a production setting where explainability is key. 

<a id="py"></a>

### Pycaret


PyCaret's classification module (pycaret.classification) is a supervised machine learning module which is used for classifying the elements into a binary group based on various techniques and algorithms. Some common use cases of classification problems include predicting customer default (yes or no), customer churn (customer will leave or stay), disease found (positive or negative).

The PyCaret classification module can be used for Binary or Multi-class classification problems. It has over 18 algorithms and 14 plots to analyze the performance of models. Be it hyper-parameter tuning, ensembling or advanced techniques like stacking, PyCaret's classification module has it all.

In [None]:
!pip install pycaret
from pycaret.datasets import get_data
dataset = get_data('credit')

In [None]:
data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

### Setting up Environment in PyCaret

The setup() function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must be called before executing any other function in pycaret. It takes two mandatory parameters: a pandas dataframe and the name of the target column. All other parameters are optional and are used to customize the pre-processing pipeline

When setup() is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To account for this, PyCaret displays a table containing the features and their inferred data types after setup() is executed. If all of the data types are correctly identified enter can be pressed to continue or quit can be typed to end the expriment. Ensuring that the data types are correct is of fundamental importance in PyCaret as it automatically performs a few pre-processing tasks which are imperative to any machine learning experiment. These tasks are performed differently for each data type which means it is very important for them to be correctly configured.

In [None]:
!pip uninstall -y pandas

In [None]:
!pip install pandas

In [None]:
from pycaret.classification import *

### Pycaret needs an input to be entered after the next command and since kaggle doesnt support, commenting it out

In [None]:
# exp_clf101 = setup(data = data, target = 'default', session_id=123)

### Comparing All Models

Comparing all models to evaluate performance is the recommended starting point for modeling once the setup is completed (unless you exactly know what kind of model you need, which is often not the case). This function trains all models in the model library and scores them using stratified cross validation for metric evaluation. The output prints a score grid that shows average Accuracy, AUC, Recall, Precision, F1 and Kappa accross the folds (10 by default) of all the available models in the model library.

In [None]:
# compare_models()

Two simple words of code (not even a line) have created over 15 models using 10 fold stratified cross validation and evaluated the 6 most commonly used classification metrics (Accuracy, AUC, Recall, Precision, F1, Kappa). The score grid printed above highlights the highest performing metric for comparison purposes only. The grid by default is sorted using 'Accuracy' (highest to lowest) which can be changed by passing the sort parameter. For example compare_models(sort = 'Recall') will sort the grid by Recall instead of Accuracy. If you want to change the fold parameter from the default value of 10 to a different value then you can use the fold parameter. For example compare_models(fold = 5) will compare all models on 5 fold cross validation. Reducing the number of folds will improve the training time.

### Create a Model

While compare_models() is a powerful function and often a starting point in any experiment, it does not return any trained models. PyCaret's recommended experiment workflow is to use compare_models() right after setup to evaluate top performing models and finalize a few candidates for continued experimentation. As such, the function that actually allows to you create a model is unimaginatively called create_model(). This function creates a model and scores it using stratified cross validation. Similar to compare_models(), the output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold


There are 18 classifiers available in the model library of PyCaret. 

We Look at DT for an example

In [None]:
# dt = create_model('dt')

In [None]:
#trained model object is stored in the variable 'dt'. 
# print(dt)

### Tune a Model

When a model is created using the create_model() function it uses the default hyperparameters. In order to tune hyperparameters, the tune_model() function is used. This function automatically tunes the hyperparameters of a model on a pre-defined search space and scores it using stratified cross validation. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold.

In [None]:
# tuned_dt = tune_model(dt)

The tune_model() function is a random grid search of hyperparameters over a pre-defined search space. By default, it is set to optimize Accuracy but this can be changed using optimize parameter. 

For example: tune_model(dt, optimize = 'AUC') will search for the hyperparameters of a Decision Tree Classifier that results in highest AUC. For the purposes of this example, we have used the default metric Accuracy for the sake of simplicity only. Generally, when the dataset is imbalanced (such as the credit dataset we are working with) Accuracy is not a good metric for consideration.

### Plot a Model

Before model finalization, the plot_model() function can be used to analyze the performance across different aspects such as AUC, confusion_matrix, decision boundary etc. This function takes a trained model object and returns a plot based on the test / hold-out set

In [None]:
###AUC Plot

 # plot_model(tuned_dt, plot = 'auc')

In [None]:
###  Precision-Recall Curve

# plot_model(tuned_dt, plot = 'pr')

In [None]:
###  Feature Importance Plot

# plot_model(tuned_dt, plot='feature')

In [None]:
### Confusion Matrix

# plot_model(tuned_dt, plot = 'confusion_matrix')

###  Finalize Model for Deployment

Model finalization is the last step in the experiment. A normal machine learning workflow in PyCaret starts with setup(), followed by comparing all models using compare_models() and shortlisting a few candidate models (based on the metric of interest) to perform several modeling techniques such as hyperparameter tuning, ensembling, stacking etc. This workflow will eventually lead you to the best model for use in making predictions on new and unseen data. The finalize_model() function fits the model onto the complete dataset including the test/hold-out sample (30% in this case). The purpose of this function is to train the model on the complete dataset before it is deployed in production.

In [None]:
# final_dt = finalize_model(tuned_dt)

#Final Random Forest model parameters for deployment
# print(final_dt)

PROS:

	- Easy to set up and train a baseline in less than 10 lines of code.

	- Easy to tune, stack and get baseline hyperparameters



CONS:

	- Dont see exclusive DL options, might be added to later versions

<a id="hp"></a>

### Hyperopt


HyperOpt is an open-source library for large scale AutoML and HyperOpt-Sklearn is a wrapper for HyperOpt that supports AutoML with HyperOpt for the popular Scikit-Learn machine learning library, including the suite of data preparation transforms and classification and regression algorithms.


### HyperOpt and HyperOpt-Sklearn

HyperOpt is an open-source Python library for Bayesian optimization developed by James Bergstra.

It is designed for large-scale optimization for models with hundreds of parameters and allows the optimization procedure to be scaled across multiple cores and multiple machines.

The library was explicitly used to optimize machine learning pipelines, including data preparation, model selection, and model hyperparameters


HyperOpt is challenging to use directly, requiring the optimization procedure and search space to be carefully specified.

An extension to HyperOpt was created called HyperOpt-Sklearn that allows the HyperOpt procedure to be applied to data preparation and machine learning models provided by the popular Scikit-Learn open-source machine learning library.

HyperOpt-Sklearn wraps the HyperOpt library and allows for the automatic search of data preparation methods, machine learning algorithms, and model hyperparameters for classification and regression tasks.

In [None]:
!pip install hyperopt

In [None]:
from pandas import read_csv

dataframe = read_csv("../input/solar-data/sonar.csv", header=None)

# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]

In [None]:
# define search

from hpsklearn import HyperoptEstimator
from hpsklearn import any_classifier
from hpsklearn import any_preprocessing
from hyperopt import tpe
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

model = HyperoptEstimator(classifier=any_classifier('cla'), preprocessing=any_preprocessing('pre'), algo=tpe.suggest, max_evals=50, trial_timeout=30)

In [None]:
# minimally prepare dataset

X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

In [None]:
#!pip uninstall -y hyperopt

#!pip uninstall -y sklearn

# !pip install sklearn

# !pip install git+https://github.com/hyperopt/hyperopt-sklearn.git

In [None]:
# define search

estim = HyperoptEstimator(algo=tpe.suggest, max_evals=50, trial_timeout=30)

# perform the search
#model.fit(X_train, y_train)

# summarize performance
#acc = model.score(X_test, y_test)

#print("Accuracy: %.3f" % acc)
# summarize the best model
#print(model.best_model())

#### Hyperopt, like autosklearn and MLBOX seems to have issues here. Hence commented. 


#### More hyperopt here - https://hyperopt.github.io/hyperopt-sklearn/


At the end of the run, the best-performing model is evaluated on the holdout dataset and the Pipeline discovered is printed for later use.

In [None]:
#estim.fit( X_train, y_train )


#print( estim.score( X_test, y_test ) )


#print( estim.best_model() )

PROS:

	- Easy to set up and train a baseline in less than 10 lines of code.

	- Easy for Hyperparameter searches



CONS:

	- Dont see exclusive DL options, might be added to later versions

<a id="oth"></a>

### Other Frameworks like Cloud AutoML by google, TransmogrifAI by Salesforce are also in the market but both cant be tried here as the former is a paid Version and the latter needs Spark to be installed to be tried. If you are interested you can read about them [here](https://cloud.google.com/automl/) and [here](https://transmogrif.ai/)


### Cloud AutoML is a suite of machine learning products from Google that enables developers with limited machine learning expertise to train high-quality models specific to their business needs by leveraging Google’s state-of-the-art transfer learning and Neural Architecture Search technology. Cloud AutoML provides a simple graphical user interface (GUI) to train, evaluate, improve, and deploy models based on your own data. 



### TransmogrifAI:

### is an open source automated machine learning library from Salesforce. The company’s flagship ML platform called Einstein is also powered by TransmogrifAI. It is an end-to-end AutoML library for structured data written in Scala that runs on top of Apache Spark. 

### TransmogrifAI is especially useful when you need to :
    
    Rapidly train good quality machine learning models with minimal hand tuning.
     
    Build modular, reusable, strongly-typed machine learning workflows
    

#### You may read the examples [here](https://docs.transmogrif.ai/en/stable/)

### The Reason for this notebook was to check the various Automl frameworks that are widely used and also try them Hands-on. Do let know your thoughts and upvote. :)


### There are debates whether AUTOML would replace data scientists in the future and my personal take is that we are still a long way to go for that and that DS is more than just .fit() and .predict() and tuning hyperparameters and data preparation play a huge role in that and thats not something, automl can do. It is what you feed to it, is returned and hence the process before the modeling is something that cant be replaced. 