# **CIS 520: Machine Learning, Fall 2020**
# **Week 14, Worksheet 2**
## **Auto Machine Learning**


- **Content Creator:** Shaozhe Lyu
- **Content Checkers:** Michael Zhou, Siyun Hu
- **Acknowledgements:** This notebook contains an excerpt from the [auto-sklearn](https://automl.github.io/auto-sklearn/master/) 



Auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction. Learn more about the technology behind auto-sklearn by reading the paper published at [NIPS 2015](https://proceedings.neurips.cc/paper/2015/file/11d0e6287202fced83f79975ec59a3a6-Paper.pdf) .

We will start by installing the packages swig and auto-sklearn:

In [None]:
!sudo apt-get install swig -y
!pip install Cython numpy
!pip install pipelineprofiler
!pip install auto-sklearn

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  swig3.0
Suggested packages:
  swig-doc swig-examples swig3.0-examples swig3.0-doc
The following NEW packages will be installed:
  swig swig3.0
0 upgraded, 2 newly installed, 0 to remove and 14 not upgraded.
Need to get 1,100 kB of archives.
After this operation, 5,822 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig3.0 amd64 3.0.12-1 [1,094 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig amd64 3.0.12-1 [6,460 B]
Fetched 1,100 kB in 1s (1,258 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 2.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: 

In [None]:
!git clone https://github.com/automl/auto-sklearn.git

Cloning into 'auto-sklearn'...
remote: Enumerating objects: 152, done.[K
remote: Counting objects: 100% (152/152), done.[K
remote: Compressing objects: 100% (120/120), done.[K
remote: Total 36673 (delta 75), reused 54 (delta 30), pack-reused 36521[K
Receiving objects: 100% (36673/36673), 33.64 MiB | 10.15 MiB/s, done.
Resolving deltas: 100% (29288/29288), done.


Next, we will create a new file "requirements.txt" and then copy the block below into the file:

In [None]:
setuptools

numpy>=1.9.0
scipy>=0.14.1

joblib
scikit-learn>=0.22.0,<0.23

dask
distributed>=2.2.0
lockfile
pyyaml
pandas>=1.0
liac-arff

ConfigSpace>=0.4.14,<0.5
pynisher>=0.6.1
pyrfr>=0.7,<0.9
smac>=0.13.1,<0.14
distributed>=2.2.0


Through "requirements.txt", we will install all necessary dependencies and their corresponding versions: 

In [None]:
!pip install -r requirements.txt



After installing all the dependencies, select **Restart runtime** to change the version and then import the auto machine learning library:

In [None]:
import pandas as pd
import numpy as np
import PipelineProfiler
import autosklearn.classification
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

## 1. Auto-sklearn Classification

## Data Loading



In [None]:
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = \
    sklearn.model_selection.train_test_split(X, y, random_state=1)

NameError: ignored

## Building and Fit
**time_left_for_this_task**, optional (default=3600):
Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.

**per_run_time_limit**, optional (default=1/10 of time_left_for_this_task):
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.

In [None]:
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder='/tmp/autosklearn_classification_example_tmp',
    output_folder='/tmp/autosklearn_classification_example_out',
)
automl.fit(X_train, y_train, dataset_name='breast_cancer')



AutoSklearnClassifier(dask_client=None,
                      delete_output_folder_after_terminate=True,
                      delete_tmp_folder_after_terminate=True,
                      disable_evaluator_output=False, ensemble_nbest=50,
                      ensemble_size=50, exclude_estimators=None,
                      exclude_preprocessors=None, get_smac_object_callback=None,
                      include_estimators=None, include_preprocessors=None,
                      initial_configurations_via_metal...
                      max_models_on_disc=50, memory_limit=3072,
                      metadata_directory=None, metric=None, n_jobs=None,
                      output_folder='/tmp/autosklearn_classification_example_out',
                      per_run_time_limit=30, resampling_strategy='holdout',
                      resampling_strategy_arguments=None, seed=1,
                      smac_scenario_args=None, time_left_for_this_task=120,
                      tmp_folder='/tmp/auto

Now we print the final ensemble constructed by auto-sklearn. From this part you can see which kinds of classifier are ensembled by auto-ML. 

In [None]:
print(automl.show_models())

[(0.320000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'classifier:__choice__': 'gradient_boosting', 'data_preprocessing:categorical_transformer:categorical_encoding:__choice__': 'no_encoding', 'data_preprocessing:categorical_transformer:category_coalescence:__choice__': 'minority_coalescer', 'data_preprocessing:numerical_transformer:imputation:strategy': 'most_frequent', 'data_preprocessing:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:__choice__': 'polynomial', 'classifier:gradient_boosting:early_stop': 'valid', 'classifier:gradient_boosting:l2_regularization': 8.495727973549814e-08, 'classifier:gradient_boosting:learning_rate': 0.10895774269386836, 'classifier:gradient_boosting:loss': 'auto', 'classifier:gradient_boosting:max_bins': 255, 'classifier:gradient_boosting:max_depth': 'None', 'classifier:gradient_boosting:max_leaf_nodes': 6, 'classifier:gradient_boosting:min_samples_leaf': 17, 'classifier:gradient_boosting:scor

## Get the Score of the Final Ensemble

If you want to have a higher score, you can increase the     **time_left_for_this_task** and     **per_run_time_limit** parameters.

In [None]:
predictions = automl.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))

Accuracy score: 0.9440559440559441


## Questions

How well did the above method do compared to a standard random forest?

What did it find was the most accurate model? (What contributed the most to the ensemble?)


AutoML can also be extended to use text description of problems to pick hyperparameters:
* Use vector embeddings of dataset title, description and keywords
* For each new dataset, find the most similar prior dataset and use its hyperparameters
* The similarity metric is learned (supervised)
