# FLIP(01):  Advanced Data Science
**(Tools Module 04: TPOP)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, but NOT allowed to change or distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au)

---


# Session 06 - TPOT API

## Classification

class tpot.TPOTClassifier(generations=100, population_size=100,
                          offspring_size=None, mutation_rate=0.9,
                          crossover_rate=0.1,
                          scoring='accuracy', cv=5,
                          subsample=1.0, n_jobs=1,
                          max_time_mins=None, max_eval_time_mins=5,
                          random_state=None, config_dict=None,
                          warm_start=False,
                          memory=None,
                          periodic_checkpoint_folder=None,
                          early_stop=None,
                          verbosity=0,
                          disable_update_check=False)

[source](https://github.com/EpistasisLab/tpot/blob/master/tpot/base.py)

Automated machine learning for supervised classification tasks.

The TPOTClassifier performs an intelligent search over machine learning pipelines that can contain supervised classification models, preprocessors, feature selection techniques, and any other estimator or transformer that follows the [`scikit-learn API`](http://scikit-learn.org/stable/developers/contributing.html#apis-of-scikit-learn-objects). The TPOTClassifier will also search over the hyperparameters of all objects in the pipeline.

By default, TPOTClassifier will search over a broad range of supervised classification algorithms, transformers, and their parameters. However, the algorithms, transformers, and hyperparameters that the TPOTClassifier searches over can be fully customized using the `config_dict` parameter.

Read more in the [User Guide](https://epistasislab.github.io/tpot/using/#tpot-with-code).

<table>
    <tr>
        <td>Parameters:</td>
        <td>**generations**: int, optional (default=100)<br>

&emsp;Number of iterations to the run pipeline optimization process. Must be a positive number.
<br><br>
&emsp;Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline.
<br><br>
&emsp;TPOT will evaluate population_size + generations × offspring_size pipelines in total. 
<br><br>
**population_size**: int, optional (default=100)
<br>
&emsp;Number of individuals to retain in the genetic programming population every generation. Must be a positive number.
<br><br>
&emsp;Generally, TPOT will work better when you give it more individuals with which to optimize the pipeline. 
<br><br>
**offspring_size**: int, optional (default=100)
<br>
&emsp;Number of offspring to produce in each genetic programming generation. Must be a positive number. 
<br><br>
**mutation_rate**: float, optional (default=0.9)
&emsp;Mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation.
<br><br>
&emsp;mutation_rate + crossover_rate cannot exceed 1.0.
<br><br>
&emsp;We recommend using the default parameter unless you understand how the mutation rate affects GP algorithms. 
<br><br>
**crossover_rate**: float, optional (default=0.1)
&emsp;Crossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to "breed" every generation.
<br><br>
&emsp;mutation_rate + crossover_rate cannot exceed 1.0.
<br><br>
&emsp;We recommend using the default parameter unless you understand how the crossover rate affects GP algorithms. 
<br><br>
**scoring**: string or callable, optional (default='accuracy')
<br>
&emsp;Function used to evaluate the quality of a given pipeline for the classification problem. The following built-in scoring functions can be used:
<br><br>
&emsp;'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss','precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc'
<br><br>
&emsp;If you would like to use a custom scorer, you can pass the callable object/function with signature scorer(estimator, X, y).
<br><br>
&emsp;If you would like to use a metric function, you can pass the callable function to this parameter with the signature score_func(y_true, y_pred). TPOT assumes that any function with "error" or "loss" in the function name is meant to be minimized, whereas any other functions will be maximized. This scoring type was deprecated in version 0.9.1 and will be removed in version 0.11.
<br><br>
&emsp;See the section on scoring functions for more details. 
<br><br>
**cv**: int, cross-validation generator, or an iterable, optional (default=5)
<br>
&emsp;Cross-validation strategy used when evaluating pipelines.
<br><br>
&emsp;Possible inputs:<br>

&emsp;1. integer, to specify the number of folds in a StratifiedKFold,<br>
&emsp;2. An object to be used as a cross-validation generator, or<br>
&emsp;3. An iterable yielding train/test splits.<br>
<br><br>
**subsample**: float, optional (default=1.0)
<br><br>
&emsp;Fraction of training samples that are used during the TPOT optimization process. Must be in the range (0.0, 1.0].
<br><br>
&emsp;Setting subsample=0.5 tells TPOT to use a random subsample of half of the training data. This subsample will remain the same during the entire pipeline optimization process. 
<br><br>
n_jobs: integer, optional (default=1)
<br><br>
&emsp;Number of processes to use in parallel for evaluating pipelines during the TPOT optimization process.
<br><br>
&emsp;Setting *n_jobs=-1* will use as many cores as available on the computer. Beware that using multiple processes on the same machine may cause memory issues for large datasets 
<br><br>
**max_time_mins**: integer or None, optional (default=None)
<br><br>
&emsp;How many minutes TPOT has to optimize the pipeline.
<br><br>
&emsp;If not None, this setting will override the generations parameter and allow TPOT to run until max_time_mins minutes elapse. 
<br><br>
**max_eval_time_mins**: integer, optional (default=5)
<br><br>
&emsp;How many minutes TPOT has to evaluate a single pipeline.
<br><br>
&emsp;Setting this parameter to higher values will allow TPOT to evaluate more complex pipelines, but will also allow TPOT to run longer. Use this parameter to help prevent TPOT from wasting time on evaluating time-consuming pipelines. 
<br><br>
**random_state**: integer or None, optional (default=None)
<br><br>
&emsp;The seed of the pseudo random number generator used in TPOT.
<br><br>
&emsp;Use this parameter to make sure that TPOT will give you the same results each time you run it against the same data set with that seed. 
<br><br>
**config_dict**: Python dictionary, string, or None, optional (default=None)
<br><br>
&emsp;A configuration dictionary for customizing the operators and parameters that TPOT searches in the optimization process.
<br><br>
&emsp;Possible inputs are:
<br>
&emsp;1. Python dictionary, TPOT will use your custom configuration,<br>
&emsp;2. string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or<br>
&emsp;3. string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or
&emsp;4. string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse &emsp;5. None, TPOT will use the default TPOTClassifier configuration.
<br><br>
&emsp;See the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations. 
<br><br>
**warm_start**: boolean, optional (default=False)
<br><br>
&emsp;Flag indicating whether the TPOT instance will reuse the population from previous calls to fit().
<br><br>
&emsp;Setting warm_start=True can be useful for running TPOT for a short time on a dataset, checking the results, then resuming the TPOT run from where it left off. 
<br><br>
**memory**: a sklearn.external.joblib.Memory object or string, optional (default=None)
<br><br>
&emsp;If supplied, pipeline will cache each transformer after calling fit. This feature is used to avoid computing the fit transformers within a pipeline if the parameters and input data are identical with another fitted pipeline during optimization process. More details about memory caching in [scikit-learn documentation](http://scikit-learn.org/stable/modules/pipeline.html#caching-transformers-avoid-repeated-computation)
<br><br>
&emsp;Possible inputs are:
<br><br>
&emsp;1. String 'auto': TPOT uses memory caching with a temporary directory and cleans it up upon shutdown, or<br>
&emsp;2. Path of a caching directory, TPOT uses memory caching with the provided directory and TPOT does NOT clean the caching directory up upon shutdown, or<br>
&emsp;3. Memory object, TPOT uses the instance of sklearn.external.joblib.Memory for memory caching and TPOT does NOT clean the caching directory up upon shutdown, or<br>
&emsp;None, TPOT does not use memory caching.
<br><br>
**periodic_checkpoint_folder**: path string, optional (default: None)
<br>
&emsp;If supplied, a folder in which TPOT will periodically save the best pipeline so far while optimizing.
&emsp;Currently once per generation but not more often than once per 30 seconds.
<br><br>
&emsp;Useful in multiple cases:
<br><br>
&emsp;1. Sudden death before TPOT could save optimized pipeline<br>
&emsp;2. Track its progress<br>
&emsp;3. Grab pipelines while it's still optimizing<br>
<br>
**early_stop**: integer, optional (default: None)
<br>
&emsp;How many generations TPOT checks whether there is no improvement in optimization process.
<br><br>
&emsp;Ends the optimization process if there is no improvement in the given number of generations. 
<br><br>
**verbosity**: integer, optional (default=0)
<br>
&emsp;How much information TPOT communicates while it's running.
<br><br>
&emsp;Possible inputs are:<br>
<br><br>
&emsp;0, TPOT will print nothing,<br>
&emsp;1, TPOT will print minimal information,<br>
&emsp;2, TPOT will print more information and provide a progress bar, or<br>
&emsp;3, TPOT will print everything and provide a progress bar.<br>
<br>
**disable_update_check**: boolean, optional (default=False)
<br>
&emsp;Flag indicating whether the TPOT version checker should be disabled.
<br><br>
&emsp;The update checker will tell you when a new version of TPOT has been released. </td>
    </tr>
    <tr>
    <td>Attributes:</td>
    <td>
**fitted_pipeline_**: scikit-learn Pipeline object
<br>
&emsp;The best pipeline that TPOT discovered during the pipeline optimization process, fitted on the entire training dataset. 
<br><br>
**pareto_front_fitted_pipelines_**: Python dictionary
<br>
&emsp;Dictionary containing the all pipelines on the TPOT Pareto front, where the key is the string representation of the pipeline and the value is the corresponding pipeline fitted on the entire training dataset.
<br><br>
&emsp;The TPOT Pareto front provides a trade-off between pipeline complexity (i.e., the number of steps in the pipeline) and the predictive performance of the pipeline.
<br><br>
&emsp;Note: pareto_front_fitted_pipelines_ is only available when *verbosity=3*. 
<br><br>
**evaluated_individuals_**: Python dictionary
<br>
&emsp;Dictionary containing all pipelines that were evaluated during the pipeline optimization process, where the key is the string representation of the pipeline and the value is a tuple containing (# of steps in pipeline, accuracy metric for the pipeline).
<br><br>
&emsp;This attribute is primarily for internal use, but may be useful for looking at the other pipelines that TPOT evaluated. </td>
    </tr>
</table>

Example

In [None]:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)
tpot.fit(X_train, y_train)
#这个fit函数初始化遗传规划算法，在平均k倍交叉验证的基础上找到得分最高的流水线，然后对所提供的样本集进行训练，TPOT实例可作为拟合模型。
print(tpot.score(X_test, y_test))
tpot.export('tpot_mnist_pipeline.py')

A Jupyter widget could not be displayed because the widget state could not be found. This could happen if the kernel storing the widget is no longer available, or if the widget state was not saved in the notebook. You may be able to create the widget by running the appropriate cells.

Generation 1 - Current best internal CV score: 0.9725347652485199

Functions

<table>
    <tr>
        <td><font color='blue'>fit</font>(features, classes[, sample_weight, groups])</td>
        <td>Run the TPOT optimization process on the given training data.</td>
    </tr>
    
    <tr>
        <td><font color='blue'>predict</font>(features)</td>
        <td>Use the optimized pipeline to predict the classes for a feature set.</td>
    </tr>
    
    <tr>
        <td><font color='blue'>predict_proba</font>(features)</td>
        <td>Use the optimized pipeline to estimate the class probabilities for a feature set.</td>
    </tr>
    
    <tr>
        <td><font color='blue'>score</font>(testing_features, testing_classes)</td>
        <td>Returns the optimized pipeline's score on the given testing data using the user-specified scoring function.</td>
    </tr>
    
    <tr>
        <td><font color='blue'>export</font>(output_file_name)</td>
        <td>Export the optimized pipeline as Python code.</td>
    </tr>
</table>

In [None]:
fit(features, classes, sample_weight=None, groups=None)

Run the TPOT optimization process on the given training data.

Uses genetic programming to optimize a machine learning pipeline that maximizes the score on the provided features and target. This pipeline optimization procedure uses internal k-fold cross-validaton to avoid overfitting on the provided data. At the end of the pipeline optimization procedure, the best pipeline is then trained on the entire set of provided samples.

<table>
    <tr>
        <td>**Parameters**:</td>
        <td>**features**: array-like {n_samples, n_features}
<br>
&emsp;Feature matrix
<br><br>
&emsp;TPOT and all scikit-learn algorithms assume that the features will be numerical and there will be no missing values. As such, when a feature matrix is provided to TPOT, all missing values will automatically be replaced (i.e., imputed) using <font color='blue'>median value imputation</font>.
<br><br>
&emsp;If you wish to use a different imputation strategy than median imputation, please make sure to apply imputation to your feature set prior to passing it to TPOT. 
<br><br>
**classes**: array-like {n_samples}
<br><br>
&emsp;List of class labels for prediction 
<br><br>
**sample_weight**: array-like {n_samples}, optional
<br>
&emsp;Per-sample weights. Higher weights force TPOT to put more emphasis on those points. 
<br><br>
**groups**: array-like, with shape {n_samples, }, optional
<br>
&emsp;Group labels for the samples used when performing cross-validation.
<br><br>
&emsp;This parameter should only be used in conjunction with sklearn's Group cross-validation functions, such as <font color='blue'>sklearn.model_selection.GroupKFold.</font></td>
    </tr>
    <tr>
        <td>**Returns**:</td>
        <td>**self**: object
<br>
&emsp;Returns a copy of the fitted TPOT object </td>
    </tr>
</table>

In [None]:
predict(features)

Use the optimized pipeline to predict the classes for a feature set.

<table>
    <tr>
        <td>**Parameters**:</td>
        <td>**features:** array-like {n_samples, n_features}
<br>
&emsp;Feature matrix </td>
    </tr>
    <tr>
        <td>**Returns**:</td>
        <td>**predictions:**  array-like {n_samples}
<br>
&emsp;Predicted classes for the samples in the feature matrix </td>
    </tr>

</table>

In [None]:
predict_proba(features)

Use the optimized pipeline to estimate the class probabilities for a feature set.

Note: This function will only work for pipelines whose final classifier supports the *predict_proba* function. TPOT will raise an error otherwise.

<table>
    <tr>
        <td>**Parameters**:</td>
        <td>**features:** array-like {n_samples, n_features}
<br>
&emsp;Feature matrix </td>
    </tr>
    <tr>
        <td>**Returns**:</td>
        <td>**predictions:**  array-like {n_samples, n_features}
<br>
&emsp;The class probabilities of the input samples </td>
    </tr>

</table>

In [None]:
score(testing_features, testing_classes)

Returns the optimized pipeline's score on the given testing data using the user-specified scoring function.

The default scoring function for TPOTClassifier is 'accuracy'.


<table>
    <tr>
        <td>**Parameters**:</td>
        <td>**features:** array-like {n_samples, n_features}
<br>
&emsp;Feature matrix of the testing set
<br><br>
**testing_classes:** array-like {n_samples} 
<br>
&emsp;List of class labels for prediction in the testing set 
        </td>
    </tr>
    <tr>
        <td>**Returns**:</td>
        <td>**accuracy_score:**float
<br>
&emsp;The estimated test set accuracy according to the user-specified scoring function. </td>
    </tr>
</table>

In [None]:
export(output_file_name)

Export the optimized pipeline as Python code.

See the [usage documentation](https://epistasislab.github.io/tpot/using/#tpot-with-code) for example usage of the export function. 

<table>

    <tr>
        <td>**Parameters**:</td>
        <td>**output_file_name:** string 
<br>
&emsp;String containing the path and file name of the desired output file 
        </td>
    </tr>
    
    <tr>
        <td>**Returns**:</td>
        <td>Does not return anything</td>
    </tr>
    
</table>

## Regression

class tpot.TPOTRegressor(generations=100, population_size=100,
                         offspring_size=None, mutation_rate=0.9,
                         crossover_rate=0.1,
                         scoring='neg_mean_squared_error', cv=5,
                         subsample=1.0, n_jobs=1,
                         max_time_mins=None, max_eval_time_mins=5,
                         random_state=None, config_dict=None,
                         warm_start=False,
                         memory=None,
                         periodic_checkpoint_folder=None,
                         early_stop=None,
                         verbosity=0,
                         disable_update_check=False)

[source](https://github.com/EpistasisLab/tpot/blob/master/tpot/base.py)

Automated machine learning for supervised regression tasks.

The TPOTRegressor performs an intelligent search over machine learning pipelines that can contain supervised regression models, preprocessors, feature selection techniques, and any other estimator or transformer that follows the [`scikit-learn API`](http://scikit-learn.org/stable/developers/contributing.html#apis-of-scikit-learn-objects). The TPOTRegressor will also search over the hyperparameters of all objects in the pipeline.

By default, TPOTRegressor will search over a broad range of supervised regression models, transformers, and their hyperparameters. However, the models, transformers, and parameters that the TPOTRegressor searches over can be fully customized using the `config_dict` parameter.

Read more in the [User Guide](https://epistasislab.github.io/tpot/using/#tpot-with-code).

[Parameters](https://epistasislab.github.io/tpot/api/)
[Attributes](https://epistasislab.github.io/tpot/api/)

In [None]:
from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

digits = load_boston()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')

Functions
<br>
[Functions](https://epistasislab.github.io/tpot/api/)

In [None]:
fit(features, target, sample_weight=None, groups=None)

Run the TPOT optimization process on the given training data.

Uses genetic programming to optimize a machine learning pipeline that maximizes the score on the provided features and target. This pipeline optimization procedure uses internal k-fold cross-validaton to avoid overfitting on the provided data. At the end of the pipeline optimization procedure, the best pipeline is then trained on the entire set of provided samples.

[Parameters](https://epistasislab.github.io/tpot/api/)

In [None]:
predict(features)

Use the optimized pipeline to predict the target values for a feature set. 

In [None]:
score(testing_features, testing_target)

Returns the optimized pipeline's score on the given testing data using the user-specified scoring function.

The default scoring function for TPOTClassifier is 'mean_squared_error'. 

In [None]:
export(output_file_name)

Export the optimized pipeline as Python code.

See the [`usage documentation`](https://epistasislab.github.io/tpot/using/#tpot-with-code) for example usage of the export function. 