# FLIP(01):  Advanced Data Science
**(Tools Module 04: TPOP)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, but NOT allowed to change or distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au)

---


# Session 01 - TPOT on the command line

To use TPOT via the command line, enter the following command with a path to the data file:

An example command-line call to TPOT may look like:

TPOT offers several arguments that can be provided at the command line. To see brief descriptions of these arguments, enter the following command:

In [None]:
tpot --help

Detailed descriptions of the command-line arguments are below.

<table>
   <tr>
      <td>Argument</td>
      <td>Parameter</td>
      <td>Valid values</td>
      <td>Effect</td>
   </tr>
   <tr>
      <td>-is</td>
      <td>INPUT_SEPARATOR</td>
      <td>Any string</td>
      <td>Character used to separate columns in the input file.</td>
   </tr>
   <tr>
      <td>-target</td>
      <td>TARGET_NAME</td>
      <td>Any string</td>
      <td>Name of the target column in the input file.</td>
   </tr>
   <tr>
      <td>-mode</td>
      <td>TPOT_MODE</td>
      <td>['classification', 'regression']</td>
      <td>Whether TPOT is being used for a supervised classification or regression problem.</td>
   </tr>
   <tr>
      <td>-o</td>
      <td>OUTPUT_FILE</td>
      <td>String path to a file</td>
      <td>File to export the code for the final optimized pipeline.</td>
   </tr>
      <tr>
      <td>-g</td>
      <td>GENERATIONS</td>
      <td>Any positive integer</td>
      <td>Number of iterations to run the pipeline optimization process. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline.
<br><br>
TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.</td>
   </tr>
   <tr>
      <td>-p</td>
      <td>POPULATION_SIZE</td>
      <td> 	Any positive integer</td>
      <td>Number of individuals to retain in the GP population every generation. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline.
<br><br>
TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.</td>
   </tr>
   <tr>
      <td>-os</td>
      <td>OFFSPRING_SIZE</td>
      <td>Any positive integer</td>
      <td>Number of offspring to produce in each GP generation.

By default, OFFSPRING_SIZE = POPULATION_SIZE.</td>
   </tr>
   <tr>
      <td>-mr</td>
      <td>MUTATION_RATE</td>
      <td>[0.0, 1.0]</td>
      <td>GP mutation rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to apply random changes to every generation.
<br><br>
We recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.</td>
   </tr>
   <tr>
      <td>-xr</td>
      <td>CROSSOVER_RATE</td>
      <td>[0.0, 1.0]</td>
      <td>GP crossover rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to "breed" every generation.
<br><br>
We recommend using the default parameter unless you understand how the crossover rate affects GP algorithms.</td>
   </tr>
   <tr>
      <td>-scoring</td>
      <td>SCORING_FN</td>
      <td>'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy',
'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted',
'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc', 'my_module.scorer_name*'</td>
      <td>Function used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression.
<br><br>
TPOT assumes that any function with "error" or "loss" in the name is meant to be minimized, whereas any other functions will be maximized.
<br><br>
my_module.scorer_name: You can also specify your own function or a full python path to an existing one.
<br><br>
See the section on scoring functions for more details.</td>
   </tr>
   <tr>
      <td>-cv</td>
      <td>CV</td>
      <td>Any integer > 1</td>
      <td>Number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT optimization process.</td>
   </tr>
   <tr>
      <td>-sub</td>
      <td>SUBSAMPLE</td>
      <td>(0.0, 1.0]</td>
      <td>Subsample ratio of the training instance. Setting it to 0.5 means that TPOT randomly collects half of training samples for pipeline optimization process.</td>
   </tr>
   <tr>
      <td>-njobs</td>
      <td>NUM_JOBS</td>
      <td>Any positive integer or -1</td>
      <td>Number of CPUs for evaluating pipelines in parallel during the TPOT optimization process.
<br><br>
Assigning this to -1 will use as many cores as available on the computer.</td>
   </tr>
   <tr>
      <td>-maxtime</td>
      <td>MAX_TIME_MINS</td>
      <td>Any positive integer</td>
      <td>How many minutes TPOT has to optimize the pipeline.
<br><br>
If provided, this setting will override the "generations" parameter and allow TPOT to run until it runs out of time.</td>
   </tr>
   <tr>
      <td>-maxeval</td>
      <td>MAX_EVAL_MINS</td>
      <td>Any positive integer</td>
      <td>How many minutes TPOT has to evaluate a single pipeline.
<br><br>
Setting this parameter to higher values will allow TPOT to consider more complex pipelines but will also allow TPOT to run longer.</td>
   </tr>
   <tr>
      <td>-s</td>
      <td>RANDOM_STATE</td>
      <td>Any positive integer</td>
      <td>Random number generator seed for reproducibility.
<br><br>
Set this seed if you want your TPOT run to be reproducible with the same seed and data set in the future.</td>
   </tr>
   <tr>
      <td>-config</td>
      <td>CONFIG_FILE</td>
      <td>String or file path</td>
      <td>Operators and parameter configurations in TPOT:
<br><br>
    Path for configuration file: TPOT will use the path to a configuration file for customizing the operators and parameters that TPOT uses in the optimization process
    string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors
    string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies
    string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices.
<br><br>
See the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations. </td>
   </tr>
   <tr>
      <td>-memory</td>
      <td>MEMORY</td>
      <td>String or file path</td>
      <td>If supplied, pipeline will cache each transformer after calling fit. This feature is used to avoid computing the fit transformers within a pipeline if the parameters and input data are identical with another fitted pipeline during optimization process. Memory caching mode in TPOT:
<br><br>
    Path for a caching directory: TPOT uses memory caching with the provided directory and TPOT does NOT clean the caching directory up upon shutdown.
    string 'auto': TPOT uses memory caching with a temporary directory and cleans it up upon shutdown.
</td>
   </tr>
   <tr>
      <td>-cf</td>
      <td>CHECKPOINT_FOLDER</td>
      <td>Folder path</td>
      <td>If supplied, a folder you created, in which tpot will periodically save the best pipeline so far while optimizing.
<br><br>
This is useful in multiple cases:

    sudden death before tpot could save an optimized pipeline
    progress tracking
    grabbing a pipeline while tpot is working


<br><br>
Example:
mkdir my_checkpoints
-cf ./my_checkpoints </td>
   </tr>
   <tr>
      <td>-es</td>
      <td>EARLY_STOP</td>
      <td>Any positive integer</td>
      <td>How many generations TPOT checks whether there is no improvement in optimization process.
<br><br>
End optimization process if there is no improvement in the set number of generations. </td>
   </tr>
   <tr>
      <td>-v</td>
      <td>VERBOSITY</td>
      <td>{0, 1, 2, 3}</td>
      <td> 	How much information TPOT communicates while it is running.
<br><br>
0 = none, 1 = minimal, 2 = high, 3 = all.
<br><br>
A setting of 2 or higher will add a progress bar during the optimization procedure.</td>
   </tr>
</table>