Skip to content

Latest commit

 

History

History
32 lines (28 loc) · 5.35 KB

03.intro.md

File metadata and controls

32 lines (28 loc) · 5.35 KB

Introduction

For many bioinformatics problems of classifying individuals into clinical categories from high-dimensional biological data, performance of a machine learning (ML) model depends greatly on the problem it is applied to [@doi:10.1186/s13040-017-0154-4;@pmid:29218881]. In addition, choosing a classifier is merely one step of the arduous process that leads to predictions. To detect patterns among features (e.g., clinical variables) and their associations with the outcome (e.g., clinical diagnosis), a data scientist typically has to design and test different complex machine learning (ML) frameworks that consist of data exploration, feature engineering, model selection and prediction. Automated machine learning (AutoML) systems were developed to automate this challenging and time-consuming process. These intelligent systems increase the accessibility and scalability of various machine learning applications by efficiently solving an optimization problem to discover pipelines that yield satisfactory outcomes, such as prediction accuracy. Consequently, AutoML allows data scientists to focus their effort in applying their expertise in other important research components such as developing meaningful hypotheses or communicating the results.

Grid search, random search [@url:http://www.jmlr.org/papers/v13/bergstra12a.html], Bayesian optimization [@arxiv:1012.2599] and evolutionary algorithm (EA) [@isbn:3642072852] are four common approaches to build AutoML systems for diverse applications. Both grid search and random search could be too computational expensive and impractical to explore all possible combinations of the hyperparameters on a model with high dimensional search space, for example, with more than 10 hyperparameters [@arxiv:1603.09441]. Bayesian optimization is implemented in both auto-sklearn [@url:https://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning] and Auto-WEKA [@arxiv:1208.3719; @url:http://www.jmlr.org/papers/v18/16-261.html] for model selection and hyperparameter optimization. Although both systems allow simple ML pipelines including data preprocessing, feature engineering and single model prediction, they cannot build more complex pipelines or stacked models which are necessary for complicated prediction problems. On the other hand, Evolutionary Algorithm (EA) can generate highly extensible and complex ML pipelines and ensemble models for data scientists. For example, Recipe [@doi:10.1007/978-3-319-55696-3_16] uses grammar-based EA to build and optimize ML pipelines based on a fully configurable grammar. Autostacker [@arxiv:1803.00684] uses basic EA to look for flexible combinations of many ML algorithms that yield better performance. DEvol (https://github.com/joeddav/devol) was designed specifically for deep neural networks and can optimize complex model architecture by using EA to tune hyperparameters related to convolutional/dense layers and optimizer. More recently released, GAMA [@doi:10.21105/joss.01132] performs automatic ensemble of best ML pipelines evaluated by asynchronous EA instead of simply using a single best pipeline for prediction. Progressively, EA enhances AutoML systems with high flexibility in building pipelines in a large search space of ML algorithms and their hyperparameters.

Tree-based Pipeline Optimization Tool (TPOT) is a EA-based AutoML system that uses genetic programming (GP) [@raw:GPbook] to optimize a series of feature selectors, preprocessors and ML models with the objective of maximizing classification accuracy. While most AutoML systems primarily focus on model selection and hyperparameter optimization, TPOT also pays attention to feature selection and feature engineering by evaluating the complete pipelines based on their cross-validated score such as mean squared error or balanced accuracy. Given no a priori knowledge about the problem, TPOT has been shown to frequently outperform standard machine learning analyses [@doi:10.1145/2908812.2908918; @arxiv:1607.08878]. Effort has been made to specialize TPOT for human genetics research, resulting in a useful extended version of TPOT, TPOT-MDR, that features Multifactor Dimensionality Reduction and an Expert Knowledge Filter [@doi:10.1145/3071178.3071212]. However, at the current stage, TPOT still requires great computational expense to analyze large datasets such as in genome-wide association studies (GWAS) or gene expression analyses. Consequently, the application of TPOT on real-world datasets has been limited to small sets of features [@pmid:30815180].

In this work, we introduce two new features implemented in TPOT that helps increase the system’s scalability. First, the Feature Set Selector (FSS) allows the users to pass specific subsets of the features, reducing the computational expense of TPOT at the beginning of each pipeline to only evaluate on a smaller subset of data rather than the entire dataset. Consequently, FSS increases TPOT's efficiency in application on large data sets by slicing the data into smaller sets of features (e.g. genes) and allowing a genetic algorithm to select the best subset in the final pipeline. Second, Template enables the option for strongly typed GP, a method to enforce type constraints in genetic programming. By letting users specify a desired structure of the resulting machine learning pipeline, Template helps reduce TPOT computation time and potentially provide more interpretable results.