#Data-driven Advice for Applying Machine Learning to Bioinformatics Problems


생체정보학 문제에 머신러닝을 적용하기 위한 데이터를 처리하는 조언

As the bioinformatics field grows, it must keep pace not only with new data but with new algorithms. Here we contribute a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. We present a number of statistical and visual comparisons of algorithm performance and quantify the effect of model selection and algorithm tuning for each algorithm and dataset. The analysis culminates in the recommendation of five algorithms with hyperparameters that maximize classifier performance across the tested problems, as well as general guidelines for applying machine learning to supervised classification problems.

생체정보학이 성장할수록, 새로운 데이터와 알고리즘이 보조를 맞추고 있다.

**이 논문에는 현재 연구자들에게 data-driven알고리즘을 제공하기 위해 공개적으로 위해 공개적으로 사용 가능한 165개 분류 문제 세트에 대해 일반적으로 사용되는 13개의 최첨단 기계 학습 알고리즘에 대한 철저히 분석했다.**



많은 통계, 알고리즘 결과에 대한 시각적 비교를 제공하고, 그리고 각각의 알고리즘과 데이터 셋에 대한 알고리즘 튜닝, 그리고 모델 선택의 영향을 정형화 했다. 

**분석은 테스트된 문제 전반에서 분류기 성능을 최대화하는 하이퍼파라미터를 가진 5가지 알고리즘과 통제된 분류 문제에 기계 학습을 적용하기 위한 일반적인 지침을 제공한다.**

##1. introduction

도입

The bioinformatics field is increasingly relying on machine learning (ML) algorithms to con- duct predictive analytics and gain greater insights into the complex biological processes of the human body.1 For example, ML algorithms have been applied to great success in GWAS, and have proven effective at detecting patterns of epistasis within the human genome. Recently, deep learning algorithms were used to detect cancer metastases on high-resolution pathology images3 at levels comparable to human pathologists. These results, among others, indicate heavy interest in ML development and analysis for bioinformatics applications.
Owing to the development of open source ML packages and active research in the ML field, researchers can easily choose from dozens of ML algorithm implementations to build predictive models of complex data. Although having several readily-available ML algorithm implementations is advantageous to bioinformatics researchers seeking to move beyond simple statistics, many researchers experience “choice overload” and find difficulty in selecting the right ML algorithm for their problem at hand.As a result, some ML-oriented bioinformatics projects could be improved simply through the use of a better ML algorithm.

생체정보학 분야에 많이 사용되고 있고, 특히 GWAS기술이 대표적이다.

병리학자들과 비교해서 높은 퍼포먼스를 가지고 있고, 따라서 ml분야에 많은 관심을 가지고 있다.

너무 많은 ml모델이 있어서 병리학자들이 선택하기 힘들어한다. 따라서 더 좋은 모델을 쓰는 것만으로 프로젝트가 개선될 수 있다.

ML researchers are aware of the challenges that algorithm selection presents to ML prac- titioners. As a result, there have been some efforts to empirically assesses different algorithms across sets of problems, beginning in the mid 1990s with the StatLog project.

ml연구진들은 ml실무진들에게 알고리즘을 선택하는 것에 대한 어려움을 알고있다. 결과적으로 경험적으로 알고리즘을 평가하려는 시도가 1990년 중반부터 있어왔다.

Early work in this ﬁeld also emphasized bioinformatics applications. More recently, Caruana et al. and Fern´andez-Delgado et al. analyzed several supervised learning algorithms, coupled with some parameter tuning. The aforementioned literature often compared many algorithms but on rel- atively few example problems (between 4 and 12), with only using upwards of 112 example problems. In the time since these assessments, researchers have moved towards standardized, open source implementations of ML algorithms (e.g. scikit-learn and Weka), and the number of publicly available datasets that can be used for comparison have skyrocketed, leading to the creation of decentralized, collaboration-based analyses such as the OpenML project.How- ever, the value of focused, reproducible ML experiments is still paramount.These observations motivated our work, in which we conduct a contemporary, open source, and thorough com- parison of ML algorithms across a large set of publicly available problems, including several bioinformatics problems.

이전에 있던 문헌들은 많은 알고리즘을 비교했지만, 상대적으로 적은 예제(4-12)를 비교했고, 112개의 예제만 사용했다. 이후에 , 연구자들은 표준화된 머신러닝 알고리즘 오픈소스로 향했으며, 그리고 비교에 사욛될 수 있는 다수의 공개적으로 사용가능한 데이터셋은 급증했고, OpenML 프로젝트와 같은 분산형 협업 기반 분석이 초래되었다.


아무리 그 가치가 중요하다고 해도, 재현 가능한 ML 연구는 다른 무엇보다 중요하다.

In this paper, we take a detailed look at 13 popular open source ML algorithms and analyze their performance across a set of 165 supervised classiﬁcation problems in order to provide data-driven advice to practitioners who wish to apply ML to their datasets. A key part of this comparison is a full hyperparameter optimization of each algorithm.

**우리는 실무자들에게 data_driven 조언을 주기 위해 이번에 13가지 오픈 소스 머신러닝 알고리즘을 세부적으로 다룰것이고, 그들의 성과를 165개 분류 문제를 통해 분석할것이다, 이 비교의 핵심은 각 알고리즘의 완전한 하이퍼파라미터 최적화다.**

The results highlight the importance of selecting the right ML algorithm for each problem, which can improve prediction accuracy signiﬁcantly on some problems. Further, we empirically quantify the eﬀect of hyperparameter (i.e. algorithm parameter) tuning for each ML algorithm, demonstrating marked improvements in the predictive accuracy of nearly all ML algorithms. We show that the underlying behaviors of various ML algorithms cluster in terms of performance, as might be expected. Finally, based on the results of the experiments, we provide a reﬁned set of recommendations for ML algorithms and parameters as a starting point for future researchers.

모델 선택으로 정확도가 달라진다. 우린 이걸 증명하면서 미래 연구진들에게 시작점으로써 정제된 알고리즘 추천을 제공할것이다.

##2. methods

방법론

In this study, we compared 13 popular ML algorithms from scikit-learn, a widely used ML library implemented in Python. Each algorithm and its hyperparameters are described in Ta- ble 1. The algorithms include Na¨ıve Bayes algorithms, common linear classiﬁers, tree-based al- gorithms, distance-based classiﬁers, ensemble algorithms, and non-linear, kernel-based strate- gies. The goal was to represent the most common classes of algorithms used in literature, as well as recent state-of-the-art algorithms such as Gradient Tree Boosting

연구에서, 우리는 **13개의 사이킷런 알고리즘**을 다룰것.
나이브 베이즈, 선형 회귀, 트리 기반,등등.

table 1은 각각의 알고리즘과 하이퍼파라미터가 묘사되어있다.

For each algorithm, the hyperparameters were tuned using a ﬁxed grid search with 10- fold cross-validation. In our results, we compare the average balanced accuracy over the 10 folds in order to account for class imbalance. We used expert knowledge about the reasonable hyperparameters to specify the ranges of values to tune for each algorithm. It is worth noting that we did not attempt to control for the number of total hyperparameter combinations budgeted to each algorithm. As a result, algorithms with more parameters have an advantage in the sense that they have more training attempts on each dataset. However, it is our goal to report as close to the best performance as possible for each algorithm on each dataset, and for this reason we chose to optimize each algorithm as thoroughly as possible.

k-fold 교차 검증을 사용해서 하이퍼파라미터를 튜닝했고, 이걸 바탕으로 비교했고, 가능한 최대한 최적화를 시켰다.

The algorithms were compared on 165 supervised classiﬁcation datasets from the Penn Ma- chine Learning Benchmark (PMLB).13 PMLB is a collection of publicly available classiﬁcation problems that have been standardized to the same format and collected in a central locationwith easy access via Pythona. Although not limited to problems in biology and medicine, PMLB includes many biomedical classiﬁcation problems, including tasks such as disease di- agnosis, post-operative decision making, and exon boundary identiﬁcation in DNA, among others. A sample of the biomedical classiﬁcation tasks contained in PMLB is listed in Table 2.

PMLB은 여러 생체정보학에 대한 데이터를 많이 포함하고 있고, table 2에 pmlb에 포함된 생체 분류 샘플이 정렬되어 있다.

Prior to evaluating each ML algorithm, we scaled the features of every dataset by sub- tracting the mean and scaling the features to unit variance. This scaling step was necessitated by some ML algorithms, such as the distance-based classiﬁers, which assume that the features of the datasets will be scaled appropriately beforehand.

ml알고리즘을 평가하기 전에 데이터에 대한 스케일링을 해야 한다.

The entire experimental design consisted of over 5.5 million ML algorithm and parameter evaluations in total, resulting in a rich set of data that is analyzed from several viewpoints in Section 3. As an additional contribution of this work, we have provided the complete code required both to conduct the algorithm and hyperparameter optimization study, as well as access to the analysis and resultsb. Doing so allows researchers to easily compare algorithm performance on the datasets that are most similar to their own, and to conduct further analysis pertaining to their research.

전체 실험 설계는 총 550만 개 이상의 ML 알고리즘과 매개 변수 평가로 구성되었으며, 결과적으로 섹션 3의 여러 관점에서 분석되는 풍부한 데이터 세트가 생겼다. 그리고 우리는 코드를 제공한다.