# Example classification analysis using ShinyLearner

*By Erica Suh and Stephen Piccolo*

This notebook illustrates how to perform a benchmark comparison of classification algorithms across multiple datasets. We assume the reader has a moderate level of understanding of shell scripting and Python scripting. We also assume that the user's operating system is UNIX-based.

### Preparing the data

The Penn Machine Learning Benchmarks (PMLB) repository contains a large number of datasets that can be used to test machine-learning algorithms. We can access this repository using the Python module named `pmlb`, which can be installed via `pip`.

In [11]:
%%bash

#pip3 install --upgrade pip
pip3 install pmlb



For demonstration purposes, we will fetch 10 biology-related datasets from PMLB. First, define a list that indicates the unique identifier for each of these datasets.

In [18]:
datasets = ['analcatdata_aids', 'ann-thyroid']
#            'breast-cancer',
#            'dermatology',
#            'diabetes',
#            'hepatitis',
#            'iris',
#            'liver-disorder',
#            'molecular-biology_promoters',
#            'yeast']

PMLB's [GitHub repository](https://github.com/EpistasisLab/penn-ml-benchmarks) demonstrates how `pmlb` is used in Python scripts.

ShinyLearner requires that input data files have exactly one feature named 'Class', which includes the class labels. So we must modify the PMLB data to meet this requirement. After modify the data, we save each dataframe to a a file with a [supported file extension](https://github.com/srp33/ShinyLearner/blob/master/InputFormats.md).

In [19]:
from pmlb import fetch_data
import os
import shutil

new_directories = ["./Demo_Datasets/", "./Demo_Results_Basic/", "./Demo_Results_ParamsOptimized/"]
for directory in new_directories:
    if os.path.exists(directory):
        shutil.rmtree(directory)
    os.makedirs(directory)

for data in datasets:
    curr_data = fetch_data(data)
    curr_data = curr_data.rename(columns={'target': 'Class'})  # Rename 'target' to 'Class'
    curr_data.to_csv('./Demo_Datasets/{}.tsv'.format(data), sep='\t', index=True)  # Save to a .tsv file

### Performing a benchmark comparison of 10 classification algorithms

For this initial analysis, we will apply 10 different classification algorithms to the 10 datasets we have prepared. We will use Monte Carlo cross validation (with *no* hyperparameter optimization). To keep the execution time reasonable, we will only do 2 iterations of Monte Carlo cross validation.

ShinyLearner is executed within a Docker container. The ShinyLearner [web application](http://bioapps.byu.edu/shinylearner/) enables us to more easily build commands for executing ShinyLearner within Docker at the command line. We used this tool to create a template command. Below we modify that template and execute ShinyLearner for each dataset. We also indicate that we want to one-hot encode (`--ohe`) and scale the data (`--scale`) and that we want to impute any missing values (`--impute`).

(*This process takes several minutes to execute. You won't see any output until the analysis has completed.*)

In [21]:
%%bash

function runShinyLearner {
  dataset_file_path="$1"  
  dataset_file_name="$(basename $dataset_file_path)"
  dataset_name="${dataset_file_name/\.tsv/}"
  dataset_dir_path="$(pwd)/Demo_Results_Basic/$dataset_name"
  
  mkdir -p "$dataset_dir_path"

  docker run --rm \
    -v "$(pwd)/Demo_Datasets":/InputData \
    -v "$dataset_dir_path":/OutputData \
    --user $(id -u):$(id -g) \
    srp33/shinylearner:version48 \
    /UserScripts/classification_montecarlo \
      --data "/InputData/$dataset_file_name" \
      --description "$dataset_name" \
      --iterations 2 \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/svm/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/mlr/earth/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/mlr/h2o.randomForest/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/decision_tree/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/logistic_regression/default*" \
      --classif-algo "/AlgorithmScripts/Classification/arff/weka/HoeffdingTree/default*" \
      --classif-algo "/AlgorithmScripts/Classification/arff/weka/MultilayerPerceptron/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/mlr/mlp/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/multilayer_perceptron/default*" \
      --classif-algo "/AlgorithmScripts/Classification/arff/weka/SimpleLogistic/default*" \
      --output-dir "/OutputData" \
      --ohe true \
      --scale true \
      --impute true \
      --verbose false
}

rm -rf Demo_Results_Basic

for dataset_file_path in ./Demo_Datasets/*.tsv
do
  runShinyLearner "$dataset_file_path"
done

***********************************************************************
Command that was executed for analysis on Mon Apr 29 11:51:14 MDT 2019:

/UserScripts/classification_montecarlo --data /InputData/analcatdata_aids.tsv --description analcatdata_aids --iterations 2 --classif-algo /AlgorithmScripts/Classification/tsv/sklearn/svm/default* --classif-algo /AlgorithmScripts/Classification/tsv/mlr/earth/default* --classif-algo /AlgorithmScripts/Classification/tsv/mlr/h2o.randomForest/default* --classif-algo /AlgorithmScripts/Classification/tsv/sklearn/decision_tree/default* --classif-algo /AlgorithmScripts/Classification/tsv/sklearn/logistic_regression/default* --classif-algo /AlgorithmScripts/Classification/arff/weka/HoeffdingTree/default* --classif-algo /AlgorithmScripts/Classification/arff/weka/MultilayerPerceptron/default* --classif-algo /AlgorithmScripts/Classification/tsv/mlr/mlp/default* --classif-algo /AlgorithmScripts/Classification/tsv/sklearn/multilayer_perceptron/default* --cl

### Repeating the benchmark comparison with hyperparameter optimization

ShinyLearner provides an option to optimize a classification algorithm's hyperparameters. To accomplish this, it uses nested cross validation. This process requires more computational time, but it often increases classification accuracy. In the code below, we execute the same 10 classification algorithms on the same 10 datasets. There are some differences in the code below compared to the code above:

1. We store the output in `Demo_Results_ParamsOptimized` rather than `Demo_Results_Basic`.
2. We use the `nestedclassification_montecarlo` user script rather than `classification_montecarlo`.
3. The path specified for each classification algorithm ends with `*` rather than `default*`. This tells ShinyLearner to evaluate all hyperparameter combinations, not just default ones.
4. We indicate that we want to use 2 "outer" iterations and 2 "inner" iterations. In the previous example, we executed 5 iterations of Monte Carlo cross validation. But this time, we will use 2 rounds of nested ("inner") cross validation to optimize hyperparameters.

(*This process takes several minutes to execute...much longer than the previous example. You won't see any output until the analysis has completed.*)

In [22]:
%%bash

function runShinyLearner {
  dataset_file_path="$1"
  
  dataset_file_name="$(basename $dataset_file_path)"
  dataset_name="${dataset_file_name/\.tsv/}"

  docker run --rm \
    -v "$(pwd)/Demo_Datasets":/InputData \
    -v "$(pwd)/Demo_Results_ParamsOptimized/$dataset_name":/OutputData \
    --user $(id -u):$(id -g) \
    srp33/shinylearner:version484 \
    /UserScripts/nestedclassification_montecarlo \
      --data "/InputData/$dataset_file_name" \
      --description "$dataset_name" \
      --outer-iterations 2 \
      --inner-iterations 2 \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/svm/*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/mlr/earth/*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/mlr/h2o.randomForest/*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/decision_tree/*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/logistic_regression/*" \
      --classif-algo "/AlgorithmScripts/Classification/arff/weka/HoeffdingTree/*" \
      --classif-algo "/AlgorithmScripts/Classification/arff/weka/MultilayerPerceptron/*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/mlr/mlp/*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/multilayer_perceptron/*" \
      --classif-algo "/AlgorithmScripts/Classification/arff/weka/SimpleLogistic/*" \
      --output-dir "/OutputData" \
      --ohe true \
      --scale true \
      --impute true \
      --verbose false
}

rm -rf Demo_Results_Basic

for dataset_file_path in ./Demo_Datasets/*.tsv
do
  runShinyLearner "$dataset_file_path"
done

***********************************************************************
Command that was executed for analysis on Mon Apr 29 11:55:42 MDT 2019:

/UserScripts/nestedclassification_montecarlo --data /InputData/analcatdata_aids.tsv --description analcatdata_aids --outer-iterations 2 --inner-iterations 2 --classif-algo /AlgorithmScripts/Classification/tsv/sklearn/svm/* --classif-algo /AlgorithmScripts/Classification/tsv/mlr/earth/* --classif-algo /AlgorithmScripts/Classification/tsv/mlr/h2o.randomForest/* --classif-algo /AlgorithmScripts/Classification/tsv/sklearn/decision_tree/* --classif-algo /AlgorithmScripts/Classification/tsv/sklearn/logistic_regression/* --classif-algo /AlgorithmScripts/Classification/arff/weka/HoeffdingTree/* --classif-algo /AlgorithmScripts/Classification/arff/weka/MultilayerPerceptron/* --classif-algo /AlgorithmScripts/Classification/tsv/mlr/mlp/* --classif-algo /AlgorithmScripts/Classification/tsv/sklearn/multilayer_perceptron/* --classif-algo /AlgorithmScripts/C

scripts/nestedclassification: line 59:   185 Killed                  java $(getJavaArgs) -jar shinylearner.jar ANALYSIS_DATA_FILE=$analysisDataFile EXPERIMENT_FILE=$tmpDir/ie2 DEBUG=$verbose OUTPUT_BENCHMARK_FILE_PATH=$tmpDir/icb OUTPUT_PREDICTIONS_FILE_PATH=$tmpDir/ip NUM_CORES=$numCores TEMP_DIR=$tmpDir 2> /dev/null


### Clean up

In [None]:
%%bash

rm -rf Demo_Datasets

### Analyzing and visualizing the results

Please see the document called `Demo_Analysis.Rmd`, which contains R code for analyzing and visualizing the results.