## ShinyLearner: Using PMLB Datasets

The Penn Machine Learning Benchmarks (PMLB) team has provided a Python wrapper named `pmlb`, which can be installed on Python via `pip`.

In [5]:
!pip install pmlb

[33mYou are using pip version 10.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


As an example, we will fetch 10 biology-related datasets from PMLB.

In [6]:
datasets = ['analcatdata_aids',
			'ann-thyroid',
			'breast-cancer',
			'dermatology',
			'diabetes',
			'hepatitis',
            'iris',
			'liver-disorder',
			'molecular-biology_promoters',
			'yeast']

PMLB's [GitHub repository](https://github.com/EpistasisLab/penn-ml-benchmarks) demonstrates how `pmlb` is used in Python scripts. In addition, ShinyLearner requires that input data files have exactly one feature named 'Class.' After making the appropriate changes, we will save the dataframe to a file with a [supported file extension](https://github.com/srp33/ShinyLearner/blob/master/InputFormats.md).

In [7]:
from pmlb import fetch_data

for data in datasets:
	curr_data = fetch_data(data)
	curr_data = curr_data.rename(columns={'target': 'Class'})  # 'target' is renamed to 'Class'
	curr_data.to_csv('./new_{}.tsv'.format(data), sep='\t', index=True)  # Saved to a .tsv file

This [web application](http://bioapps.byu.edu/shinylearnerweb/) allows us to build ShinyLearner commands more easily. By following the step-by-step process, we obtained the command below and will run it with each dataset to get the output files. 

In [8]:
%%bash

function runShinyLearner {
  localInFileName="$1"
  description="$2"

  /UserScripts/nestedclassification_montecarlo \
    --data "InputData/$localInFileName" \
    --description "$description" \
    --outer-iterations 5 \
    --inner-iterations 2 \
    --classif-algo "AlgorithmScripts/Classification/tsv/sklearn/svm/default*,AlgorithmScripts/Classification/tsv/mlr/earth/default*,AlgorithmScripts/Classification/tsv/mlr/h2o.randomForest/default*,AlgorithmScripts/Classification/tsv/sklearn/decision_tree/default*,AlgorithmScripts/Classification/tsv/sklearn/logistic_regression/default*,AlgorithmScripts/Classification/arff/weka/HoeffdingTree/default*,AlgorithmScripts/Classification/arff/weka/MultilayerPerceptron/default*,AlgorithmScripts/Classification/tsv/mlr/mlp/default*,AlgorithmScripts/Classification/tsv/sklearn/multilayer_perceptron/default*,AlgorithmScripts/Classification/arff/weka/SimpleLogistic/default*" \
    --output-dir "OutputData" \
    --ohe true \
    --scale true \
    --impute true \
    --verbose false
}

for dataset in "analcatdata_aids" "iris"
do
  runShinyLearner "new_${dataset}.tsv" "${dataset}"
done

#runShinyLearner "new_analcatdata_aids.tsv" "analcatdata_aids"
#runShinyLearner "/Users/EricaSuh/shinylearner" "/Users/EricaSuh/shinylearner/ann-thyroid" "new_ann-thyroid.tsv" "ann-thyroid"
#runShinyLearner "/Users/EricaSuh/shinylearner" "/Users/EricaSuh/shinylearner/breast-cancer" "new_breast-cancer.tsv" "breast-cancer"
#runShinyLearner "/Users/EricaSuh/shinylearner" "/Users/EricaSuh/shinylearner/dermatology" "new_dermatology.tsv" "dermatology"
#runShinyLearner "/Users/EricaSuh/shinylearner" "/Users/EricaSuh/shinylearner/diabetes" "new_diabetes.tsv" "diabetes"
#runShinyLearner "/Users/EricaSuh/shinylearner" "/Users/EricaSuh/shinylearner/hepatitis" "new_hepatitis.tsv" "hepatitis"
#runShinyLearner "/Users/EricaSuh/shinylearner" "/Users/EricaSuh/shinylearner/iris" "new_iris.tsv" "iris"
#runShinyLearner "/Users/EricaSuh/shinylearner" "/Users/EricaSuh/shinylearner/liver-disorder" "new_liver-disorder.tsv" "liver-disorder"
#runShinyLearner "/Users/EricaSuh/shinylearner" "/Users/EricaSuh/shinylearner/molecular-biology_promoters" "new_molecular-biology_promoters.tsv" "molecular-biology_promoters"
#runShinyLearner "/Users/EricaSuh/shinylearner" "/Users/EricaSuh/shinylearner/yeast" "new_yeast.tsv" "yeast"

***********************************************************************
Command that was executed for analysis on Mon Aug  6 18:21:35 UTC 2018:

/UserScripts/nestedclassification_montecarlo --data InputData/new_analcatdata_aids.tsv --description analcatdata_aids --outer-iterations 5 --inner-iterations 2 --classif-algo AlgorithmScripts/Classification/tsv/sklearn/svm/default*,AlgorithmScripts/Classification/tsv/mlr/earth/default*,AlgorithmScripts/Classification/tsv/mlr/h2o.randomForest/default*,AlgorithmScripts/Classification/tsv/sklearn/decision_tree/default*,AlgorithmScripts/Classification/tsv/sklearn/logistic_regression/default*,AlgorithmScripts/Classification/arff/weka/HoeffdingTree/default*,AlgorithmScripts/Classification/arff/weka/MultilayerPerceptron/default*,AlgorithmScripts/Classification/tsv/mlr/mlp/default*,AlgorithmScripts/Classification/tsv/sklearn/multilayer_perceptron/default*,AlgorithmScripts/Classification/arff/weka/SimpleLogistic/default* --output-dir OutputData --ohe t