find_feature_importance fails for multiclass data #274

alexmsalmeida · 2021-08-18T16:17:46Z

Hi,

Thanks for developing mikropml - it's an amazing package for implementing machine learning methods in microbiome data.

I have been trying to apply it to a complex multiclass dataset I am working with. However, it seems that whenever I run run_ml() with find_feature_importance = TRUE, I get the following error

Training complete.
Finding feature importance...
Error in calc_perf_metrics(test_data, trained_model, outcome_colname,  : 
  subscript out of bounds

The error does not occur if I use a dataset with just two classes. I was also able to reproduce the error with the dataset otu_mini_multi that is provided as an example in the repo, so it does not seem to be specific to my data.

Any ideas on what could be the issue? I am using R v3.6.1, mikropml v1.0.0, caret v6.0-88 and future.apply v1.8.1.

Thanks in advance,
Alex

The text was updated successfully, but these errors were encountered:

kelly-sovacool · 2021-08-19T20:52:55Z

I wasn't able to reproduce the error with R 4.0.3 and mikropml 1.1.0, nor with your versions of the software (expect I had to use future.apply 1.7.0 because 1.8.1 requires R >= 4). Can you provide the code that reproduced the error using the otu_mini_multi dataset?

Here's the code I used for testing:

library(mikropml)
ml_results <- run_ml(otu_mini_multi,
  "glmnet",
  outcome_colname = "dx",
  find_feature_importance = TRUE,
  seed = 2019,
  cv_times = 2
)

And here's how I created the conda environment with your software versions:

mamba create -n R-3.6.1 r-base=3.6.1 r-caret=6.0-88 r-mikropml=1.0.0 r-future.apply

alexmsalmeida · 2021-08-19T21:39:24Z

Thanks for following this up. Interesting... if I use your exact code it works. However, if I generate the otu_mini_multi from the otu_large_multi.csv file, I get the error. See my code below:

library(mikropml)

otu_large_multi <- read.delim("otu_large_multi.csv", sep = ",")
otu_mini_multi <- otu_large_multi[, 1:11]

ml_results <- run_ml(otu_mini_multi,
                     "glmnet",
                     outcome_colname = "dx",
                     find_feature_importance = TRUE,
                     seed = 2019,
                     cv_times = 2
)

Using 'dx' as the outcome column.
Training the model...
Training complete.
Finding feature importance...
Error in calc_perf_metrics(test_data, trained_model, outcome_colname,  : 
  subscript out of bounds
In addition: Warning messages:
....

Both datasets seem to be identical, so not sure what is going on.

alexmsalmeida · 2021-08-19T21:49:29Z

I seem to have figured it out. If I read in the file with "stringsAsFactors = FALSE" it works. If I recall correctly this is the default in R v4 now, so this might have been the reason.

kelly-sovacool · 2021-08-19T21:51:16Z

Ahh stringsAsFactors strikes again! Glad you figured it out.

kelly-sovacool · 2021-08-19T21:54:48Z

By the way, you can also instead read in the file with readr::read_csv("otu_large_multi.csv"). It won't convert strings to factors unless you explicitly specify the col_types. https://readr.tidyverse.org/reference/read_delim.html

alexmsalmeida · 2021-08-19T22:05:45Z

Ah, excellent, thanks for the tip. I will also eventually have to update to R v4, but have been a bit hesitant to do so fearing it will break all my scripts.

Thanks again for the help!

alexmsalmeida changed the title ~~feature_importance fails for multiclass data~~ find_feature_importance fails for multiclass data Aug 18, 2021

kelly-sovacool added the reprex Needs a minimal reproducible example label Aug 19, 2021

kelly-sovacool closed this as completed Aug 19, 2021

kelly-sovacool removed the reprex Needs a minimal reproducible example label Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find_feature_importance fails for multiclass data #274

find_feature_importance fails for multiclass data #274

alexmsalmeida commented Aug 18, 2021

kelly-sovacool commented Aug 19, 2021

alexmsalmeida commented Aug 19, 2021

alexmsalmeida commented Aug 19, 2021

kelly-sovacool commented Aug 19, 2021

kelly-sovacool commented Aug 19, 2021

alexmsalmeida commented Aug 19, 2021

find_feature_importance fails for multiclass data #274

find_feature_importance fails for multiclass data #274

Comments

alexmsalmeida commented Aug 18, 2021

kelly-sovacool commented Aug 19, 2021

alexmsalmeida commented Aug 19, 2021

alexmsalmeida commented Aug 19, 2021

kelly-sovacool commented Aug 19, 2021

kelly-sovacool commented Aug 19, 2021

alexmsalmeida commented Aug 19, 2021