Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find_feature_importance fails for multiclass data #274

Closed
alexmsalmeida opened this issue Aug 18, 2021 · 6 comments
Closed

find_feature_importance fails for multiclass data #274

alexmsalmeida opened this issue Aug 18, 2021 · 6 comments

Comments

@alexmsalmeida
Copy link

Hi,

Thanks for developing mikropml - it's an amazing package for implementing machine learning methods in microbiome data.

I have been trying to apply it to a complex multiclass dataset I am working with. However, it seems that whenever I run run_ml() with find_feature_importance = TRUE, I get the following error

Training complete.
Finding feature importance...
Error in calc_perf_metrics(test_data, trained_model, outcome_colname,  : 
  subscript out of bounds

The error does not occur if I use a dataset with just two classes. I was also able to reproduce the error with the dataset otu_mini_multi that is provided as an example in the repo, so it does not seem to be specific to my data.

Any ideas on what could be the issue? I am using R v3.6.1, mikropml v1.0.0, caret v6.0-88 and future.apply v1.8.1.

Thanks in advance,
Alex

@alexmsalmeida alexmsalmeida changed the title feature_importance fails for multiclass data find_feature_importance fails for multiclass data Aug 18, 2021
@kelly-sovacool
Copy link
Member

I wasn't able to reproduce the error with R 4.0.3 and mikropml 1.1.0, nor with your versions of the software (expect I had to use future.apply 1.7.0 because 1.8.1 requires R >= 4). Can you provide the code that reproduced the error using the otu_mini_multi dataset?

Here's the code I used for testing:

library(mikropml)
ml_results <- run_ml(otu_mini_multi,
  "glmnet",
  outcome_colname = "dx",
  find_feature_importance = TRUE,
  seed = 2019,
  cv_times = 2
)

And here's how I created the conda environment with your software versions:

mamba create -n R-3.6.1 r-base=3.6.1 r-caret=6.0-88 r-mikropml=1.0.0 r-future.apply

@kelly-sovacool kelly-sovacool added the reprex Needs a minimal reproducible example label Aug 19, 2021
@alexmsalmeida
Copy link
Author

Thanks for following this up. Interesting... if I use your exact code it works. However, if I generate the otu_mini_multi from the otu_large_multi.csv file, I get the error. See my code below:

library(mikropml)

otu_large_multi <- read.delim("otu_large_multi.csv", sep = ",")
otu_mini_multi <- otu_large_multi[, 1:11]

ml_results <- run_ml(otu_mini_multi,
                     "glmnet",
                     outcome_colname = "dx",
                     find_feature_importance = TRUE,
                     seed = 2019,
                     cv_times = 2
)
Using 'dx' as the outcome column.
Training the model...
Training complete.
Finding feature importance...
Error in calc_perf_metrics(test_data, trained_model, outcome_colname,  : 
  subscript out of bounds
In addition: Warning messages:
....

Both datasets seem to be identical, so not sure what is going on.

@alexmsalmeida
Copy link
Author

I seem to have figured it out. If I read in the file with "stringsAsFactors = FALSE" it works. If I recall correctly this is the default in R v4 now, so this might have been the reason.

@kelly-sovacool
Copy link
Member

Ahh stringsAsFactors strikes again! Glad you figured it out.

@kelly-sovacool
Copy link
Member

By the way, you can also instead read in the file with readr::read_csv("otu_large_multi.csv"). It won't convert strings to factors unless you explicitly specify the col_types. https://readr.tidyverse.org/reference/read_delim.html

@kelly-sovacool kelly-sovacool removed the reprex Needs a minimal reproducible example label Aug 19, 2021
@alexmsalmeida
Copy link
Author

Ah, excellent, thanks for the tip. I will also eventually have to update to R v4, but have been a bit hesitant to do so fearing it will break all my scripts.

Thanks again for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants