Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in { : task 1 failed - "'n' must be a positive integer >= 'x'" #684

Closed
andzandz11 opened this issue Jul 7, 2017 · 17 comments
Closed

Error in { : task 1 failed - "'n' must be a positive integer >= 'x'" #684

andzandz11 opened this issue Jul 7, 2017 · 17 comments

Comments

@andzandz11
Copy link

@andzandz11 andzandz11 commented Jul 7, 2017

My code fails. No idea why. Followed a tutorial from http://blog.revolutionanalytics.com/2015/10/the-5th-tribe-support-vector-machines-and-caret.html

If you are filing a bug, make sure these boxes are checked before submitting your issue— thank you!

file.txt

Minimal, runnable code:

library(caret)
library(kernlab)     
library(doParallel)
load(file="file.txt")

mydat <- oligo[,-1]

cluster <- makeCluster(detectCores()) 
registerDoParallel(cluster)

ctrl <- trainControl(method="repeatedcv",   
                     repeats=5,
                     search="grid",
                     summaryFunction=multiClassSummary,
                     classProbs = TRUE)
 

svm.tune <- train(x=mydat,
                  y=oligo[,"sci_name"],
                  method = "svmRadial",
                  trControl=ctrl,
                  metric="ROC",
                  tuneLength = 9)


Error in { : task 1 failed - "'n' must be a positive integer >= 'x'"
In addition: Warning messages:
1: In .local(x, ...) : Variable(s) `' constant. Cannot scale data.
2: In train.default(x = mydat, y = oligo[, "sci_name"], method = "svmRadial",  :
  The metric "ROC" was not in the result set. logLoss will be used instead.

Session Info:

>sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] doParallel_1.0.10 iterators_1.0.8   foreach_1.4.3     kernlab_0.9-25    caret_6.0-76     
[6] ggplot2_2.2.1     lattice_0.20-35  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11       magrittr_1.5       splines_3.4.0      MASS_7.3-47        munsell_0.4.3     
 [6] colorspace_1.3-2   rlang_0.1.1        minqa_1.2.4        stringr_1.2.0      car_2.1-5         
[11] plyr_1.8.4         tools_3.4.0        nnet_7.3-12        pbkrtest_0.4-7     grid_3.4.0        
[16] gtable_0.2.0       nlme_3.1-131       mgcv_1.8-17        quantreg_5.33      e1071_1.6-8       
[21] class_7.3-14       MatrixModels_0.4-1 lme4_1.1-13        lazyeval_0.2.0     tibble_1.3.3      
[26] Matrix_1.2-9       nloptr_1.0.4       reshape2_1.4.2     ModelMetrics_1.1.0 codetools_0.2-15  
[31] stringi_1.1.5      compiler_3.4.0     scales_0.4.1       stats4_3.4.0       SparseM_1.77    
@topepo
Copy link
Owner

@topepo topepo commented Jul 7, 2017

I'm currently troubleshooting an error with the same message from my own work. It looks like it is only an issue when the underlying model fit fails and that model generates an S4 object.

(EDITED) I didn't see that you had attached a file.

@andzandz11
Copy link
Author

@andzandz11 andzandz11 commented Jul 7, 2017

So maybe I did subset my data too harshly and my model makes no sense any more?

@topepo
Copy link
Owner

@topepo topepo commented Jul 7, 2017

I'll take a look in a bit but with SVMs, there is a decent chance that the model fails when building the secondary Platt probability model.

topepo added a commit that referenced this issue Jul 7, 2017
@topepo
Copy link
Owner

@topepo topepo commented Jul 7, 2017

There was a bug in the summary code. However...

For your data, I'm not sure that anything will help; there are going to be big problems when you have very few cases in the classes:

> table(table(oligo[,"sci_name"]))

 1  2  3  4  5  8 
10  9  6  4  6  1 

There were a lot of model errors saying

model fit failed for Fold03.Rep1: sigma=0.00359, C= 1.00 Error in .local(object, ...) : test vector does not match model !

This is directly related to the small frequency of the some of the taxa.

Even something like leave-one-out will probably fail since, in 10 cases, a model will be build without one of the classes and will fail at predicting that class.

Also, for future reference, set the seed prior to running train so that we get the same random numbers/resamples etc.

@topepo
Copy link
Owner

@topepo topepo commented Jul 7, 2017

You might be able to do something at the genus-level with a model for Actinidia versus not Actinidia. That may not help much though in your context.

@andzandz11
Copy link
Author

@andzandz11 andzandz11 commented Jul 7, 2017

Thanks that helps!
The data is just a subset of a much bigger dataset. Maybe I overdid the substting.
However, I have expanded my sample from 100 (as above) to 400 sequences (running right now) and SVM is running for half an hour (edit: one hour) now and continuing . I have 300000 sequences (=observations) in my dataset. I am beginning to think radial SVM might not be the best idea.
However, the issue is solved. Feel free to close.

My problem is that I need to find a machine learning algorithm that can learn to classify 50000 classes with just 150000 observations (each 1024 data values, integer between 0 and 16). My classes have "master" classes (hierarchical), 2 levels deep. I need to find the deepest class that can be assigned. Data values within one class are supposed to be 98% identical between observations. Observations belonging to different classes are supposed to be <98% identical. However, only 60% of my data follows that trend. That is what I need the machine learning for (I need to look at the content of my observations, instead of just calculating the distance between two observations). Because If i go just pure distance I will get an error of 40%. Maybe you have an idea what to use. Anything appreciated :D
so these guys use naive bayes: http://aem.asm.org/content/73/16/5261.long
But they only go down to genus. I need to dig deeper.
Have a nice day!

@topepo
Copy link
Owner

@topepo topepo commented Jul 7, 2017

A few things:

  • Your predictors are (mostly?) binary so a parametric nonlinear model like the RBF doesn't really do anything. If you really want to use SVMs, go linear. I would strongly suggest trees since they can find/isolate important subpopulations for prediction and they also do feature selection in the process.
  • Perhaps you wouldn't have this issue with the larger set, but some of your predictors seemed pretty sparse and get resampled down to a single value. You might try adding the "zv" preprocessing method to your train calls. It will eliminate these zero-variance predictors prior to the model fit.
  • Can you compute a similarity measure based on non-conserved genetic regions between species (or SNPs or some other DNA-based probe)? If there are species that are extremely genetically similar, you might think about pooling them into a joint class. That would mitigate the number of classes while still giving you more specificity than the genus.
  • A hierarchical system might help too. If you have a good genus model, you could develop separate species within genus models and that might help the situation.
@andzandz11
Copy link
Author

@andzandz11 andzandz11 commented Jul 7, 2017

My predictors in theory can have values between 1 and uhh I don't know what the maximum is, theoretically 200, practically more like 10. So, sorry, I cann not use the binary things. Maybe I should transfer them to binary, hmm. Good point!
I am using machine learning because my data is too complex for trees. That is the first point why I am here. I made phylogenetic trees with the DNA data, and a lot of sequences end up in ambiguous positions.
I am also using machine learning because my data is too complex for distance calculations :D
http://media.springernature.com/lw785/springer-static/image/art%3A10.1186%2F1756-0381-7-4/MediaObjects/13040_2013_Article_103_Fig6_HTML.jpg
Distance would be BLAST. Tree (phylogenetic not the machine learning tree (what is that? like random forest?) would be somewhere near Nj.
Basically I am trying to recreate https://biodatamining.biomedcentral.com/articles/10.1186/1756-0381-7-4#MOESM2
using k-mers instead of single characters from the DNA sequence. My problem is that in the paper they use 250 sequences, but I have 300.000

@topepo
Copy link
Owner

@topepo topepo commented Jul 7, 2017

I should have been more specific... I meant tree-based machine learning models (e.g. boosted trees and not phylogenetic trees).

@andzandz11
Copy link
Author

@andzandz11 andzandz11 commented Jul 7, 2017

temp.txt

I still don't understand why I can't get a ROC?
I changed from species to genus estimator, temporarly, to give the model more observations per class. I upped the number of observatuions to 1000, up from 100.
(I installed the recent github version.)

load(file="temp.txt")
mydat <- oligo[,-1]

library(magrittr)
library(caret)
library(doParallel)

cluster <- makeCluster(detectCores()) # convention to leave 1 core for OS
registerDoParallel(cluster)



ctrl <- trainControl(method="repeatedcv",   
                     repeats=10,
                     search="grid",
                     summaryFunction=multiClassSummary,
                     classProbs = TRUE)
 

svm.tune <- train(x=mydat,
                  y=oligo[,"sci_name"]%>%as.character%>%gsub("_.*","",.)%>%gsub("[^[:alnum:]]","",.),
                  method = "rf",
                  trControl=ctrl,
                  metric="ROC",
                  tuneLength = 4)

gives me:


In train.default(x = mydat, y = oligo[, "sci_name"] %>% as.character %>%  :
  The metric "ROC" was not in the result set. logLoss will be used instead.
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

>svm.tune
Random Forest 

1000 samples
1024 predictors
 110 classes: 'Acis', 'Acorus', 'Actinidia', 'Aerva', 'Agathis', 'Alisma', 'Allium', 'Alstonia', 'Alternanthera', 'Altingia', 'Amaranthus', 'Ancistrocladus', 'Angelica', 'Apium', 'Apoballis', 'Apocynum', 'Aponogeton', 'Araucaria', 'Aridarum', 'Arisaema', 'Arum', 'Asclepias', 'Asimina', 'Aspidosperma', 'Avicennia', 'Baldellia', 'Beauverdia', 'Boophone', 'Brassaiopsis', 'Bucephalandra', 'Bupleurum', 'Carum', 'Catharanthus', 'Celosia', 'Centella', 'Cicuta', 'Clivia', 'Colocasia', 'Conicosia', 'Conium', 'Crinum', 'Cryptocoryne', 'Cynanchum', 'Cyrtanthus', 'Damasonium', 'Daucus', 'Decalepis', 'Deeringothamnus', 'Echinodorus', 'Eleutherococcus', 'Fatsia', 'Foeniculum', 'Galanthus', 'Gilliesia', 'Habranthus', 'Haemanthus', 'Hannonia', 'Hedera', 'Hippeastrum', 'Homalomena', 'Ilex', 'Justicia', 'Lapiedra', 'Leucocoryne', 'Leucojum', 'Ligusticum', 'Limnocharis', 'Liquidambar', 'Luronium', 'Lycoris', 'Macropanax', 'Meeboldia', 'Miersia', 'Narcissus', 'Nothoscordum', 'Oenanthe', 'Oplopanax', 'Osmo... <truncated>

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 894, 905, 896, 902, 898, 898, ... 
Resampling results across tuning parameters:

  mtry  logLoss    AUC  Accuracy   Kappa      Mean_F1  Mean_Sensitivity  Mean_Specificity
     2  0.3607889  NaN  0.9747318  0.9734459  NaN      NaN               0.9997664       
    16  0.2349366  NaN  0.9848907  0.9841323  NaN      NaN               0.9998607       
   128  0.2270863  NaN  0.9697814  0.9682439  NaN      NaN               0.9997158       
  1024  0.5466890  NaN  0.9593647  0.9572440  NaN      NaN               0.9996206       
  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value  Mean_Detection_Rate  Mean_Balanced_Accuracy
  NaN                  NaN                  0.008861199          NaN                   
  NaN                  NaN                  0.008953552          NaN                   
  NaN                  NaN                  0.008816194          NaN                   
  NaN                  NaN                  0.008721497          NaN                   

logLoss was used to select the optimal model using  the smallest value.
The final value used for the model was mtry = 128.

If I go back to species level it stops completely.
(by changing y to :
y=oligo[,"sci_name"]%>%as.character%>%gsub("[^[:alnum:]]","",.),

Something is wrong; all the logLoss metric values are missing:
    logLoss         AUC         Accuracy       Kappa        Mean_F1    Mean_Sensitivity
 Min.   : NA   Min.   : NA   Min.   : NA   Min.   : NA   Min.   : NA   Min.   : NA     
 1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA     
 Median : NA   Median : NA   Median : NA   Median : NA   Median : NA   Median : NA     
 Mean   :NaN   Mean   :NaN   Mean   :NaN   Mean   :NaN   Mean   :NaN   Mean   :NaN     
 3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA     
 Max.   : NA   Max.   : NA   Max.   : NA   Max.   : NA   Max.   : NA   Max.   : NA     
 NA's   :4     NA's   :4     NA's   :4     NA's   :4     NA's   :4     NA's   :4       
 Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value Mean_Detection_Rate Mean_Balanced_Accuracy
 Min.   : NA      Min.   : NA         Min.   : NA         Min.   : NA         Min.   : NA           
 1st Qu.: NA      1st Qu.: NA         1st Qu.: NA         1st Qu.: NA         1st Qu.: NA           
 Median : NA      Median : NA         Median : NA         Median : NA         Median : NA           
 Mean   :NaN      Mean   :NaN         Mean   :NaN         Mean   :NaN         Mean   :NaN           
 3rd Qu.: NA      3rd Qu.: NA         3rd Qu.: NA         3rd Qu.: NA         3rd Qu.: NA           
 Max.   : NA      Max.   : NA         Max.   : NA         Max.   : NA         Max.   : NA           
 NA's   :4        NA's   :4           NA's   :4           NA's   :4           NA's   :4             
Error: Stopping
In addition: Warning messages:
1: In train.default(x = mydat, y = oligo[, "sci_name"] %>% as.character %>%  :
  The metric "ROC" was not in the result set. logLoss will be used instead.
2: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

Sorry, maybe this is a new issue. I am no data scientist and new to github.

It also crashed when using naive_bayes.


Something is wrong; all the Accuracy metric values are missing:
    Accuracy       Kappa    
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :2     NA's   :2    
Error: Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.
@andzandz11
Copy link
Author

@andzandz11 andzandz11 commented Jul 7, 2017

https://stackoverflow.com/questions/38945574/something-is-wrong-all-the-accuracy-metric-values-are-missing

This error is caused by the fact that there were missing values in resampled performance measures. That might happen if there is a resample where one of the outcome classes (in this case default) has zero samples so sensitivity or specificity is undefined.

If this is true this is really bad. This makes machine learning totally unsuited for life science, because we have normally distributed data, and there will always be classes with low observations. They will be resampled from time to time. Tha algorithm must not be allowed to break every time.
Maybe a simple na.omit=T somwhere in the predict function will do the trick?

I do not really think I want to up-sample or down-sample because I want to get predictions reflecting the real distribution of my data. I also can not delete or add anything to my dataset, because it is a food network, kinda, if i add or leave something out, it completely looses its meaning.

If I have a class in my testing set that is not in my model I want the accurarcy to be 0. If I have a class in my testing set that is not in the model I want the accurarcy to be NA. (But I still want the other classes to have a accuracy that is the base for further calculations without having one NA skrew over my whole results.

So let me describe what I did in another tool that has nothing to do with machine learning, but with distance similarity based classification: I cross validated (LOOCV) each class n-1 times with 1 observation, where n is the total number of observations. And it was perfect, However, it was no machine learning and not very reliable. But in machine learning, I can not do this it seems, beause building 300000 models and validate with one observation each time would take way to long.

edit: I think I need a custom sampling function. Assume I have DNA sequences that have a species class and a genus cass (which is higher level).

I want to simulate the following case: I have the full database minus one sequence and test that sequence against the database. (LOOCV)
What I can not do: Sample one sequence per run -> will take forever
What I could do: Because I already know that I can go to genus with 97% (previous trials with phylogenetic distances) I can sample one sequence per genus, without influencing sequences in another genus.
What I need is:
Sample once from each class (species), but not twice per superclass (genus). Mark the samples to be not sampled again. Do not sample classes with only one sequences (but leave them in the dataset as honeypots). Repeat 9 times (maximum number of sequences per gensu is 10).
Repeat everything 10 times (reset the do-not-sample-flag). Average classification success per genus. Average classification success per species.
Oh boy, I have not even heared of machine learning since two days ago....

@topepo
Copy link
Owner

@topepo topepo commented Jul 8, 2017

I still don't understand why I can't get a ROC?

ROC curves are for cases where you have two classes. multiClassSummary has a measure (AUC) that is the average of the 1:all results. From ?multiClassSummary:

multiClassSummary computes some overall measures of for performance (e.g. overall accuracy and the Kappa statistic) and several averages of statistics calculated from "one-versus-all" configurations. For example, if there are three classes, three sets of sensitivity values are determined and the average is reported with the name ("Mean_Sensitivity"). The same is true for a number of statistics generated by confusionMatrix. With two classes, the basic sensitivity is reported with the name "Sensitivity"

The problem is that you are holding out an entire class (or perhaps more) during resampling and this is resulting in al of the class-specific measures to be missing since many of their values are missing. Note that the log-loss, accuracy, and other class-independent metrics are estimated.

If this is true this is really bad. This makes machine learning totally unsuited for life science, because we have normally distributed data, and there will always be classes with low observations. They will be resampled from time to time. Tha algorithm must not be allowed to break every time.

I understand your point of view but strongly disagree. Take the sensitivity issue. How can you estimate the false positives rate if there are no positives? I don't see if as a shortcoming of the model; you just don't have enough per-species data to support the type of model that you want.

@schelhorn
Copy link

@schelhorn schelhorn commented Jul 8, 2017

I guess @topepo knows a thing or two about the life sciences ;) There are ways to deal with class imbalances, and AUROC isn't your best choice in that case anyway - kappa or logloss plus case weights probably make more sense.

@andzandz11
Copy link
Author

@andzandz11 andzandz11 commented Jul 8, 2017

Sorry, I was frustrated. You both have been incredibly helpful.
I think that is why the popular RDP classifier (uses naive bayes) only goes down to genus. Because we don't have the databases right now to support machine learning down to species in my field of study. The average number of sequences per species is 2.2, ranging from 1-10. (3,2 if I exclude all singletons)
Still, just calculating the distances between the sequences yields 65% identification success, so "IT" is in the data. What I really would need is a sampling method that only samples once per class (without sampling singeton classes at all). I will try to find one.
I know that I cannot cross validate properly if the whole class gets sampled, but I need it to fial silently and not affect the other classes. Ultimatively I need a confidence value PER class, and not per model. Because I need to know in advance, if I start a new study in a particular organism, what the chance is I can classifiy that organism and its close relatives with the database and my learning model.
Thanks!

@andzandz11 andzandz11 closed this Jul 8, 2017
@topepo
Copy link
Owner

@topepo topepo commented Jul 8, 2017

Sorry, I was frustrated.

No problem at all.

With parallel processing, leave-one-out might be doable. You might also consider using the groupKFold function along with the index argument of trainControl too.

@schelhorn
Copy link

@schelhorn schelhorn commented Jul 8, 2017

I guess supervised machine learning just isn't a paradigm that fits your problem well - at least not with a general purpose toolbox such as caret. Are there any reasons you aren't using the published taxonomic binning methods? From an applied Bioinformatics point of view, your problem has been solved, hasn't it?

@andzandz11
Copy link
Author

@andzandz11 andzandz11 commented Jul 9, 2017

Not really in my field.
13040_2013_article_103_fig5_html 1
I am not working in metagenomics. I have (multiple) single sequences (different markers that I need to combine) that need to be identified >95% tp species level; as in: medical relevant.
The problem is, with BLAST, you cannot really combine a lot of markers effectively, as BLAST has no clue about the errors this might introduce. The uninformative characters have the same impact as informative characters.
DNA-BAR needs an alignment, which can not be done for a lot of organisms. BLOG is also machine learning. All the other stuff is inferior to machine learning. The problem is, in thier "theoretical" evaluations, the sample sets contain an unrealistically high number of replicates per species ~10. I was not aware of that when I started looking into this matter. I am sure machine learning is the way to go, not just yet, with sequence repositories having an average of 2-3 sequences per species.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.