Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upError in { : task 1 failed - "'n' must be a positive integer >= 'x'" #684
Comments
|
I'm currently troubleshooting an error with the same message from my own work. It looks like it is only an issue when the underlying model fit fails and that model generates an S4 object. (EDITED) I didn't see that you had attached a file. |
|
So maybe I did subset my data too harshly and my model makes no sense any more? |
|
I'll take a look in a bit but with SVMs, there is a decent chance that the model fails when building the secondary Platt probability model. |
|
There was a bug in the summary code. However... For your data, I'm not sure that anything will help; there are going to be big problems when you have very few cases in the classes: > table(table(oligo[,"sci_name"]))
1 2 3 4 5 8
10 9 6 4 6 1 There were a lot of model errors saying
This is directly related to the small frequency of the some of the taxa. Even something like leave-one-out will probably fail since, in 10 cases, a model will be build without one of the classes and will fail at predicting that class. Also, for future reference, set the seed prior to running |
|
You might be able to do something at the genus-level with a model for Actinidia versus not Actinidia. That may not help much though in your context. |
|
Thanks that helps! My problem is that I need to find a machine learning algorithm that can learn to classify 50000 classes with just 150000 observations (each 1024 data values, integer between 0 and 16). My classes have "master" classes (hierarchical), 2 levels deep. I need to find the deepest class that can be assigned. Data values within one class are supposed to be 98% identical between observations. Observations belonging to different classes are supposed to be <98% identical. However, only 60% of my data follows that trend. That is what I need the machine learning for (I need to look at the content of my observations, instead of just calculating the distance between two observations). Because If i go just pure distance I will get an error of 40%. Maybe you have an idea what to use. Anything appreciated :D |
|
A few things:
|
|
My predictors in theory can have values between 1 and uhh I don't know what the maximum is, theoretically 200, practically more like 10. So, sorry, I cann not use the binary things. Maybe I should transfer them to binary, hmm. Good point! |
|
I should have been more specific... I meant tree-based machine learning models (e.g. boosted trees and not phylogenetic trees). |
|
I still don't understand why I can't get a ROC?
gives me:
If I go back to species level it stops completely.
Sorry, maybe this is a new issue. I am no data scientist and new to github. It also crashed when using naive_bayes.
|
If this is true this is really bad. This makes machine learning totally unsuited for life science, because we have normally distributed data, and there will always be classes with low observations. They will be resampled from time to time. Tha algorithm must not be allowed to break every time. I do not really think I want to up-sample or down-sample because I want to get predictions reflecting the real distribution of my data. I also can not delete or add anything to my dataset, because it is a food network, kinda, if i add or leave something out, it completely looses its meaning. If I have a class in my testing set that is not in my model I want the accurarcy to be 0. If I have a class in my testing set that is not in the model I want the accurarcy to be NA. (But I still want the other classes to have a accuracy that is the base for further calculations without having one NA skrew over my whole results. So let me describe what I did in another tool that has nothing to do with machine learning, but with distance similarity based classification: I cross validated (LOOCV) each class n-1 times with 1 observation, where n is the total number of observations. And it was perfect, However, it was no machine learning and not very reliable. But in machine learning, I can not do this it seems, beause building 300000 models and validate with one observation each time would take way to long. edit: I think I need a custom sampling function. Assume I have DNA sequences that have a species class and a genus cass (which is higher level). I want to simulate the following case: I have the full database minus one sequence and test that sequence against the database. (LOOCV) |
ROC curves are for cases where you have two classes.
The problem is that you are holding out an entire class (or perhaps more) during resampling and this is resulting in al of the class-specific measures to be missing since many of their values are missing. Note that the log-loss, accuracy, and other class-independent metrics are estimated.
I understand your point of view but strongly disagree. Take the sensitivity issue. How can you estimate the false positives rate if there are no positives? I don't see if as a shortcoming of the model; you just don't have enough per-species data to support the type of model that you want. |
|
I guess @topepo knows a thing or two about the life sciences ;) There are ways to deal with class imbalances, and AUROC isn't your best choice in that case anyway - |
|
Sorry, I was frustrated. You both have been incredibly helpful. |
No problem at all. With parallel processing, leave-one-out might be doable. You might also consider using the |
|
I guess supervised machine learning just isn't a paradigm that fits your problem well - at least not with a general purpose toolbox such as |

My code fails. No idea why. Followed a tutorial from http://blog.revolutionanalytics.com/2015/10/the-5th-tribe-support-vector-machines-and-caret.html
If you are filing a bug, make sure these boxes are checked before submitting your issue— thank you!
update.packages(oldPkgs="caret", ask=FALSE)sessionInfo()file.txt
Minimal, runnable code:
Session Info: