-
-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undefined offset in metric class while training Multilayer Perceptron Classifier #64
Comments
Hi @DivineOmega thanks for the bug report! By any chance do any of the training sets contain 10 or less samples? Also, here is line 107 of MCC For some reason, the integer 0 is showing up in a prediction ... do you have a directory named 0 or one that would evaluate to 0 or false under type-coercion? (see https://www.php.net/manual/en/language.types.boolean.php#language.types.boolean.casting) |
Hi @andrewdalpino. Thanks for looking into this. My two classes are There's no directory named To confirm I've only got the two classes, I ran the following code. $dataset = DatasetHelper::buildLabeled();
dd($dataset->possibleOutcomes()); And got the following array:
|
FYI, I'm using $ composer show | grep rubix/ml
rubix/ml dev-master d0872a0 A high-level machine learning and deep learning library for the PHP language. |
@DivineOmega Hmmmm ... this is a mysterious one How often does this error occur? For example, out of 100 trainings, how many one them would error in your estimation? Does training seem normal when these errors occur? Is the loss decreasing steadily? I'm wondering if the network is outputting NaN values if for some reason training went awry. |
@andrewdalpino So far, with that dataset it has failed 5/5 times (3 with FBeta, 2 with MCC).
Here's the log from the latest training session. Loss is decreasing.
|
Hmm everything seems normal up to epoch 5 I'm going to give this some thought |
@andrewdalpino I've just re-ran the training with exact same dataset, and learner/metric configuration. In this case it fails after epoch 3.
|
@andrewdalpino Tried again and failed after the 8th epoc. This current dataset does not appear to be succeeding at all at the moment.
|
While attempting to debug this, training succeeded. The only difference I made was to add 2 new samples (1 to each class). However, it did fail once with these additional samples as well, so I'd assume this is just coincidence.
|
The change of dataset can be definitely ruled out as training just completed on the original dataset, after epoch 13.
|
The number of training rounds that the algorithm executes should not matter, rather I was looking at how the loss was steadily decreasing as the MCC was steadily increasing with time. In the log below we see a jump downward in the MCC at epoch 2, however, this may just be the algorithm escaping a local minima so no problems are observed from the logs.
I'm going to need to do some more digging to find out what's really going on Thanks for the extra info @DivineOmega it's very helpful After the latest trials, what is the training success rate about? One thing you can try in the meantime is decreasing the learning rate of the Adam optimizer. You can also try using the non-adaptive Stochastic optimizer to rule out issues due to momentum. Also, feel free to join our chat https://t.me/RubixML |
@andrewdalpino Today and yesterday, on datasets of that size and above, I've seen 16 failures of this type and only maybe 2 or 3 successes (so around 11 - 16% success rate). |
@andrewdalpino Interestingly, after lowering the Adam optimiser learning rate by 10x, the training completed first time without issue.
|
@andrewdalpino Since lowering the Adam optimiser learning rate by 10x, the training has yet to fail once, over around 5 or 6 training sessions. This works around the problem for now, but does not solve the root cause. Might help to narrow it done though? |
Summarizing what we talked about in chat ... This error is caused by a chain of silent errors starting with numerical under/overflow due to a learning rate that is too high for the user's particular dataset. As a result the network produces NaN values at the output layer which in turn produce a prediction of The solution to this is to decrease the learning rate of the Gradient Descent optimizer to prevent the network from blowing up. To aid the user in identifying when the network has become unstable, we will catch NaN values before scoring the validation set and then throw an informative exception. Here is a good article on exploding gradients and why decreasing the learning rate has the effect of stabilizing training https://machinelearningmastery.com/exploding-gradients-in-neural-networks/ |
Thanks again for the great bug report @DivineOmega You can test out the fix on the latest dev-master or you can wait until the next release |
Describe the bug
When attempting to train a Multilayer Perception Classifier, I occasionally get the following type of exception. I have been able to replicate this with both the MCC and FBeta metrics. Unfortunately this exception does not occur consistently even with the same dataset.
To Reproduce
The following code is capable to recreating this error occasionally.
The labelled dataset used is a series of text files split into different directories that indicate their class names. This dataset is built using the following function.
Expected behavior
Training should complete without any errors within the metric class.
The text was updated successfully, but these errors were encountered: