Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined offset in metric class while training Multilayer Perceptron Classifier #64

Closed
DivineOmega opened this issue Apr 4, 2020 · 16 comments
Assignees
Labels
bug Something isn't working
Projects

Comments

@DivineOmega
Copy link
Contributor

DivineOmega commented Apr 4, 2020

Describe the bug

When attempting to train a Multilayer Perception Classifier, I occasionally get the following type of exception. I have been able to replicate this with both the MCC and FBeta metrics. Unfortunately this exception does not occur consistently even with the same dataset.

[2020-04-04 22:32:21] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107)
[stacktrace]
#0 /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php(107): Illuminate\\Foundation\\Bootstrap\\HandleExceptions->handleError()
#1 /[REDACTED]/vendor/rubix/ml/src/Classifiers/MultilayerPerceptron.php(414): Rubix\\ML\\CrossValidation\\Metrics\\MCC->score()
#2 /[REDACTED]/vendor/rubix/ml/src/Classifiers/MultilayerPerceptron.php(360): Rubix\\ML\\Classifiers\\MultilayerPerceptron->partial()
#3 /[REDACTED]/vendor/rubix/ml/src/Pipeline.php(189): Rubix\\ML\\Classifiers\\MultilayerPerceptron->train()
#4 /[REDACTED]/vendor/rubix/ml/src/PersistentModel.php(191): Rubix\\ML\\Pipeline->train()
#5 /[REDACTED]/app/Console/Commands/TrainModel.php(89): Rubix\\ML\\PersistentModel->train()
#6 [internal function]: App\\Console\\Commands\\TrainModel->handle()
#7 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(32): call_user_func_array()
#8 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/Util.php(36): Illuminate\\Container\\BoundMethod::Illuminate\\Container\\{closure}()
#9 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(90): Illuminate\\Container\\Util::unwrapIfClosure()
#10 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(34): Illuminate\\Container\\BoundMethod::callBoundMethod()
#11 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/Container.php(592): Illuminate\\Container\\BoundMethod::call()
#12 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Console/Command.php(134): Illuminate\\Container\\Container->call()
#13 /[REDACTED]/vendor/symfony/console/Command/Command.php(255): Illuminate\\Console\\Command->execute()
#14 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Console/Command.php(121): Symfony\\Component\\Console\\Command\\Command->run()
#15 /[REDACTED]/vendor/symfony/console/Application.php(912): Illuminate\\Console\\Command->run()
#16 /[REDACTED]/vendor/symfony/console/Application.php(264): Symfony\\Component\\Console\\Application->doRunCommand()
#17 /[REDACTED]/vendor/symfony/console/Application.php(140): Symfony\\Component\\Console\\Application->doRun()
#18 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Console/Application.php(93): Symfony\\Component\\Console\\Application->run()
#19 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(129): Illuminate\\Console\\Application->run()
#20 /[REDACTED]/artisan(37): Illuminate\\Foundation\\Console\\Kernel->handle()
#21 {main}
"}

To Reproduce

The following code is capable to recreating this error occasionally.

$estimator = new PersistentModel(
    new Pipeline(
        [
            new TextNormalizer(),
            new WordCountVectorizer(10000, 3, new NGram(1, 3)),
            new TfIdfTransformer(),
            new ZScaleStandardizer()
        ],
        new MultilayerPerceptron([
            new Dense(100),
            new PReLU(),
            new Dense(100),
            new PReLU(),
            new Dense(100),
            new PReLU(),
            new Dense(50),
            new PReLU(),
            new Dense(50),
            new PReLU(),
        ], 100, null, 1e-4, 1000, 1e-4, 10, 0.1, null, new MCC())
    ),
    new Filesystem($modelPath.'classifier.model')
);

$estimator->setLogger(new Screen('train-model'));

$estimator->train($dataset);

The labelled dataset used is a series of text files split into different directories that indicate their class names. This dataset is built using the following function.

    public static function buildLabeled(): Labeled
    {
        $samples = $labels = [];

        $directories = glob(storage_path('app/dataset/*'));

        foreach($directories as $directory) {
            foreach (glob($directory.'/*.txt') as $file) {
                $text = file_get_contents($file);
                $samples[] = [$text];
                $labels[] = basename($directory);
            }
        }

        return Labeled::build($samples, $labels);
    }

Expected behavior

Training should complete without any errors within the metric class.

@DivineOmega DivineOmega added the bug Something isn't working label Apr 4, 2020
@DivineOmega DivineOmega changed the title Illegal offset in metric class while training Multilayer Perceptron Classifier Undefined offset in metric class while training Multilayer Perceptron Classifier Apr 4, 2020
@andrewdalpino
Copy link
Member

andrewdalpino commented Apr 5, 2020

Hi @DivineOmega thanks for the bug report!

By any chance do any of the training sets contain 10 or less samples?

Also, here is line 107 of MCC

Annotation 2020-04-04 194729

For some reason, the integer 0 is showing up in a prediction ... do you have a directory named 0 or one that would evaluate to 0 or false under type-coercion? (see https://www.php.net/manual/en/language.types.boolean.php#language.types.boolean.casting)

@DivineOmega
Copy link
Contributor Author

DivineOmega commented Apr 5, 2020

Hi @andrewdalpino. Thanks for looking into this.

My two classes are positive with 527 samples and non-positive with 1035.

There's no directory named 0 and I do not think my buildLabeled function would generate a 0 class. If this were the case, I'd expect the Metric to always and immediately fail, while sometimes it happens after several epocs.

To confirm I've only got the two classes, I ran the following code.

$dataset = DatasetHelper::buildLabeled();
dd($dataset->possibleOutcomes());

And got the following array:

array:2 [
  0 => "non-positive"
  1 => "positive"
]

@DivineOmega
Copy link
Contributor Author

FYI, I'm using dev-master d0872a0 version of rubix/ml via Composer, the latest version as of right now.

$ composer show | grep rubix/ml
rubix/ml                              dev-master d0872a0 A high-level machine learning and deep learning library for the PHP language.

@andrewdalpino
Copy link
Member

andrewdalpino commented Apr 5, 2020

@DivineOmega Hmmmm ... this is a mysterious one

How often does this error occur? For example, out of 100 trainings, how many one them would error in your estimation?

Does training seem normal when these errors occur? Is the loss decreasing steadily? I'm wondering if the network is outputting NaN values if for some reason training went awry.

@DivineOmega
Copy link
Contributor Author

@andrewdalpino So far, with that dataset it has failed 5/5 times (3 with FBeta, 2 with MCC).

$ cat storage/logs/laravel.log | grep "Undefined offset"
[2020-04-04 21:15:02] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/FBeta.php:127)
[2020-04-04 21:34:52] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/FBeta.php:127)
[2020-04-04 22:04:19] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/FBeta.php:127)
[2020-04-04 22:32:21] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107)
[2020-04-05 08:24:02] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107)
$ cat storage/logs/laravel.log | grep "Undefined offset" | wc -l
5

Here's the log from the latest training session. Loss is decreasing.

$ php artisan ml:train
[2020-04-05 08:02:48] train-model.INFO: Fitted WordCountVectorizer
[2020-04-05 08:02:52] train-model.INFO: Fitted TfIdfTransformer
[2020-04-05 08:02:57] train-model.INFO: Fitted ZScaleStandardizer
[2020-04-05 08:02:59] train-model.INFO: Learner init hidden_layers=[0=Dense 1=PReLU 2=Dense 3=PReLU 4=Dense 5=PReLU 6=Dense 7=PReLU 8=Dense 9=PReLU] batch_size=100 optimizer=Adam alpha=0.0001 epochs=1000 min_change=0.0001 window=10 hold_out=0.1 cost_fn=CrossEntropy metric=MCC
[2020-04-05 08:02:59] train-model.INFO: Training started
[2020-04-05 08:07:14] train-model.INFO: Epoch 1 score=0.49118783277336 loss=0.30282976376964
[2020-04-05 08:11:25] train-model.INFO: Epoch 2 score=0.50325846378583 loss=0.18621958479585
[2020-04-05 08:15:38] train-model.INFO: Epoch 3 score=0.50114946967608 loss=0.1244821070199
[2020-04-05 08:19:52] train-model.INFO: Epoch 4 score=0.55362054479179 loss=0.096733785356479

   ErrorException 

  Undefined offset: 0

  at vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107
    103|                         ++$trueNeg[$class];
    104|                     }
    105|                 }
    106|             } else {
  > 107|                 ++$falsePos[$prediction];
    108|                 ++$falseNeg[$label];
    109|             }
    110|         }
    111| 

      +5 vendor frames 
  6   app/Console/Commands/TrainModel.php:89
      Rubix\ML\PersistentModel::train()

      +14 vendor frames 
  21  artisan:37
      Illuminate\Foundation\Console\Kernel::handle()

@andrewdalpino
Copy link
Member

Hmm everything seems normal up to epoch 5

I'm going to give this some thought

@DivineOmega
Copy link
Contributor Author

@andrewdalpino I've just re-ran the training with exact same dataset, and learner/metric configuration. In this case it fails after epoch 3.

$ php artisan ml:train
[2020-04-05 10:45:40] train-model.INFO: Fitted WordCountVectorizer
[2020-04-05 10:45:45] train-model.INFO: Fitted TfIdfTransformer
[2020-04-05 10:45:49] train-model.INFO: Fitted ZScaleStandardizer
[2020-04-05 10:45:51] train-model.INFO: Learner init hidden_layers=[0=Dense 1=PReLU 2=Dense 3=PReLU 4=Dense 5=PReLU 6=Dense 7=PReLU 8=Dense 9=PReLU] batch_size=100 optimizer=Adam alpha=0.0001 epochs=1000 min_change=0.0001 window=10 hold_out=0.1 cost_fn=CrossEntropy metric=MCC
[2020-04-05 10:45:51] train-model.INFO: Training started
[2020-04-05 10:49:51] train-model.INFO: Epoch 1 score=0.56864319578525 loss=0.28955885998412
[2020-04-05 10:53:56] train-model.INFO: Epoch 2 score=0.4425972854422 loss=0.15849405919593
[2020-04-05 10:58:04] train-model.INFO: Epoch 3 score=0.48331552310563 loss=0.13056189684066

   ErrorException 

  Undefined offset: 0

  at vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107
    103|                         ++$trueNeg[$class];
    104|                     }
    105|                 }
    106|             } else {
  > 107|                 ++$falsePos[$prediction];
    108|                 ++$falseNeg[$label];
    109|             }
    110|         }
    111| 

      +5 vendor frames 
  6   app/Console/Commands/TrainModel.php:89
      Rubix\ML\PersistentModel::train()

      +14 vendor frames 
  21  artisan:37
      Illuminate\Foundation\Console\Kernel::handle()

@DivineOmega
Copy link
Contributor Author

DivineOmega commented Apr 5, 2020

@andrewdalpino Tried again and failed after the 8th epoc. This current dataset does not appear to be succeeding at all at the moment.

I'm wondering if it is an error in the labeled dataset stratifiedSplit method. I'll do some testing.

@DivineOmega
Copy link
Contributor Author

While attempting to debug this, training succeeded. The only difference I made was to add 2 new samples (1 to each class). However, it did fail once with these additional samples as well, so I'd assume this is just coincidence.

[2020-04-05 13:06:53] train-model.INFO: Epoch 7 score=0.52481517908426 loss=0.042373131883474
[2020-04-05 13:06:53] train-model.INFO: Parameters restored from snapshot at epoch 4.
[2020-04-05 13:06:53] train-model.INFO: Training complete

@DivineOmega
Copy link
Contributor Author

The change of dataset can be definitely ruled out as training just completed on the original dataset, after epoch 13.

[2020-04-05 14:05:29] train-model.INFO: Epoch 13 score=0.50379156418009 loss=0.028950132464314
[2020-04-05 14:05:29] train-model.INFO: Parameters restored from snapshot at epoch 3.
[2020-04-05 14:05:29] train-model.INFO: Training complete

@andrewdalpino
Copy link
Member

andrewdalpino commented Apr 5, 2020

The number of training rounds that the algorithm executes should not matter, rather I was looking at how the loss was steadily decreasing as the MCC was steadily increasing with time. In the log below we see a jump downward in the MCC at epoch 2, however, this may just be the algorithm escaping a local minima so no problems are observed from the logs.

@andrewdalpino I've just re-ran the training with exact same dataset, and learner/metric configuration. In this case it fails after epoch 3.

$ php artisan ml:train
[2020-04-05 10:45:40] train-model.INFO: Fitted WordCountVectorizer
[2020-04-05 10:45:45] train-model.INFO: Fitted TfIdfTransformer
[2020-04-05 10:45:49] train-model.INFO: Fitted ZScaleStandardizer
[2020-04-05 10:45:51] train-model.INFO: Learner init hidden_layers=[0=Dense 1=PReLU 2=Dense 3=PReLU 4=Dense 5=PReLU 6=Dense 7=PReLU 8=Dense 9=PReLU] batch_size=100 optimizer=Adam alpha=0.0001 epochs=1000 min_change=0.0001 window=10 hold_out=0.1 cost_fn=CrossEntropy metric=MCC
[2020-04-05 10:45:51] train-model.INFO: Training started
[2020-04-05 10:49:51] train-model.INFO: Epoch 1 score=0.56864319578525 loss=0.28955885998412
[2020-04-05 10:53:56] train-model.INFO: Epoch 2 score=0.4425972854422 loss=0.15849405919593
[2020-04-05 10:58:04] train-model.INFO: Epoch 3 score=0.48331552310563 loss=0.13056189684066

   ErrorException 

  Undefined offset: 0

  at vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107
    103|                         ++$trueNeg[$class];
    104|                     }
    105|                 }
    106|             } else {
  > 107|                 ++$falsePos[$prediction];
    108|                 ++$falseNeg[$label];
    109|             }
    110|         }
    111| 

      +5 vendor frames 
  6   app/Console/Commands/TrainModel.php:89
      Rubix\ML\PersistentModel::train()

      +14 vendor frames 
  21  artisan:37
      Illuminate\Foundation\Console\Kernel::handle()

I'm going to need to do some more digging to find out what's really going on

Thanks for the extra info @DivineOmega it's very helpful

After the latest trials, what is the training success rate about?

One thing you can try in the meantime is decreasing the learning rate of the Adam optimizer. You can also try using the non-adaptive Stochastic optimizer to rule out issues due to momentum.

Also, feel free to join our chat https://t.me/RubixML

@andrewdalpino andrewdalpino self-assigned this Apr 5, 2020
@DivineOmega
Copy link
Contributor Author

@andrewdalpino Today and yesterday, on datasets of that size and above, I've seen 16 failures of this type and only maybe 2 or 3 successes (so around 11 - 16% success rate).

@DivineOmega
Copy link
Contributor Author

DivineOmega commented Apr 6, 2020

@andrewdalpino Interestingly, after lowering the Adam optimiser learning rate by 10x, the training completed first time without issue.

$ php artisan ml:train
[2020-04-05 22:27:14] train-model.INFO: Fitted WordCountVectorizer
[2020-04-05 22:27:18] train-model.INFO: Fitted TfIdfTransformer
[2020-04-05 22:27:22] train-model.INFO: Fitted ZScaleStandardizer
[2020-04-05 22:27:24] train-model.INFO: Learner init hidden_layers=[0=Dense 1=PReLU 2=Dense 3=PReLU 4=Dense 5=PReLU 6=Dense 7=PReLU 8=Dense 9=PReLU] batch_size=100 optimizer=Adam alpha=0.0001 epochs=1000 min_change=0.0001 window=10 hold_out=0.1 cost_fn=CrossEntropy metric=FBeta
[2020-04-05 22:27:24] train-model.INFO: Training started
[2020-04-05 22:31:49] train-model.INFO: Epoch 1 score=0.66972334473112 loss=0.3231510276241
[2020-04-05 22:36:49] train-model.INFO: Epoch 2 score=0.72543492395247 loss=0.20104000225191
[2020-04-05 22:41:36] train-model.INFO: Epoch 3 score=0.74259303923404 loss=0.11534214676067
[2020-04-05 22:46:22] train-model.INFO: Epoch 4 score=0.75292705742216 loss=0.08167593375029
[2020-04-05 22:50:53] train-model.INFO: Epoch 5 score=0.78349036224602 loss=0.062994060070852
[2020-04-05 22:55:24] train-model.INFO: Epoch 6 score=0.78586685471343 loss=0.053025888192447
[2020-04-05 22:59:44] train-model.INFO: Epoch 7 score=0.7640015718736 loss=0.049605014056035
[2020-04-05 23:04:15] train-model.INFO: Epoch 8 score=0.75258053059604 loss=0.043536833530061
[2020-04-05 23:08:41] train-model.INFO: Epoch 9 score=0.76595700309472 loss=0.040312908446744
[2020-04-05 23:12:55] train-model.INFO: Epoch 10 score=0.76807362257247 loss=0.037231249873399
[2020-04-05 23:17:15] train-model.INFO: Epoch 11 score=0.75719922146547 loss=0.034125440250398
[2020-04-05 23:21:37] train-model.INFO: Epoch 12 score=0.75719922146547 loss=0.035133840655248
[2020-04-05 23:25:57] train-model.INFO: Epoch 13 score=0.76033834586466 loss=0.03846565249604
[2020-04-05 23:30:19] train-model.INFO: Epoch 14 score=0.77751091875429 loss=0.032385857444644
[2020-04-05 23:34:40] train-model.INFO: Epoch 15 score=0.7703448072442 loss=0.032105124285677
[2020-04-05 23:39:04] train-model.INFO: Epoch 16 score=0.7703448072442 loss=0.032299122346931
[2020-04-05 23:39:04] train-model.INFO: Parameters restored from snapshot at epoch 6.
[2020-04-05 23:39:04] train-model.INFO: Training complete

@DivineOmega
Copy link
Contributor Author

@andrewdalpino Since lowering the Adam optimiser learning rate by 10x, the training has yet to fail once, over around 5 or 6 training sessions.

This works around the problem for now, but does not solve the root cause. Might help to narrow it done though?

@andrewdalpino
Copy link
Member

andrewdalpino commented Apr 7, 2020

Summarizing what we talked about in chat ...

This error is caused by a chain of silent errors starting with numerical under/overflow due to a learning rate that is too high for the user's particular dataset. As a result the network produces NaN values at the output layer which in turn produce a prediction of false when run through the argmax function. This false value is then silently converted (thanks PHP) to the integer 0 when used as the key/index of an array entry used to accumulate false positives in the MCC and FBeta metrics.

The solution to this is to decrease the learning rate of the Gradient Descent optimizer to prevent the network from blowing up. To aid the user in identifying when the network has become unstable, we will catch NaN values before scoring the validation set and then throw an informative exception.

Here is a good article on exploding gradients and why decreasing the learning rate has the effect of stabilizing training https://machinelearningmastery.com/exploding-gradients-in-neural-networks/

@andrewdalpino andrewdalpino added this to Backlog in Roadmap via automation Apr 8, 2020
@andrewdalpino andrewdalpino moved this from Backlog to In progress in Roadmap Apr 8, 2020
@andrewdalpino andrewdalpino moved this from In progress to Review in Roadmap Apr 11, 2020
@andrewdalpino andrewdalpino moved this from Review to Completed in Roadmap Apr 11, 2020
@andrewdalpino
Copy link
Member

Thanks again for the great bug report @DivineOmega

You can test out the fix on the latest dev-master or you can wait until the next release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Roadmap
  
Completed
Development

No branches or pull requests

2 participants