63 test results #78

cwmeijer · 2021-02-16T16:21:02Z

This PR adds an assert statement to every existing test aimed at experiments. The assert tests the result, for instance, the loss, or the rank. I didn't discuss this kind of assertion with anyone, so it's definitely worth looking at these specifics assertions and see if you guys agree with what I did. I also had to change the experiments and scripts in order to be able to read the results from the tests.

cwmeijer · 2021-02-22T12:48:57Z

I'll have a look at the conflicts now, so I'll turn this into a draft again.

bhigy · 2021-02-24T10:21:18Z

platalea/basic.py

@@ -170,6 +171,9 @@ def val_loss(net):
                result["validation loss"] = validation_loss
            wandb.log(result)

+    # Return loss of the final model for automated testing
+    return {'final_loss': loss_value}


Is there a reason why we return different criteria (validation_loss, final_loss) for asr and basic experiments? I would make this consistent by returning the same criteria or both.

Or maybe better, do as in (e.g.) mtl.py and save and return all intermediate scores.

I'll have a look if it can always return the result dict (that is written to json anyway).

You might have to save the results at each epoch.

Little reminder: please also check experiments/flickr8k/pip_ind.py.

The challenge here is to have the experiments return sensible results for the user while also meaningfull to test against (sensitive to code/logic changes). This commit manages to do this for most tests. refs #63

bhigy

Looks pretty good to me but I am wondering whether it makes sense to have the step_loss in the results. That is only the loss for one training step right? The rest of the results are computed on the full validation set, which makes it a bit confusing to me.

cwmeijer · 2021-03-02T14:13:57Z

All experiment functions now return sensible results that can be useful to the user. These consist of performance metrics of every step. Because the tests are usually performing only a single update step, with minimal input data, the performance metrics are often trivial. For instance, rank.10 is always 1 because, in the tests, there are no more instances than 10. I, therefore, added a training loss to the results to have the results include a measure very sensitive to logic/code changes. Because different machines resulted in slightly different roundings, I had to use an approximate checker instead of an exact one. Therefore, I chose to include pandas in the test environment.

cwmeijer · 2021-03-02T14:21:08Z

The step_loss is the training loss at that time step. So, just another performance metric about the current time step. Of course, it is not the most useful performance metric ever, but it is somewhat informative for a user, and of course useful for the test. If you don't agree, we can look for other solutions.

bhigy · 2021-03-02T15:38:25Z

No, I agree. I didn't really think about the degenerate scores we get when testing with a just one batch. With this, it makes sense to add the last training loss.

Is there a specific reason why you don't add it in asr.py though?

bhigy · 2021-03-02T15:41:52Z

Checking the code again, I think step_loss is actually a bit misleading. This is actually the mean training loss over the whole epoch right?

add asserts for results in experiment tests (refs #63)

7c3a5ca

cwmeijer marked this pull request as ready for review February 16, 2021 16:30

cwmeijer marked this pull request as draft February 22, 2021 12:49

cwmeijer added 2 commits February 23, 2021 12:07

Merge branch 'master' into 63-test-results

cb36f0d

Merge branch 'master' into 63-test-results

e2781c6

cwmeijer marked this pull request as ready for review February 23, 2021 11:15

cwmeijer requested a review from bhigy February 23, 2021 11:15

bhigy requested changes Feb 24, 2021

View reviewed changes

cwmeijer added 4 commits March 2, 2021 10:56

fix tests while maintaining sensible result returns

8493c37

The challenge here is to have the experiments return sensible results for the user while also meaningfull to test against (sensitive to code/logic changes). This commit manages to do this for most tests. refs #63

add meaningful test results to transformer and basic tests (refs #63)

c11800f

add approximate checking in experiment tests (refs #63)

9dbca8b

add pandas to test environment (refs #63)

087cc01

cwmeijer requested a review from bhigy March 2, 2021 13:56

bhigy reviewed Mar 2, 2021

View reviewed changes

Merge branch 'master' into 63-test-results

403b807

rename step_loss, added average_loss to all experiments (ref #63)

584d6f3

cwmeijer merged commit 10c1c0f into master Mar 3, 2021

cwmeijer deleted the 63-test-results branch March 3, 2021 15:27

cwmeijer mentioned this pull request Mar 8, 2021

Add actual result checks to test suite #63

Closed

egpbos mentioned this pull request Apr 27, 2021

Set additional seeds? #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

63 test results #78

63 test results #78

cwmeijer commented Feb 16, 2021 •

edited

Loading

cwmeijer commented Feb 22, 2021

bhigy Feb 24, 2021

bhigy Feb 24, 2021

cwmeijer Feb 24, 2021

bhigy Feb 24, 2021 •

edited

Loading

bhigy left a comment

cwmeijer commented Mar 2, 2021

cwmeijer commented Mar 2, 2021

bhigy commented Mar 2, 2021

bhigy commented Mar 2, 2021

63 test results #78

63 test results #78

Conversation

cwmeijer commented Feb 16, 2021 • edited Loading

cwmeijer commented Feb 22, 2021

bhigy Feb 24, 2021

Choose a reason for hiding this comment

bhigy Feb 24, 2021

Choose a reason for hiding this comment

cwmeijer Feb 24, 2021

Choose a reason for hiding this comment

bhigy Feb 24, 2021 • edited Loading

Choose a reason for hiding this comment

bhigy left a comment

Choose a reason for hiding this comment

cwmeijer commented Mar 2, 2021

cwmeijer commented Mar 2, 2021

bhigy commented Mar 2, 2021

bhigy commented Mar 2, 2021

cwmeijer commented Feb 16, 2021 •

edited

Loading

bhigy Feb 24, 2021 •

edited

Loading