-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
63 test results #78
63 test results #78
Conversation
I'll have a look at the conflicts now, so I'll turn this into a draft again. |
platalea/basic.py
Outdated
@@ -170,6 +171,9 @@ def val_loss(net): | |||
result["validation loss"] = validation_loss | |||
wandb.log(result) | |||
|
|||
# Return loss of the final model for automated testing | |||
return {'final_loss': loss_value} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we return different criteria (validation_loss
, final_loss
) for asr and basic experiments? I would make this consistent by returning the same criteria or both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe better, do as in (e.g.) mtl.py
and save and return all intermediate scores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look if it can always return the result dict (that is written to json anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might have to save the results at each epoch.
Little reminder: please also check experiments/flickr8k/pip_ind.py
.
The challenge here is to have the experiments return sensible results for the user while also meaningfull to test against (sensitive to code/logic changes). This commit manages to do this for most tests. refs #63
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me but I am wondering whether it makes sense to have the step_loss
in the results. That is only the loss for one training step right? The rest of the results are computed on the full validation set, which makes it a bit confusing to me.
All experiment functions now return sensible results that can be useful to the user. These consist of performance metrics of every step. Because the tests are usually performing only a single update step, with minimal input data, the performance metrics are often trivial. For instance, rank.10 is always 1 because, in the tests, there are no more instances than 10. I, therefore, added a training loss to the results to have the results include a measure very sensitive to logic/code changes. Because different machines resulted in slightly different roundings, I had to use an approximate checker instead of an exact one. Therefore, I chose to include pandas in the test environment. |
The step_loss is the training loss at that time step. So, just another performance metric about the current time step. Of course, it is not the most useful performance metric ever, but it is somewhat informative for a user, and of course useful for the test. If you don't agree, we can look for other solutions. |
No, I agree. I didn't really think about the degenerate scores we get when testing with a just one batch. With this, it makes sense to add the last training loss. Is there a specific reason why you don't add it in |
Checking the code again, I think |
This PR adds an assert statement to every existing test aimed at experiments. The assert tests the result, for instance, the loss, or the rank. I didn't discuss this kind of assertion with anyone, so it's definitely worth looking at these specifics assertions and see if you guys agree with what I did. I also had to change the experiments and scripts in order to be able to read the results from the tests.