Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

63 test results #78

Merged
merged 9 commits into from
Mar 3, 2021
Merged

63 test results #78

merged 9 commits into from
Mar 3, 2021

Conversation

cwmeijer
Copy link
Contributor

@cwmeijer cwmeijer commented Feb 16, 2021

This PR adds an assert statement to every existing test aimed at experiments. The assert tests the result, for instance, the loss, or the rank. I didn't discuss this kind of assertion with anyone, so it's definitely worth looking at these specifics assertions and see if you guys agree with what I did. I also had to change the experiments and scripts in order to be able to read the results from the tests.

@cwmeijer cwmeijer marked this pull request as ready for review February 16, 2021 16:30
@cwmeijer
Copy link
Contributor Author

I'll have a look at the conflicts now, so I'll turn this into a draft again.

@cwmeijer cwmeijer marked this pull request as draft February 22, 2021 12:49
@cwmeijer cwmeijer marked this pull request as ready for review February 23, 2021 11:15
@cwmeijer cwmeijer requested a review from bhigy February 23, 2021 11:15
@@ -170,6 +171,9 @@ def val_loss(net):
result["validation loss"] = validation_loss
wandb.log(result)

# Return loss of the final model for automated testing
return {'final_loss': loss_value}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we return different criteria (validation_loss, final_loss) for asr and basic experiments? I would make this consistent by returning the same criteria or both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe better, do as in (e.g.) mtl.py and save and return all intermediate scores.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look if it can always return the result dict (that is written to json anyway).

Copy link
Contributor

@bhigy bhigy Feb 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have to save the results at each epoch.

Little reminder: please also check experiments/flickr8k/pip_ind.py.

The challenge here is to have the experiments return sensible results for the
user while also meaningfull to test against (sensitive to code/logic changes).
This commit manages to do this for most tests. refs #63
@cwmeijer cwmeijer requested a review from bhigy March 2, 2021 13:56
Copy link
Contributor

@bhigy bhigy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me but I am wondering whether it makes sense to have the step_loss in the results. That is only the loss for one training step right? The rest of the results are computed on the full validation set, which makes it a bit confusing to me.

@cwmeijer
Copy link
Contributor Author

cwmeijer commented Mar 2, 2021

All experiment functions now return sensible results that can be useful to the user. These consist of performance metrics of every step. Because the tests are usually performing only a single update step, with minimal input data, the performance metrics are often trivial. For instance, rank.10 is always 1 because, in the tests, there are no more instances than 10. I, therefore, added a training loss to the results to have the results include a measure very sensitive to logic/code changes. Because different machines resulted in slightly different roundings, I had to use an approximate checker instead of an exact one. Therefore, I chose to include pandas in the test environment.

@cwmeijer
Copy link
Contributor Author

cwmeijer commented Mar 2, 2021

The step_loss is the training loss at that time step. So, just another performance metric about the current time step. Of course, it is not the most useful performance metric ever, but it is somewhat informative for a user, and of course useful for the test. If you don't agree, we can look for other solutions.

@bhigy
Copy link
Contributor

bhigy commented Mar 2, 2021

No, I agree. I didn't really think about the degenerate scores we get when testing with a just one batch. With this, it makes sense to add the last training loss.

Is there a specific reason why you don't add it in asr.py though?

@bhigy
Copy link
Contributor

bhigy commented Mar 2, 2021

Checking the code again, I think step_loss is actually a bit misleading. This is actually the mean training loss over the whole epoch right?

@cwmeijer cwmeijer merged commit 10c1c0f into master Mar 3, 2021
@cwmeijer cwmeijer deleted the 63-test-results branch March 3, 2021 15:27
@egpbos egpbos mentioned this pull request Apr 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants