Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase coverage #57

Closed
egpbos opened this issue Jan 12, 2021 · 16 comments · Fixed by #101
Closed

Increase coverage #57

egpbos opened this issue Jan 12, 2021 · 16 comments · Fixed by #101

Comments

@egpbos
Copy link
Collaborator

egpbos commented Jan 12, 2021

We should take a look at the coverage reports (now that they are fixed #19) and figure out why some parts of the code are not covered by the experiment runs on CI. It seems for instance that in encoders.py there are a lot of unused models. Are they still used in some other dependent package or can we remove them?

Related to #41.

@bhigy
Copy link
Contributor

bhigy commented Feb 25, 2021

I checked encoder.py. A couple of them (SpeechEncoderMultiConv and SpeechEncoderVGG) are probably not used but could be useful in the future if we want to experiment. I realized we don't have an experiment covering the VQ architecture so I will add one.

@bhigy
Copy link
Contributor

bhigy commented Feb 25, 2021

@egpbos or @cwmeijer, is there a way to run the tests locally or should I do trial and errors with push requests?

@bhigy
Copy link
Contributor

bhigy commented Feb 26, 2021

I added the experiment with VQ architecture in branch 57-increase-coverage. I think we are good regarding encoders.py.

@bhigy bhigy assigned egpbos and unassigned bhigy Feb 26, 2021
@egpbos
Copy link
Collaborator Author

egpbos commented Mar 1, 2021

@egpbos or @cwmeijer, is there a way to run the tests locally or should I do trial and errors with push requests?

Yes, inside the repo directory just run tox (install with pip install tox or conda, or whatever you like, it doesn't matter, tox handles everything else) to run the test suite.

@bhigy
Copy link
Contributor

bhigy commented Mar 1, 2021

Anything we need to do regarding this issue?

@egpbos
Copy link
Collaborator Author

egpbos commented Mar 1, 2021

The VQ test looks great! Let's make a PR for the branch, looks mergeable to me and it increases coverage by 2.36%, see https://app.codecov.io/gh/spokenlanguage/platalea/compare/master...57-increase-coverage/overview.

I went through the coverage report in detail and other ways to increase coverage would be:

  1. Experiments: add more tests to also cover broader parameter space. Specifically (excluding paths that trigger warnings or errors, which would be a bit overkill to cover imho):
    a. mtl_asr, pip_ind and pip_seq have an uncovered path triggered by if data['train'].dataset.is_slt()
    b. pip_ind and pip_seq can also be run again with input from previous models (as we did before in the command line based CI setup, but now we'd have to store such input files as testing artifacts in the repo, so that tox can reach them easily)
    c. We can increase transformer coverage by adding --trafo_dropout=0.1
  2. Remove (or add in experiments) unused attention models LinearAttention and ScalarAttention.
  3. To test logging of training and validation loss in all experiments, we could add a configuration parameter for how much those log functions should be triggered. Currently these are hard coded to respectively every 100th and every 400th step. By configuring them and setting them to 1, we will cover them as well.
  4. We don't cover the Librispeech dataset, because we don't run that experiment in the testset. Is it possible/feasible to build a test version for that dataset like with flickr1d?
  5. introspect.py is not covered at all, what is it for? Can it be removed?
  6. The adadelta optimizer is not covered, maybe add it with an option in some random test.
  7. rank_eval.cosine is not covered, can it be deleted?
  8. The noam and constant schedulers are not covered, also maybe add with an option to some random test (or add separate test to suite).
  9. bleu_score and score_slt are not used. I suspect slt will be used if an SLT dataset is used, as mentioned in 1.a., correct? That leaves the bleu score, is that still useful? If so, how to trigger?
  10. xer.py, wow... what is that? Never saw it before, but it seems hugely inefficient, with all the for-loops. In any case, also a lot uncovered.

@egpbos
Copy link
Collaborator Author

egpbos commented Mar 1, 2021

My guess is that actually 3. will give the biggest boost and is also most valuable to cover. All cost and validation paths are currently uncovered.

@bhigy
Copy link
Contributor

bhigy commented Mar 2, 2021

My comments for each of the points above. In general, I am in favor of keeping the tests to a minimum and only for important parts, but let's discuss that during our next meeting.
1.
a. Running the same experiment twice seems a bit to much to me but we can try to trigger alternative paths to cover some components you mention in the next points (e.g. keep pip_ind as is but trigger the SLT code for pip_seq).
b. Same here. Seems a bit overkill to me.
c. This seems reasonable.
2. Same as above, feels to much to me.
3. Seems feasible.
4. Not sure we should cover LibriSpeech, and if we do, do we want to generate small test sets for each dataset?
5. Introspect is used to extract hidden representation from the model to perform analyses on them. We should definitely keep that.
6. See 1.a.
7. Seems used by rank_eval.ranking.
8. See 1.a.
9. Both should be related to SLT.
10. xer.py contains the metrics for ASR so we should keep it. I am surprised that it is no mainly covered. I didn't try to optimize it and that would be low priority.

@egpbos
Copy link
Collaborator Author

egpbos commented Mar 2, 2021

Ok great, agreed on the overkill parts, 100% is not a goal in itself. So to summarize, a list of reasonable/feasible actions to increase coverage:

  • Trigger SLT path in pip_seq test (this should cover 1.a. and 9.).
  • Add --trafo_dropout=0.1 option to transformer test (covers 1.c.).
  • Add configuration parameter for at which step interval to log train and validation losses.
  • Set these now configurable train and validation loss intervals to 1 (or max number of steps in that test) in all tests (covers 3. and probably 7.).
  • Pick one test in which to use the adadelta optimizer (covers 6.)
  • Pick one test in which to use the noam scheduler and one test in which to use the constant scheduler (covers 8.)

Two remaining questions:

  1. Should we just remove LinearAttention and ScalarAttention or are they still useful?
  2. introspect.py sounds like it should be in utils, should we move it there?

@bhigy
Copy link
Contributor

bhigy commented Mar 2, 2021

I'd keep alternative attention mechanisms which could serve in future experiments.
As for introspect.py, it is part of the architecture of some models so it makes sense to me to keep it here.

@egpbos
Copy link
Collaborator Author

egpbos commented May 20, 2021

@bhigy: For the SLT path, do we actually need a test dataset in a different language than English? Or does the actual data not matter for "smoke testing" (testing whether a run works at all)?

In the first case, we could add some Japanese sentences to flickr1d. Is the stuff you used for those experiments public data?

If it doesn't matter, we could just put some lorum ipsum there.

@bhigy
Copy link
Contributor

bhigy commented May 20, 2021

It doesn't matter if the captions are actually in a different language but we need to set the language as 'jp' for is_slt() to be True and thus have some value for the raw_jp field in the metadata. The data can be fake, copied from the English (maybe simpler) or taken from one of the files /home/bjrhigy/corpora/flickr8k/dataset_multilingual*.json.

@egpbos
Copy link
Collaborator Author

egpbos commented May 26, 2021

Ok, the SLT path gave a nice +0.9% in the coverage. That was the last of the wish list!

One final question @bhigy: we now get these warnings from that pip_seq test with Japanese texts:

tests/test_experiments.py::test_pip_seq_experiment
  /home/runner/work/platalea/platalea/.tox/py38/lib/python3.8/site-packages/nltk/translate/bleu_score.py:516: UserWarning: 
  The hypothesis contains 0 counts of 2-gram overlaps.
  Therefore the BLEU score evaluates to 0, independently of
  how many N-gram overlaps of lower order it contains.
  Consider using lower n-gram order or use SmoothingFunction()
    warnings.warn(_msg)

tests/test_experiments.py::test_pip_seq_experiment
  /home/runner/work/platalea/platalea/.tox/py38/lib/python3.8/site-packages/nltk/translate/bleu_score.py:516: UserWarning: 
  The hypothesis contains 0 counts of 3-gram overlaps.
  Therefore the BLEU score evaluates to 0, independently of
  how many N-gram overlaps of lower order it contains.
  Consider using lower n-gram order or use SmoothingFunction()
    warnings.warn(_msg)

tests/test_experiments.py::test_pip_seq_experiment
  /home/runner/work/platalea/platalea/.tox/py38/lib/python3.8/site-packages/nltk/translate/bleu_score.py:516: UserWarning: 
  The hypothesis contains 0 counts of 4-gram overlaps.
  Therefore the BLEU score evaluates to 0, independently of
  how many N-gram overlaps of lower order it contains.
  Consider using lower n-gram order or use SmoothingFunction()
    warnings.warn(_msg)

What do these mean? Is there anything we can do about these? Or is it just because flickr1d is too small with only 50 sentences?

@bhigy
Copy link
Contributor

bhigy commented May 26, 2021

I think the size of the dataset is the issue. Could you quickly check what happens when you run it in normal conditions?

@egpbos
Copy link
Collaborator Author

egpbos commented May 26, 2021

Turns out it's not the small dataset (the warnings also occur with the full dataset), but the small hidden layer size that I had set for the tests. Increasing that to 8 gets rid of the warnings, so I'll push that to the PR branch as well and then we can wrap it up.

@egpbos
Copy link
Collaborator Author

egpbos commented May 26, 2021

Our badge has now turned from red to orange, yay ;)
Schermafbeelding 2021-05-26 om 17 36 56

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants