Increase coverage #57

egpbos · 2021-01-12T16:20:16Z

We should take a look at the coverage reports (now that they are fixed #19) and figure out why some parts of the code are not covered by the experiment runs on CI. It seems for instance that in encoders.py there are a lot of unused models. Are they still used in some other dependent package or can we remove them?

Related to #41.

The text was updated successfully, but these errors were encountered:

bhigy · 2021-02-25T08:43:04Z

I checked encoder.py. A couple of them (SpeechEncoderMultiConv and SpeechEncoderVGG) are probably not used but could be useful in the future if we want to experiment. I realized we don't have an experiment covering the VQ architecture so I will add one.

bhigy · 2021-02-25T09:03:56Z

@egpbos or @cwmeijer, is there a way to run the tests locally or should I do trial and errors with push requests?

bhigy · 2021-02-26T15:53:05Z

I added the experiment with VQ architecture in branch 57-increase-coverage. I think we are good regarding encoders.py.

egpbos · 2021-03-01T10:13:20Z

@egpbos or @cwmeijer, is there a way to run the tests locally or should I do trial and errors with push requests?

Yes, inside the repo directory just run tox (install with pip install tox or conda, or whatever you like, it doesn't matter, tox handles everything else) to run the test suite.

bhigy · 2021-03-01T10:38:27Z

Anything we need to do regarding this issue?

egpbos · 2021-03-01T10:44:53Z

The VQ test looks great! Let's make a PR for the branch, looks mergeable to me and it increases coverage by 2.36%, see https://app.codecov.io/gh/spokenlanguage/platalea/compare/master...57-increase-coverage/overview.

I went through the coverage report in detail and other ways to increase coverage would be:

Experiments: add more tests to also cover broader parameter space. Specifically (excluding paths that trigger warnings or errors, which would be a bit overkill to cover imho):
a. mtl_asr, pip_ind and pip_seq have an uncovered path triggered by if data['train'].dataset.is_slt()
b. pip_ind and pip_seq can also be run again with input from previous models (as we did before in the command line based CI setup, but now we'd have to store such input files as testing artifacts in the repo, so that tox can reach them easily)
c. We can increase transformer coverage by adding --trafo_dropout=0.1
Remove (or add in experiments) unused attention models LinearAttention and ScalarAttention.
To test logging of training and validation loss in all experiments, we could add a configuration parameter for how much those log functions should be triggered. Currently these are hard coded to respectively every 100th and every 400th step. By configuring them and setting them to 1, we will cover them as well.
We don't cover the Librispeech dataset, because we don't run that experiment in the testset. Is it possible/feasible to build a test version for that dataset like with flickr1d?
introspect.py is not covered at all, what is it for? Can it be removed?
The adadelta optimizer is not covered, maybe add it with an option in some random test.
rank_eval.cosine is not covered, can it be deleted?
The noam and constant schedulers are not covered, also maybe add with an option to some random test (or add separate test to suite).
bleu_score and score_slt are not used. I suspect slt will be used if an SLT dataset is used, as mentioned in 1.a., correct? That leaves the bleu score, is that still useful? If so, how to trigger?
xer.py, wow... what is that? Never saw it before, but it seems hugely inefficient, with all the for-loops. In any case, also a lot uncovered.

egpbos · 2021-03-01T10:46:52Z

My guess is that actually 3. will give the biggest boost and is also most valuable to cover. All cost and validation paths are currently uncovered.

bhigy · 2021-03-02T11:00:32Z

My comments for each of the points above. In general, I am in favor of keeping the tests to a minimum and only for important parts, but let's discuss that during our next meeting.
1.
a. Running the same experiment twice seems a bit to much to me but we can try to trigger alternative paths to cover some components you mention in the next points (e.g. keep pip_ind as is but trigger the SLT code for pip_seq).
b. Same here. Seems a bit overkill to me.
c. This seems reasonable.
2. Same as above, feels to much to me.
3. Seems feasible.
4. Not sure we should cover LibriSpeech, and if we do, do we want to generate small test sets for each dataset?
5. Introspect is used to extract hidden representation from the model to perform analyses on them. We should definitely keep that.
6. See 1.a.
7. Seems used by rank_eval.ranking.
8. See 1.a.
9. Both should be related to SLT.
10. xer.py contains the metrics for ASR so we should keep it. I am surprised that it is no mainly covered. I didn't try to optimize it and that would be low priority.

egpbos · 2021-03-02T11:16:27Z

Ok great, agreed on the overkill parts, 100% is not a goal in itself. So to summarize, a list of reasonable/feasible actions to increase coverage:

Trigger SLT path in pip_seq test (this should cover 1.a. and 9.).
Add --trafo_dropout=0.1 option to transformer test (covers 1.c.).
Add configuration parameter for at which step interval to log train and validation losses.
Set these now configurable train and validation loss intervals to 1 (or max number of steps in that test) in all tests (covers 3. and probably 7.).
Pick one test in which to use the adadelta optimizer (covers 6.)
Pick one test in which to use the noam scheduler and one test in which to use the constant scheduler (covers 8.)

Two remaining questions:

Should we just remove LinearAttention and ScalarAttention or are they still useful?
introspect.py sounds like it should be in utils, should we move it there?

bhigy · 2021-03-02T12:50:59Z

I'd keep alternative attention mechanisms which could serve in future experiments.
As for introspect.py, it is part of the architecture of some models so it makes sense to me to keep it here.

egpbos · 2021-05-20T09:47:17Z

@bhigy: For the SLT path, do we actually need a test dataset in a different language than English? Or does the actual data not matter for "smoke testing" (testing whether a run works at all)?

In the first case, we could add some Japanese sentences to flickr1d. Is the stuff you used for those experiments public data?

If it doesn't matter, we could just put some lorum ipsum there.

bhigy · 2021-05-20T14:09:07Z

It doesn't matter if the captions are actually in a different language but we need to set the language as 'jp' for is_slt() to be True and thus have some value for the raw_jp field in the metadata. The data can be fake, copied from the English (maybe simpler) or taken from one of the files /home/bjrhigy/corpora/flickr8k/dataset_multilingual*.json.

egpbos · 2021-05-26T10:12:56Z

Ok, the SLT path gave a nice +0.9% in the coverage. That was the last of the wish list!

One final question @bhigy: we now get these warnings from that pip_seq test with Japanese texts:

tests/test_experiments.py::test_pip_seq_experiment
  /home/runner/work/platalea/platalea/.tox/py38/lib/python3.8/site-packages/nltk/translate/bleu_score.py:516: UserWarning: 
  The hypothesis contains 0 counts of 2-gram overlaps.
  Therefore the BLEU score evaluates to 0, independently of
  how many N-gram overlaps of lower order it contains.
  Consider using lower n-gram order or use SmoothingFunction()
    warnings.warn(_msg)

tests/test_experiments.py::test_pip_seq_experiment
  /home/runner/work/platalea/platalea/.tox/py38/lib/python3.8/site-packages/nltk/translate/bleu_score.py:516: UserWarning: 
  The hypothesis contains 0 counts of 3-gram overlaps.
  Therefore the BLEU score evaluates to 0, independently of
  how many N-gram overlaps of lower order it contains.
  Consider using lower n-gram order or use SmoothingFunction()
    warnings.warn(_msg)

tests/test_experiments.py::test_pip_seq_experiment
  /home/runner/work/platalea/platalea/.tox/py38/lib/python3.8/site-packages/nltk/translate/bleu_score.py:516: UserWarning: 
  The hypothesis contains 0 counts of 4-gram overlaps.
  Therefore the BLEU score evaluates to 0, independently of
  how many N-gram overlaps of lower order it contains.
  Consider using lower n-gram order or use SmoothingFunction()
    warnings.warn(_msg)

What do these mean? Is there anything we can do about these? Or is it just because flickr1d is too small with only 50 sentences?

bhigy · 2021-05-26T11:57:14Z

I think the size of the dataset is the issue. Could you quickly check what happens when you run it in normal conditions?

egpbos · 2021-05-26T14:42:32Z

Turns out it's not the small dataset (the warnings also occur with the full dataset), but the small hidden layer size that I had set for the tests. Increasing that to 8 gets rid of the warnings, so I'll push that to the PR branch as well and then we can wrap it up.

egpbos · 2021-05-26T15:37:46Z

Our badge has now turned from red to orange, yay ;)

egpbos added the software sustainability label Jan 13, 2021

bhigy self-assigned this Feb 4, 2021

bhigy assigned egpbos and unassigned bhigy Feb 26, 2021

egpbos mentioned this issue May 20, 2021

increasing coverage #101

Merged

egpbos closed this as completed in #101 May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase coverage #57

Increase coverage #57

egpbos commented Jan 12, 2021

bhigy commented Feb 25, 2021

bhigy commented Feb 25, 2021

bhigy commented Feb 26, 2021

egpbos commented Mar 1, 2021

bhigy commented Mar 1, 2021

egpbos commented Mar 1, 2021

egpbos commented Mar 1, 2021

bhigy commented Mar 2, 2021

egpbos commented Mar 2, 2021 •

edited

Loading

bhigy commented Mar 2, 2021

egpbos commented May 20, 2021

bhigy commented May 20, 2021

egpbos commented May 26, 2021

bhigy commented May 26, 2021

egpbos commented May 26, 2021

egpbos commented May 26, 2021

Increase coverage #57

Increase coverage #57

Comments

egpbos commented Jan 12, 2021

bhigy commented Feb 25, 2021

bhigy commented Feb 25, 2021

bhigy commented Feb 26, 2021

egpbos commented Mar 1, 2021

bhigy commented Mar 1, 2021

egpbos commented Mar 1, 2021

egpbos commented Mar 1, 2021

bhigy commented Mar 2, 2021

egpbos commented Mar 2, 2021 • edited Loading

bhigy commented Mar 2, 2021

egpbos commented May 20, 2021

bhigy commented May 20, 2021

egpbos commented May 26, 2021

bhigy commented May 26, 2021

egpbos commented May 26, 2021

egpbos commented May 26, 2021

egpbos commented Mar 2, 2021 •

edited

Loading