trim-*: discard unmatched #10

thermokarst · 2018-01-02T22:46:14Z

This cutadapt parameter controls if unmatched reads should be discarded - this would be pretty useful to wrap (and straightforward).

This recently came up on the forum.

thermokarst · 2018-01-30T21:44:20Z

Actually, this might not be quite so straightforward --- the trim-* methods operate on SampleData[SequencesWithQuality] (and the paired variant) --- these QIIME 2 types don't have any on-disk formats that allow for empty files.

cc @ebolyen

mikerobeson · 2018-02-06T19:44:50Z

@thermokarst Would --untrimmed-output untrimmed.fastq help in this case? It might be easier to add this flag and simply write to another output file. Even if it ends up being empty. As --discard-untrimmed is only a shorthand for --untrimmed-output /dev/null as per the guide. I do not know much about the QIIME 2 API. But I do use cutadapt often. :-)

thermokarst · 2018-02-06T19:55:07Z

Thanks @mikerobeson! I updated my comment above to clarify that I was discussing QIIME 2 types, specifically. Without going into too much detail, QIIME 2 handles a bunch of file validation when reading and saving files. Right now, it is not possible in QIIME 2 to save an artifact of type SampleData[SequencesWithQuality] that is completely empty (read: no data in it). @ebolyen has some clever ideas about how to make the QIIME 2 framework more capable of dealing with these types of cases, but we are still in the planning phases. I know on the surface this problem sounds trivial ("just don't save a file if it doesn't have any data!"), but the reality is a bit more complicated with respect to type theory. Stay tuned!

mikerobeson · 2018-02-06T20:02:24Z

Thanks for the update @thermokarst! I completely understand these sorts of issues. Thank you for looking into this. Cheers!

thermokarst · 2018-02-22T02:25:41Z

This recently came up on the forum.

thermokarst · 2019-03-27T02:45:52Z

Stewing on this problem a bit. One thing about this --discard-unmatched flag ---the inclusion of this flag will completely change to resulting reads --- any unmatched reads will be "discarded," rather than being left as-is. I was originally concerned about "empty" output files, but now my concern is that we probably don't want to go changing this functionality underneath people's feet... Options:

New methods: trim_single_and_discard and trim_paired_and_discard
If we supported optional outputs, add a boolean parameter in the original trim_single and trim_paired methods
Just add this behavior into trim_single and trim_paired with no way of disabling or opting out. We could add some new methods to q2-demux (or somewhere else?) --- merge_single/merge_paired (this could have an overlap_method parameter: error_on_overlapping_sample - for merging "datasets", concatenate - for merging "matched samples"), these could allow you to concatenate the matched and unmatched reads.

Thoughts @mikerobeson, @ebolyen, @nbokulich?

nbokulich · 2019-03-27T15:25:40Z

now my concern is that we probably don't want to go changing this functionality underneath people's feet...

Why not add the parameter to trim_* and set it to False by default? That way you are not changing the functionality under people's feet.

There still is the concern about having an empty output — just raise and error if the output will be empty (after all, users would be consciously specifying that they want to drop any non-matches, so better to wait and raise an error [== notification] than an empty output that is not useful to them).

I do not like the other options as much for these reasons:

New methods: trim_single_and_discard and trim_paired_and_discard

Method sprawl just confuses people especially if the methods are largely redundant.

If we supported optional outputs, add a boolean parameter in the original trim_single and trim_paired methods

The discarded output would be useless. I could see some cases where it may be useful (e.g., filter out different marker genes that have primers attached, e.g., for those folks mixing ITS + 16S), but by and large this would not be useful. Still, I prefer this out of your three options

Just add this behavior into trim_single and trim_paired with no way of disabling or opting out.

Meh. This breaks the current use case where we use trim_* to remove adapters that may or may not be hanging off of a set of sequences. We don't want to break that functionality for the sake of supported a slightly different but distinct use case.

I really think that we can kill 2 birds with one stone here. Your option 2 is most palatable to me, but why not just discard those reads and raise an error if the output is empty?

thermokarst · 2019-03-27T16:00:53Z

Yep, let's just discard outright. When the framework supports optional outputs we can start plopping those reads into their own artifact. 🎚️

mikerobeson · 2019-03-27T16:12:33Z

Hello everyone. I think @nbokulich hit all the points quite well and I largely agree with them. I'd also echo his question:

...why not just discard those reads and raise an error if the output is empty?

As this must not only be addressed here but, if I remember correctly, this "empty file" issue has come up before. Didn't this occur with the ITSxpress plugin? Or was that determined to be another issue? If not, I can still imagine several cases where this "empty file" issue can arise in both of these plugins.

Sometimes it can be helpful to have cutadapt to write out the untrimmed sequences, i.e.

--untrimmed-output
--untrimmed-paired-output.

This helps to determine how many off-targets or the types of contamination that can be encountered. At least my colleagues tend to ask me for the reads that did not make the "cut". Haha, see what I did there? 😆

mikerobeson · 2019-03-27T16:15:54Z

Oops! Sorry @thermokarst I did not see your response prior to posting mine. :-)

@nbokulich

* ENH: Adds `discard_untrimmed` to filter methods Fixes #10 * SQUASH: addressing @nbokulich's demands

thermokarst added the enhancement label Jan 2, 2018

thermokarst self-assigned this Jan 30, 2018

thermokarst mentioned this issue Mar 9, 2018

Expose --m (min output read length) parameter #17

Closed

thermokarst mentioned this issue Mar 27, 2019

ENH: Adds discard_untrimmed to trim methods #29

Merged

nbokulich closed this as completed in #29 Mar 27, 2019

nbokulich pushed a commit that referenced this issue Mar 27, 2019

ENH: Adds discard_untrimmed to trim methods (#29)

16819d7

* ENH: Adds `discard_untrimmed` to filter methods Fixes #10 * SQUASH: addressing @nbokulich's demands

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trim-*: discard unmatched #10

trim-*: discard unmatched #10

thermokarst commented Jan 2, 2018

thermokarst commented Jan 30, 2018 •

edited

mikerobeson commented Feb 6, 2018

thermokarst commented Feb 6, 2018

mikerobeson commented Feb 6, 2018

thermokarst commented Feb 22, 2018

thermokarst commented Mar 27, 2019

nbokulich commented Mar 27, 2019

thermokarst commented Mar 27, 2019

mikerobeson commented Mar 27, 2019

mikerobeson commented Mar 27, 2019

trim-*: discard unmatched #10

trim-*: discard unmatched #10

Comments

thermokarst commented Jan 2, 2018

thermokarst commented Jan 30, 2018 • edited

mikerobeson commented Feb 6, 2018

thermokarst commented Feb 6, 2018

mikerobeson commented Feb 6, 2018

thermokarst commented Feb 22, 2018

thermokarst commented Mar 27, 2019

nbokulich commented Mar 27, 2019

thermokarst commented Mar 27, 2019

mikerobeson commented Mar 27, 2019

mikerobeson commented Mar 27, 2019

thermokarst commented Jan 30, 2018 •

edited