Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trim-*: discard unmatched #10

Closed
thermokarst opened this issue Jan 2, 2018 · 10 comments
Closed

trim-*: discard unmatched #10

thermokarst opened this issue Jan 2, 2018 · 10 comments
Assignees

Comments

@thermokarst
Copy link
Contributor

This cutadapt parameter controls if unmatched reads should be discarded - this would be pretty useful to wrap (and straightforward).

This recently came up on the forum.

@thermokarst thermokarst self-assigned this Jan 30, 2018
@thermokarst
Copy link
Contributor Author

thermokarst commented Jan 30, 2018

Actually, this might not be quite so straightforward --- the trim-* methods operate on SampleData[SequencesWithQuality] (and the paired variant) --- these QIIME 2 types don't have any on-disk formats that allow for empty files.

cc @ebolyen

@mikerobeson
Copy link
Member

@thermokarst Would --untrimmed-output untrimmed.fastq help in this case? It might be easier to add this flag and simply write to another output file. Even if it ends up being empty. As --discard-untrimmed is only a shorthand for --untrimmed-output /dev/null as per the guide. I do not know much about the QIIME 2 API. But I do use cutadapt often. :-)

@thermokarst
Copy link
Contributor Author

Thanks @mikerobeson! I updated my comment above to clarify that I was discussing QIIME 2 types, specifically. Without going into too much detail, QIIME 2 handles a bunch of file validation when reading and saving files. Right now, it is not possible in QIIME 2 to save an artifact of type SampleData[SequencesWithQuality] that is completely empty (read: no data in it). @ebolyen has some clever ideas about how to make the QIIME 2 framework more capable of dealing with these types of cases, but we are still in the planning phases. I know on the surface this problem sounds trivial ("just don't save a file if it doesn't have any data!"), but the reality is a bit more complicated with respect to type theory. Stay tuned!

@mikerobeson
Copy link
Member

Thanks for the update @thermokarst! I completely understand these sorts of issues. Thank you for looking into this. Cheers!

@thermokarst
Copy link
Contributor Author

This recently came up on the forum.

@thermokarst
Copy link
Contributor Author

Stewing on this problem a bit. One thing about this --discard-unmatched flag ---the inclusion of this flag will completely change to resulting reads --- any unmatched reads will be "discarded," rather than being left as-is. I was originally concerned about "empty" output files, but now my concern is that we probably don't want to go changing this functionality underneath people's feet... Options:

  • New methods: trim_single_and_discard and trim_paired_and_discard
  • If we supported optional outputs, add a boolean parameter in the original trim_single and trim_paired methods
  • Just add this behavior into trim_single and trim_paired with no way of disabling or opting out. We could add some new methods to q2-demux (or somewhere else?) --- merge_single/merge_paired (this could have an overlap_method parameter: error_on_overlapping_sample - for merging "datasets", concatenate - for merging "matched samples"), these could allow you to concatenate the matched and unmatched reads.

Thoughts @mikerobeson, @ebolyen, @nbokulich?

@nbokulich
Copy link
Member

now my concern is that we probably don't want to go changing this functionality underneath people's feet...

Why not add the parameter to trim_* and set it to False by default? That way you are not changing the functionality under people's feet.

There still is the concern about having an empty output — just raise and error if the output will be empty (after all, users would be consciously specifying that they want to drop any non-matches, so better to wait and raise an error [== notification] than an empty output that is not useful to them).

I do not like the other options as much for these reasons:

New methods: trim_single_and_discard and trim_paired_and_discard

Method sprawl just confuses people especially if the methods are largely redundant.

If we supported optional outputs, add a boolean parameter in the original trim_single and trim_paired methods

The discarded output would be useless. I could see some cases where it may be useful (e.g., filter out different marker genes that have primers attached, e.g., for those folks mixing ITS + 16S), but by and large this would not be useful. Still, I prefer this out of your three options

Just add this behavior into trim_single and trim_paired with no way of disabling or opting out.

Meh. This breaks the current use case where we use trim_* to remove adapters that may or may not be hanging off of a set of sequences. We don't want to break that functionality for the sake of supported a slightly different but distinct use case.

I really think that we can kill 2 birds with one stone here. Your option 2 is most palatable to me, but why not just discard those reads and raise an error if the output is empty?

@thermokarst
Copy link
Contributor Author

Yep, let's just discard outright. When the framework supports optional outputs we can start plopping those reads into their own artifact. 🎚️

@mikerobeson
Copy link
Member

Hello everyone. I think @nbokulich hit all the points quite well and I largely agree with them. I'd also echo his question:

...why not just discard those reads and raise an error if the output is empty?

As this must not only be addressed here but, if I remember correctly, this "empty file" issue has come up before. Didn't this occur with the ITSxpress plugin? Or was that determined to be another issue? If not, I can still imagine several cases where this "empty file" issue can arise in both of these plugins.

Sometimes it can be helpful to have cutadapt to write out the untrimmed sequences, i.e.

  • --untrimmed-output
  • --untrimmed-paired-output.

This helps to determine how many off-targets or the types of contamination that can be encountered. At least my colleagues tend to ask me for the reads that did not make the "cut". Haha, see what I did there? 😆

@mikerobeson
Copy link
Member

Oops! Sorry @thermokarst I did not see your response prior to posting mine. :-)

nbokulich pushed a commit that referenced this issue Mar 27, 2019
* ENH: Adds `discard_untrimmed` to filter methods

Fixes #10

* SQUASH: addressing @nbokulich's demands
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants