-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possible for .uc files to be relabel_sha1-aware? #129
Comments
This is not possible yet, but could easily be added. Will look into it soon. |
Great, this would be really helpful. |
Do you want only the header of the representative sequences to be replaced by their sha1 hash in the uc file? If the identifiers of all sequences in the uc file are replaced by their sha1 hash, all sequences in the same cluster will get identical identifiers since their sequences are identical, making them somewhat meaningless. |
Ah, good point, hadn't thought of that... My use case is that I want to dereplicate sequences that were demultiplexed with QIIME, and build a BIOM table. For that purpose, I need to define OTU/observation ids, and the sha1 of the sequences is perfect for that as it would allow merging of tables that are created in different runs of vsearch. But, I'm realizing there is a problem with this: I need the sample identifiers (which are part of the input sequence identifiers in the QIIME demultiplexed sequence files) when I build the BIOM table so relabeling these in the uc file wouldn't work. The input file I'm working with looks like:
Where, for example, Could the original sequence identifiers be included in the fasta comment field when passing
to this:
(not sure if those sha1 values match up to those exact sequences, just created the example). Then I can parse the fasta file to build a map of the old to new identifiers from the fasta file to use when creating my BIOM table. |
Hello @gregcaporaso
I would describe that as 'adding a label.' To me, 're-labeling' implies changing the names of the reads because each read represents something different.
These methods all end with a fasta file. In order to build an abundance table / BIOM file, remap your original reads to this fasta file. For example:
Does this do what you want it to do? Edit: remove example of OTUs. |
No, I'm not looking to do OTU clustering. Instead, I want the dereplicated sequences to represent my OTUs. I know how to do this, but including the hashs with the vsearch output that I'm already generating would speed up the process. |
Ah ok. The example code I provided does this. |
Yes, but it's requiring two runs of vsearch where the second one is going to be relatively very slow. I have all of the information that I need from the |
Point taken. Remapping with global alignments is very slow. Remapping with What if the
What if the new sha1 label was listed in an additional column? Like this:
You can then pull those hashes from the Edit: I'm not sure I like this option because it breaks compatibility with USEARCH. There has got to be a better option, and I think it may be remapping with |
@colinbrislawn I'll try to add the search_exact command soon. Thanks for the suggestion. |
What if we always include the full previous FASTA header after a space after the new identifier (sha1/md5/prefix+number) when relabelling? Example: Before relabelling:
After relabelling with sha1:
I think will be quite general and should break few scripts. We could introduce another option, e.g. |
@torognes, that solution works. And the resulting file is still valid fasta, so no (correct) fasta parsers should break. Thanks for the quick responses on this. vsearch is awesome! @colinbrislawn, your solution would result in the file not being .uc anymore like you mention. While the content would be what we need, there's a lot of benefit to sticking with a file format that other tools are already using. (As a side note, my initial thought was that an adapted uc file could just add the seed's hash in the last existing column for the |
@torognes I really like your idea of having @gregcaporaso I agree that preserving file formats is important, and really like Torbjørn's suggestion because it yields valid .fasta and .uc files. |
The Remember that there is also an |
Excellent, thanks so much! |
I'm using vsearch in dereplication mode. In my output fasta I would like to have the sequences relabeled based on their sha1, and I'd like to have a
.uc
file generated as well. This works with the following command:However the issue I'm running into is that the sequence identifiers in the
.uc
file are the original identifiers, not the relabeled (sha1) identifiers. Would it be possible for vsearch to write the relabeled identifiers instead of the original identifiers to the.uc
file? (Or is it possible, and I'm just missing that option?)Thanks!
The text was updated successfully, but these errors were encountered: