-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different results between shuffle --two-pass and without, resulting in duplicated sequences #364
Comments
Tagging @nick-youngblut in case you have any additional context or ideas! |
On second look, I think this might be the effect of a separate issue: without the |
Yes, it's the cause.
Yes. It keeps the
So you need manually delete |
Maybe I should add a flag |
Ah I see.I think that flag would be important to have: reusing the temp files means that we otherwise can't pipe shuffle outputs to splits, etc. perhaps |
Since the assemblies are often sorted by contig size, you can also try I'm not sure how to describe the way how
|
Thanks so much, that is good to know. I was trying to avoid having to check the size of the input file ahead of time and splitting into an variable number of files containing a set number of contigs (as opposed to a set number of files containing a variable number of contigs). So while |
I've added a flag to recreate
Please try the new binaries: |
Hello! This is a possible duplicate of #225, but I'm making a new issue because that one didn't mention a potential difference in the headers. Happy to make this a comment on the other if thats preferable.
Prerequisites
seqkit version
Describe your issue
describe the problem
I am attempting to split an assembly into a number of chunks for easier processing in parrallel. Because the assemblies are often sorted by contig size, I shuffled the assembly prior to (1) removing short contigs and (2) splitting into 200 contig chunks. However, I noticed that some downstream tools are complaining about duplicated sequences in the resulting split files. I think I've traced this down to using the
--two-pass
mode.provide a reproducible example
I have tried, but note the difficulty reproducing below
But, its difficult to reproduce!
Going off of issue #225 , I tried moving this to a new directory and seqkit worked as expected! Something appears to be happening when writing the index file; I have attached both the index form the bad run and the one from the successful run; the latter has all the expected headers.
bad473.assembly.fasta.seqkit.fai.gz
good473.assembly.fasta.seqkit.fai.gz
473.assembly.fasta.gz
Strangely, it seems like the index for the "bad" run processed with two pass has more headers than are present in the fasta! Is it possible this is arising because the
seqkit shuffle --two-pass
or some otherseqkit
command is doing in-place modification of the input file?I realize this is a strange issue; any insight would be appreciated!
The text was updated successfully, but these errors were encountered: