Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different results between shuffle --two-pass and without, resulting in duplicated sequences #364

Closed
4 tasks done
nickp60 opened this issue Feb 9, 2023 · 8 comments
Closed
4 tasks done

Comments

@nickp60
Copy link

nickp60 commented Feb 9, 2023

Hello! This is a possible duplicate of #225, but I'm making a new issue because that one didn't mention a potential difference in the headers. Happy to make this a comment on the other if thats preferable.

Prerequisites

  • make sure you're are using the latest version by seqkit version
  • v2.3.1 on linux via conda
  • read the usage

Describe your issue

  • describe the problem
    I am attempting to split an assembly into a number of chunks for easier processing in parrallel. Because the assemblies are often sorted by contig size, I shuffled the assembly prior to (1) removing short contigs and (2) splitting into 200 contig chunks. However, I noticed that some downstream tools are complaining about duplicated sequences in the resulting split files. I think I've traced this down to using the --two-pass mode.

  • provide a reproducible example
    I have tried, but note the difficulty reproducing below

# expected
$ seqkit stats tmpall_small/473.assembly.fasta
file                             format  type  num_seqs    sum_len  min_len  avg_len  max_len
tmpall_small/473.assembly.fasta  FASTA   DNA      2,122  5,192,331       56  2,446.9   36,157

# shuffled, expected results
$ seqkit shuffle tmpall_small/473.assembly.fasta  | seqkit stats
[INFO] read sequences ...
[INFO] 2122 sequences loaded
[INFO] shuffle ...
[INFO] output ...
file  format  type  num_seqs    sum_len  min_len  avg_len  max_len
-     FASTA   DNA      2,122  5,192,331       56  2,446.9   36,157

# something strange in 2pass mode
$ seqkit shuffle 473.assembly.fasta --two-pass | seqkit stats
[INFO] create and read FASTA index ...
[INFO] read sequence IDs from FASTA index ...
[INFO] 2182 sequences loaded
[INFO] shuffle ...
[INFO] output ...
file  format  type  num_seqs    sum_len  min_len  avg_len  max_len
-     FASTA   DNA      4,233  5,121,864        0    1,210   36,157

But, its difficult to reproduce!

Going off of issue #225 , I tried moving this to a new directory and seqkit worked as expected! Something appears to be happening when writing the index file; I have attached both the index form the bad run and the one from the successful run; the latter has all the expected headers.

bad473.assembly.fasta.seqkit.fai.gz

good473.assembly.fasta.seqkit.fai.gz

473.assembly.fasta.gz

Strangely, it seems like the index for the "bad" run processed with two pass has more headers than are present in the fasta! Is it possible this is arising because the seqkit shuffle --two-pass or some other seqkit command is doing in-place modification of the input file?

I realize this is a strange issue; any insight would be appreciated!

@nickp60
Copy link
Author

nickp60 commented Feb 9, 2023

Tagging @nick-youngblut in case you have any additional context or ideas!

@nickp60
Copy link
Author

nickp60 commented Feb 9, 2023

On second look, I think this might be the effect of a separate issue: without the --keep-temp option, shouldn't the .seqkit.fai files be deleted? I'm noticing the timestamp on the "bad" .seqkit.fai predates my assembly, so potentially there is an issue arising from (a) seqkit shuffle re-using an existing index instead of generating a new one, (b) seqkit shuffle and seqkit split leaving the index without --keep-temp being enabled, or both?

@shenwei356
Copy link
Owner

seqkit shuffle re-using an existing index instead of generating a new one

Yes, it's the cause.

seqkit shuffle and seqkit split leaving the index without --keep-temp being enabled.

Yes. It keeps the .fai file for plain text fasta files.
It only delete these tmp files created for STDIN and compressed files if --keep-temp not given.

  -k, --keep-temp       keep temporary FASTA and .fai file when using 2-pass mode
# in 2-pass mode.
if (isStdin(file) || !isPlainFile(file)) && !keepTemp {
	checkError(os.Remove(newFile))
	checkError(os.Remove(newFile + ".seqkit.fai"))
}

So you need manually delete .fai file if the fasta file changed.

@shenwei356
Copy link
Owner

Maybe I should add a flag --force for deleting the potential existing .fai files, to make sure it matches the fasta file.
Commands affected: subseq, sort, split, and shuffle.

@nickp60
Copy link
Author

nickp60 commented Feb 10, 2023

Ah I see.I think that flag would be important to have: reusing the temp files means that we otherwise can't pipe shuffle outputs to splits, etc. perhaps --clean or something would be better, as there is already a --force argument for some cmds? Thanks for the fast reply!

@shenwei356
Copy link
Owner

I am attempting to split an assembly into a number of chunks for easier processing in parrallel. Because the assemblies are often sorted by contig size, I shuffled the assembly prior to (1) removing short contigs and (2) splitting into 200 contig chunks.

Since the assemblies are often sorted by contig size, you can also try seqkit split2 -p for splitting the contigs into N parts, which would produce relatively even partitions. And there's no need to shuffle.

I'm not sure how to describe the way how seqkit split2 distributes reads into parts, but here's an example to illustrate it.

$ seqkit seq -n seqs.fa
1
2
3
4

seqkit split:

$ seqkit split -p 2 -O split seqs.fa

$ seqkit seq -n  split/seqs.part_001.fa 
1
2

$ seqkit seq -n  split/seqs.part_002.fa 
3
4

seqkit split2:

$ seqkit split2 -p 2 -O split2 seqs.fa

$ seqkit seq -n  split2/seqs.part_001.fa 
1
3

$ seqkit seq -n  split2/seqs.part_002.fa 
2
4

@nickp60
Copy link
Author

nickp60 commented Feb 10, 2023

Thanks so much, that is good to know. I was trying to avoid having to check the size of the input file ahead of time and splitting into an variable number of files containing a set number of contigs (as opposed to a set number of files containing a variable number of contigs). So while seqkit split2 -p does a great job of splitting the partitions evenly, seqkit split2 -s appears to keep the input order.

@shenwei356
Copy link
Owner

shenwei356 commented Feb 14, 2023

I've added a flag to recreate .fai file, improved the log info, and updated the help message.

  • seqkit faidx/sort/shuffle/split/subseq:
    • new flag -U/--update-faidx: update the FASTA index file if it exists.
    • improve log info and update help message.

Please try the new binaries:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants