Different results between shuffle --two-pass and without, resulting in duplicated sequences #364

nickp60 · 2023-02-09T22:27:15Z

Hello! This is a possible duplicate of #225, but I'm making a new issue because that one didn't mention a potential difference in the headers. Happy to make this a comment on the other if thats preferable.

Prerequisites

make sure you're are using the latest version by seqkit version
v2.3.1 on linux via conda
read the usage

Describe your issue

describe the problem
I am attempting to split an assembly into a number of chunks for easier processing in parrallel. Because the assemblies are often sorted by contig size, I shuffled the assembly prior to (1) removing short contigs and (2) splitting into 200 contig chunks. However, I noticed that some downstream tools are complaining about duplicated sequences in the resulting split files. I think I've traced this down to using the --two-pass mode.
provide a reproducible example
I have tried, but note the difficulty reproducing below

# expected
$ seqkit stats tmpall_small/473.assembly.fasta
file                             format  type  num_seqs    sum_len  min_len  avg_len  max_len
tmpall_small/473.assembly.fasta  FASTA   DNA      2,122  5,192,331       56  2,446.9   36,157

# shuffled, expected results
$ seqkit shuffle tmpall_small/473.assembly.fasta  | seqkit stats
[INFO] read sequences ...
[INFO] 2122 sequences loaded
[INFO] shuffle ...
[INFO] output ...
file  format  type  num_seqs    sum_len  min_len  avg_len  max_len
-     FASTA   DNA      2,122  5,192,331       56  2,446.9   36,157

# something strange in 2pass mode
$ seqkit shuffle 473.assembly.fasta --two-pass | seqkit stats
[INFO] create and read FASTA index ...
[INFO] read sequence IDs from FASTA index ...
[INFO] 2182 sequences loaded
[INFO] shuffle ...
[INFO] output ...
file  format  type  num_seqs    sum_len  min_len  avg_len  max_len
-     FASTA   DNA      4,233  5,121,864        0    1,210   36,157

But, its difficult to reproduce!

Going off of issue #225 , I tried moving this to a new directory and seqkit worked as expected! Something appears to be happening when writing the index file; I have attached both the index form the bad run and the one from the successful run; the latter has all the expected headers.

bad473.assembly.fasta.seqkit.fai.gz

good473.assembly.fasta.seqkit.fai.gz

473.assembly.fasta.gz

Strangely, it seems like the index for the "bad" run processed with two pass has more headers than are present in the fasta! Is it possible this is arising because the seqkit shuffle --two-pass or some other seqkit command is doing in-place modification of the input file?

I realize this is a strange issue; any insight would be appreciated!

The text was updated successfully, but these errors were encountered:

nickp60 · 2023-02-09T22:31:59Z

Tagging @nick-youngblut in case you have any additional context or ideas!

nickp60 · 2023-02-09T22:42:55Z

On second look, I think this might be the effect of a separate issue: without the --keep-temp option, shouldn't the .seqkit.fai files be deleted? I'm noticing the timestamp on the "bad" .seqkit.fai predates my assembly, so potentially there is an issue arising from (a) seqkit shuffle re-using an existing index instead of generating a new one, (b) seqkit shuffle and seqkit split leaving the index without --keep-temp being enabled, or both?

shenwei356 · 2023-02-10T01:27:30Z

seqkit shuffle re-using an existing index instead of generating a new one

Yes, it's the cause.

seqkit shuffle and seqkit split leaving the index without --keep-temp being enabled.

Yes. It keeps the .fai file for plain text fasta files.
It only delete these tmp files created for STDIN and compressed files if --keep-temp not given.

  -k, --keep-temp       keep temporary FASTA and .fai file when using 2-pass mode

# in 2-pass mode.
if (isStdin(file) || !isPlainFile(file)) && !keepTemp {
	checkError(os.Remove(newFile))
	checkError(os.Remove(newFile + ".seqkit.fai"))
}

So you need manually delete .fai file if the fasta file changed.

shenwei356 · 2023-02-10T04:53:51Z

Maybe I should add a flag --force for deleting the potential existing .fai files, to make sure it matches the fasta file.
Commands affected: subseq, sort, split, and shuffle.

nickp60 · 2023-02-10T13:39:30Z

Ah I see.I think that flag would be important to have: reusing the temp files means that we otherwise can't pipe shuffle outputs to splits, etc. perhaps --clean or something would be better, as there is already a --force argument for some cmds? Thanks for the fast reply!

shenwei356 · 2023-02-10T14:04:49Z

I am attempting to split an assembly into a number of chunks for easier processing in parrallel. Because the assemblies are often sorted by contig size, I shuffled the assembly prior to (1) removing short contigs and (2) splitting into 200 contig chunks.

Since the assemblies are often sorted by contig size, you can also try seqkit split2 -p for splitting the contigs into N parts, which would produce relatively even partitions. And there's no need to shuffle.

I'm not sure how to describe the way how seqkit split2 distributes reads into parts, but here's an example to illustrate it.

$ seqkit seq -n seqs.fa
1
2
3
4

seqkit split:

$ seqkit split -p 2 -O split seqs.fa

$ seqkit seq -n  split/seqs.part_001.fa 
1
2

$ seqkit seq -n  split/seqs.part_002.fa 
3
4

seqkit split2:

$ seqkit split2 -p 2 -O split2 seqs.fa

$ seqkit seq -n  split2/seqs.part_001.fa 
1
3

$ seqkit seq -n  split2/seqs.part_002.fa 
2
4

nickp60 · 2023-02-10T14:18:22Z

Thanks so much, that is good to know. I was trying to avoid having to check the size of the input file ahead of time and splitting into an variable number of files containing a set number of contigs (as opposed to a set number of files containing a variable number of contigs). So while seqkit split2 -p does a great job of splitting the partitions evenly, seqkit split2 -s appears to keep the input order.

shenwei356 · 2023-02-14T05:42:27Z

I've added a flag to recreate .fai file, improved the log info, and updated the help message.

seqkit faidx/sort/shuffle/split/subseq:
- new flag -U/--update-faidx: update the FASTA index file if it exists.
- improve log info and update help message.

Please try the new binaries:

shenwei356 added the enhancement label Feb 10, 2023

shenwei356 mentioned this issue Feb 14, 2023

panic: runtime error: invalid memory address or nil pointer dereference #365

Closed

shenwei356 added a commit that referenced this issue Feb 14, 2023

improve fasta index file operations. #364, #365

15b1662

shenwei356 closed this as completed Mar 1, 2023

shenwei356 mentioned this issue Mar 17, 2023

Update SeqKit to v2.4.0 bioconda/bioconda-recipes#39957

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different results between shuffle --two-pass and without, resulting in duplicated sequences #364

Different results between shuffle --two-pass and without, resulting in duplicated sequences #364

nickp60 commented Feb 9, 2023

nickp60 commented Feb 9, 2023

nickp60 commented Feb 9, 2023 •

edited

Loading

shenwei356 commented Feb 10, 2023

shenwei356 commented Feb 10, 2023

nickp60 commented Feb 10, 2023

shenwei356 commented Feb 10, 2023

nickp60 commented Feb 10, 2023

shenwei356 commented Feb 14, 2023 •

edited

Loading

Different results between shuffle --two-pass and without, resulting in duplicated sequences #364

Different results between shuffle --two-pass and without, resulting in duplicated sequences #364

Comments

nickp60 commented Feb 9, 2023

Prerequisites

Describe your issue

But, its difficult to reproduce!

nickp60 commented Feb 9, 2023

nickp60 commented Feb 9, 2023 • edited Loading

shenwei356 commented Feb 10, 2023

shenwei356 commented Feb 10, 2023

nickp60 commented Feb 10, 2023

shenwei356 commented Feb 10, 2023

nickp60 commented Feb 10, 2023

shenwei356 commented Feb 14, 2023 • edited Loading

nickp60 commented Feb 9, 2023 •

edited

Loading

shenwei356 commented Feb 14, 2023 •

edited

Loading