Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug seqkit stats? #6

Closed
lakhujanivijay opened this issue Jan 9, 2017 · 3 comments
Closed

Possible bug seqkit stats? #6

lakhujanivijay opened this issue Jan 9, 2017 · 3 comments

Comments

@lakhujanivijay
Copy link

Tool used: seqkit

Dummy fasta file (fasta.fa):

>test1
GCATCGATCAGCTACGATCATCACTA
GNNNNNNTACATCAGCACTACATCACTNNNNN
>test2
GTACGCTACGANNNGCTACGACTACGATATATATATATATATATATATATATATATATATATAT
GCTACGATCACNTACATCGACTA
>test3
GTGTGCTACATCATCACTACGTACTACAT
>test4
AA

Command:

./seqkit stat fasta.fa

Output:

file      format  type  num_seqs  sum_len  min_len  avg_len  max_len
fasta.fa  FASTA   DNA          4      176        0       44       87

Problem:
min_len =0 (however, minimum length should be 2; sequence id "test4")

Validation using seqkit:

Command:

./seqkit fx2tab -l fasta.fa

Output:

test1	GCATCGATCAGCTACGATCATCACTAGNNNNNNTACATCAGCACTACATCACTNNNNN		58
test2	GTACGCTACGANNNGCTACGACTACGATATATATATATATATATATATATATATATATATATATGCTACGATCACNTACATCGACTA		87
test3	GTGTGCTACATCATCACTACGTACTACAT		29
test4	AA		2

Notice: length of sequence test4 is "2"

Is it a bug or I misunderstood something?

@shenwei356
Copy link
Owner

shenwei356 commented Jan 9, 2017

Sorry for that, it's fixed in the latest version (v0.4.3), please update.

Affected verions: v0.4.0, v0.4.1, v0.4.2

@lakhujanivijay
Copy link
Author

Thanks! Perfect!

One more question please - How can I get contigs from scaffolds (use seqkit). Please see below

input scaffolds file

>test1
GCATCGATCAGCTACGATCATCACTA
GNNNNNNTACATCAGCACTACATCACTNNNNN
>test2
GTACGCTACGANNNGCTACGACTACGATATATATATATATATATATATATATATATATATATAT
GCTACGATCACNTACATCGACTA
>test3
GTGTGCTACATCATCACTACGTACTACAT
>test4
AA

output contigs file

>test1
GCATCGATCAGCTACGATCATCACTA
G
>test1_2
ACATCAGCACTACATCACT
>test2
GTACGCTACGA
>test2_1
GCTACGACTACGATATATATATATATATATATATATATATATATATATAT
GCTACGATCACNTACATCGACTA
>test3
GTGTGCTACATCATCACTACGTACTACAT
>test4
AA

i.e. essentially splitting the sequences by n/N.

@shenwei356
Copy link
Owner

We talked about this: https://www.biostars.org/p/211400/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants