Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seqkit locate misses embedded patterns #368

Closed
ZacharyArdern opened this issue Feb 21, 2023 · 5 comments
Closed

seqkit locate misses embedded patterns #368

ZacharyArdern opened this issue Feb 21, 2023 · 5 comments

Comments

@ZacharyArdern
Copy link

ZacharyArdern commented Feb 21, 2023

Describe your issue

Seqkit locate does not seem to find fully overlapping patterns on the same strand (whether in greedy or non-greedy mode)

For instance, the regex below for searching for ORFs, adapted from that provided, does not find fully overlapping (embedded) ORFs on the same strand.

#with genome file "test.txt":
cat test.txt | seqkit locate -i -p "[AT]TG(?:.{3})+?T" -r | awk '{if (($6-$5)>=60 && NR>=2 ) print }' > ORFs.txt ;

e.g. this predicts an ORF from positions 123-260 on the plus strand, but misses the embedded ORF from 128-247 (found e.g. with prodigal, in test_all.txt)
prodigal -i test.txt -g 1 -f gff -s test_all.txt > test.gff ;

faidx test.txt LVMU01000403.1:123-260 ;
faidx test.txt LVMU01000403.1:128-247 ;

test.txt

@shenwei356
Copy link
Owner

shenwei356 commented Feb 22, 2023

For some reasons I can't recall, I chose to remove these "embedded" regions when searching with regular expressions before. It looks unnecessary now, cause we already have the -G/--non-greedy flag.

Thanks for reporting :)

$ seqkit faidx test.txt LVMU01000403.1:123-260 \
    | seqkit locate -r -i -p "A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)" -M \
    | csvtk cut -t -f -patternName

seqID   pattern strand  start   end
LVMU01000403.1:123-260  A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)        +       1       138
LVMU01000403.1:123-260  A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)        +       6       125   * i.e. region 128-247
LVMU01000403.1:123-260  A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)        +       82      138
LVMU01000403.1:123-260  A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)        +       106     138
LVMU01000403.1:123-260  A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)        -       36      47


$ seqkit faidx test.txt LVMU01000403.1:123-260 \
    | seqkit locate -r -i -p "A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)" -M -G \
    | csvtk cut -t -f -patternName

seqID   pattern strand  start   end
LVMU01000403.1:123-260  A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)        +       1       138
LVMU01000403.1:123-260  A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)        -       36      47

@ZacharyArdern
Copy link
Author

ZacharyArdern commented Feb 22, 2023

Thank you for the quick response again @shenwei356 !

I have tested it, and see just one issue, with the example you give for finding ORFs, rather than seqkit locate

  • the regex you give also finds sequences where a stop immediately follows the start codon.

e.g.

LVMU01000403.1:12414-12659(+)
ATGTAAACCGACACCCGTCGCCTGAACGATACGTTTAAGTGTCCTTTGTTTGATCATCGTATTATCTCGCCAAATTACCTATCCAACCGAAGTGTACTATACATTCGGCGGGCCAGTTTAGCACAAAGAGCCTCGAAACCCAAATTCCAGTCAATTCTTAATCAGCTTGCTTACGCAGGAATGCTGGGATATCCAGATAATCCGGCTCTTTCGCAGTTTGCGGCGCATTGTCATTCACGACTTTAG

These few cases can be filtered out easily afterwards, so this is just a minor comment in case you know how to improve the regex so it excludes these non-ORF cases.

@shenwei356
Copy link
Owner

the regex you give also finds sequences where a stop immediately follows the start codon

I'm afraid it's not.

A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)
      ----------
      It means at least one triple-base codon.
      +  means one or more
$ echo -ne ">s\nATGTAA\n" | seqkit locate -r -i -p "A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)"
seqID   patternName     pattern strand  start   end     matched

$ echo -ne ">s\nATGcccTAA\n" | seqkit locate -r -i -p "A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)"
seqID   patternName     pattern strand  start   end     matched
s       A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)        A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)        +       1       9       ATGcccTAA

@ZacharyArdern
Copy link
Author

ZacharyArdern commented Feb 23, 2023

@shenwei356 my statement wasn't fully clear, but this regex includes a stop within an "ORF" if it is immediately after the start codon:

echo -ne ">s\nATGTAAGGGTGA\n" | seqkit locate -r -i -p "A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)"
seqID	patternName	pattern	strand	start	end	matched
s	A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)	A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)	+	1	12	ATGTAAGGGTGA

@shenwei356
Copy link
Owner

I see. The regular expression is not perfect for finding ORF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants