Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vsearch --usearch_global not showing "full alignment" instead only the segment pair #545

Closed
gbbio opened this issue Nov 23, 2023 · 3 comments
Assignees
Labels

Comments

@gbbio
Copy link

gbbio commented Nov 23, 2023

Hello,

If I use usearch_global to search a short primer sequence against a reference (or preferably search the reference against the primer) I don't get the "full alignment" back. For example:

primer query:   ACAGTGACATGGGGACGTAT
reference:       CAGTGACATGGGGACGTAT...

Qry    2 + CAGTGACATGGGGACGTAT 20
           |||||||||||||||||||
Tgt    1 + CAGTGACATGGGGACGTAT 19


19 cols, 19 ids (100.0%), 0 gaps (0.0%)

The first base of the primer is not there and the identity is 100%. With a global alignment I was expecting something like:

Qry    1 + ACAGTGACATGGGGACGTAT 20
            |||||||||||||||||||
Tgt    1 + -CAGTGACATGGGGACGTAT 19


20 cols, 19 ids (95.0%), 0 gaps (0.0%)

Is this intended?

How can I achieve the above result like the second example? I want to align 1 primer sequence against many references and also detect multiple matches per reference. So preferably using --usearch_global where my file with references is the query. And get the alignment as qrow and trow columns.

Many thanks in advance.

@torognes
Copy link
Owner

Hi!

Yes, this is intended. VSEARCH uses global alignments, but the terminal gap penalties are very low by default, and terminal gaps are not shown in the alignments. It may therefore look like local alignments. This is similar to USEARCH.

By default, the identity between the query and target is calculated as the percentage of matching nucleotides in the aligned region, excluding gaps at the ends (in the terminal regions). If you want to, you can choose to use a different definition of identity, for instance the CD-HIT definition, which is the number of matching columns divided by the shortest sequence length. In your case, when aligning primers, the primer length will almost always be the shortest sequence. I think that will give you the result you want. Use the option --iddef 0 to use this identity definition. In the example above, the result will be 95%.

You can use the userout and userfields options to get a tab-separated text file with the information you may need. For instance, using --userfields query+target+id0+alnlen+qilo+qihi+tilo+tihi+qrow+trow could be useful in this case. Like this:

vsearch --usearch_global primer.fasta --db ref.fasta --alnout alnout.txt --id 0.8 --iddef 0 --userfields query+target+id0+alnlen+qilo+qihi+tilo+tihi+qrow+trow --userout userout.tsv

Please see the manual for more info about these options.

Some remarks: You cannot get multiple matches within the same database sequence for one query, so you may have to reverse the search by having the primers in the database (as you indicate). If you are using short sequences (less than 32 nucleotides) in the database you need to use the --minseqlength 1 option. You will also need to use the --maxaccepts 0 option to find all matches to each query, not just the best match.

I hope this works for you.

@torognes torognes self-assigned this Nov 23, 2023
@gbbio
Copy link
Author

gbbio commented Nov 23, 2023

Thanks for the reply.

The command I am using now is:

vsearch --usearch_global genomes.fa --db primer --id 0.5 --qmask none --strand both --minwordmatches 0 --wordlength 3 --maxaccepts 0 --maxrejects 0 -maxgaps 15 -mincols 5 -userout {output.userout} -qmask none -userfields query+target+id+alnlen+mism+gaps+qilo+qihi+tilo+tihi+qstrand+qrow+trow+ql+tl

I will check the CD-HIT definition and maybe I can do some parsing myself with the help of the aln or caln column. A tseq or qseq column would maybe also give some more options but I see they are not available.

Thanks again for the reply and thanks for this tool.

@frederic-mahe
Copy link
Collaborator

Thanks @gbbio I've created a new issue regarding the missing --userfields options (see issue #548).

Also, regression tests added to our test-suite (see frederic-mahe/vsearch-tests@bdb1a50)

Please close the issue if you consider it solved.

@gbbio gbbio closed this as completed Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants