Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too few hits #454

Open
fmaumusINRA opened this issue May 17, 2021 · 9 comments
Open

too few hits #454

fmaumusINRA opened this issue May 17, 2021 · 9 comments

Comments

@fmaumusINRA
Copy link

Dear Soeding lab members,
I was trying to use mmseqs for tblastn-like analysis.
With a given aa query db and nt targetdb, I have results of blastall and blast+ giving approx 11k and 14k hits (evalue 0.001), respectively.
Aiming at speeding this process, I wanted to use easy-search.
At first, I was getting only about 500-700 hits with default settings and varying sensitivity.
I then tried out the "--max-seqs" option
--max-seqs INT Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [0.000]
It seems that by default the max seq number is set to 300. When I switched this to --max-seqs 100000, I am now getting 2.5K hits.
mmseqs easy-search prt.fa genome.fa out /tmp/ -s 7 --max-seqs 100000
That's much better but still far from blast.
Would you have any suggestion to address this discrepancy?

Thanks a lot for your help !

@milot-mirdita
Copy link
Member

We don't (/can't) generate similar k-mers for nucleotide searches, so the sensitivity parameter doesn't really affect anything in this case. You could try reducing the k-mer size a bit, that might result in more hits getting passed through the prefilter.

@fmaumusINRA
Copy link
Author

Thanks a lot for your response. I tried reducing with "-k 5" which improved a bit and then got an error when trying "-k 4".
But isn't it doing a BLASTP against translated ORFs anyhow?

@milot-mirdita
Copy link
Member

Sorry you are right, I misread the initial thread. Maybe try reducing the --min-length of ORFs extracted from the genome the default 30aa (90 nucl) is quite long, that might be one issue. Could you share the two FASTA files, then I can take a look.

@fmaumusINRA
Copy link
Author

My query is a reverse transcriptase (RT) domain. For instance

Athila41
LDAGVIYPISDSTWVSPVHCVPKKGGMTVVKNEKDELIPTRTITGHRMCIDYRKLNAASRKDHFPLPFIDQMLERLANHPYYCFLDGYSGFFQIPIHPNDQEKTTFTCPYGTFAYKRMPFGLCNAPATFQRCMTSIFSDLIEEMVEVFMDDFSVYGPSFSSCLLNLGRVLTRCEETNLVLNWEKCHFMVKEGIVLDHKISEKGIEVDKGKVEVMMQLQPPKTVKDIRSFLGHAGFYRRFIKDFA
It is about 240 aa so I wonder if --min-length will help.

My target is a plant genome. The genome is too big to share (1Gb). It is from here https://www.ncbi.nlm.nih.gov/Traces/wgs/?val=APLD01.
Thanks a lot for your attempts and suggestions!!!

@milot-mirdita
Copy link
Member

--min-length would affect the genome in this case.

@milot-mirdita
Copy link
Member

I get over 37k hits, but I also split the genome db creation from the search:

mmseqs createdb APLD01.1.fsa_nt.gz APLD01.2.fsa_nt.gz ntdb
mmseqs easy-search asd.fa ntdb res.m8 tmp -s 7 --max-seqs 30000000

--min-length is not changing much though.

@fmaumusINRA
Copy link
Author

OK, maybe I over simplified the question because I was actually counting merged hits with a minimum size of 520 nt (empirical, a bit less than RT size).
I indeed get 37061 hits with mmseqs, 34268 remain after merging overlaping hits, and 1099 merged hits pass the size filter.
With blast+, I get 8159 hits, 6262 merged, and 2947 merged hits pass the size filter.
So yes, I get more raw hits with mmseqs but they are short :(
Sorry for misleading at first !

@milot-mirdita
Copy link
Member

The ORF extraction step might be a bit too naive to deal well with a plant. I guess BLAST is able to extract longer fragments for some reason, but I don't know how.

@fmaumusINRA
Copy link
Author

Thanks a lot for your help, see you soon with another question, I am exploring still

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants