too few hits #454

fmaumusINRA · 2021-05-17T19:48:37Z

Dear Soeding lab members,
I was trying to use mmseqs for tblastn-like analysis.
With a given aa query db and nt targetdb, I have results of blastall and blast+ giving approx 11k and 14k hits (evalue 0.001), respectively.
Aiming at speeding this process, I wanted to use easy-search.
At first, I was getting only about 500-700 hits with default settings and varying sensitivity.
I then tried out the "--max-seqs" option
--max-seqs INT Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [0.000]
It seems that by default the max seq number is set to 300. When I switched this to --max-seqs 100000, I am now getting 2.5K hits.
mmseqs easy-search prt.fa genome.fa out /tmp/ -s 7 --max-seqs 100000
That's much better but still far from blast.
Would you have any suggestion to address this discrepancy?

Thanks a lot for your help !

milot-mirdita · 2021-05-18T12:19:34Z

We don't (/can't) generate similar k-mers for nucleotide searches, so the sensitivity parameter doesn't really affect anything in this case. You could try reducing the k-mer size a bit, that might result in more hits getting passed through the prefilter.

fmaumusINRA · 2021-05-18T12:53:20Z

Thanks a lot for your response. I tried reducing with "-k 5" which improved a bit and then got an error when trying "-k 4".
But isn't it doing a BLASTP against translated ORFs anyhow?

milot-mirdita · 2021-05-18T12:57:34Z

Sorry you are right, I misread the initial thread. Maybe try reducing the --min-length of ORFs extracted from the genome the default 30aa (90 nucl) is quite long, that might be one issue. Could you share the two FASTA files, then I can take a look.

fmaumusINRA · 2021-05-18T13:19:33Z

My query is a reverse transcriptase (RT) domain. For instance

Athila41
LDAGVIYPISDSTWVSPVHCVPKKGGMTVVKNEKDELIPTRTITGHRMCIDYRKLNAASRKDHFPLPFIDQMLERLANHPYYCFLDGYSGFFQIPIHPNDQEKTTFTCPYGTFAYKRMPFGLCNAPATFQRCMTSIFSDLIEEMVEVFMDDFSVYGPSFSSCLLNLGRVLTRCEETNLVLNWEKCHFMVKEGIVLDHKISEKGIEVDKGKVEVMMQLQPPKTVKDIRSFLGHAGFYRRFIKDFA
It is about 240 aa so I wonder if --min-length will help.

My target is a plant genome. The genome is too big to share (1Gb). It is from here https://www.ncbi.nlm.nih.gov/Traces/wgs/?val=APLD01.
Thanks a lot for your attempts and suggestions!!!

milot-mirdita · 2021-05-18T13:20:54Z

--min-length would affect the genome in this case.

milot-mirdita · 2021-05-18T13:33:55Z

I get over 37k hits, but I also split the genome db creation from the search:

mmseqs createdb APLD01.1.fsa_nt.gz APLD01.2.fsa_nt.gz ntdb
mmseqs easy-search asd.fa ntdb res.m8 tmp -s 7 --max-seqs 30000000

--min-length is not changing much though.

fmaumusINRA · 2021-05-18T13:53:57Z

OK, maybe I over simplified the question because I was actually counting merged hits with a minimum size of 520 nt (empirical, a bit less than RT size).
I indeed get 37061 hits with mmseqs, 34268 remain after merging overlaping hits, and 1099 merged hits pass the size filter.
With blast+, I get 8159 hits, 6262 merged, and 2947 merged hits pass the size filter.
So yes, I get more raw hits with mmseqs but they are short :(
Sorry for misleading at first !

milot-mirdita · 2021-05-18T14:37:30Z

The ORF extraction step might be a bit too naive to deal well with a plant. I guess BLAST is able to extract longer fragments for some reason, but I don't know how.

fmaumusINRA · 2021-05-18T16:05:27Z

Thanks a lot for your help, see you soon with another question, I am exploring still

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too few hits #454

too few hits #454

fmaumusINRA commented May 17, 2021

milot-mirdita commented May 18, 2021

fmaumusINRA commented May 18, 2021

milot-mirdita commented May 18, 2021

fmaumusINRA commented May 18, 2021

milot-mirdita commented May 18, 2021

milot-mirdita commented May 18, 2021

fmaumusINRA commented May 18, 2021

milot-mirdita commented May 18, 2021

fmaumusINRA commented May 18, 2021

too few hits #454

too few hits #454

Comments

fmaumusINRA commented May 17, 2021

milot-mirdita commented May 18, 2021

fmaumusINRA commented May 18, 2021

milot-mirdita commented May 18, 2021

fmaumusINRA commented May 18, 2021

milot-mirdita commented May 18, 2021

milot-mirdita commented May 18, 2021

fmaumusINRA commented May 18, 2021

milot-mirdita commented May 18, 2021

fmaumusINRA commented May 18, 2021