Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Species "all" is not known to RepeatMasker when running -species all #241

Closed
Ruiqi-CUB opened this issue Dec 8, 2023 · 4 comments
Closed
Labels

Comments

@Ruiqi-CUB
Copy link

Thanks for developing the awesome software. I am running the following command with -species all option but encountered an error message. Could you please have a look?

nohup RepeatMasker -pa 48 -a -e ncbi -dir all_mask_result -nolow -species all reference-genome.fna.

but I got the following messages:

Species "all" is not known to RepeatMasker.  There may
not be any TE families defined in the libraries for this
species/clade or there may be an error in the spelling.
Please check your entry against the NCBI Taxonomy database
and/or try using a broader clade or related species instead.
The full list of species/clades defined in the library may be
obtained using the famdb.py script.

Species/Taxa Search:
   [NCBI Taxonomy ID: ]

Here is the software version information:
RepeatMasker version 4.1.3-p1
Search Engine: NCBI/RMBLAST [ 2.14.1+ ]
Using Master RepeatMasker Database: RepeatMasker/Libraries/RepeatMaskerLib.h5
Title : Dfam withRBRM
Version : 3.6
Date : 2022-04-12
Families : 63,852

@Ruiqi-CUB Ruiqi-CUB added the bug label Dec 8, 2023
@rmhubley
Copy link
Member

rmhubley commented Dec 8, 2023

I am not sure which versions of RepeatMasker would have supported the "all" synonym ( maps to NCBI taxid 1 "root" node ), as a way to search the entire database against your sequence, but newer versions (4.1.3 - 4.1.6) don't accept it, as you reported. I am conflicted about this, as I am not sure this is practical to perform with the current size of the Dfam database and current architecture of RepeatMasker without some care. If you really want to try this, you could get around the error message you are seeing by using any taxa below the root. E.g:

nohup RepeatMasker -pa 48 -a -e ncbi -dir all_mask_result -nolow -species 'cellular organisms' reference-genome.fna

There are two other things to consider. The first, is that this will produce a tremendous amount of false positives (multiple testing problem using many unrelated query sequences). The second, is that you are using '-nolow', which doesn't simply omit simple repeats from reporting, it also doesn't identify them prior to searching against TE families. Many TE families contain stretches of tandem or low-complexity sequences and will falsely label tandem repeat sequences if this option is used.

@rmhubley rmhubley pinned this issue Dec 8, 2023
@rmhubley rmhubley closed this as completed Dec 8, 2023
@Ruiqi-CUB
Copy link
Author

Thanks a lot Robert for the quick reply!
The reason that I would like to use -species all is that there have been roported that increasing number of TE were horizontally transferred to my study system from bacterias and virus. Do you think I should run it with them along with cellular organisms then concatenate the results?

Also, for the -nolow option, I have another step just identifying simple repeats before this step. Do you think it is a good practice?

@rmhubley
Copy link
Member

rmhubley commented Dec 8, 2023

RepeatMasker only removes low-divergence simple repeats prior to searching against the TE libraries and then at the end searches for remaining higher divergence simple repeats at the end. In this fashion we avoid the false matching against TE families that contain simple repeats in their models (consensus/pHMM) while still obtaining better alignments to the TEs when the simple repeat contributes a larger alignment to the family. So, if you pre-mask the genome before running RepeatMasker you should take that into account.

@Ruiqi-CUB
Copy link
Author

Appreciate the insight Robert!

@rmhubley rmhubley unpinned this issue Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants