-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
output files explanation #33
Comments
Hi Diego, I understand the confusion because I have seen the same issue. You are correct, there should be non-LTR elements reported for common plant genomes. The non-LTR finding method is based on the program MGEScan-nonLTR, which I re-wrote almost entirely because I had so much trouble interpreting the results or even getting it work. Unfortunately, it appears that the issues are not completely resolved. This is a difficult one, but it is high on my list to resolve ASAP. I'll file this a bug and get back to this issue when I can. Thanks for the report, |
Thanks Evan for your quick response, I ran tephra:
If I want it to use this information to search for copies in the genome, which set should I use to blastn them. Is one of these the consensus members of de different families? |
Sorry, I missed this part of the question. "Families" here are multi-copy groups (2336 in this report) and "singletons" (2696 in this report) are elements not grouped with any family. For most plant genomes you see a distribution where the most common family size is 1, meaning there are many small families and a few larger ones (I'll update with a reference). So, if you add the "elements in families" to the "singletons" you'll get the total number:
The last nice is just a check (mostly for me) to make sure the numbers are correct after all the classification and annotation steps are complete. BTW, this will be documented in detail in the manuscript to make it clear what all of these numbers are and how they are derived, and this is a good reminder so thanks for asking. |
I see... but what I don't understand is why there are more multi-copy families than "elements in families"?
Could be something wrong with the LTRs specifically? The same happens with the Copia LTRs... And also...
I'm looking forward to see your paper! You did an excellent work compiling several tools, classify and annotate TEs! Actually we recently published (a month ago) a MITE discovery tool called MITE Tracker (https://www.ncbi.nlm.nih.gov/pubmed/30285604) which gave better results than the others MITE discovery tools available and can work with large genomes. I noticed that you didn't incorporate an specific tool for that type of TE (From the
If you are interested, you are more than welcome to see it and incorporate it to tephra and ask as any question regarding the the algorithm we used. Best Diego |
Thank you for the clarification, I misunderstood to real issue. Indeed, this is a bug. I did some testing yesterday and I can recreate this issue in some cases. I believe this is a logging problem and not a problem with the annotations, but I should be able to resolve the issue today and make a new release. Thank you for the reference for MITE Tracker! I will take a look for sure, this work should be very helpful for this project. |
Hi Evan, I've been checking the outputs files with the logging and I found others inconsistencies on the TEs. Should I wait for the new release and try to run it all over again, or if I want the sequences (classified into the different families) to find more copies in the genome (by blasting), I just cat both files and use that?
Cool! Let us know if you have any question |
Sorry to bother you again, but where can I find or filter out these sequences from the unclussified LTRs but with protein domains matches
Because you also have these:
but I think I want to use the classified ones plus the unclasssifed with protein matches for the copies search with blastn. but I dont know how to filter them from the |
Thank you for the patience. I have made a new release today (v0.12.3) that should address all the issues about the family/element numbers (new features have also been added but the usage is the same). Concerning the issue with non-LTRs, I do not think this is a bug because it works fine for some species, such as Arabidopsis. I believe the problem is that elements in some species have too high a divergence from the models being used. This will take more research, and I'll likely create a new issue for that. Please let me know if the other questions/issues have been resolved with the latest changes. |
I believe all of the issues described above have been resolved in v0.12.4 (specifically, the issue with the non-LTRs). Please let me know if there are any more questions or if I missed something. I'll leave this one open for a while or until I get a response. Thanks. |
Hi Evan! Thanks again, Diego |
Hi Evan,
Y ran tephra:
nohup tephra all -c tephra_config.yml &
since I want al types of TEs. I use a repbase.fasta from Solanum tuberosum for the repeatdb and left the others databases with the default options.
I'm having trouble to understand the differents output files, specially regarding the LTRs.
Could you explain me the differences between "elements" and "singleton"? Why are they in separated files?
What are the 2336 families and why the don't match with the elements or singletons?
this is from the log:
nano nohup.out
If I want it to use this information to search for copies in the genome, which set should I use to blastn them. Is one of these the consensus members of de different families?
And another issue is that I have 0 nonLTRs match, nor LINEs, or SINEs.
I know for a fact that there are SINEs and LINEs, actually the repbase from GIRI which I use have LINEs and SINEs.
Why do you thinks that happend?
Best
Diego
The text was updated successfully, but these errors were encountered: