New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Virome decontamination setting on non-viral metagenomes #40
Comments
Hi ! Best, |
Hi Simon, Thank you for the informative and punctual reply - I will proceed with using the virome decontamination setting for my future VirSorter runs. |
I realize this issue is closed but I had a related question that seems to fit best here. I have two non-viral metagenome assemblies that are part of the same study. After a normal virsorter run I get this message for only one of the assemblies
My question is whether it is better to use |
Hi Jarrod, That is a good question. The "warning" message is a recent addition, so I'm reopening the issue as it may be useful to a number of folks. My recommendation here would be to re-run VirSorter in virome decontamination mode for this one assembly, and compare the results to the "regular" mode. By comparing the results, I mean counting how many additional viral contigs are identified, and manually inspecting a few of them. If these look like genuine viruses, then I would recommend running everything as "virome decontamination", for consistency. A quick background on this: regular VirSorter estimates a number of parameters (ratio of genes with PFAM hits, ratio of genes with viral hits, average gene size, etc) from the data themselves. The original idea was that it would be more accurate to estimate these from the whole (mostly microbial) dataset, and then look for sequences that look "more viral than the average". In some cases, e.g. if a dataset has a substantial portion of viruses, this approach fails as the "average" is already viral. So the "virome decontamination" mode uses pre-computed parameters from bacterial/archaeal genomes in RefSeq instead of estimates from the data. Eventually, this virome decontamination mode tends to work with all kinds of datasets (microbial or viral) is sensitive to "unusual" microbial genomes (i.e. not well captured by RefSeq genomes). |
Hi Simon, for the record, here are the commands I ran for each assembly...
|
Hi Simon Here are some details. Maybe too many, maybe too few :) The two datasets: EP: 753612 contigs For the EP dataset, only 12 hits were detected in the The one thing I am not sure about is the "best way" to test the hits. Suggestions? Straight up blastp against This table breaks down how the number of hits by treatment for each category from each dataset.
|
Hi Jarrod,
Hopefully this is somewhat clear, please let me know if you have any question or if the data doesn't seem to match my assumptions :-) Btw, are these metagenomes anything special, e.g. samples filtered in a specific way ? |
Hi Simon
Great explanation! I will look at those data today. And the EP assembly
has a lot of viral hits but it looks like it was a little below VirSorters
warning threshold. So I agree that decontamination for both sets is looking
like the best way forward.
These are near shore marine samples, collected at 3-5m deep and filtered
through 0.22 micro filters. Even Kaiju classification on the contigs and
Kraken on the short reads ( for each sample) shows a relatively high
percentage of viral hits, like 15% of total datasets. I honestly don’t have
enough experience with these kinds of samples to know if this is “normal”
or not.
…On Wed, Dec 11, 2019 at 10:40 simroux ***@***.***> wrote:
Hi Jarrod,
Thanks for the additional information :-) Here is how I typically look at
these cases:
-
The file "VIRSorter_affi-contigs.tab" in the folder "Metric_files" has
the gene-by-gene annotation that VirSorter uses to make its calls. Columns
are: Gene ID, Start, Stop, Length, Strand, Hit in phage cluster database,
score, e-value, Phage cluster category, Hit in pfam, score, e-value. You
should be able to "grep" individual contigs from this file (adding "-gene"
at the end of the contig name if needed). I like to look at this file
because these are the exact data VirSorter looked at to calculate
enrichment / depletion statistics, and so if a contig is almost entirely
unknown, with e.g. only ~ 10% PFAM, but was not considered as "depleted in
PFAM hits" in the regular mode, I know that I should use the virome
decontamination mode.
-
In your case, based on the results you've seen here for categories 1
sequences, it seems like you should use the virome decontamination mode.
Basically VirSorter uses 2 types of metrics: viral hallmark gene (which
work the same in regular vs virome mode since it's simple presence/absence)
and enrichments/depletion stats (which will be different between regular
and virome modes). For a contig to be category 1, it needs to be
significant in some enrichment stat + have ≥1 hallmark gene. The fact that
you find "new" category 1 contigs means that there were sequences with a
hallmark gene (so most are likely viral), were not considered enriched in
viral genes or depleted in pfam in regular mode, but are considered
enriched/depleted in virome mode. That suggests to me that the background
stats computed from the whole dataset were too "stringent", i.e. the
overall percentage of phage cluster & pfam affiliation was too similar to a
"normal" virus genome, and eventually there was very little significant
enrichment/depletion.
Hopefully this is somewhat clear, please let me know if you have any
question or if the data doesn't seem to match my assumptions :-) Btw, are
these metagenomes anything special, e.g. samples filtered in a specific way
?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#40?email_source=notifications&email_token=AD3RFNT4KIC77ALYXWLJ37TQYECYHA5CNFSM4HNYUG2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGTSKOQ#issuecomment-564602170>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD3RFNTH5CL2LL4CTU6367LQYECYHANCNFSM4HNYUG2A>
.
|
Hi Jarrod, Best, |
Hi Simon,
Firstly, thank you for this tool - it has been of great use in a viral identification pipeline. I had a question about the virome decontamination setting for the CyVerse implementation of VirSorter. I ran a metagenome generated from a sample, collected on a 0.1μm filter, that was not treated or processed in any way to specifically enrich for viruses. Out of curiosity, I ran VirSorter on this metagenome with and without the virome decontamination function. In the case of the former, cat-1 and cat-2 predictions were more than double those obtained if the virome decontamination setting was not applied. Could you please comment as to what may be going on here? I am not sure what approach is best for my data.
Thank you for your time!
The text was updated successfully, but these errors were encountered: