Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virome decontamination setting on non-viral metagenomes #40

Open
Binvir opened this issue May 17, 2019 · 9 comments
Open

Virome decontamination setting on non-viral metagenomes #40

Binvir opened this issue May 17, 2019 · 9 comments

Comments

@Binvir
Copy link

Binvir commented May 17, 2019

Hi Simon,

Firstly, thank you for this tool - it has been of great use in a viral identification pipeline. I had a question about the virome decontamination setting for the CyVerse implementation of VirSorter. I ran a metagenome generated from a sample, collected on a 0.1μm filter, that was not treated or processed in any way to specifically enrich for viruses. Out of curiosity, I ran VirSorter on this metagenome with and without the virome decontamination function. In the case of the former, cat-1 and cat-2 predictions were more than double those obtained if the virome decontamination setting was not applied. Could you please comment as to what may be going on here? I am not sure what approach is best for my data.

Thank you for your time!

@simroux
Copy link
Owner

simroux commented May 18, 2019

Hi !
So this is something we've seen in the past: if your dataset is composed mostly of viruses (~ 30% or more), then "virome decontamination" works better than "regular" mode. This is the case for viromes, but also typically for small size fraction (see e.g. https://www.nature.com/articles/s41564-018-0225-4). So the short answer is: the "virome decontamination" is probably best for your data.

Best,
Simon

@Binvir
Copy link
Author

Binvir commented May 20, 2019

Hi Simon,

Thank you for the informative and punctual reply - I will proceed with using the virome decontamination setting for my future VirSorter runs.

@Binvir Binvir closed this as completed May 20, 2019
@jarrodscott
Copy link

Hi @simroux and @Binvir

I realize this issue is closed but I had a related question that seems to fit best here. I have two non-viral metagenome assemblies that are part of the same study. After a normal virsorter run I get this message for only one of the assemblies

More than 25% of the bp in contigs >= 10kb were predicted as viral (estimated ratio: 28.77%)...
You may want to use the virome decontamination mode on this dataset, as it seems to have lot of viruses

My question is whether it is better to use --virome decontamination on both datasets for consistency or use it only on the one high in putative virus contigs?

@simroux simroux reopened this Dec 10, 2019
@simroux
Copy link
Owner

simroux commented Dec 10, 2019

Hi Jarrod,

That is a good question. The "warning" message is a recent addition, so I'm reopening the issue as it may be useful to a number of folks.

My recommendation here would be to re-run VirSorter in virome decontamination mode for this one assembly, and compare the results to the "regular" mode. By comparing the results, I mean counting how many additional viral contigs are identified, and manually inspecting a few of them. If these look like genuine viruses, then I would recommend running everything as "virome decontamination", for consistency.

A quick background on this: regular VirSorter estimates a number of parameters (ratio of genes with PFAM hits, ratio of genes with viral hits, average gene size, etc) from the data themselves. The original idea was that it would be more accurate to estimate these from the whole (mostly microbial) dataset, and then look for sequences that look "more viral than the average". In some cases, e.g. if a dataset has a substantial portion of viruses, this approach fails as the "average" is already viral. So the "virome decontamination" mode uses pre-computed parameters from bacterial/archaeal genomes in RefSeq instead of estimates from the data. Eventually, this virome decontamination mode tends to work with all kinds of datasets (microbial or viral) is sensitive to "unusual" microbial genomes (i.e. not well captured by RefSeq genomes).

@jarrodscott
Copy link

jarrodscott commented Dec 10, 2019

Hi Simon,
Excellent, good idea.
As a preemptive measure, I ran both assemblies using regular and --virome :-) and the analyses just finished earlier today. For both assemblies the --virome results have ~5x the number of entries in the VIRSorter_global-phage-signal file as the regular settings. I am looking through the data now. I will provide a summary here when I am finished...

for the record, here are the commands I ran for each assembly...

wrapper_phage_contigs_sorter_iPlant.pl -f WA-contigs.fa --ncpu 25 --db 2 --wdir VIRSORTER/WA --data-dir virsorter-data --diamond
wrapper_phage_contigs_sorter_iPlant.pl -f WA-contigs.fa --ncpu 25 --db 2 --wdir VIRSORTER_VIROME/WA --data-dir virsorter-data --diamond --virome

@jarrodscott
Copy link

Hi Simon

Here are some details. Maybe too many, maybe too few :)

The two datasets:

EP: 753612 contigs
WA: 574305 contigs. This was the assembly VirSorter flagged for decontamination.

For the EP dataset, only 12 hits were detected in the regular and not in the virome. Similarly in the WA dataset, only 9 hits were detected in the regular and not in the virome. So, --virome seems to pick up everything regular does. When the two methods overlap, they appear to assign the same category. I eyeballed this, so not quantitative.

The one thing I am not sure about is the "best way" to test the hits. Suggestions? Straight up blastp against nr or RefSeq? I tested a handful of genes from a handful of contigs and basically most hits are low percent identity and hypothetical proteins from an assortment of taxa. Some high percent phage hits. Definitely not seeing anything that screams microbe and not virus.

This table breaks down how the number of hits by treatment for each category from each dataset.

category EP-REGULAR EP-VIROME per_change WA-REGULAR WA-VIROME per_change
1_Complete_phage_contigs_cat_1_sure 1735 5779 233.08 2163 6128 183.31
2_Complete_phage_contigs_cat_2_some_what_sure 9869 41561 321.13 8956 37571 319.51
3_Complete_phage_contigs_cat_3_not_so_sure 285 3596 1161.75 145 4238 2822.76
4_Prophages_cat_1_sure 8 2 -75.00 70 0 -100.00
5_Prophages_cat_2_some_what_sure 54 25 -53.70 244 23 -90.57
6_Prophages_cat_3_not_so_sure 4 19 375.00 4 32 700.00

@simroux
Copy link
Owner

simroux commented Dec 11, 2019

Hi Jarrod,
Thanks for the additional information :-) Here is how I typically look at these cases:

  • The file "VIRSorter_affi-contigs.tab" in the folder "Metric_files" has the gene-by-gene annotation that VirSorter uses to make its calls. Columns are: Gene ID, Start, Stop, Length, Strand, Hit in phage cluster database, score, e-value, Phage cluster category, Hit in pfam, score, e-value. You should be able to "grep" individual contigs from this file (adding "-gene" at the end of the contig name if needed). I like to look at this file because these are the exact data VirSorter looked at to calculate enrichment / depletion statistics, and so if a contig is almost entirely unknown, with e.g. only ~ 10% PFAM, but was not considered as "depleted in PFAM hits" in the regular mode, I know that I should use the virome decontamination mode.

  • In your case, based on the results you've seen here for categories 1 sequences, it seems like you should use the virome decontamination mode. Basically VirSorter uses 2 types of metrics: viral hallmark gene (which work the same in regular vs virome mode since it's simple presence/absence) and enrichments/depletion stats (which will be different between regular and virome modes). For a contig to be category 1, it needs to be significant in some enrichment stat + have ≥1 hallmark gene. The fact that you find "new" category 1 contigs means that there were sequences with a hallmark gene (so most are likely viral), were not considered enriched in viral genes or depleted in pfam in regular mode, but are considered enriched/depleted in virome mode. That suggests to me that the background stats computed from the whole dataset were too "stringent", i.e. the overall percentage of phage cluster & pfam affiliation was too similar to a "normal" virus genome, and eventually there was very little significant enrichment/depletion.

Hopefully this is somewhat clear, please let me know if you have any question or if the data doesn't seem to match my assumptions :-) Btw, are these metagenomes anything special, e.g. samples filtered in a specific way ?

@jarrodscott
Copy link

jarrodscott commented Dec 11, 2019 via email

@simroux
Copy link
Owner

simroux commented Dec 11, 2019

Hi Jarrod,
"filtered through 0.22 micro filters" is the key thing here, and the results make a lot of sense. Metagenomes from cells collected on 0.22 micro filters look like "regular" metagenomes, but metagenomes from the filtrate below a 0.22 micro filter typically have tons of viruses and look more like a viral metagenome, even if no other steps are taken to enrich in viral particles. We've seen this in the past in e.g. rumen microbiome (https://www.nature.com/articles/s41564-018-0225-4), and we had to also use the virome decontamination mode on these size fractions :-)

Best,
Simon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants