Skip to content

Small pipeline to cluster viral genomes based on their k-mer content. WiP

License

Notifications You must be signed in to change notification settings

rnajena/viralclust

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViralClust - Find representative viruses for your dataset

DOI

License: GPL v3Python3.8NextFlowconda

Twitter URL


DISCLAIMER

This pipeline is work-in-progress. There are some bugs known to me, some aren't. Before getting desperate, please check out the Issues that are already opened and discussed. If you can't find your problem there, don't hesitate to drop me an E-Mail or open an issue yourself. I am not responsible for any results produced with ViralClust nor for the conclusions you draw from it.


Overview: What is this about?

Have you ever been in the situation that you wanted to compare you're specific virus of interest with all other viruses from its genus? Or even family? For some taxonomic clades, there are many different genomes available, which can be used for comparative genomics.

However, more often than not viral genome datasets are redundant and thus introduce bias into your downstream analyses. Think about a consensus genome of Flavivirus with 2.000 Dengue virus genomes and 5 Zika virus genomes. You may start to see the problem here. To remove redundancy, clustering of the input sequences is a nice idea. However, given the scientific question at hand, it is hard to determine whether a cluster algorithm is appropiate.

Thus, ViralClust was developed. A Nextflow pipeline utilizing different cluster methods and implementations all at once on your data set. Combining this with meta information from the NCBI allows you to explore the resulting representative genomes for each tool and decide for the cluster that fit your question.

For example: clustering all available Filoviridae with cd-hit-est usually leads to a large cluster containing all Zaire Ebola viruses, which can be valueable, if you want to compare this species as a whole. If you are interested in subtle changes within the species, you may want to use another approach, which divides the "Zaire cluster" into smaller sub-cluster, which represent different outbreaks and epidemics.


Installation

In order to run ViralClust, I recommend creating a conda environment dedicated for NextFlow. Of course, you can install NextFlow on your system how ever you like, but considering potential users not having sudo permissions, the conda-way proofed to be simple.

  • First install conda on your system: You'll find the latest installer here.
  • Next, make sure that conda is part of your $PATH variable, which is usually the case. For any issues, please refer to the respective installation guide.

Warning: Currently, ViralClust is running on Linux systems. Windows and MacOS support may follow at some point, but has never been tested so far.

  • Create a conda environment and install NextFlow within this environment:

    Click here to see how:
    conda create -n nextflow -c bioconda nextflow
    conda activate nextflow

    Alternative: You can also use the environment.yml provided, after cloning the repo:

    Click here to see how:
    conda env create -f environment.yml
  • Clone the github repository for the latest version of ViralClust, or download the latest stable release version here.

    Click here to see how:
    `git clone https://github.com/klamkiew/ViralClust.git && cd ViralClust`
  • Done!


Quickstart

You may ask yourself, how you can run ViralClust yourself now. Well, first of all, you do not have to worry about any dependencies, since NextFlow will take care of this via individual conda environments for each step. You just have to make sure to have a stable internet connection, when you run ViralClust for the very first time. If not done yet, now is the time to activate your conda environment: conda activate nextflow And we're ready to go!

nextflow run viralclust.nf --fasta "data/test_input.fasta"

This might take a little bit of time, since all individual conda environments for each step of ViralClust is created. In the mean time, let's talk about parameters and options.

Parameters & Options

Let us briefly go over the most important parameters and options. There is a more detailed overview of all possible flags, parameters and additional stuff you can do in the help of message of the pipeline - and at the end of this file.

Input sequences: --fasta <PATH>

--fasta <PATH> is the main parameter you have to set. This will tell ViralClust where your genomes are located. <PATH> refers to a multiple fasta sequence file, which stores all your genomes of interest.

Specific genomes of interest: --goi <PATH>

--goi <PATH> is similar to the --fasta parameter, but the sequences stored in this specfic fasta file are your genomes of interest, or shortly GOI. Using this parameter tells ViralClust to include all genomes present in goi.fasta in the final set of representative sequences. You have a secret in-house lab-strain that is not published yet? Put it in your goi.fasta.

Evaluate and rate cluster: --eval and --ncbi

--eval and --ncbi are two parameters, that do more for you than just clustering. Since ViralClust is running several clustering algorithms, it can be hard to decide which one produced the most appriopate results. Worry not, since --eval is here to help you. Additionally to the clustering results, you'll get a brief overview of the clusters, that arose from the different algorithms. With --ncbi enabled, ViralClust further scans your genome identifiers (the lines in your fasta file starting with >) for GenBank accession IDs and uses them to retrieve further information from the NCBI about the taxonomy of the sequence, as well as accession date and country. Note that using --ncbi implicitly also sets --eval.

Update the NCBI metainformation database: --update_ncbi

--update_ncbi is used whenever you need to update the database of ViralClust. As soon as you run the pipeline with --ncbi enabled for the first time, this is done automatically for you. Each viral GenBank entry currently available from the NCBI is processed and for each entry, ViralClust stores the accession ID, taxonomy, accession date and accession country for future uses.

Specify the output path: --output <PATH>

--output <PATH> specifies the output directory, where all results are stored. Per default, this is a folder called ViralClust_result which will be created in the directory that you are currently in.

Determine the numbers of cores used: --cores and --max_cores

--max_cores and --cores determine how many CPU cores are used at maximum and how many cores are used for one individual process at maximum, respectively. The default values cause ViralClust to use all available cores, but for each individual step in the pipeline, only 1 core is used.

There are many more parameters, especially directly connected to the behaviour of Nextflow, which are not explained here. The main things are covered, for the rest, I refer to the clustering section and the complete help message of ViralClust.


Cluster Tools

Since ViralClust is nothing without the great work of awesome minds, it is only fair to give credit, where credit is due. Currently, five different approaches are used, to cluster input genomes. CD-HIT, Sumaclust and vsearch all implement the same algorithmic idea, but with minor, subtle changes in their respective heuristics. I further utilize the clustering module of MMSeqs2. And, last but not least, ViralClust implements a k-mer based clustering method, which is realized with the help of UMAP and HDBSCAN.

For all tools, the respective manual and/or github page is linked. Firstly, because I think, all of those are great tools, which you are implicitly using by using ViralClust. And second, because ViralClust offers the possibility to set all parameters of all tools; therefore, if you need something very specific, you can check out the respective documentations.

And, in case of using any of the results provided by ViralClust in a scientific publication, I would be grateful to be cited. In my eyes, it is only fair that you not only cite ViralClust, but also the clustering method you ultimately decided for, even if ViralClust was assisting you in the decision.

Click here for all citations
  • CD-HIT:

    • Weizhong Li & Adam Godzik, "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences". Bioinformatics, (2006) 22:1658-9
    • Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li, CD-HIT: accelerated for clustering the next generation sequencing data. Bioinformatics, (2012), 28 (23): 3150-3152
  • sumaclust:

    • Mercier C, Boyer F, Bonin A, Coissac E (2013) SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences. Available: http://metabarcoding.org/sumatra.
  • vsearch:

    • Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584
  • MMSeqs2:

    • Steinegger, M., Söding, J. "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets". Nat Biotechnol 35, 1026–1028 (2017)
  • UMAP & HDBscan:

    • McInnes, L, Healy, J, "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction", ArXiv e-prints 1802.03426, 2018
    • L. McInnes, J. Healy, S. Astels, "hdbscan: Hierarchical density based clustering" In: Journal of Open Source Software, The Open Journal, volume 2, number 11. 2017

Graphical Workflow

Workflow graph

Help Message

This paragraph is simply the help message of ViralClust.

Expand here
____________________________________________________________________________________________

Welcome to ViralClust - your pipeline to cluster viral genome sequences once and for all!
____________________________________________________________________________________________

Usage example:
nextflow run viralclust.nf --update_ncbi

or

nextflow run viralclust.nf --fasta "genomes.fasta"

or both

nextflow run viralclust.nf --update_ncbi --fasta "genomes.fasta"

____________________________________________________________________________________________

Mandatory Input:
--fasta PATH                      Path to a multiple fasta sequence file, storing all genomes that shall be clustered.
                                  Usually, this parameter has to be set, unless the parameter --ncbi_update has been set.

Optional Input:
--goi PATH                        Path to a (multiple) fasta sequence file with genomes that have to end
                                  up in the final set of representative genomes, e.g. strains of your lab that are
                                  of special interest. This parameter is optional.
____________________________________________________________________________________________

Options:
--eval                            After clustering, calculate basic statistics of clustering results. For each
                                  tool, the minimum, maximum, average and median cluster sizes are calculated,
                                  as well as the average distance of two representative genomes.

--ncbi                            Additionally to the evaluation performed by --eval, NCBI metainformation
                                  is included for all genomes of the input set. Therefore, the identifier of fasta records are
                                  scanned for GenBank accession IDs, which are then used to retrieve information about the taxonomy,
                                  accession date and accession country of a sequence. Implicitly calls --eval.
                                  Attention: If no database is available at data, setting this flag
                                  implicitly sets --ncbi_update.

--ncbi_update                     Downloads all current GenBank entries from the NCBI FTP server and processes the data to
                                  the databank stored at data.

Cluster options:
--cdhit_params                    Additional parameters for CD-HIT-EST cluster analysis. [default -c 0.9]
                                  You can use nextflow run viralclust.nf --cdhit_help
                                  For more information and options, we refer to the CD-HIT manual.

--hdbscan_params                  Additional parameters for HDBscan cluster analysis. [default -k 7]
                                  For more information and options, please use
                                  nextflow run viralclust.nf --hdbscan_help.

--sumaclust_params                Additional parameters for sumaclust cluster analysis. [default -t 0.9]
                                  You can use nextflow run viralclust.nf --sumaclust_help.
                                  For more information and options, we refer to the sumaclust manual.

--vclust_params                   Additional parameters for vsearch cluster analysis. [default --id 0.9]
                                  You can use nextflow run viralclust.nf --vclust_help
                                  For more information and options, we refer to the vsearch manual.

--mmseqs_params                   Additional parameters for MMSeqs2 cluster analysis. [default --min-seq-id 0.9]
                                  You can use nextflow run viralclust.nf --mmseqs_help
                                  For more information and options, we refer to the MMSeqs2 manual.

Computing options:
--cores INT                       max cores per process for local use [default 1]
--max_cores INT                   max cores used on the machine for local use [default 8]
--memory INT                      max memory in GB for local use [default 16.GB]
--output PATH                     name of the result folder [default viralclust_results]
--permanentCacheDir PATH          location for auto-download data like databases [default data]
--condaCacheDir PATH              location for storing the conda environments [default conda]
--workdir PATH                    working directory for all intermediate results [default /tmp/nextflow-work-$USER]

Nextflow options:
-with-report rep.html             cpu / ram usage (may cause errors)
-with-dag chart.html              generates a flowchart for the process tree
-with-timeline time.html          timeline (may cause errors)
____________________________________________________________________________________________