This package is built around a collection of publicly available tools and personal scripts tied together for analyzing metagenomic microbiome datasets.
The pipeline was developed by QiLi (liqi at, Lab of Algal Genomics).
External software used by this pipeline are copyright respective authors.
- 1.Preprocess
- 2.Assembly
- 3.Binning
- 4.Taxonomy
- 5.Bins Evaluation
- 6.Phylogenetic analysis
- 7.Functional analysis
The presence of poor quality or technical sequences such as adapters in the sequencing data can easily result in suboptimal downstream analyses. There are many useful read preprocessing tools to perform the quality control (FastQC, Trimmomatic). Here we choose Trimmomatic to clean our sequencing datasets, for example:
java -jar xx/software/trimmomatic-0.xx.jar PE -threads 30 -phred64 xx/R1.fq xx/R2.fq R1_paired_trimmed.fq R1_unpaired_trimmed.fq R2_paired_trimmed.fq R2_unpaired_trimmed.fq ILLUMINACLIP:xx/Trimmomatic-0.xx/adapters/TruSeqxxx.fa:2:30:10 LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:20
There are two types of sequence assembly, that is overlap/layout/consensus approach and de Bruijn graph approach. Because of the characteristics of next-generation sequencing platform and the incompletion of reference genome, de nove assembly is appropriate for metagenomics assembly. According to our experience, we use SPAdes Genomes Assembler to assembly our data. Executinng SPAdes as the following command: -o output -k 21,33,55 --pe1-1 xx/R1_paired_trimmed.fq --pe1-2 xx/R2_paired_trimmed.fq --pe1-s xx/R1R2_unpaired_trimmed.fq --mp1-1 xx/MP_R1_paired_trimmed.fq --mp1-2 xx/MP_R2_paired_trimmed.fq --mp1-s xx/MP_R1R2_unpaired_trimmed.fq --pacbio xx/filtered_subreads.fastq --careful --cov-cutoff 3 --disable-gzip-output -t 20 -m 500
In order to bin metagenomics sequence, we have to collect some fetures about the assembled fragments. The output of some assembler, such as SPAdes, Velvet and AByss, provide the coverage value of each cotigs in the header. If there isn’t coverage information in the assembly data, we can map reads to contigs using BWA or Bowtie to get coverage information instead. For each contigs, GC content, length and coverage are extracted from assembly file using my perl script. The script outputs the file including 4 column, that is contig name, contig length, GC content and coverage. Additionly, we utilize the essential gene sets to draw back some assembly fragements from the unassigned sequences. The process of obtaining the information of essential gene contained in the assembly was doned by the same perl script. -r scaffolds.fasta -1 R1_paired_trimmed.fq -2 R2_paired_trimmed.fq -m 500 -I 0 -X 600 -p phred64 -t 30
Binning the metagenomic assembly using selected features.
The process of selecting the contigs in each area may be time-consuming, it totally depends on the complexity of the metagenomics project. Sometimes, you need to combine more than one level value to get the optimal inital group.
We implemented taxonomic assignments of the genome bins using TAXAassign with some modified codes for efficiency and accuracy. We used deduced amino acid sequence information through DIAMOND BLASTP searches, instead of nucleotide sequences through BLASTN searches, to produce a protein sequence alignment against the NCBI non-redundant (nr) protein database. -c 30 -r 100 -t 60 -m 60 -q 50 -a "60,65,70,80,95,95" -f All_bins.faa
Since genomes reconstructed from metagenomic data usually vary substantially in their qualities, we proposed a set of quality criteria with quantitative thresholds to evaluate the quality of these genomes for subsequent analyses.
Other related software: CheckM.
To infer phylogenetic relationships among bacteria, a whole-genome based and alignment-free Composition Vector Tree (CVTree) method was applied to the comparison and clustering of the 68 genomes we extracted from our assemblies.
cvtree -i species.list -p data -o CVTree_k6.txt -k 6
Other related software: PHYLIP, MEGAN.
Functional characterization and annotation of protein encoding genes were performed by MOCAT2. To further compare the functional potential of each group, the predicted ORFs were analyzed using the GhostKOALA service on the KEGG website. When the results were returned through your email, open the links and then compare your interesting pathway.
Other related software: MEGAN.
