Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



78 Commits

Repository files navigation

Metagenomics workflow from raw sequencing data to metagenome-assembled genomes (MAGs)

This workflow is generated to process the analysis for metagenomics sequencing samples in TEDDY study (dbGaP Study Accession: phs001443.v1.p1).

Data processing for metagenomics assembly

  • a. Tools
  • b. Download and PreProcess the Raw Data
  • c. Assembly
  • d. Quantification of scaffolds
  • e. ORF prediction and functional annotation with KEGG and MetaCyc
  • f. Genome binning and Quantification of MAGs
  • g. Protein clustering


  1. NCBI SRA Toolkit to download the sequencing file with sra format, and then convert it to fastq format
  2. Assembly tool metaspades
  3. pullseq to filter sequences by a minimum length or maximum length
  4. reads mapping and calculation: bowtie2 for read mappping shrinksam to remove sequences that failed to map to the reference add_read_count.rb
  5. Prodigal for ORF prediction
  6. KofamScan KEGG ko term annotation
  7. Diamond alignment
  8. BBtools

Download and PreProcess the Raw Data

We downloaded sequencing data from each subject separetly and covert it to fastq format using NCBI SRA Toolkit, then co-assemblied the samples in each subject. The above steps were run using script assembly.batch on OU OSCER server. The instructions of this script are as follows.


  • sratoolkit (tested v2.9.6-1)
  • SPAdes (tested v3.13.1)
  • Python (tested version: 2.7.5)

1. download data from dbGaP

$ prefetch --option-file sra_list.txt

# an example for `sra_list.txt`(all the srr samples in a subject):

2. convert sra files to fastq format

$ for i in *.sra
    fastq-dump --split-files --gzip $i

# left and right reads
$ cat *_1.fastq.gz >SUBJECTID_1.fastq.gz
$ cat *_2.fastq.gz >SUBJECTID_2.fastq.gz

In the following assembly process, we will co-assemly the samples from the same subject. Therefore, we also merged the fastq files from the same subject.

3. PreProcess the Raw Data

The metagenomics data from this dbGaP Study (phs001442) is well proprocessed with quality control and removing human genome sequences. If your data is the raw sequencing data without preprocess, BBtools could be used to do this step.

Assembly using metaspades

$ -1 /work/TEDDY/ncbi/dbGaP-21828/sra/SUBJECTID/SUBJECTID_1_.fastq.gz \
                -2 /work/TEDDY/ncbi/dbGaP-21828/sra/SUBJECTID/SUBJECTID_2_.fastq.gz \
                -o SUBJECTID \
                -t 20  

The default memorial is 250G, if you need a larger one, please add a parameter setting like -m 500 to 500G.

Notes: Output is a directory SUBJECTID, which including several output files from metaspades assembly process. In this directory, scaffolds.fasta file is the co-assemblied metagenome in this subject, and was used in the following analysis.

Script abundanceANDbinning.batch is used to calculate the read counts and RPKM abundance for each scaffold in a subject, and then do genome binning. The instructions of this batch script are as follows.


  • pullseq (tested v1.0.2)
  • Bowtie2 (tested v2.3.5.1)
  • samtools (tested v0.1.19)
  • shrinksam (tested v0.9.0)
  • MetaBAT 2 (tested v2.12.1)
  • CheckM (tested v1.1.2)
  • add_read_count.rb
  • HMMER (tested 3.2.1)
  • pplacer (tested 1.1.alpha19)
  • python v2.7

The inputs used in this batch script are SUBJECTID_scaffolds.fasta from above assembly process, fastq files for each SRA run sample (from the above download process), SUBJECTID_srr.txt including all the SRA accessions in this subject, total_reads including the total sequencing reads of each SRA run (obtained from the log file in download process). script is used when runing this batch script, needs to give a full path in the batch script or put into one of your $PATH directories.

The main outputs are mapped.counted.result file that shows read counts of each scaffold, (srrID)_rpkm.txtfile that shows rpkm abundance of each scaffold, output_bin directory that includes all the binned genomes (MAGs), checkm_wf_out checkM output directory.

Abundance Calculation for each scaffold in a subject

1. fillter scaffolds

#Select scaffolds that sequence length > 1000; 
$ pullseq -i SUBJECTID_scaffolds.fasta -m 1000 > SUBJECTID_scaffolds_min1000.fasta

2. map reads into scaffolds with bowtie2

#Create bowtie2 index file
$ bowtie2-build  SUBJECTID_scaffolds_min1000.fasta  SUBJECTID_min1000

# Sequence alignment for each SRR runs; the reference is the co-assemblied scaffolds of the subject where SRR runs from.

$ bowtie2 -x SUBJECTID_min1000 -1 SUBJECTID_eachSRR_1.fastq.gz \
                               -2 SUBJECTID_eachSRR_2.fastq.gz \
                               -S SUBJECTID_eachSRR_alignment.sam \
                               -p 19

# 'SUBJECTID_eachSRR_alignment.sam' is also the file used for genome binning
# You can obtain the number of aligned reads in the output file 
# and the number of total read can be obtained in the log file of assembly
# Mapping rate = the number of aligned reads/ the number of total readsRPKM Calculation (shrinksam)

3. Count Mapped Reads

# we use `shrinksam` to remove unmapped reads from bowtie2-created SAM files,
# it will generate a SAM file with only the reads that map to an assembled scaffold.

$ shrinksam -i SUBJECTID_eachSRR_alignment.sam > SUBJECTID_eachSRR_mapped.sam

# Then we count the reads numbers mapped to each scaffold through `add_read_count.rb`
$ add_read_count.rb SUBJECTID_eachSRR_mapped.sam  SUBJECTID_scaffolds_min1000.fasta > SUBJECTID_eachSRR.reads.counted

# we just filter the lines containing scaffold name in the output fasta files

$ grep -e ">" SUBJECTID_eachSRR.reads.counted > mapped.counted.result
$ sed -i "s/>//g" mapped.counted.result
$ sed -i "s/read_count_//g" mapped.counted.result

# an example of `mapped.count.result`
NODE_8_length_295416_cov_86.298909  read_length_150    1805
NODE_9_length_294755_cov_77.048012  read_length_150     267
NODE_10_length_287310_cov_147.7697  read_length_150     2469
NODE_11_length_283217_cov_82.5382   read_length_150     423
NODE_12_length_273911_cov_216.2412  read_length_150     16

4. RPKM Calculation for each assembled scaffold

$ Python mapped.counted.result total_reads_file srrID

# total_reads_file is a file that includes total reads number in each sample, an example:
10004483	SRR7768473
10005994	SRR7768567
10007320	SRR7768664

# srrID is just the srrID sample ID number (such as SRR7768473)

# to calculate the rpkm, scaffold length is all needed. 
# MetaSpades predicted scaffold name is like 'NODE_8_length_295416_cov_86.298909'.
# the number 295416 is the length of this scaffold.

Important notes: Here we only got RPKM for each scaffold; For the proteins in the same scaffod, they should have the same RPKM with this scaffold.

Genome binning

1.sam to sorted bam for binning

$ samtools view -bS SUBJECTID_eachSRR_alignment.sam -@ 19 > SUBJECTID_eachSRR_alignment.bam
$ samtools sort SUBJECTID_eachSRR_alignment.bam  -@ 19 -o SUBJECTID_eachSRR_sorted > SUBJECTID_eachSRR_sorted.bam

# SUBJECTID_eachSRR_alignment.sam is from Bowtie2 output result in read mapping process.

2. metabat binning

$ jgi_summarize_bam_contig_depths --outputDepth output_depth.txt *_sorted.bam
# *_sorted.bam are the bam files from the same subjects.

$ metabat -i SUBJECTID_scaffolds_min1000.fasta -o output_bin -a output_depth.txt -m 2000

# output_bins is the directory including all the binned genomes, each genome in a file.

3. qualtify for the binned genomes

$ checkm lineage_wf -f CheckM.txt -t 10 -x fa --pplacer_threads 1 output_bins checkm_wf_out 

Protein prediction and read counts for each protein (ORF)

1. Protein prediction through Prodigal

The batch script is prodigal.batch, The Requirment is Prodigal (tested v2.6.3)

$ prodigal -i SUBJECTID_scaffolds_min1000.fasta \
           -a SUBJECTID_scaffolds_min1000_protein.fasta \
           -p meta -f gff -o SUBJECTID_predicted.gff

# input file is the scaffolds of each subject ``
# two output files: 
`SUBJECTID_scaffolds_min1000_protein.fasta` (protein translations file); 
`SUBJECTID_predicted.gff` (inforamtion for each CDS, we will use the CDS length to get the protein abundance) 

2.1 Map scaffold RPKM to each protein that located in this scaffold.

The principle is that for the proteins in the same scaffod, they should have the same RPKM with this scaffold. This is analysis for metagenome, which is not same with metatranscriptomics.

2.2 read counts for each protein

Some statistic tools, such as DEGseq2, they use the read counts (not the normalized RPKM) as the inputs here we add a method to get protein read counts directly from the scaffold read counts in the above. We divide the scaffold read counts into each protein according to their CDS length.

Functional Annotation of the predicted ORF

1. KEGG annotation through KofamScan

$ kofam_scan/exec_annotation -o SUBJECTID_kofam.txt SUBJECTID_scaffolds_min1000_protein.fasta --tmp-dir tmp_dirSUBJECTID --cpu 10 

map the annotated KO term into pathways through KEGG Mapper:

2. Metacyc reaction annotation by aligning database from Metacyc pathway database

$ diamond blastp --query SUBJECTID_scaffolds_min1000_protein.fasta \
                 --db uniprot-seq-ids.fasta \
                 --out Metacyc_protein_top5hit.blst \
                 --outfmt 6  --evalue 0.001 --max-target-seqs 5 --sensitive
$ python Metacyc_protein_top5hit.blst Metacyc_protein_RXN_key_sen

Protein clustering

cd-hit.batch was used to perform hierarchical clustering of all proteins on OU OSCER server. The instructions of this script are as follows.


  • CD-HIT (tested v4.8.1)
# 90% identity
$ cd-hit -i full_length_min50.fa \
         -o nr_90_0.9 \
         -c 0.9 -n 5  -d 0  -g 1 -p 1 -T 35 -M 0 -G 0 -aS 0.9 -aL 0.9 > nr_90_0.9.log
$ cd-hit-2d -i nr_90_0.9 \
            -i2 fragment_min50.fa \
            -o nr_90_0.9_fragment \
            -c 0.9 -n 5  -d 0  -g 1 -p 1 -T 35 -M 0 -G 0 -aS 0.9 > nr_90_0.9_fragment.log

# 65% identity
$ cd-hit -i nr_90_0.9 \
       -o nr_65_0.9 \
       -c 0.65 -n 4  -d 0  -g 1 -p 1 -T 35 -M 0 -G 0 -aS 0.9 -aL 0.9 > nr_65_0.9.log
$ cd-hit-2d -i nr_65_0.9 \
          -i2 nr_90_0.9_fragment \
          -o nr_65_0.9_fragment \
          -c 0.65 -n 4  -d 0  -g 1 -p 1 -T 40 -M 0 -G 0 -aS 0.9 > nr_65_0.9_fragment.log

# 40% identity
$ cd-hit -i nr_65_0.9 \
       -o nr_40_0.9 \
       -c 0.4 -n 2  -d 0  -g 1 -p 1 -T 35 -M 0 -G 0 -aS 0.9 -aL 0.9 > nr_40_0.9.log

$ cd-hit-2d -i nr_40_0.9 \
            -i2 nr_65_0.9_fragment \
            -o nr_40_0.9_fragment \
            -c 0.4 -n 2  -d 0  -g 1 -p 1 -T 30 -M 0 -G 0 -aS 0.9 > nr_40_0.9_fragment.log

In the end, CD_HIT scripts was used to merge the full-length and fragment clusters in each step, and was used to merge the clusters in three levels.

Other scripts

  • get each protein length from .gff files.

  • get the read counts of each protein cluster in each sample.

  • get the RPKM anundance of each protein cluster in each sample.

  • GLMM_sh.R performm GLMM statisctical comparison between IA and controls.

  • map the statistically significant protein clusters into high quality MAGs.

  • MAG_row_fisher.R perform fisher's exact test to find the enrichment of the IA-associated protein clusters in MAGs level.

  • gtdbtk.batch taxonomy assignment for MAGs using GTDB-Tk (tested v1.3.0).

  • ANIcalculator.batch gANI calculation using ANIcalculator (tested ANIcalculator v1.0).

  • aldex_module.R KEGG module comparison between genome groups using ALDEx2.

  • phylolm_analysis.r Phylogenetic regression analysis.

    compare teddy MAGs with three references:

  • compare_with_MAGs_in3references.batch compare the high-quality MAGs with MAGs in Almeida et al. 2019; Nayfach et al. 2019; Pasolli et al. 2019 and using mash (tested Mash v2.2)

  • select the best hit in each reference genomes, and then calculate the ANI using ANIcalculation

  • output the table including mapping information with three reference genomes


Metagenomics workflow from sequencing data to Metagenome-Assembled Genomes (MAGs)







No packages published