This repository provides codes and files to reproduce data and figures from the manuscript "Critical assessment of pan-genomics of metagenome-assembled genomes", by Tang Li and Yanbin Yin* (*corresponding author). Here, the Python and shell scripts cover downloading genome data, simulating metagenome-assembled genomes (MAGs) from complete genomes, analyzing pan-genome, performing Clusters of Orthologous Group (COG) functional annotations and comparing phylogenetic trees. The R codes cover reformatting data, generating plots and combining plots.
- FastANI: Calculate Average Nucleotide Identity (ANI).
- Prokka: Prokaryotic genome annotation.
- Blast+: Compare sequences to database.
- Roary: Pan-genome analysis.
- Anvi'o: Pan-genome analysis.
- BPGA: Pan-genome analysis.
- Fasttree: Phylogenetic tree construction.
- The entire data generated in this study is too large to store on Github, some example data for
Escherichia coli
are available online for testing MAG simulation, generating mixed MAG datasets, extracting and comparing core genes, and evaluating downstream analysis. Anaconda
is used to create conda environment to run Python scripts, the required package conda_list can be downloaded usingconda create --name <env> --file conda_list
.- Information about R packages needed to run R codes can be found in R_packages.
- The four supplementary tables for this manuscript can be found the folder supplementary tables.
-
Genome_Data_Collection: collect and analyze genome data.
- download_all_complete_genome_fasta.sh: download complete bacteria genomes from NCBI RefSeq.
- download_genus_contaminaton_genomes.sh: download bacteria genomes as contamination datasets.
- fastANI.sh: calculate average nucleotide identity (ANI) for bacteria species.
-
17_species: pan-genome analysis for 17 species.
- prokka.sh: genome annotation by using Prokka.
- gen_gff.sh: rename .gff files from Prokka results.
- roary_species.sh: pan-genome analysis by using Roary.
- sbatch_roary.sh: run multiple jobs for pan-genome analysis.
-
MAG_Simulation: simulate MAGs from complete genomes.
- fragmentation.py: fragmentation simulation - random cut the genome to fragments (random number of fragments).
- fragmentation_avrg_length.py: fragmentation simulation - random cut the genome to fragments (random length of fragments).
- incompleteness.py: incompleteness simulation - remove a percentage of sequence length from each fragment.
- contamination.py: contamination simulation - add fragments from other genomes in the same species (intraspecies).
- contamination_genus.py: contamination simulation - add fragments from other genomes in the same genus (interspecies).
- random_distribution: generate random numbers following F distribution for simualtion.
- generate_numbers.sh: generate numbers for genome list to assign random fragmentation/incompleteness/contamination numbers.
- simulation.sh: automatic simulation scripts.
- batch_files.sh: batch files for simulation.
- multiple_dataset.sh: generate multiple datasets for testing the dataset variations.
-
Mixed_datasets: generate and analyze mxied datasets contain MAGs and complete genomes.
- rad_combine.sh: generate mixed datasets with different percentage of MAGs.
- copy_ori_file.py: generate mixed datasets by combining original and simulated MAG dataset.
- Pan-genome_and_summary.sh: perform pan-genome analysis for mixed datasets.
- loop_rad_combine.sh: run rad_combine.sh for multiple times.
- roary_sum.py: summary Roary results for multiple mixed datasets.
-
Three_tools: perform pan-genome analysis using three different tools.
- Anvi'o:
- Anvi'o.sh: use Anvi'o for pan-genome analysis.
- genbank-parser.py: generate a tab-delimited file to define external gene calls from Genbank files given by Prokka.
- BPGA:
- gen_faa.sh: rename the .faa files.
- sort_faa.sh: prepare .faa files for BPGA pan-genome analysis.
- summary.sh: BPGA pan-genome result summary.
- Roary:
- gen_gff.sh: rename the .gff files.
- prokka_generate.sh: generate prokka annotation script.
- bash_prokka.sh: prokka annotation.
- roary.sh: generate Roary pan-genome analysis script.
- sbatch_roary.sh: run Roary analysis.
- rad_roary_sum.py: summarize roary results.
- Anvi'o:
-
COG_analysis: perform the clusters of orthologous Genes (COG) analysis.
- prepare_files.sh: prepare for COG functional analysis for core gene families after pan-genome analysis.
- rpsblast_batch.sh: use rps-blast to perform domian search and then determine COG categories.
- rps_select.py: select non-overlap domains for each genes.
- rps2COG.py: extract COG information from rpsblast results after the selection of non-overlap domains.
- core_rps.sh: run rps-blast.
- rpsblast_to_COG.sh: select non-overlap domains for rps-blast results and assign the COG categories.
-
Core_gene: extract and analyze core genes.
- Anvi'o:
- Anvio_extract_core.sh: extract core genes from anvi'o outputs.
- extract_all_fasta_seq.py: extract all gene sequences in gene clusters.
- Anvio_extract_ref_core_seq.py: select the longest core sequence from each gene cluster as the core representative sequence.
- blast_generate.sh: generate blast scripts.
- Anvio_core_gene_compare.sh: core gene comparison.
- Anvio_summary.sh: summarzie core gene comparison results.
- BPGA:
- BPGA_extract_ref_core_seq.py: extract representative core genes from BPGA results under different core gene thresholds.
- BPGA_extract_core.sh: extract core gene representative sequences.
- blast_generate.sh: generate blast scripts.
- BPGA_core_gene_compare.sh: core gene comparison.
- BPGA_summary.sh: summarzie core gene comparison results.
- Roary:
- gene_id_extract.py: extract core gene cluster ids.
- extract_core_gene_faa.py: extract core gene representative sequences.
- DNA_to_protein.py: translate DNA seuquences to proteins.
- Roary_extract_core_gene.sh: extract core gene representative sequences from pan-genome reference.
- blast_generate.sh: generate blast scripts.
- Roary_core_gene_compare.sh: core gene comparison.
- Roary_summary.sh: summarzie core gene comparison results.
- Anvi'o:
-
Phylogenetic_analysis: compare phylogenetic trees constructed based on MAGs and complete genomes.
- change_tree_id.py: change tree id with new id file.
- gene_pre_abs_tree.sh: change genome ids for tree comparison.
- fasttree_generate.sh: generate phylogenetic trees based on core gene alignment.
- core_alignment_tree.sh: build tree based on core gene alignment by using fasttree.
- tree_compare.sh: compare two phylogenetic trees by using ETE3 toolkit.
- These R codes were used to generate figures and supplementary figures in the manuscript. For example, the "Fig2.AB.frag_data.R" was used to generate Figure 2.A and Figure 2.B in the manuscript. The input files for generating Figure 2 can be found in "Fig2.frag_incomp".