Step01: specify the set of strains
Load the strain file within the run directory which contains a list of NCBI RefSeq accession numbers or names of own GenBank files (without file ending).
Step03: extract gene sequences from GenBank (*.gbk) file
Extract genes from GenBank (*.gbk) file as nucleotide and amino acid sequences
- Input:
In folder./data/TestSet/:
*.gbk file - Output:
In folder./data/TestSet/nucleotide_fna:
*.fna file (nucleotide sequences)
In folder./data/TestSet/protein_faa:
*.faa file (amino acid sequences)
Step04: extract metadata from GenBank (*.gbk) file (Alternative: provide manually curated metadata table)
Extracting metadata ( E.g.: country, collection_date, host, strain) or provide a tab-separated values (TSV) file.
strain | location | host age | serotype | benzylpenicillin MIC (ug/mL) | ... |
---|---|---|---|---|---|
NC_01 | Germany | 35 | 23A | 0.016 | ... |
NC_02 | Switzerland | 66 | 23B | 4 | ... |
- Input:
In folder./data/TestSet/:
*.gbk file - Output:
In folder./data/TestSet/:
metainfo.tsv (metadata for visualization)
User-provided metadata:
-
-mi --metainfo_fpath
the absolute path for meta_information file (e.g.: /path/meta.out)
Step05: compute gene clusters
all-against-all protein sequences comparison by Diamond and clustering of genes using MCL
- Input:
In folder./data/TestSet/protein_faa/:
*.faa file - Output:
In folder./data/TestSet/protein_faa/diamond_matches/:
allclusters.cpk (dictionary for gene clusters)
diamond_geneCluster_dt: {clusterID:[ count_strains,[memb1,...],count_genes }
Step06: build alignments, gene trees from gene clusters and run phylogeny-based post-processing
Load nucleotide sequences in gene clusters, construct nucleotide and amino acid alignment, build a gene tree based on nucleotide alignment, split paralogs and export the gene tree in json file for visualization
- Input:
In folder./data/TestSet/protein_faa/diamond_matches/:
allclusters.cpk file - Output:
In folder./data/TestSet/protein_faa/diamond_matches/:
allclusters_final.tsv ( final gene clusters)
In folder./data/TestSet/geneCluster/:
GC*.fna (nucleotide fasta)
GC*_na_aln.fa (nucleotide alignment)
GC*.faa (amino acid fasta)
GC*_aa_aln.fa (amino acid alignment)
GC*_tree.json (gene tree in json file)
Step07: construct core gene SNP matrix
Call SNPs in strictly core genes (without gene duplication) and build SNP matrix for strain tree
- Output:
In folder./data/TestSet/geneCluster/:
SNP_whole_matrix.aln (SNP matrix as pseudo alignment)
snp_pos.cpk (snp positions)
Step08: build the strain tree using core gene SNPs
Use fasttree to build core genome phylogeny and further refine it by RAxML
- Input:
In folder./data/TestSet/geneCluster/:
SNP_whole_matrix.aln - Output:
In folder./data/TestSet/geneCluster/:
strain_tree.nwk
Step09: infer gene gain and loss event
Use ancestral reconstruction algorithm (treetime) to infer gain and loss events
- Output:
In folder./data/TestSet/geneCluster/:
genePresence.aln (gene presence and absence pattern)
GC000*_patterns.json (gene gain/loss pattern for each gene cluster)
Step10: export gene cluster json file
Export json file for gene cluster datatable visualization
In folder ./data/TestSet/geneCluster/:
- Output:
In folder./data/TestSet/geneCluster/
geneCluster.json (gene cluster json for datatable visualization)
Step11: export tree and metadata json file
Export json files for strain tree and metadata visualization
- Input:
In folder./data/TestSet/:
metainfo.tsv (metadata table)
In folder./data/TestSet/geneCluster/:
strain_tree.nwk (strain tree) - Output:
In folder./data/TestSet/geneCluster/
coreGenomeTree.json (strain tree visualization)
strainMetainfo.json (strain metadata table visualization) - Data collection for visualization (sending data to server)
In folder
./data/TestSet/vis/
geneCluster.json coreGenomeTree.json strainMetainfo.json In folder./data/TestSet/vis/geneCluster/
GC000*_na_aln.fa GC000*_aa_aln.fa GC000*_tree.json GC000*_patterns.json