Skip to content

yanyew/SynGAP

Repository files navigation

SynGAP

SyGAP_logo

A toolkit for comparative genomics and transcriptomics research of related species.

SynGAP (Synteny-based Gene Structure Annotation Polisher) is a command-line software written in Python3, suitable for Linux operating systems. And we provides the image that can be used for other operating systems such as MacOS and Windows. It supports two main workflows: (1) genome annotations polishing for related species (dual, master, triple, and custom); (2) gene differential expression analysis of related species (genepair, evi, and eviplot).

Find source codes and documentation at https://github.com/yanyew/SynGAP
Find detailed documentation at https://www.yuque.com/yanyew/gc786d
For any question about SynGAP, please contact 360875601w@gmail.com
If you use SynGAP, please cite: Wu F, Mai Y, Chen C, et al. SynGAP: a synteny-based toolkit for gene structure annotation polishing[J]. Genome Biology, 2024, 25(1): 218. https://doi.org/10.1186/s13059-024-03359-8

Installation

conda (recommended)

conda install -c conda-forge -c bioconda syngap

manually

cd ~/code  # or any directory of your choice
git clone git://github.com/yanyew/SynGAP.git
cd ~/code/SynGAP
conda env create -f SynGAP.environment.yaml -c conda-forge -c bioconda
export PATH=~/code/SynGAP:$PATH

Docker image

docker pull yanyew/syngap:1.2.5
docker run -it yanyew/syngap:1.2.5
conda activate syngap # activate the conda environment for SynGAP

Dependence

python >=3.10
biopython >=1.81
jcvi >=1.3.6
bedtools >=2.31.0
last >=1454
emboss >=6.6.0
gffread >=0.12.7
seqkit >=2.4.0
diamond >=2.1.8
perl-bioperl >=1.7.8
kneed >=0.8.3
numpy >=1.26.0
pandas >=2.1.1
matplotlib-base >=3.8.0
scikit-image >=0.22.0
pybedtools >=0.9.0
deap >=1.4.1
more-itertools
crossmap
graphviz
webcolors
ortools-python
ftpretty

Usage

genome annotations polishing

dual

SynGAP dual was a module designed for the mutual gene structural annotations correction of two species, which takes the genome sequences and genome annotations of the correction objects as input. For example:

syngap dual \
--sp1fa=Athaliana_167_TAIR9.fa \
--sp1gff=Athaliana_167_TAIR10.gene.gff3 \
--sp2fa=Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa \
--sp2gff=Arabidopsis_halleri.Ahal2.2.52.gff3 \
--sp1=Ath \
--sp2=Aha

In the results directory, there are several key output files:

Result File Description
*.SynGAP.gff3 the full polished genome annotation file (originnal + polished)
*.SynGAP.clean.gff3 the polished genome annotation file (only polished)
*.SynGAP.clean.miss_annotated.gff3 only the polished annotations that are miss-annotated in the originnal genome annotation
*.SynGAP.clean.mis_annotated.gff3 only the polished annotations that are mis-annotated in the originnal genome annotation
*.anchors.gap the gaps where mis-annotation or miss-annotation of gene models (MAGs) may exist

master

You can also chosse to polish the gene structural annotations of one species with the Core set picked up by us. Core set includes several plant and animal species with high quality genome annotation:

plant animal
Aristolochia fimbriata Bos taurus
Arabidopsis thaliana Caenorhabditis elegans
Brachypodiumdistachyon Canis lupus familiaris
Cucumis sativus Drosophila melanogaster
Citrus sinensis Danio rerio
Fragaria vesca Felis catus
Glycine max Gallus gallus
Musa acuminata Homo sapiens
Oryza sativa Mus musculus
Solanum lycopersicum Ovis aries
Vitis vinifera Pan troglodytes
Zea mays Sus scrofa
Xenopus tropicalis

To use SynGAP master, you should first download the database from the link below, which include plant.tar.gz and animal.tar.gz. You can choose the one you need.
https://tbtools.cowtransfer.com/s/85ed3920aa7f47
Then import the downloaded database:

syngap initdb \
--sp=plant \
--file=plant.tar.gz

After import the database, run SynGAP master:

syngap master \
--sp=plant \
--sp1fa=Brassica_rapa_ro18.SCU_BraROA_2.3.dna.toplevel.fa \
--sp1gff=Brassica_rapa_ro18.SCU_BraROA_2.3.53.chr.gff3 \
--sp1=Bra

triple

As for the polishing of three species in combination, you can choose SynGAP triple.

syngap triple \
--sp1fa=Athaliana_167_TAIR9.fa \
--sp1gff=Athaliana_167_TAIR10.gene.gff3 \
--sp2fa=Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa \
--sp2gff=Arabidopsis_halleri.Ahal2.2.52.gff3 \
--sp3fa=Brassica_rapa_ro18.SCU_BraROA_2.3.dna.toplevel.fa \
--sp3gff=Brassica_rapa_ro18.SCU_BraROA_2.3.53.chr.gff3 \
--sp1=Ath \
--sp2=Aha \
--sp3=Bra

custom

If you only focus on the annotation polishing in specific synteny block, or prefer to use synteny results from other software rather than jcvi, you can offer the *.anchors file that contains the block and use SynGAP custom.

syngap custom \
--sp1fa=Athaliana_167_TAIR9.fa \
--sp1gff=Athaliana_167_TAIR10.gene.gff3 \
--sp2fa=Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa \
--sp2gff=Arabidopsis_halleri.Ahal2.2.52.gff3 \
--custom_anchors=Ath.Aha.originalid.anchors \
--sp1=Ath \
--sp2=Aha

gene differential expression analysis of related species

SynGAP incorporates another function module, genepair, to generate high-confidence cross-species homologous gene pairs by combining the improved synteny (from SynGAP dual or triple) and best two-way BLAST. And SynGAP evi can adopte another parameter, expression variation index (EVI), which is calculated based on the gene expression level, the difference in expression level, and the difference of the expression trend in a time-series transcriptome data.

genepair

SynGAP genepair takes the genome sequences and genome annotations of the paired objects as input.

syngap genepair \
--sp1fa=Can.fa \
--sp1gff=Can.SynGAP.gff3 \
--sp2fa=Sly.fa \
--sp2gff=Sly.SynGAP.gff3 \
--sp1=Can \
--sp2=Sly

SynGAP genepair will generate several key output files (see below), and ..final.genepair will used in SynGAP evi.

Result File Description
*.final.genepair the full gene pairs file (syntenic + best two-way BLAST)
*.Synteny.genepair the syntenic gene pairs
*.2wayblast.genepair the best two-way BLAST gene pairs

evi

Base on the gene pairs between two species and the time-series transcriptome data, evi calculates the EVI for each gene pair. The input expression file should be a tab-delimited text file with normalized expression values, including FPKM, RPKM, and TPM (among which we recommend using TPM).

syngap evi \
--genepair=Can.Sly.final.genepair \
--sp1exp=Can.S1_S7.transcript.TPM.xls \
--sp2exp=Sly.S1_S7.transcript.TPM.xls

There are several key output files:

Result File Description
*.final.genepair.EVI.xls the final EVI result file, in which the gene pairs are ranked by EVI
*.final.genepair.EVI.threshold.txt the threshold of EVI. The gene pairs with EVI exceeding the threshold were considered to show marked differential expression signals
*.final.genepair.EVI.pdf the ranked dotplot of EVI for all gene pairs
*.final.genepair.EVI.indexweight.pdf the stacked barplot of the three indexes contributing to EVI, which can help to adjust the weight of three indexes
*.final.genepair.EVI.indexweightratio.pdf the percentage stacked barplot of the three indexes contributing to EVI, which can help to adjust the weight of three indexes

eviplot

If you are interested in specific gene pairs, you can highlight them using eviplot.

syngap eviplot \
--EVI=Can.Sly.final.genepair.EVI.xls \
--highlightid=highlight.id \
--outgraph=Can.Sly.highlight.EVI.pdf

The format of highlight.id is like follow:

GeneID1 GeneID2 Label
Capana06g001783 transcript:Solyc06g059840.3.1 CaBCKDH
Capana02g002339 transcript:Solyc02g081745.1.1 CaAT3
Capana04g000751 transcript:Solyc04g077240.3.1 CaBCAT