A toolkit for comparative genomics and transcriptomics research of related species.
SynGAP (Synteny-based Gene Structure Annotation Polisher) is a command-line software written in Python3, suitable for Linux operating systems. And we provides the image that can be used for other operating systems such as MacOS and Windows. It supports two main workflows: (1) genome annotations polishing for related species (dual, master, triple, and custom); (2) gene differential expression analysis of related species (genepair, evi, and eviplot).
Find source codes and documentation at https://github.com/yanyew/SynGAP
Find detailed documentation at https://www.yuque.com/yanyew/gc786d
For any question about SynGAP, please contact 360875601w@gmail.com
If you use SynGAP, please cite: Wu F, Mai Y, Chen C, et al. SynGAP: a synteny-based toolkit for gene structure annotation polishing[J]. Genome Biology, 2024, 25(1): 218.
https://doi.org/10.1186/s13059-024-03359-8
conda install -c conda-forge -c bioconda syngap
cd ~/code # or any directory of your choice
git clone git://github.com/yanyew/SynGAP.git
cd ~/code/SynGAP
conda env create -f SynGAP.environment.yaml -c conda-forge -c bioconda
export PATH=~/code/SynGAP:$PATH
docker pull yanyew/syngap:1.2.5
docker run -it yanyew/syngap:1.2.5
conda activate syngap # activate the conda environment for SynGAP
python >=3.10
biopython >=1.81
jcvi >=1.3.6
bedtools >=2.31.0
last >=1454
emboss >=6.6.0
gffread >=0.12.7
seqkit >=2.4.0
diamond >=2.1.8
perl-bioperl >=1.7.8
kneed >=0.8.3
numpy >=1.26.0
pandas >=2.1.1
matplotlib-base >=3.8.0
scikit-image >=0.22.0
pybedtools >=0.9.0
deap >=1.4.1
more-itertools
crossmap
graphviz
webcolors
ortools-python
ftpretty
SynGAP dual was a module designed for the mutual gene structural annotations correction of two species, which takes the genome sequences and genome annotations of the correction objects as input. For example:
syngap dual \
--sp1fa=Athaliana_167_TAIR9.fa \
--sp1gff=Athaliana_167_TAIR10.gene.gff3 \
--sp2fa=Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa \
--sp2gff=Arabidopsis_halleri.Ahal2.2.52.gff3 \
--sp1=Ath \
--sp2=Aha
In the results directory, there are several key output files:
Result File | Description |
---|---|
*.SynGAP.gff3 | the full polished genome annotation file (originnal + polished) |
*.SynGAP.clean.gff3 | the polished genome annotation file (only polished) |
*.SynGAP.clean.miss_annotated.gff3 | only the polished annotations that are miss-annotated in the originnal genome annotation |
*.SynGAP.clean.mis_annotated.gff3 | only the polished annotations that are mis-annotated in the originnal genome annotation |
*.anchors.gap | the gaps where mis-annotation or miss-annotation of gene models (MAGs) may exist |
You can also chosse to polish the gene structural annotations of one species with the Core set picked up by us. Core set includes several plant and animal species with high quality genome annotation:
plant | animal |
---|---|
Aristolochia fimbriata | Bos taurus |
Arabidopsis thaliana | Caenorhabditis elegans |
Brachypodiumdistachyon | Canis lupus familiaris |
Cucumis sativus | Drosophila melanogaster |
Citrus sinensis | Danio rerio |
Fragaria vesca | Felis catus |
Glycine max | Gallus gallus |
Musa acuminata | Homo sapiens |
Oryza sativa | Mus musculus |
Solanum lycopersicum | Ovis aries |
Vitis vinifera | Pan troglodytes |
Zea mays | Sus scrofa |
Xenopus tropicalis |
To use SynGAP master, you should first download the database from the link below, which include plant.tar.gz and animal.tar.gz. You can choose the one you need.
https://tbtools.cowtransfer.com/s/85ed3920aa7f47
Then import the downloaded database:
syngap initdb \
--sp=plant \
--file=plant.tar.gz
After import the database, run SynGAP master:
syngap master \
--sp=plant \
--sp1fa=Brassica_rapa_ro18.SCU_BraROA_2.3.dna.toplevel.fa \
--sp1gff=Brassica_rapa_ro18.SCU_BraROA_2.3.53.chr.gff3 \
--sp1=Bra
As for the polishing of three species in combination, you can choose SynGAP triple.
syngap triple \
--sp1fa=Athaliana_167_TAIR9.fa \
--sp1gff=Athaliana_167_TAIR10.gene.gff3 \
--sp2fa=Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa \
--sp2gff=Arabidopsis_halleri.Ahal2.2.52.gff3 \
--sp3fa=Brassica_rapa_ro18.SCU_BraROA_2.3.dna.toplevel.fa \
--sp3gff=Brassica_rapa_ro18.SCU_BraROA_2.3.53.chr.gff3 \
--sp1=Ath \
--sp2=Aha \
--sp3=Bra
If you only focus on the annotation polishing in specific synteny block, or prefer to use synteny results from other software rather than jcvi, you can offer the *.anchors file that contains the block and use SynGAP custom.
syngap custom \
--sp1fa=Athaliana_167_TAIR9.fa \
--sp1gff=Athaliana_167_TAIR10.gene.gff3 \
--sp2fa=Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa \
--sp2gff=Arabidopsis_halleri.Ahal2.2.52.gff3 \
--custom_anchors=Ath.Aha.originalid.anchors \
--sp1=Ath \
--sp2=Aha
SynGAP incorporates another function module, genepair, to generate high-confidence cross-species homologous gene pairs by combining the improved synteny (from SynGAP dual or triple) and best two-way BLAST. And SynGAP evi can adopte another parameter, expression variation index (EVI), which is calculated based on the gene expression level, the difference in expression level, and the difference of the expression trend in a time-series transcriptome data.
SynGAP genepair takes the genome sequences and genome annotations of the paired objects as input.
syngap genepair \
--sp1fa=Can.fa \
--sp1gff=Can.SynGAP.gff3 \
--sp2fa=Sly.fa \
--sp2gff=Sly.SynGAP.gff3 \
--sp1=Can \
--sp2=Sly
SynGAP genepair will generate several key output files (see below), and ..final.genepair will used in SynGAP evi.
Result File | Description |
---|---|
*.final.genepair | the full gene pairs file (syntenic + best two-way BLAST) |
*.Synteny.genepair | the syntenic gene pairs |
*.2wayblast.genepair | the best two-way BLAST gene pairs |
Base on the gene pairs between two species and the time-series transcriptome data, evi calculates the EVI for each gene pair. The input expression file should be a tab-delimited text file with normalized expression values, including FPKM, RPKM, and TPM (among which we recommend using TPM).
syngap evi \
--genepair=Can.Sly.final.genepair \
--sp1exp=Can.S1_S7.transcript.TPM.xls \
--sp2exp=Sly.S1_S7.transcript.TPM.xls
There are several key output files:
Result File | Description |
---|---|
*.final.genepair.EVI.xls | the final EVI result file, in which the gene pairs are ranked by EVI |
*.final.genepair.EVI.threshold.txt | the threshold of EVI. The gene pairs with EVI exceeding the threshold were considered to show marked differential expression signals |
*.final.genepair.EVI.pdf | the ranked dotplot of EVI for all gene pairs |
*.final.genepair.EVI.indexweight.pdf | the stacked barplot of the three indexes contributing to EVI, which can help to adjust the weight of three indexes |
*.final.genepair.EVI.indexweightratio.pdf | the percentage stacked barplot of the three indexes contributing to EVI, which can help to adjust the weight of three indexes |
If you are interested in specific gene pairs, you can highlight them using eviplot.
syngap eviplot \
--EVI=Can.Sly.final.genepair.EVI.xls \
--highlightid=highlight.id \
--outgraph=Can.Sly.highlight.EVI.pdf
The format of highlight.id is like follow:
GeneID1 | GeneID2 | Label |
---|---|---|
Capana06g001783 | transcript:Solyc06g059840.3.1 | CaBCKDH |
Capana02g002339 | transcript:Solyc02g081745.1.1 | CaAT3 |
Capana04g000751 | transcript:Solyc04g077240.3.1 | CaBCAT |