Skip to content

Genome Preparation

Tao edited this page Jul 30, 2022 · 14 revisions

Believe it or not, preparing genomes for the pipeline usually can take quite some efforts. Once the genomes are ready, you are half-done.

1. Check what's available

plaBiPD gives you a nice up-to-date systematic view of what has been published (https://www.plabipd.de/portal/web/guest/angiosperm-phylogenetic-view)

2. Resources

Genomes can be downloaded from public repositories such as [Phytozome](https://phytozome.jgi.doe.gov/pz/portal.html), [CoGe](https://genomevolution.org/CoGe/), [GigaDB](http://gigadb.org/), [NCBI](https://www.ncbi.nlm.nih.gov/genome/browse/#), [PALZA](https://bioinformatics.psb.ugent.be/plaza/), or [Ensembl](http://plants.ensembl.org/index.html).

3. Download

Required files:

Sequences (either pep or cds) of all predicted genes in fasta format.

GFF file

4. Process the sequence file

Shorten the gene names and add unique identifiers

When genome sequence files in fasta are downloaded, we'll see that gene names can be named in all kinds of different fashions.

For example gene names in formats like:

>Aradu.20JM2 genotype-assembly-annot=V14167.a1.M1

>DTZ79_01g11390

>AT1G01010.1 | NAC domain containing protein 1 | Chr1:3760-5630 FORWARD LENGTH=429 | 201606

>Bv1_000040_cpku.t1 cDNAEvidence=88.9

Except the second gene name (which may also needs a meaningful species prefix), the other ones all need to be shortened. For example, removing all the characters following the white space, also better to replace dots in names, and add an unique 3-5 letters long prefix for the species. So names like >ath_AT1G01010, >Aradu_20JM2, .. look good.

Note several the other things as well.

  1. check whether the total number of sequences in the fasta can match the records in the GFF file (counted by 'genes' of the 3rd column)
  2. make sure no alternative transcripts or protein sequences in the fasta file for a single coding region.
  3. remove '*' at the end of the each sequence (which may have a problem when build sequence database by Diamond later)

Once it's ready, name the sequence file as "abc.pep", where abc the species abbreviation.

5. Process the GFF file

GFF files usually are quite big in size and contain all information about coding regions, exon-intron structures, non-coding regions, and UTRs et al. We just need to extract related information for coding genes and prepare a four-column text file named "abc.bed", in a format like the following:

athChr1 ath_AT1G01010 3631 5899

athChr1 ath_AT1G01020 5928 8737

athChr1 ath_AT1G01030 11649 13714

athChr1 ath_AT1G01040 23146 31227

athChr1 ath_AT1G01050 31170 33153

Note gene names have to be exactly the same to the ones in the "abc.pep" fasta sequences.

Please also attach the unique species abbreviation in front of the chromosome ID in the first column.

You can't have the same term 'Chr1' for different species, you have to distinguish what chromosomes from what species, otherwise it would cause problems.

Also note that the first two letters (case sensitive) in column 1 matter for the distinguishment, for example you may have two species abbreviations 'art'(arabidopsis thaliana) and 'arl' (arabidopsis lyrata), while it's okay to use such abbreviations, you have to make the first two letters (in column 1) different from each other in the bed file (which is used for MCScanX later). Otherwise in the comparison of 'art' and 'arl', the software can't tell which gene is from which. So to avoid redundance, in each bed file it may looks like this:

artChr1 art_AT1G20100 1000 3000

...

aRlChr3 arl_AL3G23500 4000 6000

...

This is very important when you have many species, you have to check and design unique species abbreviations, also avoiding the redundance of the first two letters in the 1st column of the bed files.

p.s. At normal conditions, the number of genes in the fasta file should match the number of lines in the bed file. Or at least more or less similar. If there are big differences between this two, you have to find out the reason.

So now finally you've got '.pep' and '.bed' files prepared for each genome.