Skip to content

schmelling/genome_assembly

Repository files navigation

Genome assembly

This repository contains some of my scripts, that I wrote during my lab rotation in the Genomics, Evolution, and Development Lab at Michigan State University

The amazing Amanita bisporigera genome project

A.bisporigera is one of the most toxic fungi in the world due to its cyclic peptides, which inhibit the RNA polymerase II. The inhibition of this polymerase affects the protein synthesis, which can be deadly. An assembled and annotated genome would be a great help studying this fungi. There are several problems with the assembly of this genome.

  1. This fungi is dikaryotic, so every cell has two nuclei. Currently it is not clear, how the genomic content is distributed between these nuclei.
  2. This fungi can't be cultured in the lab. So all the tissues are collected in nature, which means that the sequencing material is somehow contaminated.
  3. Like most fungi, we have to deal with repeats and overlapping coding regions.

Another big problem that I have to deal with is myself. I'm an undergrad student, molecular biologist by training, and have never worked in the field of informatics or bioinformatics.

I have access to two genome assemblies (Illumina data, SPAdes and Velvet Optimizer) and one transcriptome assembly (Illumina data, Trinity). I recently got two new assemblies, generated by Joshua Herr. The one assembly is for the A.bisporigera genome (Illumina and PacBio data, SPAdes assembler) and the other is for a related species Amanita phalloides (Illumina data, SPAdes assembler), also one of the most dangerous fungi known.

This wiki has two purposes. On the one hand it is used as a documentation for myself and on the other hand it should help other people to reproduce what I did or do the same things on their own data. Because of that I will write down the basic code one needs to use to run the program, this serve as a jump start for others, and I will write down the code that I specifically used.

Work Flow

  • Evaluation of genome assemblies
  • Annotate RNAseq data
  • Comparing the RNAseq data with other related transcriptomes

Evaluation of genome assemblies

Assembly stats

Evaluating genome assemblies is an important step on the way to a complete annotate genome. There are several papers out there, which show that different assembler generate different results on the same sequencing data. The Assemblathon 2 paper describes these issues. My take home message are the following statements from this paper.

  1. Don’t trust the results of a single assembly. If possible, generate several assemblies (with different assemblers and/or different assembler parameters).
  2. Do not place too much faith in a single metric.
  3. Potentially choose an assembler that excels in the area you are interested in (e.g., coverage, continuity, or number of error free bases).
  4. If you are interested in generating a genome assembly for the purpose of genic analysis (e.g., training a gene finder, studying codon usage bias, looking for intron-specific motifs), then it may not be necessary to be concerned by low N50/NG50 values or by a small assembly size.
  5. Assess the levels of heterozygosity in your target genome before you assemble (or sequence) it and set your expectations accordingly.

I used two different scripts to get different statistics on the genome assemblies. assemstats2.py from the GED lab and my own script assembly_stats.py.

Stats of interest:

  • Genome Length and # of contigs
  • Trimmed Genome Length and # of contigs (contigs < 500bp)
  • Min Median Mean Max contig size
  • N50 N90
  • NG50 NG90 (suggestion in Assemblathon 2 paper)
  • of contigs > than average gene size (suggestion in Assemblathon 2 paper)

Download:
curl -O https://raw.githubusercontent.com/schmelling/Genome-Assembly/master/assembly_stats.py
Usage:
python assembly_stats.py <assembly.fasta> <minimum contig size> <estimated genome size> <average gene size>
python assembly_stats.py assembly.fasta 500 55000000 2000

The estimated genome size of 55Mbp was calculated by Joshua Herr. This genome size is similar to related species, the acutual genome size can still vary a bit.

CEGMA

CEGMA or Core Eukaryotic Genes Mapping Approach is a tool to analyze a eukaryotic genome assembly, providing a metric for assessing the completeness of genomes and transcriptomes. It maps a set of conserved eukaryotic core genes to the assembly fasta file and returns a set of seven files. I personally only used the output.completeness_report file to see the percent completeness of my assembly. I ran CEGMA on the iPlant Collaborative, because it's free, easy to use and no installation is required.

Running CEGMA on iPlant

  1. Create an account on iPlant Collaborative
  2. Open the Discovery Environment
  3. Upload your fasta files
  4. Search for CEGMA under Apps and save it under favorites
  5. Click on CEGMA App, specify the Analysis Name and choose fasta file under Main
  6. Run CEGMA

Transcriptome completeness

Another way to evaluate a genome assembly is to analyze the coverage of transcriptome data, if they are available. I used BLAT an BLAST-like alignment tool, which align fasta files (nucleotide, protein) faster than BLAST does. The usage is similar to BLAST and well documented on the web site. The default output file is a .psl file, which some programs use as input files, but there are different formats to choose from.

BLAT & baa.pl

Download:
mkdir blat
cd blat
curl -O http://genome-test.cse.ucsc.edu/~kent/exe/linux/blatSuite.zip
unzip blatSuite.zip
Usage:
./blat <database> <query> <output.psl>
./blat VelvetOp_assembly.fasta Trinity.fasta assembly.psl -t=dna -q=rna -noHead -out=blast8

I used baa.pl to analyze the .psl output file for completeness of my genome assembly. This program analyses the .psl file and returns four statistics.

  • Percentage of transcripts with a BLAT entry (15661/15752): 0.994222955815135
  • Total % coverage of all positions (10303278 / 10491059): 0.982100853688841
  • Number of transcripts mapping to a single contig/scaffold: (14750/15661) 0.941830023625567
  • Average number of contigs/scaffolds per mapped transcript: 1.08715918523721

Download:
git clone git://github.com/josephryan/baa.pl
Install:
cd baa.pl/
perl Makefile.PL
make
make install
Usage:
baa.pl <blat file> <fasta query used in blat analysis>
baa.pl VelvetOp_assembly.psl Trinity.fasta

Genome Annotation

Genome annotation is the process of attaching biological information to sequences. It consists of three main steps:

  1. identifying portions of the genome that do not code for proteins
  2. identifying elements on the genome, a process called gene prediction, and
  3. attaching biological information to these elements.

You separate genome annotation into two many parts the structural annotation and the functional annotation.

Structural annotation consists of the identification of genomic elements.

  • ORFs and their localization
  • gene structure
  • coding regions
  • location of regulatory motifs

Functional annotation consists of attaching biological information to genomic elements.

  • biochemical function
  • biological function
  • involved regulation and interactions
  • expression

The annotation of the RNAseq data will provide a way of the functional annotation, gene prediction will provide a way for structural annotation.

Annotate RNAseq data

Transcriptome sequencing or RNAseq provides a rapid and inexpensive method to access the gene space. After sequencing the transcriptome of interest, you need to assemble the single reads to useful contigs. You could either map the read to a reference genome, this obviously works only if you work with model organism that have reference genomes. Otherwise you need to run through a de-novo assembly. If you have to do that the khmer Eel pond mRNAseq protocol provides a working and well documented way. I already started with assembled RNAseq data using the khmer protocols, so I could skip the first steps and only did step 5 building transcript families and step 6 annotating transcript families. Annotation of RNAseq data has two great purposes. First you get an idea about putative functions of your mRNAs and second you may find orthologs, which are indicate 'real' genes. Mapping the orthologs sequences back to a genome assembly, you have another way to evaluate a genome assembly by the coverage of orthologs.

Building & annotating transcript families

I had to make some few adjustments to the original eel pond protocol, but I documented all my changes. The original eel pond protocol blasts against a mouse database and I wanted to look at the uniprot database. You will see that I didn't write down my specific code in this area, but I think you should be able to figure out the code that I used, given what I wrote down in this section.

Change python code to find homologs and orthologs

This section provides all necessary changes for you to work with the uniprot database instead of the mouse one. Right now the eel pond protocol don't has two versions.

pico ~/eel-pond/namedb.py to open file for changes.

Changes:

  • line 4 fp = open('mouse.namedb') to fp = open('uniprot.namedb')
  • line 6 mouse_names = cPickle.load(fp) to uniprot_names = cPickle.load(fp)
  • line 9 mouse_fullname = cPickle.load(open('mouse.namedb.fullname')) to uniprot_fullname = cPickle.load(open('uniprot.namedb.fullname'))
  • line 10 mouse_seqs = screed.ScreedDB('mouse.protein.faa') to uniprot_seqs = screed.ScreedDB('uniprot_sprot.fasta')

Crtl + X and Y and Enter to save changes.

pico ~/eel-pond/annotate-seqs.py to open file for changes.

Changes:

  • line 37 if o and transform_name(o) in namedb.mouse_names: to if o and transform_name(o) in namedb.uniprot_names:
  • line 38 annot = namedb.mouse_names[transform_name(o)] to annot = namedb.uniprot_names[transform_name(o)]
  • line 53 annot = namedb.mouse_names.get(transform_name(h)) to annot = namedb.uniprot_names.get(transform_name(h))
  • line 99 if o and transform_name(o) in namedb.mouse_names: to if o and transform_name(o) in namedb.uniprot_names:
  • line 100 annot = namedb.mouse_names[transform_name(o)] to annot = namedb.uniprot_names[transform_name(o)]
  • line 107 if h and transform_name(h[0][0]) in namedb.mouse_names: to if h and transform_name(h[0][0]) in namedb.uniprot_names:
  • line 110 annot = namedb.mouse_names[transform_name(h)] to annot = namedb.uniprot_names[transform_name(h)]

Crtl + X and Y and Enter to save changes.

Building transcript families

As I mentioned earlier I didn't had to assemble the RNAseq data with Trinity. If you have to that you have to start with step 1 of the eel pond protocol. Because I had the assemble data I could jump to the building of transcript families. This step will cluster sequences by homology into transcript families.

gzip -c Trinity.fasta > trinity-assembly-raw.fa.gz
/usr/local/share/khmer/scripts/do-partition.py -x 1e9 -N 4 --threads 4 TriA trinity-assembly-raw.fa.gz
python /usr/local/share/eel-pond/rename-with-partitions.py TriA trinity-assembly-raw.fa.gz.part
mv trinity-assembly-raw.fa.gz.part.renamed.fasta.gz trinity-assembly.renamed.fa.gz

Download the sequences from the uniprot database:

curl -O ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

Preparing BLAST database and run BLAST

Before I could run BLAST on the uniprot sequences and my RNAseq data, I had to build BLAST database out of these fast files.

gunzip uniprot_sprot.fasta.gz
gunzip trinity-assembly.renamed.fa.gz
formatdb -i trinity-assembly.renamed.fa -o T -p F
formatdb -i uniprot_sprot.fasta -o T -p T

After building BLAST databases I finally could start to blast for homologs and orthologs in my RNAseq data. -e 1e-3 is a e value threshold, -p specifies the blast program, -a 8 is the number of processors, -v 4 is the number of database sequences to show one-line descriptions, -b 4 is the number of database sequence to show alignments

blastall -i trinity-assembly.renamed.fa -d uniprot_sprot.fasta -e 1e-3 -p blastx -o TriA.x.uniprot -a 8 -v 4 -b 4
blastall -i uniprot_sprot.fasta -d trinity-assembly.renamed.fa -e 1e-3 -p tblastn -o uniprot.x.TriA -a 8 -v 4 -b 4

After finishing the blastall runs I needed to filter the homologs and orthologs to annotate my Trinity.fasta file. All these script are provided by the eel pond package.

python /usr/local/share/eel-pond/make-uni-best-hits.py TriA.x.uniprot TriA.x.uniprot.homol
python /usr/local/share/eel-pond/make-reciprocal-best-hits.py TriA.x.uniprot uniprot.x.TriA uniprot.x.TriA.ortho
python /usr/local/share/eel-pond/make-namedb.py uniprot_sprot.fasta uniprot.namedb
python -m screed.fadbm uniprot_sprot.fasta
python /usr/local/share/eel-pond/annotate-seqs.py trinity-assembly.renamed.fa uniprot.x.TriA.ortho TriA.x.uniprot.homol

Extracting orthologs sequences

The next step was to extract all of the annotated orthologs from the annotated fasta file. To do so I used my own script extract.py.

Download:
curl -O https://raw.githubusercontent.com/schmelling/Genome-Assembly/master/extract.py
Usage:
python extract.py <annotated fasta> <new fasta> <separator>
python extract.py destroying_angel.fa ortho.fa ortho

Match orthologs to genome

I wanted to know how many of these orthologs,I found, are covered in my genome assembly, because then I would know how much of these 'real' genes are actually im my assembly.

First I created a BLAST database from my assembly fasta and then blast the transcriptome data against this database. I used -evalue 1e-3 as an e value threshold and -outfmt 6 to get a the results in a tab format

makeblastdb -in <genome assembly fasta> -out <database name> -dbtype nucl
makeblastdb -in VelvetOp_assembly.fa -out VelvetOp_DNA_db -dbtype nucl
blastn -query <transcriptome data> -out <output file> -db <database>
blastn -query trinity-destroyingangel.renamed.fa -out Velvet_RNA_BLAST.txt -db VelvetOp_DNA_db -outfmt 6 -evalue 1e-3

Next I filtered the 'match' and 'no-match' hits from the ortholog fasta file and the BLAST results using my own script find_match_2.py.

Download:
curl -O https://raw.githubusercontent.com/schmelling/Genome-Assembly/master/find_match_2.py
Usage:
python find_match_2.py <fasta file> <BLAST file> <kind>
python find_match_2.py ortho.fa Velvet_RNA_BLAST ortho.rna

Comparative RNAseq analysis

Comparing transcriptomes of related species can give insights in conserved genes or regulatory networks. It can also serve to analyze phylogenies between species. I compared the transcriptome of A.bisporigera with the transcriptome of A.muscaria, A.thiersii and V.volvacea to investigate conserved and closely related genes in the A.bisporigera transcriptome.

Download the different transcriptomes from JGI:
wget http://genome.jgi.doe.gov/Amamu1/download/Amamu1_GeneCatalog_transcripts_20120806.nt.fasta.gz
wget http://genome.jgi.doe.gov/Amath1/download/Amath1_GeneCatalog_transcripts_20120702.nt.fasta.gz
wget http://genome.jgi.doe.gov/Volvo1/download/Volvo1_GeneCatalog_transcripts_20130703.nt.fasta.gz

Building transcript families with the Eel Pond protocol here.

tblastx:
makeblastdb -in TRANSCRIPT_FASTA -out DATABASE_NAME -dbtype nucl
tblastx -query TRANSCRIPT_ONE -db TRANSCRIPT_TWO -out T1.x.T2 -evalue 1e-3 -outfmt 6
for reciprocal best hit
tblastx -query TRANSCRIPT_TWO -db TRANSCRIPT_ONE -out T2.x.T1 -evalue 1e-3 -outfmt 6

filter best hits:
curl -O https://github.com/schmelling/Genome-Assembly/raw/master/best_hit.py
python best_hit.py <blast result> <output file> <best>

Calculating the number of hits:
wc -l <best output file>

Create a csv file with transcript names:
curl -O https://github.com/schmelling/Genome-Assembly/raw/master/transcript_comparison.py
python transcript_comparison.py <trinity.fasta> <best output 1> <best output 2> <best output 3> <csv output file>

About

Scripts of Amanita project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages