# Worksheet 02 - Annotating Genomes With `Prokka`, and Calculating a Pangenome with `Roary` <img src="images/JHI_STRAP_Web.png" style="width: 150px; float: right;">

## Introduction

In this worksheet, you will use [`Prokka`]() and [`Roary`](https://github.com/sanger-pathogens/Roary) to annotate and calculate a pangenome for assemblies of the bacterial plant pathogen *Pantoea agglomerans*. 

The `Prokka` software conducts rapid annotation of microbial genomes, and the `Roary` software enables rapid pangenome identification in a tool that can be run on a desktop or laptop computer.

## Running cells in this notebook

<div class="alert alert-info" role="alert">
This is an interactive notebook, which means you are able to run the code that is written in each of the cells.
<br /><br />
To run the code in a cell, you should:
<br /><br />
<ol>
<li>Place your mouse cursor in the cell, and click (this gives the cell *focus*) to make it active 
<li>Hold down the `Shift` key, and press the `Return` key.
</ol>
</div>

If this is successful, you should see the input marker to the left of the cell change from

```
In [ ]:
```

to (for example)

```
In [1]:
```

and you may see output appear below the cell.

### Related online documentation

**Publications**

* "Roary: Rapid large-scale prokaryote pan genome analysis", Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill,
*Bioinformatics*, (2015). [doi:10.1093/bioinformatics/btv421](http://dx.doi.org/10.1093/bioinformatics/btv421)


**Software**

* `pyani` homepage: [http://widdowquinn.github.io/pyani/](http://widdowquinn.github.io/pyani/)
* `Roary` software: [https://github.com/sanger-pathogens/Roary](https://github.com/sanger-pathogens/Roary)
* `Prokka` homepage: [http://www.vicbioinformatics.com/software.prokka.shtml](http://www.vicbioinformatics.com/software.prokka.shtml)

### Requirements

<div class="alert alert-success">
To complete this worksheet, you will need:
<ul>
<li>an active internet connection
<li>a local, working installation of <b>pyani</b>
<li>a local, working installation of <b>Prokka</b>
<li>a local, working installation of <b>Roary</b>
</ul>
</div>

## 1. Downloading *Pantoea agglomerans* assemblies

We will use the `genbank_get_genomes_by_taxon.py` script, part of `pyani` to download all available *Pantoea agglomerans* assembly sequences to the subdirectory `Pantoea_agglomerans`. To do this, issue the following command at the terminal:

```
genbank_get_genomes_by_taxon.py -v --format fasta --email <your_email> \
  -o Pantoea_agglomerans -t 549
```

This should generate verbose output describing the operation of downloading (at least) 15 genome assembly sequences. Once the command is complete, you should see the following (or similar) contents of the `Pantoea_agglomerans` subdirectory when issuing the `tree` command:

```
$ tree Pantoea_agglomerans/
Pantoea_agglomerans/
├── GCF_000241285.1.fasta
├── GCF_000330765.1.fasta
├── GCF_000475055.1.fasta
├── GCF_000627115.1.fasta
├── GCF_000687245.1.fasta
├── GCF_000710215.1.fasta
├── GCF_000731125.1.fasta
├── GCF_000743785.1.fasta
├── GCF_000743785.2.fasta
├── GCF_000757415.1.fasta
├── GCF_000814075.1.fasta
├── GCF_001288285.1.fasta
├── GCF_001558735.1.fasta
├── GCF_001597625.1.fasta
├── GCF_001598475.1.fasta
├── classes.txt
└── labels.txt
```

## 2. Fixing downloaded files for use with `Prokka`

<div class="alert alert-warning">
The assembly FASTA files downloaded from NCBI may not work directly with <b>Prokka</b>, as their contig names are too long (there is a 20 character limit), and give the following error:
</div>

```
Contig ID must <= 20 chars long: gi|358545120|dbj|BAEF01000003.1|
```

<div class="alert alert-warning">
To avoid this, we will rename all our downloaded files *in place* (i.e. without copying and rewriting) with unique ID numbers, from 1 to $n$, where $n$ is the number of contigs in the FASTA file. This is done by executing the cell below.
</div>

In [None]:
import os

# Use Biopython for sequence reading/writing
from Bio import SeqIO

# Input directory, containing FASTA files
indir = "Pantoea_agglomerans"

# Rename all contigs in each FASTA file of the input directory, 
# from 1 to n, where n is the number of contigs in the FASTA file
# Loop over all FASTA files in the directory, and rename contigs
# in-place, in each file.
for filename in [fname for fname in os.listdir(indir)
                 if os.path.splitext(fname)[-1] == '.fasta']:
    infile = os.path.join(indir, filename)
    seqrecords = list(SeqIO.parse(infile, 'fasta'))
    for idx, record in enumerate(seqrecords):
        record.id = str(idx)
    SeqIO.write(seqrecords, infile, 'fasta')

When the cell above has finished running, we have an input dataset that is suitable for running `Prokka`.

## 3. Annotating genomes with `Prokka`

We will use `Prokka` with default settings to annotate the downloaded assemblies.

For a single input sequences, such as `Pantoea_agglomerans/GCF_000241285.1.fasta`, a simple default command is:

```
prokka Pantoea_agglomerans/GCF_000241285.1.fasta --genus Pantoea --species agglomerans
```

<div class="alert alert-info" role="alert">
On my laptop, this takes ≈10min to annotate the genome above so, to annotate 15 genomes, we could expect this exercise to take approximately 2.5 hours. Because of this, the <b>Prokka</b> <b>.gff</b> output for 15 *P. agglomerans* genomes has been prepared using the script <b>scripts/02-prokka.sh</b>, and the output placed in the subdirectory <b>prokka_out</b>.
</div>

This script can be run by issuing the following command at the terminal:

```
sh scripts/02-prokka.sh
```

It will generate one subdirectory under `prokka_out` for each input sequence:

```
$ tree -L 1 prokka_out/
prokka_out/
├── GCF_000241285.1
├── GCF_000330765.1
├── GCF_000475055.1
├── GCF_000627115.1
├── GCF_000687245.1
├── GCF_000710215.1
├── GCF_000731125.1
├── GCF_000743785.1
├── GCF_000743785.2
├── GCF_000757415.1
├── GCF_000814075.1
├── GCF_001288285.1
├── GCF_001558735.1
├── GCF_001597625.1
└── GCF_001598475.1
```

and each subdirectory will contain the annotation for that input sequence:

```
$ tree prokka_out/GCF_000241285.1/
prokka_out/GCF_000241285.1/
├── PROKKA_05212016.err
├── PROKKA_05212016.faa
├── PROKKA_05212016.ffn
├── PROKKA_05212016.fna
├── PROKKA_05212016.fsa
├── PROKKA_05212016.gbk
├── PROKKA_05212016.gff
├── PROKKA_05212016.log
├── PROKKA_05212016.sqn
├── PROKKA_05212016.tbl
└── PROKKA_05212016.txt
```

## 4. Constructing a pangenome with `Roary`

We will construct a default pangenome for *P. agglomerans* using `Roary`, by issuing the following command at the terminal: 

```
roary -f roary_out -v prokka_out/*/*.gff
```

The `-f` flag places `Roary`'s output in the subdirectory `roary_<something>`, where `<something>` can vary from run to run:

```
$ tree roary/
roary/
├── accessory.header.embl
├── accessory.tab
├── accessory_binary_genes.fa
├── accessory_binary_genes.fa.newick
├── accessory_graph.dot
├── blast_identity_frequency.Rtab
├── clustered_proteins
├── core_accessory.header.embl
├── core_accessory.tab
├── core_accessory_graph.dot
├── fixed_input_files
│   ├── GCF_000330765.1.gff
│   ├── GCF_000475055.1.gff
│   ├── GCF_000627115.1.gff
│   ├── GCF_000687245.1.gff
│   └── GCF_000710215.1.gff
├── gene_presence_absence.Rtab
├── gene_presence_absence.csv
├── number_of_conserved_genes.Rtab
├── number_of_genes_in_pan_genome.Rtab
├── number_of_new_genes.Rtab
├── number_of_unique_genes.Rtab
└── summary_statistics.txt
```

`Roary` defines (arbitrary) cutoffs for four classes of gene, representing different levels of conservation across the input genome set:

* Core genes (99-100% of input strains)
* Soft core genes (95-99%)
* Shell genes	(15-95%)
* Cloud genes	(0-15%)

The numbers of genes placed in each class are summarised in the `summary_statistics.txt` file, and several useful output files are also produced.