# Roary: the pan genome pipeline

## Introduction

A pan-genome is the repertoire of genes from a group of genomes.  It includes genes present in all strains (*core genome*) and those present in only some of the strains (*accessory genome*). 

In this tutorial, we use [Roary](https://sanger-pathogens.github.io/Roary/) to calculate the pan-genome of 5 bacterial strains.

For this tutorial, you will need to install **[Roary](https://github.com/sanger-pathogens/Roary)** and **[Prokka](https://github.com/tseemann/prokka)**.  


## Prepare data for input into Roary

### Download tutorial dataset

*Salmonella enterica* serovar Weltevreden (*S.* Weltevreden)

**A Phylogenetic and Phenotypic Analysis of Salmonella enterica Serovar Weltevreden, an Emerging Agent of Diarrheal Disease in Tropical Regions.**  
Makendi C, Page AJ, Wren BW, Le Thi Phuong T, Clare S, *et al*.  
*PLOS Neglected Tropical Diseases*,2016;**10(2)**:e0004446.  
[doi:10.1371/journal.pntd.0004446](http://dx.doi.org/10.1371/journal.pntd.0004446)







In [None]:
wget https://github.com/vaofford/pathogen-informatics-training/tree/roary_tutorial/Notebooks/Roary/example

| Assembly accession| Year of isolation | Country | Isolation source |
| :---: | :---: | :---: | :---: |
| LN890522 | 1998 | New Caledonia | Human Stool |
| LN890523 | 1998 | New Caledonia | Human Stool |
| LN890524 | 1999 | Reunion | Human Blood |
| LN890525 | 1999 | Reunion | Human Blood |
| LN890526 | 1999 | Reunion | Human Blood |

### Genome annotation


To run prokka on just one of the files (e.g. LN890522.fna) we would use the command below.

`prokka --kingdom Bacteria --genus Salmonella --locustag LN890522 --outdir prokka_LN890522 LN890522.fna`

As we have multiple genomes to annotate, it is easier to use a loop to run prokka on our five .fna files, automatically updating the locus tag, output directory and input filename for us. 

In [None]:
for fn in *.fna
do
fname=`basename $fn .fna`
prokka --kingdom Bacteria --genus Salmonella --locustag $fname --outdir "prokka_$fname" $fn
done

| Extension | Description |
| :---: | --- |
| .gff | This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. |
| .gbk | This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence. |
| .fna | Nucleotide FASTA file of the input contig sequences. |
| .faa | Protein FASTA file of the translated CDS sequences. |
| .ffn | Nucleotide FASTA file of all the annotated sequences, not just CDS. |
| .sqn | An ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. |
| .fsa | Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. |
| .tbl | Feature Table file, used by "tbl2asn" to create the .sqn file. |
| .err | Unacceptable annotations - the NCBI discrepancy report. |
| .log | Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the --quiet option was enabled. |
| .txt | Statistics relating to the annotated features found. |

## Quality Control

## Running Roary

## Drawing trees

## Querying the pangenome

## Visualising results using Phandango

Note: make sure you have the latest version of prokka installled if your filenames have the "|" character in them else tbl2asn will not generate the .gbk file which is used by prokka.

## References

**Roary: Rapid large-scale prokaryote pan genome analysis.**  
Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill  
*Bioinformatics* (2015) **31(22)**:3691-3693.  
[doi:10.1093/bioinformatics/btv421](http://dx.doi.org/10.1093/bioinformatics/btv421) [PMID:26198102](https://www.ncbi.nlm.nih.gov/pubmed/26198102) 

**Prokka: rapid prokaryotic genome annotation**  
Seemann T.  
*Bioinformatics* (2014) **30(14)**:2068-9.  
[doi:10.1093/bioinformatics/btu153](http://dx.doi.org/10.1093/bioinformatics/btu153) [PMID:24642063](https://www.ncbi.nlm.nih.gov/pubmed/24642063)

**Phandango**   
[http://jameshadfield.github.io/phandango/](http://jameshadfield.github.io/phandango/)

**A Phylogenetic and Phenotypic Analysis of Salmonella enterica Serovar Weltevreden, an Emerging Agent of Diarrheal Disease in Tropical Regions.**  
Makendi C, Page AJ, Wren BW, Le Thi Phuong T, Clare S, *et al.*  
*PLOS Neglected Tropical Diseases* (2016) **10(2)**:e0004446.  
[doi:10.1371/journal.pntd.0004446](http://dx.doi.org/10.1371/journal.pntd.0004446) [PMID:26867150](https://www.ncbi.nlm.nih.gov/pubmed/26867150)
