Skip to content

sheppardlab/pGWAS

Repository files navigation

pGWAS

pGWAS and functional filtering pipeline to identify candidate adaptive genetic traits in bacteria

This repository contains scripts and methods accompanying the Méric, Mageiros et al. study “A pathogen in plain sight: disease-causing genotypes of the commensal skin bacterium Staphylococcus epidermidis” currently submitted for publication. This pGWAS method was developed by Leonardos Mageiros, Guillaume Méric and Samuel Sheppard at the University of Bath / Milner Centre for Evolution (United Kingdom), with considerable input from Koji Yahara (University of Tokyo, Japan), Xavier Didelot (Imperial College London).

This manual was written by Guillaume Méric, Leonardos Mageiros and Samuel K. Sheppard. All scripts for the method can be found in this repository, while example input and output files can be downloaded from Figshare (doi:10.6084/m9.figshare.5856861). General inquiries, including technical, should be directed to s.k.sheppard@bath.ac.uk.

Required languages and libraries:

Perl (https://www.perl.org/), Bioperl (http://bioperl.org/), Blast2 (http://www.ncbi.nlm.nih.gov/BLAST/), Python (https://www.python.org/), NumPy (http://www.numpy.org/), Biopython (http://biopython.org/), R (https://www.r-project.org/).

Principle:

Principle of the method Steps detailed below are directly relating to the numbers from the figure above. Script names are marked in red. Examples in the text and example files relate to the S. epidermidis pGWAS study that this method accompanies.

Steps 1 to 3. Bacterial isolation, DNA extraction, whole genome sequencing and genome assembly. An appropriate sampling of bacterial strains is a crucial step in every microbiological study to limit interpretation bias of any result, and this is also valid for GWAS. Depending on the complexity of the phenotype of interest to associate with genetic elements, for example, antibiotic resistance (1), toxicity (2), biofilm formation (3), host association (4), it can be necessary to perform a broad, species-wide sampling to avoid misinterpreting lineage-specific adaptive traits, which has been shown to occur for instance in Escherichia coli (5), for species-wide adaptive traits. DNA extraction, sequencing and assemblies can be performed in numerous ways. Our laboratory has been preferring the QIAGEN DNA extraction kits and Illumina MiSeq or HiSeq sequencers in various studies (3, 4, 6-8). For assembly, SPAdes (9) or Velvet (10), or any other suitable software can be used. The output of these steps should be contiguous assemblies of whole bacterial genomes.

Steps 4 and 5. Definition of groups to study, population structure and preparation of dataset to analyse using GWAS. This method associates genetic elements from the pangenome of bacteria of interest with simple and complex phenotypes. Typically, what we call simple phenotypes are traits that will be caused by the presence of very few genetic elements, such as specific antibiotic resistance, or the biosynthesis of specific metabolic compounds. Complex phenotypes are likely to be linked to the presence of various elements, pathways or linked genes, for instance, biofilm formation, environmental survival, host association, or pathogenicity/healthy carriage.

To perform the study, two groups to compare have to be defined. For simple phenotypes, this definition is relatively easy. A bacterium resists or not to a particular antibiotic, or produces or not a particular compound. Defining groups is more complicated for complex, quantitative phenotype, of which the distribution of values would have to be examined before deciding on groups. For instance, we recently published a GWAS analysis in the bacterium Campylobacter, in which we defined groups from a quantitative phenotypic scores of biofilm formation based on high and low percentiles (3). For more stringency, we structured the quantitative phenotypical measures in 3 percentiles (“high”, “medium” and “low”) and defined “high” and “low” as the groups to use in the association (6).

The next step is to examine the population structure of the dataset of interest by constructing a robust phylogenetic tree (PhyML (11), FastTree (12), ClonalFrameML (13) or other methods) mapped with the two groups to compare (Step 4). This step is crucial as it will inform on whether GWAS is a suitable method for the analysis. Indeed, GWAS is a preferred method when the two groups of interest are present in various lineages of the population. A certain level of lineage-specific adaptations are to be expected in bacteria, but many phenotypes are commonly found across bacterial populations. If however a phenotype of choice is clearly appearing has having been inherited in the common ancestor of a particular lineage, and only that lineage, a classical pangenome content comparison may be a more parsimonious way to understand the genetic basis of this particular phenotypical variation.

Indeed, a particularity of our approach is that it offers multiple ways of drastically reducing the effect of clonal inheritance to identify candidate adaptive genetic elements that are not simply co-inherited. In other words, our approach attempts to separate recent adaptive events that would have occurred at the tip of a phylogenetic tree from old adaptive events that would have occurred in deep branches. A first step is to select pairs of strains from both groups that are very closely related on a robust phylogenetic tree (or to set up an arbitrary, very low genetic distance threshold between strains) to define pairs. The selection of pairs should follow as much as possible the diversity of strains in the whole population, which will limit lineage-specific over-sampling bias when interpreting the results. Our method works for 15 pairs of strains or more for a single GWAS group. If it is possible to assemble 30 pairs or more, the pairs should be split into two or more independent groups (of similar species diversity coverage) that can be treated as technical replicate sets. We call this approach of isolates selection a “matched dataset” approach. Once a sub-population of isolates is selected, a robust phylogenetic tree based on a core genome alignment should be constructed as a guiding tree for the association.

The output of these steps should be a robust phylogenetic tree (in Newick format) of a carefully selected subset of the original dataset (or more, if replicate datasets can be assembled), suitable for further analysis.

Step 6 and 7. Automatic annotation of genomes and creation of a pan-genome list. This step allows the creation of a reference-free list of all unique genes and their various alleles present in the genomic dataset of interest. This is based on the previously published reference pangenome approach (14) and the “pangenome.pl” script (included in the “1_Pangenome” directory of example_files.zip). Briefly, after automatic annotation of the contig files with RAST (15, 16), Prokka (17) or similar methods, the script uses BLAST to compare every next entry of a gene (CDS) list (see input file formats in the exemple_files folder) to every entry of this gene list. When two nucleotide sequences share more than 70% sequence identity on more than 10% of the sequence length, they are considered identical or alleles of the same gene. When the local alignment falls below this threshold, the two sequences are considered different, and of two different genes (14).

The first file to be processed is typically the most commonly accepted reference genome for a given species, but can be any list of well-annotated genes or genome. The rationale behind chosing a well-annotated reference strain as the first file to be processed is that a large majority of its genes will be present in most isolates from the dataset and classified by the script as alleles of already existing genes, rather than novel unique genes. This way, the gene will retain its accepted nomenclature name, and ease up the interpretation of further results with previously published literature by retaining commonly accepted gene names.

The input format of the gene lists that include nucleotide sequences is typically the output of the “databank scan” function in BIGSdb (18-20), but example files are included for information (in “1_Pangenome\pangenome_input” of example_files.zip).

The output will consist of 9 various files containing summaries of the BLAST operations performed by the script. Two of these files contain the pangenome list that include all genes identified as unique in a BIGSdb format (“union.txt” file) and in a multi-FASTA format (“union.fas”). Examples of these files are included in the “1_Pangenome” directory of example_files.zip.

Step 8a. Quantitative allelic pan-genome wide association (pGWAS).

• Principle Our method is based on the fragmentation of genomes into consecutive 30-bp k-mers, or “words”, which are statistically associated with one phenotypical group of interest or another. This approach is alignment-free and allow for the detection of any genomic variant, whether caused by point mutation, homologous recombination, or lateral gene transfer, and will control for the effect of population structure and the possible clonal inheritance of genetic variants by simulating random word evolution and comparing the simulation with observed distributions. Details about the procedure have been published previously (3, 4). Here, we present an updated method as the association and mapping are now performed on the whole dataset pangenome, rather than a single reference genome, making our approach entirely reference-free. In practice, this means that a GWAS run is performed on a reference gene sequence comprised by all the alleles of all the genes in the dataset.

• Preparation of input files An example of all files described below is included in the directory “2_GWAS_input_files” of the example_files.zip.

  • Contigs: Create a folder containing the contigs of the strains to be analysed. Contig files must be named only with the identifier used in the tree (for simplicity, we use the BIGSdb isolate identifier number) and have a suffix of .ctg. If two technical replicates of GWAS are performed, as discussed above, create two separate folders.
  • Guiding tree(s): One (or two) trees in Newick format, created at Step 5.
  • Reference FASTA and BED files: The FASTA file contains concatenated sequences from the pangenome list “union.txt” file created out of the GWAS strains followed by all the different alleles that can be obtained by parsing the “duplicates.log” file produced from the pangenome script in Step 7. The “duplicates.log” file contains sequence information of all alleles of all the unique genes of the dataset included in the “union.txt” file. All gene and allele sequences from the dataset must be included in this file, and all corresponding headers and gene names must remain identical in all files taking care though to remove alleles with 100% sequence identity. It is recommended to make sure that the nucleotide letters in the FASTA files are in capitals. We used the “concatenate_sequences.pl”, included in the “3_GWAS_input_preperation” directory of example_files.zip, to concatenate the sequences into one FASTA header. This sequence is the reference sequence for the GWAS run. The corresponding BED file have to be created (example files included in the “2_GWAS_input_files” directory of example_files.zip) and will contain the annotations and positions corresponding to the FASTA concatenated sequence. This step is crucial for the subsequent mapping of associated words to genes of the pangenome list. . If two technical replicates of GWAS are performed, as discussed above, repeat this procedure for the second replicate set as well.
  • Isolate status file: A comma separated file for each replicate set with 3 columns: , < strain id >, . The “group” value should be a one letter designation of the phenotype assigned in the corresponding isolate.

• GWAS execution The analysis is performed by executing “assomap_given_phylo.py”, which can be found in the “4_GWAS_scripts” directory of example_files.zip. The script has to be run for each technical replicate dataset. The command to use is as follows:

Python ./ assomap_given_phylo.py -s 0.0001 -d <contigs_folder> -p <isolates_status_file> -f 30 -t -r <reference_fasta_file> -c

For the last argument the groups must be indicated by two letters separated by an underscore based on the naming that you have on your isolate status file (see previous point). For example, if your phenotypes in the status file are H and D for “healthy carriage” and “disease”, the argument in the command must then be H_D. The script will run our GWAS algorithm on the genomic datasets.

Step 8b. Mapping of associated words to the pangenome Three files will be produced. To perform the mapping of the word association on the reference pangenome list, the script “summarize_assoc_words.pl” (included in the “4_GWAS_scripts” directory of example_files.zip) has to be run for each GWAS with the following syntax:

perl ./summarise_assoc_words.pl –f -k -p 0.0001 –o -a -r

This script will output 9 files summarising the hits of the corresponding GWAS run. Files with the suffix “.overlapGenes” contain all the words associated with the two groups of interest, mapped to corresponding genes from the pangenome list. Results that are mapped on alleles different than the alleles included in the “union.txt” file have to be replaced with the allele name included in the pangenome list using the “duplicates.log” file. The score of association is a positive integer for one group and negative for the other. If words are found to be associated with both phenotypes, we typically keep the highest absolute score when summarising the association per gene. In the case of two technical replicate GWAS datasets, genes that appear as associated in both groups can be taken into account after merging the results of each replicate (“allele 1” and “other alleles” results).

Steps 11 to 14. Quantitative phenotyping of bacterial isolates and statistical correlation of GWAS signal with in vitro or in vivo phenotype scores

• Principle of the approach At this stage, a list of genes containing words associated statistically with two groups of interest is obtained. If a simple biological groups were studied (e.g. “antibiotic resistance”), the list may be straight-forward to interpret, especially if the expected genes, and might inform on other linked genotypes (3) are present. However, in the case of complex phenotypes, it is expected that many lowly significant hits would be obtained, without any major strongly significantly associated hit. This would be particularly the case for groups defined on complex phenotypes that could be explained by various genetic factors. For instance, “virulence” can have many genetic components. To inform and filter the list of associated elements, we present a novel approach that requires the further phenotypical testing of bacterial strains used in the GWAS, as described above. These phenotypes can be multiple and have to be carefully chosen. For instance, if one wants to associate genotypes involved in “virulence” as a general property, possible interesting phenotypes to test in vitro could be cell invasion, adhesion, elicited immune response or even specific antibiotic resistance. The output of our procedure will be lists of words, mapped to corresponding genes, that are associated in the two groups of interest, but also correlated (using Fisher’s exact test) with phenotypic scores (i.e., the word will be present in isolates that have a high phenotypical score). This step constitute a major improvement of current GWAS methods, as it allows the functional filtering of association results, something that is, to our knowledge, unprecedented.

• Execution of the correlation script Three inputs have to be prepared, followed by the execution of 3 scripts. Scripts and sample files are located in the “5_correlation_scripts” directory of example_files.zip:

  • Associated words in FASTA format: This file contains all associated words as they were produced by the GWAS run, in a multi-FASTA format. Each word must be under one single FASTA header in the file, and each FASTA header must have the format “>GENENAME_uniqueID”. For an example of this, see the “all_assoc_words.fas” sample file.
  • Contigs: A folder has to be created, containing all contigs of the strains included in the in vitro assays
  • Phenotype scores file: A file has to be created with the results of these assays. That file should contain 3 tab-delineated columns: “strain ID”, “phenotype score”, “group name” (see “sample_phenotype” file). Note: if results are unclear at the end of the correlation step, we recommend to split the continuous phenotypic scores in this phenotype file in 3 percentiles, and assign the top percentile the value “1”, the middle percentile the value “0.5” and the low percentile the value “0”.

Execute the “1_split_words_per_gene.pl” script with as an argument the name of the words multi-FASTA file. A folder named “split_words” will be created and contain one file including all the words associated with one gene as an output. Then, execute the “2_multi_count_words_per_word.pl”, using as an argument the name of the folder containing the contigs, the name of the folder containing the split associated words and an output folder name. This script will count the presence/absence of each word in all the contigs.

Finally execute the correlation R script (“3_correlation_script_V2.R”), using giving as an argument the name of the phenotypic scores file, and the folder with the counted words. The output has to be redirected into a new output text file. The functionally-filtered list of associated genes will be obtained by ordered the resulting list from the lowest to highest Fisher’s exact test p values.

References

  1. Chewapreecha C, et al. (2014) Comprehensive identification of single nucleotide polymorphisms associated with beta-lactam resistance within pneumococcal mosaic genes. Plos Genet 10(8):e1004547.
  2. Laabei M, et al. (2014) Predicting the virulence of MRSA from its genome sequence. Genome Res 24(5):839-849.
  3. Pascoe B, et al. (2015) Enhanced biofilm formation and multi-host transmission evolve from divergent genetic backgrounds in Campylobacter jejuni. Environ Microbiol.
  4. Sheppard SK, et al. (2013) Genome-wide association study identifies vitamin B5 biosynthesis as a host specificity factor in Campylobacter. PNAS 110(29):11923-11927.
  5. Meric G, Kemsley EK, Falush D, Saggers EJ, & Lucchini S (2013) Phylogenetic distribution of traits associated with plant colonization in Escherichia coli. Environ Microbiol 15(2):487-501.
  6. Meric G, et al. (2015) Ecological Overlap and Horizontal Gene Transfer in Staphylococcus aureus and Staphylococcus epidermidis. Genome Biol Evol 7(5):1313-1328.
  7. Sheppard SK, et al. (2014) Cryptic ecology among host generalist Campylobacter jejuni in domestic animals. Mol Ecol 23(10):2442-2451.
  8. Sheppard SK, et al. (2013) Progressive genome-wide introgression in agricultural Campylobacter coli. Mol Ecol 22(4):1051-1064.
  9. Bankevich A, et al. (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology : a journal of computational molecular cell biology 19(5):455-477.
  10. Zerbino DR & Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18(5):821-829.
  11. Guindon S, Delsuc F, Dufayard JF, & Gascuel O (2009) Estimating maximum likelihood phylogenies with PhyML. Methods Mol Biol 537:113-137.
  12. Price MN, Dehal PS, & Arkin AP (2010) FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One 5(3):e9490.
  13. Didelot X & Wilson DJ (2015) ClonalFrameML: efficient inference of recombination in whole bacterial genomes. PLoS Comput Biol 11(2):e1004041.
  14. Meric G, et al. (2014) A reference pan-genome approach to comparative bacterial genomics: identification of novel epidemiological markers in pathogenic Campylobacter. PLoS One 9(3):e92798.
  15. Aziz RK, et al. (2008) The RAST Server: rapid annotations using subsystems technology. Bmc Genomics 9:75.
  16. Overbeek R, et al. (2014) The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res 42(1):D206-214.
  17. Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30(14):2068-2069.
  18. Jolley KA & Maiden MC (2010) BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics 11:595.
  19. Maiden MC, et al. (2013) MLST revisited: the gene-by-gene approach to bacterial genomics. Nat Rev Microbiol 11(10):728-736.
  20. Sheppard SK, Jolley KA, & Maiden MCJ (2012) A Gene-By-Gene Approach to Bacterial Population Genomics: Whole Genome MLST of Campylobacter. Genes 3(2):261-277.

About

pGWAS and functional filtering pipeline to identify candidate adaptive genetic traits in bacteria

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published