- NOTE FOR USERS
- NOTE FOR DEVELOPERS
- Third-party Packages
- Recent changes
- 1. Producing and validating the config file
- 2. Fix sequences for analysis
- 3. Defining CDS features on each genome (optional)
- 4. Design primers to each genome in bulk
- 5. Deduplicate primer sets (optional)
- 6. Screen primers against
- 7. Test primers against input sequences for crosshybridisation with
- 8. Classify primers by predicted diagnostic capability with
- 9. Extract and assess candidate metabarcoding markers
- FURTHER INFORMATION
Full documentation about usage of
diagnostic_primers can be found on ReadTheDocs:
If you use
diagnostic_primers, please cite one or both of the following papers:
- Pritchard L et al. (2012) "Alignment-Free Design of Highly Discriminatory Diagnostic Primer Sets for Escherichia coli O104:H4 Outbreak Strains." PLoS ONE 7(4): e34498. doi:10.1371/journal.pone.0034498 - Method description and application to human bacterial pathogens, sub-serotype resolution
- Pritchard L et al. (2013) "Detection of phytopathogens of the genus Dickeya using a PCR primer prediction pipeline for draft bacterial genome sequences." Plant Pathology, 62, 587-596 doi:10.1111/j.1365-3059.2012.02678.x - Application to plant pathogens, species-level resolution
The default branch for this repository is a development branch:
diagnostic_primers. If you are looking for code to reproduce work from Pritchard et al. (2012) or Pritchard et al. (2013), please checkout the
master branch, or download release v0.1.3.
The default master branch for development is
diagnostic_primers. We appreciate it when you raise issues, or make contributions via pull request, especially if you follow the guidelines on the wiki. Writing tests to demonstrate/catch bugs, or to improve coverage, is especially appreciated!
This package automates discovery of discriminatory PCR or qPCR primers that can distinguish between arbitrary subgroups of input genomes or other biological sequences of interest. The package also aids identification of novel metabarcoding marker sequences. The package can be used in a number of ways:
pdpcommand-line program can be used to conduct this analysis in an interactive or scripted manner.
diagnostic_primersPython model can be used within your own programs to automate primer and/or marker design.
This tool depends on some third-party software packages
Primer3is the tool used to design primers. For compatibility with
EMBOSS, version 1 of the software is essential. [webpage]
EMBOSS: This suite of tools is used to interact with
Primer3and to perform in silico cross-hybridisation checks with
BLAST+: This tool is used to screen primers against a database of off-target sequences with the
prodigal: This program is used to identify candidate prokaryotic CDS features when using the
pdp filtersubcommand. [webpage]
MAFFT: This is required to align candidate metabarcoding marker sequences, when using the
pdp extractsubcommand. [webpage]
These packages can all be installed using the
conda package manager, and the
bioconda: a channel for the
condapackage manager, specialising in bioinformatics software. [webpage]
conda install primer3=1.1.4 emboss blast prodigal mafft
The new version of
find_differential_primers) now uses a subcommand model (like the tools
subversion). These execute the following subtasks, some or all of which may be required to perform a specific primer/marker design run. This change has been made for flexibility in composing workflows.
pdp config: Process/validate the configuration file and stitch input contig fragments/replace ambiguity symbols as necessary.
pdp filter: Filter input genome sequences so that primers are designed only to targeted regions.
e3: Design amplifying primers on the input sequences
bs: Filter designed primers against a database of negative examples
ps: Filter designed primers on their ability to amplify each input sequence
cl: Classify designed primers by specificity for each class of input sequence
ex: Extract amplicon sequences corresponding to diagnostic primer sets
Each of these subcommands has specific help, accessible with
pdp <subcommand> -h or
pdp <subcommand> --help.
This section describes a worked example analysis, proceeding from defining a config file to producing a diagnostic primer set result. All the files required for this analysis can be found in the subdirectory
1. Producing and validating the config file
We begin with a small set of bacterial genomes: three Pectobacterium species. These are defined as
.fasta sequences in the directory
$ ls tests/walkthrough/sequences/ GCF_000011605.1.fasta GCF_000291725.1.fasta GCF_000749845.1.fasta
A basic config file defining the three genomes is also provided as
tests/walkthrough/pectoconf.tab in tab-separated tabular format (you can generate equivalent files for your own data using Excel, and saving the sheet as tab-separated text format).
# Pectobacterium genomes downloaded from GenBank/NCBI; genomovars inferred from ANIm # Annotated Pba: genomovar 1 Pba_SCRI1043 Pectobacterium,atrosepticum_NCBI,gv1 tests/walkthrough/sequences/GCF_000011605.1.fasta - # Annotated Pwa: genomovars 2, 3 Pwa_CFBP_3304 Pectobacterium,wasabiae_NCBI,gv2 tests/walkthrough/sequences/GCF_000291725.1.fasta - # Annotated Pb : genomovar 7 Pbe_NCPPB_2795 Pectobacterium,betavasculorum_NCBI,gv7 tests/walkthrough/sequences/GCF_000749845.1.fasta -
Four columns are required (no headers are read), one row per input genome:
- a unique name for the genome
- a list of comma-separated labels for this genome - this will determine specificity of predicted primer sets
- path to a
FASTAfile of the input genome sequence - this can be a draft genome in multiple fragments
- (optional) path to a
.gfffile descrigbing genome features - if no features are provided, this should be a hyphen/dash (
In the walkthrough config file no features are defined, so this column contains only the symbol
- to mark it as being empty.
Comment lines can be included, starting with
# as the first character. These are ignored in the analysis
To confirm that the config file is correctly-formatted, we use the
pdp config --validate command:
$ pdp config --validate tests/walkthrough/pectoconf.tab WARNING: Validation problems Pbe_NCPPB_2795 requires stitch (tests/walkthrough/sequences/GCF_000749845.1.fasta) Pwa_CFBP_3304 requires stitch (tests/walkthrough/sequences/GCF_000291725.1.fasta) Pwa_CFBP_3304 has non-N ambiguities (tests/walkthrough/sequences/GCF_000291725.1.fasta)
This tells us that the first two genome files are in multiple parts so must be concatenated for this analysis, and that the second file also has ambiguity base symbols that are not
N. These must be replaced by
N for the analysis to proceed using the third-party tools.
2. Fix sequences for analysis
pdp config command will fix input genomes for analysis, when the
--fix_sequences argument is supplied. In this case,
pdp writes new sequences having the necessary changes (stitching, replacing ambiguity symbols) and generates a new config file that refers to the fixed sequences. The path to the new config file is given with the argument
$ pdp config --fix_sequences tests/walkthrough/fixed.json \ tests/walkthrough/pectoconf.tab
This writes corrected sequences to the
tests/walkthrough/sequences subdirectory, and a new config file to
JSON, rather than tabular format):
$ tree tests/walkthrough/ tests/walkthrough/ ├── fixed.json ├── pectoconf.tab └── sequences ├── GCF_000011605.1.fasta ├── GCF_000291725.1.fasta ├── GCF_000291725.1_concat.fas ├── GCF_000291725.1_concat_noambig.fas ├── GCF_000749845.1.fasta └── GCF_000749845.1_concat.fas
3. Defining CDS features on each genome (optional)
For prokaryotic genomes, we can use a genecaller to predict gene features with the
pdp filter --prodigal command. This generates a
GFF file describing CDS feature locations on each genome, and
FASTA files of the CDS sequences. This can be useful because, in the primer design stage, we can take CDS locations into account to retain only primers that amplify within CDS regions.
To use the genecaller, we must provide an appropriate config file (the
fixed.json config file), and the path to a new config file that will contain information about the predicted features (we'll call this
fixed_with_features.json). We will tell
prodigal to place the predicted gene locations in the subdirectory
pdp filter --prodigal --outdir tests/walkthrough/prodigal \ tests/walkthrough/fixed.json \ tests/walkthrough/fixed_with_features.json
The new directory containing genecaller output is created, as is the new config file:
$ tree tests/walkthrough/ tests/walkthrough/ ├── fixed.json ├── fixed_with_features.json ├── pectoconf.tab ├── prodigal │ ├── GCF_000011605.1.features │ ├── GCF_000011605.1.gff │ ├── GCF_000291725.1_concat_noambig.features │ ├── GCF_000291725.1_concat_noambig.gff │ ├── GCF_000749845.1_concat.features │ └── GCF_000749845.1_concat.gff └── sequences […]
4. Design primers to each genome in bulk
Using the config file
fixed_with_features.json we can design primers to each input genome with the EMBOSS
ePrimer3 package. At a minimum, we need to give the
pdp eprimer3 command the input config file, and the path to an output config file that will contain additional information about the primers designed to each genome. We can also use the
--outdir argument to provide a path to a directory where
pdp will put the
ePrimer3 output files.
If we wish to use only primers that amplify within the CDS regions we designed in the previous step, we need to give the
--filter argument. Otherwise
pdp eprimer3 will ignore this information.
$ pdp eprimer3 --filter --outdir tests/walkthrough/eprimer3 \ tests/walkthrough/fixed_with_features.json \ tests/walkthrough/with_primers.json
This places the output of
ePrimer3 into its own directory (
.eprimer3 files), and generates
JSON files that describe the primers for each of the genomes. Additionally, a
.bed file is produced describing the location of each primer set on the source genome, which can be visualised using a tool such as IGV.
$ tree tests/walkthrough/ tests/walkthrough/ ├── eprimer3 │ ├── GCF_000011605.1.eprimer3 │ ├── GCF_000011605.1_named.bed │ ├── GCF_000011605.1_named.eprimer3 │ ├── GCF_000011605.1_named.json │ ├── GCF_000291725.1_concat_noambig.eprimer3 │ ├── GCF_000291725.1_concat_noambig_named.bed │ ├── GCF_000291725.1_concat_noambig_named.eprimer3 │ ├── GCF_000291725.1_concat_noambig_named.json │ ├── GCF_000749845.1_concat.eprimer3 │ ├── GCF_000749845.1_concat_named.bed │ ├── GCF_000749845.1_concat_named.eprimer3 │ └── GCF_000749845.1_concat_named.json ├── fixed.json ├── fixed_with_features.json ├── pectoconf.tab ├── prodigal […] ├── sequences […] └── with_primers.json
5. Deduplicate primer sets (optional)
When designing thermodynamically plausible primers to closely-related genomes, it is very likely that identical primer sets will be created on distinct genomes due to their sequence conservation. Carrying these duplicate primer sets through to later analysis stages can result in significant unnecessary computational load. To remove identical primer sets, we can use the
pdp dedupe command:
$ pdp dedupe --dedupedir tests/walkthrough/deduped \ tests/walkthrough/with_primers.json \ tests/walkthrough/deduped_primers.json
This examines the primers described in the input config file (
with_primers.json) and places equivalent sets of new
.bed files, with all redundant primers removed, in the specified directory (
tests/walkthrough/deduped), and creates a new config file (
deduped_primers.json) that points to these files.
6. Screen primers against
BLASTN database (optional)
Now that primers have been designed and deduplicated, it can be useful to screen them against a
BLASTN database to identify and filter out any primers that have potential for off-target amplification. In general, we advise that this step is used not to demonstrate potential for target amplification, but only to exclude primers that might have potential for off-target binding, in order to remove non-specific primer sets.
pdp blastscreen subcommand expects the path to a suitable pre-compiled
BLASTN database to be provided with the argument
--db (there is a
BLASTN E. coli genome database in the subdirectory
tests/walkthrough/blastdb/). By specifiying
--outdir, we place
BLAST output in the
tests/walkthrough/blastn subdirectory. The input config file use is the
deduped_primers.json file produced in the last step, having locations of the non-redundant primer files, and we specify that the command writes a new config file called
screened.json that points to a reduced set of primers, with the potentially off-target/non-specific sets excluded.
No sequences are deleted as a result of this action.
$ pdp blastscreen --db tests/walkthrough/blastdb/e_coli_screen.fna \ --outdir tests/walkthrough/blastn \ tests/walkthrough/deduped_primers.json \ tests/walkthrough/screened.json
This screen produces the new subdirectory
tests/walkthrough/blastn that contains all the primer sequences derived from each genome in FASTA format, and the tabular format output (
.blasttab) of each BLAST search. New files are added to the
deduped subdirectory (with suffix
_screened), describing the reduced sets of primers, post-screening.
$ tree tests/walkthrough/ tests/walkthrough/ ├── blastdb │ ├── e_coli_screen.fna.nhr │ ├── e_coli_screen.fna.nin │ └── e_coli_screen.fna.nsq ├── blastn │ ├── GCF_000011605.1_primers.blasttab │ ├── GCF_000011605.1_primers.fasta │ ├── GCF_000291725.1_concat_noambig_primers.blasttab │ ├── GCF_000291725.1_concat_noambig_primers.fasta │ ├── GCF_000749845.1_concat_primers.blasttab │ └── GCF_000749845.1_concat_primers.fasta […] ├── deduped │ ├── GCF_000011605.1_named_deduped.bed │ ├── GCF_000011605.1_named_deduped.json │ ├── GCF_000011605.1_named_deduped_screened.bed │ ├── GCF_000011605.1_named_deduped_screened.fasta │ ├── GCF_000011605.1_named_deduped_screened.json │ ├── GCF_000291725.1_concat_noambig_named_deduped.bed │ ├── GCF_000291725.1_concat_noambig_named_deduped.json │ ├── GCF_000291725.1_concat_noambig_named_deduped_screened.bed │ ├── GCF_000291725.1_concat_noambig_named_deduped_screened.fasta │ ├── GCF_000291725.1_concat_noambig_named_deduped_screened.json │ ├── GCF_000749845.1_concat_named_deduped.bed │ ├── GCF_000749845.1_concat_named_deduped.json │ ├── GCF_000749845.1_concat_named_deduped_screened.bed │ ├── GCF_000749845.1_concat_named_deduped_screened.fasta │ └── GCF_000749845.1_concat_named_deduped_screened.json […] ├── fixed.json ├── fixed_with_features.json ├── pectoconf.tab ├── prodigal […] ├── sequences […] └── with_primers.json
7. Test primers against input sequences for crosshybridisation with
To identify which primers might be diagnostically useful for any of the classes/labels defined in the config file, we test how they potentially amplify the other genomes from the input set. We use the
primersearch to do this.
pdp primersearch subcommand, we pass the appropriate
.json config file and specify a directory to hold
primersearch output with
--outdir, as well provide the path to write a new config file that will hold data about this crosshybridisation screen.
pdp primersearch \ --outdir tests/walkthrough/primersearch \ tests/walkthrough/screened.json \ tests/walkthrough/primersearch.json
After running the command, a new directory
tests/walkthrough/primersearch is produced, containing a new primer file (
.tab - required for
primersearch input) and
.json files for each genome's
primersearch results. In addition, there are
.bed files describing all amplicons produced on a specific genome:
.bed files can be visualised using a tool like
$ tree tests/walkthrough tests/walkthrough [...] ├── primersearch │ ├── Pba_SCRI1043_amplicons.bed │ ├── Pba_SCRI1043_amplicons.json │ ├── Pba_SCRI1043_primers.primertab │ ├── Pba_SCRI1043_primersearch.json │ ├── Pba_SCRI1043_ps_Pba_SCRI1043.primersearch │ ├── Pba_SCRI1043_ps_Pbe_NCPPB_2795.primersearch │ ├── Pba_SCRI1043_ps_Pwa_CFBP_3304.primersearch │ ├── Pbe_NCPPB_2795_amplicons.bed │ ├── Pbe_NCPPB_2795_amplicons.json │ ├── Pbe_NCPPB_2795_primers.primertab │ ├── Pbe_NCPPB_2795_primersearch.json │ ├── Pbe_NCPPB_2795_ps_Pba_SCRI1043.primersearch │ ├── Pbe_NCPPB_2795_ps_Pbe_NCPPB_2795.primersearch │ ├── Pbe_NCPPB_2795_ps_Pwa_CFBP_3304.primersearch │ ├── Pwa_CFBP_3304_amplicons.bed │ ├── Pwa_CFBP_3304_amplicons.json │ ├── Pwa_CFBP_3304_primers.primertab │ ├── Pwa_CFBP_3304_primersearch.json │ ├── Pwa_CFBP_3304_ps_Pba_SCRI1043.primersearch │ ├── Pwa_CFBP_3304_ps_Pbe_NCPPB_2795.primersearch │ ├── Pwa_CFBP_3304_ps_Pwa_CFBP_3304.primersearch │ └── target_amplicons.json ├── primersearch.json [...]
primersearch.json config file contains information about this crosshybridisation screen, and can be used for identification and extraction of diagnostic primer sequence sets.
8. Classify primers by predicted diagnostic capability with
To classify primer sets by their ability to amplify only genomes belonging to a specific named group/label in the configuration file, we use the
pdp classify subcommand. This examines the
primersearch output and reports back primer sets that amplify all genomes having a specific label/group, and only those genomes.
We pass as input the
.json file produced by the
pdp primersearch run, and the path to a directory for the output of the
pdp classify subcommand:
pdp classify \ tests/walkthrough/primersearch.json \ tests/walkthrough/classify
The new directory contains
.ePrimer3 format files for each set of primers diagnostic to a given class, and summary information for thos primers in
The output directory also contains
.bed files describing all regions on each genome that are amplified by primers having a particular specificity. For instance, the
Pbe_NCPPB_2795_Pectobacterium_amplicons.* files contain information about the regions of the genome
Pbe_NCPPB_2795 that are amplified by all primer sets with specificity to the label/group
$ tree tests/walkthrough/classify tests/walkthrough/classify ├── Pba_SCRI1043_Pectobacterium_amplicons.bed ├── Pba_SCRI1043_Pectobacterium_amplicons.json ├── Pba_SCRI1043_atrosepticum_NCBI_amplicons.bed ├── Pba_SCRI1043_atrosepticum_NCBI_amplicons.json ├── Pba_SCRI1043_gv1_amplicons.bed ├── Pba_SCRI1043_gv1_amplicons.json ├── Pbe_NCPPB_2795_Pectobacterium_amplicons.bed ├── Pbe_NCPPB_2795_Pectobacterium_amplicons.json ├── Pbe_NCPPB_2795_betavasculorum_NCBI_amplicons.bed ├── Pbe_NCPPB_2795_betavasculorum_NCBI_amplicons.json ├── Pbe_NCPPB_2795_gv7_amplicons.bed ├── Pbe_NCPPB_2795_gv7_amplicons.json ├── Pectobacterium_primers.ePrimer3 ├── Pectobacterium_primers.json ├── Pwa_CFBP_3304_Pectobacterium_amplicons.bed ├── Pwa_CFBP_3304_Pectobacterium_amplicons.json ├── Pwa_CFBP_3304_gv2_amplicons.bed ├── Pwa_CFBP_3304_gv2_amplicons.json ├── Pwa_CFBP_3304_wasabiae_NCBI_amplicons.bed ├── Pwa_CFBP_3304_wasabiae_NCBI_amplicons.json ├── atrosepticum_NCBI_primers.ePrimer3 ├── atrosepticum_NCBI_primers.json ├── betavasculorum_NCBI_primers.ePrimer3 ├── betavasculorum_NCBI_primers.json ├── gv1_primers.ePrimer3 ├── gv1_primers.json ├── gv2_primers.ePrimer3 ├── gv2_primers.json ├── gv7_primers.ePrimer3 ├── gv7_primers.json ├── results.json ├── summary.tab ├── wasabiae_NCBI_primers.ePrimer3 └── wasabiae_NCBI_primers.json
summary.tab file contains a table showing how many primer sets were designed that are potentially specific to each of the input genomes and classes/labels:
$ cat tests/walkthrough/classify/summary.tab Group NumPrimers Primers Pectobacterium 4 tests/walkthrough/classify/Pectobacterium_primers.json atrosepticum_NCBI 1 tests/walkthrough/classify/atrosepticum_NCBI_primers.json betavasculorum_NCBI 2 tests/walkthrough/classify/betavasculorum_NCBI_primers.json gv1 1 tests/walkthrough/classify/gv1_primers.json gv2 2 tests/walkthrough/classify/gv2_primers.json gv7 2 tests/walkthrough/classify/gv7_primers.json
9. Extract and assess candidate metabarcoding markers
The candidate diagnostic primer sets designed using
pdp can be useful for generating group-specific metabarcoding markers. Using the
pdp extract subcommand, the region(s) predicted to be amplified by each candidate primer set (in the
primersearch step) are extracted from the target genomes, aligned (using
MAFFT) and then compared to each other to estimate their sequence diversity.
Where a set of primers is predicted to amplify a specific subgroup of the input genomes, and the predicted amplicons are sequence-diverse, those regions and primers are potentially useful markers for metabarcoding.
pdp extract produces an output table of diversity measures for each primer set to assess the sequence diversity of each primer set's predicted amplicons.
pdp extract with paths to the
JSON config file describing
primersearch results, and to a
JSON output output from the
pdp classify step, corresponding to one of the groups/labels that the resulting primers are specific to. We also supply the path to an output directory for the command to write to.
pdp extract \ tests/walkthrough/primersearch.json \ tests/walkthrough/classify/Pectobacterium_primers.json \ tests/walkthrough/extract
The new output directory is created (if needed), and a further subdirectory named for primer specificity. Within that directory, all predicted amplicons for each group-specific primer are written as
FASTA files, and a corresponding
.aln alignment file (produced using
MAFFT). An additional
distances_summary.tab file is also produced, describing sequence diversity measures for amplicons produced by each primer set:
$ tree tests/walkthrough/extract tests/walkthrough/extract └── Pectobacterium_primers ├── GCF_000011605.1_primer_00001.aln ├── GCF_000011605.1_primer_00001.fasta ├── GCF_000011605.1_primer_00002.aln ├── GCF_000011605.1_primer_00002.fasta ├── GCF_000011605.1_primer_00003.aln ├── GCF_000011605.1_primer_00003.fasta ├── GCF_000749845.1_concat_primer_00003.aln ├── GCF_000749845.1_concat_primer_00003.fasta └── distances_summary.tab
distances_summary file describes, for each primer set, sequence distance (min, max, mean and stdev), and diversity/evenness measures (Shannon Index and Shannon Evenness), to aid selection of candidate markers. It also provides a count of the number of unique amplicon sequences observed, and the count of sequences that are not unique (
unique + nonunique = total count).
$ cat tests/walkthrough/extract/Pectobacterium_primers/distances_summary.tab primer dist_mean dist_sd dist_min dist_max unique nonunique shannon_index shannon_evenness GCF_000011605.1_primer_00001 0.0267 0.0153 0.0100 0.0400 3 0 1.10 1.00 GCF_000011605.1_primer_00002 0.0600 0.0000 0.0600 0.0600 3 0 1.10 1.00 GCF_000011605.1_primer_00003 0.0400 0.0100 0.0300 0.0500 3 0 1.10 1.00 GCF_000749845.1_concat_primer_00003 0.1133 0.0115 0.1000 0.1200 3 0 1.10 1.00
For further technical information, please read the comments contained within the top of each '*.py' file as well as the Supporting Information ('Methods S1' document) of doi:10.1371/journal.pone.0034498.