Skip to content
Ariel Vina-Rodriguez edited this page Jan 29, 2020 · 9 revisions

Overview

ThDyHybrid is a program for simple modeling of primers/probes hybridization on to a set of target sequences. It aim for relatively rapid and simple selection of candidate primer and probes for PCR and microarray detection and identification of RNA viruses. Although it do not need the target sequences to be aligned, it can take advantage of low quality alignments (like the ones obtained by simple BLAST queries at the NCBI site) to set common coordinates for selection (and report) of the genomic region to analyze. The expected user is directly the biologist that need the new assay. The program need to be easy to use (with a graphical user interface - GUI) and install. Actually, no installation is need � just run the supplied executable. No modification of the computer system will occur and thus no admin right are needed.

Motivation

I have been working on RNA virus detection by PCR since 1993, and have observed the following pattern in the workflow for new assay design:

A researcher (which basically is a molecular biologist working with a "new" virus) decides he need a new diagnostic or screening PCR. He will collect available sequences from the NCBI GenBank site and from personal sources and will build a multialignment. Taking into account this alignment and information from experiments and the published work from other authors a promising region within the genome will be chosen. Some free or commercial software will help to choose a set of primers or probes to test experimentally and to select the ones that will be further used. Among many other, some of the difficulties frequently found are:

  • Too many sequences to align.
  • Too few sequences to align.
  • High variability (the sequences are more than 15-30 % divergent - in % of point differences)
  • The software only process one sequence at once.
  • The parameters are difficult to select because they are text-based, expert-experience based, etc. (number of mismatch to allow, maximum Tm difference between primers, etc.)

Due to these difficulties, a common practice has been to print the alignment and visually decide a set of candidate primers and probes, which are then partially checked with the software and submitted to the experiemnts.

The writings and software of Santa Lucia et al. addresses many of these problems from a stricter thermodynamic point of view. Kaderali and Lebers et al. used this knowledge to modify the dynamic programming algorithm for DNA alignment widely used.

Primers and probe design

(from my PhD thesis: "RNA virus detection and identification using techniques based on DNA hybridization")

Although, there are well known software for design of primers and probes or for modeling of DNA hybridizations, we found no simple way to adapt them to the specifics of the RNA virus characteristics influencing the design of primers and probes for PCR, or of probes for a low-density microarray, for virus identification. Even when the goal is to design a high specific (RT-q) PCR the target sequence is always a group of similar but still different sequences, with up to 10 or 20 % of punctual nucleotide differences affecting most positions. Tools that find a set of specifics and compatible primers from only one given sequence are hardly useful. The task becomes even more complicated when trying to design an assay that detect a whole viral group (for example - a given genus), especially if we need to control the detection efficiency for each of the subgroups (for example - species) which form the broader group. The mere possibility of breaking the target group into a classification tree of variable deep is not implemented in most software, and manual workarounds significantly complicate the task.

Further complications are introduced when some incomplete sequences that do not cover the entire target region, are ones of the few representatives of some of the subgroups and therefore can’t be ignored. Additionally, to represent all this sequence diversity the number of sequences selected for analysis maybe well in the order of a few hundreds. We did not found any software that simultaneously copes with all these requirements. A solution widely used is to simply print the alignment and manually scan for possible candidates deducing partial consensus. But with the growth of the number of sequences into the hundreds this approach become impracticable and it is always time consuming and error prone. We partially automated this solution using an Excel workbook which is publicly available at GibHub as VisualOligoDeg.

VisualOligoDeg facilitate the visual selection of candidate primer or probes from an existing MSA, by interactively constructing consensus with selected degree of degeneracy, modeling some characteristic (Tm, ∆G, etc.) of the possible hybridization and easily grouping and filtering of sequences. Some implementation details are: its use and installation do not depend on any software other than MS Excel; the truncated and reclassified aligned sequences can be re-exported to a new text/fasta alignment file; the used NN parameters can be modified; part of the workbook cells are regenerated during the import of a fasta file, potentially correcting errors inadvertently introduced by the user; part of the functionality is programmatically expressed in VBA and includes a set of functions to import and export all the code from the workbook, allowing us to track the code with git (a distributed version control system of software source code, https://git-scm.com/ ).

After selection of primers or probe candidates and before or after their validation with other tools, it may be desirable to predict their hybridization characteristics onto a set of additional target sequences originally not present in the MSA used during the design. A question arises: Is it imperative to add the sequence to the MSA prior to test it against a set of probes? This problematic situation also appeared when we tried to predict the result of the hybridization into our low-density microarray of the amplicon from a viral strain with a known sequence. We decided to use a different model to avoid building or updating the MSA and developed a second software tool: ThDy_DNAHybrid.

Most of the algorithms used for sequence alignment, and often reused for the design of primers and probe, originate in the phylogenetic analysis of sequences. There, each position is thought to carry some amount of phylogenetic information, thus, sequences are treated as a text, nucleotide as letters and differences as letter substitutions. But, as primary concept during probe design, we may prefer to refer to the percent of probes or target in hybridized state, which is what determine the efficiency of a PCR or of a microarray detection. The percent of probes hybridized is described with the equilibrium constant K of the reaction, thermodynamically (Chapter 1-ref [6]) related to the reaction ∆G=-RT ln K, at a given temperature T, where the gas constant R=1.9872 cal/mol K. SantaLucia, J. provides a “rule of thumb” to illustrate this relation: “every −1.4 kcal/mol in ∆G results in a change in the equilibrium constant by a multiplicative factor of 10”, and “−4.2 kcal/mol (= -1.4×3) equals a K change by 1000”.

Thus, rather than using the phylogenetic or plain text approach we may prefer a chemical or thermodynamic approximation to describe those reactions. Beginning with a rough approximation we may add more and more factors making our predictions more accurate. By using the text approach this incremental approximation is translated into more rules dictated by “experts”, which may be confusing, especially when quantitatively it may be unclear in each situation what rule is predominant. The thermodynamic approach will instead incorporate the new factors into the model itself, into the calculation, potentially making the final interpretation or use of the prediction simpler.

ThDy_DNAHybrid uses the thermodynamic NN method to describe DNA hybridizations. ∆G is calculated through the ∆H and ∆S accumulated by adjacent par of dinucleotides. Thus, the relative position of the interacting nucleotides at both DNA strand is used, which send us back to an alignment, but now of only two sequences. Instead of trying every possible combination of relative position or annealing, we search for the most stable. This problem was solved by Kadelary using the dynamic-programming approach of the Smith–Waterman alignment algorithm modified by changing the alignment objective function with ∆G or Tm based on ∆H and ∆S, introducing the dependencies into par of adjacent di-nucleotides following the NN model. We have further adapted the algorithm to extract any position at which the stability of the hybridization pass a user defined value: any position in which a probe could produce a measurable signal. The modification can also track all the target sequences that have at least that level of interaction with the given probe.

Additional details of the implementation are: it is written in c++11/17; includes a GUI built with the Nana C++ GUI library – a new, simple and modern way to do GUI in c++ for Windows and Linux; the parameters of each runs are saved in a “project” file which can be re-run and/or manually inspected and edited as simple text; a sub-library, ProgParam, was developed to managed the parameters and subprograms that form the software and transparently join the core functionality with the user interface; the sub-library Unit make easy the conversion between any physical or chemical units, helping to avoid errors and offering great flexibility for the final user; and results can be saved in a set of text files and presented as interactive tables.

Clone this wiki locally