Skip to content

Purpose and Overview

Samuel E. Miller edited this page Aug 16, 2019 · 14 revisions

Post-processing tools such as Percolator and PeptideProphet are regularly used to refine database search peptide-spectrum matches (PSMs). Postnovo refines de novo sequence predictions by combining the forces of multiple de novo sequencing tools and deriving new metrics for a machine learning algorithm. Postnovo increases the yield of accurate de novo sequences at a given false discovery rate (FDR) by about an order of magnitude. Further, it reliably estimates the posterior error probability of de novo sequences, allowing FDR control.

Postnovo considers the set of de novo sequence candidates for each spectrum (1 from Novor, 20 from PepNovo+, 20 from a modified version of DeepNovo), not just the top prediction. Sequence candidates from each tool are compared using a new, highly efficient algorithm to find consensus sequences (longest and top-ranking common subsequences between tools), which tend to be more accurate than predictions from a single tool. Consensus sequences from each combination of tools plus top predictions are fed into the Postnovo predictive models (1 model for each tool and combination of tools; 7 total for 3 tools), and the reported sequence is selected by score, or estimated posterior probability of sequence correctness.

The score of a sequence candidate is predicted from a set of metrics. In addition to the selection of consensus sequences, the new metrics introduced by Postnovo are key to the increase in predictive power. The metrics fall into three groups.

  1. Identification of potential sequence errors, which are largely mono-/dipeptide and di-/dipeptide substitutional errors that are common in de novo sequencing largely due to missing and low-intensity fragment peaks (more detail on errors in Muth and Renard's paper).

  2. Comparison of de novo sequences generated at different fragment mass tolerances. The critical observation was that sequences that are resilient to reparameterization are more likely to be correct than sequences sensitive to reparameterization.

  3. Comparison of de novo sequences from different precursor spectra representing the same peptide species. The consistency of a sequence prediction within a spectrum cluster is positively correlated with its accuracy. Postnovo leverages the state-of-the-art spectrum clustering tool, MaRaCluster, for this analysis.

Complete training and testing capabilities allow the user to build upon the default machine learning models or create their own models with other spectra. Also, the Postnovo framework is flexible, facilitating the future addition of other tools such as Peaks for further improvements in de novo sequence accuracy.