# Lab 6: Support Vector Machines and Percolator

In this lab we'll be taking an in-depth dive into Percolator, a proteomics tool that highlights two of the methods we've discussed in the course so far: Support vector machines (SVMs) and cross-validation. We'll start by discussing how Percolator works, what the logging messages it provides mean, and finally try to break it by violating the machine learning prime directive.

## Exercise 1: What is Percolator and how does it work?

Percolator is a proteomics tool that is used to re-score peptide-spectrum matches (PSMs) from a proteomics experiment using a machine learning algorithm ([Käll et al](https://www.nature.com/articles/nmeth1113)). It is extremely popular and is integrated with other tools like Proteome Discoverer, Mascot, and EncyclopeDIA. As a tool, Percolator both boosts the number of peptides we can detect in proteomics experiments with its machine learning algorithm and provides a consistent statical framework to interpret the resulting identifications. Although Percolator was specifically developed for proteomics data, it provides and excellent example of using an SVM and cross-validation to great effect.

Here's how it works:

### Target-decoy competition can be used to assess error rates.

In proteomics experiments, we are often interested in quantifying how confident we are about how assignment of peptides to tandem mass spectra by a database search engine. By far the most popular method to assign confidence estimates to our PSMs is using target-decoy competition (also known as the target-decoy approach). The priciples behind target-decoy competition are suprising simple, but powerful for interpreting proteomics data.

To perform target-decoy competition, we start by generating decoy peptides from a protein database by shuffling the real peptides sequences (the *targets*), but maintaining the terminal residues to preserve enzymatic cleavage sites. These new decoy sequences should not be present in the sample that was analyzed, and thus let us see how well incorrect PSMs are scored. If we only allow one peptide to be assigned to each spectrum, the the target and decoy sequence must compete against one another for each asignment. *A small aside: Reversing peptide sequences is just one instance of shuffling the sequence.* 

We might expect to see a score distribution from our search engine that looks like the one below:
![](images/tdc.jpg)

We can then draw a vertical line that is our score threshold, above which we accept our PSMs and below which we reject them. Using the target and decoy PSMs that we accept, we can then estimate the false discovery rate (FDR) for this set PSMs as: 

$$ FDR = \frac{\#~decoys}{\#~targets} $$ 

*The exact formula Percolator uses is slightly more complicated, but is approximately equivalent to this in most conditions*

**Questions to discuss:**
1. What assumptions does target-decoy competition rely on to be accurate?
2. When would decoy sequences provide a poor estimate of the false discovery rate?

### Percolator uses an SVM to separate high-scoring target sequences from decoy sequences.

Thus far we've discussed supervised and unsuperised learning in this course, and we've only seen SVMs used for the supervised learning tasks. However, the Percolator's task is actually an instance of something new: semi-supervised learning. 

Semi-supervised learning tasks are tasks where the labels you want your model to predict are noisy—that is, some are incorrect or missing. In the case of Percolator, we ideally want to separate our correct PSMs from our incorrect PSMs. However, we don't know which are the correct PSMs to begin with! Instead, we know that target PSMs are either correct or incorrect and we assume that our decoy PSMs are all incorrect. 