Verification of Mokapot Results - Getting the `Proteins` Column to work and other questions #116

CCranney · 2023-12-01T17:13:40Z

CCranney
Dec 1, 2023

Hi,

I'm exploring using mokapot in scoring PSMs identified using an in-house developed python package (zodiaq). We have a scoring mechanism, but wanted to explore additional scoring methods. I've made a simple script that runs mokapot on the output of zodiaq identification, and got some promising results. However, in the process I stumbled across a few questions I thought I'd ask here.

I'm trying to format my .pin to allow for scoring of proteins in addition to peptides and PSMs. Looking over examples in documentation and elsewhere, it looks like you need the final column to be the Proteins column, and that proteins need to be separated by tabs. I was thinking that doing this would result in a 'mokapot.proteins.txt' output file generated at the end, but am not seeing that occur. Do you have an idea on what I'm doing wrong?
1.1. Here is an example of one line in my .pin file. There are several additional columns included from my zodiaq identification output, but more on that next.

SpecId	Label	ScanNr	MzEXP	MzLIB	zLIB	cosine	Peak(Query)	Peaks(Library)	shared	ionCount	CompensationVoltage	Peptide	Proteins
3234_TFEINPR_2	1	3234	438.5	438.732334751	2	0.9999996011266165	2628	4	3	15022.073974609377	-50.0	TFEINPR	sp|Q58FF3|ENPLL_HUMAN    sp|P14625|ENPL_HUMAN

I've been doing some experimentation comparing mokapot outputs to the scoring mechanism we developed along with zodiaq, and am trying to see if I can maximize mokapot results. Do you have any tips and tricks you would recommend for optimizing a run through mokapot? For example, the two columns we've deemed as most important for differentiating targets from decoys are the cosine (the cosine similarity score, which suggests how similar the "shape" of a peptide spectrum is to a pattern found in the query file) and shared (the number of matching peaks between library/query spectra) columns. I'm going to run mokapot using just those added columns to experiment, but are there other ways to maximize results that you have found effective?

CCranney · 2023-12-01T17:26:38Z

CCranney
Dec 1, 2023
Author

To expand on question 2, I'm wondering if it would be useful to include more features as well. For example, would adding a column denoting the length of the peptide be helpful? There's a lot to possibly experiment with, so thought I'd ask.

6 replies

CCranney Dec 1, 2023
Author

Here are a few. I was breaking down how the mokapot and zodiaq outputs differed. This focuses mostly on metrics generated using zodiaq because I'm more familiar with what goes into those FDR calculations, but am interested in breaking down the specifics of the mokapot calculations.

This is a direct infusion dataset, though similar patterns were found in the LC-MS dataset. zodiaq identified 4448 spectra after FDR filtering, mokapot identified 2528 after q-score filtering. There were 2029 spectra that overlapped between the two. To highlight how they graphed in general identification, I made the following scatterplots (artificially separating targets and decoys to increase visibility).

The legends were not my focus when generating these images, so they are a bit nonsensical. Darker colors denote those spectra were identified. I also don't know which decoys were identified as "close" matches with mokapot, so those decoys are not highlighted in the mokapot figure.

(note: zodiaq employs a score multiplying cosine similarity score by the fifth square root of the number of matches, so there's effectively a line drawn through the graph)

I was also curious how the spectra identified by mokapot but rejected by zodiaq looked generally. I took their FDR rates calculated through zodiaq and made a histogram. It seems zodiaq rejects peptides that were considerably close to passing the 0.01 FDR cutoff with a few exceptions.

jessegmeyerlab Dec 1, 2023

@wfondrie in this case "shared" means how many fragment ions match between the library spectra and the experimental spectra

jspaezp Dec 8, 2023
Collaborator

Hello there! (disclaimer: I am not that familiar with how zodiaq scores, beyond what is being discussed here) I think an interesting plot to check here would be the mokapot score vs the final discriminant score derived from zodiaq.

I would expect that the mokapot discrimination could be worse in your case because of a couple of reasons.

Since mokapot makes several cross validation splits, its possible that if your total number of PSMs is reasonably small, every split will not have enough data to make a robust fit (do they go up if you set the number of folds to 2?).
I am not sure if the q-values would be calibrated between the two methods (if X PSMs are identified by method Y, if we were to force the other method to accept X, is the relative number of decoys the same?)

Regarding the addition of other features ... Adding other features can be useful but in general hard to validate, since it might introduce sources of data leakage or over-fitting. Some features that seem to be robust in my experience would be:
- peptide length.
- log number of candidates.
- some form of relative divergence with respect to the rest of the candidates (poisson score in sage or E-value in comet).
- one-hot-encoding charge.
- absolute delta mass of the measured vs calculated precursor.

It is worth noting that the default model in mokapot requires the features to be either increasing or decreasing with quality. In other words features that have a "sweet spot" are not good (absolute value of delta mass is good, delta mass is bad, since the model cannot learn that "values lower than 0 AND higher than 0 are bad").

LMK of that makes sense!
any other ideas @wfondrie ??

CCranney Dec 12, 2023
Author

Hi @jspaezp! Thank you so much for your response. I'm working on rewriting how I run mokapot based on the feedback above, but I had a few updates/clarifications I wanted to ask about.

Great idea on the score vs. score scatterplot, I went ahead and made that and had the following results. I suspect the staggered lines stem from the distinction zodiaq makes of spectra with different numbers of shared peaks per PSM.
That's a good point on the cross validation splits. I wasn't clear on what you meant by "setting the number of folds to 2" - is this a mokapot setting I missed?
For experimenting on the q-value calibration idea, is there a way to save as output the decoys that were identified as high-scoring from the mokapot algorithm? Presently it looks like the mokapot output only shows the targets that were of sufficient quality to keep (Label==True).
For the features you identified, specifically bullets 2 and 3, can you specify what you mean by candidates? Are those PSM matches or query peaks identified as belonging to a library spectrum?
To clarify, by "one-hot-encoding charge" you mean representing a charge (2, 3 etc.) in one-hot-encoding format (0 1 0, 0 0 1 etc.), correct?
I like the idea of using the delta mass of the measured vs calculated precursor, though I suspect that would be less helpful in our case. Our identifications stem from DIA data analysis, so the query precursor mass is not specifically tailored to individual PSMs.
On your final note that the default model in mokapot requires the features to be either increasing or decreasing in quality, I was thinking of sorting the table by number of matching peaks primarily and the cosine score secondarily. This would "stagger" the quality of PSMs, gearing it toward higher quality, but it would only be a generalized quality sort. Is that the general idea of what we should be trying to do?

jspaezp Dec 12, 2023
Collaborator

I aggree
If you are using the command line interface of mokapot, you want to add the mokapot --folds 2 argument to change it, I believe it defaults to 3. If you are using the python api, it is an argument to brew (folds). https://mokapot.readthedocs.io/en/latest/api/functions.html#mokapot.brew.
mokapot --keep_decoys on the CLI, on the python api it is an argument (decoys) in to_txt https://mokapot.readthedocs.io/en/latest/api/confidence.html#mokapot.confidence.GroupedConfidence.to_txt
I think sage defines it as "the number of peptides in the library that were scored vs this spectrum, before picking the best psm to report". (I think of it as ... if a spectrum gets scored vs millions of candidate peptides, by chance will have higher scores than a spectrum that can only be matched to a couple.) I don't know if that made sense ...
Indeed! This in practice just sets a different 'threshold' for every charge state.
Well ... Some DIA search 'engines' to trace the precursor on the MS1 and use it as part of their scoring, there is certainly some room for interpretation but I would agree that the raw target for quad isolation will have very little value.
Oh sorry for the misunderstanding, I meant that the values themselves should correlate in that way with the 'quality' (the order of the rows themselves dont matter). WHICH is the case in your features (more matching peaks is better than less every time, and higher cosine similarity is better than lower every time); in my example what would be a bad feature would be the delta mass, because the model cannot learn that 'closer to 0 is good', it would learn 'lower is good', thus a shift of "-200 dalton" would be better than "+0.01 dalton". Which is why we instead use absolute delta mass! in which case lower is always better (because we cannot have negative numbers). (I hope that made sense...).

wfondrie · 2023-12-12T18:00:31Z

wfondrie
Dec 12, 2023
Maintainer

Hi @CCranney and @jessegmeyerlab 👋

@jspaezp gave some great advice above. Here's some other thoughts and answers to your questions:

I'm trying to format my .pin to allow for scoring of proteins in addition to peptides and PSMs. Looking over examples in documentation and elsewhere, it looks like you need the final column to be the Proteins column, and that proteins need to be separated by tabs. I was thinking that doing this would result in a 'mokapot.proteins.txt' output file generated at the end, but am not seeing that occur. Do you have an idea on what I'm doing wrong?
1.1. Here is an example of one line in my .pin file. There are several additional columns included from my zodiaq identification output, but more on that next.

The requirement for a tab-delimited proteins column is an unfortunate artifact of the design choices for the PIN file format when Percolator was being created. In actuality, this column can't be used for confidence estimation effectively, without strict requirements being placed on the user. Instead, mokapot actually will perform its own protein inference if you provide it a FASTA file and tell it how the proteins were digested. See the example from our cookbook here and the read_fasta() function for details.

As an alternative, you can also use any DataFrame as input into mokapot from the Python API. The only downside to this method is that you will need to specify the column meanings for mokapot to use them correctly. For example, see this earlier question:

In the works is a big upgrade for mokapot and this column will no longer be required 😉

I've been doing some experimentation comparing mokapot outputs to the scoring mechanism we developed along with zodiaq, and am trying to see if I can maximize mokapot results. Do you have any tips and tricks you would recommend for optimizing a run through mokapot? For example, the two columns we've deemed as most important for differentiating targets from decoys are the cosine (the cosine similarity score, which suggests how similar the "shape" of a peptide spectrum is to a pattern found in the query file) and shared (the number of matching peaks between library/query spectra) columns. I'm going to run mokapot using just those added columns to experiment, but are there other ways to maximize results that you have found effective?

@jspaezp had some good suggestions, but I thought I would elaborate. Mokapot trains a linear SVM model by default to try and separate the high-scoring target PSMs from the decoy PSMs that are provided; any score that generally indicates the quality of a PSM (like your cosine similarity score) or helps calibrate the score between PSMs (features like peptide length, charge state, etc.) generally will be helpful to include. As @jspaezp mentioned, there are two considerations to think about with any linear model: (1) features should be monotonic in how they behave (i.e. higher = better, lower = worse) and (2) categorical variables (such as charge state) should be one-hot encoded. SciKit-learn has some good documenation on categorical variables here.

A word of caution however: it is possible to include features that discriminate between target and decoy PSMs irrespective of their quality. As such, my recommendation is to perform an entrapment experiment with your selected features and verify that the q-values estimated by mokapot are less than or equal to the empirical q-values you can calculate from the entrapment partition of your FASTA file.

I also want to note that we are seeing fewer IDs with mokapot than with zoDIAq native scoring and filtering, which seems unexpected. Caleb has some good plots exploring that with LC-MS and direct infusion data

This is indeed unexpected. I have a few questions: Are the zoDIAq native scores included as features in mokapot? This would be valid if the zoDIAq scores do not use target/decoy label information as part of their calculation. If they are included as features, then I would never expect mokapot to perform worse, except in the case of unstable model training due to relatively few PSMs.

One note: it is nearly impossible to robustly compare two implementations of TDC without a ground truth. My suggestion for evaluation mokapot would be to either compare using an entrapment experiment design, or to use mokapot's assign_confidence() functionality to perform TDC with the zoDIAq scores within mokapot.

0 replies

CCranney · 2023-12-14T01:41:50Z

CCranney
Dec 14, 2023
Author

Hi @wfondrie and @jspaezp!

Thank you both so much for your replies. I have a few more follow-ups, as well as some results from applying suggestions from above.

Instead, mokapot actually will perform its own protein inference if you provide it a FASTA file and tell it how the proteins were digested. See the example from our cookbook here and the read_fasta() function for details.

@wfondrie This is great to know, thank you! I'm experimenting with using a fasta file now. I notice that mokapot has a make_decoys() function, designed to add decoys to a fasta file. In identifying PSMs, zodiaq has already identified both targets and decoys - the decoys of which were created with the initial library file. If we were trying to identify target vs decoy proteins, would the decoy proteins of the FASTA file (perhaps as generated by mokapot) need to match the decoy proteins of the original library file?

This is indeed unexpected. I have a few questions: Are the zoDIAq native scores included as features in mokapot? This would be valid if the zoDIAq scores do not use target/decoy label information as part of their calculation. If they are included as features, then I would never expect mokapot to perform worse, except in the case of unstable model training due to relatively few PSMs.

@wfondrie zoDIAq uses what we call the Match Count and Cosine (MaCC) score, which in pseudocode is (cosine score * (number of matching peaks ** (1 / 5)). If I understand your question correctly, the MaCC score is calculated independent of our knowledge of whether or not a PSM is for a target or decoy peptide (is that what you meant by "zoDIAq scores do not use target/decoy label information as part of their calculation"?). I did not include the MaCC score in this initial run, as both the cosine score and number of matching peaks were included and I wondered if a score based on these two variables would be redundant. As I'll show in a moment, I did include it in my recent reanalysis.

Since mokapot makes several cross validation splits, its possible that if your total number of PSMs is reasonably small, every split will not have enough data to make a robust fit (do they go up if you set the number of folds to 2?).

@jspaezp thank you for the suggestion! I suspect we do have a relatively small number of PSMs, and so included this flag in my recent reanalysis (see below). I noticed a slight improvement, so am using those results.

Regarding the suggested features, I was able to include the peptide length and the above-describe MaCC score. Generally, it looks like we get a slight improvement in quality, in that clearly-bad targets are now excluded, but the number of identified peptides is still about half of what we get when purely using the MaCC score.
Note: instead of using q-score, I'm using the mokapot score of my newly acquired decoy PSM result list and doing a straight FDR calculation. The numbers don't change much, but now it's consistent with what zodiaq is doing.

Notably, this graph looks much better/in line with what we would expect:

I went ahead and added the one-hot encoding for charge, and the results got slightly worse:

I still need to with some of your other suggestions, including log number of candidates and delta mass difference (we were thinking of average peak m/z differences rather than precursor m/z differences). But I guess my main question regarding those is - do you anticipate that adding any of these features could potentially double the number of identified peptides, putting it in the ballpark of what zodiaq is already calculating?

Also, we were planning on trying out percolator after experimenting with mokapot. In your view as contributors to both, would you anticipate percolator could improve these numbers at all?

It would not surprise me if one of the primary hurdles we are experiencing is a minimal number of PSMs. The data I am feeding in has ~185k PSM matches, only ~4k of which are considered significant.

0 replies

jspaezp · 2023-12-14T15:45:18Z

jspaezp
Dec 14, 2023
Collaborator

I still need to with some of your other suggestions, including log number of candidates and delta mass difference (we were thinking of average peak m/z differences rather than precursor m/z differences). But I guess my main question regarding those is - do you anticipate that adding any of these features could potentially double the number of identified peptides, putting it in the ballpark of what zodiaq is already calculating?

It is hard to say for sure BUT I have seen datasets where the best feature is the absDM, so at least in some datasets it does have a major impact. If you are going down the feature engineering path you could also try other summary metrics for the ms2 error (max(abs(err)), min(abs(err)), median(abs(err)), standard deviation(abs(err))).

Also, we were planning on trying out percolator after experimenting with mokapot. In your view as contributors to both, would you anticipate percolator could improve these numbers at all?

I would not to be completely honest. I think even in Will's paper there is a comparison of the percolator and mokapot scores for an experiment, where he shows they are the same for most practical purposes.

which in pseudocode is (cosine score * (number of matching peaks ** (1 / 5)).

You could also try adding that as a feature for mokapot! by default mokapot will try to use the weights of all the features and if that is worse than using the best feature, it will fall back to just using the best column.... I would even try cosine_score * (log(num_matches + 1)).

Having said that .. I would agree with Will in the sense that having some ground truth here would be handy to benchmark the feature additions and implementations.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verification of Mokapot Results - Getting the `Proteins` Column to work and other questions #116

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Verification of Mokapot Results - Getting the Proteins Column to work and other questions #116

CCranney Dec 1, 2023

Replies: 4 comments · 6 replies

CCranney Dec 1, 2023 Author

CCranney Dec 1, 2023 Author

jessegmeyerlab Dec 1, 2023

jspaezp Dec 8, 2023 Collaborator

CCranney Dec 12, 2023 Author

jspaezp Dec 12, 2023 Collaborator

wfondrie Dec 12, 2023 Maintainer

CCranney Dec 14, 2023 Author

jspaezp Dec 14, 2023 Collaborator

Verification of Mokapot Results - Getting the `Proteins` Column to work and other questions #116

CCranney
Dec 1, 2023

Replies: 4 comments 6 replies

CCranney
Dec 1, 2023
Author

CCranney Dec 1, 2023
Author

jspaezp Dec 8, 2023
Collaborator

CCranney Dec 12, 2023
Author

jspaezp Dec 12, 2023
Collaborator

wfondrie
Dec 12, 2023
Maintainer

CCranney
Dec 14, 2023
Author

jspaezp
Dec 14, 2023
Collaborator