Replies: 4 comments 6 replies
-
To expand on question 2, I'm wondering if it would be useful to include more features as well. For example, would adding a column denoting the length of the peptide be helpful? There's a lot to possibly experiment with, so thought I'd ask. |
Beta Was this translation helpful? Give feedback.
-
Hi @CCranney and @jessegmeyerlab 👋 @jspaezp gave some great advice above. Here's some other thoughts and answers to your questions:
The requirement for a tab-delimited proteins column is an unfortunate artifact of the design choices for the PIN file format when Percolator was being created. In actuality, this column can't be used for confidence estimation effectively, without strict requirements being placed on the user. Instead, mokapot actually will perform its own protein inference if you provide it a FASTA file and tell it how the proteins were digested. See the example from our cookbook here and the As an alternative, you can also use any DataFrame as input into mokapot from the Python API. The only downside to this method is that you will need to specify the column meanings for mokapot to use them correctly. For example, see this earlier question: In the works is a big upgrade for mokapot and this column will no longer be required 😉
@jspaezp had some good suggestions, but I thought I would elaborate. Mokapot trains a linear SVM model by default to try and separate the high-scoring target PSMs from the decoy PSMs that are provided; any score that generally indicates the quality of a PSM (like your cosine similarity score) or helps calibrate the score between PSMs (features like peptide length, charge state, etc.) generally will be helpful to include. As @jspaezp mentioned, there are two considerations to think about with any linear model: (1) features should be monotonic in how they behave (i.e. higher = better, lower = worse) and (2) categorical variables (such as charge state) should be one-hot encoded. SciKit-learn has some good documenation on categorical variables here. A word of caution however: it is possible to include features that discriminate between target and decoy PSMs irrespective of their quality. As such, my recommendation is to perform an entrapment experiment with your selected features and verify that the q-values estimated by mokapot are less than or equal to the empirical q-values you can calculate from the entrapment partition of your FASTA file.
This is indeed unexpected. I have a few questions: Are the zoDIAq native scores included as features in mokapot? This would be valid if the zoDIAq scores do not use target/decoy label information as part of their calculation. If they are included as features, then I would never expect mokapot to perform worse, except in the case of unstable model training due to relatively few PSMs. One note: it is nearly impossible to robustly compare two implementations of TDC without a ground truth. My suggestion for evaluation mokapot would be to either compare using an entrapment experiment design, or to use mokapot's |
Beta Was this translation helpful? Give feedback.
-
Thank you both so much for your replies. I have a few more follow-ups, as well as some results from applying suggestions from above.
@wfondrie This is great to know, thank you! I'm experimenting with using a fasta file now. I notice that mokapot has a make_decoys() function, designed to add decoys to a fasta file. In identifying PSMs, zodiaq has already identified both targets and decoys - the decoys of which were created with the initial library file. If we were trying to identify target vs decoy proteins, would the decoy proteins of the FASTA file (perhaps as generated by mokapot) need to match the decoy proteins of the original library file?
@wfondrie zoDIAq uses what we call the Match Count and Cosine (MaCC) score, which in pseudocode is (cosine score * (number of matching peaks ** (1 / 5)). If I understand your question correctly, the MaCC score is calculated independent of our knowledge of whether or not a PSM is for a target or decoy peptide (is that what you meant by "zoDIAq scores do not use target/decoy label information as part of their calculation"?). I did not include the MaCC score in this initial run, as both the cosine score and number of matching peaks were included and I wondered if a score based on these two variables would be redundant. As I'll show in a moment, I did include it in my recent reanalysis.
@jspaezp thank you for the suggestion! I suspect we do have a relatively small number of PSMs, and so included this flag in my recent reanalysis (see below). I noticed a slight improvement, so am using those results. Regarding the suggested features, I was able to include the peptide length and the above-describe MaCC score. Generally, it looks like we get a slight improvement in quality, in that clearly-bad targets are now excluded, but the number of identified peptides is still about half of what we get when purely using the MaCC score. Notably, this graph looks much better/in line with what we would expect: I went ahead and added the one-hot encoding for charge, and the results got slightly worse: I still need to with some of your other suggestions, including log number of candidates and delta mass difference (we were thinking of average peak m/z differences rather than precursor m/z differences). But I guess my main question regarding those is - do you anticipate that adding any of these features could potentially double the number of identified peptides, putting it in the ballpark of what zodiaq is already calculating? Also, we were planning on trying out percolator after experimenting with mokapot. In your view as contributors to both, would you anticipate percolator could improve these numbers at all? It would not surprise me if one of the primary hurdles we are experiencing is a minimal number of PSMs. The data I am feeding in has ~185k PSM matches, only ~4k of which are considered significant. |
Beta Was this translation helpful? Give feedback.
-
It is hard to say for sure BUT I have seen datasets where the best feature is the absDM, so at least in some datasets it does have a major impact. If you are going down the feature engineering path you could also try other summary metrics for the ms2 error (
I would not to be completely honest. I think even in Will's paper there is a comparison of the percolator and mokapot scores for an experiment, where he shows they are the same for most practical purposes.
You could also try adding that as a feature for mokapot! by default mokapot will try to use the weights of all the features and if that is worse than using the best feature, it will fall back to just using the best column.... I would even try Having said that .. I would agree with Will in the sense that having some ground truth here would be handy to benchmark the feature additions and implementations. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I'm exploring using mokapot in scoring PSMs identified using an in-house developed python package (zodiaq). We have a scoring mechanism, but wanted to explore additional scoring methods. I've made a simple script that runs mokapot on the output of zodiaq identification, and got some promising results. However, in the process I stumbled across a few questions I thought I'd ask here.
.pin
to allow for scoring of proteins in addition to peptides and PSMs. Looking over examples in documentation and elsewhere, it looks like you need the final column to be theProteins
column, and that proteins need to be separated by tabs. I was thinking that doing this would result in a 'mokapot.proteins.txt' output file generated at the end, but am not seeing that occur. Do you have an idea on what I'm doing wrong?1.1. Here is an example of one line in my
.pin
file. There are several additional columns included from my zodiaq identification output, but more on that next.cosine
(the cosine similarity score, which suggests how similar the "shape" of a peptide spectrum is to a pattern found in the query file) andshared
(the number of matching peaks between library/query spectra) columns. I'm going to run mokapot using just those added columns to experiment, but are there other ways to maximize results that you have found effective?Beta Was this translation helpful? Give feedback.
All reactions