-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Argument "useful_properties" must only contain column names of the featureData slot. #49
Comments
Hi Antonio Thank you for your interest in MSqRob. The error you get says that at least one of the inputs for Fractionated data is always more difficult to analyze because some peptides might be seen in multiple fractions. There are multiple ways to deal with this, but I never did an in-depth comparison of what works best, so I can only give a few suggestions:
Probably there are other, and maybe better ways to deal with fractions, but as I said, I never really dug very deeply into that. I hope this already helps. Let me know if my suggestions worked and if you have any further problems. Best regards Ludger |
Dear Ludger Thanks for your fast answer! I fixed my issue by making sure that the columns indicating the peptide sequence and the protein group were called exactly as in the script (peptide and prot) AND by making sure that the remaining column names started all with sumIntensity_. That is the standard moFF format , however, I got rid of this prefix because I tried doing some post-processing of moFF output to aggregate the fractions. So now it looks like this:
I tried dealing with fractionation by implementing the Levenberg-Marquandt algorithm described in the MaxLFQ paper (http://www.mcponline.org/content/13/9/2513.long) in Python (with scipy.optimize.root). However I didn't have much success, as this is how the log2 fold change distribution looks for E. coli proteins (which should be around log2(3) = 1.58) and Homo sapiens, (which should be around log2(1)=0). If I do what you propose, which is inputting the moFF file directly and considering fraction a random effect, I get a much better looking result, since the separation between the 2 organisms is much clearer. However I still find many E. coli proteins with an estimate of 0. This could be due to not just an error in the quantification, but also missing identifications in the upstream analysis. Also there's a few Homo sapiens proteins with positive estimates, albeit they are really few. Finally, I think MSqRob's tendency to assign estimates around 0 is actually "cheating" when benchmarking the tool with this dataset since one of the target estimates (for HS) is 0. So I would say the best way to go when dealing with fractionated data for now is considering it a random effect, as you said. Nevertheless, I think it is improvable since the results from the original MaxLFQ paper (Figure 3 I) show cleaner normal distributions. Also please let me know if you have any ideas on how to improve the result, or if you need clarification on how the dataset is. Thanks once again for your time Cheers Antonio |
Hi Antonio I'm happy you were able to solve the issue. The ridge regression in MSqRob, which is causing the tendency to assign estimates around 0, This is conceptually equivalent to putting a normally distributed "prior" around 0 on the data. You then actually assume, before you have seen the data, that the estimate must be around zero, and only if there is enough evidence for differential abundance in the data, will your estimate be further away from 0. This will inevitably introduce a bit of bias in the data. E.g. if your true differential abundance is at 1.58, I agree that if you want to benchmark MSqRob on its FC estimates, you cannot use it for the Homo sapiens proteins, but you can still do it for the E. coli proteins. I personally find it very difficult to compare the plot you made here to the histograms in the paper, also because in the paper, replicate groups were filtered for 2 out of 3 valid values in the proteinGroups file. I also see no reason why the histogram of the estimates for all the proteins should follow a normal distribution. A normal distribution can be expected if you sample observations of a phenomenon where there is purely random noise. However, there are many processes that make the log fold change estimates non random: proteins interact with eachother and can be translated or degraded in response to up- or downregulation of other proteins, there is ionization competition between co-eluting peptides, and so on. So in my opinion the shape of the estimates' distribution doesn't say anything about the quality of the fold change estimates. I think it might be clearer if you would redo the analysis with the same preprocessing protocol and make for example boxplots or violin plots of the estimates for each method and draw a horizontal line for the true fold change. If you then compare MSqRob to MaxLFQ, I would expect the E. coli estimates of MSqRob to be a bit below the true value on average, but much less variable. (although there might also be the impact of moFF; if you only want to assess MSqRob, you should use the peptides.txt file from the original author's dataset PXD000279) Finally, I personally think the best way to benchmark MSqRob is to compare the rankings in terms of p-value. This is because the p-value is the measure of statistical significance on which the proteins should be ranked (i.e. the lower the p-value, the higher the probability that the protein is truly differentially abundant if you have no prior knowledge). Cheers Ludger |
Hi Ludger Thousand thanks for such a comprehensive answer! Now I clearly understand the meaning of the high abundance of 0 estimates. I will do what you said over the weekend and share it here 👍 I agree that the p-value information should be taken into account. I have a couple more questions regarding MSqRob but maybe I should open a new issue :) Thanks! |
Hi Antonio You're welcome! I don't think it is necessary to correct for E coli. proteins with estimates of 0. However, when the result is not statistically significant, we cannot reject our null hypothesis. This is a much weaker conclusion because it can mean that the null hypothesis is indeed true, but it can also mean that the null hypothesis is not true, but we just didn't collect enough evidence against it (unless we e.g. did some statistical power calculations upfront, but these also rely on certain assumptions, e.g. regarding the minimal effect size). A nice analogy would be that I would like to prove that there are gnomes living in my garden. My null hypothesis would be that there are 0 gnomes in my garden and my alternative hypothesis would be that there would be at least 1 gnome living in my garden. If I go out and try to collect gnomes, and I find a gnome, I will be very happy (hello Nobel Prize!). However, if I can't find any gnome, it could be that there are indeed none, but I could also have used a wrong way to look for them, they could have been on holiday, etc... This is why it is so difficult to prove that something does not exist. Mostly, in science, we always try to disprove the null hypothesis, or in our case: we want to find proteins that are different from 0. Thus, I think it is not needed to correct for the zeros in the E. coli proteins. The zeros indicate that there is not enough evidence in the data to reject the null hypothesis for these proteins. With "the replicate groups were filtered for 2 out of 3 valid values in the proteinGroups file" I indeed meant that the authors in the MaxLFQ paper only retained the peptides for which per group in 2 of the 3 replicates an intensity was observed, or at least, this is how I interpreted this sentence in their paper: "Replicate groups were filtered for two out of three valid values and averaged". This means that they very likely removed a large part of their dataset (as in my experience, the largest part of the dataset has more missing values than that). I didn't look into detail what kind of preprocessing they did exactly, but when you do a comparison, you should make sure to use the same preprocessing, so ideally you should try to reproduce their preprocessing pipeline and apply it to both methods or use a preprocessing pipeline on your own for both methods. Of course you could also argue that the preprocessing is an inherent part of a method, but I think this is not a valid argument for this kind of comparison because you don't look at which proteins you find and which you don't, you look at how nicely the distributions are separated. And of course, the distributions will be much more separated if you only use proteins for which there is a lot of data. |
Hi there
First of all, thanks for developing such a useful tool and supporting moFF with Import2MSnSet.
I am trying to set up a quantification pipeline using the Compomics workflow (searchGUI, peptideShaker and moFF). I obtained a moFF file output for the MaxLFQ dataset (available https://www.ebi.ac.uk/pride/archive/projects/PXD000279) and I implemented a Levenberg-Marquandt minimisation similar to that explained in the paper, to normalize the XICs. That way, I get a moff file with 2+6 columns.
However, when I try to preprocess the MSnSet
I get this error
Error in preprocess_MSnSet(MSnSet = peptides, accession = "prot", exp_annotation = exp_annot, : Argument "useful_properties" must only contain column names of the featureData slot.
I set
useful_properties
to the name of the first two columns in the file, storing the protein group and the peptide sequence respectivally. Defaulting to NULL also gives the error. These are the only 2 slots underFeatureData > data
in the peptides object.I saw #32 and I realised I had stripped sumIntensity_ from the column names, but still adding it again did not fix it either.
What could be the problem? I can share my file if needed.
Moreover, does it make sense to preprocess the files, or can actually the sample fractionation be considered an extra effect? In that case, would it be fixed or random? Run effects are considered random, so I guess fractions should be too. I don't know why though.
Thanks beforehand!
Best regards
Antonio
The text was updated successfully, but these errors were encountered: