New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine with what certainty one distribution can be chosen as a better aproximator for a desired value than another distribution. #5328
Comments
I don't quite understand what you are aiming for. simplified: You have two treatments (processing methods) and want to show that one is "better", where "better" is having a larger coefficient or one closer to 1. In the model we don't support one-sided alternatives, so you are not testing equal versus higher or smaller. [[ 1. -1. 0.] The first line tests whether the two coefficients are the same, the second line tests that the second coefficient is different from zero. AFAICS, the second condition is irrelevant for your objective. The first condition reject for either direction, i.e. for alternatives that the first coefficient is larger than the second or that it is smaller. i.e your test outcome is that equality of coefficients is rejected with a very low p-value (1e-66 when second line is still included, and a bit larger most likely if the second line is dropped.) t_test can be interpreted as a one-sided test in the models. The problem is that joint hypothesis with inequality alternatives are difficult to implement. Actually in this case we can also use the confidence interval for the boundary point for rejecting the one-sided alternative (for the null that coefficient is 0, i.e. original is 1). At alpha=0.025, the one-sided test has the first coefficient significantly positive and the second coefficient significantly negative. aside: I don't understand why your second coefficient is negative, when the mean in the plot is around 0.7. (correction: I just saw that you subtract 1) |
A coefficient closer to 1, or after I have subtracted 1 from all, a coefficient closer to zero (which I believe makes things easier to test hypothesis-wise).
I'm not really interested in the one-sided comparison. whether it overshoots or undershoots I can just read from the sign of the factor in the GLM. But I am surprised that this is so difficult/ambiguous. It is fairly obvious to me from the graphic that I should prefer “Generic”. But given the data, how certain can I say I am of this? This seems like something that should be tested all the time e.g. in QC or manufacturing. How is this usually done? |
@josef-pkt and I cannot run this
|
Ok, what about this: For me to prefer one procedure two criteria have to be met:
I can get:
so the probability of obtaining the data given 1. and 2. should be Pa*(1-P0) --- right? The problem with this is, it does not take into account the comparison between the two distributions. I could get the scores for both: PA = Pa*(1-P0) and I could get the ratio: PA/PB tells me how many times better A a choice is than B: import pandas as pd
from os import path
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
volume_path = path.abspath('data/volumes.csv')
df = pd.read_csv(volume_path)
df = df.loc[df['Processing']!='Unprocessed']
df = df.loc[((df['Processing']=='Legacy') & (df['Template']=='Legacy')) | ((df['Processing']=='Generic') & (df['Template']=='Generic'))]
df['Volume Change Factor'] = df['Volume Change Factor']-1
model=smf.mixedlm('Q("Volume Change Factor") ~ Processing-1', df, groups='uID')
fit = model.fit()
summary = fit.summary()
print(fit)
print(summary)
omnibus_tests = np.eye(len(fit.params))
omnibus_tests = np.delete(omnibus_tests, (1,2), axis=0)
omnibus_tests[0,1] = -1
print(omnibus_tests)
anova_0 = fit.f_test(omnibus_tests)
p_0 = anova_0.pvalue
print(anova_0, p_0)
omnibus_tests = np.eye(len(fit.params))
omnibus_tests = np.delete(omnibus_tests, (1,2), axis=0)
print(omnibus_tests)
anova_a = fit.f_test(omnibus_tests)
p_a = anova_a.pvalue
print(anova_a, p_a)
omnibus_tests = np.eye(len(fit.params))
omnibus_tests = np.delete(omnibus_tests, (1,2), axis=0)
omnibus_tests[0,0] = 0
omnibus_tests[0,1] = 1
print(omnibus_tests)
anova_b = fit.f_test(omnibus_tests)
p_b = anova_b.pvalue
print(anova_b, p_b)
print('Should I take A?')
p_A = p_a*(1-p_0)
print(p_A)
print('Should I take A?')
p_B = p_b*(1-p_0)
print(p_B)
print('How much better is A than B?')
print(p_A/p_B) This tells me “Generic” is about What do you say @josef-pkt ? Did I nail it or am I performing operations with p-values which are not supposed to be performed with p-values? |
Perhaps take a look at an Anderson-Darling test, which will show you (with a P value) if your two sets of samples are drawn from the same distribution? |
I think it's the latter. I think the main problem is in finding a definition for a measure by how much A is better than B. In general "better" should be defined outside the statistical model in terms of a utility, objective or loss function for the usage of the estimate. Otherwise the are purely statistical distance measures for distribution, for example Kullback-Leibler distance that is also uses when a misspecified MLE model approximates some underlying "true" model. Being a "gazillion" times better in some measure doesn't really have an intrinsic meaning in terms of meaningful distance/quality measure. |
Yes, that makes sense.
The cost here is the switching cost, which I cannot estimate. But Ideally I would give a the reader a metric for how much closer to the mark a new method gets him versus an old one. Maybe some users will jump on as soon as it's even 10% better, for others it might have to be closer to a gazillion. What kind of performance metric would you find most convincing here? I would add also that these are percentage values, I'm measuring how much the process changes the total volume of the input, so negative values are impossible and a score of 2 is way better than a score of 0. I guess this also wholly invalidates my 1 subtraction, since negative and positive ranges are not, in the extremes, equivalent. |
@clham I read the wikipedia article, but I am unsure how the Anderson-Darling could be used here. Can you provide an example? |
@josef-pkt I could give this “better” another try, but I think it's quite clearly defined already and this is all the same statement in different sentences: What is the probability that given the samples, by picking “Generic” I will actually end up picking the process which in the long run causes MORE volume change? |
My understanding of your issue is that you have 2 groups or samples 1) treatment A and 2) Treatment B. Is there a third "no treatment" group? You're trying to evaluate:
With an Anderson darling test, you could determine if the distribution of the A group is different than the distribution of the B group. That satisfies your first point above. (also, I was thinking of the K-sample non-parametric test, not the one based on a specified distribution) |
There is a pre-treatment group which is only ones, and thus breaks the GLM.
I think that would be satisfied by a t-test as well, the problem is more determining which sample comes from a distribution which gives me a higher probability of getting values close to 1. |
The problem is that this is a Bayesian statement, i.e. you want to use a posterior |
The issue with a t-test is that it assumes normality, and, based on the graph in the OP, that may not be a safe assumption. If you run an A-D test, you can figure out if treatment A and B are different. if they are, the next question is "which is better" -- the question here is what is the penalty function. You could compare the variance vs 1 (A has more total variance from 1 than B), but that assumes that Squared error is the correct penalty. If your engineering problem has a set maximum loss (before the volume lost is "out of spec",) you might need to cook up a function that only trips when the spec is breached (and then it is binary?), or something else that is use-case specific |
In general the t-test is very robust to deviations from normality as long as the deviations are not huge. But it is only a test of the mean(s) and doesn't say anything about other distributional properties. |
@josef-pkt But that requires me to set an arbitrary CI. Why not 99%? Why not 90%? I don't want to do that, that's up to the user (as is even more clear in your salary example). I am putting forward a choice A over B and I would like to give an estimate of certainty so that people can look at that and compare it to whatever their personalized significance threshold is for this particular question.
@clham ok, I'm willing to assume that the square error is the correct penatly. How then do I go about getting a metric for how much more likely A is to deviate from 1 than B is. I think this is what makes it tricky. I find it really curious that it's easy to recommend or dismiss A over nothing, but not over another distribution. |
An assumption here is that there WILL be deviance (because of measurement errors if nothing else). It seems the key metric would be average deviance or average squared deviance -- Would just a standard deviation (vs 1) or mean absolute error be sufficient? |
That would be a ratio of t-values. How would I interpret that? |
So, this really gets into "what are you trying to sell". (not a stats problem) but, is it useful for your purposes to say (for example) "Treatment A has on average a 20% lower volume loss than treatment B?" |
@clham No, what I am trying to communicate is how certain they can be of getting values closer to 1 if they switch from B to A. Clearly a stats problem. Your recommended an approach which ends up with a ratio of t-values. How would you interpret that ratio? Say the t-score of A is 1.5 and that of B is -3 --- can you say that you are thus half as likely to get values significantly deviating form 1 if you choose A over B?
I don't know in how far that is useful. I would expect people to ask --- how sure are you? They don't care so much about the specific percentage difference in my sample, but about with what certainty this generalizes to whatever sample they have. |
Interesting; So it sounds like the question is more "what are the chances the mean(a)<Mean(b)"; under the condition that samples from A and b are heterogeneous. Would Welch's T-test be helpful here? (h/t to google) |
I don't think anything frequentist would help in making a probability statement about the posterior "belief". The pvalue for testing equality of the two means is still an interpretable measure of uncertainty. |
@josef-pkt you might be right, but I don't actually want the distribution. I.e. it's going to be z likely that A will get you 0.1 closer to 1 than B will, z' likely that it will get you 0.11 closer to 1 than B will, z'' likely that it will get you 0.13 closer to 1 than B will, etc. I just want to know the integral. How likely is it that it will get you closer to 1 than B will, in general.
@clham that again just compares the distributions, that bit is afaict handled with a t-test, the problem remains determining how likely is it that one of them is closer to 1 than the other is. |
Can we reframe the question -- if mean(A) is closer to 1 than mean(B), than a<b. You can say with X confidence that BECAUSE we are x% sure A<B, we are x% sure that A is closer to 1 than B. (if A were larger, we would fail to reject the null that it is infact closer) |
@clham that's the problem, if mean(A) is closer to 1 than mean(B) we cannot say that a<b, it could be that A scatters around 0.99 and B scatters around 0.8. It could also be that A scatters around 1.21 and B scatters around 0.8 --- in which case the two are clearly very different from each other but there would be little evidence to prefer A. |
Measure A vs B as an absolute value difference from 1 for each distro. this way, a mean of 1.21 would be represented as .21, as would a mean of .79. Whichever is less is infact "closer" |
@clham I also thought of that, but the absolute bit will mess up the distribution as soon as it crosses the target value: [-1,-2,0,0,0,1,1,2] will have a different mean and SD than [1,2,0,0,0,1,1,2]. |
Presumably the body of the distribution is well away from 1? It's certainly wrong from a "pure-stats" perspective, but, from a "useful-stats" perspective, is it good enough? As an alternative, can you shift everything over a few -- add e.g. 3 to all your data points (or subtract if the mean is on the "wrong" side)? this will force (nearly) everything to have a consistent sign, but get the directionality and relative variance correct? |
@clham that sounds like some serious fudging :) At any rate, I have decided to use a t-test to establish a significant difference between the two samples, and the ratio of RMSE in order to claim a better suitability of one over the other. I maintain that this doesn't fully address my hypothesis, but it seems this is as good as it gets :( |
Reopening this, at least temporarily, so I can look at it again. I'm not going to do any Bayesian inference and posterior beliefs. However, we have more for comparing predictive distributions now. For metric measurements we could use one-sided gof tests, e.g. ks_test. (We don't have general stochastic dominance measures yet, like first or second order) |
I have the following situation, I am testing two variations of a process against each other. I know that the ideal outcome value is 1. I would like to know which process I can prefer (and with what certainty) based on a sampling of their output values.
(for the same set of inputs, so there's a common random factor, but that bit I think is trivial).
Please see the figure below and find a full minimal working example here.
The way I see it (for a given level of the process variation variable) H0 is that there is a greater or equal effect of the other level.
Since I am unsure how to encode this variable I have opted to instead test the (imho incomplete) H0 that there is no effect for the given level and that it is not significantly different from the other level. This still lacks the constraint that there shouldn't be a greater effect from the other level. I have no idea how that could be encoded.
At any rate I fail long before getting there.
I have tried to subtract one (to make zero the ideal value), and to model the data without an intercept, so I can handle the two levels individually, but given the code:
I get
Based on this it seems that both levels are almost equal in how well they approximate the value, but that is clearly not the case, as seen in the figure.
Any idea how I can perform this comparison with statsmodels?
The text was updated successfully, but these errors were encountered: