Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine with what certainty one distribution can be chosen as a better aproximator for a desired value than another distribution. #5328

Open
TheChymera opened this issue Oct 18, 2018 · 29 comments

Comments

@TheChymera
Copy link

TheChymera commented Oct 18, 2018

I have the following situation, I am testing two variations of a process against each other. I know that the ideal outcome value is 1. I would like to know which process I can prefer (and with what certainty) based on a sampling of their output values.

(for the same set of inputs, so there's a common random factor, but that bit I think is trivial).

Please see the figure below and find a full minimal working example here.

boxplot

The way I see it (for a given level of the process variation variable) H0 is that there is a greater or equal effect of the other level.
Since I am unsure how to encode this variable I have opted to instead test the (imho incomplete) H0 that there is no effect for the given level and that it is not significantly different from the other level. This still lacks the constraint that there shouldn't be a greater effect from the other level. I have no idea how that could be encoded.

At any rate I fail long before getting there.

I have tried to subtract one (to make zero the ideal value), and to model the data without an intercept, so I can handle the two levels individually, but given the code:

import pandas as pd
from os import path
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

volume_path = path.abspath('volumes.csv')
df = pd.read_csv(volume_path)

df = df.loc[df['Processing']!='Unprocessed']
df = df.loc[((df['Processing']=='Legacy') & (df['Template']=='Legacy')) | ((df['Processing']=='Generic') & (df['Template']=='Generic'))]

df['Volume Change Factor'] = df['Volume Change Factor']-1

model=smf.mixedlm('Q("Volume Change Factor") ~ Processing-1', df, groups='uID')

fit = model.fit()
summary = fit.summary()
print(fit)
print(summary)

omnibus_tests = np.eye(len(fit.params))
omnibus_tests = np.delete(omnibus_tests, (2), axis=0)
omnibus_tests[0,1] = -1
print("Generic better than Legacy?")
print(omnibus_tests)
anova = fit.f_test(omnibus_tests)
print(anova)

omnibus_tests = np.eye(len(fit.params))
omnibus_tests = np.delete(omnibus_tests, (2), axis=0)
omnibus_tests[0,1] = -1
omnibus_tests[1,0] = 1
omnibus_tests[1,1] = 0
print("Legacy better than Generic?")
print(omnibus_tests)
anova = fit.f_test(omnibus_tests)
print(anova)

I get

/usr/lib64/python3.6/site-packages/statsmodels/regression/mixed_linear_model.py:2092: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
<statsmodels.regression.mixed_linear_model.MixedLMResultsWrapper object at 0x7f57bc5d5f98>
                 Mixed Linear Model Regression Results
=======================================================================
Model:            MixedLM Dependent Variable: Q("Volume Change Factor")
No. Observations: 136     Method:             REML                     
No. Groups:       68      Scale:              0.0038                   
Min. group size:  2       Likelihood:         160.0737                 
Max. group size:  2       Converged:          Yes                      
Mean group size:  2.0                                                  
------------------------------------------------------------------------
                        Coef.   Std.Err.     z     P>|z|  [0.025  0.975]
------------------------------------------------------------------------
Processing[Generic]      0.036     0.009    4.071  0.000   0.019   0.053
Processing[Legacy]      -0.275     0.009  -31.364  0.000  -0.293  -0.258
uID Var                  0.001     0.013                                
=======================================================================

Generic better than Legacy?
[[ 1. -1.  0.]
 [ 0.  1.  0.]]
<F test: F=array([[578.95292511]]), p=1.1575867403053478e-66, df_denom=134, df_num=2>
Legacy better than Generic?
[[ 1. -1.  0.]
 [ 1.  0.  0.]]
<F test: F=array([[578.95292511]]), p=1.1575867403053893e-66, df_denom=134, df_num=2>

Based on this it seems that both levels are almost equal in how well they approximate the value, but that is clearly not the case, as seen in the figure.

Any idea how I can perform this comparison with statsmodels?

@josef-pkt
Copy link
Member

I don't quite understand what you are aiming for.

simplified: You have two treatments (processing methods) and want to show that one is "better", where "better" is having a larger coefficient or one closer to 1.
Is that correct?

In the model we don't support one-sided alternatives, so you are not testing equal versus higher or smaller.

[[ 1. -1. 0.]
[ 0. 1. 0.]]

The first line tests whether the two coefficients are the same, the second line tests that the second coefficient is different from zero. AFAICS, the second condition is irrelevant for your objective.

The first condition reject for either direction, i.e. for alternatives that the first coefficient is larger than the second or that it is smaller. i.e your test outcome is that equality of coefficients is rejected with a very low p-value (1e-66 when second line is still included, and a bit larger most likely if the second line is dropped.)

t_test can be interpreted as a one-sided test in the models. The problem is that joint hypothesis with inequality alternatives are difficult to implement.
I'm not sure whether t_test([[ 1. -1. 0.]]) provides enough information to recover the sign of the test statistic and which tail is relevant for the one sided test. (I shouldn't be to difficult to add a one sided alternative for this type of test.)

Actually in this case we can also use the confidence interval for the boundary point for rejecting the one-sided alternative (for the null that coefficient is 0, i.e. original is 1). At alpha=0.025, the one-sided test has the first coefficient significantly positive and the second coefficient significantly negative.

aside: I don't understand why your second coefficient is negative, when the mean in the plot is around 0.7. (correction: I just saw that you subtract 1)

@TheChymera
Copy link
Author

TheChymera commented Oct 19, 2018

simplified: You have two treatments (processing methods) and want to show that one is "better", where "better" is having a larger coefficient or one closer to 1.
Is that correct?

A coefficient closer to 1, or after I have subtracted 1 from all, a coefficient closer to zero (which I believe makes things easier to test hypothesis-wise).

In the model we don't support one-sided alternatives, so you are not testing equal versus higher or smaller.

I'm not really interested in the one-sided comparison. whether it overshoots or undershoots I can just read from the sign of the factor in the GLM.

But I am surprised that this is so difficult/ambiguous. It is fairly obvious to me from the graphic that I should prefer “Generic”. But given the data, how certain can I say I am of this? This seems like something that should be tested all the time e.g. in QC or manufacturing. How is this usually done?

@TheChymera
Copy link
Author

@josef-pkt and I cannot run this t = fit.t_test(np.array([[1,-1,0]])), it fails with:

ValueError: r_matrix for t-test should have 2 columns

@TheChymera
Copy link
Author

TheChymera commented Oct 19, 2018

Ok, what about this: For me to prefer one procedure two criteria have to be met:

  1. the distiributions need to be significantly different from each other (A!=B)
  2. the distribution needs to not be significantly different from 0 (A=0)

I can get:

  • P0 The probability of obtaining the data given A=B
  • Pa The probability of obtaining the data given A=0

so the probability of obtaining the data given 1. and 2. should be Pa*(1-P0) --- right?

The problem with this is, it does not take into account the comparison between the two distributions. I could get the scores for both:

PA = Pa*(1-P0)
PB = Pb*(1-P0)

and I could get the ratio: PA/PB tells me how many times better A a choice is than B:

import pandas as pd
from os import path
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

volume_path = path.abspath('data/volumes.csv')
df = pd.read_csv(volume_path)

df = df.loc[df['Processing']!='Unprocessed']
df = df.loc[((df['Processing']=='Legacy') & (df['Template']=='Legacy')) | ((df['Processing']=='Generic') & (df['Template']=='Generic'))]

df['Volume Change Factor'] = df['Volume Change Factor']-1

model=smf.mixedlm('Q("Volume Change Factor") ~ Processing-1', df, groups='uID')

fit = model.fit()
summary = fit.summary()
print(fit)
print(summary)

omnibus_tests = np.eye(len(fit.params))
omnibus_tests = np.delete(omnibus_tests, (1,2), axis=0)
omnibus_tests[0,1] = -1
print(omnibus_tests)
anova_0 = fit.f_test(omnibus_tests)
p_0 = anova_0.pvalue
print(anova_0, p_0)

omnibus_tests = np.eye(len(fit.params))
omnibus_tests = np.delete(omnibus_tests, (1,2), axis=0)
print(omnibus_tests)
anova_a = fit.f_test(omnibus_tests)
p_a = anova_a.pvalue
print(anova_a, p_a)

omnibus_tests = np.eye(len(fit.params))
omnibus_tests = np.delete(omnibus_tests, (1,2), axis=0)
omnibus_tests[0,0] = 0
omnibus_tests[0,1] = 1
print(omnibus_tests)
anova_b = fit.f_test(omnibus_tests)
p_b = anova_b.pvalue
print(anova_b, p_b)

print('Should I take A?')
p_A = p_a*(1-p_0)
print(p_A)

print('Should I take A?')
p_B = p_b*(1-p_0)
print(p_B)

print('How much better is A than B?')
print(p_A/p_B)

This tells me “Generic” is about 5.737335433895926e+58 times more certain a choice than “Legacy”. I believe that is referred to as a gazillion, which seems about right from the graph.

What do you say @josef-pkt ? Did I nail it or am I performing operations with p-values which are not supposed to be performed with p-values?

@clham
Copy link

clham commented Oct 19, 2018

Perhaps take a look at an Anderson-Darling test, which will show you (with a P value) if your two sets of samples are drawn from the same distribution?

@josef-pkt
Copy link
Member

Did I nail it or am I performing operations with p-values which are not supposed to be performed with p-values?

I think it's the latter.
I think the main problem is that P0, Pa, Pb condition on different assumptions, and we cannot compare in general probabilities that have different conditioning sets.
Also Pa*(1-P0) assumes independence of the two events, which I guess doesn't hold, although it might if A and B are independent.

I think the main problem is in finding a definition for a measure by how much A is better than B.
E.g. a direct measure would be just comparing the means, saying for example that the first is significantly higher, in statistical significance and in "subject" significance.
Similar would be measures on distributions like for example a one-sided kstest (or ad test)

In general "better" should be defined outside the statistical model in terms of a utility, objective or loss function for the usage of the estimate.
Eg. if this were industrial quality control, then there might be measures for how much we could save in cost if the new procedure has higher output and lower variance in output.

Otherwise the are purely statistical distance measures for distribution, for example Kullback-Leibler distance that is also uses when a misspecified MLE model approximates some underlying "true" model.

Being a "gazillion" times better in some measure doesn't really have an intrinsic meaning in terms of meaningful distance/quality measure.

@TheChymera
Copy link
Author

TheChymera commented Oct 19, 2018

I think it's the latter.
I think the main problem is that P0, Pa, Pb condition on different assumptions, and we cannot compare in general probabilities that have different conditioning sets.
Also Pa*(1-P0) assumes independence of the two events, which I guess doesn't hold, although it might if A and B are independent.

Yes, that makes sense.

Eg. if this were industrial quality control, then there might be measures for how much we could save in cost if the new procedure has higher output and lower variance in output.

The cost here is the switching cost, which I cannot estimate. But Ideally I would give a the reader a metric for how much closer to the mark a new method gets him versus an old one. Maybe some users will jump on as soon as it's even 10% better, for others it might have to be closer to a gazillion.

What kind of performance metric would you find most convincing here? I would add also that these are percentage values, I'm measuring how much the process changes the total volume of the input, so negative values are impossible and a score of 2 is way better than a score of 0. I guess this also wholly invalidates my 1 subtraction, since negative and positive ranges are not, in the extremes, equivalent.

@TheChymera
Copy link
Author

@clham I read the wikipedia article, but I am unsure how the Anderson-Darling could be used here. Can you provide an example?

@TheChymera
Copy link
Author

TheChymera commented Oct 19, 2018

@josef-pkt I could give this “better” another try, but I think it's quite clearly defined already and this is all the same statement in different sentences:

What is the probability that given the samples, by picking “Generic” I will actually end up picking the process which in the long run causes MORE volume change?

@clham
Copy link

clham commented Oct 19, 2018

@clham I read the wikipedia article, but I am unsure how the Anderson-Darling could be used here. Can you provide an example?

My understanding of your issue is that you have 2 groups or samples 1) treatment A and 2) Treatment B. Is there a third "no treatment" group?

You're trying to evaluate:

Ok, what about this: For me to prefer one procedure two criteria have to be met:
the distiributions need to be significantly different from each other (A!=B)
the distribution needs to not be significantly different from 0 (A=0)

With an Anderson darling test, you could determine if the distribution of the A group is different than the distribution of the B group. That satisfies your first point above. (also, I was thinking of the K-sample non-parametric test, not the one based on a specified distribution)

@TheChymera
Copy link
Author

My understanding of your issue is that you have 2 groups or samples 1) treatment A and 2) Treatment B. Is there a third "no treatment" group?

There is a pre-treatment group which is only ones, and thus breaks the GLM.

With an Anderson darling test, you could determine if the distribution of the A group is different than the distribution of the B group. That satisfies your first point above.

I think that would be satisfied by a t-test as well, the problem is more determining which sample comes from a distribution which gives me a higher probability of getting values close to 1.

@josef-pkt
Copy link
Member

What is the probability that given the samples, by picking “Generic” I will actually end up picking the process which in the long run causes MORE volume change?

The problem is that this is a Bayesian statement, i.e. you want to use a posterior probability. Frequentist hypothesis testing is always conditional on the Null hypothesis.
The frequentist interpretation is clearer as estimation result. i.e. Given our data and estimates, we would get on average xx more under treatment A than under treatment B with a confidence interval of (low, upp).
(e.g. in empirical economics the statement would be that the salary under A is e.g. $2540 higher than under B with 95% confidence interval ($1999, $3001), which is the uncertainty because our sample is just a random draw from some sampling universe.)

@clham
Copy link

clham commented Oct 19, 2018

My understanding of your issue is that you have 2 groups or samples 1) treatment A and 2) Treatment B. Is there a third "no treatment" group?

There is a pre-treatment group which is only ones, and thus breaks the GLM.

With an Anderson darling test, you could determine if the distribution of the A group is different than the distribution of the B group. That satisfies your first point above.

I think that would be satisfied by a t-test as well, the problem is more determining which sample comes from a distribution which gives me a higher probability of getting values close to 1.

The issue with a t-test is that it assumes normality, and, based on the graph in the OP, that may not be a safe assumption.

If you run an A-D test, you can figure out if treatment A and B are different. if they are, the next question is "which is better" -- the question here is what is the penalty function. You could compare the variance vs 1 (A has more total variance from 1 than B), but that assumes that Squared error is the correct penalty. If your engineering problem has a set maximum loss (before the volume lost is "out of spec",) you might need to cook up a function that only trips when the spec is breached (and then it is binary?), or something else that is use-case specific

@josef-pkt
Copy link
Member

The issue with a t-test is that it assumes normality, and, based on the graph in the OP, that may not be a safe assumption.

In general the t-test is very robust to deviations from normality as long as the deviations are not huge. But it is only a test of the mean(s) and doesn't say anything about other distributional properties.

@TheChymera
Copy link
Author

(e.g. in empirical economics the statement would be that the salary under A is e.g. $2540 higher than under B with 95% confidence interval ($1999, $3001), which is the uncertainty because our sample is just a random draw from some sampling universe.)

@josef-pkt But that requires me to set an arbitrary CI. Why not 99%? Why not 90%? I don't want to do that, that's up to the user (as is even more clear in your salary example). I am putting forward a choice A over B and I would like to give an estimate of certainty so that people can look at that and compare it to whatever their personalized significance threshold is for this particular question.

You could compare the variance vs 1 (A has more total variance from 1 than B), but that assumes that Squared error is the correct penalty

@clham ok, I'm willing to assume that the square error is the correct penatly. How then do I go about getting a metric for how much more likely A is to deviate from 1 than B is. I think this is what makes it tricky. I find it really curious that it's easy to recommend or dismiss A over nothing, but not over another distribution.

@clham
Copy link

clham commented Oct 19, 2018

How then do I go about getting a metric for how much more likely A is to deviate from 1 than B is. I think this is what makes it tricky.

An assumption here is that there WILL be deviance (because of measurement errors if nothing else). It seems the key metric would be average deviance or average squared deviance -- Would just a standard deviation (vs 1) or mean absolute error be sufficient?

@TheChymera
Copy link
Author

An assumption here is that there WILL be deviance (because of measurement errors if nothing else). It seems the key metric would be average deviance or average squared deviance -- Would just a standard deviation (vs 1) or mean absolute error be sufficient?

That would be a ratio of t-values. How would I interpret that?

@clham
Copy link

clham commented Oct 19, 2018

So, this really gets into "what are you trying to sell". (not a stats problem) but, is it useful for your purposes to say (for example) "Treatment A has on average a 20% lower volume loss than treatment B?"

@TheChymera
Copy link
Author

TheChymera commented Oct 19, 2018

So, this really gets into "what are you trying to sell". (not a stats problem)

@clham No, what I am trying to communicate is how certain they can be of getting values closer to 1 if they switch from B to A. Clearly a stats problem.

Your recommended an approach which ends up with a ratio of t-values. How would you interpret that ratio? Say the t-score of A is 1.5 and that of B is -3 --- can you say that you are thus half as likely to get values significantly deviating form 1 if you choose A over B?

it useful for your purposes to say (for example) "Treatment A has on average a 20% lower volume loss than treatment B?"

I don't know in how far that is useful. I would expect people to ask --- how sure are you? They don't care so much about the specific percentage difference in my sample, but about with what certainty this generalizes to whatever sample they have.

@clham
Copy link

clham commented Oct 19, 2018

Interesting; So it sounds like the question is more "what are the chances the mean(a)<Mean(b)"; under the condition that samples from A and b are heterogeneous. Would Welch's T-test be helpful here? (h/t to google)

@josef-pkt
Copy link
Member

I don't think anything frequentist would help in making a probability statement about the posterior "belief".

The pvalue for testing equality of the two means is still an interpretable measure of uncertainty.
Any alpha > pvalue that is used to define confidence intervals would not include zero, i.e. the point where both means are equal, i.e. it is a measure of the width of the maximal confidence interval for which we can reject the null of equal means.

@TheChymera
Copy link
Author

TheChymera commented Oct 19, 2018

The problem is that this is a Bayesian statement, i.e. you want to use a posterior probability.

@josef-pkt you might be right, but I don't actually want the distribution. I.e. it's going to be z likely that A will get you 0.1 closer to 1 than B will, z' likely that it will get you 0.11 closer to 1 than B will, z'' likely that it will get you 0.13 closer to 1 than B will, etc. I just want to know the integral. How likely is it that it will get you closer to 1 than B will, in general.

Would Welch's T-test be helpful here?

@clham that again just compares the distributions, that bit is afaict handled with a t-test, the problem remains determining how likely is it that one of them is closer to 1 than the other is.

@clham
Copy link

clham commented Oct 19, 2018

Can we reframe the question -- if mean(A) is closer to 1 than mean(B), than a<b. You can say with X confidence that BECAUSE we are x% sure A<B, we are x% sure that A is closer to 1 than B. (if A were larger, we would fail to reject the null that it is infact closer)

@TheChymera
Copy link
Author

@clham that's the problem, if mean(A) is closer to 1 than mean(B) we cannot say that a<b, it could be that A scatters around 0.99 and B scatters around 0.8. It could also be that A scatters around 1.21 and B scatters around 0.8 --- in which case the two are clearly very different from each other but there would be little evidence to prefer A.

@clham
Copy link

clham commented Oct 19, 2018

Measure A vs B as an absolute value difference from 1 for each distro. this way, a mean of 1.21 would be represented as .21, as would a mean of .79. Whichever is less is infact "closer"

@TheChymera
Copy link
Author

TheChymera commented Oct 19, 2018

@clham I also thought of that, but the absolute bit will mess up the distribution as soon as it crosses the target value: [-1,-2,0,0,0,1,1,2] will have a different mean and SD than [1,2,0,0,0,1,1,2].

@clham
Copy link

clham commented Oct 19, 2018

Presumably the body of the distribution is well away from 1? It's certainly wrong from a "pure-stats" perspective, but, from a "useful-stats" perspective, is it good enough?

As an alternative, can you shift everything over a few -- add e.g. 3 to all your data points (or subtract if the mean is on the "wrong" side)? this will force (nearly) everything to have a consistent sign, but get the directionality and relative variance correct?

@TheChymera
Copy link
Author

@clham that sounds like some serious fudging :)

At any rate, I have decided to use a t-test to establish a significant difference between the two samples, and the ratio of RMSE in order to claim a better suitability of one over the other. I maintain that this doesn't fully address my hypothesis, but it seems this is as good as it gets :(

@josef-pkt
Copy link
Member

Reopening this, at least temporarily, so I can look at it again.

I'm not going to do any Bayesian inference and posterior beliefs.

However, we have more for comparing predictive distributions now.
e.g. stochastically larger P(y1 > y2) in brunner-munzel style would provide an effect size measure and test for whether and how much "better" (in terms of larger values) one distribution is compared to the other. Only requires nonparametric ordinal information.

For metric measurements we could use one-sided gof tests, e.g. ks_test.

(We don't have general stochastic dominance measures yet, like first or second order)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants