Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Nested ANOVA incorrect results -- wrong matrix definition #8506

Open
mfissore opened this issue Nov 7, 2022 · 5 comments
Open

BUG: Nested ANOVA incorrect results -- wrong matrix definition #8506

mfissore opened this issue Nov 7, 2022 · 5 comments

Comments

@mfissore
Copy link

mfissore commented Nov 7, 2022

Description of the bug

statsmodels.formula.api.ols run on data with nested factors (i.e. not complete crossing) gives incorrect results.

This problem has been reported at least twice independently

  1. https://stackoverflow.com/questions/72950135/nested-anova-in-statsmodels
  2. https://stats.stackexchange.com/questions/594889/nested-anova-in-python-and-singular-matrices

Code Sample

I'll replicate the code from example 2) above:

import pandas as pd
import statsmodels.formula.api as smf

nobs = 5
nB = 4
nA = 3
A = [f"A{i}" for i in range(1,nA+1) for h in range(nobs*nB)]
B = [f"B{i}{j}" for i in range(1,nA+1) for j in range(1,nB+1) for h in range(nobs)]
obs = list(range(5))*nA*nB

data = (30,-19,-31,-14,-14,0,20,32,11,13,7,5,3,-5,8,28,15,20,20,-48,
       24,16,-18,11,-10,14,11,27,11,8,20,18,8,32,-25,20,-12,0,-5,44,
       14,-18,33,-9,-7,14,8,-19,6,-38,18,16,7,-4,21,-25,13,36,18,-7)

df = pd.DataFrame(data ={
    "A": A,
    "B": B,
    "Observation_number": obs,
},dtype="category")
df = df.join(pd.Series(data=data,name="y"))

model = smf.ols("y ~ A + A:B",data=df).fit()
anova_df = sm.stats.anova_lm(model)

The crossing factor A:B has 33 DOF, which is only right if Statsmodels crossed factors nested in A1 with A2.

           df        sum_sq     mean_sq         F    PR(>F)
A          3.0   2506.300000  835.433333  2.180719  0.102465
A:B       33.0  10338.241297  313.280039  0.817750  0.726097
Residual  48.0  18388.800000  383.100000       NaN       NaN

Running model.summary() shows the condition number is very low:

[2] The smallest eigenvalue is 4.14e-33. This might indicate that there are strong multicollinearity problems or that the design > matrix is singular.

which you would expect when all-zero columns are left in the model matrix.

Requested change

Add automatic detection and ignore of all-zero columns (inappropriately crossed factors) or give user capability to manually do so. As Statsmodel works currently, any nested model fitted with OLS will give completely erroneous results.

@mfissore
Copy link
Author

mfissore commented Nov 7, 2022

Problem #8336 is seemingly similar (wrong number of DOF) but does not reference the zero columns, high condition number issue.

@mfissore mfissore changed the title Nested ANOVA incorrect results -- wrong matrix definition BUG: Nested ANOVA incorrect results -- wrong matrix definition Nov 7, 2022
@josef-pkt
Copy link
Member

I just saw the statsstackexchange question

Do you have the expected output?

I never really figured out nested effects and formulas for it.

trying out some things
The following has a full rank exog, note I'm recoding the nested category to use the same names in each higher level

nobs = 5
nB = 4
nA = 3
A = [f"A{i}" for i in range(1,nA+1) for h in range(nobs*nB)]
B = [f"B{j}" for i in range(1,nA+1) for j in range(1,nB+1) for h in range(nobs)]
obs = list(range(5))*nA*nB

data = (30,-19,-31,-14,-14,0,20,32,11,13,7,5,3,-5,8,28,15,20,20,-48,
       24,16,-18,11,-10,14,11,27,11,8,20,18,8,32,-25,20,-12,0,-5,44,
       14,-18,33,-9,-7,14,8,-19,6,-38,18,16,7,-4,21,-25,13,36,18,-7)

df = pd.DataFrame(data ={
    "A": A,
    "B": B,
    "Observation_number": obs,
},dtype="category")
df = df.join(pd.Series(data=data,name="y"))

results = smf.ols("y ~ A/B",data=df).fit()
anova_df = sm.stats.anova_lm(results)
print(anova_df)

            df        sum_sq     mean_sq         F    PR(>F)
A          2.0    441.233333  220.616667  0.575872  0.566051
A:B        9.0   2656.900000  295.211111  0.770585  0.643738
Residual  48.0  18388.800000  383.100000       NaN       NaN

@josef-pkt
Copy link
Member

We can keep this issue because it has a complete example.

related possibility:
Is R just dropping singular columns or is there a special treatment for nested models.
That a\b is supposed to be equivalent to a + a:b does not sound like recoding the nested levels, so maybe R just uses the default dropping of collinear variables.

@mfissore
Copy link
Author

@josef-pkt yes, I think it does drop the collinear variables; in particular, looking into the DesignMatrix produced by Patsy I see some columns are all-empty, which are the interactions between the nested factors (say B) and the levels of A in which they do not appear.

I tried dropping them manually but then Statsmodels complains that the DesignInfo is no longer matching the DesignMatrix.

Is there any way when I build the model that I can specify to drop collinear columns?

@JSEFERINO
Copy link

Aun con este cambio no realiza correctamente el Anova anidada

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants