BUG: Nested ANOVA incorrect results -- wrong matrix definition #8506

mfissore · 2022-11-07T18:12:40Z

Description of the bug

statsmodels.formula.api.ols run on data with nested factors (i.e. not complete crossing) gives incorrect results.

This problem has been reported at least twice independently

Code Sample

I'll replicate the code from example 2) above:

import pandas as pd
import statsmodels.formula.api as smf

nobs = 5
nB = 4
nA = 3
A = [f"A{i}" for i in range(1,nA+1) for h in range(nobs*nB)]
B = [f"B{i}{j}" for i in range(1,nA+1) for j in range(1,nB+1) for h in range(nobs)]
obs = list(range(5))*nA*nB

data = (30,-19,-31,-14,-14,0,20,32,11,13,7,5,3,-5,8,28,15,20,20,-48,
       24,16,-18,11,-10,14,11,27,11,8,20,18,8,32,-25,20,-12,0,-5,44,
       14,-18,33,-9,-7,14,8,-19,6,-38,18,16,7,-4,21,-25,13,36,18,-7)

df = pd.DataFrame(data ={
    "A": A,
    "B": B,
    "Observation_number": obs,
},dtype="category")
df = df.join(pd.Series(data=data,name="y"))

model = smf.ols("y ~ A + A:B",data=df).fit()
anova_df = sm.stats.anova_lm(model)

The crossing factor A:B has 33 DOF, which is only right if Statsmodels crossed factors nested in A1 with A2.

           df        sum_sq     mean_sq         F    PR(>F)
A          3.0   2506.300000  835.433333  2.180719  0.102465
A:B       33.0  10338.241297  313.280039  0.817750  0.726097
Residual  48.0  18388.800000  383.100000       NaN       NaN

Running model.summary() shows the condition number is very low:

[2] The smallest eigenvalue is 4.14e-33. This might indicate that there are strong multicollinearity problems or that the design > matrix is singular.

which you would expect when all-zero columns are left in the model matrix.

Requested change

Add automatic detection and ignore of all-zero columns (inappropriately crossed factors) or give user capability to manually do so. As Statsmodel works currently, any nested model fitted with OLS will give completely erroneous results.

The text was updated successfully, but these errors were encountered:

mfissore · 2022-11-07T18:15:47Z

Problem #8336 is seemingly similar (wrong number of DOF) but does not reference the zero columns, high condition number issue.

josef-pkt · 2022-11-08T19:47:35Z

I just saw the statsstackexchange question

Do you have the expected output?

I never really figured out nested effects and formulas for it.

trying out some things
The following has a full rank exog, note I'm recoding the nested category to use the same names in each higher level

nobs = 5
nB = 4
nA = 3
A = [f"A{i}" for i in range(1,nA+1) for h in range(nobs*nB)]
B = [f"B{j}" for i in range(1,nA+1) for j in range(1,nB+1) for h in range(nobs)]
obs = list(range(5))*nA*nB

data = (30,-19,-31,-14,-14,0,20,32,11,13,7,5,3,-5,8,28,15,20,20,-48,
       24,16,-18,11,-10,14,11,27,11,8,20,18,8,32,-25,20,-12,0,-5,44,
       14,-18,33,-9,-7,14,8,-19,6,-38,18,16,7,-4,21,-25,13,36,18,-7)

df = pd.DataFrame(data ={
    "A": A,
    "B": B,
    "Observation_number": obs,
},dtype="category")
df = df.join(pd.Series(data=data,name="y"))

results = smf.ols("y ~ A/B",data=df).fit()
anova_df = sm.stats.anova_lm(results)
print(anova_df)

            df        sum_sq     mean_sq         F    PR(>F)
A          2.0    441.233333  220.616667  0.575872  0.566051
A:B        9.0   2656.900000  295.211111  0.770585  0.643738
Residual  48.0  18388.800000  383.100000       NaN       NaN

josef-pkt · 2022-11-08T19:57:07Z

We can keep this issue because it has a complete example.

related possibility:
Is R just dropping singular columns or is there a special treatment for nested models.
That a\b is supposed to be equivalent to a + a:b does not sound like recoding the nested levels, so maybe R just uses the default dropping of collinear variables.

mfissore · 2022-11-26T00:09:15Z

@josef-pkt yes, I think it does drop the collinear variables; in particular, looking into the DesignMatrix produced by Patsy I see some columns are all-empty, which are the interactions between the nested factors (say B) and the levels of A in which they do not appear.

I tried dropping them manually but then Statsmodels complains that the DesignInfo is no longer matching the DesignMatrix.

Is there any way when I build the model that I can specify to drop collinear columns?

JSEFERINO · 2023-09-30T14:35:16Z

Aun con este cambio no realiza correctamente el Anova anidada

mfissore changed the title ~~Nested ANOVA incorrect results -- wrong matrix definition~~ BUG: Nested ANOVA incorrect results -- wrong matrix definition Nov 7, 2022

josef-pkt added comp-regression comp-stats prio-elev labels Nov 8, 2022

This was referenced Nov 8, 2022

BUG/ENH: constrained/collinear fit and degrees of freedom for terms, df_terms, anova #8512

Open

Nested ANOVA #8336

Open

josef-pkt mentioned this issue Mar 6, 2023

SUMM: singular exog (again) #8721

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Nested ANOVA incorrect results -- wrong matrix definition #8506

BUG: Nested ANOVA incorrect results -- wrong matrix definition #8506

mfissore commented Nov 7, 2022

mfissore commented Nov 7, 2022 •

edited

josef-pkt commented Nov 8, 2022

josef-pkt commented Nov 8, 2022

mfissore commented Nov 26, 2022

JSEFERINO commented Sep 30, 2023

BUG: Nested ANOVA incorrect results -- wrong matrix definition #8506

BUG: Nested ANOVA incorrect results -- wrong matrix definition #8506

Comments

mfissore commented Nov 7, 2022

Description of the bug

Code Sample

Requested change

mfissore commented Nov 7, 2022 • edited

josef-pkt commented Nov 8, 2022

josef-pkt commented Nov 8, 2022

mfissore commented Nov 26, 2022

JSEFERINO commented Sep 30, 2023

mfissore commented Nov 7, 2022 •

edited