R² adjusted strange after including interaction term #1792

Closed
t-pfaff opened this Issue Jun 26, 2014 · 30 comments

Projects

None yet

3 participants

@t-pfaff
t-pfaff commented Jun 26, 2014

In my analysis, R² adj. jumps up after including an interaction term, e.g., from 0.38 to 0.97. This is surely not due to the data. In Stata I get the same values for R², but much lower and more plausible values for R² adj.

I don't have an example with simulated data.

My models are (2nd model with interaction between month and bank holiday):

model = 'sales ~ school_holidays + C(weekday) + C(month) + C(bank_holiday_upcoming) + rain + wind_speed + C(wind_direction_category) + temperature + global_solar_radiation'

model = 'sales ~ school_holidays + C(weekday) + C(month) * C(bank_holiday_upcoming) + rain + wind_speed + C(wind_direction_category) + temperature + global_solar_radiation'

import statsmodels.formula.api as smf
smf.ols(formula = model, data=df).fit()

R² adj.:
unbenannt

Coefficients are identical with Stata except bank_holiday_upcoming and the interaction. Stata omits different categories for the interaction than Statsmodels. However, that shouldn't be the reason for the vast difference in values of R² adj.

Is there something wrong with the calculation of R² adj. with such interactions?

@t-pfaff t-pfaff changed the title from R2_adjusted strange after including interaction term to R² adjusted strange after including interaction term Jun 26, 2014
@josef-pkt
Member

Is rsquared the same as in Stata?

the only things that adjust rsquared are
res.model.nobs, res.model.k_constant, res.model.df_resid

Can you check if those are the same as in Stata?
k_constant is whether we detect that there is a constant

However, if any of those are wrong then they should also affect other results.

My only guess is that k_constant is wrong and we use the rsquared for the model without constant.

@t-pfaff
t-pfaff commented Jun 26, 2014

No, R² is also much lower in Stata for the model w/ interaction.
So possibly an upward bias in Statsmodels for both R² and R² adj.

The R² and R² adj. in Stata for the models without constant are close to the ones in Statsmodels, but not the same (little bit lower in Stata, e.g., 0.94 instead of 0.96).

@josef-pkt
Member

I need more information or a example that replicates this to figure out what's going on

What is res_ols.model.k_constant?
and try to add hasconst=True to the model call, i.e.
smf.ols(formula = model, data=df, hasconst=True).fit()

If that doesn't fix it, then I need an example or at least the full summary printout of both Stata and statsmodels.

@t-pfaff
t-pfaff commented Jun 26, 2014

res.model.k_constant is 1 for the model w/o interaction and erroneously 0 for the model w/ interaction.
hasconst=True fixes the bug. res.model.k_constant is 1 for both models and R² and R²adj. are now the same than in Stata.

@jseabold
Member

Can you please post an example of this, so we can fix it?

@t-pfaff
t-pfaff commented Jun 26, 2014

I wouldn't know how to quickly come up with an example with random numbers, but I could send you the data as csv if you want.

@jseabold
Member

A gist with a subset of your data would be helpful.

@josef-pkt
Member

related #1724 FAQ-D
I don't find an issue for detection of implicit constant.
I thought there was also a question somewhere about whether patsy gives us the hasconst information.

@jseabold
Member

Yeah that's almost certainly the issue.

@josef-pkt
Member

@t-pfaff can you just copy the list of exog names or the parameter table to a comment?
I would like to see how patsy encoded the variables with the interaction effects.

From my interpretation of the formula, I thought there would still be a constant in the design matrix, exog, in the model with the interaction effect.

@t-pfaff
t-pfaff commented Jun 26, 2014

est.params
Intercept 2.695109e+02
C(weekday)[T.1] 1.186743e+01
C(weekday)[T.2] 1.070368e+01
C(weekday)[T.3] 4.644066e+01
C(weekday)[T.4] 2.279567e+01
C(weekday)[T.5] 4.526613e+01
C(weekday)[T.6] 1.845712e+01
C(month)[T.2] 1.168990e+01
C(month)[T.3] 2.665760e+01
C(month)[T.4] 2.981669e+01
C(month)[T.5] 2.377399e+01
C(month)[T.6] 3.144761e+01
C(month)[T.7] 2.985957e+01
C(month)[T.8] 3.177678e+01
C(month)[T.9] 4.114404e+01
C(month)[T.10] 3.385812e+01
C(month)[T.11] 2.926375e+01
C(month)[T.12] 1.201812e+01
C(bank_holiday_upcoming)[T.1] -6.556388e+00
C(bank_holiday_upcoming)[T.2] -2.196862e+01
C(wind_direction_category)[T.4] -4.275858e-01
C(wind_direction_category)[T.5] -2.108426e+00
C(wind_direction_category)[T.6] -1.402545e+00
C(wind_direction_category)[T.7] -5.134967e+00
C(month)[T.2]:C(bank_holiday_upcoming)[T.1] -1.294535e-14
C(month)[T.3]:C(bank_holiday_upcoming)[T.1] -6.662056e+01
C(month)[T.4]:C(bank_holiday_upcoming)[T.1] -3.063782e+01
C(month)[T.5]:C(bank_holiday_upcoming)[T.1] -2.474670e+01
C(month)[T.6]:C(bank_holiday_upcoming)[T.1] -2.053468e+01
C(month)[T.7]:C(bank_holiday_upcoming)[T.1] 2.301371e-13
C(month)[T.8]:C(bank_holiday_upcoming)[T.1] 5.302206e-14
C(month)[T.9]:C(bank_holiday_upcoming)[T.1] -6.584540e-14
C(month)[T.10]:C(bank_holiday_upcoming)[T.1] 1.238800e+01
C(month)[T.11]:C(bank_holiday_upcoming)[T.1] 6.830759e-14
C(month)[T.12]:C(bank_holiday_upcoming)[T.1] 1.235954e+02
C(month)[T.2]:C(bank_holiday_upcoming)[T.2] 0.000000e+00
C(month)[T.3]:C(bank_holiday_upcoming)[T.2] 0.000000e+00
C(month)[T.4]:C(bank_holiday_upcoming)[T.2] 0.000000e+00
C(month)[T.5]:C(bank_holiday_upcoming)[T.2] 0.000000e+00
C(month)[T.6]:C(bank_holiday_upcoming)[T.2] 0.000000e+00
C(month)[T.7]:C(bank_holiday_upcoming)[T.2] 0.000000e+00
C(month)[T.8]:C(bank_holiday_upcoming)[T.2] 0.000000e+00
C(month)[T.9]:C(bank_holiday_upcoming)[T.2] 0.000000e+00
C(month)[T.10]:C(bank_holiday_upcoming)[T.2] 0.000000e+00
C(month)[T.11]:C(bank_holiday_upcoming)[T.2] 0.000000e+00
C(month)[T.12]:C(bank_holiday_upcoming)[T.2] -2.196862e+01
school_holidays -3.892149e+01
rain -1.496713e+00
wind_speed -2.080651e-01
temperature 3.801379e-01
global_solar_radiation 1.686657e-02
Length: 51, dtype: float64

@t-pfaff
t-pfaff commented Jun 26, 2014

The first 50 observations below, read as dataframe and then run:

for sign in ['+','*']:
    model = 'sales ~ school_holidays + C(weekday) + C(month)'+sign+'C(bank_holiday_upcoming) + rain + wind_speed + C(wind_direction_category) + temperature + global_solar_radiation'
    est = smf.ols(formula = model, hasconst=True, data=df).fit()
        if sign == '+':
            print 'R2 adj. no interaction:', est.rsquared, est.model.k_constant
        else:
            print 'R2 adj. w/ interaction:', est.rsquared, est.model.k_constant

,date,sales,branch_coded,product_category,rain,global_solar_radiation,temperature,wind_speed,wind_direction_category,school_holidays,bank_holiday_upcoming,weekday,month
0,2011-01-02,210.0,C,snacks,0.8,104.25591666666668,1.1804166666666671,10.42383333333333,6,1,0,6,1
1,2011-01-03,232.0,C,snacks,0.6,38.58089285714284,-0.5233928571428569,9.903749999999997,6,1,0,0,1
2,2011-01-04,267.0,C,snacks,0.0,65.47017857142858,0.13791666666666685,12.267202380952385,6,1,0,1,1
3,2011-01-05,340.5,C,snacks,1.5,81.59625000000003,-2.7635119047619066,13.821011904761908,4,1,0,2,1
4,2011-01-06,198.0,C,snacks,9.3,23.895357142857147,3.7742261904761896,13.072083333333328,5,0,0,3,1
5,2011-01-07,274.0,C,snacks,4.6,26.646785714285702,3.2508928571428566,7.230416666666662,4,0,0,4,1
6,2011-01-08,264.0,C,snacks,6.2,47.101805555555536,10.081458333333336,20.044236111111115,6,0,0,5,1
7,2011-01-09,276.5,C,snacks,0.0,77.52125000000002,4.074750000000001,16.74625,6,0,0,6,1
8,2011-01-10,265.0,C,snacks,0.0,85.83583333333333,1.8101190476190487,11.593214285714279,5,0,0,0,1
9,2011-01-11,353.0,C,snacks,0.7,86.23107142857141,2.3351190476190467,14.314583333333335,5,0,0,1,1
10,2011-01-12,293.0,C,snacks,12.8,33.032857142857125,3.4066666666666663,13.52458333333333,6,0,0,2,1
11,2011-01-13,266.0,C,snacks,10.0,22.87511904761905,8.733511904761901,14.246011904761904,6,0,0,3,1
12,2011-01-14,237.5,C,snacks,8.2,16.29505952380952,9.618452380952396,20.60017857142858,6,0,0,4,1
13,2011-01-15,412.0,C,snacks,0.0,33.05861111111114,8.189305555555556,18.903263888888887,6,0,0,5,1
14,2011-01-16,318.5,C,snacks,0.0,105.11983333333336,9.864333333333333,18.15875000000001,6,0,0,6,1
15,2011-01-17,267.0,C,snacks,2.8,49.73470238095235,7.957916666666662,11.485416666666675,6,0,0,0,1
16,2011-01-18,228.5,C,snacks,3.7,26.407559523809525,4.500357142857141,6.357440476190478,5,0,0,1,1
17,2011-01-19,270.0,C,snacks,4.6,37.17071428571428,2.4166071428571425,7.560654761904765,7,0,0,2,1
18,2011-01-20,373.0,C,snacks,0.0,36.11267857142858,0.29053571428571423,8.183452380952383,5,0,0,3,1
19,2011-01-21,244.5,C,snacks,2.1,36.933571428571426,-1.1698809523809526,7.789345238095237,7,0,0,4,1
20,2011-01-22,262.0,C,snacks,2.1,33.70597222222221,1.3907638888888894,6.318680555555552,6,0,0,5,1
21,2011-01-23,281.0,C,snacks,0.4,34.248916666666666,3.4540833333333323,7.616000000000001,7,0,0,6,1
22,2011-01-24,248.5,C,snacks,5.0,27.61772455089817,3.0025748502994034,7.366347305389218,7,0,0,0,1
23,2011-01-25,248.5,C,snacks,1.9,62.67363095238093,3.1476190476190475,15.408928571428586,7,0,0,1,1
24,2011-01-26,303.0,C,snacks,0.0,60.42630952380952,1.1420238095238109,8.828869047619051,4,0,0,2,1
25,2011-01-27,237.0,C,snacks,0.0,66.37928571428576,-1.4107142857142867,12.79648809523809,3,0,0,3,1
26,2011-01-28,273.5,C,snacks,0.0,131.88958333333338,-2.3451785714285704,11.911488095238091,3,0,0,4,1
27,2011-01-29,300.0,C,snacks,0.0,120.4815277777778,-3.965694444444444,9.412430555555552,3,1,0,5,1
28,2011-01-30,292.0,C,snacks,0.0,60.79808333333333,-1.6692500000000017,7.313500000000001,4,1,0,6,1
29,2011-01-31,247.0,C,snacks,0.0,37.22166666666666,-2.083333333333333,4.088511904761904,6,1,0,0,1
30,2011-02-01,249.0,C,snacks,1.5,41.8797619047619,-4.092440476190474,16.058571428571426,6,1,0,1,2
31,2011-02-02,254.0,C,snacks,0.8,34.432916666666664,0.8997619047619049,20.05113095238097,6,0,0,2,2
32,2011-02-03,553.0,C,snacks,2.3,60.94982142857138,4.0494047619047615,15.771726190476189,6,0,0,3,2
33,2011-02-04,404.5,C,snacks,0.5,17.058154761904756,6.643809523809521,27.699940476190488,6,0,0,4,2
34,2011-02-05,278.0,C,snacks,1.4,26.34548611111111,10.119652777777787,34.00375,6,0,0,5,2
35,2011-02-06,266.0,C,snacks,0.1,46.01866666666666,8.967166666666667,26.13041666666667,6,0,0,6,2
36,2011-02-07,264.0,C,snacks,0.0,161.21107142857144,9.021845238095235,22.857976190476204,6,0,0,0,2
37,2011-02-08,264.5,C,snacks,0.0,155.39041666666677,6.061607142857143,12.638630952380948,7,0,0,1,2
38,2011-02-09,290.5,C,snacks,0.0,164.73910714285722,4.5698214285714265,6.951309523809522,4,0,0,2,2
39,2011-02-10,304.5,C,snacks,10.0,71.87726190476185,6.453214285714287,16.744761904761912,6,0,0,3,2
40,2011-02-11,347.0,C,snacks,1.2,34.12898809523808,5.816011904761907,5.912023809523809,7,0,0,4,2
41,2011-02-12,326.0,C,snacks,8.3,36.26229166666664,0.08486111111111108,12.83368055555556,3,0,0,5,2
42,2011-02-13,315.0,C,snacks,0.0,132.51949999999994,5.953416666666668,6.091250000000001,5,0,0,6,2
43,2011-02-14,258.0,C,snacks,3.9,162.84869047619043,5.477440476190479,10.252857142857145,4,0,0,0,2
44,2011-02-15,243.0,C,snacks,0.8,83.19101190476191,5.384702380952383,9.980892857142852,5,0,0,1,2
45,2011-02-16,286.5,C,snacks,0.0,140.20702380952392,7.008273809523812,9.452678571428574,4,0,0,2,2
46,2011-02-17,340.5,C,snacks,0.0,136.8748809523811,-0.0059523809523808644,15.235773809523812,3,0,0,3,2
47,2011-02-18,258.0,C,snacks,0.0,40.94494047619047,-0.6382142857142854,11.752023809523811,3,0,0,4,2
48,2011-02-21,251.0,C,snacks,0.0,250.7772023809523,-3.1190476190476177,12.655238095238097,3,0,0,0,2
49,2011-02-22,287.5,C,snacks,0.0,255.18964285714284,-3.0063095238095245,9.898154761904769,3,0,0,1,2
50,2011-02-23,256.0,C,snacks,0.0,146.8395833333333,-0.19791666666666652,11.755357142857141,4,0,0,2,2

@josef-pkt
Member

OK, there is still something wrong. Given that the intercept is in the parameters, we should have detected the constant. (I'm adding bug instead of enhancement label)

Also it's an example where I guess there are many zero columns in the interaction, where there are no observations for the specific combination. From what I checked before, that should still be correct.

@t-pfaff If this is public data, then I think it would make a good test case.

My guess is now that hasconst detection get's confused by the columns of zeros. If that's the case, then I can come up with an example to replicate.

@t-pfaff
t-pfaff commented Jun 26, 2014

@josef-pkt No, it's not public data.

@jseabold
Member

Then do you need to remove it? This is public. OTOH, if it's ok to post a snippet like this, it's fine to include in the test suite.

@t-pfaff
t-pfaff commented Jun 26, 2014

Oh, the 50 obs. are OK to be public. The whole dataset wouldn't be public.

@josef-pkt
Member

got it.
There is something wrong AFAICS

The code currently should raise an exception, because it doesn't allow several constants (columns with var = 0 )
only the exception is caught right away, and k_constant is set to zero

in _handle_constant

try:  # to detect where the constant is
                const_idx = np.where(self.exog.var(axis=0) == 0)[0].squeeze()
                self.k_constant = const_idx.size
                if self.k_constant > 1:
                    raise ValueError("More than one constant detected.")
                else:
                    self.const_idx = const_idx
            except:  # should be an index error but who knows, means no const
                self.const_idx = None
                self.k_constant = 0

@jseabold you want to fix it? or I look into it later today

@josef-pkt josef-pkt added the prio-high label Jun 26, 2014
@josef-pkt
Member

I have a "baby" example somewhere in an issue, where I checked how OLS behaves if there are no observations in a category that could be used as simple test case.

@josef-pkt
Member

@t-pfaff Can you check one of your interaction examples between statsmodels and Stata to see if the other numbers are the same, loglike, standard error, ....

I don't think we have a test case yet for this case with zero columns, I just found out recently that patsy is not providing a "shortened" design matrix.

@jseabold
Member

Earliest I'll be able to look is tomorrow. I'm hoping to see about preparing a 0.5.1 then/this weekend too.

@josef-pkt
Member

Ok, I start to prepare a PR

changes:

  • use np.ptp instead of var (I had recently a case where var == an eps, ptp==0)
  • don't raise on multiple "constant" columns, take const_idx if column mean == 1 or otherwise first != 0
@t-pfaff
t-pfaff commented Jun 26, 2014

loglikehood and standard errors are OK.

@josef-pkt
Member

thanks @t-pfaff
I'm not able to replicate this yet, so there might be something else going on.
I might have to ping you later if I don't manage.

@josef-pkt
Member

@t-pfaff which version of statsmodels are you using?

@t-pfaff
t-pfaff commented Jun 26, 2014

0.5.0

@jseabold
Member

I'm also not sure I saw a problem using recent dev version but am not certain. One reason I said I'd have to look later.

@josef-pkt
Member

The _handle_constant code is obviously wrong, however it's not used anymore. !?
(my recent surprise that hasconst in anova_lm is fixed in master, which I didn't try to track down. #1213 #1770)

>>> res2.model.k_constant
1.0
>>> res2.model.data.k_constant
0

simple example

import numpy as np
from statsmodels.regression.linear_model import OLS

y = np.random.randn(20) 
x = np.column_stack((np.ones(20), np.zeros(20)))
res = OLS(y, x).fit()

print res.model.k_constant
print res.model.data.const_idx is None

res2 = OLS.from_formula('y ~ x1 + x2', data={'x1': np.zeros(20), 'x2': 0.1*np.ones(20)}).fit()
print(res2.model.k_constant)
print(res2.model.data.k_constant)
print(res2.model.data.const_idx is None)
print res2.summary()
@jseabold
Member

On Thu, Jun 26, 2014 at 11:10 AM, Josef Perktold notifications@github.com
wrote:

The _handle_constant code is obviously wrong, however it's not used
anymore. !?
(my recent surprise that hasconst in anova_lm is fixed in master, which I
didn't try to track down. #1213
#1213 #1770
#1770)

I put the hash of the fix in that issue. We need some DRY code?

res2.model.k_constant
1.0
res2.model.data.k_constant
0

simple example

import numpy as np
from statsmodels.regression.linear_model import OLS

y = np.random.randn(20)
x = np.column_stack((np.ones(20), np.zeros(20)))
res = OLS(y, x).fit()

print res.model.k_constant
print res.model.data.const_idx is None

res2 = OLS.from_formula('y ~ x1 + x2', data={'x1': np.zeros(20), 'x2': 0.1*np.ones(20)}).fit()
print(res2.model.k_constant)
print(res2.model.data.k_constant)
print(res2.model.data.const_idx is None)
print res2.summary()

Reply to this email directly or view it on GitHub
#1792 (comment)
.

@josef-pkt
Member

Thanks, I missed that comment.

I also just tracked down Kevin's check for implicit constant in RegressionModel._has_constant
It's only for RegressionModels, and substitutes for the base code.
So we should still have the problem of this issue for non-regression models.

@josef-pkt
Member

I think this is all fixed after merging #1797

@josef-pkt josef-pkt closed this Jul 29, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment