add_constant incorrectly detects constant column #1025

Closed
ajmarks opened this Issue Aug 7, 2013 · 6 comments

Projects

None yet

2 participants

@ajmarks
ajmarks commented Aug 7, 2013

statsmodels/statsmodels/tools/tools.py:245 checks for columns with unit variance, not zero variance when looking for constant columns. Any z-scored data will, of course, have unit variance. The line should be
if np.any(data.var(0) == 0):

@josef-pkt
Member

Thank you for reporting, clearing a bug.

How did you get this? Do you have an example that triggered this?

I don't find any code in statsmodels that uses this function based on a quick file content search.

@josef-pkt
Member

I see now that it's used through _pandas_add_constant.

@jseabold jseabold added a commit to jseabold/statsmodels that referenced this issue Aug 7, 2013
@jseabold jseabold TST: Add regression test for #1025. f2dd90c
@ajmarks
ajmarks commented Aug 7, 2013

Yeah, using sm.add_constant(df_x) is a handy utility function. I was running regressions on normalized data using sm.add_constant((df_x-df_x.mean())/df_x.std())), which was giving me a puzzling error in my code, and I traced it back to here.

@josef-pkt
Member

@ajmarks Out of curiousity: What's the reason that you are z-scoring?

I was wondering several times if we should offer it, or if we should make standardized beta coefficients available in the results class, but until now I haven't seen any strong reason to do it. (Not strong enough to put it on top of my priorities even though I have large parts of the code.)

@ajmarks
ajmarks commented Aug 7, 2013

When I'm running exploratory multiple regressions, I like to do it all z-scored so that I can see quickly how much of the variation in the endog comes from which endog. In particular, my current dataset has variables varying hugely in scale (i.e. some on the order of 1e9 and some on the order of 1e2), so this stops me having to look up each time and see if a parameter of 5e-7 is tiny or huge relative to its variable. The ironic thing is that since any linear regression will pass through [mean(y), ...mean(x_i)...]^t, the const regression parameter will be zero when it's run on z-scored data, but some of my other code assumes the constant is there.

It also helps when I want to plot a bunch of stuff next to each other. For dataframes, it's super easy to z-score:

def zscore(df):
    if np.any(df.var(0) == 0):
        raise Exception("Ain't gonna work")
    return (df-df.mean())/df.std()
@josef-pkt
Member

Thanks for the explanation, I'm always interested in hearing different approaches and practices for this (which vary widely across fields).

Ok, that sounds like a case where standardized beta coefficients in the results would be useful. Although, if z-scoring is easy and you don't need the original and the standardized params at the same time, then it's not really important to have it.

Aside: standardized beta coefficients show "how much of the variation in the endog comes from which endog" only if the exog/design matrix is orthogonal, otherwise correlation between exogs prevents the exact variance decomposition.
(I was looking once at some alternative measures to see the contributions of each variable, like the reduction in r_squared if any one variable is left out of the regression. IIRC that was the alternative proposed in an article that criticized standardized beta coefficients as indicators for the variance decomposition.)

@jseabold jseabold closed this in 8a1e7b2 Aug 14, 2013
@jseabold jseabold added a commit that referenced this issue Aug 14, 2013
@jseabold jseabold MAINT: Merge branch 'maintenance/0.5.x'. Merge to close issues.
* maintenance/0.5.x:
  BUG: fix warning arguments in GenericLikelihoodModel
  MAINT: Add name to .mailmap.
  ENH: Pandas Series no longer inherits from ndarray. Closes #1036.
  TST: Fixed test for Anaconda on Windows
  TST: Make test compatible with pandas 0.7.x
  BUG: Fail gracefully when not enough obs given for order.
  BUG: Handle non-string names in lag name making.
  TST: Test for issue 1038.
  DOC: fix docstrings so Latex finishes
  REF: add shapes to Transf_gen, small cleanups
  add explicit shapes to TestTransf2: fixes the fail w/scipy PR/2588
  BUG: acorr_breush_godfrey fix nlags choice closes #676
  BUG: return only yfitted if return_sorted is False closes #922
  TST: Add regression test for #1025.
  BUG: Check for 0 variance not unit. Closes #1025.
  BUG: robust.norms.TrimmedMean fix typos in psi_deriv closes #425
  ENH: Bump Python and NumPy versions. Remove 2.5 only code.
72528f4
@PierreBdR PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
@jseabold jseabold BUG: Check for 0 variance not unit. Closes #1025. 1adfdf0
@PierreBdR PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
@jseabold jseabold TST: Add regression test for #1025. aa7b9ba
@PierreBdR PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
@jseabold jseabold MAINT: Merge branch 'maintenance/0.5.x'. Merge to close issues.
* maintenance/0.5.x:
  BUG: fix warning arguments in GenericLikelihoodModel
  MAINT: Add name to .mailmap.
  ENH: Pandas Series no longer inherits from ndarray. Closes #1036.
  TST: Fixed test for Anaconda on Windows
  TST: Make test compatible with pandas 0.7.x
  BUG: Fail gracefully when not enough obs given for order.
  BUG: Handle non-string names in lag name making.
  TST: Test for issue 1038.
  DOC: fix docstrings so Latex finishes
  REF: add shapes to Transf_gen, small cleanups
  add explicit shapes to TestTransf2: fixes the fail w/scipy PR/2588
  BUG: acorr_breush_godfrey fix nlags choice closes #676
  BUG: return only yfitted if return_sorted is False closes #922
  TST: Add regression test for #1025.
  BUG: Check for 0 variance not unit. Closes #1025.
  BUG: robust.norms.TrimmedMean fix typos in psi_deriv closes #425
  ENH: Bump Python and NumPy versions. Remove 2.5 only code.
5d36f4f
@yarikoptic yarikoptic added a commit to yarikoptic/statsmodels that referenced this issue Oct 23, 2014
@yarikoptic yarikoptic Merge tag 'v0.5.0' into debian
Version 0.5.0

* tag 'v0.5.0':
  DOC: Update release notes with maint branch changes.
  MAINT: Fix mailmap entry.
  BUG: fix warning arguments in GenericLikelihoodModel
  MAINT: Add name to .mailmap.
  ENH: Pandas Series no longer inherits from ndarray. Closes #1036.
  TST: Fixed test for Anaconda on Windows
  TST: Make test compatible with pandas 0.7.x
  BUG: Fail gracefully when not enough obs given for order.
  BUG: Handle non-string names in lag name making.
  TST: Test for issue 1038.
  DOC: fix docstrings so Latex finishes
  REF: add shapes to Transf_gen, small cleanups
  add explicit shapes to TestTransf2: fixes the fail w/scipy PR/2588
  BUG: acorr_breush_godfrey fix nlags choice closes #676
  BUG: return only yfitted if return_sorted is False closes #922
  TST: Add regression test for #1025.
  BUG: Check for 0 variance not unit. Closes #1025.
  BUG: robust.norms.TrimmedMean fix typos in psi_deriv closes #425
  ENH: Bump Python and NumPy versions. Remove 2.5 only code.
e9c3418
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment