dtype bug object arrays (raises in clustered standard errors code) #1875

Closed
kcarlson00 opened this Issue Aug 8, 2014 · 11 comments

Projects

None yet

3 participants

@kcarlson00

I ran into this bug while trying to get clustered standard errors on a dataset with >=500k rows. I also propose a fix.

Here is error where bf is a RegressionResults estimated from data=dat:

In [22]: cov = bf.get_robustcov_results(cov_type='cluster', groups=dat['fips'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-22-01bf5ce21d9d> in <module>()
----> 1 cov = bf.get_robustcov_results(cov_type='cluster', groups=dat['fips'].astype(np.uint32))

/Users/a1kic01/Projects/statsmodels/statsmodels/regression/linear_model.pyc in get_robustcov_results(self, cov_type, use_t, **kwds)
   1793                     self.n_groups = n_groups = len(np.unique(groups))
   1794                 res.cov_params_default = sw.cov_cluster(self, groups,
-> 1795                                                  use_correction=use_correction)
   1796
   1797             elif groups.ndim == 2:

/Users/a1kic01/Projects/statsmodels/statsmodels/stats/sandwich_covariance.py in cov_cluster(results, group, use_correction)
    531         clusters = np.unique(group)
    532
--> 533     scale = S_crosssection(xu, group)
    534
    535     nobs, k_params = xu.shape

/Users/a1kic01/Projects/statsmodels/statsmodels/stats/sandwich_covariance.py in S_crosssection(x, group)
    485
    486     '''
--> 487     x_group_sums = group_sums(x, group).T  #TODO: why transposed
    488
    489     return S_white_simple(x_group_sums)

/Users/a1kic01/Projects/statsmodels/statsmodels/stats/sandwich_covariance.py in group_sums(x, group)
    431     #TODO: transpose return in group_sum, need test coverage first
    432     return np.array([np.bincount(group, weights=x[:, col])
--> 433                             for col in range(x.shape[1])])
    434
    435

TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

The method will run successfully if the function _get_sandwich_arrays has the return line changed from

return xu, hessian_inv to return xu.astype('float'), hessian_inv

The function was returning an array with dtype('O').

@josef-pkt
Member

Thank for the report.

BTW: I'm just adding robust covariance to all the fit methods. RegressionModel.fit has the option already in master. The other models should be merged within a few days.#1867 #1870
(usage is currently only in test cases.)

about dtype('O'):
Do you have a minimal example to replicate this?
I wonder whether this is a symptom but not the cause. My guess is that if there is a dtype problem, then it would need to be intercepted and cast to float already earlier.
What's the dtype of results.resid and model.exog.and group?
Can you show a few lines and dtype of your data?

@kcarlson00

Sorry, no minimal example right now. It's a lot of data but IIRC I got it
to work with a smaller dataset before. Here are some results that should
answer your questions. It seems that you may be right about upstream dtype
problems.

Z is the dataframe of regressors. It has no constant but a full set of
dummy variables. It comes from taking the dataframe dat and adding dummies
for the variable fips (the cluster variable) and adding a quadratic
function of time for each fips value. (group specific quadratic trends)

In [175]: Z.shape
Out[175]: (1264101, 203)

n [162]: bf
Out[162]: <statsmodels.regression.linear_model.RegressionResultsWrapper at
0x12a753dd0>

In [163]: bf.resid.dtype
Out[163]: dtype('O')

In [164]: b?

In [165]: b.exog.dtype
Out[165]: dtype('O')

In [166]: dat['fips'].dtype
Out[166]: dtype('uint16')

In [167]: b
Out[167]: <statsmodels.regression.linear_model.OLS at 0x12a7a4d50>


In [177]: print Z.dtypes.to_string()
meduc_cat_I2      bool
meduc_cat_I3      bool
meduc_cat_I4      bool
meduc_cat_I5      bool
mage_cat_I2       bool
mage_cat_I3       bool
mrace_cat_I2      bool
mrace_cat_I3      bool
mrace_cat_I4      bool
mrace_cat_I5      bool
tbo_cat_I2        bool
tbo_cat_I3        bool
tbo_cat_I4        bool
tbo_cat_I5        bool
tbo_cat_I6        bool
year            uint16
calmo            uint8
36001             bool
36003             bool
36005             bool
36007             bool
36009             bool
36011             bool
36013             bool
36015             bool
36017             bool
36019             bool
36021             bool
36023             bool
36025             bool
36027             bool
36029             bool
36031             bool
36033             bool
36035             bool
36037             bool
36039             bool
36041             bool
36043             bool
36045             bool
36047             bool
36049             bool
36051             bool
36053             bool
36055             bool
36057             bool
36059             bool
36061             bool
36063             bool
36065             bool
36067             bool
36069             bool
36071             bool
36073             bool
36075             bool
36077             bool
36079             bool
36081             bool
36083             bool
36085             bool
36087             bool
36089             bool
36091             bool
36093             bool
36095             bool
36097             bool
36099             bool
36101             bool
36103             bool
36105             bool
36107             bool
36109             bool
36111             bool
36113             bool
36115             bool
36117             bool
36119             bool
36121             bool
36123             bool
36001Xmn        uint16
36003Xmn        uint16
36005Xmn        uint16
36007Xmn        uint16
36009Xmn        uint16
36011Xmn        uint16
36013Xmn        uint16
36015Xmn        uint16
36017Xmn        uint16
36019Xmn        uint16
36021Xmn        uint16
36023Xmn        uint16
36025Xmn        uint16
36027Xmn        uint16
36029Xmn        uint16
36031Xmn        uint16
36033Xmn        uint16
36035Xmn        uint16
36037Xmn        uint16
36039Xmn        uint16
36041Xmn        uint16
36043Xmn        uint16
36045Xmn        uint16
36047Xmn        uint16
36049Xmn        uint16
36051Xmn        uint16
36053Xmn        uint16
36055Xmn        uint16
36057Xmn        uint16
36059Xmn        uint16
36061Xmn        uint16
36063Xmn        uint16
36065Xmn        uint16
36067Xmn        uint16
36069Xmn        uint16
36071Xmn        uint16
36073Xmn        uint16
36075Xmn        uint16
36077Xmn        uint16
36079Xmn        uint16
36081Xmn        uint16
36083Xmn        uint16
36085Xmn        uint16
36087Xmn        uint16
36089Xmn        uint16
36091Xmn        uint16
36093Xmn        uint16
36095Xmn        uint16
36097Xmn        uint16
36099Xmn        uint16
36101Xmn        uint16
36103Xmn        uint16
36105Xmn        uint16
36107Xmn        uint16
36109Xmn        uint16
36111Xmn        uint16
36113Xmn        uint16
36115Xmn        uint16
36117Xmn        uint16
36119Xmn        uint16
36121Xmn        uint16
36123Xmn        uint16
36001Xmn2       uint16
36003Xmn2       uint16
36005Xmn2       uint16
36007Xmn2       uint16
36009Xmn2       uint16
36011Xmn2       uint16
36013Xmn2       uint16
36015Xmn2       uint16
36017Xmn2       uint16
36019Xmn2       uint16
36021Xmn2       uint16
36023Xmn2       uint16
36025Xmn2       uint16
36027Xmn2       uint16
36029Xmn2       uint16
36031Xmn2       uint16
36033Xmn2       uint16
36035Xmn2       uint16
36037Xmn2       uint16
36039Xmn2       uint16
36041Xmn2       uint16
36043Xmn2       uint16
36045Xmn2       uint16
36047Xmn2       uint16
36049Xmn2       uint16
36051Xmn2       uint16
36053Xmn2       uint16
36055Xmn2       uint16
36057Xmn2       uint16
36059Xmn2       uint16
36061Xmn2       uint16
36063Xmn2       uint16
36065Xmn2       uint16
36067Xmn2       uint16
36069Xmn2       uint16
36071Xmn2       uint16
36073Xmn2       uint16
36075Xmn2       uint16
36077Xmn2       uint16
36079Xmn2       uint16
36081Xmn2       uint16
36083Xmn2       uint16
36085Xmn2       uint16
36087Xmn2       uint16
36089Xmn2       uint16
36091Xmn2       uint16
36093Xmn2       uint16
36095Xmn2       uint16
36097Xmn2       uint16
36099Xmn2       uint16
36101Xmn2       uint16
36103Xmn2       uint16
36105Xmn2       uint16
36107Xmn2       uint16
36109Xmn2       uint16
36111Xmn2       uint16
36113Xmn2       uint16
36115Xmn2       uint16
36117Xmn2       uint16
36119Xmn2       uint16
36121Xmn2       uint16
36123Xmn2       uint16

On Thu, Aug 7, 2014 at 10:10 PM, Josef Perktold notifications@github.com
wrote:

Thank for the report.

BTW: I'm just adding robust covariance to all the fit methods.
RegressionModel.fit has the option already in master. The other models
should be merged within a few days.#1867
#1867 #1870
#1870

(usage is currently only in test cases.)

about dtype('O'):
Do you have a minimal example to replicate this?
I wonder whether this is a symptom but not the cause. My guess is that if
there is a dtype problem, then it would need to be intercepted and cast to
float already earlier.
What's the dtype of results.resid and model.exog.and group?
Can you show a few lines and dtype of your data?


Reply to this email directly or view it on GitHub
#1875 (comment)
.

@josef-pkt
Member

IIRC, pandas converts bool to object in asarray.
numpy converts bool to int and is happy to work with it.

the dtype information that you provided should be enough to come up with a simple example.
I'm still a bit surprised that it works.

That's a good example to help decide on dtypes. Currently we don't enforce anything, which also means that we don't enforce any precision for the calculation, not upcast and use more memory.
The main open question that we avoided so far is if we should enforce a precision or leave the memory-precision trade-off to the user.

The main calculation in the OLS default is np.linalg.pinv, which is essentially an svd. The precision used depends on the dtype of the data and what dtypes are available in BLAS/LAPACK.
If the original exog is not in the right format, then calling the linalg routines might require casting and/or making a copy of the data.

AFAIU, but never verified: if your endog/exog are float32, then we should do the calculation in single precision and not increase the memory by upcasting to float32.
However, now I think that's not true in current master, since we are making now a copy for wexog. (open issue)
Can you check the dtype of model.wexog?

And another check:
Can you compare the results for params and bse that you currently have with those when you cast your data to float64 before calling OLS?
If the results agree at rtol=1e-14 or so, then we have some implicit upcasting to float64 somewhere (done by numpy). If they don't agree at a high rtol, then the calculations are most likely in float32.

related: We have several open issues on dtype and need to do something for 0.6, at least, to avoid object arrays.

@josef-pkt josef-pkt added this to the 0.6 milestone Aug 8, 2014
@kcarlson00

model.wexog also had dtype object.

When casting to float64, the difference in params vs the original results
is this. The bool variables have fairly large differences in the
coefficient estimates.

In [190]: diffb
Out[190]:
meduc_cat_I2 0.000001
meduc_cat_I3 0.000002
meduc_cat_I4 0.000001
meduc_cat_I5 0.000008
mage_cat_I2 -0.000001
mage_cat_I3 -0.000003
mrace_cat_I2 0.000001
mrace_cat_I3 0.000002
mrace_cat_I4 0.000007
mrace_cat_I5 0.000010
tbo_cat_I2 0.000001
tbo_cat_I3 0.000001
tbo_cat_I4 0.000002
tbo_cat_I5 0.000003
tbo_cat_I6 0.000005
...
36095Xmn2 -1.842190e-10
36097Xmn2 -2.034383e-10
36099Xmn2 3.590988e-10
36101Xmn2 1.394164e-10
36103Xmn2 3.712720e-11
36105Xmn2 1.599791e-10
36107Xmn2 1.054748e-10
36109Xmn2 1.648896e-10
36111Xmn2 -2.413664e-11
36113Xmn2 -3.466309e-10
36115Xmn2 1.909970e-11
36117Xmn2 1.592547e-10
36119Xmn2 -5.432136e-12
36121Xmn2 5.687027e-10
36123Xmn2 1.954272e-10

The difference in bse is :

meduc_cat_I3 -8.072165e-11
meduc_cat_I4 -5.989187e-11
meduc_cat_I5 3.399636e-10
mage_cat_I2 -1.088019e-13
mage_cat_I3 -1.596732e-10
mrace_cat_I2 -8.138534e-11
mrace_cat_I3 9.939183e-11
mrace_cat_I4 -3.335376e-11
mrace_cat_I5 1.064006e-09
tbo_cat_I2 -3.972178e-11
tbo_cat_I3 3.417489e-12
tbo_cat_I4 5.938428e-11
tbo_cat_I5 1.594356e-10
tbo_cat_I6 6.032685e-11
...
36095Xmn2 2.364396e-14
36097Xmn2 2.187335e-14
36099Xmn2 -3.104320e-14
36101Xmn2 -1.707559e-14
36103Xmn2 -1.593746e-14
36105Xmn2 -1.485975e-14
36107Xmn2 7.731554e-15
36109Xmn2 -1.917699e-14
36111Xmn2 4.415359e-15
36113Xmn2 3.554644e-14
36115Xmn2 -2.054292e-15
36117Xmn2 -1.839360e-14
36119Xmn2 2.379282e-15
36121Xmn2 -4.566703e-14
36123Xmn2 -1.441707e-14
Length: 203, dtype: float64>

On Fri, Aug 8, 2014 at 12:13 AM, Josef Perktold notifications@github.com
wrote:

IIRC, pandas converts bool to object in asarray.
numpy converts bool to int and is happy to work with it.

the dtype information that you provided should be enough to come up with a
simple example.
I'm still a bit surprised that it works.

That's a good example to help decide on dtypes. Currently we don't enforce
anything, which also means that we don't enforce any precision for the
calculation, not upcast and use more memory.
The main open question that we avoided so far is if we should enforce a
precision or leave the memory-precision trade-off to the user.

The main calculation in the OLS default is np.linalg.pinv, which is
essentially an svd. The precision used depends on the dtype of the data and
what dtypes are available in BLAS/LAPACK.
If the original exog is not in the right format, then calling the linalg
routines might require casting and/or making a copy of the data.

AFAIU, but never verified: if your endog/exog are float32, then we should
do the calculation in single precision and not increase the memory by
upcasting to float32.
However, now I think that's not true in current master, since we are
making now a copy for wexog. (open issue)
Can you check the dtype of model.wexog?

And another check:
Can you compare the results for params and bse that you currently have
with those when you cast your data to float64 before calling OLS?
If the results agree at rtol=1e-14 or so, then we have some implicit
upcasting to float64 somewhere (done by numpy). If they don't agree at a
high rtol, then the calculations are most likely in float32.

related: We have several open issues on dtype and need to do something for
0.6, at least, to avoid object arrays.


Reply to this email directly or view it on GitHub
#1875 (comment)
.

@josef-pkt
Member

I forgot to ask: which version of statsmodels and of pandas are you using?

It looks like with your versions there is no full upcasting to float64, and it doesn't use double precision everywhere. This can be a feature or an undesired behavior as default.

@kcarlson00

In [194]: statsmodels.version
Out[194]: '0.6.0.dev-2e806fc'

Pandas is 0.14.0

On Fri, Aug 8, 2014 at 9:21 AM, Josef Perktold notifications@github.com
wrote:

I forgot to ask: which version of statsmodels and of pandas are you using?

It looks like with your versions there is no full upcasting to float64,
and it doesn't use double precision everywhere. This can be a feature or an
undesired behavior as default.


Reply to this email directly or view it on GitHub
#1875 (comment)
.

@josef-pkt
Member

I sent a message to the mailing list.

@kcarlson00 Are you ok with a change in statsmodels where we always upcast to float64 by default?

@kcarlson00

Seems reasonable to use the most precise calculation by default. Also it
should give results that exactly match Stata, which does all calculations
with double precision.

On Fri, Aug 8, 2014 at 9:47 AM, Josef Perktold notifications@github.com
wrote:

I sent a message to the mailing list.

@kcarlson00 https://github.com/kcarlson00 Are you ok with a change in
statsmodels where we always upcast to float64 by default?


Reply to this email directly or view it on GitHub
#1875 (comment)
.

@josef-pkt
Member

Stata, which does all calculations with double precision.

Not in my experience, I ran into problems in Stata with float32 data (which didn't match float64 statsmodels results.)
Now I have "always use double precision" in my Stata startup commands.

@kcarlson00

Interesting. I state that based on this guide:
http://blog.stata.com/tag/precision/

It says sometimes the calculations use quad precision but always at least
double.

On Fri, Aug 8, 2014 at 10:57 AM, Josef Perktold notifications@github.com
wrote:

Stata, which does all calculations with double precision.

Not in my experience, I ran into problems in Stata with float32 data
(which didn't match float64 statsmodels results.)
Now I have "always use double precision" in my Stata startup commands.


Reply to this email directly or view it on GitHub
#1875 (comment)
.

@josef-pkt josef-pkt changed the title from dtype bug in clustered standard errors code to dtype bug object arrays (raises in clustered standard errors code) Aug 10, 2014
@josef-pkt josef-pkt added the comp-base label Aug 10, 2014
@jseabold
Member

Closing as duplicate of #880.

@jseabold jseabold closed this Sep 26, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment