Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove_data= option does not appear to remove data fully #7494

Open
mmacpherson opened this issue Jun 9, 2021 · 7 comments
Open

remove_data= option does not appear to remove data fully #7494

mmacpherson opened this issue Jun 9, 2021 · 7 comments

Comments

@mmacpherson
Copy link

Describe the bug

In linear/logistic models fit via the formula interface, calling .save() with remove_data=True appears to remove some but not all instances of length-nobs structures from the fitted model object.

This seems at odds with the documentation of the remove_data option, which suggests that all such structures would be removed.

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

def invlogit(s):
    return np.exp(s) / (1 + np.exp(s))

n = 1000000 # use large nobs to make size differences obvious
df = pd.DataFrame(
    dict(x1=np.random.normal(size=n),
         x2=np.random.normal(size=n),
         e=np.random.normal(size=n),
        )
).assign(y=lambda f: (invlogit(1 + 2 * f.x1 - f.x2 + f.e) > 0.5).astype(int))

model = smf.logit("y ~ x1 * x2", data=df).fit()

model.save("model.sm", remove_data=False)
model.save("model-nodata.sm", remove_data=True)

model._results.model.data.frame = model._results.model.data.frame.iloc[0:0]
model._results.model.data.orig_exog = model._results.model.data.orig_exog.iloc[0:0]
model._results.model.data.orig_endog = model._results.model.data.orig_endog.iloc[0:0]
model._results.model.wexog = None
model._results.model.wendog = None

model.save("model-nodata-plus.sm", remove_data=True)

!ls -atlr model*.sm

Yields:

-rw------- 1 root root 120003651 Jun  9 16:03 model.sm
-rw------- 1 root root  80003591 Jun  9 16:03 model-nodata.sm
-rw------- 1 root root      3578 Jun  9 16:03 model-nodata-plus.sm

Calling save with remove_data=True does result in a reduction of the serialized model size, but it's apparent that some copies of the data/design matrix are still in there. "Manually" removing additional structures as above appears to get all copies of the data out of the model. (I confirm that prediction still works from the deserialized model, at least for the simple formula given above.)

Expected Output

Based on the documentation of the remove_data option (ref), I would have expected all instances of these nobs-length arrays/dataframes to be removed from the model.

Remove data arrays, all nobs arrays from result and model.

Options that occur are (1) if it's not the intent of the remove_data option fully to remove all length-nobs structures from the fitted model objects, the documentation might be revised to reflect this, or (2) we might revise remove_data so that it fully removes all length-nobs structures from the object.

Output of import statsmodels.api as sm; sm.show_versions()

INSTALLED VERSIONS
------------------
Python: 3.6.12.final.0
OS: Linux 5.10.25-linuxkit #1 SMP Tue Mar 23 09:27:39 UTC 2021 x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
statsmodels
===========
Installed: 0.12.2 (/venv/lib/python3.6/site-packages/statsmodels)
Required Dependencies
=====================
cython: Not installed
numpy: 1.19.0 (/venv/lib/python3.6/site-packages/numpy)
scipy: 1.5.4 (/venv/lib/python3.6/site-packages/scipy)
pandas: 1.1.0 (/venv/lib/python3.6/site-packages/pandas)
    dateutil: 2.8.1 (/venv/lib/python3.6/site-packages/dateutil)
patsy: 0.5.1 (/venv/lib/python3.6/site-packages/patsy)
Optional Dependencies
=====================
matplotlib: 3.3.4 (/venv/lib/python3.6/site-packages/matplotlib)
    backend: module://ipykernel.pylab.backend_inline 
cvxopt: Not installed
joblib: 1.0.1 (/venv/lib/python3.6/site-packages/joblib)
Developer Tools
================
IPython: 7.16.1 (/venv/lib/python3.6/site-packages/IPython)
    jinja2: 2.11.3 (/venv/lib/python3.6/site-packages/jinja2)
sphinx: 4.0.2 (/venv/lib/python3.6/site-packages/sphinx)
    pygments: 2.8.1 (/venv/lib/python3.6/site-packages/pygments)
pytest: 6.2.3 (/venv/lib/python3.6/site-packages/pytest)
virtualenv: Not installed
@josef-pkt
Copy link
Member

josef-pkt commented Jun 9, 2021

thanks for reporting
if not all nobs arrays are removed, then it's a bug.

However, it's intentional that we don't remove model.data.frame when formulas are used. We need it to rebuild the formula information because formula information cannot be pickled.

Did you try to load the saved pickle and predict with model.data.frame removed?

AFAIR, we don't need orig_exog orig_endog, so those should be removed also in the formula case.

related:
For the remove_data function, but not for save. we could add an option to remove all pandas, but then pickling with formulas will not work. That would be only for the usecase when the estimated model stays in memory.

#6858 for an alternative to patsy, which is not connected to statsmodels yet.

@mmacpherson
Copy link
Author

Thank you in turn for the speedy reply.

Did you try to load the saved pickle and predict with model.data.frame removed?

Yes, I might have included the code. Here it is:

sm.load("model-nodata-plus.sm").predict({"x1": 0, "x2": 0})

which yields a pd.Series as the response, which at least suggests that the formula information has been pickled?

0    0.85621
dtype: float64

Two thoughts here:

  1. I haven't completely removed model.data.frame, but removed the data-rows from it, i.e. leaving the columns metadata intact. Prediction didn't/doesn't appear to work if I del the frame or set the frame to None.
  2. I'm using a small subset of the Wilkinson formula language, and for full support the frame may be required. Perhaps the full data matrix would be required for eg splines or stateful transforms?

@josef-pkt
Copy link
Member

AFAIK, when we have categorical variables in the formula, than patsy needs the full original dataset to rebuild the category levels and encoding. Same with stateful transforms.
In those cases your approach should fail to rebuild the model when unpickling.

@RogerHar
Copy link

I'm having the same issue but I'm not using patsy, just sm.OLS directly.

Under statsmodels version 0.12.2, remove_data=True only reduces the size of the pickle file by around a half. Removing wendog and wexog 'by hand' in addition myself reduces it to a couple of kB, so it appears that remove_data is failing to remove wendog and wexog.

Under statsmodels version 0.11.0, remove_data=True appears to successfully remove wendog and wexog by itself. So this appears to be a new issue.

import numpy as np
import statsmodels.api as sm
sm.__version__

n = 1000000
x = np.linspace(0, 10, n)
y = x + np.random.normal(size=n)
fit = sm.OLS(y, x).fit()

fit.save("fit_sm0.12.2", remove_data=False)
fit.save("fit_sm0.12.2_remove_data", remove_data=True)
fit.model.wendog = None
fit.model.wexog = None
fit.save("fit_sm0.12.2_remove_weogs")
!ls -l fit_*

Yields:

'0.12.2'
-rw-r--r--. 1 rogerhar users 40002361 Jul 12 14:37 fit_sm0.12.2
-rw-r--r--. 1 rogerhar users 16002230 Jul 12 13:45 fit_sm0.12.2_remove_data
-rw-r--r--. 1 rogerhar users     2140 Jul 12 13:48 fit_sm0.12.2_remove_weogs

The same under statsmodels v0.11.0 gives 2133 bytes with remove_data=True alone, i.e. no need to also remove wendog and wexog 'by hand'.

(I 'm afraid I don't have sufficient knowledge of statsmodels or github to find or investigate the relevant changeset myself)

@josef-pkt josef-pkt added this to the 0.13 milestone Jul 12, 2021
@josef-pkt
Copy link
Member

@RogerHar Thanks for reporting

I try to check this before the next release which will be soon (maybe around 2 months or less).
It's implemented in a way that is a bit tricky because the keys for what to remove are added to at different levels of the class hierarchy.

@josef-pkt
Copy link
Member

OLS/WLS part fixed in #7595

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants