remove_data= option does not appear to remove data fully #7494

mmacpherson · 2021-06-09T16:20:53Z

Describe the bug

In linear/logistic models fit via the formula interface, calling .save() with remove_data=True appears to remove some but not all instances of length-nobs structures from the fitted model object.

This seems at odds with the documentation of the remove_data option, which suggests that all such structures would be removed.

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

def invlogit(s):
    return np.exp(s) / (1 + np.exp(s))

n = 1000000 # use large nobs to make size differences obvious
df = pd.DataFrame(
    dict(x1=np.random.normal(size=n),
         x2=np.random.normal(size=n),
         e=np.random.normal(size=n),
        )
).assign(y=lambda f: (invlogit(1 + 2 * f.x1 - f.x2 + f.e) > 0.5).astype(int))

model = smf.logit("y ~ x1 * x2", data=df).fit()

model.save("model.sm", remove_data=False)
model.save("model-nodata.sm", remove_data=True)

model._results.model.data.frame = model._results.model.data.frame.iloc[0:0]
model._results.model.data.orig_exog = model._results.model.data.orig_exog.iloc[0:0]
model._results.model.data.orig_endog = model._results.model.data.orig_endog.iloc[0:0]
model._results.model.wexog = None
model._results.model.wendog = None

model.save("model-nodata-plus.sm", remove_data=True)

!ls -atlr model*.sm

Yields:

-rw------- 1 root root 120003651 Jun  9 16:03 model.sm
-rw------- 1 root root  80003591 Jun  9 16:03 model-nodata.sm
-rw------- 1 root root      3578 Jun  9 16:03 model-nodata-plus.sm

Calling save with remove_data=True does result in a reduction of the serialized model size, but it's apparent that some copies of the data/design matrix are still in there. "Manually" removing additional structures as above appears to get all copies of the data out of the model. (I confirm that prediction still works from the deserialized model, at least for the simple formula given above.)

Expected Output

Based on the documentation of the remove_data option (ref), I would have expected all instances of these nobs-length arrays/dataframes to be removed from the model.

Remove data arrays, all nobs arrays from result and model.

Options that occur are (1) if it's not the intent of the remove_data option fully to remove all length-nobs structures from the fitted model objects, the documentation might be revised to reflect this, or (2) we might revise remove_data so that it fully removes all length-nobs structures from the object.

Output of `import statsmodels.api as sm; sm.show_versions()`

INSTALLED VERSIONS
------------------
Python: 3.6.12.final.0
OS: Linux 5.10.25-linuxkit #1 SMP Tue Mar 23 09:27:39 UTC 2021 x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
statsmodels
===========
Installed: 0.12.2 (/venv/lib/python3.6/site-packages/statsmodels)
Required Dependencies
=====================
cython: Not installed
numpy: 1.19.0 (/venv/lib/python3.6/site-packages/numpy)
scipy: 1.5.4 (/venv/lib/python3.6/site-packages/scipy)
pandas: 1.1.0 (/venv/lib/python3.6/site-packages/pandas)
    dateutil: 2.8.1 (/venv/lib/python3.6/site-packages/dateutil)
patsy: 0.5.1 (/venv/lib/python3.6/site-packages/patsy)
Optional Dependencies
=====================
matplotlib: 3.3.4 (/venv/lib/python3.6/site-packages/matplotlib)
    backend: module://ipykernel.pylab.backend_inline 
cvxopt: Not installed
joblib: 1.0.1 (/venv/lib/python3.6/site-packages/joblib)
Developer Tools
================
IPython: 7.16.1 (/venv/lib/python3.6/site-packages/IPython)
    jinja2: 2.11.3 (/venv/lib/python3.6/site-packages/jinja2)
sphinx: 4.0.2 (/venv/lib/python3.6/site-packages/sphinx)
    pygments: 2.8.1 (/venv/lib/python3.6/site-packages/pygments)
pytest: 6.2.3 (/venv/lib/python3.6/site-packages/pytest)
virtualenv: Not installed

The text was updated successfully, but these errors were encountered:

josef-pkt · 2021-06-09T16:50:28Z

thanks for reporting
if not all nobs arrays are removed, then it's a bug.

However, it's intentional that we don't remove model.data.frame when formulas are used. We need it to rebuild the formula information because formula information cannot be pickled.

Did you try to load the saved pickle and predict with model.data.frame removed?

AFAIR, we don't need orig_exog orig_endog, so those should be removed also in the formula case.

related:
For the remove_data function, but not for save. we could add an option to remove all pandas, but then pickling with formulas will not work. That would be only for the usecase when the estimated model stays in memory.

#6858 for an alternative to patsy, which is not connected to statsmodels yet.

mmacpherson · 2021-06-09T22:40:55Z

Thank you in turn for the speedy reply.

Did you try to load the saved pickle and predict with model.data.frame removed?

Yes, I might have included the code. Here it is:

sm.load("model-nodata-plus.sm").predict({"x1": 0, "x2": 0})

which yields a pd.Series as the response, which at least suggests that the formula information has been pickled?

0    0.85621
dtype: float64

Two thoughts here:

I haven't completely removed model.data.frame, but removed the data-rows from it, i.e. leaving the columns metadata intact. Prediction didn't/doesn't appear to work if I del the frame or set the frame to None.
I'm using a small subset of the Wilkinson formula language, and for full support the frame may be required. Perhaps the full data matrix would be required for eg splines or stateful transforms?

josef-pkt · 2021-06-10T01:01:34Z

AFAIK, when we have categorical variables in the formula, than patsy needs the full original dataset to rebuild the category levels and encoding. Same with stateful transforms.
In those cases your approach should fail to rebuild the model when unpickling.

RogerHar · 2021-07-12T14:23:23Z

I'm having the same issue but I'm not using patsy, just sm.OLS directly.

Under statsmodels version 0.12.2, remove_data=True only reduces the size of the pickle file by around a half. Removing wendog and wexog 'by hand' in addition myself reduces it to a couple of kB, so it appears that remove_data is failing to remove wendog and wexog.

Under statsmodels version 0.11.0, remove_data=True appears to successfully remove wendog and wexog by itself. So this appears to be a new issue.

import numpy as np
import statsmodels.api as sm
sm.__version__

n = 1000000
x = np.linspace(0, 10, n)
y = x + np.random.normal(size=n)
fit = sm.OLS(y, x).fit()

fit.save("fit_sm0.12.2", remove_data=False)
fit.save("fit_sm0.12.2_remove_data", remove_data=True)
fit.model.wendog = None
fit.model.wexog = None
fit.save("fit_sm0.12.2_remove_weogs")
!ls -l fit_*

Yields:

'0.12.2'
-rw-r--r--. 1 rogerhar users 40002361 Jul 12 14:37 fit_sm0.12.2
-rw-r--r--. 1 rogerhar users 16002230 Jul 12 13:45 fit_sm0.12.2_remove_data
-rw-r--r--. 1 rogerhar users     2140 Jul 12 13:48 fit_sm0.12.2_remove_weogs

The same under statsmodels v0.11.0 gives 2133 bytes with remove_data=True alone, i.e. no need to also remove wendog and wexog 'by hand'.

(I 'm afraid I don't have sufficient knowledge of statsmodels or github to find or investigate the relevant changeset myself)

josef-pkt · 2021-07-12T14:48:22Z

@RogerHar Thanks for reporting

I try to check this before the next release which will be soon (maybe around 2 months or less).
It's implemented in a way that is a bit tricky because the keys for what to remove are added to at different levels of the class hierarchy.

josef-pkt · 2021-07-20T18:09:43Z

for OLS/WLS
https://github.com/statsmodels/statsmodels/pull/6888/files#diff-8cb6b075cf3ba145f963655324052ee2647c3554d9bb80e920ac520c8efe17bcL191
remove wendog, wexog from the to-remove list

@bash
0c2887a#diff-8cb6b075cf3ba145f963655324052ee2647c3554d9bb80e920ac520c8efe17bcL191

changes were done to fix #6887
#6887 (comment)

josef-pkt · 2021-07-20T20:49:49Z

OLS/WLS part fixed in #7595

josef-pkt added comp-discrete type-bug labels Jun 9, 2021

josef-pkt added the prio-elev label Jul 12, 2021

josef-pkt added this to the 0.13 milestone Jul 12, 2021

josef-pkt mentioned this issue Jul 20, 2021

FAQ: get_prediction fails with regression models after calling remove_data #6887

Open

josef-pkt mentioned this issue Jul 20, 2021

BUG: regression, allow remove_data to remove wendog, wexog, wresid #7595

Merged

josef-pkt removed the prio-elev label Jul 20, 2021

josef-pkt removed this from the 0.13 milestone Jul 20, 2021

hiroki32 mentioned this issue Jul 3, 2023

How to completely remove training data from model. #8947

Open

RoelVerbelen mentioned this issue Feb 11, 2024

BUG/ENH: Saving a model with remove_data=True breaks .summary() #9147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove_data= option does not appear to remove data fully #7494

remove_data= option does not appear to remove data fully #7494

mmacpherson commented Jun 9, 2021

josef-pkt commented Jun 9, 2021 •

edited

Loading

mmacpherson commented Jun 9, 2021

josef-pkt commented Jun 10, 2021

RogerHar commented Jul 12, 2021

josef-pkt commented Jul 12, 2021

josef-pkt commented Jul 20, 2021

josef-pkt commented Jul 20, 2021

remove_data= option does not appear to remove data fully #7494

remove_data= option does not appear to remove data fully #7494

Comments

mmacpherson commented Jun 9, 2021

Describe the bug

Code Sample, a copy-pastable example if possible

Expected Output

Output of import statsmodels.api as sm; sm.show_versions()

josef-pkt commented Jun 9, 2021 • edited Loading

mmacpherson commented Jun 9, 2021

josef-pkt commented Jun 10, 2021

RogerHar commented Jul 12, 2021

josef-pkt commented Jul 12, 2021

josef-pkt commented Jul 20, 2021

josef-pkt commented Jul 20, 2021

Output of `import statsmodels.api as sm; sm.show_versions()`

josef-pkt commented Jun 9, 2021 •

edited

Loading