-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove_data= option does not appear to remove data fully #7494
Comments
thanks for reporting However, it's intentional that we don't remove model.data.frame when formulas are used. We need it to rebuild the formula information because formula information cannot be pickled. Did you try to load the saved pickle and predict with model.data.frame removed? AFAIR, we don't need orig_exog orig_endog, so those should be removed also in the formula case. related: #6858 for an alternative to patsy, which is not connected to statsmodels yet. |
Thank you in turn for the speedy reply.
Yes, I might have included the code. Here it is: sm.load("model-nodata-plus.sm").predict({"x1": 0, "x2": 0}) which yields a pd.Series as the response, which at least suggests that the formula information has been pickled?
Two thoughts here:
|
AFAIK, when we have categorical variables in the formula, than patsy needs the full original dataset to rebuild the category levels and encoding. Same with stateful transforms. |
I'm having the same issue but I'm not using patsy, just sm.OLS directly. Under statsmodels version 0.12.2, remove_data=True only reduces the size of the pickle file by around a half. Removing wendog and wexog 'by hand' in addition myself reduces it to a couple of kB, so it appears that remove_data is failing to remove wendog and wexog. Under statsmodels version 0.11.0, remove_data=True appears to successfully remove wendog and wexog by itself. So this appears to be a new issue. import numpy as np
import statsmodels.api as sm
sm.__version__
n = 1000000
x = np.linspace(0, 10, n)
y = x + np.random.normal(size=n)
fit = sm.OLS(y, x).fit()
fit.save("fit_sm0.12.2", remove_data=False)
fit.save("fit_sm0.12.2_remove_data", remove_data=True)
fit.model.wendog = None
fit.model.wexog = None
fit.save("fit_sm0.12.2_remove_weogs")
!ls -l fit_* Yields:
The same under statsmodels v0.11.0 gives 2133 bytes with remove_data=True alone, i.e. no need to also remove wendog and wexog 'by hand'. (I 'm afraid I don't have sufficient knowledge of statsmodels or github to find or investigate the relevant changeset myself) |
@RogerHar Thanks for reporting I try to check this before the next release which will be soon (maybe around 2 months or less). |
for OLS/WLS @bash changes were done to fix #6887 |
OLS/WLS part fixed in #7595 |
Describe the bug
In linear/logistic models fit via the formula interface, calling
.save()
withremove_data=True
appears to remove some but not all instances of length-nobs structures from the fitted model object.This seems at odds with the documentation of the remove_data option, which suggests that all such structures would be removed.
Code Sample, a copy-pastable example if possible
Yields:
Calling save with
remove_data=True
does result in a reduction of the serialized model size, but it's apparent that some copies of the data/design matrix are still in there. "Manually" removing additional structures as above appears to get all copies of the data out of the model. (I confirm that prediction still works from the deserialized model, at least for the simple formula given above.)Expected Output
Based on the documentation of the
remove_data
option (ref), I would have expected all instances of these nobs-length arrays/dataframes to be removed from the model.Options that occur are (1) if it's not the intent of the
remove_data
option fully to remove all length-nobs structures from the fitted model objects, the documentation might be revised to reflect this, or (2) we might reviseremove_data
so that it fully removes all length-nobs structures from the object.Output of
import statsmodels.api as sm; sm.show_versions()
The text was updated successfully, but these errors were encountered: