Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/ENH: Saving a model with remove_data=True breaks .summary() #9147

Open
RoelVerbelen opened this issue Feb 11, 2024 · 11 comments
Open

BUG/ENH: Saving a model with remove_data=True breaks .summary() #9147

RoelVerbelen opened this issue Feb 11, 2024 · 11 comments

Comments

@RoelVerbelen
Copy link

It would be great if a model saved with the data removed would still be able to produce the summary statistics. Potentially it's worthwhile keeping the minimum amount of data in order to reproduce these stats? Especially when using the formula API in which the original data is still saved, see here and here.

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

# load "growth curves pf pigs" dataset
my_data = sm.datasets.get_rdataset("dietox", "geepack").data

# make GLM model
my_formula = "Weight ~ Time"
my_family = sm.families.Gaussian(link=sm.families.links.identity())
model = smf.glm(formula=my_formula, data=my_data, family=my_family)
result = model.fit()

# save model with "remove_data=True" 
result.save('test_model.pickle', remove_data=True)
new_result = sm.load('test_model.pickle')
print(new_result.summary())

Output:

image

@josef-pkt
Copy link
Member

see for example #6887

you need to call summary() before remove_data, so that the summary statistics are computed and cached, computation of extra statistics is lazy but cached.
Afterwards, calling summary again will use the cached values.

This behavior is intentional.
remove_data removes all data-length arrays, and now new statistics can be computed.
This means that users can only call remove_data after they have computed everything they want that needs the data.

@josef-pkt
Copy link
Member

closing as duplicate

there should be unit tests for this for all core models, but likely not for more recently added models.

@RoelVerbelen
Copy link
Author

Thanks @josef-pkt for pointing that out.

Unfortunately, that doesn't seem to work for the example above:

result.summary() # works
result.save('test_model.pickle', remove_data=True)
new_result = sm.load('test_model.pickle')
new_result.summary() # fails

Same for:

result.summary() # works
result.remove_data()
result.summary() # fails

@bashtage bashtage reopened this Feb 12, 2024
@bashtage
Copy link
Member

Sounds like it needs another look.

@josef-pkt
Copy link
Member

ok, that's a bug

fails in computing pseudo_rsquared, which cannot be cached because it's a method with arguments, but it only needs to use cached attributes.

Problem is
llnull, llf is deleted (set to None) in cache
The same also happens if I use numpy arrays in GLM instead of the formula interface.
Also it's unrelated to pickling, remove_data alone has same effect.
I thought we have a unit test that includes GLM

cache after remove data

result._cache
{'null': None,
 'llnull': None,
 'mu': None,
 'llf': None,
 'deviance': None,
 'pearson_chi2': None,
 'bse': array([0.52072095, 0.07095736]),
 'tvalues': array([30.16077478, 97.89963309]),
 'pvalues': array([7.74789429e-200, 0.00000000e+000]),
 'aic': None,
 'bic_deviance': None,
 'bic_llf': None,
 'fittedvalues': None,
 'null_deviance': None,
 'resid_anscombe': None,
 'resid_anscombe_scaled': None,
 'resid_anscombe_unscaled': None,
 'resid_deviance': None,
 'resid_pearson': None,
 'resid_response': None,
 'resid_working': None,
 'resid': None,
 'wresid': None}

@josef-pkt
Copy link
Member

bug was introduced in #4421

llf, llnull are @cache_readonly
aic which is not nulled is @cached_value

#4421 moved from a list for things to remove to removing things by default.
Also it looks like it evaluates cached attributes before remove_data, which does a lot of computation in remove_data.
Original reason for adding remove data was only predict which does not need any extra results statistics.

maybe revert most of #4421 ?
I'm not even sure what the current behavior is. It seems to have changed the behavior in a large way compared to what I had initially added.

@josef-pkt
Copy link
Member

josef-pkt commented Feb 12, 2024

OLS summary after remove_data also fails, but it fails in the diagnostics, jarque_bera(self.wresid)

unit tests
base tests
test_remove_data_pickle calls summary before remove_data but not after.
So, it looks like working summary after remove_data was never guaranteed generically by base tests.

The original unit tests only checked that predict works correctly after remove_data.

@josef-pkt
Copy link
Member

josef-pkt commented Feb 12, 2024

correction: I'm still not sure how this currently works.
The current code still uses the whitelist, #4421 added additional Noneing based on identity of cache decorator.

In OLS, there are no cached results statistics set to None.

AFAICS:
The bug is in the interaction of #4421 and #6056
https://github.com/statsmodels/statsmodels/pull/6056/files#diff-caa0f13bb4c8b24512745dc6ef3d5eb4781a45610dd793ed488e0ece329b6759L142

4421 introduced behavior of remove data depending on isinstance of specific cache decorator.
6056 made all decorators aliases of the same (pandas) decorator.

But I still don't see why OLS is not affected. It uses the same decorator @cache_readonly

update
this sounded familiar to me, looking for issue
#7511 (comment)
7511 was a quickfix that removed triggering the lazy evaluation
but did not make any changes to the underlying decorator problem

@josef-pkt
Copy link
Member

@RoelVerbelen
just as background information, motivation for summary after remove_data

What is your usecase for summary from pickled model/results?

related issue for summary_col #7368

I think we can add to the unit tests to verify that summary() works after remove_data if summary has been called before.

Related: given that it came up several times
We could add a run_summary=False or similar (summary=False ?) to remove_data and to pickle/save.
This would at least provide an obvious, visible documentation about this behavior.
Whether the default should be True is not clear. It's not backwards compatible in terms of performance, and might not be the correct default if the main usecase for pickling is still predict.

@josef-pkt josef-pkt added this to the 0.15 milestone Feb 12, 2024
@josef-pkt josef-pkt changed the title Saving a model with remove_data=True breaks .summary() BUG: Saving a model with remove_data=True breaks .summary() Feb 12, 2024
@josef-pkt josef-pkt changed the title BUG: Saving a model with remove_data=True breaks .summary() BUG/ENH: Saving a model with remove_data=True breaks .summary() Feb 12, 2024
@josef-pkt
Copy link
Member

two points here

I will open a separate issue for the ENH so this issue is just the bug fix

@RoelVerbelen
Copy link
Author

Hi @josef-pkt

My motivation for using .summary() after .remove_data() is that I'm setting up modelling pipelines where I build (many) GLMs in a python script, save the GLMs and then run quarto reports to evaluate these.

Within the model evaluation quarto reports, I'd like to print out the model summary as it contains a useful short summary of the model object and key statistics.

The data sets I'm working with are very large, so I'm looking for ways to reduce the size of the statsmodels saved objects. Currently they contain a lot of duplication as the data needs to be saved down alongside it for the formula API to work. That's not ideal.

FYI, in case you're interested, I have the same modelling pipeline set-up in R and there I've managed to strongly reduce the size of glms/gams (mgcv) by removing all "nobs arrays" and keeping only the first row of the model frame without affecting functionality.

#' Reduce glm object size (for saving purposes)
#'
#' @param object glm model
#'
#' @return glm model with certain components removed
reduce_glm_size <- function(object) {
  
  object$residuals <- NULL
  object$fitted.values <- NULL
  object$effects <- NULL
  object$qr$qr <- NULL
  object$linear.predictors <- NULL
  object$working.weights <- NULL
  object$z <- NULL
  object$wt <- NULL
  object$weights <- NULL
  object$prior.weights <- NULL
  object$hat <- NULL
  object$model <- object$model[1, ]
  object$data <- NULL
  object$offset <- NULL
  
  object
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants