BUG/ENH: Saving a model with `remove_data=True` breaks `.summary()` #9147

RoelVerbelen · 2024-02-11T11:00:46Z

It would be great if a model saved with the data removed would still be able to produce the summary statistics. Potentially it's worthwhile keeping the minimum amount of data in order to reproduce these stats? Especially when using the formula API in which the original data is still saved, see here and here.

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

# load "growth curves pf pigs" dataset
my_data = sm.datasets.get_rdataset("dietox", "geepack").data

# make GLM model
my_formula = "Weight ~ Time"
my_family = sm.families.Gaussian(link=sm.families.links.identity())
model = smf.glm(formula=my_formula, data=my_data, family=my_family)
result = model.fit()

# save model with "remove_data=True" 
result.save('test_model.pickle', remove_data=True)
new_result = sm.load('test_model.pickle')
print(new_result.summary())

Output:

josef-pkt · 2024-02-11T14:47:00Z

see for example #6887

you need to call summary() before remove_data, so that the summary statistics are computed and cached, computation of extra statistics is lazy but cached.
Afterwards, calling summary again will use the cached values.

This behavior is intentional.
remove_data removes all data-length arrays, and now new statistics can be computed.
This means that users can only call remove_data after they have computed everything they want that needs the data.

josef-pkt · 2024-02-11T14:48:10Z

closing as duplicate

there should be unit tests for this for all core models, but likely not for more recently added models.

RoelVerbelen · 2024-02-11T21:33:37Z

Thanks @josef-pkt for pointing that out.

Unfortunately, that doesn't seem to work for the example above:

result.summary() # works
result.save('test_model.pickle', remove_data=True)
new_result = sm.load('test_model.pickle')
new_result.summary() # fails

Same for:

result.summary() # works
result.remove_data()
result.summary() # fails

bashtage · 2024-02-12T13:21:22Z

Sounds like it needs another look.

josef-pkt · 2024-02-12T14:14:59Z

ok, that's a bug

fails in computing pseudo_rsquared, which cannot be cached because it's a method with arguments, but it only needs to use cached attributes.

Problem is
llnull, llf is deleted (set to None) in cache
The same also happens if I use numpy arrays in GLM instead of the formula interface.
Also it's unrelated to pickling, remove_data alone has same effect.
I thought we have a unit test that includes GLM

cache after remove data

result._cache
{'null': None,
 'llnull': None,
 'mu': None,
 'llf': None,
 'deviance': None,
 'pearson_chi2': None,
 'bse': array([0.52072095, 0.07095736]),
 'tvalues': array([30.16077478, 97.89963309]),
 'pvalues': array([7.74789429e-200, 0.00000000e+000]),
 'aic': None,
 'bic_deviance': None,
 'bic_llf': None,
 'fittedvalues': None,
 'null_deviance': None,
 'resid_anscombe': None,
 'resid_anscombe_scaled': None,
 'resid_anscombe_unscaled': None,
 'resid_deviance': None,
 'resid_pearson': None,
 'resid_response': None,
 'resid_working': None,
 'resid': None,
 'wresid': None}

josef-pkt · 2024-02-12T14:32:04Z

bug was introduced in #4421

llf, llnull are @cache_readonly
aic which is not nulled is @cached_value

#4421 moved from a list for things to remove to removing things by default.
Also it looks like it evaluates cached attributes before remove_data, which does a lot of computation in remove_data.
Original reason for adding remove data was only predict which does not need any extra results statistics.

maybe revert most of #4421 ?
I'm not even sure what the current behavior is. It seems to have changed the behavior in a large way compared to what I had initially added.

josef-pkt · 2024-02-12T14:40:32Z

OLS summary after remove_data also fails, but it fails in the diagnostics, jarque_bera(self.wresid)

unit tests
base tests
test_remove_data_pickle calls summary before remove_data but not after.
So, it looks like working summary after remove_data was never guaranteed generically by base tests.

The original unit tests only checked that predict works correctly after remove_data.

josef-pkt · 2024-02-12T15:18:38Z

correction: I'm still not sure how this currently works.
The current code still uses the whitelist, #4421 added additional Noneing based on identity of cache decorator.

In OLS, there are no cached results statistics set to None.

AFAICS:
The bug is in the interaction of #4421 and #6056
https://github.com/statsmodels/statsmodels/pull/6056/files#diff-caa0f13bb4c8b24512745dc6ef3d5eb4781a45610dd793ed488e0ece329b6759L142

4421 introduced behavior of remove data depending on isinstance of specific cache decorator.
6056 made all decorators aliases of the same (pandas) decorator.

But I still don't see why OLS is not affected. It uses the same decorator @cache_readonly

update
this sounded familiar to me, looking for issue
#7511 (comment)
7511 was a quickfix that removed triggering the lazy evaluation
but did not make any changes to the underlying decorator problem

josef-pkt · 2024-02-12T16:36:00Z

@RoelVerbelen
just as background information, motivation for summary after remove_data

What is your usecase for summary from pickled model/results?

related issue for summary_col #7368

I think we can add to the unit tests to verify that summary() works after remove_data if summary has been called before.

Related: given that it came up several times
We could add a run_summary=False or similar (summary=False ?) to remove_data and to pickle/save.
This would at least provide an obvious, visible documentation about this behavior.
Whether the default should be True is not clear. It's not backwards compatible in terms of performance, and might not be the correct default if the main usecase for pickling is still predict.

josef-pkt · 2024-02-12T21:34:57Z

two points here

bugfix remove decorator alias and their usage in remove_data, this is a bit more extensive than quickfix in Allow remove_data to work when an attribute is not implemented #7511
ENH/TST: make summary after remove_data possible and verify with unit test

I will open a separate issue for the ENH so this issue is just the bug fix

RoelVerbelen · 2024-02-13T08:25:39Z

Hi @josef-pkt

My motivation for using .summary() after .remove_data() is that I'm setting up modelling pipelines where I build (many) GLMs in a python script, save the GLMs and then run quarto reports to evaluate these.

Within the model evaluation quarto reports, I'd like to print out the model summary as it contains a useful short summary of the model object and key statistics.

The data sets I'm working with are very large, so I'm looking for ways to reduce the size of the statsmodels saved objects. Currently they contain a lot of duplication as the data needs to be saved down alongside it for the formula API to work. That's not ideal.

FYI, in case you're interested, I have the same modelling pipeline set-up in R and there I've managed to strongly reduce the size of glms/gams (mgcv) by removing all "nobs arrays" and keeping only the first row of the model frame without affecting functionality.

#' Reduce glm object size (for saving purposes)
#'
#' @param object glm model
#'
#' @return glm model with certain components removed
reduce_glm_size <- function(object) {
  
  object$residuals <- NULL
  object$fitted.values <- NULL
  object$effects <- NULL
  object$qr$qr <- NULL
  object$linear.predictors <- NULL
  object$working.weights <- NULL
  object$z <- NULL
  object$wt <- NULL
  object$weights <- NULL
  object$prior.weights <- NULL
  object$hat <- NULL
  object$model <- object$model[1, ]
  object$data <- NULL
  object$offset <- NULL
  
  object
}

josef-pkt closed this as completed Feb 11, 2024

bashtage reopened this Feb 12, 2024

josef-pkt added the type-bug label Feb 12, 2024

josef-pkt added prio-high comp-base comp-genmod labels Feb 12, 2024

josef-pkt added this to the 0.15 milestone Feb 12, 2024

josef-pkt changed the title ~~Saving a model with remove_data=True breaks .summary()~~ BUG: Saving a model with remove_data=True breaks .summary() Feb 12, 2024

josef-pkt changed the title ~~BUG: Saving a model with remove_data=True breaks .summary()~~ BUG/ENH: Saving a model with remove_data=True breaks .summary() Feb 12, 2024

josef-pkt added the type-enh label Feb 12, 2024

josef-pkt mentioned this issue Feb 12, 2024

ENH: make sure calling summary after remove data works if cached #9151

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/ENH: Saving a model with `remove_data=True` breaks `.summary()` #9147

BUG/ENH: Saving a model with `remove_data=True` breaks `.summary()` #9147

RoelVerbelen commented Feb 11, 2024

josef-pkt commented Feb 11, 2024

josef-pkt commented Feb 11, 2024

RoelVerbelen commented Feb 11, 2024

bashtage commented Feb 12, 2024

josef-pkt commented Feb 12, 2024

josef-pkt commented Feb 12, 2024

josef-pkt commented Feb 12, 2024 •

edited

josef-pkt commented Feb 12, 2024 •

edited

josef-pkt commented Feb 12, 2024

josef-pkt commented Feb 12, 2024

RoelVerbelen commented Feb 13, 2024

BUG/ENH: Saving a model with remove_data=True breaks .summary() #9147

BUG/ENH: Saving a model with remove_data=True breaks .summary() #9147

Comments

RoelVerbelen commented Feb 11, 2024

josef-pkt commented Feb 11, 2024

josef-pkt commented Feb 11, 2024

RoelVerbelen commented Feb 11, 2024

bashtage commented Feb 12, 2024

josef-pkt commented Feb 12, 2024

josef-pkt commented Feb 12, 2024

josef-pkt commented Feb 12, 2024 • edited

josef-pkt commented Feb 12, 2024 • edited

josef-pkt commented Feb 12, 2024

josef-pkt commented Feb 12, 2024

RoelVerbelen commented Feb 13, 2024

BUG/ENH: Saving a model with `remove_data=True` breaks `.summary()` #9147

BUG/ENH: Saving a model with `remove_data=True` breaks `.summary()` #9147

josef-pkt commented Feb 12, 2024 •

edited

josef-pkt commented Feb 12, 2024 •

edited