Bug: loading certain scipy functions fails #184

BenjaminBossan · 2022-10-11T14:35:32Z

Certain scipy functions cannot be loaded correctly. Here is a test to reproduce:

from scipy import stats

def test_scipy_stats(tmp_path):
    estimator = FunctionTransformer(func=stats.zipf)
    save_load_round(estimator, tmp_path / "file.skops")

zipf is only one of a few failing scipy functions that I tested. The error message is

AttributeError: module 'builtins' has no attribute 'method'

with a really long, unhelpful stacktrace. However, the problem already occurs during the generation of the state, which looks really weird:

Click to expand

{
  "__class__": "FunctionTransformer",
  "__module__": "sklearn.preprocessing._function_transformer",
  "content": {
    "__class__": "dict",
    "__module__": "builtins",
    "content": {
      "func": {
        "__class__": "zipf_gen",
        "__module__": "scipy.stats._discrete_distns",
        "content": {
          "__class__": "dict",
          "__module__": "builtins",
          "content": {
            "_stats_has_moments": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "true",
              "is_json": true
            },
            "_random_state": {
              "__class__": "RandomState",
              "__module__": "numpy.random.mtrand",
              "content": {
                "__class__": "dict",
                "__module__": "builtins",
                "content": {
                  "bit_generator": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "\"MT19937\"",
                    "is_json": true
                  },
                  "state": {
                    "__class__": "dict",
                    "__module__": "builtins",
                    "content": {
                      "key": {
                        "__class__": "ndarray",
                        "__module__": "numpy",
                        "type": "numpy",
                        "file": "140209309201776.npy"
                      },
                      "pos": {
                        "__class__": "str",
                        "__module__": "builtins",
                        "content": "335",
                        "is_json": true
                      }
                    },
                    "key_types": {
                      "__class__": "list",
                      "__module__": "builtins",
                      "content": [...]
                    }
                  },
                  "has_gauss": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "0",
                    "is_json": true
                  },
                  "gauss": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "0.0",
                    "is_json": true
                  }
                },
                "key_types": {
                  "__class__": "list",
                  "__module__": "builtins",
                  "content": [...]
                }
              }
            },
            "_rvs_uses_size_attribute": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "false",
              "is_json": true
            },
            "_rvs_size_warned": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "false",
              "is_json": true
            },
            "_ctor_param": {
              "__class__": "dict",
              "__module__": "builtins",
              "content": {
                "a": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "1",
                  "is_json": true
                },
                "b": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "Infinity",
                  "is_json": true
                },
                "name": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "\"zipf\"",
                  "is_json": true
                },
                "badvalue": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "null",
                  "is_json": true
                },
                "moment_tol": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "1e-08",
                  "is_json": true
                },
                "values": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "null",
                  "is_json": true
                },
                "inc": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "1",
                  "is_json": true
                },
                "longname": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "\"A Zipf\"",
                  "is_json": true
                },
                "shapes": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "null",
                  "is_json": true
                },
                "extradoc": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "null",
                  "is_json": true
                },
                "seed": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "null",
                  "is_json": true
                }
              },
              "key_types": {
                "__class__": "list",
                "__module__": "builtins",
                "content": [...]
              }
            },
            "badvalue": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "NaN",
              "is_json": true
            },
            "a": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "1",
              "is_json": true
            },
            "b": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "Infinity",
              "is_json": true
            },
            "moment_tol": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "1e-08",
              "is_json": true
            },
            "inc": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "1",
              "is_json": true
            },
            "shapes": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "\"a\"",
              "is_json": true
            },
            "_parse_arg_template": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "\"\\ndef _parse_args(self, a,  loc=0):\\n    return (a, ), loc, 1\\n\\ndef _parse_args_rvs(self, a,  loc=0, size=None):\\n    return self._argcheck_rvs(a,  loc, 1, size=size)\\n\\ndef _parse_args_stats(self, a,  loc=0, moments='mv'):\\n    return (a, ), loc, 1, moments\\n\"",
              "is_json": true
            },
            "numargs": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "1",
              "is_json": true
            },
            "vecentropy": {
              "__class__": "vectorize",
              "__module__": "numpy",
              "content": {
                "__class__": "dict",
                "__module__": "builtins",
                "content": {
                  "pyfunc": {
                    "__class__": "method",
                    "__module__": "builtins",
                    "content": {
                      "__class__": "dict",
                      "__module__": "builtins",
                      "content": {},
                      "key_types": {
                        "__class__": "list",
                        "__module__": "builtins",
                        "content": []
                      }
                    }
                  },
                  "cache": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "false",
                    "is_json": true
                  },
                  "signature": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "null",
                    "is_json": true
                  },
                  "_ufunc": {
                    "__class__": "dict",
                    "__module__": "builtins",
                    "content": {},
                    "key_types": {
                      "__class__": "list",
                      "__module__": "builtins",
                      "content": []
                    }
                  },
                  "__doc__": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "null",
                    "is_json": true
                  },
                  "otypes": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "null",
                    "is_json": true
                  },
                  "excluded": {
                    "__class__": "set",
                    "__module__": "builtins"
                  },
                  "_in_and_out_core_dims": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "null",
                    "is_json": true
                  }
                },
                "key_types": {
                  "__class__": "list",
                  "__module__": "builtins",
                  "content": [...]
                }
              }
            },
            "name": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "\"zipf\"",
              "is_json": true
            },
            "extradoc": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "null",
              "is_json": true
            },
            "__doc__": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "\"A Zipf (Zeta) discrete random variable.\\n\\n    As an instance of the `rv_discrete` class, `zipf` object inherits from it\\n    a collection of generic methods (see below for the full list),\\n    and completes them with details specific for this particular distribution.\\n    \\n    Methods\\n    -------\\n    rvs(a, loc=0, size=1, random_state=None)\\n        Random variates.\\n    pmf(k, a, loc=0)\\n        Probability mass function.\\n    logpmf(k, a, loc=0)\\n        Log of the probability mass function.\\n    cdf(k, a, loc=0)\\n        Cumulative distribution function.\\n    logcdf(k, a, loc=0)\\n        Log of the cumulative distribution function.\\n    sf(k, a, loc=0)\\n        Survival function  (also defined as ``1 - cdf``, but `sf` is sometimes more accurate).\\n    logsf(k, a, loc=0)\\n        Log of the survival function.\\n    ppf(q, a, loc=0)\\n        Percent point function (inverse of ``cdf`` --- percentiles).\\n    isf(q, a, loc=0)\\n        Inverse survival function (inverse of ``sf``).\\n    stats(a, loc=0, moments='mv')\\n        Mean('m'), variance('v'), skew('s'), and/or kurtosis('k').\\n    entropy(a, loc=0)\\n        (Differential) entropy of the RV.\\n    expect(func, args=(a,), loc=0, lb=None, ub=None, conditional=False)\\n        Expected value of a function (of one argument) with respect to the distribution.\\n    median(a, loc=0)\\n        Median of the distribution.\\n    mean(a, loc=0)\\n        Mean of the distribution.\\n    var(a, loc=0)\\n        Variance of the distribution.\\n    std(a, loc=0)\\n        Standard deviation of the distribution.\\n    interval(alpha, a, loc=0)\\n        Endpoints of the range that contains fraction alpha [0, 1] of the\\n        distribution\\n\\n    See Also\\n    --------\\n    zipfian\\n\\n    Notes\\n    -----\\n    The probability mass function for `zipf` is:\\n\\n    .. math::\\n\\n        f(k, a) = \\\\frac{1}{\\\\zeta(a) k^a}\\n\\n    for :math:`k \\\\ge 1`, :math:`a > 1`.\\n\\n    `zipf` takes :math:`a > 1` as shape parameter. :math:`\\\\zeta` is the\\n    Riemann zeta function (`scipy.special.zeta`)\\n\\n    The Zipf distribution is also known as the zeta distribution, which is\\n    a special case of the Zipfian distribution (`zipfian`).\\n\\n    The probability mass function above is defined in the \\\"standardized\\\" form.\\n    To shift distribution use the ``loc`` parameter.\\n    Specifically, ``zipf.pmf(k, a, loc)`` is identically\\n    equivalent to ``zipf.pmf(k - loc, a)``.\\n\\n    References\\n    ----------\\n    .. [1] \\\"Zeta Distribution\\\", Wikipedia,\\n           https://en.wikipedia.org/wiki/Zeta_distribution\\n\\n    Examples\\n    --------\\n    >>> from scipy.stats import zipf\\n    >>> import matplotlib.pyplot as plt\\n    >>> fig, ax = plt.subplots(1, 1)\\n    \\n    Calculate the first four moments:\\n    \\n    >>> a = 6.5\\n    >>> mean, var, skew, kurt = zipf.stats(a, moments='mvsk')\\n    \\n    Display the probability mass function (``pmf``):\\n    \\n    >>> x = np.arange(zipf.ppf(0.01, a),\\n    ...               zipf.ppf(0.99, a))\\n    >>> ax.plot(x, zipf.pmf(x, a), 'bo', ms=8, label='zipf pmf')\\n    >>> ax.vlines(x, 0, zipf.pmf(x, a), colors='b', lw=5, alpha=0.5)\\n    \\n    Alternatively, the distribution object can be called (as a function)\\n    to fix the shape and location. This returns a \\\"frozen\\\" RV object holding\\n    the given parameters fixed.\\n    \\n    Freeze the distribution and display the frozen ``pmf``:\\n    \\n    >>> rv = zipf(a)\\n    >>> ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1,\\n    ...         label='frozen pmf')\\n    >>> ax.legend(loc='best', frameon=False)\\n    >>> plt.show()\\n    \\n    Check accuracy of ``cdf`` and ``ppf``:\\n    \\n    >>> prob = zipf.cdf(x, a)\\n    >>> np.allclose(x, zipf.ppf(prob, a))\\n    True\\n    \\n    Generate random numbers:\\n    \\n    >>> r = zipf.rvs(a, size=1000)\\n\\n    Confirm that `zipf` is the large `n` limit of `zipfian`.\\n\\n    >>> from scipy.stats import zipfian\\n    >>> k = np.arange(11)\\n    >>> np.allclose(zipf.pmf(k, a), zipfian.pmf(k, a, n=10000000))\\n    True\\n\\n    \"",
              "is_json": true
            }
          },
          "key_types": {
            "__class__": "list",
            "__module__": "builtins",
            "content": [...]
          }
        }
      },
      "inverse_func": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "null",
        "is_json": true
      },
      "validate": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "false",
        "is_json": true
      },
      "accept_sparse": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "false",
        "is_json": true
      },
      "check_inverse": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "true",
        "is_json": true
      },
      "feature_names_out": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "null",
        "is_json": true
      },
      "kw_args": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "null",
        "is_json": true
      },
      "inv_kw_args": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "null",
        "is_json": true
      },
      "_sklearn_version": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "\"1.1.1\"",
        "is_json": true
      }
    },
    "key_types": {
      "__class__": "list",
      "__module__": "builtins",
      "content": [...]
    }
  }
}

("key_types" omitted for readability)

The text was updated successfully, but these errors were encountered:

E-Aho · 2022-10-17T18:57:30Z

In the code that @BenjaminBossan has given, the core of the error seems to come from serialising a bound method of an instance of the stats.zipf_gen class.

I tried just adding "method" as a special case in the gettype function in _utils.py, but it didn't work.

Basically, I think it's not serialising enough information for bound methods. As you can see in your state, the "pyfunc" is just a "method", but it doesn't at all say what it is a method of.

It should be serialising the object (here: an instance of stats.zipf_gen) and the method it's pointing to (here: _entropy) I think

E-Aho · 2022-10-17T20:29:22Z

I can confirm that adding dummy MethodType dispatch functions can get the provided example working.

I'll raise a draft PR that has the code needed to get the above example working, but I think it'll need more work to make sure it works in other cases of bound methods getting serialised.

E-Aho · 2022-10-17T21:23:01Z

Draft PR above ^ shows the fix for this, still needs a bit of work before I raise the full PR but it's there if you want to check the fix :)

BenjaminBossan · 2022-10-24T10:26:53Z

Besides the problem with methods, there is also a problem with circular references. Here are some simple examples that fail at the moment:

class ObjWithCircularRef:
    def __init__(self):
        self.x = [123]
        self.y = {"a": self.x}
        self.x.append(self.y)

class ObjWithCircularRef:
    def __init__(self):
        self.x = [123, self, 456]

Fixing this might also fix sklearn.cluster.Birch.

adrinjalali · 2022-10-31T11:28:43Z

Reopening since this is not fixed IIUC.

E-Aho · 2022-10-31T18:14:42Z

Yep, bug still exists, but for a different reason now.

We can now persist methods in general, but circular references (as exist in stats.zipf) aren't handled in the right way, as @BenjaminBossan mentions above.

E-Aho · 2022-11-11T21:53:28Z

Jumping back on this once I've merged in #209

E-Aho · 2022-11-15T17:16:41Z

So I've looked at how Pickle handles this type of problem, and I think I understand their method.

The main takeaway is this:

In addition to a memo, Pickle's implementation creates a stack of objects and opcodes during dump and load. This comes with a few benefits, but for this issue in particular, the main benefit is that it allows Pickle to pop out of any recursive sections, and gotos to jump back to objects and continue the stack from there.

Link to part of the pickle source code that shows this with a comment

Link to part of the pickletools documentation describing the purpose of POP_MARK

I also came up with a minimal reproduction of the error:

def wrap(func):
    return func

class CircularObject:
    def a(self):
        return
    
    def __init__(self):
        self.b = wrap(self.a)

def test_circular_self_reference():
  obj = CircularObject()

  import pickle
  pkl_obj = pickle.dumps(obj)   # works fine
  
  import skops
  skp_obj = skops.io.dumps(obj)  # recursive stack error

You can also use the pickletools dis method to look at what the stack looked like for the above example. When I did this, I could see once it reaches b, and once it pushes the global null_func on the stack, it performs a binget op to return to a previously memoized position in the stack, and continues from that position on the stack.

To implement a stack like pickle has, we would need a pretty big refactor.

For the issue referenced originally, there's likely a much simpler workaround we could implement, but I also think having some a stack implementation would be a good idea to consider for a more general fix.

adrinjalali · 2022-11-17T10:04:22Z

Could you try the code from #204 ? For me your test doesn't fail, but I haven't tested if the loaded object is actually valid.

BenjaminBossan · 2022-11-17T10:26:16Z

Could you try the code from #204 ? For me your test doesn't fail, but I haven't tested if the loaded object is actually valid.

It's probably a bug in the pasted code, where the variable obj is overridden by the pickle call. When you remove that, there is indeed an error (at least on main).

To implement a stack like pickle has, we would need a pretty big refactor.

Yes, we explicitly didn't want to "re-invent" the whole pickle machinery, at that point we might as well use pickle :) So hopefully, for the much narrower scope we set ourselves for skops, we find a way to avoid this problem. Absolute worst case scenario would be special casing objects known to have that problem and giving custom code an escape hatch via __getstate__ but I'm sure we can come up with something better.

E-Aho · 2022-11-17T10:27:51Z

It's probably a bug in the pasted code, where the variable obj is overridden by the pickle call. When you remove that, there is indeed an error (at least on main).

Ah yep, I tried to combine two tests but forgot to rename the variables, my bad! I can double check it still fails on #204 this eve

blaisethom · 2023-08-27T09:54:54Z

Hi, I think I'm getting the same error when trying to serialise a keras model:

from tensorflow.keras import layers, models
from sklearn.pipeline import Pipeline
from skops.io import dump
model = models.Sequential(
    [
        layers.Conv2D(32, (3, 3), activation="relu", input_shape=(32, 32, 3))
    ]
)
pipeline = Pipeline(
    [("classifier", model)]
)
dump(pipeline, 'keras-test.skops')`

Just posting in case it's helpful to anyone else and/or helps with the bug fixing!

BenjaminBossan · 2023-08-28T14:05:26Z

@blaisethom Thanks for reporting. Please note that keras is not supported at the moment, so I wouldn't expect it to work.

adrinjalali · 2023-08-29T13:08:24Z

@blaisethom have you tried with scikeras? I'm not sure if it'd make a difference, but if we're to support this, scikeras would be a better place to start.

blaisethom · 2023-08-29T13:10:42Z

Hi, @adrinjalali - yes, I realised after posting that I should have done that and did so but it gives the same error.

adrinjalali · 2023-08-29T13:42:51Z

@blaisethom if you're kind enough to open a new issue with a mini reproducible it'd be very helpful

blaisethom · 2023-08-29T14:25:12Z

Done

BenjaminBossan added bug Something isn't working persistence Secure persistence feature labels Oct 11, 2022

E-Aho mentioned this issue Oct 17, 2022

Add Method type in dispatch calls #195

Merged

BenjaminBossan closed this as completed in #195 Oct 31, 2022

adrinjalali reopened this Oct 31, 2022

blaisethom mentioned this issue Aug 29, 2023

Bug: dump fails on keras models with RecursionError: maximum recursion depth exceeded, after calling fit #388

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: loading certain scipy functions fails #184

Bug: loading certain scipy functions fails #184

BenjaminBossan commented Oct 11, 2022

E-Aho commented Oct 17, 2022 •

edited

E-Aho commented Oct 17, 2022

E-Aho commented Oct 17, 2022

BenjaminBossan commented Oct 24, 2022

adrinjalali commented Oct 31, 2022

E-Aho commented Oct 31, 2022

E-Aho commented Nov 11, 2022

E-Aho commented Nov 15, 2022 •

edited

adrinjalali commented Nov 17, 2022

BenjaminBossan commented Nov 17, 2022

E-Aho commented Nov 17, 2022

blaisethom commented Aug 27, 2023

BenjaminBossan commented Aug 28, 2023

adrinjalali commented Aug 29, 2023

blaisethom commented Aug 29, 2023

adrinjalali commented Aug 29, 2023

blaisethom commented Aug 29, 2023

Bug: loading certain scipy functions fails #184

Bug: loading certain scipy functions fails #184

Comments

BenjaminBossan commented Oct 11, 2022

E-Aho commented Oct 17, 2022 • edited

E-Aho commented Oct 17, 2022

E-Aho commented Oct 17, 2022

BenjaminBossan commented Oct 24, 2022

adrinjalali commented Oct 31, 2022

E-Aho commented Oct 31, 2022

E-Aho commented Nov 11, 2022

E-Aho commented Nov 15, 2022 • edited

adrinjalali commented Nov 17, 2022

BenjaminBossan commented Nov 17, 2022

E-Aho commented Nov 17, 2022

blaisethom commented Aug 27, 2023

BenjaminBossan commented Aug 28, 2023

adrinjalali commented Aug 29, 2023

blaisethom commented Aug 29, 2023

adrinjalali commented Aug 29, 2023

blaisethom commented Aug 29, 2023

E-Aho commented Oct 17, 2022 •

edited

E-Aho commented Nov 15, 2022 •

edited