Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: loading certain scipy functions fails #184

Open
BenjaminBossan opened this issue Oct 11, 2022 · 17 comments · Fixed by #195
Open

Bug: loading certain scipy functions fails #184

BenjaminBossan opened this issue Oct 11, 2022 · 17 comments · Fixed by #195
Labels
bug Something isn't working persistence Secure persistence feature

Comments

@BenjaminBossan
Copy link
Collaborator

Certain scipy functions cannot be loaded correctly. Here is a test to reproduce:

from scipy import stats

def test_scipy_stats(tmp_path):
    estimator = FunctionTransformer(func=stats.zipf)
    save_load_round(estimator, tmp_path / "file.skops")

zipf is only one of a few failing scipy functions that I tested. The error message is

AttributeError: module 'builtins' has no attribute 'method'

with a really long, unhelpful stacktrace. However, the problem already occurs during the generation of the state, which looks really weird:

Click to expand
{
  "__class__": "FunctionTransformer",
  "__module__": "sklearn.preprocessing._function_transformer",
  "content": {
    "__class__": "dict",
    "__module__": "builtins",
    "content": {
      "func": {
        "__class__": "zipf_gen",
        "__module__": "scipy.stats._discrete_distns",
        "content": {
          "__class__": "dict",
          "__module__": "builtins",
          "content": {
            "_stats_has_moments": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "true",
              "is_json": true
            },
            "_random_state": {
              "__class__": "RandomState",
              "__module__": "numpy.random.mtrand",
              "content": {
                "__class__": "dict",
                "__module__": "builtins",
                "content": {
                  "bit_generator": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "\"MT19937\"",
                    "is_json": true
                  },
                  "state": {
                    "__class__": "dict",
                    "__module__": "builtins",
                    "content": {
                      "key": {
                        "__class__": "ndarray",
                        "__module__": "numpy",
                        "type": "numpy",
                        "file": "140209309201776.npy"
                      },
                      "pos": {
                        "__class__": "str",
                        "__module__": "builtins",
                        "content": "335",
                        "is_json": true
                      }
                    },
                    "key_types": {
                      "__class__": "list",
                      "__module__": "builtins",
                      "content": [...]
                    }
                  },
                  "has_gauss": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "0",
                    "is_json": true
                  },
                  "gauss": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "0.0",
                    "is_json": true
                  }
                },
                "key_types": {
                  "__class__": "list",
                  "__module__": "builtins",
                  "content": [...]
                }
              }
            },
            "_rvs_uses_size_attribute": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "false",
              "is_json": true
            },
            "_rvs_size_warned": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "false",
              "is_json": true
            },
            "_ctor_param": {
              "__class__": "dict",
              "__module__": "builtins",
              "content": {
                "a": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "1",
                  "is_json": true
                },
                "b": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "Infinity",
                  "is_json": true
                },
                "name": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "\"zipf\"",
                  "is_json": true
                },
                "badvalue": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "null",
                  "is_json": true
                },
                "moment_tol": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "1e-08",
                  "is_json": true
                },
                "values": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "null",
                  "is_json": true
                },
                "inc": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "1",
                  "is_json": true
                },
                "longname": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "\"A Zipf\"",
                  "is_json": true
                },
                "shapes": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "null",
                  "is_json": true
                },
                "extradoc": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "null",
                  "is_json": true
                },
                "seed": {
                  "__class__": "str",
                  "__module__": "builtins",
                  "content": "null",
                  "is_json": true
                }
              },
              "key_types": {
                "__class__": "list",
                "__module__": "builtins",
                "content": [...]
              }
            },
            "badvalue": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "NaN",
              "is_json": true
            },
            "a": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "1",
              "is_json": true
            },
            "b": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "Infinity",
              "is_json": true
            },
            "moment_tol": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "1e-08",
              "is_json": true
            },
            "inc": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "1",
              "is_json": true
            },
            "shapes": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "\"a\"",
              "is_json": true
            },
            "_parse_arg_template": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "\"\\ndef _parse_args(self, a,  loc=0):\\n    return (a, ), loc, 1\\n\\ndef _parse_args_rvs(self, a,  loc=0, size=None):\\n    return self._argcheck_rvs(a,  loc, 1, size=size)\\n\\ndef _parse_args_stats(self, a,  loc=0, moments='mv'):\\n    return (a, ), loc, 1, moments\\n\"",
              "is_json": true
            },
            "numargs": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "1",
              "is_json": true
            },
            "vecentropy": {
              "__class__": "vectorize",
              "__module__": "numpy",
              "content": {
                "__class__": "dict",
                "__module__": "builtins",
                "content": {
                  "pyfunc": {
                    "__class__": "method",
                    "__module__": "builtins",
                    "content": {
                      "__class__": "dict",
                      "__module__": "builtins",
                      "content": {},
                      "key_types": {
                        "__class__": "list",
                        "__module__": "builtins",
                        "content": []
                      }
                    }
                  },
                  "cache": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "false",
                    "is_json": true
                  },
                  "signature": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "null",
                    "is_json": true
                  },
                  "_ufunc": {
                    "__class__": "dict",
                    "__module__": "builtins",
                    "content": {},
                    "key_types": {
                      "__class__": "list",
                      "__module__": "builtins",
                      "content": []
                    }
                  },
                  "__doc__": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "null",
                    "is_json": true
                  },
                  "otypes": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "null",
                    "is_json": true
                  },
                  "excluded": {
                    "__class__": "set",
                    "__module__": "builtins"
                  },
                  "_in_and_out_core_dims": {
                    "__class__": "str",
                    "__module__": "builtins",
                    "content": "null",
                    "is_json": true
                  }
                },
                "key_types": {
                  "__class__": "list",
                  "__module__": "builtins",
                  "content": [...]
                }
              }
            },
            "name": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "\"zipf\"",
              "is_json": true
            },
            "extradoc": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "null",
              "is_json": true
            },
            "__doc__": {
              "__class__": "str",
              "__module__": "builtins",
              "content": "\"A Zipf (Zeta) discrete random variable.\\n\\n    As an instance of the `rv_discrete` class, `zipf` object inherits from it\\n    a collection of generic methods (see below for the full list),\\n    and completes them with details specific for this particular distribution.\\n    \\n    Methods\\n    -------\\n    rvs(a, loc=0, size=1, random_state=None)\\n        Random variates.\\n    pmf(k, a, loc=0)\\n        Probability mass function.\\n    logpmf(k, a, loc=0)\\n        Log of the probability mass function.\\n    cdf(k, a, loc=0)\\n        Cumulative distribution function.\\n    logcdf(k, a, loc=0)\\n        Log of the cumulative distribution function.\\n    sf(k, a, loc=0)\\n        Survival function  (also defined as ``1 - cdf``, but `sf` is sometimes more accurate).\\n    logsf(k, a, loc=0)\\n        Log of the survival function.\\n    ppf(q, a, loc=0)\\n        Percent point function (inverse of ``cdf`` --- percentiles).\\n    isf(q, a, loc=0)\\n        Inverse survival function (inverse of ``sf``).\\n    stats(a, loc=0, moments='mv')\\n        Mean('m'), variance('v'), skew('s'), and/or kurtosis('k').\\n    entropy(a, loc=0)\\n        (Differential) entropy of the RV.\\n    expect(func, args=(a,), loc=0, lb=None, ub=None, conditional=False)\\n        Expected value of a function (of one argument) with respect to the distribution.\\n    median(a, loc=0)\\n        Median of the distribution.\\n    mean(a, loc=0)\\n        Mean of the distribution.\\n    var(a, loc=0)\\n        Variance of the distribution.\\n    std(a, loc=0)\\n        Standard deviation of the distribution.\\n    interval(alpha, a, loc=0)\\n        Endpoints of the range that contains fraction alpha [0, 1] of the\\n        distribution\\n\\n    See Also\\n    --------\\n    zipfian\\n\\n    Notes\\n    -----\\n    The probability mass function for `zipf` is:\\n\\n    .. math::\\n\\n        f(k, a) = \\\\frac{1}{\\\\zeta(a) k^a}\\n\\n    for :math:`k \\\\ge 1`, :math:`a > 1`.\\n\\n    `zipf` takes :math:`a > 1` as shape parameter. :math:`\\\\zeta` is the\\n    Riemann zeta function (`scipy.special.zeta`)\\n\\n    The Zipf distribution is also known as the zeta distribution, which is\\n    a special case of the Zipfian distribution (`zipfian`).\\n\\n    The probability mass function above is defined in the \\\"standardized\\\" form.\\n    To shift distribution use the ``loc`` parameter.\\n    Specifically, ``zipf.pmf(k, a, loc)`` is identically\\n    equivalent to ``zipf.pmf(k - loc, a)``.\\n\\n    References\\n    ----------\\n    .. [1] \\\"Zeta Distribution\\\", Wikipedia,\\n           https://en.wikipedia.org/wiki/Zeta_distribution\\n\\n    Examples\\n    --------\\n    >>> from scipy.stats import zipf\\n    >>> import matplotlib.pyplot as plt\\n    >>> fig, ax = plt.subplots(1, 1)\\n    \\n    Calculate the first four moments:\\n    \\n    >>> a = 6.5\\n    >>> mean, var, skew, kurt = zipf.stats(a, moments='mvsk')\\n    \\n    Display the probability mass function (``pmf``):\\n    \\n    >>> x = np.arange(zipf.ppf(0.01, a),\\n    ...               zipf.ppf(0.99, a))\\n    >>> ax.plot(x, zipf.pmf(x, a), 'bo', ms=8, label='zipf pmf')\\n    >>> ax.vlines(x, 0, zipf.pmf(x, a), colors='b', lw=5, alpha=0.5)\\n    \\n    Alternatively, the distribution object can be called (as a function)\\n    to fix the shape and location. This returns a \\\"frozen\\\" RV object holding\\n    the given parameters fixed.\\n    \\n    Freeze the distribution and display the frozen ``pmf``:\\n    \\n    >>> rv = zipf(a)\\n    >>> ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1,\\n    ...         label='frozen pmf')\\n    >>> ax.legend(loc='best', frameon=False)\\n    >>> plt.show()\\n    \\n    Check accuracy of ``cdf`` and ``ppf``:\\n    \\n    >>> prob = zipf.cdf(x, a)\\n    >>> np.allclose(x, zipf.ppf(prob, a))\\n    True\\n    \\n    Generate random numbers:\\n    \\n    >>> r = zipf.rvs(a, size=1000)\\n\\n    Confirm that `zipf` is the large `n` limit of `zipfian`.\\n\\n    >>> from scipy.stats import zipfian\\n    >>> k = np.arange(11)\\n    >>> np.allclose(zipf.pmf(k, a), zipfian.pmf(k, a, n=10000000))\\n    True\\n\\n    \"",
              "is_json": true
            }
          },
          "key_types": {
            "__class__": "list",
            "__module__": "builtins",
            "content": [...]
          }
        }
      },
      "inverse_func": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "null",
        "is_json": true
      },
      "validate": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "false",
        "is_json": true
      },
      "accept_sparse": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "false",
        "is_json": true
      },
      "check_inverse": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "true",
        "is_json": true
      },
      "feature_names_out": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "null",
        "is_json": true
      },
      "kw_args": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "null",
        "is_json": true
      },
      "inv_kw_args": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "null",
        "is_json": true
      },
      "_sklearn_version": {
        "__class__": "str",
        "__module__": "builtins",
        "content": "\"1.1.1\"",
        "is_json": true
      }
    },
    "key_types": {
      "__class__": "list",
      "__module__": "builtins",
      "content": [...]
    }
  }
}

("key_types" omitted for readability)

@BenjaminBossan BenjaminBossan added bug Something isn't working persistence Secure persistence feature labels Oct 11, 2022
@E-Aho
Copy link
Collaborator

E-Aho commented Oct 17, 2022

In the code that @BenjaminBossan has given, the core of the error seems to come from serialising a bound method of an instance of the stats.zipf_gen class.

I tried just adding "method" as a special case in the gettype function in _utils.py, but it didn't work.

Basically, I think it's not serialising enough information for bound methods. As you can see in your state, the "pyfunc" is just a "method", but it doesn't at all say what it is a method of.

It should be serialising the object (here: an instance of stats.zipf_gen) and the method it's pointing to (here: _entropy) I think

@E-Aho
Copy link
Collaborator

E-Aho commented Oct 17, 2022

I can confirm that adding dummy MethodType dispatch functions can get the provided example working.

I'll raise a draft PR that has the code needed to get the above example working, but I think it'll need more work to make sure it works in other cases of bound methods getting serialised.

@E-Aho
Copy link
Collaborator

E-Aho commented Oct 17, 2022

Draft PR above ^ shows the fix for this, still needs a bit of work before I raise the full PR but it's there if you want to check the fix :)

@BenjaminBossan
Copy link
Collaborator Author

Besides the problem with methods, there is also a problem with circular references. Here are some simple examples that fail at the moment:

class ObjWithCircularRef:
    def __init__(self):
        self.x = [123]
        self.y = {"a": self.x}
        self.x.append(self.y)

class ObjWithCircularRef:
    def __init__(self):
        self.x = [123, self, 456]

Fixing this might also fix sklearn.cluster.Birch.

@adrinjalali
Copy link
Member

Reopening since this is not fixed IIUC.

@adrinjalali adrinjalali reopened this Oct 31, 2022
@E-Aho
Copy link
Collaborator

E-Aho commented Oct 31, 2022

Yep, bug still exists, but for a different reason now.

We can now persist methods in general, but circular references (as exist in stats.zipf) aren't handled in the right way, as @BenjaminBossan mentions above.

@E-Aho
Copy link
Collaborator

E-Aho commented Nov 11, 2022

Jumping back on this once I've merged in #209

@E-Aho
Copy link
Collaborator

E-Aho commented Nov 15, 2022

So I've looked at how Pickle handles this type of problem, and I think I understand their method.

The main takeaway is this:

In addition to a memo, Pickle's implementation creates a stack of objects and opcodes during dump and load. This comes with a few benefits, but for this issue in particular, the main benefit is that it allows Pickle to pop out of any recursive sections, and gotos to jump back to objects and continue the stack from there.

Link to part of the pickle source code that shows this with a comment

Link to part of the pickletools documentation describing the purpose of POP_MARK

I also came up with a minimal reproduction of the error:

def wrap(func):
    return func

class CircularObject:
    def a(self):
        return
    
    def __init__(self):
        self.b = wrap(self.a)

def test_circular_self_reference():
  obj = CircularObject()

  import pickle
  pkl_obj = pickle.dumps(obj)   # works fine
  
  import skops
  skp_obj = skops.io.dumps(obj)  # recursive stack error

You can also use the pickletools dis method to look at what the stack looked like for the above example. When I did this, I could see once it reaches b, and once it pushes the global null_func on the stack, it performs a binget op to return to a previously memoized position in the stack, and continues from that position on the stack.

To implement a stack like pickle has, we would need a pretty big refactor.

For the issue referenced originally, there's likely a much simpler workaround we could implement, but I also think having some a stack implementation would be a good idea to consider for a more general fix.

@adrinjalali
Copy link
Member

Could you try the code from #204 ? For me your test doesn't fail, but I haven't tested if the loaded object is actually valid.

@BenjaminBossan
Copy link
Collaborator Author

Could you try the code from #204 ? For me your test doesn't fail, but I haven't tested if the loaded object is actually valid.

It's probably a bug in the pasted code, where the variable obj is overridden by the pickle call. When you remove that, there is indeed an error (at least on main).

To implement a stack like pickle has, we would need a pretty big refactor.

Yes, we explicitly didn't want to "re-invent" the whole pickle machinery, at that point we might as well use pickle :) So hopefully, for the much narrower scope we set ourselves for skops, we find a way to avoid this problem. Absolute worst case scenario would be special casing objects known to have that problem and giving custom code an escape hatch via __getstate__ but I'm sure we can come up with something better.

@E-Aho
Copy link
Collaborator

E-Aho commented Nov 17, 2022

It's probably a bug in the pasted code, where the variable obj is overridden by the pickle call. When you remove that, there is indeed an error (at least on main).

Ah yep, I tried to combine two tests but forgot to rename the variables, my bad! I can double check it still fails on #204 this eve

@blaisethom
Copy link

Hi, I think I'm getting the same error when trying to serialise a keras model:

from tensorflow.keras import layers, models
from sklearn.pipeline import Pipeline
from skops.io import dump
model = models.Sequential(
    [
        layers.Conv2D(32, (3, 3), activation="relu", input_shape=(32, 32, 3))
    ]
)
pipeline = Pipeline(
    [("classifier", model)]
)
dump(pipeline, 'keras-test.skops')`

Just posting in case it's helpful to anyone else and/or helps with the bug fixing!

@BenjaminBossan
Copy link
Collaborator Author

@blaisethom Thanks for reporting. Please note that keras is not supported at the moment, so I wouldn't expect it to work.

@adrinjalali
Copy link
Member

@blaisethom have you tried with scikeras? I'm not sure if it'd make a difference, but if we're to support this, scikeras would be a better place to start.

@blaisethom
Copy link

Hi, @adrinjalali - yes, I realised after posting that I should have done that and did so but it gives the same error.

@adrinjalali
Copy link
Member

@blaisethom if you're kind enough to open a new issue with a mini reproducible it'd be very helpful

@blaisethom
Copy link

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working persistence Secure persistence feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants