Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In model_builder, _validate_data changes input type #314

Open
donnut opened this issue Feb 27, 2024 · 1 comment
Open

In model_builder, _validate_data changes input type #314

donnut opened this issue Feb 27, 2024 · 1 comment

Comments

@donnut
Copy link

donnut commented Feb 27, 2024

When calling predict_posterior the line

X_pred = self._validate_data(X_pred)

validates the input data using sklearn's check_array. This check function does not only check, but also modifies the data. If the input is a pandas DataFrame, the output is a numpy array. This may causes problems in _data_setter, an abstract function implemented by the user, as the user expects a DataFrame back.

Is this done intentionally? I would like to suggest to return data of the same type as the input data.

@pdb5627
Copy link
Contributor

pdb5627 commented Mar 1, 2024

It is intentional, but may not be the best way forward. When I worked on ModelBuilder, I was focused on getting it to be able to integrate into a scikit-learn-based Pipeline. It's been a while, but if I remember right, scikit-learn transformers and estimators accept a variety of input types and then "normalize" them to numpy arrays for the computations. So that was the idea of _validate_data. All the base _validate_data does is call check_array or check_X_y from sklearn.utils.validation. I suppose it would be just as easy to call the validation function directly rather than through _validate_data. In my own fork of this code, I actually don't use _validate_data or check_array or check_X_y at all. In my use, the input is always pandas objects rather than numpy arrays. I have to call set_output(transform='pandas') on the transformers in the pipeline, but it works for me.

From comments in some other issues (#196, #246), it seems there is a desire to refactor ModelBuilder to suit a more general use case, and then specialize it for scikit-learn as an additional class/sub-class. Seems like a good idea. There are other API's besides scikit-learn, for example sktime, which also might be useful to work with ModelBuilder-based PYMC models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants