pymc3.Data converts input data to float64 type - so int data cannot later be used as an index #3813

hottwaj · 2020-02-19T17:23:42Z

Hi there guys

I'd like to create a model that I want to fit many times to different datasets for cross validation purposes.

One of my columns of input data is categorical, so I use it to index a vector of RVs depending on which category is presented in each sample of data. Something like this:

cat_mvs = [pymc3.Normal(c, mu = 0, sd = 0.05) for c in unique_categories]

cat_mv_vector = pymc3.math.stack(cat_mvs)
cat_data = pymc3.Data('Categorical Input Data', category_codes)
sample_cat_mv = cat_mv_vector[cat_data]

Note that my category_codes data is a numpy array of integers

That last line of code above triggers an error, here's the traceback within pymc3:

~/.pyenv/versions/3.7.2/envs/jupyterlab-3.7.2/lib/python3.7/site-packages/theano/tensor/var.py in __getitem__(self, args)
    568                             TensorVariable, TensorConstant,
    569                             theano.tensor.sharedvar.TensorSharedVariable))):
--> 570                 return self.take(args[axis], axis)
    571             else:
    572                 return theano.tensor.subtensor.advanced_subtensor(self, *args)

~/.pyenv/versions/3.7.2/envs/jupyterlab-3.7.2/lib/python3.7/site-packages/theano/tensor/var.py in take(self, indices, axis, mode)
    612 
    613     def take(self, indices, axis=None, mode='raise'):
--> 614         return theano.tensor.subtensor.take(self, indices, axis, mode)
    615 
    616     # COPYING

~/.pyenv/versions/3.7.2/envs/jupyterlab-3.7.2/lib/python3.7/site-packages/theano/tensor/subtensor.py in take(a, indices, axis, mode)
   2448             return advanced_subtensor1(a.flatten(), indices)
   2449         elif axis == 0:
-> 2450             return advanced_subtensor1(a, indices)
   2451         else:
   2452             if axis < 0:

~/.pyenv/versions/3.7.2/envs/jupyterlab-3.7.2/lib/python3.7/site-packages/theano/gof/op.py in __call__(self, *inputs, **kwargs)
    613         """
    614         return_list = kwargs.pop('return_list', False)
--> 615         node = self.make_node(*inputs, **kwargs)
    616 
    617         if config.compute_test_value != 'off':

~/.pyenv/versions/3.7.2/envs/jupyterlab-3.7.2/lib/python3.7/site-packages/theano/tensor/subtensor.py in make_node(self, x, ilist)
   1701         ilist_ = theano.tensor.as_tensor_variable(ilist)
   1702         if ilist_.type.dtype not in theano.tensor.integer_dtypes:
-> 1703             raise TypeError('index must be integers')
   1704         if ilist_.type.ndim != 1:
   1705             raise TypeError('index must be vector')

TypeError: index must be integers

It seems that within pymc3.Data(), my category_codes data is being coerced to float64, which is not a valid indexing type.

Looking at the source for pymc3.Data() I think the problem is ultimately in the called function pymc3.model.pandas_to_array which converts its input data to a float on its last line, see https://github.com/pymc-devs/pymc3/blob/master/pymc3/model.py#L1495

Can pymc3.Data() and/or pymc3.model.pandas_to_array be changed to be preserve the input data type?

Thanks!

The text was updated successfully, but these errors were encountered:

hottwaj · 2020-02-19T17:29:00Z

A temporary workaround seems to be to use theano to cast my index data back to ints:

from theano import tensor as tt
sample_cat_mv = cat_mv_vector[tt.cast(cat_data, 'int8')]

rpgoldman · 2020-02-19T17:40:22Z

I have something like this in my code:

medium_id = self.Data("medium", df_value("medium_idx", dtype='int'))

self.Data is so I can store some information about the data before invoking pm.Data, df_value pulls information out of a Pandas DataFrame. Internally it does df[key].to_numpy(dtype=dtype).

So I believe that what happens here is that pm.Data if given something that has a dtype specified, will retain that dtype.

TBQH, I am not sure why this works, because it looks like the code inside pm.Data, which invokes pymc3.util.pandas_to_array() on its data argument, which in turn invokes pm.theanof.floatX(), doesn't seem to do anything to take into account the dtype of the array.

One thing to do would be to simply sidestep pm.Data and use theano.shared()

AlexAndorra · 2020-02-19T21:47:33Z

Hi!
I think you can also use something like pm.intX(pm.Data("category_codes", category_codes)). I tested it several times to do what you want to do and it should work.

Robert's solution (using theano.shared() instead of pm.Data) is also quite elegant, depending on how familiar you are with theano!
PyMCheers 🖖

rpgoldman · 2020-02-19T21:49:12Z

FWIW, this makes me wonder why pm.Data does not accept a dtype argument.

AlexAndorra · 2020-02-19T21:58:43Z

Yeah, I wondered the same thing. Could be a useful PR. I don't have time to work on it right now but I'll keep that in mind 😉 Le mer. 19 févr. 2020 à 22:49, rpgoldman <notifications@github.com> a écrit :

…

FWIW, this makes me wonder why pm.Data does not accept a dtype argument. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3813>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHIJMTGDITZJ3LDOMR56BXDRDWSNRANCNFSM4KX5QJTA> .

hottwaj · 2020-02-23T12:20:14Z

Hi guys, thanks for your comments! I would be happy to have a go at a PR, but what form should the fix take? From what you said I think the choices are 1&2 and A&B below?

either:

changing pymc3.Data to accept a dtype kwarg
or, getting pymc3.Data to check the input array's dtype automatically?

and
A. calling theano.shared directly from pymc3.Data() (possibly dropping the call to pymc3.util.pandas_to_array()?)

B. or, changing pymc3.util.pandas_to_array() to take accept/check for different dtypes?

Cheers

rpgoldman · 2020-02-23T16:33:07Z

My preference would be to do number 1, and ideally to have pymc3.Data check input dtype as well, and raise a UserWarning on data type mismatch (e.g., when an integer is turned into a float).

I don't have a strong opinion on A vs. B, but i haven't looked into this yet.

The argument for this is the principle of minimal surprise (I didn't expect that integer to turn into a float!), but the argument against this is that programmers using numpy are notoriously sloppy about types, because numpy tries to be clever about this. Personally, I'm not a big fan of that approach, but my bet is that there are a ton of people who sloppily use 1 or 0 to initialize arrays, confident that they will turn into 1.0 or 0.0...

AlexAndorra · 2020-02-23T23:05:42Z

Yeah, I think the best would probably be 1 and 2 😆 But if too big a change, and if we assume that people are sloppy about types, then maybe 2 would be better, as it automates the process.

No strong opinion on A and B either -- maybe a slight preference for A: if theano.shared offers a robust solution, why rely on another utility? But I haven't looked into this yet and don't have the same knowledge of the code base as Robert.

rpgoldman · 2020-02-25T20:25:32Z

@hottwaj To return to this, I think 1 + B would be the right approach. I note that there is already an intX function in theanof.py that could help.

I'm not sure what dtypes we should permit, though. Anything other than just int and float? Or do we also need to handle categoricals in pm.Data()? @AlexAndorra ?

How about making a WIP pull request to get this started?

AlexAndorra · 2020-02-26T09:13:00Z

Yeah intX looks quite useful; I'm already using it when I need to use pm.Data for indexing -- pm.intX(pm.Data("category_codes", category_codes)).

I'm not sure we need any other types than int and float, as pm.Data is used either for predictors or for indexers. So, in both cases, these are numbers and I don't see how strings (categoricals, etc.) could be used, as you need them for computation in the model.

hottwaj · 2020-02-26T09:46:12Z

Great thanks guys, I will go with 1+B and add support for int alongside existing assumption of float. It will maybe take me a week or so to find a time slot to squeeze the work into. Will submit a PR to indicate WIP. Cheers!

…nput data (previously all input data was coerced to float) WIP for pymc-devs#3813

hottwaj · 2020-02-26T10:41:22Z

Actually rather than sitting on this I have done some initial changes and submitted a PR.

pandas_to_array now accepts a dtype kwarg and assumes float by default. Only float and int types (including numpy float/int 32/64 types) are accepted.

(minor issue: should e.g. int8, int16 be accepted? is there a generic way of testing that a dtype is a numpy int/float type?)

Data() also accepts a dtype kwarg. The dtype kwarg is set to None by default, indicating that Data() should try to determine the dtype for itself, in which case it uses the following logic:

use dtype of passed input data, if it the input data has a dtype attr
otherwise test if input data is instance of int
otherwise assume float

So I've deviated a bit from what we agreed and implemented more of a 2+B :) Happy to revert to an implementation in the style of 1+B if that's what you'd prefer though.

Thanks!

* Initial changes to allow pymc3.Data() to support both int and float input data (previously all input data was coerced to float) WIP for #3813 * added exception for invalid dtype input to pandas_to_array * Refined implementation * Finished dtype conversion handling * Added SharedVariable option to getattr_value * Added dtype handling to set_data function * Added tests for pm.Data used for index variables * Added tests for using pm.data as RV input * Ran Black on data tests files * Added release note * Updated release notes * Updated code in light of Luciano's comments * Fixed implementation of integer checking * Simplified implementation of type checking * Corrected implementation for other uses of pandas_to_array Co-authored-by: hottwaj <jonathan.a.clarke@gmail.com>

hottwaj added a commit to hottwaj/pymc3 that referenced this issue Feb 26, 2020

Initial changes to allow pymc3.Data() to support both int and float i…

05462f2

…nput data (previously all input data was coerced to float) WIP for pymc-devs#3813

hottwaj mentioned this issue Feb 26, 2020

allow pymc3.Data() to support both int and float input data #3816

Closed

This was linked to pull requests May 13, 2020

allow pymc3.Data() to support both int and float input data #3816

Closed

Expand pm.Data capacities #3925

Merged

AlexAndorra removed a link to a pull request May 17, 2020

allow pymc3.Data() to support both int and float input data #3816

Closed

lucianopaz closed this as completed in #3925 May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pymc3.Data converts input data to float64 type - so int data cannot later be used as an index #3813

pymc3.Data converts input data to float64 type - so int data cannot later be used as an index #3813

hottwaj commented Feb 19, 2020

hottwaj commented Feb 19, 2020

rpgoldman commented Feb 19, 2020

AlexAndorra commented Feb 19, 2020 •

edited

rpgoldman commented Feb 19, 2020

AlexAndorra commented Feb 19, 2020 via email

hottwaj commented Feb 23, 2020

rpgoldman commented Feb 23, 2020

AlexAndorra commented Feb 23, 2020

rpgoldman commented Feb 25, 2020

AlexAndorra commented Feb 26, 2020

hottwaj commented Feb 26, 2020

hottwaj commented Feb 26, 2020

pymc3.Data converts input data to float64 type - so int data cannot later be used as an index #3813

pymc3.Data converts input data to float64 type - so int data cannot later be used as an index #3813

Comments

hottwaj commented Feb 19, 2020

hottwaj commented Feb 19, 2020

rpgoldman commented Feb 19, 2020

AlexAndorra commented Feb 19, 2020 • edited

rpgoldman commented Feb 19, 2020

AlexAndorra commented Feb 19, 2020 via email

hottwaj commented Feb 23, 2020

rpgoldman commented Feb 23, 2020

AlexAndorra commented Feb 23, 2020

rpgoldman commented Feb 25, 2020

AlexAndorra commented Feb 26, 2020

hottwaj commented Feb 26, 2020

hottwaj commented Feb 26, 2020

AlexAndorra commented Feb 19, 2020 •

edited