Refactor convert_observed_data #7299

lhelleckes · 2024-05-06T09:32:05Z

Description

In order to improve type hints in the convert_observed_data function and to ultimately resolve issue #7277, the generator part of the code was separated in a statement with return. This will make it easier to apply dtypes to the other data structures in the next step.

Related Issue

Closes #
Related to BUG: pm.Data does not accept dtype argument #7277

Checklist

Checked that the pre-commit linting/style checks pass
Included tests that prove the fix is effective or that the new feature works (not applicable since refactoring)
Added necessary documentation (docstrings and/or example notebooks)
If you are a pro: each commit corresponds to a relevant logical change

Type of change

📚 Documentation preview 📚: https://pymc--7299.org.readthedocs.build/en/7299/

welcome · 2024-05-06T09:32:08Z

]
💖 Thanks for opening this pull request! 💖 The PyMC community really appreciates your time and effort to contribute to the project. Please make sure you have read our Contributing Guidelines and filled in our pull request template to the best of your ability.

michaelosthege · 2024-05-06T11:00:29Z

Good news is that all tests except test_skewstudentt_logp worked.

Flaky tests are annoying because we can't XFAIL them (I forgot).

michaelosthege · 2024-05-08T10:59:37Z

@lhelleckes you can rebase now :)

ricardoV94 · 2024-05-16T12:49:35Z

pymc/pytensorf.py

+    if isgenerator(data):
+        return floatX(generator(data))


I think this was on purpose, for something fancy in VI. @ferrine can you confirm whether it's true and whether we still need it?

Yes, the refactor only moves the code, but if we can delete it that'd be even better!

Doesn't wrapping the generator in floatX consume it immediately?

nvm it was also done before

User-provided observed data comes through this, therefore I'm pretty certain that this generator branch is still relevant for VI.

One next step after this PR could be to split the function into two overloads - one of which applies to generator type data, and the other to all the rest.

The more important next step should be the introduction of a dtype kwarg so we can get #7277 fixed.

Just a guess, but that might be for stochastic optimization? If so, that generator could even be infinite.
Why not (floatX(value) for value in generator)?

If the output can be a generator, the type signature is wrong

No, even with a generator, the returned value is a TensorVariable:

I see some black magic.

aseyboldt · 2024-05-17T08:39:51Z

Couldn't the observed value be supposed to be an integer?

michaelosthege · 2024-05-17T14:40:10Z

Couldn't the observed value be supposed to be an integer?

Yes, in that case it will result in a NumPy array too:

>>> type(pm.floatX(5))
<class 'numpy.ndarray'>

aseyboldt · 2024-05-20T14:32:25Z

I meant that it might be an array (pytensor or numpy) with integer dtype. Unless I'm missing some context we can't just convert that to a float type.

michaelosthege · 2024-05-21T18:41:38Z

I meant that it might be an array (pytensor or numpy) with integer dtype. Unless I'm missing some context we can't just convert that to a float type.

If you follow the branching, you'll find that we made that conversion all the time already.

In my opinion we should merge this and continue adding a dtype kwarg to fix #7277.

The whole "preparing generator data for VI" should be refactored. I would probably even give it it's own pm.GeneratorData container.
This is not the scope of this PR though.

If y'all agree I can take the first step towards putting generator data into a pm.GeneratorData container. (to_graphviz style, deprecation warning, ...)

aseyboldt · 2024-05-22T13:39:21Z

If you follow the branching, you'll find that we made that conversion all the time already.

Sorry, I don't know what you mean. Can you point me to an example? I don't think we are converting data that a users specified as an int type to a float type automatically, do we?

michaelosthege · 2024-05-22T14:11:02Z

If you follow the branching, you'll find that we made that conversion all the time already.

Sorry, I don't know what you mean. Can you point me to an example? I don't think we are converting data that a users specified as an int type to a float type automatically, do we?

main branch:

pymc/pymc/pytensorf.py

Lines 119 to 133 in fd11cf0

    
           elif isgenerator(data): 
        
               ret = generator(data) 
        
           else: 
        
               ret = np.asarray(data) 
        
           # type handling to enable index variables when data is int: 
        
           if hasattr(data, "dtype"): 
        
               if "int" in str(data.dtype): 
        
                   return intX(ret) 
        
               # otherwise, assume float: 
        
               else: 
        
                   return floatX(ret) 
        
           # needed for uses of this function other than with pm.Data: 
        
           else: 
        
               return floatX(ret)

When data is a generator, the if isgenerator(data) is the first case that evaluates True.
Then if hasattr(data, "dtype") → False and return floatX(ret).

If we should do that is a separate, VI-specific question which is IMO best dealt with by separating the generator case away.

welcome · 2024-05-23T18:40:30Z

]
Congrats on merging your first pull request! 🎉 We here at PyMC are proud of you! 💖 Thank you so much for your contribution 🎁

michaelosthege added the maintenance label May 6, 2024

michaelosthege assigned lhelleckes May 6, 2024

lhelleckes force-pushed the refactor_convert_observed_data branch 2 times, most recently from b939c22 to c2f7bec Compare May 16, 2024 08:54

Refactor convert_observed data to simplify typing

b97a5b0

lhelleckes force-pushed the refactor_convert_observed_data branch from c2f7bec to b97a5b0 Compare May 16, 2024 11:02

ricardoV94 reviewed May 16, 2024

View reviewed changes

michaelosthege approved these changes May 17, 2024

View reviewed changes

michaelosthege merged commit 931a5af into pymc-devs:main May 23, 2024

michaelosthege added the no releasenotes Skipped in automatic release notes generation label May 23, 2024

michaelosthege mentioned this pull request May 23, 2024

Split convert observed data #7334

Merged

9 tasks

This was referenced Jun 17, 2024

Imputation does not work in combination with pm.Data #4441

Open

Add type annotations to convert_observed_data #4650

Closed

Uh oh!

Refactor convert_observed_data #7299

Refactor convert_observed_data #7299

Uh oh!

Conversation

lhelleckes commented May 6, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Checklist

Type of change

Uh oh!

welcome bot commented May 6, 2024

Uh oh!

michaelosthege commented May 6, 2024

Uh oh!

michaelosthege commented May 8, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aseyboldt commented May 17, 2024

Uh oh!

michaelosthege commented May 17, 2024

Uh oh!

aseyboldt commented May 20, 2024

Uh oh!

michaelosthege commented May 21, 2024

Uh oh!

aseyboldt commented May 22, 2024

Uh oh!

michaelosthege commented May 22, 2024

Uh oh!

welcome bot commented May 23, 2024

Uh oh!

Uh oh!

lhelleckes commented May 6, 2024 •

edited by github-actions bot

Loading