Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor convert_observed_data #7299

Merged

Conversation

lhelleckes
Copy link
Contributor

@lhelleckes lhelleckes commented May 6, 2024

Description

In order to improve type hints in the convert_observed_data function and to ultimately resolve issue #7277, the generator part of the code was separated in a statement with return. This will make it easier to apply dtypes to the other data structures in the next step.

Related Issue

Checklist

Type of change

  • New feature / enhancement
  • Bug fix
  • Documentation
  • Maintenance
  • Other (please specify):

📚 Documentation preview 📚: https://pymc--7299.org.readthedocs.build/en/7299/

Copy link

welcome bot commented May 6, 2024

Thank You Banner]
💖 Thanks for opening this pull request! 💖 The PyMC community really appreciates your time and effort to contribute to the project. Please make sure you have read our Contributing Guidelines and filled in our pull request template to the best of your ability.

@michaelosthege
Copy link
Member

Good news is that all tests except test_skewstudentt_logp worked.

Flaky tests are annoying because we can't XFAIL them (I forgot).

@michaelosthege
Copy link
Member

@lhelleckes you can rebase now :)

@lhelleckes lhelleckes force-pushed the refactor_convert_observed_data branch 2 times, most recently from b939c22 to c2f7bec Compare May 16, 2024 08:54
@lhelleckes lhelleckes force-pushed the refactor_convert_observed_data branch from c2f7bec to b97a5b0 Compare May 16, 2024 11:02
Comment on lines +84 to +85
if isgenerator(data):
return floatX(generator(data))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was on purpose, for something fancy in VI. @ferrine can you confirm whether it's true and whether we still need it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the refactor only moves the code, but if we can delete it that'd be even better!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't wrapping the generator in floatX consume it immediately?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm it was also done before

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User-provided observed data comes through this, therefore I'm pretty certain that this generator branch is still relevant for VI.

One next step after this PR could be to split the function into two overloads - one of which applies to generator type data, and the other to all the rest.

The more important next step should be the introduction of a dtype kwarg so we can get #7277 fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a guess, but that might be for stochastic optimization? If so, that generator could even be infinite.
Why not (floatX(value) for value in generator)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the output can be a generator, the type signature is wrong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, even with a generator, the returned value is a TensorVariable:

grafik

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some black magic.

@aseyboldt
Copy link
Member

Couldn't the observed value be supposed to be an integer?

@michaelosthege
Copy link
Member

Couldn't the observed value be supposed to be an integer?

Yes, in that case it will result in a NumPy array too:

>>> type(pm.floatX(5))
<class 'numpy.ndarray'>

@aseyboldt
Copy link
Member

I meant that it might be an array (pytensor or numpy) with integer dtype. Unless I'm missing some context we can't just convert that to a float type.

@michaelosthege
Copy link
Member

I meant that it might be an array (pytensor or numpy) with integer dtype. Unless I'm missing some context we can't just convert that to a float type.

If you follow the branching, you'll find that we made that conversion all the time already.

In my opinion we should merge this and continue adding a dtype kwarg to fix #7277.


The whole "preparing generator data for VI" should be refactored. I would probably even give it it's own pm.GeneratorData container.
This is not the scope of this PR though.

If y'all agree I can take the first step towards putting generator data into a pm.GeneratorData container. (to_graphviz style, deprecation warning, ...)

@aseyboldt
Copy link
Member

If you follow the branching, you'll find that we made that conversion all the time already.

Sorry, I don't know what you mean. Can you point me to an example? I don't think we are converting data that a users specified as an int type to a float type automatically, do we?

@michaelosthege
Copy link
Member

If you follow the branching, you'll find that we made that conversion all the time already.

Sorry, I don't know what you mean. Can you point me to an example? I don't think we are converting data that a users specified as an int type to a float type automatically, do we?

main branch:

pymc/pymc/pytensorf.py

Lines 119 to 133 in fd11cf0

elif isgenerator(data):
ret = generator(data)
else:
ret = np.asarray(data)
# type handling to enable index variables when data is int:
if hasattr(data, "dtype"):
if "int" in str(data.dtype):
return intX(ret)
# otherwise, assume float:
else:
return floatX(ret)
# needed for uses of this function other than with pm.Data:
else:
return floatX(ret)

When data is a generator, the if isgenerator(data) is the first case that evaluates True.
Then if hasattr(data, "dtype") → False and return floatX(ret).

If we should do that is a separate, VI-specific question which is IMO best dealt with by separating the generator case away.

@michaelosthege michaelosthege merged commit 931a5af into pymc-devs:main May 23, 2024
20 checks passed
Copy link

welcome bot commented May 23, 2024

Congratulations Banner]
Congrats on merging your first pull request! 🎉 We here at PyMC are proud of you! 💖 Thank you so much for your contribution 🎁

@michaelosthege michaelosthege added the no releasenotes Skipped in automatic release notes generation label May 23, 2024
@michaelosthege michaelosthege mentioned this pull request May 23, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance no releasenotes Skipped in automatic release notes generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants