-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/memoize key #1572
Feature/memoize key #1572
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1572 +/- ##
==========================================
+ Coverage 97.13% 97.19% +0.06%
==========================================
Files 56 68 +12
Lines 2091 2137 +46
Branches 342 343 +1
==========================================
+ Hits 2031 2077 +46
Misses 31 31
Partials 29 29
|
@henryre I just noticed it won't work in the expected way. The snorkel pattern is to return the x_mapped so the cache will change the data point. def test_decorator_mapper_memoized_use_memoize_key(self) -> None:
square_hit_tracker = SquareHitTracker()
@lambda_mapper(memoize=True, memoize_key=lambda x: x.num)
def square(x: DataPoint) -> DataPoint:
x.num_squared = square_hit_tracker(x.num)
return x
x8 = self._get_x()
x8_mapped = square(x8)
assert x8_mapped is not None
self.assertEqual(x8_mapped.num_squared, 64)
self.assertEqual(square_hit_tracker.n_hits, 1)
x8_with_another_text = self._get_x(text="Henry is still having fun")
x8_with_another_text_mapped = square(x8_with_another_text)
assert x8_with_another_text_mapped is not None
self.assertEqual(x8_with_another_text_mapped.num_squared, 64)
self.assertEqual(square_hit_tracker.n_hits, 1)
# This should fail :/
self.assertEqual(x8_with_another_text_mapped, x8_mapped) |
Hi @Wirg, thanks for putting this up! Based on the example you put up, the expected behavior would be |
Hi @henryre , I hope you're going well. Small bump on this PR. |
Hi @henryre , Another bump for this pr. What do you want me to do ? Do you have some change ? Do you want to give up on this feature ? |
Hi @Wirg, sorry for the delay here and thanks for the reminder! Taking a look today! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wirg just a couple of small suggestions, then we can get this merged! Thanks again for working on this!
test/map/test_core.py
Outdated
x.num_squared = square_hit_tracker(x.num) | ||
return x | ||
|
||
x8 = self._get_x() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be preferable to have an example of canonical usage in this unit test. For example, using a custom unique ID:
x1 = SimpleNamespace(uid="id1", num=8, unhashable=some_unhashable_pandas_object)
x2 = SimpleNamespace(uid="id1", num=8, unhashable=some_unhashable_pandas_object)
...
and then we use memoize_key = lambda x: x.uid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wirg let me know when you can make this change, then happy to approve the PR! Thanks again for your patience!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did the change. I am not fully satisfied. If tomorrow someone change :
get_hashable
to support pandas dataframememoize_key
not to be used
The test won't fail.
name: str, | ||
pre: List["BaseMapper"], | ||
memoize: bool, | ||
memoize_key: Optional[HashingFunction] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to prefer None
as the default instead of using get_hashable
as the default? We'd then be able to avoid the Optional
everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see 2 reasons :
- function are mutable and the "good practice" is to avoid mutable in default parameters. I don't really see a situation where we would mutate this function tho.
- we will have to import get_hashable in all the subclasses and all the wrappers of BaseMapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point about mutating functions in modules, hadn't thought of that route. Sounds good!
I finally changed the test. I am not fully satisfied. If tomorrow someone changes :
The test won't fail. |
@Wirg good thinking! You could add an additional field in the test called |
@henryre so I added a EDIT : nevermind. I rebased and those were fixed. Waiting for your review.
|
05ceb78
to
24de6cb
Compare
@henryre small bump I don't know what I should be doing regarding codecov. |
@Wirg looks like codecov was being a bit temperamental, will go ahead and merge in. Thanks for your hard work here! |
Description of proposed changes
Feature
Currently, there is no way to decide the key to be memoized when using preprocessor(memoize=True).
This leads to 2 issues :
Example : We are trying to evaluate the reliability of a paragraph in a blog post.
We could evaluate the reliability of the paragraph and of the website.
The preprocessing corresponding to those 2 tasks will share the same key for memoize, which is not ideal : a website can have a few thousand paragraphs so we will evaluate website reliability a lot more than necessary.
Result
Implementation
Add a
memoize_key : Optional[HashingFunction]
to theBaseMapper
, if provided and notNone
, it will be used instead ofget_hashable
to define the hash of the input.memoize_key
has been made accessible to the different functions providingmemoize
api.Related issue(s)
#1561
Test plan
Checklist
Need help on these? Just ask!
tox -e complex
and/ortox -e spark
if appropriate.@henryre , I will be waiting for your review.
I ran
tox -e doc
, but it did not produce any change and I have a bunch ofWARNING: toctree contains reference to nonexisting document
, is it normal ?By the way, how would you like to discuss my use case further ?