Feature/memoize key #1572

Wirg · 2020-04-08T07:03:22Z

Description of proposed changes

Feature

Currently, there is no way to decide the key to be memoized when using preprocessor(memoize=True).
This leads to 2 issues :

memoization can not be done for unhashable classes (typically a group of pandas rows). We need to wrap or subclass it.
memoization key can not be specific to a preprocessing.
Example : We are trying to evaluate the reliability of a paragraph in a blog post.
We could evaluate the reliability of the paragraph and of the website.
The preprocessing corresponding to those 2 tasks will share the same key for memoize, which is not ideal : a website can have a few thousand paragraphs so we will evaluate website reliability a lot more than necessary.

Result

@preprocessor(memoize=True, memoize_key=lambda p: p.base_website_url)
def add_website_reliability(paragraph):
    paragraph.website_reliability = evaluate_reliability(paragraph.base_website_url)
    return paragraph

Implementation

Add a memoize_key : Optional[HashingFunction] to the BaseMapper, if provided and not None, it will be used instead of get_hashable to define the hash of the input.

memoize_key has been made accessible to the different functions providing memoize api.

Related issue(s)

#1561

Test plan

Checklist

Need help on these? Just ask!

I have read the CONTRIBUTING document.
I have updated the documentation accordingly.
I have added tests to cover my changes.
I have run tox -e complex and/or tox -e spark if appropriate.
All new and existing tests passed.

@henryre , I will be waiting for your review.
I ran tox -e doc, but it did not produce any change and I have a bunch of WARNING: toctree contains reference to nonexisting document, is it normal ?
By the way, how would you like to discuss my use case further ?

codecov · 2020-04-08T07:17:48Z

Codecov Report

Merging #1572 into master will increase coverage by 0.06%.
The diff coverage is 92.30%.

@@            Coverage Diff             @@
##           master    #1572      +/-   ##
==========================================
+ Coverage   97.13%   97.19%   +0.06%     
==========================================
  Files          56       68      +12     
  Lines        2091     2137      +46     
  Branches      342      343       +1     
==========================================
+ Hits         2031     2077      +46     
  Misses         31       31              
  Partials       29       29

Impacted Files	Coverage Δ
...el/classification/training/loggers/checkpointer.py	`96.90% <ø> (ø)`
snorkel/labeling/lf/core.py	`100.00% <ø> (ø)`
snorkel/preprocess/core.py	`100.00% <ø> (ø)`
snorkel/slicing/sf/nlp.py	`100.00% <ø> (ø)`
snorkel/classification/training/trainer.py	`89.83% <50.00%> (ø)`
snorkel/labeling/model/label_model.py	`95.54% <50.00%> (ø)`
snorkel/labeling/lf/nlp.py	`100.00% <100.00%> (ø)`
snorkel/map/core.py	`100.00% <100.00%> (ø)`
snorkel/preprocess/nlp.py	`86.66% <100.00%> (ø)`
snorkel/slicing/utils.py	`94.80% <100.00%> (+0.13%)`	⬆️
... and 14 more

Wirg · 2020-04-09T09:10:26Z

@henryre I just noticed it won't work in the expected way.

The snorkel pattern is to return the x_mapped so the cache will change the data point.

    def test_decorator_mapper_memoized_use_memoize_key(self) -> None:
        square_hit_tracker = SquareHitTracker()

        @lambda_mapper(memoize=True, memoize_key=lambda x: x.num)
        def square(x: DataPoint) -> DataPoint:
            x.num_squared = square_hit_tracker(x.num)
            return x

        x8 = self._get_x()
        x8_mapped = square(x8)
        assert x8_mapped is not None
        self.assertEqual(x8_mapped.num_squared, 64)
        self.assertEqual(square_hit_tracker.n_hits, 1)
        x8_with_another_text = self._get_x(text="Henry is still having fun")
        x8_with_another_text_mapped = square(x8_with_another_text)
        assert x8_with_another_text_mapped is not None
        self.assertEqual(x8_with_another_text_mapped.num_squared, 64)
        self.assertEqual(square_hit_tracker.n_hits, 1)
        # This should fail :/
        self.assertEqual(x8_with_another_text_mapped, x8_mapped)

henryre · 2020-04-10T18:26:34Z

Hi @Wirg, thanks for putting this up! Based on the example you put up, the expected behavior would be square(x8_with_another_text) == x8_mapped since the hashing function was (intentionally) "poorly chosen" in the test. Are you saying self.assertEqual(x8_with_another_text_mapped, x8_mapped) will trigger an AssertionError in the current implementation?

Wirg · 2020-04-28T09:16:28Z

Hi @henryre ,

I hope you're going well.

Small bump on this PR.

Wirg · 2020-05-31T18:20:00Z

Hi @henryre ,

Another bump for this pr.

What do you want me to do ?

Do you have some change ? Do you want to give up on this feature ?

henryre · 2020-05-31T18:22:41Z

Hi @Wirg, sorry for the delay here and thanks for the reminder! Taking a look today!

henryre

@Wirg just a couple of small suggestions, then we can get this merged! Thanks again for working on this!

henryre · 2020-06-01T01:09:56Z

test/map/test_core.py

+            x.num_squared = square_hit_tracker(x.num)
+            return x
+
+        x8 = self._get_x()


It might be preferable to have an example of canonical usage in this unit test. For example, using a custom unique ID:

x1 = SimpleNamespace(uid="id1", num=8, unhashable=some_unhashable_pandas_object) x2 = SimpleNamespace(uid="id1", num=8, unhashable=some_unhashable_pandas_object) ...

and then we use memoize_key = lambda x: x.uid

Great idea.

@Wirg let me know when you can make this change, then happy to approve the PR! Thanks again for your patience!

So I did the change. I am not fully satisfied. If tomorrow someone change :

get_hashable to support pandas dataframe

memoize_key not to be used

The test won't fail.

henryre · 2020-06-01T01:12:11Z

snorkel/map/core.py

+        name: str,
+        pre: List["BaseMapper"],
+        memoize: bool,
+        memoize_key: Optional[HashingFunction] = None,


Any reason to prefer None as the default instead of using get_hashable as the default? We'd then be able to avoid the Optional everywhere

I see 2 reasons :

function are mutable and the "good practice" is to avoid mutable in default parameters. I don't really see a situation where we would mutate this function tho.

we will have to import get_hashable in all the subclasses and all the wrappers of BaseMapper

Great point about mutating functions in modules, hadn't thought of that route. Sounds good!

Wirg · 2020-06-20T21:20:23Z

@henryre

I finally changed the test. I am not fully satisfied. If tomorrow someone changes :

get_hashable to support pandas dataframe
memoize_key not to be used

The test won't fail.

henryre · 2020-06-20T22:46:22Z

@Wirg good thinking! You could add an additional field in the test called not_used and have different values for the two data points

Wirg · 2020-06-21T10:54:15Z

@henryre so I added a not_used int.

EDIT : nevermind. I rebased and those were fixed. Waiting for your review.

I am still encountering new test failures.
I fixed F541 (f-string used with no parameters).
I am facing a typing failure due to torch.nn.Linear usage in snorkel/slicing/utils.
I am not sure what are my steps on this ?
Is this already fixed and I should rebase ?

Wirg · 2020-06-28T20:59:22Z

@henryre small bump

I don't know what I should be doing regarding codecov.
Am I expected to add more tests. If yes, where ?

henryre · 2020-07-03T21:52:40Z

@Wirg looks like codecov was being a bit temperamental, will go ahead and merge in. Thanks for your hard work here!

brahmaneya requested a review from henryre April 9, 2020 16:59

Wirg mentioned this pull request Apr 12, 2020

Choose a memoization key in preprocessor(memoize=True) #1561

Closed

henryre reviewed Jun 1, 2020

View reviewed changes

Arnault Chazareix added 8 commits June 21, 2020 13:21

fix some typos

a52e617

add HashingFunction to snorkel.types

c50e821

add memoize_key to snorkel.map.core

8e29585

impact memoize_key in nlp preprocess, lfs, sfs

7d8d77e

add test for memoize_key

fd00dcb

fix typing with Optional[HashingFunction], black

8baee97

improve memoize_key tests to be more canonical

bc52960

black it

24de6cb

Wirg force-pushed the feature/memoize-key branch from 05ceb78 to 24de6cb Compare June 21, 2020 11:22

henryre merged commit f12338f into snorkel-team:master Jul 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/memoize key #1572

Feature/memoize key #1572

Wirg commented Apr 8, 2020

codecov bot commented Apr 8, 2020 •

edited

Loading

Wirg commented Apr 9, 2020

henryre commented Apr 10, 2020

Wirg commented Apr 28, 2020

Wirg commented May 31, 2020

henryre commented May 31, 2020

henryre left a comment •

edited

Loading

henryre Jun 1, 2020

Wirg Jun 1, 2020

henryre Jun 5, 2020

Wirg Jun 20, 2020

henryre Jun 1, 2020 •

edited

Loading

Wirg Jun 1, 2020

henryre Jun 5, 2020

Wirg commented Jun 20, 2020

henryre commented Jun 20, 2020

Wirg commented Jun 21, 2020 •

edited

Loading

Wirg commented Jun 28, 2020

henryre commented Jul 3, 2020

Feature/memoize key #1572

Feature/memoize key #1572

Conversation

Wirg commented Apr 8, 2020

Description of proposed changes

Feature

Result

Implementation

Related issue(s)

Test plan

Checklist

codecov bot commented Apr 8, 2020 • edited Loading

Codecov Report

Wirg commented Apr 9, 2020

henryre commented Apr 10, 2020

Wirg commented Apr 28, 2020

Wirg commented May 31, 2020

henryre commented May 31, 2020

henryre left a comment • edited Loading

Choose a reason for hiding this comment

henryre Jun 1, 2020

Choose a reason for hiding this comment

Wirg Jun 1, 2020

Choose a reason for hiding this comment

henryre Jun 5, 2020

Choose a reason for hiding this comment

Wirg Jun 20, 2020

Choose a reason for hiding this comment

henryre Jun 1, 2020 • edited Loading

Choose a reason for hiding this comment

Wirg Jun 1, 2020

Choose a reason for hiding this comment

henryre Jun 5, 2020

Choose a reason for hiding this comment

Wirg commented Jun 20, 2020

henryre commented Jun 20, 2020

Wirg commented Jun 21, 2020 • edited Loading

Wirg commented Jun 28, 2020

henryre commented Jul 3, 2020

codecov bot commented Apr 8, 2020 •

edited

Loading

henryre left a comment •

edited

Loading

henryre Jun 1, 2020 •

edited

Loading

Wirg commented Jun 21, 2020 •

edited

Loading