Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/memoize key #1572

Merged
merged 8 commits into from
Jul 3, 2020
Merged

Feature/memoize key #1572

merged 8 commits into from
Jul 3, 2020

Conversation

Wirg
Copy link
Contributor

@Wirg Wirg commented Apr 8, 2020

Description of proposed changes

Feature

Currently, there is no way to decide the key to be memoized when using preprocessor(memoize=True).
This leads to 2 issues :

  • memoization can not be done for unhashable classes (typically a group of pandas rows). We need to wrap or subclass it.
  • memoization key can not be specific to a preprocessing.
    Example : We are trying to evaluate the reliability of a paragraph in a blog post.
    We could evaluate the reliability of the paragraph and of the website.
    The preprocessing corresponding to those 2 tasks will share the same key for memoize, which is not ideal : a website can have a few thousand paragraphs so we will evaluate website reliability a lot more than necessary.

Result

@preprocessor(memoize=True, memoize_key=lambda p: p.base_website_url)
def add_website_reliability(paragraph):
    paragraph.website_reliability = evaluate_reliability(paragraph.base_website_url)
    return paragraph

Implementation

Add a memoize_key : Optional[HashingFunction] to the BaseMapper, if provided and not None, it will be used instead of get_hashable to define the hash of the input.

memoize_key has been made accessible to the different functions providing memoize api.

Related issue(s)

#1561

Test plan

Checklist

Need help on these? Just ask!

  • I have read the CONTRIBUTING document.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • I have run tox -e complex and/or tox -e spark if appropriate.
  • All new and existing tests passed.

@henryre , I will be waiting for your review.
I ran tox -e doc, but it did not produce any change and I have a bunch of WARNING: toctree contains reference to nonexisting document, is it normal ?
By the way, how would you like to discuss my use case further ?

@codecov
Copy link

codecov bot commented Apr 8, 2020

Codecov Report

Merging #1572 into master will increase coverage by 0.06%.
The diff coverage is 92.30%.

@@            Coverage Diff             @@
##           master    #1572      +/-   ##
==========================================
+ Coverage   97.13%   97.19%   +0.06%     
==========================================
  Files          56       68      +12     
  Lines        2091     2137      +46     
  Branches      342      343       +1     
==========================================
+ Hits         2031     2077      +46     
  Misses         31       31              
  Partials       29       29              
Impacted Files Coverage Δ
...el/classification/training/loggers/checkpointer.py 96.90% <ø> (ø)
snorkel/labeling/lf/core.py 100.00% <ø> (ø)
snorkel/preprocess/core.py 100.00% <ø> (ø)
snorkel/slicing/sf/nlp.py 100.00% <ø> (ø)
snorkel/classification/training/trainer.py 89.83% <50.00%> (ø)
snorkel/labeling/model/label_model.py 95.54% <50.00%> (ø)
snorkel/labeling/lf/nlp.py 100.00% <100.00%> (ø)
snorkel/map/core.py 100.00% <100.00%> (ø)
snorkel/preprocess/nlp.py 86.66% <100.00%> (ø)
snorkel/slicing/utils.py 94.80% <100.00%> (+0.13%) ⬆️
... and 14 more

@Wirg
Copy link
Contributor Author

Wirg commented Apr 9, 2020

@henryre I just noticed it won't work in the expected way.

The snorkel pattern is to return the x_mapped so the cache will change the data point.

    def test_decorator_mapper_memoized_use_memoize_key(self) -> None:
        square_hit_tracker = SquareHitTracker()

        @lambda_mapper(memoize=True, memoize_key=lambda x: x.num)
        def square(x: DataPoint) -> DataPoint:
            x.num_squared = square_hit_tracker(x.num)
            return x

        x8 = self._get_x()
        x8_mapped = square(x8)
        assert x8_mapped is not None
        self.assertEqual(x8_mapped.num_squared, 64)
        self.assertEqual(square_hit_tracker.n_hits, 1)
        x8_with_another_text = self._get_x(text="Henry is still having fun")
        x8_with_another_text_mapped = square(x8_with_another_text)
        assert x8_with_another_text_mapped is not None
        self.assertEqual(x8_with_another_text_mapped.num_squared, 64)
        self.assertEqual(square_hit_tracker.n_hits, 1)
        # This should fail :/
        self.assertEqual(x8_with_another_text_mapped, x8_mapped)

@brahmaneya brahmaneya requested a review from henryre April 9, 2020 16:59
@henryre
Copy link
Member

henryre commented Apr 10, 2020

Hi @Wirg, thanks for putting this up! Based on the example you put up, the expected behavior would be square(x8_with_another_text) == x8_mapped since the hashing function was (intentionally) "poorly chosen" in the test. Are you saying self.assertEqual(x8_with_another_text_mapped, x8_mapped) will trigger an AssertionError in the current implementation?

@Wirg
Copy link
Contributor Author

Wirg commented Apr 28, 2020

Hi @henryre ,

I hope you're going well.

Small bump on this PR.

@Wirg
Copy link
Contributor Author

Wirg commented May 31, 2020

Hi @henryre ,

Another bump for this pr.

What do you want me to do ?

Do you have some change ? Do you want to give up on this feature ?

@henryre
Copy link
Member

henryre commented May 31, 2020

Hi @Wirg, sorry for the delay here and thanks for the reminder! Taking a look today!

Copy link
Member

@henryre henryre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wirg just a couple of small suggestions, then we can get this merged! Thanks again for working on this!

x.num_squared = square_hit_tracker(x.num)
return x

x8 = self._get_x()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be preferable to have an example of canonical usage in this unit test. For example, using a custom unique ID:

x1 = SimpleNamespace(uid="id1", num=8, unhashable=some_unhashable_pandas_object)
x2 = SimpleNamespace(uid="id1", num=8, unhashable=some_unhashable_pandas_object)
...

and then we use memoize_key = lambda x: x.uid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wirg let me know when you can make this change, then happy to approve the PR! Thanks again for your patience!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did the change. I am not fully satisfied. If tomorrow someone change :

  • get_hashable to support pandas dataframe
  • memoize_key not to be used

The test won't fail.

name: str,
pre: List["BaseMapper"],
memoize: bool,
memoize_key: Optional[HashingFunction] = None,
Copy link
Member

@henryre henryre Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to prefer None as the default instead of using get_hashable as the default? We'd then be able to avoid the Optional everywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see 2 reasons :

  • function are mutable and the "good practice" is to avoid mutable in default parameters. I don't really see a situation where we would mutate this function tho.
  • we will have to import get_hashable in all the subclasses and all the wrappers of BaseMapper

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point about mutating functions in modules, hadn't thought of that route. Sounds good!

@Wirg
Copy link
Contributor Author

Wirg commented Jun 20, 2020

@henryre

I finally changed the test. I am not fully satisfied. If tomorrow someone changes :

  • get_hashable to support pandas dataframe
  • memoize_key not to be used

The test won't fail.

@henryre
Copy link
Member

henryre commented Jun 20, 2020

@Wirg good thinking! You could add an additional field in the test called not_used and have different values for the two data points

@Wirg
Copy link
Contributor Author

Wirg commented Jun 21, 2020

@henryre so I added a not_used int.

EDIT : nevermind. I rebased and those were fixed. Waiting for your review.

I am still encountering new test failures.
I fixed F541 (f-string used with no parameters).
I am facing a typing failure due to torch.nn.Linear usage in snorkel/slicing/utils.
I am not sure what are my steps on this ?
Is this already fixed and I should rebase ?

@Wirg
Copy link
Contributor Author

Wirg commented Jun 28, 2020

@henryre small bump

I don't know what I should be doing regarding codecov.
Am I expected to add more tests. If yes, where ?

@henryre
Copy link
Member

henryre commented Jul 3, 2020

@Wirg looks like codecov was being a bit temperamental, will go ahead and merge in. Thanks for your hard work here!

@henryre henryre merged commit f12338f into snorkel-team:master Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants