Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation to emphasize the need to avoid broad LabelingFunctions for LabelModel #1713

Closed
rjurney opened this issue Oct 9, 2022 · 3 comments

Comments

@rjurney
Copy link

rjurney commented Oct 9, 2022

Issue description

I tried Snorkel on five different problems and while the framework does work... MajorityLabelVoter and / or MajorityClassVoter outperformed the LabelModel on every problem. This problem has been noted in the past. In frustration, I asked around and found the secret to using Snorkel from two different people:

Snorkel's LabelModel doesn't work with broad coverage LabelingFunctions

This is the key to making the LabelModel work. You have to write fairly narrow LabelingFunctions that can work in combination to achieve the coverage you need. A 50% coverage LF will break the LabelModel.

For me it is good for a 3% performance bump across two classification problems using this method.

Big Question

Where do I make this contribution? This will make Snorkel much more useful to its users, many of whom had the same frustrations that I did. I would like it to get picked up in the website, and the tutorials haven't been updated in a while. Is that still the right place for this PR? https://github.com/snorkel-team/snorkel-tutorials

cc @ajratner @henryre

@rjurney
Copy link
Author

rjurney commented Oct 12, 2022

Working on some updates... will update over there :) It looks like the website builds from the tutorials project.

@ajratner
Copy link
Contributor

Hi @rjurney - first of all, you can check out the WRENCH benchmark/paper for some examples of where and under what conditions we might expect MV to outperform a more sophisticated LabelModel. It's certainly not expected that every problem would have MV doing better- but of course, this entirely depends on what types of problems you are looking at...

For example, if we have a very small number of labeling functions (as was the case in the issue you linked) we wouldn't necessarily expect a learned LabelModel to outperform Majority Vote.

Another setting where both theory and empirical results (as published in all main Snorkel papers) would predict MV > learned label model would be if your LFs were mostly low precision, e.g. worse than random accuracy- which might indeed happen if you optimized for writing very high recall, low precision LFs. But I'll emphasize that it's not the high recall per se, but the low precision, that would be the issue here- again, as per the original Snorkel papers published!

Anyway- thanks for posting this to help users navigating this!

@github-actions
Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants