Update documentation to emphasize the need to avoid broad LabelingFunctions for LabelModel #1713

rjurney · 2022-10-09T01:43:20Z

Issue description

I tried Snorkel on five different problems and while the framework does work... MajorityLabelVoter and / or MajorityClassVoter outperformed the LabelModel on every problem. This problem has been noted in the past. In frustration, I asked around and found the secret to using Snorkel from two different people:

Snorkel's LabelModel doesn't work with broad coverage LabelingFunctions

This is the key to making the LabelModel work. You have to write fairly narrow LabelingFunctions that can work in combination to achieve the coverage you need. A 50% coverage LF will break the LabelModel.

For me it is good for a 3% performance bump across two classification problems using this method.

Big Question

Where do I make this contribution? This will make Snorkel much more useful to its users, many of whom had the same frustrations that I did. I would like it to get picked up in the website, and the tutorials haven't been updated in a while. Is that still the right place for this PR? https://github.com/snorkel-team/snorkel-tutorials

cc @ajratner @henryre

rjurney · 2022-10-12T06:06:21Z

Working on some updates... will update over there :) It looks like the website builds from the tutorials project.

ajratner · 2022-10-31T03:35:19Z

Hi @rjurney - first of all, you can check out the WRENCH benchmark/paper for some examples of where and under what conditions we might expect MV to outperform a more sophisticated LabelModel. It's certainly not expected that every problem would have MV doing better- but of course, this entirely depends on what types of problems you are looking at...

For example, if we have a very small number of labeling functions (as was the case in the issue you linked) we wouldn't necessarily expect a learned LabelModel to outperform Majority Vote.

Another setting where both theory and empirical results (as published in all main Snorkel papers) would predict MV > learned label model would be if your LFs were mostly low precision, e.g. worse than random accuracy- which might indeed happen if you optimized for writing very high recall, low precision LFs. But I'll emphasize that it's not the high recall per se, but the low precision, that would be the issue here- again, as per the original Snorkel papers published!

Anyway- thanks for posting this to help users navigating this!

github-actions · 2023-01-29T12:06:46Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

rjurney mentioned this issue Oct 10, 2022

Add instructions not to use broad coverage LFs in tutorials snorkel-team/snorkel-tutorials#289

Closed

github-actions bot added the no-issue-activity label Jan 29, 2023

github-actions bot closed this as completed Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update documentation to emphasize the need to avoid broad LabelingFunctions for LabelModel #1713

Update documentation to emphasize the need to avoid broad LabelingFunctions for LabelModel #1713

rjurney commented Oct 9, 2022

rjurney commented Oct 12, 2022

ajratner commented Oct 31, 2022

github-actions bot commented Jan 29, 2023

Update documentation to emphasize the need to avoid broad LabelingFunctions for LabelModel #1713

Update documentation to emphasize the need to avoid broad LabelingFunctions for LabelModel #1713

Comments

rjurney commented Oct 9, 2022

Issue description

Big Question

rjurney commented Oct 12, 2022

ajratner commented Oct 31, 2022

github-actions bot commented Jan 29, 2023