Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

labelmodel.fit on a superset of data changes predictions of subset #1581

Closed
srimugunthan opened this issue Apr 23, 2020 · 5 comments
Closed
Assignees

Comments

@srimugunthan
Copy link

srimugunthan commented Apr 23, 2020

Issue description

We have a dataset with records which will be either have one label or multiple labels.
To verify the label model predictions, we filtered out from the original data, the records with only one label. Doing labelmodel.fit on the single-labelled data was giving accuracy of more than 90%.

But when we did labelmodel.fit on the whole data the above accuracy for singlelabelled datapoints dropped drastically to 30%.

Code example/repro steps

i was able to reproduce the bug with some generated label matrix https://github.com/srimugunthan/snorkeldebugging/blob/master/snorkeldebug.ipynb
Although here the accuracy drop in the generated data is not drastic, it illustrates the scenario

Expected behavior

the subset of data with single labels should have the same accuracy.

System info

used snorkel 0.9.3 on linux

@srimugunthan
Copy link
Author

srimugunthan commented Apr 24, 2020

Hi,
In the original example, in which the drop was from 90% to 30%, i found an issue in the code.
I see that it happens only when i use PandasParallelLFApplier to get the label matrix. With PandasLFapplier it is fine.

i check the matrices generated from PandasLFApplier and PandasParallelLFApplier and they were different.
Below is the code from notebook, which i used to check.

df_full = pd.concat([df_single,df_multilabel]
df_full.index.is_unique
True

lm1 =applier.apply(df=df_full)
lm2 =applier_regex.apply(df=df_full,n_parallel=8)

np.array_equal(lm1, lm2)
False

Is there anything i am missing.

@ajratner
Copy link
Contributor

Hi @srimugunthan thanks for surfacing this! At the current moment, the master branch version of Snorkel is not configured to support multi-label, though we've certainly applied Snorkel here (e.g. https://www.snorkel.org/blog/superglue / multi-task formulation...). So I'm not surprised there are some issues here- perhaps, since Snorkel's label model is expecting a single label, it's just taking e.g. the last one per data point, but this order is getting shuffled when applied in parallel?

Either way, we'll look into this to make sure not an issue with PandasParallelLFApplier. If, as I suspect, it's just an issue with multi-label support, we'll put on the roadmap!

@srimugunthan
Copy link
Author

srimugunthan commented May 5, 2020

@ajratner @henryre

  1. I have checked in the spam classify example code with PandasParallelLFAppluer and plain PandasLFApplier https://github.com/srimugunthan/snorkeldebugging/blob/master/spamClassify.ipynb
    I do see the Label matrices are different, although the summary metrics are same.

  2. Isnt the multi-task formulation for hierarchical labelling?. For multilabel(same-level,manylabels) case, we used the approach suggested in this article: https://towardsdatascience.com/using-snorkel-for-multi-label-annotation-cc2aa217986a We look at the labelmodel's prediction probability values , and pick additional labels which are close to maximum probability class. Let me know if this approach can be followed.

  3. In the original examplenotebook i shared ( https://github.com/srimugunthan/snorkeldebugging/blob/master/snorkeldebug.ipynb ) i see the single label accuracy shrink by 4 to 6% when multilabel data is added. This is not much and not sure if qualifies as an issue. But you can reproduce the issue from the notebook and let us know your comments.

@henryre
Copy link
Member

henryre commented May 17, 2020

Hi @srimugunthan, sorry for the delayed reply! In response to the PandasParallelLFApplier issue, I've opened up #1589. In the meantime, you can either use the standard PandasLFApplier or sort the index of the original DF before using the PandasParallelLFApplier so that the index matches.

@github-actions
Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants