New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LabelModel produces equal probability for labeled data #1422
Comments
Thanks for pointing this out!
import logging
logging.basicConfig(level=logging.INFO)
Example: label_model = LabelModel(cardinality=2, verbose=True)
L = np.array([[0, 1], [0, 1]])
label_model.fit(L)
label_model.predict_proba(L)
# array([[0.5, 0.5], [0.5, 0.5]])
label_model.predict(L, tie_break_policy="abstain")
# array([-1, -1]) |
I took a look at the logs, the results are very strange:
So, the loss becomes |
I have a similar issue with probs_train outputting [0.5, 0.5]. The prob for a
Here are the logs for training the LabelModel with roughly 10,000 data points.
|
@vtang13 can you provide an example of the L matrix that will help reproduce this error? Can you also print I tried creating a matrix with the LFs that you have and did not get the same results: L = np.array([[1, -1, -1, -1, 1, -1, -1],
[1, -1, -1, -1, -1, -1, -1],
[-1, -1, -1, -1, -1, 0, -1]])
label_model = LabelModel(cardinality=2)
label_model.fit(L)
label_model.predict_proba(L)
# array([[0.019964 , 0.980036 ],
# [0.1229839 , 0.8770161 ],
# [0.98669756, 0.01330244]]) |
Hi @paroma
It looks like the issue may occur when there are too many negative examples. It doesn't occur when the dataset is more balanced. Below is a subset of my L_train with more balanced positive/negative examples.
Training results:
|
This looks like the expected output! If you think it is an issue related to class balance, you can try passing in |
Thanks @paroma, I don't have labelled data yet but I will try those parameters once I do. Here is the full L_train to reproduce the issue: https://pastebin.com/02mEznra |
Thank you for access to the L matrix! Looking at the full L_train and running
|
That's very informative. Thanks @paroma for looking into this. |
Just quickly tacking onto the great answer from @paroma One thing we are working on is making sure that the For example, in your setting we probably would want to have a higher prior weight on the LFs over the class balance... we'll iterate here and push some updates soon! |
(Note: Marking "feature request" for our reference, as technically the |
Hi, I have observed the same behavior of the LabelModel: one LF (out of 15) which was highly accurate but had low coverage was ignored (resulting in 0.5 0.5 probabilities). In fact, I wanted to ask whether it would be possible to specify by-hand some priors about LFs in the LabelModel. In my case, I'm 99% sure that this LF gives the correct label even if it's low coverage and I'd like LabelModel not to lose this knowledge. |
Hi @s2948044 @vtang13 @xsway thanks first of all for bringing this issue to light in such detail, and @vtang13 for sharing your label matrix! It turns out that this is actually a bug due to incorrect parameter clamping; I've corrected this (it's a one line fix more or less), confirmed on @vtang13's label matrix, and on a new synthetic test that replicates the problem and confirms the solution. PR being submitted right now! Thank you all for the great help here!! Just in case anyone is curious... the Note that what @paroma said above about the Note also that we'll be steadily pushing extensions, improvements, and additions to the |
Thanks a lot for addressing this issue so fast! I checked the updated code now and it works as expected, that is, I get > 0.5 probability for the positive class for the cases where a low-coverage LF was activated (and where before I had 0.5 probability). |
Thanks @ajratner for the detailed explanation and speedy resolution! |
Thanks for pointing this out and helping us to improve the new version! :) |
Issue description
I am using snorkel for creating binary text classification training examples with 9 labeling functions. However, I find that for some data that are trained with the label model, they receive a probabilistic label of equal probabilities (i.e.
[0.5 0.5]
), despite that they only receive label of one class from the labeling functions (e.g.[-1 0 -1 -1 0 -1 -1 -1 -1]
, so only class0
orABSTAIN
), why is that?Besides, I find that setting
verbose = True
when defining theLabelModel
does not print the logging information.Lastly, if producing the label
[0.5 0.5]
is a normal behavior, then when filtering out unlabeled data points, they should be removed as well, because they do not contribute to the training (of a classifier), and if the classifier that does not support probablistic labels (e.g. FastText), usingargmax
will always lead to class0
(which is undesired).Code example/repro steps
Below I show my code of defining
LabelModel
:Below I show some of the logs I print:
Expected behavior
I expect that if a datapoint receives label of only a single class, then it should not get equal probabilities of both classes when the label model is trained.
System info
The text was updated successfully, but these errors were encountered: