Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PandasParallelLFApplier does not preserve the order of the rows #1524

Closed
hcvazquez opened this issue Dec 3, 2019 · 2 comments
Closed

PandasParallelLFApplier does not preserve the order of the rows #1524

hcvazquez opened this issue Dec 3, 2019 · 2 comments
Assignees

Comments

@hcvazquez
Copy link

Issue description

I'm using PandasParallelLFApplier to apply labeling functions to a pandas dataframe with 5000 rows.

Code example/repro steps

Using PandasParallelLFApplier

# Apply the LFs to the unlabeled training data
applier = PandasParallelLFApplier(lfs)
topic_labeling = applier.apply(df[:5000])
topic_labeling
output: array([
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       ...,
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       [-1, -1, -1,  1]])

Same code using PandasLFApplier

output: array([
       [-1, -1, -1, -1],
       [-1,  1, -1, -1],
       [-1, -1, -1, -1],
       ...,
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       [-1, -1, -1,  1]])

Second row is different.

Expected behavior

I would expect the same result for both. Labeling coverage and overlaps is the same for both. Because of that the problem has to be the order of the rows.

System info

  • How you installed Snorkel (conda, pip, source):
  • Build command you used (if compiling from source):
  • OS:
  • Python version: 3.6.8
  • Snorkel version: 0.9.3
  • Versions of any other relevant libraries:
    dask==2.8.1
    pandas==0.25.3
    numpy==1.16.4
@henryre henryre self-assigned this Dec 11, 2019
@henryre
Copy link
Member

henryre commented Dec 11, 2019

Hi @hcvazquez, great question. This is due to index sorting, and isn't reflected well in the docs right now (but on our list to update). This was discussed on the Spectrum thread here: https://spectrum.chat/snorkel/help/how-to-use-the-pandasparallelapplier~cf50f563-28e6-418c-93a3-337384566c13

@henryre
Copy link
Member

henryre commented Dec 22, 2019

Closing for now, feel free to re-open!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants