Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Training With Sparse Matrices #1629

Closed
wants to merge 19 commits into from

Conversation

talolard
Copy link

@talolard talolard commented Feb 12, 2021

Disclaimer

This PR isn't done. It does what it's supposed to do and has tests, but style and code cleanliness might not be there.

I'm not totally confidant this implementation is a fit. I'd appreciate if someone could take a look and let me know if I'm on track before I polish this. Maybe @bhancock8 who replied to the original issue ?

Description of proposed changes

Adds support for training and inference with sparse matrices.

This PR adds a few convenience functions to help the user work with sparse matrices representations of L_ind / or the objective matrix (do either have a formal name ? ).

I presume most users will call 'train_model_from_sparse_event_cooccurence', which takes a list of tuples representing L_ind indices and value (which is always 1), populate a sparse matrix and runs training.

train_model_from_sparse_event_cooccurence calls 'train_model_from_known_objective' which gets a dense numpy representation of O and trains. When I use Snorkel I call this function and calculate O elsewhere, it's faster.

Internally, there is some refactoring in LabelModel to support train_model_from_known_objective, constants are set differently and the tree and clique data calculations are moved a little.

Related issue(s)

Fixes #1625

Test plan

I wrote tests in test_sparse_data_helpers.
Basically the tests create an L matrix in standard format, and then compare the output of normal Snorkel to Sparse Snorkel.

Checklist

Need help on these? Just ask!

  • I have read the CONTRIBUTING document.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • I have run tox -e complex and/or tox -e spark if appropriate.
  • All new and existing tests passed.

Copy link
Member

@bhancock8 bhancock8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this draft! As you mentioned, there would be a separate pass on style, linting, etc. before this could be merged, so I mostly just took a look at the high level structure (except for where I couldn't help myself to comment on individual lines).

I can cc a couple of the original authors of the label model to double check correctness of the new methods once the structure is in a good place. From a structural standpoint, you're right that this doesn't quite match the style of the methods around it. The changes to label_model.py are fairly general, mostly breaking large blocks of code into smaller methods, which isn't a bad idea anyway, especially given the size of that file. But the helper methods aren't object-oriented like the other primary classes in labeling/model/.

There are two ways that I could see this moving forward:

  1. Move these helper methods to a sparse_label_model project under snorkel/contrib (see the README there for details), where they can be used by others as helpful methods, but are less integrated and not guaranteed the same level of long-term support.
  2. Create a SparseLabelModel class that inherits from LabelModel and adds these new functionalities as part of specialized fit methods. This keeps the new complexity away from the simple case where L is dense in label_model.py, while maintaining the object-centric design used in that module.

snorkel/labeling/model/sparse_data_helpers.py Outdated Show resolved Hide resolved
snorkel/labeling/model/sparse_data_helpers.py Outdated Show resolved Hide resolved
snorkel/types/data.py Outdated Show resolved Hide resolved
snorkel/labeling/model/label_model.py Outdated Show resolved Hide resolved
Copy link
Member

@bhancock8 bhancock8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is moving in the right direction—the class abstraction and subdirectory feel pretty good stylistically—no added complexity in the base case, but similar usage patterns.

I added a few hyper-local suggestions just from quick browsing. There are still some linting issues, of course, and I'll do a deeper pass once you feel like most of the big structural movement is done. Explicitly re-request review once tests are passing. I'm also tagging @fredsala for help reviewing the label model logic.

@codecov
Copy link

codecov bot commented Feb 19, 2021

Codecov Report

Merging #1629 (d579832) into master (ed77718) will decrease coverage by 0.94%.
The diff coverage is 82.14%.

@@            Coverage Diff             @@
##           master    #1629      +/-   ##
==========================================
- Coverage   97.21%   96.26%   -0.95%     
==========================================
  Files          68       72       +4     
  Lines        2151     2276     +125     
  Branches      345      358      +13     
==========================================
+ Hits         2091     2191     +100     
- Misses         31       52      +21     
- Partials       29       33       +4     
Impacted Files Coverage Δ
...abel_model/sparse_example_eventlist_label_model.py 47.82% <47.82%> (ø)
...parse_label_model/sparse_event_pair_label_model.py 61.53% <61.53%> (ø)
snorkel/labeling/model/label_model.py 94.58% <89.18%> (-0.97%) ⬇️
...odel/sparse_label_model/base_sparse_label_model.py 91.30% <91.30%> (ø)
...l/sparse_label_model/sparse_label_model_helpers.py 100.00% <100.00%> (ø)

@talolard
Copy link
Author

I think the coverage tool isn't picking up on some of the tests.
Static methods in

sparse_example_eventlist_label_model.py and sparse_event_pair_label_model.py get tested explicitly in the two tests I marked with @pytest.mark.complex

Copy link
Member

@bhancock8 bhancock8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really close here! There are a handful of .idea files we need to pull out of the commit, and then some small formatting tweaks. I agree—codecov doesn't seem to be giving credit for @pytest.mark.complex methods but I can see that they're actually covered. If you can address the above comments and do one more light pass for typos/spacing/formatting, I think this should be ready to land!

@@ -129,9 +129,10 @@ dmypy.json
# Editors
.vscode/
.code-workspace*

.idea/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cool to add this to .gitignore, but then let's not add all the files in the .idea directory. Ditto with workspace.code-workspace—let's not add it to the repo, but if you want to add that type of file to .gitignore, that's fine.


Indexing throughout this module is 0 based, with the assumption that "abstains" are ommited.

When working with larger datasets, it can be convenient to load the data in sparse format. This module
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's move this summary of this file's purpose above the indexing disclaimer.

Case 2:
The user has a list of 3-tuples(i,j,k) such that for document i, labeling function j predicted class k.

The Case 3:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop "The" for consistency

and the user has a list of tuples (i,j) that indicate that event j occoured for example i.

Case 2:
The user has a list of 3-tuples(i,j,k) such that for document i, labeling function j predicted class k.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: space before (i,j,k)


The Case 3:
user has a list of 3-tuples (i,j,c) where i and j range over [0,num_funcs*num_classes] such that
the events i and j were observed to have co-occur c times.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra space here and in case 5

rows = []
cols = []
data = []
cliquesets_list = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the parentheses, since it's just a list, not a tuple

if not is_augmented:
# This is the usual mode
L_shift = L + 1 # convert to {0, 1, ..., k}
self._set_constants(L_shift) # TODO - Why do we need this here ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._get_augmented_label_matrix uses at least self.cardinality, which is set in this method. Remove the TODO?

@github-actions
Copy link

github-actions bot commented Jun 2, 2021

This pull request is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LabelModel should support loading sparse matrices
2 participants