Support Training With Sparse Matrices #1629

talolard · 2021-02-12T23:12:12Z

Disclaimer

This PR isn't done. It does what it's supposed to do and has tests, but style and code cleanliness might not be there.

I'm not totally confidant this implementation is a fit. I'd appreciate if someone could take a look and let me know if I'm on track before I polish this. Maybe @bhancock8 who replied to the original issue ?

Description of proposed changes

Adds support for training and inference with sparse matrices.

This PR adds a few convenience functions to help the user work with sparse matrices representations of L_ind / or the objective matrix (do either have a formal name ? ).

I presume most users will call 'train_model_from_sparse_event_cooccurence', which takes a list of tuples representing L_ind indices and value (which is always 1), populate a sparse matrix and runs training.

train_model_from_sparse_event_cooccurence calls 'train_model_from_known_objective' which gets a dense numpy representation of O and trains. When I use Snorkel I call this function and calculate O elsewhere, it's faster.

Internally, there is some refactoring in LabelModel to support train_model_from_known_objective, constants are set differently and the tree and clique data calculations are moved a little.

Related issue(s)

Fixes #1625

Test plan

I wrote tests in test_sparse_data_helpers.
Basically the tests create an L matrix in standard format, and then compare the output of normal Snorkel to Sparse Snorkel.

Checklist

Need help on these? Just ask!

I have read the CONTRIBUTING document.
I have updated the documentation accordingly.
I have added tests to cover my changes.
I have run tox -e complex and/or tox -e spark if appropriate.
All new and existing tests passed.

…n't do the same calulcation twice

bhancock8

Thanks for submitting this draft! As you mentioned, there would be a separate pass on style, linting, etc. before this could be merged, so I mostly just took a look at the high level structure (except for where I couldn't help myself to comment on individual lines).

I can cc a couple of the original authors of the label model to double check correctness of the new methods once the structure is in a good place. From a structural standpoint, you're right that this doesn't quite match the style of the methods around it. The changes to label_model.py are fairly general, mostly breaking large blocks of code into smaller methods, which isn't a bad idea anyway, especially given the size of that file. But the helper methods aren't object-oriented like the other primary classes in labeling/model/.

There are two ways that I could see this moving forward:

Move these helper methods to a sparse_label_model project under snorkel/contrib (see the README there for details), where they can be used by others as helpful methods, but are less integrated and not guaranteed the same level of long-term support.
Create a SparseLabelModel class that inherits from LabelModel and adds these new functionalities as part of specialized fit methods. This keeps the new complexity away from the simple case where L is dense in label_model.py, while maintaining the object-centric design used in that module.

snorkel/labeling/model/sparse_data_helpers.py

snorkel/types/data.py

snorkel/labeling/model/label_model.py

…ular model

bhancock8

Nice! This is moving in the right direction—the class abstraction and subdirectory feel pretty good stylistically—no added complexity in the base case, but similar usage patterns.

I added a few hyper-local suggestions just from quick browsing. There are still some linting issues, of course, and I'll do a deeper pass once you feel like most of the big structural movement is done. Explicitly re-request review once tests are passing. I'm also tagging @fredsala for help reviewing the label model logic.

snorkel/labeling/model/sparse_label_model/base_sparse_label_model.py

snorkel/labeling/model/sparse_label_model/sparse_label_model_helpers.py

snorkel/labeling/model/label_model.py

codecov · 2021-02-19T11:06:03Z

Codecov Report

Merging #1629 (d579832) into master (ed77718) will decrease coverage by 0.94%.
The diff coverage is 82.14%.

@@            Coverage Diff             @@
##           master    #1629      +/-   ##
==========================================
- Coverage   97.21%   96.26%   -0.95%     
==========================================
  Files          68       72       +4     
  Lines        2151     2276     +125     
  Branches      345      358      +13     
==========================================
+ Hits         2091     2191     +100     
- Misses         31       52      +21     
- Partials       29       33       +4

Impacted Files	Coverage Δ
...abel_model/sparse_example_eventlist_label_model.py	`47.82% <47.82%> (ø)`
...parse_label_model/sparse_event_pair_label_model.py	`61.53% <61.53%> (ø)`
snorkel/labeling/model/label_model.py	`94.58% <89.18%> (-0.97%)`	⬇️
...odel/sparse_label_model/base_sparse_label_model.py	`91.30% <91.30%> (ø)`
...l/sparse_label_model/sparse_label_model_helpers.py	`100.00% <100.00%> (ø)`

talolard · 2021-02-19T11:25:23Z

I think the coverage tool isn't picking up on some of the tests.
Static methods in

sparse_example_eventlist_label_model.py and sparse_event_pair_label_model.py get tested explicitly in the two tests I marked with @pytest.mark.complex

bhancock8

Looking really close here! There are a handful of .idea files we need to pull out of the commit, and then some small formatting tweaks. I agree—codecov doesn't seem to be giving credit for @pytest.mark.complex methods but I can see that they're actually covered. If you can address the above comments and do one more light pass for typos/spacing/formatting, I think this should be ready to land!

bhancock8 · 2021-02-23T00:51:48Z

.gitignore

@@ -129,9 +129,10 @@ dmypy.json
 # Editors
 .vscode/
 .code-workspace*
-
+.idea/


It's cool to add this to .gitignore, but then let's not add all the files in the .idea directory. Ditto with workspace.code-workspace—let's not add it to the repo, but if you want to add that type of file to .gitignore, that's fine.

bhancock8 · 2021-03-03T22:05:22Z

snorkel/labeling/model/sparse_label_model/base_sparse_label_model.py

+
+Indexing throughout this module is 0 based, with the assumption that "abstains" are ommited.
+
+When working with larger datasets, it can be convenient to load the data in sparse format. This module


Nit: Let's move this summary of this file's purpose above the indexing disclaimer.

bhancock8 · 2021-03-03T22:05:29Z

snorkel/labeling/model/sparse_label_model/base_sparse_label_model.py

+Case 2:
+    The user has a list of 3-tuples(i,j,k) such that for document i, labeling function j predicted class k.
+
+The Case 3:


Drop "The" for consistency

bhancock8 · 2021-03-03T22:05:46Z

snorkel/labeling/model/sparse_label_model/base_sparse_label_model.py

+    and the user has a list of tuples (i,j) that indicate that event j occoured for example i.
+
+Case 2:
+    The user has a list of 3-tuples(i,j,k) such that for document i, labeling function j predicted class k.


Nit: space before (i,j,k)

bhancock8 · 2021-03-03T22:06:09Z

snorkel/labeling/model/sparse_label_model/base_sparse_label_model.py

+
+The Case 3:
+    user has a list of 3-tuples (i,j,c) where i and j range over [0,num_funcs*num_classes] such that
+    the events  i and j were observed to have co-occur c times.


Nit: extra space here and in case 5

bhancock8 · 2021-03-03T22:07:33Z

snorkel/labeling/model/sparse_label_model/base_sparse_label_model.py

+        rows = []
+        cols = []
+        data = []
+        cliquesets_list = (


Remove the parentheses, since it's just a list, not a tuple

bhancock8 · 2021-03-03T22:10:43Z

snorkel/labeling/model/label_model.py

+        if not is_augmented:
+            # This is the usual mode
+            L_shift = L + 1  # convert to {0, 1, ..., k}
+            self._set_constants(L_shift)  # TODO - Why do we need this here ?


self._get_augmented_label_matrix uses at least self.cardinality, which is set in this method. Remove the TODO?

github-actions · 2021-06-02T12:23:59Z

This pull request is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

talolard added 4 commits February 12, 2021 19:19

Set up basic scaffolding for sparse amtrix support

b1c805a

Added logic and tests to load L_ind as a list of tuples

9624761

refactor clique data creation so that it happens during create_tree

e0c3f2d

Added a function that predicts probs from a 'cliqueset' so that we do…

212db85

…n't do the same calulcation twice

bhancock8 reviewed Feb 14, 2021

View reviewed changes

talolard added 5 commits February 14, 2021 16:06

sparse predictor returns a dict of [tuple,list]

dd3b0c7

refactor to SparseLabelModel class

b7352fd

Continued refactoring towards classes and improved tests

c7692bf

Moved KnownDimensions type to sparse_label_model

664ac6e

Differentiate between different kinds of sparse inputs

bcb39f7

talolard mentioned this pull request Feb 15, 2021

TestLabelModelAdvanced fails when changing cardinality #1631

Closed

talolard added 6 commits February 15, 2021 16:18

Added tests that compare a sparse models output to regular model

17e935e

Added tests to check that event sparse model trains the same as a reg…

133c1d2

…ular model

Added documentation

8996bec

Pass mypy checks

c46a121

Pass mypy checks

01f6b56

Pass mypy checks

a2f267a

bhancock8 reviewed Feb 16, 2021

View reviewed changes

bhancock8 requested a review from fredsala February 16, 2021 21:51

talolard added 4 commits February 19, 2021 10:48

Comply with tox docstrings

735ab51

Ensure seed setting happens first in training

e190d04

Resolve 'Nits'

f9b2629

Pass tox

d579832

bhancock8 reviewed Mar 3, 2021

View reviewed changes

github-actions bot added the no-pr-activity label Jun 2, 2021

github-actions bot closed this Jun 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Training With Sparse Matrices #1629

Support Training With Sparse Matrices #1629

talolard commented Feb 12, 2021 •

edited by henryre

bhancock8 left a comment

bhancock8 left a comment

codecov bot commented Feb 19, 2021

talolard commented Feb 19, 2021

bhancock8 left a comment

bhancock8 Feb 23, 2021

bhancock8 Mar 3, 2021

bhancock8 Mar 3, 2021

bhancock8 Mar 3, 2021

bhancock8 Mar 3, 2021

bhancock8 Mar 3, 2021

bhancock8 Mar 3, 2021

github-actions bot commented Jun 2, 2021


		Indexing throughout this module is 0 based, with the assumption that "abstains" are ommited.

		When working with larger datasets, it can be convenient to load the data in sparse format. This module

Support Training With Sparse Matrices #1629

Support Training With Sparse Matrices #1629

Conversation

talolard commented Feb 12, 2021 • edited by henryre

Disclaimer

Description of proposed changes

Related issue(s)

Test plan

Checklist

bhancock8 left a comment

Choose a reason for hiding this comment

bhancock8 left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 19, 2021

Codecov Report

talolard commented Feb 19, 2021

bhancock8 left a comment

Choose a reason for hiding this comment

bhancock8 Feb 23, 2021

Choose a reason for hiding this comment

bhancock8 Mar 3, 2021

Choose a reason for hiding this comment

bhancock8 Mar 3, 2021

Choose a reason for hiding this comment

bhancock8 Mar 3, 2021

Choose a reason for hiding this comment

bhancock8 Mar 3, 2021

Choose a reason for hiding this comment

bhancock8 Mar 3, 2021

Choose a reason for hiding this comment

bhancock8 Mar 3, 2021

Choose a reason for hiding this comment

github-actions bot commented Jun 2, 2021

talolard commented Feb 12, 2021 •

edited by henryre