New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Revisions to crowdsourcing tutorial #64

Merged

bhancock8 merged 3 commits into master from crowdsourcing_edits

Aug 11, 2019

Member

bhancock8 commented Aug 9, 2019

Made a pass over the Crowdsourcing Tutorial.

Mostly local edits
A couple renamings ("answers" -> "labels", "crowd worker" -> "crowdworkers", etc.)
Simplify to one LF per crowdworker instead of splitting in 2

Some markdown diffs will appear larger than they are where I split on sentence boundaries instead of characters for clearer diffs in the future.

Test plan:
tox -e crowsourcing

bhancock8 added 2 commits

August 9, 2019 16:55


          Revisions to crowdsourcing tutorial

fed198c


          Run tox

e064b5d

bhancock8 requested review from henryre and brahmaneya and removed request for henryre

August 10, 2019 01:32

brahmaneya reviewed

View reviewed changes

crowdsourcing/crowdsourcing_tutorial.py Outdated

+              # # Crowdsourcing Tutorial
+              # %% [markdown]
+              # In this tutorial, we'll provide a simple walkthrough of how to use Snorkel in conjuction with crowdsourcing to create a training set for a sentiment analysis task.

Collaborator

brahmaneya Aug 10, 2019

conjunction

crowdsourcing/crowdsourcing_tutorial.py Outdated

-              # Since our objective is to classify tweets as positive or negative, we limited
-              # the dataset to tweets that were either positive or negative.
+              # Label options were positive, negative, or one of three other options saying they weren't sure if it was positive or negative; we use only the positive/negative labels.
+              # We've also altered the dataset to reflect a realistic crowdsourcing pipeline where only a subset of our available training set have recieved crowd labels.

Collaborator

brahmaneya Aug 10, 2019

received

Collaborator

brahmaneya Aug 10, 2019

also have -> has?

crowdsourcing/crowdsourcing_tutorial.py Outdated

-              # Snorkel's ability to build high-quality datasets from multiple noisy labeling
-              # signals makes it an ideal framework to approach this problem.
+              # We will treat each crowdworker's labels as coming from a single labeling function (LF).
+              # This will allow us to learn a weight for how much much to trust the labels from each crowdworker.

Collaborator

brahmaneya Aug 10, 2019

much much -> much

crowdsourcing/crowdsourcing_tutorial.py Outdated

               L_train = applier.apply(df_train)
               L_dev = applier.apply(df_dev)
+              # %% [markdown]
+              # Note that because our dev set is so small and our LFs are relatively sparse, many LFs will appear to have zero coverage.
+              # Fortunately, our label model learns weights for LFs based on their coverage on the training set, which is generally much larger.

Collaborator

brahmaneya Aug 10, 2019

based on their outputs on the training set.

crowdsourcing/crowdsourcing_tutorial.py Outdated

		@@ -295,15 +273,16 @@ def encode_text(text):
		sklearn_model = LogisticRegression(solver="liblinear")
		sklearn_model.fit(train_vectors, probs_to_preds(Y_train_prob))

Collaborator

brahmaneya Aug 10, 2019

We can use label_model.predict instead of label_model.predict_proba followed by probs_to_preds

crowdsourcing/crowdsourcing_tutorial.py Outdated

-              # * We showed how the LabelModel learns to combine inputs from crowd workers and other LFs by appropriately weighting them to generate high quality probabilistic labels.
-              # * We showed that a classifier trained on the combined labels can achieve a fairly high accuracy while also generalizing to new, unseen examples.
+              # * We demonstrated how to combine crowdsourced labels with other programmatic LFs to improve coverage.
+              # * We used the `LabelModel` to learn how to combine inputs from crowdworkers and other LFs to generate high quality probabilistic labels.

Collaborator

brahmaneya Aug 10, 2019

I don't think we should say 'to learn how'. Just 'We used the LabelModel to combine inputs...'


          Address comments

5d1ac56

Member Author

bhancock8 commented Aug 11, 2019

Comments addressed! Ready for another look.

brahmaneya self-requested a review

August 11, 2019 02:26

brahmaneya approved these changes

View reviewed changes

bhancock8 merged commit 8d4fd69 into master

bhancock8 deleted the crowdsourcing_edits branch

August 11, 2019 02:32

ajratner pushed a commit that referenced this pull request


          Revisions to crowdsourcing tutorial (#64)

5f2e513

* Revisions to crowdsourcing tutorial

* Run tox

* Address comments

ajratner added a commit that referenced this pull request


          Lighter-weight 101 intro version of spam tutorial (#37)

3bf73c9

* First rev on lighter-weight intro tutorial

* Fixing @brahmaneya edits in PR

* Editing down code to minimal form as suggested by HE

* Simplifying code; adding TF example; editing

* Editing pass over text

* Transfer tags w jupytext build, minor edits

* Filtering abstain values

* Added a stub for SFs

* Trying to fix style check

* Skipping env_* files in flake

* Style fixes

* PR changes requested by @henryre

* Quieting nltk output

* Moved to getting_started

* Silencing LogReg warning

* Forgot to sync style fix...

* Spelling fix

* Addressing comments

* First rev on lighter-weight intro tutorial

* Fixing @brahmaneya edits in PR

* Hot fix LF names (#63)

* Mtl updates (#41)

* [EASY] Update Scorer import paths (#58)

* Save MTL updates in progress

* Give more API hints

* More text updates

* Hide unnecessary helpers in utils

* Drop unused import and sync notebook

* Save MTL updates in progress

* Give more API hints

* More text updates

* Hide unnecessary helpers in utils

* Drop unused import and sync notebook

* Address comments

* Rename mtl to multitask so file and tutorial match

* Update Scorer import paths

* Update names of loss and output funcs

* Update name of ce_loss_from_outputs()

* Rename SnorkelClassifier to MultitaskClassifier (#59)

* Update Scorer import

* Update Scorer import in vrd_tutorial

* Remove unused import

* [EASY] Add links to RTD in multitask tutorial (#65)

* Add links to RTD in multitask tutorial

* Separate sentences

* Sync multitask.ipynb

* Editing down code to minimal form as suggested by HE

* Add Drybell tutorial (#62)

* Add drybell tutorial

* Add to tox and README

* Install Java on Travis

* Pass JAVA_HOME

* Add README

* Update API

* Revisions to crowdsourcing tutorial (#64)

* Revisions to crowdsourcing tutorial

* Run tox

* Address comments

* Simplifying code; adding TF example; editing

* Recsys novel (#61)

* Recsys first commit

* Backup

* Added second version

* Recsys modeling work

* Add review processing and LFs.

* First complete draft

* typo

* Add ipynb

* Add comments, refactor

* Address comments

* Updated ipynb

* Update tox.ini to allow sync / test for recsys (but not by default)

* Address comments

* Update ipynb

* Address final comments

* Fix determinism of TF tutorial (#67)

* Fix determinism of TF tutorial

* Add os PYTHONGHASHSEED back

* Editing pass over text

* Transfer tags w jupytext build, minor edits

* Filtering abstain values

* Added a stub for SFs

* Trying to fix style check

* Skipping env_* files in flake

* Style fixes

* PR changes requested by @henryre

* Quieting nltk output

* Separate download scripts, feedback session (#55)

* Moved to getting_started

* Silencing LogReg warning

* Slicing spam (#18)

* Forgot to sync style fix...

* Add link checking (#72)

* Add link checking

* Fix

* Only run Travis on changed tutorials (#74)

* Only run Travis on changed tutorials

* Fix

* Address comments

* Fix

* Fix a couple links (#75)

* Fix link

* Fix links

* Fix Travis branch check (#77)

* Spelling fix

* Stop training on dev set [EASY] (#71)

* Stop training on dev set

* Update image link

* Add style to run envs (#78)

* Add style to run envs

* Simplify

* Make travis faster for spouse [EASY] (#80)

* Make travis faster for spouse

* remove extra cell

* sync

* all caps for constant

* More verbose build script (#81)

* Add markdown build mode (#79)

* Add markdown build mode

* Fix kwarg

* Be a bit more opinionated

* [EASY] Update path to snorkel to reflect ownership transfer (#82)

* Update path to snorkel to reflect ownership transfer

* Restore path to snorkel-superglue on HazyResearch

* Make travis only run on changed dirs (#85)

* Make travis only run on changed dirs

* Small fix

* Add space

* Update tutorials with MultitaskClassifier API changes (#68)

* Update multitask and scene_graph to last_op

* Update spam tutorial

* Update notebooks

* Sync visual_relation notebook

* Update spam notebooks

* Run tox -e fix

* Remove unused import

* Configure markdown generation (#87)

* Configure markdown generation

* Add comments

* [EASY] Replace `mtl` with `multitask` in README (#83)

* Deploy tutorial pages via Travis (#88)

* Deploy tutorial pages via Travis

* Fix commands

* Update readme (#73)

* Update readme

* Address comment

* Addressing comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment