Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisions to crowdsourcing tutorial #64

Merged
merged 3 commits into from Aug 11, 2019
Merged

Conversation

bhancock8
Copy link
Member

Made a pass over the Crowdsourcing Tutorial.

  • Mostly local edits
  • A couple renamings ("answers" -> "labels", "crowd worker" -> "crowdworkers", etc.)
  • Simplify to one LF per crowdworker instead of splitting in 2

Some markdown diffs will appear larger than they are where I split on sentence boundaries instead of characters for clearer diffs in the future.

Test plan:
tox -e crowsourcing

@bhancock8 bhancock8 requested review from henryre and brahmaneya and removed request for henryre August 10, 2019 01:32
# # Crowdsourcing Tutorial

# %% [markdown]
# In this tutorial, we'll provide a simple walkthrough of how to use Snorkel in conjuction with crowdsourcing to create a training set for a sentiment analysis task.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conjunction

# Since our objective is to classify tweets as positive or negative, we limited
# the dataset to tweets that were either positive or negative.
# Label options were positive, negative, or one of three other options saying they weren't sure if it was positive or negative; we use only the positive/negative labels.
# We've also altered the dataset to reflect a realistic crowdsourcing pipeline where only a subset of our available training set have recieved crowd labels.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

received

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also have -> has?

# Snorkel's ability to build high-quality datasets from multiple noisy labeling
# signals makes it an ideal framework to approach this problem.
# We will treat each crowdworker's labels as coming from a single labeling function (LF).
# This will allow us to learn a weight for how much much to trust the labels from each crowdworker.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much much -> much

L_train = applier.apply(df_train)
L_dev = applier.apply(df_dev)

# %% [markdown]
# Note that because our dev set is so small and our LFs are relatively sparse, many LFs will appear to have zero coverage.
# Fortunately, our label model learns weights for LFs based on their coverage on the training set, which is generally much larger.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on their outputs on the training set.

@@ -295,15 +273,16 @@ def encode_text(text):
sklearn_model = LogisticRegression(solver="liblinear")
sklearn_model.fit(train_vectors, probs_to_preds(Y_train_prob))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use label_model.predict instead of label_model.predict_proba followed by probs_to_preds

# * We showed how the LabelModel learns to combine inputs from crowd workers and other LFs by appropriately weighting them to generate high quality probabilistic labels.
# * We showed that a classifier trained on the combined labels can achieve a fairly high accuracy while also generalizing to new, unseen examples.
# * We demonstrated how to combine crowdsourced labels with other programmatic LFs to improve coverage.
# * We used the `LabelModel` to learn how to combine inputs from crowdworkers and other LFs to generate high quality probabilistic labels.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should say 'to learn how'. Just 'We used the LabelModel to combine inputs...'

@bhancock8
Copy link
Member Author

Comments addressed! Ready for another look.

@brahmaneya brahmaneya self-requested a review August 11, 2019 02:26
@bhancock8 bhancock8 merged commit 8d4fd69 into master Aug 11, 2019
@bhancock8 bhancock8 deleted the crowdsourcing_edits branch August 11, 2019 02:32
ajratner pushed a commit that referenced this pull request Aug 13, 2019
* Revisions to crowdsourcing tutorial

* Run tox

* Address comments
ajratner added a commit that referenced this pull request Aug 14, 2019
* First rev on lighter-weight intro tutorial

* Fixing @brahmaneya edits in PR

* Editing down code to minimal form as suggested by HE

* Simplifying code; adding TF example; editing

* Editing pass over text

* Transfer tags w jupytext build, minor edits

* Filtering abstain values

* Added a stub for SFs

* Trying to fix style check

* Skipping env_* files in flake

* Style fixes

* PR changes requested by @henryre

* Quieting nltk output

* Moved to getting_started

* Silencing LogReg warning

* Forgot to sync style fix...

* Spelling fix

* Addressing comments

* First rev on lighter-weight intro tutorial

* Fixing @brahmaneya edits in PR

* Hot fix LF names (#63)

* Mtl updates (#41)

* [EASY] Update Scorer import paths (#58)

* Save MTL updates in progress

* Give more API hints

* More text updates

* Hide unnecessary helpers in utils

* Drop unused import and sync notebook

* Save MTL updates in progress

* Give more API hints

* More text updates

* Hide unnecessary helpers in utils

* Drop unused import and sync notebook

* Address comments

* Rename mtl to multitask so file and tutorial match

* Update Scorer import paths

* Update names of loss and output funcs

* Update name of ce_loss_from_outputs()

* Rename SnorkelClassifier to MultitaskClassifier (#59)

* Update Scorer import

* Update Scorer import in vrd_tutorial

* Remove unused import

* [EASY] Add links to RTD in multitask tutorial (#65)

* Add links to RTD in multitask tutorial

* Separate sentences

* Sync multitask.ipynb

* Editing down code to minimal form as suggested by HE

* Add Drybell tutorial (#62)

* Add drybell tutorial

* Add to tox and README

* Install Java on Travis

* Pass JAVA_HOME

* Add README

* Update API

* Revisions to crowdsourcing tutorial (#64)

* Revisions to crowdsourcing tutorial

* Run tox

* Address comments

* Simplifying code; adding TF example; editing

* Recsys novel (#61)

* Recsys first commit

* Backup

* Added second version

* Recsys modeling work

* Add review processing and LFs.

* First complete draft

* typo

* Add ipynb

* Add comments, refactor

* Address comments

* Updated ipynb

* Update tox.ini to allow sync / test for recsys (but not by default)

* Address comments

* Update ipynb

* Address final comments

* Fix determinism of TF tutorial (#67)

* Fix determinism of TF tutorial

* Add os PYTHONGHASHSEED back

* Editing pass over text

* Transfer tags w jupytext build, minor edits

* Filtering abstain values

* Added a stub for SFs

* Trying to fix style check

* Skipping env_* files in flake

* Style fixes

* PR changes requested by @henryre

* Quieting nltk output

* Separate download scripts, feedback session (#55)

* Moved to getting_started

* Silencing LogReg warning

* Slicing spam (#18)

* Forgot to sync style fix...

* Add link checking (#72)

* Add link checking

* Fix

* Only run Travis on changed tutorials (#74)

* Only run Travis on changed tutorials

* Fix

* Address comments

* Fix

* Fix a couple links (#75)

* Fix link

* Fix links

* Fix Travis branch check (#77)

* Spelling fix

* Stop training on dev set [EASY] (#71)

* Stop training on dev set

* Update image link

* Add style to run envs (#78)

* Add style to run envs

* Simplify

* Make travis faster for spouse [EASY] (#80)

* Make travis faster for spouse

* remove extra cell

* sync

* all caps for constant

* More verbose build script (#81)

* Add markdown build mode (#79)

* Add markdown build mode

* Fix kwarg

* Be a bit more opinionated

* [EASY] Update path to snorkel to reflect ownership transfer (#82)

* Update path to snorkel to reflect ownership transfer

* Restore path to snorkel-superglue on HazyResearch

* Make travis only run on changed dirs (#85)

* Make travis only run on changed dirs

* Small fix

* Add space

* Update tutorials with MultitaskClassifier API changes (#68)

* Update multitask and scene_graph to last_op

* Update spam tutorial

* Update notebooks

* Sync visual_relation notebook

* Update spam notebooks

* Run tox -e fix

* Remove unused import

* Configure markdown generation (#87)

* Configure markdown generation

* Add comments

* [EASY] Replace `mtl` with `multitask` in README (#83)

* Deploy tutorial pages via Travis (#88)

* Deploy tutorial pages via Travis

* Fix commands

* Update readme (#73)

* Update readme

* Address comment

* Addressing comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants