Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: spacy DocBin cookbook #1642

Merged
merged 2 commits into from Jul 31, 2022
Merged

docs: spacy DocBin cookbook #1642

merged 2 commits into from Jul 31, 2022

Conversation

ignacioct
Copy link
Contributor

@ignacioct ignacioct commented Jul 26, 2022

Closes #420

Following our work in #420 and #1635, I'm creating a small example into the Cookbook. Is this enough, or do we need something closer to an actual phrase or dataset?

@frascuchon
Copy link
Member

frascuchon commented Jul 28, 2022

Hi, @ignacioct, and thanks for your work!!

Maybe it could be nice to use a dataset and show a "real" workflow. The conll2003 dataset could a candidate. You could do something like this:

import spacy
import rubrix as rb

from datasets import load_dataset

ds = load_dataset("conll2003", split="train")
rds = rb.DatasetForTokenClassification.from_datasets(ds, tags="ner_tags")
nlp = spacy.blank("en")  # A blank nlp pipeline works faster

db = rds.prepare_for_training(framework="spacy", lang=nlp)

@frascuchon frascuchon changed the title feat: saving a docbin to disk docs: spacy DocBin cookbook Jul 28, 2022
@ignacioct
Copy link
Contributor Author

import spacy
import rubrix as rb

from datasets import load_dataset

ds = load_dataset("conll2003", split="train")
rds = rb.DatasetForTokenClassification.from_datasets(ds, tags="ner_tags")
nlp = spacy.blank("en")  # A blank nlp pipeline works faster

db = rds.prepare_for_training(framework="spacy", lang=nlp)

Okay, using that dataset I believe there's one row unsupported, is this a problem for the example or we can go through?

2022-07-28 13:25:14.367 | WARNING  | datasets.builder:download_and_prepare:577 - Reusing dataset conll2003 (/Users/ignacio/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee)
2022-07-28 13:25:14.372 | WARNING  | rubrix.client.datasets:_remove_unsupported_columns:252 - Following columns are not supported by the TokenClassificationRecord model and are ignored: ['pos_tags', 'chunk_tags']
2022-07-28 13:25:16.324 | WARNING  | rubrix.client.datasets:from_datasets:761 - Ignoring row with no tokens.

@frascuchon
Copy link
Member

No, It's okay. It's just a warning. You can go ahead. Thanks

@ignacioct
Copy link
Contributor Author

That is already implemented, so we can go forward and merge then :) @frascuchon

Copy link
Member

@frascuchon frascuchon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@frascuchon frascuchon merged commit 625d153 into argilla-io:master Jul 31, 2022
frascuchon pushed a commit that referenced this pull request Aug 18, 2022
(cherry picked from commit 625d153)
frascuchon pushed a commit that referenced this pull request Aug 22, 2022
(cherry picked from commit 625d153)

- docs: Improve cookbook spacy docbin (#1691)
(cherry picked from commit 3f75323)
frascuchon pushed a commit that referenced this pull request Aug 22, 2022
(cherry picked from commit 625d153)

- docs: Improve cookbook spacy docbin (#1691)
(cherry picked from commit 3f75323)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add example to Cookbook to export training data for spaCy v3
2 participants