docs: spacy `DocBin` cookbook #1642

ignacioct · 2022-07-26T15:44:30Z

Closes #420

Following our work in #420 and #1635, I'm creating a small example into the Cookbook. Is this enough, or do we need something closer to an actual phrase or dataset?

frascuchon · 2022-07-28T07:31:39Z

Hi, @ignacioct, and thanks for your work!!

Maybe it could be nice to use a dataset and show a "real" workflow. The conll2003 dataset could a candidate. You could do something like this:

import spacy
import rubrix as rb

from datasets import load_dataset

ds = load_dataset("conll2003", split="train")
rds = rb.DatasetForTokenClassification.from_datasets(ds, tags="ner_tags")
nlp = spacy.blank("en")  # A blank nlp pipeline works faster

db = rds.prepare_for_training(framework="spacy", lang=nlp)

ignacioct · 2022-07-28T11:27:13Z

import spacy
import rubrix as rb

from datasets import load_dataset

ds = load_dataset("conll2003", split="train")
rds = rb.DatasetForTokenClassification.from_datasets(ds, tags="ner_tags")
nlp = spacy.blank("en")  # A blank nlp pipeline works faster

db = rds.prepare_for_training(framework="spacy", lang=nlp)

Okay, using that dataset I believe there's one row unsupported, is this a problem for the example or we can go through?

2022-07-28 13:25:14.367 | WARNING  | datasets.builder:download_and_prepare:577 - Reusing dataset conll2003 (/Users/ignacio/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee)
2022-07-28 13:25:14.372 | WARNING  | rubrix.client.datasets:_remove_unsupported_columns:252 - Following columns are not supported by the TokenClassificationRecord model and are ignored: ['pos_tags', 'chunk_tags']
2022-07-28 13:25:16.324 | WARNING  | rubrix.client.datasets:from_datasets:761 - Ignoring row with no tokens.

frascuchon · 2022-07-28T13:23:39Z

No, It's okay. It's just a warning. You can go ahead. Thanks

ignacioct · 2022-07-28T15:51:35Z

That is already implemented, so we can go forward and merge then :) @frascuchon

frascuchon

Great!

(cherry picked from commit 625d153)

(cherry picked from commit 625d153) - docs: Improve cookbook spacy docbin (#1691) (cherry picked from commit 3f75323)

feat: add example to cookbook

b9780e4

frascuchon changed the title ~~feat: saving a docbin to disk~~ docs: spacy DocBin cookbook Jul 28, 2022

feat: using an example from datasets

9b0643b

frascuchon approved these changes Jul 31, 2022

View reviewed changes

frascuchon merged commit 625d153 into argilla-io:master Jul 31, 2022

frascuchon pushed a commit that referenced this pull request Aug 18, 2022

docs: spacy DocBin cookbook (#1642)

14ca818

(cherry picked from commit 625d153)

frascuchon pushed a commit that referenced this pull request Aug 22, 2022

docs: spacy DocBin cookbook (#1642)

6d9ff82

(cherry picked from commit 625d153) - docs: Improve cookbook spacy docbin (#1691) (cherry picked from commit 3f75323)

frascuchon pushed a commit that referenced this pull request Aug 22, 2022

docs: spacy DocBin cookbook (#1642)

bb98278

(cherry picked from commit 625d153) - docs: Improve cookbook spacy docbin (#1691) (cherry picked from commit 3f75323)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: spacy `DocBin` cookbook #1642

docs: spacy `DocBin` cookbook #1642

ignacioct commented Jul 26, 2022 •

edited

frascuchon commented Jul 28, 2022 •

edited

ignacioct commented Jul 28, 2022

frascuchon commented Jul 28, 2022

ignacioct commented Jul 28, 2022

frascuchon left a comment

docs: spacy DocBin cookbook #1642

docs: spacy DocBin cookbook #1642

Conversation

ignacioct commented Jul 26, 2022 • edited

frascuchon commented Jul 28, 2022 • edited

ignacioct commented Jul 28, 2022

frascuchon commented Jul 28, 2022

ignacioct commented Jul 28, 2022

frascuchon left a comment

Choose a reason for hiding this comment

docs: spacy `DocBin` cookbook #1642

docs: spacy `DocBin` cookbook #1642

ignacioct commented Jul 26, 2022 •

edited

frascuchon commented Jul 28, 2022 •

edited