Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to prepare_for_training() dataset with spacy pipeline #1890

Closed
jamnicki opened this issue Nov 13, 2022 · 1 comment · Fixed by #1891
Closed

Unable to prepare_for_training() dataset with spacy pipeline #1890

jamnicki opened this issue Nov 13, 2022 · 1 comment · Fixed by #1891
Assignees
Labels
type: bug Indicates an unexpected problem or unintended behavior
Milestone

Comments

@jamnicki
Copy link
Contributor

jamnicki commented Nov 13, 2022

Describe the bug
Unable to prepare_for_training() dataset with spacy pipeline including components that are not yet initialized.
I dont think Argilla should interfere on spacy behaviour like that because it can be confusing during spacy usage.

To Reproduce
Steps to reproduce the behavior:

nlp = spacy.blank("en")
nlp.add_pipe("ner")

docbin_train = rg_dataset_train.prepare_for_training(framework="spacy", lang=nlp)
Full error stack trace
  ---------------------------------------------------------------------------
  KeyError                                  Traceback (most recent call last)
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/spacy/language.py:1026, in Language.__call__(self, text, disable, component_cfg)
     1025 try:
  -> 1026     doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
     1027 except KeyError as e:
     1028     # This typically happens if a component is not initialized
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/spacy/pipeline/trainable_pipe.pyx:56, in spacy.pipeline.trainable_pipe.TrainablePipe.__call__()
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/spacy/util.py:1670, in raise_error(proc_name, proc, docs, e)
     1669 def raise_error(proc_name, proc, docs, e):
  -> 1670     raise e
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/spacy/pipeline/trainable_pipe.pyx:52, in spacy.pipeline.trainable_pipe.TrainablePipe.__call__()
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/spacy/pipeline/transition_parser.pyx:253, in spacy.pipeline.transition_parser.Parser.predict()
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/spacy/pipeline/transition_parser.pyx:274, in spacy.pipeline.transition_parser.Parser.greedy_parse()
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/model.py:315, in Model.predict(self, X)
      312 """Call the model's `forward` function with `is_train=False`, and return
      313 only the output, instead of the `(output, callback)` tuple.
      314 """
  --> 315 return self._func(self, X, is_train=False)[0]
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/spacy/ml/tb_framework.py:33, in forward(model, X, is_train)
       32 def forward(model, X, is_train):
  ---> 33     step_model = ParserStepModel(
       34         X,
       35         model.layers,
       36         unseen_classes=model.attrs["unseen_classes"],
       37         train=is_train,
       38         has_upper=model.attrs["has_upper"],
       39     )
       41     return step_model, step_model.finish_steps
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/spacy/ml/parser_model.pyx:213, in spacy.ml.parser_model.ParserStepModel.__init__()
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
      289 """Call the model's `forward` function, returning the output and a
      290 callback to compute the gradients via backpropagation."""
  --> 291 return self._func(self, X, is_train=is_train)
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/layers/chain.py:55, in forward(model, X, is_train)
       54 for layer in model.layers:
  ---> 55     Y, inc_layer_grad = layer(X, is_train=is_train)
       56     callbacks.append(inc_layer_grad)
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
      289 """Call the model's `forward` function, returning the output and a
      290 callback to compute the gradients via backpropagation."""
  --> 291 return self._func(self, X, is_train=is_train)
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/layers/chain.py:55, in forward(model, X, is_train)
       54 for layer in model.layers:
  ---> 55     Y, inc_layer_grad = layer(X, is_train=is_train)
       56     callbacks.append(inc_layer_grad)
  
      [... skipping similar frames: Model.__call__ at line 291 (1 times)]
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/layers/chain.py:55, in forward(model, X, is_train)
       54 for layer in model.layers:
  ---> 55     Y, inc_layer_grad = layer(X, is_train=is_train)
       56     callbacks.append(inc_layer_grad)
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
      289 """Call the model's `forward` function, returning the output and a
      290 callback to compute the gradients via backpropagation."""
  --> 291 return self._func(self, X, is_train=is_train)
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/layers/with_array.py:32, in forward(model, Xseq, is_train)
       31 if isinstance(Xseq, Ragged):
  ---> 32     return cast(Tuple[SeqT, Callable], _ragged_forward(model, Xseq, is_train))
       33 elif isinstance(Xseq, Padded):
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/layers/with_array.py:87, in _ragged_forward(model, Xr, is_train)
       86 layer: Model[ArrayXd, ArrayXd] = model.layers[0]
  ---> 87 Y, get_dX = layer(Xr.dataXd, is_train)
       89 def backprop(dYr: Ragged) -> Ragged:
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
      289 """Call the model's `forward` function, returning the output and a
      290 callback to compute the gradients via backpropagation."""
  --> 291 return self._func(self, X, is_train=is_train)
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/layers/concatenate.py:44, in forward(model, X, is_train)
       43 def forward(model: Model[InT, OutT], X: InT, is_train: bool) -> Tuple[OutT, Callable]:
  ---> 44     Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
       45     if isinstance(Ys[0], list):
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/layers/concatenate.py:44, in <listcomp>(.0)
       43 def forward(model: Model[InT, OutT], X: InT, is_train: bool) -> Tuple[OutT, Callable]:
  ---> 44     Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
       45     if isinstance(Ys[0], list):
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
      289 """Call the model's `forward` function, returning the output and a
      290 callback to compute the gradients via backpropagation."""
  --> 291 return self._func(self, X, is_train=is_train)
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/layers/chain.py:55, in forward(model, X, is_train)
       54 for layer in model.layers:
  ---> 55     Y, inc_layer_grad = layer(X, is_train=is_train)
       56     callbacks.append(inc_layer_grad)
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
      289 """Call the model's `forward` function, returning the output and a
      290 callback to compute the gradients via backpropagation."""
  --> 291 return self._func(self, X, is_train=is_train)
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/layers/hashembed.py:61, in forward(model, ids, is_train)
       58 def forward(
       59     model: Model[Ints1d, OutT], ids: Ints1d, is_train: bool
       60 ) -> Tuple[OutT, Callable]:
  ---> 61     vectors = cast(Floats2d, model.get_param("E"))
       62     nV = vectors.shape[0]
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/thinc/model.py:216, in Model.get_param(self, name)
      215 if not self._params.has_param(self.id, name):
  --> 216     raise KeyError(
      217         f"Parameter '{name}' for model '{self.name}' has not been allocated yet."
      218     )
      219 return self._params.get_param(self.id, name)
  
  KeyError: "Parameter 'E' for model 'hashembed' has not been allocated yet."
  
  The above exception was the direct cause of the following exception:
  
  ValueError                                Traceback (most recent call last)
  Cell In [9], line 2
        1 # with nlp.select_pipes(disable=COMPONENT):
  ----> 2 docbin_train = rg_dataset_train.prepare_for_training(framework="spacy", lang=nlp)
        3 docbin_test = rg_dataset_test.prepare_for_training(framework="spacy", lang=nlp)
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/argilla/client/datasets.py:855, in DatasetForTokenClassification.prepare_for_training(self, framework, lang)
      851 if lang is None:
      852     raise ValueError(
      853         "Please provide a spacy language model to prepare the dataset for training with the spacy framework."
      854     )
  --> 855 return self._prepare_for_training_with_spacy(nlp=lang)
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/argilla/client/datasets.py:66, in _requires_spacy.<locals>.check_if_spacy_installed(*args, **kwargs)
       61 except ModuleNotFoundError:
       62     raise ModuleNotFoundError(
       63         f"'spacy' must be installed to use `{func.__name__}`"
       64         "You can install 'spacy' with the command: `pip install spacy`"
       65     )
  ---> 66 return func(*args, **kwargs)
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/argilla/client/datasets.py:911, in DatasetForTokenClassification._prepare_for_training_with_spacy(self, nlp)
      908 if record.annotation is None:
      909     continue
  --> 911 doc = nlp(record.text)
      912 entities = []
      914 for anno in record.annotation:
  
  File ~/miniconda3/envs/inzynierka/lib/python3.8/site-packages/spacy/language.py:1029, in Language.__call__(self, text, disable, component_cfg)
     1026     doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
     1027 except KeyError as e:
     1028     # This typically happens if a component is not initialized
  -> 1029     raise ValueError(Errors.E109.format(name=name)) from e
     1030 except Exception as e:
     1031     error_handler(name, proc, [doc], e)
  
  ValueError: [E109] Component 'ner' could not be run. Did you forget to call `initialize()`?
ValueError: [E109] Component 'ner' could not be run. Did you forget to call `initialize()`?

Environment:

  • OS: Ubuntu 22.04.1 LTS
  • Argilla Version: 1.0.1
  • ElasticSearch Version: 7.10.2
@jamnicki jamnicki added the type: bug Indicates an unexpected problem or unintended behavior label Nov 13, 2022
@davidberenstein1957
Copy link
Member

Good morning @jamnicki, thanks for the response. We mostly just copied spaCy docs, however, we know it can sometimes be challenging to keep docs adequately updated. Thanks for your input and proactive PR. We always love those. ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants