Unable to use en_core_web_trf transformer in spaCy NER: Hugging Face model errors #13829
Unanswered
BenReichwein
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Description
I’m trying to train a spaCy NER pipeline using the built-in en_core_web_trf transformer model as my base. However, every time I point my config at en_core_web_trf, I run into Hugging Face download/initialization errors. I’ve tried multiple variations of config.cfg (including switching to roberta-base under [components.transformer.model]), but nothing seems to let spaCy successfully load and fine-tune the transformer. As soon as I run training, I get an OSError about “not a valid model identifier” (or a missing vectors path). I suspect I’m mis‐configuring en_core_web_trf vs. a raw Hugging Face model. Any guidance on how to properly set up config.cfg (or install prerequisites) so that python -m spacy train can find and load en_core_web_trf would be greatly appreciated.
Environment
Python version: 3.10.12
spaCy version: 3.6.2
spaCy-transformers version: 1.4.1
Transformers (Hugging Face) version: 4.35.2
OS: Ubuntu 22.04.2 LTS
GPU/CPU: CPU only (no CUDA available)
Virtual environment: venv with pip install spacy[transformers] transformers
What I’ve Tried
Standard en_core_web_trf install & loading
Result:
When I run python -c "import spacy; nlp = spacy.load('en_core_web_trf')" it works.
But as soon as I try to reference en_core_web_trf in my training config.cfg, I get a Hugging Face error.
Using raw roberta-base under [components.transformer.model]
In some examples, spaCy docs show using:
Result:
spaCy successfully downloads “roberta-base” from Hugging Face and initializes the transformer.
But I’d rather leverage the pretrained config in en_core_web_trf (which already includes RoBERTa plus pipeline defaults).
Switching vectors = null and init_tok2vec = null
I removed comments from config.cfg so that vectors = null is valid. That fixed a previous E884: vectors could not be found at 'null # …' error. However, it didn’t resolve the underlying transformer‐loading problem.
Full config.cfg (latest attempt)
Command Used
Observed Errors
Using name = "en_core_web_trf" under [components.transformer.model]
Switching to roberta-base works, but en_core_web_trf fails
With name = "roberta-base", I get no model‐loading errors:
However, changing back to:
I am downloading the model correctly locally but its still trying to use huggingface
→ Prints something like /home/user/.local/lib/python3.10/site-packages/en_core_web_trf/en_core_web_trf-3.6.2.
Yet spaCy’s training CLI still complains that en_core_web_trf isn’t a valid model on Hugging Face. It seems that spaCy isn’t looking at the locally installed package name but instead trying to download from the HF hub.
What I Suspect
The training CLI (spacy train ...) may not automatically detect a locally installed en_core_web_trf package and still tries to treat it as a Hugging Face repo ID.
Perhaps I need to specify a different path syntax, or explicitly load/initialize the pipeline first and then call resume_training instead of initialize.
I’m not sure if there’s a hidden “transformers_data” folder or environment variable that spaCy expects for en_core_web_trf when training from scratch.
Things I’ve Checked
Confirmed spacy-transformers is installed:
Confirmed Hugging Face access:
Tried both with and without a GPU; same error on CPU/GPU.
Tried building a minimal config.cfg with only the transformer & ner components (no extra custom layers), but it still refuses to find en_core_web_trf.
What I’m Hoping For
Clarification on how to reference a locally installed spaCy transformer pipeline like en_core_web_trf in config.cfg so that spacy train will use it, rather than trying to re-download it from Hugging Face.
Any examples of a working config.cfg that directly uses en_core_web_trf under [components.transformer.model].
Notes on workflow: if spaCy’s CLI can’t initialize a transformer from a locally installed pip package, should I instead call nlp = spacy.load("en_core_web_trf") manually and then use nlp.resume_training()? If so, are there code snippets or recommended best practices?
Thank you in advance for any guidance or configuration examples that let me successfully fine-tune a transformer-based NER model using en_core_web_trf.
Beta Was this translation helpful? Give feedback.
All reactions