Unable to use en_core_web_trf transformer in spaCy NER: Hugging Face model errors #13829

BenReichwein · 2025-06-03T23:33:53Z

BenReichwein
Jun 3, 2025

Description
I’m trying to train a spaCy NER pipeline using the built-in en_core_web_trf transformer model as my base. However, every time I point my config at en_core_web_trf, I run into Hugging Face download/initialization errors. I’ve tried multiple variations of config.cfg (including switching to roberta-base under [components.transformer.model]), but nothing seems to let spaCy successfully load and fine-tune the transformer. As soon as I run training, I get an OSError about “not a valid model identifier” (or a missing vectors path). I suspect I’m mis‐configuring en_core_web_trf vs. a raw Hugging Face model. Any guidance on how to properly set up config.cfg (or install prerequisites) so that python -m spacy train can find and load en_core_web_trf would be greatly appreciated.

Environment
Python version: 3.10.12

spaCy version: 3.6.2

spaCy-transformers version: 1.4.1

Transformers (Hugging Face) version: 4.35.2

OS: Ubuntu 22.04.2 LTS

GPU/CPU: CPU only (no CUDA available)

Virtual environment: venv with pip install spacy[transformers] transformers

What I’ve Tried
Standard en_core_web_trf install & loading

pip install spacy[transformers] transformers
python -m spacy download en_core_web_trf

Result:

When I run python -c "import spacy; nlp = spacy.load('en_core_web_trf')" it works.

But as soon as I try to reference en_core_web_trf in my training config.cfg, I get a Hugging Face error.

Using raw roberta-base under [components.transformer.model]
In some examples, spaCy docs show using:

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
tokenizer_config = {"use_fast": true}
transformer_config = {}
mixed_precision = false

Result:

spaCy successfully downloads “roberta-base” from Hugging Face and initializes the transformer.

But I’d rather leverage the pretrained config in en_core_web_trf (which already includes RoBERTa plus pipeline defaults).

Switching vectors = null and init_tok2vec = null
I removed comments from config.cfg so that vectors = null is valid. That fixed a previous E884: vectors could not be found at 'null # …' error. However, it didn’t resolve the underlying transformer‐loading problem.

Full config.cfg (latest attempt)

[paths]
train = "training_data.spacy"
dev   = "training_data.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 64
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.transformer]
factory = "transformer"
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "en_core_web_trf"
tokenizer_config = {"use_fast": true}
transformer_config = {}
mixed_precision = false

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "transformer"

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = null
dropout = 0.1
accumulate_gradient = 16
patience = 5
max_epochs = 20
max_steps = 0
eval_frequency = 500
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = true
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 8
stop = 32
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = true

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 1e-8

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 500
total_steps = 10000
initial_rate = 0.00002

[training.score_weights]
ents_f = 1.0
ents_p = 1.0
ents_r = 1.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Command Used

python -m spacy train config.cfg \
    --output ./model/redactor_model \
    --paths.train training_data.spacy \
    --paths.dev training_data.spacy

Observed Errors
Using name = "en_core_web_trf" under [components.transformer.model]

OSError: [E884] The pipeline could not be initialized because the vectors could not be found at 'null'.
If your pipeline was already initialized/trained before, call 'resume_training' instead of 'initialize', or initialize only the components that are new.
This error appeared when I still had a stray comment (# you’re not loading any static vectors) on the vectors = null line. Removing comments from that line made the vectors error go away, but then:

Switching to roberta-base works, but en_core_web_trf fails
With name = "roberta-base", I get no model‐loading errors:

Downloading (…)lve/main/config.json: 100%
Downloading (…)lve/main/pytorch_model.bin: 100%
Downloading (…)okenizer_config.json: 100%
Some weights of RobertaModel were not used when initializing the TransformerModel: ['lm_head.dense.weight', ...]
Training then proceeds as expected.

However, changing back to:

name = "en_core_web_trf"
immediately throws:

OSError: en_core_web_trf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`
Verifying en_core_web_trf is installed

I am downloading the model correctly locally but its still trying to use huggingface

python -c "import spacy; print(spacy.util.get_package_path('en_core_web_trf'))"

→ Prints something like /home/user/.local/lib/python3.10/site-packages/en_core_web_trf/en_core_web_trf-3.6.2.
Yet spaCy’s training CLI still complains that en_core_web_trf isn’t a valid model on Hugging Face. It seems that spaCy isn’t looking at the locally installed package name but instead trying to download from the HF hub.

What I Suspect
The training CLI (spacy train ...) may not automatically detect a locally installed en_core_web_trf package and still tries to treat it as a Hugging Face repo ID.

Perhaps I need to specify a different path syntax, or explicitly load/initialize the pipeline first and then call resume_training instead of initialize.

I’m not sure if there’s a hidden “transformers_data” folder or environment variable that spaCy expects for en_core_web_trf when training from scratch.

Things I’ve Checked
Confirmed spacy-transformers is installed:

pip show spacy-transformers

Confirmed Hugging Face access:

pip show transformers
Tested that nlp = spacy.load("en_core_web_trf") works in a separate Python REPL (it does).

Tried both with and without a GPU; same error on CPU/GPU.

Tried building a minimal config.cfg with only the transformer & ner components (no extra custom layers), but it still refuses to find en_core_web_trf.

What I’m Hoping For
Clarification on how to reference a locally installed spaCy transformer pipeline like en_core_web_trf in config.cfg so that spacy train will use it, rather than trying to re-download it from Hugging Face.

Any examples of a working config.cfg that directly uses en_core_web_trf under [components.transformer.model].

Notes on workflow: if spaCy’s CLI can’t initialize a transformer from a locally installed pip package, should I instead call nlp = spacy.load("en_core_web_trf") manually and then use nlp.resume_training()? If so, are there code snippets or recommended best practices?

Thank you in advance for any guidance or configuration examples that let me successfully fine-tune a transformer-based NER model using en_core_web_trf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unable to use en_core_web_trf transformer in spaCy NER: Hugging Face model errors #13829

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Unable to use en_core_web_trf transformer in spaCy NER: Hugging Face model errors #13829

Uh oh!

BenReichwein Jun 3, 2025

Replies: 0 comments

BenReichwein
Jun 3, 2025