Error running run_re.py - TypeError: TextInputSequence must be str #14

gohjiayi · 2022-01-21T12:20:12Z

Hi there,

I was running the file run_re.py for biobert_re where I've encountered the following error:

01/21/2022 16:21:02 - WARNING - __main__ -   Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
01/21/2022 16:21:16 - INFO - utils_re -   Creating features from dataset file at /data/jiayi/n2c2/dataset_re
Traceback (most recent call last):
  File "run_re.py", line 230, in <module>
    main()
  File "run_re.py", line 103, in main
    REDataset(data_args, tokenizer=tokenizer, cache_dir=model_args.cache_dir) if training_args.do_train else None
  File "/home/jiayi/adverse_drug_event_extraction/app/bilstm_crf_re/utils_re.py", line 132, in __init__
    self.features = glue_convert_examples_to_features(
  File "/home/jiayi/adverse_drug_event_extraction/app/bilstm_crf_re/data_processor.py", line 40, in glue_convert_examples_to_features
    return _glue_convert_examples_to_features(
  File "/home/jiayi/adverse_drug_event_extraction/app/bilstm_crf_re/data_processor.py", line 74, in _glue_convert_examples_to_features
    batch_encoding = tokenizer(
  File "/home/jiayi/anaconda3/envs/re/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2418, in __call__
    return self.batch_encode_plus(
  File "/home/jiayi/anaconda3/envs/re/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2609, in batch_encode_plus
    return self._batch_encode_plus(
  File "/home/jiayi/anaconda3/envs/re/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 409, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: TextInputSequence must be str

Upon further debugging, this seems to be an issue from HuggingFace as seen in the comment here. To fix the issue, I've simply included a use_fast=False parameter in line 99 of run_re.py as seen below.

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
    use_fast=False
)

Hope this will be helpful to anyone else encountering the same issue.

The text was updated successfully, but these errors were encountered:

smitkiri · 2022-01-21T23:28:55Z

As mentioned in the fix on the huggingface issue, it only affects newer versions of transformers. This project used transformers v3.0.2 (as specified in the requirements file), where this issue does not occur.
Thanks for pointing out the issue, it could be useful for folks using the newer versions of transformers with this project!

gohjiayi closed this as completed Feb 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error running run_re.py - TypeError: TextInputSequence must be str #14

Error running run_re.py - TypeError: TextInputSequence must be str #14

gohjiayi commented Jan 21, 2022

smitkiri commented Jan 21, 2022 •

edited

Error running run_re.py - TypeError: TextInputSequence must be str #14

Error running run_re.py - TypeError: TextInputSequence must be str #14

Comments

gohjiayi commented Jan 21, 2022

smitkiri commented Jan 21, 2022 • edited

smitkiri commented Jan 21, 2022 •

edited