Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running run_re.py - TypeError: TextInputSequence must be str #14

Closed
gohjiayi opened this issue Jan 21, 2022 · 1 comment
Closed

Comments

@gohjiayi
Copy link

Hi there,

I was running the file run_re.py for biobert_re where I've encountered the following error:

01/21/2022 16:21:02 - WARNING - __main__ -   Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
01/21/2022 16:21:16 - INFO - utils_re -   Creating features from dataset file at /data/jiayi/n2c2/dataset_re
Traceback (most recent call last):
  File "run_re.py", line 230, in <module>
    main()
  File "run_re.py", line 103, in main
    REDataset(data_args, tokenizer=tokenizer, cache_dir=model_args.cache_dir) if training_args.do_train else None
  File "/home/jiayi/adverse_drug_event_extraction/app/bilstm_crf_re/utils_re.py", line 132, in __init__
    self.features = glue_convert_examples_to_features(
  File "/home/jiayi/adverse_drug_event_extraction/app/bilstm_crf_re/data_processor.py", line 40, in glue_convert_examples_to_features
    return _glue_convert_examples_to_features(
  File "/home/jiayi/adverse_drug_event_extraction/app/bilstm_crf_re/data_processor.py", line 74, in _glue_convert_examples_to_features
    batch_encoding = tokenizer(
  File "/home/jiayi/anaconda3/envs/re/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2418, in __call__
    return self.batch_encode_plus(
  File "/home/jiayi/anaconda3/envs/re/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2609, in batch_encode_plus
    return self._batch_encode_plus(
  File "/home/jiayi/anaconda3/envs/re/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 409, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: TextInputSequence must be str

Upon further debugging, this seems to be an issue from HuggingFace as seen in the comment here. To fix the issue, I've simply included a use_fast=False parameter in line 99 of run_re.py as seen below.

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
    use_fast=False
)

Hope this will be helpful to anyone else encountering the same issue.

@smitkiri
Copy link
Owner

smitkiri commented Jan 21, 2022

As mentioned in the fix on the huggingface issue, it only affects newer versions of transformers. This project used transformers v3.0.2 (as specified in the requirements file), where this issue does not occur.
Thanks for pointing out the issue, it could be useful for folks using the newer versions of transformers with this project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants