Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing the POS Tagger Results using Huggingface Tokenizer Offsets #36

Closed
nu11us opened this issue Mar 27, 2022 · 2 comments
Closed

Comments

@nu11us
Copy link

nu11us commented Mar 27, 2022

Hi, I was planning on implementing the same POS tagger architecture using the bertweet-base model using Huggingface, but since it is not supported by PreTrainedTokenizerFast, you cannot access the offset_mappings, and thus cannot easily access the embeddings for a given token for POS tagging (planned on pooling the subwords per token in Tweebank). The tokenizer doesn't seem to deviate from the Huggingface Roberta tokenizers except for the normalization functionality, so is there any way to use this feature or see it being added (perhaps in a setting that doesn't use normalization)? It already works for the bertweet-large model, so I assume its not impossible.

@datquocnguyen
Copy link
Member

bertweet-base should run without issue under the legacy mode: https://github.com/huggingface/transformers/tree/main/examples/legacy/token-classification

Here is an example for sequence labeling with bertweet-base:

cd transformers/examples/legacy/token-classification
 
TASK_NAME=ner
SEED=1000
OUTPUT_DIR=evalBERTweet_data/ner-wnut16-s1000-bertweet-base
MAX_LENGTH=128
BERT_MODEL=bertweet-base
BATCH_SIZE=32
NUM_EPOCHS=50
SAVE_STEPS=20
PEAK_LR=1e-5
WARMUP=200
METRIC=f1
DATA_DIR=NER/wnut16
LABELS=NER/wnut16/labels.txt
 
python3 run_ner.py \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--labels $LABELS \
--seed $SEED \
--per_device_train_batch_size $BATCH_SIZE \
--tokenizer_name $BERT_MODEL \
--num_train_epochs $NUM_EPOCHS \
--learning_rate $PEAK_LR \
--warmup_steps $WARMUP \
--data_dir $DATA_DIR \
--do_train \
--do_eval \
--do_predict \
--evaluation_strategy epoch \
--save_strategy epoch \
--save_total_limit 3 \
--metric_for_best_model $METRIC \
--load_best_model_at_end \
--overwrite_output_dir 

@datquocnguyen
Copy link
Member

@nu11us I recently developed the fast tokenizer for bertweet-base. You might experiment with it by installing transformers from:
git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git

If you find it useful, please comment at this thread huggingface/transformers#17254 (comment), so that the fast tokenizer will be merged into the main transformers soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants