Reproducing the POS Tagger Results using Huggingface Tokenizer Offsets #36

nu11us · 2022-03-27T02:26:56Z

Hi, I was planning on implementing the same POS tagger architecture using the bertweet-base model using Huggingface, but since it is not supported by PreTrainedTokenizerFast, you cannot access the offset_mappings, and thus cannot easily access the embeddings for a given token for POS tagging (planned on pooling the subwords per token in Tweebank). The tokenizer doesn't seem to deviate from the Huggingface Roberta tokenizers except for the normalization functionality, so is there any way to use this feature or see it being added (perhaps in a setting that doesn't use normalization)? It already works for the bertweet-large model, so I assume its not impossible.

The text was updated successfully, but these errors were encountered:

datquocnguyen · 2022-03-27T02:42:59Z

bertweet-base should run without issue under the legacy mode: https://github.com/huggingface/transformers/tree/main/examples/legacy/token-classification

Here is an example for sequence labeling with bertweet-base:

cd transformers/examples/legacy/token-classification
 
TASK_NAME=ner
SEED=1000
OUTPUT_DIR=evalBERTweet_data/ner-wnut16-s1000-bertweet-base
MAX_LENGTH=128
BERT_MODEL=bertweet-base
BATCH_SIZE=32
NUM_EPOCHS=50
SAVE_STEPS=20
PEAK_LR=1e-5
WARMUP=200
METRIC=f1
DATA_DIR=NER/wnut16
LABELS=NER/wnut16/labels.txt
 
python3 run_ner.py \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--labels $LABELS \
--seed $SEED \
--per_device_train_batch_size $BATCH_SIZE \
--tokenizer_name $BERT_MODEL \
--num_train_epochs $NUM_EPOCHS \
--learning_rate $PEAK_LR \
--warmup_steps $WARMUP \
--data_dir $DATA_DIR \
--do_train \
--do_eval \
--do_predict \
--evaluation_strategy epoch \
--save_strategy epoch \
--save_total_limit 3 \
--metric_for_best_model $METRIC \
--load_best_model_at_end \
--overwrite_output_dir

datquocnguyen · 2022-06-06T07:42:08Z

@nu11us I recently developed the fast tokenizer for bertweet-base. You might experiment with it by installing transformers from:
git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git

If you find it useful, please comment at this thread huggingface/transformers#17254 (comment), so that the fast tokenizer will be merged into the main transformers soon.

datquocnguyen closed this as completed Mar 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing the POS Tagger Results using Huggingface Tokenizer Offsets #36

Reproducing the POS Tagger Results using Huggingface Tokenizer Offsets #36

nu11us commented Mar 27, 2022 •

edited

datquocnguyen commented Mar 27, 2022

datquocnguyen commented Jun 6, 2022

Reproducing the POS Tagger Results using Huggingface Tokenizer Offsets #36

Reproducing the POS Tagger Results using Huggingface Tokenizer Offsets #36

Comments

nu11us commented Mar 27, 2022 • edited

datquocnguyen commented Mar 27, 2022

datquocnguyen commented Jun 6, 2022

nu11us commented Mar 27, 2022 •

edited