repeated token is generated comparing to beam-search, when using fast_contrastive_search on T5 #5

wuzhiye7 · 2022-06-22T06:22:41Z

I used the fast_contrastive_search， cpoied from https://github.com/yxuansu/SimCTG/blob/main/SimCTGEncDec/SimCTGT5/simctgt5.py
code ,as follows：

but generated reapied tokens:

yxuansu · 2022-06-22T08:33:54Z

I used the fast_contrastive_search， cpoied from https://github.com/yxuansu/SimCTG/blob/main/SimCTGEncDec/SimCTGT5/simctgt5.py code ,as follows： but generated reapied tokens:

Hi @wuzhiye7,

Can you send the name of Chinese BART huggingface model and your inputs to me? I would like to test the instance and provide you some feedbacks.

wuzhiye7 · 2022-06-22T09:16:32Z

model : t5-pegasus-base ， huggingface model hub: imxly/t5-pegasus

input_tokens :['[CLS]', '我', '不会', '贴', '假', '睫', '毛', '呀', '，', '好', '难', '！', '[SEP]']
input_ids :[[101, 1909, 6932, 4745, 463, 3466, 2644, 840, 5661, 1266, 5314, 5658, 102], [101, 32018, 1909, 7117, 7914, 4913, 3399, 179, 505, 1963, 3443, 26300, 2808, 6312, 135, 40959, 731, 31348, 15699, 5661, 24630, 1963, 463, 3466, 2644, 637, 199, 2374, 2106, 4866, 5661, 541, 1963, 26300, 4745, 198, 28756, 4745, 615, 26257, 198, 179, 102]]

@yxuansu

yxuansu · 2022-06-22T16:05:19Z

model : t5-pegasus-base ， huggingface model hub: imxly/t5-pegasus input_tokens :['[CLS]', '我', '不会', '贴', '假', '睫', '毛', '呀', '，', '好', '难', '！', '[SEP]'] input_ids :[[101, 1909, 6932, 4745, 463, 3466, 2644, 840, 5661, 1266, 5314, 5658, 102], [101, 32018, 1909, 7117, 7914, 4913, 3399, 179, 505, 1963, 3443, 26300, 2808, 6312, 135, 40959, 731, 31348, 15699, 5661, 24630, 1963, 463, 3466, 2644, 637, 199, 2374, 2106, 4866, 5661, 541, 1963, 26300, 4745, 198, 28756, 4745, 615, 26257, 198, 179, 102]]

@yxuansu

Hi @wuzhiye7,

I have tested the case on my end. Please follow the instructions below:

(1) First, install simctg from pip:

pip install simctg --upgrade

(2) Second, run the example below:

from simctg.simctgt5 import SimCTGT5
model_name = r'imxly/t5-pegasus'
# initialize tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(model_name)
# initialize model
from transformers.models.mt5.modeling_mt5 import MT5ForConditionalGeneration
t5model = MT5ForConditionalGeneration.from_pretrained(model_name)
model = SimCTGT5(model_name, user_defined_model=t5model, user_defined_tokenizer=tokenizer, special_token_list=[])

print ('------------------------------------------')
# prepare input
text = '我不会贴假睫毛呀，好难！'
ids = tokenizer.encode(text, return_tensors='pt')
print ('The input text is: {}'.format(text))
print ('------------------------------------------')
# generate result
output = model.fast_contrastive_search(input_ids=ids, beam_width=5, alpha=0.5, decoding_len=30,
        start_of_sequence_token_id=tokenizer.cls_token_id, 
        end_of_sequence_token_id=tokenizer.sep_token_id, early_stop = True)
output_text = ''.join(tokenizer.convert_ids_to_tokens(output))
print ('The output text is: {}'.format(output_text))
'''
  ------------------------------------------
  The input text is: 我不会贴假睫毛呀，好难！
  ------------------------------------------
  The output text is: 如何贴假睫毛？我是女生
'''

P.S. If you are interested, the source code of simctg package is located here (https://github.com/yxuansu/SimCTG/tree/main/simctg).

Please let me know if you have any questions.

wuzhiye7 · 2022-06-23T03:02:25Z

model : t5-pegasus-base ， huggingface model hub: imxly/t5-pegasus input_tokens :['[CLS]', '我', '不会', '贴', '假', '睫', '毛', '呀', '，', '好', '难', '！', '[SEP]'] input_ids :[[101, 1909, 6932, 4745, 463, 3466, 2644, 840, 5661, 1266, 5314, 5658, 102], [101, 32018, 1909, 7117, 7914, 4913, 3399, 179, 505, 1963, 3443, 26300, 2808, 6312, 135, 40959, 731, 31348, 15699, 5661, 24630, 1963, 463, 3466, 2644, 637, 199, 2374, 2106, 4866, 5661, 541, 1963, 26300, 4745, 198, 28756, 4745, 615, 26257, 198, 179, 102]]
@yxuansu

Hi @wuzhiye7,

I have tested the case on my end. Please follow the instructions below:

(1) First, install simctg from pip:
pip install simctg --upgrade
(2) Second, run the example below:
from simctg.simctgt5 import SimCTGT5
model_name = r'imxly/t5-pegasus'
# initialize tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(model_name)
# initialize model
from transformers.models.mt5.modeling_mt5 import MT5ForConditionalGeneration
t5model = MT5ForConditionalGeneration.from_pretrained(model_name)
model = SimCTGT5(model_name, user_defined_model=t5model, user_defined_tokenizer=tokenizer, special_token_list=[])

print ('------------------------------------------')
# prepare input
text = '我不会贴假睫毛呀，好难！'
ids = tokenizer.encode(text, return_tensors='pt')
print ('The input text is: {}'.format(text))
print ('------------------------------------------')
# generate result
output = model.fast_contrastive_search(input_ids=ids, beam_width=5, alpha=0.5, decoding_len=30,
        start_of_sequence_token_id=tokenizer.cls_token_id, 
        end_of_sequence_token_id=tokenizer.sep_token_id, early_stop = True)
output_text = ''.join(tokenizer.convert_ids_to_tokens(output))
print ('The output text is: {}'.format(output_text))
'''
  ------------------------------------------
  The input text is: 我不会贴假睫毛呀，好难！
  ------------------------------------------
  The output text is: 如何贴假睫毛？我是女生
'''
P.S. If you are interested, the source code of simctg package is located here (https://github.com/yxuansu/SimCTG/tree/main/simctg).

Please let me know if you have any questions.

thanks ，its ok now

wuzhiye7 closed this as completed Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repeated token is generated comparing to beam-search, when using fast_contrastive_search on T5 #5

repeated token is generated comparing to beam-search, when using fast_contrastive_search on T5 #5

wuzhiye7 commented Jun 22, 2022 •

edited

yxuansu commented Jun 22, 2022

wuzhiye7 commented Jun 22, 2022 •

edited

yxuansu commented Jun 22, 2022 •

edited

wuzhiye7 commented Jun 23, 2022

repeated token is generated comparing to beam-search, when using fast_contrastive_search on T5 #5

repeated token is generated comparing to beam-search, when using fast_contrastive_search on T5 #5

Comments

wuzhiye7 commented Jun 22, 2022 • edited

yxuansu commented Jun 22, 2022

wuzhiye7 commented Jun 22, 2022 • edited

yxuansu commented Jun 22, 2022 • edited

wuzhiye7 commented Jun 23, 2022

wuzhiye7 commented Jun 22, 2022 •

edited

wuzhiye7 commented Jun 22, 2022 •

edited

yxuansu commented Jun 22, 2022 •

edited