Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repeated token is generated comparing to beam-search, when using fast_contrastive_search on T5 #5

Closed
wuzhiye7 opened this issue Jun 22, 2022 · 4 comments

Comments

@wuzhiye7
Copy link

wuzhiye7 commented Jun 22, 2022

I used the fast_contrastive_search, cpoied from https://github.com/yxuansu/SimCTG/blob/main/SimCTGEncDec/SimCTGT5/simctgt5.py
code ,as follows:
image
but generated reapied tokens:
image

@yxuansu
Copy link
Owner

yxuansu commented Jun 22, 2022

I used the fast_contrastive_search, cpoied from https://github.com/yxuansu/SimCTG/blob/main/SimCTGEncDec/SimCTGT5/simctgt5.py code ,as follows: image but generated reapied tokens: image

Hi @wuzhiye7,

Can you send the name of Chinese BART huggingface model and your inputs to me? I would like to test the instance and provide you some feedbacks.

@wuzhiye7
Copy link
Author

wuzhiye7 commented Jun 22, 2022

model : t5-pegasus-base , huggingface model hub: imxly/t5-pegasus
image
input_tokens :['[CLS]', '我', '不会', '贴', '假', '睫', '毛', '呀', ',', '好', '难', '!', '[SEP]']
input_ids :[[101, 1909, 6932, 4745, 463, 3466, 2644, 840, 5661, 1266, 5314, 5658, 102], [101, 32018, 1909, 7117, 7914, 4913, 3399, 179, 505, 1963, 3443, 26300, 2808, 6312, 135, 40959, 731, 31348, 15699, 5661, 24630, 1963, 463, 3466, 2644, 637, 199, 2374, 2106, 4866, 5661, 541, 1963, 26300, 4745, 198, 28756, 4745, 615, 26257, 198, 179, 102]]

@yxuansu

@yxuansu
Copy link
Owner

yxuansu commented Jun 22, 2022

model : t5-pegasus-base , huggingface model hub: imxly/t5-pegasus image input_tokens :['[CLS]', '我', '不会', '贴', '假', '睫', '毛', '呀', ',', '好', '难', '!', '[SEP]'] input_ids :[[101, 1909, 6932, 4745, 463, 3466, 2644, 840, 5661, 1266, 5314, 5658, 102], [101, 32018, 1909, 7117, 7914, 4913, 3399, 179, 505, 1963, 3443, 26300, 2808, 6312, 135, 40959, 731, 31348, 15699, 5661, 24630, 1963, 463, 3466, 2644, 637, 199, 2374, 2106, 4866, 5661, 541, 1963, 26300, 4745, 198, 28756, 4745, 615, 26257, 198, 179, 102]]

@yxuansu

Hi @wuzhiye7,

I have tested the case on my end. Please follow the instructions below:

(1) First, install simctg from pip:

pip install simctg --upgrade

(2) Second, run the example below:

from simctg.simctgt5 import SimCTGT5
model_name = r'imxly/t5-pegasus'
# initialize tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(model_name)
# initialize model
from transformers.models.mt5.modeling_mt5 import MT5ForConditionalGeneration
t5model = MT5ForConditionalGeneration.from_pretrained(model_name)
model = SimCTGT5(model_name, user_defined_model=t5model, user_defined_tokenizer=tokenizer, special_token_list=[])

print ('------------------------------------------')
# prepare input
text = '我不会贴假睫毛呀,好难!'
ids = tokenizer.encode(text, return_tensors='pt')
print ('The input text is: {}'.format(text))
print ('------------------------------------------')
# generate result
output = model.fast_contrastive_search(input_ids=ids, beam_width=5, alpha=0.5, decoding_len=30,
        start_of_sequence_token_id=tokenizer.cls_token_id, 
        end_of_sequence_token_id=tokenizer.sep_token_id, early_stop = True)
output_text = ''.join(tokenizer.convert_ids_to_tokens(output))
print ('The output text is: {}'.format(output_text))
'''
  ------------------------------------------
  The input text is: 我不会贴假睫毛呀,好难!
  ------------------------------------------
  The output text is: 如何贴假睫毛?我是女生
'''

P.S. If you are interested, the source code of simctg package is located here (https://github.com/yxuansu/SimCTG/tree/main/simctg).

Please let me know if you have any questions.

@wuzhiye7
Copy link
Author

model : t5-pegasus-base , huggingface model hub: imxly/t5-pegasus image input_tokens :['[CLS]', '我', '不会', '贴', '假', '睫', '毛', '呀', ',', '好', '难', '!', '[SEP]'] input_ids :[[101, 1909, 6932, 4745, 463, 3466, 2644, 840, 5661, 1266, 5314, 5658, 102], [101, 32018, 1909, 7117, 7914, 4913, 3399, 179, 505, 1963, 3443, 26300, 2808, 6312, 135, 40959, 731, 31348, 15699, 5661, 24630, 1963, 463, 3466, 2644, 637, 199, 2374, 2106, 4866, 5661, 541, 1963, 26300, 4745, 198, 28756, 4745, 615, 26257, 198, 179, 102]]
@yxuansu

Hi @wuzhiye7,

I have tested the case on my end. Please follow the instructions below:

(1) First, install simctg from pip:

pip install simctg --upgrade

(2) Second, run the example below:

from simctg.simctgt5 import SimCTGT5
model_name = r'imxly/t5-pegasus'
# initialize tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(model_name)
# initialize model
from transformers.models.mt5.modeling_mt5 import MT5ForConditionalGeneration
t5model = MT5ForConditionalGeneration.from_pretrained(model_name)
model = SimCTGT5(model_name, user_defined_model=t5model, user_defined_tokenizer=tokenizer, special_token_list=[])

print ('------------------------------------------')
# prepare input
text = '我不会贴假睫毛呀,好难!'
ids = tokenizer.encode(text, return_tensors='pt')
print ('The input text is: {}'.format(text))
print ('------------------------------------------')
# generate result
output = model.fast_contrastive_search(input_ids=ids, beam_width=5, alpha=0.5, decoding_len=30,
        start_of_sequence_token_id=tokenizer.cls_token_id, 
        end_of_sequence_token_id=tokenizer.sep_token_id, early_stop = True)
output_text = ''.join(tokenizer.convert_ids_to_tokens(output))
print ('The output text is: {}'.format(output_text))
'''
  ------------------------------------------
  The input text is: 我不会贴假睫毛呀,好难!
  ------------------------------------------
  The output text is: 如何贴假睫毛?我是女生
'''

P.S. If you are interested, the source code of simctg package is located here (https://github.com/yxuansu/SimCTG/tree/main/simctg).

Please let me know if you have any questions.

thanks ,its ok now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants