Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about the repetition of the ground-turth #18

Closed
LHRYANG opened this issue Oct 25, 2022 · 3 comments
Closed

about the repetition of the ground-turth #18

LHRYANG opened this issue Oct 25, 2022 · 3 comments

Comments

@LHRYANG
Copy link

LHRYANG commented Oct 25, 2022

Hi Yixuan, I have a question about the calculation of repetition rate of the ground truth. I use the code you provide:

# parse the generated results into a list of text
import json
in_f = r'./simctg_contrasive.json'
with open(in_f) as f:
    item_list = json.load(f)

text_list = []
for item in item_list:
    text = item['generated_result']['0']['continuation']
    text_list.append(text)

# compute the evaluation results
from simctg.evaluation import measure_repetition_and_diversity
rep_2, rep_3, rep_4, diversity = measure_repetition_and_diversity(text_list)
print ('The result of rep-2 is {}, rep-3 is {}, rep-4 is {}, and diversity is {}'.format(rep_2, rep_3, rep_4, round(diversity,2)))
'''
   The result of rep-2 is 3.93, rep-3 is 0.78, rep-4 is 0.31, and diversity is 0.95
'''

I can reproduce the result you reported in your paper:

The result of rep-2 is 3.93, rep-3 is 0.78, rep-4 is 0.31, and diversity is 0.95

However, when I change the code "text = item['generated_result']['0']['continuation']" to "text = item['reference_continuation_text']", it outputs

The result of rep-2 is 5.44, rep-3 is 1.28, rep-4 is 0.43, and diversity is 0.93

Which is different from the human score in your paper.

Could you help me solve this issue?
Thanks a lot!

@yxuansu
Copy link
Owner

yxuansu commented Oct 25, 2022

Hi @LHRYANG -- Thank you for your interest in our work. Have you tried to truncate the reference text to its first 128 tokens and then measure the diversity?

@LHRYANG
Copy link
Author

LHRYANG commented Oct 26, 2022

Following your suggestion, I tried to truncate the reference text by adding two lines to your original code ((if len(token_list)>128:
token_list = token_list[0:128])
)

def eval_text(text, ngram):
    token_list = text.strip().split()
    #print(len(token_list))
    if len(token_list)>128:
        token_list = token_list[0:128]
    start_idx, end_idx = 0, ngram
    total_num = 0
    ngram_set = set()
    while end_idx < len(token_list):
        one_ngram_list = token_list[start_idx:end_idx]
        assert len(one_ngram_list) == ngram
        one_ngram = ' '.join(one_ngram_list)

The output is: The result of rep-2 is 4.53, rep-3 is 1.07, rep-4 is 0.37, and diversity is 0.94, still different from that you reported.

@yxuansu
Copy link
Owner

yxuansu commented Oct 26, 2022

Hi @LHRYANG — I will double check the results on my end. Feel free to report your replicated numbers in your work :-)

@yxuansu yxuansu closed this as completed Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants