Performance Results on HumanEval #25

htcml · 2023-02-17T01:08:02Z

I am reading your CodeRL paper. It uses the APPS benchmark to show the performance comparison with Codex. Do you have any comparison results using the HumanEval dataset?

henryhungle · 2023-02-22T08:45:04Z

@htcml thanks for reading the paper.

In our case, HumanEval dataset would not be the best evaluation benchmark. The reason is that HumanEval is treated as a docstring to code task in which the function signature and its docstring (in code comment block) is given. It is ideal for zero-shot evaluation for larger LMs such as CodeGen and Codex.

In our paper, we focus more on natural language text description of a problem and generate a program from scratch.

One workaround is that we can reformulate the HumanEval as text-to-code tasks but the comparison might not be fair with current baselines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Results on HumanEval #25

Performance Results on HumanEval #25

htcml commented Feb 17, 2023

henryhungle commented Feb 22, 2023

Performance Results on HumanEval #25

Performance Results on HumanEval #25

Comments

htcml commented Feb 17, 2023

henryhungle commented Feb 22, 2023