You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am reading your CodeRL paper. It uses the APPS benchmark to show the performance comparison with Codex. Do you have any comparison results using the HumanEval dataset?
The text was updated successfully, but these errors were encountered:
In our case, HumanEval dataset would not be the best evaluation benchmark. The reason is that HumanEval is treated as a docstring to code task in which the function signature and its docstring (in code comment block) is given. It is ideal for zero-shot evaluation for larger LMs such as CodeGen and Codex.
In our paper, we focus more on natural language text description of a problem and generate a program from scratch.
One workaround is that we can reformulate the HumanEval as text-to-code tasks but the comparison might not be fair with current baselines.
I am reading your CodeRL paper. It uses the APPS benchmark to show the performance comparison with Codex. Do you have any comparison results using the HumanEval dataset?
The text was updated successfully, but these errors were encountered: