Dataset for fine-tuning on python for Code generation task #42

hitesh-anand · 2022-04-05T11:20:31Z

Dear Sir,

For the text-to-code generation task, the model is fine-tuned on the Concode Java dataset. But, I want to fine-tune the model on Python dataset. While I was figuring out how to do this, I went across the following issue : https://github.com/salesforce/CodeT5/issues/36 where it is mentioned that we can fine-tune on the python subset of CodeSearchNet.

But, the python subset of CodeSearchNet contains various fields such as repo, path, url, original string, etc. whereas the Concode dataset contains only two fields for each function : code and nl. So, can you please guide me how can I create a similar dataset for python also so that I can fine-tune the text-to-code generation task on Python?

yuewang-cuhk · 2022-04-12T01:28:02Z

Hi, if you want to employ the Python subset in CodeSearchNet to train a text-to-code generation model, you can also get the nl and code information from it. The CodeSearchNet dataset contains other fields such as docstrings (nl) and code_tokens (code). You just need to filer those with empty docstrings.

yuewang-cuhk closed this as completed Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset for fine-tuning on python for Code generation task #42

Dataset for fine-tuning on python for Code generation task #42

hitesh-anand commented Apr 5, 2022

yuewang-cuhk commented Apr 12, 2022

Dataset for fine-tuning on python for Code generation task #42

Dataset for fine-tuning on python for Code generation task #42

Comments

hitesh-anand commented Apr 5, 2022

yuewang-cuhk commented Apr 12, 2022