Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset for fine-tuning on python for Code generation task #42

Closed
hitesh-anand opened this issue Apr 5, 2022 · 1 comment
Closed

Dataset for fine-tuning on python for Code generation task #42

hitesh-anand opened this issue Apr 5, 2022 · 1 comment

Comments

@hitesh-anand
Copy link

Dear Sir,

For the text-to-code generation task, the model is fine-tuned on the Concode Java dataset. But, I want to fine-tune the model on Python dataset. While I was figuring out how to do this, I went across the following issue : https://github.com/salesforce/CodeT5/issues/36 where it is mentioned that we can fine-tune on the python subset of CodeSearchNet.

But, the python subset of CodeSearchNet contains various fields such as repo, path, url, original string, etc. whereas the Concode dataset contains only two fields for each function : code and nl. So, can you please guide me how can I create a similar dataset for python also so that I can fine-tune the text-to-code generation task on Python?

@yuewang-cuhk
Copy link
Contributor

Hi, if you want to employ the Python subset in CodeSearchNet to train a text-to-code generation model, you can also get the nl and code information from it. The CodeSearchNet dataset contains other fields such as docstrings (nl) and code_tokens (code). You just need to filer those with empty docstrings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants