Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about data preprocessing #80

Open
qkim2525 opened this issue Jan 3, 2023 · 0 comments
Open

Questions about data preprocessing #80

qkim2525 opened this issue Jan 3, 2023 · 0 comments

Comments

@qkim2525
Copy link

qkim2525 commented Jan 3, 2023

Hi.

Me and my two colleagues are interested in replicating the results of CodeT5-base on code generation task with our own dataset.
However we're having a few hiccups on preprocessing data, and we hope you don't mind a few questions.

Mainly, we're wondering how you dealt with the Google BigQuery data alongside CodeSearchNet.
From our knowledge, CodeSearchNet data is consisted of codes that are nicely isolated blocks of function codes,
while the BigQuery data and other extra data of C/C# from open-source Github repositories is, as far as we guess, isn't presented in such a convenient matter.

Our own data is also in a similar position where none of the codes are either isolated to blocks of function codes, but rather mostly a complete file of itself.
We were wondering if you did any preprocessing of your own so that the extra data of C/C# would match that of CodeSearchNet, or if you just used it raw.

And if you did use it raw, has it affected the performance compared to when the model was trained only with CodeSearchNet data?
Thank you in advance.

P.S.
My colleagues are also wondering how you dealt with the whitespace, arguing that the paper wasn't so clear with that.
One argues that you discarded whitespace all-together, while the other argues that you only removed duplicates of whitespce into one instance.
ex)
A. '\s\s\s' --> ''
B. '\s\s\s' --> '\s'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant