Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The (sub)tokenizer logic used to produce the seq2seq dataset? #1

Closed
frankxu2004 opened this issue Jul 14, 2020 · 2 comments
Closed

Comments

@frankxu2004
Copy link

Hi Uri Alon,

Thanks for the impressive work, and especially thank you for releasing the data which is kinda hard to collect for various previous publications as there are so many variants and version.

I am interested in the Java seq2seq dataset you presented, and I am wondering what tokenization logic is used? Is it BPE or some Java-specific heuristics? Thank you!

@urialon
Copy link
Contributor

urialon commented Jul 15, 2020

Hi @frankxu2004 ,
Thank you for your interest in our work!

To create the seq2seq data we used this script: https://github.com/tech-srl/slm-code-generation/blob/master/baselines/prep_baseline.py
I think it is pretty self-explanatory (see the available flags), but let me know if you have any additional questions about this script.

That script can be run on the "Java-small-json" file, and produced the "Java-small-seq2seq" data: https://github.com/tech-srl/slm-code-generation#java

We used a regular-expression-based heuristic that splits token to subtokens. So, for example, a variable called currentIndex would be split to ['current', 'index'].

I hope it helps,
Uri

@urialon
Copy link
Contributor

urialon commented Jul 23, 2020

Closing due to inactivity, if you have any further questions - feel free to re-open or create a new issue.

@urialon urialon closed this as completed Jul 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants