The (sub)tokenizer logic used to produce the seq2seq dataset? #1

frankxu2004 · 2020-07-14T23:29:31Z

Hi Uri Alon,

Thanks for the impressive work, and especially thank you for releasing the data which is kinda hard to collect for various previous publications as there are so many variants and version.

I am interested in the Java seq2seq dataset you presented, and I am wondering what tokenization logic is used? Is it BPE or some Java-specific heuristics? Thank you!

urialon · 2020-07-15T05:55:41Z

Hi @frankxu2004 ,
Thank you for your interest in our work!

To create the seq2seq data we used this script: https://github.com/tech-srl/slm-code-generation/blob/master/baselines/prep_baseline.py
I think it is pretty self-explanatory (see the available flags), but let me know if you have any additional questions about this script.

That script can be run on the "Java-small-json" file, and produced the "Java-small-seq2seq" data: https://github.com/tech-srl/slm-code-generation#java

We used a regular-expression-based heuristic that splits token to subtokens. So, for example, a variable called currentIndex would be split to ['current', 'index'].

I hope it helps,
Uri

urialon · 2020-07-23T11:17:50Z

Closing due to inactivity, if you have any further questions - feel free to re-open or create a new issue.

urialon closed this as completed Jul 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The (sub)tokenizer logic used to produce the seq2seq dataset? #1

The (sub)tokenizer logic used to produce the seq2seq dataset? #1

frankxu2004 commented Jul 14, 2020

urialon commented Jul 15, 2020

urialon commented Jul 23, 2020

The (sub)tokenizer logic used to produce the seq2seq dataset? #1

The (sub)tokenizer logic used to produce the seq2seq dataset? #1

Comments

frankxu2004 commented Jul 14, 2020

urialon commented Jul 15, 2020

urialon commented Jul 23, 2020