Source code parsing pipeline #4

dhas · 2020-06-08T05:28:15Z

I wasn't able to understand how you arrived at the dataset you provide in your code2vec/data directory. Could you clarify your source code parsing pipeline? If I understand correctly, you seem to have started with the parsed tokens serialized as JSON from http://groups.inf.ed.ac.uk/cup/codeattention/ and you have converted into *.txt in code2vec/data. Am I right?

Would you be able to add the code for doing this into the repo? I need to parse sources written in C which is why I'm seeking a clearer picture of parsing.

Thanks

wangyu1997 · 2020-08-26T02:35:12Z

I have the same problem.

dhas · 2020-08-26T06:24:37Z

@wangyu1997 - You may want to take a look at https://github.com/JetBrains-Research/astminer. They have a great parsing pipeline and a small implementation of Code2Vec, which can get you started

wangyu1997 · 2020-08-27T05:56:55Z

@dhas Thank you for reply, after i reviewing you code, I notice that the all variables in you terminal_idxs.txt are represented like "@var_xx", could you tell more about the detail? thanks!

sonoisa · 2020-08-27T10:31:33Z

Hi @dhas, @wangyu1997,
I just uploaded a Jupyter Notebook (Scala script) to run the preprocessing.

preprocessing to create Path-Contexts: https://github.com/sonoisa/code2vec/blob/master/create_path_contexts.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source code parsing pipeline #4

Source code parsing pipeline #4

dhas commented Jun 8, 2020

wangyu1997 commented Aug 26, 2020

dhas commented Aug 26, 2020

wangyu1997 commented Aug 27, 2020

sonoisa commented Aug 27, 2020

Source code parsing pipeline #4

Source code parsing pipeline #4

Comments

dhas commented Jun 8, 2020

wangyu1997 commented Aug 26, 2020

dhas commented Aug 26, 2020

wangyu1997 commented Aug 27, 2020

sonoisa commented Aug 27, 2020