Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source code parsing pipeline #4

Open
dhas opened this issue Jun 8, 2020 · 4 comments
Open

Source code parsing pipeline #4

dhas opened this issue Jun 8, 2020 · 4 comments

Comments

@dhas
Copy link

dhas commented Jun 8, 2020

Hi @sonoisa,

I wasn't able to understand how you arrived at the dataset you provide in your code2vec/data directory. Could you clarify your source code parsing pipeline? If I understand correctly, you seem to have started with the parsed tokens serialized as JSON from http://groups.inf.ed.ac.uk/cup/codeattention/ and you have converted into *.txt in code2vec/data. Am I right?

Would you be able to add the code for doing this into the repo? I need to parse sources written in C which is why I'm seeking a clearer picture of parsing.

Thanks

@wangyu1997
Copy link

I have the same problem.

@dhas
Copy link
Author

dhas commented Aug 26, 2020

@wangyu1997 - You may want to take a look at https://github.com/JetBrains-Research/astminer. They have a great parsing pipeline and a small implementation of Code2Vec, which can get you started

@wangyu1997
Copy link

@dhas Thank you for reply, after i reviewing you code, I notice that the all variables in you terminal_idxs.txt are represented like "@var_xx", could you tell more about the detail? thanks!

@sonoisa
Copy link
Owner

sonoisa commented Aug 27, 2020

Hi @dhas, @wangyu1997,
I just uploaded a Jupyter Notebook (Scala script) to run the preprocessing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants