Skip to content
This repository has been archived by the owner on Jul 30, 2024. It is now read-only.

Questions about the python and java datasets. #19

Closed
shengqiangzhang opened this issue Dec 4, 2020 · 3 comments
Closed

Questions about the python and java datasets. #19

shengqiangzhang opened this issue Dec 4, 2020 · 3 comments

Comments

@shengqiangzhang
Copy link

shengqiangzhang commented Dec 4, 2020

Hi @wasiahmad ,

The input data of the model in A Transformer-based Approach for Source Code Summarization is a series of tokens, but the input data of my model is abstract syntax tree (AST), I need to find the original source code (executable source code snippet) corresponding to a series of tokens, and then parse it to AST.

I have downloaded the data from their original work, but I found that the size of the dataset used in your paper is different from the size of their original dataset. For example, in the train set of the python dataset, the original size exceeds 100,000, while yours is about 50,000.

I want to compare with your model, so I selected the experiment dataset provided by your paper.
Since the series of tokens can not be parsed into AST, I need to find the corresponding original source code from their original work.
image

Unfortunately, I can not find the original source code for all the series of tokens.
If you could provide me with the corresponding original code files (the size of your experiment datasets are inconsistent with the original datasets), I believe I can convert them to AST and compare the experiment results with yours.

Thank you.

@wasiahmad
Copy link
Owner

Hi, I understand your need. A few things to note.

  1. The preprocessed python dataset we used is shared by the authors of Bolin et al., 2019 as we were unable to reproduce their results using the dataset we preprocessed.

  2. Note that these datasets are extremely noisy, so you may not be able to use the full data if you use AST-based methods.

  3. We also performed some naive experiments using AST, you can find the details in the paper. We did this only for the Java dataset and you can find the dataset (java_with_sbt.zip) in our provided Google drive link.

  4. The AST extraction from the original Java code is done by our co-author Saikat (https://github.com/saikat107), I have asked him to reply in this thread.

Thanks!

@saikat107
Copy link
Collaborator

Hi @shengqiangzhang ,

Like @wasiahmad mentioned, we used the same processed dataset as Bolin et.al., 2019 used. However, to my best knowledge,
the python dataset is from this paper and can be found here. You can find the description of the raw data here.

I hope that helps. Let me know if you have further questions. Feel free to close the issue if not.

Thanks!

@shengqiangzhang
Copy link
Author

shengqiangzhang commented Dec 7, 2020

Hi @saikat107 @wasiahmad ,

Thank you for your help, I am trying to transform the input data into AST format.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants