Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating embeddings for Python and Java #104

Closed
Avv22 opened this issue Oct 13, 2021 · 5 comments
Closed

Generating embeddings for Python and Java #104

Avv22 opened this issue Oct 13, 2021 · 5 comments

Comments

@Avv22
Copy link

Avv22 commented Oct 13, 2021

Hello,

Thanks again for your work.

Can you please explain how to use the model to generate embeddings for a source file in Python and also for Java? Do we have to train your model on Java dataset and Python dataset in order to use the model to generate embeddings of source code? Also is it possible to have embeddings in a fixed size of let us say 100 represented as numerical data for each file?

@urialon
Copy link
Contributor

urialon commented Oct 19, 2021

Hi Avra,
Thank you for your interest in this work! Sorry again for the delayed response.

Yes, in order to use code2vec for python, you will have to train the model on a python dataset.
Have you seen this section in the README? https://github.com/tech-srl/code2vec#extending-to-other-languages

Uri

@Avv22
Copy link
Author

Avv22 commented Nov 1, 2021

@urialon. astminer team helped me producing training, testing and validation python.c2v data. So, how we should proceed next please to train code2vec model? Then once we train the model, can we feed the same data to produce embeddings as we have 20k python files that we split into train, test and validate before feeding them to astminer tool. Once we train code2vec, we would like to feed the same python code to produce embeddings, so what do you think please?

@urialon
Copy link
Contributor

urialon commented Nov 2, 2021

I am not sure how your python.c2v look like, but try to continue running the preprocess.sh script starting from this line: https://github.com/tech-srl/code2seq/blob/master/preprocess.sh#L54 (and adapt all paths according to your files).

@Avv22 Avv22 closed this as completed Nov 22, 2021
@Avv22
Copy link
Author

Avv22 commented Nov 22, 2021

@urialon.

Thank you. So I will train your model on 150k python dataset. How to please save the model to use it later on another python dataset to generate embeddings? Does the preprocessor.sh does it automatically and save the model please?

We would like too once we train the model on 150k python dataset you specified to use to generate later on one embedding vector for each python file we have in our own dataset, can we do that please? We don't want to generate method name but one embedding that is representative of a file. We would like to do the same for our 20k python files.

@urialon
Copy link
Contributor

urialon commented Dec 3, 2021

Hi @Avra2 ,

preprocess.sh just preprocesses the data, it does not even train the model.
However, train.sh trains and saves the checkpoints. See:
https://github.com/tech-srl/code2vec#training-a-model-from-scratch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants