Generating embeddings for Python and Java #104

Avv22 · 2021-10-13T05:16:21Z

Hello,

Thanks again for your work.

Can you please explain how to use the model to generate embeddings for a source file in Python and also for Java? Do we have to train your model on Java dataset and Python dataset in order to use the model to generate embeddings of source code? Also is it possible to have embeddings in a fixed size of let us say 100 represented as numerical data for each file?

urialon · 2021-10-19T13:34:31Z

Hi Avra,
Thank you for your interest in this work! Sorry again for the delayed response.

Yes, in order to use code2vec for python, you will have to train the model on a python dataset.
Have you seen this section in the README? https://github.com/tech-srl/code2vec#extending-to-other-languages

Uri

Avv22 · 2021-11-01T14:13:40Z

@urialon. astminer team helped me producing training, testing and validation python.c2v data. So, how we should proceed next please to train code2vec model? Then once we train the model, can we feed the same data to produce embeddings as we have 20k python files that we split into train, test and validate before feeding them to astminer tool. Once we train code2vec, we would like to feed the same python code to produce embeddings, so what do you think please?

urialon · 2021-11-02T20:49:48Z

I am not sure how your python.c2v look like, but try to continue running the preprocess.sh script starting from this line: https://github.com/tech-srl/code2seq/blob/master/preprocess.sh#L54 (and adapt all paths according to your files).

Avv22 · 2021-11-22T00:58:57Z

@urialon.

Thank you. So I will train your model on 150k python dataset. How to please save the model to use it later on another python dataset to generate embeddings? Does the preprocessor.sh does it automatically and save the model please?

We would like too once we train the model on 150k python dataset you specified to use to generate later on one embedding vector for each python file we have in our own dataset, can we do that please? We don't want to generate method name but one embedding that is representative of a file. We would like to do the same for our 20k python files.

urialon · 2021-12-03T14:36:00Z

Hi @Avra2 ,

preprocess.sh just preprocesses the data, it does not even train the model.
However, train.sh trains and saves the checkpoints. See:
https://github.com/tech-srl/code2vec#training-a-model-from-scratch

Avv22 closed this as completed Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating embeddings for Python and Java #104

Generating embeddings for Python and Java #104

Avv22 commented Oct 13, 2021

urialon commented Oct 19, 2021

Avv22 commented Nov 1, 2021

urialon commented Nov 2, 2021

Avv22 commented Nov 22, 2021 •

edited

Loading

urialon commented Dec 3, 2021

Generating embeddings for Python and Java #104

Generating embeddings for Python and Java #104

Comments

Avv22 commented Oct 13, 2021

urialon commented Oct 19, 2021

Avv22 commented Nov 1, 2021

urialon commented Nov 2, 2021

Avv22 commented Nov 22, 2021 • edited Loading

urialon commented Dec 3, 2021

Avv22 commented Nov 22, 2021 •

edited

Loading