A Comparative Study of Various Code Embeddings in Software Semantic Matching

Abstract

The ability to search code repositories for functionally equivalent code would be a tremendous benefit to software engineering. Code reuse is fundamental to software engineering, and open source code repositories have become rich sources of reusable code. In this study, we examine how machine learning techniques used in Natural Language Processing (NLP) for representing words and documents as vectors can be applied to representing code fragments in vector space. To do so, we amass a large corpus of programming tasks implemented in multiple programming languages. We then apply existing document embedding techniques to our corpus of code so that we can map each code fragment to a point in vector space and study to what extent these document embeddings are useful in capturing the semantics of software code. Finally we design and implement a code-matching application for locating functionally equivalent code fragments based on vector embeddings and use this application for evaluating the different embeddings.

Requirements

astor
Flask
gensim
javalang
matplotlib
regex
sklearn

Install packages with pipenv:

$ pipenv install

Experiments

Sample Search Engine

Proof of concept code search engine

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
MethodsSplitter		MethodsSplitter
alignment		alignment
app		app
code_embeddings		code_embeddings
experiments		experiments
test_data		test_data
training_data		training_data
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
output full data 5 iters 10 windowsize.txt		output full data 5 iters 10 windowsize.txt
output full data 5 iters 5 windowsize keep frequent words only.txt		output full data 5 iters 5 windowsize keep frequent words only.txt
output full data 5 iters 5 windowsize.txt		output full data 5 iters 5 windowsize.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Comparative Study of Various Code Embeddings in Software Semantic Matching

Abstract

Requirements

Experiments

Sample Search Engine

About

Releases

Packages

Contributors 2

Languages

License

waingram/code-embeddings

Folders and files

Latest commit

History

Repository files navigation

A Comparative Study of Various Code Embeddings in Software Semantic Matching

Abstract

Requirements

Experiments

Sample Search Engine

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages