The source code of successful projects is evolving all the time, resulting in hundreds of thousands of code changes stored in source code repositories. This wealth of data can be useful, e.g., to find changes similar to a planned code change or examples of recurring code improvements. However, searching for specific kinds of code changes across a large number of code repositories is non-trivial.
This paper presents DiffSearch, the first scalable and precise search engine for code changes. The demo of the tool is online.
Given a query that describes possibly abstracted versions of a code snippet before and after a change, the approach returns a set of changes that are guaranteed to match the query. To ensure scalability, DiffSearch indexes code changes in a one-time preprocessing step, mapping them into a feature space, and then performs an efficient search in the feature space for each query. The approach is designed in a mostly language-agnostic way, and we present implementations for Java, JavaScript, and Python.
@ARTICLE{9935264,
author={Grazia, Luca Di and Bredl, Paul and Pradel, Michael},
journal={IEEE Transactions on Software Engineering},
title={DiffSearch: A Scalable and Precise Search Engine for Code Changes},
year={2023},
volume={49},
number={4},
pages={2366-2380},
doi={10.1109/TSE.2022.3218859}
}
Queries are formulated in an extension of Java. The structure of a query is the following:
old code --> new code
Here’s an example query:
x = foo(); --> x = bar();
Queries are compared against the lines changed in a commit, by checking that everything in the query indeed appears in the commit. If a commit changes more than what is asked for in a query, then this commit counts as a match. In contrast, if a query contains code not among the changed lines, it’ll not match. For example, the above query will match
x = foo(); y = z; --> x = bar(); z = y;
but it won’t match the following:
foo(); --> x = bar();
The syntax of queries follows the Java language and also allows for some syntactically incomplete code snippets. For example, a query may open a block with “{“ but not close it with a matching “}”.
In addition to the syntax of the target programming language, queries support some special symbols (note: the symbols are case-sensitive):
LT which matches any literal, e.g., a number, a char, or a string ID which matches any identifier, e.g., a variable name or method name EXPR which matches any expression, e.g., x < y + 1 or x.foo binOP which matches any binary operator, e.g., + and - unOP which matches any unary operator, e.g., ++ and -- OP which matches any assign operator, e.g., = , -=, and +=
These symbols can also be labeled with a number (0 to 3) to match the same syntactic element multiple times, e.g., ID<1> and LT<2>. For example, this query searches for code changes where the binary operands inside an if condition changes:
if (ID<0> binOP LT<0>) { --> if (ID<0> binOP LT<0>) {
The query matches this code change:
if (x > 0) { --> if (x < 0) {
But not this one (because x changes to y):
if (x > 0) { --> if (y < 0) {
Another special symbol is _ , which indicates that one part of the code change is empty. It is useful for finding inserted and deleted code. For example, the following query looks for an inserted return statement:
_ --> return EXPR;
There is a last special symbol, <...> , which represents a wildcard. It matches zero, one, or more statements or expressions and helps with queries where some other code may be interspersed between the code to find. For example, the following query searches for any code change that replaces some (potentially empty) code with a return statement that returns a variable:
<...> --> return ID;
Query A:
<...> <...>
ID<0>(); ID<1>();
<...> --> <...>
ID<0>(); ID<1>();
Matching change A:
g(); g();
f(); z();
h(); --> h();
j(); j();
f(); } z();
Query B:
<...> try {
<...>
--> } catch (ID ID) {
<...> }
Matching change B:
x=3; try {
--> x=3;
} catch (Exception e) {
System.out.println("oops");}
Query C:
<...>
ID = 23; --> <...>
<...>
Matching change C:
a=2; a=2;
b=5; b=5;
c=7; --> c=7;
d=23; d=23;
e=1; e=1;
f=2; f=2;
More DiffSearch queries are in our dedicated Wiki page.
- Java 11 and Python 3.7
- Linux Operating System, such as Ubuntu
- ANTLR 4 ->
apt install antlr
- Maven ->
apt install maven
FOR THE DIFFSEARCH SERVER:
-
Install:
chmod 754 installDiffSearch.sh
sed -i -e 's/\r$//' installDiffSearch.sh
./installDiffSearch.sh
-
Run:
mvn compile
mvn exec:java -Dexec.mainClass=research.diffsearch.main.App -Dexec.args="-g -lang java"
-
Extra:
- If you received the error: " OSError: [Errno 98] Address already in use", you can use the following commands:
fuser -k 5002/tcp
fuser -k 8843/tcp
- If you received the error: " OSError: [Errno 98] Address already in use", you can use the following commands:
-
More DiffSearch arguments are in the Wiki page
FOR THE GUI:
- clone the repository: https://github.com/sola-st/DiffSearch-UI
- Run the command:
ng serve --proxy-config src/proxy.conf.json
- Go in "http://localhost:4200/diffsearch" with your browser