NCDSearch

NCDSearch is a grep-like tool to find similar source code fragments in files. In software development, developers may copy and modify source code fragments to reuse existing functionalities in code. This tool is developed to detect such code fragments that are not identified by a simple search tool like grep.

For example, if a line if (this.distance < another.distance) return true; (in source code of the tool) is given as a query, the tool reports similar lines such as if (this.distance > another.distance) return false; and if (thislen > anotherlen) return true;. You can find two bug fix examples in a company in our paper published in ICSME 2018 Industry Track.

Takashi Ishio, Naoto Maeda, Kensuke Shibuya, Katsuro Inoue: Cloned Buggy Code Detection in Practice Using Normalized Compression Distance. In proceedings of ICSME 2018, pp.591-594, November 2018.

The tool supports famous similarity metrics such as Normalized Compression Distance and Normalized Levenshtein Distance. The default search strategy is Lempel-Ziv Jaccard Distance (proposed in https://arxiv.org/abs/1708.03346). It is significantly faster than existing NCD, while it keeps similar output to NCD using the Deflate (zip) algorithm for code clone detection. The details are described in our technical paper.

Takashi Ishio, Naoto Maeda, Kensuke Shibuya, Kenho Iwamoto, Katsuro Inoue, NCDSearch: Sliding Window-Based Code Clone Search Using Lempel-Ziv Jaccard Distance. IEICE Transactions on Information and Systems, vol.E105-D, No.5, May 2022.

Build Information

The project uses Maven with Maven Assembly Plugin. The following command builds a runnable jar ncdsearch.jar:

     mvn package

The project includes Eclipse project file and pom.xml for dependencies. The main class is ncdsearch.SearchMain. The repository includes Lexer files generated by ANTLR4. The grammar files are maintained in another repository (https://github.com/takashi-ishio/sarf-lexer)

Usage

Online usage example is available at:

The following table is a list of major options.

Option	Description
`-e` [query]	Specify a query. All arguments after the option are regarded as a query.
`-q` [filename]	Read a query from a specified file. `-` represents a standard input.
`-sline` [line-number]	Specify a start line number of a query in the file of `-q` option.
`-eline` [line-number]	Specify an end line number of a query in the file of `-q` option.
`-lang` [language]	Specify a programming language of a query. Required for `-e` option.
`-th` [threshold]	Specify a threshold of the distance. The default is 0.5.
`-thread` [num]	Specify the number of threads for concurrent search.
`-encoding` [encoding]	Specify a text encoding of source files. The default encoding is UTF-8.
`-l` [filelist]	Search files listed in the specified text file (one file name per line)
`-git` [git-dir]	Search files in the specified git repository instead of normal files.
`-v`	Show configuration and progress.
`-json`	Enable a JSON format report.
`-pos`	Report the detected source code locations in detail.
`-a` [algorithm]	Specify an algorithm to compute a distance. The default is `lzjd`.
`-link` [style]	If one of `eclipse`, `vscode`, and `fileurl` is given, file names are printed using a clickable format on particular environments. The default is `none` (just a file name).
`-testconfig`	This option does not execute a search but print the current configuration.

Query Code Fragment

You can input code fragments using STDIN, a query file, or command line arguments.

    java -jar ncdsearch.jar dir_or_file -lang java -q - < query
    java -jar ncdsearch.jar dir_or_file -lang java -q query.txt
    java -jar ncdsearch.jar dir_or_file -lang java -q query.txt -sline 10 -eline 20
    java -jar ncdsearch.jar dir_or_file -lang java -e my code fragments

The -q option specifies a file for a code fragment. -sline and -eline specifies lines of code in a file as a query. In the third example, lines 10 through 20 of the query.txt file are selected as a query.
- If a file name is "-", the tool reads a query from the standard input.
The -e option directly specifies a code fragment in command line arguments. Note that all the arguments after -e are regarded as a query code snippet.

You can specify multiple directories or files to be searched.

    java -jar ncdsearch.jar dir1 dir2 -lang java < query

The tool recognizes a programming language by either -lang option or a file name of -q option. The -lang option overrides -q option's file name.

It should be noted that the tool reports all similar code fragments by default. If you would like to see only exact matches, you can use -a tld option that uses Token-level Levenshtein Distance instead of the default strategy.

    java -jar ncdsearch.jar dir_or_file -lang java -e public static void main(String[] args)

Output Format

The tool reports a result in a CSV format by default. For example, an execution with a query:

    java -jar ncdsearch.jar -lang java -e "if (this.distance > another.distance) return true;"

would report a line like this:

    path/to/src/ncdsearch/Fragment.java,81,81,0.10714285714285714

Each line of an output represents a similar source code fragment detected by the tool.

The first column is the file name including the code fragment.
The second and third columns indicate the lines of the first and last tokens of the fragment.
- You may specify -pos option to extract char positions in the lines.
The last column indicates the normalized compression distance between the query and the code fragment. Since it is a distance, more similar code fragments have smaller values.

According to the report, you may find a similar line of code in a file. For example:

    81:   if (this.distance < another.distance) return true;

The tool also supports a JSON format. Add -json option to use the format. The -pos option with the json option reports source code tokens in addition to locations.

Source File Encoding

The tool assumes UTF-8 by default. Please specify -encoding option to choose a charset, e.g. -encoding UTF-16. A list of supported encodings is dependent on a platform.
A list for Oracle Java SE is available at: https://docs.oracle.com/javase/jp/8/docs/technotes/guides/intl/encoding.doc.html

Programming Language

The -lang option specifies a programming language.
The tool accepts file extensions: java (Java), c (C/C++), cs (C#), js (JavaScript), cbl (Cobol), vb (Visual Basic 6) and txt (plain text).
It also accepts ccfxprep files that are generated by the CCFinderX preprocessor.

A programming language option activates a lexical analysis to extract tokens. It ignores white space and comments in the specified language. The plain text mode regards a single line as a single token but ignores indentation (leading white space and trailing white space of lines).

Each language option automatically searches source files using the following extensions.

C/C++: .c, .cc, .cp, .cpp, .cx, .cxx, .c+, .c++, .h, .hh, .hxx, .h+, .h++, .hp, .hpp
Java 8: .java
JavaScript: .js
C#: .cs
Python: .py
COBOL: .cbl
Visual Basic 6 and .NET (experimental): .vb
Plain Text: .txt, .html, .md
CCFinderX: .ccfx, .ccfinderx, .ccfxprep

You can include additional files using -i (extension) option. For example, the following command searches .java files as plain text.

    java -jar ncdsearch.jar dir_or_file -lang txt -i .java -e "// line comment"

You can use multiple -i options in a command line to search additional files.

You also can use -l [filename] option that specifies a text file including a list of file names.

Full Scan Mode

For efficiency, the tool compares a query with sampled lines of code by default. It is fast, but may result in false negatives. If your query is small enough, you should specify -full option that checks all tokens so that you can get more results.

    java -jar ncdsearch.jar dir_or_file -lang java -full -e identifier

Verbose Mode

If a result is different from your expectation, you can try -v to see the configuration and progress of the search.

    java -jar ncdsearch.jar dir_or_file -lang java -v < query

Multi-threading

The program uses N working threads if -thread N option is provided. The multi-threading execution processes N files in parallel. File locations in the output may be differently ordered in each execution.

Although N can be an arbitrary number (e.g. 2, 4, or 8), an effective value of N is dependent on available CPU resources. A larger amount of memory is also required to store N files in memory at once.

Algorithms

You can specify an algorithm using -a option. This tool supports the following algorithms.

lzjd: Lempel-Ziv Jaccard Distance. The default strategy.
clzjd: Character-level Lempel-Ziv Jaccard Distance. It internally uses char instead of byte for comparison. This may be important to find similar multi-byte strings.
tld: Token-level Levenshtein Distance. A distance is measured by the number of add/delete/modify operations to transform a query string to a code fragment in source code.
ntld: Normalized Token-level Levenshtein Distance. A distance is a variant of Token-level Levenshtein Distance normalized by the length of a query string.
nbld: Normalized Byte-level Levenshtein Distance. A distance is measured by the number of different bytes, normalized by the length of a query string..
blcs: Normalized Byte-level Longest Common Subsequence. A distance is measured by the length of common byte subsequence normalized by the length of a query string.
zip: Normalized Compression Distance with Deflate algorithm that has been used in gzip.
xz, zstd, bzip2, snappy, folca: Normalized Compression Distance with the corresponding compression algorithm. These algorithms are provided just for experiments to see an impact of compression algorithms.

Normalized Compression Distance (NCD)

Using the -a zip option, you can use a traditional Normalized Compression Distance (https://en.wikipedia.org/wiki/Normalized_compression_distance) defined as follows:

    NCD(x, y) = (Z(xy)-min(Z(x), Z(y)) / max(Z(x), Z(y))

where Z(x), Z(y), and Z(xy) are data size obtained by a data compression algorithm (Deflate in case of the zip option). If two data x and y are similar, then NCD(x, y) results in a small value.

Token-level Levenshtein Distance

The -a ntld option uses Normalized Levenshtein Distance on tokens, i.e. the ratio of added, removed, and modified tokens, as a distance metric.

The -a tld option uses Levenshtein Distance on tokens without normalization. It simply counts the number of added, removed, and modified tokens and reports a code fragment whose distance is at most a given threshold. For example, -q "a < b" -th 1 -full matches a > b and a < c.

Evaluation Dataset

The evaluation-dataset/cbcd-dataset.json file includes 53 queries created from the information in J. Li and M. D. Ernst: CBCD: Cloned Buggy Code Detector.
As we found several errors in the paper, we manually checked the version history of three projects and revised the queries and answers.

Each object includes a query source code fragment and code clones that should be detected.

query: A file name including the query content.
queryloc: The source code location of the query.
- path shows a project name (postgres/git/linux) and a commit ID including the code.
- file, sline, and eline represent the file path, start line number and end line number in the version.
type: Code clone type.
path: A project name and a commit ID including code clones.
answers: A list of source code locations. The format is the same as queryloc.

License

The source code of this project is available under the MIT License. The full license description is included in the LICENSE file.

This project uses the following components.

GNU Trove (https://bitbucket.org/trove4j/trove): LGPL License
ANTLR4 (http://www.antlr.org/license.html): BSD License
XZ (https://tukaani.org/xz/java.html): public domain
ZStd (https://github.com/luben/zstd-jni): 3-clause BSD License

Grammar files for ANTLR4 come from https://github.com/antlr/grammars-v4/ except for COBOL. C/C++ grammars are slightly edited so that the tool can recognize macro directives as tokens (see the header comment). COBOL grammar is obtained from ProLeap COBOL Parser (https://github.com/uwol/proleap-cobol-parser) that is distributed under MIT License. VisualBasic6 grammar is obtained from ProLeap VB6 Parser (https://github.com/uwol/vb6parser) that is distributed under MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
.settings		.settings
evaluation-dataset		evaluation-dataset
grammars/sarf/lexer/lang		grammars/sarf/lexer/lang
src		src
tests		tests
.classpath		.classpath
.gitignore		.gitignore
.project		.project
LICENSE		LICENSE
NCDSearch_evaluation_2022.pdf		NCDSearch_evaluation_2022.pdf
NCDSearch_motivation_2018.pdf		NCDSearch_motivation_2018.pdf
README.md		README.md
pom.xml		pom.xml

License

takashi-ishio/NCDSearch

Folders and files

Latest commit

History

Repository files navigation

NCDSearch

Build Information

Usage

Query Code Fragment

Output Format

Source File Encoding

Programming Language

Full Scan Mode

Verbose Mode

Multi-threading

Algorithms

Normalized Compression Distance (NCD)

Token-level Levenshtein Distance

Evaluation Dataset

License

About

Resources

License

Stars

Watchers

Forks

Languages