GitHub - zhuwq585/MSCCD: A multilingual syntatic code clone detector based on ANTLR parser generation.

Features

Coming soon.

See our paper(accepted by ICPC2022) in https://arxiv.org/pdf/2204.01028.pdf

Docker Image

You can use the provideded docker image to avoiding environment dependence setting

Link: https://drive.google.com/file/d/17zsCf-5FnKbE1iPw6Ca4onW5ckQX69eQ/view?usp=sharing

MSCCD is in /root/MSCCD

Remember to update to the newest MSCCD by git pull

Environment dependence

We have tested MSCCD on Ubuntu 18.04LTS / MacOS Monterey.

MSCCD mainly depends on these environments:

Python v3.6.9
Java 11 (Newer than Java9) (Remember to set version by editting modules/msccd_tokenizers/pom.xml when using a different version)
Maven v3.8.5
jinja2 (pip3)
ujson (pip3)

We added some interfaces and methods to ANTLR4.8 and packaged a .jar file for MSCCD. Please install the provided antlr-4.8-modified.jar to your local maven repository.

mvn install:install-file -Dfile=./lib/antlr-4.8-modified.jar -DgroupId=org.nagoya_u.ertl.sa -DartifactId=antlr-v4.8-modified -Dversion=4.8 -Dpackaging=jar

Generate a tokenizer for the target language

First, edit ./parserConfig.json :

parser: The path of the grammar folder, including g4 files and sometimes java programs.
grammarName: The grammar name defined in the g4 file. It can also be checked in pom.xml (for grammars from grammarsv4)
startSymbol: Can be easily checked in pom.xml or the g4 file.

Then, generate the tokenizer by:

python3 tokenizerGeneration.py

Configure the tool

We can configure the tool by config.json. Here are the items:

inputProject: A list of paths. Each path presents a project you want to detect.
keywordsList: The path of the keywordslist.
languageExtensionName: A list of the extension names of the target language.
minTokens: The minimum size of the token bag in clone detection.
minTokensForBagGeneration: The minimum size of the token bag in tokenization. A smaller value will provide a larger range of token bag sizes in clone detection; a bigger one will make the tokenizer faster when you don’t want small bags.
detectionThreshold: The similarity threshold with a number in the range(0,1). If the overlapping similarity of a code pair is higher than the threshold, they will be seen as clones. A higher threshold will increase accuracy and reduce recall, and vice versa.
maxRound: The max granularity value to detect.
tokenizer: The name of generated tokenizer. It is the same as “grammarName” in parserConfig.json
threadNum_tokenizer
threadNum_detection

Execute MSCCD

Users may always need to do several detections for the same project. So we can save the necessary data in a task object to save time for the execution next time.

Execute for the first time

By this part, we will execute the tool by generating a new task from the configuration file.

1 Edit the config.json file, and check the grammar file, keyword list file, and your input file.

2 Run by python3 controller.py, and just wait for the result.

3 Check the information in tasks/task[taskId]/, for each execution, there will be a folder named detection* to save the result files

Execute from a generated task

By this part, we will execute the tool from a generated task. We can easily change the detection granularity(required) and threshold(optional) by command.

Just run it by python3 controller.py [taskId] ([statementThreshold]).

For example, python3 controller.py 1 means excute from tasks/task1. python3 controller.py 2 0.9 means excute from tasks/task2, and set the detectionThreshold to 0.9

Check the detection results.

For each task, all the data is saved in the tasks/task* folder, including configurations, file list, token bags. Here is the description:

file	description
fileList.txt	Each line represents a source file, formatting with (projectId, file Path). The index of each file in each project is defined as fileId.
tokenBags	Each line represents a token bag and uses '@ @' to separate each data field: projectId @ @ fileId @ @ bagId @ @ granularity value @ @ number of keywords @ @ symbol number @@ token number @@ start line in original file -- end line in original file@@ tokens(token text :: frequency)
taskData.obj	Configurations

Results of each detection is saved in tasks/task*/detection* folder.

file	description
pairs.file	Reported clones in [[projectId,fileId,bagId],[projectId,fileId,bagId]]
info.obj	Exection times...

Scripts:

scripts/blockPairOutput.py : generate a output file in csv format: [file1Path,startLine,endLine,file2Path,startLine,endline]
- python3 scripts/blockPairOutput.py taskId detectionId outputFile
scripts/filePairOutput.py : generate a output file in csv format: [file1Path,file2Path]
- It's useful when MSCCD is executed as a file-level clone detector. (When setting maxRound in config.json as 1 or 0)
- python3 scripts/filePairOutput.py taskId detectionId outputFile

Comming soon

Speed up
Analysis scripts to make the detection results easier to read and use

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
batchExecutionConfig		batchExecutionConfig
grammarDefinations		grammarDefinations
lib		lib
modules		modules
reportTemplete		reportTemplete
sampleInput		sampleInput
scripts		scripts
tasks		tasks
tokenizers		tokenizers
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
batchExecution.py		batchExecution.py
batchReport.py		batchReport.py
config.json		config.json
controller.py		controller.py
getCodeNetRes.py		getCodeNetRes.py
parserConfig.json		parserConfig.json
tokenizerGeneration.py		tokenizerGeneration.py

zhuwq585/MSCCD

Folders and files

Latest commit

History

Repository files navigation

Features

Docker Image

Environment dependence

Generate a tokenizer for the target language

Configure the tool

Execute MSCCD

Execute for the first time

Execute from a generated task

Check the detection results.

Scripts:

Comming soon

About

Topics

Resources

Stars

Watchers

Forks

Languages