Skip to content

Latest commit

 

History

History
180 lines (157 loc) · 6.39 KB

README.md

File metadata and controls

180 lines (157 loc) · 6.39 KB

Pre-training PLBART using CodeSearchNet

In our work, we pre-trained PLBART on a large collection of source code and natural language description from Github and StackOverflow. On the other hand, other pre-trained models, such as CodeBERT, GraphCodeBERT are pre-trained on the CodeSearchNet dataset. Therefore, we investigate PLBART's performance if pre-trained on the CodeSearchNet dataset.

To pre-train PLBART on CodeSearchNet, do the following.

bash setup.sh
bash binarize.sh
bash pretrain.sh

Pre-training Data Statistics

Number of docstring used is 1,880,853 and number of functions used are detailed below.

Num Examples
Java 1,524,722
Python 1,069,208
Javascript 1,841,822
PHP 921,770
Go 696,935
Ruby 159,342
Total 6,213,799

[Note]

  • We pre-trained PLBART on CodeSearchNet using 8 GeForce RTX 2080 (11gb) GPUs (took ~11.5 days).
  • We have published the checkpoint here.

Experiments

  • We fine-tuned PLBART-CSNet on all the downstream tasks PLBART evaluated on.
  • The scripts are provided in the root_directory/scripts/plbart_csnet directory.
  • We compare PLBART-CSNet to PLBART and the experiment results are as follows.

Code to Text Generation

Dataset: CodeSearchNet

Ruby Javascript Go Python Java PHP Overall
CodeBERT 12.16 14.90 18.07 19.06 17.65 25.16 17.83
PLBART 14.11 15.56 18.91 19.30 18.45 23.58 18.32
PLBART-CSNet 14.48 16.00 17.61 20.07 19.81 24.48 18.74

Text to Code Generation

Dataset: Concode

EM BLEU CodeBLEU
GPT-2 17.35 25.37 29.69
CodeGPT-2 18.25 28.69 32.71
CodeGPT-adapted 20.10 32.79 35.98
PLBART 18.75 36.69 38.52
PLBART-CSNet 18.60 36.79 38.81

Code to Code Generation

Task: Translation

Methods Java to C# C# to Java
BLEU EM CodeBLEU BLEU EM CodeBLEU
CodeBERT 79.9 59.0 85.1 72.1 58.8 79.4
GraphCodeBERT 80.6 59.4 - 72.6 58.8 -
PLBART 83.0 64.6 87.9 78.4 65.0 85.3
PLBART-CSNet 81.6 61.6 86.8 78.0 63.5 84.9

Task: Defect Detection, Clone Detection

Vulnerability
Detection
Clone
Detection
CodeBERT 62.08 96.5
GraphCodeBERT - 97.1
PLBART 63.18 97.2
PLBART-CSNet 59.44 97.4

Task: Code Refinement

Methods Small Medium
EM BLEU EM BLEU
CodeBERT 16.40 77.42 5.16 91.07
GraphCodeBERT 17.30 80.58 9.10 72.64
PLBART 19.21 77.02 8.98 88.50
PLBART-CSNet 19.13 76.95 11.60 88.08