If you'd like to add your paper, do not email us. Instead, read the protocol for adding a new entry and send a pull request.
We group the papers by code authorship attribution, clone detection, defect detection and repair, code summarization, code search, code completion, code translation, code question answering, problem classification, method name prediction, and type prediction.
This repository is based on our paper, Source Code Data Augmentation for Deep Learning: A Survey. You can cite it as follows:
@article{zhuo2023source,
title={Source Code Data Augmentation for Deep Learning: A Survey},
author={Terry Yue Zhuo and Zhou Yang and Zhensu Sun and Yufei Wang and Li Li and Xiaoning Du and Zhenchang Xing and David Lo},
year={2023},
eprint={2305.19915},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Authors: Terry Yue Zhuo, Zhou Yang, Zhensu Sun, Yufei Wang, Li Li, Xiaoning Du, Zhenchang Xing, David Lo
Note: WIP. More papers will be added from our survey paper to this repo soon. Inquiries should be directed to terry.zhuo@monash.edu or by opening an issue here.
Paper | Evaluation Datasets |
---|---|
Natural Attack for Pre-trained Models of Code (ICSE'22) | GCJ |
RoPGen: Towards Robust Code Authorship Attribution via Automatic Coding Style Transformation (ICSE'22) | GCJ, GitHub |
Boosting Source Code Learning with Data Augmentation (ArXiv'23) | GCJ |
Code Difference Guided Adversarial Example Generation for Deep Code Models ASE'23 | GCJ |
Paper | Datasets |
---|---|
Contrastive Code Representation Learning (EMNLP'22) | JavaScript (paper-specific) |
Data Augmentation by Program Transformation (JSS'22) | BCB |
Natural Attack for Pre-trained Models of Code (ICSE'22) | BigCloneBench |
Unleashing the Power of Compiler Intermediate Representation to Enhance Neural Program Embeddings (ICSE'22) | POJ-104, GCJ |
Heloc: Hierarchical contrastive learning of source code representation (ICPC'22) | GCJ, OJClone |
COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning (ArXiv'22) | BinaryCorp-3M |
Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection (ArXiv'22) | POJ-104, Codeforces |
Towards Learning (Dis)-Similarity of Source Code from Program Contrasts (ACL'22) | POJ-104, BigCloneBench |
ReACC: A retrieval-augmented code completion framework (ACL'22) | CodeNet |
Bridging pre-trained models and downstream tasks for source code understanding (ICSE'22) | POJ-104 |
Boosting Source Code Learning with Data Augmentation: An Empirical Study (ArXiv'23) | BigCloneBench |
CLAWSAT: Towards Both Robust and Accurate Code Models (SANER'22) | --- |
ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning (ICSE'22) | POJ-104 |
Pathways to Leverage Transcompiler based Data Augmentation for Cross-Language Clone Detection (ICPC'23) | CLCDSA |
Code Difference Guided Adversarial Example Generation for Deep Code Models (ASE'23 | BigCloneBench |
A Pre-training Method for Enhanced Code Representation Based on Multimodal Contrastive Learning (JoS'23) | POJ-104, BigCloneBench |
CONCORD: Clone-aware Contrastive Learning for Source Code (ISSTA'23) | CodeNet (Java), POJ104 |
Neuro-symbolic Zero-Shot Code Cloning with Cross-Language Intermediate Representation (ArXiv'23) | CodeNet (C, COBOL) |
Multi-target Backdoor Attacks for Code Pre-trained Models (ACL'23) | BCB |
Paper | Datasets |
---|---|
Adversarial Examples for Models of Code (OOPSLA'20) | VarMisuse |
Self-Supervised Bug Detection and Repair (NeurIPS'21) | RANDOMBUGS, PYPIBUGS |
Semantic-Preserving Adversarial Code Comprehension (COLING'22) | Defects4J |
Path-sensitive code embedding via contrastive learning for software vulnerability detection (ISSTA'22) | D2A, Fan, Devign |
Natural Attack for Pre-trained Models of Code (ICSE'22) | Devign |
COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning (ArXiv'22) | SySeVR |
Towards Learning (Dis)-Similarity of Source Code from Program Contrasts (ACL'22) | REVEAL, CodeXGLUE |
Boosting Source Code Learning with Data Augmentation: An Empirical Study (ArXiv'23) | Refactory, CodRep1 |
MIXCODE: Enhancing Code Classification by Mixup-Based Data Augmentation (SANER'23) | Refactory, CodRep1 |
ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning (ICSE'23) | Devign |
Code Difference Guided Adversarial Example Generation for Deep Code Models (ASE'23) | Devign, CodeChef |
MUFIN: Improving Neural Repair Models with Back-Translation (ArXiv'23) | Defects4J (paper-specific), QuixBugs (paper-specific) |
Leveraging Causal Inference for Explainable Automatic Program Repair (IJCNN'22) | Defects4J, QuixBugs, BugAID |
Deepdebug: Fixing python bugs using stack traces, backtranslation, and code skeletons (ArXiv'21) | paper-specific |
Break-It-Fix-It: Unsupervised Learning for Program Repair (ArXiv'21) | paper-specific, DeepFix |
Multi-target Backdoor Attacks for Code Pre-trained Models (ACL'23) | Devign. Bug2Fix |
InferFix: End-to-End Program Repair with LLMs over Retrieval-Augmented Prompts (ArXiv'23) | InferredBugs |
RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair (FSE'23) | TFix, Bug2Fix, Defects4J |
Too Few Bug Reports? Exploring Data Augmentation for Improved Changeset-based Bug Localization (ArXiv'23) | Locus data |
Paper | Datasets |
---|---|
Training Deep Code Comment Generation Models via Data Augmentation (Internetware'20) | TL-CodeSum |
Retrieval-Based Neural Source Code Summarization (ICSE'20) | PCSD, JCSD |
Generating adversarial computer programs using optimized obfuscations (ICLR'21) | Python-150K, Code2Seq Data |
Contrastive code representation learning (EMNLP'21) | JavaScript (paper-specific) |
A search-based testing framework for deep neural networks of source code embedding (ICST'21) | paper-specific |
Retrieval-Augmented Generation for Code Summarization via Hybrid GNN (ICLR'21) | CCSD (paper-specific) |
BASHEXPLAINER: Retrieval-Augmented Bash Code Comment Generation based on Fine-tuned CodeBERT (ICMSE'22) | BASHEXPLANER Data |
Data Augmentation by Program Transformation (JSS'22) | DeepCom |
Adversarial robustness of deep code comment generation (TOSEM'22) | CCSD (paper-specific) |
Do Not Have Enough Data? An Easy Data Augmentation for Code Summarization (PAAP'22) | --- |
Semantic robustness of models of source code (SANER'22) | Python-150K, Code2Seq Data |
A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities (ArXiv'22) | CodeSearchNet (Python, Java) |
CLAWSAT: Towards Both Robust and Accurate Code Models (SANER'23) | --- |
Exploring Data Augmentation for Code Generation Tasks (EACL'23) | CodeSearchNet (CodeXGLUE) |
Bash Comment Generation Via Data Augmentation and Semantic-Aware Codebert (ArXiv'23) | BASHEXPLANER Data |
READSUM: Retrieval-Augmented Adaptive Transformer for Source Code Summarization (Access'23) | PCSD |
Tram: A Token-level Retrieval-augmented Mechanism for Source Code Summarization (ArXiv'23) | PCSD, CCSD, DeepCom |
Two Birds with One Stone: Boosting Code Generation and Code Search via a Generative Adversarial Network (OOPSLA'23) | CodeSearchNet (Python, Java) |
Better Language Models of Code through Self-Improvement (ACL'23) | CodeSearchNet |
Paper | Datasets |
---|---|
AugmentedCode: Examining the Effects of Natural Language Resources in Code Retrieval Models (ArXiv'21) | CodeSearchNet |
Cosqa: 20, 000+ web queries for code search and question answering (ACL'21) | CoSQA |
A search-based testing framework for deep neural networks of source code embedding (ICST'21) | paper-specific |
Semantic-Preserving Adversarial Code Comprehension (COLING'22) | CodeSearchNet |
Exploring Representation-Level Augmentation for Code Search (EMNLP'22) | CodeSearchNet |
Cross-Modal Contrastive Learning for Code Search (ICSME'22) | AdvTest, CoSQA |
Bridging pre-trained models and downstream tasks for source code understanding (ICSE'22) | CodeSearchNet |
A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities (ArXiv'22) | CodeSearchNet (Python, Java) |
ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning (ICSE'23) | AdvTest, WebQueryTest |
CoCoSoDa: Effective Contrastive Learning for Code Search (ICSE'23) | CodeSearchNet |
Contrastive Learning with Keyword-based Data Augmentation for Code Search and Code Question Answering (EACL'23) | WebQueryTest |
A Pre-training Method for Enhanced Code Representation Based on Multimodal Contrastive Learning (JoS'23) | CodeSearchNet |
Rethinking Negative Pairs in Code Search (EMNLP'23) | CodeSearchNet |
Towards Better Multilingual Code Search through Cross-Lingual Contrastive Learning (Internetware'23) | XLCoST |
MCodeSearcher: Multi-View Contrastive Learning for Code Search (Internetware'23) | CodeSearchNet (Python, Java), CoSQA, StaQC, WebQuery |
MulCS: Towards a Unified Deep Representation for Multilingual Code Search (SANER'23) | CodeSearchNet (Python, Java), paper-specific |
Two Birds with One Stone: Boosting Code Generation and Code Search via a Generative Adversarial Network (OOPSLA'23) | CodeSearchNet (Python, Java) |
Paper | Datasets |
---|---|
Generative Code Modeling with Graphs (ICLR'19) | ExprGen Data (paper-specific) |
Adversarial Robustness of Program Synthesis Models (AIPLANS'21) | ALGOLISP |
ReACC: A retrieval-augmented code completion framework (ACL'22) | PY150 (CodeXGLUE), GithHub Java (CodeXGLUE) |
Test-Driven Multi-Task Learning with Functionally Equivalent Code Transformation for Neural Code Generation (ASE'22) | MBPP |
How Important are Good Method Names in Neural Code Generation? A Model Robustness Perspective (ArXiv'22) | refined CONCODE, refined PyTorrent |
A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities (ArXiv'22) | CodeSearchNet (Python, Java) |
ReCode: Robustness Evaluation of Code Generation Models (ACL'23) | HumanEval, MBPP |
CLAWSAT: Towards Both Robust and Accurate Code Models (SANER'23) | --- |
Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning (ICSE'23) | ATLAS, TFIX |
RustGen: An Augmentation Approach for Generating Compilable Rust Code with Large Language Models (DeployableGenerativeAI'23) | paper-specific |
Multi-target Backdoor Attacks for Code Pre-trained Models (ACL'23) | GithHub Java (CodeXGLUE) |
Domain Adaptive Code Completion via Language Models and Decoupled Domain Databases (ASE'23) | paper-specific |
APICom: Automatic API Completion via Prompt Learning and Adversarial Training-based Data Augmentation (Internetware'23) | paper-specific |
Test-Driven Multi-Task Learning with Functionally Equivalent Code Transformation for Neural Code Generation (ASE'22) | MBPP |
Better Language Models of Code through Self-Improvement (ACL'23) | CONCODE |
Paper | Datasets |
---|---|
Leveraging Automated Unit Tests for Unsupervised Code Translation (ICLR'23) | paper-specifc |
Exploring Data Augmentation for Code Generation Tasks (EACL'23) | CodeTrans (CodeXGLUE) |
Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages (EACL'23) | Transcoder Data |
ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning (ICSE'23) | CodeTrans (CodeXGLUE) |
Code Translation with Compiler Representations (ICLR'23) | Transcoder Data |
Data Augmentation for Code Translation with Comparable Corpora and Multiple References (EMNLP'23) | Transcoder Data |
Assessing and Improving Syntactic Adversarial Robustness of Pre-trained Models for Code Translation (ArXiv'23) | AVATAR |
Multi-target Backdoor Attacks for Code Pre-trained Models (ACL'23) | Transcoder Data |
Paper | Datasets |
---|---|
Cosqa: 20, 000+ web queries for code search and question answering (ACL'21) | CoSQA |
Semantic-Preserving Adversarial Code Comprehension (COLING'22) | CodeQA |
Contrastive Learning with Keyword-based Data Augmentation for Code Search and Code Question Answering (EACL'23) | CoSQA |
MCodeSearcher: Multi-View Contrastive Learning for Code Search (Internetware'23) | WebQuery (paper-specific) |
Paper | Datasets |
---|---|
Generating Adversarial Examples for Holding Robustness of Source Code Processing Models (AAAI'20) | OJ |
Generating Adversarial Examples of Source Code Classification Models via Q-Learning-Based Markov Decision Process (QRS'21) | OJ |
Heloc: Hierarchical contrastive learning of source code representation (ICPC'22) | GCJ, OJ |
COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning (ArXiv'22) | POJ-104 (CodeXGLUE) |
Bridging pre-trained models and downstream tasks for source code understanding (ICSE'22) | POJ-104 |
Boosting Source Code Learning with Data Augmentation: An Empirical Study (ArXiv'23) | Java250, Python800 |
MIXCODE: Enhancing Code Classification by Mixup-Based Data Augmentation (SANER'23) | Java250, Python800 |
Code Difference Guided Adversarial Example Generation for Deep Code Models (ASE'23) | GCJ |
An Enhanced Data Augmentation Approach to Support Multi-Class Code Readability Classification (SEKE'22) | paper-specific |
Improving Multi-Class Code Readability Classification with An Enhanced Data Augmentation Approach (130) (International Journal of Software Engineering and Knowledge Engineering) | paper-specific |
Paper | Datasets |
---|---|
Adversarial Examples for Models of Code (OOPSLA'20) | Code2vec |
A search-based testing framework for deep neural networks of source code embedding (ICST'21) | paper-specific |
On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program Transformations (IST'21) | Code2Seq |
Data Augmentation by Program Transformation (JSS'22) | Code2vec |
Discrete Adversarial Attack to Models of Code (PLDI'23) | Code2vec |
Paper | Datasets |
---|---|
Adversarial Robustness for Code (ICML'21) | DeepTyper |
Contrastive code representation learning (EMNLP'21) | DeepTyper |
Cross-Lingual Transfer Learning for Statistical Type Inference (ISSTA'22) | DeepTyper, Typilus (Python), CodeSearchNet (Java) |
We thank Steven Y. Feng, et al. for their open-source paper list on DataAug4NLP.