Resources:
Pre-Trainning:
- CodeGen (CodeGen-Mono 16B)
- CodeGeeX An Open Multilingual Code Generation Model
- StarCodes are 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2), with opt-out requests excluded.
- CodeXGLUE code-code,text-code, code-text, text-text
Code Smells:
- QScored A Large Dataset of Code Smells and Quality Metrics
Code Review:
- CodeBERT
- CodeReviewer is a model pre-trained with code change and code review data to support code review tasks.
- GraphCodeBERT is a pre-trained model for programming language that considers the inherent structure of code i.e. data flow, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).
- CodeBERT is a pre-trained model for programming language, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).
AI Server
- FauxPilot s is an attempt to build a locally hosted alternative to GitHub Copilot. It uses the SalesForce CodeGen models inside of NVIDIA's Triton Inference Server with the FasterTransformer backend.
- GitHub-Code-Clean is a cleaner version of Github-code dataset, we add the following filters: Average line length < 100, Alpha numeric characters fraction > 0.25, Remove auto-generated files (keyword search).
XLCost for text-to-code synthesis, a subset of XLCoST benchmark, for text-to-code generation at snippet level and program level for 7 programming languages: Python, C, C#, C++, Java, Javascript and PHP.
APPS is a benchmark for code generation with 10000 problems. It can be used to evaluate the ability of language models to generate code from natural language specifications. You can also find APPS metric in the hub here codeparrot/apps_metric.
Code consists of 4,200 Java codes submitted to programming competitions by human programmers and their complexity labels annotated by a group of algorithm experts.
github-jupyter-parsed with markdown and code pairs. We provide the preprocessing script in preprocessing.py. The data is deduplicated and consists of 451662 examples.
https://huggingface.co/datasets/semeru/Text-Code-concode-Java
Generate source code of class member functions in Java, given natural language description and class environment. Class environment is the programmatic context provided by the rest of the class, including other member variables and member functions in class. Models are evaluated by exact match and BLEU.
https://huggingface.co/datasets/semeru/completeformer_java_data
AutoCompleteFormer