Awesome AI Datasets for engineer efficiency

Resources:

Pre-Trainning:

CodeGen (CodeGen-Mono 16B)
CodeGeeX An Open Multilingual Code Generation Model
StarCodes are 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2), with opt-out requests excluded.
CodeXGLUE code-code,text-code, code-text, text-text

Code Smells:

QScored A Large Dataset of Code Smells and Quality Metrics

Code Review:

CodeBERT
- CodeReviewer is a model pre-trained with code change and code review data to support code review tasks.
- GraphCodeBERT is a pre-trained model for programming language that considers the inherent structure of code i.e. data flow, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).
- CodeBERT is a pre-trained model for programming language, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).

AI Server

FauxPilot s is an attempt to build a locally hosted alternative to GitHub Copilot. It uses the SalesForce CodeGen models inside of NVIDIA's Triton Inference Server with the FasterTransformer backend.

Programming

Common

GitHub-Code-Clean is a cleaner version of Github-code dataset, we add the following filters: Average line length < 100, Alpha numeric characters fraction > 0.25, Remove auto-generated files (keyword search).

Text To Code

XLCost for text-to-code synthesis, a subset of XLCoST benchmark, for text-to-code generation at snippet level and program level for 7 programming languages: Python, C, C#, C++, Java, Javascript and PHP.

Performance

APPS is a benchmark for code generation with 10000 problems. It can be used to evaluate the ability of language models to generate code from natural language specifications. You can also find APPS metric in the hub here codeparrot/apps_metric.

Complex

Code consists of 4,200 Java codes submitted to programming competitions by human programmers and their complexity labels annotated by a group of algorithm experts.

ByLanguages

Python

github-jupyter-parsed with markdown and code pairs. We provide the preprocessing script in preprocessing.py. The data is deduplicated and consists of 451662 examples.

Java

semeru/Text-Code-concode-Java

https://huggingface.co/datasets/semeru/Text-Code-concode-Java

Generate source code of class member functions in Java, given natural language description and class environment. Class environment is the programmatic context provided by the rest of the class, including other member variables and member functions in class. Models are evaluated by exact match and BLEU.

semeru/completeformer_java_data

https://huggingface.co/datasets/semeru/completeformer_java_data

AutoCompleteFormer

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome AI Datasets for engineer efficiency

Programming

Common

Text To Code

Performance

Complex

ByLanguages

Python

Java

semeru/Text-Code-concode-Java

semeru/completeformer_java_data

About

Releases

Packages

unit-mesh/awesome-datasets-for-unit-mesh

Folders and files

Latest commit

History

Repository files navigation

Awesome AI Datasets for engineer efficiency

Programming

Common

Text To Code

Performance

Complex

ByLanguages

Python

Java

semeru/Text-Code-concode-Java

semeru/completeformer_java_data

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages