Skip to content

unit-mesh/awesome-datasets-for-unit-mesh

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 

Repository files navigation

Awesome AI Datasets for engineer efficiency

Resources:

Pre-Trainning:

Code Smells:

  • QScored A Large Dataset of Code Smells and Quality Metrics

Code Review:

  • CodeBERT
    • CodeReviewer is a model pre-trained with code change and code review data to support code review tasks.
    • GraphCodeBERT is a pre-trained model for programming language that considers the inherent structure of code i.e. data flow, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).
    • CodeBERT is a pre-trained model for programming language, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).

AI Server

  • FauxPilot s is an attempt to build a locally hosted alternative to GitHub Copilot. It uses the SalesForce CodeGen models inside of NVIDIA's Triton Inference Server with the FasterTransformer backend.

Programming

Common

  • GitHub-Code-Clean is a cleaner version of Github-code dataset, we add the following filters: Average line length < 100, Alpha numeric characters fraction > 0.25, Remove auto-generated files (keyword search).

Text To Code

XLCost for text-to-code synthesis, a subset of XLCoST benchmark, for text-to-code generation at snippet level and program level for 7 programming languages: Python, C, C#, C++, Java, Javascript and PHP.

Performance

APPS is a benchmark for code generation with 10000 problems. It can be used to evaluate the ability of language models to generate code from natural language specifications. You can also find APPS metric in the hub here codeparrot/apps_metric.

Complex

Code consists of 4,200 Java codes submitted to programming competitions by human programmers and their complexity labels annotated by a group of algorithm experts.

ByLanguages

Python

github-jupyter-parsed with markdown and code pairs. We provide the preprocessing script in preprocessing.py. The data is deduplicated and consists of 451662 examples.

Java

semeru/Text-Code-concode-Java

https://huggingface.co/datasets/semeru/Text-Code-concode-Java

Generate source code of class member functions in Java, given natural language description and class environment. Class environment is the programmatic context provided by the rest of the class, including other member variables and member functions in class. Models are evaluated by exact match and BLEU.

semeru/completeformer_java_data

https://huggingface.co/datasets/semeru/completeformer_java_data

AutoCompleteFormer

About

Awesome datasets for Unit Mesh

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages