06 Jun 07:18

bauersimon

efe1ea3

v0.5.0 Latest

Latest

Highlights 🌟

Ollama 🦙 support
Mac and Windows 🖥️ support
Support for any inference endpoint 🧠 as long as it implements the OpenAI inference API
More complex "write test" task cases 🔢 for both Java, and Go
Evaluation now measures the processing time ⏱️ that it takes a model to compute a response
Evaluation now counts the number of characters 💬 in model responses to give an idea which models give brief, efficient responses
Multiple runs 🏃 built right into the evaluation tool

Pull Requests

Merged

Development 🛠️
- CI
  - fix, Let Windows CI error on failures, and inject API tokens only once per provider by @zimmski in #104
  - Cancel previous runs of the CI when a new push happened to a PR by @Munsio in #140
- Tooling
  - Introduce an "ID" method to the tool interface by @ruiAzevedo19 in #122
Documentation 📚
- Readme
  - Reduce fluff of intro to a minimum, and summary CSVs for the v0.4.0 results by @zimmski in #86
  - New section explaining the evaluation, and its tasks and cases by @bauersimon in #87
  - Update according to newest blog post release by @bauersimon in #88
  - Use the most cost-effective model that is still good for usage showcase because Claude Opus is super expensive by @zimmski in #84
Evaluation ⏱️
- Multiple Runs
  - Support multiple runs in a single evaluation by @bauersimon in #109
  - Option to execute multiple runs non-interleaved by @bauersimon in #120
  - fix, Do not cancel successive runs if previous runs had problems by @bauersimon in #129
- Testdata Repository
  - Use Git to avoid copying the repository on each model run by @Munsio in #114
  - fix,Use empty Git config in temporary repositories to not inherit any user configuration by @bauersimon in #146
  - fix, Reset repository per task to not bleed task results into subsequent task by @bauersimon in #148
- Tests
  - Remove the need to change the provider registry in tests to make test code concurrency safe by @ruiAzevedo19 in #137
  - Move the error used in the evaluation tests to a variable, to avoid copying it the test suites by @ruiAzevedo19 in #138
- Language Support
  - fix, Java test file path needs to be OS aware by @Munsio and @ruiAzevedo19 in #155
  - Require at least symflower v36800 as it fixes Java coverage extraction in examples with exceptions by @bauersimon in #14
  - fix, Do not ignore Go coverage count if there are failing tests by @ahumenberger in #161
- fix, Empty model responses should be handled as errors by @Munsio in #97
- refactor, Move evaluation logic into evaluation package for isolation of concern by @zimmski in #136
Models 🤖
- New Models
  - Ollama Support
    - Installation and Update
      - Ollama tool automated installation by @bauersimon in #95
      - Ollama tool version check and update if version is outdated by @bauersimon in #118
      - Update Ollama to 0.1.41 to have all the latest Windows fixes by @bauersimon in #154
    - Provider Integration
      - Prepare evaluation for Ollama provider by @zimmski in #115
      - Support Ollama provider by @bauersimon in #96
      - Preload Ollama models before inference and unload afterwards by @bauersimon in #121
  - Generic OpenAI API provider by @bauersimon in #112
- Allow to retry a model when it errors by @ruiAzevedo19 in #125
- Clean up query attempt code by @zimmski in #132
- Explicitly check the interface that is setting the query attempts, to ensure the model implements all its methods by @ruiAzevedo19 in #139
Reports 🗒️
- CSV
  - Replace model dependent evaluation result with report file since that contains all the evaluation information by @bauersimon in #85
  - Additional CSVs to sum up metrics for all models overall and per language by @Munsio in #94
  - fix, Sort map by model before creating the CSV output to be deterministic by @Munsio in #99
- Metrics
  - Measure processing time of model responses by @bauersimon in #106
  - Measure how many characters were present in a model response and generated test files by @ruiAzevedo19 in #142
  - Make sure to use uint64 consistently for metrics and scoring, and allow more task cases by always working on a clean repository by @zimmski in #133
Operating Systems 🖥️
- Support MacOS by @zimmski in #102
- Support Windows by @zimmski in #103
Tasks 🔢
- More “write test” tasks for Go and Java
  - More Java task cases for test generation by @ahumenberger and @zimmski in #134
  - More Go and Java tasks by @ahumenberger and @zimmski #124
  - fix, Add the testify package dependency to the Golang light repository, so symflower test can execute the generated tests by @ruiAzevedo19 in #150
  - fix, Download Go dependencies when executing tests by @bauersimon in #153

Closed

Ollama support PR #27 as everything was integrated in #95, #115, #118, #96

Issues

Closed

#83 Add additional CSV files that sum up: overall, per-language
#91 Integrate Ollama
#92 Empty responses should not be tested but should fail
#98 Non deterministic test output leads to flaky CI Jobs
#101 Unable to run benchmark tasks on windows due to incorrect directory creation syntax
#105 Measure Model response time
#108 Multiple Runs
#111 Generic OpenAI API provider
#113 Optimize repository handling in multiple runs per model
#116 Preload/Unload Ollama models before prompting
#117 Fixed Ollama version
#119 Multiple runs without interleaving
#123 Give models a retry on error
#128 Track how many characters were present in code part / complete response
#131 Follow-up: Allow to retry a model when it errors
#145 git repository change requires the GPG password
#147 Repository not reset for multiple tasks
#158 Deal with failing tests

Contributors

Munsio, zimmski, and 3 other contributors

Assets 2

26 Apr 12:42

zimmski

v0.4.0

8a38762

v0.4.0

Deep dive into evaluation with this version: https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/

This release's major additions are

Java as a new language,
automatic Markdown report with an SVG chart,
and lots of automation and testing to make the evaluation benchmark super reliable.

Features

Java language adapter with “java/plain” repository #62
Scoring through metric points and ranking of models #42
Automatic categorization of models depending on their worst result #36 #39 #48
Fully log per model and repository as results #25 #53
Migrate to symflower test instead of redoing test execution logic #62
Automatic installation of Symflower for RAG and general source code analytics to not reinvent the wheel #50
Generate test file paths through language adapters #60
Generate import / package paths through language adapters #63
Generate test framework name through language adapters #63
Human readable categories with description #57
Summary report as Markdown file with links to results #57 #77
Summary bar chart for overall results of categories as SVG in Markdown file #57

Bug fixes

More reliable parsing of code fences #70 #69
Do not exit process but instead panic for reliable testing and traces #69

Assets 2

04 Apr 12:50

zimmski

v0.3.0

a0f48bc

v0.3.0

First README of "DevQualityEval" (our final name for the benchmark) is online https://github.com/symflower/eval-dev-quality looking for feedback on how to make it more direct, less fluffy and more interesting for developers 🚨🔦Please help. We are currently sifting through the first benchmark and writing a report.

Assets 2

03 Apr 12:42

zimmski

v0.2.0

df10ae2

v0.2.0

This release makes the following tasks possible:

Add providers for models, models, and languages easily by implementing a common interface
Evaluate with any model that openrouter.ai offers and with Symflower's symbolic execution.
Add repositories that should be evaluated using Go as language
Run tests of Go repositories and query their coverage as the first evaluation benchmark task

More to come. If you want to contribute, let us know.

Assets 2

29 Mar 20:00

zimmski

v0.1.0

b0c59b4

v0.1.0

This release includes all basic components to move forward with creating an evaluation benchmark for LLMs and friends to compare and evolve code quality of code generation. The only big exceptions are a well documented README, interface to a generic LLM API service and tasks so people who want to contribute can help. These will follow soon.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlights 🌟

Pull Requests

Merged

Closed

Issues

Closed

Contributors

Features

Bug fixes

Releases: symflower/eval-dev-quality

v0.5.0

Highlights 🌟

Pull Requests

Merged

Closed

Issues

Closed

Contributors

v0.4.0

Features

Bug fixes

v0.3.0

v0.2.0

v0.1.0