Skip to content

Releases: symflower/eval-dev-quality

v0.5.0

06 Jun 07:18
efe1ea3
Compare
Choose a tag to compare

Highlights 🌟

  • Ollama 🦙 support
  • Mac and Windows 🖥️ support
  • Support for any inference endpoint 🧠 as long as it implements the OpenAI inference API
  • More complex "write test" task cases 🔢 for both Java, and Go
  • Evaluation now measures the processing time ⏱️ that it takes a model to compute a response
  • Evaluation now counts the number of characters 💬 in model responses to give an idea which models give brief, efficient responses
  • Multiple runs 🏃 built right into the evaluation tool

Pull Requests

Merged

  • Development 🛠️
    • CI
      • fix, Let Windows CI error on failures, and inject API tokens only once per provider by @zimmski in #104
      • Cancel previous runs of the CI when a new push happened to a PR by @Munsio in #140
    • Tooling
  • Documentation 📚
    • Readme
      • Reduce fluff of intro to a minimum, and summary CSVs for the v0.4.0 results by @zimmski in #86
      • New section explaining the evaluation, and its tasks and cases by @bauersimon in #87
      • Update according to newest blog post release by @bauersimon in #88
      • Use the most cost-effective model that is still good for usage showcase because Claude Opus is super expensive by @zimmski in #84
  • Evaluation ⏱️
    • Multiple Runs
      • Support multiple runs in a single evaluation by @bauersimon in #109
      • Option to execute multiple runs non-interleaved by @bauersimon in #120
      • fix, Do not cancel successive runs if previous runs had problems by @bauersimon in #129
    • Testdata Repository
      • Use Git to avoid copying the repository on each model run by @Munsio in #114
      • fix,Use empty Git config in temporary repositories to not inherit any user configuration by @bauersimon in #146
      • fix, Reset repository per task to not bleed task results into subsequent task by @bauersimon in #148
    • Tests
      • Remove the need to change the provider registry in tests to make test code concurrency safe by @ruiAzevedo19 in #137
      • Move the error used in the evaluation tests to a variable, to avoid copying it the test suites by @ruiAzevedo19 in #138
    • Language Support
      • fix, Java test file path needs to be OS aware by @Munsio and @ruiAzevedo19 in #155
      • Require at least symflower v36800 as it fixes Java coverage extraction in examples with exceptions by @bauersimon in #14
      • fix, Do not ignore Go coverage count if there are failing tests by @ahumenberger in #161
    • fix, Empty model responses should be handled as errors by @Munsio in #97
    • refactor, Move evaluation logic into evaluation package for isolation of concern by @zimmski in #136
  • Models 🤖
    • New Models
      • Ollama Support
        • Installation and Update
          • Ollama tool automated installation by @bauersimon in #95
          • Ollama tool version check and update if version is outdated by @bauersimon in #118
          • Update Ollama to 0.1.41 to have all the latest Windows fixes by @bauersimon in #154
        • Provider Integration
      • Generic OpenAI API provider by @bauersimon in #112
    • Allow to retry a model when it errors by @ruiAzevedo19 in #125
    • Clean up query attempt code by @zimmski in #132
    • Explicitly check the interface that is setting the query attempts, to ensure the model implements all its methods by @ruiAzevedo19 in #139
  • Reports 🗒️
    • CSV
      • Replace model dependent evaluation result with report file since that contains all the evaluation information by @bauersimon in #85
      • Additional CSVs to sum up metrics for all models overall and per language by @Munsio in #94
      • fix, Sort map by model before creating the CSV output to be deterministic by @Munsio in #99
    • Metrics
      • Measure processing time of model responses by @bauersimon in #106
      • Measure how many characters were present in a model response and generated test files by @ruiAzevedo19 in #142
      • Make sure to use uint64 consistently for metrics and scoring, and allow more task cases by always working on a clean repository by @zimmski in #133
  • Operating Systems 🖥️
  • Tasks 🔢

Closed

Issues

Closed

  • #83 Add additional CSV files that sum up: overall, per-language
  • #91 Integrate Ollama
  • #92 Empty responses should not be tested but should fail
  • #98 Non deterministic test output leads to flaky CI Jobs
  • #101 Unable to run benchmark tasks on windows due to incorrect directory creation syntax
  • #105 Measure Model response time
  • #108 Multiple Runs
  • #111 Generic OpenAI API provider
  • #113 Optimize repository handling in multiple runs per model
  • #116 Preload/Unload Ollama models before prompting
  • #117 Fixed Ollama version
  • #119 Multiple runs without interleaving
  • #123 Give models a retry on error
  • #128 Track how many characters were present in code part / complete response
  • #131 Follow-up: Allow to retry a model when it errors
  • #145 git repository change requires the GPG password
  • #147 Repository not reset for multiple tasks
  • #158 Deal with failing tests

v0.4.0

26 Apr 12:42
8a38762
Compare
Choose a tag to compare

Deep dive into evaluation with this version: https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/

This release's major additions are

  • Java as a new language,
  • automatic Markdown report with an SVG chart,
  • and lots of automation and testing to make the evaluation benchmark super reliable.

Features

  • Java language adapter with “java/plain” repository #62
  • Scoring through metric points and ranking of models #42
  • Automatic categorization of models depending on their worst result #36 #39 #48
  • Fully log per model and repository as results #25 #53
  • Migrate to symflower test instead of redoing test execution logic #62
  • Automatic installation of Symflower for RAG and general source code analytics to not reinvent the wheel #50
  • Generate test file paths through language adapters #60
  • Generate import / package paths through language adapters #63
  • Generate test framework name through language adapters #63
  • Human readable categories with description #57
  • Summary report as Markdown file with links to results #57 #77
  • Summary bar chart for overall results of categories as SVG in Markdown file #57

Bug fixes

  • More reliable parsing of code fences #70 #69
  • Do not exit process but instead panic for reliable testing and traces #69

v0.3.0

04 Apr 12:50
a0f48bc
Compare
Choose a tag to compare

First README of "DevQualityEval" (our final name for the benchmark) is online https://github.com/symflower/eval-dev-quality looking for feedback on how to make it more direct, less fluffy and more interesting for developers 🚨🔦Please help. We are currently sifting through the first benchmark and writing a report.

v0.2.0

03 Apr 12:42
df10ae2
Compare
Choose a tag to compare

This release makes the following tasks possible:

  • Add providers for models, models, and languages easily by implementing a common interface
  • Evaluate with any model that openrouter.ai offers and with Symflower's symbolic execution.
  • Add repositories that should be evaluated using Go as language
  • Run tests of Go repositories and query their coverage as the first evaluation benchmark task

More to come. If you want to contribute, let us know.

v0.1.0

29 Mar 20:00
b0c59b4
Compare
Choose a tag to compare

This release includes all basic components to move forward with creating an evaluation benchmark for LLMs and friends to compare and evolve code quality of code generation. The only big exceptions are a well documented README, interface to a generic LLM API service and tasks so people who want to contribute can help. These will follow soon.