You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bring in the current blog post, its information, especially the blog post image to showcase the evaluation
Readme extension: The nice thing about generating tests is that it is easy to automatically check if the result is correct. Needs to compile and provide 100% coverage. But one can only write such tests if they understand the source, so implicitly we are evaluating the language understanding of the LLM.
Log Maven commands because they can be faulty (remember that the "surefire" plugin needs a fixed version because GitHub's is tooooo old). With that it is easier to debug symflower v36847
Think about exlcuding the "perplexicty" models because they have a "per request" cost, and they are the only ones that do that.
Snowflake against Databricks would be a nice comparision since they align company-wise and are new
Include more models (We have the main problem that there are multiple models coming out every day. We should not wait for a "new version" of the eval, we should test these models right away and compare them. big problem: how do we promote findings?)
Figure out the "perfect" coverage score so we can display percentage of coverage reached
Make coverage metric fair
Save the descriptons of the models as well: https://openrouter.ai/api/v1/models The reason is that these can change over time, and we need to know after a while what they where. e.g right now i would like to know if mistral-7b-instruct for the last evaluation was v0.1. or not
Bar charts should have have their value on the bar. The axis values do not work that well
Pick an example or several examples per category: goal is to find interesting results automatically, because it will get harder and harder to go manually through results.
Charts to showcase data
Total-scores vs costs scatterplot. Result is upper-left-corner sweat spot: cheap and good results.
Piechart of whole evaluations costs: for each LLM show how much it costs. Result is to see which LLMs are costing the most to run the eval.
Reporting and documentation on writing deep-dives
What are results that align with expectations? what are results against expectations? E.g. are there small LLMs that are better than big ones?
Are there big LLMs that totally fail?
Are there small LLMs that are surprisingly good?
What about LLMs where the commonity doesn't know that much yet: e.g. Snowflake, DBRX, ...
Order models by open.weight, allows commercial-use, closed, and price(!) and size: e.g. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 is great because open-weight, and Apache2 so commerical-use allowed. Should be better rated than GPT4
distinguish between latency (time-to-first-token) and throughput (tokens generated per second)
Documentation
Clean up and extend README
Better examples for contributions
Overhaul explanation of "why" we need evaluation, i.e. why is it good to evaluate for an empty function that does nothing.
Write down a playbook for evaluations, e.g. one thing that should happen is that we let the benchmark play 5 times and then sum up points, but ... the runs should have at least one hour berak in between to not run into cached responses.
Write Tutorial for using Ollama
YouTube video for using Ollama
Tooling & Installation
Rescore existing models / eval with fixes e.g. when we do a better code repair tool, the LLM answer did not change, so we should rescore right away with new version of tool over a whole result of an eval.
Automatic tool installation with fixed version
Go
Java
Ensure that non-critical CLI input validation (such as unavailable models) does not panic
Take a look at current leaderboards and evals to know what could be interesting Current popular code leaderboards are [LiveCodeBench](https://huggingface.co/spaces/livecodebench/leaderboard), the [BigCode models leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard), [CyberSecEval](https://huggingface.co/spaces/facebook/CyberSecEval) and [CanAICode](https://huggingface.co/spaces/mike-ravkine/can-ai-code-results)
Let the Java test case for No test files actually identify and error that there are no test files (needs to be implemented in symflower test)
LLM
Log request and response in their own files, so both can be used 1:1 (character for character) directly for debugging them
Improve LLM prompt
Add an app-name to the requests so people know we are the eval https://openrouter.ai/docs#quick-start shows that other openapi-packages implement custom headers, but the one Go package we are using does not implement that. So do a PR to contribute.
Prepare language and evaluation logic for multiple files:
Use symflower symbols to receive files
Sandboxed execution Sandbox execution #17 e.g. with Docker as its first implementation
timeout for test execution (we've seen tests that take > 15 minutes to execute in some benchmarks)
Do an evaluation with different temperatures
Failing tests should receive a score penalty
Evaluation tasks
Introduce the interface for doing "evaluation tasks" so we can easily add them
Add evaluation task for "querying the relative test file path of a relative implementation file path" e.g. "What is the test relative file path for some/implementation/file.go" ... it is "some/implementation/file_test.go" for most cases.
Add evaluation task for transpilation Go->Java and Java->Go
Scoring, Categorization, Bar Charts split by language.
Check determinism of models e.g. execute each plain repository X-times, and then check if they are stable.
Code repair
Own task category
0-shot, 1-shot, ...
With LLM repair
With tool repair
Do test file paths through
symflower symbols
Task for models
Query REAL costs of all the testing of a model: the reason this is interesting is that some models have HUGE outputs, and since more output means more costs, this should be addressed in the score.
Move towards generated cases so models cannot integrate fixed cases to always have 100% score
Think about adding more trainings data generation features: This will also help with dynamic cases
Heard that Snowflake Arctic is very open with how they gathered trainings data... so we see what LLM creators think and want of trainings data
Think about a commercial effort of the eval, that we can balance some of the costs that goes into maintaining this eval
The text was updated successfully, but these errors were encountered:
Add an app-name to the requests so people know we are the eval https://openrouter.ai/docs#quick-start shows that other openapi-packages implement custom headers, but the one Go package we are using does not implement that. So do a PR to contribute.
The v0.5.0 is mainly meant for introducing more variate. There are three main goals
Tasks:
The nice thing about generating tests is that it is easy to automatically check if the result is correct. Needs to compile and provide 100% coverage. But one can only write such tests if they understand the source, so implicitly we are evaluating the language understanding of the LLM.
symflower v36847
symflower test
with a deeper execution coverage exportsymflower v36800
) Require at least symflower v36800 #144symflower v36800
) Require at least symflower v36800 #144TODO sort and sort out
Model
andProvider
to be in the same package Preload Ollama models before inference and unload afterwards #121 (comment)Current popular code leaderboards are [LiveCodeBench](https://huggingface.co/spaces/livecodebench/leaderboard), the [BigCode models leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard), [CyberSecEval](https://huggingface.co/spaces/facebook/CyberSecEval) and [CanAICode](https://huggingface.co/spaces/mike-ravkine/can-ai-code-results)
https://www.reddit.com/r/LocalLLaMA/comments/1cihrdt/comment/l2d4im0/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
No test files
actually identify and error that there are no test files (needs to be implemented insymflower test
)symflower symbols
to receive filessymflower symbols
The text was updated successfully, but these errors were encountered: