Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to agbench #5776

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from
Draft

Improvements to agbench #5776

wants to merge 11 commits into from

Conversation

ekzhu
Copy link
Collaborator

@ekzhu ekzhu commented Mar 1, 2025

  1. Add host network support in Docker and remove unused requirements from argument check.
  2. Use Pandas to simplify summary statistic calculations.
  3. Add running time to summary statistics
Using tabulation method defined in '/home/ekzhu/autogen/python/packages/agbench/benchmarks/HumanEval/Scripts/custom_tabulate.py'

    Task Id       Trial 0 Success      Trial 0 Time
--  ------------  -----------------  --------------
 0  HumanEval_0   True                            3
 1  HumanEval_1   False                          15
 2  HumanEval_2   True                            2
 3  HumanEval_3   True                           11
 4  HumanEval_4   True                            4
 5  HumanEval_5   True                            2
 6  HumanEval_6   False                          18
 7  HumanEval_7   True                            2
 8  HumanEval_8   True                            2
 9  HumanEval_9   True                           12
10  HumanEval_10  False                          11
11  HumanEval_11  True                            2
12  HumanEval_12  True                            3
13  HumanEval_13  True                            1
14  HumanEval_14  True                            4
15  HumanEval_15  True                            1
16  HumanEval_16  True                            2
17  HumanEval_17  False                          76
18  HumanEval_18  True                            4
19  HumanEval_19  True                            3
20  HumanEval_20  True                            5
21  HumanEval_21  True                            3
22  HumanEval_22  True                            1
23  HumanEval_23  True                            2
24  HumanEval_24                                nan

Summary Statistics

           Successes    Failures    Missing    Total    Average Success Rate    Average Time    Total Time
-------  -----------  ----------  ---------  -------  ----------------------  --------------  ------------
Trial 0           20           4          1       25                     0.8           7.875           189

CAUTION: 'autogenbench tabulate' is in early preview and is not thoroughly tested.
Please do not cite values from these calculations in academic work without first inspecting and verifying the results in the run logs yourself.

Now the default tabulate output looks like this

@ekzhu ekzhu requested a review from afourney March 1, 2025 23:51
Copy link

codecov bot commented Mar 1, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.60%. Comparing base (05b14f1) to head (fae7869).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5776      +/-   ##
==========================================
+ Coverage   70.16%   75.60%   +5.44%     
==========================================
  Files         262      189      -73     
  Lines       14712    12734    -1978     
  Branches      243        0     -243     
==========================================
- Hits        10322     9627     -695     
+ Misses       4200     3107    -1093     
+ Partials      190        0     -190     
Flag Coverage Δ
unittests 75.60% <ø> (+5.44%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ekzhu ekzhu changed the title Add host network support in Docker and remove unused requirements from argument check. Improvements to agbench Mar 2, 2025
@ekzhu ekzhu marked this pull request as draft March 4, 2025 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants