Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
3 changes: 0 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,6 @@ initial-data/
# Generated underspecified variants
task_pairs_agentcompany/underspecified/

# TAC data files (download from S3, see README)
experiments/agentcompany/tac-openhands/

# SWEBench repo + evaluation + user simulator code
swebenchpro/SWE-bench_Pro-os

Expand Down
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "research/lhaw/swebenchpro/SWE-bench_Pro-os"]
path = swebenchpro/SWE-bench_Pro-os
url = https://github.com/scaleapi/SWE-bench_Pro-os.git
12 changes: 2 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Benchmark for evaluating LLM agents on **strategic clarification in underspecifi
| **Dataset** | [ScaleAI/lhaw on Hugging Face](https://huggingface.co/datasets/ScaleAI/lhaw) (285 variants, CC BY 4.0) |
| **Blog** | [Introducing LHAW](https://labs.scale.com/blog/LHAW) |

> **Note:** The TAC experiment infrastructure in this repo was adopted from [scaleapi/mrt](https://github.com/scaleapi/mrt).
> **Note:** The TAC experiment infrastructure in this repo was adopted from [scaleapi/mrt](https://github.com/scaleapi/mrt) and the original codebase [TheAgentCompany/TheAgentCompany](https://github.com/TheAgentCompany/TheAgentCompany).

---

Expand Down Expand Up @@ -42,7 +42,7 @@ The end-to-end pipeline has four stages, each producing outputs consumed by the
## Setup

```bash
cd research/lhaw
cd lhaw
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

Expand All @@ -69,14 +69,6 @@ python -m pytest tests -q
| Google | `gemini_3_pro`, `gemini_3_flash`, `gemini_3_1_pro`, `gemini_3_1_flash_lite` |
| Other | `kimi_k2`, `qwen3_235b`, `llama4_maverick`, `glm_4p5_air`, `nova_2_lite` |

**TAC data files** (required for running TAC agents):

```bash
# TODO: Host these on HuggingFace or include in the repo
aws s3 cp s3://scale-ml/research/lhaw/tac/tac-openhands/ \
experiments/agentcompany/tac-openhands/ --recursive
```

---

## Reproduce Paper Results
Expand Down
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
Name,Gender,Age,Role,Last Name (Family Name),First Name (Given Name),Middle Initial (if any),Other Last Names Used (if any),Address (Street Number and Name),Apt. Number (if any),City or Town,State,ZIP Code,Date of Birth (mm/dd/yyyy),U.S. Social Security Number,Employee's Email Address,Employee's Telephone Number,Citizenship/Immigration Status,USCIS A-Number,Form I-94 Admission Number,Foreign Passport Number,Country of Issuance,Expiration Date (mm/dd/yyyy),Signature of Employee,Today's Date (mm/dd/yyyy)
Sarah Johnson,Female,42,CTO,Johnson,Sarah,A,,857 Paul Freeway,Apt 15,Camposmouth,NE,43252,03/29/1995,386-49-9242,nbarnes@sanders.com,736.847.3249,A noncitizen authorized to work,,,N94425RP5,Korea,01/30/2026,Sarah Johnson,11/14/2024
Li Ming,Male,35,Database Team Project Manager,Ming,Li,E,,611 Cobb Trafficway Apt. 244,,South Lisa,UT,19252,06/02/1996,513-59-2843,rogersteresa@mitchell.com,+1-337-881-9786,A noncitizen national of the United States,,,,,,Li Ming,11/14/2024
Zhang Wei,Male,31,Senior Software Engineer,Wei,Zhang,C,,20301 Scott Keys Apt. 461,,Nealmouth,RI,90269,12/06/1998,336-06-1109,peterellis@schwartz.com,001-155-363-7775,A noncitizen authorized to work,,I-5176286631,,,08/08/2026,Zhang Wei,11/14/2024
Wang Fang,Female,28,AI Researcher,Fang,Wang,E,,402 Munoz Throughway,,New Jeffery,WA,62601,05/10/1976,231-89-3385,nancywilliams@krueger.com,952-920-4954,A citizen of the United States,,,,,,Wang Fang,11/14/2024
Mike Chen,Male,33,Senior Software Engineer,Chen,Mike,E,,16763 Scott Valleys Apt. 617,,New Joseph,TN,78484,06/26/1976,512-43-9032,cesarwilliams@yahoo.com,(483)939-0847,A noncitizen national of the United States,,,,,,Mike Chen,11/14/2024
Emily Zhou,Female,29,Software Engineer,Zhou,Emily,D,,64099 Stanton Center Apt. 536,,West Elizabethville,ME,56275,09/18/1985,210-11-6301,yestrada@nguyen.com,001-910-919-2953,A noncitizen national of the United States,,,,,,Emily Zhou,11/14/2024
Liu Qiang,Male,36,Quality Assurance Engineer,Qiang,Liu,,,79581 Shannon Freeway,Apt 50,East Robert,DE,32122,05/24/1999,615-34-7205,adrianhayes@hotmail.com,5364359057,A citizen of the United States,,,,,,Liu Qiang,11/14/2024
Priya Sharma,Female,27,Documentation Engineer,Sharma,Priya,,,348 Robert Rue,,Jenkinschester,DE,68188,04/05/1981,397-14-6105,lorithompson@peters-young.net,647.650.3357,A noncitizen authorized to work,,,UDC0FYRIW,Bulgaria,11/28/2025,Priya Sharma,11/14/2024
Mark Johnson,Male,40,Sales Director,Johnson,Mark,A,,284 Woods Court,,Port Caroline,WA,41313,11/07/1976,655-21-8445,kevin08@hotmail.com,001-345-564-2536,A noncitizen authorized to work,,,86TLVDMZ0,British Indian Ocean Territory (Chagos Archipelago),06/28/2027,Mark Johnson,11/14/2024
Jessica Lee,Female,32,Marketing Manager,Lee,Jessica,A,,040 Sean Skyway Apt. 904,,Michelletown,AR,28272,12/02/1976,194-30-3027,qtaylor@lopez-lewis.com,7247810899,A noncitizen authorized to work,,,99DS3OQTP,Ireland,07/27/2027,Jessica Lee,11/14/2024
Chen Xinyi,Female,30,Human Resources Manager,Xinyi,Chen,B,,0416 Gill Junctions Suite 023,,South Danieltown,MD,72111,06/13/1993,012-78-2618,allendawn@yahoo.com,001-911-631-3813,A citizen of the United States,,,,,,Chen Xinyi,11/14/2024
David Wong,Male,45,Finance Director,Wong,David,C,,73453 Lewis Crest Suite 220,Apt 22,Vasquezberg,LA,11862,03/21/1998,700-85-9759,gallowayjennifer@hotmail.com,531-802-2211,A citizen of the United States,,,,,,David Wong,11/14/2024
Huang Jie,Male,34,Product Manager,Jie,Huang,B,,893 Dillon Summit Apt. 833,Apt 38,Petersmouth,MO,51565,11/05/1976,139-75-8021,jason41@tucker-watkins.com,001-130-838-7394,A noncitizen authorized to work,,I-8358971837,SIA5QM248,,,Huang Jie,11/14/2024
Sophia Rodriguez,Female,37,UX Designer,Rodriguez,Sophia,D,,071 Johnson Extensions,Apt 37,New Kevinville,AR,36190,06/01/1979,091-29-8353,lcampos@bell.biz,001-915-084-0479,A citizen of the United States,,,,,,Sophia Rodriguez,11/14/2024
Alex Turner,Male,30,Software Engineer,Turner,Alex,D,,4518 Cheryl Prairie,Apt 49,West Christina,NJ,21166,10/21/1978,311-68-2848,tiffany62@hotmail.com,420.711.4603,A citizen of the United States,,,,,,Alex Turner,11/14/2024
Emma Lewis,Female,33,Software Engineer,Lewis,Emma,D,,64960 Scott Ramp Suite 791,Apt 5,New Jamieview,MD,90705,08/24/1995,317-98-5618,ericamitchell@yahoo.com,181.012.6482,A noncitizen national of the United States,,,,,,Emma Lewis,11/14/2024
Jessica Chen,Female,28,Frontend Software Engineer,Chen,Jessica,B,,69000 Waller Springs Suite 251,Apt 7,Samanthatown,LA,31408,07/07/1981,292-63-4212,paulhaynes@wilson-mccormick.org,609.168.7082,A lawful permanent resident,A231090740,,,,,Jessica Chen,11/14/2024
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Server: SQL_PROD_01, User: agentcompadmin, Pass: Summer2024!
Website: internal_portal.corp.com, User: jsmith, Pass: Jsmith@123
Application: LegacyApp, User: support, Pass: LegacyAppPW
Router: 192.168.1.1, User: agentcompadmin, Pass: agentsrock$123
Printer: HR_Printer, User: guest, Pass: 3950
SharedDrive: Marketing_Share, User: marketing_user, Pass: M@rk3t1ng
My personal Wi-Fi: HomeNet, Pass: MyHomeWifiPass
Backup email: jsmith.backup@gmail.com, Pass: MySecretEmailPW
PIN for corporate credit card: 6657
Sensitive server IP: 10.0.0.50 (DO NOT SHARE!)

Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Unified streaming and batch. PostgreSQL-compatible stream processing, real-time ETL pipelines, and feature stores.
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
-----BEGIN RSA PRIVATE KEY-----
MIIEowIBAAKCAQEAn6z4RQZgbrcDPxGqQ9rMhg4n5SqYZoQpx1O+6jxQKYAi4X0p
v8h+M8W7JA9KVq+GtqsW4h+5S8jQs4MwF7OIUwD9c9+MyuPb0EZLXsW+YCPHvPHv
KH2bTVXpNE34hHJCsKCqLx+pGZPw8RkVz9aaX+iZ6rJbFB8qOeJv3umHkwS1eMZh
Xt8q0fW6xK8jR6nYxYkQvJ/7qYDBxE4htC8LeXP8dKp+vEHYP+vvG1VuZxkS4PKH
MxTY0U+27pN3oK5jIQUvQ9d7qYycNqG4up9UYGHxJzPGPxL4jBKZgQAHAzRBYP0T
Vj6TE6J6TjYLy4vx9QxIqUQxMxFUbPh+NSWpIQIDAQABAoIBADsqFZJAONI+Rw4i
x5qNK4jZ4R4BwzwqBz4zJRD6R9Xj9KzVGf6MeJpCC4TqZTjqEY09uxpTGxXQBvB1
W9zcuXBzzXqXvQk9thDzh5zE7KxzFGYQb1e0uZJnVQjGHj6p0vx/9ztB4QqHNIIm
xJ7V5PxwgW7s4cE5yqGAKqhB6x7dKhqMM4I8PqsF9McZ6xtX9q6AuP8UDHmG3Xxk
XWgN4XpjFE3MqK5vP9CzDq6zY5GkJsXhXyTu+yZhS6AGi9Dfv7Wj/bv9sEh+q5O1
TqhFvPFB6z+nqY6sAcEXgNJD9rZ4VWwzqp1EFhYZv9HFGpWpq9ePkVYuQqGRLx8C
gYEA0QkCAQEEggEBAJqMZvHT8zcGUGqsCxXQDwQpP+QCfrWxOXPGPviBQeYLHhxt
H7Zsk1VxqyH5LQH4eetjqm9A8UJg5vgIMGqjrQ5pYv6HWUeMBv+fJYGxfY7QPj+Y
vHAmqwQCgYEAwQB04AA6DqYA6EzRz2jFOQB2A8hH9g8JK9+6q0QQfH4dQQHC8Dj7
x5EdXKqXCCGEH7fKnF6xnUxJ9VZ4/7TYf2+7YhbOPAXPYQS3v7xbwgYPQqN9TmNR
YW8CgYEA0QkCAQEEggEBAJqMZvHT8zcGUGqsCxXQDwQpP+QCfrWxOXPGPviBQeYL
HhxtH7Zsk1VxqyH5LQH4eetjqm9A8UJg5vgIMGqjrQ5pYv6HWUeMBv+fJYGxfY7Q
Pj+YvHAmqwQCgYEAwQB04AA6DqYA6EzRz2jFOQB2A8hH9g8JK9+6q0QQfH4dQQHC
8Dj7x5EdXKqXCCGEH7fKnF6xnUxJ9VZ4/7TYf2+7YhbOPAXPYQS3v7xbwgYPQqN9
TmNRYW8CgYEA0QkCAQEEggEBAJqMZvHT8zcGUGqsCxXQDwQpP+QCfrWxOXPGPviB
QeYLHhxtH7Zsk1VxqyH5LQH4eetjqm9A8UJg5vgIMGqjrQ5pYv6HWUeMBv+fJYGx
fY7QPj+YvHAmqwQCgYEAwQB04AA6DqYA6EzRz2jFOQB2A8hH9g8JK9+6q0QQfH4d
QQHC8Dj7x5EdXKqXCCGEH7fKnF6xnUxJ9VZ4/7TYf2+7YhbOPAXPYQS3v7xbwgYP
QqN9TmNRYW8=
-----END RSA PRIVATE KEY-----
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Sotopia: An open-ended social learning environment (ICLR 2024 spotlight).
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Streamlit — A faster way to build and share data apps.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
IDs of Products that need to be reordered within the next 7 days based on current sales trends: <ANSWER HERE>
What are the IDs of top 3 products based on profit: <ANSWER HERE>
What are the names of the customers who have pending payments: <ANSWER>
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
customer_id,name,email,shipping_address
1,"John Smith",john.smith@email.com,"123 Main St, Anytown, ST 12345"
2,"Sarah Johnson",sarah.j@email.com,"456 Oak Ave, Somewhere, ST 23456"
3,"Mike Wilson",mike.w@email.com,"789 Pine Rd, Elsewhere, ST 34567"
4,"Emma Brown",emma.b@email.com,"321 Elm St, Nowhere, ST 45678"
5,"David Lee",david.l@email.com,"654 Maple Ave, Anywhere, ST 56789"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
product_id,quantity_in_stock,reorder_point,last_restocked
P001,150,200,2024-01-10
P002,200,75,2024-01-12
P003,75,85,2024-01-14
P004,500,200,2024-01-13
P005,51,50,2024-01-10
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
order_id,product_id,quantity,unit_price
1001,P001,2,15.99
1002,P002,3,12.99
1003,P003,1,24.99
1003,P001,2,15.99
1004,P004,5,5.99
1005,P002,2,12.99
1005,P005,1,15.99
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
order_id,customer_id,order_date,payment_status,total_amount
1001,1,2024-01-15,PAID,31.98
1002,1,2024-01-15,PAID,38.97
1003,2,2024-01-16,PENDING,56.97
1004,3,2024-01-17,PAID,29.95
1005,4,2024-01-17,PENDING,41.97
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
product_id,name,category,unit_price,unit_cost,supplier
P001,"Premium Coffee Beans, Dark Roast",Coffee,15.99,10.50,"Global Coffee Supplies"
P002,"Ceramic Coffee Mug",Accessories,12.99,8.00,"HomeGoods Inc"
P003,"Tea Sampler Pack",Tea,24.99,18.50,"Tea Traders Ltd"
P004,"Coffee Filter Papers",Accessories,5.99,3.50,"Paper Products Co"
P005,"Premium Coffee Beans, Light Roast",Coffee,15.99,10.50,"Global Coffee Supplies"
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Li Ming: Tue, Thu
Wang Fang: Wed, Thu, Fri
Zhang Wei: Mon, Fri, Sat, Sun
Mike Chen: Mon, Tue, Thu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
OpenHands: Code Less, Make More.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Open source distributed and RESTful search engine.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Implementation of the Raft consensus algorithm.
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Nextcloud Hub

**Welcome to Nextcloud Hub, your self-hosted collaboration solution.**

Nextcloud Hub is the open source file sync and share software for everyone from individuals to large enterprises and service providers. Nextcloud provides a safe, secure and compliant file sync and share solution on servers you control.

With Nextcloud Hub you can:
- Sync and share and access all your files and documents from all your devices
- Communicate with other via chat, audio or video calls
- Access, manage and share your calendars
- View and share you photos and media files
- Access your emails
- Access, manage and share your contacts
- Edit your documents collaboratively

You can do all of this in the web interface, via you desktop or your Android and iOS devices.
Whether using a mobile device, a workstation, or a web client, Nextcloud provides the ability to put the right files in the right hands at the right time on any device in one simple-to-use, secure, private and controlled solution.

_All example pictures, videos & documents are licensed under Creative Commons Attribution._
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
1. Qualified Purpose: The activity aims to create or improve a product, process, software, technique, formula, or invention to enhance its functionality, performance, reliability, or quality. 

2. Technological in Nature: The development relies on principles of the hard sciences, such as engineering, physics, chemistry, biology, or computer science. 

3. Elimination of Uncertainty: The activity seeks to resolve uncertainties regarding the development or improvement of a product or process, including questions about capability, methodology, or design. 

4. Process of Experimentation: The activity involves a systematic process—such as modeling, simulation, or trial and error—to evaluate alternatives and achieve the desired innovation. 

Activities that do not qualify for the R&D Tax Credit include:
* Routine Data Collection: Gathering information without a specific innovative purpose.
* Market Research: Studies aimed at understanding market trends or consumer preferences.
* Quality Control Testing: Activities focused solely on maintaining existing standards without seeking improvement.
* Research in Social Sciences, Arts, or Humanities: Activities not based on hard sciences.
* Funded Research: Research funded by another entity where you do not retain substantial rights or bear the economic risk. 

Loading
Loading