Test and Evalution

Hello,

Im a security researcher looking into the topic of LLMs for Offensive Security. I came across your project and tested it across a few vulnerable applications. I wanted to share some issues I run into and some early feedback. Maybe some items can be related to features and some might be questions.

**Test 1:**

I tested against an instance of JuiceShop, initially the run looked successful but eventually it hang. I waited for some 30plus mins and I found it I had run out of Claude credits.

- I had created an API with a limit of 10$ for Claude. It burned through the credit for almost 2M tokens, before it was able to find any vulnerability (although I think the SQL injection tester was in good track to find one). 
- After this I chanted to Claude 3.7 Sonnet to be cheaper and got some progress on another application. Perhaps the documentation could change the default or be more explicit on the cost implications.
- I didn't got any error once the API key request started been rejected it would have bee nice to get this.
- Is there a way to log into files, the outputs of the LLMs or the proxy logs in more detail more for Review?

No vulnerabilities were found within the 11$.

**Test 2:**

I tested against javulna ( https://github.com/edu-secmachine/javulna) using Sonnet model, I gave it some time and it made progress, but the APi enumerator failed, the coordinator took over this role, and some of the agents never started beyond thinking like the SQL injection. 

After around 2$ it was not able to find any vulnerability.

**Test 3:**

I went for a static analysis of the Javulna application again using Sonnet. This time I was not very hopefull because it was using very low level tools such as grep, and cat, not any specialized SAST tool but it was able to find 8 vulnerabilities (most in this applications). I was pleasantly surprised for this. 

**Questions:**

- Are there any recommended vulnerable applications on which you are testing?
- Is there a way to produce logs for further diagnostics and share those?

Im happy to continue testing and providing feedback. I have recordings of this test, so I can share them after a bit of editing on my side.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test and Evalution #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Test and Evalution #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions