Skip to content

Test and Evalution #6

@rothoma2

Description

@rothoma2

Hello,

Im a security researcher looking into the topic of LLMs for Offensive Security. I came across your project and tested it across a few vulnerable applications. I wanted to share some issues I run into and some early feedback. Maybe some items can be related to features and some might be questions.

Test 1:

I tested against an instance of JuiceShop, initially the run looked successful but eventually it hang. I waited for some 30plus mins and I found it I had run out of Claude credits.

  • I had created an API with a limit of 10$ for Claude. It burned through the credit for almost 2M tokens, before it was able to find any vulnerability (although I think the SQL injection tester was in good track to find one).
  • After this I chanted to Claude 3.7 Sonnet to be cheaper and got some progress on another application. Perhaps the documentation could change the default or be more explicit on the cost implications.
  • I didn't got any error once the API key request started been rejected it would have bee nice to get this.
  • Is there a way to log into files, the outputs of the LLMs or the proxy logs in more detail more for Review?

No vulnerabilities were found within the 11$.

Test 2:

I tested against javulna ( https://github.com/edu-secmachine/javulna) using Sonnet model, I gave it some time and it made progress, but the APi enumerator failed, the coordinator took over this role, and some of the agents never started beyond thinking like the SQL injection.

After around 2$ it was not able to find any vulnerability.

Test 3:

I went for a static analysis of the Javulna application again using Sonnet. This time I was not very hopefull because it was using very low level tools such as grep, and cat, not any specialized SAST tool but it was able to find 8 vulnerabilities (most in this applications). I was pleasantly surprised for this.

Questions:

  • Are there any recommended vulnerable applications on which you are testing?
  • Is there a way to produce logs for further diagnostics and share those?

Im happy to continue testing and providing feedback. I have recordings of this test, so I can share them after a bit of editing on my side.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions