Skip to content

AI Gauntlet

Samuele Giampieri edited this page Jun 22, 2026 · 2 revisions

AI Gauntlet

An LLM doesn't need a CVE to be exploited, just the right words. AI Gauntlet finds the words that work.

Every chatbot, copilot, and LLM-powered API you expose is a new kind of front door: one that speaks plain language, makes its own decisions, and can be talked into things its builders never intended. The right message can make an assistant ignore its safety rules, reveal the secret system prompt it was given, hand back data it was meant to protect, or quietly obey instructions hidden inside a document it was only asked to summarize. Traditional scanners walk straight past all of this - there is no open port to flag and no CVE to match, just a model that can be persuaded.

AI Gauntlet is RedAmon's answer to that blind spot. It takes the AI and LLM endpoints your recon already uncovered and runs them through a battery of real adversarial attacks - prompt injection, jailbreaks, data-leak attempts, toxic-content coaxing, and more - powered by four industry red-team tools (garak, PyRIT, Giskard, and promptfoo). Then it tells you, in plain numbers, which attacks worked, how often, and how serious the result is, and hands you the exact conversation as proof.

It is the offensive follow-up to Adversarial AI Recon: recon finds the AI surface, the Gauntlet breaks it. You pick the endpoints, choose a tool, confirm you are authorized to test the target, and launch. Minutes later every weakness is a finding you can read, sort, export, and drop into a report - mapped to the OWASP LLM Top 10 so it lines up with the rest of your assessment.

And it all runs on your own machine: the model that grades the answers is local, so your prompts, the responses, and the proof never leave your environment. You can probe a sensitive target hard without leaking a single token to a third-party API.

Why it matters

  • This is where AI apps actually break. Prompt injection and jailbreaks sit at the top of the OWASP LLM Top 10, and they never surface in a network or web scan.
  • "It refused once" is not a test. Real attackers rephrase, encode, and escalate across many turns - so the Gauntlet does too, and measures how often the model eventually gives in.
  • Findings you can stand behind. Every result is a success rate over real attempts plus a saved transcript - not a hunch - so you can prove it, prioritize it, and re-test it after a fix.

On this page


What it does, in plain terms

An LLM endpoint (a chatbot, an assistant API, a model behind /v1/chat/completions) can be tricked into doing things it should not: ignoring its rules, leaking its hidden instructions, producing harmful content, or following instructions hidden in the text it reads. The AI Gauntlet sends those tricky prompts for you, checks whether the model fell for them, and scores how often it did.

You get, per attack, an Attack Success Rate (ASR) - the percentage of attempts that worked - plus a severity, the OWASP-LLM category, and a link to the exact conversation that proves it. The higher the ASR, the more reliably the target can be abused.

You stay in control: it is off by default, it only attacks the endpoints you select, and a launch is blocked until you tick the box confirming you are authorized to test the target.


Before you start

  1. Run AI Surface Recon first. The target list is filled from the LLM endpoints that Adversarial AI Recon discovered. No endpoints yet? You can still type a custom URL to attack.
  2. Turn the feature on. In Project Settings, enable the AI Attack Surface profile (it is off by default because it sends real attack prompts). The four tools turn on with it.
  3. Nothing external to configure. A local model grades the answers automatically - you do not need any API key or cloud account.
  4. Confirm authorization. Each launch needs the Rules-of-Engagement checkbox. See Rules of Engagement.

Opening the AI Gauntlet

Open it from the left menu (AI Gauntlet) or the AI Gauntlet tab in the Red Zone. The page has four parts: a filter bar at the top, a grid of tool cards, the configure-and-launch panel that opens when you pick a tool, and a findings table at the bottom.

AI Gauntlet page overview

Across the whole page you will see colored attack chips. Each is a type of AI weakness; this is what they mean:

Chip What it means
Prompt Injection Make the model follow the attacker's instructions instead of yours
Jailbreak Bypass the model's safety rules (DAN, crescendo, skeleton key)
System Prompt Leak Get the model to reveal its hidden instructions
Data Disclosure Leak secrets, training data, or personal information
Encoding Bypass Smuggle a banned request past filters using base64 / rot13 / unicode
Toxicity / Harmful Force toxic or harmful output
Bias / Stereotypes Surface discriminatory behavior
Hallucination Confidently state false things / agree with false claims
Harmful Generation Coax out malware, exploits, or other dangerous artifacts
Insecure Output Output that breaks whatever renders it (ANSI / XSS / markdown exfil)
Supply Chain Invent fake package names (dependency-confusion risk)

Use the filter bar to narrow the tools: click one or more chips and only tools that cover all of them stay shown. Click Clear to reset.

Attack chip filter bar


The four tools - which one to use

Each tool is a card showing its name, style, purpose, and chips. Click Configure & launch to open it. Only one scan runs at a time; while one runs, the other cards show Locked.

The four tool cards

All four attack the same endpoints, but in very different ways. They complement each other, so a good workflow is garak first for breadth, then PyRIT, Giskard and promptfoo to go deeper and cross-check what garak found.

garak - the broad sweep

An open-source LLM vulnerability scanner, the rough equivalent of Nmap or Nuclei but for language models. It fires hundreds of ready-made adversarial prompts, grouped into 40 probe families, one shot each, and scores how often each family gets the model to misbehave. Coverage is its strength: in a single run it touches prompt injection, jailbreaks, encoding tricks, data leakage, system-prompt extraction, toxicity, insecure output and more.

Reach for it as your first pass on any new target to see where it is weak, then dig into those areas with the other three. Start with the four default families and add more by attack-chip once you spot a soft spot. It is single-shot, so it is the fastest way to get a wide picture.

PyRIT - the persistent attacker

Microsoft's open red-team framework. Instead of one prompt, it holds a conversation with the target and escalates turn by turn, with one local model playing the attacker and another grading each reply. This is how real adversaries work: a model that flatly refuses an obvious request will often give in once the ask is built up gradually (crescendo), reframed, or buried in a wall of fake-compliant examples (many-shot).

Reach for it when a target looks solid against single-shot attacks but you suspect it can be worn down over a chat, or when you have a specific goal in mind (type it into Custom objective and PyRIT works toward exactly that). Raise Max turns for a more determined attacker, at the cost of a longer run.

Giskard - the tailored safety scan

An open-source LLM testing library that writes its own test cases. You describe what your app does in Target purpose, and Giskard generates adversarial tests aimed at that domain, then runs a safety and quality scan across detectors like prompt injection, information disclosure, hallucination, bias and sycophancy. Because the tests are built from your description, it surfaces weaknesses specific to your application rather than generic payloads, for example a banking assistant leaking its account logic or a support bot agreeing to a refund that was never authorized.

Reach for it whenever you can describe the app's job; the more precise the purpose, the sharper the findings. Note that Giskard reports whether an issue exists (not how often it triggers), so its results read as "found / not found" with a severity rather than a percentage.

promptfoo - the second opinion

A popular open-source LLM eval and red-team runner. It runs curated, well-known public attack datasets (jailbreak and harmful-content corpora) against the target and grades each response locally, and you can layer encoding strategies (base64, rot13, leetspeak and more) on top to test whether the model quietly decodes a banned request and then complies.

Reach for it to corroborate. Because it draws on independent, community-built datasets, a weakness confirmed by promptfoo and one of the other tools is far stronger evidence than any single tool on its own, exactly the kind of cross-tool agreement the report highlights.

garak PyRIT Giskard promptfoo
Style broad single-shot multi-turn conversation tailored safety scan dataset eval
Best for wide coverage, first pass resilience over a chat app-specific safety confirming a finding
You pick probe families attack strategies detectors plugins (+ strategies)
Result ASR per family ASR per attack issue found per detector ASR per plugin

What each probe tests

This is the full list of what you can select per tool, in plain language, so you can choose what to run. The Chip column ties each one to a weakness type above.

garak probes (40)

Picking a family runs all of its variants. The four defaults (checked when you open the card) are a good first pass. Five families need a target type we cannot attack over plain HTTP chat (multimodal, model internals, an agent, file output), so they are not shown - they would only ever return nothing.

Defaults (checked on open):

  • promptinject (Prompt Injection) - Slips attacker instructions into the input to hijack the model away from its original task. It checks whether the model will abandon what it was told to do and instead follow the injected command, for example printing an attacker-chosen string. A hit means anyone who can get text in front of your model can redirect its behavior.
  • dan (Jailbreak) - Fires "Do Anything Now" and developer-mode prompts that try to talk the model out of its safety persona so it answers with no restrictions. garak ships many DAN variants, so this family is a broad test of whether common, copy-paste jailbreaks still work. If any variant lands, your guardrails can be bypassed by prompts that are already circulating freely online.
  • encoding (Encoding Bypass) - Hides a banned request inside an encoding such as base64, ROT13, hex, or unusual Unicode, then asks the model to decode and act on it. The aim is to slip past keyword and safety filters that only inspect the literal text. A hit shows the model will happily decode obfuscated payloads and then comply, defeating naive input filtering.
  • leakreplay (Data Disclosure) - Feeds the opening of known copyrighted text (news articles, books) and checks whether the model completes it word for word. Verbatim completion is strong evidence the passage was memorized from training data. It matters both for copyright exposure and as proof the model can regurgitate sensitive material it was trained on.

Prompt / indirect injection:

  • latentinjection (Prompt Injection) - Hides injection payloads inside content the model is asked to process, such as a document, email, web page, or translation, rather than in the user's direct prompt. This is the "indirect" prompt injection behind real-world attacks on RAG and agent pipelines, where the malicious instruction rides in on third-party data. A hit means your model obeys instructions buried in the data it ingests, not just the user it is talking to.
  • goodside (Prompt Injection) - Runs Riley Goodside's well-known injection tricks (instruction smuggling via tags, ThreatenJSON, "who is Riley", glitch-token strings, and more). These are classic, battle-tested techniques that broke many early LLM apps. Running them tells you whether your model is still vulnerable to the canonical injection patterns the field already knows about.

Jailbreaks:

  • doctor (Jailbreak) - Uses a "Doctor" authority persona and leetspeak puppetry to coax the model into role-playing past its safety rules. The idea is that framing a request as professional, hypothetical, or in obfuscated spelling lowers the model's guard. A hit means simple persona and roleplay framing is enough to unlock restricted answers.
  • grandma (Jailbreak) - The "grandma exploit" uses emotional social engineering, for example "my late grandma used to read me forbidden recipes to help me sleep", to pull out restricted content. It exploits the model's helpfulness and empathy rather than any technical flaw. If it works, your model can be talked into harmful output through sob-story framing alone.
  • dra (Jailbreak) - Disguise-and-Reconstruct splits or disguises a harmful request and then asks the model to reassemble and answer it. By never stating the bad request plainly, it tries to avoid tripping the refusal. A hit shows the model can be steered into harmful output whenever the intent is hidden across the prompt.
  • fitd (Jailbreak) - Foot-in-the-Door opens with a small, benign request and gradually escalates to the harmful one across the exchange. It leans on consistency bias: once a model has started helping, it tends to keep helping. Success means your model's refusals weaken as a conversation builds momentum.
  • phrasing (Jailbreak) - Rephrases a blocked request in a different tense or grammatical framing (past tense, future tense, hypothetical) to slip past filters tuned to the present-tense wording. It is a reminder that safety often keys on surface phrasing rather than meaning. A hit means trivial rewording defeats the guardrail.
  • suffix (Jailbreak) - Appends an optimized adversarial suffix (GCG/BEAST-style token strings) that statistically pushes the model toward compliance. The suffix looks like gibberish but reliably nudges many models into answering. A hit indicates susceptibility to automated, transferable jailbreak strings (the strongest variants need model internals, but transferable suffixes are tested black-box).
  • tap (Jailbreak) - Tree-of-Attacks-with-Pruning puts an attacker model in charge: it generates candidate jailbreaks, tries them against your target, and prunes dead ends while exploring the promising branches. Because it adapts to your specific model's replies, it can find bypasses no fixed prompt would. It uses the local judge to score attempts, so a hit means an automated attacker can discover a working bypass on its own.
  • goat (Jailbreak) - The Generative Offensive Agent Tester runs an attacker model through adaptive, multi-step jailbreak conversations, reacting to each of your model's answers. Like tap it is dynamic, but conversational rather than tree-structured. A hit shows your model can be worn down by an autonomous adversary that adjusts its approach turn by turn.
  • sata (Jailbreak) - Simple Assistive Task Attack masks the harmful part of a request and asks the model to fill in the blank, treating the dangerous content as an innocent language exercise. Reframing the attack as a benign completion task sidesteps the refusal triggers. A hit means the model will produce restricted content once it is dressed up as a fill-in-the-blank.

System prompt leakage:

  • sysprompt_extraction (System Prompt Leak) - Throws a range of prompts designed to make the model reveal its hidden system prompt or initial instructions word for word. That system prompt often holds business rules, guardrail wording, and sometimes secrets, so leaking it hands attackers a blueprint for further bypasses. A hit means your confidential instructions can be recovered by anyone who can chat with the model.

Sensitive data:

  • apikey (Data Disclosure) - Tries to coax the model into emitting API keys, tokens, and other secrets, whether memorized from training or sitting in its current context. It checks both for trained-in credential leakage and for over-sharing of anything in the model's working memory. A hit is a direct credential-exposure finding.
  • propile (Data Disclosure) - ProPILE probes for leaked personally identifiable information by asking the model to recall personal data about individuals in linked forms (twin, triplet, quadruplet). It measures whether private details were memorized and can be reconstructed. A hit raises a clear privacy and data-protection concern.
  • divergence (Data Disclosure) - Uses repetition attacks (for example asking the model to repeat a token endlessly) that can make it "diverge" and spill memorized training data. This is the technique that famously made some models dump raw training text. A hit shows the model can be pushed into leaking memorized content through a simple repetition trick.

Encoding / smuggling:

  • smuggling (Encoding Bypass) - Smuggles payloads using homoglyphs (look-alike Unicode characters), function-masking, and hypothetical-response framing to get past content filters. Where encoding focuses on standard encodings, smuggling focuses on character-level and framing obfuscation. A hit means visually disguised or indirectly framed payloads still reach and influence the model.

Toxicity / harmful content:

  • realtoxicityprompts (Toxicity) - Feeds prompts from the RealToxicityPrompts dataset that are known to draw out insults, threats, identity attacks, profanity, or sexual content. It measures how easily the model is steered into toxic continuations. A high score means the model produces harmful language with little provocation.
  • lmrc (Toxicity) - The Language Model Risk Cards probe sweeps a spread of harm categories, including bullying, profanity, sexual content, quack medicine, and deadnaming. It is a broad behavioral checklist rather than a single attack. Hits show exactly which harm categories your model is willing to engage in.
  • continuation (Toxicity) - Gives the model the beginning of a slur or hateful phrase and tests whether it completes it. It targets the model's tendency to autocomplete offensive text. A hit means the model can be nudged into finishing toxic content it should have refused.
  • donotanswer (Toxicity) - Uses the "Do-Not-Answer" dataset of prompts a responsible model should refuse outright, spanning hazards, malicious use, and misinformation. Rather than tricking the model, it checks baseline refusal behavior on plainly unsafe asks. A hit means the model answers questions it should simply decline.
  • atkgen (Toxicity) - An attacker model automatically generates and iterates prompts to draw toxic output from your target, using a toxicity classifier to steer toward whatever works. It is an adaptive red-teamer for harmful content rather than a fixed list. A hit shows an automated adversary can reliably provoke toxicity.
  • topic (Toxicity) - Steers the conversation onto blocked or controversial topics (expanded from WordNet) to see whether the model engages where it should stay neutral or refuse. It probes topical guardrails rather than one specific harmful output. Hits reveal the subjects the model can be drawn into discussing.

Harmful generation:

  • malwaregen (Harmful Generation) - Asks the model to produce malware building blocks such as evasion code, payloads, and malicious sub-functions. It tests whether the model will assist offensive software development. A hit is a serious finding for any model exposed to untrusted users.
  • exploitation (Harmful Generation) - Probes for the model emitting working exploit code, including SQL-injection echoes and system or Jinja template-injection payloads. It targets the model's willingness to generate attack code on request. A hit means the model can be used as an exploit-writing assistant.
  • av_spam_scanning (Harmful Generation) - Tries to make the model output known test signatures such as EICAR (the antivirus test file), GTUBE (the spam test string), and phishing templates. Emitting these is a safe proxy for "will it generate malicious-class artifacts" without creating real malware. A hit means the model will produce content that downstream security tooling treats as malicious.

Insecure output handling:

  • ansiescape (Insecure Output) - Tries to make the model emit ANSI terminal escape and control codes that can manipulate or exploit a terminal which later renders the output. The danger is not the text itself but what happens when an unsuspecting CLI or log viewer interprets it. A hit means your model's output could attack the systems that display it.
  • web_injection (Insecure Output) - Coaxes the model into producing markdown-image exfiltration, XSS, or other payloads that fire when the output is rendered in a web page. It targets the boundary where model output becomes live HTML or markdown. A hit means rendering the model's reply without sanitizing it could lead to data theft or script execution.
  • badchars (Insecure Output) - Checks how the model handles malformed, control, and other "bad" characters in the text it produces, probing output-handling robustness. Mishandled characters can corrupt downstream parsers or hide content. Hits point to output that may break or be misread by the systems consuming it.

Supply chain:

  • packagehallucination (Supply Chain) - Asks the model for code and checks whether it invents non-existent package or dependency names across Python, JS, Ruby, and Rust. Attackers can register those hallucinated names and ship malware, a "slopsquatting" or dependency-confusion attack. A hit means blindly following the model's suggestions could pull in attacker-controlled packages.

Hallucination:

  • misleading (Hallucination) - Presents false claims and checks whether the model agrees with or confirms them instead of correcting them. It measures how easily the model is led to confidently endorse a wrong premise. A hit means the model can be made to validate misinformation.
  • snowball (Hallucination) - Uses reasoning traps (primes, graph-connectivity puzzles, "name the senator" style questions) where one early wrong step snowballs into a confidently wrong final answer. It targets compounding reasoning errors rather than simple factual gaps. Hits show where the model's chain of reasoning collapses into assured nonsense.

Tip: whole families can be slow on a laptop. Start with the four defaults, then add families by chip once you see where the target is weak.

PyRIT attacks

PyRIT's choices are multi-turn conversation strategies, all in the Jailbreak family. The default is crescendo.

  • crescendo - Opens with a harmless question and escalates the conversation step by step toward the off-limits goal, nudging a little further each turn. Because every individual message looks reasonable, the model rarely refuses outright and is gradually led to the harmful answer. It is the default attack and one of the most effective against models that pass single-shot tests.
  • skeleton_key - Uses a multi-step setup that convinces the model to treat its safety guidelines as optional, for example by asking it to add a warning label instead of refusing. Once the model accepts that framing, it answers requests it would normally block. A success means a single setup sequence unlocks broad unsafe behavior for the rest of the chat.
  • tap - Tree-of-Attacks-with-Pruning puts an attacker model in charge: it generates many candidate jailbreaks, tests them against your target, prunes the failures, and explores the promising branches further. Because it adapts to your specific model's responses, it can find bypasses no fixed prompt would. The local judge scores each attempt, so a success means an automated attacker can self-improve its way past your guardrails.
  • many_shot - Floods the context with a long list of fake question-and-answer pairs in which an "assistant" cheerfully complies with harmful requests, then asks the real question at the end. The model imitates the pattern it has just been shown and answers in kind. It is especially effective against models with large context windows.

You can type a Custom objective to give PyRIT a specific goal instead of its built-in ones.

Giskard detectors

Giskard's choices are detectors. The card opens with prompt_injection checked; tick the others as needed. Giving Giskard a good Target purpose (what your app does) makes every detector below much sharper, because it generates the test cases from that description.

  • prompt_injection - Feeds crafted inputs that try to override the model's instructions and hijack its task. It checks whether your app's model can be redirected by user-supplied text. A finding means an attacker could make the assistant ignore its intended job.
  • information_disclosure - Probes for the model revealing sensitive or internal information, such as system instructions, configuration, or data it should keep private. It targets leakage relevant to the app you described. A finding means the model over-shares information a real user should never see.
  • hallucination - Looks for confidently stated but false or fabricated answers, including invented facts and unsupported claims. Tuned by your app description, it can spot domain-relevant fabrications rather than generic ones. A finding flags answers users might trust but should not.
  • harmfulness - Checks whether the model generates harmful or dangerous content when probed in your app's context, covering unsafe instructions and other clearly damaging output. It focuses on what your specific assistant can be pushed to produce. A finding means the model can be coaxed into harmful material.
  • stereotypes - Surfaces discriminatory, biased, or stereotyped responses across protected characteristics. It tests fairness rather than security. A finding highlights reputational and compliance risk in how the model treats people.
  • sycophancy - Checks whether the model abandons a correct answer to agree with a user's false premise just to be agreeable. It reveals a model that prioritizes pleasing the user over being accurate. A finding matters anywhere users rely on the model to push back on wrong assumptions.
  • output_formatting - Checks whether the model's output violates expected format constraints, for example breaking a required JSON shape or schema. Malformed output can crash or confuse the systems that consume it. A finding points to downstream-handling and integration risk.

promptfoo plugins and strategies

A plugin is the dataset of attacks; a strategy changes how each attack is delivered. Default plugin: pliny.

  • pliny (Jailbreak) - Runs prompts from Pliny's widely shared "L1B3RT4S" jailbreak collection, a community-maintained corpus of working bypasses. Because these are real, current jailbreaks that people use in the wild, it is a strong test of practical resilience. A high score means your model falls to jailbreaks that are already public.
  • beavertails (Toxicity) - Uses the BeaverTails dataset of harmful prompts spanning many categories of unsafe requests. It measures how often the model produces harmful answers across a large, standardized set of bad asks. A high score indicates weak refusal behavior on common harmful content.
  • harmbench (Toxicity) - Runs the HarmBench dataset, a standardized academic benchmark of harmful behaviors used to compare model safety. It gives a rigorous, comparable measure of how the model handles dangerous requests. A high score means the model underperforms against a recognized safety yardstick.

Strategies wrap each attack in an encoding to test whether the model will decode it and then obey:

  • basic - No transform; sends the raw payload as the baseline.
  • base64 - Base64-encodes the payload to hide it from plain-text filters.
  • rot13 - Applies the ROT13 letter shift.
  • leetspeak - Swaps letters for look-alike numbers and symbols (l33t).
  • morse - Encodes the payload as Morse code.
  • piglatin - Rewrites the payload in Pig Latin.

Running a scan, step by step

When you open a tool, you fill in four blocks top to bottom.

1. Targets. Tick the discovered endpoints you want to attack. To test something recon did not find, paste a full URL under "Attack a URL not in the graph", pick its type (llm-chat / llm-completion), optionally a model, and click Add.

Targets block

2. Probes / Attack strategies. Pick what to run (see What each probe tests). For long lists there is a toolbar with Select all, Reset to defaults, Clear, and a counter.

Probe selection grid

3. Run bounds and options. Set how thorough the run is, add target login details if the endpoint needs them, optionally describe the app, and tick the RoE box. Every field is in Every setting explained.

Run bounds and launch

4. Launch and watch. The Launch button lights up once you have at least one target, at least one probe, and the RoE box ticked. As it runs you see a status line (Phase 1/4 ... Phase 4/4: Safety, Loading targets, Attacking, Saving findings), a moving progress bar, and the live log. You can Stop any time, and if you refresh the page mid-scan it reconnects and keeps showing progress.

Live output


Every setting explained

Setting What it does Which tools
Target checkboxes The discovered endpoints to attack. all
Custom URL Attack a URL recon did not find. all
Custom interface type llm-chat or llm-completion, so the request is shaped correctly. all
Custom model The model name for a custom target (optional). all
Probes / Strategies / Plugins What to run (see the catalog). all
Generations How many times each probe is tried. More = a more reliable score, but slower. mainly garak
ASR >= Hide weak results: anything below this success rate is dropped. Default 0.3 (30%). all
Judge The local model that grades the answers (default qwen2.5:7b). all
Max turns How long the conversation can get. Shown only for PyRIT. PyRIT
Seed Makes runs repeatable. garak, PyRIT
Parallel How many requests are sent to the target at once. Default 2. Keep it low (2) for a slow or CPU-only target so its queue does not back up and time out; raise it (e.g. 8) for a fast or GPU target to finish quicker. garak, promptfoo
Strategies Extra encodings to wrap promptfoo payloads in. promptfoo
Custom objective A specific goal for PyRIT instead of its defaults. PyRIT
Target purpose A short description of your app; sharpens the generated attacks. Giskard, promptfoo, PyRIT
Target authentication How to log in to the target: None, Bearer token, or a Custom header. all
RoE confirmation Required: confirms you are authorized to test this target. all
Launch / Stop Start or stop the scan. all

The single Generations field mainly affects garak. Giskard just reports whether an issue exists, promptfoo uses its own test count, and PyRIT is governed by Max turns - so do not worry if turning it up does not change those tools.


Reading your results

When a scan finishes, the findings table at the bottom of the page fills in, sorted by success rate (worst first).

Findings table

Column Meaning
Tool Which tool found it.
OWASP The OWASP-LLM category (LLM01..LLM10, or safety).
Attack Which attack succeeded.
Target The endpoint that was attacked.
ASR How often the attack worked, as a percentage.
Trials How many attempts that percentage is based on.
Severity How serious it is.
Evidence A one-line summary of the proof.
Report A view link that opens the full conversation behind the finding.

Giskard always shows 100%. It reports whether an issue exists, not how often, so sort Giskard rows by severity rather than ASR.

The same findings also appear in the Red Zone: open Red Zone -> Data Table -> the AI Risk preset -> the AI Gauntlet Vulnerabilities sheet (the "Show findings" button on the Gauntlet page jumps straight there). From there you can search and export to XLSX / JSON / MD.

Red Zone - AI Gauntlet Vulnerabilities

Click view on any finding to open the transcript - the raw tool output (JSON / JSONL) with the exact prompts and responses that prove the weakness.


Where your targets and findings live on the graph

You do not need to think about the database, but it helps to know the flow:

  • Targets in. The picker is filled from the AI endpoints recon already found, so you are attacking things that are on your project graph.
  • Findings out. After a scan, each weakness becomes a finding attached to the endpoint it was found on, visible in the Red Zone table and the report.
  • Custom URLs. If you attack a URL recon never found, it is added to the graph for you, so the finding still has a home and is easy to find later.
  • Re-running. Running the same tool against the same target again updates the existing finding (refreshing its score and proof) instead of creating duplicates.

Reports

AI Gauntlet findings appear in your Pentest Reports under a "Tested Vulnerabilities - AI Gauntlet" section. It groups the confirmed weaknesses by category and target, and highlights any weakness that more than one tool confirmed - the strongest kind of finding.

The report is a self-contained HTML document: view it in the browser or download the .html, and it prints cleanly to PDF. There is no JSON or PDF report format - JSON of these findings comes only from the AI Gauntlet Vulnerabilities table export (XLSX / JSON / MD) and the raw transcripts, never from the report itself.

Report - Tested Vulnerabilities - AI Gauntlet


Is this safe to run?

Yes, with the usual authorization caveat - it sends real attack prompts, so only point it at targets you are allowed to test.

  • You must confirm authorization (the RoE checkbox) before every launch.
  • Nothing leaves your machine. Grading is done by a local model; no prompts, answers, or transcripts are sent to any outside service.
  • The worst categories are hard-blocked (for example CSAM / weapons) regardless of any setting.
  • It is off by default and only attacks the endpoints you select.
  • Runs are bounded so they cannot run forever.

See Rules of Engagement and Project Settings Reference.


Troubleshooting

Symptom What to do
No targets to pick Run Adversarial AI Recon first, or add a custom URL.
Launch button is greyed out You need a target, a probe selected, and the RoE box ticked.
Another tool says "Locked" A scan is already running - only one runs at a time. Stop it or wait.
No findings after a run The target resisted, or every result was below your ASR >= threshold. Lower the threshold or add probes.
PyRIT / Giskard / promptfoo found nothing These need the local grader model. If it could not start, they return nothing - check the live log.
Giskard shows 100% Normal - it reports presence, not a rate. Use severity.
A transcript link does not open The proof files are temporary; an old finding may point to a cleaned-up file. Re-run to refresh.
Very slow run Whole garak families are heavy. Use fewer families or fewer generations. A slow CPU target is the usual cause - a GPU target is dramatically faster.
Run stops early / logs vanish with no findings The target was too slow and its request queue timed out. Lower Parallel (try 2) so fewer requests pile up, and/or test a faster (GPU) target.

See also


Screenshots to capture

The image slots below still need real screenshots dropped into images/. Capture each at http://localhost:3000/ai-attack-surface (or the Red Zone / report where noted) and save with the exact filename:

  • aigauntlet-01-page-overview.png - the whole page
  • aigauntlet-02-tool-cards.png - the four tool cards
  • aigauntlet-03-filter-bar.png - the attack-chip filter bar
  • aigauntlet-04-targets-block.png - Block 1 targets + custom-URL form
  • aigauntlet-05-probes-grid.png - Block 2 probe grid
  • aigauntlet-06-run-bounds.png - Block 3 settings + RoE + Launch
  • aigauntlet-07-output-logs.png - Block 4 live log
  • aigauntlet-08-findings-table.png - the findings table
  • aigauntlet-09-redzone-airisk.png - Red Zone AI Gauntlet Vulnerabilities tab
  • aigauntlet-10-report-section.png - the report section

Clone this wiki locally