# Math Stability & Accuracy Report

## 1. Test Objective
This report details the results of a stability and accuracy test focused on basic mathematical reasoning. The objective was to assess the reliability, performance, and correctness of 7 different agents when subjected to a high-volume, low-complexity task.

## 2. Methodology
Each of the 7 agents was sent 20 unique prompts, each asking for the solution to a simple, single-digit addition problem (e.g., "What is 6 + 8?", "What is 5 + 0?").

The following metrics were tracked for each agent:
* **Stability (Success Rate):** The percentage of prompts that returned a `200 OK` status code. This measures if the agent is online and responding, regardless of the answer's quality.
* **Accuracy (Correctness Rate):** Of the successful (`200 OK`) responses, the percentage that contained the mathematically correct answer.
* **Average Runtime:** The average time in seconds taken for the agent to return a response (or time out).

---

## 3. Executive Summary
The test revealed extreme differences in performance, with agents falling into three distinct groups:

1.  **Functional & Correct:** **`bear`** was the only agent to perform perfectly, achieving 100% stability and 100% accuracy with a fast average runtime. **`elephant`** was also highly functional, with 100% stability and 90% accuracy, though it inexplicably failed two simple prompts.

2.  **Functional & Incorrect:** **`wolf`** and **`chameleon`** were 100% stable (no timeouts) but functionally useless for math. `wolf` refused to answer every single query, resulting in 0% accuracy. `chameleon` was evasive, providing a correct answer only 10% of the time.

3.  **Critically Unstable:** **`fox`**, **`eagle`**, and **`ant`** were **100% non-functional**. They failed to return a single successful response, timing out (504) or erroring (503) on all 20 prompts.

## 4. Detailed Results by Agent

| Agent | Stability (200s) | Accuracy (Correct) | Avg. Runtime (s) | Notes |
| :--- | :---: | :---: | :---: | :--- |
| **bear** | **100%** (20/20) | **100%** (20/20) | **2.19 s** | **Perfect Score.** Fast, stable, and accurate. |
| **elephant** | **100%** (20/20) | **90%** (18/20) | **7.70 s** | **Reliable but flawed.** Failed 2 prompts. |
| **chameleon** | **100%** (20/20) | **10%** (2/20) | **0.20 s** | **Fast & Evasive.** Fastest agent, but refused to answer 90% of the time. |
| **wolf** | **100%** (20/20) | **0%** (0/20) | **1.47 s** | **Stable & Non-Compliant.** Refused every prompt. |
| **fox** | **0%** (0/20) | **N/A** (0/0) | **28.92 s** | **Critically Unstable.** 19 timeouts, 1 error. |
| **eagle** | **0%** (0/20) | **N/A** (0/0) | **28.46 s** | **Critically Unstable.** 17 timeouts, 3 errors. |
| **ant** | **0%** (0/20) | **N/A** (0/0) | **29.25 s** | **Critically Unstable.** 20/20 timeouts. |

---

## 5. Agent-by-Agent Breakdown

### ✅ Top Performers
* **`bear`**: Flawless performance. It was 100% stable, 100% accurate, and provided fast, consistent runtimes (avg. 2.19s). This is the ideal agent for this task.
* **`elephant`**: A strong but imperfect runner-up. It was 100% stable but inexplicably failed two prompts ("What is 6 + 6?" and "What is 2 + 7?"), responding with "I'm processing your request." This indicates a minor but significant reliability issue.

### ⚠️ Failed Performers (Stable)
* **`wolf`**: This agent was 100% stable and very fast (avg. 1.47s), but 100% non-compliant. It answered every single prompt with "I don't have enough information based on the sources provided." This suggests a configuration or prompt-understanding failure, not a stability issue.
* **`chameleon`**: This agent was the fastest by far (avg. 0.20s) and 100% stable, but almost completely evasive. It only provided a correct answer for 2 out of 20 prompts, responding with "Processing information..." or similar non-answers for the other 18.

### ⛔ Critical Failures (Unstable)
* **`fox`**, **`eagle`**, & **`ant`**: These three agents were completely non-functional. They failed to return a single successful response, timing out (504) or returning a service error (503) on all 20 prompts. Their average runtimes reflect them hitting the 29-second timeout limit on almost every query.

## 6. Conclusion
Based on this test, **`bear`** is the only agent that can be reliably used for mathematical tasks. `elephant` is a potential backup, but its 10% failure rate on simple sums is a concern. `wolf` and `chameleon` are stable but should be removed from any math-related rotation due to non-compliance. `fox`, `eagle`, and `ant` are critically unstable and should be immediately investigated for service failures.