[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/17nIj3kRgNnJD4Gu0OUavGxbpqa-RefSd/view?usp=sharing)

# Gateway Metrics Evaluation with Flotorch Eval

This notebook demonstrates how to track and evaluate **gateway metrics** (latency, cost, token usage) using the **Flotorch SDK** alongside the **Flotorch Eval** library. Gateway metrics are automatically tracked by the Flotorch Gateway for every LLM call, providing essential operational insights for performance monitoring, cost management, and usage analytics.

---

### **Use Case Overview**

The **Agriculture & Farming Assistant** helps users explore comprehensive agriculture and farming knowledge covering:

**Core Agricultural Topics:**
- **Types of Agricultural Systems**
- **Major Crop Production**
- **Livestock Management**:
- **Soil Management**:
- **Irrigation and Water Management**:
- **Pest and Disease Management**:
- **Sustainable Agriculture**:

This notebook focuses on evaluating the **faithfulness** of the model’s answers — that is, whether the generated responses are **factually grounded** in the retrieved context.

---

### **Notebook Workflow**

We’ll follow a structured evaluation process:

1. **Iterate Questions** – Loop through each question in the `gt.json` file (Ground Truth).  
2. **Retrieve Context** – Fetch relevant passages from the Knowledge Base.  
3. **Generate Answer** – Use the system prompt and LLM to produce a response.  
4. **Store Results** – Log each question, retrieved context, generated answer, ground truth, and **gateway metrics (latency, cost, tokens)** from response headers.  
5. **Evaluate Metrics** – Use `LLMEvaluator` from Flotorch Eval to assess response quality and extract gateway metrics.  
6. **Display Results** – Summarize **gateway metrics** (latency, cost, token usage) and evaluation scores in comparison tables and analysis.

---

### **Metrics Evaluated**

#### **1. Gateway Metrics (Primary Focus)**

**Gateway metrics** are automatically tracked by the Flotorch Gateway for every LLM call. These metrics provide essential operational insights:

| Metric | Source | Description | Value |
|--------|--------|-------------|-------|
| **LATENCY** | Gateway | Measures total and average latency across LLM calls | Milliseconds (ms) |
| **COST** | Gateway | Tracks total cost of LLM operations | USD ($) |
| **TOKEN_USAGE** | Gateway | Monitors total token consumption | Token count |

**How Gateway Metrics Work:**

1. When you call `FlotorchLLM.invoke()` with `return_headers=True`, the gateway returns response headers containing:
   - Request/response latency information
   - Cost breakdown per operation
   - Token usage (input tokens + output tokens)

2. These headers are automatically passed as `metadata` to `EvaluationItem`, and Flotorch Eval extracts gateway metrics from them.

3. Gateway metrics are computed automatically - **no additional configuration required**. Simply pass the headers as metadata and the evaluator handles the rest.

**Why Gateway Metrics Matter:**

- **Performance Monitoring**: Track latency to identify bottlenecks and optimize response times
- **Cost Management**: Monitor spending to budget and optimize model usage
- **Usage Analytics**: Understand token consumption patterns to plan capacity

For more details, see the [Flotorch Eval documentation](https://github.com/FissionAI/flotorch-eval/tree/develop).

#### **2. Faithfulness Metric**

We evaluate **Faithfulness** to measure how faithfully a generated answer reflects the supporting context retrieved from our knowledge base. A high score means every claim in the answer is grounded in the evidence; a low score indicates hallucinations or contradictions.

#### Ragas Faithfulness (Flotorch `evaluation_engine="ragas"`)
According to the [Ragas documentation](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/), the pipeline:
- breaks the generated answer into atomic claims using an evaluator LLM,
- checks each claim against retrieved context passages for support, and
- reports the ratio of supported claims to total claims as a score between 0.0 and 1.0.

---

### **Evaluation Engine**

- `evaluation_engine="ragas"` — keeps every metric inside the [**Ragas**](https://docs.ragas.io/en/stable/getstarted/) rubric for retrieval-aware evaluations (faithfulness, answer relevance, context precision, aspect critic, etc.), which is what we configure in the this notebook.


---

### **Requirements**

- Flotorch account with configured LLM, embedding model, and Knowledge Base.  
- `gt.json` containing question–answer pairs for evaluation.  
- `prompt.json` containing the system and user prompt templates.  
---

#### **Documentation References**
- [**flotorch-eval GitHub repo**](https://github.com/FissionAI/flotorch-eval/tree/develop) — reference implementation with sample notebooks and evaluation pipelines.
- [**Ragas Faithfulness Documentation**](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/) — detailed explanation of the metric.
---

## 1. Install Dependencies

First, we install the two necessary libraries. The `-q` flag is for a "quiet" installation, hiding the lengthy output.

-   `flotorch`: The main Python SDK for interacting with all Flotorch services, including LLMs and Knowledge Bases.
-   `flotorch-eval[llm]`: The evaluation library. We use the `[llm]` extra to install dependencies required for LLM-based (Ragas) evaluations.

In [None]:
# Install flotorch-sdk and flotorch-core
# You can safely ignore dependency errors during installation.

%pip install flotorch==3.1.0b1 flotorch-eval==2.0.0
%pip install opentelemetry-instrumentation-httpx==0.58b0

## 2. Configure Environment

This is the main configuration step. Set your API key, base URL, and the model names you want to use.

-   **`FLOTORCH_API_KEY`**: Your Flotorch API key (found in your Flotorch Console).
-   **`FLOTORCH_BASE_URL`**: Your Flotorch console instance URL.
-   **`inference_model_name`**: The LLM your agent uses to *generate* answers (your 'agent's brain').
-   **`evaluation_llm_model_name`**: The LLM used to *evaluate* the answers (the 'evaluator's brain'). This is typically a powerful, separate model like `flotorch/gpt-4o` to ensure an unbiased, high-quality judgment.
-   **`evaluation_embedding_model_name`**: The embedding model used for semantic similarity checks during evaluation.
-   **`knowledge_base_repo`**: The ID of your Flotorch Knowledge Base, which acts as the 'source of truth' for your RAG agent.

### Example :

| Parameter | Description | Example |
|-----------|-------------|---------|
| `FLOTORCH_API_KEY` | Your API authentication key | `sk_...` |
| `FLOTORCH_BASE_URL` | Gateway endpoint | `https://gateway.flotorch.cloud` |
| `inference_model_name` | The LLM your agent uses to generate answers | `flotorch/gpt-4o-mini` |
| `evaluation_llm_model_name` | The LLM used to evaluate the answers | `flotorch/gpt-4o` |
| `evaluation_embedding_model_name` | Embedding model for semantic similarity checks | `open-ai/text-embedding-ada-002` |
| `knowledge_base_repo` | The ID of your Flotorch Knowledge Base | `agriculture-farming` |

In [None]:
import getpass  # Securely prompt without echoing in Prefect/notebooks

# Prefect-side authentication for Flotorch access
try:
    FLOTORCH_API_KEY = getpass.getpass("Paste your API key here: ")  # Used by Prefect flow and local runs
    print(f"Success")
except getpass.GetPassWarning as e:
    print(f"Warning: {e}")
    FLOTORCH_API_KEY = ""

FLOTORCH_BASE_URL = input("Paste your Flotorch Base URL here: ")  # Prefect gateway or cloud endpoint

inference_model_name = "flotorch/<your-model-name>"  # Model generating answers
evaluation_llm_model_name = "flotorch/<your_model_name>"  # Model judging answer quality
evaluation_embedding_model_name = "flotorch/<embedding_model_name>"  # Embedding model for similarity checks

knowledge_base_repo = "<your_knowledge_base_id>" #Knowledge_base ID

## 3. Import Required Libraries

### Purpose
Import all required components for evaluating the RAG assistant.

### Key Components
- `json` : Loads configuration files and ground truth data from disk
- `tqdm` : Shows a lightweight progress bar while iterating over evaluation items
- `FlotorchLLM` : Connects to the Flotorch inference endpoint for answer generation
- `FlotorchVectorStore` : Retrieves context snippets from the configured knowledge base
- `memory_utils` : Utility helpers for extracting text from vector-store search results
- `LLMEvaluator`, EvaluationItem, MetricKey** : Runs metric scoring for the generated answers



In [None]:
#Required imports
import json
from typing import List
from tqdm import tqdm # Use standard tqdm for simple progress bars
from google.colab import files

# Flotorch SDK components
from flotorch.sdk.llm import FlotorchLLM
from flotorch.sdk.memory import FlotorchVectorStore
from flotorch.sdk.utils import memory_utils

# Flotorch Eval components
from flotorch_eval.llm_eval import LLMEvaluator, EvaluationItem, MetricKey
from flotorch_eval import display_llm_evaluation_results

print("Imported necessary libraries successfully")

## 4. Load Data and Prompts

### Purpose
Here, we load our ground truth questions (`gt.json`) and the agent prompts (`prompt.json`) from local
files.

### Files Required

**1. `gt.json` (Ground Truth)**  
Contains question-answer pairs for evaluation. Each `answer` is the expected correct response.

```json
[
  {
    "question": "What percentage of global greenhouse gas emissions does agriculture account for?",
    "answer": "Agriculture accounts for approximately 10 to 12 percent of global greenhouse gas emissions directly and additional emissions through associated land use changes."
  },
  {
    "question": "What percentage of global agricultural land is irrigated and how much food does it produce?",
    "answer": "Approximately 20 percent of global agricultural land is irrigated, yet this land produces about 40 percent of the world's food."
  }
]
```

**2. `prompt.json` (Agent Prompts)**  
Defines the system prompt and user prompt template with `{context}` and `{question}` placeholders for dynamic formatting.

```json
{
  "system_prompt": "You are a helpful agriculture and farming assistant. Answer based only on the context provided.",
  "user_prompt_template": "Context:\n{context}\n\nQuestion:\n{question}\n\nAnswer:"
}
```

### Instructions
Update `gt_path` and `prompt_path` variables in the next cell to point to your local file locations.

In [None]:
print("Please upload your Ground Truth file (gt.json)")
gt_upload = files.upload()

gt_path = list(gt_upload.keys())[0]
with open(gt_path, 'r') as f:
    ground_truth = json.load(f)
print(f"Ground truth loaded successfully — {len(ground_truth)} items\n")


print("Please upload your Prompts file (prompts.json)")
prompts_upload = files.upload()

prompts_path = list(prompts_upload.keys())[0]
with open(prompts_path, 'r') as f:
    prompt_config = json.load(f)
print(f"Prompts loaded successfully — {len(prompt_config)} prompt pairs")

## 5. Define Helper Function

### Purpose
Create a prompt-formatting helper for LLM message construction.

### Functionality
The `create_messages` function:
- Builds the final prompt that will be sent to the LLM.
- Accepts system prompt, user prompt template, question, and retrieved context chunks
- Replaces `{context}` and `{question}` placeholders in the user prompt
- Returns a structured message list with (`{role: ..., content: ...}`) fields ready for LLM consumption

In [None]:
def create_messages(system_prompt: str, user_prompt_template: str, question: str, context: List[str] = None):
    """
    Creates a list of messages for the LLM based on the provided prompts, question, and optional context.
    """
    context_text = ""
    if context:
        if isinstance(context, list):
            context_text = "\n\n---\n\n".join(context)
        elif isinstance(context, str):
            context_text = context

    # Format the user prompt template
    user_content = user_prompt_template.replace("{context}", context_text).replace("{question}", question)

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_content}
    ]
    return messages


## 6. Initialize Clients

### Purpose
Set up the infrastructure for RAG pipeline execution.

### Components Initialized
1. **FlotorchLLM** (`inference_llm`): Connects to the LLM endpoint for generating answers based on retrieved context
2. **FlotorchVectorStore** (`kb`): Connects to the Knowledge Base for semantic search and context retrieval
3. **Prompt Variables**: Extracts system prompt and user prompt template from `prompt_config` for dynamic message formatting

These clients power the evaluation loop by retrieving relevant context and generating answers for each question.

In [None]:
# 1. Set up the LLM for generating answers
inference_llm = FlotorchLLM(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    model_id=inference_model_name
)

# 2. Set up the Knowledge Base connection
kb = FlotorchVectorStore(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    vectorstore_id=knowledge_base_repo
)

# 3. Load prompts into variables
system_prompt = prompt_config.get("system_prompt", "")
user_prompt_template = prompt_config.get("user_prompt", "{question}")

print("Models and Knowledge Base are ready.")

## 7. Run Experiment Loop

### Purpose
Execute the full RAG pipeline for each question to generate answers for evaluation.

### Pipeline Steps
For each question in `ground_truth`, the loop performs:

1. **Retrieve Context**: Searches the Knowledge Base (`kb.search()`) to fetch relevant context passages
2. **Build Messages**: Uses `create_messages()` to format the system prompt, user prompt, question, and retrieved context into LLM-ready messages
3. **Generate Answer**: Invokes the inference LLM (`inference_llm.invoke()`) with `return_headers=True` to capture response metadata (cost, latency, tokens)
4. **Store for Evaluation**: Packages question, generated answer, expected answer, context, and metadata into an `EvaluationItem` object

### Error Handling
A `try...except` block gracefully handles API failures, storing error messages as evaluation items to ensure the loop completes without crashes.


In [None]:
evaluation_items = [] # This will store our results

# Use simple tqdm for a progress bar
print(f"Running experiment on {len(ground_truth)} items...")

for qa in tqdm(ground_truth):
    question = qa.get("question", "")
    gt_answer = qa.get("answer", "")

    try:
        # --- 1. Retrieve Context ---
        search_results = kb.search(query=question)
        context_texts = memory_utils.extract_vectorstore_texts(search_results)

        # --- 2. Build Messages ---
        messages = create_messages(
            system_prompt=system_prompt,
            user_prompt_template=user_prompt_template,
            question=question,
            context=context_texts
        )

        # --- 3. Generate Answer ---
        # return_headers=True captures gateway metrics (LATENCY, COST, TOKEN_USAGE) automatically
        response, headers = inference_llm.invoke(messages=messages, return_headers=True)
        generated_answer = response.content

        # --- 4. Store for Evaluation ---
        # Headers contain gateway metrics automatically tracked by Flotorch Gateway:
        # - LATENCY: total and average latency (ms)
        # - COST: total cost of operations (USD)
        # - TOKEN_USAGE: total tokens consumed (input + output)
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=generated_answer,
            expected_answer=gt_answer,
            context=context_texts, # Store the context for later display
            metadata=headers, # Gateway metrics extracted from headers automatically
        ))

    except Exception as e:
        print(f"[ERROR] Failed on question '{question[:50]}...': {e}")
        # Store a failure case so we can see it
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=f"Error: {e}",
            expected_answer=gt_answer,
            context=[],
            metadata={"error": str(e)},
        ))

print(f"Experiment completed. {len(evaluation_items)} items are ready for evaluation.")

## 8. Initialize the Evaluator

### Using Ragas Engine

Now that we have our `evaluation_items` list (containing the generated answers), we can set up the `LLMEvaluator`.

This class is the core component of the **Flotorch-Eval** library — think of it as the *“head judge”* for our evaluation process. It coordinates metric calculations, semantic comparisons, and LLM-based judgments using the configuration we provide.

### Parameter Insights

- **`api_key` / `base_url`** — Standard credentials used to authenticate and connect with the Flotorch-Eval service.  
- **`inferencer_model` / `embedding_model`** — The evaluator uses:
  - an **LLM** (`inferencer_model`) for reasoning-based checks, and  
  - an **embedding model** (`embedding_model`) for semantic and contextual similarity evaluations.  
- **`evaluation_engine`** — Here, we set this to `"ragas"`, meaning the evaluator will use the **[Ragas framework](https://docs.ragas.io/en/stable/getstarted/)** for metric computation.  
  Ragas is well-suited for RAG-style evaluations and handles metrics such as:
  - **Faithfulness**
  - **Answer Relevance**
  - **Context Precision**
  - **Aspect Critic (custom maliciousness check)**  
- **`metrics`** — In this configuration, we evaluate only **`MetricKey.FAITHFULNESS`**.

### Gateway Metrics - Automatic Extraction

**Important**: Gateway metrics (LATENCY, COST, TOKEN_USAGE) are **automatically extracted** from the `metadata` field (response headers) of each `EvaluationItem`.

When you pass headers from `FlotorchLLM.invoke(return_headers=True)` as metadata, the evaluator automatically:
- Extracts **LATENCY** (total_latency_ms, average_latency_ms) from gateway response headers
- Calculates **COST** (total_cost) from gateway pricing data
- Aggregates **TOKEN_USAGE** (total_tokens) from gateway headers

**No additional configuration needed** - gateway metrics are collected transparently! See the [Flotorch Eval documentation](https://github.com/FissionAI/flotorch-eval/tree/develop) for details.

### Faithfulness Metric

**Definition**: evaluates how factually consistent a generated response is with the retrieved context. It measures whether all claims in the generated answer can be supported by the context. The score ranges from 0 to 1, calculated as: **Number of claims supported by context / Total number of claims in the response**. This metric is crucial for preventing hallucinations and ensuring the AI doesn't fabricate information beyond what's provided in the source documents.

**How It Works**:
1. Breaks the answer into individual claims
2. Checks each claim against retrieved context  
3. Score = (Supported claims) / (Total claims)

**Example**:

*Context*: "Agriculture employs approximately 26 percent of the global workforce and contributes significantly to the gross domestic product of many nations, particularly in developing countries. The three main types of irrigation methods are surface irrigation, sprinkler irrigation, and drip irrigation."

*Faithful Answer* (Correct): "Agriculture employs approximately 26 percent of the global workforce. The three main irrigation methods are surface, sprinkler, and drip irrigation." → **Score: 1.0**

*Unfaithful Answer* (Incorrect): "Agriculture employs 50 percent of the global workforce, and there are five main types of irrigation methods." → **Score: 0.0** (contains unsupported claims)


In [None]:
# Initialize the LLMEvaluator client
evaluator_client = LLMEvaluator(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    embedding_model=evaluation_embedding_model_name,
    inferencer_model=evaluation_llm_model_name,
    metrics=[
        MetricKey.FAITHFULNESS,
    ],
    evaluation_engine="ragas",
)

print("LLMEvaluator client initialized.")

## 9. Run Evaluation

### Purpose
Execute the evaluation process to score all generated answers using the **faithfulness** metric.

### Process
- Call either:
  - `evaluator_client.evaluate()` for **synchronous** (sequential) execution, or  
  - `evaluator_client.aevaluate()` for **asynchronous** (concurrent) execution  
  using the complete list of `evaluation_items`.

- For each evaluation item:
  - The evaluator scores **faithfulness** by comparing the generated answer against the retrieved context.

- Collect the following outputs:
  - Faithfulness scores
  - Gateway metrics (cost, latency, token usage)
  - Structured evaluation results

### Output
- A complete evaluation report ready for analysis.

> **Note:**  
> This step may take a few minutes, as it requires LLM calls for each question to compute faithfulness scores.  
> Use the **synchronous** method for standard sequential execution, or the **asynchronous** method for faster, concurrent processing.


### Asynchronous Evaluation

In [None]:
print("Starting evaluation... This may take a few minutes.")

eval_results = await evaluator_client.aevaluate(evaluation_items)

print("Evaluation complete.")

### Synchronous Evaluation (uncomment the below code to use synchronous manner)

In [None]:
# print("Starting evaluation... This may take a few minutes.")

# eval_results = evaluator_client.evaluate(evaluation_items)

# print("Evaluation complete.")

## 10. View Per-Question Results

### Purpose
Display evaluation results in a formatted table for easy analysis and comparison.


In [None]:
display_llm_evaluation_results(eval_result)

## 11. View Raw JSON Results

### Purpose
Display the complete evaluation results in JSON format for detailed inspection and programmatic access.

### Output Structure
The JSON output includes for each question:
- **model**: The evaluation LLM model used
- **input_query**: The original question
- **context**: Full retrieved context passages (not truncated)
- **generated_answer**: Complete LLM-generated response
- **groundtruth_answer**: Expected correct answer
- **evaluation_metrics**: Dictionary containing:

  **Gateway Metrics (Automatically Tracked):**
  - **total_latency_ms**: Total latency across all LLM operations (milliseconds)
  - **average_latency_ms**: Average latency per operation (milliseconds)
  - **total_cost**: Total cost of all LLM operations (USD)
  - **total_tokens**: Total token consumption (input + output tokens)

  **Quality Metrics:**
  - **faithfulness**: Faithfulness score (0.0 to 1.0)
  - **average_score**: Average of all evaluated metrics
  - **total_latency_ms**: Total latency across all LLM operations (milliseconds)
  - **average_latency_ms**: Average latency per operation (milliseconds)
  - **total_cost**: Total cost of all LLM operations (USD)
  - **total_tokens**: Total token consumption (input + output tokens)

**Gateway Metrics Details:**

These gateway metrics are automatically extracted from the Flotorch Gateway response headers. They provide operational insights:
- **LATENCY**: Track response times to optimize performance
- **COST**: Monitor spending across all model calls
- **TOKEN_USAGE**: Understand consumption patterns for capacity planning

All gateway metrics are collected transparently - simply pass headers as metadata and the evaluator extracts them automatically.

This raw JSON format is useful for further analysis, exporting results, cost tracking, performance monitoring, or integrating with other tools.

In [None]:
print("--- Aggregate Evaluation Results ---")
print(json.dumps(eval_results, indent=2))

## 12. Summary

### What We Accomplished

This notebook provided a complete, step-by-step workflow for tracking and evaluating gateway metrics (latency, cost, token usage) using Flotorch Eval.

### Workflow Summary

1. **Configured Infrastructure**
   - Set up `FlotorchLLM` for answer generation
   - Connected to `FlotorchVectorStore` for context retrieval
   - Initialized `LLMEvaluator` with Ragas engine for faithfulness scoring

2. **Generated Responses**
   - Loaded ground truth questions from `gt.json`
   - Retrieved relevant context from the Knowledge Base for each question
   - Generated answers using the inference LLM with retrieved context
   - **Captured gateway metrics** by using `return_headers=True` with `FlotorchLLM.invoke()`
   - **Automatically collected LATENCY, COST, and TOKEN_USAGE** from gateway response headers

3. **Evaluated Gateway Metrics & Quality Metrics**
  - **Automatically extracted gateway metrics** (LATENCY, COST, TOKEN_USAGE) from response headers
  - Scored each generated answer using the Ragas faithfulness metric
  - Combined operational metrics (latency, cost, tokens) with quality metrics (faithfulness) for comprehensive analysis

4. **Visualized Results**
  - Displayed **gateway metrics** (latency, cost, tokens) in a formatted table for quick analysis
  - Exported complete results as JSON including **all gateway metrics** (latency, cost, tokens)
  - Analyzed operational metrics (performance, cost, usage) alongside quality metrics
  - Monitored gateway metrics for cost and performance optimization

### Key Takeaways

#### Gateway Metrics (Automatically Tracked)
- **LATENCY**: Automatically tracked from Flotorch Gateway response headers - measures total and average latency across all LLM calls
- **COST**: Automatically extracted from gateway pricing data - tracks total cost of all LLM operations in USD
- **TOKEN_USAGE**: Automatically aggregated from gateway headers - monitors total token consumption (input + output)

**Gateway Metrics Benefits:**
- **No additional configuration required** - simply use `return_headers=True` with `FlotorchLLM.invoke()`
- **Transparent collection** - metrics are automatically extracted from response headers
- **Comprehensive tracking** - every LLM operation is automatically monitored
- **Cost optimization** - track spending patterns to optimize model usage
- **Performance monitoring** - identify latency bottlenecks for faster responses
- **Usage analytics** - understand token consumption for capacity planning

For detailed gateway metrics documentation, see the [Flotorch Eval GitHub repository](https://github.com/FissionAI/flotorch-eval/tree/develop).

**Example Context**: “Corn requires 150-200 pounds of nitrogen per acre for optimal yields. Apply nitrogen in split applications: 50% at planting, 25% at vegetative stage, 25% at reproductive stage.”

- **Ragas Outcome**
  - Faithful answer: “Corn needs 150-200 pounds of nitrogen per acre, split into three applications at planting, vegetative, and reproductive stages” → score ≈ 1.0
  - Unfaithful answer: “Corn requires 300 pounds of nitrogen applied all at once at planting” → unsupported claim lowers score toward 0

