[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/1QNILXJMSKUPHozqmSIq-tsHqpHRAFNPp/view?usp=sharing)

# Evaluating the Network Security Protocol Advisor with Flotorch Eval

This notebook provides a step-by-step guide to **evaluate a question-answering agent (RAG)** using the **Flotorch SDK** and **Flotorch Eval** library.  
The use case here is a **Network Security Protocol Advisor** — an LLM-powered assistant that answers operational and architectural questions about the "**Network Security Protocols – Comprehensive Guide**" covering encryption standards, authentication layers, VPN designs, wireless hardening, email security, and best-practice implementations across the OSI stack.

---


### **Use Case Overview**

The **Network Security Protocol Advisor** helps SecOps teams, architects, and auditors quickly retrieve grounded guidance on:
- **Encryption Protocols** (TLS handshake hardening, IPsec modes, AES modes, RSA vs. ECC tradeoffs)
- **Authentication Protocols** (Kerberos ticket flows, AD/LDAP binding over TLS, RADIUS vs. TACACS+ AAA policies, OAuth 2.0 and OpenID Connect safeguards)
- **Virtual Private Networks** (site-to-site IPsec negotiation, SSL/TLS clientless access, WireGuard key distribution, HA/failover topologies)
- **Wireless Security** (WPA3 SAE, 802.1X/EAP methods, WIDS/WIPS containment, Bluetooth and IoT segmentation strategies)
- **Email/Web/Perimeter Security** (SPF/DKIM/DMARC enforcement, web security controls, firewall ACL design, intrusion detection/prevention tuning)

Relevant passages are retrieved from the **Network Security Knowledge Base** compiled from the comprehensive guide above, ensuring that every answer maps back to vetted implementation details and policy controls.

This notebook focuses on evaluating **retrieval faithfulness** using the **DeepEval Context Recall metric** — verifying that the assistant consistently pulls the *right* network security passages so its answers cite encryption suites, AAA flows, tunneling modes, or monitoring steps found in the source documentation.

---

### **Notebook Workflow**

We'll follow a structured evaluation process:

1. **Iterate Questions** – Loop through each network security scenario in the ground-truth set.  
2. **Retrieve Context** – Fetch relevant protocol guidance from the Network Security Knowledge Base.  
3. **Generate Answer** – Use the system prompt and LLM to produce a technically sound response.  
4. **Store Results** – Log each question, retrieved context, generated answer, and reference answer.  
5. **Evaluate Context Recall** – Use `LLMEvaluator` from Flotorch Eval to run the DeepEval Context Recall check.  
6. **Display Results** – Summarize recall outcomes in an at-a-glance table.

---

### **Metric Evaluated — Context Recall**

We track **Context Recall** to ensure the assistant leverages all *relevant* portions of the underlying network security corpus. A score of 1 indicates the retrieved snippets contained the critical facts (cipher negotiation steps, AAA requirements, IDS workflows) needed to answer the question; 0 means the assistant missed essential context or relied on unrelated passages.

#### DeepEval Context Recall (Flotorch `evaluation_engine="deepeval"`)
- Compares the generated answer with the reference answer and cited context snippets to determine whether the supporting evidence fully covers the required facts.  
- Highlights retrieval gaps so teams can expand the knowledge base, adjust search parameters, or refine prompt instructions.  
- Prioritizes *coverage* over style, which is crucial when validating that VPN runbooks, WPA3 onboarding steps, or SPF/DKIM policies were surfaced before the model responded.

---

### **Evaluation Engine**

- `evaluation_engine="auto"` — lets Flotorch Eval mix Ragas and DeepEval according to the priority routing described in the [flotorch-eval repo](https://github.com/FissionAI/flotorch-eval/tree/develop) (Ragas first, DeepEval as fallback) so that faithfulness, answer relevance, and context metrics always run on the best available backend.
- `evaluation_engine="deepeval"` — routes metrics through DeepEval’s engine (faithfulness, answer relevancy, context relevancy, context precision, context recall, hallucination) while still capturing Flotorch gateway telemetry as documented in the [Flotorch Eval repo](https://github.com/FissionAI/flotorch-eval/tree/develop). This mode is showcased later in the notebook.

In this notebook we rely on the DeepEval pathway to ensure network security guidance always references the exact encryption suites, access controls, and monitoring playbooks documented in the source guide.

---

### **Requirements**

- Flotorch account with configured LLM, embedding model, and Network Security Knowledge Base.  
- `gt.json` (or another ground-truth file) containing network security protocol Q&A pairs for evaluation.  
- `prompt.json` containing the system and user prompt templates tailored to network security analysts.

---
#### **Documentation References**
- [**flotorch-eval GitHub repo**](https://github.com/FissionAI/flotorch-eval/tree/develop) — reference implementation with sample notebooks and evaluation pipelines.  
- [**DeepEval Context Recall Documentation**](https://deepeval.com/docs/metrics-contextual-recall) — detailed explanation of the context recall metric.

## 1. Install Dependencies

First, we install the two necessary libraries. The `-q` flag is for a "quiet" installation, hiding the lengthy output.

-   `flotorch`: The main Python SDK for interacting with all Flotorch services, including LLMs and Knowledge Bases.
-   `flotorch-eval[llm]`: The evaluation library. We use the `[llm]` extra to install dependencies required for LLM-based (Ragas) evaluations.

In [None]:
# Install flotorch-sdk and flotorch-core
# You can safely ignore dependency errors during installation.

%pip install flotorch==3.1.0b1 flotorch-eval==2.0.0
%pip install opentelemetry-instrumentation-httpx==0.58b0

## 2. Configure Environment

This is the main configuration step. Set your API key, base URL, and the model names you want to use.

-   **`FLOTORCH_API_KEY`**: Your Flotorch API key (found in your Flotorch Console).
-   **`FLOTORCH_BASE_URL`**: Your Flotorch console instance URL.
-   **`inference_model_name`**: The LLM your agent uses to *generate* answers (your 'agent's brain').
-   **`evaluation_llm_model_name`**: The LLM used to *evaluate* the answers (the 'evaluator's brain'). This is typically a powerful, separate model like `flotorch/gpt-4o` to ensure an unbiased, high-quality judgment.
-   **`evaluation_embedding_model_name`**: The embedding model used for semantic similarity checks during evaluation.
-   **`knowledge_base_repo`**: The ID of your Flotorch Knowledge Base, which acts as the 'source of truth' for your RAG agent.

### Example :

| Parameter | Description | Example |
|-----------|-------------|---------|
| `FLOTORCH_API_KEY` | Your API authentication key | `sk_...` |
| `FLOTORCH_BASE_URL` | Gateway endpoint | `https://gateway.flotorch.cloud` |
| `inference_model_name` | The LLM your agent uses to generate answers | `flotorch/gpt-4o-mini` |
| `evaluation_llm_model_name` | The LLM used to evaluate the answers | `flotorch/gpt-4o` |
| `evaluation_embedding_model_name` | Embedding model for semantic similarity checks | `open-ai/text-embedding-ada-002` |
| `knowledge_base_repo` | The ID of your Flotorch Knowledge Base | `digital-twin` |

In [None]:
import getpass  # Securely prompt without echoing in Prefect/notebooks

# Prefect-side authentication for Flotorch access
try:
    FLOTORCH_API_KEY = getpass.getpass("Paste your API key here: ")  # Used by Prefect flow and local runs
    print(f"Success")
except getpass.GetPassWarning as e:
    print(f"Warning: {e}")
    FLOTORCH_API_KEY = ""

FLOTORCH_BASE_URL = input("Paste your Flotorch Base URL here: ")  # Prefect gateway or cloud endpoint

inference_model_name = "flotorch/<your-model-name>"  # Model generating answers
evaluation_llm_model_name = "flotorch/<your_model_name>"  # Model judging answer quality
evaluation_embedding_model_name = "flotorch/<embedding_model_name>"  # Embedding model for similarity checks

knowledge_base_repo = "<your_knowledge_base_id>" #Knowledge_base ID

## 3. Import Required Libraries

### Purpose
Import all required components for evaluating the RAG assistant.

### Key Components
- `json` : Loads configuration files and ground truth data from disk
- `tqdm` : Shows a lightweight progress bar while iterating over evaluation items
- `FlotorchLLM` : Connects to the Flotorch inference endpoint for answer generation
- `FlotorchVectorStore` : Retrieves context snippets from the configured knowledge base
- `memory_utils` : Utility helpers for extracting text from vector-store search results
- `LLMEvaluator`, EvaluationItem, MetricKey** : Runs metric scoring for the generated answers

In [None]:
#Required imports
import json
from typing import List
from tqdm import tqdm # Use standard tqdm for simple progress bars
from google.colab import files

# Flotorch SDK components
from flotorch.sdk.llm import FlotorchLLM
from flotorch.sdk.memory import FlotorchVectorStore
from flotorch.sdk.utils import memory_utils

# Flotorch Eval components
from flotorch_eval.llm_eval import LLMEvaluator, EvaluationItem, MetricKey
from flotorch_eval.llm_eval import display_llm_evaluation_results

print("Imported necessary libraries successfully")

## 4. Load Data and Prompts

### Purpose
Here, we load our ground truth questions (`gt.json`) and the agent prompts (`prompt.json`) from local
files.

### Files Required

**1. `gt.json` (Ground Truth)**  
Contains question-answer pairs for evaluation. Each `answer` is the expected correct response.

```json
[
  {
    "question": "What port does LDAPS (LDAP over SSL/TLS) use?",
    "answer": "LDAPS uses port 636."
  },
  {
    "question": "What ports does RADIUS use for authentication and accounting?",
    "answer": "RADIUS uses UDP port 1812 for authentication and port 1813 for accounting."
  }
]
```

**2. `prompt.json` (Agent Prompts)**  
Defines the system prompt and user prompt template with `{context}` and `{question}` placeholders for dynamic formatting.

```json
{
  "system_prompt": "You are a helpful Network Security assistant. Answer based only on the context provided.",
  "user_prompt_template": "Context:\n{context}\n\nQuestion:\n{question}\n\nAnswer:"
}
```

### Instructions
Update `gt_path` and `prompt_path` variables in the next cell to point to your local file locations.

In [None]:
print("Please upload your Ground Truth file (gt.json)")
gt_upload = files.upload()

gt_path = list(gt_upload.keys())[0]
with open(gt_path, 'r') as f:
    ground_truth = json.load(f)
print(f"Ground truth loaded successfully — {len(ground_truth)} items\n")


print("Please upload your Prompts file (prompts.json)")
prompts_upload = files.upload()

prompts_path = list(prompts_upload.keys())[0]
with open(prompts_path, 'r') as f:
    prompt_config = json.load(f)
print(f"Prompts loaded successfully — {len(prompt_config)} prompt pairs")

## 5. Define Helper Function

### Purpose
Create a prompt-formatting helper for LLM message construction.

### Functionality
The `create_messages` function:
- Builds the final prompt that will be sent to the LLM.
- Accepts system prompt, user prompt template, question, and retrieved context chunks
- Replaces `{context}` and `{question}` placeholders in the user prompt
- Returns a structured message list with (`{role: ..., content: ...}`) fields ready for LLM consumption

In [None]:
def create_messages(system_prompt: str, user_prompt_template: str, question: str, context: List[str] = None):
    """
    Creates a list of messages for the LLM based on the provided prompts, question, and optional context.
    """
    context_text = ""
    if context:
        if isinstance(context, list):
            context_text = "\n\n---\n\n".join(context)
        elif isinstance(context, str):
            context_text = context

    # Format the user prompt template
    user_content = user_prompt_template.replace("{context}", context_text).replace("{question}", question)

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_content}
    ]
    return messages

## 6. Initialize Clients

### Purpose
Set up the infrastructure for RAG pipeline execution.

### Components Initialized
1. **FlotorchLLM** (`inference_llm`): Connects to the LLM endpoint for generating answers based on retrieved context
2. **FlotorchVectorStore** (`kb`): Connects to the Knowledge Base for semantic search and context retrieval
3. **Prompt Variables**: Extracts system prompt and user prompt template from `prompt_config` for dynamic message formatting

These clients power the evaluation loop by retrieving relevant context and generating answers for each question.

In [None]:
# 1. Set up the LLM for generating answers
inference_llm = FlotorchLLM(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    model_id=inference_model_name
)

# 2. Set up the Knowledge Base connection
kb = FlotorchVectorStore(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    vectorstore_id=knowledge_base_repo
)

# 3. Load prompts into variables
system_prompt = prompt_config.get("system_prompt", "")
user_prompt_template = prompt_config.get("user_prompt_template", "{question}")

print("Models and Knowledge Base are ready.")

## 7. Run Experiment Loop

### Purpose
Execute the full DeepEval for each question to generate answers for evaluation.

### Pipeline Steps
For each question in `ground_truth`, the loop performs:

1. **Retrieve Context**: Searches the Knowledge Base (`kb.search()`) to fetch relevant context passages
2. **Build Messages**: Uses `create_messages()` to format the system prompt, user prompt, question, and retrieved context into LLM-ready messages
3. **Generate Answer**: Invokes the inference LLM (`inference_llm.invoke()`) with `return_headers=True` to capture response metadata (cost, latency, tokens)
4. **Store for Evaluation**: Packages question, generated answer, expected answer, context, and metadata into an `EvaluationItem` object

### Error Handling
A `try...except` block gracefully handles API failures, storing error messages as evaluation items to ensure the loop completes without crashes.


In [None]:
evaluation_items = [] # This will store our results

# Use simple tqdm for a progress bar
print(f"Running experiment on {len(ground_truth)} items...")

for qa in tqdm(ground_truth):
    question = qa.get("question", "")
    gt_answer = qa.get("answer", "")

    try:
        # --- 1. Retrieve Context ---
        search_results = kb.search(query=question)
        context_texts = memory_utils.extract_vectorstore_texts(search_results)

        # --- 2. Build Messages ---
        messages = create_messages(
            system_prompt=system_prompt,
            user_prompt_template=user_prompt_template,
            question=question,
            context=context_texts
        )

        # --- 3. Generate Answer ---
        response, headers = inference_llm.invoke(messages=messages, return_headers=True)
        generated_answer = response.content

        # --- 4. Store for Evaluation ---
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=generated_answer,
            expected_answer=gt_answer,
            context=context_texts, # Store the context for later display
            metadata=headers,
        ))

    except Exception as e:
        print(f"[ERROR] Failed on question '{question[:50]}...': {e}")
        # Store a failure case so we can see it
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=f"Error: {e}",
            expected_answer=gt_answer,
            context=[],
            metadata={"error": str(e)},
        ))

print(f"Experiment completed. {len(evaluation_items)} items are ready for evaluation.")

## 8. Initialize the Evaluator (DeepEval)

### Using DeepEval Context Recall

Now that we have our `evaluation_items` list (containing the generated answers), we switch the `LLMEvaluator` to the **DeepEval** backend so every score reflects how completely the retrieved passages cover the required network security facts.

This class remains the *“head judge”* for the evaluation loop; we’re simply selecting the DeepEval rubric that specializes in context adequacy for encryption suites, AAA flows, VPN modes, and detection runbooks.

### Parameter Insights (DeepEval Mode)

- **`api_key` / `base_url`** — Standard credentials used to authenticate and connect with the Flotorch-Eval service.  
- **`inferencer_model` / `embedding_model`** — DeepEval-powered scoring still needs the evaluator LLM and embeddings for semantic checks.  
- **`evaluation_engine="deepeval"`** — Routes metrics through DeepEval, which (per the [flotorch-eval repository](https://github.com/FissionAI/flotorch-eval/tree/develop)) unlocks the following metric keys:  
  - **`MetricKey.FAITHFULNESS`**
  - **`MetricKey.ANSWER_RELEVANCY`**
  - **`MetricKey.CONTEXT_RELEVANCY`**
  - **`MetricKey.CONTEXT_PRECISION`**
  - **`MetricKey.CONTEXT_RECALL`**
  - **`MetricKey.HALLUCINATION`**
  These are the same metrics surfaced in Flotorch’s *auto* mode when Ragas prerequisites (like embeddings) are missing.  
- **`metrics`** — For this notebook we register only `MetricKey.CONTEXT_RECALL`, keeping the focus on whether the retrieved snippets contain every protocol detail needed.  
- **`metric_configs`** — Pass DeepEval-specific arguments such as a `"threshold"` (e.g., `0.8`) to trigger pass/fail decisions.  
- **Thresholds** — Set between `0.0–1.0`; network security reviews often target `1.0` to ensure no encryption, AAA, or VPN step is missed.

DeepEval’s contextual recall rubric expects each test case to include the `input`, `actual_output`, `expected_output`, and `retrieval_context` fields so it can compare the gold-standard answer with the passages returned by your retriever. The judge extracts every statement from the `expected_output` and computes the percentage that can be attributed to at least one snippet in the `retrieval_context`, i.e., `Contextual Recall = attributable statements / total statements`, as documented in the official guidance ([DeepEval Contextual Recall docs](https://deepeval.com/docs/metrics-contextual-recall)).

### DeepEval Context Recall Metric

**Definition**: verifies that the retrieved network security context fully covers the facts referenced in the generated answer using the [DeepEval context recall rubric](https://deepeval.com/docs/metrics-contextual-recall). A score of 1 means every critical detail (cipher choice, tunnel mode, authentication exchange, IDS response) appears in the supporting snippets; anything below the threshold signals missing evidence.

**How It Works**:
1. DeepEval extracts key facts from the answer using a grading LLM.  
2. Each fact is checked against the retrieved network security passages, with rationales describing any missing coverage.  
3. The final context-recall score reflects the proportion of supported facts and is compared against the configured threshold.  

**Example**:

*Question*: "Describe the TLS 1.3 handshake phases and why forward secrecy matters."

- *Pass Scenario* (Score = 1.0): The answer cites the retrieved TLS section explaining ClientHello, ServerHello, certificate validation, ECDHE key exchange, and confirms forward secrecy requirements.  
- *Fail Scenario* (Score = 0.0): The answer mentions forward secrecy but the retrieved snippets only cover IPsec tunnel mode, so DeepEval flags the missing TLS context and the evaluation fails.  

This mirrors the broader RAG workflow while delivering guardrail-ready signals tailored to network security coverage.  


In [None]:
# Configure DeepEval Context Recall thresholds
metric_args = {
    "context_recall": {"threshold": 0.7},
}

# Initialize the LLMEvaluator client
evaluator_client = LLMEvaluator(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    embedding_model=evaluation_embedding_model_name,
    inferencer_model=evaluation_llm_model_name,
    metrics=[
        MetricKey.CONTEXT_RECALL,
    ],
    evaluation_engine="deepeval",
    metric_configs=metric_args
)

print("LLMEvaluator client initialized.")

## 9. Run Evaluation (DeepEval)

### Purpose
Execute the evaluation process to score all generated answers using the **Context Recall** metric.

### Process
- Call either:
  - `evaluator_client.evaluate()` for **synchronous** (sequential) execution, or  
  - `evaluator_client.aevaluate()` for **asynchronous** (concurrent) execution  
  using the complete list of `evaluation_items`.

- For each evaluation item:
  - The evaluator scores **Context Recall** by comparing the generated answer against the retrieved context.

- Collect the following outputs:
  - Context Recall scores
  - Gateway metrics (cost, latency, token usage)
  - Structured evaluation results

### Output
- A complete evaluation report ready for analysis.

> **Note:**  
> This step may take a few minutes, as it requires LLM calls for each question to compute  Aspect Critic scores.  
> Use the **synchronous** method for standard sequential execution, or the **Context Recall** method for faster, concurrent processing.
.

### Asynchronous Evaluation

In [None]:
print("Starting evaluation... This may take a few minutes.")

eval_results = await evaluator_client.aevaluate(evaluation_items)

print("Evaluation complete.")


### Synchronous Evaluation (uncomment the below code to use synchronous manner)

In [None]:
# print("Starting evaluation... This may take a few minutes.")

# eval_results = evaluator_client.evaluate(evaluation_items)

# print("Evaluation complete.")

## 10. View Per-Question Results (DeepEval)

### Purpose
Display contextual recall scores in a compact table so security reviewers can confirm that each answer had the proper supporting passages retrieved.

In [None]:
display_llm_evaluation_results(eval_results)

## 11. View Raw JSON Results

### Purpose
Display the complete evaluation results in JSON format for detailed inspection and programmatic access.

### Output Structure
The JSON output includes for each question:
- **model**: The evaluation LLM model used
- **input_query**: The original question
- **context**: Full retrieved context passages (not truncated)
- **generated_answer**: Complete LLM-generated response
- **groundtruth_answer**: Expected correct answer
- **evaluation_metrics**: Dictionary containing:
  - **context_recall**: DeepEval contextual recall score between `0` and `1`
  - **total_latency_ms**: Total evaluation time in milliseconds
  - **total_cost**: Cost of evaluation in USD
  - **total_tokens**: Token count for evaluation

This raw JSON output is useful for follow-up audits, regression tracking, or downstream automation.

In [None]:
print("--- Aggregate Evaluation Results ---")
print(json.dumps(eval_results, indent=2))

## 12. Summary

### What We Accomplished

This notebook delivered an end-to-end workflow for evaluating a Network Security Protocol Advisor with Flotorch Eval using the DeepEval contextual recall metric.

### Workflow Summary

1. **Configured Infrastructure**
   - Set up `FlotorchLLM` for answer generation.  
   - Connected to `FlotorchVectorStore` for protocol retrieval.  
   - Initialized `LLMEvaluator` with the DeepEval engine targeting contextual recall.  

2. **Generated Responses**
   - Loaded network security ground-truth questions from `network_security_gt.json`.  
   - Retrieved relevant guide excerpts for each question.  
   - Generated answers with the inference LLM and captured gateway metadata (latency, cost, tokens).  

3. **Evaluated Contextual Recall**
   - Ran DeepEval contextual recall scoring over every response.  
   - Verified whether each expected answer’s statements were supported by the retrieved context.  
   - Recorded contextual recall scores alongside gateway diagnostics for governance.  

4. **Visualized Results**
   - Displayed per-question contextual recall scores in a reviewer-friendly table.  
   - Exported the full JSON payload for auditing or automation.  

### Key Takeaways

- **Context Recall = 1.0** signals complete retrieval coverage; lower scores highlight missing network security facts in the retrieved snippets.  
- DeepEval judges **Expected Answer ↔ Retrieved Context**, which is ideal for validating encryption, AAA, VPN, and detection guidance coverage.  
- Monitoring contextual recall keeps focus on retriever quality, helping SecOps teams close documentation gaps before they impact production playbooks.