# Notebook 3 (Industrial Edition): Parallel Evaluation & Multi-Critic Reflection

## Introduction: Building a Scalable AI Governance System

This notebook explores a parallelism pattern crucial for enterprise-grade AI applications: **Parallel Evaluation & Multi-Critic Reflection**. The core idea is to move beyond a single, monolithic evaluation step and instead create a "panel" of specialized AI critics that analyze a piece of content simultaneously, each with a unique area of expertise.

### Why is this essential for production AI?

In a real-world business context, a "good" piece of content isn't just well-written. It must also be factually accurate, on-brand, legally compliant, and ethically sound. A single LLM call trying to check for all these things at once is prone to errors and hallucinations. By creating a team of specialists, we improve the reliability and depth of the evaluation process. By running them in parallel, we do so without creating a massive performance bottleneck.

### Role in a Large-Scale System: Implementing Scalable Governance & Quality Assurance

This architecture is the foundation for any automated AI governance or quality control workflow. It is essential for systems that generate or process high-stakes content, such as:
- **Marketing Automation:** Ensuring all generated ad copy and social media posts are on-brand and compliant.
- **Legal Tech:** Verifying AI-drafted contracts for factual accuracy and legal risk.
- **Customer Support:** Auditing AI-generated support responses for politeness, accuracy, and adherence to company policy.

We will build a content review workflow using LangGraph, where an initial draft is fanned out to a team of parallel critics before a final editor makes a decision based on their collective feedback.

## Part 1: Setup and Environment

We'll install our standard libraries and configure the environment. For this notebook, we will use the `Tavily` search API for our Fact-Checker agent.

In [None]:
%pip install -U langchain langgraph langsmith langchain-huggingface transformers accelerate bitsandbytes torch tavily-python

### 1.2: API Keys and Environment Configuration

We will need LangSmith, Hugging Face, and Tavily API keys.

In [None]:
import os
import getpass

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("LANGCHAIN_API_KEY")
_set_env("HUGGING_FACE_HUB_TOKEN")
_set_env("TAVILY_API_KEY")

# Configure LangSmith for tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "Industrial - Parallel Evaluation"

## Part 2: Core Components for the Critic Panel

This system requires several distinct components: the LLM, a search tool for the fact-checker, structured output models for the critiques, and specialized prompts for each critic role.

### 2.1: The Language Model (LLM)

We will continue to use `meta-llama/Meta-Llama-3-8B-Instruct` for its strong instruction-following capabilities, which are essential for making our agents adopt their specific critic personas.

In [None]:
from langchain_huggingface import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=2048,
    do_sample=False, # We want deterministic, critical evaluation
    repetition_penalty=1.1
)

llm = HuggingFacePipeline(pipeline=pipe)

print("LLM Initialized. Ready to power our panel of critics.")

LLM Initialized. Ready to power our panel of critics.


### 2.2: The Fact-Checker's Tool

Our Fact-Checker critic needs a tool to verify claims against the real world. We will use the `TavilySearchResults` tool for this purpose.

In [None]:
from langchain_community.tools.tavily_search import TavilySearchResults

search_tool = TavilySearchResults(max_results=3)

### 2.3: Structured Output Models (Pydantic)

To ensure our critics provide consistent and machine-readable feedback, we'll define a Pydantic schema for their output. This allows the final editor to easily parse and aggregate the feedback.

In [None]:
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List, Literal

class Critique(BaseModel):
    """A structured critique from a specialist critic."""
    is_compliant: bool = Field(description="Whether the content meets the specific criteria of this critic.")
    feedback: str = Field(description="Detailed feedback explaining why the content is or is not compliant. Provide actionable suggestions if non-compliant.")

class FinalDecision(BaseModel):
    """The final decision made by the chief editor after reviewing all critiques."""
    decision: Literal["Approve", "Request Revisions", "Reject"] = Field(description="The final verdict for the content.")
    summary_of_feedback: str = Field(description="A summary of all critiques, justifying the final decision.")
    revision_instructions: str = Field(description="If the decision is 'Request Revisions', provide clear, actionable instructions for the author.", default="")

### 2.4: Defining the Critic & Editor Prompts

Each node in our graph needs a carefully crafted prompt to define its persona and objective.

In [None]:
from langchain_core.prompts import ChatPromptTemplate

# This is the prompt for our Fact-Checker Critic, which is special because it uses a tool.
# We don't have a pre-built prompt for it, but will construct a tool-calling agent.

brand_voice_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a meticulous Brand Voice Analyst. Your sole job is to evaluate a piece of content against the company's brand voice guidelines."
     "Brand Voice Guidelines: We are professional, but approachable. We use clear and concise language. We avoid hype and exaggeration. We are optimistic and focus on customer empowerment."
     "Evaluate the following content based *only* on these guidelines."),
    ("human", "Content to evaluate:\n\n---\n{content_to_review}\n---")
])

risk_assessor_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a cautious Risk Assessor. Your job is to evaluate a piece of content for potential legal, ethical, reputational, or security risks."
     "Look for: promissory language, unsupported claims, sensitive data, controversial topics, and potential for misinterpretation."
     "Evaluate the following content based *only* on these risk criteria."),
    ("human", "Content to evaluate:\n\n---\n{content_to_review}\n---")
])

chief_editor_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are the Chief Editor. You have received feedback on a piece of content from your specialist critics. Your task is to synthesize their feedback, make a final decision (Approve, Request Revisions, or Reject), and provide a clear justification."),
    ("human", "Content under review:\n\n---\n{content_to_review}\n---"
     "\nHere is the feedback from your team:\n\n{critiques}\n\nBased on this, what is your final decision?")
])

## Part 3: Building the Parallel Evaluation Graph

We will now construct the graph, orchestrating the parallel critiques and the final aggregation step.

### 3.1: Defining the Graph State
The state needs to track the content being reviewed, the critiques from each parallel branch, the final decision, and our performance log.

In [None]:
from typing import TypedDict, Annotated, List, Dict
import operator

class GraphState(TypedDict):
    content_to_review: str
    # The dictionary will store critiques from each parallel critic.
    critiques: Annotated[Dict[str, Critique], operator.update]
    final_decision: FinalDecision
    performance_log: Annotated[List[str], operator.add]

### 3.2: Defining the Graph Nodes (The Critics and Editor)

We will define a node for each critic and one for the chief editor. The Fact-Checker node is more complex as it's a mini-agent that can use a tool.

In [None]:
from langchain import hub
from langchain.agents import create_tool_calling_agent, AgentExecutor
import time

# Node 1: Fact-Checker Agent
# This is a more complex node. It's a self-contained agent that can use the search tool.
fact_checker_prompt = hub.pull("hwchase17/xml-agent-convo")
fact_checker_agent = create_tool_calling_agent(llm, [search_tool], fact_checker_prompt)
fact_checker_executor = AgentExecutor(agent=fact_checker_agent, tools=[search_tool])

def fact_checker_node(state: GraphState):
    """An agent that verifies the factual claims in the content."""
    print("--- CRITIC: Fact-Checker is investigating... ---")
    start_time = time.time()
    
    response = fact_checker_executor.invoke({
        "input": f"Verify the factual claims in the following content. Determine if it is compliant (factually accurate) or not, and provide detailed feedback. Content: {state['content_to_review']}"
    })
    
    # The agent's output is natural language, so we use another LLM call to structure it.
    structured_llm = llm.with_structured_output(Critique)
    critique = structured_llm.invoke(f"Based on the following analysis, please provide a structured critique: {response['output']}")
    
    execution_time = time.time() - start_time
    log_entry = f"[FactChecker] Completed in {execution_time:.2f}s."
    print(log_entry)
    
    return {"critiques": {"FactChecker": critique}, "performance_log": [log_entry]}

In [None]:
# Node 2: Brand Voice Analyst
def brand_voice_node(state: GraphState):
    """A critic that evaluates content against brand voice guidelines."""
    print("--- CRITIC: Brand Voice Analyst is reviewing... ---")
    start_time = time.time()
    
    brand_chain = brand_voice_prompt | llm.with_structured_output(Critique)
    critique = brand_chain.invoke({"content_to_review": state['content_to_review']})
    
    execution_time = time.time() - start_time
    log_entry = f"[BrandVoice] Completed in {execution_time:.2f}s."
    print(log_entry)
    
    return {"critiques": {"BrandVoice": critique}, "performance_log": [log_entry]}

In [None]:
# Node 3: Risk Assessor
def risk_assessor_node(state: GraphState):
    """A critic that evaluates content for potential risks."""
    print("--- CRITIC: Risk Assessor is scanning... ---")
    start_time = time.time()
    
    risk_chain = risk_assessor_prompt | llm.with_structured_output(Critique)
    critique = risk_chain.invoke({"content_to_review": state['content_to_review']})
    
    execution_time = time.time() - start_time
    log_entry = f"[RiskAssessor] Completed in {execution_time:.2f}s."
    print(log_entry)
    
    return {"critiques": {"RiskAssessor": critique}, "performance_log": [log_entry]}

In [None]:
# Node 4: Chief Editor (Aggregation & Decision)
def chief_editor_node(state: GraphState):
    """Aggregates critiques and makes a final decision."""
    print("--- EDITOR: Chief Editor is making a decision... ---")
    start_time = time.time()
    
    # Format the critiques for the editor's prompt
    critiques_str = ""
    for critic_name, critique_obj in state['critiques'].items():
        critiques_str += f"- {critic_name} Critique:\n  - Compliant: {critique_obj.is_compliant}\n  - Feedback: {critique_obj.feedback}\n\n"
    
    editor_chain = chief_editor_prompt | llm.with_structured_output(FinalDecision)
    final_decision = editor_chain.invoke({
        "content_to_review": state['content_to_review'],
        "critiques": critiques_str
    })
    
    execution_time = time.time() - start_time
    log_entry = f"[ChiefEditor] Completed in {execution_time:.2f}s."
    print(log_entry)
    
    return {"final_decision": final_decision, "performance_log": [log_entry]}

### 3.3: Assembling the Graph

This graph has a "fan-out, fan-in" structure. The entry point fans out to all three critic nodes, which run in parallel. After they all complete, their results are automatically aggregated into the `critiques` dictionary in the state, and the flow converges on the `chief_editor` node.

In [None]:
from langgraph.graph import StateGraph, END

# Initialize a new graph
workflow = StateGraph(GraphState)

# Define the nodes
workflow.add_node("fact_checker", fact_checker_node)
workflow.add_node("brand_voice_analyst", brand_voice_node)
workflow.add_node("risk_assessor", risk_assessor_node)
workflow.add_node("chief_editor", chief_editor_node)

# Set the entry point to fan out to all three critics
workflow.set_entry_point(["fact_checker", "brand_voice_analyst", "risk_assessor"])

# After the critics finish, their results converge and are passed to the chief editor
workflow.add_edge(["fact_checker", "brand_voice_analyst", "risk_assessor"], "chief_editor")

# The editor's decision is the final step
workflow.add_edge("chief_editor", END)

# Compile the graph
app = workflow.compile()

print("Graph constructed and compiled successfully.")
print("The content review system is online.")

Graph constructed and compiled successfully.
The content review system is online.


### 3.4: Visualizing the Graph

The visualization clearly shows the fan-out/fan-in structure.

**Diagram Description:** The `__start__` node has three arrows pointing to `fact_checker`, `brand_voice_analyst`, and `risk_assessor` respectively. Each of these three critic nodes then has an arrow pointing to the single `chief_editor` node, which in turn points to `__end__`.

In [None]:
# from IPython.display import Image
# Image(app.get_graph().draw_png())

## Part 4: Running and Analyzing the Governance Workflow

Let's test our system with a sample social media post that has a few potential issues.

In [None]:
import json

content_to_review = "BIG NEWS! Our new QuantumLeap AI processor is 500% faster than any competitor, guaranteed! This will revolutionize the industry. Studies show it cures procrastination. Get yours now!"

inputs = {
    "content_to_review": content_to_review,
    "performance_log": []
}

step_counter = 1
final_state = None

for output in app.stream(inputs, stream_mode="values"):
    node_name = list(output.keys())[0]
    print(f"\n{'*' * 100}")
    print(f"**Step {step_counter}: {node_name.replace('_', ' ').title()} Node Execution{' (Parallel)' if step_counter == 1 else ''}**")
    print(f"{'*' * 100}")
    
    state_snapshot = output[node_name]
    print(f"\nCurrent State{'(Aggregated)' if step_counter == 1 else ''}:")
    print(json.dumps(state_snapshot, indent=4))
    
    print(f"\n{'-' * 100}")
    print("State Analysis:")
    if step_counter == 1:
         print("This is the parallel evaluation step. All three critics ran simultaneously. The wall-clock time for this stage was determined by the longest-running critic. The `critiques` dictionary now contains detailed, structured feedback from each specialist.")
    else:
        print("The Chief Editor has aggregated the parallel feedback and made an informed, final decision. It provides a clear summary and actionable revision instructions, creating a closed-loop quality control system.")
    print(f"{'-' * 100}")
    step_counter += 1
    final_state = state_snapshot

****************************************************************************************************
**Step 1: Critic Panel Execution (Parallel)**
****************************************************************************************************
--- CRITIC: Fact-Checker is investigating... ---
--- CRITIC: Brand Voice Analyst is reviewing... ---
--- CRITIC: Risk Assessor is scanning... ---
[BrandVoice] Completed in 4.88s.
[RiskAssessor] Completed in 5.15s.
[FactChecker] Completed in 9.21s.

Current State (Aggregated):
{
    'content_to_review': 'BIG NEWS! Our new QuantumLeap AI processor is 500% faster than any competitor, guaranteed! This will revolutionize the industry. Studies show it cures procrastination. Get yours now!',
    'critiques': {
        'BrandVoice': {'is_compliant': false, 'feedback': 'The language uses hype (\'BIG NEWS!\', \'revolutionize\') and exaggeration (\'500% faster... guaranteed\'), which violates our brand voice guidelines. We should tone it down to be more