One problem. Three frameworks. Production-grade results.
Build agentic AI systems with LangChain, LangGraph, and CrewAI — from first principles to a deployable, tested, and evaluated pipeline.
Most agentic AI tutorials stop at "hello world." They show you how to call an LLM, maybe wire up a tool, and call it a day. Real production systems need PII masking, structured outputs, guardrails, fallback classifiers, golden-dataset evaluation, and a clear understanding of when to use which framework.
This course builds the same real-world problem — enterprise email classification — in three frameworks side by side, so you can see exactly what each one changes and make an informed choice for your own projects.
- What You Will Learn
- What Is an Agent?
- Agent vs Conventional LLM
- The ReAct Pattern
- Multi-Agent Systems
- The Problem: Email Classification
- Shared Components
- Approach 1 — LangChain (ReAct Agent)
- Approach 2 — LangGraph (State Machine)
- Approach 3 — CrewAI (Multi-Agent)
- Framework Comparison
- When to Use What
- Evaluation & Production Readiness
- Getting Started
- Testing
- Project Structure
- License & Author
| Theory Agents, ReAct, multi-agent systems — the concepts behind the code |
Three Frameworks Same problem solved in LangChain, LangGraph, and CrewAI for direct comparison |
Production Skills PII masking, guardrails, offline testing, structured outputs, fallback classifiers |
Evaluation Golden dataset, precision/recall/F1, confusion matrix, production readiness gate |
An agent is a system that controls the flow, not just the output. A conventional LLM takes a prompt and returns a single response. An agent perceives input, reasons about what to do, acts (often by calling tools), observes the result, and iterates until the goal is met.
The key difference: the agent decides what happens next, not the developer hardcoding every step. This loop — perceive, reason, act, observe, repeat — is what makes a system truly "agentic."
| Aspect | Conventional LLM | Agent |
|---|---|---|
| Flow | Single prompt → single output | Multi-step reasoning loop |
| State | Stateless | Explicit state management |
| Tools | None or ad-hoc | Controlled, evidence-based tool usage |
| Governance | Hard to govern | Guardrails at every step |
| Observability | Low | Full traceability and audit trail |
| Best for | Text generation, Q&A | Workflows, decisions, actions with business impact |
If your task is simple text generation, a regular LLM is fine. If you need workflows, decisions, or actions with real consequences — you need an agent.
ReAct stands for Reason + Act — the pattern at the heart of modern agent frameworks.
The agent reasons about what to do, calls a tool to act, observes the result, and repeats until it has enough evidence to produce a final answer. Every step is explicit and auditable — you can trace exactly why the agent made each decision.
Why this matters for production:
- Reasoning is explicit and logged, not hidden inside a single LLM call
- Prevents blind tool calls — the agent explains why before acting
- Makes debugging straightforward — inspect any step in the loop
- Most modern agent frameworks implement structured ReAct under the hood
A multi-agent system uses multiple specialized agents, each with a focused role, collaborating to solve a problem.
Why decompose into multiple agents?
- Separation of concerns — each agent has one job and does it well
- Stronger safety — risk assessment is isolated from classification
- Natural governance — maps directly to how enterprise teams are structured
- Extensibility — add a new agent (e.g., "Compliance Reviewer") without rewriting the pipeline
We solve the same problem three different ways to make the framework differences concrete and measurable.
Goal: Classify incoming emails into one of six categories, then route each to the appropriate downstream action.
| Category | Example | Action |
|---|---|---|
phishing |
"Verify your account immediately" | Quarantine + human review |
spam |
"Limited time offer! Win a free iPhone" | Quarantine |
invoice |
"Invoice #2026-042 — payment due in 30 days" | Accounts payable queue |
meeting |
"Team sync Thursday 10 AM — Zoom link" | Calendar suggestion |
support |
"Ticket #5432 — production outage" | Support ticket |
other |
Everything else | Inbox |
All three approaches share the same four-step pipeline — only the classification engine changes:
- Preprocess — mask PII (SSN, email, IBAN, phone, credit card) before the LLM sees anything
- Classify — the LLM or agent produces a structured label + confidence + rationale
- Guardrails — enforce business rules (confidence thresholds, phishing → human review)
- Route — map the classification to a concrete downstream action
All three implementations reuse the same core modules. This isolation is intentional — it lets you see exactly what each framework changes.
class EmailLabel(str, Enum):
PHISHING = "phishing"
SPAM = "spam"
INVOICE = "invoice"
MEETING = "meeting"
SUPPORT = "support"
OTHER = "other"
class EmailClassification(BaseModel):
label: EmailLabel
confidence: float # 0.0 – 1.0
rationale: str # no PII allowed
indicators: list[str] # key signals detected
requires_human_review: bool # True if ambiguous or high-riskPydantic enforces the contract — if the LLM returns invalid JSON, it fails fast instead of silently passing garbage downstream.
Sensitive data is replaced with placeholder tags before the email reaches the LLM. Social security numbers become [SSN], credit card numbers become [CREDIT_CARD], and so on. This is a hard requirement in any regulated environment — you never want to send raw PII to a third-party API.
A deterministic safety net. When the LLM returns low confidence or is unavailable, the system degrades gracefully to rule-based classification instead of failing. The fallback is honest about its limitations — confidence is capped at 0.8.
Maps each label to a downstream action. The critical override: if requires_human_review is True, that always takes priority regardless of the label.
Four evidence-gathering tools that make the agent truly agentic:
| Tool | Purpose | When the agent calls it |
|---|---|---|
check_sender_reputation |
Domain risk score lookup | Unfamiliar or suspicious sender |
scan_urls |
Check URLs for malicious patterns | Email contains links |
lookup_known_contacts |
Verify sender against directory | Any visible sender address |
check_invoice_registry |
Validate invoice numbers | Email mentions invoices or payments |
In production, these would call real APIs (threat intelligence, CRM, invoice management). Here they use simulated databases so the entire pipeline can be tested offline.
Source:
src/langchain_agent.py+src/tools.py
The LLM decides which tools to call based on the email content. Different emails trigger different tool paths — a phishing email triggers scan_urls + check_sender_reputation, while an invoice triggers check_invoice_registry + lookup_known_contacts.
from langgraph.prebuilt import create_react_agent
agent = create_react_agent(
model=llm,
tools=[check_sender_reputation, scan_urls,
lookup_known_contacts, check_invoice_registry],
prompt=SYSTEM_PROMPT,
)What makes it a real agent:
- The LLM controls the flow — it chooses which tools to invoke and when to stop
- It follows a full ReAct loop (Reason → Act → Observe → Repeat)
- Tool calls produce evidence the classification is based on, not just LLM intuition
- The
tools_usedfield in the output provides a complete audit trail
Best for: Evidence-based classification, tool-heavy workflows, rapid prototyping.
Trade-off: Control flow is decided by the LLM — harder to guarantee deterministic execution paths at scale.
Source:
src/langgraph_agent.py
LangGraph makes the control flow explicit as a directed graph with typed state. Every node is a function, every edge is declared, and branching logic is visible — not buried in code.
graph.add_node("preprocess", preprocess_node)
graph.add_node("classify", classify_node)
graph.add_node("guardrails", guardrails_node)
graph.add_node("fallback", fallback_node)
graph.add_node("route", route_node)
graph.add_edge("preprocess", "classify")
graph.add_edge("classify", "guardrails")
graph.add_conditional_edges("guardrails", decide_path, {...})
graph.add_edge("fallback", "route")
graph.add_edge("route", END)What makes it enterprise-grade:
- Deterministic flow — every edge is explicit and auditable
- Typed state —
TypedDictstate flows immutably through the graph - Conditional edges — branching logic is declared, not hidden in if-else chains
- Node-level testing — each function can be unit tested in complete isolation
Best for: Regulated environments, compliance-sensitive pipelines, anything that needs a reviewable audit trail.
When a compliance team asks "Can you prove the system follows your documented process?" — with LangGraph, the graph is the documentation.
Source:
src/crewai_agent.py
CrewAI decomposes the problem across three specialized agents that communicate through a sequential task pipeline. Each agent has a focused role, a goal, and a backstory that shapes its reasoning.
classifier_agent = Agent(role="Email Classifier", ...)
risk_agent = Agent(role="Risk Analyst", ...)
policy_agent = Agent(role="Policy Router", ...)
crew = Crew(
agents=[classifier_agent, risk_agent, policy_agent],
tasks=[classify_task, risk_task, policy_task],
process=Process.sequential,
)What makes it powerful:
- Role separation — each agent has a focused responsibility
- Natural team mapping — mirrors how human organizations structure decision-making
- Extensible — add a new agent (e.g., "Compliance Reviewer") without rewriting anything
- Collaborative reasoning — each agent builds on the previous agent's output
Trade-off: Three LLM calls per email instead of one. For simple classification, that's overkill. For genuinely complex problems with multiple reasoning domains, it's the natural choice.
| Dimension | LangChain | LangGraph | CrewAI |
|---|---|---|---|
| Architecture | Linear chain | State machine (DAG) | Multi-agent team |
| Control flow | Implicit (LLM decides) | Explicit edges | Role-based |
| State management | Pass-through | Typed TypedDict |
Shared context |
| Branching | In code | Conditional edges | Agent delegation |
| Testability | Integration tests | Node-level unit tests | Agent-level tests |
| Governance | Manual | Built-in | Per-agent policies |
| Best for | Prototypes, tool-heavy workflows | Enterprise, compliance | Complex collaboration |
| Learning curve | Low | Medium | Medium |
| Criterion | Winner | Why |
|---|---|---|
| Best accuracy | LangChain / LangGraph | Both ~90%; tools give LangChain an edge on hard cases |
| Best speed | LangGraph | Single LLM call, ~3x faster than alternatives |
| Best auditability | LangGraph | Explicit edges, typed state, fully reviewable |
| Best safety | LangChain | Tool evidence provides the richest audit trail |
| Best cost | LangGraph | 1 LLM call vs multi-turn (LangChain) vs 3 calls (CrewAI) |
| Best extensibility | CrewAI | Adding a new agent is trivial |
Recommendation: LangGraph for production deployments. LangChain for prototyping and discovery.
Choose LangChain when:
- You need a quick prototype or proof of concept
- The workflow is mostly linear with few branches
- You want rich tool integration with an evidence trail
Choose LangGraph when:
- You need auditability, governance, and compliance
- The workflow has complex branching or conditional logic
- You are in a regulated environment (banking, healthcare, insurance)
- You want to unit test each pipeline step in isolation
Choose CrewAI when:
- The problem naturally decomposes into distinct roles
- You want collaborative multi-agent reasoning
- You need to mirror a human team structure
- Extensibility matters more than tight control
Before deploying any classifier, run it against the golden dataset — 30 hand-labelled emails covering all 6 categories with easy, medium, and hard difficulty levels.
make evaluate # Baseline (keyword fallback, no API key needed)
make evaluate-all # All approaches (requires OPENAI_API_KEY)
make compare # Side-by-side comparison with automated verdictThe evaluation computes per-class precision, recall, F1, a confusion matrix, and a production readiness gate:
MIN_WEIGHTED_F1 = 0.70 # Overall quality bar
MIN_PHISHING_RECALL = 0.80 # Must catch phishing (safety-critical)
MIN_PHISHING_PRECISION = 0.60 # Prevent alert fatigueAll three thresholds must pass for a "production ready" verdict. Results are saved to data/eval_results/ as CSV (per-sample predictions) and JSON (summary metrics).
Why these thresholds? Missing a phishing email is a security breach. Too many false alarms means people stop trusting the system. The gate balances both risks.
git clone https://github.com/ruslanmv/agentic-ai-tutorial.git
cd agentic-ai-tutorial
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
make install # or: pip install -r requirements.txtexport OPENAI_API_KEY="sk-..."make test # Run 80+ offline tests (no API key needed)
make run # Run all three approaches side by side
make compare # Head-to-head comparison with verdictmake help # Show every available command
make install # Install production dependencies
make install-dev # Install production + dev dependencies
make test # Run offline test suite
make test-all # Run all tests including integration (needs API key)
make lint # Run the ruff linter
make evaluate # Evaluate keyword fallback against golden dataset
make evaluate-all # Evaluate all approaches (needs API key)
make compare # Compare all frameworks with automated verdict
make compare-baseline # Compare baseline only (no API key)
make run-langchain # Run only the LangChain ReAct agent
make run-langgraph # Run only the LangGraph state machine
make run-crewai # Run only the CrewAI multi-agent system
make clean # Remove build artifactspython -m src.langchain_agent
python -m src.langgraph_agent
python -m src.crewai_agentAll shared components are tested offline — no API key required. The test suite covers PII masking, keyword fallback, routing, guardrails, tool functions, evaluation metrics, and JSON parsing.
make test # 80+ offline tests (~1.5 seconds)
make test-all # Includes integration tests (needs API key)tests/
├── test_preprocessing.py # PII masking patterns and edge cases
├── test_fallback.py # Keyword classifier for all 6 categories
├── test_routing.py # Label → action mapping
├── test_guardrails.py # Policy enforcement and threshold logic
├── test_tools.py # All four tool functions (offline)
├── test_evaluate.py # Metrics computation + production gate
├── test_langchain_agent.py # ReAct agent + JSON parsing strategies
├── test_langgraph_agent.py # State machine node-level tests
└── test_crewai_agent.py # Multi-agent crew tests
agentic-ai-tutorial/
├── Makefile # All convenience commands
├── README.md # This file
├── requirements.txt # Production dependencies
├── pyproject.toml # Package metadata + dev dependencies
│
├── src/ # Application code
│ ├── schema.py # Pydantic models (EmailLabel, EmailClassification)
│ ├── preprocessing.py # PII masking (SSN, email, IBAN, phone, credit card)
│ ├── fallback.py # Keyword-based fallback classifier
│ ├── routing.py # Label → action routing
│ ├── tools.py # Agent tools (sender reputation, URL scan, etc.)
│ ├── evaluate.py # Evaluation pipeline (precision, recall, F1, prod gate)
│ ├── langchain_agent.py # Approach 1 — ReAct agent with tools
│ ├── langgraph_agent.py # Approach 2 — State machine (DAG)
│ └── crewai_agent.py # Approach 3 — Multi-agent team
│
├── tests/ # 80+ offline tests + integration tests
├── data/
│ └── golden_dataset.csv # 30 hand-labelled emails for evaluation
├── examples/
│ ├── run_all.py # Run all 3 approaches side by side
│ └── compare_frameworks.py # Head-to-head comparison with verdict
├── docs/
│ └── blog.md # Extended tutorial
└── assets/
├── architecture.mermaid # Mermaid source diagrams
└── svg/ # SVG diagrams used in this README
An agent is a system that reasons, acts, and iterates toward a goal using tools and state — rather than producing a single response. Multi-agent systems decompose complex workflows across specialized roles. LangChain enables ReAct-style agents with dynamic tool selection. LangGraph provides deterministic state-machine orchestration for enterprise workflows. CrewAI enables collaborative multi-agent designs. Agentic systems trade simplicity for control, safety, and auditability — which is why they are preferred for production in regulated environments.
Apache 2.0 | Created by ruslanmv