Agentic AI Concepts

From Theory to Production Code

One problem. Three frameworks. Production-grade results.

Build agentic AI systems with LangChain, LangGraph, and CrewAI — from first principles to a deployable, tested, and evaluated pipeline.

Why This Course Exists

Most agentic AI tutorials stop at "hello world." They show you how to call an LLM, maybe wire up a tool, and call it a day. Real production systems need PII masking, structured outputs, guardrails, fallback classifiers, golden-dataset evaluation, and a clear understanding of when to use which framework.

This course builds the same real-world problem — enterprise email classification — in three frameworks side by side, so you can see exactly what each one changes and make an informed choice for your own projects.

What You Will Learn
What Is an Agent?
Agent vs Conventional LLM
The ReAct Pattern
Multi-Agent Systems
The Problem: Email Classification
Shared Components
Approach 1 — LangChain (ReAct Agent)
Approach 2 — LangGraph (State Machine)
Approach 3 — CrewAI (Multi-Agent)
Framework Comparison
When to Use What
Evaluation & Production Readiness
Getting Started
Testing
Project Structure
License & Author

What You Will Learn

Theory
_{Agents, ReAct, multi-agent systems — the concepts behind the code}

Three Frameworks
_{Same problem solved in LangChain, LangGraph, and CrewAI for direct comparison}

Production Skills
_{PII masking, guardrails, offline testing, structured outputs, fallback classifiers}

Evaluation
_{Golden dataset, precision/recall/F1, confusion matrix, production readiness gate}

What Is an Agent?

An agent is a system that controls the flow, not just the output. A conventional LLM takes a prompt and returns a single response. An agent perceives input, reasons about what to do, acts (often by calling tools), observes the result, and iterates until the goal is met.

The key difference: the agent decides what happens next, not the developer hardcoding every step. This loop — perceive, reason, act, observe, repeat — is what makes a system truly "agentic."

Agent vs Conventional LLM

Aspect	Conventional LLM	Agent
Flow	Single prompt → single output	Multi-step reasoning loop
State	Stateless	Explicit state management
Tools	None or ad-hoc	Controlled, evidence-based tool usage
Governance	Hard to govern	Guardrails at every step
Observability	Low	Full traceability and audit trail
Best for	Text generation, Q&A	Workflows, decisions, actions with business impact

If your task is simple text generation, a regular LLM is fine. If you need workflows, decisions, or actions with real consequences — you need an agent.

The ReAct Pattern

ReAct stands for Reason + Act — the pattern at the heart of modern agent frameworks.

The agent reasons about what to do, calls a tool to act, observes the result, and repeats until it has enough evidence to produce a final answer. Every step is explicit and auditable — you can trace exactly why the agent made each decision.

Why this matters for production:

Reasoning is explicit and logged, not hidden inside a single LLM call
Prevents blind tool calls — the agent explains why before acting
Makes debugging straightforward — inspect any step in the loop
Most modern agent frameworks implement structured ReAct under the hood

Multi-Agent Systems

A multi-agent system uses multiple specialized agents, each with a focused role, collaborating to solve a problem.

Why decompose into multiple agents?

Separation of concerns — each agent has one job and does it well
Stronger safety — risk assessment is isolated from classification
Natural governance — maps directly to how enterprise teams are structured
Extensibility — add a new agent (e.g., "Compliance Reviewer") without rewriting the pipeline

The Problem: Email Classification

We solve the same problem three different ways to make the framework differences concrete and measurable.

Goal: Classify incoming emails into one of six categories, then route each to the appropriate downstream action.

Category	Example	Action
`phishing`	"Verify your account immediately"	Quarantine + human review
`spam`	"Limited time offer! Win a free iPhone"	Quarantine
`invoice`	"Invoice #2026-042 — payment due in 30 days"	Accounts payable queue
`meeting`	"Team sync Thursday 10 AM — Zoom link"	Calendar suggestion
`support`	"Ticket #5432 — production outage"	Support ticket
`other`	Everything else	Inbox

All three approaches share the same four-step pipeline — only the classification engine changes:

Preprocess — mask PII (SSN, email, IBAN, phone, credit card) before the LLM sees anything
Classify — the LLM or agent produces a structured label + confidence + rationale
Guardrails — enforce business rules (confidence thresholds, phishing → human review)
Route — map the classification to a concrete downstream action

Shared Components

All three implementations reuse the same core modules. This isolation is intentional — it lets you see exactly what each framework changes.

Schema — `src/schema.py`

class EmailLabel(str, Enum):
    PHISHING = "phishing"
    SPAM     = "spam"
    INVOICE  = "invoice"
    MEETING  = "meeting"
    SUPPORT  = "support"
    OTHER    = "other"

class EmailClassification(BaseModel):
    label:                 EmailLabel
    confidence:            float        # 0.0 – 1.0
    rationale:             str          # no PII allowed
    indicators:            list[str]    # key signals detected
    requires_human_review: bool         # True if ambiguous or high-risk

Pydantic enforces the contract — if the LLM returns invalid JSON, it fails fast instead of silently passing garbage downstream.

PII Masking — `src/preprocessing.py`

Sensitive data is replaced with placeholder tags before the email reaches the LLM. Social security numbers become [SSN], credit card numbers become [CREDIT_CARD], and so on. This is a hard requirement in any regulated environment — you never want to send raw PII to a third-party API.

Keyword Fallback — `src/fallback.py`

A deterministic safety net. When the LLM returns low confidence or is unavailable, the system degrades gracefully to rule-based classification instead of failing. The fallback is honest about its limitations — confidence is capped at 0.8.

Routing — `src/routing.py`

Maps each label to a downstream action. The critical override: if requires_human_review is True, that always takes priority regardless of the label.

Tools — `src/tools.py`

Four evidence-gathering tools that make the agent truly agentic:

Tool	Purpose	When the agent calls it
`check_sender_reputation`	Domain risk score lookup	Unfamiliar or suspicious sender
`scan_urls`	Check URLs for malicious patterns	Email contains links
`lookup_known_contacts`	Verify sender against directory	Any visible sender address
`check_invoice_registry`	Validate invoice numbers	Email mentions invoices or payments

In production, these would call real APIs (threat intelligence, CRM, invoice management). Here they use simulated databases so the entire pipeline can be tested offline.

Approach 1 — LangChain (ReAct Agent)

Source: src/langchain_agent.py + src/tools.py

The LLM decides which tools to call based on the email content. Different emails trigger different tool paths — a phishing email triggers scan_urls + check_sender_reputation, while an invoice triggers check_invoice_registry + lookup_known_contacts.

from langgraph.prebuilt import create_react_agent

agent = create_react_agent(
    model=llm,
    tools=[check_sender_reputation, scan_urls,
           lookup_known_contacts, check_invoice_registry],
    prompt=SYSTEM_PROMPT,
)

What makes it a real agent:

The LLM controls the flow — it chooses which tools to invoke and when to stop
It follows a full ReAct loop (Reason → Act → Observe → Repeat)
Tool calls produce evidence the classification is based on, not just LLM intuition
The tools_used field in the output provides a complete audit trail

Best for: Evidence-based classification, tool-heavy workflows, rapid prototyping.

Trade-off: Control flow is decided by the LLM — harder to guarantee deterministic execution paths at scale.

Approach 2 — LangGraph (State Machine)

Source: src/langgraph_agent.py

LangGraph makes the control flow explicit as a directed graph with typed state. Every node is a function, every edge is declared, and branching logic is visible — not buried in code.

graph.add_node("preprocess", preprocess_node)
graph.add_node("classify",   classify_node)
graph.add_node("guardrails", guardrails_node)
graph.add_node("fallback",   fallback_node)
graph.add_node("route",      route_node)

graph.add_edge("preprocess", "classify")
graph.add_edge("classify",   "guardrails")
graph.add_conditional_edges("guardrails", decide_path, {...})
graph.add_edge("fallback",   "route")
graph.add_edge("route",      END)

What makes it enterprise-grade:

Deterministic flow — every edge is explicit and auditable
Typed state — TypedDict state flows immutably through the graph
Conditional edges — branching logic is declared, not hidden in if-else chains
Node-level testing — each function can be unit tested in complete isolation

Best for: Regulated environments, compliance-sensitive pipelines, anything that needs a reviewable audit trail.

When a compliance team asks "Can you prove the system follows your documented process?" — with LangGraph, the graph is the documentation.

Approach 3 — CrewAI (Multi-Agent)

Source: src/crewai_agent.py

CrewAI decomposes the problem across three specialized agents that communicate through a sequential task pipeline. Each agent has a focused role, a goal, and a backstory that shapes its reasoning.

classifier_agent = Agent(role="Email Classifier", ...)
risk_agent       = Agent(role="Risk Analyst", ...)
policy_agent     = Agent(role="Policy Router", ...)

crew = Crew(
    agents=[classifier_agent, risk_agent, policy_agent],
    tasks=[classify_task, risk_task, policy_task],
    process=Process.sequential,
)

What makes it powerful:

Role separation — each agent has a focused responsibility
Natural team mapping — mirrors how human organizations structure decision-making
Extensible — add a new agent (e.g., "Compliance Reviewer") without rewriting anything
Collaborative reasoning — each agent builds on the previous agent's output

Trade-off: Three LLM calls per email instead of one. For simple classification, that's overkill. For genuinely complex problems with multiple reasoning domains, it's the natural choice.

Framework Comparison

Dimension	LangChain	LangGraph	CrewAI
Architecture	Linear chain	State machine (DAG)	Multi-agent team
Control flow	Implicit (LLM decides)	Explicit edges	Role-based
State management	Pass-through	Typed `TypedDict`	Shared context
Branching	In code	Conditional edges	Agent delegation
Testability	Integration tests	Node-level unit tests	Agent-level tests
Governance	Manual	Built-in	Per-agent policies
Best for	Prototypes, tool-heavy workflows	Enterprise, compliance	Complex collaboration
Learning curve	Low	Medium	Medium

Head-to-Head Results

Criterion	Winner	Why
Best accuracy	LangChain / LangGraph	Both ~90%; tools give LangChain an edge on hard cases
Best speed	LangGraph	Single LLM call, ~3x faster than alternatives
Best auditability	LangGraph	Explicit edges, typed state, fully reviewable
Best safety	LangChain	Tool evidence provides the richest audit trail
Best cost	LangGraph	1 LLM call vs multi-turn (LangChain) vs 3 calls (CrewAI)
Best extensibility	CrewAI	Adding a new agent is trivial

Recommendation: LangGraph for production deployments. LangChain for prototyping and discovery.

When to Use What

Choose LangChain when:

You need a quick prototype or proof of concept
The workflow is mostly linear with few branches
You want rich tool integration with an evidence trail

Choose LangGraph when:

You need auditability, governance, and compliance
The workflow has complex branching or conditional logic
You are in a regulated environment (banking, healthcare, insurance)
You want to unit test each pipeline step in isolation

Choose CrewAI when:

The problem naturally decomposes into distinct roles
You want collaborative multi-agent reasoning
You need to mirror a human team structure
Extensibility matters more than tight control

Evaluation & Production Readiness

Before deploying any classifier, run it against the golden dataset — 30 hand-labelled emails covering all 6 categories with easy, medium, and hard difficulty levels.

make evaluate           # Baseline (keyword fallback, no API key needed)
make evaluate-all       # All approaches (requires OPENAI_API_KEY)
make compare            # Side-by-side comparison with automated verdict

The evaluation computes per-class precision, recall, F1, a confusion matrix, and a production readiness gate:

MIN_WEIGHTED_F1       = 0.70   # Overall quality bar
MIN_PHISHING_RECALL   = 0.80   # Must catch phishing (safety-critical)
MIN_PHISHING_PRECISION = 0.60  # Prevent alert fatigue

All three thresholds must pass for a "production ready" verdict. Results are saved to data/eval_results/ as CSV (per-sample predictions) and JSON (summary metrics).

Why these thresholds? Missing a phishing email is a security breach. Too many false alarms means people stop trusting the system. The gate balances both risks.

Getting Started

Prerequisites

git clone https://github.com/ruslanmv/agentic-ai-tutorial.git
cd agentic-ai-tutorial
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
make install                   # or: pip install -r requirements.txt

Set your API key

export OPENAI_API_KEY="sk-..."

Quick start

make test                  # Run 80+ offline tests (no API key needed)
make run                   # Run all three approaches side by side
make compare               # Head-to-head comparison with verdict

All available commands

make help                  # Show every available command
make install               # Install production dependencies
make install-dev           # Install production + dev dependencies
make test                  # Run offline test suite
make test-all              # Run all tests including integration (needs API key)
make lint                  # Run the ruff linter
make evaluate              # Evaluate keyword fallback against golden dataset
make evaluate-all          # Evaluate all approaches (needs API key)
make compare               # Compare all frameworks with automated verdict
make compare-baseline      # Compare baseline only (no API key)
make run-langchain         # Run only the LangChain ReAct agent
make run-langgraph         # Run only the LangGraph state machine
make run-crewai            # Run only the CrewAI multi-agent system
make clean                 # Remove build artifacts

Run a single approach

python -m src.langchain_agent
python -m src.langgraph_agent
python -m src.crewai_agent

Testing

All shared components are tested offline — no API key required. The test suite covers PII masking, keyword fallback, routing, guardrails, tool functions, evaluation metrics, and JSON parsing.

make test          # 80+ offline tests (~1.5 seconds)
make test-all      # Includes integration tests (needs API key)

tests/
├── test_preprocessing.py     # PII masking patterns and edge cases
├── test_fallback.py          # Keyword classifier for all 6 categories
├── test_routing.py           # Label → action mapping
├── test_guardrails.py        # Policy enforcement and threshold logic
├── test_tools.py             # All four tool functions (offline)
├── test_evaluate.py          # Metrics computation + production gate
├── test_langchain_agent.py   # ReAct agent + JSON parsing strategies
├── test_langgraph_agent.py   # State machine node-level tests
└── test_crewai_agent.py      # Multi-agent crew tests

Project Structure

agentic-ai-tutorial/
├── Makefile                       # All convenience commands
├── README.md                      # This file
├── requirements.txt               # Production dependencies
├── pyproject.toml                 # Package metadata + dev dependencies
│
├── src/                           # Application code
│   ├── schema.py                  # Pydantic models (EmailLabel, EmailClassification)
│   ├── preprocessing.py           # PII masking (SSN, email, IBAN, phone, credit card)
│   ├── fallback.py                # Keyword-based fallback classifier
│   ├── routing.py                 # Label → action routing
│   ├── tools.py                   # Agent tools (sender reputation, URL scan, etc.)
│   ├── evaluate.py                # Evaluation pipeline (precision, recall, F1, prod gate)
│   ├── langchain_agent.py         # Approach 1 — ReAct agent with tools
│   ├── langgraph_agent.py         # Approach 2 — State machine (DAG)
│   └── crewai_agent.py            # Approach 3 — Multi-agent team
│
├── tests/                         # 80+ offline tests + integration tests
├── data/
│   └── golden_dataset.csv         # 30 hand-labelled emails for evaluation
├── examples/
│   ├── run_all.py                 # Run all 3 approaches side by side
│   └── compare_frameworks.py      # Head-to-head comparison with verdict
├── docs/
│   └── blog.md                    # Extended tutorial
└── assets/
    ├── architecture.mermaid       # Mermaid source diagrams
    └── svg/                       # SVG diagrams used in this README

Key Takeaway

An agent is a system that reasons, acts, and iterates toward a goal using tools and state — rather than producing a single response. Multi-agent systems decompose complex workflows across specialized roles. LangChain enables ReAct-style agents with dynamic tool selection. LangGraph provides deterministic state-machine orchestration for enterprise workflows. CrewAI enables collaborative multi-agent designs. Agentic systems trade simplicity for control, safety, and auditability — which is why they are preferred for production in regulated environments.

License & Author

Apache 2.0 | Created by ruslanmv

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
data		data
docs		docs
examples		examples
src		src
tests		tests
udemy		udemy
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Agentic AI Concepts

From Theory to Production Code

Why This Course Exists

Table of Contents

What You Will Learn

What Is an Agent?

Agent vs Conventional LLM

The ReAct Pattern

Multi-Agent Systems

The Problem: Email Classification

Shared Components

Schema — src/schema.py

PII Masking — src/preprocessing.py

Keyword Fallback — src/fallback.py

Routing — src/routing.py

Tools — src/tools.py

Approach 1 — LangChain (ReAct Agent)

Approach 2 — LangGraph (State Machine)

Approach 3 — CrewAI (Multi-Agent)

Framework Comparison

Head-to-Head Results

When to Use What

Evaluation & Production Readiness

Getting Started

Prerequisites

Set your API key

Quick start

All available commands

Run a single approach

Testing

Project Structure

Key Takeaway

License & Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Schema — `src/schema.py`

PII Masking — `src/preprocessing.py`

Keyword Fallback — `src/fallback.py`

Routing — `src/routing.py`

Tools — `src/tools.py`

Packages