Skip to content

ruslanmv/agentic-ai-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agentic AI Concepts

From Theory to Production Code

One problem. Three frameworks. Production-grade results.

Build agentic AI systems with LangChain, LangGraph, and CrewAI — from first principles to a deployable, tested, and evaluated pipeline.

Python 3.10+ License: Apache 2.0 Tests: 80+ LangChain LangGraph CrewAI



Shared pipeline: Preprocess → Classify → Guardrails → Route


Why This Course Exists

Most agentic AI tutorials stop at "hello world." They show you how to call an LLM, maybe wire up a tool, and call it a day. Real production systems need PII masking, structured outputs, guardrails, fallback classifiers, golden-dataset evaluation, and a clear understanding of when to use which framework.

This course builds the same real-world problem — enterprise email classification — in three frameworks side by side, so you can see exactly what each one changes and make an informed choice for your own projects.


Table of Contents



What You Will Learn

Theory
Agents, ReAct, multi-agent systems — the concepts behind the code
Three Frameworks
Same problem solved in LangChain, LangGraph, and CrewAI for direct comparison
Production Skills
PII masking, guardrails, offline testing, structured outputs, fallback classifiers
Evaluation
Golden dataset, precision/recall/F1, confusion matrix, production readiness gate


What Is an Agent?

An agent is a system that controls the flow, not just the output. A conventional LLM takes a prompt and returns a single response. An agent perceives input, reasons about what to do, acts (often by calling tools), observes the result, and iterates until the goal is met.

Agent loop: Perceive → Reason → Act → Observe → Iterate

The key difference: the agent decides what happens next, not the developer hardcoding every step. This loop — perceive, reason, act, observe, repeat — is what makes a system truly "agentic."



Agent vs Conventional LLM

Aspect Conventional LLM Agent
Flow Single prompt → single output Multi-step reasoning loop
State Stateless Explicit state management
Tools None or ad-hoc Controlled, evidence-based tool usage
Governance Hard to govern Guardrails at every step
Observability Low Full traceability and audit trail
Best for Text generation, Q&A Workflows, decisions, actions with business impact

If your task is simple text generation, a regular LLM is fine. If you need workflows, decisions, or actions with real consequences — you need an agent.



The ReAct Pattern

ReAct stands for Reason + Act — the pattern at the heart of modern agent frameworks.

ReAct pattern: Reason → Act → Observe → Repeat

The agent reasons about what to do, calls a tool to act, observes the result, and repeats until it has enough evidence to produce a final answer. Every step is explicit and auditable — you can trace exactly why the agent made each decision.

Why this matters for production:

  • Reasoning is explicit and logged, not hidden inside a single LLM call
  • Prevents blind tool calls — the agent explains why before acting
  • Makes debugging straightforward — inspect any step in the loop
  • Most modern agent frameworks implement structured ReAct under the hood


Multi-Agent Systems

A multi-agent system uses multiple specialized agents, each with a focused role, collaborating to solve a problem.

Multi-agent: Classifier → Risk Analyst → Policy Router

Why decompose into multiple agents?

  • Separation of concerns — each agent has one job and does it well
  • Stronger safety — risk assessment is isolated from classification
  • Natural governance — maps directly to how enterprise teams are structured
  • Extensibility — add a new agent (e.g., "Compliance Reviewer") without rewriting the pipeline


The Problem: Email Classification

We solve the same problem three different ways to make the framework differences concrete and measurable.

Goal: Classify incoming emails into one of six categories, then route each to the appropriate downstream action.

Category Example Action
phishing "Verify your account immediately" Quarantine + human review
spam "Limited time offer! Win a free iPhone" Quarantine
invoice "Invoice #2026-042 — payment due in 30 days" Accounts payable queue
meeting "Team sync Thursday 10 AM — Zoom link" Calendar suggestion
support "Ticket #5432 — production outage" Support ticket
other Everything else Inbox

All three approaches share the same four-step pipeline — only the classification engine changes:

Pipeline: Preprocess → Classify → Guardrails → Route

  1. Preprocess — mask PII (SSN, email, IBAN, phone, credit card) before the LLM sees anything
  2. Classify — the LLM or agent produces a structured label + confidence + rationale
  3. Guardrails — enforce business rules (confidence thresholds, phishing → human review)
  4. Route — map the classification to a concrete downstream action


Shared Components

All three implementations reuse the same core modules. This isolation is intentional — it lets you see exactly what each framework changes.

Schema — src/schema.py

class EmailLabel(str, Enum):
    PHISHING = "phishing"
    SPAM     = "spam"
    INVOICE  = "invoice"
    MEETING  = "meeting"
    SUPPORT  = "support"
    OTHER    = "other"

class EmailClassification(BaseModel):
    label:                 EmailLabel
    confidence:            float        # 0.0 – 1.0
    rationale:             str          # no PII allowed
    indicators:            list[str]    # key signals detected
    requires_human_review: bool         # True if ambiguous or high-risk

Pydantic enforces the contract — if the LLM returns invalid JSON, it fails fast instead of silently passing garbage downstream.

PII Masking — src/preprocessing.py

Sensitive data is replaced with placeholder tags before the email reaches the LLM. Social security numbers become [SSN], credit card numbers become [CREDIT_CARD], and so on. This is a hard requirement in any regulated environment — you never want to send raw PII to a third-party API.

Keyword Fallback — src/fallback.py

A deterministic safety net. When the LLM returns low confidence or is unavailable, the system degrades gracefully to rule-based classification instead of failing. The fallback is honest about its limitations — confidence is capped at 0.8.

Routing — src/routing.py

Maps each label to a downstream action. The critical override: if requires_human_review is True, that always takes priority regardless of the label.

Tools — src/tools.py

Four evidence-gathering tools that make the agent truly agentic:

Tool Purpose When the agent calls it
check_sender_reputation Domain risk score lookup Unfamiliar or suspicious sender
scan_urls Check URLs for malicious patterns Email contains links
lookup_known_contacts Verify sender against directory Any visible sender address
check_invoice_registry Validate invoice numbers Email mentions invoices or payments

In production, these would call real APIs (threat intelligence, CRM, invoice management). Here they use simulated databases so the entire pipeline can be tested offline.



Approach 1 — LangChain (ReAct Agent)

Source: src/langchain_agent.py + src/tools.py

LangChain ReAct architecture

The LLM decides which tools to call based on the email content. Different emails trigger different tool paths — a phishing email triggers scan_urls + check_sender_reputation, while an invoice triggers check_invoice_registry + lookup_known_contacts.

from langgraph.prebuilt import create_react_agent

agent = create_react_agent(
    model=llm,
    tools=[check_sender_reputation, scan_urls,
           lookup_known_contacts, check_invoice_registry],
    prompt=SYSTEM_PROMPT,
)

What makes it a real agent:

  • The LLM controls the flow — it chooses which tools to invoke and when to stop
  • It follows a full ReAct loop (Reason → Act → Observe → Repeat)
  • Tool calls produce evidence the classification is based on, not just LLM intuition
  • The tools_used field in the output provides a complete audit trail

Best for: Evidence-based classification, tool-heavy workflows, rapid prototyping.

Trade-off: Control flow is decided by the LLM — harder to guarantee deterministic execution paths at scale.



Approach 2 — LangGraph (State Machine)

Source: src/langgraph_agent.py

LangGraph state machine architecture

LangGraph makes the control flow explicit as a directed graph with typed state. Every node is a function, every edge is declared, and branching logic is visible — not buried in code.

graph.add_node("preprocess", preprocess_node)
graph.add_node("classify",   classify_node)
graph.add_node("guardrails", guardrails_node)
graph.add_node("fallback",   fallback_node)
graph.add_node("route",      route_node)

graph.add_edge("preprocess", "classify")
graph.add_edge("classify",   "guardrails")
graph.add_conditional_edges("guardrails", decide_path, {...})
graph.add_edge("fallback",   "route")
graph.add_edge("route",      END)

What makes it enterprise-grade:

  • Deterministic flow — every edge is explicit and auditable
  • Typed stateTypedDict state flows immutably through the graph
  • Conditional edges — branching logic is declared, not hidden in if-else chains
  • Node-level testing — each function can be unit tested in complete isolation

Best for: Regulated environments, compliance-sensitive pipelines, anything that needs a reviewable audit trail.

When a compliance team asks "Can you prove the system follows your documented process?" — with LangGraph, the graph is the documentation.



Approach 3 — CrewAI (Multi-Agent)

Source: src/crewai_agent.py

CrewAI multi-agent architecture

CrewAI decomposes the problem across three specialized agents that communicate through a sequential task pipeline. Each agent has a focused role, a goal, and a backstory that shapes its reasoning.

classifier_agent = Agent(role="Email Classifier", ...)
risk_agent       = Agent(role="Risk Analyst", ...)
policy_agent     = Agent(role="Policy Router", ...)

crew = Crew(
    agents=[classifier_agent, risk_agent, policy_agent],
    tasks=[classify_task, risk_task, policy_task],
    process=Process.sequential,
)

What makes it powerful:

  • Role separation — each agent has a focused responsibility
  • Natural team mapping — mirrors how human organizations structure decision-making
  • Extensible — add a new agent (e.g., "Compliance Reviewer") without rewriting anything
  • Collaborative reasoning — each agent builds on the previous agent's output

Trade-off: Three LLM calls per email instead of one. For simple classification, that's overkill. For genuinely complex problems with multiple reasoning domains, it's the natural choice.



Framework Comparison

Framework comparison: LangChain vs LangGraph vs CrewAI

Dimension LangChain LangGraph CrewAI
Architecture Linear chain State machine (DAG) Multi-agent team
Control flow Implicit (LLM decides) Explicit edges Role-based
State management Pass-through Typed TypedDict Shared context
Branching In code Conditional edges Agent delegation
Testability Integration tests Node-level unit tests Agent-level tests
Governance Manual Built-in Per-agent policies
Best for Prototypes, tool-heavy workflows Enterprise, compliance Complex collaboration
Learning curve Low Medium Medium

Head-to-Head Results

Criterion Winner Why
Best accuracy LangChain / LangGraph Both ~90%; tools give LangChain an edge on hard cases
Best speed LangGraph Single LLM call, ~3x faster than alternatives
Best auditability LangGraph Explicit edges, typed state, fully reviewable
Best safety LangChain Tool evidence provides the richest audit trail
Best cost LangGraph 1 LLM call vs multi-turn (LangChain) vs 3 calls (CrewAI)
Best extensibility CrewAI Adding a new agent is trivial

Recommendation: LangGraph for production deployments. LangChain for prototyping and discovery.



When to Use What

Choose LangChain when:

  • You need a quick prototype or proof of concept
  • The workflow is mostly linear with few branches
  • You want rich tool integration with an evidence trail

Choose LangGraph when:

  • You need auditability, governance, and compliance
  • The workflow has complex branching or conditional logic
  • You are in a regulated environment (banking, healthcare, insurance)
  • You want to unit test each pipeline step in isolation

Choose CrewAI when:

  • The problem naturally decomposes into distinct roles
  • You want collaborative multi-agent reasoning
  • You need to mirror a human team structure
  • Extensibility matters more than tight control


Evaluation & Production Readiness

Before deploying any classifier, run it against the golden dataset — 30 hand-labelled emails covering all 6 categories with easy, medium, and hard difficulty levels.

make evaluate           # Baseline (keyword fallback, no API key needed)
make evaluate-all       # All approaches (requires OPENAI_API_KEY)
make compare            # Side-by-side comparison with automated verdict

The evaluation computes per-class precision, recall, F1, a confusion matrix, and a production readiness gate:

MIN_WEIGHTED_F1       = 0.70   # Overall quality bar
MIN_PHISHING_RECALL   = 0.80   # Must catch phishing (safety-critical)
MIN_PHISHING_PRECISION = 0.60  # Prevent alert fatigue

All three thresholds must pass for a "production ready" verdict. Results are saved to data/eval_results/ as CSV (per-sample predictions) and JSON (summary metrics).

Why these thresholds? Missing a phishing email is a security breach. Too many false alarms means people stop trusting the system. The gate balances both risks.



Getting Started

Prerequisites

git clone https://github.com/ruslanmv/agentic-ai-tutorial.git
cd agentic-ai-tutorial
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
make install                   # or: pip install -r requirements.txt

Set your API key

export OPENAI_API_KEY="sk-..."

Quick start

make test                  # Run 80+ offline tests (no API key needed)
make run                   # Run all three approaches side by side
make compare               # Head-to-head comparison with verdict

All available commands

make help                  # Show every available command
make install               # Install production dependencies
make install-dev           # Install production + dev dependencies
make test                  # Run offline test suite
make test-all              # Run all tests including integration (needs API key)
make lint                  # Run the ruff linter
make evaluate              # Evaluate keyword fallback against golden dataset
make evaluate-all          # Evaluate all approaches (needs API key)
make compare               # Compare all frameworks with automated verdict
make compare-baseline      # Compare baseline only (no API key)
make run-langchain         # Run only the LangChain ReAct agent
make run-langgraph         # Run only the LangGraph state machine
make run-crewai            # Run only the CrewAI multi-agent system
make clean                 # Remove build artifacts

Run a single approach

python -m src.langchain_agent
python -m src.langgraph_agent
python -m src.crewai_agent


Testing

All shared components are tested offline — no API key required. The test suite covers PII masking, keyword fallback, routing, guardrails, tool functions, evaluation metrics, and JSON parsing.

make test          # 80+ offline tests (~1.5 seconds)
make test-all      # Includes integration tests (needs API key)
tests/
├── test_preprocessing.py     # PII masking patterns and edge cases
├── test_fallback.py          # Keyword classifier for all 6 categories
├── test_routing.py           # Label → action mapping
├── test_guardrails.py        # Policy enforcement and threshold logic
├── test_tools.py             # All four tool functions (offline)
├── test_evaluate.py          # Metrics computation + production gate
├── test_langchain_agent.py   # ReAct agent + JSON parsing strategies
├── test_langgraph_agent.py   # State machine node-level tests
└── test_crewai_agent.py      # Multi-agent crew tests


Project Structure

agentic-ai-tutorial/
├── Makefile                       # All convenience commands
├── README.md                      # This file
├── requirements.txt               # Production dependencies
├── pyproject.toml                 # Package metadata + dev dependencies
│
├── src/                           # Application code
│   ├── schema.py                  # Pydantic models (EmailLabel, EmailClassification)
│   ├── preprocessing.py           # PII masking (SSN, email, IBAN, phone, credit card)
│   ├── fallback.py                # Keyword-based fallback classifier
│   ├── routing.py                 # Label → action routing
│   ├── tools.py                   # Agent tools (sender reputation, URL scan, etc.)
│   ├── evaluate.py                # Evaluation pipeline (precision, recall, F1, prod gate)
│   ├── langchain_agent.py         # Approach 1 — ReAct agent with tools
│   ├── langgraph_agent.py         # Approach 2 — State machine (DAG)
│   └── crewai_agent.py            # Approach 3 — Multi-agent team
│
├── tests/                         # 80+ offline tests + integration tests
├── data/
│   └── golden_dataset.csv         # 30 hand-labelled emails for evaluation
├── examples/
│   ├── run_all.py                 # Run all 3 approaches side by side
│   └── compare_frameworks.py      # Head-to-head comparison with verdict
├── docs/
│   └── blog.md                    # Extended tutorial
└── assets/
    ├── architecture.mermaid       # Mermaid source diagrams
    └── svg/                       # SVG diagrams used in this README


Key Takeaway

An agent is a system that reasons, acts, and iterates toward a goal using tools and state — rather than producing a single response. Multi-agent systems decompose complex workflows across specialized roles. LangChain enables ReAct-style agents with dynamic tool selection. LangGraph provides deterministic state-machine orchestration for enterprise workflows. CrewAI enables collaborative multi-agent designs. Agentic systems trade simplicity for control, safety, and auditability — which is why they are preferred for production in regulated environments.



License & Author

Apache 2.0 | Created by ruslanmv

About

Agentic AI Concepts — From Theory to Production Code - Building Agentic AI Systems from Scratch — One Problem, Three Frameworks A Hands-On Tutorial with LangChain, LangGraph, and CrewAI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors