# LangGraph and LangSmith - Agentic RAG Powered by LangChain

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating our Tool Belt
  4. Creating Our State
  5. Creating and Compiling A Graph!

- 🤝 Breakout Room #2:
  1. Evaluating the LangGraph Application with LangSmith
  2. Adding Helpfulness Check and "Loop" Limits
  3. LangGraph for the "Patterns" of GenAI

# 🤝 Breakout Room #1

## Part 1: LangGraph - Building Cyclic Applications with LangChain

LangGraph is a tool that leverages LangChain Expression Language to build coordinated multi-actor and stateful applications that includes cyclic behaviour.

### Why Cycles?

In essence, we can think of a cycle in our graph as a more robust and customizable loop. It allows us to keep our application agent-forward while still giving the powerful functionality of traditional loops.

Due to the inclusion of cycles over loops, we can also compose rather complex flows through our graph in a much more readable and natural fashion. Effectively allowing us to recreate application flowcharts in code in an almost 1-to-1 fashion.

### Why LangGraph?

Beyond the agent-forward approach - we can easily compose and combine traditional "DAG" (directed acyclic graph) chains with powerful cyclic behaviour due to the tight integration with LCEL. This means it's a natural extension to LangChain's core offerings!

## Task 1:  Dependencies

We'll first install all our required libraries.

> NOTE: If you're running this locally - please skip this step.

In [1]:
#!pip install -qU langchain langchain_openai langchain-community langgraph arxiv

## Task 2: Environment Variables

We'll want to set both our OpenAI API key and our LangSmith environment variables.

In [2]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [3]:
os.environ["TAVILY_API_KEY"] = getpass.getpass("TAVILY_API_KEY")

In [4]:
from uuid import uuid4

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIE6 - LangGraph - {uuid4().hex[0:8]}"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangSmith API Key: ")

## Task 3: Creating our Tool Belt

As is usually the case, we'll want to equip our agent with a toolbelt to help answer questions and add external knowledge.

There's a tonne of tools in the [LangChain Community Repo](https://github.com/langchain-ai/langchain/tree/master/libs/community/langchain_community/tools) but we'll stick to a couple just so we can observe the cyclic nature of LangGraph in action!

We'll leverage:

- [Tavily Search Results](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/tools/tavily_search/tool.py)
- [Arxiv](https://github.com/langchain-ai/langchain/tree/master/libs/community/langchain_community/tools/arxiv)

#### Activity #1:

Please add the tools to use into our toolbelt.

> NOTE: Each tool in our toolbelt should be a method.

In [5]:
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_community.tools.arxiv.tool import ArxivQueryRun

tavily_tool = TavilySearchResults(max_results=5)

tool_belt = [
    tavily_tool,
    ArxivQueryRun(),
]

### Model

Now we can set-up our model! We'll leverage the familiar OpenAI model suite for this example - but it's not *necessary* to use with LangGraph. LangGraph supports all models - though you might not find success with smaller models - as such, they recommend you stick with:

- OpenAI's GPT-3.5 and GPT-4
- Anthropic's Claude
- Google's Gemini

> NOTE: Because we're leveraging the OpenAI function calling API - we'll need to use OpenAI *for this specific example* (or any other service that exposes an OpenAI-style function calling API.

In [6]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o", temperature=0)

Now that we have our model set-up, let's "put on the tool belt", which is to say: We'll bind our LangChain formatted tools to the model in an OpenAI function calling format.

In [7]:
model = model.bind_tools(tool_belt)

#### Question #1:

How does the model determine which tool to use?

The model determines which tool to use via **OpenAI-style function calling**, which leverages the tool schema bound to the model.

Here’s how it works:
- When tools are bound using `.bind_tools(tool_belt)`, LangChain passes metadata (names, input schema) to the model.
- During inference, the model:
  - Analyzes the user query
  - Predicts a `function_call` in its output (if needed)
  - Specifies the **tool name** and the **arguments** required

The LangChain agent inspects the model’s output:

```python
AIMessage(content="", additional_kwargs={"tool_calls": [...]})
```

Then:
- If `tool_calls` is present, the state routes to the corresponding tool node
- The specified tool runs and returns output
- This output is added to the state and fed back into the agent node

**In short**: The model selects a tool by predicting a function call structured around the tool’s name and expected input format — using function-calling API behavior.

## Task 4: Putting the State in Stateful

Earlier we used this phrasing:

`coordinated multi-actor and stateful applications`

So what does that "stateful" mean?

To put it simply - we want to have some kind of object which we can pass around our application that holds information about what the current situation (state) is. Since our system will be constructed of many parts moving in a coordinated fashion - we want to be able to ensure we have some commonly understood idea of that state.

LangGraph leverages a `StatefulGraph` which uses an `AgentState` object to pass information between the various nodes of the graph.

There are more options than what we'll see below - but this `AgentState` object is one that is stored in a `TypedDict` with the key `messages` and the value is a `Sequence` of `BaseMessages` that will be appended to whenever the state changes.

Let's think about a simple example to help understand exactly what this means (we'll simplify a great deal to try and clearly communicate what state is doing):

1. We initialize our state object:
  - `{"messages" : []}`
2. Our user submits a query to our application.
  - New State: `HumanMessage(#1)`
  - `{"messages" : [HumanMessage(#1)}`
3. We pass our state object to an Agent node which is able to read the current state. It will use the last `HumanMessage` as input. It gets some kind of output which it will add to the state.
  - New State: `AgentMessage(#1, additional_kwargs {"function_call" : "WebSearchTool"})`
  - `{"messages" : [HumanMessage(#1), AgentMessage(#1, ...)]}`
4. We pass our state object to a "conditional node" (more on this later) which reads the last state to determine if we need to use a tool - which it can determine properly because of our provided object!

In [8]:
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
import operator
from langchain_core.messages import BaseMessage

class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

## Task 5: It's Graphing Time!

Now that we have state, and we have tools, and we have an LLM - we can finally start making our graph!

Let's take a second to refresh ourselves about what a graph is in this context.

Graphs, also called networks in some circles, are a collection of connected objects.

The objects in question are typically called nodes, or vertices, and the connections are called edges.

Let's look at a simple graph.

![image](https://i.imgur.com/2NFLnIc.png)

Here, we're using the coloured circles to represent the nodes and the yellow lines to represent the edges. In this case, we're looking at a fully connected graph - where each node is connected by an edge to each other node.

If we were to think about nodes in the context of LangGraph - we would think of a function, or an LCEL runnable.

If we were to think about edges in the context of LangGraph - we might think of them as "paths to take" or "where to pass our state object next".

Let's create some nodes and expand on our diagram.

> NOTE: Due to the tight integration with LCEL - we can comfortably create our nodes in an async fashion!

In [9]:
from langgraph.prebuilt import ToolNode

def call_model(state):
  messages = state["messages"]
  response = model.invoke(messages)
  return {"messages" : [response]}

tool_node = ToolNode(tool_belt)

Now we have two total nodes. We have:

- `call_model` is a node that will...well...call the model
- `tool_node` is a node which can call a tool

Let's start adding nodes! We'll update our diagram along the way to keep track of what this looks like!


In [10]:
from langgraph.graph import StateGraph, END

uncompiled_graph = StateGraph(AgentState)

uncompiled_graph.add_node("agent", call_model)
uncompiled_graph.add_node("action", tool_node)

<langgraph.graph.state.StateGraph at 0x1224b4050>

Let's look at what we have so far:

![image](https://i.imgur.com/md7inqG.png)

Next, we'll add our entrypoint. All our entrypoint does is indicate which node is called first.

In [11]:
uncompiled_graph.set_entry_point("agent")

<langgraph.graph.state.StateGraph at 0x1224b4050>

![image](https://i.imgur.com/wNixpJe.png)

Now we want to build a "conditional edge" which will use the output state of a node to determine which path to follow.

We can help conceptualize this by thinking of our conditional edge as a conditional in a flowchart!

Notice how our function simply checks if there is a "function_call" kwarg present.

Then we create an edge where the origin node is our agent node and our destination node is *either* the action node or the END (finish the graph).

It's important to highlight that the dictionary passed in as the third parameter (the mapping) should be created with the possible outputs of our conditional function in mind. In this case `should_continue` outputs either `"end"` or `"continue"` which are subsequently mapped to the action node or the END node.

In [12]:
def should_continue(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  return END

uncompiled_graph.add_conditional_edges(
    "agent",
    should_continue
)

<langgraph.graph.state.StateGraph at 0x1224b4050>

Let's visualize what this looks like.

![image](https://i.imgur.com/8ZNwKI5.png)

Finally, we can add our last edge which will connect our action node to our agent node. This is because we *always* want our action node (which is used to call our tools) to return its output to our agent!

In [13]:
uncompiled_graph.add_edge("action", "agent")

<langgraph.graph.state.StateGraph at 0x1224b4050>

Let's look at the final visualization.

![image](https://i.imgur.com/NWO7usO.png)

All that's left to do now is to compile our workflow - and we're off!

In [14]:
compiled_graph = uncompiled_graph.compile()

#### Question #2:

Is there any specific limit to how many times we can cycle?

If not, how could we impose a limit to the number of cycles?

**Default Behaviour**: LangGraph allows **unlimited cycling** through the graph unless you explicitly restrict it. This could lead to infinite loops if the conditional logic never leads to an end state.

**To impose a limit**, you can:
1. **Track the number of cycles** via message count:
   ```python
   if len(state["messages"]) > MAX_CYCLE_LIMIT:
       return "end"
   ```
2. **Introduce a counter in state**:
   - Add a `loop_count: int` key in `AgentState`
   - Increment it at every node
   - Terminate the loop conditionally when `loop_count > threshold`

3. **Add a guardrail node**:
   - Before every tool call or agent invocation, route through a check node
   - Abort if a threshold is crossed

**Best Practice**: Use `len(state["messages"])` as a simple heuristic in small-scale experiments. In production, it is better to use a dedicated `loop_counter` in the state schema.

## Using Our Graph

Now that we've created and compiled our graph - we can call it *just as we'd call any other* `Runnable`!

Let's try out a few examples to see how it fairs:

In [15]:
from langchain_core.messages import HumanMessage

inputs = {"messages" : [HumanMessage(content="Who is the current captain of the Winnipeg Jets?")]}

async for chunk in compiled_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"])
        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_XEB3nQERJ34NN9w6GFDGO1nf', 'function': {'arguments': '{"query":"current captain of the Winnipeg Jets 2023"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 27, 'prompt_tokens': 162, 'total_tokens': 189, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_f5bdcc3276', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-f208f6e2-0749-4ad7-964a-5c57b745274c-0', tool_calls=[{'name': 'tavily_search_results_json', 'args': {'query': 'current captain of the Winnipeg Jets 2023'}, 'id': 'call_XEB3nQERJ34NN9w6GFDGO1nf', 'type': 'tool_call'}], usage_metadata={'input_tokens': 162, 'output_t

Let's look at what happened:

1. Our state object was populated with our request
2. The state object was passed into our entry point (agent node) and the agent node added an `AIMessage` to the state object and passed it along the conditional edge
3. The conditional edge received the state object, found the "tool_calls" `additional_kwarg`, and sent the state object to the action node
4. The action node added the response from the OpenAI function calling endpoint to the state object and passed it along the edge to the agent node
5. The agent node added a response to the state object and passed it along the conditional edge
6. The conditional edge received the state object, could not find the "tool_calls" `additional_kwarg` and passed the state object to END where we see it output in the cell above!

Now let's look at an example that shows a multiple tool usage - all with the same flow!

In [16]:
inputs = {"messages" : [HumanMessage(content="Search Arxiv for the QLoRA paper, then search each of the authors to find out their latest Tweet using Tavily!")]}

async for chunk in compiled_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        if node == "action":
          print(f"Tool Used: {values['messages'][0].name}")
        print(values["messages"])

        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_Nc6MLS3zCcNiZGJWTzvrvyQC', 'function': {'arguments': '{"query":"QLoRA"}', 'name': 'arxiv'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 17, 'prompt_tokens': 178, 'total_tokens': 195, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_f5bdcc3276', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-b924709e-f1cb-4542-b1e9-c38a71d352af-0', tool_calls=[{'name': 'arxiv', 'args': {'query': 'QLoRA'}, 'id': 'call_Nc6MLS3zCcNiZGJWTzvrvyQC', 'type': 'tool_call'}], usage_metadata={'input_tokens': 178, 'output_tokens': 17, 'total_tokens': 195, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'a

#### Activity #2:

Please write out the steps the agent took to arrive at the correct answer.

**User Input**: *"Search Arxiv for the QLoRA paper, then search each of the authors to find out their latest Tweet using Tavily!"*

---

**LangGraph Execution Steps**

1. **Initialize State**
   - The graph starts with an empty message state.
   - A `HumanMessage` containing the user input is appended.

2. **Step 1 — Agent Node Execution**
   - The `agent` node receives the state and invokes the LLM (GPT-4).
   - The model interprets the user intent and returns a `function_call`:

     ```json
     {
       "function": {
         "name": "arxiv",
         "arguments": {"query": "QLoRA"}
       }
     }
     ```

   - This is added to the message sequence as an `AIMessage`.

3. **Step 2 — Conditional Edge**
   - The `should_continue` function detects a `tool_call` (Arxiv).
   - The state is routed to the `action` node.

4. **Step 3 — Action Node Execution**
   - The `arxiv` tool is invoked with `query="QLoRA"`.
   - It returns metadata (title, abstract, and **author list**).
   - A `ToolMessage` with this result is appended to state.

5. **Step 4 — Back to Agent Node**
   - The updated state (including Arxiv output) is passed back to the `agent` node.
   - The agent analyzes the tool result and now decides to **loop** over the authors and fetch their latest Tweets using Tavily.
   - It produces another `function_call`:

     ```json
     {
       "function": {
         "name": "tavily_search_results_json",
         "arguments": {"query": "Tim Dettmers latest tweet"}
       }
     }
     ```

6. **Step 5 — Conditional Edge**
   - A new tool call is detected, routing again to `action`.

7. **Step 6 — Action Node Execution**
   - The `tavily_search_results_json` tool fetches web search results for the author (e.g., link to tweet thread or profile).
   - This result is appended as another `ToolMessage`.

8. **Step 7 — Agent Node Final Response**
   - With both tools’ results in state, the agent constructs a natural language reply:
   
     > *"Here are the latest tweets or posts from the authors of the QLoRA paper..."*

9. **Step 8 — Conditional Edge**
   - No further `tool_calls` detected.
   - State is routed to the `END` node.
   - Final answer is returned.

---

**Final Outcome**
- The system has used **multiple reasoning steps**, **memory (state)**, and **cyclic routing** to handle a multi-instruction user query.
- This shows LangGraph’s power to handle **tool-chaining** and **multi-turn inference** in a structured, modular way.

# 🤝 Breakout Room #2

## Part 1: LangSmith Evaluator

### Pre-processing for LangSmith

To do a little bit more preprocessing, let's wrap our LangGraph agent in a simple chain.

In [17]:
def convert_inputs(input_object):
  return {"messages" : [HumanMessage(content=input_object["question"])]}

def parse_output(input_state):
  return input_state["messages"][-1].content

agent_chain = convert_inputs | compiled_graph | parse_output

In [18]:
agent_chain.invoke({"question" : "What is RAG?"})

"RAG stands for Retrieval-Augmented Generation. It is a technique used in natural language processing (NLP) that combines retrieval-based methods with generative models to improve the quality and accuracy of generated text. Here's how it generally works:\n\n1. **Retrieval**: The system first retrieves relevant information from a large corpus or database. This step involves searching for documents, passages, or data that are pertinent to the input query or context.\n\n2. **Augmentation**: The retrieved information is then used to augment the input to a generative model. This means that the generative model has access to additional context or facts that can help it produce more accurate and contextually relevant responses.\n\n3. **Generation**: Finally, the generative model uses both the original input and the retrieved information to generate a response. This can be in the form of text, answers to questions, or other types of content.\n\nRAG is particularly useful in scenarios where the

### Task 1: Creating An Evaluation Dataset

Just as we saw last week, we'll want to create a dataset to test our Agent's ability to answer questions.

In order to do this - we'll want to provide some questions and some answers. Let's look at how we can create such a dataset below.

```python
questions = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    "What is a Retrieval Augmented Generation system?",
    "Who authored the QLoRA paper?",
    "What is the most popular deep learning framework?",
    "What significant improvements does the LoRA system make?"
]

answers = [
    {"must_mention" : ["paged", "optimizer"]},
    {"must_mention" : ["NF4", "NormalFloat"]},
    {"must_mention" : ["ground", "context"]},
    {"must_mention" : ["Tim", "Dettmers"]},
    {"must_mention" : ["PyTorch", "TensorFlow"]},
    {"must_mention" : ["reduce", "parameters"]},
]
```

#### 🏗️ Activity #3:

Please create a dataset in the above format with at least 5 questions.

In [19]:
questions = [
    "What is the purpose of the LoRA technique in model fine-tuning?",
    "How does QLoRA differ from traditional LoRA?",
    "What type of datasets were used to evaluate QLoRA?",
    "Which large language model was used with QLoRA in the experiments?",
    "What are the benefits of quantization in QLoRA?",
    "Explain the concept of attention in transformer models."
]

answers = [
    {"must_mention": ["reduce", "parameters"]},             # LoRA goal
    {"must_mention": ["quantized", "efficient"]},           # QLoRA differences
    {"must_mention": ["OpenAssistant", "Guanaco"]},         # Datasets used
    {"must_mention": ["LLaMA", "7B"]},                      # Model used
    {"must_mention": ["memory", "storage"]},                # Quantization benefits
    {"must_mention": ["self-attention", "context"]},        # Attention mechanism
]

Now we can add our dataset to our LangSmith project using the following code which we saw last Thursday!

In [20]:
from langsmith import Client

client = Client()

dataset_name = f"Retrieval Augmented Generation - Evaluation Dataset - {uuid4().hex[0:8]}"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about the QLoRA Paper to Evaluate RAG over the same paper."
)

client.create_examples(
    inputs=[{"question" : q} for q in questions],
    outputs=answers,
    dataset_id=dataset.id,
)

{'example_ids': ['63c5889b-1215-4faa-b8f8-1e55c47af95d',
  '90dd45cf-b7ca-4564-b211-4a2abab385f5',
  '8e74a38e-97af-4980-b408-cb7d9aaa1bb3',
  '424c597d-8c4b-4931-b1bc-102734f61da0',
  '89de680a-a33e-4e4c-a280-52e6d6db13d7',
  '61a9a691-7e40-488c-9a93-59cff664a9ad'],
 'count': 6}

#### Question #3:

How are the correct answers associated with the questions?

> NOTE: Feel free to indicate if this is problematic or not

In the evaluation section, questions and their "expected answers" are paired using:

- A list of `questions`
- A parallel list of `answers`, where each item is a dictionary specifying required content (via `must_mention`)

Example:
```python
questions = [
    "What optimizer is used in QLoRA?",
    "Who authored the QLoRA paper?",
]

answers = [
    {"must_mention": ["paged", "optimizer"]},
    {"must_mention": ["Tim", "Dettmers"]},
]
```

Then, they're passed into `create_examples()` method:

```python
client.create_examples(
    inputs=[{"question": q} for q in questions],
    outputs=answers,
    dataset_id=dataset.id,
)
```

This relies on list-order alignment: index 0 in `questions` matches index 0 in `answers`, and so on.

---

**Is this problematic?**

Yes, potential issues include:
1. **Order Sensitivity**: Misalignment or reordering of one list causes incorrect pairing.
2. **Scalability**: With larger datasets, parallel lists become harder to manage.
3. **Validation Overhead**: Requires manual verification to ensure correctness.

---

**Recommendation:**
Use structured data (e.g., a list of dictionaries or a DataFrame):

```python
qa_data = [
    {"question": "...", "must_mention": ["..."]},
    ...
]
```

This avoids mismatches and scales better.

### Task 2: Adding Evaluators

Now we can add a custom evaluator to see if our responses contain the expected information.

We'll be using a fairly naive exact-match process to determine if our response contains specific strings.

In [21]:
from langsmith.evaluation import EvaluationResult, run_evaluator

@run_evaluator
def must_mention(run, example) -> EvaluationResult:
    prediction = run.outputs.get("output") or ""
    required = example.outputs.get("must_mention") or []
    score = all(phrase in prediction for phrase in required)
    return EvaluationResult(key="must_mention", score=score)

#### Question #4:

What are some ways you could improve this metric as-is?

> NOTE: Alternatively you can suggest where gaps exist in this method.

The current `must_mention` metric:

```python
score = all(phrase in prediction for phrase in required)
```

##### **Gaps:**
1. **Exact string match only**
   - Misses paraphrases, synonyms, or reworded concepts
   - False negatives occur if the model uses alternate phrasing

2. **Lack of semantic understanding**
   - No comprehension of meaning, intent, or context

3. **No fuzzy matching**
   - Spelling variations or minor formatting differences are ignored

4. **No hallucination detection**
   - The metric doesn't penalize inaccurate or fabricated answers

---

##### **Improvements:**

| Approach | Description |
|---------|-------------|
| Embedding similarity | Use cosine similarity between expected terms and response |
| LLM-as-a-judge | Prompt a model like GPT-4 to score the response based on whether key concepts are semantically present |
| Regex or fuzzy string match | Allow more flexible token matches |
| NER / QA Extraction | Use NLP techniques to extract facts and validate entities or phrases |
| BLEU / ROUGE scores | Use established text similarity metrics for fuzzy comparison |

---

Task 3: Evaluating

All that is left to do is evaluate our agent's response!

In [22]:
experiment_results = client.evaluate(
    agent_chain,
    data=dataset_name,
    evaluators=[must_mention],
    experiment_prefix=f"RAG Pipeline - Evaluation - {uuid4().hex[0:4]}",
    metadata={"version": "1.0.0"},
)

View the evaluation results for experiment: 'RAG Pipeline - Evaluation - ca4d-79e953b0' at:
https://smith.langchain.com/o/cc1700a3-14e9-4dd6-9611-c9c0b0000632/datasets/394420f8-9314-42fc-8ac4-80fa300e2cbe/compare?selectedSessions=e9330b55-ef15-40d9-b716-00ba436ac825




0it [00:00, ?it/s]

In [23]:
experiment_results

Unnamed: 0,inputs.question,outputs.output,error,reference.must_mention,feedback.must_mention,execution_time,example_id,id
0,What is the purpose of the LoRA technique in m...,"LoRA, which stands for Low-Rank Adaptation, is...",,"[reduce, parameters]",True,7.96149,424c597d-8c4b-4931-b1bc-102734f61da0,8f8ae3c9-cbdd-43f1-9464-7e1407612336
1,What type of datasets were used to evaluate QL...,The QLoRA approach was evaluated using a varie...,,"[OpenAssistant, Guanaco]",False,5.690482,61a9a691-7e40-488c-9a93-59cff664a9ad,845ae151-4e09-4c41-8fec-292e9ede3d0c
2,Which large language model was used with QLoRA...,"In the experiments involving QLoRA, large lang...",,"[LLaMA, 7B]",False,3.994545,63c5889b-1215-4faa-b8f8-1e55c47af95d,b9be2731-f869-427a-960b-5f855ae3a673
3,How does QLoRA differ from traditional LoRA?,QLoRA (Quantized Low-Rank Adaptation) and LoRA...,,"[quantized, efficient]",False,8.599974,89de680a-a33e-4e4c-a280-52e6d6db13d7,c849d4cd-5a02-4409-a7b0-b8f058474c0a
4,Explain the concept of attention in transforme...,Attention in transformer models is a mechanism...,,"[self-attention, context]",True,7.90509,8e74a38e-97af-4980-b408-cb7d9aaa1bb3,24363190-77de-4b0a-8d21-346abf8bb67d
5,What are the benefits of quantization in QLoRA?,Quantization in QLoRA (Quantized Low-Rank Adap...,,"[memory, storage]",False,10.284491,90dd45cf-b7ca-4564-b211-4a2abab385f5,29f47a39-ef30-4ad6-a1ad-26ae2a383666


## Part 2: LangGraph with Helpfulness:

### Task 3: Adding Helpfulness Check and "Loop" Limits

Now that we've done evaluation - let's see if we can add an extra step where we review the content we've generated to confirm if it fully answers the user's query!

We're going to make a few key adjustments to account for this:

1. We're going to add an artificial limit on how many "loops" the agent can go through - this will help us to avoid the potential situation where we never exit the loop.
2. We'll add to our existing conditional edge to obtain the behaviour we desire.

First, let's define our state again - we can check the length of the state object, so we don't need additional state for this.

In [24]:
class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

Now we can set our graph up! This process will be almost entirely the same - with the inclusion of one additional node/conditional edge!

#### Activity #5:

Please write markdown for the following cells to explain what each is doing.

>This initializes a new StateGraph named graph_with_helpfulness_check that uses our existing AgentState format to track messages.  We add two nodes:
>
>+ "agent": calls the language model (call_model)
>+ "action": executes the selected tool (tool_node)

In [25]:
graph_with_helpfulness_check = StateGraph(AgentState)

graph_with_helpfulness_check.add_node("agent", call_model)
graph_with_helpfulness_check.add_node("action", tool_node)

<langgraph.graph.state.StateGraph at 0x125ee6d50>

> This sets the entry point of the graph to be the "agent" node.  That means every time the graph starts processing, it begins by invoking the LLM to handle the input.

In [26]:
graph_with_helpfulness_check.set_entry_point("agent")

<langgraph.graph.state.StateGraph at 0x125ee6d50>

#### Activity #4:

Please write what is happening in our `tool_call_or_helpful` function!

> This defines a custom conditional edge function to route state transitions. It performs three checks in order:
> 
> + If there's a tool_call, it returns "action" (run a tool).
> + If the message history is too long (more than 10), it returns "END" to break the loop.
> + Otherwise, it uses a helpfulness-check LLM (GPT-4) to assess if the agent’s last message is helpful enough.
> 
>    + If the LLM says it's helpful ("Y"), we end.
>    + If not ("N"), we continue cycling.

In [27]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

def tool_call_or_helpful(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  initial_query = state["messages"][0]
  final_response = state["messages"][-1]

  if len(state["messages"]) > 10:
    return "END"

  prompt_template = """\
  Given an initial query and a final response, determine if the final response is extremely helpful or not. Please indicate helpfulness with a 'Y' and unhelpfulness as an 'N'.

  Initial Query:
  {initial_query}

  Final Response:
  {final_response}"""

  prompt_template = PromptTemplate.from_template(prompt_template)

  helpfulness_check_model = ChatOpenAI(model="gpt-4")

  helpfulness_chain = prompt_template | helpfulness_check_model | StrOutputParser()

  helpfulness_response = helpfulness_chain.invoke({"initial_query" : initial_query.content, "final_response" : final_response.content})

  if "Y" in helpfulness_response:
    return "end"
  else:
    return "continue"

> This adds conditional edges from the "agent" node. Based on the result from tool_call_or_helpful, the state flows:
> 
> + To "action" if a tool needs to be used
> + To "agent" again if the response was unhelpful
> + To END if either the response was helpful or we've hit the loop limit

In [28]:
graph_with_helpfulness_check.add_conditional_edges(
    "agent",
    tool_call_or_helpful,
    {
        "continue" : "agent",
        "action" : "action",
        "end" : END
    }
)

<langgraph.graph.state.StateGraph at 0x125ee6d50>

> This connects the "action" node back to the "agent" node. After a tool is invoked and its result is added to the state, control returns to the agent to interpret the result  and continue the loop.

In [29]:
graph_with_helpfulness_check.add_edge("action", "agent")

<langgraph.graph.state.StateGraph at 0x125ee6d50>

> This compiles the constructed graph into a runnable LangGraph pipeline called *agent_with_helpfulness_check*.  The graph is now ready for inference calls with cyclical agent-tool interactions and helpfulness evaluation.

In [30]:
agent_with_helpfulness_check = graph_with_helpfulness_check.compile()

> The code block below tests the enhanced graph pipeline with a multi-part question involving:
> 
> + A technical concept: LoRA
> + A researcher: Tim Dettmers
> + A foundational ML concept: Attention
> 
> It uses astream() to stream intermediate updates as the graph executes. For each update:
> 
> + It prints the node name ('agent' or 'action')
> + Then prints the message history added at that step
> 
> This allows us to observe the full execution trace:
> 
> + How many times the agent cycles through the loop
> + Which tools it uses (e.g., Arxiv, Tavily)
> + When and how the final response is returned
> 
> With the helpfulness check, the loop terminates only when the model generates a response judged to be helpful enough, or the message count exceeds 10.

In [31]:
inputs = {"messages" : [HumanMessage(content="Related to machine learning, what is LoRA? Also, who is Tim Dettmers? Also, what is Attention?")]}

async for chunk in agent_with_helpfulness_check.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"])
        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_O2DzQXM8VAn4FXPIfTmykmrB', 'function': {'arguments': '{"query": "LoRA machine learning"}', 'name': 'arxiv'}, 'type': 'function'}, {'id': 'call_cJCPuRxlquTjKZm4lvdxeyoE', 'function': {'arguments': '{"query": "Tim Dettmers"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}, {'id': 'call_LpUig4V5Hc0sDpFGXpHFrue7', 'function': {'arguments': '{"query": "Attention mechanism machine learning"}', 'name': 'arxiv'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 72, 'prompt_tokens': 177, 'total_tokens': 249, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_f5bdcc3276', 'finish_reason': 'tool_calls', 'logprobs': None},

### Task 4: LangGraph for the "Patterns" of GenAI

Let's ask our system about the 4 patterns of Generative AI:

1. Prompt Engineering
2. RAG
3. Fine-tuning
4. Agents

In [32]:
patterns = ["prompt engineering", "RAG", "fine-tuning", "LLM-based agents"]

In [34]:
from IPython.display import Markdown, display

for pattern in patterns:
  what_is_string = f"What is {pattern} and when did it break onto the scene??"
  inputs = {"messages" : [HumanMessage(content=what_is_string)]}
  messages = agent_with_helpfulness_check.invoke(inputs)
  display(Markdown(messages["messages"][-1].content))
  print("\n\n")

**Prompt Engineering: Definition and Importance**

Prompt engineering is the process of designing and refining input prompts to effectively guide the behavior of AI models, particularly large language models (LLMs) like GPT-4, to produce desired outputs. It involves crafting instructions in natural language that can be interpreted and understood by AI models to perform specific tasks. This technique is crucial for improving the accuracy and effectiveness of generative AI tools, which can create text, images, video, and more. By refining prompts, users can enhance the quality and relevance of the AI's responses, making it a vital skill for anyone interacting with AI models.

**History of Prompt Engineering**

The history of prompt engineering is closely tied to the development and evolution of natural language processing (NLP) and artificial intelligence (AI) systems. Initially, early language models struggled to consistently understand or follow instructions, which highlighted the need for structured guidance, giving birth to prompt engineering. The release of GPT-3 in 2020 marked a significant moment, as it demonstrated the potential of large language models to generate human-like text. This led to increased interest in prompt design as a means to guide these models to perform desired tasks. The introduction of reinforcement learning techniques, such as InstructGPT, further advanced the field by training models to better follow instructions and align with human intent.






Retrieval-Augmented Generation (RAG) is a relatively new technology that was first proposed in 2020. It was introduced in a paper titled "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Patrick Lewis and a team at Facebook AI Research. RAG is a technique designed to enhance the accuracy and reliability of generative AI models by incorporating information from specific and relevant data sources. This approach allows AI systems to access and leverage a broader range of external knowledge, making them more informative and accurate.

RAG has been embraced by both academic and industry researchers as a way to significantly improve the value of generative AI systems. It has evolved over time, integrating advanced retrievers, large language models (LLMs), and other complementary technologies to tackle knowledge-intensive tasks. The concept of RAG has led to the development of modular frameworks that allow for reconfigurable systems, enhancing the flexibility and efficiency of AI applications.






Fine-tuning in machine learning refers to the process of taking a pre-trained model and further training it on a smaller, targeted dataset. This approach is particularly useful for adapting a model to a specific task or domain after it has been initially trained on a broader dataset. Fine-tuning allows the model to leverage the general knowledge it has already acquired and adjust it to perform well on a new, more specific task.

The concept of fine-tuning has been around for a while, but it gained significant attention with the rise of deep learning and transfer learning techniques. In the neural network era, fine-tuning became a common practice where a pre-trained model is adapted by adding a new layer for a specific task and training it on the new dataset. This method allows the base model to remain largely unchanged while the task-specific layer learns to classify, predict, or generate outputs.

Fine-tuning has been particularly impactful in fields like natural language processing and computer vision, where large pre-trained models can be adapted to specific tasks with relatively small amounts of additional data. The technique has evolved over time, with recent advancements including instruction fine-tuning and differentially private fine-tuning, which address specific challenges in model training and privacy.

The exact timeline of when fine-tuning "broke onto the scene" is not pinpointed in the sources, but it became more prominent with the development of deep learning frameworks and the availability of large pre-trained models, such as those used in NLP and image recognition tasks.






LLM-based agents, or Large Language Model-based agents, are systems that utilize large language models to perform complex tasks by integrating modules like planning, memory, and tool usage. These agents act as the "brain" to control operations needed to complete tasks or user requests. They can handle complex reasoning, create plans, and execute them using various tools.

The concept of LLM-based agents gained significant attention in 2022 with the popularization of OpenAI's ChatGPT. Since then, various methods and techniques have been developed to enhance their utilization and address their limitations. These agents have evolved from basic text-based models to advanced systems capable of handling complex tasks beyond simple question-answering, such as digital automation and financial inquiries.

For more detailed information, you can explore resources like [this article on Medium](https://medium.com/@razgaleh/brief-history-of-llm-agents-d7d22f82a539) or [NVIDIA's introduction to LLM agents](https://developer.nvidia.com/blog/introduction-to-llm-agents/).




