# LangGraph and LangSmith - Agentic RAG Powered by LangChain

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating our Tool Belt
  4. Creating Our State
  5. Creating and Compiling A Graph!

  - 🤝 Breakout Room #2:
  1. Evaluating the LangGraph Application with LangSmith
  2. Adding Helpfulness Check and "Loop" Limits
  3. LangGraph for the "Patterns" of GenAI

# 🤝 Breakout Room #1

## Part 1: LangGraph - Building Cyclic Applications with LangChain

LangGraph is a tool that leverages LangChain Expression Language to build coordinated multi-actor and stateful applications that includes cyclic behaviour.

### Why Cycles?

In essence, we can think of a cycle in our graph as a more robust and customizable loop. It allows us to keep our application agent-forward while still giving the powerful functionality of traditional loops.

Due to the inclusion of cycles over loops, we can also compose rather complex flows through our graph in a much more readable and natural fashion. Effectively allowing us to recreate application flowcharts in code in an almost 1-to-1 fashion.

### Why LangGraph?

Beyond the agent-forward approach - we can easily compose and combine traditional "DAG" (directed acyclic graph) chains with powerful cyclic behaviour due to the tight integration with LCEL. This means it's a natural extension to LangChain's core offerings!

## Task 1:  Dependencies

We'll first install all our required libraries.

In [4]:
!pip install -qU langchain langchain_openai langchain-community langgraph arxiv duckduckgo_search==5.3.1b1

## Task 2: Environment Variables

We'll want to set both our OpenAI API key and our LangSmith environment variables.

In [1]:
import os
from dotenv import load_dotenv
import openai
from pathlib import Path
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Get the current working directory
current_dir = Path.cwd()

# Construct the path to the .env file
env_path = current_dir.parent.parent / '.env'

# Check if the .env file exists
if env_path.exists():
    logger.info(f".env file found at: {env_path}")
    # Load environment variables from .env file
    load_dotenv(dotenv_path=env_path)
else:
    logger.error(f".env file not found at: {env_path}")

# Retrieve the OpenAI API key from environment variables
openai.api_key = os.getenv("OPENAI_API_KEY")

# Verify that the API key is set
if openai.api_key:
    logger.info("OpenAI API key loaded successfully.")
else:
    logger.error("Failed to load OpenAI API key.")

INFO:__main__:.env file found at: c:\src\mlops\sb-aie4\.env
INFO:__main__:OpenAI API key loaded successfully.


In [24]:
import os
from uuid import uuid4
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIE4 - LangGraph - {uuid4().hex[0:8]}"

# Try to get the API key from the environment variable
langsmith_api_key = os.getenv("LANGSMITH_API_KEY")



# Set the LANGCHAIN_API_KEY environment variable
if langsmith_api_key:
    os.environ["LANGCHAIN_API_KEY"] = langsmith_api_key
    print("LangSmith API key set successfully.")
else:
    print("No LangSmith API key provided. Some functionality may be limited.")

No LangSmith API key provided. Some functionality may be limited.


## Task 3: Creating our Tool Belt

As is usually the case, we'll want to equip our agent with a toolbelt to help answer questions and add external knowledge.

There's a tonne of tools in the [LangChain Community Repo](https://github.com/langchain-ai/langchain/tree/master/libs/community/langchain_community/tools) but we'll stick to a couple just so we can observe the cyclic nature of LangGraph in action!

We'll leverage:

- [Duck Duck Go Web Search](https://github.com/langchain-ai/langchain/tree/master/libs/community/langchain_community/tools/ddg_search)
- [Arxiv](https://github.com/langchain-ai/langchain/tree/master/libs/community/langchain_community/tools/arxiv)

####🏗️ Activity #1:

Please add the tools to use into our toolbelt.

> NOTE: Each tool in our toolbelt should be a method.

In [7]:
from langchain_community.tools.ddg_search import DuckDuckGoSearchRun
from langchain_community.tools.arxiv.tool import ArxivQueryRun


tool_belt = [
    DuckDuckGoSearchRun(),
    ArxivQueryRun()
]

### Model

Now we can set-up our model! We'll leverage the familiar OpenAI model suite for this example - but it's not *necessary* to use with LangGraph. LangGraph supports all models - though you might not find success with smaller models - as such, they recommend you stick with:

- OpenAI's GPT-3.5 and GPT-4
- Anthropic's Claude
- Google's Gemini

> NOTE: Because we're leveraging the OpenAI function calling API - we'll need to use OpenAI *for this specific example* (or any other service that exposes an OpenAI-style function calling API.

In [8]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o", temperature=0)

Now that we have our model set-up, let's "put on the tool belt", which is to say: We'll bind our LangChain formatted tools to the model in an OpenAI function calling format.

In [9]:
model = model.bind_tools(tool_belt)

#### ❓ Question #1:

How does the model determine which tool to use?

The model determines which tools to use based on analyzing the task at hand and matching it to the most appropriate tool(s) from those available in the tool_belt that has been bound to it. The model's choice is limited to and informed by the specific set of tools we have provided through this binding process.

## Task 4: Putting the State in Stateful

Earlier we used this phrasing:

`coordinated multi-actor and stateful applications`

So what does that "stateful" mean?

To put it simply - we want to have some kind of object which we can pass around our application that holds information about what the current situation (state) is. Since our system will be constructed of many parts moving in a coordinated fashion - we want to be able to ensure we have some commonly understood idea of that state.

LangGraph leverages a `StatefulGraph` which uses an `AgentState` object to pass information between the various nodes of the graph.

There are more options than what we'll see below - but this `AgentState` object is one that is stored in a `TypedDict` with the key `messages` and the value is a `Sequence` of `BaseMessages` that will be appended to whenever the state changes.

Let's think about a simple example to help understand exactly what this means (we'll simplify a great deal to try and clearly communicate what state is doing):

1. We initialize our state object:
  - `{"messages" : []}`
2. Our user submits a query to our application.
  - New State: `HumanMessage(#1)`
  - `{"messages" : [HumanMessage(#1)}`
3. We pass our state object to an Agent node which is able to read the current state. It will use the last `HumanMessage` as input. It gets some kind of output which it will add to the state.
  - New State: `AgentMessage(#1, additional_kwargs {"function_call" : "WebSearchTool"})`
  - `{"messages" : [HumanMessage(#1), AgentMessage(#1, ...)]}`
4. We pass our state object to a "conditional node" (more on this later) which reads the last state to determine if we need to use a tool - which it can determine properly because of our provided object!

In [10]:
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
import operator
from langchain_core.messages import BaseMessage

class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

## Task 5: It's Graphing Time!

Now that we have state, and we have tools, and we have an LLM - we can finally start making our graph!

Let's take a second to refresh ourselves about what a graph is in this context.

Graphs, also called networks in some circles, are a collection of connected objects.

The objects in question are typically called nodes, or vertices, and the connections are called edges.

Let's look at a simple graph.

![image](https://i.imgur.com/2NFLnIc.png)

Here, we're using the coloured circles to represent the nodes and the yellow lines to represent the edges. In this case, we're looking at a fully connected graph - where each node is connected by an edge to each other node.

If we were to think about nodes in the context of LangGraph - we would think of a function, or an LCEL runnable.

If we were to think about edges in the context of LangGraph - we might think of them as "paths to take" or "where to pass our state object next".

Let's create some nodes and expand on our diagram.

> NOTE: Due to the tight integration with LCEL - we can comfortably create our nodes in an async fashion!

In [11]:
from langgraph.prebuilt import ToolNode

def call_model(state):
  messages = state["messages"]
  response = model.invoke(messages)
  return {"messages" : [response]}

tool_node = ToolNode(tool_belt)

Now we have two total nodes. We have:

- `call_model` is a node that will...well...call the model
- `tool_node` is a node which can call a tool

Let's start adding nodes! We'll update our diagram along the way to keep track of what this looks like!


In [12]:
from langgraph.graph import StateGraph, END

uncompiled_graph = StateGraph(AgentState)

uncompiled_graph.add_node("agent", call_model)
uncompiled_graph.add_node("action", tool_node)

Let's look at what we have so far:

![image](https://i.imgur.com/md7inqG.png)

Next, we'll add our entrypoint. All our entrypoint does is indicate which node is called first.

In [13]:
uncompiled_graph.set_entry_point("agent")

![image](https://i.imgur.com/wNixpJe.png)

Now we want to build a "conditional edge" which will use the output state of a node to determine which path to follow.

We can help conceptualize this by thinking of our conditional edge as a conditional in a flowchart!

Notice how our function simply checks if there is a "function_call" kwarg present.

Then we create an edge where the origin node is our agent node and our destination node is *either* the action node or the END (finish the graph).

It's important to highlight that the dictionary passed in as the third parameter (the mapping) should be created with the possible outputs of our conditional function in mind. In this case `should_continue` outputs either `"end"` or `"continue"` which are subsequently mapped to the action node or the END node.

In [14]:
def should_continue(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  return END

uncompiled_graph.add_conditional_edges(
    "agent",
    should_continue
)

Let's visualize what this looks like.

![image](https://i.imgur.com/8ZNwKI5.png)

Finally, we can add our last edge which will connect our action node to our agent node. This is because we *always* want our action node (which is used to call our tools) to return its output to our agent!

In [15]:
uncompiled_graph.add_edge("action", "agent")

Let's look at the final visualization.

![image](https://i.imgur.com/NWO7usO.png)

All that's left to do now is to compile our workflow - and we're off!

In [16]:
compiled_graph = uncompiled_graph.compile()

#### ❓ Question #2:

**Is there any specific limit to how many times we can cycle?**

*In the current implementation, there is no explicit limit to the number of cycles. The graph will continue to cycle between the "agent" and "action" nodes as long as the should_continue function keeps returning "action". This means the conversation could potentially go on indefinitely if the model keeps requesting tool actions.*

**If not, how could we impose a limit to the number of cycles?**

*To impose a limit on the number of cycles, we could modify the code and introduce a MAX_CYCLES constant to set the maximum number of allowed cycles.By implementing this, you ensure that the conversation will end after a certain number of cycles, preventing potential infinite loops or excessively long conversations.*

## Using Our Graph

Now that we've created and compiled our graph - we can call it *just as we'd call any other* `Runnable`!

Let's try out a few examples to see how it fairs:

In [17]:
from langchain_core.messages import HumanMessage

inputs = {"messages" : [HumanMessage(content="Who is the current captain of the Winnipeg Jets?")]}

async for chunk in compiled_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"])
        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_argCvBDDWF0vQbEZOoS3ILyx', 'function': {'arguments': '{"query":"current captain of the Winnipeg Jets 2023"}', 'name': 'duckduckgo_search'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 25, 'prompt_tokens': 156, 'total_tokens': 181, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_5796ac6771', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-0b5e4c32-3e2a-4f71-974c-891555a13002-0', tool_calls=[{'name': 'duckduckgo_search', 'args': {'query': 'current captain of the Winnipeg Jets 2023'}, 'id': 'call_argCvBDDWF0vQbEZOoS3ILyx', 'type': 'tool_call'}], usage_metadata={'input_tokens': 156, 'output_tokens': 25, 'total_tokens': 181})]



Receiving update from node: 'action'
[ToolMessage(content="Prior to the start of the 2023-24 season, Adam Lowry was named the new c

Let's look at what happened:

1. Our state object was populated with our request
2. The state object was passed into our entry point (agent node) and the agent node added an `AIMessage` to the state object and passed it along the conditional edge
3. The conditional edge received the state object, found the "tool_calls" `additional_kwarg`, and sent the state object to the action node
4. The action node added the response from the OpenAI function calling endpoint to the state object and passed it along the edge to the agent node
5. The agent node added a response to the state object and passed it along the conditional edge
6. The conditional edge received the state object, could not find the "tool_calls" `additional_kwarg` and passed the state object to END where we see it output in the cell above!

Now let's look at an example that shows a multiple tool usage - all with the same flow!

In [18]:
inputs = {"messages" : [HumanMessage(content="Search Arxiv for the QLoRA paper, then search each of the authors to find out their latest Tweet using DuckDuckGo.")]}

async for chunk in compiled_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        if node == "action":
          print(f"Tool Used: {values['messages'][0].name}")
        print(values["messages"])

        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_oD9sBdZbg96rO2gIq9a5cTpT', 'function': {'arguments': '{"query": "QLoRA"}', 'name': 'arxiv'}, 'type': 'function'}, {'id': 'call_Ui8oEZ81xn4nw4Xer5UTgvIz', 'function': {'arguments': '{"query": "latest tweet"}', 'name': 'duckduckgo_search'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 50, 'prompt_tokens': 173, 'total_tokens': 223, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_5796ac6771', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-7c79795c-756d-4abb-96a9-18c850c20a25-0', tool_calls=[{'name': 'arxiv', 'args': {'query': 'QLoRA'}, 'id': 'call_oD9sBdZbg96rO2gIq9a5cTpT', 'type': 'tool_call'}, {'name': 'duckduckgo_search', 'args': {'query': 'latest tweet'}, 'id': 'call_Ui8oEZ81xn4nw4Xer5UTgvIz', 'type': 'tool_call'}], usage_metadata={'input_tokens': 173, '

####🏗️ Activity #2:

Please write out the steps the agent took to arrive at the correct answer.

# Agent's Steps to Process the Query

1. **Initial Query Processing**
   - Received human message: "Search Arxiv for the QLoRA paper, then search each of the authors to find out their latest Tweet using DuckDuckGo."

2. **Arxiv Search**
   - Used Arxiv tool to search for the QLoRA paper
   - Found information about "QLoRA: Efficient Finetuning of Quantized LLMs" by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer

3. **Author Identification**
   - Identified authors of the QLoRA paper from Arxiv search results

4. **DuckDuckGo Searches**
   - Performed DuckDuckGo search for each author
   - Retrieved general information about authors and their work

5. **Result Compilation**
   - Compiled search results from both Arxiv and DuckDuckGo

6. **Attempt to Find Tweets**
   - Attempted to find tweets from authors using DuckDuckGo search results
   - Unsuccessful in retrieving actual tweet content

7. **Response Formulation**
   - Likely formulated a response indicating:
     - Found information about QLoRA paper
     - Retrieved general information about authors
     - Unable to locate specific tweets

8. **Output Generation**
   - Generated output for each step
   - Showed tools used (Arxiv and DuckDuckGo) and content retrieved

## Part 1: LangSmith Evaluator

### Pre-processing for LangSmith

To do a little bit more preprocessing, let's wrap our LangGraph agent in a simple chain.

In [19]:
def convert_inputs(input_object):
  return {"messages" : [HumanMessage(content=input_object["question"])]}

def parse_output(input_state):
  return input_state["messages"][-1].content

agent_chain = convert_inputs | compiled_graph | parse_output

In [20]:
agent_chain.invoke({"question" : "What is RAG?"})

'RAG stands for Retrieval-Augmented Generation. It is a technique used in natural language processing (NLP) and machine learning to improve the performance of language models by combining retrieval-based methods with generative models. Here’s a brief overview of how it works:\n\n1. **Retrieval**: In the first step, the system retrieves relevant documents or pieces of information from a large corpus or database. This is typically done using a retrieval model, such as BM25 or a dense retrieval model like DPR (Dense Passage Retrieval).\n\n2. **Augmentation**: The retrieved documents are then used to augment the input to the generative model. This means that the generative model has access to additional context or information that can help it produce more accurate and relevant responses.\n\n3. **Generation**: Finally, the generative model, such as GPT-3 or BERT, uses the augmented input to generate a response. The additional context provided by the retrieved documents helps the model gener

### Task 1: Creating An Evaluation Dataset

Just as we saw last week, we'll want to create a dataset to test our Agent's ability to answer questions.

In order to do this - we'll want to provide some questions and some answers. Let's look at how we can create such a dataset below.

```python
questions = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    "What is a Retrieval Augmented Generation system?",
    "Who authored the QLoRA paper?",
    "What is the most popular deep learning framework?",
    "What significant improvements does the LoRA system make?"
]

answers = [
    {"must_mention" : ["paged", "optimizer"]},
    {"must_mention" : ["NF4", "NormalFloat"]},
    {"must_mention" : ["ground", "context"]},
    {"must_mention" : ["Tim", "Dettmers"]},
    {"must_mention" : ["PyTorch", "TensorFlow"]},
    {"must_mention" : ["reduce", "parameters"]},
]
```

####🏗️ Activity #3:

Please create a dataset in the above format with at least 5 questions.

In [22]:
questions = [
    "What is the main contribution of the QLoRA paper?",
    "How does QLoRA reduce memory usage compared to full fine-tuning?",
    "What is NF4, as introduced in the QLoRA paper?",
    "How does QLoRA compare to other efficient fine-tuning methods in terms of performance?",
    "What is the significance of the Guanaco model mentioned in the QLoRA paper?"
]

answers = [
    {"must_mention": ["efficient finetuning", "quantized LLMs", "memory reduction"]},
    {"must_mention": ["4-bit quantization", "Low Rank Adapters", "frozen pretrained model"]},
    {"must_mention": ["4-bit", "NormalFloat", "optimal for normal distribution"]},
    {"must_mention": ["full 16-bit performance", "less memory", "faster training"]},
    {"must_mention": ["outperforms previous models", "Vicuna benchmark", "single GPU training"]}
]

Now we can add our dataset to our LangSmith project using the following code which we saw last Thursday!

In [25]:
from langsmith import Client

client = Client()
dataset_name = f"Retrieval Augmented Generation - Evaluation Dataset - {uuid4().hex[0:8]}"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about the QLoRA Paper to Evaluate RAG over the same paper."
)

client.create_examples(
    inputs=[{"question" : q} for q in questions],
    outputs=answers,
    dataset_id=dataset.id,
)

#### ❓ Question #3:

How are the correct answers associated with the questions?

# How Questions and Answers are Associated

1. **Indexing Correspondence**
   - Questions and answers are stored in separate lists: `questions` and `answers`.
   - Each question corresponds to the answer at the same index in the `answers` list.

2. **Answer Format**
   - Answers are not direct responses, but dictionaries with a "must_mention" key.
   - The "must_mention" value is a list of key concepts or phrases.

3. **Evaluation Mechanism**
   - Correct answers are determined by checking if the agent's response includes the phrases listed in "must_mention".

4. **Flexibility in Evaluation**
   - This approach allows for partial credit and flexibility in wording.
   - The agent doesn't need to provide an exact match to a predefined answer.

5. **Potential Issues**
   - *Ambiguity*: Some key phrases might be too general or could appear in incorrect answers.
   - *Incompleteness*: The "must_mention" list might not capture all aspects of a correct answer.
   - *Lack of Context*: The evaluation doesn't consider the overall coherence or correctness of the response.
   - *Maintenance*: Updating questions requires careful adjustment of the corresponding answer criteria.

6. **Advantages**
   - *Simplicity*: Easy to implement and understand.
   - *Adaptability*: Can work with various types of questions and response styles.

> NOTE: Feel free to indicate if this is problematic or not

### Task 2: Adding Evaluators

Now we can add a custom evaluator to see if our responses contain the expected information.

We'll be using a fairly naive exact-match process to determine if our response contains specific strings.

In [26]:
from langsmith.evaluation import EvaluationResult, run_evaluator

@run_evaluator
def must_mention(run, example) -> EvaluationResult:
    prediction = run.outputs.get("output") or ""
    required = example.outputs.get("must_mention") or []
    score = all(phrase in prediction for phrase in required)
    return EvaluationResult(key="must_mention", score=score)

#### ❓ Question #4:

What are some ways you could improve this metric as-is?

# Improving the Must-Mention Metric

1. **Partial Credit Scoring**
   - Instead of binary scoring, implement a graduated scale based on the number of required phrases mentioned.
   - Example: Score = (number of phrases mentioned) / (total required phrases)

2. **Phrase Importance Weighting**
   - Assign different weights to required phrases based on their importance.
   - Calculate a weighted score instead of a simple all-or-nothing approach.

3. **Semantic Similarity**
   - Use word embeddings or language models to detect semantic equivalents of required phrases.
   - This allows for more flexible matching beyond exact string matching.

4. **Context Consideration**
   - Implement a method to evaluate if the required phrases are used in the correct context.
   - This could involve analyzing surrounding words or sentence structure.

5. **Negative Scoring**
   - Introduce penalties for mentioning incorrect or contradictory information.
   - This helps ensure the overall accuracy of the response, not just the presence of key phrases.

6. **Length Normalization**
   - Adjust scores based on the length of the response to avoid favoring overly verbose answers.

7. **Phrase Order Consideration**
   - For some questions, the order of mentioned phrases might be important. Implement a way to consider this.

# Gaps in the Current Method

1. **Lack of Nuance**
   - The current binary scoring doesn't capture partial knowledge or near-misses.

2. **Over-simplification**
   - Complex topics might be reduced to a few key phrases, missing important nuances.

3. **Potential for Gaming**
   - A model could theoretically score well by simply mentioning all required phrases without coherence.

4. **Ignoring Overall Correctness**
   - The method doesn't evaluate the overall correctness or coherence of the answer.

5. **Limited to Predefined Phrases**
   - It may not recognize novel but correct formulations of answers.

6. **No Consideration of Explanation Quality**
   - The depth or quality of explanations isn't factored into the score.

7. **Inability to Handle Open-ended Questions**
   - This method is less suitable for questions requiring creative or diverse answers.

> NOTE: Alternatively you can suggest where gaps exist in this method.

Now that we have created our custom evaluator - let's initialize our `RunEvalConfig` with it!

In [27]:
from langchain.smith import RunEvalConfig, run_on_dataset

eval_config = RunEvalConfig(
    custom_evaluators=[must_mention],
)

Task 3: Evaluating

All that is left to do is evaluate our agent's response!

In [28]:
client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=agent_chain,
    evaluation=eval_config,
    verbose=True,
    project_name=f"RAG Pipeline - Evaluation - {uuid4().hex[0:8]}",
    project_metadata={"version": "1.0.0"},
)

View the evaluation results for project 'RAG Pipeline - Evaluation - b3f02c88' at:
https://smith.langchain.com/o/bdfb057b-3c54-550b-8976-4c06e5176a04/datasets/efeb7a27-2edd-4cac-932f-a2d0fc0a46c8/compare?selectedSessions=fd3dba37-27ef-48a9-a689-9e1c9c3a4716

View all tests for Dataset Retrieval Augmented Generation - Evaluation Dataset - 1aace49d at:
https://smith.langchain.com/o/bdfb057b-3c54-550b-8976-4c06e5176a04/datasets/efeb7a27-2edd-4cac-932f-a2d0fc0a46c8
[------------------------------------------------->] 5/5

Unnamed: 0,feedback.must_mention,error,execution_time,run_id
count,5,0.0,5.0,5
unique,1,0.0,,5
top,False,,,4b5cf97a-44f9-4e76-8196-cc62eac9bbcb
freq,5,,,1
mean,,,5.172188,
std,,,1.816041,
min,,,3.906258,
25%,,,4.0233,
50%,,,4.101165,
75%,,,5.680836,


{'project_name': 'RAG Pipeline - Evaluation - b3f02c88',
 'results': {'1c85c6c9-0e10-400d-a5df-89f10eb4d94b': {'input': {'question': 'What is the significance of the Guanaco model mentioned in the QLoRA paper?'},
   'feedback': [EvaluationResult(key='must_mention', score=False, value=None, comment=None, correction=None, evaluator_info={}, feedback_config=None, source_run_id=UUID('4fce4bce-ab32-477e-82a4-e0b8b4318e77'), target_run_id=None)],
   'execution_time': 5.680836,
   'run_id': '4b5cf97a-44f9-4e76-8196-cc62eac9bbcb',
   'output': 'The Guanaco model, as mentioned in the QLoRA paper, represents a significant advancement in the efficient fine-tuning of large language models (LLMs). Here are the key points about the Guanaco model and its significance:\n\n1. **Efficiency in Fine-Tuning**: The QLoRA approach allows for the fine-tuning of a 65 billion parameter model on a single 48GB GPU while maintaining the performance of full 16-bit fine-tuning tasks. This is achieved by backpropagat

## Part 2: LangGraph with Helpfulness:

### Task 3: Adding Helpfulness Check and "Loop" Limits

Now that we've done evaluation - let's see if we can add an extra step where we review the content we've generated to confirm if it fully answers the user's query!

We're going to make a few key adjustments to account for this:

1. We're going to add an artificial limit on how many "loops" the agent can go through - this will help us to avoid the potential situation where we never exit the loop.
2. We'll add to our existing conditional edge to obtain the behaviour we desire.

First, let's define our state again - we can check the length of the state object, so we don't need additional state for this.

In [29]:
class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

Now we can set our graph up! This process will be almost entirely the same - with the inclusion of one additional node/conditional edge!

####🏗️ Activity #5:

Please write markdown for the following cells to explain what each is doing.

##### YOUR MARKDOWN HERE

# Creating a State Graph with Helpfulness Check

This code cell is setting up a new graph structure for an AI agent with a helpfulness check. Let's break it down:

1. **Initializing the State Graph**
   ```python
   graph_with_helpfulness_check = StateGraph(AgentState)
   ```
   - Creates a new `StateGraph` object named `graph_with_helpfulness_check`
   - Uses `AgentState` as the state type for this graph

2. **Adding the Agent Node**
   ```python
   graph_with_helpfulness_check.add_node("agent", call_model)
   ```
   - Adds a node named "agent" to the graph
   - Associates this node with the `call_model` function
   - This node likely represents the main AI model's decision-making process

3. **Adding the Action Node**
   ```python
   graph_with_helpfulness_check.add_node("action", tool_node)
   ```
   - Adds another node named "action" to the graph
   - Associates this node with the `tool_node` object
   - This node probably handles the execution of specific tools or actions

## Purpose
- This graph structure allows for a more complex flow of operations in the AI agent.
- It separates the decision-making process (agent node) from the action execution (action node).
- The name `graph_with_helpfulness_check` suggests that this graph might include additional logic to evaluate the helpfulness of the agent's responses.

## Next Steps
- After setting up these nodes, the graph would typically need edges defined to connect these nodes and specify the flow of operations.
- Additional nodes or conditional logic might be added to implement the helpfulness check functionality.

In [30]:
graph_with_helpfulness_check = StateGraph(AgentState)

graph_with_helpfulness_check.add_node("agent", call_model)
graph_with_helpfulness_check.add_node("action", tool_node)

##### YOUR MARKDOWN HERE

# Setting the Entry Point for the AI Graph

```python
graph_with_helpfulness_check.set_entry_point("agent")
```

## Purpose
- Defines the starting node for the graph's execution flow.
- Specifies "agent" as the initial processing point.

## Implications
- All interactions begin with the AI model's analysis.
- Allows the model to:
  1. Handle simple queries directly.
  2. Decide when to use additional tools/actions.

## Design Choice
- Prioritizes AI assessment before any other operations.
- Supports implementation of helpfulness checks early in the process.

## Significance
- Crucial for defining the operational flow of the AI system.
- Ensures the primary AI model is the first point of contact for inputs.

In [31]:
graph_with_helpfulness_check.set_entry_point("agent")

##### YOUR MARKDOWN HERE


## Purpose
- Determines next action based on AI's response
- Checks for tool calls or evaluates response helpfulness

## Key Components
1. **Tool Call Check**: Returns "action" if tool calls present
2. **Message Limit Check**: Ends process if over 10 messages
3. **Helpfulness Evaluation**: 
   - Uses GPT-4 to assess response helpfulness
   - Compares initial query and final response
4. **Decision Logic**:
   - "end" if helpful
   - "continue" if not helpful

## Significance
- Implements feedback loop for response quality
- Balances tool use and direct answers
- Prevents overly long conversations

This function is a crucial decision point in the AI's conversation flow, determining whether to end, refine, or execute a tool action.

In [33]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

def tool_call_or_helpful(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  initial_query = state["messages"][0]
  final_response = state["messages"][-1]

  if len(state["messages"]) > 10:
    return "END"

  prompt_template = """\
  Given an initial query and a final response, determine if the final response is extremely helpful or not. Please indicate helpfulness with a 'Y' and unhelpfulness as an 'N'.

  Initial Query:
  {initial_query}

  Final Response:
  {final_response}"""

  prompt_template = PromptTemplate.from_template(prompt_template)

  helpfulness_check_model = ChatOpenAI(model="gpt-4")

  helpfulness_chain = prompt_template | helpfulness_check_model | StrOutputParser()

  helpfulness_response = helpfulness_chain.invoke({"initial_query" : initial_query.content, "final_response" : final_response.content})

  if "Y" in helpfulness_response:
    return "end"
  else:
    return "continue"

####🏗️ Activity #4:

Please write what is happening in our `tool_call_or_helpful` function!

##### YOUR MARKDOWN HERE

# Function: tool_call_or_helpful

This function is a key decision-making component in the conversation flow. Here's a breakdown of its operation:

1. **Initial Check for Tool Calls**
   - Checks if the last message contains any tool calls
   - If yes, returns "action", indicating a tool should be used

2. **Message Limit Check**
   - If there are more than 10 messages, returns "END" to prevent overly long conversations

3. **Helpfulness Evaluation**
   - Proceeds if neither of the above conditions are met

4. **Prompt Creation**
   - Defines a prompt template to assess the helpfulness of the final response

5. **Model Setup**
   - Uses ChatOpenAI with GPT-4 model for evaluation

6. **Evaluation Chain**
   - Creates a chain: prompt_template -> GPT-4 model -> String output parser

7. **Helpfulness Check**
   - Invokes the chain with initial query and final response
   - Expects 'Y' or 'N' response indicating helpfulness

8. **Final Decision**
   - Returns "end" if response contains 'Y' (conversation complete)
   - Returns "continue" otherwise (for a more helpful response)

This function serves as a sophisticated decision point, determining whether to:
- Use a tool
- End the conversation due to length
- End it due to a satisfactory response
- Continue for a more helpful answer

It leverages another AI model (GPT-4) for helpfulness judgment, adding an extra layer of intelligence to conversation management.


In [34]:
graph_with_helpfulness_check.add_conditional_edges(
    "agent",
    tool_call_or_helpful,
    {
        "continue" : "agent",
        "action" : "action",
        "end" : END
    }
)

##### YOUR MARKDOWN HERE

In [35]:
graph_with_helpfulness_check.add_edge("action", "agent")

##### YOUR MARKDOWN HERE

In [36]:
agent_with_helpfulness_check = graph_with_helpfulness_check.compile()

##### YOUR MARKDOWN HERE

In [37]:
inputs = {"messages" : [HumanMessage(content="Related to machine learning, what is LoRA? Also, who is Tim Dettmers? Also, what is Attention?")]}

async for chunk in agent_with_helpfulness_check.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"])
        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_anUH5b3fdzmbiNcufB8MkFRF', 'function': {'arguments': '{"query": "LoRA machine learning"}', 'name': 'duckduckgo_search'}, 'type': 'function'}, {'id': 'call_GPiErYOKrt7v47SnPjzENHx3', 'function': {'arguments': '{"query": "Tim Dettmers"}', 'name': 'duckduckgo_search'}, 'type': 'function'}, {'id': 'call_yJAATFIfvhBdoUyjo1ImMoxr', 'function': {'arguments': '{"query": "Attention in machine learning"}', 'name': 'duckduckgo_search'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 76, 'prompt_tokens': 171, 'total_tokens': 247, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_5796ac6771', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-35aece06-09d7-430a-8971-5201391055db-0', tool_calls=[{'name': 'duckduckgo_search', 'args': {'query': 'LoRA machine learning'}, 'id': '

### Task 4: LangGraph for the "Patterns" of GenAI

Let's ask our system about the 4 patterns of Generative AI:

1. Prompt Engineering
2. RAG
3. Fine-tuning
4. Agents

In [38]:
patterns = ["prompt engineering", "RAG", "fine-tuning", "LLM-based agents"]

In [39]:
for pattern in patterns:
  what_is_string = f"What is {pattern} and when did it break onto the scene??"
  inputs = {"messages" : [HumanMessage(content=what_is_string)]}
  messages = agent_with_helpfulness_check.invoke(inputs)
  print(messages["messages"][-1].content)
  print("\n\n")

Prompt engineering is a concept primarily associated with the field of artificial intelligence, particularly in the context of natural language processing (NLP) and large language models (LLMs) like GPT-3. It involves the design and crafting of prompts (input text) to elicit desired responses from AI models. The goal is to optimize the input to get the most accurate, relevant, or creative output from the model.

### Key Aspects of Prompt Engineering:
1. **Crafting Effective Prompts**: Designing prompts that are clear, specific, and tailored to the task at hand.
2. **Iterative Testing**: Continuously refining prompts based on the responses received to improve the quality of the output.
3. **Understanding Model Behavior**: Gaining insights into how the model interprets different types of prompts and adjusting accordingly.
4. **Use Cases**: Applications range from generating creative content, answering questions, summarizing text, to more complex tasks like code generation and data analys