# LangGraph and LangSmith - Agentic RAG Powered by LangChain

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating our Tool Belt
  4. Creating Our State
  5. Creating and Compiling A Graph!

  - 🤝 Breakout Room #2:
  1. Evaluating the LangGraph Application with LangSmith
  2. Adding Helpfulness Check and "Loop" Limits
  3. LangGraph for the "Patterns" of GenAI

# 🤝 Breakout Room #1

## Part 1: LangGraph - Building Cyclic Applications with LangChain

LangGraph is a tool that leverages LangChain Expression Language to build coordinated multi-actor and stateful applications that includes cyclic behaviour.

### Why Cycles?

In essence, we can think of a cycle in our graph as a more robust and customizable loop. It allows us to keep our application agent-forward while still giving the powerful functionality of traditional loops.

Due to the inclusion of cycles over loops, we can also compose rather complex flows through our graph in a much more readable and natural fashion. Effectively allowing us to recreate application flowcharts in code in an almost 1-to-1 fashion.

### Why LangGraph?

Beyond the agent-forward approach - we can easily compose and combine traditional "DAG" (directed acyclic graph) chains with powerful cyclic behaviour due to the tight integration with LCEL. This means it's a natural extension to LangChain's core offerings!

## Task 1:  Dependencies

We'll first install all our required libraries.

In [1]:
!pip install -qU langchain langchain_openai langchain-community langgraph arxiv duckduckgo_search==5.3.1b1

## Task 2: Environment Variables

We'll want to set both our OpenAI API key and our LangSmith environment variables.

In [2]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIE4 - LangGraph - {uuid4().hex[0:8]}"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangSmith API Key: ")

In [4]:
os.environ["LANGCHAIN_PROJECT"]

'AIE4 - LangGraph - d1b82ae5'

In [35]:
from IPython.display import display, Markdown

def pretty_print(message: str) -> None:
    display(Markdown(f"```markdown\n{message}\n```"))

## Task 3: Creating our Tool Belt

As is usually the case, we'll want to equip our agent with a toolbelt to help answer questions and add external knowledge.

There's a tonne of tools in the [LangChain Community Repo](https://github.com/langchain-ai/langchain/tree/master/libs/community/langchain_community/tools) but we'll stick to a couple just so we can observe the cyclic nature of LangGraph in action!

We'll leverage:

- [Duck Duck Go Web Search](https://github.com/langchain-ai/langchain/tree/master/libs/community/langchain_community/tools/ddg_search)
- [Arxiv](https://github.com/langchain-ai/langchain/tree/master/libs/community/langchain_community/tools/arxiv)

####🏗️ Activity #1:

Please add the tools to use into our toolbelt.

> NOTE: Each tool in our toolbelt should be a method.

In [44]:
from langchain_community.tools.ddg_search import DuckDuckGoSearchRun
from langchain_community.tools.arxiv.tool import ArxivQueryRun

tool_belt = [
    DuckDuckGoSearchRun(),
    ArxivQueryRun()
]

### Model

Now we can set-up our model! We'll leverage the familiar OpenAI model suite for this example - but it's not *necessary* to use with LangGraph. LangGraph supports all models - though you might not find success with smaller models - as such, they recommend you stick with:

- OpenAI's GPT-3.5 and GPT-4
- Anthropic's Claude
- Google's Gemini

> NOTE: Because we're leveraging the OpenAI function calling API - we'll need to use OpenAI *for this specific example* (or any other service that exposes an OpenAI-style function calling API.

In [45]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o", temperature=0)

Now that we have our model set-up, let's "put on the tool belt", which is to say: We'll bind our LangChain formatted tools to the model in an OpenAI function calling format.

In [46]:
model = model.bind_tools(tool_belt)

#### ❓ Question #1:

How does the model determine which tool to use?

The LLM is provided with context - why and how to call the tool, descprition of each tool, payload of the tool, LLM then will make decisions on which tools to execute. 

## Task 4: Putting the State in Stateful

Earlier we used this phrasing:

`coordinated multi-actor and stateful applications`

So what does that "stateful" mean?

To put it simply - we want to have some kind of object which we can pass around our application that holds information about what the current situation (state) is. Since our system will be constructed of many parts moving in a coordinated fashion - we want to be able to ensure we have some commonly understood idea of that state.

LangGraph leverages a `StatefulGraph` which uses an `AgentState` object to pass information between the various nodes of the graph.

There are more options than what we'll see below - but this `AgentState` object is one that is stored in a `TypedDict` with the key `messages` and the value is a `Sequence` of `BaseMessages` that will be appended to whenever the state changes.

Let's think about a simple example to help understand exactly what this means (we'll simplify a great deal to try and clearly communicate what state is doing):

1. We initialize our state object:
  - `{"messages" : []}`
2. Our user submits a query to our application.
  - New State: `HumanMessage(#1)`
  - `{"messages" : [HumanMessage(#1)}`
3. We pass our state object to an Agent node which is able to read the current state. It will use the last `HumanMessage` as input. It gets some kind of output which it will add to the state.
  - New State: `AgentMessage(#1, additional_kwargs {"function_call" : "WebSearchTool"})`
  - `{"messages" : [HumanMessage(#1), AgentMessage(#1, ...)]}`
4. We pass our state object to a "conditional node" (more on this later) which reads the last state to determine if we need to use a tool - which it can determine properly because of our provided object!

In [47]:
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
import operator
from langchain_core.messages import BaseMessage

class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

## Task 5: It's Graphing Time!

Now that we have state, and we have tools, and we have an LLM - we can finally start making our graph!

Let's take a second to refresh ourselves about what a graph is in this context.

Graphs, also called networks in some circles, are a collection of connected objects.

The objects in question are typically called nodes, or vertices, and the connections are called edges.

Let's look at a simple graph.

![image](https://i.imgur.com/2NFLnIc.png)

Here, we're using the coloured circles to represent the nodes and the yellow lines to represent the edges. In this case, we're looking at a fully connected graph - where each node is connected by an edge to each other node.

If we were to think about nodes in the context of LangGraph - we would think of a function, or an LCEL runnable.

If we were to think about edges in the context of LangGraph - we might think of them as "paths to take" or "where to pass our state object next".

Let's create some nodes and expand on our diagram.

> NOTE: Due to the tight integration with LCEL - we can comfortably create our nodes in an async fashion!

In [48]:
from langgraph.prebuilt import ToolNode

def call_model(state):
  messages = state["messages"]
  response = model.invoke(messages)
  return {"messages" : [response]}

tool_node = ToolNode(tool_belt)

Now we have two total nodes. We have:

- `call_model` is a node that will...well...call the model
- `tool_node` is a node which can call a tool

Let's start adding nodes! We'll update our diagram along the way to keep track of what this looks like!


In [49]:
from langgraph.graph import StateGraph, END

uncompiled_graph = StateGraph(AgentState)

uncompiled_graph.add_node("agent", call_model)
uncompiled_graph.add_node("action", tool_node)

Let's look at what we have so far:

![image](https://i.imgur.com/md7inqG.png)

Next, we'll add our entrypoint. All our entrypoint does is indicate which node is called first.

In [50]:
uncompiled_graph.set_entry_point("agent")

![image](https://i.imgur.com/wNixpJe.png)

Now we want to build a "conditional edge" which will use the output state of a node to determine which path to follow.

We can help conceptualize this by thinking of our conditional edge as a conditional in a flowchart!

Notice how our function simply checks if there is a "function_call" kwarg present.

Then we create an edge where the origin node is our agent node and our destination node is *either* the action node or the END (finish the graph).

It's important to highlight that the dictionary passed in as the third parameter (the mapping) should be created with the possible outputs of our conditional function in mind. In this case `should_continue` outputs either `"end"` or `"continue"` which are subsequently mapped to the action node or the END node.

In [51]:
def should_continue(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  return END

uncompiled_graph.add_conditional_edges(
    "agent",
    should_continue
)

Let's visualize what this looks like.

![image](https://i.imgur.com/8ZNwKI5.png)

Finally, we can add our last edge which will connect our action node to our agent node. This is because we *always* want our action node (which is used to call our tools) to return its output to our agent!

In [52]:
uncompiled_graph.add_edge("action", "agent")

Let's look at the final visualization.

![image](https://i.imgur.com/NWO7usO.png)

All that's left to do now is to compile our workflow - and we're off!

In [53]:
compiled_graph = uncompiled_graph.compile()

#### ❓ Question #2:

Is there any specific limit to how many times we can cycle?

If not, how could we impose a limit to the number of cycles?

we can set a counter in the state that tracks how many times the agent has cycled. Once the counter reaches a predefined limit, in the 'should_continue' function, we return to END.

Someting like this:

In [38]:
#sudo code, commented out in case run it by accident
# class AgentState(TypedDict):
#   messages: Annotated[list, add_messages]
#   cycle_count: int

# CYCLE_LIMIT = some_number

# def call_model(state):
#     messages = state["messages"]
#     response = model.invoke(messages)
#     state["cycle_count"] += 1
#     return {"messages": [response], "cycle_count": state["cycle_count"]}

## Using Our Graph

Now that we've created and compiled our graph - we can call it *just as we'd call any other* `Runnable`!

Let's try out a few examples to see how it fairs:

In [39]:
from langchain_core.messages import HumanMessage

inputs = {"messages" : [HumanMessage(content="Who is the current captain of the Winnipeg Jets?")]}

async for chunk in compiled_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        pretty_print(values["messages"])
        #pretty_print(values["messages"][0].content)
        print("\n\n")

Receiving update from node: 'agent'


```markdown
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_Ngdom7DqqVEeQw1k5rMHuVvy', 'function': {'arguments': '{"query":"current captain of the Winnipeg Jets 2023"}', 'name': 'duckduckgo_search'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 25, 'prompt_tokens': 156, 'total_tokens': 181}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_fde2829a40', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-869b4d15-a1fb-4140-a046-44105298b2de-0', tool_calls=[{'name': 'duckduckgo_search', 'args': {'query': 'current captain of the Winnipeg Jets 2023'}, 'id': 'call_Ngdom7DqqVEeQw1k5rMHuVvy', 'type': 'tool_call'}], usage_metadata={'input_tokens': 156, 'output_tokens': 25, 'total_tokens': 181})]
```




Receiving update from node: 'action'


```markdown
[ToolMessage(content="The Winnipeg Jets feel they are heading in the right direction despite an early exit from the 2024 Stanley Cup Playoffs. The Jets equaled the winningest season in their history in 2023-24 ... The Winnipeg Jets are battling with Cole Perfetti right now, who's still unsigned. Perfetti had a lot of ups and downs last year, most notably going through a 23-game goalless streak, prompting ... It has become a matter of fact that McGroarty wouldn't sign with Winnipeg because they refused to guarantee him an NHL roster spot. Reportedly, a few trades over the past six months have fallen through because of McGroarty's demand until Penguins president of hockey operations/GM Kyle Dubas ponied up his top prospect, 2023 first-round pick Brayden Yager. Owen Sound Attack GM Dale DeGray said Colby Barlow informed the team a few days before training camp that he's decided to stay and prepare in the Toronto area before the Winnipeg Jets meet for camp next month. The Jets selected Barlow in the first round and signed him to an entry-level contract in the summer of 2023. Rutger McGroarty, a new prospect for the Pittsburgh Penguins, explained why he requested a trade from the Winnipeg Jets.The 20-year-old forward, who was picked 14th overall by the Jets in 2022 ...", name='duckduckgo_search', tool_call_id='call_Ngdom7DqqVEeQw1k5rMHuVvy')]
```




Receiving update from node: 'agent'


```markdown
[AIMessage(content='As of now, the Winnipeg Jets do not have a designated captain. The team has been operating without a captain since Blake Wheeler was stripped of the captaincy before the 2022-2023 season.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 42, 'prompt_tokens': 466, 'total_tokens': 508}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_157b3831f5', 'finish_reason': 'stop', 'logprobs': None}, id='run-948a6f8d-850b-4865-89ee-13aa75428f24-0', usage_metadata={'input_tokens': 466, 'output_tokens': 42, 'total_tokens': 508})]
```






Let's look at what happened:

1. Our state object was populated with our request
2. The state object was passed into our entry point (agent node) and the agent node added an `AIMessage` to the state object and passed it along the conditional edge
3. The conditional edge received the state object, found the "tool_calls" `additional_kwarg`, and sent the state object to the action node
4. The action node added the response from the OpenAI function calling endpoint to the state object and passed it along the edge to the agent node
5. The agent node added a response to the state object and passed it along the conditional edge
6. The conditional edge received the state object, could not find the "tool_calls" `additional_kwarg` and passed the state object to END where we see it output in the cell above!

Now let's look at an example that shows a multiple tool usage - all with the same flow!

In [54]:
inputs = {"messages" : [HumanMessage(content="Search Arxiv for the QLoRA paper, then search each of the authors to find out their latest Tweet using DuckDuckGo.")]}

async for chunk in compiled_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        if node == "action":
          print(f"Tool Used: {values['messages'][0].name}")
        pretty_print(values["messages"])

        print("\n\n")

Receiving update from node: 'agent'


```markdown
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_pvRWYfQJjyqpgrIWiUKvScC2', 'function': {'arguments': '{"query": "QLoRA"}', 'name': 'arxiv'}, 'type': 'function'}, {'id': 'call_oMmkApt0m9CIjC0cmikBHl7a', 'function': {'arguments': '{"query": "latest tweet"}', 'name': 'duckduckgo_search'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 50, 'prompt_tokens': 173, 'total_tokens': 223}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_157b3831f5', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-c5c51b2c-832a-4f3d-aef6-eac9274e6559-0', tool_calls=[{'name': 'arxiv', 'args': {'query': 'QLoRA'}, 'id': 'call_pvRWYfQJjyqpgrIWiUKvScC2', 'type': 'tool_call'}, {'name': 'duckduckgo_search', 'args': {'query': 'latest tweet'}, 'id': 'call_oMmkApt0m9CIjC0cmikBHl7a', 'type': 'tool_call'}], usage_metadata={'input_tokens': 173, 'output_tokens': 50, 'total_tokens': 223})]
```




Receiving update from node: 'action'
Tool Used: arxiv


```markdown
[ToolMessage(content='Published: 2023-05-23\nTitle: QLoRA: Efficient Finetuning of Quantized LLMs\nAuthors: Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer\nSummary: We present QLoRA, an efficient finetuning approach that reduces memory usage\nenough to finetune a 65B parameter model on a single 48GB GPU while preserving\nfull 16-bit finetuning task performance. QLoRA backpropagates gradients through\na frozen, 4-bit quantized pretrained language model into Low Rank\nAdapters~(LoRA). Our best model family, which we name Guanaco, outperforms all\nprevious openly released models on the Vicuna benchmark, reaching 99.3% of the\nperformance level of ChatGPT while only requiring 24 hours of finetuning on a\nsingle GPU. QLoRA introduces a number of innovations to save memory without\nsacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is\ninformation theoretically optimal for normally distributed weights (b) double\nquantization to reduce the average memory footprint by quantizing the\nquantization constants, and (c) paged optimziers to manage memory spikes. We\nuse QLoRA to finetune more than 1,000 models, providing a detailed analysis of\ninstruction following and chatbot performance across 8 instruction datasets,\nmultiple model types (LLaMA, T5), and model scales that would be infeasible to\nrun with regular finetuning (e.g. 33B and 65B parameter models). Our results\nshow that QLoRA finetuning on a small high-quality dataset leads to\nstate-of-the-art results, even when using smaller models than the previous\nSoTA. We provide a detailed analysis of chatbot performance based on both human\nand GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable\nalternative to human evaluation. Furthermore, we find that current chatbot\nbenchmarks are not trustworthy to accurately evaluate the performance levels of\nchatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to\nChatGPT. We release all of our models and code, including CUDA kernels for\n4-bit training.\n\nPublished: 2024-05-27\nTitle: Accurate LoRA-Finetuning Quantization of LLMs via Information Retention\nAuthors: Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno\nSummary: The LoRA-finetuning quantization of LLMs has been extensively studied to\nobtain accurate yet compact LLMs for deployment on resource-constrained\nhardware. However, existing methods cause the quantized LLM to severely degrade\nand even fail to benefit from the finetuning of LoRA. This paper proposes a\nnovel IR-QLoRA for pushing quantized LLMs with LoRA to be highly accurate\nthrough information retention. The proposed IR-QLoRA mainly relies on two\ntechnologies derived from the perspective of unified information: (1)\nstatistics-based Information Calibration Quantization allows the quantized\nparameters of LLM to retain original information accurately; (2)\nfinetuning-based Information Elastic Connection makes LoRA utilizes elastic\nrepresentation transformation with diverse information. Comprehensive\nexperiments show that IR-QLoRA can significantly improve accuracy across LLaMA\nand LLaMA2 families under 2-4 bit-widths, e.g., 4- bit LLaMA-7B achieves 1.4%\nimprovement on MMLU compared with the state-of-the-art methods. The significant\nperformance gain requires only a tiny 0.31% additional time consumption,\nrevealing the satisfactory efficiency of our IR-QLoRA. We highlight that\nIR-QLoRA enjoys excellent versatility, compatible with various frameworks\n(e.g., NormalFloat and Integer quantization) and brings general accuracy gains.\nThe code is available at https://github.com/htqin/ir-qlora.\n\nPublished: 2024-06-12\nTitle: Exploring Fact Memorization and Style Imitation in LLMs Using QLoRA: An Experimental Study and Quality Assessment Methods\nAuthors: Eugene Vyborov, Oleksiy Osypenko, Serge Sotnyk\nSummary: There are various methods for adapting LLMs to different domains. The most\ncommon methods are prompting, finetuning, and RAG. In this w', name='arxiv', tool_call_id='call_pvRWYfQJjyqpgrIWiUKvScC2'), ToolMessage(content="While Clark was the focal point after the win, a 2021 tweet by Reese resurfaced. Ostensibly, the tweet suggests LeBron wasn't helping the Lakers in Game 6 of their first-round postseason series ... Vice President Kamala Harris condemned former President Donald Trump and his campaign for their actions on Monday at the Arlington National Cemetery in a new post on X. Harris accused Trump of ... Twitter. The latest Twitter news and updates. Twitter is a social networking service, primarily microblogging but also a picture and video sharing service, founded by Jack Dorsey, Noah Glass, Biz ... Twitter Blue's Chaotic Revamp: The overhaul of the subscription service that lets users buy verified badges was the first big test for Musk as the platform's new owner. It didn't go well . Twitter has added a warning to one of President Donald Trump's tweets about protests in Minneapolis. The company says the tweet violated the platform's rules about glorifying violence. U.S ...", name='duckduckgo_search', tool_call_id='call_oMmkApt0m9CIjC0cmikBHl7a')]
```




Receiving update from node: 'agent'


```markdown
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_2D01nuepHj8X2xNnBjalxCMR', 'function': {'arguments': '{"query": "Tim Dettmers latest tweet"}', 'name': 'duckduckgo_search'}, 'type': 'function'}, {'id': 'call_S3iO58HSRIAAxSpssjKc7YSH', 'function': {'arguments': '{"query": "Artidoro Pagnoni latest tweet"}', 'name': 'duckduckgo_search'}, 'type': 'function'}, {'id': 'call_VOR108Zy9ZRHS7PmsQsNqEC2', 'function': {'arguments': '{"query": "Ari Holtzman latest tweet"}', 'name': 'duckduckgo_search'}, 'type': 'function'}, {'id': 'call_e7A9sfqKkPzVa01GZBqbleyz', 'function': {'arguments': '{"query": "Luke Zettlemoyer latest tweet"}', 'name': 'duckduckgo_search'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 108, 'prompt_tokens': 1395, 'total_tokens': 1503}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_157b3831f5', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-5edb56a5-0bbe-46b0-85e3-dc60313b52c2-0', tool_calls=[{'name': 'duckduckgo_search', 'args': {'query': 'Tim Dettmers latest tweet'}, 'id': 'call_2D01nuepHj8X2xNnBjalxCMR', 'type': 'tool_call'}, {'name': 'duckduckgo_search', 'args': {'query': 'Artidoro Pagnoni latest tweet'}, 'id': 'call_S3iO58HSRIAAxSpssjKc7YSH', 'type': 'tool_call'}, {'name': 'duckduckgo_search', 'args': {'query': 'Ari Holtzman latest tweet'}, 'id': 'call_VOR108Zy9ZRHS7PmsQsNqEC2', 'type': 'tool_call'}, {'name': 'duckduckgo_search', 'args': {'query': 'Luke Zettlemoyer latest tweet'}, 'id': 'call_e7A9sfqKkPzVa01GZBqbleyz', 'type': 'tool_call'}], usage_metadata={'input_tokens': 1395, 'output_tokens': 108, 'total_tokens': 1503})]
```




Receiving update from node: 'action'
Tool Used: duckduckgo_search


```markdown
[ToolMessage(content="Allen School Ph.D. student Tim Dettmers accepted the grand prize for QLoRA, a novel approach to finetuning pretrained models that significantly reduces the amount of GPU memory required — from over 780GB to less than 48GB — to finetune a 65B parameter model. With QLoRA, the largest publicly available models can be finetuned on a single ... They show Dettmers explaining to EleutherAI researchers that Meta lawyers had concerns with using The Pile, an AI training dataset created by EleutherAI that includes Books3. The plaintiffs' attorneys have since redacted the quotes from the amended complaint, but screenshots of the chats have continued to circulate on LinkedIn, according to ... Her new position hasn't been finalized. ... — Tim Dettmers is joining Ai2 as an AI researcher. Dettmers specializes in efficient deep learning at the intersection of machine learning, NLP, and ... Tim Dettmers, Ruslan A. Svirschevski, Vage Egiazarian, Denis Kuznedelev, ... (SpQR), a new compressed format and quantization technique that enables for the first time \\emph{near-lossless} compression of LLMs across model scales while reaching similar compression levels to previous methods. SpQR works by identifying and isolating \\emph{outlier ... Essentially, a CPU is a latency-optimized device while GPUs are bandwidth-optimized devices. If a CPU is a race car, a GPU is a cargo truck. The main job in deep learning is to fetch and move ...", name='duckduckgo_search', tool_call_id='call_2D01nuepHj8X2xNnBjalxCMR'), ToolMessage(content="QLoRA introduces multiple innovations designed to reduce memory use without sacrificing performance: (1) 4-bit NormalFloat, an information theoretically optimal quantization data type for normally distributed data that yields better empirical results than 4-bit Integers and 4-bit Floats.(2) Double Quantization, a method that quantizes the quantization constants, saving an average of about 0.37 ... We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly ... Artidoro Pagnoni. artidoro. Follow. mrm8488's profile picture Weyaxi's profile picture nezubn's profile picture. ... artidoro/model-tvergho. Updated Nov 18, 2023. artidoro/model-vinaic. Updated Nov 18, 2023. artidoro/model-vinaia. Updated Nov 18, 2023. datasets. None public yet. Company Recent studies have highlighted the promising performance of LLMs in various NLP tasks, particularly in the zero-shot and few-shot settings. Notably, GPT-3 has shown competitive performance and, in some cases, even outperformed state-of-the-art models in few-shot in-context learning (Brown et al., 2020b; Zhong et al., 2023).However, when it comes to sentiment analysis tasks, some studies have ... Pagnoni Celebrates 175th Year. Pagnoni has been a leading light in the world of HPL, short-cycle and compact presses for the value-added panel product industry. Stephen Powney interviews director Michele Pagnoni about the company's developments and his view on markets.", name='duckduckgo_search', tool_call_id='call_S3iO58HSRIAAxSpssjKc7YSH'), ToolMessage(content='We need to get back to where they can be punished for knowingly lying on air like that. He 100% said that and he knows it. — SaltyGoat (@SaltyGoat17) August 29, 2024 If his network were honest ... The heated on-air dispute last night between MSNBC\'s Ari Melber and Donald Trump campaign adviser Corey Lewandowski didn\'t end with Wednesday\'s segment: Today, Lewandowski tweeted a video in ... Ari Melber warns Trump campaign advisor Corey Lewandowski he \'will be potentially in a defamation situation\' on Wednesday\'s episode of The Beat (MSNBC) "I did not say that. That is a false ... MSNBC host Ari Melber, during an interview with Trump campaign adviser Corey Lewandowski on Wednesday, threatened him with a defamation lawsuit for quoting the anchor calling the former President ... While in Mumbai, Ari would stay at the Holtzman\'s home each evening. "Their house was like a second home to me," explains Ari, who has travelled to India forty times. The last time that Ari was at the Holtzman\'s Chabad House was two months ago, where he stayed for a week and participated in prayers and Torah lectures given by Rabbi ...', name='duckduckgo_search', tool_call_id='call_VOR108Zy9ZRHS7PmsQsNqEC2'), ToolMessage(content="Luke Zettlemoyer is a research manager and site lead for FAIR Seattle. He is also a Professor in the Allen School of Computer Science & Engineering at the University of Washington. His research is in empirical computational semantics, where the goal is to build models that recover representations of the meaning of natural language text. \ufeff Twitter \ufeff Reddit. Join our list for notifications and early access to events ... About this Episode. Today we're joined by Luke Zettlemoyer, professor at University of Washington and a research manager at Meta. In our conversation with Luke, we cover multimodal generative AI, the effect of data on models, and the significance of open ... Today we're joined by Luke Zettlemoyer, professor at University of Washington and a research manager at Meta. In our conversation with Luke, we cover multimodal generative AI, the effect of data on models, and the significance of open source and open science. We explore the grounding problem, the need for visual grounding and embodiment in Evaluating the factuality of long-form text generated by large language models (LMs) is nontrivial because (1) generations often contain a mixture of... Luke Zettlemoyer, Omer Levy. Abstract. We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the language modeling loss function (next token prediction) with diffusion to train a single transformer over mixed-modality sequences. We pretrain multiple Transfusion models up to ...", name='duckduckgo_search', tool_call_id='call_e7A9sfqKkPzVa01GZBqbleyz')]
```




Receiving update from node: 'agent'


```markdown
[AIMessage(content='Here are the latest updates and tweets related to the authors of the QLoRA paper:\n\n### Tim Dettmers\n- **Latest Update**: Tim Dettmers is joining Ai2 as an AI researcher. He specializes in efficient deep learning at the intersection of machine learning, NLP, and more.\n- **Latest Tweet**: No specific tweet found, but recent discussions involve his work on QLoRA and efficient deep learning techniques.\n\n### Artidoro Pagnoni\n- **Latest Update**: Artidoro Pagnoni has been involved in various NLP tasks and has updated several models recently.\n- **Latest Tweet**: No specific tweet found, but his recent work includes updates to models like `model-tvergho` and `model-vinaic`.\n\n### Ari Holtzman\n- **Latest Update**: Ari Holtzman has been involved in discussions and interviews related to political and social issues.\n- **Latest Tweet**: No specific tweet found, but recent mentions include his involvement in interviews and discussions on MSNBC.\n\n### Luke Zettlemoyer\n- **Latest Update**: Luke Zettlemoyer is a research manager and site lead for FAIR Seattle, and a professor at the University of Washington. His research focuses on empirical computational semantics.\n- **Latest Tweet**: No specific tweet found, but recent discussions include his work on multimodal generative AI and the significance of open source and open science.\n\nFor more detailed and specific tweets, you might want to check their individual Twitter profiles directly.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 308, 'prompt_tokens': 2813, 'total_tokens': 3121}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_157b3831f5', 'finish_reason': 'stop', 'logprobs': None}, id='run-83aa0906-0bfc-462d-a045-305439bafcb2-0', usage_metadata={'input_tokens': 2813, 'output_tokens': 308, 'total_tokens': 3121})]
```






####🏗️ Activity #2:

Please write out the steps the agent took to arrive at the correct answer.

1. Our state object was populated with our request
2. The state object was passed into our entry point (agent node) and the agent node added an `AIMessage` to the state object and passed it along the conditional edge
3. The conditional edge received the state object, found a list of "tool_calls" `additional_kwarg`, and sent the state object to the action node. In this case, we have tool arxiv and duckduckgo_search in the "tool_calls" list, each tool will be executed sequentially within each cycle of the loop. 
4. Arxiv got executed first, then query "latest tweet" also got executed via duckduckgo_search (although it's not gonna bring anything useful information to our case). The action node added the response from the OpenAI function calling endpoint to the state object and passed it along the edge to the agent node. 
5. The agent node added a response to the state object and passed it along the conditional edge. In here the agent identified the authors of the paper and decided to use duckduckgo_search to search each author's latest tweets. 
6. The conditional edge received the state object, found a list of "tool_calls" `additional_kwarg`, and sent the state object to the action node. In this case, it's a list of queries to search each author's latest tweets.
6. A list of duckduckgo_search got executed to search each author's lastest tweets. The action node added the response from the OpenAI function calling endpoint to the state object and passed it along the edge to the agent node. 
7. The agent node added a response to the state object and passed it along the conditional edge. 
6. The conditional edge received the state object, could not find the "tool_calls" `additional_kwarg` and passed the state object to END where we see it output in the cell above!

## Part 1: LangSmith Evaluator

### Pre-processing for LangSmith

To do a little bit more preprocessing, let's wrap our LangGraph agent in a simple chain.

In [55]:
def convert_inputs(input_object):
  return {"messages" : [HumanMessage(content=input_object["question"])]}

def parse_output(input_state):
  return input_state["messages"][-1].content

agent_chain = convert_inputs | compiled_graph | parse_output

In [56]:
pretty_print(agent_chain.invoke({"question" : "What is RAG?"}))

```markdown
RAG stands for Retrieval-Augmented Generation. It is a technique used in natural language processing (NLP) and machine learning to improve the performance of language models by combining retrieval-based methods with generative models. Here's a brief overview of how it works:

1. **Retrieval**: In the first step, the system retrieves relevant documents or pieces of information from a large corpus based on the input query. This is typically done using a retrieval model, such as BM25 or a dense retrieval model like DPR (Dense Passage Retrieval).

2. **Augmentation**: The retrieved documents are then used to augment the input query. This can involve concatenating the retrieved information with the original query or using it to provide additional context.

3. **Generation**: Finally, a generative model, such as GPT-3 or BERT, is used to generate a response based on the augmented input. The generative model can produce more accurate and contextually relevant responses because it has access to the additional information retrieved in the first step.

RAG models are particularly useful in scenarios where the input query requires specific knowledge or context that may not be fully captured by the generative model alone. By incorporating relevant external information, RAG models can provide more accurate and informative responses.
```

### Task 1: Creating An Evaluation Dataset

Just as we saw last week, we'll want to create a dataset to test our Agent's ability to answer questions.

In order to do this - we'll want to provide some questions and some answers. Let's look at how we can create such a dataset below.

```python
questions = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    "What is a Retrieval Augmented Generation system?",
    "Who authored the QLoRA paper?",
    "What is the most popular deep learning framework?",
    "What significant improvements does the LoRA system make?"
]

answers = [
    {"must_mention" : ["paged", "optimizer"]},
    {"must_mention" : ["NF4", "NormalFloat"]},
    {"must_mention" : ["ground", "context"]},
    {"must_mention" : ["Tim", "Dettmers"]},
    {"must_mention" : ["PyTorch", "TensorFlow"]},
    {"must_mention" : ["reduce", "parameters"]},
]
```

####🏗️ Activity #3:

Please create a dataset in the above format with at least 5 questions.

In [75]:
questions = [
    "What are the key components of the Transformer model?",
    "How does the Transformer model handle the order of sequences without recurrent layers?",
    "What are the advantages of the Transformer model over traditional RNN-based models?",
    "Who authored the Attention is All You Need paper?",
    "What are some limitations of the Transformer model mentioned in the paper?",
    "What is the main contribution of the Attention is All You Need paper"
]

answers = [
    {"must_mention" : ["self-attention", "multi-head"]},
    {"must_mention" : ["positional", "encoding"]},
    {"must_mention" : ["parallelization", "long-range"]},
    {"must_mention" : ["vaswani", "ashish"]},
    {"must_mention" : ["qadratic", "recurrence"]},
    {"must_mention" : ["transformer", "architecture"]},
]

Now we can add our dataset to our LangSmith project using the following code which we saw last Thursday!

In [76]:
uuid4().hex[0:8]

'fa79f0a7'

In [77]:
from langsmith import Client

client = Client()
dataset_name = f"Retrieval Augmented Generation - Evaluation Dataset - {uuid4().hex[0:8]}"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about the Attention is All You Need Paper to Evaluate RAG over the same paper."
)

client.create_examples(
    inputs=[{"question" : q} for q in questions],
    outputs=answers,
    dataset_id=dataset.id,
)

#### ❓ Question #3:

How are the correct answers associated with the questions?

> NOTE: Feel free to indicate if this is problematic or not

please see below answer using the agent_chain

In [21]:
pretty_print(agent_chain.invoke({"question" : "How are the correct answers associated with the questions in LangSmith evaluation?"}))

CallbackManager.merge(): Parent run IDs do not match. Using the parent run ID of the first callback manager.


In LangSmith evaluations, correct answers are typically associated with questions through a process that involves defining a set of expected answers or criteria that the responses should meet. Here’s a general outline of how this process might work:

1. **Question Definition**: Clearly define the questions that need to be evaluated. Each question should be specific and unambiguous.

2. **Expected Answer Specification**: For each question, specify the correct or expected answers. This can be done in several ways:
   - **Exact Match**: The correct answer is a specific string or set of strings.
   - **Pattern Matching**: The correct answer follows a certain pattern or regular expression.
   - **Semantic Matching**: The correct answer is evaluated based on its meaning, which might involve natural language understanding.

3. **Evaluation Criteria**: Define the criteria for evaluating the answers. This can include:
   - **Accuracy**: How closely the answer matches the expected answer.
   - **Completeness**: Whether the answer fully addresses the question.
   - **Relevance**: Whether the answer is relevant to the question.
   - **Conciseness**: Whether the answer is concise and to the point.

4. **Automated Evaluation**: Use automated tools to compare the provided answers against the expected answers based on the defined criteria. This can involve:
   - **String Comparison**: For exact matches.
   - **Regular Expressions**: For pattern matching.
   - **Natural Language Processing (NLP)**: For semantic matching, using techniques like word embeddings, similarity measures, or pre-trained language models.

5. **Human Review**: In cases where automated evaluation is challenging, human reviewers might be involved to assess the answers based on the criteria.

6. **Scoring and Feedback**: Assign scores to the answers based on how well they meet the criteria and provide feedback if necessary.

By following these steps, LangSmith evaluations can systematically associate correct answers with questions, ensuring a fair and consistent assessment process.

### Task 2: Adding Evaluators

Now we can add a custom evaluator to see if our responses contain the expected information.

We'll be using a fairly naive exact-match process to determine if our response contains specific strings.

In [78]:
from langsmith.evaluation import EvaluationResult, run_evaluator

@run_evaluator
def must_mention(run, example) -> EvaluationResult:
    prediction = run.outputs.get("output").lower() or ""
    required = example.outputs.get("must_mention") or []
    score = all(phrase in prediction for phrase in required)
    return EvaluationResult(key="must_mention", score=score)

In [69]:
all(phrase in 'yes no' for phrase in ['yes','and'])

False

#### ❓ Question #4:

What are some ways you could improve this metric as-is?

> NOTE: Alternatively you can suggest where gaps exist in this method.

it's purely based on string-matching, in this case, a lot variations can happen. I modified the above score logic to get everything to lower case first before comparing. Besides that, we can add fuzzy scores to do fuzzy matches, along with adding regex to define patterns or remove non alphanumeric characters, to make our metric more flexible. 

We can also have more metrics like 'must not have', or instead of all(phrase in prediction for phrase in required), we can do any(phrase in prediction for phrase in required), etc. 

At the last, we can use LLMs like BERT to evalute if 'prediction' and 'required' have same semantic meaning, in this way we don't need to focus on 1-1 match. 

Now that we have created our custom evaluator - let's initialize our `RunEvalConfig` with it!

In [79]:
from langchain.smith import RunEvalConfig, run_on_dataset

eval_config = RunEvalConfig(
    custom_evaluators=[must_mention],
)

Task 3: Evaluating

All that is left to do is evaluate our agent's response!

In [80]:
client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=agent_chain,
    evaluation=eval_config,
    verbose=True,
    project_name=f"RAG Pipeline - Evaluation - {uuid4().hex[0:8]}",
    project_metadata={"version": "1.0.0"},
)

View the evaluation results for project 'RAG Pipeline - Evaluation - bd970406' at:
https://smith.langchain.com/o/e8fff0bf-5e2d-5959-8289-a4e1b2625f84/datasets/f375c3a9-4379-4a16-8109-06a8746e4b9a/compare?selectedSessions=29bcf60a-ca49-4215-ac57-19368576ca8b

View all tests for Dataset Retrieval Augmented Generation - Evaluation Dataset - df4b8743 at:
https://smith.langchain.com/o/e8fff0bf-5e2d-5959-8289-a4e1b2625f84/datasets/f375c3a9-4379-4a16-8109-06a8746e4b9a
[------------------------------------------------->] 6/6

Unnamed: 0,feedback.must_mention,error,execution_time,run_id
count,6,0.0,6.0,6
unique,2,0.0,,6
top,True,,,13cac096-8dd4-4f4d-a465-58bbc5fc9cbc
freq,5,,,1
mean,,,6.840911,
std,,,1.694429,
min,,,5.687585,
25%,,,5.734344,
50%,,,5.922246,
75%,,,7.525476,


{'project_name': 'RAG Pipeline - Evaluation - bd970406',
 'results': {'42cd6405-6c47-448f-afb6-e9d3af67ca23': {'input': {'question': 'What are the key components of the Transformer model?'},
   'feedback': [EvaluationResult(key='must_mention', score=True, value=None, comment=None, correction=None, evaluator_info={}, feedback_config=None, source_run_id=UUID('7ed2a47b-bdef-4a46-90db-b8c0cb8e7b97'), target_run_id=None)],
   'execution_time': 5.687585,
   'run_id': '13cac096-8dd4-4f4d-a465-58bbc5fc9cbc',
   'output': 'The Transformer model, introduced in the paper "Attention is All You Need" by Vaswani et al., is a deep learning model primarily used for natural language processing tasks. The key components of the Transformer model are:\n\n1. **Encoder-Decoder Architecture**:\n   - The Transformer model consists of an encoder and a decoder, each composed of multiple layers.\n\n2. **Encoder**:\n   - **Input Embedding**: Converts input tokens into dense vectors.\n   - **Positional Encoding**:

## Part 2: LangGraph with Helpfulness:

### Task 3: Adding Helpfulness Check and "Loop" Limits

Now that we've done evaluation - let's see if we can add an extra step where we review the content we've generated to confirm if it fully answers the user's query!

We're going to make a few key adjustments to account for this:

1. We're going to add an artificial limit on how many "loops" the agent can go through - this will help us to avoid the potential situation where we never exit the loop.
2. We'll add to our existing conditional edge to obtain the behaviour we desire.

First, let's define our state again - we can check the length of the state object, so we don't need additional state for this.

In [81]:
class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

Now we can set our graph up! This process will be almost entirely the same - with the inclusion of one additional node/conditional edge!

####🏗️ Activity #5:

Please write markdown for the following cells to explain what each is doing.

##### YOUR MARKDOWN HERE
create an empty graph, add agent node and action node

In [82]:
graph_with_helpfulness_check = StateGraph(AgentState)

graph_with_helpfulness_check.add_node("agent", call_model)
graph_with_helpfulness_check.add_node("action", tool_node)

##### YOUR MARKDOWN HERE
add our entrypoint. The entrypoint indicates which node is called first (in here, it will be 'agent').

In [83]:
graph_with_helpfulness_check.set_entry_point("agent")

##### YOUR MARKDOWN HERE
Define conditional edge.
It will check the most recent state of a node , total numbers of the state, and 'if the answer is helpful or not' to determine which path to follow.


In [84]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

def tool_call_or_helpful(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  initial_query = state["messages"][0]
  final_response = state["messages"][-1]

  if len(state["messages"]) > 10:
    return "end"

  prompt_template = """\
  Given an initial query and a final response, determine if the final response is extremely helpful or not. Please indicate helpfulness with a 'Y' and unhelpfulness as an 'N'.

  Initial Query:
  {initial_query}

  Final Response:
  {final_response}"""

  prompt_template = PromptTemplate.from_template(prompt_template)

  helpfulness_check_model = ChatOpenAI(model="gpt-4")

  helpfulness_chain = prompt_template | helpfulness_check_model | StrOutputParser()

  helpfulness_response = helpfulness_chain.invoke({"initial_query" : initial_query.content, "final_response" : final_response.content})

  if "Y" in helpfulness_response:
    return "end"
  else:
    return "continue"

####🏗️ Activity #4:

Please write what is happening in our `tool_call_or_helpful` function!

If there is "tool_calls" `additional_kwarg` in the most recent state, then return 'action'; if the total number of messages in the state exceeds 10, then return 'end'; Given the initial user query and the fianl response message (when there is no "tool_calls" `additional_kwarg` in the most recent state), if LLM thinks the answer is helpful ('Y'), then return 'end' otherwise 'continue' (going back to agent node, do more reasoning and decide what to do next).

##### YOUR MARKDOWN HERE
connect agent to correct path based on the return of the `tool_call_or_helpful` function. If the function returns 'action', then call tools; if the function returns 'end', then end the flow; if the funciton returns 'continue', then go back to agent.

In [85]:
graph_with_helpfulness_check.add_conditional_edges(
    "agent",
    tool_call_or_helpful,
    {
        "continue" : "agent",
        "action" : "action",
        "end" : END
    }
)

##### YOUR MARKDOWN HERE
connect action node to agent node so the action node (which is used to call our tools) can return its output to our agent.

In [86]:
graph_with_helpfulness_check.add_edge("action", "agent")

##### YOUR MARKDOWN HERE
compile the whole graph

In [87]:
agent_with_helpfulness_check = graph_with_helpfulness_check.compile()

##### YOUR MARKDOWN HERE
execute the state graph asynchronously, processing the inputs and printing out the results as they are generated.

In [91]:
inputs = {"messages" : [HumanMessage(content="Related to machine learning, what is LoRA? Also, who is Tim Dettmers? Also, what is Attention?")]}

async for chunk in agent_with_helpfulness_check.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        pretty_print(values["messages"])
        pretty_print(values["messages"][0].content)
        print("\n\n")

Receiving update from node: 'agent'


```markdown
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_jw9bg0EbXbQKqk4QcrGxKmlA', 'function': {'arguments': '{"query": "LoRA machine learning"}', 'name': 'duckduckgo_search'}, 'type': 'function'}, {'id': 'call_txI7KrjFJnp6V22k0qBo3jCZ', 'function': {'arguments': '{"query": "Tim Dettmers"}', 'name': 'duckduckgo_search'}, 'type': 'function'}, {'id': 'call_jJ3dMLa69eBoEgwk4CXYDFgm', 'function': {'arguments': '{"query": "Attention in machine learning"}', 'name': 'duckduckgo_search'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 76, 'prompt_tokens': 171, 'total_tokens': 247}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_157b3831f5', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-81eaca01-ada3-466e-8745-493aa3570426-0', tool_calls=[{'name': 'duckduckgo_search', 'args': {'query': 'LoRA machine learning'}, 'id': 'call_jw9bg0EbXbQKqk4QcrGxKmlA', 'type': 'tool_call'}, {'name': 'duckduckgo_search', 'args': {'query': 'Tim Dettmers'}, 'id': 'call_txI7KrjFJnp6V22k0qBo3jCZ', 'type': 'tool_call'}, {'name': 'duckduckgo_search', 'args': {'query': 'Attention in machine learning'}, 'id': 'call_jJ3dMLa69eBoEgwk4CXYDFgm', 'type': 'tool_call'}], usage_metadata={'input_tokens': 171, 'output_tokens': 76, 'total_tokens': 247})]
```

```markdown

```




Receiving update from node: 'action'


```markdown
[ToolMessage(content='Let\'s jump on LoRA. Low-Rank Adaptation of LLMs (LoRA) So, in usual fine-tuning, we. Take a pretrained model. Do Transfer Learning over new training data to slightly adjust these pre-trained weights LoRA\'s approach to decomposing ( Δ W ) into a product of lower rank matrices effectively balances the need to adapt large pre-trained models to new tasks while maintaining computational efficiency. The intrinsic rank concept is key to this balance, ensuring that the essence of the model\'s learning capability is preserved with significantly ... "Lora The Tuner" By Daniel Warfield using MidJourney. All images by the author unless otherwise specified. Fine tuning is the process of tailoring a machine learning model to a specific application, which can be vital in achieving consistent and high quality performance. Feb 18, 2024. Share. Ever since the introduction of BERT in 2019, fine-tuning has been the standard approach to adapt large language models (LLMs) to downstream tasks. This changed with the introduction of LoRA (Hu et al 2021) which showed for the first time that the weight update matrix during fine-tuning can be drastically simplified using ... Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning ($\\\\approx$100K prompt-response ...', name='duckduckgo_search', tool_call_id='call_jw9bg0EbXbQKqk4QcrGxKmlA'), ToolMessage(content='Allen School Ph.D. student Tim Dettmers accepted the grand prize for QLoRA, a novel approach to finetuning pretrained models that significantly reduces the amount of GPU memory required — from over 780GB to less than 48GB — to finetune a 65B parameter model. With QLoRA, the largest publicly available models can be finetuned on a single ... — Tim Dettmers is joining Ai2 as an AI researcher. Dettmers specializes in efficient deep learning at the intersection of machine learning, NLP, and computer systems with a focus on quantization ... Its purpose is to make cutting-edge research by Tim Dettmers, a leading academic expert on quantization and the use of deep learning hardware accelerators, accessible to the general public. QLoRA: One of the core contributions of bitsandbytes towards the democratization of AI. If you have a curiosity about how fancy graphics cards actually work, and why they are so well-suited to AI-type applications, then take a few minutes to read [Tim Dettmers] explain why this is so.… Tim Dettmers. Video. Tech Moves: AI researcher Yejin Choi leaves Univ. of Washington and Allen Institute for AI. by Todd Bishop & Taylor Soper on August 2, 2024 August 2, 2024 at 11:59 am.', name='duckduckgo_search', tool_call_id='call_txI7KrjFJnp6V22k0qBo3jCZ'), ToolMessage(content='Learn how attention mechanisms in deep learning enable models to focus on relevant information and improve performance in tasks such as machine translation, image captioning, and speech recognition. Understand the steps and components of attention mechanism architecture and see examples of its applications. Attention mechanism is a fundamental invention in artificial intelligence and machine learning, redefining the capabilities of deep learning models. This mechanism, inspired by the human mental process of selective focus, has emerged as a pillar in a variety of applications, accelerating developments in natural language processing, computer vision, and beyond. There are several types of attention mechanisms, each designed to cater to specific use cases. Here are a few notable ones: 1. Self-Attention Mechanism. Self-attention, also known as intra ... They all use transformer architecture with attention mechanisms at their core to solve problems across domains. In the Transformer series, we go over the ingredients that have made Transformers a universal recipe for machine learning. First up, we take a visual dive to understand the attention mechanism: Why transformers and attention took over. The concept of "attention" in deep learning has its roots in the effort to improve Recurrent Neural Networks (RNNs) for handling longer sequences or sentences. For instance, consider translating a sentence from one language to another. Translating a sentence word-by-word is usually not an option because it ignores the complex grammatical ...', name='duckduckgo_search', tool_call_id='call_jJ3dMLa69eBoEgwk4CXYDFgm')]
```

```markdown
Let's jump on LoRA. Low-Rank Adaptation of LLMs (LoRA) So, in usual fine-tuning, we. Take a pretrained model. Do Transfer Learning over new training data to slightly adjust these pre-trained weights LoRA's approach to decomposing ( Δ W ) into a product of lower rank matrices effectively balances the need to adapt large pre-trained models to new tasks while maintaining computational efficiency. The intrinsic rank concept is key to this balance, ensuring that the essence of the model's learning capability is preserved with significantly ... "Lora The Tuner" By Daniel Warfield using MidJourney. All images by the author unless otherwise specified. Fine tuning is the process of tailoring a machine learning model to a specific application, which can be vital in achieving consistent and high quality performance. Feb 18, 2024. Share. Ever since the introduction of BERT in 2019, fine-tuning has been the standard approach to adapt large language models (LLMs) to downstream tasks. This changed with the introduction of LoRA (Hu et al 2021) which showed for the first time that the weight update matrix during fine-tuning can be drastically simplified using ... Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning ($\\approx$100K prompt-response ...
```




Receiving update from node: 'agent'


```markdown
[AIMessage(content="### LoRA in Machine Learning\n**LoRA** stands for **Low-Rank Adaptation**. It is a parameter-efficient fine-tuning method for large language models (LLMs). Traditional fine-tuning involves adjusting the pre-trained weights of a model to adapt it to new tasks, which can be computationally expensive. LoRA simplifies this process by decomposing the weight update matrix into a product of lower-rank matrices. This approach significantly reduces the memory and computational requirements while maintaining the model's performance. LoRA is particularly useful for adapting large pre-trained models to new tasks efficiently.\n\n### Tim Dettmers\n**Tim Dettmers** is a researcher specializing in efficient deep learning, particularly at the intersection of machine learning, natural language processing (NLP), and computer systems. He is known for his work on quantization and the use of deep learning hardware accelerators. One of his notable contributions is **QLoRA**, a method that drastically reduces the GPU memory required to fine-tune large models, making it possible to fine-tune a 65 billion parameter model on a single GPU. Tim Dettmers is associated with the Allen School and has joined the Allen Institute for AI (Ai2) as an AI researcher.\n\n### Attention in Machine Learning\n**Attention mechanisms** are a fundamental component in modern deep learning models, particularly in natural language processing (NLP) and computer vision. The concept of attention allows models to focus on relevant parts of the input data, improving performance in tasks such as machine translation, image captioning, and speech recognition. \n\nThere are several types of attention mechanisms:\n1. **Self-Attention**: Also known as intra-attention, it allows a model to consider different positions of a single sequence to compute a representation of the sequence.\n2. **Global Attention**: Considers all parts of the input sequence when generating each part of the output sequence.\n3. **Local Attention**: Focuses on a specific part of the input sequence, which is particularly useful for long sequences.\n\nAttention mechanisms are a core component of the Transformer architecture, which has become the foundation for many state-of-the-art models in NLP and beyond. The ability to focus on relevant information and ignore irrelevant details makes attention mechanisms powerful tools for improving model performance across various tasks.", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 463, 'prompt_tokens': 1154, 'total_tokens': 1617}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_157b3831f5', 'finish_reason': 'stop', 'logprobs': None}, id='run-3872aa98-382c-49f3-b1af-17e9ab39258e-0', usage_metadata={'input_tokens': 1154, 'output_tokens': 463, 'total_tokens': 1617})]
```

```markdown
### LoRA in Machine Learning
**LoRA** stands for **Low-Rank Adaptation**. It is a parameter-efficient fine-tuning method for large language models (LLMs). Traditional fine-tuning involves adjusting the pre-trained weights of a model to adapt it to new tasks, which can be computationally expensive. LoRA simplifies this process by decomposing the weight update matrix into a product of lower-rank matrices. This approach significantly reduces the memory and computational requirements while maintaining the model's performance. LoRA is particularly useful for adapting large pre-trained models to new tasks efficiently.

### Tim Dettmers
**Tim Dettmers** is a researcher specializing in efficient deep learning, particularly at the intersection of machine learning, natural language processing (NLP), and computer systems. He is known for his work on quantization and the use of deep learning hardware accelerators. One of his notable contributions is **QLoRA**, a method that drastically reduces the GPU memory required to fine-tune large models, making it possible to fine-tune a 65 billion parameter model on a single GPU. Tim Dettmers is associated with the Allen School and has joined the Allen Institute for AI (Ai2) as an AI researcher.

### Attention in Machine Learning
**Attention mechanisms** are a fundamental component in modern deep learning models, particularly in natural language processing (NLP) and computer vision. The concept of attention allows models to focus on relevant parts of the input data, improving performance in tasks such as machine translation, image captioning, and speech recognition. 

There are several types of attention mechanisms:
1. **Self-Attention**: Also known as intra-attention, it allows a model to consider different positions of a single sequence to compute a representation of the sequence.
2. **Global Attention**: Considers all parts of the input sequence when generating each part of the output sequence.
3. **Local Attention**: Focuses on a specific part of the input sequence, which is particularly useful for long sequences.

Attention mechanisms are a core component of the Transformer architecture, which has become the foundation for many state-of-the-art models in NLP and beyond. The ability to focus on relevant information and ignore irrelevant details makes attention mechanisms powerful tools for improving model performance across various tasks.
```






### Task 4: LangGraph for the "Patterns" of GenAI

Let's ask our system about the 4 patterns of Generative AI:

1. Prompt Engineering
2. RAG
3. Fine-tuning
4. Agents

In [92]:
patterns = ["prompt engineering", "RAG", "fine-tuning", "LLM-based agents"]

In [93]:
for pattern in patterns:
  what_is_string = f"What is {pattern} and when did it break onto the scene??"
  inputs = {"messages" : [HumanMessage(content=what_is_string)]}
  messages = agent_with_helpfulness_check.invoke(inputs)
  print(messages["messages"][-1].content)
  print("\n\n")

Prompt engineering is a concept primarily associated with the field of artificial intelligence, particularly in the context of natural language processing (NLP) and large language models (LLMs) like GPT-3. It involves the design and optimization of prompts (input queries or instructions) to elicit the most accurate, relevant, and useful responses from AI models. The goal is to frame questions or tasks in a way that maximizes the performance of the AI in generating desired outputs.

### Key Aspects of Prompt Engineering:
1. **Prompt Design**: Crafting the input text to guide the AI model towards producing a specific type of response.
2. **Prompt Optimization**: Iteratively refining prompts based on the quality of the responses received.
3. **Contextualization**: Providing sufficient context within the prompt to help the AI understand the task or question.
4. **Task Specification**: Clearly defining the task or question to avoid ambiguity and ensure the AI understands what is being asked