# Extraction Chain Rule with LangChain and ChatGroq

In this notebook, we demonstrate how to build an extraction chain rule using LangChain and integrate it with ChatGroq for extracting structured information from unstructured text.

## The Schema
First, we need to describe what information we want to extract from the text.

We'll use Pydantic to define an example schema to extract personal information.

In [75]:
from typing import Optional

from pydantic import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the person's hair if known"
    )
    height_in_meters: Optional[str] = Field(
        default=None, description="Height measured in meters"
    )

Here are two important tips for creating schemas:

### 1. Add Descriptions for Attributes and the Schema:

Clearly document each attribute and the overall schema. This helps the LLM understand what information to extract, making the results more accurate.

### 2. Avoid Forcing the LLM to Guess: 

Use Optional for attributes, so the LLM can return None if it doesn't have the answer. This prevents it from making up information.

## The Extractor
Let's create an information extractor using the schema we defined above.

In [76]:
from typing import Optional

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from pydantic import BaseModel, Field

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        # Please see the how-to about improving performance with
        # reference examples.
        # MessagesPlaceholder('examples'),
        ("human", "{text}"),
    ]
)

We need to use a model that supports function/tool calling.

In [77]:
import getpass
import os

if not os.environ.get("GROQ_API_KEY"):
    os.environ["GROQ_API_KEY"] = getpass.getpass("Enter API key for Groq: ")

from langchain_groq import ChatGroq

llm = ChatGroq(model="llama3-8b-8192")

In [78]:
structured_llm = llm.with_structured_output(schema=Person)

In [79]:
text = "Alan Smith is 6 feet tall and has blond hair."
prompt = prompt_template.invoke({"text": text})
structured_llm.invoke(prompt)

Person(name='Alan Smith', hair_color='blond', height_in_meters='1.83')

## Multiple Entities
In most cases, you should be extracting a list of entities rather than a single entity.

This can be easily achieved using pydantic by nesting models inside one another.

In [80]:
from typing import List, Optional

from pydantic import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the person's hair if known"
    )
    height_in_meters: Optional[str] = Field(
        default=None, description="Height measured in meters"
    )


class Data(BaseModel):
    """Extracted data about people."""

    # Creates a model so that we can extract multiple entities.
    people: List[Person]

In [81]:
prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        # Please see the how-to about improving performance with
        # reference examples.
        # MessagesPlaceholder('examples'),
        ("human", "{text}"),
    ]
)

In [84]:
structured_llm2 = llm.with_structured_output(schema=Data)
text = "My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me."
prompt = prompt_template.invoke({"text": text})
structured_llm2.invoke(prompt)

Data(people=[Person(name='Jeff', hair_color='black', height_in_meters='1.83'), Person(name='Anna', hair_color='black', height_in_meters=None)])

In [85]:
messages = [
    {"role": "user", "content": "2 🦜 2"},
    {"role": "assistant", "content": "4"},
    {"role": "user", "content": "2 🦜 3"},
    {"role": "assistant", "content": "5"},
    {"role": "user", "content": "3 🦜 4"},
]

response = llm.invoke(messages)
print(response.content)

7


The pattern in the messages list is:

* The user sends a mathematical expression involving the symbol 🦜 (likely representing a placeholder or operation).

* The assistant provides the result, presumably performing some operation (like addition, subtraction, or another type of operation).


The interaction appears to involve simple arithmetic, with the symbol 🦜 acting as a separator or operator.

#### User's Input: "2 🦜 2"
- The assistant responds with **"4"**, assuming that 🦜 acts as a placeholder for addition (i.e., 2 + 2).

#### User's Input: "2 🦜 3"
- The assistant responds with **"5"**, interpreting 🦜 as addition (2 + 3).

#### User's Input: "3 🦜 4"
- The assistant responds with **"7"**, performing 3 + 4.


LangChain includes a utility function tool_example_to_messages that will generate a valid sequence for most model providers. It simplifies the generation of structured few-shot examples by just requiring Pydantic representations of the corresponding tool calls.

Let's try this out. We can convert pairs of input strings and desired Pydantic objects to a sequence of messages that can be provided to a chat model. Under the hood, LangChain will format the tool calls to each provider's required format.

In [86]:
from langchain_core.utils.function_calling import tool_example_to_messages

examples = [
    (
        "The oceana is vast and blue. It's more than 20,000 feet deep.",
        Data(people=[]),
    ),
    (
        "Fairy traveled far from France to Spain.",
        Data(people=[Person(name="Fairy", height_in_meters=None, hair_color=None)]),
    ),
]


messages = []

for txt, tool_call in examples:
    if tool_call.people:
        # This final message is optional for some providers
        ai_response = "Detected people."
    else:
        ai_response = "Detected no people."
    messages.extend(tool_example_to_messages(txt, [tool_call], ai_response=ai_response))

In [87]:
for message in messages:
    message.pretty_print()


The oceana is vast and blue. It's more than 20,000 feet deep.
Tool Calls:
  Data (1061fe6d-955b-469b-bd20-2f277987fc32)
 Call ID: 1061fe6d-955b-469b-bd20-2f277987fc32
  Args:
    people: []

You have correctly called this tool.

Detected no people.

Fairy traveled far from France to Spain.
Tool Calls:
  Data (711902ce-b346-4373-aaa2-57985add6ed2)
 Call ID: 711902ce-b346-4373-aaa2-57985add6ed2
  Args:
    people: [{'name': 'Fairy', 'hair_color': None, 'height_in_meters': None}]

You have correctly called this tool.

Detected people.


Let's compare performance with and without these messages. For example, let's pass a message for which we intend no people to be extracted:

In [88]:
message_no_extraction = {
    "role": "user",
    "content": "The solar system is large, but earth has only 1 moon.",
}

structured_llm = llm.with_structured_output(schema=Data)
structured_llm.invoke([message_no_extraction])

Data(people=[Person(name="The Earth's Moon", hair_color=None, height_in_meters=None)])

In this example, the model is liable to erroneously generate records of people.

Because our few-shot examples contain examples of "negatives", we encourage the model to behave correctly in this case:

In [89]:
structured_llm.invoke(messages + [message_no_extraction])

Data(people=[Person(name=None, hair_color=None, height_in_meters=None)])

## Using reference examples when doing extraction
The quality of extractions can often be improved by providing reference examples to the LLM.

Data extraction attempts to generate structured representations of information found in text and other unstructured or semi-structured formats. Tool-calling LLM features are often used in this context. Here we will see how to build few-shot examples of tool calls to help steer the behavior of extraction and similar applications.
LangChain implements a tool-call attribute on messages from LLMs that include tool calls.
 
To build reference examples for data extraction, we build a chat history containing a sequence of:

HumanMessage containing example inputs;
AIMessage containing example tool calls;
ToolMessage containing example tool outputs.
LangChain adopts this convention for structuring tool calls into conversation across LLM model providers.

First we build a prompt template that includes a placeholder for these messages:

In [90]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked "
            "to extract, return null for the attribute's value.",
        ),
        # ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
        MessagesPlaceholder("examples"),  # <-- EXAMPLES!
        # ↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑
        ("human", "{text}"),
    ]
)

Test out the template:

In [91]:
from langchain_core.messages import (
    HumanMessage,
)

prompt.invoke(
    {"text": "this is a sample text to test template with message placeholder", "examples": [HumanMessage(content="testing 1 2 3")]}
)

ChatPromptValue(messages=[SystemMessage(content="You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.", additional_kwargs={}, response_metadata={}), HumanMessage(content='testing 1 2 3', additional_kwargs={}, response_metadata={}), HumanMessage(content='this is a sample text to test template with message placeholder', additional_kwargs={}, response_metadata={})])

## Define reference examples
Examples can be defined as a list of input-output pairs.

Each example contains an example input text and an example output showing what should be extracted from the text.

In [92]:
import uuid
from typing import Dict, List, TypedDict

from langchain_core.messages import (
    AIMessage,
    BaseMessage,
    HumanMessage,
    SystemMessage,
    ToolMessage,
)
from pydantic import BaseModel, Field


class Example(TypedDict):
    """A representation of an example consisting of text input and expected tool calls.

    For extraction, the tool calls are represented as instances of pydantic model.
    """

    input: str  # This is the example text
    tool_calls: List[BaseModel]  # Instances of pydantic model that should be extracted


def tool_example_to_messages(example: Example) -> List[BaseMessage]:
    """Convert an example into a list of messages that can be fed into an LLM.

    This code is an adapter that converts our example to a list of messages
    that can be fed into a chat model.

    The list of messages per example corresponds to:

    1) HumanMessage: contains the content from which content should be extracted.
    2) AIMessage: contains the extracted information from the model
    3) ToolMessage: contains confirmation to the model that the model requested a tool correctly.

    The ToolMessage is required because some of the chat models are hyper-optimized for agents
    rather than for an extraction use case.
    """
    messages: List[BaseMessage] = [HumanMessage(content=example["input"])]
    tool_calls = []
    for tool_call in example["tool_calls"]:
        tool_calls.append(
            {
                "id": str(uuid.uuid4()),
                "args": tool_call.dict(),
                # The name of the function right now corresponds
                # to the name of the pydantic model
                # This is implicit in the API right now,
                # and will be improved over time.
                "name": tool_call.__class__.__name__,
            },
        )
    messages.append(AIMessage(content="", tool_calls=tool_calls))
    tool_outputs = example.get("tool_outputs") or [
        "You have correctly called this tool."
    ] * len(tool_calls)
    for output, tool_call in zip(tool_outputs, tool_calls):
        messages.append(ToolMessage(content=output, tool_call_id=tool_call["id"]))
    return messages

Next let's define our examples and then convert them into message format.

In [93]:
examples = [
    (
        "The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.",
        Data(people=[]),
    ),
    (
        "Fiona traveled far from France to Spain.",
        Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)]),
    ),
]


messages = []

for text, tool_call in examples:
    messages.extend(
        tool_example_to_messages({"input": text, "tool_calls": [tool_call]})
    )

C:\Users\saksh\AppData\Local\Temp\ipykernel_12672\951804369.py:45: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  "args": tool_call.dict(),


In [94]:
example_prompt = prompt.invoke({"text": "this is some text", "examples": messages})

for message in example_prompt.messages:
    print(f"{message.type}: {message}")

system: content="You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value." additional_kwargs={} response_metadata={}
human: content="The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it." additional_kwargs={} response_metadata={}
ai: content='' additional_kwargs={} response_metadata={} tool_calls=[{'name': 'Data', 'args': {'people': []}, 'id': 'b11ea080-b17b-4cec-9eb2-d6cc030b6179', 'type': 'tool_call'}]
tool: content='You have correctly called this tool.' tool_call_id='b11ea080-b17b-4cec-9eb2-d6cc030b6179'
human: content='Fiona traveled far from France to Spain.' additional_kwargs={} response_metadata={}
ai: content='' additional_kwargs={} response_metadata={} tool_calls=[{'name': 'Data', 'args': {'people': [{'name': 'Fiona', 'hair_color': None, 'height_in_meters': None}]}, 'id': '5cea15d4-f229-43d7-b96a-6fe9ee8d828d', 

In [95]:
runnable = prompt | llm.with_structured_output(
    schema=Data,
    method="function_calling",
    include_raw=False,
)

## Without examples 😿

In [96]:
for _ in range(5):
    text = "The solar system is large, but earth has only 1 moon."
    print(runnable.invoke({"text": text, "examples": []}))

people=[Person(name=None, hair_color=None, height_in_meters=None)]
people=[Person(name=None, hair_color=None, height_in_meters=None)]
people=[Person(name=None, hair_color=None, height_in_meters=None)]
people=[Person(name=None, hair_color=None, height_in_meters=None)]
people=[Person(name='Earth', hair_color=None, height_in_meters=None)]


## With examples 😻
Reference examples helps to fix the failure!

In [97]:
for _ in range(5):
    text = "The solar system is large, but earth has only 1 moon."
    print(runnable.invoke({"text": text, "examples": messages}))

people=[]
people=[]
people=[]
people=[]
people=[]


In [98]:
runnable.invoke(
    {
        "text": "My name is Harrison. My hair is black.",
        "examples": messages,
    }
)

Data(people=[Person(name='Harrison', hair_color='black', height_in_meters=None)])

## Second Base Example using Pydantic model without multiple entities and without tool_calling and example messages

In [99]:
from typing import Optional
from pydantic import BaseModel, Field
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_groq import ChatGroq
import json


# Define the Product schema using Pydantic
class Product(BaseModel):
    """Information about a product."""

    # Doc-string helps the LLM understand the purpose of extraction
    product_name: Optional[str] = Field(
        default=None, description="The name of the product"
    )
    price: Optional[str] = Field(
        default=None, description="The price of the product in USD"
    )
    availability: Optional[str] = Field(
        default=None,
        description="The availability status of the product (e.g., 'In stock', 'Out of stock')",
    )

In [100]:
# Define the prompt template for extraction
prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        ("human", "{text}"),
    ]
)

In [101]:
# Set up the LLM
llm = ChatGroq(model="llama3-8b-8192")  # Replace with your LLM provider and model


In [102]:
# Create the extraction chain
input_text = """
The latest gadget, the SmartWidget 3000, is now available for $299.99. It is currently in stock and ready for delivery.
"""
# Validate the extracted data against the Product schema
try:
    structured_llm = llm.with_structured_output(schema=Product)
    prompt = prompt_template.invoke({"text": input_text})
    response = structured_llm.invoke(prompt)
    print(response)
except Exception as e:
    print("Validation Error:", str(e))

product_name='SmartWidget 3000' price='299.99' availability='In stock'


## Difference: Classification vs Extraction

### Classification
#### Goal: 
Assign predefined categories or labels to an input based on its content.

#### Key Characteristics:
- **Input**: Typically unstructured text or data.
- **Output**: One or more predefined categories (labels).
- **Prompt**: Focuses on guiding the model to identify and assign a label (e.g., sentiment, topic, spam vs. not spam).
- **Schema**: No complex schema is required; the output is often a simple string or list of labels.

#### Use Case: 
Classifying customer feedback into sentiment categories like "Positive," "Negative," or "Neutral."

---



### Extraction
#### Goal: 
Extract specific structured information or attributes from unstructured data.

#### Key Characteristics:
- **Input**: Typically unstructured text containing detailed information.
- **Output**: A structured schema with extracted attributes, validated using tools like Pydantic.
- **Prompt**: Focuses on instructing the model to extract specific fields or entities.
- **Schema**: Requires a predefined schema (e.g., using Pydantic) to validate and structure the output.

#### Use Case: 
Extracting details like name, email, and phone number from a customer query.
