## Set up
from: https://python.langchain.com/docs/use_cases/extraction/quickstart/

We will use the structured output method available on LLMs that are capable of function/tool calling.

Select a model, install the dependencies for it and set up API keys!

In [1]:
# pip install langchain

# Install a model capable of tool calling
# pip install langchain-openai
# pip install langchain-mistralai
# pip install langchain-fireworks

# Set env vars for the relevant model or load from a .env file:
# import dotenv
# dotenv.load_dotenv()

from: https://python.langchain.com/docs/use_cases/extraction/quickstart/#the-extractor
## The Schema
First, we need to describe what information we want to extract from the text.

We’ll use Pydantic to define an example schema to extract personal information.

In [52]:
from typing import List, Optional

from langchain_core.pydantic_v1 import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person, not other object")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the peron's hair if known"
    )
    height_in_meters: Optional[str] = Field(
        default=None, description="Height measured in meters"
    )

class Data(BaseModel):
    """Extracted data about people."""

    # Creates a model so that we can extract multiple entities.
    people: List[Person]

There are two best practices when defining schema:

1. Document the attributes and the schema itself: This information is sent to the LLM and is used to improve the quality of information extraction.
2. Do not force the LLM to make up information! Above we used Optional for the attributes allowing the LLM to output None if it doesn’t know the answer.

> For best performance, document the schema well and make sure the model isn’t force to return results if there’s no information to be extracted in the text.

In [3]:
LANGCHAIN_TRACING_V2=True
LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
# LANGCHAIN_API_KEY="<your-api-key>"
LANGCHAIN_PROJECT="langchain-demo"

## The Extractor
Let’s create an information extractor using the schema we defined above.

In [53]:
from typing import Optional

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        # Please see the how-to about improving performance with
        # reference examples.
        MessagesPlaceholder('examples'),
        ("human", "{text}"),
    ]
)

In [54]:
# from langchain_mistralai import ChatMistralAI
# from langsmith import traceable

#llm = ChatMistralAI(model="mistral-large-latest", temperature=0)
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

#@traceable(name="get runnable chain used extracting structured output")
#def get_runnable():
#    return prompt | llm.with_structured_output(schema=Person)

runnable = prompt | llm.with_structured_output(schema=Person)

In [32]:
text = "Alan Smith is 6 feet tall and has blond hair."
runnable.invoke({"text": text, "examples": []})

Person(name='Alan Smith', hair_color='blond', height_in_meters='1.83')

In [33]:
text = "The solar system is large, but earth has only 1 moon."
print(runnable.invoke({"text": text, "examples": []}))

name='earth' hair_color=None height_in_meters=None


## 提取多个Person

In [56]:
runnable = prompt | llm.with_structured_output(schema=Data)
text = "My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me."
runnable.invoke({"text": text, "examples": []})

Data(people=[Person(name='Jeff', hair_color='black', height_in_meters='1.83'), Person(name='Anna', hair_color='black', height_in_meters=None)])

## 其他尝试

In [39]:
prompt.invoke({"text": text, "examples": []}).to_messages

<bound method ChatPromptValue.to_messages of ChatPromptValue(messages=[SystemMessage(content="You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value."), HumanMessage(content='My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me.')])>

In [40]:
prompt

ChatPromptTemplate(input_variables=['examples', 'text'], input_types={'examples': typing.List[typing.Union[langchain_core.messages.ai.AIMessage, langchain_core.messages.human.HumanMessage, langchain_core.messages.chat.ChatMessage, langchain_core.messages.system.SystemMessage, langchain_core.messages.function.FunctionMessage, langchain_core.messages.tool.ToolMessage]]}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], template="You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.")), MessagesPlaceholder(variable_name='examples'), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['text'], template='{text}'))])

## Define reference examples
Examples can be defined as a list of input-output pairs.

Each example contains an example input text and an example output showing what should be extracted from the text.

> INFO
>
> This is a bit in the weeds, so feel free to ignore if you don’t get it!
>
> The format of the example needs to match the API used (e.g., tool calling or JSON mode etc.).
>
> Here, the formatted examples will match the format expected for the tool calling API since that’s what we’re using.
>

In [21]:
import uuid
from typing import Dict, List, TypedDict

from langchain_core.messages import (
    AIMessage,
    BaseMessage,
    HumanMessage,
    SystemMessage,
    ToolMessage,
)
from langchain_core.pydantic_v1 import BaseModel, Field


class Example(TypedDict):
    """A representation of an example consisting of text input and expected tool calls.

    For extraction, the tool calls are represented as instances of pydantic model.
    """

    input: str  # This is the example text
    tool_calls: List[BaseModel]  # Instances of pydantic model that should be extracted


def tool_example_to_messages(example: Example) -> List[BaseMessage]:
    """Convert an example into a list of messages that can be fed into an LLM.

    This code is an adapter that converts our example to a list of messages
    that can be fed into a chat model.

    The list of messages per example corresponds to:

    1) HumanMessage: contains the content from which content should be extracted.
    2) AIMessage: contains the extracted information from the model
    3) ToolMessage: contains confirmation to the model that the model requested a tool correctly.

    The ToolMessage is required because some of the chat models are hyper-optimized for agents
    rather than for an extraction use case.
    """
    messages: List[BaseMessage] = [HumanMessage(content=example["input"])]
    openai_tool_calls = []
    for tool_call in example["tool_calls"]:
        openai_tool_calls.append(
            {
                "id": str(uuid.uuid4()),
                "type": "function",
                "function": {
                    # The name of the function right now corresponds
                    # to the name of the pydantic model
                    # This is implicit in the API right now,
                    # and will be improved over time.
                    "name": tool_call.__class__.__name__,
                    "arguments": tool_call.json(),
                },
            }
        )
    messages.append(
        AIMessage(content="", additional_kwargs={"tool_calls": openai_tool_calls})
    )
    tool_outputs = example.get("tool_outputs") or [
        "You have correctly called this tool."
    ] * len(openai_tool_calls)
    for output, tool_call in zip(tool_outputs, openai_tool_calls):
        messages.append(ToolMessage(content=output, tool_call_id=tool_call["id"]))
    return messages

## 使用examples

In [47]:
examples = [
    (
        "The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.",
        Person(name=None, height_in_meters=None, hair_color=None),
    ),
    (
        "The basketball court has two basketball hoops",
        Person(name=None, height_in_meters=None, hair_color=None),
    ),
    (
        "The basketball court has two basketball hoops",
        Person(name=None, height_in_meters=None, hair_color=None),
    ),
    (
        "Fiona traveled far from France to Spain.",
        Person(name="Fiona", height_in_meters=None, hair_color=None),
    ),
]


messages = []

for text, tool_call in examples:
    messages.extend(
        tool_example_to_messages({"input": text, "tool_calls": [tool_call]})
    )

In [48]:
prompt.invoke({"text": "this is some text", "examples": messages})

ChatPromptValue(messages=[SystemMessage(content="You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value."), HumanMessage(content="The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it."), AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'a6b586aa-8e46-4d09-ae79-df6840ebe177', 'type': 'function', 'function': {'name': 'Person', 'arguments': '{"name": null, "hair_color": null, "height_in_meters": null}'}}]}, tool_calls=[{'name': 'Person', 'args': {'name': None, 'hair_color': None, 'height_in_meters': None}, 'id': 'a6b586aa-8e46-4d09-ae79-df6840ebe177'}]), ToolMessage(content='You have correctly called this tool.', tool_call_id='a6b586aa-8e46-4d09-ae79-df6840ebe177'), HumanMessage(content='The basketball court has two basketball hoops'), AIMessage(content='', additional_kwargs={'tool_calls': [{'id': '6dadf131-5

In [49]:
print(messages)

[HumanMessage(content="The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it."), AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'a6b586aa-8e46-4d09-ae79-df6840ebe177', 'type': 'function', 'function': {'name': 'Person', 'arguments': '{"name": null, "hair_color": null, "height_in_meters": null}'}}]}, tool_calls=[{'name': 'Person', 'args': {'name': None, 'hair_color': None, 'height_in_meters': None}, 'id': 'a6b586aa-8e46-4d09-ae79-df6840ebe177'}]), ToolMessage(content='You have correctly called this tool.', tool_call_id='a6b586aa-8e46-4d09-ae79-df6840ebe177'), HumanMessage(content='The basketball court has two basketball hoops'), AIMessage(content='', additional_kwargs={'tool_calls': [{'id': '6dadf131-5d33-46ce-847d-2f62bc24d718', 'type': 'function', 'function': {'name': 'Person', 'arguments': '{"name": null, "hair_color": null, "height_in_meters": null}'}}]}, tool_calls=[{'name': 'Person', 'args': {'name': None, 'hair_color': None, 'height

In [55]:
for _ in range(5):
    text = "The solar system is large, but earth has only 1 moon."
    print(runnable.invoke({"text": text, "examples": messages}))

name=None hair_color=None height_in_meters=None
name=None hair_color=None height_in_meters=None
name=None hair_color=None height_in_meters=None
name=None hair_color=None height_in_meters=None
name=None hair_color=None height_in_meters=None


> 发现加了example，还是会识别Earth,只是没有识别到hair_color和height_in_meters
在Person schema name description 中加上 "not other object" 就排除 earth 了