Used Langchain, Claude Sonnet 3.5 LLM, Amazon Bedrock, and tool-calling features of chat models to extract structured information from unstructured text while using the langchain documentation.

[Langchain Extraction](https://python.langchain.com/docs/tutorials/extraction/#the-schema)

Use Case is an Automated Resume Screening for Recruiters. Recruiters can use a chat model with tool-calling to extract structured data (like name, skills, work experience) from unstructured resumes. This helps them quickly process and analyze candidate information for Applicant Tracking Systems (ATS).

Let's install the necessary packages and put in the AWS credentials!

In [None]:
%pip install --upgrade langchain-core langchain-aws

First, we need to describe what information we want to extract from the text.
We'll use Pydantic to define an example schema to extract personal information.

In [5]:
from typing import Optional

from pydantic import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the person's hair if known"
    )
    height_in_meters: Optional[str] = Field(
        default=None, description="Height measured in meters"
    )

There are two best practices when defining schema:


1. Document the attributes and the schema itself: This information is sent to the LLM and is used to improve the quality of information extraction.
2. Do not force the LLM to make up information! Above we used Optional for the attributes allowing the LLM to output None if it doesn't know the answer.

Let's create an information extractor using the schema we defined above.

In [6]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        # Please see the how-to about improving performance with
        # reference examples.
        # MessagesPlaceholder('examples'),
        ("human", "{text}"),
    ]
)

We need to use a model that supports function/tool calling.

In [7]:
# Ensure your AWS credentials are configured

from langchain.chat_models import init_chat_model

llm = init_chat_model("anthropic.claude-3-5-sonnet-20240620-v1:0", model_provider="bedrock_converse")

In [8]:
structured_llm = llm.with_structured_output(schema=Person)

In [9]:
structured_llm

RunnableBinding(bound=ChatBedrockConverse(client=<botocore.client.BedrockRuntime object at 0x7f03c9d20f10>, model_id='anthropic.claude-3-5-sonnet-20240620-v1:0', aws_access_key_id=SecretStr('**********'), aws_secret_access_key=SecretStr('**********'), aws_session_token=SecretStr('**********'), provider='anthropic', supports_tool_choice_values=('auto', 'any', 'tool')), kwargs={'tools': [{'type': 'function', 'function': {'name': 'Person', 'description': 'Information about a person.', 'parameters': {'properties': {'name': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'description': 'The name of the person'}, 'hair_color': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'description': "The color of the person's hair if known"}, 'height_in_meters': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'description': 'Height measured in meters'}}, 'type': 'object'}}}], 'ls_structured_output_format': {'kwargs': {'method': 'function_calling'

Let's test it out:

In [10]:
text = "Alan Smith is 6 feet tall and has blond hair."
prompt = prompt_template.invoke({"text": text})
structured_llm.invoke(prompt)

Person(name='Alan Smith', hair_color='blond', height_in_meters='1.83')

In most cases, you should be extracting a list of entities rather than a single entity.

This can be easily achieved using pydantic by nesting models inside one another.

In [11]:
from typing import List, Optional

from pydantic import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the person's hair if known"
    )
    height_in_meters: Optional[str] = Field(
        default=None, description="Height measured in meters"
    )


class Data(BaseModel):
    """Extracted data about people."""

    # Creates a model so that we can extract multiple entities.
    people: List[Person]

Let's see what happens!

In [12]:
structured_llm = llm.with_structured_output(schema=Data)
text = "My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me."
prompt = prompt_template.invoke({"text": text})
structured_llm.invoke(prompt)

Data(people=[Person(name='Jeff', hair_color='black', height_in_meters='1.83'), Person(name='Anna', hair_color='black', height_in_meters=None)])

The behavior of LLM applications can be steered using few-shot prompting. For chat models, this can take the form of a sequence of pairs of input and response messages demonstrating desired behaviors.

For example, we can convey the meaning of a symbol with alternating user and assistant messages:

In [13]:
messages = [
    {"role": "user", "content": "2 🦜 2"},
    {"role": "assistant", "content": "4"},
    {"role": "user", "content": "2 🦜 3"},
    {"role": "assistant", "content": "5"},
    {"role": "user", "content": "3 🦜 4"},
]

response = llm.invoke(messages)
print(response.content)

7


The reason why it's 7 is because the chili like character in between the pairs seems to be acting like a + sign. If 2 + 2 = 4, and 2 + 3 = 5, then 3 + 4 must be 7.

Structured output often uses tool calling under-the-hood. This typically involves the generation of AI messages containing tool calls, as well as tool messages containing the results of tool calls. What should a sequence of messages look like in this case?

Different chat model providers impose different requirements for valid message sequences. Some will accept a (repeating) message sequence of the form:


* User message
* AI message with tool call
* Tool message with result

Others require a final AI message containing some sort of response.

LangChain includes a utility function tool_example_to_messages that will generate a valid sequence for most model providers. It simplifies the generation of structured few-shot examples by just requiring Pydantic representations of the corresponding tool calls.

Let's try this out. We can convert pairs of input strings and desired Pydantic objects to a sequence of messages that can be provided to a chat model. Under the hood, LangChain will format the tool calls to each provider's required format.

In [17]:
from langchain_core.utils.function_calling import tool_example_to_messages

examples = [
    (
        "The ocean is vast and blue. It's more than 20,000 feet deep.",
        Data(people=[]),
    ),
    (
        "Fiona traveled far from France to Spain.",
        Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)]),
    ),
]


messages = []

for txt, tool_call in examples:
    if tool_call.people:
        # This final message is optional for some providers
        ai_response = "Detected people."
    else:
        ai_response = "Detected no people."
    messages.extend(tool_example_to_messages(txt, [tool_call], ai_response=ai_response))

Inspecting the result, we see these two example pairs generated eight messages:

In [18]:
for message in messages:
    message.pretty_print()


The ocean is vast and blue. It's more than 20,000 feet deep.
Tool Calls:
  Data (fab79d47-b259-476c-bf65-35c12ad1f8a2)
 Call ID: fab79d47-b259-476c-bf65-35c12ad1f8a2
  Args:
    people: []

You have correctly called this tool.

Detected no people.

Fiona traveled far from France to Spain.
Tool Calls:
  Data (25434004-2a14-4e41-aecc-e11ffbbb7561)
 Call ID: 25434004-2a14-4e41-aecc-e11ffbbb7561
  Args:
    people: [{'name': 'Fiona', 'hair_color': None, 'height_in_meters': None}]

You have correctly called this tool.

Detected people.
