# Extraction Chain

topics covered:

* tool calling
* few shot prompting
* chaining


In [36]:
from typing import Optional
from pydantic import BaseModel, Field, EmailStr

# Schema

This is simply the schema for a **structured output**

`description` -- this description is used by the LLM.

Having a good description can help improve extraction results.

## pydantic

A robust way of defining data structures.

Pydantic offers validation and json/dict serialization.

## Best practices for schema

1. Document attributes and schema: The attribute `description` is directly used by the LLM, it can be helpful in quality improving of the outputs
2. Prevent hallucination: Keep the outputs optional to avoid LLM from making up information incase it is absent. (Use `Optional` and `None`)

In [37]:
class Person(BaseModel):
    name: Optional[str] = Field(default=None, description="Name of the person")
    hair_color: Optional[str] = Field(default=None, description="Color of the person's hair")
    height_in_meters: Optional[float] = Field(default=None, description="Height measured in meters")
    email: Optional[EmailStr] = Field(default=None, description="Email address of the person")

# Extractor

It is a prompt template that contains appropriate context, instructions etc.

Use it to optinally extract metadata about the document

In [38]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

In [39]:
prompt_template = ChatPromptTemplate.from_messages(
    messages=[
        ("system",
         "You are an expert extraction algorithm. "
        "Only extract relevant information from the text. "
        "If you do not know the value of an attribute asked to extract, "
        "return null for the attribute's value.",
        ),
        ("user",
         "{text}")
    ]
)

In [40]:
from langchain_ollama import ChatOllama

In [41]:
llm = ChatOllama(
    temperature=0,
    model="llama3.2"
).with_structured_output(schema=Person)

In [42]:
text = "I am Ruchir Attri. You can recognize me from my black hair and 6 feet height. Feel free to reach me out on notmy@email.com"

prompt = prompt_template.invoke(
    input={"text": text}
)

In [43]:
response = llm.invoke(
    input=prompt
)

response

Person(name='Ruchir Attri', hair_color='black', height_in_meters=1.83, email='notmy@email.com')

# Multiple entities

Use nested entities to find multiple occurences of a Person

In [44]:
from typing import List

In [45]:
class Data(BaseModel):
    people: List[Person]

In [46]:
llm = ChatOllama(
    temperature=0,
    model="llama3.2"
).with_structured_output(
    schema=Data
)

In [47]:
text = "My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me."
prompt = prompt_template.invoke({"text": text})
llm.invoke(prompt)

Data(people=[Person(name='Jeff', hair_color='black', height_in_meters=None, email=None), Person(name='Anna', hair_color='black', height_in_meters=None, email=None)])

## Few-shot prompting

It is a fancy way of saying: Give few example inputs to the LLM in your prompt

Few shot prompting improves output quality

**Q.** What does it look like?

**A.** It is a sequence of pairs of: input(`user`) and expected response examples(`assistant`).

In [49]:
llm = ChatOllama(
    temperature=0,
    model="llama3.2"
)

In [51]:
messages = [
    {"role": "user", "content": "2 $ 2"},
    {"role": "assistant", "content": "4"},
    {"role": "user", "content": "2 $ 3"},
    {"role": "assistant", "content": "5"},
    {"role": "user", "content": "3 $ 4"},
]

llm.invoke(messages)

AIMessage(content='7', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2025-03-18T05:41:40.240551645Z', 'done': True, 'done_reason': 'stop', 'total_duration': 292844111, 'load_duration': 22285193, 'prompt_eval_count': 59, 'prompt_eval_duration': 46000000, 'eval_count': 2, 'eval_duration': 221000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-de14ef04-42be-4a7b-bfb7-a70f991673cb-0', usage_metadata={'input_tokens': 59, 'output_tokens': 2, 'total_tokens': 61})