<a href="https://colab.research.google.com/github/ramahasiba/NLP/blob/LangChain/Build_an_Extraction_Chain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Build an Extraction Chain](https://python.langchain.com/docs/tutorials/extraction/)

In this notebook, I used tool-calling features of chat models to extract structured information from unstructured text.

## Installation

In [3]:
!pip install -q --upgrade langchain-core

In [4]:
!pip install -q dotenv

In [5]:
!pip install -qU "langchain[groq]"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/130.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.8/130.8 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h

## Setup

In [7]:
import os
from pprint import pprint
from dotenv import load_dotenv
import getpass

try:
  load_dotenv('.env')
except ImportError:
  print('No .env file found')

# Setup LangSmith to be able to inspect what exactly goes inside my chain or agent
os.environ["LANGSMITH_TRACING"] = "true"
if "LANGSMITH_API_KEY" not in os.environ:
  os.environ["LANGSMITH_API_KEY"] = getpass.getpass(
      prompt = "Enter the Langsmith api key:"
  )

if "LANGSMITH_PROJECT" not in os.environ:
  os.environ["LANGSMITH_PROJECT"] = getpass.getpass(
      prompt = "Enter langsmith project name: "
  )
  if not os.environ.get("LANGSMITH_PROJECT"):
    os.environ["LANGSMITH_PROJECT"] = "default"

os.environ["GROQ_API_KEY"] = os.getenv('GROQ_API_KEY')
os.environ["HF_TOKEN"] = os.getenv('HF_TOKEN')

Enter langsmith project name: ··········


## The Schema
describe what information we want to extract from the text.




In [8]:
from typing import Optional
from pydantic import BaseModel, Field

class Person(BaseModel):
  """Information about a person."""
  name: Optional[str] = Field(default=None, description="The name of the person")
  hair_color: Optional[str] = Field(
      default=None, description="The colot of the person's hair if known"
  )
  height_in_meters: Optional[str] = Field(
      default=None, description="Height measured inmeters"
  )

`Optional` allows the model to output None if it doesn't know the answer. and this is for getting the best performance were we do force the model to return resuts if there is no information to be extracted.

## The Extractor
here we create information extractor using the schema we defined above. we deine a custom prompt to provide instructions and any additional context. we can add examples into the prompt template to improve the extraction quality.


In [16]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "you are an expert extraction algorithm."
            "Only extract relevant information from the text."
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value."
        ),
        ("human", "{text}"),
    ]
)

In [10]:
from langchain.chat_models import init_chat_model # Chat model is unstance of the runnable interface
model_name = "llama3-70b-8192"

llm=init_chat_model(model_name, model_provider="groq")

In [11]:
structured_llm = llm.with_structured_output(schema=Person)

In [12]:
text = "Alan Smith is 6 feet tall and has blond hair"
prompt = prompt_template.invoke({"text": text})
structured_llm.invoke(prompt)



Person(name='Alan Smith', hair_color='blond', height_in_meters='1.83')

## Multiple Entities
In most cases, we extract list of entities and this can be done using `Pydantic` by nesting models inside one another.    

The doc-string in the schema is sent to the LLM as the description of the schema Person, and it may help to improve extraction results. attributes description which is inside the Field is used by the LLM as well.

Having good desccription can help impove the extraction results.

In [13]:
from typing import List, Optional
from pydantic import BaseModel, Field

class Person(BaseModel):
  """Information about person"""
  name: Optional[str] = Field(default=None, description="he name of the person")
  hair_color: Optional[str] = Field(
      default=None, description="The color of the person's hair if known"
  )
  height_in_meters: Optional[str] = Field(
      dfault=None, description="height measured in meters"
  )

class Data(BaseModel):
  """Extracted data about people"""
  # Creates a model so we can extract multiple entities

  people: List[Person]

In [14]:
structured_llm = llm.with_structured_output(schema=Data)
text = "My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me."
prompt = prompt_template.invoke({"text": text})
structured_llm.invoke(prompt)

Data(people=[Person(name='Jeff', hair_color='black', height_in_meters='1.83'), Person(name='Anna', hair_color='black', height_in_meters=None)])

## Steered the model using few-shot prompting

In [15]:
messages = [
    {"role": "user", "content": "2 🦜 2"},
    {"role": "assistant", "content": "4"},
    {"role": "user", "content": "2 🦜 3"},
    {"role": "assistant", "content": "5"},
    {"role": "user", "content": "3 🦜 4"},
]

response = llm.invoke(messages)
print(response.content)

7


## Converts a tool call example into a sequence of chat messages

In [18]:
from langchain_core.utils.function_calling import tool_example_to_messages

examples = [
    (
        "The ocean is vast and blue. It's more than 20,000 feet deep.",
        Data(people=[])
    ),
    (
        "Fiona traveled far from France to Spain.",
        Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)])
    )
]

messages = []

for txt, tool_call in examples:
  if tool_call.people:
    # This final message is optional for some providers
    ai_response = "Detected people."
  else:
    ai_response = "Dtected no people."
  messages.extend(tool_example_to_messages(txt, [tool_call], ai_response=ai_response))

  messages.extend(tool_example_to_messages(txt, [tool_call], ai_response=ai_response))


In [19]:
for message in messages:
  message.pretty_print()


The ocean is vast and blue. It's more than 20,000 feet deep.
Tool Calls:
  Data (27f27dab-63b6-40ce-afbd-45a3f4e934ff)
 Call ID: 27f27dab-63b6-40ce-afbd-45a3f4e934ff
  Args:
    people: []

You have correctly called this tool.

Dtected no people.

Fiona traveled far from France to Spain.
Tool Calls:
  Data (a40db5de-90f5-46fd-970a-6864e9dafbe3)
 Call ID: a40db5de-90f5-46fd-970a-6864e9dafbe3
  Args:
    people: [{'name': 'Fiona', 'hair_color': None, 'height_in_meters': None}]

You have correctly called this tool.

Detected people.


In [22]:
message_no_extraction = {
    "role": "user",
    "content": "The solar system is large, but earth has only 1 moon"
}

structured_llm = llm.with_structured_output(schema=Data)
structured_llm.invoke([message_no_extraction])



Data(people=[Person(name='Alice', hair_color=None, height_in_meters='1.75')])

In [23]:
structured_llm.invoke(messages+[message_no_extraction])

Data(people=[Person(name=None, hair_color=None, height_in_meters=None)])