Main use cases using OpenAI functions for Tagging and Extraction 

This allows us to  extract structured data from unstructured text


Tagging:-

We have seen the LLM, given a function description, select arguments from the input text, generate a structured output forming a function call.

More generally, the LLM can evaluated the input text and generated structured output.


Text ---> LLM ----> {
           ^             Sentiment: Positive
           |             language: spanish
           |         }
      Structured
      Description


Here we pass in unstructured text along with some structured description, and then we use the LLM to generate some structured output to reason over that input text

and create some response response in the format of the structured description that we pass in.

Here in above eg, we know we have been generating an object that has a sentiment of the text and also has a tag for the language of the text.

So we pass in a piece of text we will pass in a structure description saying hey extract some sentiment, extract some language, and the LLM will 

reason over that text and respond with am object that has a sentiments tag and a language tag.



Extraction 

1. When given an input Json schema, the LLM gas been fine tuned to find and fill the parameters of that schema.

2. The capability is not limited to function schema.

3. This can be used for general purpose extraction.


Text ----> LLM ----> [{ ...
            ^            first_name: lang,
            |            last_name: chain,
            |            language: python
            |          },
            |         ]
        Structured 
        Description


In extraction, we are going to be extracting specific entities from the text. 

These entities are also represented by a structure description.

But rather than using LLM to reason over the text and respond with a single output in this structure description.

We are using LLM to look over the text and extract a list of these elements.

eg look over an article and extract the list of papers that are mentioned in articles.
        

In [1]:
import os
from dotenv import load_dotenv


In [2]:
load_dotenv()

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

In [3]:
from typing import List   # to help us with type hints
from pydantic import BaseModel, Field
from langchain.utils.openai_functions import convert_pydantic_to_openai_function

In [4]:
# Create pydantic model

class Tagging(BaseModel):
    """Tag the piece of text with particular info."""
    sentiment: str = Field(description = "Sentiment of text, should be `pos`, `neg` or `neutral`")
    language: str = Field(description = "Language of the text (should be ISO 639-1 code)")

In [5]:
convert_pydantic_to_openai_function(Tagging)

  convert_pydantic_to_openai_function(Tagging)


{'name': 'Tagging',
 'description': 'Tag the piece of text with particular info.',
 'parameters': {'properties': {'sentiment': {'description': 'Sentiment of text, should be `pos`, `neg` or `neutral`',
    'type': 'string'},
   'language': {'description': 'Language of the text (should be ISO 639-1 code)',
    'type': 'string'}},
  'required': ['sentiment', 'language'],
  'type': 'object'}}

Here we got json block of function which is going to pass to openai as openai can only take functions in json format.

In [6]:
# Tagging

from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI

In [7]:
model = ChatOpenAI(temperature = 0)

  model = ChatOpenAI(temperature = 0)


In [8]:
tagging_functions = [convert_pydantic_to_openai_function(Tagging)]

In [11]:
prompt = ChatPromptTemplate.from_messages([
    ("system"," Think carefully, and them tag the text as instructed"),
    ("user","{input}")   
])


In [None]:
model_with_functions = model.bind(
    functions = tagging_functions
    function_call = {"name":"Tagging"}
)

In [13]:
tagging_chain = prompt | model_with_functions

In [17]:
import json

In [19]:
# print(json.dumps(response.model_dump(),indent=2))
# json.dumps(tagging_chain.invoke({"input":"I love langchain"}).model_dump(),indent=2)
tagging_chain.invoke({"input":"I love langchain"})


AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{"sentiment":"pos","language":"en"}', 'name': 'Tagging'}}, response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 101, 'total_tokens': 120, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'function_call', 'logprobs': None}, id='run--d7b134b6-8546-4577-8a23-953e42d4b0cd-0')

Here we can see the function call is there arguments that are passes in and we can see that the sentiment is positive and language is english

In [20]:
tagging_chain.invoke({"input":"non mi piace questo cibo"})

AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{"sentiment":"neg","language":"it"}', 'name': 'Tagging'}}, response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 104, 'total_tokens': 123, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'function_call', 'logprobs': None}, id='run--0c4506bc-e597-4701-835d-6a03eda437ec-0')

Here we can see the function call is there arguments that are passes in and we can see that the sentiment is negative and language is italian.

here we can see output format is kind of nested and we know that we are always going to be extracting the structure and so what we really want to do

is add an output parser that takes in this AI message and basically pares out the JSON and just says that because that's the only interesting thing here we already

know that we are going to call this function, so 

the fact that content is null, that not interesting to us.

the fact that there's a function call that's made, that not interesting to us.

we are forcing to do that.

The fact that it's calling tagging function, also not interesting to us because we know that it's going to call tagging function because we forced it to.

Here we what really want is just the values of arguments, which is JSON block and it will be really convenient if that was parsed into JSON because it's JSON

in this JSON block and we want to be able to use the individual elements.

we can use Output parser, in chain that would help us.


In [22]:
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser

In [23]:
tagging_chain = prompt | model_with_functions | JsonOutputFunctionsParser()

In [24]:
tagging_chain.invoke({"input":"non mi piace questo cibo"})

{'sentiment': 'neg', 'language': 'it'}

here we only got parameters

Extraction

Extraction is similar to tagging, but used for extracting multiple pieces of information.

In [25]:
from typing import Optional
class Person(BaseModel):
    """Information about a person."""
    name: str = Field(description = "Person's name")
    age: Optional[int] = Field(description = "Person's age")

we want to extract list of person objects so

In [26]:
class Information(BaseModel):
    """Information to extract."""
    people: List[Person] = Field(description= "List of info about people")

This Information class will be pass to opneai function

In [27]:
convert_pydantic_to_openai_function(Information)

{'name': 'Information',
 'description': 'Information to extract.',
 'parameters': {'properties': {'people': {'description': 'List of info about people',
    'items': {'description': 'Information about a person.',
     'properties': {'name': {'description': "Person's name", 'type': 'string'},
      'age': {'anyOf': [{'type': 'integer'}, {'type': 'null'}],
       'description': "Person's age"}},
     'required': ['name', 'age'],
     'type': 'object'},
    'type': 'array'}},
  'required': ['people'],
  'type': 'object'}}

In [29]:
extraction_functions = [convert_pydantic_to_openai_function(Information)]
extraction_model = model.bind(functions= extraction_functions, function_call={"name":"Information"})

In [31]:
extraction_model.invoke("Joe is 30, his mom is martha")

AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{"people":[{"name":"Joe","age":30},{"name":"Martha","age":null}]}', 'name': 'Information'}}, response_metadata={'token_usage': {'completion_tokens': 21, 'prompt_tokens': 96, 'total_tokens': 117, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--66c89739-8567-48b3-9b4c-891eb5b22684-0')

Here we got all the parameters for the person, but considering martha, age is showing null, we can probably do better by forcing model to respond in more educated way.

By adding prompt tht will tell the Language model to do that.

In [41]:
prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract the relevant information, if not explicitly provided do not guess. Extract partial info"),
    ("human", "{input}")      
])

This will allow language model to not respond always to makeup age equals 0.

In [42]:
extraction_chain = prompt | extraction_model

In [44]:
extraction_chain.invoke({"input":"Joe is 30, his mom is martha"})

AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{"people":[{"name":"Joe","age":30},{"name":"Martha","age":null}]}', 'name': 'Information'}}, response_metadata={'token_usage': {'completion_tokens': 21, 'prompt_tokens': 113, 'total_tokens': 134, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--2ff58208-27f5-42a2-88a2-234308d75f76-0')

we can parse this ai message to some structure

In [47]:
extraction_chain = prompt | extraction_model | JsonOutputFunctionsParser()

In [None]:
extraction_chain.invoke({"input":"Joe is 30, his mom is martha"})


{'people': [{'name': 'Joe', 'age': 30}, {'name': 'Martha', 'age': None}]}