# Tagging and Extraction Using OpenAI functions

Tagging and extraction using OpenAI functions refer to the automatic identification and retrieval of relevant information from text. 

**Tagging** involves assigning labels or categories to specific pieces of data, making it easier to classify and organize information. For example, in a customer support chat, messages might be tagged with categories like "billing issue" or "technical support."

**Extraction** focuses on pulling specific entities or details, such as dates, names, or key phrases, from the text. For instance, extracting a user's email or product name from a conversation. 

Using OpenAI models, developers can build functions that understand natural language and automatically tag or extract data, simplifying data organization, content analysis, or knowledge retrieval without manual input. This process is valuable in automating tasks like summarization, content classification, or data-driven decision-making.

In [1]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

In [2]:
from typing import List
from pydantic import BaseModel, Field
from langchain.utils.openai_functions import convert_pydantic_to_openai_function

In [4]:
class Tagging(BaseModel):
    """Tag the piece of text with particular info."""
    sentiment: str = Field(description="sentiment of text, should be `pos`, `neg`, or `neutral`")
    language: str = Field(description="language of text (should be ISO 639-1 code)")
        
convert_pydantic_to_openai_function(Tagging)

{'name': 'Tagging',
 'description': 'Tag the piece of text with particular info.',
 'parameters': {'title': 'Tagging',
  'description': 'Tag the piece of text with particular info.',
  'type': 'object',
  'properties': {'sentiment': {'title': 'Sentiment',
    'description': 'sentiment of text, should be `pos`, `neg`, or `neutral`',
    'type': 'string'},
   'language': {'title': 'Language',
    'description': 'language of text (should be ISO 639-1 code)',
    'type': 'string'}},
  'required': ['sentiment', 'language']}}

In [6]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI

model = ChatOpenAI(temperature=0)

tagging_functions = [convert_pydantic_to_openai_function(Tagging)]
prompt = ChatPromptTemplate.from_messages([
    ("system", "Think carefully, and then tag the text as instructed"),
    ("user", "{input}")
])
model_with_functions = model.bind(
    functions=tagging_functions,
    function_call={"name": "Tagging"} # Sai: It'll always do tagging
)

tagging_chain = prompt | model_with_functions

In [7]:
tagging_chain.invoke({"input": "I love langchain"})

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Tagging', 'arguments': '{"sentiment":"pos","language":"en"}'}})

In [8]:
tagging_chain.invoke({"input": "non mi piace questo cibo"}) # Test in italian

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Tagging', 'arguments': '{"sentiment":"neg","language":"it"}'}})

### Take the AIMessage to parser as a JSON

JSON format is always easier for the downstream right?

In [10]:
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser

tagging_chain = prompt | model_with_functions | JsonOutputFunctionsParser()
tagging_chain.invoke({"input": "non mi piace questo cibo"}) 

{'sentiment': 'neg', 'language': 'it'}

## Extraction

Extraction is similar to tagging, but used for extracting multiple pieces of information.

In [11]:
from typing import Optional
class Person(BaseModel):
    """Information about a person."""
    name: str = Field(description="person's name")
    age: Optional[int] = Field(description="person's age") # Optional

In [12]:
class Information(BaseModel):
    """Information to extract."""
    people: List[Person] = Field(description="List of info about people")

In [13]:
convert_pydantic_to_openai_function(Information)

{'name': 'Information',
 'description': 'Information to extract.',
 'parameters': {'title': 'Information',
  'description': 'Information to extract.',
  'type': 'object',
  'properties': {'people': {'title': 'People',
    'description': 'List of info about people',
    'type': 'array',
    'items': {'title': 'Person',
     'description': 'Information about a person.',
     'type': 'object',
     'properties': {'name': {'title': 'Name',
       'description': "person's name",
       'type': 'string'},
      'age': {'title': 'Age',
       'description': "person's age",
       'type': 'integer'}},
     'required': ['name']}}},
  'required': ['people']}}

In [23]:
extraction_functions = [convert_pydantic_to_openai_function(Information)]
extraction_model = model.bind(functions=extraction_functions, function_call={"name": "Information"})

extraction_model.invoke("Joe is 30, his mom is Martha")

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Information', 'arguments': '{"people":[{"name":"Joe","age":30},{"name":"Martha"}]}'}})

In [17]:
# Use prompt to control how the LLM behave 
prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract the relevant information, if not explicitly provided do not guess. Extract partial info"),
    ("human", "{input}")
])

In [18]:
extraction_chain = prompt | extraction_model
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Information', 'arguments': '{"people":[{"name":"Joe","age":30},{"name":"Martha"}]}'}})

In [19]:
# Use Parser
extraction_chain = prompt | extraction_model | JsonOutputFunctionsParser()
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})

{'people': [{'name': 'Joe', 'age': 30}, {'name': 'Martha'}]}

In [20]:
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="people")
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})

[{'name': 'Joe', 'age': 30}, {'name': 'Martha'}]

## Doing it for real

We can apply tagging to a larger body of text.

For example, let's load this blog post and extract tag information from a sub-set of the text.

In [25]:
# Mentioned in the course - LangChain for LLM Application Development
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
documents = loader.load()

In [27]:
doc = documents[0]
page_content = doc.page_content[:10000]
print(page_content[:1000])







LLM Powered Autonomous Agents | Lil'Log







































Lil'Log






















Posts




Archive




Search




Tags




FAQ




emojisearch.app









      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


 


Table of Contents



Agent System Overview

Component One: Planning

Task Decomposition

Self-Reflection


Component Two: Memory

Types of Memory

Maximum Inner Product Search (MIPS)


Component Three: Tool Use

Case Studies

Scientific Discovery Agent

Generative Agents Simulation

Proof-of-Concept Examples


Challenges

Citation

References





Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general

### Summary

In [28]:
class Overview(BaseModel):
    """Overview of a section of text."""
    summary: str = Field(description="Provide a concise summary of the content.")
    language: str = Field(description="Provide the language that the content is written in.")
    keywords: str = Field(description="Provide keywords related to the content.")

In [29]:
overview_tagging_function = [
    convert_pydantic_to_openai_function(Overview)
]
tagging_model = model.bind(
    functions=overview_tagging_function,
    function_call={"name":"Overview"}
)
tagging_chain = prompt | tagging_model | JsonOutputFunctionsParser()

In [30]:
tagging_chain.invoke({"input": page_content})

{'summary': 'The text discusses building autonomous agents powered by LLM (large language model) as the core controller. It covers components like planning, memory, and tool use, along with techniques such as task decomposition and self-reflection.',
 'language': 'English',
 'keywords': 'LLM, autonomous agents, planning, memory, tool use, task decomposition, self-reflection'}

### Extraction

In [31]:
class Paper(BaseModel):
    """Information about papers mentioned."""
    title: str
    author: Optional[str]


class Info(BaseModel):
    """Information to extract"""
    papers: List[Paper]

In [38]:
paper_extraction_function = [
    convert_pydantic_to_openai_function(Info)
]
extraction_model = model.bind(
    functions=paper_extraction_function, 
    function_call={"name":"Info"}
)
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers")
extraction_chain.invoke({"input": page_content})

# The result: [{'title': 'LLM Powered Autonomous Agents', 'author': 'Lilian Weng'}]
# Lilian Weng is actually the blog post auther, not the auther of papers
# and the title is the blog post title, not the title of papers
# Next, we'll try to use prompt to make it better

[{'title': 'Chain of thought (CoT; Wei et al. 2022)'},
 {'title': 'Tree of Thoughts (Yao et al. 2023)'},
 {'title': 'LLM+P (Liu et al. 2023)'},
 {'title': 'ReAct (Yao et al. 2023)'},
 {'title': 'Reflexion (Shinn & Labash 2023)'},
 {'title': 'Chain of Hindsight (CoH; Liu et al. 2023)'},
 {'title': 'Algorithm Distillation (AD; Laskin et al. 2023)'}]

In [39]:
# Create a prompt

template = """A article will be passed to you. Extract from it all papers that are mentioned by this article. 

Do not extract the name of the article itself. If no papers are mentioned that's fine - you don't need to extract any! Just return an empty list.

Do not make up or guess ANY extra information. Only extract what exactly is in the text."""

prompt = ChatPromptTemplate.from_messages([
    ("system", template),
    ("human", "{input}")
])

extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers")

extraction_chain.invoke({"input": page_content})

[{'title': 'Chain of thought (CoT; Wei et al. 2022)'},
 {'title': 'Tree of Thoughts (Yao et al. 2023)'},
 {'title': 'LLM+P (Liu et al. 2023)'},
 {'title': 'ReAct (Yao et al. 2023)'},
 {'title': 'Reflexion (Shinn & Labash 2023)'},
 {'title': 'Chain of Hindsight (CoH; Liu et al. 2023)'},
 {'title': 'Algorithm Distillation (AD; Laskin et al. 2023)'}]

In [49]:
# A random message to see what it will return
extraction_chain.invoke({"input": "This is a test"}) 

[]

### Split the data into small chunks

In [51]:
# Mentioned in the course - LangChain for LLM Application Development
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_overlap=0)
splits = text_splitter.split_text(doc.page_content)
len(splits)

15

In [52]:
# It is useful because we'll extrat a list of member through split then merge them all together
def flatten(matrix):
    flat_list = []
    for row in matrix:
        flat_list += row
    return flat_list

In [53]:
flatten([[1, 2], [3, 4]])

[1, 2, 3, 4]

In [57]:
len(splits)

15

In [54]:
print(splits[0])

LLM Powered Autonomous Agents | Lil'Log







































Lil'Log






















Posts




Archive




Search




Tags




FAQ




emojisearch.app









      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


 


Table of Contents



Agent System Overview

Component One: Planning

Task Decomposition

Self-Reflection


Component Two: Memory

Types of Memory

Maximum Inner Product Search (MIPS)


Component Three: Tool Use

Case Studies

Scientific Discovery Agent

Generative Agents Simulation

Proof-of-Concept Examples


Challenges

Citation

References





Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general probl

In [55]:
# Convert test into dictionary
from langchain.schema.runnable import RunnableLambda
prep = RunnableLambda(
    lambda x: [{"input": doc} for doc in text_splitter.split_text(x)]
)
prep.invoke("hi")

[{'input': 'hi'}]

In [60]:
chain = prep | extraction_chain.map()
flattenChain = chain | flatten

chain.invoke(doc.page_content)

[[{'title': 'AutoGPT'}, {'title': 'GPT-Engineer'}, {'title': 'BabyAGI'}],
 [{'title': 'Chain of thought'},
  {'title': 'Tree of Thoughts'},
  {'title': 'LLM+P'},
  {'title': 'ReAct'},
  {'title': 'Reflexion'}],
 [{'title': 'Chain of Hindsight (CoH; Liu et al. 2023)'},
  {'title': 'Algorithm Distillation (AD; Laskin et al. 2023)'}],
 [{'title': 'Laskin et al. 2023'},
  {'title': 'Miller 1956'},
  {'title': 'Duan et al. 2017'}],
 [{'title': 'Google Blog'}, {'title': 'ann-benchmarks.com'}],
 [{'title': 'MRKL (Karpas et al. 2022)'},
  {'title': 'TALM (Tool Augmented Language Models; Parisi et al. 2022)'},
  {'title': 'Toolformer (Schick et al. 2023)'},
  {'title': 'HuggingGPT (Shen et al. 2023)'}],
 [{'title': 'API-Bank', 'author': 'Li et al. 2023'},
  {'title': 'ChemCrow', 'author': 'Bran et al. 2023'}],
 [{'title': 'Boiko et al. (2023)'},
  {'title': 'Generative Agents Simulation (Park, et al. 2023)'}],
 [{'title': 'Park et al. 2023'}],
 [{'title': 'Sample Paper 1', 'author': 'Author A'}

In [61]:
flattenChain.invoke(doc.page_content)

[{'title': 'AutoGPT'},
 {'title': 'GPT-Engineer'},
 {'title': 'BabyAGI'},
 {'title': 'Chain of thought'},
 {'title': 'Tree of Thoughts'},
 {'title': 'LLM+P'},
 {'title': 'ReAct'},
 {'title': 'Reflexion'},
 {'title': 'Chain of Hindsight (CoH; Liu et al. 2023)'},
 {'title': 'Algorithm Distillation (AD; Laskin et al. 2023)'},
 {'title': 'Laskin et al. 2023'},
 {'title': 'Miller 1956'},
 {'title': 'Duan et al. 2017'},
 {'title': 'Google Blog'},
 {'title': 'ann-benchmarks.com'},
 {'title': 'MRKL (Karpas et al. 2022)'},
 {'title': 'TALM (Tool Augmented Language Models; Parisi et al. 2022)'},
 {'title': 'Toolformer (Schick et al. 2023)'},
 {'title': 'HuggingGPT (Shen et al. 2023)'},
 {'title': 'API-Bank', 'author': 'Li et al. 2023'},
 {'title': 'ChemCrow', 'author': 'Bran et al. 2023'},
 {'title': 'Boiko et al. (2023)'},
 {'title': 'Generative Agents Simulation (Park, et al. 2023)'},
 {'title': 'Park et al. 2023'},
 {'title': 'Sample Paper 1', 'author': 'John Doe'},
 {'title': 'Sample Paper 2

We can see here that it seems to be making some up. So there's paper A, author A. This appears to be incorrect, but if you actually look at the article that this is referencing, this article is itself an article about prompting and about, and covers, among other things, extraction and retrieval augmented generation and citing

And so there's actually a bunch of language in there that's imitating some fake papers and having the response cite those, and so it's actually picking this up correctly.

### Beware of language model/prompts input when doing Q&A or extraction
One really interesting thing is that when you do extraction or even when you do question answering over articles that talk about prompting or talk about language models and have examples of prompts in there, sometimes the language model can get confused and mess things up, but that's an aside.