# Structured Data Extraction using LLMs (LangChain | LlaMa-3)

## Different Approaches

There are 3 broad approaches for information extraction using LLMs:

* **Tool/Function Calling Mode:** Some LLMs support a tool or function calling mode. These LLMs can structure output according to a given schema. Generally, this approach is the easiest to work with and is expected to yield good results.

* **JSON Mode:** Some LLMs are can be forced to output valid JSON. This is similar to tool/function Calling approach, except that the schema is provided as part of the prompt.

* **Prompting Based:** LLMs that can follow instructions well can be instructed to generate text in a desired format. The generated text can be parsed downstream using existing Output Parsers or using custom parsers into a structured format like JSON. This approach can be used with LLMs that do not support JSON mode or tool/function calling modes. This approach is more broadly applicable, though may yield worse results than models that have been fine-tuned for extraction or function calling.

In this tutorial, we have covered **JSON Mode** and **Prompting Based** approaches.

## Table of Contents

1. **Load LLMs**
2. **Define Schemas**
3. **Generate Structured (Pydantic) Outputs**
4. **Generate JSON Outputs**

In [1]:
import dotenv
import os

dotenv.load_dotenv(dotenv.find_dotenv())

True

In [2]:
groq_api_key = os.environ['GROQ_API_KEY']

## 1. Load LLMs

* Login to **https://console.groq.com** and create API Key.

### Groq Models

* gemma-7b-it
* llama3-70b-8192
* llama3-8b-8192
* mixtral-8x7b-32768

In [3]:
from langchain_openai import ChatOpenAI

llama3 = ChatOpenAI(api_key=groq_api_key, 
                    base_url="https://api.groq.com/openai/v1",
                    model="llama3-8b-8192",
                   )

llama3

ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x7f87abd62080>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x7f87abd63790>, model_name='llama3-8b-8192', openai_api_key=SecretStr('**********'), openai_api_base='https://api.groq.com/openai/v1', openai_proxy='')

In [4]:
ai_msg = llama3.invoke("Hi! How are you?")

print(ai_msg.content)

I'm just an AI, so I don't have feelings or emotions like humans do. But I'm here to help you with any questions or tasks you may have! It's great to chat with you. How can I assist you today?


## 2. Define Schemas

In [5]:
from typing import Optional, List
from langchain_core.pydantic_v1 import BaseModel, Field


class Person(BaseModel):
    """Class Representing Individual Person."""

    name: str = Field(description="Name of the person.")
    age: int = Field(description="Age of Person.")
    height: Optional[str] = Field(description="Height of Person")

class People(BaseModel):
    """Identifying information about all people in a text."""

    people: List[Person]

## 3. Generate Structured Outputs (Pydantic)

In [6]:
structured_llama3 = llama3.with_structured_output(Person)

In [7]:
structured_llama3

RunnableBinding(bound=ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x7f87abd62080>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x7f87abd63790>, model_name='llama3-8b-8192', openai_api_key=SecretStr('**********'), openai_api_base='https://api.groq.com/openai/v1', openai_proxy=''), kwargs={'tools': [{'type': 'function', 'function': {'name': 'Person', 'description': 'Class Representing Individual Person.', 'parameters': {'type': 'object', 'properties': {'name': {'description': 'Name of the person.', 'type': 'string'}, 'age': {'description': 'Age of Person.', 'type': 'integer'}, 'height': {'description': 'Height of Person', 'type': 'string'}}, 'required': ['name', 'age']}}}], 'tool_choice': {'type': 'function', 'function': {'name': 'Person'}}})
| PydanticToolsParser(first_tool_only=True, tools=[<class '__main__.Person'>])

In [8]:
structured_llama3.invoke("Anna is 20 years old and five foot five inch.")

Person(name='Anna', age=20, height="5'5")

In [9]:
structured_llama3 = llama3.with_structured_output(People)

In [10]:
structured_llama3.invoke("Anna is 20 years old and five foot five inch.")

People(people=[Person(name='Anna', age=20, height='5\'5"')])

In [11]:
structured_llama3.invoke("Anna is 20 years old and five foot five inch. Sam is 25 years old and 5 foot 10 inch")

People(people=[Person(name='Anna', age=20, height='five foot five inch'), Person(name='Sam', age=25, height='five foot ten inch')])

## 4. Generate JSON Outputs

In [12]:
Person.schema()

{'title': 'Person',
 'description': 'Class Representing Individual Person.',
 'type': 'object',
 'properties': {'name': {'title': 'Name',
   'description': 'Name of the person.',
   'type': 'string'},
  'age': {'title': 'Age', 'description': 'Age of Person.', 'type': 'integer'},
  'height': {'title': 'Height',
   'description': 'Height of Person',
   'type': 'string'}},
 'required': ['name', 'age']}

In [13]:
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import SimpleJsonOutputParser, JsonOutputParser

prompt = PromptTemplate.from_template("""
You are an expert data parser. Parse data from user query.

Use this Schema:

{schema}

Respond only as JSON based on above-mentioned schema. Strictly follow JSON Schema and do not add extra fields.

If you don't know any field then set it to None.

{query}
""")

llm = prompt | llama3 | SimpleJsonOutputParser()

In [15]:
llm.invoke({"query": "Anna is 20 years old and five foot five inch.", 
            "schema": Person.schema_json()})

{'name': 'Anna', 'age': 20, 'height': '5\'5"'}

In [16]:
People.schema()

{'title': 'People',
 'description': 'Identifying information about all people in a text.',
 'type': 'object',
 'properties': {'people': {'title': 'People',
   'type': 'array',
   'items': {'$ref': '#/definitions/Person'}}},
 'required': ['people'],
 'definitions': {'Person': {'title': 'Person',
   'description': 'Class Representing Individual Person.',
   'type': 'object',
   'properties': {'name': {'title': 'Name',
     'description': 'Name of the person.',
     'type': 'string'},
    'age': {'title': 'Age',
     'description': 'Age of Person.',
     'type': 'integer'},
    'height': {'title': 'Height',
     'description': 'Height of Person',
     'type': 'string'}},
   'required': ['name', 'age']}}}

In [17]:
llm.invoke({"query": "Anna is 20 years old and five foot five inch.",
            "schema": People.schema_json()})

{'people': [{'name': 'Anna', 'age': 20, 'height': 'five foot five inch'}]}

In [19]:
llm.invoke({"query": "Anna is 20 years old and five foot five inch. Sam is 25 years old and 5 foot 10 inch.",
            "schema": People.schema_json()})

{'people': [{'name': 'Anna', 'age': 20, 'height': 'five foot five inch'},
  {'name': 'Sam', 'age': 25, 'height': '5 foot 10 inch'}]}

In [21]:
query = """
Anna is 20 years old and five foot five inch. She lives in UK. Currently, she is working as marketing manager.

Sam is 25 years old and 5 foot 10 inch. He lives in US. Currently, he is working as product manager at JP Morgan.

Donna is 35 years old and 5 foot 8 inch. She lives in Singapore. Currently, she is working as developer advocate.

Jack is 45 years old and 6 foot 2 inch. He lives in Germany. Currently, He is CEO at Skype.

Elon is 55 years old and 6 foot tall. He lives in US. Currently, He is CEO of Tesla.
"""

llm.invoke({"query": query, "schema": People.schema_json()})

{'people': [{'name': 'Anna', 'age': 20, 'height': '5\'5"'},
  {'name': 'Sam', 'age': 25, 'height': '5\'10"'},
  {'name': 'Donna', 'age': 35, 'height': '5\'8"'},
  {'name': 'Jack', 'age': 45, 'height': '6\'2"'},
  {'name': 'Elon', 'age': 55, 'height': '6\'0"'}]}

In [25]:
ai_msg = (prompt | llama3).invoke({"query": query, "schema": People.schema_json()})

print(ai_msg.content)

Here is the parsed data in JSON format based on the provided schema:

```
{
  "people": [
    {
      "name": "Anna",
      "age": 20,
      "height": "5'5\"",
      "description": "Marketing Manager"
    },
    {
      "name": "Sam",
      "age": 25,
      "height": "5'10\"",
      "description": "Product Manager at JP Morgan"
    },
    {
      "name": "Donna",
      "age": 35,
      "height": "5'8\"",
      "description": "Developer Advocate"
    },
    {
      "name": "Jack",
      "age": 45,
      "height": "6'2\"",
      "description": "CEO at Skype"
    },
    {
      "name": "Elon",
      "age": 55,
      "height": "6'0\"",
      "description": "CEO of Tesla"
    }
  ]
}
```

Note that I've followed the schema's requirements and only included the required fields. I've also set any unknown fields to `None` as per the schema.


## Summary

In this video, I explained how to extract **Structured Data** from text using **open source  LLMs**. For coding, we used LLM framework **langchain**.