# Extract the data in a structured data 

## Different Approaches
There are 3 broad approaches for information extraction using LLMs.

1. **Tool/Function Calling Mode**: Some LLMs support a tool or function calling mode. These LLMs can structure output according to a given schema. Generally, this approach is the easiest to work and is expected to yield desired good results.
2. **JSON Mode**: Some LLMs are be forced to output valid JSON. This is similar to tool/function calling approach, except the schema is provided as part of the prompt.
3. **Prompting Based**: LLMs that can follow instructions well can be instructed to generate text in desired format. The generated text can be parsed downstream using existing Output parsers or using custom parsers into a structured format like JSON. This approach can be used with LLMs that do not support JSON mode or tool/function calling modes.   

### Parse using Instructor and Pydantic Model
1. Load LLMs
2. Define Schemas
3. Generate Stuctured(Pydantic) Outputs
4. Generate Outputs

In [37]:
import instructor
from openai import OpenAI
from pydantic import BaseModel
from dotenv import load_dotenv
load_dotenv()

True

In [38]:
# Define a pydantic class to get our desiered stuctured output
class User(BaseModel):
    name: str
    age: int    

In [39]:
# create the OpenAI client
import openai
client = instructor.patch(OpenAI())

In [40]:
# Extract structured data from natuaral language
user_info = client.chat.completions.create(
    model="gpt-3.5-turbo",
    response_model = User,
    messages=[
        {
            "role": "user",
            "content": "Extract Jason is 25 years old",
        }
    ],    
)
print(user_info)

name='Jason' age=25


In [41]:
# Create a new user based on the user model

new_user = client.chat.completions.create(
    model = "gpt-3.5-turbo",
    response_model = User,
    messages = [
        {
            "role": "user",
            "content": "Generate a user"
        },
    ]
)
# print as Json response
print(new_user.model_dump_json(indent=2))

{
  "name": "Alice",
  "age": 30
}


In [42]:
new_users = client.chat.completions.create(
    model = "gpt-3.5-turbo",
    response_model = list[User],
    messages = [
        {
            "role": "user",
            "content": "Generate a list of 10 users with tamil names and age between 30 to 60"
        },
    ]
)
# print as Json response
for user in new_users:
    print(user)


name='கலைவன்' age=40
name='சிதீசா' age=45
name='தமிழ்மணி' age=35
name='சந்திரன்' age=50
name='அரிவண்ணன்' age=55
name='பவித்ரா' age=32
name='தீபான்' age=38
name='சரவணன்' age=42
name='ஜோதிகா' age=48
name='உன்னிதன்' age=58


# Generate JSON Outputs

In [43]:
from dotenv import load_dotenv,find_dotenv
import os
load_dotenv(find_dotenv())

True

### Login to https://console.groq.com and create a API key


In [44]:
groq_api_key = os.getenv("GROQ_API_KEY")

In [45]:
from langchain_openai import ChatOpenAI
llama3 = ChatOpenAI(api_key = groq_api_key,
                    model="llama3-70b-8192",
                    base_url="https://api.groq.com/openai/v1",
                    temperature=0.1)
llama3

ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x000001B9325FB170>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x000001B9310E5070>, root_client=<openai.OpenAI object at 0x000001B9325ECD70>, root_async_client=<openai.AsyncOpenAI object at 0x000001B9325FB230>, model_name='llama3-70b-8192', temperature=0.1, model_kwargs={}, openai_api_key=SecretStr('**********'), openai_api_base='https://api.groq.com/openai/v1')

In [46]:
# test the model
ai_msg = llama3.invoke("What is the capital of France?")
ai_msg
ai_msg.content

'The capital of France is Paris.'

## Define the Schemas

In [47]:
from typing import Optional, List
from pydantic import BaseModel, Field

class Person(BaseModel):
    """
    A class representing a person with a name, age, and a list of hobbies.

    Attributes:
        name (str): The name of the person.
        age (int): The age of the person.
        hobbies (List[str]): A list of hobbies of the person.
    """
    name: str = Field(description="name of the person")
    age: int = Field(description="age of the person")
    hobbies: List[str] = Field(description="hobbies of the person")

class People(BaseModel):
    """
    A class representing a list of people.

    Attributes:
        people (List[Person]): A list of Person objects.
    """
    people: List[Person] = Field(description="list of people")

### Generate Structured outputs (Pydantic) 

In [48]:
structured_llama3 = llama3.with_structured_output(Person)
structured_llama3

RunnableBinding(bound=ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x000001B9325FB170>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x000001B9310E5070>, root_client=<openai.OpenAI object at 0x000001B9325ECD70>, root_async_client=<openai.AsyncOpenAI object at 0x000001B9325FB230>, model_name='llama3-70b-8192', temperature=0.1, model_kwargs={}, openai_api_key=SecretStr('**********'), openai_api_base='https://api.groq.com/openai/v1'), kwargs={'tools': [{'type': 'function', 'function': {'name': 'Person', 'description': 'A class representing a person with a name, age, and a list of hobbies.\n\nAttributes:\n    name (str): The name of the person.\n    age (int): The age of the person.\n    hobbies (List[str]): A list of hobbies of the person.', 'parameters': {'properties': {'name': {'description': 'name of the person', 'type': 'string'}, 'age': {'description': 'age of the person', 'type': 'integer'}, 'hobbies': {'description': 'hobb

In [49]:
structured_llama3.invoke("Ram is 45 years old and he loves to blogging, running, learning")

Person(name='Ram', age=45, hobbies=['blogging', 'running', 'learning'])

In [50]:
structured_llama3 = llama3.with_structured_output(People)
structured_llama3.invoke("Shan is 15 years old and he is intreasted in biking, gaming and learning new stuff. Sam is 20 years old and loves to swim and dancing")

People(people=[Person(name='Shan', age=15, hobbies=['biking', 'gaming', 'learning new stuff']), Person(name='Sam', age=20, hobbies=['swimming', 'dancing'])])

## Generate JSON Outputs

In [51]:
Person.schema()

C:\Users\Darshan\AppData\Local\Temp\ipykernel_2320\2286394867.py:1: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  Person.schema()


{'description': 'A class representing a person with a name, age, and a list of hobbies.\n\nAttributes:\n    name (str): The name of the person.\n    age (int): The age of the person.\n    hobbies (List[str]): A list of hobbies of the person.',
 'properties': {'name': {'description': 'name of the person',
   'title': 'Name',
   'type': 'string'},
  'age': {'description': 'age of the person',
   'title': 'Age',
   'type': 'integer'},
  'hobbies': {'description': 'hobbies of the person',
   'items': {'type': 'string'},
   'title': 'Hobbies',
   'type': 'array'}},
 'required': ['name', 'age', 'hobbies'],
 'title': 'Person',
 'type': 'object'}

In [52]:
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser, SimpleJsonOutputParser

In [53]:
prompt = PromptTemplate.from_template(""" 
                                      You are an expert data parser. Parse data from user query.
                                      Use this schema:
                                      {schema}
                                      Respond only a JSON based on above-mentioned schema. 
                                      Strictly follow JSON schema and do not add extra fields.
                                      If you don't know any field then set it to None.

                                      {query}
                                      """)
llm = prompt | llama3 | SimpleJsonOutputParser()

In [54]:
llm.invoke({"query":"Raju is 21 years old and loves doing vlog, training",
            "schema":Person.model_json_schema()})

{'name': 'Raju', 'age': 21, 'hobbies': ['vlog', 'training']}

In [55]:
People.model_json_schema()

{'$defs': {'Person': {'description': 'A class representing a person with a name, age, and a list of hobbies.\n\nAttributes:\n    name (str): The name of the person.\n    age (int): The age of the person.\n    hobbies (List[str]): A list of hobbies of the person.',
   'properties': {'name': {'description': 'name of the person',
     'title': 'Name',
     'type': 'string'},
    'age': {'description': 'age of the person',
     'title': 'Age',
     'type': 'integer'},
    'hobbies': {'description': 'hobbies of the person',
     'items': {'type': 'string'},
     'title': 'Hobbies',
     'type': 'array'}},
   'required': ['name', 'age', 'hobbies'],
   'title': 'Person',
   'type': 'object'}},
 'description': 'A class representing a list of people.\n\nAttributes:\n    people (List[Person]): A list of Person objects.',
 'properties': {'people': {'description': 'list of people',
   'items': {'$ref': '#/$defs/Person'},
   'title': 'People',
   'type': 'array'}},
 'required': ['people'],
 'title'

In [56]:
query = """ 
Anna is 20 years old and is interested to do gardening, walking during her free time.
Seema is 18 years old and is interested to do painting, reading during her free time.
Ria is 21 years old and is interested to do singing, painting, walking during her free time.
Rohan is 22 years old and is interested to do painting, walking during her free time.
"""
llm.invoke({"query":query,
           "schema":People.model_json_schema()})

{'people': [{'name': 'Anna', 'age': 20, 'hobbies': ['gardening', 'walking']},
  {'name': 'Seema', 'age': 18, 'hobbies': ['painting', 'reading']},
  {'name': 'Ria', 'age': 21, 'hobbies': ['singing', 'painting', 'walking']},
  {'name': 'Rohan', 'age': 22, 'hobbies': ['painting', 'walking']}]}