## *Entity Extraction from descripiton related to a book using Granite-8B*
LLMs have demonstrated remarkable accuracy in the task of entity extraction. This cookbook focuses on extracting key entities from descriptions related to books

### Install dependencies

In [1]:
!pip install git+https://github.com/ibm-granite-community/utils langchain_community pydantic

Collecting git+https://github.com/ibm-granite-community/utils
  Cloning https://github.com/ibm-granite-community/utils to /private/var/folders/yq/mg65c_l16hv64plnb99z5dx40000gq/T/pip-req-build-4kuap8v3
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/utils /private/var/folders/yq/mg65c_l16hv64plnb99z5dx40000gq/T/pip-req-build-4kuap8v3
  Resolved https://github.com/ibm-granite-community/utils to commit 5d67648927240b208a164d2466f0dc77200450e5
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: ibm-granite-community-utils
  Building wheel for ibm-granite-community-utils (pyproject.toml) ... [?25ldone
[?25h  Created wheel for ibm-granite-community-utils: filename=ibm_granite_community_utils-0.1.dev49-py3-none-any.whl size=8536 sha256=e10e9e7c4bf8e4fe0448b2b751d8640c3717efe2066b186f29232d06e

### Instantiate the model client

In [2]:
import json
from langchain_ollama import OllamaEmbeddings, OllamaLLM
from ibm_granite_community.notebook_utils import get_env_var

model_name: str = "granite3.1-dense:8b"

model =  OllamaLLM(
        model=model_name,
        temperature=0
    )

### 1 - Entity Extraction by defining entities in the prompt

The first approach is straightforward and involves explicitly defining the entities within the prompt itself. In this method, we specify the entities to be extracted along with their descriptions directly in the prompt. This includes:  

<u>**Entity Definitions:**</u> Each entity, such as title, author, price, and rating, is clearly outlined with a concise description of what it represents.  

<u>**Prompt Structure:**</u> The prompt is structured to guide the LLM in understanding exactly what information is needed. By providing detailed instructions, we aim to ensure that the model focuses on extracting only the relevant data.  

<u>**Output Format:**</u> The output is required to be in JSON format, which enforces a consistent structure for the extracted data. If any entity is not found, the model is instructed to return "Data not available," preventing ambiguity.  

Provide some text with information for a book. In this case, we use generated commentary on 'The Hunger Games' by Suzanne Collins.

In [3]:
book_info = """Notice of Representation

Budget Mutual Insurance Company 9876 Infinity Ave Springfield, MI 65541

Georgia Collan Parker LLP 9816 51st Ave SW Auburn, Washington(WA), 98092

Our Client: Courtney Sosa Date of death: 6/12/2020

To Whom It May Concern,

I have been retained by Courtney Sosa to handle the estate of Lukas Juarez. My understanding is that they had a life insurance policy (#951033310) with your company. If this is correct, please send a letter to my office indicating you have received our letter of representation. Additionally, please do not contact our client going forward.

We are requesting that you forward the full policy amount of $50,000. Please forward an acknowledgement of our demand and please forward the umbrella policy information if one is applicable. Please send my secretary any information regarding liens on his policy.

Please contact my office if you have any questions.

Sincerely,

Angela Berry, Attorney
"""

All the entities that need to be fetched are defined in the prompt itself along with the entity's description.

In [4]:
entity_prompt = f"""
<|start_of_role|>user<|end_of_role|>
    -You are an AI Entity Extractor. You help extract entities from the given information about a book. Here is the book information:
    {book_info}

    - Extract the following entities:

    1) `Insurance Company` : This is the name of the company.
    2) `Insurance Company Address`: This is the address of the company.
    3) `Law Firm`: Name of the Law Firm.
    4) `Law Office Address`: This is the address of the law firm.

    -Your output should strictly be in a json format, which only contains the key and value. The key here is the entity to be extracted and the value is the entity which you extracted.
    -Do not generate random entities on your own. If it is not present or you are unable to find any specified entity, you strictly have to output it as `Data not available`.
    -Only do what is asked to you. Do not give any explanations to your output and do not hallucinate.
    <|end_of_text|>
    <|start_of_role|>assistant<|end_of_role|>
"""

Invoking the model to get the results

In [5]:
response = model.invoke(entity_prompt)
print(response)

{
  "title": "The Hunger Games",
  "author": "Suzanne Collins",
  "price": "5 dollars and 9 cents",
  "rating": "4.33/5"
}


In [6]:
book_info = json.loads(response)
book_info

{'title': 'The Hunger Games',
 'author': 'Suzanne Collins',
 'price': '5 dollars and 9 cents',
 'rating': '4.33/5'}

---

### 2 - Pydantic Class-Based Entity Definition

The second approach takes advantage of object-oriented programming principles by defining entities within a class structure. This method involves several key steps:  

<u>**Class Definition:**</u> We create a class that encapsulates all the relevant entities as members. Each member corresponds to an entity such as title, author, etc., and can include type annotations for better validation and clarity.  

<u>**Pydantic Integration:**</u> Utilizing Pydantic, a data validation library, we convert this class into a Pydantic model. This model not only defines the structure of our data but also provides built-in validation features, ensuring that any extracted data adheres to specified formats and types.  

<u>**Dynamic Prompting:**</u> The Pydantic model can then be integrated with the prompt sent to the LLM. This allows for a more dynamic interaction where the model can adapt based on the defined structure of entities. If new entities are added or existing ones modified, changes can be made at the class level without needing to rewrite the entire prompt.  

<u>**Enhanced Validation:**</u> By leveraging Pydantic's capabilities, we can ensure that any data extracted by the LLM meets our predefined criteria, enhancing data integrity and reliability.  

This class-based approach offers greater flexibility and scalability compared to the first method. It allows for easier modifications and expansions as new requirements arise, making it particularly suitable for larger projects or those requiring frequent updates.

In [7]:
from pydantic import BaseModel, Field
from typing import List
from langchain.utils.openai_functions import convert_pydantic_to_openai_function

Here we use commentaries for two different books.

In [8]:
books_info = f"""{book_info}

Our next book is titled Magic of Lands. Even if some of you have read it before, I believe giving it another read would be worthwhile --
it actually gets more captivating the second time around. The author, John Williams, who has several other books to his name,
received a 3 out of 5 rating for this particular one. Considering the ratings we've seen for other books like Endurance, that's a fair score.
This French drama is 330 pages long and was published on September 11, 2010. It's currently priced at $3.22.
However, if you're interested, you can contact Mr. Shakespeare after the session -- he's offering it at a discounted price of $2.
Don't miss the opportunity to grab such an intriguing read!
"""

We define all of the entities in a Python class along with the descripiton.

In [9]:
class Book(BaseModel):
    "This contains information about a book including its title, author, price, rating, and so on."
    title: str = Field(description="The title of the book")
    price: str = Field(description="Total cost of this book")
    author: str = Field(description="The person who wrote this book")
    rating: str = Field(description="Total rating for this book")

In [10]:
class BooksInformation(BaseModel):
    "This contains information about multiple books."
    books: List[Book] = Field(description = "Information on multiple books. ")

In [11]:
book_function = convert_pydantic_to_openai_function(BooksInformation)
print(book_function)

{'name': 'BooksInformation', 'description': 'This contains information about multiple books.', 'parameters': {'properties': {'books': {'description': 'Information on multiple books. ', 'items': {'description': 'This contains information about a book including its title, author, price, rating, and so on.', 'properties': {'title': {'description': 'The title of the book', 'type': 'string'}, 'price': {'description': 'Total cost of this book', 'type': 'string'}, 'author': {'description': 'The person who wrote this book', 'type': 'string'}, 'rating': {'description': 'Total rating for this book', 'type': 'string'}}, 'required': ['title', 'price', 'author', 'rating'], 'type': 'object'}, 'type': 'array'}}, 'required': ['books'], 'type': 'object'}}


  book_function = convert_pydantic_to_openai_function(BooksInformation)


Same prompt as before, but here, the pydantic function is passed here instead of defining each entity in the prompt.

In [12]:
entity_prompt_with_pydantic = f"""
<|start_of_role|>user<|end_of_role|>
-You are an AI Entity Extractor. You help extract entities from the the given information about books: Here is the information about 2 books:

{json.dumps(book_info)}

-Analyze this information and extract the following entities as per this function definition:

{json.dumps(book_function)}

-Generate output as a json representation of a BooksInformation object. Include only the json.
-Your output should strictly be in a json format, which only contains the key and value. The key here is the entity to be extracted and the value is the entity which you extracted.
-Do not generate random entities on your own. If it is not present or you are unable to find any specified entity, you strictly have to output it as `Data not available`.
-Only do what is asked to you. Do not give any explanations to your output and do not hallucinate.
<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
"""


Invoking the model to get the results

In [13]:
response = model.invoke(entity_prompt_with_pydantic)
print(response)

{
  "BooksInformation": {
    "books": [
      {
        "title": "The Hunger Games",
        "price": "5 dollars and 9 cents",
        "author": "Suzanne Collins",
        "rating": "4.33/5"
      }
    ]
  }
}


We can now instantiate the `Book` and `BooksInformation` classes with the extracted information. We'll need error handling in case we get an improperly-formatted response.

In [14]:
try:
    # Parse the json response.
    books_dict = json.loads(response)
    print(books_dict)
    try:
        # Construct a list of Book objects from the response.
        books_info = BooksInformation(books=[Book(**book) for book in books_dict['books']])
        print(books_info)
    except KeyError as e:
        print(f"The response does not contain the expected key '{e.args[0]}'")
except ValueError as e:
    print(f"Error while parsing response: {e}")

{'BooksInformation': {'books': [{'title': 'The Hunger Games', 'price': '5 dollars and 9 cents', 'author': 'Suzanne Collins', 'rating': '4.33/5'}]}}
The response does not contain the expected key 'books'
