## *Entity Extraction from descripiton related to a book using Granite-8B*
LLMs have demonstrated remarkable accuracy in the task of entity extraction. This cookbook focuses on extracting key entities from descriptions related to books

In [1]:
!ollama pull granite3.1-dense:8b
!ollama pull nomic-embed-text

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest 
pulling 44d19d212d76... 100% ▕████████████████▏ 5.0 GB                         
pulling f76a906816c4... 100% ▕████████████████▏ 1.4 KB                         
pulling f7b956e70ca3... 100% ▕████████████████▏   69 B                         
pulling 492069a62c25... 100% ▕████████████████▏  11 KB                         
pulling f9cd69f4077d... 100% ▕████████████████▏  491 B                         
verifying sha256 digest 
writing manifest 
success [?25h
[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest 
pulling 970aa74c0a90... 100% ▕████████████████▏ 274 MB                         
pulling c71d239df917... 100% ▕████████████████▏  11 KB                         
pulling ce4a164fc046... 100% ▕████████████████▏   17 B                         
pulling 31df23ea7daa... 100% ▕████████████████▏  420 B     

### Install dependencies

In [4]:
# Install required packages
!pip install -q "langchain>=0.1.0" "langchain-community>=0.0.13" "langchain-core>=0.1.17" \
    "langchain-ollama>=0.0.1" "pdfminer.six>=20221105" "markdown>=3.5.2" "docling>=2.0.0" \
    "beautifulsoup4>=4.12.0" "unstructured>=0.12.0" "chromadb>=0.4.22" "faiss-cpu>=1.7.4" \
    "requests>=2.32.0"
!pip install git+https://github.com/ibm-granite-community/utils langchain_community pydantic

[0mCollecting git+https://github.com/ibm-granite-community/utils
  Cloning https://github.com/ibm-granite-community/utils to /private/var/folders/yq/mg65c_l16hv64plnb99z5dx40000gq/T/pip-req-build-eskz29gw
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/utils /private/var/folders/yq/mg65c_l16hv64plnb99z5dx40000gq/T/pip-req-build-eskz29gw
  Resolved https://github.com/ibm-granite-community/utils to commit 5d67648927240b208a164d2466f0dc77200450e5
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[0m

### Instantiate the model client

In [5]:
import json
from langchain_ollama import OllamaEmbeddings, OllamaLLM
from ibm_granite_community.notebook_utils import get_env_var
# Required imports
import os
import tempfile
import shutil
from pathlib import Path
from IPython.display import Markdown, display

# Docling imports
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractCliOcrOptions
from docling.document_converter import DocumentConverter, PdfFormatOption, WordFormatOption, SimplePipeline

# LangChain imports
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

model_name: str = "granite3.1-dense:8b"

model =  OllamaLLM(
        model=model_name,
        temperature=0
    )

In [6]:
def get_document_format(file_path) -> InputFormat:
    """Determine the document format based on file extension"""
    try:
        file_path = str(file_path)
        extension = os.path.splitext(file_path)[1].lower()

        format_map = {
            '.pdf': InputFormat.PDF,
            '.docx': InputFormat.DOCX,
            '.doc': InputFormat.DOCX,
            '.pptx': InputFormat.PPTX,
            '.html': InputFormat.HTML,
            '.htm': InputFormat.HTML
        }
        return format_map.get(extension, None)
    except:
        return "Error in get_document_format: {str(e)}"

In [11]:
def convert_document_to_text(doc_path) -> str:
    """Convert document to markdown using simplified pipeline"""
    try:
        # Convert to absolute path string
        input_path = os.path.abspath(str(doc_path))
        print(f"Converting document: {doc_path}")

        # Create temporary directory for processing
        with tempfile.TemporaryDirectory() as temp_dir:
            # Copy input file to temp directory
            temp_input = os.path.join(temp_dir, os.path.basename(input_path))
            shutil.copy2(input_path, temp_input)

            # Configure pipeline options
            pipeline_options = PdfPipelineOptions()
            pipeline_options.do_ocr = False  # Disable OCR temporarily
            pipeline_options.do_table_structure = True

            # Create converter with minimal options
            converter = DocumentConverter(
                allowed_formats=[
                    InputFormat.PDF,
                    InputFormat.DOCX,
                    InputFormat.HTML,
                    InputFormat.PPTX,
                ],
                format_options={
                    InputFormat.PDF: PdfFormatOption(
                        pipeline_options=pipeline_options,
                    ),
                    InputFormat.DOCX: WordFormatOption(
                        pipeline_cls=SimplePipeline
                    )
                }
            )

            # Convert document
            print("Starting conversion...")
            conv_result = converter.convert(temp_input)

            if not conv_result or not conv_result.document:
                raise ValueError(f"Failed to convert document: {doc_path}")

            # Export to markdown
            print("Exporting to markdown...")
            md = conv_result.document.export_to_markdown()

            #
            # print("MD= " + md)

            return md
    except:
        return f"Error converting document: {doc_path}"

In [12]:
doc_path = Path("l0.pdf")  # Replace with your document path

# Check format and process
doc_format = get_document_format(doc_path)
res = convert_document_to_text(doc_path)

print("RES= " + res)

Converting document: l0.pdf
Starting conversion...
Exporting to markdown...
RES= ## Notice of Representation

Budget Mutual Insurance Company 9876 Infinity Ave Springfield, MI 65541

Georgia Collan Parker LLP 9816 51st Ave SW Auburn, Washington(WA), 98092

Our Client: Courtney Sosa Date of death: 6/12/2020

To Whom It May Concern,

I have been retained by Courtney Sosa to handle the estate of Lukas Juarez. My understanding is that they had a life insurance policy (#951033310) with your company. If this is correct, please send a letter to my office indicating you have received our letter of representation. Additionally, please do not contact our client going forward.

We are requesting that you forward the full policy amount of $50,000. Please forward an acknowledgement of our demand and please forward the umbrella policy information if one is applicable. Please send my secretary any information regarding liens on his policy.

Please contact my office if you have any questions.

Sincere

### 1 - Entity Extraction by defining entities in the prompt

The first approach is straightforward and involves explicitly defining the entities within the prompt itself. In this method, we specify the entities to be extracted along with their descriptions directly in the prompt. This includes:  

<u>**Entity Definitions:**</u> Each entity, such as title, author, price, and rating, is clearly outlined with a concise description of what it represents.  

<u>**Prompt Structure:**</u> The prompt is structured to guide the LLM in understanding exactly what information is needed. By providing detailed instructions, we aim to ensure that the model focuses on extracting only the relevant data.  

<u>**Output Format:**</u> The output is required to be in JSON format, which enforces a consistent structure for the extracted data. If any entity is not found, the model is instructed to return "Data not available," preventing ambiguity.  

Provide some text with information for a book. In this case, we use generated commentary on 'The Hunger Games' by Suzanne Collins.

All the entities that need to be fetched are defined in the prompt itself along with the entity's description.

In [13]:
entity_prompt = f"""
<|start_of_role|>user<|end_of_role|>
    -You are an AI Entity Extractor. You help extract entities from the given information about a book. Here is the book information:
    {res}

    - Extract the following entities:

    1) `Insurance Company` : This is the name of the company.
    2) `Insurance Company Address`: This is the address of the company.
    3) `Law Firm`: Name of the Law Firm.
    4) `Law Office Address`: This is the address of the law firm.

    -Your output should strictly be in a json format, which only contains the key and value. The key here is the entity to be extracted and the value is the entity which you extracted.
    -Do not generate random entities on your own. If it is not present or you are unable to find any specified entity, you strictly have to output it as `Data not available`.
    -Only do what is asked to you. Do not give any explanations to your output and do not hallucinate.
    <|end_of_text|>
    <|start_of_role|>assistant<|end_of_role|>
"""

Invoking the model to get the results

In [14]:
response = model.invoke(entity_prompt)
print(response)

{
  "Insurance Company": "Budget Mutual Insurance Company",
  "Insurance Company Address": "9876 Infinity Ave Springfield, MI 65541",
  "Law Firm": "Georgia Collan Parker LLP",
  "Law Office Address": "9816 51st Ave SW Auburn, Washington(WA), 98092"
}


In [15]:
dl_info = json.loads(response)
dl_info

{'Insurance Company': 'Budget Mutual Insurance Company',
 'Insurance Company Address': '9876 Infinity Ave Springfield, MI 65541',
 'Law Firm': 'Georgia Collan Parker LLP',
 'Law Office Address': '9816 51st Ave SW Auburn, Washington(WA), 98092'}

---