### Install the Indexify Extractor SDK, Langchain Retriever and the Indexify Client

In [2]:
%%capture
!pip install indexify-extractor-sdk indexify virtualenv

### Trying out different extractors offered

We have several PDF and Invoice Extractor. Here are a few extractors that worked really well to get various fields from my HOA receipt.

First, get a taste of playing with these extractors locally.

#### PDFExtractor
First, we will try PDFExtractor. It can extract all the values from text as well as tables in one shot which can be used for question answering.

Download the extractor:

In [None]:
!indexify-extractor download hub://pdf/pdf-extractor

Load the extractor and the file:

In [19]:
from indexify_extractor_sdk import load_extractor, Content
extractor, config_cls = load_extractor("pdf-extractor.pdf_extractor:PDFExtractor")
content = Content.from_file("/Users/rishiraj/tensorlakeai/experiments/forms/Statement_HOA.pdf")

Extract the data:

In [None]:
result = extractor.extract(content)

Find the content with content_type 'text/plain':

In [21]:
text_content = next(content.data.decode('utf-8') for content in result if content.content_type == 'text/plain')
text_content

'Axis\nSTATEMENTInvoice No. 20240501-336593\nDate: 4/19/2024\nAccount Number:\nOwner:\nProperty:922000203826\nJohn Doe\n200 Park Avenue, Manhattan\nJohn Doe\n200 Park Avenue Manhattan\nNew York 10166SUMMARY OF ACCOUNT\nLast Month Balance:\nCurrent Amount Due:$653.03\n$653.03\nAccount details on back.\nProfessionally\nprepared by:\nSTATEMENT MESSAGE\nWelcome to Action Property Management! We are excited to be\nserving your community. Our Community Care team is more than\nhappy to assist you with any billing questions you may have. For\ncontact options, please visit www.actionlife.com/contact. Visit the\nAction Property Management web page at: www.actionlife.com.BILLING QUESTIONS\nScan the QR code to\ncontact our\nCommunity Care\nteam.\nactionlife.com/contact\nCommunityCare@actionlife.com\nRegister your Resident\nPortal account now!\nRegistration Key/ID:\nFLOWR2U\nresident.actionlife.com\nTo learn more about issues facing HOAs, say "Hey Siri, search the web for The Uncommon Area by Actio

Find the content with content_type 'application/json':

In [22]:
import json
import pandas as pd

json_content = next(content.data for content in result if content.content_type == 'application/json')

# Convert the JSON string to a Python dictionary
data_dict = json.loads(json_content)

# Convert the dictionary to a pandas DataFrame
df = pd.DataFrame.from_dict(data_dict, orient='index')

# Print the DataFrame
print(df)

             0                           1                                2  \
0         Date                 Description                       Assessment   
1   02/01/2024                      Charge                 Storage Fee 2024   
2   02/01/2024   Monthly Assessment- A12H2  Monthly Assessment- A12 H2 2024   
3   02/06/2024                      Charge               EV Charge Fee 2024   
4   02/11/2024                      eCheck                           eCheck   
5   03/01/2024  Monthly Assessment- A12 H2  Monthly Assessment- A12 H2 2024   
6   03/01/2024                      Charge                  StorageFee 2024   
7   03/11/2024                      eCheck                           eCheck   
8   04/01/2024                      Charge                 Storage Fee 2024   
9   04/01/2024  Monthly Assessment- A12 H2  Monthly Assessment- A12 H2 2024   
10  04/11/2024                      eCheck                           eCheck   
11  05/01/2024   Monthly Assessment- A12H2  Monthly 

Extract JSON from text content:

In [26]:
prompt = """Extract information according to this schema and return json in this format {"Invoice No.": "", "Date": "", "Account Number": "", "Owner": "", "Property": "", "Address": "", "Registration Key": "", "Last Month Balance": "", "Current Amount Due": "", "Due Date": ""}:
Axis\nSTATEMENTInvoice No. "Invoice No."\nDate: 4/19/2024\nAccount Number:\nOwner:\nProperty:"Account Number"\n"Owner"\n"Property"\n"Owner"\n"Property"\n"Address"SUMMARY OF ACCOUNT\nLast Month Balance:\nCurrent Amount Due:"Last Month Balance"\n"Current Amount Due"\nAccount details on back.\nProfessionally\nprepared by:\nSTATEMENT MESSAGE\nWelcome to Action Property Management! We are excited to be\nserving your community. Our Community Care team is more than\nhappy to assist you with any billing questions you may have. For\ncontact options, please visit www.actionlife.com/contact. Visit the\nAction Property Management web page at: www.actionlife.com.BILLING QUESTIONS\nScan the QR code to\ncontact our\nCommunity Care\nteam.\nactionlife.com/contact\nCommunityCare@actionlife.com\nRegister your Resident\nPortal account now!\nRegistration Key/ID:\n"Registration Key"\nresident.actionlife.com\nTo learn more about issues facing HOAs, say "Hey Siri, search the web for The Uncommon Area by Action Property Management."\nMake checks payable to:\nAxisAccount Number: "Account Number"\nOwner: "Owner"\nPLEASE REMIT PAYMENT TO:\n** AUTOPAY SCHEDULED **\n** NO REMITTANCE NECESSARY **CURRENT AMOUNT DUE\n"Current Amount Due"\nDUE DATE\n"Due Date"\n0049 00008330 0000922000203826 7 00065303 00000000 9"""

In [29]:
from openai import OpenAI
client = OpenAI(api_key="")

response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": prompt},
    {"role": "user", "content": text_content}
  ]
)

text_dict = response.choices[0].message.content
text_dict

'{\n    "Invoice No.": "20240501-336593",\n    "Date": "4/19/2024",\n    "Account Number": "922000203826",\n    "Owner": "John Doe",\n    "Property": "200 Park Avenue, Manhattan",\n    "Address": "200 Park Avenue Manhattan, New York 10166",\n    "Registration Key": "FLOWR2U",\n    "Last Month Balance": "$653.03",\n    "Current Amount Due": "$653.03",\n    "Due Date": "5/1/2024"\n}'

Question answering with extracted content:

In [32]:
response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": str(data_dict) + str(text_dict)},
    {"role": "user", "content": "by when do I have to make the payment and what amount? also what was the EV charge amount?"}
  ]
)

text_dict = response.choices[0].message.content
text_dict

'You have to make the payment of $653.03 by the due date of 5/1/2024. \n\nThe EV charge amount was $8599.55.'

#### LayoutLMDocumentQA
Next we try LayoutLMDocumentQA. It can't extract all the values in one shot, but can answer to single questions.

Download the extractor:

In [None]:
!indexify-extractor download hub://pdf/layoutlm_document_qa

Load the extractor and the file:

In [8]:
from indexify_extractor_sdk import load_extractor, Content
extractor, config_cls = load_extractor("layoutlm_document_qa.document_qa:LayoutLMDocumentQA")
content = Content.from_file("/Users/diptanuc/Downloads/Statement_HOA.pdf")

Ask question to the extractor:

In [9]:
config = config_cls(query="What's the due date?")
result = extractor.extract(content, config)
result

[Feature(feature_type='metadata', name='metadata', value={'query': "What's the due date?", 'answer': '5/1/2024', 'page': 0, 'score': 0.9999791383743286}, comment=None)]

### Start the Indexify Server

To make this extractor continously extract - 
1. Download the Indexify Server
2. Start it in development mode on your laptop
3. Create extraction policies with questions that extracts the fields from the PDF
4. Finally, you can get all the extracted value for a document by making an API call

##### Download the Server
```bash
curl https://tensorlake.ai | sh
```

In [None]:
!./indexify server -d

### Create the Extraction Policies


In [2]:
from indexify import IndexifyClient
client = IndexifyClient()

In [3]:
client.add_extraction_policy(extractor='tensorlake/layoutlm-document-qa-extractor', name="hoa-fees-due-date", input_params={"query": "Whats the due date?"})
client.add_extraction_policy(extractor='tensorlake/layoutlm-document-qa-extractor', name="hoa-fees-outstanding", input_params={"query": "Whats the outstanding amount?"})

### Upload Files

In [4]:
content_id = client.upload_file("/Users/diptanuc/Downloads/Statement_HOA.pdf")

In [6]:
client.get_structured_data(content_id)

[{'id': '3Ie8VXVxfNTPAL5L',
  'content_id': 'efcf0931508836d3',
  'metadata': {'answer': '5/1/2024',
   'page': 0,
   'query': 'Whats the due date?',
   'score': 0.9999799728393556},
  'extractor_name': 'tensorlake/layoutlm-document-qa-extractor'},
 {'id': 'VmCTqMFR-m7IG0nn',
  'content_id': 'efcf0931508836d3',
  'metadata': {'answer': '$603.03',
   'page': 1,
   'query': 'Whats the outstanding amount?',
   'score': 0.9992976188659668},
  'extractor_name': 'tensorlake/layoutlm-document-qa-extractor'}]

In [36]:
client.sql_query("select * from ingestion;")

SqlQueryResult(result=[{'answer': '$603.03', 'content_id': 'd8ec685dd9cc3505', 'page': 1, 'query': 'Whats the outstanding amount?', 'score': 0.9992976188659668}, {'answer': '5/1/2024', 'content_id': 'd8ec685dd9cc3505', 'page': 0, 'query': 'Whats the due date?', 'score': 0.9999799728393556}])