<a href="https://colab.research.google.com/github/sushantagarwal29/ragpoc/blob/main/llamaparse_RAGPOC_structuredoutput.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Parsing Complex PDFs with LlamaParse

##### Note: This example requires a KDB.AI endpoint and API key. Sign up for a free [KDB.AI account](https://kdb.ai/get-started).

> [KDB.AI](https://kdb.ai/) is a powerful knowledge-based vector database and search engine that allows you to build scalable, reliable AI applications, using real-time data, by providing advanced search, recommendation and personalization.

PDFs and other complex document types are notoriously difficult to work with, yet are the common file formats used for publishing important business related information. Since these file types are so common, it is key to have the capability to parse and ingest these documents swiftly, with accuracy, while cleanly extracting embedded entities such as images, tables, and graphs. If extracted correctly, all of the data held in a complex document like a PDF can be ingested into a RAG workflow to generate accurate and contextual responses for users and the business.

This sample will illustrate how to use LlamaParse, an generative AI enabled parsing platform created by LlamaIndex to parse and represent complex files in a way that enables effective retrieval. We will use LlamaIndex to orchestrate a RAG pipeline where LlamaParse is used to parse a complex academic article and extract text and tables from it, and KDB.AI is used as our retrieval mechanism to pass relevant information about the article to an LLM.

LlamaParse transforms complex documents like PDFs into markdown or text formats, which are easily ingestible. This parsing also extracts embedded entities like tables and images.

Agenda:
1. Dependencies, Imports & Setup
2. Set API Keys for LlamaCloud, OpenAI, Cohere
3. Define KDB.AI Session
4. Create Schema and KDB.AI Table
5. Download ARXIV Article: '[LLM In-Context Recall is Prompt Dependent](https://arxiv.org/pdf/2404.08865)' by Daniel Machlab and Rick Battle
6. LlamaParse & LlamaIndex Setup
7. Parse the Document with LlamaParse into Markdown Format
8. Extract Text and Table nodes from Markdown Document
9. Create the RAG Pipeline with LlamaIndex and KDB.AI
10. Query the RAG Pipeline!

## 1. Dependencies, Imports & Setup

In order to successfully run this sample, note the following steps depending on where you are running this notebook:

-***Run Locally / Private Environment:*** The [Setup](https://github.com/KxSystems/kdbai-samples/blob/main/README.md#setup) steps in the repository's `README.md` will guide you on prerequisites and how to run this with Jupyter.


-***Colab / Hosted Environment:*** Open this notebook in Colab and run through the cells.

In [1]:
!pip install llama-index
!pip install llama-index-core
!pip install llama-index-embeddings-openai
!pip install llama-parse
!pip install llama-index-vector-stores-kdbai
!pip install pandas
#!pip install llama-index-postprocessor-cohere-rerank
!pip install kdbai_client

Collecting llama-index
  Downloading llama_index-0.12.10-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-agent-openai<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.1-py3-none-any.whl.metadata (726 bytes)
Collecting llama-index-cli<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_cli-0.4.0-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.13.0,>=0.12.10 (from llama-index)
  Downloading llama_index_core-0.12.10.post1-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.3.1-py3-none-any.whl.metadata (684 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.6.3-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-index-llms-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_llms_openai-0.3.13-py3-none-any.whl.metadata (3.3 kB)


In [2]:
!pip install -U llama-index-llms-azure-inference
!pip install -U llama-index-embeddings-azure-inference

Collecting llama-index-llms-azure-inference
  Downloading llama_index_llms_azure_inference-0.3.0-py3-none-any.whl.metadata (1.7 kB)
Collecting azure-ai-inference>=1.0.0b5 (from llama-index-llms-azure-inference)
  Downloading azure_ai_inference-1.0.0b6-py3-none-any.whl.metadata (31 kB)
Collecting azure-identity<2.0.0,>=1.15.0 (from llama-index-llms-azure-inference)
  Downloading azure_identity-1.19.0-py3-none-any.whl.metadata (80 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.6/80.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting isodate>=0.6.1 (from azure-ai-inference>=1.0.0b5->llama-index-llms-azure-inference)
  Downloading isodate-0.7.2-py3-none-any.whl.metadata (11 kB)
Collecting azure-core>=1.30.0 (from azure-ai-inference>=1.0.0b5->llama-index-llms-azure-inference)
  Downloading azure_core-1.32.0-py3-none-any.whl.metadata (39 kB)
Collecting msal>=1.30.0 (from azure-identity<2.0.0,>=1.15.0->llama-index-llms-azure-inference)
  Downloading msal-1.31.1-

In [3]:
from llama_index.llms.azure_inference import AzureAICompletionsModel

In [4]:
from llama_index.embeddings.azure_inference import AzureAIEmbeddingsModel

In [5]:
from llama_parse import LlamaParse
from llama_index.core import Settings
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.kdbai import KDBAIVectorStore
#from llama_index.postprocessor.cohere_rerank import CohereRerank
from getpass import getpass
import os
import kdbai_client as kdbai


## 2. Set API Keys for LlamaCloud, OpenAI, Cohere
Get API keys here:
- [LlamaCloud](https://cloud.llamaindex.ai/)
- [OpenAI](https://platform.openai.com/api-keys)
- [Cohere](https://dashboard.cohere.com/welcome/register)

In [6]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio
nest_asyncio.apply()

In [7]:
# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = (
    os.environ["LLAMA_CLOUD_API_KEY"]
    if "LLAMA_CLOUD_API_KEY" in os.environ
    else getpass("LLAMA CLOUD API key: ")
)

LLAMA CLOUD API key: ··········


## 3. Define KDB.AI Session
KDB.AI comes in two offerings:

KDB.AI Cloud - For experimenting with smaller generative AI projects with a vector database in our cloud.
KDB.AI Server - For evaluating large scale generative AI applications on-premises or on your own cloud provider.
Depending on which you use there will be different setup steps and connection details required.

Option 1. KDB.AI Cloud
To use KDB.AI Cloud, you will need two session details - a URL endpoint and an API key. To get these you can sign up for free here.

You can connect to a KDB.AI Cloud session using kdbai.Session and passing the session URL endpoint and API key details from your KDB.AI Cloud portal.

If the environment variables KDBAI_ENDPOINTS and KDBAI_API_KEY exist on your system containing your KDB.AI Cloud portal details, these variables will automatically be used to connect. If these do not exist, it will prompt you to enter your KDB.AI Cloud portal session URL endpoint and API key details.

### Option 1. KDB.AI Cloud

In [8]:
#Set up KDB.AI endpoint and API key
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)

KDB.AI endpoint: https://cloud.kdb.ai/instance/5mpjbggrkg
KDB.AI API key: ··········


In [9]:
#connect to KDB.AI
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

## 4. Create Schema and KDB.AI Table

In [10]:
schema = [
        dict(name="document_id", type="str"),
        dict(name="text", type="str"),
        dict(name="embeddings", type="float32s"),
    ]

indexFlat = {
        "name": "flat",
        "type": "flat",
        "column": "embeddings",
        "params": {'dims': 1536, 'metric': 'L2'},
    }

#schema = [
#        dict(name="document_id", type="bytes"),
#        dict(name="text", type="bytes"),
#        dict(name="embeddings", type="float32s"),
#    ]
#
#indexFlat = {
#        "name": "flat",
#        "type": "flat",
#        "column": "embeddings",
#        "params": {'dims': 1536, 'metric': 'L2'},
#    }

In [11]:
# Connect with kdbai database
db = session.database("default")

In [12]:
KDBAI_TABLE_NAME = "LlamaParse_Table"

# First ensure the table does not already exist
try:
    db.table(KDBAI_TABLE_NAME).drop()
except kdbai.KDBAIException:
    pass

#Create the table
table = db.create_table(KDBAI_TABLE_NAME, schema, indexes=[indexFlat])

## 6. LlamaParse & LlamaIndex Setup
We define which LLM and embedding model should be used, define the file path of the complex document, and create parsing instructions.

Using Open AI LLM & Embedding models via Azure foundry

In [13]:
#Set up KDB.AI endpoint and API key
AZURE_OPENAI_ENDPOINT = (
    os.environ["AZURE_OPENAI_ENDPOINT"]
    if "AZURE_OPENAI_ENDPOINT" in os.environ
    else input("AZURE_OPENAI_ENDPOINT: ")
)
AZURE_OPENAI_API_KEY = (
    os.environ["AZURE_OPENAI_API_KEY"]
    if "AZURE_OPENAI_API_KEY" in os.environ
    else getpass("AZURE_OPENAI_API_KEY: ")
)

AZURE_OPENAI_ENDPOINT: https://ai-depoc1aihub1643128651037.openai.azure.com/openai/deployments/
AZURE_OPENAI_API_KEY: ··········


In [14]:
#pdf_file_name = './LLM_recall.pdf'
pdf_file_name = '/content/Combined - BJG_Whitmoor Electronics PA_CW2323438_WS3894893582_0.pdf'

In [15]:
parsing_instructions = '''The document attached are legal and long term agreement contracts. Answer questions using the information in this article and be precise.'''

## 7. Parse the document with LlamaParse into markdown format

In [16]:
documents = LlamaParse(result_type="markdown", parsing_instructions=parsing_instructions).load_data(pdf_file_name)

Started parsing the file under job_id 1e7edc6a-bd53-47ae-9b1f-e80cbf5577c8
.........

In [None]:
print(documents[0].text[:1000])

# Collins Aerospace

# PURCHASE AGREEMENT NO.: CW2323438

This Purchase Agreement (this “Purchase Agreement”), effective as of April 21, 2023 (“Purchase Agreement Effective Date”), is entered into by and between ROSEMOUNT AEROSPACE, INC., a part of Collins Aerospace, a Delaware Corporation, having an office and place of business at 14300 Judicial Road Burnsville, MN, 55306, KIDDE TECHNOLOGIES INC., a part of Collins Aerospace, having a place of business at 4200 Airport Drive NW, Wilson NC, 27896, SIMMONDS PRECISION PRODUCTS, INC. a part of Collins Aerospace, having a place of business at 100 Panton Road, Vergennes, VT, 05491, ROCKWELL COLLINS, INC. a part of Collins Aerospace, a Delaware Corporation, having a place of business at 400 Collins Roads NE, Cedar Rapids, IA, 52498, (“Collective Buyers”); and BJG ELECTRONICS, a New York Corporation, having an office and place of business at 141 Remington Blvd., Ronkonkoma, NY., 11779 (“Seller or BJG”). Buyer and Seller may hereinafter be indi

## 8. Extract Text and Table nodes from Markdown Document

In [17]:
llm = AzureAICompletionsModel(
    endpoint=AZURE_OPENAI_ENDPOINT+"gpt-4o",
    credential=AZURE_OPENAI_API_KEY,
    api_version="2024-08-01-preview",
)

embed_model = AzureAIEmbeddingsModel(
    endpoint=AZURE_OPENAI_ENDPOINT+"text-embedding-3-small",
    credential=AZURE_OPENAI_API_KEY,
    model_name="text-embedding-3-small",
)

Settings.llm = llm
Settings.embed_model = embed_model

In [18]:
# Parse the documents using MarkdownElementNodeParser
node_parser = MarkdownElementNodeParser(llm=llm, num_workers=8).from_defaults()

In [19]:
# Retrieve nodes (text) and objects (table)
nodes = node_parser.get_nodes_from_documents(documents)

0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
3it [00:00, 2877.41it/s]
1it [00:00, 628.64it/s]
0it [00:00, ?it/s]


#### Split nodes into base_nodes (text nodes), and object (table nodes)

In [20]:
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

#### Explore these extracted nodes

In [None]:
print(base_nodes[6].text[:])

Notes:

* A = Denotes that the listed Part Number is hereby added to Exhibit A of Agreement No. 961

B = Denotes that the listed Part Number is hereby deleted from Exhibit A of Agreement No. 961

C = Denotes that the listed Part Number is hereby changed on Exhibit A of Agreement No. 961

Lead Time and Firm Zone are in Calendar days

** Target quantity is the estimated volume for that part during the term of the contract, and by no means is a commitment to purchase by Collins Aerospace

*** Rounding value is the packaging size, panel quantity, or reel size, etc.

**** NCNR = Non Cancelable Non Returnable

***** OCM / OEM: Original Component Manufacturer / Original Equipment Manufacturer

 Exhibit A of Agreement No. CW2323438

 Collins Aerospace between BJG Electronics (Supplier) and Rockwell Collins (Buyer)


In [None]:
print(objects)

[IndexNode(id_='219050ab-c33c-4f3a-81fb-c2dfee830252', embedding=None, metadata={'col_schema': 'Column: Part Number\nType: string\nSummary: Unique identifier for each part.\n\nColumn: Description\nType: string\nSummary: Brief description of the part.\n\nColumn: Lead Time\nType: integer\nSummary: Time required to deliver the part (in days).\n\nColumn: Capacity\nType: integer\nSummary: Maximum number of parts that can be produced.\n\nColumn: Quantity\nType: integer\nSummary: Number of parts ordered.\n\nColumn: Unit Price\nType: string\nSummary: Price per unit of the part.\n\nColumn: Extended Price\nType: string\nSummary: Total price for the ordered quantity.\n\nColumn: Item\nType: string\nSummary: Indicates if the item is included in the order.\n\nColumn: Award\nType: string\nSummary: Indicates if the item has been awarded.\n\nColumn: Truthful Cost Data Applies\nType: string\nSummary: Indicates if truthful cost data applies to the item.'}, excluded_embed_metadata_keys=['col_schema'], exc

In [None]:
# insert the table markdown into the text of each table object
for i in range(len(objects)):
  objects[i].text = objects[i].obj.text[:]

In [23]:
objects[3].text

'This table provides details about various parts including their descriptions, unit prices, lead times, and order quantities.,\nwith the following columns:\n- A/B/C*: None\n- Collins Part Number: None\n- Part Description: None\n- UoM (USD): None\n- Lead Time (Calendar Days): None\n- Minimum Order Quantity: None\n- Minimum Ship Quantity: None\n- Rounding Value***: None\n- Consignment Agreement (Yes/No): None\n- Schedule Agreement (Yes/No): None\n- Firm Zone (Calendar Days): None\n- NCNR****: None\n'

In [21]:
print(objects[3].obj.text[:])

This table provides details about various parts including their descriptions, unit prices, lead times, and order quantities.,
with the following columns:
- A/B/C*: None
- Collins Part Number: None
- Part Description: None
- UoM (USD): None
- Lead Time (Calendar Days): None
- Minimum Order Quantity: None
- Minimum Ship Quantity: None
- Rounding Value***: None
- Consignment Agreement (Yes/No): None
- Schedule Agreement (Yes/No): None
- Firm Zone (Calendar Days): None
- NCNR****: None

|A/B/C*|Collins Part Number|Part Description|UoM (USD)|Lead Time (Calendar Days)|Minimum Order Quantity|Minimum Ship Quantity|Rounding Value***|Consignment Agreement (Yes/No)|Schedule Agreement (Yes/No)|Firm Zone (Calendar Days)|NCNR****|
|---|---|---|---|---|---|---|---|---|---|---|---|
|A|M85528/2-10-A-01|MTG DEVICE, CONNECTOR|$ 2.86|49|100|100|1|N|N|30|N|
|A|M23053/13-002-0|SLVG,SHRINK,250(125)ID,035THK,FLELAS,BK|$ 1.48|168|100|100|1|N|N|30|N|



## 9. Create the RAG Pipeline with LlamaIndex and KDB.AI

Use KDB.AI as the vector store, insert base_nodes and objects into KDB.AI, create query_engine using Cohere for reranking.

In [24]:
vector_store = KDBAIVectorStore(table)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [25]:
#Create the index, inserts base_nodes and objects into KDB.AI
recursive_index = VectorStoreIndex(
    nodes= base_nodes + objects, storage_context=storage_context
    #,store_nodes_override=FALSE
)

In [None]:
# Query KDB.AI to ensure the nodes were inserted
table.query()

Unnamed: 0,document_id,text,embeddings
0,c775eede-38b9-4189-b222-d2545023976f,"FRAME CONTRACT\n\nBETWEEN\n\nSONACA S.A, organ...","[0.016634772, 0.0052407854, 0.023204163, -0.02..."
1,cd940be8-d324-4e5d-a75f-bb5b20d8b635,Confidential\n\n Issue 8\n\n PREAMBLE\n\n 1. D...,"[0.011731884, 0.024188302, 0.067214504, -0.018..."
2,aede5586-971e-4ebf-a603-2ba3a7ec9146,TERMINATION OF THE PRIME CONTRACT\n\n Issue\n\...,"[0.0038611756, 0.027351115, 0.06129448, -0.026..."
3,54e2b497-fa29-4e23-9206-5e6b634425d4,Confrac? Rer SONIAERO/BL89\n\nconfideNTiAl\n\n...,"[-0.012501534, -0.022194654, 0.06440267, -0.01..."
4,03150c8b-fc67-447a-9a72-6f40d6343347,Contract Ref. SONIAEROIBL89\n\nConfidential\n\...,"[-0.009399878, 0.019608628, 0.04198308, -0.034..."
5,ab3b2bf2-5597-40f6-b2e3-d6c0fcb47b50,Requirements\n\nin drawings and related docume...,"[-0.003455646, 0.031686407, 0.07114641, -0.000..."
6,466ba365-5ad0-445d-b63f-9b9390b7563e,CONFIDENTIAL\n\n The specific Issue\n\n1. Spec...,"[-0.018013405, 0.0034069584, 0.032839824, -0.0..."
7,8edce628-89b0-459c-8513-5bfa33cc152e,LOGISTICS REQUIREMENTS\n\nIssue 8\n\n1. Logist...,"[-0.013366911, 0.07309987, 0.044199917, -0.001..."
8,de440177-dd53-4fe9-8c90-ab5bd2ab9e30,FINANCIAL HEALTH\n\n 7. PARENT COMPANY GUARANT...,"[-0.030599749, 0.05298548, 0.049993955, 0.0178..."
9,4e8acde8-305a-40fe-9e6c-6777302f3206,8.1. Specific Tooling\n\nSpecific Tooling can ...,"[-0.035280775, 0.05995904, 0.061578143, -0.003..."


New logic using function calling

In [26]:
AZURE_OPENAI_ENDPOINT = (
    os.environ["AZURE_OPENAI_ENDPOINT"]
    if "AZURE_OPENAI_ENDPOINT" in os.environ
    else input("AZURE_OPENAI_ENDPOINT: ")
)
AZURE_OPENAI_API_KEY = (
    os.environ["AZURE_OPENAI_API_KEY"]
    if "AZURE_OPENAI_API_KEY" in os.environ
    else getpass("AZURE_OPENAI_API_KEY: ")
)

AZURE_OPENAI_ENDPOINT: https://ai-depoc1aihub1643128651037.openai.azure.com
AZURE_OPENAI_API_KEY: ··········


In [79]:
#from openai import OpenAI
from openai import AzureOpenAI

#client = OpenAI()
client = AzureOpenAI(
  azure_endpoint = AZURE_OPENAI_ENDPOINT,
  api_key=AZURE_OPENAI_API_KEY,
  api_version="2024-08-01-preview"
)



def embed_query(query):
    query_embedding = client.embeddings.create(
            input=query,
            model="text-embedding-3-small"
        )
    return query_embedding.data[0].embedding

def retrieve_data(query):
    query_embedding = embed_query(query)
    results = table.search(vectors={'flat':[query_embedding]},n=5
                          #,filter=[('<>','document_id',#'4a9551df-5dec-4410-90bb-43d17d722918')]
                           )
    retrieved_data_for_RAG = []
    #for i in range(len(objects)):
    #  retrieved_data_for_RAG.append(objects[i].text)
    for index, row in results[0].iterrows():
      retrieved_data_for_RAG.append(row['text'])
      #print(row['text'])
    return retrieved_data_for_RAG

def RAG(query):
  question = "You will answer this question based on the provided reference material: " + query
  messages = "Here is the provided context: " + "\n"
  results = retrieve_data(query)
  if results:
    for data in results:
      messages += data + "\n"
  response = client.chat.completions.create(
      model="gpt-4o",
      messages=[
          {"role": "system", "content": question},
          {
          "role": "user",
          "content": [
              {"type": "text", "text": messages},
          ],
          }
      ],
      response_format=
     {
      "type": "json_schema",
      "json_schema": {
        "name": "contract_details_extraction",
        "schema": {
          "type": "object",
          "properties": {
            "contract_start_date": { "type": "string" },
            "contract_end_date": { "type": "string" }
          },
        },
      }
    },
      max_tokens=1200,
  )
  content = response.choices[0].message.content
  return content

In [80]:
print(RAG(query= """You are an AI assistant specialized in analyzing legal contracts and long term agreements.
Your task is to extract relevant information from a given contract document.
Your output must be a structured JSON object.

Instructions:
1. Carefully read the entire contract documents
2. Extract the relevant information.
3. Present your findings in JSON format as specified below.

Important Notes:
- Extract only relevant information.
- Consider the context of the entire contract when determining relevance.
- Do not be verbose, only respond with the correct format and information.
- Some docs may have multiple relevant excerpts -- include all that apply.
- Some questions may have no relevant excerpts -- just return ["N/A"].
- Do not include additional JSON keys beyond the ones listed here.
- Do not include the same key multiple times in the JSON.

Expected JSON keys and explanation of what they are:
- 'contract_start_date': The start date or effective date of the contract in mm/dd/yyyy format
- 'contract_end_date': The end date of the contract in mm/dd/yyyy format
"""))

{"contract_start_date":"04/21/2023","contract_end_date":"03/31/2025"}


In [None]:
### Define reranker
cohere_rerank = CohereRerank(top_n=10)

### Create the query_engine to execute RAG pipeline using LlamaIndex, KDB.AI, and Cohere reranker
query_engine = recursive_index.as_query_engine(similarity_top_k=20,
                                               node_postprocessors=[cohere_rerank],
                                               vector_store_kwargs={
                                                    "index" : "flat",
                                                },
                                            )

## 10. Query the RAG Pipeline!
All the work is complete! Now we can ask questions about the article whether the information is contained in text, or in tables.

In [None]:
query_1 = """You are an AI assistant specialized in analyzing legal contracts and long term agreements.
Your task is to extract relevant information from a given contract document.
Your output must be a structured JSON object.

Instructions:
1. Carefully read the entire contract documents
2. Extract the relevant information.
3. Present your findings in JSON format as specified below.

Important Notes:
- Extract only relevant information.
- Consider the context of the entire contract when determining relevance.
- Do not be verbose, only respond with the correct format and information.
- Some docs may have multiple relevant excerpts -- include all that apply.
- Some questions may have no relevant excerpts -- just return ["N/A"].
- Do not include additional JSON keys beyond the ones listed here.
- Do not include the same key multiple times in the JSON.

Expected JSON keys and explanation of what they are:
- 'contract_end_date': The end date of the contract.
- 'item_identifier': Comman seperated list of the items in contract
- 'Party1': First Party name
- 'Party1_address': First Party adress
- 'Party2': Second Party name
- 'Party2_address': Second Party adress
- 'signing_date': The date the contract was signed.
- 'contract_start_date': The start date of the contract.
- 'term_of_payment': Description of the payment terms.
- 'contract_value': Value of contract if mentioned.
- 'contract_number': ID of contract.
- 'contract_type': Type of contract.
"""

response_1 = query_engine.query(query_1)

print(str(response_1))


```json
{
  "contract_end_date": "September 30, 2025",
  "item_identifier": "N/A",
  "Party1": "Honeywell International Inc.",
  "Party1_address": "1300 W Warner Rd, Tempe, AZ 85284",
  "Party2": "United Avionics Inc",
  "Party2_address": "38 Great Hill Rd, Naugatuck, CT 06770",
  "signing_date": "March 1, 2021",
  "contract_start_date": "October 1, 2020",
  "term_of_payment": "N/A",
  "contract_value": "N/A",
  "contract_number": "W56HZV-20-D-0062",
  "contract_type": "Stand-Alone Government Program Contract"
}
```


In [None]:
query_1 = """You are an AI assistant specialized in analyzing legal contracts and long term agreements.
Your task is to extract goods/part numbers/product and their related information from a given contract document.
Your output must be a structured JSON object.

Instructions:
1. Carefully read the entire contract documents
2. Extract the relevant information most probably represented in one or multiple tables.
3. Present your findings in JSON format as specified below.

Important Notes:
- Extract only relevant information.
- Consider the context of the entire contract when determining relevance.
- Do not be verbose, only respond with the correct format and information.
- Some docs may have multiple relevant excerpts -- include all that apply.
- Some questions may have no relevant excerpts -- just return ["N/A"].
- Do not include additional JSON keys beyond the ones listed here.
- Do not include the same key multiple times in the JSON.

Expected JSON keys and explanation of what they are:
- Part Number: Unique identifier for each component.
- Description: Brief description of the component.
- Lead Time: Time in days required to deliver the component.
- Capacity: Maximum production capacity for the component.
- Quantity: Number of units available.
- Unit Price: Price per individual unit of the component.
- Extended Price: Total price for the quantity available.
- Item: Indicates if the item is relevant.
- Award: Indicates if the item has been awarded.
- Truthful Cost Data Applies: Indicates if truthful cost data is applicable.
"""

response_1 = query_engine.query(query_1)

print(str(response_1))


```json
[
    {
        "Part Number": "N/A",
        "Description": "N/A",
        "Lead Time": "N/A",
        "Capacity": "N/A",
        "Quantity": "N/A",
        "Unit Price": "N/A",
        "Extended Price": "N/A",
        "Item": "N/A",
        "Award": "N/A",
        "Truthful Cost Data Applies": "N/A"
    }
]
```


In [55]:
print(RAG(query= """You are an AI assistant specialized in analyzing legal contracts and long term agreements.
Your task is to union all tables and display as a single table.
"""))

NotFoundError: Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}

In [None]:
print(RAG(query= """You are an AI assistant specialized in analyzing legal contracts and long term agreements.
Your task is to extract all goods/part numbers/product and their related information from a given contract document.
Combine all tables into single output.
Your output must be a structured JSON object.

Instructions:
1. Carefully read the entire contract documents
2. Extract the relevant information most probably represented in one or multiple tables.
3. If multiple tables then combine all tables into a single table
4. Present your findings in JSON format as specified below.

Important Notes:
- Extract only relevant information.
- Consider the context of the entire contract when determining relevance.
- Do not be verbose, only respond with the correct format and information.
- Some docs may have multiple relevant excerpts -- include all that apply.
- Some questions may have no relevant excerpts -- just return ["N/A"].
- Do not include additional JSON keys beyond the ones listed here.
- Do not include the same key multiple times in the JSON.

Expected JSON keys and explanation of what they are:
- Part Number: Unique identifier for each component.
- Description: Brief description of the component.
- Lead Time: Time in days required to deliver the component.
- Capacity: Maximum production capacity for the component.
- Quantity: Number of units available.
- Unit Price: Price per individual unit of the component.
- Extended Price: Total price for the quantity available.
- Item: Indicates if the item is relevant.
- Award: Indicates if the item has been awarded.
- Truthful Cost Data Applies: Indicates if truthful cost data is applicable.
"""))

```json
[
    {
        "Part Number": "M85528/2-10-A-01",
        "Description": "MTG DEVICE, CONNECTOR",
        "Lead Time": "49",
        "Capacity": "N/A",
        "Quantity": "N/A",
        "Unit Price": "2.86",
        "Extended Price": "N/A",
        "Item": "Yes",
        "Award": "No",
        "Truthful Cost Data Applies": "N/A"
    },
    {
        "Part Number": "M23053/13-002-0",
        "Description": "SLVG,SHRINK,250(125)ID,035THK,FLELAS,BK",
        "Lead Time": "168",
        "Capacity": "N/A",
        "Quantity": "N/A",
        "Unit Price": "1.48",
        "Extended Price": "N/A",
        "Item": "Yes",
        "Award": "No",
        "Truthful Cost Data Applies": "N/A"
    }
]
```


## Delete the KDB.AI Table

Once finished with the table, it is best practice to drop it.

In [50]:
print(RAG(query= """You are an AI assistant specialized in analyzing legal contracts and long term agreements.
Your task is to extract relevant information from a given contract document.
Your output must be a structured JSON object.

Instructions:
1. Carefully read the entire contract documents
2. Extract the relevant information.
3. Present your findings in JSON format as specified below.

Important Notes:
- Extract only relevant information.
- Consider the context of the entire contract when determining relevance.
- Do not be verbose, only respond with the correct format and information.
- Some docs may have multiple relevant excerpts -- include all that apply.
- Some questions may have no relevant excerpts -- just return ["N/A"].
- Do not include additional JSON keys beyond the ones listed here.
- Do not include the same key multiple times in the JSON.

Expected JSON keys and explanation of what they are:
- 'contract_start_date': The start date or effective date of the contract in mm/dd/yyyy format
- 'contract_end_date': The end date of the contract in mm/dd/yyyy format
- 'Buyer': | seperated list of Name of buyers who are buying products or services from supplier
- 'Buyer_addresses': First Party adress
- 'Supplier':  | seperated list of Name of Suppliers who are providing products or services to buyer
- 'Supplier_address': Second Party adress
- 'signing_date': The date the contract was signed.
- 'contract_start_date': The start date of the contract.
- 'term_of_payment': Description of the payment terms.
- 'contract_value': Value of contract if mentioned.
- 'contract_number': ID of contract.
- 'contract_type': Type of contract.
"""))

TypeError: Completions.create() got an unexpected keyword argument 'ResponseFormatJsonSchema'

In [40]:
print(RAG(query= """You are an AI assistant specialized in analyzing legal contracts and long term agreements.
Your task is to extract relevant information from a given contract document.
Your output must be a valid csv file

Instructions:
1. Carefully read the entire contract documents
2. Extract the relevant information.
3. Present your findings in CSV format as specified below.
4. Do not prefix or suffix any text in your reply
5. Output only csv data and no verbose information

Important Notes:
- Extract only relevant information.
- Consider the context of the entire contract when determining relevance.
- Do not be verbose, only respond with the correct format and information.
- Some docs may have multiple relevant excerpts -- include all that apply.
- Some questions may have no relevant excerpts -- just return "N/A".
- Do not include additional fields beyond the ones listed here.
- Do not include the same key multiple times

Expected fields in csv and explanation of what they are:
- 'contract_start_date': The start date or effective date of the contract in mm/dd/yyyy format
- 'contract_end_date': The end date of the contract in mm/dd/yyyy format
- 'Buyer': | seperated list of Name of buyers who are buying products or services from supplier
- 'Buyer_addresses': First Party adress
- 'Supplier':  | seperated list of Name of Suppliers who are providing products or services to buyer
- 'Supplier_address': Second Party adress
- 'signing_date': The date the contract was last signed in mm/dd/yyyy format
- 'contract_number': ID of contract.
- 'contract_type': Type of contract.
"""))

```csv
contract_start_date,contract_end_date,Buyer,Buyer_addresses,Supplier,Supplier_address,signing_date,contract_number,contract_type
04/21/2023,03/31/2025,ROSEMOUNT AEROSPACE, INC. | KIDDE TECHNOLOGIES, INC. | SIMMONDS PRECISION PRODUCTS, INC. | ROCKWELL COLLINS, INC.,14300 Judicial Road Burnsville, MN, 55306 | 4200 Airport Drive NW, Wilson NC, 27896 | 100 Panton Road, Vergennes, VT, 05491 | 400 Collins Roads NE, Cedar Rapids, IA, 52498,BJG ELECTRONICS,141 Remington Blvd. Ronkonkoma, NY., 11779,"04/13/2023 | 04/17/2023",CW2323438,PURCHASE AGREEMENT
```


In [None]:
print(RAG(query= """You are an AI assistant specialized in analyzing legal contracts and long term agreements.
Your task is to extract relevant information from a given contract document.
Your output must be a structured JSON object.

Instructions:
1. Carefully read the entire contract documents
2. Extract the relevant information.
3. Present your findings in JSON format as specified below for each identified product or part number.

Important Notes:
- Extract only relevant information.
- Consider the context of the entire contract when determining relevance.
- Do not be verbose, only respond with the correct format and information.
- Some docs may have multiple relevant excerpts -- include all that apply.
- Some questions may have no relevant excerpts -- just return ["N/A"].
- Do not include additional JSON keys beyond the ones listed here.
- Do not include the same key multiple times in the JSON.

Expected JSON keys and explanation of what they are:
- 'contract_end_date': The end date of the contract.
- 'item_identifier': Comman seperated list of the items in contract
- 'Party1': First Party name
- 'Party1_address': First Party adress
- 'Party2': Second Party name
- 'Party2_address': Second Party adress
- 'signing_date': The date the contract was signed.
- 'contract_start_date': The start date of the contract.
- 'term_of_payment': Description of the payment terms.
- 'contract_value': Value of contract if mentioned.
- 'contract_number': ID of contract.
- 'contract_type': Type of contract.
"""))

```json
{
  "contract_end_date": "March 31, 2025",
  "item_identifier": "M85528/2-10-A-01, M23053/13-002-0",
  "Party1": "BJG Electronics",
  "Party1_address": "141 Remington Blvd., Ronkonkoma, NY., 11779",
  "Party2": "Rosemount Aerospace, Inc., a part of Collins Aerospace, Kidde Technologies, Inc., a part of Collins Aerospace",
  "Party2_address": "14300 Judicial Road, Burnsville, MN. 55306, 4200 Airport Drive NW, Wilson, NC. 27896",
  "signing_date": "17 April 2023",
  "contract_start_date": "13 April 2023",
  "term_of_payment": "N/A",
  "contract_value": "N/A",
  "contract_number": "CW2323438",
  "contract_type": "Purchase Agreement"
}
```


In [None]:
table.drop()