# Large Scale Document Analysis and Processing

This notebook demonstrates an example of using [LangChain](https://www.langchain.com/) to delvelop a Retrieval Augmented Generation (RAG) pattern. It uses Azure AI Document Intelligence as document loader, which can extracts tables, paragraphs, and layout information from pdf, image, office and html files. The output markdown can be used in LangChain's markdown header splitter, which enables semantic chunking of the documents. Then the chunked documents are indexed into Azure AI Search vectore store. Given a user query, it will use Azure AI Search to get the relevant chunks, then feed the context into the prompt with the query to generate the answer.


![Semantic chunking in RAG](https://github.com/microsoft/Form-Recognizer-Toolkit/blob/main/SampleCode/media/semantic-chunking-rag.png?raw=true)


## Prerequisites
- An Azure AI Document Intelligence resource in one of the 3 preview regions: **East US**, **West US2**, **West Europe** - follow [this document](https://learn.microsoft.com/azure/ai-services/document-intelligence/create-document-intelligence-resource?view=doc-intel-4.0.0) to create one if you don't have.
- An Azure AI Search resource - follow [this document](https://learn.microsoft.com/azure/search/search-create-service-portal) to create one if you don't have.
- An Azure OpenAI resource and deployments for embeddings model and chat model - follow [this document](https://learn.microsoft.com/azure/ai-services/openai/how-to/create-resource?pivots=web-portal) to create one if you don't have.

## Setup

In [None]:
! pip install python-dotenv langchain langchain-core langchain-community langchain-openai langchainhub openai tiktoken azure-ai-documentintelligence azure-identity azure-search-documents==11.6.0b3

In [1]:
"""
This code loads environment variables using the `dotenv` library and sets the necessary environment variables for Azure services.
The environment variables are loaded from the `.env` file in the same directory as this notebook.
"""
import os
from dotenv import load_dotenv

load_dotenv()

os.environ["AZURE_OPENAI_ENDPOINT"] = os.getenv("AZURE_OPENAI_ENDPOINT")
os.environ["AZURE_OPENAI_API_KEY"] = os.getenv("AZURE_OPENAI_API_KEY")
os.environ["OPENAI_API_VERSION"] = os.getenv("OPENAI_API_VERSION") or "2024-05-01-preview"
doc_intelligence_endpoint = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT")
doc_intelligence_key = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_KEY")

In [2]:
from langchain_openai import AzureChatOpenAI
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from langchain_openai import AzureOpenAIEmbeddings
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.vectorstores.azuresearch import AzureSearch

## Load a document and split it into semantic chunks

In [None]:
# Initiate Azure AI Document Intelligence to load the document. You can either specify file_path or url_path to load the document.
project2_sas_token = r"...." #Your SAS token here
loader = AzureAIDocumentIntelligenceLoader(url_path=project2_sas_token, api_key = doc_intelligence_key, api_endpoint = doc_intelligence_endpoint, api_model="prebuilt-layout")
docs = loader.load()

# Split the document into chunks base on markdown headers.
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

docs_string = docs[0].page_content
splits = text_splitter.split_text(docs_string)

print("Length of splits: " + str(len(splits)))

# Inspect the Semantically Split Chunks

In [None]:
import pandas as pd
df = pd.DataFrame([vars(split) for split in splits])
pd.set_option('display.max_colwidth', None)
df.head(5)

## Embed and index the chunks

Indexing the chunks is a great opportunity to:
- Extract additional metadata from the chunks for better deterministic search
- Embed the chunks for semantic search and hybrid search
- Design a schema in way that allows for fast retrieval using filters

Small Language Models are great at Document Structure Analysis. For this purpose, a local copy of a SLM could serve as a more cost-effective alternative for scale scenarios, while preserving quality.

The first metadata item that should be added here are
- Project ID
- Project Name
- Product ID
- Product Name

You will see that using Semantic Chunking via Markdown automatically also provides a hierarchy of the document structure. This is great for context expansion and establishing a parent-child relationship between the chunks.

In [None]:
df['project_id'] = '2'
df['project_name'] = 'Product A'
df['product_id'] = '1'
df['product_name'] = 'Cooler for Product A'
df.head(5)

# Now let's do the same thing for Document 1

In [None]:
# Initiate Azure AI Document Intelligence to load the document. You can either specify file_path or url_path to load the document.
project1_sas_token = r"...." #Your SAS token here
loader = AzureAIDocumentIntelligenceLoader(url_path=project1_sas_token, api_key = doc_intelligence_key, api_endpoint = doc_intelligence_endpoint, api_model="prebuilt-layout")
docs = loader.load()
docs_string = docs[0].page_content
splits_1 = text_splitter.split_text(docs_string)


In [None]:
df1 = pd.DataFrame([vars(split) for split in splits_1])
df1['project_id'] = '1'
df1['project_name'] = 'Product B'
df1['product_id'] = '1'
df1['product_name'] = 'Rotary Kiln for drying Lithium Hydroxide'
df_combined = pd.concat([df, df1], ignore_index=True)
df_combined = df_combined.drop(columns=['type'])
df_combined


In [None]:
# Embed the splitted documents and insert into Azure Search vector store

import json
import uuid

aoai_embeddings = AzureOpenAIEmbeddings(
    azure_deployment="....",  # e.g., "text-embedding-3-large"
    openai_api_version="...",  # e.g., "2023-12-01-preview"
)

vector_store_address: str = os.getenv("AZURE_SEARCH_ENDPOINT")
vector_store_password: str = os.getenv("AZURE_SEARCH_ADMIN_KEY")

index_name: str = "..." # name of your search index
vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=aoai_embeddings.embed_query,
)

df_combined["metadata"] = df_combined["metadata"].apply(json.dumps)
df_combined['id'] = [str(uuid.uuid4()) for _ in range(len(df_combined))]
df_combined["page_content_vector"] = df_combined["page_content"].apply(aoai_embeddings.embed_query)
df_combined_dict = df_combined.to_dict(orient='records')

# Now store the chunked and embedded documents into the index

In [None]:
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
search_client: SearchClient = SearchClient(endpoint=vector_store_address, index_name=index_name, credential=AzureKeyCredential(vector_store_password))
search_client.upload_documents(documents=df_combined_dict)

## Retrive relevant chunks based on a question

In [None]:
from azure.search.documents.models import VectorizedQuery

# Retrieve relevant chunks based on the question
query = "equipment warranty conditions for project 1 rotary kiln"
query_embedding = aoai_embeddings.embed_query(query)
vector_query = VectorizedQuery(vector=query_embedding, k_nearest_neighbors=3, fields="page_content_vector")
results = search_client.search(
    search_text=query,
    vector_queries=[vector_query],
    select=["page_content", "project_id", "project_name", "product_id", "product_name"],
)

# Using GPT-4o to create optimized JSON Queries for Azure AI Search
LLMs are great code generators, why not use this strength of theirs!!

In [None]:
llm = AzureChatOpenAI(
    api_version="...", # e.g., "2024-05-01-preview"
    azure_deployment="...", # e.g., "gpt-4o"
    temperature=0
)

# Let's try to check if the LLM can create filters for us
index_definition = r"""{
  "@odata.context": "https://sa-cognitivesearch-1.search.windows.net/$metadata#indexes/$entity",
  "@odata.etag": "\"0x8DC90792DAD43E3\"",
  "name": "ge-lcp-docindex",
  "defaultScoringProfile": null,
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "searchable": false,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": true,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "page_content",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "project_id",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "project_name",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "product_id",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "product_name",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "metadata",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "page_content_vector",
      "type": "Collection(Edm.Single)",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": 3072,
      "vectorSearchProfile": "vector-profile-1718812341384",
      "vectorEncoding": null,
      "synonymMaps": []
    }
  ],
  "scoringProfiles": [],
  "corsOptions": null,
  "suggesters": [],
  "analyzers": [],
  "normalizers": [],
  "tokenizers": [],
  "tokenFilters": [],
  "charFilters": [],
  "encryptionKey": null,
  "similarity": {
    "@odata.type": "#Microsoft.Azure.Search.BM25Similarity",
    "k1": null,
    "b": null
  },
  "semantic": {
    "defaultConfiguration": null,
    "configurations": [
      {
        "name": "semantic-ge-lcp-docindex",
        "prioritizedFields": {
          "titleField": {
            "fieldName": "product_name"
          },
          "prioritizedContentFields": [
            {
              "fieldName": "page_content"
            },
            {
              "fieldName": "metadata"
            }
          ],
          "prioritizedKeywordsFields": [
            {
              "fieldName": "project_name"
            }
          ]
        }
      }
    ]
  },
  "vectorSearch": {
    "algorithms": [
      {
        "name": "vector-config-1718812342680",
        "kind": "hnsw",
        "hnswParameters": {
          "metric": "cosine",
          "m": 4,
          "efConstruction": 400,
          "efSearch": 500
        },
        "exhaustiveKnnParameters": null
      }
    ],
    "profiles": [
      {
        "name": "vector-profile-1718812341384",
        "algorithm": "vector-config-1718812342680",
        "vectorizer": "vectorizer-1718812345799",
        "compression": "vector-1718812082632-compressor"
      }
    ],
    "vectorizers": [
      {
        "name": "vectorizer-1718812345799",
        "kind": "azureOpenAI",
        "azureOpenAIParameters": {
          "resourceUri": "https://salekh-openai-swedenc.openai.azure.com",
          "deploymentId": "salekh-swedenc-text-embedding-3-large",
          "apiKey": "<redacted>",
          "modelName": "text-embedding-3-large",
          "authIdentity": null
        },
        "customWebApiParameters": null,
        "aiServicesVisionParameters": null,
        "amlParameters": null
      }
    ],
    "compressions": [
      {
        "name": "vector-1718812082632-compressor",
        "kind": "scalarQuantization",
        "rerankWithOriginalVectors": true,
        "defaultOversampling": 4,
        "scalarQuantizationParameters": {
          "quantizedDataType": "int8"
        }
      }
    ]
  }
}"""
example_document = r"""{
      "id": "dcaf9c27-3e65-4022-8166-addd9e4a20c6",
      "page_content": "||||\n| - | - | - |\n| :selected: X | Attachment | 1 Plot plan Product A Cooler System and Maintenance Arrangement |\n| :selected: X | Attachment | 2 P&ID Product A Cooler Package |\n| :selected: X | Attachment | 3 Sample Structure Sequences |\n| :selected: X | Attachment | 4 Sample list of electrical consumers |\n| :selected: X | Attachment | 5 Sample Mechanical Equipment List |\n| :selected: X | Attachment | 6 BASF Technical Rule E-P-MC 911, Pressure Vessels Fabricated from Metallic Materials; Technical Standards to be met by the Manufacturer (Issue Oct 2021) |\n| X :selected: :unselected: | Attachment | 6A BASF Technical Rule E-P-MC 911, Annex 8.2 Submission of Certification Documents in Electronic Form; Creation of Documents for Digital Certification (Issue Oct 2021) |\n| :selected: X | Attachment | 6B BASF Technical Rule E-P-MC 911, Sample Form 100 EN (replaced by Annex 8.3)G-R-MC 100 M - Minimum Safety and Health Requirements - Machinery |  \n<!-- PageFooter=\"CONFIDENTIAL\" -->",
      "project_id": "2",
      "project_name": "Product A",
      "product_id": "1",
      "product_name": "Cooler for Product A",
      "metadata": "\"\\\"\\\\\\\"\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"{\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"Header 1\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\": \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"Table of contents\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\", \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"Header 2\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\": \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"11 Attachments\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"}\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\"\\\\\\\"\\\"\""
    }"""
prompt = f"""You are an AI chatbot that specializes in creating expert JSON queries for Azure AI Search. Create an optimized query based
           on the following index definition and the example document. The query should retrieve the most relevant documents based on the
           given example document. The query should be optimized for the given index definition. The query should be in JSON format. Try to
           use filters whenever possible.
           
           Only provide a JSON answer. No other text output should be included. When I parse the incoming answer from the AI using json.loads,
           I should get a valid Python dictionary. No exceptions are permitted.
           Index Definition: {index_definition} \n Example Document: {example_document} \n Text Query: {query}"""

from langchain_core.messages import HumanMessage
message = HumanMessage(content=prompt)
print(llm.invoke([message]).content)

In [None]:
# Use a prompt for RAG that is checked into the LangChain prompt hub (https://smith.langchain.com/hub/rlm/rag-prompt?organizationId=989ad331-949f-4bac-9694-660074a208a7)
context = "\n\n".join(result["page_content"] for result in results)

rag_prompt = f"""You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. 
                Answer in a detailed, factually correct and intelligent manner.\n
                Question: {query} \n
                Context: {context} \n
                Answer:"""
rag_message = HumanMessage(content=rag_prompt)
print(llm.invoke([rag_message]).content)