# Figure understanding & hierarchical document structure analysis

This notebook demonstrates an example of using Azure AI Document Intelligence to ouptut detected figures and hierarchical document structure (in markdown). It will then crop the figures and send figure content (with its caption) to Azure Open AI GPT-4V model to understand the semantics. The figure description will be used to update the markdown output, which can be further used for [semantic chunking](sample_rag_langchain.ipynb).

![Advanced document insights with figure understanding and hierarchical document structure](https://github.com/microsoft/Form-Recognizer-Toolkit/blob/main/SampleCode/media/figure-understanding.png?raw=true)

## Prerequisites
- An Azure AI Document Intelligence resource in one of the 3 preview regions: **East US**, **West US2**, **West Europe** - follow [this document](https://learn.microsoft.com/azure/ai-services/document-intelligence/create-document-intelligence-resource?view=doc-intel-4.0.0) to create one if you don't have.
- An Azure AI Search resource - follow [this document](https://learn.microsoft.com/azure/search/search-create-service-portal) to create one if you don't have.
- An Azure OpenAI resource and deployments for embeddings model and chat model - follow [this document](https://learn.microsoft.com/azure/ai-services/openai/how-to/create-resource?pivots=web-portal) to create one if you don't have.

## Setup

In [1]:
! pip install python-dotenv openai azure-ai-documentintelligence azure-identity pillow PyMuPDF

Collecting python-dotenv
  Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting openai
  Downloading openai-1.35.14-py3-none-any.whl.metadata (21 kB)
Collecting azure-ai-documentintelligence
  Using cached azure_ai_documentintelligence-1.0.0b3-py3-none-any.whl.metadata (40 kB)
Collecting azure-identity
  Downloading azure_identity-1.17.1-py3-none-any.whl.metadata (79 kB)
     ---------------------------------------- 0.0/79.4 kB ? eta -:--:--
     ----- ---------------------------------- 10.2/79.4 kB ? eta -:--:--
     ----- ---------------------------------- 10.2/79.4 kB ? eta -:--:--
     -------------- ----------------------- 30.7/79.4 kB 435.7 kB/s eta 0:00:01
     ------------------- ------------------ 41.0/79.4 kB 281.8 kB/s eta 0:00:01
     ----------------------------- -------- 61.4/79.4 kB 297.7 kB/s eta 0:00:01
     ---------------------------------- --- 71.7/79.4 kB 302.7 kB/s eta 0:00:01
     ---------------------------------- --- 71.7/79.4 kB 302.7 

In [2]:
! pip install python-dotenv pillow PyMuPDF langchain langchain-community langchain-openai langchainhub openai tiktoken azure-ai-documentintelligence azure-identity azure-search-documents==11.6.0b3

Collecting langchain-openai
  Downloading langchain_openai-0.1.16-py3-none-any.whl.metadata (2.5 kB)
Collecting langchainhub
  Using cached langchainhub-0.1.20-py3-none-any.whl.metadata (659 bytes)
Collecting tiktoken
  Using cached tiktoken-0.7.0-cp312-cp312-win_amd64.whl.metadata (6.8 kB)
Collecting azure-search-documents==11.6.0b3
  Using cached azure_search_documents-11.6.0b3-py3-none-any.whl.metadata (23 kB)
Collecting azure-common>=1.1 (from azure-search-documents==11.6.0b3)
  Using cached azure_common-1.1.28-py2.py3-none-any.whl.metadata (5.0 kB)
Collecting regex>=2022.1.18 (from tiktoken)
  Using cached regex-2024.5.15-cp312-cp312-win_amd64.whl.metadata (41 kB)
Using cached azure_search_documents-11.6.0b3-py3-none-any.whl (317 kB)
Downloading langchain_openai-0.1.16-py3-none-any.whl (46 kB)
   ---------------------------------------- 0.0/46.1 kB ? eta -:--:--
   -------- ------------------------------- 10.2/46.1 kB ? eta -:--:--
   -------- ------------------------------- 10.2/

In [3]:
!pip install openai



In [4]:
"""
This code loads environment variables using the `dotenv` library and sets the necessary environment variables for Azure services.
The environment variables are loaded from the `.env` file in the same directory as this notebook.
"""

import os
import re
import openai
import uuid
# from dotenv import load_dotenv
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import ContentFormat
from openai import AzureOpenAI

#RAG
from langchain import hub
from langchain_openai import AzureChatOpenAI
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from langchain_openai import AzureOpenAIEmbeddings
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.vectorstores.azuresearch import AzureSearch
from openai import OpenAI

In [5]:
doc_intelligence_endpoint = "https://classificationextrctionresource.cognitiveservices.azure.com/"#os.getenv("AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT")
doc_intelligence_key = "949b74af6e5641f2b684053aab8c52d9"#os.getenv("AZURE_DOCUMENT_INTELLIGENCE_KEY")

# aoai_api_base = #os.getenv("AZURE_OPENAI_ENDPOINT")
# aoai_api_key= os.getenv("AZURE_OPENAI_API_KEY")

# client = OpenAI(api_key="sk-proj-8zeBScFVgVhpceskr02sT3BlbkFJ0lvp27e5sdNhQW6TPrPV")

# openai_model = OpenAI(api_key=aoai_api_key)
# aoai_deployment_name = 'gpt-4v' # your model deployment name for GPT-4V
# aoai_api_version = '2024-02-15-preview' # this might change in the future


# vector_store_address: str = "https://gencognisearchv1.search.windows.net"#os.getenv("AZURE_SEARCH_ENDPOINT")
# vector_store_password: str = "Nc3bnAvxKv753K1jDTVNh4iWq9d4lhOr1ok5czqN5VAzSeCvrkZ7"#os.getenv("AZURE_SEARCH_ADMIN_KEY")

## Crop figure from the document (pdf or image) based on the bounding box

In [6]:
from PIL import Image
import fitz  # PyMuPDF
import mimetypes

def crop_image_from_image(image_path, page_number, bounding_box):
    """
    Crops an image based on a bounding box.

    :param image_path: Path to the image file.
    :param page_number: The page number of the image to crop (for TIFF format).
    :param bounding_box: A tuple of (left, upper, right, lower) coordinates for the bounding box.
    :return: A cropped image.
    :rtype: PIL.Image.Image
    """
    with Image.open(image_path) as img:
        if img.format == "TIFF":
            # Open the TIFF image
            img.seek(page_number)
            img = img.copy()
            
        # The bounding box is expected to be in the format (left, upper, right, lower).
        cropped_image = img.crop(bounding_box)
        return cropped_image
def crop_image_from_pdf_page(pdf_path, page_number, bounding_box):
    """
    Crops a region from a given page in a PDF and returns it as an image.

    :param pdf_path: Path to the PDF file.
    :param page_number: The page number to crop from (0-indexed).
    :param bounding_box: A tuple of (x0, y0, x1, y1) coordinates for the bounding box.
    :return: A PIL Image of the cropped area.
    """
    doc = fitz.open(pdf_path)
    page = doc.load_page(page_number)
    
    # Cropping the page. The rect requires the coordinates in the format (x0, y0, x1, y1).
    bbx = [x * 72 for x in bounding_box]
    rect = fitz.Rect(bbx)
    pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72), clip=rect)
    
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    
    doc.close()

    return img
def crop_image_from_file(file_path, page_number, bounding_box):
    """
    Crop an image from a file.

    Args:
        file_path (str): The path to the file.
        page_number (int): The page number (for PDF and TIFF files, 0-indexed).
        bounding_box (tuple): The bounding box coordinates in the format (x0, y0, x1, y1).

    Returns:
        A PIL Image of the cropped area.
    """
    mime_type = mimetypes.guess_type(file_path)[0]
    
    if mime_type == "application/pdf":
        return crop_image_from_pdf_page(file_path, page_number, bounding_box)
    else:
        return crop_image_from_image(file_path, page_number, bounding_box)


## Use Azure OpenAI (GPT-4V model) to understand the semantics of the figure content

In [7]:
import openai
import base64
from mimetypes import guess_type

# Function to encode a local image into data URL 
def local_image_to_data_url(image_path):
    # Guess the MIME type of the image based on the file extension
    mime_type, _ = guess_type(image_path)
    if mime_type is None:
        mime_type = 'application/octet-stream'  # Default MIME type if none is found

    # Read and encode the image file
    with open(image_path, "rb") as image_file:
        base64_encoded_data = base64.b64encode(image_file.read()).decode('utf-8')

    # Construct the data URL
    return f"data:{mime_type};base64,{base64_encoded_data}"

In [8]:
MAX_TOKENS = 2000

def understand_image_with_gptv(image_path, caption=""):
    """
    Generates a description for an image using the GPT-4V model.

    Parameters:
    - image_path (str): The path to the image file.
    - caption (str): The caption for the image.

    Returns:
    - img_description (str): The generated description for the image.
    """
    # Convert local image path to data URL
    data_url = local_image_to_data_url(image_path)

    # Construct message based on whether caption is provided
    if caption:
        prompt = f"Describe this image (note: it has image caption: {caption}):"
    else:
        prompt = "Describe this image:"

    messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": [
                    {"type": "text", "text": prompt},
                    {"type": "image_url", "image_url": {"url": data_url}}
                ]}
            ]
    # print(messages)
    try:
        response = client.chat.completions.create(
            model="gpt-4-vision-preview",
            messages=messages,
            max_tokens=MAX_TOKENS
        )
        # print(response)
        img_description = response.choices[0].message.content

        return img_description

    except Exception as e:
        # Handle API errors or connection issues
        print(f"Error occurred: {str(e)}")
        return None

## Update markdown figure content section with the description from GPT-4V model

In [9]:
def update_figure_description(md_content, img_description, idx, img_location):
    """
    Updates the figure description in the Markdown content and appends the image location.

    Args:
        md_content (str): The original Markdown content.
        img_description (str): The new description for the image.
        idx (int): The index of the figure.
        img_location (str): The location of the image.

    Returns:
        str: The updated Markdown content with the new figure description.
    """

    # The substring you're looking for
    start_substring = f"![](figures/{idx})"
    end_substring = "</figure>"
    new_string = f"<!-- FigureContent=\"{img_description}\" -->"
    img_location_string = f"<!-- FigureLocation=\"{img_location}\" -->"
    
    new_md_content = md_content
    # Find the start and end indices of the part to replace
    start_index = md_content.find(start_substring)
    if start_index != -1:  # if start_substring is found
        start_index += len(start_substring)  # move the index to the end of start_substring
        end_index = md_content.find(end_substring, start_index)
        if end_index != -1:  # if end_substring is found
            # Insert the new string and image location string right after the start_substring
            new_md_content = (
                md_content[:start_index] +
                new_string +
                img_location_string +
                md_content[start_index:]
            )
    
    return new_md_content

# Spliting the scrapped data in to sections

In [10]:
import re
import json
from tqdm import tqdm

class Section:
    def __init__(self, content, section_type, metadata=None):
        self.content = content
        self.type = section_type
        self.metadata = metadata if metadata else {}
        self.subsections = []
        self.content_items = []

    def add_subsection(self, subsection):
        self.subsections.append(subsection)

    def add_content_item(self, item):
        self.content_items.append(item)

    def to_dict(self):
        return {
            "type": self.type,
            "content": self.content,
            "metadata": self.metadata,
            "subsections": [subsection.to_dict() for subsection in self.subsections],
            "content_items": self.content_items
        }

    def __repr__(self):
        return f"Section(type={self.type}, content={self.content[:30]}..., metadata={self.metadata}, subsections={len(self.subsections)}, content_items={len(self.content_items)})"

def parse_markdown(markdown_text):
    parsed_data = []

    # Regex patterns
    header_pattern = re.compile(r'<!-- PageHeader="(.*?)" -->')
    subheading_pattern = re.compile(r'## (.*?)\n')
    figure_location_pattern = re.compile(r'<!-- FigureLocation="(.*?)" -->')
    figure_pattern = re.compile(r'<figure>(.*?)<figcaption>(.*?)<\/figcaption>(.*?)<\/figure>', flags=re.DOTALL)
    markdown_table_pattern = re.compile(r'\|.*?\|.*?\|\s*\n(\|\s*-+\s*\|)+\s*\n((\|\s*.*?\s*\|.*?\|\s*\n)+)', flags=re.DOTALL)
    file_location_pattern = re.compile(r'<!-- FileLocation="(.*?)" -->')
    file_name_pattern = re.compile(r'<!-- FileName="(.*?)" -->')
    file_title_pattern = re.compile(r'<!-- FileTitle="(.*?)" -->')
    note_pattern = re.compile(r'<!-- Note="(.*?)" -->')

    # Initialize sections
    sections = []
    current_section = None
    current_subheading = None

    # Extract metadata
    file_location = file_location_pattern.findall(markdown_text)
    file_name = file_name_pattern.findall(markdown_text)
    file_title = file_title_pattern.findall(markdown_text)
    notes = note_pattern.findall(markdown_text)

    metadata = {
        "file_location": file_location[0] if file_location else "",
        "file_name": file_name[0] if file_name else "",
        "file_title": file_title[0] if file_title else "",
        "note": notes[0] if notes else ""
    }

    # Find headers
    headers = header_pattern.findall(markdown_text)
    for header in tqdm(headers, desc="Processing headers"):
        header_section = Section(content=header.strip(), section_type="header", metadata=metadata)
        sections.append(header_section)
        current_section = header_section

    # Find subheadings
    subheadings = subheading_pattern.findall(markdown_text)
    for subheading in tqdm(subheadings, desc="Processing subheadings"):
        subheading_section = Section(content=subheading.strip(), section_type="subheading", metadata=metadata)
        if current_section:
            current_section.add_subsection(subheading_section)
        else:
            sections.append(subheading_section)
        current_subheading = subheading_section

    # Find figure locations
    figure_locations = figure_location_pattern.findall(markdown_text)

    # Find figures and their captions
    figures = figure_pattern.findall(markdown_text)
    for i, (fig_content, fig_caption, fig_tail) in enumerate(tqdm(figures, desc="Processing figures")):
        figure_full_content = f'<figure>{fig_content}<figcaption>{fig_caption}</figcaption>{fig_tail}</figure>'.strip()
        figure_location = figure_locations[i] if i < len(figure_locations) else "Unknown"
        figure_item = {
            "type": "figure",
            "content": figure_full_content,
            "caption": fig_caption.strip(),
            "location": figure_location
        }
        if current_subheading:
            current_subheading.add_content_item(figure_item)
        elif current_section:
            current_section.add_content_item(figure_item)
        else:
            parsed_data.append(figure_item)

    # Find tables in markdown format
    markdown_tables = markdown_table_pattern.findall(markdown_text)
    for header, _, body in tqdm(markdown_tables, desc="Processing tables"):
        table_full_content = f'{header.strip()}\n{body.strip()}'
        table_item = {
            "type": "table",
            "content": table_full_content.strip()
        }
        if current_subheading:
            current_subheading.add_content_item(table_item)
        elif current_section:
            current_section.add_content_item(table_item)
        else:
            parsed_data.append(table_item)



    # Remove extracted elements to isolate remaining content
    remaining_content = header_pattern.sub('', markdown_text)
    remaining_content = figure_pattern.sub('', remaining_content)
    remaining_content = figure_location_pattern.sub('', remaining_content)
    remaining_content = markdown_table_pattern.sub('', remaining_content)
    remaining_content = subheading_pattern.sub('', remaining_content).strip()
    remaining_content = file_location_pattern.sub('', remaining_content)
    remaining_content = file_name_pattern.sub('', remaining_content)
    remaining_content = file_title_pattern.sub('', remaining_content)
    remaining_content = note_pattern.sub('', remaining_content)

    # Split remaining content into paragraphs based on line breaks
    paragraphs = [p.strip() for p in remaining_content.split('\n\n') if p.strip()]
    for paragraph in tqdm(paragraphs, desc="Processing paragraphs"):
        paragraph_item = {
            "type": "paragraph",
            "content": paragraph
        }
        if current_subheading:
            current_subheading.add_content_item(paragraph_item)
        elif current_section:
            current_section.add_content_item(paragraph_item)
        else:
            parsed_data.append(paragraph_item)

    return sections

# Generate embeddings using OpenAI

In [11]:
def generate_embeddings_openai(data):
    embeddings = []  # Initialize an empty list to store all embeddings
    for section in data:
        try:
            # Generate embedding using OpenAI's GPT-3 model
            response = client.embeddings.create(
                input=section["content"],
                model="text-embedding-3-small"
            )
            embedding_vector = response.data[0].embedding
            embeddings.append(embedding_vector)  # Append each embedding to the list
        except Exception as e:
            print(f"Error generating embedding for section '{section['type']}': {e}")

    return embeddings  # Return the list of all embeddings

## Analyze a document with Azure AI Document Intelligence Layout model and update figure description in the markdown output

In [20]:
document_intelligence_client = DocumentIntelligenceClient(
endpoint=doc_intelligence_endpoint, 
credential=AzureKeyCredential(doc_intelligence_key),
headers={"x-ms-useragent":"sample-code-figure-understanding/1.0.0"},
)

with open(input_file_path, "rb") as f:
    poller = document_intelligence_client.begin_analyze_document(
        "prebuilt-layout", analyze_request=f, content_type="application/octet-stream", output_content_format=ContentFormat.TEXT 
    )

result = poller.result()
md_content = result.content

In [23]:
print(result)

{'apiVersion': '2024-02-29-preview', 'modelId': 'prebuilt-layout', 'stringIndexType': 'textElements', 'content': 'GIAR Creating tomorrow today\nCase Study On Tackling Bird shit menace in MRO, RGIA (GATL)\nCONTRIBUTED BY:\nKalyan Reddy Gudimetla\nGATL\nJuly 2019\nExecutive Summary\nGMR Aero Technic Ltd. (GATL) has its facilities spread over an area of 28 acres at Hyderabad airport. It is a 100% subsidiary of GMR Hyderabad International Airport Ltd. (GHIAL), and offers world class third party independent airframe Maintenance, Repair & Overhaul (MRO) services from its facility at Rajiv Gandhi International Airport (RGIA), Hyderabad.\nSince the hangars or aircraft maintenance bays are high and wide structures, birds often enter through various openings and roost in the available flat areas of the superstructure. The accumulation of bird droppings, feathers and nesting material can pose a huge problems like equipment damage through metal corrosion and structural weakness, damaging ongoing p

In [25]:
result.keys()

dict_keys(['apiVersion', 'modelId', 'stringIndexType', 'content', 'pages', 'tables', 'paragraphs', 'styles', 'contentFormat', 'sections', 'figures'])

In [31]:
len(result['paragraphs'])

95

In [37]:
set([para['role'] for para in result['paragraphs'] if 'role' in para])

{'pageFooter', 'pageNumber', 'sectionHeading', 'title'}

In [21]:
print(md_content)

GIAR Creating tomorrow today
Case Study On Tackling Bird shit menace in MRO, RGIA (GATL)
CONTRIBUTED BY:
Kalyan Reddy Gudimetla
GATL
July 2019
Executive Summary
GMR Aero Technic Ltd. (GATL) has its facilities spread over an area of 28 acres at Hyderabad airport. It is a 100% subsidiary of GMR Hyderabad International Airport Ltd. (GHIAL), and offers world class third party independent airframe Maintenance, Repair & Overhaul (MRO) services from its facility at Rajiv Gandhi International Airport (RGIA), Hyderabad.
Since the hangars or aircraft maintenance bays are high and wide structures, birds often enter through various openings and roost in the available flat areas of the superstructure. The accumulation of bird droppings, feathers and nesting material can pose a huge problems like equipment damage through metal corrosion and structural weakness, damaging ongoing paint jobs and if lodged inside open engines can lead to engine failures mid-flight. It can also pose a danger to personnel

In [12]:
def analyze_layout(input_file_path, output_folder):
    """
    Analyzes the layout of a document and extracts figures along with their descriptions, then update the markdown output with the new description.

    Args:
        input_file_path (str): The path to the input document file.
        output_folder (str): The path to the output folder where the cropped images will be saved.

    Returns:
        str: The updated Markdown content with figure descriptions.

    """
    document_intelligence_client = DocumentIntelligenceClient(
        endpoint=doc_intelligence_endpoint, 
        credential=AzureKeyCredential(doc_intelligence_key),
        headers={"x-ms-useragent":"sample-code-figure-understanding/1.0.0"},
    )

    with open(input_file_path, "rb") as f:
        poller = document_intelligence_client.begin_analyze_document(
            "prebuilt-layout", analyze_request=f, content_type="application/octet-stream", output_content_format=ContentFormat.MARKDOWN 
        )

    result = poller.result()
    md_content = result.content
    
    
    if result.figures:
        print("Figures:")
        for idx, figure in enumerate(result.figures):
            figure_content = ""
            img_description = ""
            img_location = ""
            print(f"Figure #{idx} has the following spans: {figure.spans}")
            for i, span in enumerate(figure.spans):
                print(f"Span #{i}: {span}")
                figure_content += md_content[span.offset:span.offset + span.length]
            print(f"Original figure content in markdown: {figure_content}")

            # Note: figure bounding regions currently contain both the bounding region of figure caption and figure body
            if figure.caption:
                caption_region = figure.caption.bounding_regions
                print(f"\tCaption: {figure.caption.content}")
                print(f"\tCaption bounding region: {caption_region}")
                for region in figure.bounding_regions:
                    if region not in caption_region:
                        print(f"\tFigure body bounding regions: {region}")
                        # To learn more about bounding regions, see https://aka.ms/bounding-region
                        boundingbox = (
                                region.polygon[0],  # x0 (left)
                                region.polygon[1],  # y0 (top)
                                region.polygon[4],  # x1 (right)
                                region.polygon[5]   # y1 (bottom)
                            )
                        print(f"\tFigure body bounding box in (x0, y0, x1, y1): {boundingbox}")
                        cropped_image = crop_image_from_file(input_file_path, region.page_number - 1, boundingbox) # page_number is 1-indexed

                        # Get the base name of the file
                        base_name = os.path.basename(input_file_path)
                        # Remove the file extension
                        file_name_without_extension = os.path.splitext(base_name)[0]

                        output_file = f"{file_name_without_extension}_cropped_image_{idx}.png"
                        cropped_image_filename = os.path.join(output_folder, output_file)
                        img_location += cropped_image_filename
                        cropped_image.save(cropped_image_filename)
                        # img_description += understand_image_with_gptv(cropped_image_filename, figure.caption.content)
            else:
                print("\tNo caption found for this figure.")
                for region in figure.bounding_regions:
                    print(f"\tFigure body bounding regions: {region}")
                    # To learn more about bounding regions, see https://aka.ms/bounding-region
                    boundingbox = (
                            region.polygon[0],  # x0 (left)
                            region.polygon[1],  # y0 (top
                            region.polygon[4],  # x1 (right)
                            region.polygon[5]   # y1 (bottom)
                        )
                    print(f"\tFigure body bounding box in (x0, y0, x1, y1): {boundingbox}")

                    cropped_image = crop_image_from_file(input_file_path, region.page_number - 1, boundingbox) # page_number is 1-indexed

                    # Get the base name of the file
                    base_name = os.path.basename(input_file_path)
                    # Remove the file extension
                    file_name_without_extension = os.path.splitext(base_name)[0]

                    output_file = f"{file_name_without_extension}_cropped_image_{idx}.png"
                    cropped_image_filename = os.path.join(output_folder, output_file)
                    # cropped_image_filename = f"data/cropped/image_{idx}.png"
                    cropped_image.save(cropped_image_filename)
                    img_location += cropped_image_filename
                    print(f"\tFigure {idx} cropped and saved as {cropped_image_filename}")
                    # img_description += understand_image_with_gptv(cropped_image_filename, "")
                    print(f"\tDescription of figure {idx}: {img_description}")
            
            # replace_figure_description(figure_content, img_description, idx)
            md_content = update_figure_description(md_content, img_description, idx,img_location)
            

    return md_content

In [13]:
input_file_path = "C:/Users/sampath.emandi/Downloads/Congnis_workspace/file1.pdf"
output_folder = "C:/Users/sampath.emandi/Downloads/Congnis_workspace/recogimgs/"

In [14]:
updated_md_with_figure_understanding = analyze_layout(input_file_path, output_folder)

Figures:
Figure #0 has the following spans: [{'offset': 0, 'length': 92}]
Span #0: {'offset': 0, 'length': 92}
Original figure content in markdown: <figure>

![](figures/0)

<!-- FigureContent="GIAR Creating tomorrow today" -->

</figure>


	No caption found for this figure.
	Figure body bounding regions: {'pageNumber': 1, 'polygon': [6.7621, 1.0789, 7.7756, 1.078, 7.7761, 1.5793, 6.7626, 1.5801]}
	Figure body bounding box in (x0, y0, x1, y1): (6.7621, 1.0789, 7.7761, 1.5793)
	Figure 0 cropped and saved as C:/Users/sampath.emandi/Downloads/Congnis_workspace/recogimgs/file1_cropped_image_0.png
	Description of figure 0: 
Figure #1 has the following spans: [{'offset': 10978, 'length': 71}]
Span #0: {'offset': 10978, 'length': 71}
Original figure content in markdown: -60Hz Current: 100mA Classification: Class 3R Laser color: Red+Green Si
	No caption found for this figure.
	Figure body bounding regions: {'pageNumber': 7, 'polygon': [1.0153, 5.785, 7.4604, 5.7804, 7.4652, 7.6793, 1.0206, 7.6

In [16]:
print(updated_md_with_figure_understanding)

<figure>

![](figures/0)<!-- FigureContent="" --><!-- FigureLocation="C:/Users/sampath.emandi/Downloads/Congnis_workspace/recogimgs/file1_cropped_image_0.png" -->

<!-- FigureContent="GIAR Creating tomorrow today" -->

</figure>


Case Study On Tackling Bird shit menace in MRO, RGIA (GATL)

CONTRIBUTED BY:

Kalyan Reddy Gudimetla

GATL

July 2019

Executive Summary
===

GMR Aero Technic Ltd. (GATL) has its facilities spread over an area of 28 acres at Hyderabad airport. It is a 100% subsidiary of GMR Hyderabad International Airport Ltd. (GHIAL), and offers world class third party independent airframe Maintenance, Repair & Overhaul (MRO) services from its facility at Rajiv Gandhi International Airport (RGIA), Hyderabad.

Since the hangars or aircraft maintenance bays are high and wide structures, birds often enter through various openings and roost in the available flat areas of the superstructure. The accumulation of bird droppings, feathers and nesting material can pose a huge problem

In [2]:
pip install markdown

Collecting markdownNote: you may need to restart the kernel to use updated packages.

  Using cached Markdown-3.6-py3-none-any.whl.metadata (7.0 kB)
Using cached Markdown-3.6-py3-none-any.whl (105 kB)
Installing collected packages: markdown
Successfully installed markdown-3.6


In [3]:
pip install beautifulsoup4


Note: you may need to restart the kernel to use updated packages.


In [9]:
pip install mistune beautifulsoup4


Collecting mistune
  Using cached mistune-3.0.2-py3-none-any.whl.metadata (1.7 kB)
Downloading mistune-3.0.2-py3-none-any.whl (47 kB)
   ---------------------------------------- 0.0/48.0 kB ? eta -:--:--
   -------- ------------------------------- 10.2/48.0 kB ? eta -:--:--
   ----------------- ---------------------- 20.5/48.0 kB 162.5 kB/s eta 0:00:01
   ----------------- ---------------------- 20.5/48.0 kB 162.5 kB/s eta 0:00:01
   ------------------------- -------------- 30.7/48.0 kB 145.2 kB/s eta 0:00:01
   ------------------------- -------------- 30.7/48.0 kB 145.2 kB/s eta 0:00:01
   ------------------------- -------------- 30.7/48.0 kB 145.2 kB/s eta 0:00:01
   ---------------------------------- ----- 41.0/48.0 kB 115.5 kB/s eta 0:00:01
   ---------------------------------- ----- 41.0/48.0 kB 115.5 kB/s eta 0:00:01
   ---------------------------------------- 48.0/48.0 kB 100.6 kB/s eta 0:00:00
Installing collected packages: mistune
Successfully installed mistune-3.0.2
Note: you

In [4]:
import markdown
from bs4 import BeautifulSoup



In [18]:
import mistune
from bs4 import BeautifulSoup, Tag

def markdown_to_html_tree(markdown_text):
    # Convert Markdown to HTML using mistune
    markdown = mistune.create_markdown()
    html = markdown(markdown_text)

    # Parse the HTML
    soup = BeautifulSoup(html, 'html.parser')

    # Helper function to convert the HTML tree to a hierarchical format
    def element_to_dict(element):
        if isinstance(element, Tag):
            children = [element_to_dict(child) for child in element.children if isinstance(child, Tag) or child.strip()]
            return {
                'name': element.name,
                'attrs': dict(element.attrs),
                'text': element.get_text(strip=True) if element.name not in ['figure', 'figcaption'] else '',
                'children': children
            }
        return {'text': element.strip()}

    # Convert the parsed HTML to a hierarchical dictionary
    body_element = soup.body if soup.body else soup
    html_tree = element_to_dict(body_element)

    return html_tree

def print_tree(node, level=0):
    indent = '  ' * level
    if 'name' in node:
        print(f"{indent}{node['name']} (Attributes: {node['attrs']}): {node['text']}")
        for child in node['children']:
            print_tree(child, level + 1)
    else:
        print(f"{indent}Text: {node['text']}")

In [26]:
pip install markdown-analysis==0.0.5


Collecting markdown-analysis==0.0.5
  Downloading markdown_analysis-0.0.5.tar.gz (7.1 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: markdown-analysis
  Building wheel for markdown-analysis (setup.py): started
  Building wheel for markdown-analysis (setup.py): finished with status 'done'
  Created wheel for markdown-analysis: filename=markdown_analysis-0.0.5-py3-none-any.whl size=7193 sha256=69bc4685e95db3805e431342b6f11663a7b1f7dba3b0d32295a00ae021fc160e
  Stored in directory: c:\users\sampath.emandi\appdata\local\pip\cache\wheels\eb\b9\ca\0c823581986971ae5e545429d0eeb905bda2388eb4a61b4921
Successfully built markdown-analysis
Installing collected packages: markdown-analysis
Successfully installed markdown-analysis-0.0.5
Note: you may need to restart the kernel to use updated packages.


In [25]:
from markdown_analysis import MarkdownAnalyzer, to_json

# Helper function to print the hierarchical tree
def print_tree(node, level=0):
    indent = '  ' * level
    if isinstance(node, dict):
        name = node.get('name', 'Text')
        attrs = node.get('attrs', {})
        text = node.get('text', '')
        children = node.get('children', [])
        print(f"{indent}{name} (Attributes: {attrs}): {text}")
        for child in children:
            print_tree(child, level + 1)
    elif isinstance(node, list):
        for item in node:
            print_tree(item, level)

ModuleNotFoundError: No module named 'markdown_analysis'

In [22]:
# Example Markdown text
markdown_text = """

<figure>

![](figures/0)<!-- FigureContent="The image displays a logo on the left, which is the emblem of India composed of four lions standing back to back, signifying power, courage, pride, and confidence. Below the emblem, text in Devanagari script likely represents the government body associated with the emblem. To the right is text in English that reads "Data from: https://upag-gov.in/ ES&E Division, Department of Agriculture & Farmer Welfare." The URL provided suggests that the data is sourced from a website related to agriculture and farmer welfare, specifically from a division called ES&E (which might stand for Economics, Statistics & Extension) within the department, and this department is likely associated with the government of Uttar Pradesh, India considering the "up" in the URL. There is also a stylized graphic in green and yellow depicting what appears to be a sprouting plant or agricultural symbol, representing the theme of agriculture." -->

<!-- FigureContent="Data from: https://upag.gov.in/ ES&E Division, Department of Agriculture & Farmer Welfare. अपनेल रूपले" -->

</figure>


Commodity Report: Rapeseed & Mustard
===

Key parameters on commodity situation are discussed below
===

R&M OILSEED
===


## CROP CALENDER:

R&M oilseed is grown during the Rabi season. The sowing is done mainly between October and November and the crop is harvested mainly between February and April.


## Figure 1: Crop Calendar
 :selected:
<figure>

![](figures/1)<!-- FigureContent="The image shows two tabs or buttons side by side. On the left is a tab with a green colored background labeled "Harvesting," and on the right is a tab with a blue colored background labeled "Sowing." The "Harvesting" tab appears to be the active or selected option, as indicated by its brighter color and lack of a border compared to the "Sowing" tab. The "Sowing" tab seems unselected, with a darker background and an outline." -->

<!-- FigureContent="Harvesting Sowing" -->

</figure>


| SEASON | JUN | JUL | AUG | SEP | OCT | NOV | DEC | JAN | FEB | MAR | APR | MAY |
| - | - | - | - | - | - | - | - | - | - | - | - | - |
| Rabi | | | | | | | | | | | | |

Source: ICAR


## PRODUCTION:

As per GOI's Third Advance Estimates, 2023-24, 
R&M production in the current year is estimated to be 131.61 LMTs, 
slightly higher than last year's production of 126.43 LMT. 
All India production increased by 23.71% when compared to the last 5-year average. 
Rajasthan accounted for 45.41% of the total production in 2023-24.
India's CAGR for R&M production for the last 5 years was 6.44%. 
The conversion ratio for R&M oil is 33% and for meal is 67%.

<figure>

<figcaption>

Figure 2: Statewise Rapeseed & Mustard Production Share of Top 5 States (2023-24)\*

</figcaption>

![](figures/2)<!-- FigureContent="The image shows a horizontal bar graph titled "Figure 2: Statewise Rapeseed & Mustard Production Share of Top 5 States (2023-24)*". 
The graph displays the percentage share of production across five different states, 
which are listed on the horizontal axis.

The bars represent the following states and their respective production share percentages:

- Rajasthan: 45.41%
- Uttar Pradesh: 14.24%
- Madhya Pradesh: 13.28%
- Haryana: 10.78%
- West Bengal: 5.99%

The production share is illustrated on the vertical axis that ranges from 0% to 40%, in increments of 10. The bars for each state are color-coded in shades of blue-green, with the color intensity apparently corresponding to the size of the production share. Darker shades indicate larger shares, as suggested by the 'Color Scale' legend on the right side of the graph, which shows a gradient from light to dark with values from 10 to 40, even though such numbers do not overtly correlate with the production shares displayed in the graph.

Please note that there is an asterisk (*) next to the year (2023-24), which typically indicates a footnote or additional information somewhere in the associated document, which is not visible in the image provided." -->

<!-- FigureContent="45.41% Color Scale 40 Production share (%) 40 ¥30 30 320 20 14.24% 10 13.28% 10.78% 5.99% 10 0 Rajasthan Uttar Pradesh Madhya Pradesh Haryana West Bengal" -->

</figure>


\*Values from 2023-24 are from Third Advance Estimates, Rabi

Source: DA&FW

Table 1: Top 5 Rapeseed & Mustard Producing States, 2023-24\* (Production in LMTs)

| State | Current Year (2023-2024) | Previous Year (2022-2023) | Year (2021-2022) | Last 5 years Average |
| - | - | - | - | - |
| Rajasthan | 59.77 | 58.32 | 57.76 | 48.83 |
| Uttar Pradesh | 18.75 | 16.20 | 10.33 | 11.47 |
| Madhya Pradesh | 17.48 | 15.51 | 16.91 | 13.25 |
| Haryana | 14.19 | 13.02 | 13.66 | 12.77 |
| West Bengal | 7.88 | 8.16 | 7.43 | 7.49 |
| All India | 131.61 | 126.43 | 119.63 | 106.39 |

\*Values from 2023-24 are from Third Advance Estimates, Rabi Source: DA&FW
:selected:<figure>

![](figures/3)<!-- FigureContent="The image is a small banner or footer likely taken from a website or document. On the left side of the image, there is an emblem, which appears to be the state emblem of India, featuring the Lion Capital of Ashoka. Below the emblem, there is text in Hindi, which is not clearly legible, but it may be a motto or a tagline.

In the middle, there is text that says "Data from: https://upag-gov.in/" followed by "ES&E Division, Department of Agriculture & Farmer Welfare." This indicates that the information or data related to the banner has been sourced from the mentioned website, which appears to be associated with the Agricultural Department of a region or state in India, possibly Uttar Pradesh given the "upag" in the URL.

On the right side, next to this text, there is a stylized graphic of what looks to be a plant or crop with two leaves and a sheaf of grain in yellow and green colors, which are often used to represent agriculture.

The overall image seems informational and suggests a connection with agricultural data and farmer welfare from an official government source in India." -->

<!-- FigureContent="Data from: https://upag.gov.in/ ES&E Division, Department of Agriculture & Farmer Welfare. अपनेव अपने" -->

</figure>



## ARRIVALS & PRICES:

Arrivals: In May 2024, the all-India mandi arrivals for R&M were 3.05 LMT. The arrivals in Apr 2024, were 7.28 LMT, while in May 2023 3.18 LMT arrived in the mandis. Cumulative arrivals for Jan to May in 2024 are 21.34 LMT, while in 2023 the arrivals were 22.40 LMT

Mandi Prices: All-India wholesale mandi prices of R&M oilseed was Rs. 5527.95 per quintal in May 2024. In Apr 2024 mandi prices were Rs. 5205.99 per quintal. Compared to same month of last year prices are higher by 6.29% when prices were Rs. 5201.06 per quintal. These prices are lower than the MSP of Rs. 5650 per quintal.

<figure>

![](figures/4)<!-- FigureContent="This image is a horizontal bar chart titled "Figure 5: All India K&H Arrivals Month wise Comparison (LMT)." It shows the quantity of arrivals (in Lakh Metric Tonnes, LMT) for certain months spanning from May 2021 to May 2024. Each bar represents the quantity of arrivals for a specific month, and above each bar is a numerical value indicating the precise quantity in LMT.

Starting from May 2021 with 0.95 LMT and continuing to May 2024 with 7.28 LMT, the chart displays a fluctuating pattern with some months having noticeably larger amounts of arrivals than others. For example, April 2022 shows the highest quantity at 12.94 LMT, which significantly stands out on the chart. Conversely, several months such as December 2023 and January 2024 show lower quantities, around 1.22 and 1.12 LMT, respectively.

The chart is sourced from Agmarknet, as indicated by the image caption, which suggests the data pertains to agricultural market arrivals in India." -->

<!-- FigureContent="Figure 3: All India R&M arrivals month wise comparison (LMT) 12.94 12 10 8 8.25 7.4 7.28 6 6.77 5.45 4 4.67 4.28 3.54 3.18 3.26 2 2.62 3.0 17.32 1.75 1.95 1.94 2.05 1,51 1.49 0.95 1.14 1.11 1.41 1.35 1.63 1.7 1.75 1.31 1.31 1.2 1.1 0 0.69 0.7. 0.73 1.05 0.97 May 2021 Jun 2021 Jul 2021 Aug 2021 Sep 2021 Oct 2021 Nov 2021 Dec 2021 Jan 2022 Feb 2022 Mar 2022 Apr 2022 May 2022 Jun 2022 Jul 2022 Aug 2022 Sep 2022 Oct 2022 Nov 2022 Dec 2022 Jan 2023 Feb 2023 Mar 2023 Apr 2023 May 2023 Jun 2023 Jul 2023 Aug 2023 Sep 2023 Oct 2023 Nov 2023 Jan 2024 Dec 2023 Feb 2024 Mar 2024 Apr 2024 May 2024 :selected: Arrivals(Tonnes)" -->

<figcaption>

Source: Agmarknet

</figcaption>

</figure>


Figure 4: All India Mandi prices & MSP of R&M in Rs./Qtl

<figure>

![](figures/5)<!-- FigureContent="The image is a line graph with two lines representing price trends over time from May 2022 to May 2024. Each line uses a different color: one is orange, and the other is blue, to represent different datasets or categories. 

The vertical axis on the left is labeled "Prices (Rs/Qtl)," showing the price range from 5000 to 6500 with increments, suggesting these prices are in Indian Rupees per Quintal. 

The horizontal axis along the bottom shows the months from May 2022 to May 2024, spaced three months apart. This indicates that data points may be quarterly.

The orange line starts at the top with a value near 6471.11 in May 2022 and then mostly declines with some fluctuations until reaching a low point around February 2023. It then increases and has some variability before it ends at a value of 5300.60 in May 2024.

The blue line remains constant at a value of 5050.00 from May 2022 until about February 2023. After that, it shows an upward trend with some small fluctuations, ending at a value of 5650.00 in May 2024. 

The graph may represent the price movements of two different commodities, products, or markets over the specified time period. Without more context, it's challenging to provide a specific interpretation of what these lines represent." -->

<!-- FigureContent="6500 6471.11 6000 Prices (Rs/Qtl) 5650.00 35500 5300.60 5050.00 5000 May - 2022 Aug - 2022 Nov - 2022 Feb - 2023 May - 2023 Aug - 2023 Nov - 2023 Feb - 2024 May - 2024" -->

</figure>


\-MSP (DA&FW) - Mandi Modal Price (AgMarknet)

Source: Agmarknet & DA&FW
"""


  markdown_text = """


In [23]:

# Convert and parse
html_tree = markdown_to_html_tree(markdown_text)

In [24]:
html_tree

{'name': '[document]',
 'attrs': {},
 'text': '<figure><!-- FigureContent="The image displays a logo on the left, which is the emblem of India composed of four lions standing back to back, signifying power, courage, pride, and confidence. Below the emblem, text in Devanagari script likely represents the government body associated with the emblem. To the right is text in English that reads "Data from: https://upag-gov.in/ ES&E Division, Department of Agriculture & Farmer Welfare." The URL provided suggests that the data is sourced from a website related to agriculture and farmer welfare, specifically from a division called ES&E (which might stand for Economics, Statistics & Extension) within the department, and this department is likely associated with the government of Uttar Pradesh, India considering the "up" in the URL. There is also a stylized graphic in green and yellow depicting what appears to be a sprouting plant or agricultural symbol, representing the theme of agriculture." --

In [17]:
print(html_tree)

None


In [63]:
parsed_sections = parse_markdown(updated_md_with_figure_understanding)

Processing headers: 100%|██████████| 119/119 [00:00<00:00, 98601.77it/s]
Processing subheadings: 100%|██████████| 109/109 [00:00<00:00, 234691.55it/s]
Processing figures: 100%|██████████| 11/11 [00:00<00:00, 65536.00it/s]
Processing tables: 0it [00:00, ?it/s]
Processing paragraphs: 100%|██████████| 505/505 [00:00<00:00, 925753.29it/s]


In [64]:
parsed_sections

[Section(type=header, content=Annual Report 2022-23..., metadata={'file_location': '', 'file_name': '', 'file_title': '', 'note': ''}, subsections=0, content_items=0),
 Section(type=header, content=Annual Report 2022-23..., metadata={'file_location': '', 'file_name': '', 'file_title': '', 'note': ''}, subsections=0, content_items=0),
 Section(type=header, content=Annual Report 2022-23..., metadata={'file_location': '', 'file_name': '', 'file_title': '', 'note': ''}, subsections=0, content_items=0),
 Section(type=header, content=Annual Report 2022-23..., metadata={'file_location': '', 'file_name': '', 'file_title': '', 'note': ''}, subsections=0, content_items=0),
 Section(type=header, content=Annual Report 2022-23..., metadata={'file_location': '', 'file_name': '', 'file_title': '', 'note': ''}, subsections=0, content_items=0),
 Section(type=header, content=Annual Report 2022-23..., metadata={'file_location': '', 'file_name': '', 'file_title': '', 'note': ''}, subsections=0, content_it

In [65]:
# Convert to a dictionary for NoSQL-like structure
def sections_to_dict(sections):
    return [section.to_dict() for section in sections]

In [66]:
# Convert the parsed sections to a NoSQL-like structure
nosql_structure = sections_to_dict(parsed_sections)

In [68]:
nosql_structure

[{'type': 'header',
  'content': 'Annual Report 2022-23',
  'metadata': {'file_location': '',
   'file_name': '',
   'file_title': '',
   'note': ''},
  'subsections': [],
  'content_items': []},
 {'type': 'header',
  'content': 'Annual Report 2022-23',
  'metadata': {'file_location': '',
   'file_name': '',
   'file_title': '',
   'note': ''},
  'subsections': [],
  'content_items': []},
 {'type': 'header',
  'content': 'Annual Report 2022-23',
  'metadata': {'file_location': '',
   'file_name': '',
   'file_title': '',
   'note': ''},
  'subsections': [],
  'content_items': []},
 {'type': 'header',
  'content': 'Annual Report 2022-23',
  'metadata': {'file_location': '',
   'file_name': '',
   'file_title': '',
   'note': ''},
  'subsections': [],
  'content_items': []},
 {'type': 'header',
  'content': 'Annual Report 2022-23',
  'metadata': {'file_location': '',
   'file_name': '',
   'file_title': '',
   'note': ''},
  'subsections': [],
  'content_items': []},
 {'type': 'header',


In [67]:
# Print the JSON structure
print(json.dumps(nosql_structure, indent=4))

[
    {
        "type": "header",
        "content": "Annual Report 2022-23",
        "metadata": {
            "file_location": "",
            "file_name": "",
            "file_title": "",
            "note": ""
        },
        "subsections": [],
        "content_items": []
    },
    {
        "type": "header",
        "content": "Annual Report 2022-23",
        "metadata": {
            "file_location": "",
            "file_name": "",
            "file_title": "",
            "note": ""
        },
        "subsections": [],
        "content_items": []
    },
    {
        "type": "header",
        "content": "Annual Report 2022-23",
        "metadata": {
            "file_location": "",
            "file_name": "",
            "file_title": "",
            "note": ""
        },
        "subsections": [],
        "content_items": []
    },
    {
        "type": "header",
        "content": "Annual Report 2022-23",
        "metadata": {
            "file_location": "",
         

In [29]:

print("----------------------------------------------------------------------------------------------------------------------------------------")
# Parse Markdown text with tags
parsed_splits_data = parse_markdown_with_tags(markdown_text=updated_md_with_figure_understanding)

print("Length of splits: " + str(len(parsed_splits_data)))


# print(f"Updated markdown content with figure understanding:\n\n {updated_md_with_figure_understanding}")


----------------------------------------------------------------------------------------------------------------------------------------
Length of splits: 3
Length of openai embeddings : 3


In [30]:
parsed_splits_data

[{'type': 'header',
  'content': 'GLOBAL TRADE OUTLOOK AND STATISTICS - APRIL 2024'},
 {'type': 'figure',
  'content': '![](figures/0)<!-- FigureContent="This image shows a pie chart divided into four segments, each representing a different category of service delivery with corresponding percentages:\n\n1. Commercial presence (mode 3) takes up the majority of the chart with 57.0%.\n2. Digital delivery (mode 1) is the next largest segment with 20.7%.\n3. Other cross-border transactions (mode 1) account for 13.5%.\n4. Consumption abroad (mode 2) fills a smaller segment with 7.8%.\n5. Presence of natural persons (mode 4) is the smallest portion, constituting just 1.0%.\n\nEach segment of the chart is color-coded, and the percentages are clearly marked next to each category. This chart could be used to analyze the composition of service delivery methods for a particular market or business sector, reflecting how services are being consumed or provided in different modes." -->\n\n<!-- Figure

In [None]:
# Generate embeddings using OpenAI
embeddings_openai = generate_embeddings_openai(parsed_splits_data)

print("Length of openai embeddings : "+ str(len(embeddings_openai)))


# RAG

In [15]:
import weaviate
# Create the client
weviate_client = weaviate.Client(
    url="http://98.70.77.203:4090"
)

print(weviate_client.is_ready())

            your code to use Python client v4 `weaviate.WeaviateClient` connections and methods.

            For Python Client v4 usage, see: https://weaviate.io/developers/weaviate/client-libraries/python
            For code migration, see: https://weaviate.io/developers/weaviate/client-libraries/python/v3_v4_migration
            


True


# To know existing class names

In [22]:
def existingClassNames(weaviate_url:str):
    import requests

    weaviate_url = weaviate_url#
    schema_endpoint = f"{weaviate_url}/v1/schema"

    try:
        # Send a GET request to retrieve the schema
        response = requests.get(schema_endpoint)
        response.raise_for_status()  # Raise an exception for 4xx/5xx responses

        # Extract class names from the response
        schema_data = response.json()
        class_names = [class_obj["class"] for class_obj in schema_data.get("classes", [])]

        # Print the existing class names
        print("Existing class names in Weaviate:")
        for class_name in class_names:
            print(f"- {class_name}")

    except requests.exceptions.HTTPError as err:
        print(f"Failed to retrieve schema: {err}")
    except Exception as e:
        print(f"An error occurred: {e}")
        

existingClassNames(weaviate_url="http://98.70.77.203:4090")

Existing class names in Weaviate:
- Kizen
- Test2
- Article
- Product
- Test3
- Order
- Electionamigos
- Test1
- Customer
- Test4
- Document


# Collections in documents

In [17]:
def collectionsInClasses(weaviate_url:str):
    import requests

    weaviate_url = weaviate_url#
    schema_endpoint = f"{weaviate_url}/v1/schema"

    try:
        # Send a GET request to retrieve the schema
        response = requests.get(schema_endpoint)
        response.raise_for_status()  # Raise an exception for 4xx/5xx responses

        # Extract class names and their properties from the response
        schema_data = response.json()
        classes = schema_data.get("classes", [])

        # Print details of each class
        print("Existing classes (collections) in Weaviate:")
        for class_obj in classes:
            class_name = class_obj["class"]
            description = class_obj.get("description", "No description provided")
            properties = class_obj.get("properties", [])
            
            print(f"\nClass Name: {class_name}")
            print(f"Description: {description}")
            print("Properties:")
            for prop in properties:
                prop_name = prop["name"]
                data_type = prop["dataType"][0] if prop.get("dataType") else "Unknown"
                print(f"- {prop_name}: {data_type}")

    except requests.exceptions.HTTPError as err:
        print(f"Failed to retrieve schema: {err}")
    except Exception as e:
        print(f"An error occurred: {e}")

collectionsInClasses(weaviate_url="http://98.70.77.203:4090")  

Existing classes (collections) in Weaviate:

Class Name: Document
Description: Represents a document with embeddings.
Properties:
- title: text
- content: text
- embeddings: number[]

Class Name: Kizen
Description: Represents a kizen reports sample with embeddings.
Properties:
- type: text
- content: text

Class Name: Test2
Description: Represents a mir reports sample with embeddings.
Properties:
- type: text
- heading: text
- content: text
- vector: number[]

Class Name: Product
Description: No description provided
Properties:
- name: text
- description: text
- price: number
- categories: text[]
- in_stock: boolean

Class Name: Test3
Description: Represents a mir reports sample with embeddings.
Properties:
- type: text
- heading: text
- content: text

Class Name: Order
Description: No description provided
Properties:
- customer_id: text
- order_date: date
- total: number
- items: text[]

Class Name: Electionamigos
Description: Sample election amigos data
Properties:
- type: text
- hea

In [18]:
weviate_client.schema.get(class_name='Test4').keys()

dict_keys(['class', 'description', 'invertedIndexConfig', 'multiTenancyConfig', 'properties', 'replicationConfig', 'shardingConfig', 'vectorIndexConfig', 'vectorIndexType', 'vectorizer'])

# Delete Collections

In [25]:
# weviate_client.schema.delete_class("Article") 

# Function to update documents with objects

Creating schema of collection

In [26]:
parsed_splits_data

[{'type': 'figure',
  'content': '![](figures/0)<!-- FigureContent="This image shows a pie chart divided into four segments, each representing a different category of service delivery with corresponding percentages:\n\n1. Commercial presence (mode 3) takes up the majority of the chart with 57.0%.\n2. Digital delivery (mode 1) is the next largest segment with 20.7%.\n3. Other cross-border transactions (mode 1) account for 13.5%.\n4. Consumption abroad (mode 2) fills a smaller segment with 7.8%.\n5. Presence of natural persons (mode 4) is the smallest portion, constituting just 1.0%.\n\nEach segment of the chart is color-coded, and the percentages are clearly marked next to each category. This chart could be used to analyze the composition of service delivery methods for a particular market or business sector, reflecting how services are being consumed or provided in different modes." -->\n\n<!-- FigureContent="Other cross-border transactions (mode 1) Digital delivery (mode 1) 13.5% 20.7

In [21]:
class_obj = {
    "class": "Article",
    "description": "An Author class to store the author information",
    "vectorizer": "text2vec-huggingface",  # this could be any vectorizer
    "properties":[
        {
            "name": "title",
            "dataType": ["text"],
            "moduleConfig": {
                "text2vec-huggingface": {  # this must match the vectorizer used
                    "vectorizePropertyName": True,
                    "tokenization": "lowercase",
                    "indexFilterable": True,
                    "indexSearchable": True,
                },
        "vectorIndexConfig": {
        "distance": "cosine",
        "bq": {
            "enabled": True,  # Enable BQ compression. Default: False
            "rescoreLimit": 200,  # The minimum number of candidates to fetch before rescoring. Default: -1 (No limit)
            "cache": True,  # Enable use of vector cache. Default: False
        },
        "vectorCacheMaxObjects": 100000,  # Cache size if `cache` enabled. Default: 1000000000000
    },
    },
}
        ],
    "invertedIndexConfig": {
    "bm25": {
        "b": 0.7,
        "k1": 1.25
    },
    "indexTimestamps": True,
    "indexNullState": True,
    "indexPropertyLength": True
}
}

weviate_client.schema.create_class(class_obj)

# Function to retrieve documents based on a query

In [23]:

def query_documents(class_name, properties, where_filter=None):
    query = weviate_client.query.get(class_name, properties)
    if where_filter:
        query = query.with_where(where_filter)
    results = query.do()
    return results

In [None]:
# Example usage
class_name = "Article"
properties = ["type", "heading","content"]  # Specify the properties to be returned
# where_filter = {
#     "path": ["name"],
#     "operator": "Equal",
#     "valueString": "Laptop"
# }

documents = query_documents(class_name, properties)
print(documents)

{'data': {'Get': {'Electionamigos': []}}}


In [None]:
  # Configure batch
# with weviate_client.batch as batch:
for i,data_obj in enumerate(data_object):
        dt_obj = {"type": data_obj['type'],
            "content":data_obj['content']
            }
        weviate_client.batch.configure(batch_size=100).add_data_object(
            dt_obj,
            "test4",
            vector=embeddings_openai[i]
            # tenant="tenantA"  # If multi-tenancy is enabled, specify the tenant to which the object will be added.
        )

In [None]:
collection = weviate_client.collections.get("test4")

for item in collection.iterator(
    include_vector=True  # If using named vectors, you can specify ones to include e.g. ['title', 'body'], or True to include all
):
    print(item.properties)
    print(item.vector)

AttributeError: 'Client' object has no attribute 'collections'

In [None]:
# STEP 1 - Prepare a helper function to iterate through data in batches
def get_batch_with_cursor(collection_name, batch_size, cursor=None):
    # First prepare the query to run through data
    query = (
        weviate_client.query.get(
            collection_name,         # update with your collection name
            ["type", "content"] # update with the required properties
        )
        .with_additional(["id vector"])
        .with_limit(batch_size)
    )

    # Fetch the next set of results
    if cursor is not None:
        result = query.with_after(cursor).do()
    # Fetch the first set of results
    else:
        result = query.do()

    return result["data"]["Get"][collection_name]

In [None]:


# STEP 2 - Iterate through the data
cursor = None
while True:
    # Get the next batch of objects
    next_batch = get_batch_with_cursor("Test4", 100, cursor)

    # Break the loop if empty – we are done
    if len(next_batch) == 0:
        break

    # Here is your next batch of objects
    print(next_batch)

    # Move the cursor to the last returned uuid
    cursor=next_batch[-1]["_additional"]["id"]

[{'_additional': {'id': '032a26ab-ff2c-4dbc-bc74-d344aa64963f', 'vector': [-0.01339822, 0.042832207, 0.05714469, -0.01766697, 0.023691894, 0.0008460216, 0.012589196, -0.010267364, 0.0058539105, 0.033176545, 0.008399375, 0.0013574166, -0.027572576, -0.007860025, 0.026165007, 0.0055776583, -0.0031588096, -0.03978028, -0.07656126, 0.04433186, 0.030387715, 0.048620343, 0.017009227, -0.0029466874, -0.007524577, -0.037570264, -0.029256398, 0.00051673915, 0.018008996, -0.029519495, 0.015035999, -0.03643895, 0.042306013, 0.013759978, -0.00039279574, 0.010504152, -0.011451301, -0.00531785, -0.037622884, -0.06819477, -0.006080832, -0.0018860773, -0.022994686, 0.041595653, -0.039701354, 0.022889448, -0.024652198, -0.0886111, 0.040438022, 0.017324943, -0.019324481, -0.043332092, 0.0021409527, -0.034597266, 0.041437794, -0.006919454, 0.017456492, 0.04346364, 0.058407556, -0.061354242, 0.053303473, -0.013523191, -0.0054263775, 0.0005673031, 0.002903934, 0.015627967, -0.022560576, 0.042200774, 0.0003

In [None]:
# Function to retrieve documents based on a query
def query_documents(class_name, properties, where_filter=None):
    query = weviate_client.query.get(class_name, properties)
    if where_filter:
        query = query.with_where(where_filter)
    results = query.do()
    return results

# Example usage
class_name = "Test4"
properties = ["type", "content"]  # Specify the properties to be returned
# where_filter = {
#     "path": ["name"],
#     "operator": "Equal",
#     "valueString": "Laptop"
# }

documents = query_documents(class_name, properties)
print(documents)

NameError: name 'weviate_client' is not defined

# creating embeding to the question and retriving the data

In [3]:
import numpy as np

In [8]:
from openai import OpenAI
client = OpenAI(api_key="sk-proj-8zeBScFVgVhpceskr02sT3BlbkFJ0lvp27e5sdNhQW6TPrPV")


In [9]:
def generate_embeddings_openai(question):
    embeddings = []  # Initialize an empty list to store all embeddings
    try:
        # Generate embedding using OpenAI's GPT-3 model
        response = client.embeddings.create(
            input=question,
            model="text-embedding-3-small"
        )
        embedding_vector = response.data[0].embedding
        embeddings.append(embedding_vector)  # Append each embedding to the list
    except Exception as e:
        print(f"Error generating embedding for question : {e}")

    return embeddings  # Return the list of all embeddings

In [10]:
# Define a function to get vectors for a given question and from the collection
def get_vector_for_question(question):
    # Encode the question using a pre-trained model (you need to replace this with actual model inference)
    # For this example, let's assume 'encode_question' is a function that encodes a question to a vector
    question_vector = generate_embeddings_openai(question)
    return question_vector


def get_vectors_from_collection():
    # Fetch vectors from Weaviate collection
    result = weviate_client.query.get("Test4", ["_additional { vector }"]).do()
    vectors = []
    for obj in result['data']['Get']['Test4']:
        vectors.append(obj['_additional']['vector'])
    return np.array(vectors)

# Compute cosine similarity
def compute_cosine_similarity(question_vector, collection_vectors):
    return weviate_client.cosine_similarity(question_vector, collection_vectors)



In [32]:
# Example usage
question = "What is the  Top 5 Rapeseed & Mustard Producing States?"
question_vector = get_vector_for_question(question)


In [33]:
len(question_vector)

1

In [34]:
import json

In [36]:
response = (
    weviate_client.query
    .get("Test4", ['type','content'])
    .with_hybrid(
        query=question,
        alpha=0.25,
        vector=question_vector[0],
    )
    
    # .with_additional(["score", "explainScore"])
    .with_limit(3)
    .do()
)

print(json.dumps(response, indent=1))


{
 "data": {
  "Get": {
   "Test4": [
    {
     "content": "Table 1: Top 5 Rapeseed & Mustard Producing States, 2023-24\\* (Production in LMTs)",
     "type": "table"
    },
    {
     "content": "Table 1: Top 5 Rapeseed & Mustard Producing States, 2023-24\\* (Production in LMTs)",
     "type": "table"
    },
    {
     "content": "Table 1: Top 5 Rapeseed & Mustard Producing States, 2023-24\\* (Production in LMTs)",
     "type": "table"
    }
   ]
  }
 }
}


In [48]:
result = (
  weviate_client.query
  .get("Test4", ['type','content'])
  .with_generate(grouped_task=question)
  .with_near_vector({
    "vector": question_vector[0]
  })
  .with_limit(5)
).do()

print(result)

{'errors': [{'locations': [{'column': 34218, 'line': 1}], 'message': 'Cannot query field "generate" on type "Test4Additional".', 'path': None}]}


In [56]:
from weaviate.gql.get import HybridFusion

response = (
    weviate_client.query
    .get("Test4", ['type','content'])
    .with_hybrid(
        query=question,
        vector=question_vector[0],
        fusion_type=HybridFusion.RELATIVE_SCORE
    )
    .with_limit(3)
    .do()
)

print(json.dumps(response, indent=2))

{
  "data": {
    "Get": {
      "Test4": [
        {
          "content": "Table 1: Top 5 Rapeseed & Mustard Producing States, 2023-24\\* (Production in LMTs)",
          "type": "table"
        },
        {
          "content": "Table 1: Top 5 Rapeseed & Mustard Producing States, 2023-24\\* (Production in LMTs)",
          "type": "table"
        },
        {
          "content": "Table 1: Top 5 Rapeseed & Mustard Producing States, 2023-24\\* (Production in LMTs)",
          "type": "table"
        }
      ]
    }
  }
}


In [59]:
response = (
    weviate_client.query
    .get("Test4", ['type','content'])
    .with_hybrid(
        query="food",
        vector=question_vector[0],
        properties=["content"],
        alpha=0.25
    )
    .with_limit(3)
    .do()
)

print(json.dumps(response, indent=2))

{
  "data": {
    "Get": {
      "Test4": [
        {
          "content": "Table 1: Top 5 Rapeseed & Mustard Producing States, 2023-24\\* (Production in LMTs)",
          "type": "table"
        },
        {
          "content": "Table 1: Top 5 Rapeseed & Mustard Producing States, 2023-24\\* (Production in LMTs)",
          "type": "table"
        },
        {
          "content": "Table 1: Top 5 Rapeseed & Mustard Producing States, 2023-24\\* (Production in LMTs)",
          "type": "table"
        }
      ]
    }
  }
}


In [12]:
collection_vectors = get_vectors_from_collection()
similarities = compute_cosine_similarity(question_vector, collection_vectors)

# Print the similarities
print("Cosine similarities:", similarities)

AttributeError: 'Client' object has no attribute 'cosine_similarity'