# **Unstructured RAG**
Unstructured or (Semi-Structured) RAG is a method designed to handle documents that combine text, tables, and images. It addresses challenges like broken tables caused by text splitting and the difficulty of embedding tables for semantic search.

Here we are using unstructured.io to parse and separate text, tables, and images.

Tool Reference: [Unstructured](https://unstructured.io/)

## **Initial Setup**

In [1]:
! pip install --q athina faiss-gpu pytesseract unstructured-client "unstructured[all-docs]"

ERROR: Could not find a version that satisfies the requirement faiss-gpu (from versions: none)
ERROR: No matching distribution found for faiss-gpu

[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\patel\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip


In [2]:
!apt-get install poppler-utils
!apt-get install tesseract-ocr
!apt-get install libtesseract-dev

'apt-get' is not recognized as an internal or external command,
operable program or batch file.
'apt-get' is not recognized as an internal or external command,
operable program or batch file.
'apt-get' is not recognized as an internal or external command,
operable program or batch file.


In [3]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
os.environ['ATHINA_API_KEY'] = os.getenv('ATHINA_API_KEY')

# Optional: Verify keys are loaded
if not os.environ["OPENAI_API_KEY"] or not os.environ['ATHINA_API_KEY']:
    print("Warning: API keys not loaded from .env file")

## **Indexing**

In [5]:
! pip install --q athina faiss-cpu pytesseract unstructured-client "unstructured[all-docs]" python-dotenv langchain-openai

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gradio 5.16.1 requires aiofiles<24.0,>=22.0, but you have aiofiles 24.1.0 which is incompatible.
llama-index-llms-huggingface 0.4.2 requires llama-index-core<0.13.0,>=0.12.0, but you have llama-index-core 0.11.23 which is incompatible.
llama-index-readers-smart-pdf-loader 0.3.0 requires llama-index-core<0.13.0,>=0.12.0, but you have llama-index-core 0.11.23 which is incompatible.
spacy 3.8.4 requires thinc<8.4.0,>=8.3.4, but you have thinc 8.1.12 which is incompatible.

[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\patel\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip


In [13]:
!pip install -U langchain langchain-openai


Collecting langchain
  Using cached langchain-0.3.25-py3-none-any.whl.metadata (7.8 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.8 (from langchain)
  Using cached langchain_text_splitters-0.3.8-py3-none-any.whl.metadata (1.9 kB)
Using cached langchain-0.3.25-py3-none-any.whl (1.0 MB)
Using cached langchain_text_splitters-0.3.8-py3-none-any.whl (32 kB)
Installing collected packages: langchain-text-splitters, langchain
  Attempting uninstall: langchain-text-splitters
    Found existing installation: langchain-text-splitters 0.3.5
    Uninstalling langchain-text-splitters-0.3.5:
      Successfully uninstalled langchain-text-splitters-0.3.5
  Attempting uninstall: langchain
    Found existing installation: langchain 0.3.15
    Uninstalling langchain-0.3.15:
      Successfully uninstalled langchain-0.3.15
Successfully installed langchain-0.3.25 langchain-text-splitters-0.3.8



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\patel\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip


In [1]:
# load embedding model
from langchain_openai import OpenAIEmbeddings


In [4]:
!pip uninstall -y onnxruntime onnxruntime-gpu unstructured unstructured-inference

Found existing installation: onnxruntime 1.20.1
Uninstalling onnxruntime-1.20.1:
  Successfully uninstalled onnxruntime-1.20.1
Found existing installation: unstructured 0.16.8
Uninstalling unstructured-0.16.8:
  Successfully uninstalled unstructured-0.16.8
Found existing installation: unstructured-inference 0.8.1
Uninstalling unstructured-inference-0.8.1:
  Successfully uninstalled unstructured-inference-0.8.1


You can safely remove it manually.


In [5]:
!pip install onnxruntime==1.15.1
!pip install "unstructured[all-docs]" --no-deps
!pip install unstructured-inference

Collecting onnxruntime==1.15.1
  Downloading onnxruntime-1.15.1-cp310-cp310-win_amd64.whl.metadata (4.1 kB)
Downloading onnxruntime-1.15.1-cp310-cp310-win_amd64.whl (6.7 MB)
   ---------------------------------------- 0.0/6.7 MB ? eta -:--:--
   ---------------------------------------- 0.1/6.7 MB 1.7 MB/s eta 0:00:05
   - -------------------------------------- 0.2/6.7 MB 2.8 MB/s eta 0:00:03
   ----- ---------------------------------- 0.9/6.7 MB 7.3 MB/s eta 0:00:01
   --------------- ------------------------ 2.6/6.7 MB 15.2 MB/s eta 0:00:01
   ------------------------- -------------- 4.3/6.7 MB 19.7 MB/s eta 0:00:01
   ------------------------------------- -- 6.3/6.7 MB 23.8 MB/s eta 0:00:01
   ---------------------------------------  6.7/6.7 MB 23.9 MB/s eta 0:00:01
   ---------------------------------------- 6.7/6.7 MB 20.5 MB/s eta 0:00:00
Installing collected packages: onnxruntime
Successfully installed onnxruntime-1.15.1



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting unstructured[all-docs]
  Downloading unstructured-0.17.2-py3-none-any.whl.metadata (24 kB)
Downloading unstructured-0.17.2-py3-none-any.whl (1.8 MB)
   ---------------------------------------- 0.0/1.8 MB ? eta -:--:--
   ---- ----------------------------------- 0.2/1.8 MB 5.9 MB/s eta 0:00:01
   --------------- ------------------------ 0.7/1.8 MB 8.8 MB/s eta 0:00:01
   ---------------------------------------  1.8/1.8 MB 16.0 MB/s eta 0:00:01
   ---------------------------------------- 1.8/1.8 MB 12.5 MB/s eta 0:00:00
Installing collected packages: unstructured
Successfully installed unstructured-0.17.2



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting unstructured-inference
  Downloading unstructured_inference-1.0.2-py3-none-any.whl.metadata (5.3 kB)
Collecting onnxruntime>=1.18.0 (from unstructured-inference)
  Downloading onnxruntime-1.22.0-cp310-cp310-win_amd64.whl.metadata (5.0 kB)
Downloading unstructured_inference-1.0.2-py3-none-any.whl (47 kB)
   ---------------------------------------- 0.0/47.6 kB ? eta -:--:--
   ---------------------------------------- 47.6/47.6 kB 1.2 MB/s eta 0:00:00
Downloading onnxruntime-1.22.0-cp310-cp310-win_amd64.whl (12.7 MB)
   ---------------------------------------- 0.0/12.7 MB ? eta -:--:--
    --------------------------------------- 0.2/12.7 MB 5.6 MB/s eta 0:00:03
   -- ------------------------------------- 0.9/12.7 MB 10.0 MB/s eta 0:00:02
   ------ --------------------------------- 2.1/12.7 MB 17.0 MB/s eta 0:00:01
   ------------ --------------------------- 3.8/12.7 MB 22.1 MB/s eta 0:00:01
   ----------------- ---------------------- 5.4/12.7 MB 24.9 MB/s eta 0:00:01
   -------


[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
 # load and extract images, tables, and chunk text
from unstructured.partition.pdf import partition_pdf

filename = "./content/sample.pdf"

pdf_elements = partition_pdf(
    filename=filename,
    extract_images_in_pdf=True,
    strategy = "hi_res",
    hi_res_model_name="yolox",
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=3000,
    combine_text_under_n_chars=200,
)

In [7]:
# check unique categories
from collections import Counter
category_counts = Counter(str(type(element)) for element in pdf_elements)
unique_categories = set(category_counts)
category_counts

Counter({"<class 'unstructured.documents.elements.CompositeElement'>": 14,
         "<class 'unstructured.documents.elements.TableChunk'>": 2})

In [8]:
# extract unique types
unique_types = {el.to_dict()['type'] for el in pdf_elements}
unique_types

{'CompositeElement', 'Table'}

In [9]:
# # display images from pdf
# from IPython.display import Image, display
# image_files = os.listdir('/content/figures')
# image_files = [os.path.join('/content/figures', image_file) for image_file in image_files]

# for image_file in image_files:
#     display(Image(filename=image_file))

In [10]:
# convert pdf_elements to langchain documents
from langchain.schema import Document
documents = [Document(page_content=el.text, metadata={"source": filename}) for el in pdf_elements]

## **Vector Store**

In [11]:
# create vectorstore
from langchain.vectorstores import FAISS
vectorstore = FAISS.from_documents(documents, embeddings)

## **Retriever**

In [12]:
# create retriever
retriever = vectorstore.as_retriever()

## **RAG Chain**

In [13]:
# load llm
from langchain_openai import ChatOpenAI
llm = ChatOpenAI()

In [14]:
# create document chain
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

template = """"
You are a helpful assistant that answers questions based on the provided context, which can include text and tables.
Use the provided context to answer the question.
Question: {input}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)

# Setup RAG pipeline
rag_chain = (
    {"context": retriever,  "input": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [15]:
# response
response = rag_chain.invoke("Compare all the Training Results on MATH Test Set")
response

'To compare all the Training Results on the MATH Test Set, we can look at the results from Table 6 in the provided context. The results are as follows:\n\n- deepseek-sft-abel:\n   - SFT-phase1: 0.372\n   - SFT-phase2-shortcutLearning: 0.386\n   - SFT-phase2-journeyLearining: 0.470\n   - DPO: 0.472\n\n- deepseek-sft-prm800k:\n   - SFT-phase1: 0.290\n   - SFT-phase2-shortcutLearning: 0.348\n   - SFT-phase2-journeyLearining: 0.428\n   - DPO: 0.440\n\nBased on these results, we can see that Journey Learning led to significant improvements compared to Shortcut Learning on both models, with gains of +8.4 and +8.0 on deepseek-sft-abel and deepseek-sft-prm800k, respectively. The DPO results were also provided for comparison.'

## **Preparing Data for Evaluation**

In [16]:
# create dataset
question = ["Compare all the Training Results on MATH Test Set"]
response = []
contexts = []

# Inference
for query in question:
  response.append(rag_chain.invoke(query))
  contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])

# To dict
data = {
    "query": question,
    "response": response,
    "context": contexts,
}

  warn_deprecated(


In [17]:
# create dataset
from datasets import Dataset
dataset = Dataset.from_dict(data)

In [18]:
# create dataframe
import pandas as pd
df = pd.DataFrame(dataset)

In [19]:
df

Unnamed: 0,query,response,context
0,Compare all the Training Results on MATH Test Set,To compare all the Training Results on the MAT...,[The results of our experiments are shown in T...


In [20]:
# Convert to dictionary
df_dict = df.to_dict(orient='records')

# Convert context to list
for record in df_dict:
    if not isinstance(record.get('context'), list):
        if record.get('context') is None:
            record['context'] = []
        else:
            record['context'] = [record['context']]

## **Evaluation in Athina AI**

We will use **Does Response Answer Query** eval here. It Checks if the response answer the user's query. To learn more about this. Please refer to our [documentation](https://docs.athina.ai/api-reference/evals/preset-evals/overview) for further details.

In [21]:
# set api keys for Athina evals
from athina.keys import AthinaApiKey, OpenAiApiKey
OpenAiApiKey.set_key(os.getenv('OPENAI_API_KEY'))
AthinaApiKey.set_key(os.getenv('ATHINA_API_KEY'))

  warn(


In [22]:
# load dataset
from athina.loaders import Loader
dataset = Loader().load_dict(df_dict)

In [23]:
# evaluate
from athina.evals import DoesResponseAnswerQuery
DoesResponseAnswerQuery(model="gpt-4o").run_batch(data=dataset).to_df()

You can view your dataset at: https://app.athina.ai/develop/e5dec38c-c58c-412d-b910-588d97ccd090


Unnamed: 0,query,context,response,expected_response,display_name,failed,grade_reason,runtime,model,passed
0,Compare all the Training Results on MATH Test Set,"[The results of our experiments are shown in Table 6. All results are tested on the MATH test set, using a re-divided subset from PRM800K, which includes 500 examples. The results show that Journey Learning led to significant improvements compared to Shortcut Learning, with gains of +8.4 and +8.0 on the deepseek-sft-abel and deepseek-sft-prm800k models, respectively, demonstrating the effectiveness of our proposed Journey Learning method. However, the improvement from DPO was more modest, an...","To compare all the Training Results on the MATH Test Set, we can look at the results provided in Table 6 from the context. The results for the different models on the MATH test set are as follows:\n\n- deepseek-sft-abel: SFT-phase1 = 0.372, SFT-phase2-shortcutLearning = 0.386, SFT-phase2-journeyLearining = 0.470, DPO = 0.472\n- deepseek-sft-prm800k: SFT-phase1 = 0.290, SFT-phase2-shortcutLearning = 0.348, SFT-phase2-journeyLearining = 0.428, DPO = 0.440\n\nFrom these results, we can see that...",,Does Response Answer Query,False,"The response provides a detailed comparison of the training results on the MATH test set for two models, deepseek-sft-abel and deepseek-sft-prm800k. It includes specific performance metrics for different phases and methods, such as SFT-phase1, SFT-phase2-shortcutLearning, SFT-phase2-journeyLearning, and DPO. Additionally, it highlights the improvements observed with the Journey Learning method compared to Shortcut Learning, which directly addresses the user's query about comparing training r...",1910,gpt-4o,1.0
