# Preprocessing Unstructured Data for LLM Applications

This notebook contains a full walkthrough on utilizing the Unstructured Open Source Library and the API to preprocess unstructured data (PDFs, Images, PPTs, Emails...) for use in LLM Applications (RAG,CAG, Agentic AI...)

This project was built by **Zakaria Gouttel** while following Matt Robinson & Andrew Ng's ***"Preprocessing Unstructured Data for LLM Applications"*** course on **DeepLearning.ai**

## Main Imports
The cells below include all the main imports utilized in this project

In [2]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

In [3]:
from IPython.display import JSON, display_json

from dotenv import load_dotenv, find_dotenv

import json

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared, operations
from unstructured_client.models.errors import SDKError

from unstructured.partition.html import partition_html
from unstructured.partition.pptx import partition_pptx
from unstructured.staging.base import dict_to_elements, elements_to_json

In [4]:
_ = load_dotenv(find_dotenv())

import os

DLAI_API_KEY = os.getenv("U_API_KEY")
DLAI_API_URL = os.getenv("U_API_URL")

s = UnstructuredClient(
    api_key_auth=DLAI_API_KEY,
    server_url=DLAI_API_URL,
)

## Reading from HTML Files

This section utilizes the partition_html function from Unstructured's open library to read data from HTML files. I've utilized an example from a previous training of mine on HTML. Other examples are available in the repository.

In [1]:
filename = "example_files/8.4 Web Design Agency Project/MyResult.html"
elements = partition_html(filename=filename)

NameError: name 'partition_html' is not defined

In [None]:
element_dict = [el.to_dict() for el in elements]
example_output = json.dumps(element_dict[1:6], indent=2)
print(example_output)

In [15]:
JSON(example_output)

<IPython.core.display.JSON object>

## Reading from PPT Files

This section utilizes the partition_pptx function from the Unstructured open source library to read data from a PowerPoint file.

In [32]:
import nltk
nltk.download('averaged_perceptron_tagger')

filename = "example_files/msft_openai.pptx"
elements = partition_pptx(filename=filename)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [29]:
element_dict = [el.to_dict() for el in elements]
example_output_ppt=json.dumps(element_dict[:], indent=2)
JSON(example_output_ppt).data

[{'type': 'Title',
  'element_id': '0272e93746a3e4b9d812abd463338715',
  'text': 'Physical diagram ',
  'metadata': {'category_depth': 1,
   'file_directory': 'example_files',
   'filename': 'example.pptx',
   'last_modified': '2024-11-04T13:18:27',
   'page_number': 1,
   'languages': ['eng'],
   'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation'}},
 {'type': 'Title',
  'element_id': '37bb97fc3af881a0c223c1d703e5979c',
  'text': 'Web Client ',
  'metadata': {'category_depth': 1,
   'file_directory': 'example_files',
   'filename': 'example.pptx',
   'last_modified': '2024-11-04T13:18:27',
   'page_number': 1,
   'languages': ['eng'],
   'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation'}},
 {'type': 'Title',
  'element_id': '3ff6b3047ef90bbfb1fa79fd21bfe051',
  'text': 'User',
  'metadata': {'category_depth': 1,
   'file_directory': 'example_files',
   'filename': 'example.pptx',
   'last_modified': '2024-11-04T

## Reading from PDF

This section utilizes the partition_pdf function from the Unstructured open source library to read data from a PDF file.

In [5]:
filename = "example_files/CoT.pdf"
resp = partition_pdf(
    filename=filename,
    strategy="hi_res",
    extract_images_in_pdf=False,
)
element_dict_pdf = [el.to_dict() for el in resp]

example_output = json.dumps(element_dict_pdf[1:6], indent=2)
print(example_output)

INFO: Reading PDF for file: example_files/CoT.pdf ...


[
  {
    "type": "NarrativeText",
    "element_id": "055f2fa97fbdee35766495a3452ebd9d",
    "text": "This section contains tables for experimental results for varying models and model sizes, on all benchmarks, for standard prompting vs. chain-of-thought prompting.",
    "metadata": {
      "detection_class_prob": 0.940491259098053,
      "coordinates": {
        "points": [
          [
            295.1437683105469,
            271.98025166666656
          ],
          [
            295.1437683105469,
            329.9569183333334
          ],
          [
            1409.8411865234375,
            329.9569183333334
          ],
          [
            1409.8411865234375,
            271.98025166666656
          ]
        ],
        "system": "PixelSpace",
        "layout_width": 1700,
        "layout_height": 2200
      },
      "last_modified": "2025-03-03T14:56:23",
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "paren

## Metadata Extraction and Chunking

In this section, we will handle Metadata extraction & Chunking using Hybrid Search and Vector DBs

### Main Imports

In [23]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

In [24]:
from unstructured.chunking.basic import chunk_elements
from unstructured.chunking.title import chunk_by_title
from unstructured.staging.base import dict_to_elements

import chromadb

### Example

We will try with the Winter Sport EPUB file in the example files folder, to use this file we need to use the Platform API not the open source library.

In [25]:
# Prepare the request input.

filename = "./example_files/winter-sports.epub"

with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

params=shared.PartitionParameters(files=files)

req = operations.PartitionRequest(partition_parameters=params)

In [30]:
# Send the request.

try:
    resp = s.general.partition(request=req)
except SDKError as e:
    print(e)

JSON(json.dumps(resp.elements[0:3], indent=2)).data

[{'type': 'Title',
  'element_id': 'c2994d1baf27b22f630c8f044664879d',
  'text': 'The Project Gutenberg eBook of Winter Sports in Switzerland, by E. F. Benson',
  'metadata': {'category_depth': 1,
   'languages': ['eng'],
   'filename': 'winter-sports.epub',
   'filetype': 'application/epub'}},
 {'type': 'NarrativeText',
  'element_id': '5cea770359a3b1342a867bac7831ebcd',
  'text': 'This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org. If you are not located in the United States, you’ll have to check the laws of the country where you are located before using this eBook.',
  'metadata': {'link_texts': ['www.gutenberg.org'],
   'link_urls': ['https://www.gutenberg.org'],
   'languages': ['eng'],
   'filename': 'winter-sports.epub',
   'filety

In [32]:
#Search for an element within the book which is Narrative Text, and that contains the word hockey.

[x for x in resp.elements if x['type'] == 'NarrativeText' and 'hockey' in x['text'].lower()]

[{'type': 'NarrativeText',
  'element_id': '80e426d3de10c91ca6ff1b7880f388c8',
  'text': 'But at every Swiss resort there are rinks made, which render the skater independent of natural surfaces of ice, and those, at all well-conducted places, are “new every morning,” because every evening they are swept and sprinkled with water, which by the ensuing day has frozen, and presents a fresh surface to the zealot. In fact, an artificial skating-rink is as necessary an equipment in the Swiss winter resort as is the hotel itself. The construction and renovation of these rinks is most interesting, and ranks among the fine arts, just as does the architecture of a fine golf-links or the preparation of good wickets. These rinks are used for two purposes: skating, including bandy or ice hockey, and curling. I do not count ice-gymkhanas or ice-carnivals, because anything is good enough for them. You can play the shovel-game or crawl through barrels among the jungles of frost-flowers. I do not imply 

In [33]:
# Mapping Chapters to their IDs as parsed by the ETL.

chapters = [
    "THE SUN-SEEKER",
    "RINKS AND SKATERS",
    "TEES AND CRAMPITS",
    "ICE-HOCKEY",
    "SKI-ING",
    "NOTES ON WINTER RESORTS",
    "FOR PARENTS AND GUARDIANS",
]

chapter_ids = {}
for element in resp.elements:
    for chapter in chapters:
        if chapter in element["text"] and element["type"] == "Title":
            chapter_ids[element["element_id"]] = chapter
            break

In [34]:
chapter_ids

{'65af5da00154a2f526b43177bbad3189': 'THE SUN-SEEKER',
 '4e3f02ce1525ca1a197b5d5e9cd1953d': 'RINKS AND SKATERS',
 'c2f9b5a30cb07d9adaf5f390dc8e7564': 'TEES AND CRAMPITS',
 '67d58ab3aae6e41e9ce429dc4cbe5501': 'ICE-HOCKEY',
 'ea7faf4689009a3b72bf19dc10f29037': 'SKI-ING',
 'ea3feb77c48592d21a05386e68fde88e': 'NOTES ON WINTER RESORTS',
 '56c4ba79d33a505bb9b4480d7a5a4ce7': 'FOR PARENTS AND GUARDIANS'}

In [35]:
# Configure a function that returns a chapter ID given a chapter name.
chapter_to_id = {v: k for k, v in chapter_ids.items()}

#Fetch a paragraph from the chapter ice Hockey.
[x for x in resp.elements if x["metadata"].get("parent_id") == chapter_to_id["ICE-HOCKEY"]][0]

{'type': 'NarrativeText',
 'element_id': '24a39efc765a60db62dd23405133a957',
 'text': 'Many of the Swiss winter-resorts can put into the field a very strong ice-hockey team, and fine teams from other countries often make winter tours there; but the ice-hockey which the ordinary winter visitor will be apt to join in will probably be of the most elementary and unscientific kind indulged in, when the skating day is drawing to a close, by picked-up sides. As will be readily understood, the ice over which a hockey match has been played is perfectly useless for skaters any more that day until it has been swept, scraped, and sprinkled or flooded; and in consequence, at all Swiss resorts, with the exception of St. Moritz, where there is a rink that has been made for the hockey-player, or when an important match is being played, this sport is supplementary to such others as I have spoken of. Nobody, that is, plays hockey and nothing else, since he cannot play hockey at all till the greedy skate

### Creating a vector database and filling it with the parsed information from our EPUB

Below we will create a vector database using Chromadb, we'll first create the db as a collection, and then fill it with the data and metadata we've extracted.

In [36]:
client = chromadb.PersistentClient(path="./chroma_tmp", settings=chromadb.Settings(allow_reset=True))
client.reset()

True

In [37]:
collection = client.create_collection(
    name="winter_sports",
    metadata={"hnsw:space": "cosine"}
)

In [38]:
for element in resp.elements:
    parent_id = element["metadata"].get("parent_id")
    chapter = chapter_ids.get(parent_id, "")
    collection.add(
        documents=[element["text"]],
        ids=[element["element_id"]],
        metadatas=[{"chapter": chapter}]
    )

C:\Users\LENOVO\.cache\chroma\onnx_models\all-MiniLM-L6-v2\onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:07<00:00, 11.0MiB/s]


Here we are using the peek function to get a random sample of what we have within our freshly created vector database.

In [39]:
results = collection.peek()
print(results["documents"])

['The Project Gutenberg eBook of Winter Sports in Switzerland, by E. F. Benson', 'This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org. If you are not located in the United States, you’ll have to check the laws of the country where you are located before using this eBook.', 'Title: Winter Sports in Switzerland', 'Author: E. F. Benson', 'Illustrator: C. Fleming Williams', 'Photographer: Mrs. Aubrey Le Blond', 'Release date: August 23, 2019 [EBook #60153] Most recently updated: January 30, 2020', 'Language: English', 'Credits: Produced by Chuck Greif and the Online Distributed Proofreading Team at http://www.pgdp.net (This file was produced from images available at The Internet Archive)', '*** START OF THE PROJECT GUTENBERG EBOOK WINTER SPORT

Here we are querying the database with a specific question, the hybrid search intervenes in the where clause, as we've specified the context that the search will be limited to. The number of results has also been limited to 2 to return only the 2 highest matches.

In [None]:
result = collection.query(
    query_texts=["What's by the fireside?"],
    n_results=2,
    where={"chapter": "THE SUN-SEEKER"},
)
print(json.dumps(result, indent=2))

### Chunking Content

In this section, we'll be covering how to chunk content using the unstructured library

In [42]:
# Chunking by Title, while maintaining a minimum of 100 characters and a maximum of 3000 characters per chunk.

elements = dict_to_elements(resp.elements)
chunks = chunk_by_title(
    elements,
    combine_text_under_n_chars=100,
    max_characters=3000,
)

len(elements)

JSON(json.dumps(chunks[0].to_dict(), indent=2))


<IPython.core.display.JSON object>

In [44]:
len(chunks)

165

## Preprocessing PDFs & Images

This section covers in more detail, the preprocessing of PDFs and images with the Unstructured Open source library and platform API.

### Main Imports

In [6]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

from unstructured.partition.html import partition_html
from unstructured.partition.pdf import partition_pdf

from unstructured.staging.base import dict_to_elements

### Process the document as HTML, to use as a benchmark for our PDF processing

In [7]:
filename = "./example_files/el_nino.html"
html_elements = partition_html(filename=filename)

for element in html_elements[:10]:
    print(f"{element.category.upper()}: {element.text}")

UNCATEGORIZEDTEXT: CNN 1/30/2024
TITLE: A potent pair of atmospheric rivers will drench California as El Niño makes its first mark on winter
UNCATEGORIZEDTEXT: By Mary Gilbert, CNN Meteorologist
UNCATEGORIZEDTEXT: Updated: 3:49 PM EST, Tue January 30, 2024
UNCATEGORIZEDTEXT: Source: CNN
NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the first clear sign of the influence El Niño was expected to have on the state this winter.
NARRATIVETEXT: The soaking storms will raise the flood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical El Niño pattern kicks into gear.
NARRATIVETEXT: El Niño – a natural phenomenon in the tropical Pacific that influences weather around the globe – causes changes in the jet stream that can point storms directly at California. Storms can also tap into an extra-potent supply of moistur

### Processing with DLD (Document Layout Detection)

First using partition_pdf for a quick overview

In [8]:
filename = "./example_files/el_nino.pdf"
pdf_elements = partition_pdf(filename=filename, strategy="fast")

for element in pdf_elements[:10]:
    print(f"{element.category.upper()}: {element.text}")

INFO: pikepdf C++ to Python logger bridge initialized


HEADER: 1/30/24, 5:11 PM
HEADER: Pineapple express: California to get drenched by back-to-back storms fueling a serious ﬂood threat | CNN
HEADER: CNN 1/30/2024
NARRATIVETEXT: A potent pair of atmospheric rivers will drench California as El Niño makes its ﬁrst mark on winter
TITLE: By Mary Gilbert, CNN Meteorologist
TITLE: Updated: 3:49 PM EST, Tue January 30, 2024
TITLE: Source: CNN
NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the ﬁrst clear sign of the inﬂuence El Niño was expected to have on the state this winter.
NARRATIVETEXT: The soaking storms will raise the ﬂood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical El Niño pattern kicks into gear.
NARRATIVETEXT: El Niño – a natural phenomenon in the tropical Paciﬁc that inﬂuences weather around the globe – causes changes in the jet stream that can po

Using DLD with YOLOX as the object detection model. We have no need for OCR here because of the document we're using so it's a single model call.

In [11]:
with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

params = shared.PartitionParameters(
    files=files,
    strategy="hi_res",
    hi_res_model_name="yolox",
)

req=operations.PartitionRequest(partition_parameters=params)

try:
    resp = s.general.partition(request=req)
    dld_elements = dict_to_elements(resp.elements)
except SDKError as e:
    print(e)


for element in dld_elements[:10]:
    print(f"{element.category.upper()}: {element.text}")

INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"


HEADER: 1/30/24, 5:11 PM
HEADER: CNN 1/30/2024
HEADER: Pineapple express: California to get drenched by back-to-back storms fueling a serious ﬂood threat | CNN
TITLE: A potent pair of atmospheric rivers will drench California as El Niño makes its ﬁrst mark on winter
NARRATIVETEXT: By Mary Gilbert, CNN Meteorologist
NARRATIVETEXT: Updated: 3:49 PM EST, Tue January 30, 2024
NARRATIVETEXT: Source: CNN
NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the ﬁrst clear sign of the inﬂuence El Niño was expected to have on the state this winter.
NARRATIVETEXT: The soaking storms will raise the ﬂood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical El Niño pattern kicks into gear.
NARRATIVETEXT: El Niño – a natural phenomenon in the tropical Paciﬁc that inﬂuences weather around the globe – causes changes in the jet st

Now let's compare the DLD output to our HTML output and see if it was well done on the PDF side of things

In [12]:
import collections

len(html_elements)

html_categories = [el.category for el in html_elements]
collections.Counter(html_categories).most_common()

[('NarrativeText', 23), ('UncategorizedText', 6), ('Title', 2)]

In [13]:
len(dld_elements)

dld_categories = [el.category for el in dld_elements]
collections.Counter(dld_categories).most_common()

[('NarrativeText', 28), ('Header', 6), ('Title', 4), ('Footer', 1)]

## Extracting Tables

In this part we'll cover table extraction using the Unstructured Platform API.


### Main imports

In [22]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared,operations
from unstructured_client.models.errors import SDKError

from unstructured.staging.base import dict_to_elements


_ = load_dotenv(find_dotenv())

import os

DLAI_API_KEY = os.getenv("U_API_KEY")
DLAI_API_URL = os.getenv("U_API_URL")

s = UnstructuredClient(
    api_key_auth=DLAI_API_KEY,
    server_url=DLAI_API_URL,
)




### Extracting tables from a PDF table

In [15]:
filename = "./example_files/embedded-images-tables.pdf"

with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

params = shared.PartitionParameters(
    files=files,
    strategy="hi_res",
    hi_res_model_name="yolox",
    skip_infer_table_types=[],
    pdf_infer_table_structure=True,
)

req=operations.PartitionRequest(partition_parameters=params)

try:
    resp = s.general.partition(request=req)
    elements = dict_to_elements(resp.elements)
except SDKError as e:
    print(e)

INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"


In [16]:
#Extracting only tables from the elements.
tables = [el for el in elements if el.category == "Table"]
tables[0].text

'Inhibitor Polarization Corrosion be (V/dec) ba (V/dec) Ecorr (V) icorr (Acm?) concentration (g) resistance (Q) rate (mmj/year) 0.0335 0.0409 —0.9393 0.0003 24.0910 2.8163 1.9460 0.0596 .8276 0.0002 121.440 1.5054 0.0163 0.2369 .8825 0.0001 42121 0.9476 NO 0233 0.0540 —0.8027 5.39E-05 373.180 0.4318 0.1240 0.0556 .5896 5.46E-05 305.650 0.3772 = 5 0.0382 0.0086 .5356 1.24E-05 246.080 0.0919'

In [18]:
#Converting the table output to HTML

table_html = tables[0].metadata.text_as_html

from io import StringIO 
from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)
file_obj = StringIO(table_html)
tree = etree.parse(file_obj, parser)
print(etree.tostring(tree, pretty_print=True).decode())

<table>
  <thead>
    <tr>
      <th>Inhibitor concentration (g)</th>
      <th>be (V/dec)</th>
      <th>ba (V/dec)</th>
      <th>Ecorr (V)</th>
      <th>icorr (Acm?)</th>
      <th>Polarization resistance (Q)</th>
      <th>Corrosion rate (mmj/year)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td/>
      <td>0.0335</td>
      <td>0.0409</td>
      <td>&#8212;0.9393</td>
      <td>0.0003</td>
      <td>24.0910</td>
      <td>2.8163</td>
    </tr>
    <tr>
      <td>NO</td>
      <td>1.9460</td>
      <td>0.0596</td>
      <td>&#8212;0.8276</td>
      <td>0.0002</td>
      <td>121.440</td>
      <td>1.5054</td>
    </tr>
    <tr>
      <td/>
      <td>0.0163</td>
      <td>0.2369</td>
      <td>&#8212;0.8825</td>
      <td>0.0001</td>
      <td>42121</td>
      <td>0.9476</td>
    </tr>
    <tr>
      <td/>
      <td>0233</td>
      <td>0.0540</td>
      <td>&#8212;0.8027</td>
      <td>5.39E-05</td>
      <td>373.180</td>
      <td>0.4318</td>
    </tr>
    <tr>
      <td/>
 

In [19]:
#Display the table
from IPython.core.display import HTML
HTML(table_html)

Inhibitor concentration (g),be (V/dec),ba (V/dec),Ecorr (V),icorr (Acm?),Polarization resistance (Q),Corrosion rate (mmj/year)
,0.0335,0.0409,—0.9393,0.0003,24.091,2.8163
NO,1.946,0.0596,—0.8276,0.0002,121.44,1.5054
,0.0163,0.2369,—0.8825,0.0001,42121.0,0.9476
,233.0,0.054,—0.8027,5.39e-05,373.18,0.4318
,0.124,0.0556,—0.5896,5.46e-05,305.65,0.3772
= 5,0.0382,0.0086,—0.5356,1.24e-05,246.08,0.0919


### Utilizing Tables in RAG Applications (Example with Langchain)

In [20]:
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document
from langchain.chains.summarize import load_summarize_chain

In [23]:
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-1106")
chain = load_summarize_chain(llm, chain_type="stuff")
chain.invoke([Document(page_content=table_html)])

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


{'input_documents': [Document(metadata={}, page_content='<table><thead><tr><th>Inhibitor concentration (g)</th><th>be (V/dec)</th><th>ba (V/dec)</th><th>Ecorr (V)</th><th>icorr (Acm?)</th><th>Polarization resistance (Q)</th><th>Corrosion rate (mmj/year)</th></tr></thead><tbody><tr><td></td><td>0.0335</td><td>0.0409</td><td>—0.9393</td><td>0.0003</td><td>24.0910</td><td>2.8163</td></tr><tr><td>NO</td><td>1.9460</td><td>0.0596</td><td>—0.8276</td><td>0.0002</td><td>121.440</td><td>1.5054</td></tr><tr><td></td><td>0.0163</td><td>0.2369</td><td>—0.8825</td><td>0.0001</td><td>42121</td><td>0.9476</td></tr><tr><td></td><td>0233</td><td>0.0540</td><td>—0.8027</td><td>5.39E-05</td><td>373.180</td><td>0.4318</td></tr><tr><td></td><td>0.1240</td><td>0.0556</td><td>—0.5896</td><td>5.46E-05</td><td>305.650</td><td>0.3772</td></tr><tr><td>= 5</td><td>0.0382</td><td>0.0086</td><td>—0.5356</td><td>1.24E-05</td><td>246.080</td><td>0.0919</td></tr></tbody></table>')],
 'output_text': 'The table provide

## Building our own RAG Bot

In this section, we'll be utilizing all the techniques we've learned from the previous sections to build our own RAG Bot that will work based on the unstructured data repository that we have for it.

### Main Imports

In [3]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared,operations
from unstructured_client.models.errors import SDKError

from unstructured.chunking.title import chunk_by_title
from unstructured.partition.md import partition_md
from unstructured.partition.pptx import partition_pptx
from unstructured.staging.base import dict_to_elements

from dotenv import load_dotenv, find_dotenv


import chromadb

_ = load_dotenv(find_dotenv())

import os

DLAI_API_KEY = os.getenv("U_API_KEY")
DLAI_API_URL = os.getenv("U_API_URL")

s = UnstructuredClient(
    api_key_auth=DLAI_API_KEY,
    server_url=DLAI_API_URL,
)

### Reading our context files

Below, we'll read the PDF file using the unstructured API and preprocess it for insertion in our Chromadb vector Database.

In [4]:

# Parsing the pdf file using the unstructured API.

filename = "./rag_context/donut_paper.pdf"

with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

params = shared.PartitionParameters(
    files=files,
    strategy="hi_res", #Document Layout Detection
    hi_res_model_name="yolox",
    pdf_infer_table_structure=True,
    skip_infer_table_types=[],
)

req=operations.PartitionRequest(partition_parameters=params)

try:
    resp = s.general.partition(request=req)
    pdf_elements = dict_to_elements(resp.elements)
except SDKError as e:
    print(e)

INFO: HTTP Request: GET https://api.unstructuredapp.io/general/docs "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"


In [7]:
#Extracting tables from the parsed PDF elements

tables = [el for el in pdf_elements if el.category == "Table"]
table_html = tables[0].metadata.text_as_html

In [8]:
#Visualize the tables

from io import StringIO 
from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)
file_obj = StringIO(table_html)
tree = etree.parse(file_obj, parser)
print(etree.tostring(tree, pretty_print=True).decode())

<table>
  <tbody/>
</table>



In [9]:
#Get reference titles from the parsed PDF so we remove them later.
reference_title = [
    el for el in pdf_elements
    if el.text == "References"
    and el.category == "Title"
][0]

reference_title.to_dict()
references_id = reference_title.id

for element in pdf_elements:
    if element.metadata.parent_id == references_id:
        print(element)
        break

pdf_elements = [el for el in pdf_elements if el.metadata.parent_id != references_id]

1. Afzal, M.Z., Capobianco, S., Malik, M.I., Marinai, S., Breuel, T.M., Dengel, A., Liwicki, M.: Deepdocclassifier: Document classification with deep convolutional neural network. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR). pp. 1111–1115 (2015). https://doi.org/10.1109/ICDAR.2015.7333933 1, 4, 14


In [11]:
# Remove PDF Headers so it doesn't impact the file.

headers = [el for el in pdf_elements if el.category == "Header"]
headers[1].to_dict()
pdf_elements = [el for el in pdf_elements if el.category != "Header"]

PowerPoint file Preprocessing

In [12]:
filename = "./rag_context/donut_slide.pptx"
pptx_elements = partition_pptx(filename=filename)

Markdown file preprocessing

In [13]:
filename = "./rag_context/donut_readme.md"
md_elements = partition_md(filename=filename)

### Loading the documents into the Vector DB

Below, we will load the documents into our vector DB. First we will chunk them. then we will create embeddings for them and load them into the database alongside the embeddings.


In [16]:
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.utils import filter_complex_metadata


In [17]:
#Chunking the elements by title.
elements = chunk_by_title(pdf_elements + pptx_elements + md_elements)

documents = []
for element in elements:
    metadata = element.metadata.to_dict()
    del metadata["languages"]
    metadata["source"] = metadata["filename"]
    documents.append(Document(page_content=element.text, metadata=metadata))

embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(filter_complex_metadata(documents), embeddings)
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 6}
)

INFO: HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


### Preparing the RAG Bot

In [18]:
from langchain.prompts.prompt import PromptTemplate
from langchain_openai import OpenAI
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

template = """You are an AI assistant for answering questions about the Donut document understanding model.
You are given the following extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "Hmm, I'm not sure." Don't try to make up an answer.
If the question is not about Donut, politely inform them that you are tuned to only answer questions about Donut.
Question: {question}
=========
{context}
=========
Answer in Markdown:"""
prompt = PromptTemplate(template=template, input_variables=["question", "context"])

In [19]:
llm = OpenAI(temperature=0)

doc_chain = load_qa_with_sources_chain(llm, chain_type="map_reduce")
question_generator_chain = LLMChain(llm=llm, prompt=prompt)
qa_chain = ConversationalRetrievalChain(
    retriever=retriever,
    question_generator=question_generator_chain,
    combine_docs_chain=doc_chain,
)

See also the following migration guides for replacements based on `chain_type`:
stuff: https://python.langchain.com/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/docs/versions/migrating_chains/map_rerank_docs_chain

  doc_chain = load_qa_with_sources_chain(llm, chain_type="map_reduce")
  question_generator_chain = LLMChain(llm=llm, prompt=prompt)
  qa_chain = ConversationalRetrievalChain(


In [20]:
qa_chain.invoke({
    "question": "How does Donut compare to other document understanding models?",
    "chat_history": []
})["answer"]

INFO: HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"


' Donut is a state-of-the-art document understanding model that does not require OCR and can be trained in an end-to-end fashion. It uses a simple architecture consisting of a visual encoder and textual decoder. \nSOURCES: donut_readme.md, donut_paper.pdf, donut_slide.pptx'

In [21]:
filter_retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1, "filter": {"source": "donut_readme.md"}}
)

In [22]:
filter_chain = ConversationalRetrievalChain(
    retriever=filter_retriever,
    question_generator=question_generator_chain,
    combine_docs_chain=doc_chain,
)

filter_chain.invoke({
    "question": "How do I classify documents with DONUT?",
    "chat_history": [],
    "filter": filter,
})["answer"]

INFO: HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"


' DONUT can classify documents through visual document classification or information extraction. \nSOURCES: donut_readme.md'

# And that's IT. I hope you learned as much as I did !

Peace, Zack