# Game Plan for Unstructured Data

Having a lot of data is hard to find the crucial information, making you wish that you could summarize the key points. That's the challend of unstructured data.

Imagine having an assitant that not only could find what you are looking for but also give you a nice summary of the context. This is what I am going to build in this notebook.

## Introduction

Making sense out of caos. I am about to tackel the challenge of organizing unstructured data that don't fit neatly into sheets. We will be able to extract, process, and generate insights from unstructured data and make it manageable!

Why should you and I focus on Unstructured data?
Because it is everywhere. We have a lots of unstructured data but extracting value from it is challenging
Tradicional data tools struggle with unstructured data so mastering this skill gives you an edge.

## What will be cover?

1. Learn about LangChain - Main tool to handle unstructured data. This library will help us prepare different document types for embedding retrival.
2. Work with different data types (Excel, word, PPT, epub, PDF)
3. Building Retrival and Generation Function - Retrieve relevant information from the unstructured data.

## Putting it all together

By the end of this notebook, you will have the tools to handle unstructured data, with real-world applications for your projects.


## Introduction Langchain library

Earlier we explored how OpenAI works using the GPT model for computer vision to transcribe. But we are potentially missing out on conecting pages. Imagine doing this with spread sheets. Thats why we need something better.

LANGCHAIN

It is a very powerfull library for building apps with complex workflows particularlly when dealing with unstructured data and LLMs models.

### Why LangChain?

It is designed to simplify the process of working with language models and other AI tools. Allowes you to focus on what it really important wich is extracting insights from data.

With Langchain you can easily gather Loading, Splitting, Embedding and querying in an unified framework.

### Key features of langchain

1. Modular Design

> Allows you to mix different components
2. Integration with Language Models
3. Support for Unstructured data

## Getting Started with Langchain

Some of the core components:

1. Document loaders
> Load documents and process documents in different formats (pdf excel files)

2. Text Splitters
> Help you breakdonw large texts into managable chuncks. This is because you will have token limits when working with language model that has token limits.

3. Embeddings
> makes it easy to generate embeddings for the text that we have and can use it for retrival, generation, etc.

4. Vector Stores
> Eficiantlly search and retrieve docuemtn based on theri embeddings using libraries like FAISS

## Introducing Langchain-OpenAI

Langchain's key feature is its integration with OpenAI models, allowing all these tasks with GPT-3 or GPT-4.

**Explanation of Parameters**

* openai_api_key
* model
* temperature
* max_tokens (control the output)
* n
* stop
* presence_penalty and frequency_penalty



Conclusion:

Langchain with OpenAI models provides a powerful toolkit for managing unstructured data and generating insights, offering flexibility and efficiency for retrival systems, text generation, and conversational AI.

**Working with Excel Data Using Langchain**

How do you sift through data without missing insights?

This lecture covers using Langchain to handle Excel files:

**Step 1:** Loading the excel data

Mode = 'elements': breajs the excel file into smaller parts. For well-structured files, use mode = 'table' to load entire tables.

Initial data Check: Print a sample (print(docs[:100)) to confirm correcto loading and preview the data's structure

**Step 2:** Splitting the Document into Chunks

This is important because language models have token limit and processing the data in samaller chuncks ensures that no information is loss.
> chunk_size = 2000: Determines chunk size. Larger chunks keep context but may exceed token limits. Adjust as needed

> chunk_overlap = 200: the content from one chunck flow into the next

> Smaller chunck_size: For highly granular data or lower toekn capacity, consider the chunk size to 1000 or 500

> Larger chunck_overlap = 200: Increase to 300+ smoother transitions in cintext-sensitive documents. With larger chuncks_overlap you are also increasing the number of chuncks because you are repiting chuncks

**Step 3:** Generating embeddings

model = 'text-embedding-3-large'

API Key Security

Alternative options:

* Smaller model: for faster processing with large datasets, considerusing a smaller model like "text-embedding-3-small"

Python - Initial Setup for Data Processing

Let's set everything up

# Libraries and OpenAI API

In [2]:
from google.colab import userdata
api_key = userdata.get('genai_course')

In [3]:
# Change directory
%cd /content/drive/MyDrive/Ideas/GenAI/RAG/Unstructured Data

/content/drive/MyDrive/Ideas/GenAI/RAG/Unstructured Data


In [3]:
!pip install langchain-community unstructured langchain-openai openai faiss-cpu #msoffcrypto-tool

Collecting langchain-community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting unstructured
  Downloading unstructured-0.18.15-py3-none-any.whl.metadata (24 kB)
Collecting langchain-openai
  Downloading langchain_openai-1.0.1-py3-none-any.whl.metadata (1.8 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting langchain-core<2.0.0,>=1.0.1 (from langchain-community)
  Downloading langchain_core-1.0.1-py3-none-any.whl.metadata (3.5 kB)
Collecting langchain-classic<2.0.0,>=1.0.0 (from langchain-community)
  Downloading langchain_classic-1.0.0-py3-none-any.whl.metadata (3.9 kB)
Collecting requests<3.0.0,>=2.32.5 (from langchain-community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting filetype (fro

In [6]:
!pip install msoffcrypto-tool

Collecting msoffcrypto-tool
  Downloading msoffcrypto_tool-5.4.2-py3-none-any.whl.metadata (10 kB)
Downloading msoffcrypto_tool-5.4.2-py3-none-any.whl (48 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.7/48.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: msoffcrypto-tool
Successfully installed msoffcrypto-tool-5.4.2


# Excel

In [10]:
# import functions / modules
from langchain_community.document_loaders import UnstructuredExcelLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.vectorstores.faiss import FAISS
from IPython.display import display, Markdown

Loading Data and Implementing Chunking Strategies

In [7]:
# Load the excel data
# Mode elements and this parsed the data into individual components
loader = UnstructuredExcelLoader('Reviews.xlsx', mode = 'elements')
docs = loader.load()

# display the first 5 elements
docs[:5]

[Document(metadata={'source': 'Reviews.xlsx', 'filename': 'Reviews.xlsx', 'last_modified': '2025-05-15T12:40:49', 'page_name': 'Udemy_Reviews_Export_2024-08-22', 'page_number': 1, 'text_as_html': '<table><tr><td>Course Name</td><td>Student Name</td><td>Timestamp</td><td>Rating</td><td>Comment</td></tr><tr><td>Master Python for Data Analysis and Business Analytics 2024</td><td>Gaurav Mehra</td><td>2024-08-21 06:46:55+00:00</td><td>4</td><td/></tr><tr><td>Master Python for Data Analysis and Business Analytics 2024</td><td>Harigovind S</td><td>2024-08-21 04:35:13+00:00</td><td>5</td><td/></tr><tr><td>Data Literacy and Business Analytics for Business Leaders</td><td>Celine Jayme</td><td>2024-08-21 01:42:37+00:00</td><td>4</td><td/></tr><tr><td>Decision Making with Problem Solving &amp; Critical Thinking</td><td>Donovan Smith</td><td>2024-08-20 20:02:59+00:00</td><td>4</td><td/></tr><tr><td>Econometrics and Statistics for Business in R &amp; Python</td><td>Mark Stent</td><td>2024-08-20 16:5

In [8]:
# Split the document into chunks
# It is much easier to deal with parts of your data at once
# rather than all of it
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000, # each chunk has 2000 token, 1 token is more or less 1 word
    chunk_overlap = 200
)
chunks = text_splitter.split_documents(docs)

# Display the first 5 chunks
chunks[:5]

Output hidden; open in https://colab.research.google.com to view.

Developing a Retrieval System for Unstructured Data

In [9]:
# Embeddings
embeddings = OpenAIEmbeddings(
    openai_api_key = api_key,
    model = 'text-embedding-3-large'
)

# Create the FAISS index
db_faiss = FAISS.from_documents(chunks, embeddings)
db_faiss

<langchain_community.vectorstores.faiss.FAISS at 0x7f7899a44230>

In [10]:
# Try the Retrival System
query = 'give me my worst reviews with comments'

# Retrieve the context -> Langchain uses Cosine Distance Metric
docs_faiss = db_faiss.similarity_search_with_score(query, k = 5)
docs_faiss # we are getting chunks

Output hidden; open in https://colab.research.google.com to view.

Now we will use the GenAI model to determine the most relvant

Building a Generation System for Dynamic Content

we should merge all the outputs from the retrival system

In [11]:
# Insped the doc_faiss
len(docs_faiss) # 5
len(docs_faiss[0]) # this a tuple
docs_faiss[0][0].page_content # This is the chunk of 2000
docs_faiss[0][1] # This is the score of the retrival system

np.float32(1.3821557)

In [13]:
# Merge the docs to use in the Gen System
context_text = '\n\n'.join([doc.page_content for doc, _score in docs_faiss])
context_text

"User 2022-06-21 20:02:37+00:00 5 Econometrics and Statistics for Business in R & Python Anonymized User 2022-06-21 18:08:29+00:00 5 Data Mining for Business Analytics & Data Analysis in Python Kiran Godbole 2022-06-21 14:53:24+00:00 5 THANKS YOU SIR, THIS COURSE REALLY HELP ME A LOT. Econometrics and Statistics for Business in R & Python Brian Turaki 2022-06-20 19:56:07+00:00 5 Master Time Series Analysis and Forecasting with Python 2024 Anish H S 2022-06-20 04:33:44+00:00 3.5 Forecasting Models & Time Series Analysis for Business in R Anonymized User 2022-06-19 20:48:45+00:00 3.5 Master Time Series Analysis and Forecasting with Python 2024 Arthur Gonsales 2022-06-19 20:33:59+00:00 5 This course is amazing, it made a lot of things clear to me in the regard of time series modeling. It goes from the very basic algorithms to the most advanced models with a perfect explanation and a perfect balance between theory and practice. :) Econometrics and Statistics for Business in R & Python Yean

In [14]:
# Create a simple prompt for RAG system
prompt = f"""
Based on this context {context_text} please answer this question {query}.
If you don't know the answer just say don't know.
"""

In [16]:
# Call the OpenAI API with the LangChain
model = ChatOpenAI(
    openai_api_key = api_key,
    model = 'gpt-4o-mini',
    temperature = 0 # 0 creativity
)
response_text = model.invoke(prompt)

In [17]:
# Display the answer
display(Markdown(response_text.content))

Based on the provided context, here are the reviews with the lowest ratings and their comments:

1. **Rating: 3.0**
   - **Course:** Master Time Series Analysis and Forecasting with Python 2024
   - **Comment:** "content too simple and basic"

2. **Rating: 3.0**
   - **Course:** Forecasting Models & Time Series Analysis for Business in R
   - **Comment:** "Section 4: too many examples with different subjects. May be will be stay on one topic and don't jump from one to another."

3. **Rating: 3.0**
   - **Course:** Decision Making with Problem Solving & Critical Thinking
   - **Comment:** No specific comment provided.

4. **Rating: 3.5**
   - **Course:** Forecasting Models & Time Series Analysis for Business in R
   - **Comment:** No specific comment provided.

5. **Rating: 3.5**
   - **Course:** Econometrics and Statistics for Business in R & Python
   - **Comment:** No specific comment provided.

These reviews reflect the lowest ratings along with any comments provided by the users.

Building Retrieval and Generation Functions

Let's build a couple of functions

In [25]:
# Preparing the unstructured data
def prepare_excel(file_path):
  # Loading the data
  loader = UnstructuredExcelLoader(file_path, mode = 'elements')
  docs = loader.load()

  # Split the text into chunks
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size = 2000,
      chunk_overlap = 200
  )
  chunks = text_splitter.split_documents(docs)

  # Prepare the embeddings
  embeddings = OpenAIEmbeddings(
      openai_api_key = api_key,
      model = 'text-embedding-3-large'
  )

  # FAISS index
  db_faiss = FAISS.from_documents(chunks, embeddings)

  return db_faiss

In [12]:
# Prepare a function to retriveand generate (RAG)
def ask(db, query, k):
  # Getting the context
  docs_faiss = db.similarity_search_with_score(query, k = k)
  context_text = '\n\n'.join([doc.page_content for doc, _score in docs_faiss])

  # Define the prompt
  prompt = f"""
  based on this context {context_text}
  please answer this question {query}
  if the information is not in the context, say that you don't have information
  """

  # Call the LLm
  model = ChatOpenAI(
      openai_api_key = api_key,
      model = 'gpt-4o-mini',
      temperature = 0
  )
  response_text = model.invoke(prompt)

  return display(Markdown(response_text.content))

In [26]:
# Preparing the excel data
db_excel = prepare_excel('Reviews.xlsx')

In [27]:
# Define the query
query = """
Analyse the reviews, choose the ones with the worst comments and transcribe them
"""

In [32]:
# Ask the question
ask(db_excel, query, 5)

Based on the provided context, here are the reviews with the worst comments:

1. **Loïc Legros** (2023-11-08):
   - "Not a good course :\n1- some python code are not up to date and doesn't work.\n2- the course doesn't add any value compare to the simple reading of either wikipedia article or documentation of python module used. Considered this course as a audio version of those.\n3- instructor doesn't really know python and code could easily be improved."

2. **Alka Rachel John** (2024-05-18):
   - "This is one of the worst courses on Udemy. Please don't."

These reviews highlight significant dissatisfaction with the courses, focusing on outdated content and perceived lack of value.

Working with Word Documents

Analyzing lengthy Word document can be tedious and time-consuming when extracting key information.

What if you could automate document breakdown and retrival?

By the end of this section you will learn all these tasks. You will be able to **load documents**, **splited into chunks**, **store them** in a **vector database** for efficient retrival.

**Step 1: Importing Necessary Libraries**

* import nltk (important for text processing)

nltk.download('punkt'): Essential for splitting text into sentences or words in natutal language processing. Ensure it's installed first.

* Library imports: import libraries near their use to keep the script organized and load dependencies as needed.

**Step 2: Loading the word document**

* mode = 'elements': breaks the document into smaller parts like paragraphs or sections, essential for processing specific parts.

**Step 3: Creating a function to Process word documents**

* chunk_size = 500: Set the chunk size, ensure content fits with the model's token limits.

* chunk_overlap = 50: Maintains context across chunks, wich is essential for understanding continuous text

* Reusable Function: Using a function for logic ensures code reuse and maintains cleanliness.

* Adjust Chunck Size: For shorter or context sensitive documents, reduce the chunk size to 300-400 characters.

* Different Embeddings Models: Choose a smaller model like 'text-embedding-3-small' for faster precessing.

Conclusion

In this lecture, we explored loading, processing, and storing Word documents with Langchain. By structuring the workflow you can make unstructured data searchable. This method is ideal for large documents or collections, saving time compared to manual analysis.

Setting Up Word Documents for RAG

# Word

In [1]:
# install libraries
!pip install python-docx
!pip install --upgrade nltk



In [4]:
import nltk
nltk.download('punkt')
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [6]:
# Load the word document with data parsed as individual elements
loader = UnstructuredWordDocumentLoader(
    'Declaration of independence.docx',
    mode = 'elements'
)
docs = loader.load()
docs[:5]

[Document(metadata={'source': 'Declaration of independence.docx', 'category_depth': 0, 'filename': 'Declaration of independence.docx', 'last_modified': '2025-05-15T12:40:49', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'UncategorizedText', 'element_id': 'bb562112a268f9b61e327bb112658e40'}, page_content='In Congress, July 4, 1776'),
 Document(metadata={'source': 'Declaration of independence.docx', 'category_depth': 0, 'filename': 'Declaration of independence.docx', 'last_modified': '2025-05-15T12:40:49', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'NarrativeText', 'element_id': '37dfdfb7ee7a49a6bd0e5a59bf4d33a4'}, page_content="The unanimous Declaration of the thirteen united States of America,\xa0When in the Course of human events, it becomes necessary for one people to dissolve the political bands 

In [7]:
# Create a function to prepare the word document
def prepare_word(file_path: str):
  loader = UnstructuredWordDocumentLoader(
      file_path,
      mode = 'elements'
  )
  docs = loader.load()

  # Split into chunks, Embeddings and FAISS
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size = 500,
      chunk_overlap = 50
  )
  chunks = text_splitter.split_documents(docs)
  embeddings = OpenAIEmbeddings(
      openai_api_key = api_key,
      model = 'text-embedding-3-large'
  )
  db_faiss = FAISS.from_documents(chunks, embeddings)

  return db_faiss

Implementing RAG for Word Documents

In [8]:
# Query a couple of queries
query1 = 'Tell me about the document by giving me the 3 main points'
query2 = 'Get me the best chocolate cake recipe'

In [13]:
# Prepare the word data
db_doc = prepare_word('Declaration of independence.docx')
ask(db_doc, query1, k = 5)

The document you provided appears to be an excerpt from the Declaration of Independence of the United States. Here are three main points derived from the context:

1. **Right to Alter or Abolish Government**: The document asserts that when a government becomes destructive to the rights of the people, it is their right and duty to alter or abolish it and establish a new government that ensures their safety and happiness.

2. **Unalienable Rights**: It emphasizes the belief that all men are created equal and are endowed with certain unalienable rights, including life, liberty, and the pursuit of happiness. Governments are established to secure these rights, deriving their powers from the consent of the governed.

3. **Justification for Separation**: The document states that when it becomes necessary for one people to dissolve the political connections with another, they must declare the reasons for their separation, demonstrating a respect for the opinions of mankind and the principles of natural law.

In [14]:
# Test with query2
ask(db_doc, query2, k = 5)

I don't have information on chocolate cake recipes based on the provided context.

Working with PowerPoint Presentations

PPT slides contain valuable informatio, but extracting insights from slides can be challenging. We will automate this process to make very easily searchable. In this session, we will master these tasks, simplifying data analysis and querying.

**Step 1: Loading the PPT Presentation**

from langchain_community.document_leaders import UnstructuredPowerPointLoader

**Step 2: Creating a Function to Process PowerPoint Presentations**

Most of the functions we already know

**Best Practices**

* chunk_size = 200: PowerPoint Slides have short text segments, so a 200-character chunk size ensures meaningful content.

* chunk_overlap = 20: small overlap maintains context between chunks, essential for presentations with ideas spread across slides.

* Adjusting Size: For detailed text, a chunk size of 300 or more may be better.

I just have showed you how to process PPT presentations with Langchain, turning static slides into searchable data. This method benefits busnesses, educators and others who frequently work with presentations.

# Powerpoint

In [15]:
# Install a library
!pip install python-pptx

Collecting python-pptx
  Downloading python_pptx-1.0.2-py3-none-any.whl.metadata (2.5 kB)
Collecting XlsxWriter>=0.5.7 (from python-pptx)
  Downloading xlsxwriter-3.2.9-py3-none-any.whl.metadata (2.7 kB)
Downloading python_pptx-1.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.8/472.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xlsxwriter-3.2.9-py3-none-any.whl (175 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.3/175.3 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: XlsxWriter, python-pptx
Successfully installed XlsxWriter-3.2.9 python-pptx-1.0.2


In [16]:
# Import the PPT class
from langchain_community.document_loaders import UnstructuredPowerPointLoader

In [22]:
# Getting the data -> Splitting -> Embedding -> FAISS
# Build a function to retrievefrom the PPT
def prepare_ppt(file_path):
  # Loadeing the data
  loader = UnstructuredPowerPointLoader(file_path, mode = 'elements')
  docs = loader.load()

  # Split into chunks, embeddings and FAISS
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size = 500,
      chunk_overlap = 50
  )
  chunks = text_splitter.split_documents(docs)
  embeddings = OpenAIEmbeddings(
      openai_api_key = api_key,
      model = 'text-embedding-3-large'
  )
  db_faiss = FAISS.from_documents(chunks, embeddings)

  return db_faiss

In [23]:
# Prepare the presentation data
db_pp = prepare_ppt('Bitte pitch deck EN.pptx')

RAG Implementation for PowerPoint

In [24]:
# Define a couple of queries to test the presentation
query1 = "What is Bitte's competitive advantage?"
query2 = "What is independence"

In [25]:
# Ask the questions
ask(db_pp, query1, k = 5)

Bitte's competitive advantage lies in its ability to provide a rich and interactive dining experience through its digital menu. This includes features such as presenting not only photographs and nutritional information but also offering suggestions for dish combinations, which can promote increased sales. Additionally, Bitte offers customizable layouts that align with the restaurant's brand image, enhancing the overall customer experience. However, specific details about payment commissions or transaction advantages are not provided in the context.

In [26]:
# Ask the second query
ask(db_pp, query2, 5)

The context provided does not contain information about the concept of independence. Therefore, I don't have information on that topic.

Working with EPUB Files

Working with EPUB Files using Langchain

Finding specific passages in eBooks can be difficult. In this session we are going to explore how to quickly extract and embed content for easier analysis.

EPUB fileas contain text, metadata, and images. We will show how to load, chunk, and embed them for easy retrival.

By the session end, you will be quipped to efficiently process and analyze eBooks with Langchain.

**Step 1: Setting Up Your Enviroment:**

Unstructed EPUB loader by **pypandoc**

**Step 2: Creating a Function to process EPUB files**

loader, chunk size, etc.

* Adjusting ChunkSize: For simpler or less dense EPUB text, consider increasing the chunk size to 300 or 400 characters

* Embedding Models: For quicjer tasks, use 'text-embedding-3-small'. For detailed analysis, 'text-embedding-3-large'.

**Tips and tricks for working with EPUB files**

* Handling Metadata (store is separatly incase you have pictures, texts for easier reference adn indexing)

* Managing Non-text-elements (If you have an EPUB with a lot of images this wont work properly)

* Dealing with Nested Structures (chapters)

* Optimizing storage


# Epub

This is not just text, there is quite a bit of information in Epub.

In [27]:
# Install libraries
!pip install pypandoc

Collecting pypandoc
  Downloading pypandoc-1.15-py3-none-any.whl.metadata (16 kB)
Downloading pypandoc-1.15-py3-none-any.whl (21 kB)
Installing collected packages: pypandoc
Successfully installed pypandoc-1.15


In [28]:
# Import the libraries
from langchain_community.document_loaders import UnstructuredEPubLoader
import pypandoc
pypandoc.download_pandoc()

In [29]:
# Build function prepare epub docs
def prepare_epub(file_path):
  # Loading the data
  loader= UnstructuredEPubLoader(file_path, mode = 'elements')
  docs = loader.load()

  # Split into chunls, Embeddings and FAISS
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size = 200,
      chunk_overlap = 20
  )
  chunks = text_splitter.split_documents(docs)
  embeddings = OpenAIEmbeddings(
      openai_api_key = api_key,
      model = 'text-embedding-3-large'
  )
  db_faiss = FAISS.from_documents(chunks, embeddings)

  return db_faiss

In [31]:
# Prepare the EPUB
db_epub = prepare_epub("Alice’s Adventures in Wonderland.epub")

In [32]:
# Prepare a couple of queries
query1 = "What is the main point of the story?"
query2= "Does Alice like Bitte's digital menus?"

In [33]:
# Answer query 1
ask(db_epub, query1, 5)

The main point of the story, as suggested by the Duchess, is that everything has a moral, and in this case, the moral is that love is what makes the world go round.

In [34]:
# Aswer query 2
ask(db_epub, query2, 5)

I don't have information.

# PDF (Enhanced)

PDFs are so important but know, pulling out specific details from lengthy PDFs can be very frustrating.

Won't it be awesome if you could automate the retrival process?

PDFs are versatile but comple. This lecture will cover the process for efficient information retrieval.

**Step 1: Creating a Function to Process PDFs**

def prepare_pdf(file_path):

1. Load the PDF document with data parsed as individual elements

2. Split the loaded PDF into chunks

3. Generate embeddings for chunks using OpenAI's embedding model

4. Store the chunks and their embeddings in a FAISS vector database

5. Return the vector database for further use

**Tips and Tricks for Working withPDFs**

* Handling Scanned PDFs (instead of just text you might have images for instance). We do want to use OCR.

* Dealing with Multi-Column Layouts (two columns in one page)

* Managing Large PDFs

* Preserving Non-Textual Elements (you might want to have some computer vision there in order to improve the way that we are doing the retrival system)

* Optimizing Chunk Size



PDF Setup for RAG

# PDF

In [38]:
!pip install pymupdf pdfminer.six pillow_heif unstructured_inference unstructured_pytesseract
!apt-get install poppler-utils
!apt install tesseract-ocr # technology that reads pdfs, basically transfer image to text

Collecting pymupdf
  Using cached pymupdf-1.26.5-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting pdfminer.six
  Using cached pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Collecting pillow_heif
  Downloading pillow_heif-1.1.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (9.6 kB)
Collecting unstructured_inference
  Downloading unstructured_inference-1.0.5-py3-none-any.whl.metadata (5.3 kB)
Collecting unstructured_pytesseract
  Downloading unstructured.pytesseract-0.3.15-py3-none-any.whl.metadata (11 kB)
Collecting onnx (from unstructured_inference)
  Downloading onnx-1.19.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (7.0 kB)
Collecting onnxruntime>=1.18.0 (from unstructured_inference)
  Downloading onnxruntime-1.23.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting pypdfium2 (from unstructured_inference)
  Downloading pypdfium2-5.0.0-py3-none-manylinux_2_17_x86_64.manylin

In [39]:
# Import the libraries
from langchain_community.document_loaders import UnstructuredPDFLoader
from pdfminer import psparser

In [40]:
# Build function prepared epub docs
def prepare_pdf(file_path):
  # Loading the data
  loader = UnstructuredPDFLoader(file_path, mode = 'elements')
  docs = loader.load()

  # Split into chunks, embeddings and FAISS
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size = 500,
      chunk_overlap = 50
  )
  chunks = text_splitter.split_documents(docs)
  embeddings = OpenAIEmbeddings(
      openai_api_key = api_key,
      model = 'text-embedding-3-large'
  )
  db_faiss = FAISS.from_documents(chunks, embeddings)

  return db_faiss

In [43]:
!pip install pi_heif

Collecting pi_heif
  Downloading pi_heif-1.1.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (6.5 kB)
Downloading pi_heif-1.1.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pi_heif
Successfully installed pi_heif-1.1.1


In [45]:
!apt-get install poppler-utils -y

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.11).
0 upgraded, 0 newly installed, 0 to remove and 38 not upgraded.


In [46]:
!pip install pdf2image

Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Downloading pdf2image-1.17.0-py3-none-any.whl (11 kB)
Installing collected packages: pdf2image
Successfully installed pdf2image-1.17.0


In [47]:
# Apply the function
db_pdf = prepare_pdf('Famous old receipts.pdf')



RAG Implementation for PDF Files

In [48]:
# Query inspirations
query1 = 'What are the most unusual recipes? Tell me wich and tell me how to make them'
query2 = 'Wich recipes would impress my friends'

In [49]:
# Try query 1
ask(db_pdf, query1, 5)

Based on the provided context, the most unusual recipe appears to be the one for **Chicken Saute Bellevue**. While the specific details for this recipe are not included in the text, the combination of ingredients and the method of preparation suggest it may be unique.

Another unusual recipe is the one that includes **fritters made of brains** along with eggs, chopped parsley, and pepper, which is not a common ingredient in many modern recipes.

Unfortunately, the context does not provide detailed instructions on how to make Chicken Saute Bellevue or the fritters made of brains. Therefore, I cannot provide the specific steps for these recipes. If you have any other questions or need information on different recipes, feel free to ask!

In [50]:
# Try query 2
ask(db_pdf, query2, 5)

Based on the context provided, the recipes that would likely impress your friends are the chili sauce, stuffed ripe tomatoes (Southern style), and the lemon puff dessert. The chili sauce offers a unique and flavorful combination of ingredients, while the stuffed tomatoes and lemon puff are classic dishes that can showcase your cooking skills. The citron pudding mentioned at the beginning also seems to be a favorite, but specific details about its preparation are not included in the context.

# Wrapping up this massive section

With this we are probablya few lines of code away from turning messy files into brilliant insights.

## Recap

* We tackled Excel spreadsheets, turned data into actinable chunks, and also mastered loading, processing, and embedding with Langchain.

* We tackled Word documetns, breaking them into manageable pieces and creaeted functions to automate handling large text volumes.

* We looked into PowerPoint presentations learing to parse, chunk, and store them in a vector database for easy retrival.

* We also looked eBook. They were very tricky with their structured content and metadata, but I built a workflow to handle both, the text and its context.

* In the last section we turned PDFs into images, but in this one we just went to work with pdfs without having to transform into images.

## Key take aways

* Consistency isKey
* Context matters
* Adaptability