## Overview of RAG

RAG combines the reasoning capabilities of large language models (LLMs) with external data, enhancing factual recall. It operates through two main methods: weight updates (fine-tuning) and retrieval-augmented generation, where relevant context is provided to LLMs via prompts.




**Challenges with Multi-Modal Data**: Traditional RAG methods struggle with semi-structured data (e.g., tables combined with text) and multi-modal inputs (e.g., images). The article proposes unified strategies to tackle these challenges using emerging multimodal models.

## **Techniques to Enhance RAG**
Several techniques are introduced to improve the effectiveness of RAG:
* **Base Case RAG**: Simple top-K retrieval on embedded document chunks.
* **Summary Embedding**: Retrieval based on document summaries while returning full documents for context.
* **Windowing**: Expanding the retrieval window to include more context.
* **Metadata Filtering**: Using metadata to filter relevant chunks during retrieval.
* **Fine-tuning Embeddings**: Customizing embedding models for specific datasets.
* **Two-stage RAG**: Implementing a keyword search followed by semantic retrieval.

## Reference:


https://blog.langchain.dev/semi-structured-multi-modal-rag/

![Multi-Modal RAG](https://blog.langchain.dev/content/images/size/w1000/2023/10/image-22.png)


## Techniques Explored:

1) **Multi-Vector Retriever**
The multi-vector retriever allows for separating documents used for answer synthesis from those utilized for retrieval. This method enhances the ability to manage various content types effectively. For instance, it can summarize verbose documents while retaining full documents for context during answer generation.


2) **Document Processing**
The use of tools like Unstructured, which can extract and partition different data types from documents (e.g., images, tables, text) is essential for managing semi-structured data effectively.


3) **Multi-Modal Approaches**
Three strategies are detailed for applying the multi-vector retriever framework:
* Using Multimodal Embeddings: Embedding both images and text together for similarity searches.
* Generating Text Summaries from Images: Utilizing multimodal LLMs to create summaries that can be embedded and retrieved.
* Linking Raw Images and Text Chunks: Passing raw data to LLMs for synthesis while utilizing image summaries.

## Install necessary libraries

In [1]:
%%capture
! pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain langchain-chroma "unstructured[all-docs]" pydantic lxml

In [2]:
%%capture
!pip install langchainhub

In [3]:
!wget "https://arxiv.org/pdf/2409.13385"

--2024-11-17 10:14:04--  https://arxiv.org/pdf/2409.13385
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.131.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1015356 (992K) [application/pdf]
Saving to: ‘2409.13385’


2024-11-17 10:14:04 (126 MB/s) - ‘2409.13385’ saved [1015356/1015356]



In [4]:
import os

os.environ["OPENAI_API_KEY"] = "PLACE YOUR OPENAI API KEY HERE"

In [12]:
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

# Path to save images
path = "/content/SilverSpeak__Evading_AI_Generated_Text_Detectors_using_Homoglyphs (1).pdf"

In [6]:
!apt-get install tesseract-ocr
!apt-get install libtesseract-dev

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 49 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 4,816 kB in 2s (2,191 kB/s)
Selecting previously unselected package tesseract-ocr-eng.
(Reading database ... 123629 files and directories currently installed.)
Preparing to unpack .../tesseract-ocr-

In [7]:
!pip install pytesseract

Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.13


In [8]:
!apt-get install poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 49 not upgraded.
Need to get 186 kB of archives.
After this operation, 696 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.5 [186 kB]
Fetched 186 kB in 1s (137 kB/s)
Selecting previously unselected package poppler-utils.
(Reading database ... 123809 files and directories currently installed.)
Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.5_amd64.deb ...
Unpacking poppler-utils (22.02.0-2ubuntu0.5) ...
Setting up poppler-utils (22.02.0-2ubuntu0.5) ...
Processing triggers for man-db (2.10.2-1) ...


In [9]:
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = "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"

In [10]:
# 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

In [13]:
# Get elements
raw_pdf_elements = partition_pdf(
    filename=path ,
    # Using pdf format to find embedded image blocks
    extract_images_in_pdf=True,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    # Hard max on chunks
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=path,
)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]

In [14]:
# Create a dictionary to store counts of each type
category_counts = {}

for element in raw_pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
# TableChunk if Table > max chars set above
unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 71,
 "<class 'unstructured.documents.elements.Table'>": 66}

In [15]:
class Element(BaseModel):
    type: str
    text: Any


# Categorize by type
categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

# Tables
table_elements = [e for e in categorized_elements if e.type == "table"]
print(len(table_elements))

# Text
text_elements = [e for e in categorized_elements if e.type == "text"]
print(len(text_elements))

66
71


## Text and Table summaries

In [16]:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

In [17]:
# Prompt
prompt_text = """You are an assistant tasked with summarizing tables and text. \
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
model = ChatOpenAI()
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

In [18]:
# Apply to text
texts = [i.text for i in text_elements if i.text != ""]
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})

In [19]:
# Apply to tables
tables = [i.text for i in table_elements]
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

## Creating Vectorstore

In [20]:
import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="summaries", embedding_function=OpenAIEmbeddings()
)

# The storage layer for the parent documents
store = InMemoryStore()  # <- Can we extend this to images
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

  collection_name="summaries", embedding_function=OpenAIEmbeddings()


In [21]:
# Add texts
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries)
]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content=s, metadata={id_key: table_ids[i]})
    for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

In [22]:
from langchain_core.runnables import RunnablePassthrough

# Prompt template
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# Option 1: LLM
model = ChatOpenAI()
# Option 2: Multi-modal LLM
# model = LLaVA

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [23]:
chain.invoke(
    "What is the  Homoglyph-basedattack?"
)

'Homoglyph-based attacks are a technique used to evade AI-generated text detectors by rewriting some characters in the original text to resemble other characters.'

In [24]:
chain.invoke("Tell me about  Figure1:Homoglyph-basedattack")

'Figure 1: Homoglyph-based attack shows the original text on the left box, adapted from a source referenced as (Hans et al., 2024), and the text after rewriting some of its characters on the right box. The bottom boxes display the tokenized versions from a source referred to as (OpenAI, 2024b), with differences highlighted in red.'

In [25]:
chain.invoke("Explain about Figure 2: Experimental process")

'Based on the provided context, there is no mention of Figure 2 or an experimental process. The text provided includes information about released datasets, the confirmation of scores across multiple executions, approximate requirements for time and space for experiments, details about detectors used in experiments, and results for Fast-DetectGPT and Ghostbuster on the essay dataset.'

In [26]:
chain.invoke("Describe about MatthewsCorrelationCoefficient in Table-2")

'The Matthews Correlation Coefficient (MCC) in Table 2 represents the performance of all detectors on all datasets for different attack configurations. The color of the cell in the table represents the MCC value, ranging from 0 (red) to 1 (green). The MCC values are used as the main metric for assessing the efficacy of detectors, with a focus on class balance. The MCC values in the table show a consistent decline in performance across all detectors and datasets, particularly in the greedy replacement setting where the attack renders detectors ineffective.'

In [27]:
chain.invoke("What is the Average of each detector in Table-2")

'The average of each detector in Table-2 is as follows:\n- ArguGPT: 0.64\n- Binoculars: 0.17\n- DetectGPT: 0.1\n- Fast-DetectGPT: 0.05\n- Ghostbuster: 0.01\n- OpenAI: -0.01'

In [28]:
chain.invoke("What is the abstract of this paper")

'The abstract of this paper is not provided in the context given.'

In [29]:
chain.invoke("How is perplexity computed?")

'Perplexity is computed based on Equation 1, where N represents the number of tokens in the text and p(ti) is the probability of token ti given t1, . . . , ti−1.'

In [30]:
chain.invoke("Describe about Differences in log likelihood pertoken")

'The text discusses how homoglyph-based attacks on tokens can result in differences in log likelihood per token. When 10% of the characters in the text are modified, their tokenization changes 70% of the time. This leads to a distribution of log likelihoods that is shifted towards more negative values in the attacked text compared to the original text. As a result, the attacked text may appear "more likely to be human" when evaluated with a Language Model (LLM) in terms of perplexity, while maintaining the same appearance. This shift in log likelihood values towards more negative values can help evade detection by classification models.'

In [31]:
chain.invoke("Tell me about Distribution of Original and attacked in Figure:3")

'Based on the provided context, it is not possible to determine the distribution of Original and attacked texts in Figure 3 as the content does not include any specific information or data related to Figure 3.'

In [32]:
chain.invoke("Describe how Embeddings from ArguGPT look like")

'The embeddings from ArguGPT show that the original texts are well-separated, while the embeddings of the attacked texts are mixed and placed in a different subspace. Three clusters can be observed, with two corresponding to the original texts where AI and human texts are clearly separated.'

In [33]:
chain.invoke("Give Results for DetectGPT on the CHEAT dataset in tabular form")

'Table 5: Results for DetectGPT on the CHEAT dataset.\n---------------------------------------------------------\n| Metric    | Score    |\n|-----------|----------|\n| Accuracy  | 0.85     |\n| Precision | 0.78     |\n| Recall    | 0.92     |\n| F1 Score  | 0.84     |\n---------------------------------------------------------'

In [34]:
chain.invoke("Describe Confusion matrices for the Binoculars detector on the CHEAT dataset.")

'The confusion matrices for the Binoculars detector on the CHEAT dataset show the true labels and predicted labels for different types of attacks, including random attacks at 10%, 15%, and 20%, as well as a greedy attack. The matrices display the number of true positive, true negative, false positive, and false negative predictions for each attack scenario.'