# RAG Langchain with Unstructured PDF
Name      : Willy Santoso<br>
Subject   : RAG Langchain with Unstructured PDF
<br><br>
This code is can be run on Google Colab since this is a Jupyter Notebook file (upload this ipynb to Colab).<br>
If using Google Colab, run this code below to verify Nvidia GPU Driver and install the libraries:

In [None]:
!nvidia-smi

#### Install PCI utils for Ollama GPU support, libs for Poppler PDF and Tesseract-ocr for image PDF reader

In [None]:
!sudo apt-get update
!sudo apt-get install -qq -y pciutils
!sudo apt-get install -qq -y libxml2 libxslt1-dev libmagic-dev
!sudo apt-get install -qq -y libnss3 libnss3-dev
!sudo apt-get install -qq -y libcairo2-dev libjpeg-dev libgif-dev
!sudo apt-get install -qq -y cmake libblkid-dev e2fslibs-dev libboost-all-dev libaudit-dev
!sudo apt-get install -qq -y tesseract-ocr
!sudo apt-get install -qq -y libpoppler-dev poppler-utils

In [None]:
!pip install -q pyngrok

In [None]:
!curl -fsSL https://ollama.com/install.sh | sh
!ollama serve > server.log 2>&1 &
!ollama pull nomic-embed-text
!ollama pull llama3.1

In [None]:
!pip install -q numpy==1.26.4
!pip install -q protobuf==4.25.4
!pip install -q chromadb==0.4.24
!pip install -q onnx==1.16.1
!pip install -q onnxruntime==1.17.1 onnxruntime-gpu==1.17.1
!pip install -q rapidocr-onnxruntime
!pip install -q datasets==2.18.0
!pip install -q pytesseract
!pip install -U -q nltk
!pip install -U -q langchain langchain-core langchain-community
!pip install -U -q langchain-chroma langchain-ollama langchain-huggingface langchainhub langserve langsmith
!pip install -U -q langchain-unstructured unstructured-client unstructured "unstructured[all-docs]" python-magic pydantic lxml pypdf pymupdf
!pip install -U -q ragas

In [None]:
!wget -O Attention-is-all-you-need.pdf "https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

### Table of Contents
1. Data Preprocessing
2. Retrieval Strategy
3. Model Selection
4. Evaluation Dataset Creation
5. Evaluation
6. Recommendation

### 1. Data Preprocessing
Here we are preparing the required libraries and load the PDF with `UnstructuredPDFLoader`

<b>Import the required libraries</b><br>
- Langchain: main library/tools for RAG
- Chroma: for storing vector databases that generated from Embeddings
- RecursiveCharacterSplitter: use the recursive text splitter for splitting documents
- UnstructurdPDFLoader: for loading unstructured documents like PDFs.
- ChatOllama: use open-source Ollama for Chat LLMs.
- Ollama Embeddings: use open-source Ollama for Embeddings.
- ChatPromptTemplate and PromptTemplate: template for inserting the prompts.
- StrOutputParser: string output parsers.
- RunnablePassthrough: runnable for passing the question inputs from the user.
- MultiQueryRetriever: the main Retrieval method that we want to use.

In [None]:
from langchain_core.messages import (
    AIMessage,
    HumanMessage,
    SystemMessage
)
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import PyPDFLoader

from langchain_chroma import Chroma
from langchain_community.chat_models import ChatOllama
from langchain_community.embeddings import OllamaEmbeddings
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

<b>Load the document</b><br>
Let's load the paper

In [None]:
loader = UnstructuredPDFLoader('./Attention-is-all-you-need.pdf')
data = loader.load()

Display head of paper

In [None]:
print(data[0].page_content[:1000])

<b>Splitting document</b>

For Splitting documents, there is two known method from Langchain:
- `CharacterTextSplitter`: This is a simpler method that splits the text based on a specified character, such as spaces or newlines.
- `RecursiveCharacterTextSplitter`: This method is more advanced and versatile. It attempts to split the text using a series of separators in a hierarchical or recursive manner. For example, it might first try to split the text at paragraph breaks (\n\n), and if the resulting chunks are too large, it then tries to split by single newlines (\n), and if necessary, by spaces, and finally by individual characters.

In [None]:
# Split and chunk
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(data)

### 2. Retrieval Strategy
For Retrieval Strategy: here we will be using Dense Retrieval (Embedding Vectors) and `MultiQueryRetriever`, it can send multiple queries at one time and one prompt.

<b>Adding Vector Databases to ChromaDB</b><br>
We will using `OllamaEmbeddings` with model: `nomic-embed-text`<br>(open-source Embedding model that claimed surpasses OpenAI `text-embedding-ada-002` and `text-embedding-3-small`, <a href="https://ollama.com/library/nomic-embed-text">[source]</a>)
<br>
Then we need to store the Embedding vectors into Vector Database from ChromaDB

In [None]:
# Add to vector database
vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=OllamaEmbeddings(model="nomic-embed-text",show_progress=True),
    collection_name="attention-paper-rag",
    persist_directory="./local-rag-attention"
)

### 3. Model Selection
Load the LLM Chat model, we will be using the latest Llama3.1: open-source LLM that improved from Llama3.<br>
For detailed bechmarks and comparisons, here: https://blog.gopenai.com/llama-3-1-vs-llama-3-differences-d3d23e09607f

Why use open-source LLMs? Well, there's several reasons:
- Deployment can be local and anywhere
- No need to pay or subscription services
- Privacy matters, as open-source LLMs runs on local, privacy stays on local machine
- Open-source models has worldwide contributions unlike closed-source is classified contributions
- Open-source has massive support of community developers like HuggingFace, llama.cpp, Ollama, Langchain, etc.

In [None]:
llm = ChatOllama(model='llama3.1')

Define the Query Prompt Template

In [None]:
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

<b>Still about Retrieval Strategy</b>

We will be using `MultiQueryRetriever` for main strategy, as it can send multiple queries at one time and one prompt.

In [None]:
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(),
    llm,
    prompt=QUERY_PROMPT
)

# RAG prompt
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

Create the Retrieval Chains

In [None]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

### 4. Evaluation Dataset Creation
Develop an evaluation dataset that reflects realistic queries researchers might have. Include both simple and complex queries (multi-hop, comparing two things, multiple questions in one prompt) that test the system's ability to retrieve and generate accurate information. Around 20 questions should be enough.

In [None]:
queries = [
    "What is the main contribution of the Transformer model?",
    "What does the Scaled Dot-Product Attention mechanism do?",
    "How many layers are there in the encoder stack of the Transformer?",
    "Explain the architecture of the Transformer model?",
    "What tasks were used to evaluate the Transformer model?",
    "What are the differences between the encoder and decoder stacks in the Transformer model?",
    "What are the advantages of using self-attention in the Transformer model?",
    "How does the Transformer model handle positional information in sequences?",
    "Explain the regularization techniques used in training the Transformer",
    "How do the training schedules differ between the base and big Transformer models?",
    "Compare the training efficiency of the Transformer with RNN-based models",
    "How does the positional encoding in the Transformer differ from other models?",
    "How does the Transformer handle long-range dependencies compared to convolutional networks?",
    "What is the purpose of the positional encoding in the Transformer?",
    "What hardware was used to train the Transformer models?",
    "How many attention heads are used in the Transformer?",
    "What is the main challenge that self-attention addresses in sequence modeling?",
    "What is the effect of using multi-head attention as shown in the paper?",
    "Describe the key components of the Transformer's encoder and decoder stacks",
    "What training data was used for the Transformer model?",
]
ground_truths = [
    "The Transformer have ability to achieve state-of-the-art results without relying on recurrent neural networks (RNNs) or convolutional layers, relying only on attention mechanisms",
    "The Scaled Dot-Product Attention maps a query and a set of key K and value V pairs to an output, computed as a weighted sum of the values.",
    "The encoder stack consists of 6 identical layers.",
    "The Transformer model consists of an encoder-decoder architecture, with each having a stack of 6 identical layers utilizing self-attention and fully connected feed-forward network.",
    "The Transformer model was evaluated on two machine translation tasks: WMT 2014 English-to-German and WMT 2014 English-to-French.",
    "The encoder consists of two sub-layers: self-attention and a feed-forward network. The decoder includes an additional third sub-layer for encoder-decoder attention.",
    "Self-attention allows for parallelization, reduces path lengths, and enables the model to learn long-range dependencies effectively.",
    "The Transformer uses positional encodings based on sine and cosine functions to inject positional information into the sequence.",
    "Regularization techniques include residual dropout, label smoothing, and weight sharing across embedding layers.",
    "The base model is trained for 12 hours, and the big model for 3.5 days, with corresponding increases in training steps.",
    "The Transformer is significantly more parallelizable and trains faster than recurrent-based models, achieving better and faster performance with less computational resources.",
    "The Transformer uses fixed sine and cosine functions, whereas other models might use learned positional embeddings.",
    "The Transformer’s self-attention mechanism allows it to capture long-range dependencies with a constant number of sequential operations, unlike convolutional networks that are limited to a local window and require stacking multiple layers.",
    "Positional encoding is used to inject information about the relative or absolute position of the tokens in the sequence.",
    "The Transformer models were trained using 8 NVIDIA P100 GPUs.",
    "The Transformer employs 8 parallel attention heads in its multi-head attention mechanism.",
    "Self-attention addresses the challenge of modeling dependencies without requiring their distance in the input or output sequences.",
    "Multi-head attention allows the model to focus on different positions within the sequence, providing a more nuanced understanding by attending to different subspaces.",
    "The encoder consists of two sub-layers: self-attention and position-wise feed-forward networks, while the decoder adds encoder-decoder attention to these components.",
    "The WMT 2014 English-to-German and English-to-French datasets were used for training the Transformer."
]
contexts = []
answers = []

<b>Question no.1</b><br>
Simple Queries with MultiQueryRetriever: "What is the main contribution of the Transformer model?"

In [None]:
content = []
retrieval = retriever.invoke("What is the main contribution of the Transformer model?")

for item in retrieval:
    content.append(item.page_content)
contexts.append(content)

In [None]:
answer = chain.invoke("What is the main contribution of the Transformer model?")
answers.append(answer)
answer

### 5. Evaluation
Here we will be using RAGAS evaluation metrics for evaluating RAG results.<br>
There are several RAGAS metrics:<br>
1. Faithfullness: Faithfulness measures how factually consistent the generated answer is with the provided context. It ensures that the claims made in the answer can be directly inferred from the given context. A faithful answer means that all statements in the answer are supported by the context.<br>(Example: If the context states that "Einstein was born on March 14, 1879," an answer saying he was born on March 20, 1879, would have low faithfulness.)<br>
2. Answer Relevancy: This metric assesses how relevant the generated answer is to the original question. It focuses on whether the answer addresses the query directly and avoids unnecessary or unrelated information. Relevance is often measured by generating variations of the question based on the answer and comparing them to the original question using cosine similarity.<br>(Example: For a question asking "What is the capital of France?", an answer stating "Paris is the capital of France" is highly relevant, while an answer discussing French cuisine would score lower.).<br>
3. Context Precision: Context precision evaluates how accurately the retrieved context provides the necessary information to answer the question. It measures whether the relevant information is ranked higher among the retrieved documents, focusing on the signal-to-noise ratio in the retrieval process.<br>(Example: If the context retrieved is directly related to the query without much irrelevant information, the precision is high.).<br>
4. Context Recall: Context recall measures how much of the relevant information from the ground truth is included in the retrieved context. It compares the retrieved context with the expected answer to see if all necessary details are captured.<br>(Example: If the ground truth for a question includes multiple details, and the retrieved context covers most of these, the recall would be high.).<br>
5. Context Entity Recall: This metric is similar to context recall but focuses specifically on entities mentioned in the context. It checks if the retrieved context includes all the entities that are crucial to answering the question.<br>(Example: If a query is about "Albert Einstein," the context entity recall checks if the retrieved documents mention Einstein.).<br>
6. Answer Similarity: Answer similarity assesses how close the generated answer is to the ground truth by comparing the semantic similarity between the two. This metric ensures that even if different words are used, the meaning conveyed is similar.<br>(Example: If the ground truth answer is "Paris is the capital of France," and the generated answer is "The capital of France is Paris," the similarity would be high.).<br>
7. Answer Correctness: This metric evaluates the factual correctness of the generated answer by comparing it to the ground truth. It considers both the factuality and the semantic similarity of the answer to ensure it is correct.<br>(Example: If the ground truth states "Paris is the capital of France," and the answer says "London is the capital of France," the correctness would be low.).

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    context_entity_recall,
    answer_similarity,
    answer_correctness,
)
from datasets import Dataset

d = {
    "question": queries,
    "answer": answers,
    "contexts": contexts,
    "ground_truth": ground_truths
}
dataset = Dataset.from_dict(d)

In [None]:
score = evaluate(dataset,
                 metrics=[faithfulness, answer_relevancy, context_precision, context_recall, context_entity_recall, answer_similarity, answer_correctness],
                 llm=ChatOllama(model='llama3.1'),
                 embeddings=OllamaEmbeddings(model='nomic-embed-text'))
score_df = score.to_pandas()
score_df

The evaluation scores can be exported to CSV format if decided

In [None]:
score_df.to_csv("EvaluationScores.csv", encoding="utf-8", index=False)

### 6. Recommendation
From the Evaluation Dataset Creation result, it seems the strategy needs to be improved. As the answers from LLMs there is correct information but also there is some hallucination and over-typing the messages. Also, the Document PDF Loader needs to be improved too, as another approach (Semi-Structurd PDF Loader <a href="https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb">from here</a>) will be better preprocessing the documents (but i'm facing issues installing Semi-Structued PDF Loader on my machine, so i used `UnstructuredPDFLoader` instead).<br>
There are also wide of Retrieval Strategies (<a href="https://medium.com/@abhinavkimothi/rag-value-chain-retrieval-strategies-in-information-augmentation-for-large-language-models-3a44845e1e26">as shown on this medium blog</a>) than only Multi-Query Retrieval alone, it is good to explore all of Retrieval technique for comparing and testing RAG results. Finally the RAGAS evaluation is also needs to be improved on open-source LLMs (as i'm facing issue generating the eval scores).