### RAG with PDF 📄 Data extraction to give context to LLM 🧠

<img src="./Images/RAGs.png" width="800" height="700" style="display: block; margin: auto;">

In [1]:
!pip install pypdf

Collecting pypdf
  Using cached pypdf-5.3.0-py3-none-any.whl.metadata (7.2 kB)
Using cached pypdf-5.3.0-py3-none-any.whl (300 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.3.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [1]:
from dotenv import load_dotenv
load = load_dotenv('./../.env')


In [2]:
from langchain_ollama import ChatOllama

llm = ChatOllama(
    base_url="http://localhost:11434",
    model = "qwen2.5:latest",
    temperature=0.5,
    max_tokens = 250
)

### 1. Extracting the PDF files

In [3]:
from langchain_community.document_loaders import PyPDFLoader

pdf1 = "./attention.pdf"
pdf2 = "./LLMForgetting.pdf"
pdf3 = "./TestingAndEvaluatingLLM.pdf"

pdfFiles = [pdf1, pdf2, pdf3]

documents = []

for pdf in pdfFiles:
    loader = PyPDFLoader(pdf)
    documents.extend(loader.load())

print(len(documents))

253


In [9]:
print(documents[:1])

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': './attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGoogle Brain\nlukas

### 2. Text Splitting into Chunks

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, 
                                               chunk_overlap=200, add_start_index=True)

all_splits = text_splitter.split_documents(documents)

len(all_splits)

640

### 3. Embedding

In [5]:
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="llama3.2:latest")

vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
len(vector_1), len(vector_1)

(3072, 3072)

### 4. Vector Stores

In [13]:
#!pip install -qU langchain-chroma


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
from langchain_chroma import Chroma

vector_store = Chroma.from_documents(
    documents = all_splits,
    embedding=embeddings,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

### 5. Retriving from the Persistant Vector Datastore

In [6]:
from langchain_chroma import Chroma


vector_store = Chroma(persist_directory='./chroma_langchain_db', embedding_function=embeddings)

result = vector_store.similarity_search("What is Bias testing", k=3)

for doc in result:
    print(doc.page_content)

60 CHAPTER 4. LOGICAL REASONING CORRECTNESS
the following challenges: 1) If an LLM concludes correctly, it is unclear
whether the response stems from reasoning or merely relies on simple
heuristics such as memorization or word correlations (e.g., “dry floor”
is more likely to correlate with “playing football”). 2) If an LLM
fails to reason correctly, it is not clear which part of the reasoning
process it failed (i.e., inferring not raining from floor being dry or
inferring playing football from not raining). 3) There is a lack of
a system that can organize such test cases to cover all other formal
reasoning scenarios besides implication, such as logical equivalence
(e.g., If A then B, if B then A; therefore, A if and only if B). 4)
Furthermore, understanding an LLM’s performance on such test cases
provides little guidance on improving the reasoning ability of the
LLM. To better handle these challenges, a well-performing testing
Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang,
“

In [11]:
result = vector_store.similarity_search_with_score("What are the types of LLM Testing")

result[0]

(Document(id='16eeba20-c1be-42a2-8243-a14c71563837', metadata={'author': '', 'creationdate': '2025-01-07T01:36:50+00:00', 'creator': 'LaTeX with hyperref', 'keywords': '', 'moddate': '2025-01-07T01:36:50+00:00', 'page': 9, 'page_label': '10', 'producer': 'pdfTeX-1.40.25', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'source': './LLMForgetting.pdf', 'start_index': 0, 'subject': '', 'title': '', 'total_pages': 15, 'trapped': '/False'}, page_content='Under review\nFigure 6: The performance of general knowledge of the BLOOMZ-7.1b and LLAMA-7b\nmodel trained on the instruction data and the mixed data. The dashed lines refers to the\nperformance of BLOOMZ-7.1b and LLAMA-7B and the solid ones refer to those of mixed-\ninstruction trained models.\nincreases to 3b, BLOOMZ-3b suffers less forgetting compared to mT0-3.7B. For example, the\nFG value of BLOOMZ-3b is 11.09 which is 5.64 lower than that of mT0-3.7b. These results\nsugges

### 6. Retrivers in Langchain

In [7]:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs = {"k": 1}
)

retriever.batch(
    [
        "What is the Bias Measurement",
        "How to test human safety against LLM",
        "How LLM forgets the context"
    ]
)


[[Document(id='16eeba20-c1be-42a2-8243-a14c71563837', metadata={'author': '', 'creationdate': '2025-01-07T01:36:50+00:00', 'creator': 'LaTeX with hyperref', 'keywords': '', 'moddate': '2025-01-07T01:36:50+00:00', 'page': 9, 'page_label': '10', 'producer': 'pdfTeX-1.40.25', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'source': './LLMForgetting.pdf', 'start_index': 0, 'subject': '', 'title': '', 'total_pages': 15, 'trapped': '/False'}, page_content='Under review\nFigure 6: The performance of general knowledge of the BLOOMZ-7.1b and LLAMA-7b\nmodel trained on the instruction data and the mixed data. The dashed lines refers to the\nperformance of BLOOMZ-7.1b and LLAMA-7B and the solid ones refer to those of mixed-\ninstruction trained models.\nincreases to 3b, BLOOMZ-3b suffers less forgetting compared to mT0-3.7B. For example, the\nFG value of BLOOMZ-3b is 11.09 which is 5.64 lower than that of mT0-3.7b. These results\nsugge

### Document Retrival Manually

In [8]:
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate

query = "What exactly does Testing the Factual Correctness of LLM tells"

retrieved_docs = retriever.get_relevant_documents(query)

context_text = "\n\n".join([doc.page_content for doc in retrieved_docs])

prompt_template = ChatPromptTemplate.from_template(
    """
    You are an AI Assisant. Use the following context to answer the question correctly.
    If you dont know the answer, just tell, I dont know.
    
    Also, summarize the response in MD format
    
    "context: {context} \n\n"
    "question: {question} \n\n"
    "AI answer:
    
    """
)

chain = prompt_template | llm | StrOutputParser()

response = chain.invoke({"context": context_text, "question": query})

print(response)

  retrieved_docs = retriever.get_relevant_documents(query)


**Testing the Factual Correctness of LLMs involves assessing their performance on multi-hop questions. Specifically:**

- **Multi-hop Questions**: These are more challenging for LLMs compared to single-hop questions.
- **FactChecker Tool**: Used to generate 600 multi-hop questions and evaluate LLM responses.
- **Evaluation Outcome**: All LLMs showed a higher incidence of factual errors in responding to 2-hop questions, indicating increased difficulty with complex queries.
- **Findings**:
  - FactChecker can identify significant factual errors in both commercial and research LLMs.
  - It provides an evaluation on the factual accuracy of these models.

This testing approach highlights the limitations of current LLMs in handling complex, multi-step reasoning tasks.


### Using Langchain Hub for Prompt

In [35]:
from langchain_core.output_parsers import StrOutputParser
from langchain import hub

query = "How to test Translation in LLM?"

retrieved_docs = retriever.get_relevant_documents(query)

context_text = "\n\n".join([doc.page_content for doc in retrieved_docs])

prompt = hub.pull("rlm/rag-prompt")

chain = prompt | llm | StrOutputParser()

response = chain.invoke({"context": context_text, "question": query})

print(response)

To test translation in LLMs like ChatGPT and GPT-4, you can use human evaluations on randomly selected responses where the models have differing judgments. Translate responses from other languages to English using tools like Google Translate before feeding them into the model for evaluation. This helps ensure consistency across language outputs.


### Retrieving data using RetrievalQA

In [45]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever, return_source_documents=True)

question = "What exactly does Testing the Factual Correctness of LLM tells"

response = qa_chain.invoke(question)

sources = set(doc.metadata.get("source", "Unknown") for doc in response["source_documents"])

print(response['result'])
print("\n📕 Sources Used:")
for source in sources:
    print(f"- {source}")

Testing the factual correctness of large language models (LLMs) involves assessing their ability to provide accurate information. This is done by generating multi-hop questions, which are more complex and require the model to connect multiple pieces of information to arrive at a correct answer. The study uses FactChecker to create 600 multi-hop questions and queries various LLMs with these questions. The results show that all tested LLMs make more factual errors when answering multi-hop questions compared to single-hop ones, indicating that multi-hop questions present a greater challenge for the models.

This testing process helps identify substantial factual errors in both commercial and research LLMs and provides an evaluation of their factual accuracy. It highlights the need for improved capabilities in handling complex, interconnected information to enhance the reliability of these models.

📕 Sources Used:
- ./TestingAndEvaluatingLLM.pdf
