# **Title: Knowledge Navigator: A RAG-Powered Research Assistant for Cutting-Edge AI Papers**

# Project Overview: Combating LLM Hallucination with Retrieval-Augmented Generation (RAG)
The Challenge
In the fast-paced world of AI research, staying current with brand-new, jargon-filled papers presents a significant challenge. Manually sifting through dense documents to find specific information is a time-consuming task, leading to information overload.

Furthermore, relying on large language models (LLMs) to answer questions on this new, private data introduces three key problems:

**Hallucination:** 
LLMs may confidently generate incorrect or fabricated information when they lack specific knowledge from the provided documents.

**Lack of Specificity:**
Questions that require a deep understanding of concepts spread across multiple papers are often difficult for a standard LLM to answer accurately.

**No Verifiable Sources:**
Without direct access to the source material, there is no way to verify the LLM's claims, undermining its trustworthiness.

**The Solution:**
A RAG System this project tackles these challenges head-on by building a Retrieval-Augmented Generation (RAG) pipeline. The system provides a powerful and verifiable solution for question-answering over a private dataset of research papers.

**Input:** 
A collection of specialized research papers in PDF format.

**Output:** 
A precise, concise, and fact-based answer, accompanied by the exact source snippets from the research papers used to generate the response. This ensures all information is grounded in the provided context, eliminating hallucinations and building trust with the user.

# **Installing all the necessary libraries and API's**

In [1]:
# Remove conflicting packages from the Kaggle base environment.
!pip install PyPDF2
!pip uninstall -qqy jupyterlab kfp 2>/dev/null  # Remove unused conflicting packages
!pip install -U -q "google-genai==1.7.0"
!pip install faiss-cpu

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faiss-cpu
  Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.0 kB)
Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0.post1


In [2]:
import os
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

In [3]:
from google import genai
from google.genai import types

genai.__version__

'1.7.0'

# **Data Acquisition & Preparation - Document Loading**

**Identify Document Paths**

In [4]:
# code to get all the files in data directory
import os

list_file_names = os.listdir('/kaggle/input/capstone-project-dataset/')
base_path = '/kaggle/input/capstone-project-dataset/'
full_file_path = []

for file in list_file_names:
    full_file_path.append(base_path + file)

print(full_file_path)

['/kaggle/input/capstone-project-dataset/paper 1.pdf', '/kaggle/input/capstone-project-dataset/paper 3.pdf', '/kaggle/input/capstone-project-dataset/paper 5.pdf', '/kaggle/input/capstone-project-dataset/paper 4.pdf', '/kaggle/input/capstone-project-dataset/paper 2.pdf']


# **Iterate and extract text**

In [5]:
import PyPDF2

# Open a PDF file
count = 0
all_texts_of_paper = []

for file in full_file_path:
    with open(file, 'rb') as pdf_file:
        reader = PyPDF2.PdfReader(pdf_file)
        file_count_all_text = ''
        
        for page in reader.pages:
            count += 1
            file_count_all_text = file_count_all_text + page.extract_text()

    # this is a final list with 5 indexes each with a complete text for all the 5 papers
    all_texts_of_paper.append(file_count_all_text)
            


In [6]:
# function for cleaning the Raw text i.e removing '/uniXXXXXX' or 'optimiza-\ntion'
import re

def clean_pdf_text(text):
    #removing '/uniXXXXXX'
    text = re.sub(r"/uni[A-Za-z\d]+", " ", text)

    #removing '- \n'
    return re.sub(r"[^0-9a-zA-Z]+"," ", text)
    

In [7]:
# will contain the all the cleaned texts of all papers
all_clean_raw_text = []

In [8]:

for raw_text in all_texts_of_paper:
    all_clean_raw_text.append(clean_pdf_text(raw_text))

**Chunking the list all_text_paper**

In [9]:
from IPython.core.debugger import set_trace
list_all_chunks = []

# chunking function
def chunking_text(input_str, overlap_size, chunk_size):
    
    step = chunk_size - overlap_size
    
    
    for start in range(0, len(input_str), step):
        if start + chunk_size <= len(input_str):
            yield(input_str[start:start + chunk_size])
        else:
            yield(input_str[len(input_str) - chunk_size: len(input_str)])
            break

for text in all_clean_raw_text:
    for chunks in chunking_text(text, overlap_size = 200, chunk_size = 1000):
        list_all_chunks.append(chunks)
    

In [10]:
# print(list_all_chunks)

**Phase 3: Embedding & Vector Store**

This is where we convert your text chunks into numerical representations (embeddings) and prepare them for efficient search.

In [10]:
from google import genai
from google.api_core import retry
import tqdm
from tqdm.rich import tqdm as tqdmr
import warnings

client = genai.Client(api_key=GOOGLE_API_KEY)

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})
@retry.Retry(predicate=is_retriable, timeout=300.0)

def embed_fn(text: str)-> list[float]:
    result = client.models.embed_content(
            model="gemini-embedding-001",
            contents=text)
    
    return result.embeddings[0].values
    

In [11]:
import numpy as np

numerical_embeddings_list = []

for chunks in list_all_chunks:
    numerical_embeddings_list.append(embed_fn(chunks))

numerical_list_numpy_arr = np.array(numerical_embeddings_list)

In [12]:
print(numerical_list_numpy_arr.shape)

(314, 3072)


Using FAISS library for embeddings storage

In [13]:
# used faiss library for embedding storage
import faiss

d = numerical_list_numpy_arr.shape[1]
index = faiss.IndexFlatL2(d)
index.add(numerical_list_numpy_arr)


**Construct the Prompt for the LLM:**

In [14]:

def prompt(user_query, list_all_chunks):
    return f"""You are a highly intelligent AI research assistant. Your task is to answer the user's question truthfully and concisely,
    *only* using the information provided in the 'Context' sections below.
    If the answer cannot be found in the provided context, state that you do not have enough information to answer from the given documents.
    Do not use any outside knowledge.
    --- Context ---
    Retrieved Document 1
    {list_all_chunks[0]}
    Retrieved Document 2
    {list_all_chunks[1]}
    Retrieved Document 3
    {list_all_chunks[2]}
    Retrieved Document 4
    {list_all_chunks[3]}
    Retrieved Document 5
    {list_all_chunks[4]}
    --- END Context ---
    
    User Question: {user_query}
    
    """

In [15]:
# method to print the AI response

from IPython.display import Markdown, display

def print_AI_response(list_chunk, res):
    display(Markdown(f"""
**AI Assistant's Answer:**  
{res.text}

---

**Source Used:**  
**Chunk 1**  
{list_chunk[0]}

**Chunk 2**  
{list_chunk[1]}

**Chunk 3**  
{list_chunk[2]}

**Chunk 4**  
{list_chunk[3]}

**Chunk 5**  
{list_chunk[4]}
"""))


**Phase 4: Retrieval Augmented Generation**

In [16]:
def rag_function(list_all_chunk, user_query_text):
    user_query_embedding = []
    user_query_embedding = embed_fn(user_query_text)

    # print(user_query_embedding)
    
    user_embed_np_arr = np.array(user_query_embedding)
    
    user_embed_np_arr_2d = np.reshape(user_embed_np_arr, (1, -1))
    
    #print the shape of 2D numpy array
    
    print(user_embed_np_arr_2d.shape)
    
    # this is the number of similar chunks we want from our papers
    k = 5
    D, I = index.search(user_embed_np_arr_2d, k)

    print(I[0])
    
    #loop through I[0] and get the chunks from list_all_chunks
    list_of_chunks_userqr = []
    for i in I[0]:
        list_of_chunks_userqr.append(list_all_chunk[i])

    #printing the list of chunks to see the data
    # print(list_of_chunks_userqr)

    #generating response from model based on prompt

    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=prompt(user_query_text, list_of_chunks_userqr)
    )

    return list_of_chunks_userqr, response
    
    

In [17]:
# output response from the LLM
list_chunks, response = rag_function(list_all_chunks, input("What do you want to ask from the papers?"))
print_AI_response(list_chunks, response)

What do you want to ask from the papers? long term memory


(1, 3072)
[ 64 234 224  65  82]



**AI Assistant's Answer:**  
Some researchers view conversational memory mechanisms as a form of self-evolution for large language models (LLMs). Early methods involved concatenating full dialogue histories into prompts to preserve context, but these were limited by the model's context window and unsuitable for long-term interactions.

More efficient memory systems developed to address this include:
*   **ReadAgent** (Lee et al. 2024): Summarizes long texts for on-demand retrieval, improving context utilization.
*   **Memory Bank** (Zhong et al. 2024): Uses vector-based similarity search to enhance storage and access, though scalability remains an issue.
*   **A MEM** (Xu et al. 2025): Constructs an evolving knowledge graph, improving organization at the cost of structural complexity.

These methods enhance user alignment through few-shot prompts that combine past interactions with current inputs. However, they operate externally and do not improve the LLM's internal cognition, which remains fixed and non-evolving. "Memory focused baselines" are also used in evaluations on long-term dialogue tasks.

---

**Source Used:**  
**Chunk 1**  
a Dong et al 2023 It improves the model s understanding of professional terminology and context boosting accuracy and robustness in specialized tasks Chen et al 2020 How ever SFT requires extensive manual data collection and in tervention during fine tuning leading to high resource costs Thus an automated framework for efficient data acquisition and fine tuning is urgently needed to reduce these costs Long Term Memory Some researchers view conversa tional memory mechanisms as a form of self evolution for large language models LLMs Yi et al 2024 Early meth ods concatenated full dialogue histories into prompts to pre serve context but were limited by the model s context win dow and unsuitable for long term interactions To address this more efficient memory systems have been developed ReadAgent Lee et al 2024 summarizes long texts for on demand retrieval improving context utilization Memory Bank Zhong et al 2024 uses vector based similarity search to enhance storage and access though scal

**Chunk 2**  
reasoning VLM UDMC retrieves dialogue examples D Hi Ai Nd i 1from memory storage where Hi H1 i H2 i is a historical human message and Ai A1 i A2 i is the corresponding response from the foundation models These examples are selected via similarity matching to the current scenario with their specific embeddings evand etgenerated by CLIP a model trained on a vast amount of image text pairs from the internet 40 Therefore incorporating the two steps of foundation mod els the complete context generation process can be formalized as C tm S H tm Retrieve M H tm 23 where Retrieve is a function that selects relevant historical examples from memory Mbased on similarity evaluation to the current human message H t The VLM processes this context to generate an updated attention dictionary A t 9 TABLE II COMPARISON OF DRIVING PERFORMANCE IN URBAN DRIVING SCENARIOS Method Scenario Comp Time ms Col TRV IB TTC Alarm Duration s Travel Time s Autopilot 12 rule based ML ACC 10 61 3 57 0 0 0 0 10 0 4 26 6 R

**Chunk 3**  
regulate the final decision format before updating P the slow sub system retrieves relevant memory itemsM m1 m2 m M which contains similar scene embeddings and the relevant responses and generating a new OCP formulation via foundation model inference P tm FM C tm Retrieve M C tm 17 where C tm is the multimodal context including the RGB images from the cameras mounted on the autonomous vehicle and the supplementary text describing the task and the re sponse format of the VLM More detailed descriptions about the prompt can be found in Section IV B Between updates at a time step tk the fast sub system reuses the insights from P tm to solve the OCP ensuring continuity P tk P tm tk t tm t t m 1 t R 18 3 Adaptive OCP Evolution The key insight is that the OCP s cost function and constraints evolve only when the slow Multimodal Info RGB Image Textual Clues T1 T2 AI Response A1 A2 Image Embedding ev Texts Embedding etExpert Memory P Alive Memory QContext Construction C1 t S1 H1 D1 VLM Analysis 

**Chunk 4**  
nt Lee et al 2024 summarizes long texts for on demand retrieval improving context utilization Memory Bank Zhong et al 2024 uses vector based similarity search to enhance storage and access though scalability remains an issue A MEM Xu et al 2025 constructs an evolv ing knowledge graph improving organization at the cost of structural complexity While differing in implementa tion these methods all enhance user alignment via few shot prompts that combine past interactions with current inputs Yao et al 2024 However they operate externally and do not improve the LLM s internal cognition which remains fixed and non evolving Methodology This section details the architecture and mechanisms of DPSE including the Censor module for satisfaction esti mation and topic classification the dual phase data aug mentation strategies and the two phase fine tuning process Figure 2 provides an overview of the framework Addi tionally DPSE mitigates hallucinations during topic classi fication and dataset expan

**Chunk 5**  
emoryBank MB and A MEM AM Both SFT and PO optimize model weights post training SFT improves domain knowledge while PO aligns with user preferences DPSE unifies these through dual phase self evolution We conduct unified comparisons against SFT PO to assess joint optimization and separate comparisons with memory based methods which rely on external retrieval rather than parameter updates to contrast endogenous parameter tuning and exogenous memory re trieval evolution Dataset and Evaluation Metrics Following prior work we evaluate the DPSE framework against SFT and prefer ence optimization PO baselines on general NLP tasks and against memory focused baselines on long term dialogue tasks For general NLP we use AlpacaEval 2 0 Dubois et al 2025 and MT Bench Zheng et al 2023 and for long termdialogue the LoCoMo dataset Maharana et al 2024 Al pacaEval 2 0 contains 805 instruction following questions from five sources and uses GPT 4 Turbo as an automatic judge to compare model responses agains


**Generate content using gemini**

# **Output & Demonstration (And Verification)**

**Test with Different Queries**

**Test Case 1 (Answerable)**

In [26]:
# this is where we give query that the LLM can answer from the data we feed into it
list_chunks, response = rag_function(list_all_chunks, input("What do you want to ask from the papers?"))
print_AI_response(list_chunks, response)

What do you want to ask from the papers? what is llm


(1, 3072)
[110 111 135 116 128]



**AI Assistant's Answer:**  
An LLM (Language Model) is a type of artificial intelligence model designed to understand and generate human language with remarkable proficiency. They are trained on massive datasets of text, enabling them to grasp linguistic patterns, structure, and subtle nuances, and typically consist of billions of parameters. Modern LLMs are primarily built on the Transformer architecture and aim to predict the next token in a sequence given the preceding tokens. Most LLMs today are decoder-based, designed for text generation tasks.

---

**Source Used:**  
**Chunk 1**  
e Language Model LLM is a type of artificial intelligence model designed to understand and generate hu man language with remarkable proficiency These models are trained on massive datasets of text enabling them to grasp linguistic patterns structure and even subtle nuances Typi cally LLMs consist of billions of parameters Modern LLMs are primarily built on the Transformer architecture a deep learning framework that uses the attention mechanism This architecture has become especially prominent since Google introduced BERT in 2018 There are three main types of Transformer models 1 Encoders These process input data like text and produce dense representations or embeddings 2 Decoders These generate new tokens sequentially predicting one token at a time to form coherent output 3 Encoder Decoder Seq2Seq This setup first uses an encoder to process the input se quence into a contextual representation which a decoder then uses to generate an output sequence Although Transformers come in various

**Chunk 2**  
 Seq2Seq This setup first uses an encoder to process the input se quence into a contextual representation which a decoder then uses to generate an output sequence Although Transformers come in various forms most LLMs today are decoder based designed for text generation tasks and composed of billions of parameters The core concept behind LLMs is straightforward yet powerful they aim to predict the next token in a sequence given the preceding tokens A token is the fundamental unit of information an LLM operates on While it s similar to a word LLMs typically use sub word tokens for efficiency Text generation remains the most common application of LLMs Given an initial input or prompt the model predicts one token at a time continuing the sequence until it reaches a specified length or encounters an end of sequence EOS token Considering their robust summarization and generation capabilties in automotive LLMs are adopted for various tasks within software development toolchain including re qu

**Chunk 3**  
dge LLM will determine the winning side of the debate Chen et al introduce ReConcile 39 a multiagent group discussion with confidence estimation In the course of their argument each debater defends their position and presents their argument in support of it They then offer their own estimation of their confidence As the participants continue to engage with each other s arguments the confidence scoreFeature Chat2Scenario 27 TARGET 28 LEADE 29 LeGEND 30 Source of Input Naturalistic driving datasetsTraffic rules natural language Real world traffic videosAccident reports natu ral language Scenario Understanding LLM extracts scenario elements and converts them into structured representations LLM parses rule based descriptions and maps them to structured DSL syntax LMM interprets vehicle behaviors from videos using optical flow anal ysis LLM1 extracts acci dent events into In teractive Pattern Se quences IPS Scenario Generation Method Get scenarios from datasets and converts to OpenSCENARIO 

**Chunk 4**  
uery b generate LLM generates an answer by using a prompt that combines the question and the retrieved data In the context of automotive software development RAG plays a pivotal role in addressing the increasing complexity ofregulatory compliance and RFQ Request for Quotation pro cessing As OEMs face mounting pressure to align technical specifications and manufacturing plans with frequently evolv ing standards such as Euro 7 emissions regulations manual tracking and analysis become error prone and time consuming By combining RAG with AI agent networks companies can automate the retrieval of relevant regulatory documents extract design related constraints and assess compliance dynamically 10 11 This approach not only reduces the risk of over looking critical requirements but also enables faster and more accurate responses to RFQs ultimately supporting more agile and regulation aware development cycles in the automotive domain D Vision Language Models VLMs Vision Language models are mult

**Chunk 5**  
standing of how LLMs and RAG systems contribute to the end to end process of regulation compliant scenario generation Chat2Scenario 27 proposes a pipeline that extracts concrete driving scenarios from natural istic datasets using GPT 4 focusing on criticality thresholds and translating the outputs into simulation ready formats like OpenSCENARIO TARGET 28 leverages GPT 4 to parse traffic rules written in natural language and converts them into a formal domain specific language DSL for structured test scenario generation LEADE 29 employs vision language models and traffic videos to reconstruct safety critical sce narios through behavior comparison between human drivers and autonomous systems Lastly LeGEND 30 introduces a top down approach for transforming textual accident reports into structured functional logical and concrete scenarios usinga two phase LLM based transformation 1 is a synthesized general pipeline that abstracts the key steps common across these solutions VII LLM BASED CO


**Test Case 2 (Unanswerable)**

In [27]:
# this is where we give query which is not from the data we feed the LLM
list_chunks, response = rag_function(list_all_chunks, input("What do you want to ask from the papers?"))
print_AI_response(list_chunks, response)

What do you want to ask from the papers? who is the president of germany?


(1, 3072)
[168  55 266 105 110]



**AI Assistant's Answer:**  
I do not have enough information to answer from the given documents.

---

**Source Used:**  
**Chunk 1**  
 systems Technical University of Munich Technical Report 2024 Online Available https mediatum ub tum de doc 1738462 1738462 pdf 17 A Abdalla H Pandey B Shomali J Schaub A M uller M Eisen barth and J Andert Generative artificial intelligence for model based graphical programming in automotive function development Novem ber 10 2024 available at SSRN https ssrn com abstract 5153452 or http dx doi org 10 2139 ssrn 5153452 18 A Schamschurko N Petrovic and A C Knoll RECSIP REpeated Clustering of Scores Improving the Precision 2025 conference paper accepted for IntelliSys2025 Online Available https arxiv org abs 2503 12108 19 V Zolfaghari N Petrovic F Pan K Lebioda and A Knoll Adopting rag for llm aided future vehicle design in 2024 2nd International Conference on Foundation and Large Language Models FLLM 2024 pp 437 442 20 Y Uygun and V Momodu Local large language models to simplify requirement engineering documents in the automotive industry Production Manufacturing Research vol 12 no 1 202

**Chunk 2**  
41 no 115 pp 64 68 2007 25 M Mondal C K Roy and K A Schneider A survey on clone refactoring and tracking Journal of Systems and Software vol 159 p 110429 2020 26 S Thakur B Ahmad Z Fan H Pearce B Tan R Karri B Dolan Gavitt and S Garg Benchmarking large lan guage models for automated verilog rtl code generation in2023 Design Automation Test in Europe Conference Exhibition DATE IEEE 2023 pp 1 6 27 K Chen J Li K Wang Y Du J Yu J Lu L Li J Qiu J Pan Y Huang et al Chemist X large language model empowered agent for reaction condition recommendation in chemical synthesis arXiv preprint arXiv 2311 10776 2023 28 P Khosla P Teterwak C Wang A Sarna Y Tian P Isola A Maschinot C Liu and D Krishnan Su pervised contrastive learning Advances in neural infor mation processing systems vol 33 pp 18 661 18 673 2020 29 J Aneja A Schwing J Kautz and A Vahdat A con trastive learning approach for training variational autoen coder priors Advances in neural information processing systems vol 34 pp 480 493 2021 

**Chunk 3**  
oceedings of the National Academy of Sciences vol 122 no 21 p e2401626122 2025 34 H Liu K Chen and J Ma Incremental learning based real time trajectory prediction for autonomous driving via sparse Gaussian process regression in 2024 IEEE Intelligent Vehicles Symposium pp 1 7 2024 35 A Zeng M Chen L Zhang and Q Xu Are Transformers effective for time series forecasting in Proceedings of the AAAI conference on artificial intelligence pp 11121 11128 2023 36 Q Ge Q Sun S E Li S Zheng W Wu and X Chen Numeri cally stable dynamic bicycle model for discrete time control in IEEE Intelligent Vehicles Symposium pp 128 134 2021 37 A Marafioti O Zohar M Farr e M Noyan E Bakouch P Cuenca C Zakka L B Allal A Lozhkov N Tazi V Srivastav J Lochner H Larcher M Morlon L Tunstall L von Werra and T Wolf SmolVLM Redefining small and efficient multimodal models arXiv preprint arXiv 2504 05299 2025 38 A Yang A Li B Yang B Zhang B Hui B Zheng B Yu C Gao C Huang C Lv et al Qwen3 technical report arXiv preprint ar

**Chunk 4**  
a survey outcome which was conducted among our automotive industry partners regarding the type of GenAI tools used for their daily work activities I I NTRODUCTION The breakthrough of Generative Artificial Intelligence GenAI especially Large Language Models LLMs in the last three years has a significant impact on many areas from everyday routines to industry and manufacturing LLMs exhibit strong capabilities when it comes to text summarization and generation and were found suitable for the automation of many human tasks This affects many industrial domains 1 among them the automotive industry 2 3 4 Usually the adoption of GenAI in the automotive and other industry domains target either to eliminate human intervention needed for repetitive tasks speed up complex processes and activities or even introduce newly added value by enabling novel use cases 2 The automotive industry is known for strict design devel opment testing and manufacturing procedures which need to be compliant with numer

**Chunk 5**  
e Language Model LLM is a type of artificial intelligence model designed to understand and generate hu man language with remarkable proficiency These models are trained on massive datasets of text enabling them to grasp linguistic patterns structure and even subtle nuances Typi cally LLMs consist of billions of parameters Modern LLMs are primarily built on the Transformer architecture a deep learning framework that uses the attention mechanism This architecture has become especially prominent since Google introduced BERT in 2018 There are three main types of Transformer models 1 Encoders These process input data like text and produce dense representations or embeddings 2 Decoders These generate new tokens sequentially predicting one token at a time to form coherent output 3 Encoder Decoder Seq2Seq This setup first uses an encoder to process the input se quence into a contextual representation which a decoder then uses to generate an output sequence Although Transformers come in various


# Capstone Project Summary: Retrieval-Augmented Generation (RAG) System
This project demonstrates the creation of a Retrieval-Augmented Generation (RAG) system for question-answering on a private dataset of academic papers. The primary goal was to overcome the limitations of large language models (LLMs) in providing accurate, fact-based responses on specialized, domain-specific information, thereby preventing a phenomenon known as "hallucination."

**This project successfully integrates several key GenAI and data science capabilities:**

**PDF Text Extraction:** 
We began by extracting raw text from a collection of PDF documents, forming the foundational knowledge base. This step highlights the importance of data preprocessing in handling real-world, unstructured data.

**Robust Text Cleaning:**
A critical component was the implementation of a comprehensive text cleaning pipeline. This process addressed common issues from PDF extraction, such as garbled characters and hyphenated line breaks, ensuring that the source material was pristine and reliable for subsequent steps.

**Embedding and Vectorization:**
We used an embedding model to convert the cleaned text into numerical vectors. This allows the semantic meaning of the text to be represented in a high-dimensional space, which is essential for efficient similarity searches.

**Vector Database (FAISS):**
The vectorized chunks were indexed and stored in a FAISS (Facebook AI Similarity Search) database. This acts as our long-term memory, enabling lightning-fast retrieval of the most relevant document chunks for any given query.

**Retrieval-Augmented Generation (RAG) Pipeline:** 
The core of the project is the RAG pipeline. For each user query, the system performs a vector search to find the most semantically similar chunks from our knowledge base. These retrieved chunks are then provided to a powerful LLM (the Gemini model) as context, allowing it to generate a detailed, accurate, and source-grounded answer.