# Retrieval-Augmented Generation (RAG) Demo Project

## Overview

This project demonstrates a **Retrieval-Augmented Generation (RAG)** pipeline, where we utilize **LlamaIndex** and **OpenAI's GPT models** to index and search through multiple PDF files. The system enables efficient document querying and provides contextual answers based on the indexed content.

## Key Features

- **PDF Parsing**: Extracts text content from PDF files.
- **Indexing**: Creates a vector-based index using LlamaIndex for efficient search and retrieval.
- **Query Processing**: Leverages OpenAI's GPT models to respond to user queries in natural language.
- **Customizable**: Supports integration with other file types and advanced prompt engineering.

## Workflow

1. **Text Extraction**: Use `PyPDF2` to extract text from PDF files.
2. **Index Creation**: Create a vector-based index of the extracted text using LlamaIndex.
3. **Query and Retrieval**: Input a natural language query, and the system searches the index and generates an accurate response.

## Technologies Used

- **LlamaIndex (formerly GPT Index)**: For building and querying the document index.
- **OpenAI GPT Models**: For natural language understanding and response generation.
- **PyPDF2**: For extracting text from PDF files.
- **Python**: Primary programming language for implementation.

## Project Structure




## How to Run

1. Clone the repository and navigate to the project directory.
2. Install the required dependencies:
   ```bash
   pip install -r requirements.txt


In [1]:
import os
from dotenv import load_dotenv
load_dotenv()


True

In [2]:
# Retrieve the API key
api_key = os.getenv('OPENAI_API_KEY')

# Verify if the key is loaded
if not api_key:
    raise ValueError("API key not found. Ensure it is set in the .env file.")

# Set it explicitly as an environment variable (optional)
os.environ['OPENAI_API_KEY'] = api_key

In [3]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader('data').load_data()

In [4]:
index = VectorStoreIndex.from_documents(documents, show_progress=True)


  from .autonotebook import tqdm as notebook_tqdm
Parsing nodes: 100%|██████████| 21/21 [00:00<00:00, 364.48it/s]
Generating embeddings: 100%|██████████| 32/32 [00:01<00:00, 18.23it/s]


In [5]:
index

<llama_index.core.indices.vector_store.base.VectorStoreIndex at 0x24fbd88e5c0>

In [6]:
query_engine = index.as_query_engine()

In [25]:
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.indices.postprocessor import SimilarityPostprocessor
from llama_index.core.response.pprint_utils import pprint_response


retriever = VectorIndexRetriever(index,similarity_top_k=4)
query_engine = RetrieverQueryEngine(retriever)

response = query_engine.query('What is YOLO?')
pprint_response(response,show_source=True)



Final Response: YOLO is a new approach to object detection that frames
object detection as a regression problem to spatially separated
bounding boxes and associated class probabilities. It uses a single
neural network to predict bounding boxes and class probabilities
directly from full images in one evaluation. YOLO is known for its
speed, processing images in real-time and achieving high mean average
precision compared to other real-time detectors.
______________________________________________________________________
Source Node 1/4
Node ID: c58eb4b7-896f-4af3-9ea8-07b7653f1e8b
Similarity: 0.8273606121232412
Text: You Only Look Once: Uniﬁed, Real-Time Object Detection Joseph
Redmon∗, Santosh Divvala∗†, Ross Girshick¶, Ali Farhadi∗† University
of Washington∗, Allen Institute for AI†, Facebook AI Research¶
http://pjreddie.com/yolo/ Abstract We present YOLO, a new approach to
object detection. Prior work on object detection repurposes classiﬁers
to per- form...
_________________________

In [23]:
from llama_index.core.response.pprint_utils import pprint_response
response = query_engine.query('What is YOLO?')
pprint_response(response,show_source=True)


Final Response: YOLO is a new approach to object detection that frames
object detection as a regression problem to spatially separated
bounding boxes and associated class probabilities. It uses a single
neural network to predict bounding boxes and class probabilities
directly from full images in one evaluation, optimizing the whole
detection pipeline end-to-end for improved performance.
______________________________________________________________________
Source Node 1/2
Node ID: c58eb4b7-896f-4af3-9ea8-07b7653f1e8b
Similarity: 0.8273606121232412
Text: You Only Look Once: Uniﬁed, Real-Time Object Detection Joseph
Redmon∗, Santosh Divvala∗†, Ross Girshick¶, Ali Farhadi∗† University
of Washington∗, Allen Institute for AI†, Facebook AI Research¶
http://pjreddie.com/yolo/ Abstract We present YOLO, a new approach to
object detection. Prior work on object detection repurposes classiﬁers
to per- form...
______________________________________________________________________
Source Node 2/2
No