# RAG Application - POC

This notebook provides a demonstration of how a Retrieval-Augmented Generation (RAG) application functions.

### Enter your query

In [1]:
query_text = input("Ask anything about our final year project")
query_text

'WHat is this project about ?'

### Import LangChain libraries

In [2]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.llms import Ollama

### Import utility functions

In [3]:
from rag.keyword_generator import extract_keywords
from rag.db import get_db_collection, add_to_collection, query_collection

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\impostor\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\impostor\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\impostor\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


Chroma DB connected


  from tqdm.autonotebook import tqdm, trange


Embedding function loaded


## Load pdf document and load it into Vector Database

In [4]:
file_path = (
    "docs/project-report.pdf"
)
loader = PyPDFLoader(file_path)
document = loader.load()
print("No. of pages in the document:", len(document))

No. of pages in the document: 23


#### Split pages into chunks of texts

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunked_documents = text_splitter.split_documents(document)

#### Prepare data for indexing
- Generate Unique Id for individual chunks
- Generate keywords for metadata using NLP

In [6]:
contents = []
ids = []
keywords = []

page_no = 0
c_index = -1
for index, doc in enumerate(chunked_documents):
    metadata = doc.metadata
    source = metadata['source'].replace('/','-').replace('.','-')

    if metadata['page'] > page_no:
        c_index = 0
    else:
        c_index += 1

    page_no = metadata['page']
    
    chunk_id = f"{source}-p{page_no}-c{c_index}"

    contents.append(doc.page_content)
    ids.append(chunk_id)
    keywords.append(extract_keywords(doc.page_content))
    print("Processed chunk:", chunk_id)

Processed chunk: docs-project-report-pdf-p0-c0
Processed chunk: docs-project-report-pdf-p1-c0
Processed chunk: docs-project-report-pdf-p1-c1
Processed chunk: docs-project-report-pdf-p2-c0
Processed chunk: docs-project-report-pdf-p2-c1
Processed chunk: docs-project-report-pdf-p3-c0
Processed chunk: docs-project-report-pdf-p4-c0
Processed chunk: docs-project-report-pdf-p5-c0
Processed chunk: docs-project-report-pdf-p6-c0
Processed chunk: docs-project-report-pdf-p6-c1
Processed chunk: docs-project-report-pdf-p6-c2
Processed chunk: docs-project-report-pdf-p7-c0
Processed chunk: docs-project-report-pdf-p7-c1
Processed chunk: docs-project-report-pdf-p7-c2
Processed chunk: docs-project-report-pdf-p8-c0
Processed chunk: docs-project-report-pdf-p8-c1
Processed chunk: docs-project-report-pdf-p9-c0
Processed chunk: docs-project-report-pdf-p10-c0
Processed chunk: docs-project-report-pdf-p10-c1
Processed chunk: docs-project-report-pdf-p10-c2
Processed chunk: docs-project-report-pdf-p11-c0
Processed

### Create a collection in Chroma DB

In [7]:
COLLECTION_NAME = "my_project"
collection = get_db_collection(COLLECTION_NAME)

metadata = [{"tags": ", ".join(i) } for i in keywords]
add_to_collection(collection, contents, ids, metadata)

Add of existing embedding ID: docs-project-report-pdf-p0-c0
Add of existing embedding ID: docs-project-report-pdf-p1-c0
Add of existing embedding ID: docs-project-report-pdf-p1-c1
Add of existing embedding ID: docs-project-report-pdf-p2-c0
Add of existing embedding ID: docs-project-report-pdf-p2-c1
Add of existing embedding ID: docs-project-report-pdf-p3-c0
Add of existing embedding ID: docs-project-report-pdf-p4-c0
Add of existing embedding ID: docs-project-report-pdf-p5-c0
Add of existing embedding ID: docs-project-report-pdf-p6-c0
Add of existing embedding ID: docs-project-report-pdf-p6-c1
Add of existing embedding ID: docs-project-report-pdf-p6-c2
Add of existing embedding ID: docs-project-report-pdf-p7-c0
Add of existing embedding ID: docs-project-report-pdf-p7-c1
Add of existing embedding ID: docs-project-report-pdf-p7-c2
Add of existing embedding ID: docs-project-report-pdf-p8-c0
Add of existing embedding ID: docs-project-report-pdf-p8-c1
Add of existing embedding ID: docs-proje

Documents loaded to DB


### Chunks retreived from the DB

In [8]:
query_result = query_collection(collection, query_text)
query_result

{'ids': [['docs-project-report-pdf-p0-c0',
   'docs-project-report-pdf-p10-c1',
   'docs-project-report-pdf-p6-c1']],
 'distances': [[0.4867134690284729, 0.49514859914779663, 0.5069766640663147]],
 'metadatas': [[{'tags': 'engineering, mechatronics, mt, department, mt16mt'},
   {'tags': 'machine, dispense, product, date, expiry'},
   {'tags': 'areas, available, drug, especially, first'}]],
 'embeddings': None,
 'documents': [['VISVESVARAYA TECHNOLOGICAL UNIVERSITY  \nBELAGAVI  \n  \n  \n  \nA  \nProject Report on  \n  \n  \nAUTOMATIC MEDICINE VENDING MACHINE  \n  \nIn partial fulfillment of the requirement for the award of the  \nBachelor Degree  \n  \nIn  \nMechatronics Engineering  \n  \nSubmitted  \n  \nby \n  \n  \nSanketh S Raj Jain  \n  \n  \nRakshith Shetty  \n  \n  \nRoyson Pais  \n  \nPrajwal Poojary  \n  \n  \n4 \nMT16MT  \n037 \n  \n  \n4 \nMT \n16 \nMT033  \n  \n  \n4 \nMT \n17 \nMT \n408 \n  \n  \n030 \n 4 \nMT16MT  \n  \n  \nUnder the Guidance  \n  \nof \n  \n     \nMr. S

### Prepare final prompt to give to LLM model

In [9]:
text = ""
for doc in query_result['documents']:
    for i in doc:
        text += i

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know find the answer in the provided context, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
).format(context=text)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)
final_prompt = prompt.format(input=query_text)
final_prompt

"System: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know find the answer in the provided context, say that you don't know. Use three sentences maximum and keep the answer concise.\n\nVISVESVARAYA TECHNOLOGICAL UNIVERSITY  \nBELAGAVI  \n  \n  \n  \nA  \nProject Report on  \n  \n  \nAUTOMATIC MEDICINE VENDING MACHINE  \n  \nIn partial fulfillment of the requirement for the award of the  \nBachelor Degree  \n  \nIn  \nMechatronics Engineering  \n  \nSubmitted  \n  \nby \n  \n  \nSanketh S Raj Jain  \n  \n  \nRakshith Shetty  \n  \n  \nRoyson Pais  \n  \nPrajwal Poojary  \n  \n  \n4 \nMT16MT  \n037 \n  \n  \n4 \nMT \n16 \nMT033  \n  \n  \n4 \nMT \n17 \nMT \n408 \n  \n  \n030 \n 4 \nMT16MT  \n  \n  \nUnder the Guidance  \n  \nof \n  \n     \nMr. Sathyanarayana  \n  \nHead of Mechatronics Department  \n  \n      \n   \nDEPARTMENT OF  \nMECHATRONICS ENGINEERING  \n  \nMangalore  \n  \nInstitute of Techn

### Connect to local LLM, I'm using phi-3 from Microsoft

In [10]:
llm = Ollama(
    model="phi3",
    keep_alive=-1,
    format="json"
)

### Final output from the LLM using the context

In [11]:
llm.invoke(final_prompt)

'{"summary": "This project proposes an Automatic Medicine Vending Machine with expiry date features using a Finite State Machine (FSM) model. It contains four medicines available as first aid and without prescription, aiming to improve accessibility in developing countries like India."}'

### Note: Output quality not only depends on the data quality but also on the model. Choose better models for high quality output. (Try [llama3.1](https://ollama.com/library/llama3.1), phi3 is decent if you're limited by your system requirements)