## Problem Statement

You have to build Retrieval-Augmented Generation (RAG) Model for QA Bot

Develop a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA) bot for a business. Use a vector database like Pinecone DB or Chroma DB and a generative model like Cohere API or OpenAI. The QA bot should be able to retrieve relevant information from a dataset and generate coherent answers.

# Import all the important dependencies

In [1]:
!pip install -q -U langchain-community langchain openai chromadb langchain-openai

# From langchain framework import neccessary class

In [2]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser


import os
from getpass import getpass

## Workflow

![Alt text](image.png)

# From PDF_Docs directory loading and reading the PDF document

In [4]:
loader = PyPDFDirectoryLoader("PDF_Docs")
documents = loader.load()




## Splitting entire documents into small small chunks so that LLM can context them as input token.

In [5]:
text_splitter  = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
text_chunks = text_splitter.split_documents(documents)
len(text_chunks)

166

In [6]:
type(text_chunks[0].page_content)

str

In [7]:
text_chunks[0].page_content

'SQL in DS  DB Schema It represent how the data is organised & provides informa4on about the rela4onships between the tables in a given database. Table Schema: Represent metadata of a table. 1. A=ributes (columns) of table 2. Data type of a=ributes: i) Numeric (like student_id, age, salary, weight, height, etc.) ii) String Char/Var char (Name) iii) Date (Date/4mestamp) Char is ﬁxed length string and var char is a variable length string.  Primary Key : A column that can be used to uniquely iden4fy each row in the table. Constraint: Unique + Not Null \n Foreign Key: A column in a table that refers to the primary key in another table. Foreign key link together tables in a rela4onal database.'

In [8]:
len(text_chunks[0].page_content)

695

In [9]:
type(text_chunks[0])

## Use OpenAI Embedding for text embedding by using OpenAI API key

In [10]:
# Use getpass to securely input your OpenAI API key
api_key = getpass('Enter your OpenAI API key: ')

# Set it as an environment variable
os.environ['OPENAI_API_KEY'] = api_key

# Initialize OpenAI embeddings using the API key from environment variable
embeddings = OpenAIEmbeddings(openai_api_key=os.getenv('OPENAI_API_KEY'))

# Example query
result = embeddings.embed_query("How are you!")

# Check the length of the result (embedding vector)
print(len(result))

Enter your OpenAI API key: ··········


  embeddings = OpenAIEmbeddings(openai_api_key=os.getenv('OPENAI_API_KEY'))


1536


## Using Chroma vector databse to store text chunks into text embedding

In [11]:
vectordb = Chroma.from_documents(documents=text_chunks, embedding=embeddings)

In [12]:
len(vectordb)

166

## Finding the top 2 similar text document from the vector dataabse.

In [14]:
query = vectordb.similarity_search(query="What is the difference between positional encoding and layer normalization?", k=2)

In [15]:
query

[Document(metadata={'page': 25, 'source': 'PDF_Docs/Document1.pdf'}, page_content='When using batch normaliza7on on sequences with padding, the calculated mean and variance can be skewed because padding values distort the true representa7on of the data. Padding is added to make input sequences equal in length, but it introduces ar7ﬁcial values that aﬀect the accuracy of the mean and variance. To address this issue, layer normaliza7on is used instead. Layer normalisa7on normalizes across the features (or rows) within each individual sequence, rather than across the batch (or columns), ensuring that padding doesn’t interfere with the normalisa7on process and providing a more accurate representa7on of the data. Conclusion: Posi3onal Encoding The technique of posi7onal encoding is employed in transformer models to provide the model with informa7on regarding the word order in a sequence. Because self-a>en7on does not naturally follow a sequence, posi7onal encodings are appended to the word 

## Top 3 most similar documents/text from the database

In [18]:
query1 = "How much probability and statistic require to crack data science interview?"

results = vectordb.similarity_search(query1, k=3)

# Display results
for i, result in enumerate(results):
    print(f"Result {i+1}: {result.page_content}")
    print(f"Metadata: {result.metadata}")

Result 1: Top Interview Ques3ons for DS  In this ar4cle, we’ll explore the top ques4ons commonly asked in data science interviews, breaking them down into simple, easy-to-understand explana4ons. Q.1 Explain the bias-variance tradeoﬀ. 
Model Complexity Bias and variance are inversely propor4onal to each other. When bias increases variance decreases. Therefore, we need to ﬁnd a balance trade between the two. Q2. What is overﬁTng, and how can you prevent it?
Metadata: {'page': 9, 'source': 'PDF_Docs/Document2.pdf'}
Result 2: Sta$s$cs and Probability needed in DS  In this blog, we’ll provide a concise overview of the essen7al sta7s7cs and probability concepts needed for a data science role. Central limit Theorem : The distribu7on of sample means is Gaussian, no ma>er what the shape of the original distribu7on is. Assump7ons: popula7on mean and standard devia7on should be ﬁnite and sample size >=30  How to prove CLT in python: sample_30 = [df['income'].sample(30).mean() for i in range(10000

##

## Creates a Retrieval-Augmented Generation (RAG) pipeline for answering questions using a large language model (LLM) and a vector database to retrieve relevant documents or text chunks.

In [17]:
rag_chain = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=vectordb.as_retriever())
rag_chain.run(query1)

  rag_chain = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=vectordb.as_retriever())
  rag_chain.run(query1)


' This document seems to be discussing the use of LSTM encoder and decoder architecture for text summarization, specifically focusing on the self-attention mechanism for representing words and finding similarity between them using dot product. It also mentions the use of linear transformation with different matrices during the training process.'

- LLM (OpenAI): For generating the final response.
- Vector Database (vectordb): To store and retrieve text embeddings based on similarity.
- Retriever: A mechanism that retrieves relevant chunks or documents for the LLM to base its answer on.
- Chain Type ("stuff"): A method of handling the retrieved documents before generating an answer. Here, it simply concatenates all the retrieved documents.