### Retriever Augmented Generation - RAG Tool
- Learn about RAG
- Learn about Embeddings
- Learn about Vector Store DB, FAISS, ChromaDB
- Learn about Chunks, laps, splitter
- Learn about max_tokens, min_tokens
- Learn about how to create the orchestration seemless workflow using RAG + Langchain
- Learn about similarity search
- Learn about prompt template
- Learn about how to control the predictive results from Langchain framework

In [1]:
!git clone https://github.com/Jaish19/GenAI---RAG-using-LangChain.git

Cloning into 'GenAI---RAG-using-LangChain'...
remote: Enumerating objects: 41, done.[K
remote: Counting objects: 100% (41/41), done.[K
remote: Compressing objects: 100% (40/40), done.[K
remote: Total 41 (delta 18), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (41/41), 1.70 MiB | 4.57 MiB/s, done.
Resolving deltas: 100% (18/18), done.


In [2]:
!pip install langchain
!pip install huggingface_hub
!pip install sentence_transformers
!pip install -U langchain-community
!pip install unstructured[local-inference]

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence_transformers)
 

### Get HUGGINGFACEHUB_API_KEY

In [3]:
from google.colab import userdata
sec_key=userdata.get("HUGGINGFACEHUB_API_TOKEN")
print(sec_key)

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


In [4]:
import os
os.environ["HUG_FACE_TOKEN"] = sec_key
os.environ["HUGGINGFACEHUB_API_TOKEN"] = sec_key

### Download Text File - optional

In [None]:
# THIS SNIPPET CAN BE USED IF YOU HAVE DATA IN ANY REMOTE LOCATION TO PULL USING GET REQUEST
import requests

url = "https://raw.githubusercontent.com/hwchase17/langchain/master/docs/modules/state_of_the_union.txt"
res = requests.get(url)
with open("state_of_the_union.txt", "w") as f:
  f.write(res.text)

### Embeddings

### Working with PDF Files

In [2]:
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings import HuggingFaceEmbeddings
from langchain import HuggingFaceHub # HuggingFace Hub used to connect via inference providers



# IMPORT THE PDF FILE

In [9]:
# LOAD THE DOCUMENT

import os
file_name = "neural_network.pdf"  # Replace with your actual PDF file name
loaders = UnstructuredPDFLoader(os.path.join("/content/GenAI---RAG-using-LangChain/", file_name))


# VECTOR STORE DB - FAISS

In [10]:
# VECTOR STORE - SPLITS, CHUNKS, EMBED, STORES IN VECTOR DB (Default FAISS)
from langchain.text_splitter import CharacterTextSplitter

index = VectorstoreIndexCreator(
    embedding=HuggingFaceEmbeddings(),
    text_splitter=CharacterTextSplitter(chunk_size=500, chunk_overlap=0)).from_loaders([loaders])

  embedding=HuggingFaceEmbeddings(),


In [11]:
# TRY SIMILARITY SEARCH HERE

# Get the internal vector store from the index
vectorstore = index.vectorstore

# Now do similarity search on it
results = vectorstore.similarity_search("What is vanishing gradient problem?")

print(results)

[Document(id='089e6f80-21d3-4f02-8ac3-b48062dec358', metadata={'source': '/content/GenAI---RAG-using-LangChain/neural_network.pdf'}, page_content='5.2 What’s causing the vanishing gradient problem? Unstable\n\ngradients in deep neural nets\n\nTo get insight into why the vanishing gradient problem occurs, let’s consider the simplest deep neural network: one with just a single neuron in each layer. Here’s a network with three hidden layers:'), Document(id='7d7a6b11-e657-412a-bdae-c9eba0fd3836', metadata={'source': '/content/GenAI---RAG-using-LangChain/neural_network.pdf'}, page_content='You’re welcome to take this expression for granted, and skip to the discussion of how it relates to the vanishing gradient problem. There’s no harm in doing this, since the expression is a special case of our earlier discussion of backpropagation. But there’s also a simple explanation of why the expression is true, and so it’s fun (and perhaps enlightening) to take a look at that explanation.'), Document(

# LANGCHAIN + RAG - WORKFLOW

LLM AGENT + PROMPT AGENT + RETRIEVAL AGENT (RAG) - ACCOMADATING THEM IN LANGCHAIN

In [12]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate


# CREATING THE LLM AGENT TOOL HERE

llm = HuggingFacePipeline.from_model_id(
    model_id="google/flan-t5-base",  # Small summarization model
    task="text2text-generation",
    device=0,  # Use GPU
    pipeline_kwargs={
        "max_new_tokens": 200,    # ✅ limit output to ~60 tokens
        "min_length": 100       # ✅ Optional: force minimum 20 tokens
        # "max_length": 200        # ✅ Optional: max total length of output sequence
    }
)


# CREATING THE PROMPT AGENT TOOL

template = """
You are a helpful assistant. Use the following context to answer the user's question.
If the answer cannot be found in the context, say "I don't know".

Context:
{context}

Question: {question}
Answer:
"""
prompt = PromptTemplate(input_variables=["context", "question"], template=template)


# USING RAG - RETRIEVAL TOOL TO EXTRACT THE CONTENT AND APPENDING THEM IN LANGCHAIN FOR WORKFLOW

chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=index.vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": prompt},
    input_key="question"
)


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Device set to use cuda:0


# QUERY THE WORKFLOW

In [13]:
print(chain.run("What is vanishing gradient problem?"))


  print(chain.run("What is vanishing gradient problem?"))
Token indices sequence length is longer than the specified maximum sequence length for this model (724 > 512). Running this sequence through the model will result in indexing errors


Unstable gradients in deep neural nets. (5.6) Why the vanishing gradient problem occurs. (cid:48)(z) | To avoid the vanishing gradient problem we need 1. (cid:48)(wa + b)  C  a4 . (5.6) Why the vanishing gradient problem occurs. (cid:48)(z) | To avoid the vanishing gradient problem we need 1. (cid:48)(wa + b)  C  a4 . (5.6) Why the vanishing gradient problem occurs. (cid:48)(z) | To avoid the vanishing gradient problem we need 1.


In [14]:
print(chain.run('What is sigmoid neurons?'))

Just like a perceptron, the sigmoid neuron has inputs, x1, x2,.... But instead of being just 0 or 1, these inputs can also take on any values between 0 and 1. So, for instance, 0.638... is a valid input for a sigmoid neuron. Also just like a perceptron, the sigmoid neuron has weights for each input, w1,w2,..., and an overall bias, b. But the output is not 0 or 1. Instead, it’s (wx + b), where  is called the sigmoid function1, and is defined by: (z)  1 1 + e  z . To put it all a little more explicitly, the output of a sigmoid neuron with inputs x1,x2,


In [15]:
chain.run('Tell about two caveats')

'2.2 The two assumptions we need about the cost function The point, of course, is that models with a large number of free parameters can describe an amazingly wide range of phenomena. Even if such a model agrees well with the available data, that doesn\'t make it a good model. It may just mean there\'s enough freedom in the model that it can describe almost any data set of the given size, without capturing any genuine insights into the underlying phenomenon. When that happens the model will work well for the existing data, but will fail to generalize to new situations. The true test of a model is its ability to make predictions in situations it hasn\'t been exposed to before. There are three morals to draw from these stories. First, it can be quite a subtle business deciding which of two explanations is truly "simpler". Second, even if we can make such a judgment, simplicity is '

## CHROMA DB

Instead of VectorStores (default FAISS DB) if you want to try with chromaDB try the below code and replace it with VectorStores.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import UnstructuredPDFLoader
import os

# Load your PDF or document using the loader
pdf_folder_path = "path_to_your_pdfs"
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)]

# Split your documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)

# Use HuggingFace embeddings to convert text to vectors
embeddings = HuggingFaceEmbeddings()

# Create Chroma vector store
docs = []  # This will store the documents as text chunks
for loader in loaders:
    docs.extend(loader.load())  # Load documents

# Split the documents into smaller chunks
chunks = text_splitter.split_documents(docs)

# Generate embeddings for each chunk
doc_embeddings = embeddings.embed_documents([chunk.page_content for chunk in chunks])

# Initialize Chroma DB
chroma_db = Chroma.from_documents(docs, embeddings)

# Now you can use this Chroma DB for similarity search or QA tasks
query = "What is version control?"
result = chroma_db.similarity_search(query)


# FAISS DB - SIMPLE LANGCHAIN PROCESS

In [16]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m68.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0


In [18]:
from langchain_community.document_loaders import TextLoader
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain import LLMChain


loader = TextLoader('/content/GenAI---RAG-using-LangChain/GIT Commands.txt')
documents = loader.load()

# Embeddings
embeddings = HuggingFaceEmbeddings()

# Text Splitter
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)


db = FAISS.from_documents(docs, embeddings)


# llm=HuggingFaceHub(repo_id="5CD-AI/visocial-T5-base", model_kwargs={"temperature":0, "max_length":512})
## Use HuggingfacePipelines With Gpu
llm = HuggingFacePipeline.from_model_id(
    model_id="gpt2",
    task="text-generation",
    device=0,  # replace with device_map="auto" to use the accelerate library.
    pipeline_kwargs={"max_new_tokens": 100},
)

chain = load_qa_chain(llm, chain_type="stuff")

query = "GIT command to avoid the git push failure"
results_from_similarity = db.similarity_search(query)
chain.run(input_documents=results_from_similarity, question=query)



  embeddings = HuggingFaceEmbeddings()
Device set to use cuda:0
stuff: https://python.langchain.com/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/docs/how_to/#qa-with-rag
  chain = load_qa_chain(llm, chain_type="stuff")
Token indices sequence length is longer than the specified maximum sequence length for this model (1079 > 1024). Running this sequence through the model will result in indexing errors


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
