<a href="https://colab.research.google.com/github/tamoghna21/RAG_LLM/blob/main/1c_RAG_QA_pdf_full_with_langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Full Implementation of RAG framework to get answer from private pdf documents using LLM (Mistral-7B-Instruct-v0.2)

### Retrieval-Augmented generation on local pdf documents (Federal Open Market Committee (FOMC) meeting documents for the years 2020-2023)

#### Select Runtime > GPU

#### Install Packages

In [None]:
!pip install -q torch transformers accelerate bitsandbytes langchain sentence-transformers faiss-gpu
!pip install -q ragatouille
!pip install -q langchain_community
!pip install -q python-dotenv

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.5/973.5 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.7/224.7 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m70.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.2/310.2 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m124.4/124.4 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━

In [None]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.docstore.document import Document as LangchainDocument
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from transformers import AutoTokenizer, pipeline
from ragatouille import RAGPretrainedModel #For the Re Ranker
from transformers import Pipeline
from typing import Optional, List, Tuple
#import pytesseract
#from PIL import ImageEnhance, ImageFilter, Image


#### Path of the Vector database (already created from the pdf docs)

In [None]:
import os
from google.colab import drive
drive.mount('/content/drive')

os.chdir("/content/drive/My Drive/")

from dotenv import load_dotenv
load_dotenv(os.path.join('', './.env'))
os.environ["HUGGINGFACE_TOKEN"] = os.getenv('HUGGINGFACE_TOKEN')

# Folder where the FAISS Index is stored
os.chdir("/content/drive/My Drive/FOMC_docs_2023_2020")

Mounted at /content/drive


#### Load the Vector database, load the LLM model, setup the prompt for the LLM model

In [None]:
from langchain.prompts import PromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.llms import HuggingFacePipeline
#from langchain.chains import LLMChain

EMBEDDING_MODEL_NAME = "thenlper/gte-small"
embeddings = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},  # Set `True` for cosine similarity
)
db_VECTOR = FAISS.load_local("faiss_index", embeddings,allow_dangerous_deserialization=True)

from huggingface_hub import login
login(token=os.environ["HUGGINGFACE_TOKEN"])

READER_MODEL_NAME = 'mistralai/Mistral-7B-Instruct-v0.2' # The LLM Model

tokenizer = AutoTokenizer.from_pretrained(READER_MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

use_4bit = True # Activate 4-bit precision base model loading
compute_dtype = getattr(torch, "float16") # Compute dtype for 4-bit base models
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit, # Activate 4-bit precision base model loading
    bnb_4bit_use_double_quant=False, #True, # Activate nested quantization for 4-bit base models (double quantization)
    bnb_4bit_quant_type="nf4", # Quantization type (fp4 or nf4)
    bnb_4bit_compute_dtype=compute_dtype #torch.bfloat16
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

model = AutoModelForCausalLM.from_pretrained(READER_MODEL_NAME,quantization_config=bnb_config)


READER_LLM = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    do_sample=True,
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text = False,
    max_new_tokens=1000,
)

langchain_llm = HuggingFacePipeline(pipeline=READER_LLM)

# Create prompt template
prompt_template = """
### [INST] Instruction: Answer the question based on your knowledge. Here is context to help:

{context}

### QUESTION:
{question} [/INST]
"""

# Create prompt from prompt template
prompt = PromptTemplate(
  input_variables=["context", "question"],
  template=prompt_template,
)

# Create llm chain
llm_chain = prompt | langchain_llm

retriever = db_VECTOR.as_retriever(
    search_type="similarity",
    search_kwargs={'k': 30})

RERANKER = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
from langchain.retrievers import ContextualCompressionRetriever

compression_retriever = ContextualCompressionRetriever(
    base_compressor=RERANKER.as_langchain_document_compressor(), base_retriever=retriever
)

rag_chain = (
  {"context": compression_retriever, "question": RunnablePassthrough()}
      | llm_chain
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/66.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

artifact.metadata:   0%|          | 0.00/1.63k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/405 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

#### Ask Questions

In [None]:
# A question not related to the info in the added knowledge base; llm chain can answer
question = "What is the capital of USA?"

# Asking without RAG
output = llm_chain.invoke({"context":"",
                  "question": question})
print(output)

The capital city of the United States is Washington, D.C. (District of Columbia). This is a factual statement and does not require any specific knowledge beyond basic geographical information.


In [None]:
# A question not related to the info in the added knowledge base; llm chain can answer
question = "are meter and pounds comparable?"

# Asking without RAG
output = llm_chain.invoke({"context":"",
                  "question": question})
print(output)

Meter and pounds are units of different physical quantities. A meter is a unit of length, while a pound is a unit of mass or weight. They cannot be directly compared as they measure different things. However, if you want to relate mass and length, you can use the concept of density. Density is defined as mass per unit volume, so if you know the density of an object and its volume (in cubic meters), you can find its mass (in pounds) using the conversion factor between kilograms (the base unit for mass in the International System of Units, SI) and pounds. One kilogram is equal to approximately 2.20462 pounds. So, if you have the density (in kg/m³) and volume (in m³), you can calculate the mass (in kg) and then convert it to pounds.


In [None]:
# A question related to the info in the added knowledge base; Therefore llm chain cannot answer
question = "How is the inflation trend in 2023?"

# Asking without RAG
output = llm_chain.invoke({"context":"",
                  "question": question})
print(output)

I cannot provide an answer to that specific question as I don't have real-time access to current economic data or the ability to predict future trends. Inflation rates can be influenced by various factors such as monetary policy, supply and demand conditions, oil prices, exchange rates, and other economic indicators. To get an accurate understanding of the inflation trend in 2023, it would be best to refer to reliable economic forecasts and reports from reputable sources such as central banks, financial institutions, and research organizations.


In [None]:
#Asking the same question to the RAG supported LLM
output = rag_chain.invoke(question)
print(output)


100%|██████████| 1/1 [00:00<00:00,  3.88it/s]


According to the documents provided, the inflation trend in 2023 is expected to decline further. Specifically, on a four-quarter change basis, total PCE price inflation is projected to be 2.8%, and core inflation is expected to be 3.2%. With the effects of supply-demand imbalances in goods markets expected to further unwind and labor and product markets projected to become less tight, inflation is forecasted to decline further over 2024 and 2025. Additionally, core goods inflation is projected to move down further this year and then remain subdued, housing services inflation is expected to peak later this year and then move down, and core non-housing services inflation is forecasted to slow as nominal wage growth eases. With steep declines in consumer energy prices and a substantial moderation in food price inflation expected for this year, total inflation is projected to step down markedly this year and then track core inflation over the following two years. In 2025, both total and co

In [None]:
#Another question related to the info in the added knowledge base; Therefore llm chain cannot answer
question = "What is the set federeral fund rate in February 2023?"
output = llm_chain.invoke({"context":"",
                  "question": question})
print(output)

I'm unable to provide an exact answer as I don't have real-time access to current or future economic data, including the federal funds rate for February 2023. The federal funds rate is determined by the Federal Open Market Committee (FOMC) of the Federal Reserve System and is typically announced after each FOMC meeting. To find out the most recent or upcoming federal funds rate, you would need to check with reputable financial news sources or the Federal Reserve itself.


In [None]:
#Asking the same question to the RAG supported LLM
output = rag_chain.invoke(question)
print(output)

100%|██████████| 1/1 [00:00<00:00,  3.79it/s]


The federal fund rate in February 2023 was directed by the Federal Open Market Committee to be maintained in a target range of 4-1/2 to 4-3/4 percent. The interest rate paid on reserve balances was raised to 4.65 percent, and the primary credit rate was raised to 4.75 percent, both effective February 2, 2023.


#### References:
https://huggingface.co/learn/cookbook/en/advanced_rag

https://medium.com/@akriti.upadhyay/implementing-rag-with-langchain-and-hugging-face-28e3ea66c5f7

https://medium.com/@s.rashwand/how-to-build-a-chatbot-smarter-than-chatgpt-quickly-using-langchain-and-weaviate-f6309cc86e09

https://medium.com/@thakermadhav/build-your-own-rag-with-mistral-7b-and-langchain-97d0c92fa146