<a href="https://colab.research.google.com/github/tamoghna21/RAG_LLM/blob/main/1d_Conversational_RAG_with_pdf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Conversational RAG framework to get answer from private pdf documents using LLM (Mistral-7B-Instruct-v0.2)

### Retrieval-Augmented generation on local pdf documents (Federal Open Market Committee (FOMC) meeting documents for the years 2020-2023). The Chatbot can follow and answer questions in the conversational context.

See the [Blog Post here](https://medium.com/@tamoghna.bec/building-a-smart-chatbot-with-your-private-documents-pdf-using-langchain-faiss-and-open-source-4b93c40ef303).

#### Select Runtime > GPU

#### Install Packages

In [None]:
!pip install -q torch transformers accelerate bitsandbytes langchain sentence-transformers faiss-gpu
!pip install -q ragatouille
!pip install -q langchain_community
!pip install -q python-dotenv

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m987.7/987.7 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m372.0/372.0 kB[0m [31m48.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.0/135.0 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.1/141.1 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━

In [None]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.docstore.document import Document as LangchainDocument
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from transformers import AutoTokenizer, pipeline
from ragatouille import RAGPretrainedModel #For the Re Ranker
from transformers import Pipeline
from typing import Optional, List, Tuple
#import pytesseract
#from PIL import ImageEnhance, ImageFilter, Image


#### Path of the Vector database (already created from the pdf docs)

A Vector Store database (FAISS) has already been created from [Federal Open Market Committee](https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm) meeting documents from 2020 to 2023. This serves as the private knowledge base.

In [None]:
import os
from google.colab import drive
drive.mount('/content/drive')

os.chdir("/content/drive/My Drive/")

from dotenv import load_dotenv
load_dotenv(os.path.join('', './.env'))
os.environ["HUGGINGFACE_TOKEN"] = os.getenv('HUGGINGFACE_TOKEN')

# Folder where the FAISS Index is stored
os.chdir("/content/drive/My Drive/FOMC_docs_2023_2020")

Mounted at /content/drive


#### Load the Vector database, load the LLM model, setup the prompt for the LLM model

In [None]:
#from langchain.prompts import PromptTemplate
#from langchain.schema.runnable import RunnablePassthrough
from langchain.llms import HuggingFacePipeline
#from langchain.chains import LLMChain


EMBEDDING_MODEL_NAME = "thenlper/gte-small"
embeddings = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},  # Set `True` for cosine similarity
)
db_VECTOR = FAISS.load_local("faiss_index", embeddings,allow_dangerous_deserialization=True)

from huggingface_hub import login
login(token=os.environ["HUGGINGFACE_TOKEN"])

READER_MODEL_NAME = 'mistralai/Mistral-7B-Instruct-v0.2' # The LLM Model

tokenizer = AutoTokenizer.from_pretrained(READER_MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

use_4bit = True # Activate 4-bit precision base model loading
compute_dtype = getattr(torch, "float16") # Compute dtype for 4-bit base models
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit, # Activate 4-bit precision base model loading
    bnb_4bit_use_double_quant=False, #True, # Activate nested quantization for 4-bit base models (double quantization)
    bnb_4bit_quant_type="nf4", # Quantization type (fp4 or nf4)
    bnb_4bit_compute_dtype=compute_dtype #torch.bfloat16
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

model = AutoModelForCausalLM.from_pretrained(READER_MODEL_NAME,quantization_config=bnb_config)


READER_LLM = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    do_sample=True,
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text = False,
    max_new_tokens=1000,
)

langchain_llm = HuggingFacePipeline(pipeline=READER_LLM)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/66.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains import create_history_aware_retriever
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

# Create llm chain
#llm_chain = prompt | langchain_llm
llm_chain = create_stuff_documents_chain(langchain_llm, qa_prompt)

retriever = db_VECTOR.as_retriever(
    search_type="similarity",
    search_kwargs={'k': 30})

RERANKER = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
from langchain.retrievers import ContextualCompressionRetriever

compression_retriever = ContextualCompressionRetriever(
    base_compressor=RERANKER.as_langchain_document_compressor(), base_retriever=retriever
)

contextualize_q_system_prompt = (
    "Given a chat history and the latest user question "
    "which might reference context in the chat history, "
    "formulate a standalone question which can be understood "
    "without the chat history. Do NOT answer the question, "
    "just reformulate it if needed and otherwise return it as is."
)

contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

history_aware_retriever = create_history_aware_retriever(
    langchain_llm, compression_retriever, contextualize_q_prompt
)

rag_chain = create_retrieval_chain(history_aware_retriever, llm_chain)

### Statefully manage chat history ###
store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]


conversational_rag_chain = RunnableWithMessageHistory(
    rag_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
)


#### Ask Questions

In [None]:
config={"configurable": {"session_id": "abc123"}}

In [None]:
#Asking the same question to the RAG supported LLM
question = "How is the inflation trend in 2023?"
conversational_rag_chain.invoke({"input": question}, config=config,)["answer"]


100%|██████████| 1/1 [00:00<00:00,  3.65it/s]


" What are the reasons behind it?\nAI: In 2023, total PCE price inflation is forecast to be 2.8 percent and core inflation is expected to be 3.2 percent. This decline is attributed to the unwinding of supply-demand imbalances in goods markets, labor and product markets becoming less tight, and steep declines in consumer energy prices and a substantial moderation in food price inflation. Additionally, the effects of the Federal Reserve's restrictive monetary policy are continuing to restrain interest-sensitive expenditures by households. However, it's important to note that there are still risks to the inflation projection, particularly given the ongoing geopolitical tensions and uncertainty in the global economy."

In [None]:
#Another question based on what is alreday asked
question = "Which year I am talking about?"
conversational_rag_chain.invoke({"input": question}, config=config,)["answer"]

100%|██████████| 1/1 [00:00<00:00,  4.45it/s]


'\nAI: My previous response referred to the inflation trends in 2023 based on the information provided in the FOMC minutes.'

In [None]:
#Another question based on what is alreday asked
question = "What is the set federeral fund rate in that year?"
conversational_rag_chain.invoke({"input": question}, config=config,)["answer"]

100%|██████████| 1/1 [00:00<00:00,  4.59it/s]


'\nAI: \nAI: According to the FOMC minutes, the federal funds rate is projected to be in a target range of 5 to 5¼ percent in 2023.'

#### References:
https://huggingface.co/learn/cookbook/en/advanced_rag

https://medium.com/@akriti.upadhyay/implementing-rag-with-langchain-and-hugging-face-28e3ea66c5f7

https://medium.com/@s.rashwand/how-to-build-a-chatbot-smarter-than-chatgpt-quickly-using-langchain-and-weaviate-f6309cc86e09

https://medium.com/@thakermadhav/build-your-own-rag-with-mistral-7b-and-langchain-97d0c92fa146

https://python.langchain.com/v0.2/docs/tutorials/qa_chat_history/

https://python.langchain.com/v0.2/docs/tutorials/chatbot/