<a href="https://colab.research.google.com/github/shivvor2/RAG-Demo-colab/blob/main/Multiround_conversation_RAG_and_llama3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setting up the environment

Installing Dependancies

In [1]:
!apt-get install -qq poppler-utils libleptonica-dev tesseract-ocr libtesseract-dev python3-pil tesseract-ocr-eng tesseract-ocr-script-latn lshw

Selecting previously unselected package lshw.
(Reading database ... 121925 files and directories currently installed.)
Preparing to unpack .../00-lshw_02.19.git.2021.06.19.996aaad9c7-2build1_amd64.deb ...
Unpacking lshw (02.19.git.2021.06.19.996aaad9c7-2build1) ...
Selecting previously unselected package pci.ids.
Preparing to unpack .../01-pci.ids_0.0~2022.01.22-1_all.deb ...
Unpacking pci.ids (0.0~2022.01.22-1) ...
Selecting previously unselected package usb.ids.
Preparing to unpack .../02-usb.ids_2022.04.02-1_all.deb ...
Unpacking usb.ids (2022.04.02-1) ...
Selecting previously unselected package libarchive-dev:amd64.
Preparing to unpack .../03-libarchive-dev_3.6.0-1ubuntu1.1_amd64.deb ...
Unpacking libarchive-dev:amd64 (3.6.0-1ubuntu1.1) ...
Selecting previously unselected package libimagequant0:amd64.
Preparing to unpack .../04-libimagequant0_2.17.0-1_amd64.deb ...
Unpacking libimagequant0:amd64 (2.17.0-1) ...
Selecting previously unselected package libleptonica-dev.
Preparing to u

In [2]:
!pip install --quiet -r requirements.txt

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.5/103.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m975.5/975.5 kB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m68.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m68.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m82.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.6/45.6 MB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Setting up environment Variables

In [3]:
import os
from dotenv import load_dotenv

load_dotenv()

GROQ_API_KEY = os.getenv("GROQ_API_KEY") # Load using .env file
HF_TOKEN = os.getenv("HF_TOKEN")

Letting colab register tessarect correctly

In [4]:
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'/usr/local/bin/pytesseract'

Huggingface login

In [5]:
from huggingface_hub import login
login(token = HF_TOKEN)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Controll Panel (All variables)

In [54]:
# Vectorization
# Loading
path = "/content/documents"
# Chunking (should use larger chunk sizes if we are using a larger model instead)
chunk_size = 256
chunk_overlap = 64

# Inference

top_k = 5 # How many chunked blocks to retrieve
context_count = 2 # At most how many contexts will be stored
max_shown_message_rounds_count = 2 # Will show n user/system interactions

# Model details (for Groq):
model = "llama3-8b-8192" # Can also choose "llama3-70b-8192"
max_tokens = 8192
temperature = 0.3


# Load, Chunk and Store Documents

initialize document vectorization objects

Embedding models, we use the [highest ranked model](https://huggingface.co/spaces/mteb/leaderboard) within our VRAM constrains (16GB for T4 GPU)

In [8]:
from langchain_community.document_loaders import DirectoryLoader, UnstructuredFileLoader
from langchain.text_splitter import TokenTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

# Recursively loads every supported document under the file directory
loader = DirectoryLoader(path = path,
                         recursive = True) # use_multithreading = True

# loader = UnstructuredFileLoader("/content/documents/An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks.pdf")

splitter = TokenTextSplitter(chunk_size = chunk_size,
                             chunk_overlap = chunk_overlap)

embeddings = HuggingFaceEmbeddings(
    model_name = "Alibaba-NLP/gte-Qwen2-1.5B-instruct",
    model_kwargs = {"device": "cuda"},
)

# Processing Documents
# documents = loader.load()
# docs_chunked = splitter.split_documents(documents)
# vectorstore = Chroma.from_documents(docs_chunked, embeddings)

  warn_deprecated(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/284 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/144k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/55.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/879 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/370 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

In [9]:
documents = loader.load()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [55]:
docs_chunked = splitter.split_documents(documents)

In [56]:
vectorstore = Chroma.from_documents(docs_chunked, embeddings)

# Inference

## LLM message creation process

System Prompts

In [12]:
# To-Do
prompt_query = """
You are an AI assistant helping the user find information from a knowledge base to answer questions in a conversation. Your role is to generate a query statement based on the user's most recent message, which will be used to search the knowledge base for relevant information.

Here are the key points to keep in mind:

1. Carefully analyze the user's last message in the conversation. Use the earlier messages only as context to help you understand what the user is asking.

2. If the user's last message is asking for new information or posing a question that needs an informative response, generate a clear, concise query statement that captures the key information needed to answer their question.

3. The query statement you generate should be suitable for searching a knowledge base, so focus on keywords and phrases that are likely to match relevant documents. Avoid including conversational language or filler words in the query.

4. However, if the user's last message does not require new information to be retrieved - for example, if they are simply asking for clarification about something already discussed, or if they are just making small talk - then do not generate a query. Instead, respond with only the exact phrase "N/A" (without quotes) to indicate that no knowledge base search is needed.

5. Your response should consist of either the query statement or "N/A", and nothing else. Do not include any other text, explanations, or phrases in your response.

Remember, your goal is to help the user retrieve the most relevant information to respond helpfully to their needs. Analyze their latest message carefully and use your best judgment to decide if a knowledge base query is needed, and if so, what the query should be.
"""

prompt = """
You are an AI assistant designed to provide helpful responses to user queries. Your task is to generate accurate and relevant responses based on the information available to you.
Key guidelines:

Analyze the user's latest message carefully, considering the conversation history for context.
Provide a helpful and informative response based on the available information. Be as specific and detailed as possible when answering questions or addressing the user's needs.
If you don't have sufficient information to fully answer the user's query:
a. Provide a partial answer based on the information you do have.
b. Clearly state that you don't have enough information to provide a complete answer.
c. If appropriate, suggest what additional information might be needed to give a more comprehensive response.
Maintain a professional and helpful tone throughout the conversation. Be concise when possible, but provide detailed explanations when necessary.
If the user asks about your capabilities or limitations, be honest about what you can and cannot do. However, do not mention anything about context retrieval or how you access information unless explicitly asked by the user.
If the user's message doesn't require a substantive response (e.g., simple acknowledgments or pleasantries), respond appropriately but briefly.
Always strive to provide the most helpful and accurate response possible based on the information available to you. If you're unsure about something, it's better to acknowledge that uncertainty rather than making unfounded assumptions.
Do not reference or mention the existence of retrieved context or any internal information retrieval process unless the user specifically asks about it.

Remember, your primary goal is to assist the user effectively while seamlessly integrating any available information into your responses.
"""

system_prompt = [
        {
            "role": "system",
            "content": prompt,
        }]



system_prompt_query = [
        {
            "role": "system",
            "content": prompt_query,
        }]



Message Creation process

In [13]:
def current_message_with_context(usr_msg, current_context):
  current_message = [
      {
          "role": "user",
          "content": f"{usr_msg}. \n (Respond to the user given the following context: {current_context})",
      }]

  return current_message

def current_message_for_query(usr_msg):
  current_message = [
      {
          "role": "user",
          "content": f"{usr_msg}. \n (Respond with either a query statement or N/A and nothing else)",
      }]
  return current_message

def new_message(msg, role):
  current_message = [
      {
          "role": role,
          "content": msg,
      }]

  return current_message

## Define workflow

In [32]:
from typing import Optional
from groq import Groq

client = Groq(api_key = GROQ_API_KEY)
msg_hist = []
context = []

# Abstracting away the message creation proces

def update_context(query: str) -> None:
  global context # I know this is not ideal this is just a proof of concept
  if query is not None:
    top_k_docs = vectorstore.similarity_search(query, top_k)
    context = context + [{"text": doc.page_content} for doc in top_k_docs]
    if len(context) > top_k * context_count:
      context = context[top_k:] # pops first top_k elements (earliest retrieved progress)

def get_query(usr_msg: str) -> Optional[str]:
  messages = system_prompt_query + msg_hist + current_message_for_query(usr_msg = usr_msg)
  response = client.chat.completions.create(
    messages = messages,
    max_tokens = max_tokens,
    temperature = temperature,
    model = model)
  response_msg = response.choices[0].message.content
  if response_msg == "N/A":
    response_msg = None
  return response_msg

def answer_question(usr_msg: str) -> str:
  global msg_hist
  query = get_query(usr_msg)
  print(query)
  # If user queries a new thing, add that into the context
  update_context(query)

  messages = system_prompt + msg_hist + current_message_with_context(usr_msg, context)
  response = client.chat.completions.create(
    messages = messages,
    max_tokens = max_tokens,
    temperature = temperature,
    model = model)
  response_msg = response.choices[0].message.content
  msg_hist = msg_hist + new_message(usr_msg, "user") + new_message(response_msg, "assistant")

  return response_msg

## Inference (Run the next 2 cells send a message)

In [57]:
# Reset Chat
msg_hist = []
context = []

In [58]:
# Type your message here
last_message = """
What is the definition of catastrophic forgetting?
"""

In [59]:
answer_question(last_message)
for msg in msg_hist[-max_shown_message_rounds_count:]:
  for key in msg.keys():
    print(msg[key])

"definition of catastrophic forgetting in machine learning"
user

What is the definition of catastrophic forgetting?

assistant
Catastrophic forgetting is a phenomenon in machine learning where a neural network trained on a specific task forgets previously learned information when it is trained on a new task. This can occur when the network is trained on a sequence of tasks, and each new task requires the network to learn new patterns and relationships.

The term "catastrophic" was coined because the forgetting of previously learned information can be sudden and dramatic, often resulting in a significant drop in performance on the original task. This can be particularly problematic in applications where the network is expected to retain knowledge across multiple tasks, such as in lifelong learning or multi-task learning scenarios.

Catastrophic forgetting is often attributed to the fact that neural networks are trained using a single objective function, which encourages the network to 

# Objectifying (TBD)

In [60]:
print(context)

[{'text': ' 2010. Oral Presentation.\n\nTheano:\n\nBlitzer, John, Dredze, Mark, and Pereira, Fernando. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classiﬁcation. In ACL ’07, pp. 440–447, 2007.\n\nGlorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. In JMLR Deep sparse rectiﬁer neural networks. W&CP: Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2011), April 2011a.\n\nGlorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. Domain adaptation for large-scale sentiment classi- ﬁcation: A deep learning approach. In Proceedings of theTwenty-eight International Conference on Ma- chine Learning (ICML’11), volume 27, pp. 97–110, June 2011b.\n\nWhen computational resources are too limited to ex- periment with multiple activation functions, we rec- ommend using the maxout activation function trained with dropout. This is the only method that appears on the lower-left frontier of the performance tradeoﬀ pl