**Simple RAG(Retrieval Augmented Generation) system**

***Overview***

This code Implements a simple Retrieval-Augmented Generation (RAG) system for processing and querying text documents. The complete modules aim to get an understanding a naive implementation without using a vector db.

***Components***



1.   Installing required libraries - The main libraries include components of building RAG, like text splitters, openai, runnables, transformers, prompt etc
2.   Importing relevant libraries - The main libraries include components of building RAG, like text splitters, openai, runnables, transformers, prompt etc
3.   Import tokens - Load the documents/files from the directory from which we want to retrieve the responses.
4.   Chunker - Create chunks of the documents, every chunk will have a chunk ID stored in a document ID , we can define the chunk size
5.   Embedding - Use openai model to create embeddings of the chunks created above
6.   Similarity score - Retreiver class contains a cosine similarity function which will generate a similarity score between the document embeddings and the prompt. The top 3 embeddings which have a high similarity score will be retreiver by the retriever.
7.   LLM - The openai gpt model will take the retriever output and the prompt as the user query and return the output response.


***Benefits of the code***
1.   Simple implementation of text chunking and retrieval
2.   Modularized structure
3.   Scalable
4.   Processing History, so that LLM can continue conversation

***Further improvements***
1.   Implement advanced text splitters
2.   Vector db approach for efficient storage of embedings
3.   Auto processing of history



**Installing libraries**

In [None]:
!pip install transformers==4.35.2
!pip install torch
!pip install openai==0.28
!pip install langchain_community
!pip install langchain_openai
!pip install -qU langchain-text-splitters

Collecting transformers==4.35.2
  Downloading transformers-4.35.2-py3-none-any.whl.metadata (123 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/123.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.5/123.5 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.19,>=0.14 (from transformers==4.35.2)
  Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.35.2-py3-none-any.whl (7.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m55.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m60.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tokenizers, transformers
  Attempting uninstall: tokenizer

**Importing Libraries**

In [None]:
import numpy as np
import requests
import os
import json, ast
import re
from transformers import AutoTokenizer, AutoModel
import uuid
import torch
from langchain_openai import ChatOpenAI
import openai
from langchain_community.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate,MessagesPlaceholder
from langchain.schema.runnable import RunnableLambda
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain.schema import HumanMessage, AIMessage
import numpy as np
from openai import OpenAI
import os
import random
from dotenv import load_dotenv

  _torch_pytree._register_pytree_node(


**environment variables**

In [None]:
# Load environment variables from a .env file
load_dotenv()

os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN')
openai_key = os.getenv('OPENAI_API_KEY')

**Splitting the documents into multiple chunks**

In [None]:
def document_chunker(directory_path,
                     model_name,
                     paragraph_separator='\n\n',
                     chunk_size=1024,
                     separator=' ',
                     secondary_chunking_regex=r'\S+?[\.,;!?]',
                     chunk_overlap=0):

    tokenizer = AutoTokenizer.from_pretrained(model_name)  # Load tokenizer for the specified model
    documents = {}  # Initialize dictionary to store results

    # Read each file in the specified directory
    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        base = os.path.basename(file_path)
        sku = os.path.splitext(base)[0]
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as file:
                text = file.read()

            # Generate a unique identifier for the document
            doc_id = str(uuid.uuid4())

            # Process each file using the existing chunking logic
            paragraphs = re.split(paragraph_separator, text)
            all_chunks = {}
            for paragraph in paragraphs:
                words = paragraph.split(separator)
                current_chunk = ""
                chunks = []

                for word in words:
                    new_chunk = current_chunk + (separator if current_chunk else '') + word
                    if len(tokenizer.tokenize(new_chunk)) <= chunk_size:
                        current_chunk = new_chunk
                    else:
                        if current_chunk:
                            chunks.append(current_chunk)
                        current_chunk = word

                if current_chunk:
                    chunks.append(current_chunk)

                refined_chunks = []
                for chunk in chunks:
                    if len(tokenizer.tokenize(chunk)) > chunk_size:
                        sub_chunks = re.split(secondary_chunking_regex, chunk)
                        sub_chunk_accum = ""
                        for sub_chunk in sub_chunks:
                            if sub_chunk_accum and len(tokenizer.tokenize(sub_chunk_accum + sub_chunk + ' ')) > chunk_size:
                                refined_chunks.append(sub_chunk_accum.strip())
                                sub_chunk_accum = sub_chunk
                            else:
                                sub_chunk_accum += (sub_chunk + ' ')
                        if sub_chunk_accum:
                            refined_chunks.append(sub_chunk_accum.strip())
                    else:
                        refined_chunks.append(chunk)

                final_chunks = []
                if chunk_overlap > 0 and len(refined_chunks) > 1:
                    for i in range(len(refined_chunks) - 1):
                        final_chunks.append(refined_chunks[i])
                        overlap_start = max(0, len(refined_chunks[i]) - chunk_overlap)
                        overlap_end = min(chunk_overlap, len(refined_chunks[i+1]))
                        overlap_chunk = refined_chunks[i][overlap_start:] + ' ' + refined_chunks[i+1][:overlap_end]
                        final_chunks.append(overlap_chunk)
                    final_chunks.append(refined_chunks[-1])
                else:
                    final_chunks = refined_chunks

                # Assign a UUID for each chunk and structure it with text and metadata
                for chunk in final_chunks:
                    chunk_id = str(uuid.uuid4())
                    all_chunks[chunk_id] = {"text": chunk, "metadata": {"file_name":sku}}  # Initialize metadata as empty dict

            # Map the document UUID to its chunk dictionary
            documents[doc_id] = all_chunks

    return documents

**Create embedings for the chunks**

In [None]:
class embeddings:
  '''
  implementing embedding vectors to the chunks
  '''
  def __init__(self,model_name,openai_key):
    self.model_name = model_name
    self.openai_key = openai_key


  def embeddings_wrapper(self,text):
    client = OpenAI(api_key = self.openai_key)
    response = client.embeddings.create( model = self.model_name, input=text )
    embedding = response.data[0].embedding
    return embedding

  def get_embeddings(self,dict_chunks):
      '''
      implementing embedding vectors to the chunks
      Arguments:
      dict_chunk: dictionary of chunks with chunk id and doc id
      '''
      embed = { doc_id: {chunk_id: self.embeddings_wrapper(chunk_dict.get("text"))
                          for chunk_id, chunk_dict in chunks.items()
                            }
                  for doc_id, chunks in dict_chunks.items()
              }
      return embed

# vector_store = embeddings("text-embedding-ada-002",openai_key)
# embedding_vector = vector_store.get_embeddings(doc_chunks)

**Retrieving relevant documents form the chunks**

In [None]:
class retriever:
  def __init__(self,embed_dict,query):
    self.embed_dict = embed_dict
    self.query_embedding = vector_store.embeddings_wrapper(query)

  def cosine_similarity(self,doc_embedding):
    vec1 = np.array(doc_embedding)
    vec2 = np.array(self.query_embedding)
    norm_doc = np.linalg.norm(vec1)
    norm_query = np.linalg.norm(vec2)
    if norm_doc == 0 or norm_query == 0:
      score = 0
    else:
      score = np.dot(vec1, vec2) / (norm_doc * norm_query)
    return score

  def similarity_score(self,top_k):
    cosine_score = {(doc_id , chunk_id) : self.cosine_similarity(embeddings)
                      for doc_id, chunks in self.embed_dict.items()
                      for chunk_id, embeddings in chunks.items()}
    cosine_score = dict(sorted(cosine_score.items(), key=lambda item: item[1])[-top_k:])
    return cosine_score

# retriever_class = retriever(embedding_vector,prompt)
# retrieved_ids = retriever_class.similarity_score(3)
# retrieved_ids

**Initializing LLM**

In [None]:
class LLMmanager:
  def get_llm():
    chat_llm = ChatOpenAI(
                  openai_api_key=openai_key,
                  model_name='gpt-3.5-turbo',
                  temperature=0.0,
                  max_tokens=126 )
    return chat_llm

**Final output generation**

In [None]:
class output_generation:
  def __init__(self,user_query,retrieved_docs,process_history = True):
    self.user_query = user_query
    self.retrieved_docs = retrieved_docs
    self.process_history = process_history

  def construct_prompt(self,inputs):
    return {
        "retrieved_docs": inputs["retrieved_docs"],
        "user_query": inputs["user_query"],
        "history": inputs["history"]
            }

  def update_history(self,output):
        # Update the history
    global history
    if not self.process_history:
      history = []
    history.append(HumanMessage(content = self.user_query))
    history.append(AIMessage(content=output))

  def get_session_history():
    # Example: Replace this with actual logic to get the session history
    return history

  def llm_call(self):

    prompt_template = ChatPromptTemplate.from_messages([
                  ("system", """You're a helpful assistant",
            "Use the following information as in Context to answer the user's question. ",
            "If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. ",
            "If you don't know the answer based on context, just say that you don't know, don't try to make up an answer. ",
            "If you think user is asking a follow up question, you can take some historical information from history ",
            "",
            "Context: {retrieved_docs} ",
            "",
            "Please generate answers from the context provided above only and return the output in a pretty format."""),
                  ("user", "{user_query}"),
                  MessagesPlaceholder(variable_name="history")
              ])

    chat_llm = LLMmanager.get_llm()

    custom_chain = RunnableLambda(self.construct_prompt) | prompt_template | chat_llm | StrOutputParser()

    custom_chain1 = RunnableWithMessageHistory(
                custom_chain,
                get_session_history = self.get_session_history
            )

    output = custom_chain.invoke({
              "retrieved_docs": self.retrieved_docs,
              "user_query": self.user_query,
              "history": history
          })

    self.update_history(output)
    return output

In [None]:
#Initialising for saving history texts
history = []

In [None]:
prompt = 'Why is bangalore silicon valley.explain in one sentence'
prompt2 = 'which question'

In [None]:
doc_chunks = document_chunker(directory_path='text_data',
                        model_name='BAAI/bge-small-en-v1.5',
                        chunk_size=256)
# doc_chunks.keys()

vector_store = embeddings("text-embedding-ada-002",openai_key)
embedding_vector = vector_store.get_embeddings(doc_chunks)

retriever_class = retriever(embedding_vector,prompt2)
retrieved_ids = retriever_class.similarity_score(3)
# retrieved_ids

retrieved_docs = ''
for i in retrieved_ids.keys():
  retrieved_docs = retrieved_docs + doc_chunks[i[0]][i[1]]['text']
# retrieved_docs

#initialising llm
llm = LLMmanager.get_llm()

#defining llm compatible prompt
print(output_generation(prompt2,retrieved_docs,True).llm_call())

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

  chat_llm = ChatOpenAI(


I'm sorry, but it seems like you haven't provided a specific question for me to answer. Could you please ask a question related to the context provided so that I can assist you better?


In [None]:
history

[HumanMessage(content='which question'),
 AIMessage(content="I'm sorry, but it seems like you haven't provided a specific question for me to answer. Could you please ask a question related to the context provided so that I can assist you better?")]

In [None]:
print(f'initial history:{history}')
print(output_generation(prompt,retrieved_docs,True).llm_call())
print(f"updated_history{history}")

initial history:[HumanMessage(content='which question'), AIMessage(content="I'm sorry, but it seems like you haven't provided a specific question for me to answer. Could you please ask a question related to the context provided so that I can assist you better?")]
Bangalore is known as the "Silicon Valley of India" because it is the nation's leading software exporter and a major semiconductor hub.
updated_history[HumanMessage(content='which question'), AIMessage(content="I'm sorry, but it seems like you haven't provided a specific question for me to answer. Could you please ask a question related to the context provided so that I can assist you better?"), HumanMessage(content='Why is bangalore silicon valley.explain in one sentence'), AIMessage(content='Bangalore is known as the "Silicon Valley of India" because it is the nation\'s leading software exporter and a major semiconductor hub.')]


In [None]:
history

[HumanMessage(content='which question'),
 AIMessage(content="I'm sorry, but it seems like you haven't provided a specific question for me to answer. Could you please ask a question related to the context provided so that I can assist you better?"),
 HumanMessage(content='Why is bangalore silicon valley.explain in one sentence'),
 AIMessage(content='Bangalore is known as the "Silicon Valley of India" because it is the nation\'s leading software exporter and a major semiconductor hub.')]