# Retrieval Augmented Generation (RAG) using FAISS vector database and Gemma never-empty structured data retrieval.

This repos owes too much to [jaydeepthik/Gemma-RAG](https://github.com/jaydeepthik/Gemma-RAG/blob/main/Gemma_rag_movies.ipynb).

First and foremost, pip install the necessary things.

In [None]:
!pip install  accelerate bitsandbytes transformers sentence-transformers datasets langchain  matplotlib 

As for faiss-gpu installation, please go to their official documentation where they recommend conda install fashion.

You can also see the `environment.yml` in the repo, which is the conda export from my own environment. Feel free to be my guest.

Now let's begin. We introduce a json file as our dataset. Note it has been anonymzied by the `anonymizer.py` in the repo.

In [1]:
from datasets import load_dataset
import pandas as pd
from tqdm.notebook import tqdm


dataset_df = pd.read_json("result_anonymized.json")
main_dataset_df = dataset_df[["id", "OrderID", "CardID", "CardNo", "OrderCode", "RouteNo", "RouteName", "RatedHour", "DeviceCode", "MachiningDevice", "MachiningDesc", "ClampingMethod", "UsedTools", "SpindleSpeed", "CuttingSpeed", "FeedRate", "CuttingDepth", "CustName", "ProductName"]] ## make test easier.
main_dataset_df.head(15)


Unnamed: 0,id,OrderID,CardID,CardNo,OrderCode,RouteNo,RouteName,RatedHour,DeviceCode,MachiningDevice,MachiningDesc,ClampingMethod,UsedTools,SpindleSpeed,CuttingSpeed,FeedRate,CuttingDepth,CustName,ProductName
0,746138.173538,SmxnOBuIrjkfrlemkIiDtIQPCbLWUcZhmnOo,OpJKVEyqyBvygjSqlLwjMURdagbXeGMmTpNo,iuhJsPKPQbsjbXUmLjI,gRHBfICsHFSdZAIv,-28.111525,sG,71.337674,,BEeVUDiRIgdumoomXVh,trDkjXoVjfi,oE,FLaFJ,,,,,KKjturzCJvTzaziuem,PGawvqFpJMaPjneBaaf
1,746263.573424,LlCplfwkQROTRBaZfMpcNINAfYNQqkSrwCaC,EmlGBBueHyBkXhyCabUbAxIKvmAMpCRfeRig,losOLMpiBHlntZlmKZr,eIvApiopWxnEyLxT,110.864345,bzT,69.35478,,YrO,KQqdqhQa,,,,,,,UqEQtsVLpDGSZYaAZg,bNiCsBsmmOedzFpTUKw
2,749229.348099,wjesXVdTbsDOqoqiUvzjlpEHSYBnstIYSJEX,wAhUjaczojBFyXusshRqHExXcJRmAEOISsnt,tnfmyycnoOlXbXPZfZK,PzySElyTFXcYRyHy,47.288207,mB,264.360936,,iUtjhhOsF,CIzDLZvI,,,,,,,YXNyFScWWZYQGjvSev,LNmvWClnHKgodduYWoa
3,749244.781658,hqcFLozynguilkkNucGifbDeRmYkivjrlSiB,DhHQDkOhWMAivrRBqgQiQglPNgpFWfuzgPCF,ctSXvszYPSUPkdMYldB,WfyUoFZtkSnwSjdf,-47.295644,bCp,119.25254,,yte,zDsLALRQTUThEOV,,,,,,,shQtBMmyDMQgukDEys,BmHtZRJARJrthZRKvue
4,761728.06586,HmkZgTUkOwPZtvIOmvQeRNIPIayTIaYwFzPE,NhawkDwhUAqHSxCCRGdXusKvBYeBydKgVPRz,NVQVjVqqgYGqMCGwmIj,MJRlVaCTUfxmMyVi,12.089209,zVq,192.831837,,EyVflUAyzwfluzzuNPY,CWzi,,,,,,,pKzeRGZTovpncYMRxt,tZLXSLCjsJwOSEUpeMk
5,761644.267951,SRyTInPGkbWQlcolJuGOiAvwsrcabxOzXmte,PDIkyuqDPuffSNOtwhVezClRarRvAmucCKRp,gDenXgcpFtuDuUshAdZ,VlvxilbmvzuXgNEA,88.389511,VPk,26.979096,,shRHaWXdfVAotLGvqVp,VsUq,,,,,,,pIfminwSMCVMwcGtWY,RYBtUkHzzJZfBDvKlUy
6,740701.729192,YQClnKxeekcIRTsdlsuUGtCItgUccvVRunAz,szoXxrqVRjWmjQegEoAbUMrXSorrzuwmfjoO,VZXhimnNeAXCxMzsNoU,dvjTRcijdIitOgIv,23.033465,tP,146.12714,,uJRXO,nZVvxnVVfYlBulTCWzWjP,,,,,,,APERpWRnqePsn,tNDiU
7,740582.839113,GSCBbpGXGuuoFWfuOsilbTOSNVUBOZlofIkI,uEdVwfQMEUPAYODSykUfokXAekNDgtiVvYXN,xfCmSlHULdTmHcaARyv,RDTkIdtRjIdStjmg,38.800646,WM,66.222113,,qeFw,gKbjcRHpxFlTMSbElxnEtQ,,,,,,,cZMeoOejBsJgo,nxpcd
8,740635.727077,UtJeiUpwgyopkKaMEbNNvCvSLjLRZLTYaJca,khxnlZfyJZxgDGBvXmhVTANvGjhqyzxeFFHQ,jFZwiJHqxItUkztCDum,xGUeRruJnPPDrIZV,113.134469,gTnH,4334.099771,,hrqq,GGlQtWsKegAQgbSdPJYWCKov,,,,,,,aNRsmFWDWOGKr,Pqskb
9,740556.268447,cGFxrofcOhAPxKJTBptpNHCBsqCEkGDSOiAv,RRjIZuhIqIAyubxQhGXPaHHSyvcxZIDzBAiB,kOUbNyJrpKCxqpPwwjY,GWAgGhPWreUuffiA,62.431049,ko,-11.707966,,uSBKO,DeUofICjwftJnubjbclQU,,,,,,,ZdEpHOGUOXogd,kYzGI


The following three columns are the typical columns that is usually involved in user's input command. For example, `give me an order like this, but customers name like that, etc.`

In [2]:
faiss_dataset_df = main_dataset_df[[ "OrderCode", "CustName", "ProductName"]] 
faiss_dataset_df.head(8)


Unnamed: 0,OrderCode,CustName,ProductName
0,gRHBfICsHFSdZAIv,KKjturzCJvTzaziuem,PGawvqFpJMaPjneBaaf
1,eIvApiopWxnEyLxT,UqEQtsVLpDGSZYaAZg,bNiCsBsmmOedzFpTUKw
2,PzySElyTFXcYRyHy,YXNyFScWWZYQGjvSev,LNmvWClnHKgodduYWoa
3,WfyUoFZtkSnwSjdf,shQtBMmyDMQgukDEys,BmHtZRJARJrthZRKvue
4,MJRlVaCTUfxmMyVi,pKzeRGZTovpncYMRxt,tZLXSLCjsJwOSEUpeMk
5,VlvxilbmvzuXgNEA,pIfminwSMCVMwcGtWY,RYBtUkHzzJZfBDvKlUy
6,dvjTRcijdIitOgIv,APERpWRnqePsn,tNDiU
7,RDTkIdtRjIdStjmg,cZMeoOejBsJgo,nxpcd


In [3]:
from langchain_community.document_loaders import DataFrameLoader

#convert DataFrane into Langchain Document format for further processing
loader = DataFrameLoader(main_dataset_df, page_content_column="OrderCode") # in our case, OrderCode is the main index. In this way, if one OrderCode is picked in final search, multiple data bearing that OrderCode value can come up.
dataset_docs = loader.load()

#chunking using embedding model
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

#hierarchy of seperators to be used by text splitter
MARKDOWN_SEPARATORS = [
    "\n#{1,6} ",
    "```\n",
    "\n\\*\\*\\*+\n",
    "\n---+\n",
    "\n___+\n",
    "\n\n",
    "\n",
    " ",
    "",
]


#"thenlper/gte-small" with 512 dimentional embedding is used as embedding model
EMB_MODEL_CKP = "thenlper/gte-small"
#get enbedding_tokenizer
embedding_tokenizer = AutoTokenizer.from_pretrained(EMB_MODEL_CKP)


def split_documents(chunk_size, KB, tokenizer=embedding_tokenizer):
  """
    Split documents into chunks of maximum size `chunk_size` tokens and return a list of documents.
  """
  text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        embedding_tokenizer, #tokenizer to be used to determine number of tokens
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True, # If `True`, includes chunk's start index in metadata
        strip_whitespace=True, # If `True`, strips whitespace from the start and end of every document
        separators=MARKDOWN_SEPARATORS, #use seperators for chunking
    )

  docs_processed = []
  for doc in KB:
      docs_processed += text_splitter.split_documents([doc])

  #remove duplicates
  unique_texts = {}
  docs_processed_unique = []
  for doc in tqdm(docs_processed):
    if doc.page_content not in unique_texts:
      unique_texts[doc.page_content] = True
      docs_processed_unique.append(doc)

  return docs_processed_unique

#split documents
docs_processed_tok = split_documents(512, dataset_docs, EMB_MODEL_CKP)

  0%|          | 0/33738 [00:00<?, ?it/s]

Building Vector Database using FAISS


In [5]:
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy


embedding_model = HuggingFaceEmbeddings(
    model_name = EMB_MODEL_CKP,
    multi_process = True,
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},  # set True for cosine similarity
)

#create FAISS indices for approximate nearest neighbour search
KNOWLEDGE_VECTOR_DATABASE = FAISS.from_documents(
    dataset_docs, embedding_model, distance_strategy=DistanceStrategy.COSINE
)

In [6]:
def get_search_result(query, vector_db):
  """
  given a "query" search for top k "OrderCodes" embedded in vector database
  """
  #get top k documents similar to "query"
  retrieved_docs = vector_db.similarity_search(query=user_query, k=50)
  search_result = ""
  search_result_OrderCodes = []
  for result in retrieved_docs:
      retrieved_OrderCode = result.page_content if result.page_content else "N/A"
      retrieved_CustName = result.metadata['CustName'] if result.metadata['CustName'] else "N/A"
      retrieved_ProductName = result.metadata['ProductName'] if result.metadata['ProductName'] else "N/A"
      search_result += f"OrderCode: {retrieved_OrderCode}, CustName: {retrieved_CustName}, ProductName: {retrieved_ProductName}\n" 
      search_result_OrderCodes.append(retrieved_OrderCode)

  return search_result, search_result_OrderCodes

Load Gemma 

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")


# CPU Enabled uncomment below 👇🏽
# model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
# GPU Enabled use below 👇🏽
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", device_map="auto")


Use Case 

In [9]:
# Conduct query with retrieval of sources
user_query = "Show me all documents with  'CustName' containing 'asdffdsa' and 'ProductName' containing '7jj7of'?"
retrieved_results, retreived_result_OrderCodes = get_search_result(user_query, KNOWLEDGE_VECTOR_DATABASE)
print('RETRIEVED RESULTS')
retrieved_results


RETRIEVED RESULTS


'OrderCode: VfdNjTmNsQzwpFB, CustName: EaERvxboSS, ProductName: FkYU\nOrderCode: rmjYDDQitXDsFqJ, CustName: kXuAFfxQuvVqZ, ProductName: oeRUXlgYYBR\nOrderCode: ttDSoFJfxFQkWfQ, CustName: ZMONpIgFGFgWW, ProductName: szqgnb\nOrderCode: VKNfoPDFsnJGVfA, CustName: sCwydzeajT, ProductName: sOaK\nOrderCode: cUYfSATGwqDlgXj, CustName: ipexUKImYPncD, ProductName: zkaXfICdizk\nOrderCode: FdOfnsFfCQuIcDj, CustName: IIYgONfEKLGyk, ProductName: bpbT\nOrderCode: JCWuQHXdslJdKfJ, CustName: WkqnNWSMVPtFY, ProductName: fI\nOrderCode: rmFeqTkqhYfpgjF, CustName: dNjACoAdalBJs, ProductName: KBYFiOirQTMEOx\nOrderCode: ccJfDXAbeCySodB, CustName: SXKQHuleowYct, ProductName: NoBviU\nOrderCode: OFdjFVckAGHKdBp, CustName: jdhLwlRQnV, ProductName: LtKF\nOrderCode: JFqJxjPSTkbCkDm, CustName: YUWwCZYDFhKMa, ProductName: Yw\nOrderCode: KMJyCufBfcCxohd, CustName: KnzCAyIKKwvfi, ProductName: KAyy\nOrderCode: NfdDFJdCjqnwpSV, CustName: hfyUMFmqlMcvp, ProductName: qyqCN\nOrderCode: rmkMFJwNcqKRvjQ, CustName: RpddJkmwr

In [10]:

combined_information = f"Answer the following query using the provided context. \n <CONTEXT>:\n{retrieved_results}. \n <QUERY>: {user_query}"

#chat template for gemma model conversation
chat = [
    { "role": "user", "content": combined_information },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

# Moving tensors to GPU
input_ids = tokenizer(combined_information, return_tensors="pt").to("cuda")

pipe = pipeline(model=model,
                tokenizer = tokenizer,
                task="text-generation",
                # return_full_text=False,
                return_full_text=True,
                # max_new_tokens=500,
                max_new_tokens=5000,
                do_sample=True,
                temperature=2.1,
                top_k=50,
                top_p=1,
                #repetition_penalty=1.1,
                num_return_sequences=1,
                #add_special_tokens=True,
                )
print(f"Query: {user_query}\n")
print(pipe(prompt)[0]["generated_text"])

Query: Show me all documents with  'CustName' containing 'asdffdsa' and 'ProductName' containing '7jj7of'?

<bos><start_of_turn>user
Answer the following query using the provided context. 
 <CONTEXT>:
OrderCode: VfdNjTmNsQzwpFB, CustName: EaERvxboSS, ProductName: FkYU
OrderCode: rmjYDDQitXDsFqJ, CustName: kXuAFfxQuvVqZ, ProductName: oeRUXlgYYBR
OrderCode: ttDSoFJfxFQkWfQ, CustName: ZMONpIgFGFgWW, ProductName: szqgnb
OrderCode: VKNfoPDFsnJGVfA, CustName: sCwydzeajT, ProductName: sOaK
OrderCode: cUYfSATGwqDlgXj, CustName: ipexUKImYPncD, ProductName: zkaXfICdizk
OrderCode: FdOfnsFfCQuIcDj, CustName: IIYgONfEKLGyk, ProductName: bpbT
OrderCode: JCWuQHXdslJdKfJ, CustName: WkqnNWSMVPtFY, ProductName: fI
OrderCode: rmFeqTkqhYfpgjF, CustName: dNjACoAdalBJs, ProductName: KBYFiOirQTMEOx
OrderCode: ccJfDXAbeCySodB, CustName: SXKQHuleowYct, ProductName: NoBviU
OrderCode: OFdjFVckAGHKdBp, CustName: jdhLwlRQnV, ProductName: LtKF
OrderCode: JFqJxjPSTkbCkDm, CustName: YUWwCZYDFhKMa, ProductName: Yw
Ord

In [11]:
combined_information

"Answer the following query using the provided context. \n <CONTEXT>:\nOrderCode: VfdNjTmNsQzwpFB, CustName: EaERvxboSS, ProductName: FkYU\nOrderCode: rmjYDDQitXDsFqJ, CustName: kXuAFfxQuvVqZ, ProductName: oeRUXlgYYBR\nOrderCode: ttDSoFJfxFQkWfQ, CustName: ZMONpIgFGFgWW, ProductName: szqgnb\nOrderCode: VKNfoPDFsnJGVfA, CustName: sCwydzeajT, ProductName: sOaK\nOrderCode: cUYfSATGwqDlgXj, CustName: ipexUKImYPncD, ProductName: zkaXfICdizk\nOrderCode: FdOfnsFfCQuIcDj, CustName: IIYgONfEKLGyk, ProductName: bpbT\nOrderCode: JCWuQHXdslJdKfJ, CustName: WkqnNWSMVPtFY, ProductName: fI\nOrderCode: rmFeqTkqhYfpgjF, CustName: dNjACoAdalBJs, ProductName: KBYFiOirQTMEOx\nOrderCode: ccJfDXAbeCySodB, CustName: SXKQHuleowYct, ProductName: NoBviU\nOrderCode: OFdjFVckAGHKdBp, CustName: jdhLwlRQnV, ProductName: LtKF\nOrderCode: JFqJxjPSTkbCkDm, CustName: YUWwCZYDFhKMa, ProductName: Yw\nOrderCode: KMJyCufBfcCxohd, CustName: KnzCAyIKKwvfi, ProductName: KAyy\nOrderCode: NfdDFJdCjqnwpSV, CustName: hfyUMFmqlMcv

In [12]:
most_frequent_OrderCode = max(set(retreived_result_OrderCodes), key=retreived_result_OrderCodes.count)

In [13]:
most_similar_OrderCode = retreived_result_OrderCodes[0]
wanted_filtered_df = main_dataset_df[main_dataset_df['OrderCode'] == most_similar_OrderCode]
wanted_filtered_df

Unnamed: 0,id,OrderID,CardID,CardNo,OrderCode,RouteNo,RouteName,RatedHour,DeviceCode,MachiningDevice,MachiningDesc,ClampingMethod,UsedTools,SpindleSpeed,CuttingSpeed,FeedRate,CuttingDepth,CustName,ProductName
2822,671432.928126,cKUlORAWQFZHQNfqEfoaeyZGhlhyGuYqLMrs,nQfplJxzhnmuXihkgoOamYDeuHLLIsGbERZR,CnciRUTImJlspKUAleL,VfdNjTmNsQzwpFB,174.823396,bm,110.628347,,,ExpFLksePRGR,,,,,,,EaERvxboSS,FkYU


In an un-anonymous real json file, the `wanted_filtered_df` usually comes with multiple lines.
