<center><h1>RAG using Gemma, Langchain and ChromaDB</h1></center>
<center><img src="https://res.infoq.com/news/2024/02/google-gemma-open-model/en/headerimage/generatedHeaderImage-1708977571481.jpg" width="400"></center>


# Introduction

This notebook demonstrates how to build a retrieval augmented generation (RAG) system using Gemma as a large language model (LLM), Langchain for tools to process input files, and ChromaDB as vector database.

## What is RAG?

Retriever augmented generation (RAG) is a system that improves the response generated by a LLM in two ways:
- First, the information is retrieved from a dataset that is stored in vector database; the query is used to perform similarity search in the documents stored in the vector database.
- Second, by restraining the context provided to the LLM to content that is similar with the initial query, stored in the vector database, we can reduce significantly (or even eliminate) LLM's halucinations, since the answer is provided from the context of the stored documents.

An important advantage of this approach is that we do not need to fine-tune the LLM with our custom data; instead, the data is ingested (cleaned, transformed, chunked, and indexed in the vector database).

## Procedure

We create two classes:
* AIAgent - An AI Agent that query Gemma LLM using a custom prompt that instruct Gemma to generate and answer (from the query) by refering to the context (as well provided); the answer to the AI Agent query function is then returned.
* RAGSystem - initialized with an AIAgent object. In the init function of this class, we ingest the data from the dataset in the vector database. This class have as well a query member function. In this function we first perform similarity search with the query to the vector database. Then, we call the generate function of the ai agent object. Before returning the answer, we use a predefined template to compose the overal response from the question, answer and the context retrieved.


# Packages instalation and configurations

In [None]:
# install required libraries
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes
!pip install langchain
!pip install sentence-transformers
!pip install chromadb

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM

from langchain.document_loaders import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

from IPython.display import display, Markdown


# AI Agent class

In [12]:
import transformers
import torch
class AIAgent:
    """
    Gemma 7b-it assistant.
    It uses Gemma transformers 2b-it/2.
    """
    def __init__(self, max_length=256):
        self.max_length = max_length
        self.tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/gemma/transformers/1.1-7b-it/1")
        self.bnb_config = transformers.BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            device_map='cuda',
        )
        self.gemma_lm = AutoModelForCausalLM.from_pretrained("/kaggle/input/gemma/transformers/1.1-7b-it/1", quantization_config=self.bnb_config)

    def create_prompt(self, query, context):
        # prompt template
        prompt = f"""
        You are an assistant for question-answering tasks for Retrieval Augmented Generation system. 
        Use the following pieces of retrieved context to answer the question. 
        If you don't know the answer, just say that you don't know. 
        Use two sentences maximum and keep the answer concise.
        Question: {query}
        Context: {context}
        Answer:
        """
        return prompt
    
    def generate(self, query, retrieved_info):
        prompt = self.create_prompt(query, retrieved_info)
        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids
        # Answer generation
        answer = self.gemma_lm.generate(
            input_ids,
            #max_length=self.max_length, # limit the answer to max_length
            max_new_tokens=self.max_length
        )
        # Decode and return the answer
        answer = self.tokenizer.decode(answer[0], skip_special_tokens=True, skip_prompt=True)
        return prompt, answer

## Test the AIAgent

In [11]:
ai_agent = AIAgent()

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Let's use the context from the Data Science interview Q&A treasury.

In [13]:
from langchain.document_loaders import PyPDFLoader
# create a loader
loader = PyPDFLoader("/kaggle/input/tesla10q/tsla-20230930.pdf")

# load your data
data = loader.load()
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your document')

You have 55 document(s) in your data
There are 2593 characters in your document


In [15]:
class RAGSystem:
    """Sentence embedding based Retrieval Based Augmented generation.
        Given database of pdf files, retriever finds num_retrieved_docs relevant documents"""
    def __init__(self, ai_agent, num_retrieved_docs=2):
        # load the data
        self.num_docs = num_retrieved_docs
        self.ai_agent = ai_agent
        loader = PyPDFLoader("/kaggle/input/tesla10q/tsla-20230930.pdf")
        # load your data
        documents = loader.load()
        self.template = "\n\nQuestion:\n{question}\n\nPrompt:\n{prompt}\n\nAnswer:\n{answer}\n\nContext:\n{context}"
        
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=800, 
            chunk_overlap=100)
        all_splits = text_splitter.split_documents(documents)
        # create a vectorstore database
        embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
        self.vector_db = Chroma.from_documents(documents=all_splits, 
                                               embedding=embeddings, 
                                               persist_directory="chroma_db")
        self.retriever = self.vector_db.as_retriever()

    def retrieve(self, query):
        # retrieve top k similar documents to query
        docs = self.retriever.get_relevant_documents(query)
        return docs
    
    def query(self, query):
        # generate the answer
        context = self.retrieve(query)
        data = ""
        for item in list(context):
            data += item.page_content
            
        data = data[:500]

        prompt, answer = self.ai_agent.generate(query, data)
        
        return self.template.format(question=query,
                                    prompt=prompt,
                                   answer=answer,
                                   context=context)
        
        

In [16]:
def colorize_text(text):
    for word, color in zip(["Question", "Prompt", "Answer", "Context"], ["blue", "magenta", "red", "green"]):
        text = text.replace(f"\n\n{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

# Test the RAG system

In [18]:
rag_system = RAGSystem(ai_agent)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Let's try first with few of the questions from the data we used for the retrieval system.

In [19]:
answer = rag_system.query("What's the total assets?")
display(Markdown(colorize_text(answer)))

2024-04-17 04:02:09.294969: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-17 04:02:09.295086: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-17 04:02:09.434758: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered




**<font color='blue'>Question:</font>**
What's the total assets?

**<font color='magenta'>Prompt:</font>**

        You are an assistant for question-answering tasks for Retrieval Augmented Generation system. 
        Use the following pieces of retrieved context to answer the question. 
        If you don't know the answer, just say that you don't know. 
        Use two sentences maximum and keep the answer concise.
        Question: What's the total assets?
        Context: September 30,
2023December 31,
2022
Assets
Current assets
Cash and cash equivalents $ 15,932  $ 16,253  
Short-term investments 10,145  5,932  
Accounts receivable, net 2,520  2,952  
Inventory 13,721  12,839  
Prepaid expenses and other current assets 2,708  2,941  
Total current assets 45,026  40,917  
Operating lease vehicles, net 6,119 5,035  
Solar ener gy systems, net 5,293  5,489  
Property , plant and equipment, net 27,744  23,548  
Operating lease right-of-use assets 3,637  2,563  
Digi
        Answer:
        

**<font color='red'>Answer:</font>**

        You are an assistant for question-answering tasks for Retrieval Augmented Generation system. 
        Use the following pieces of retrieved context to answer the question. 
        If you don't know the answer, just say that you don't know. 
        Use two sentences maximum and keep the answer concise.
        Question: What's the total assets?
        Context: September 30,
2023December 31,
2022
Assets
Current assets
Cash and cash equivalents $ 15,932  $ 16,253  
Short-term investments 10,145  5,932  
Accounts receivable, net 2,520  2,952  
Inventory 13,721  12,839  
Prepaid expenses and other current assets 2,708  2,941  
Total current assets 45,026  40,917  
Operating lease vehicles, net 6,119 5,035  
Solar ener gy systems, net 5,293  5,489  
Property , plant and equipment, net 27,744  23,548  
Operating lease right-of-use assets 3,637  2,563  
Digi
        Answer:
        The total assets as of December 31, 2022, was $45,026 million.

**<font color='green'>Context:</font>**
[Document(page_content='September 30,\n2023December 31,\n2022\nAssets\nCurrent assets\nCash and cash equivalents $ 15,932  $ 16,253  \nShort-term investments 10,145  5,932  \nAccounts receivable, net 2,520  2,952  \nInventory 13,721  12,839  \nPrepaid expenses and other current assets 2,708  2,941  \nTotal current assets 45,026  40,917  \nOperating lease vehicles, net 6,119 5,035  \nSolar ener gy systems, net 5,293  5,489  \nProperty , plant and equipment, net 27,744  23,548  \nOperating lease right-of-use assets 3,637  2,563  \nDigital assets, net 184 184 \nIntangible assets, net 191 215 \nGoodwill 250 194 \nOther non-current assets 5,497  4,193  \nTotal assets $ 93,941  $ 82,338  \nLiabilities\nCurrent liabilities\nAccounts payable $ 13,937  $ 15,255  \nAccrued liabilities and other 7,636  7,142  \nDeferred revenue 2,206  1,747', metadata={'page': 5, 'source': '/kaggle/input/tesla10q/tsla-20230930.pdf'}), Document(page_content='Table of Contents\nNote 10 – Variable Interest Entity Arrangements\nThe aggregate carrying values of the variable interest entities’ assets and liabilities, after elimination of any intercompany\ntransactions and balances, in the consolidated balance sheets were as follows (in millions):\nSeptember 30,\n2023December 31,\n2022\nAssets   \nCurrent assets   \nCash and cash equivalents $ 78 $ 68 \nAccounts receivable, net 30 22 \nPrepaid expenses and other current assets 341 274 \nTotal current assets 449 364 \nSolar energy systems, net 3,921 4,060 \nOther non-current assets 390 404 \nTotal assets $ 4,760 $ 4,828 \nLiabilities   \nCurrent liabilities   \nAccrued liabilities and other $ 73 $ 69 \nDeferred revenue 10 10 \nCurrent portion of debt and finance leases 1,512 1,013 \nTotal current liabilities 1,595 1,092', metadata={'page': 32, 'source': '/kaggle/input/tesla10q/tsla-20230930.pdf'}), Document(page_content='Operating lease vehicles (1,858) (1,136)\nPrepaid expenses and other current assets 322 (865)\nOther non-current assets (2,655) (1,580)\nAccounts payable and accrued liabilities (24) 4,659 \nDeferred revenue 774 856 \nCustomer deposits (95) 251 \nOther long-term liabilities 2,066 1,016 \nNet cash provided by operating activities 8,886 11,446 \nCash Flows from Investing Activities\nPurchases of property and equipment excluding finance leases, net of sales (6,592) (5,300)\nPurchases of solar energy systems, net of sales — (5)\nProceeds from sales of digital assets — 936 \nPurchase of intangible assets — (9)\nPurchases of investments (13,221) (1,467)\nProceeds from maturities of investments 8,959 3 \nProceeds from sales of investments 138 — \nBusiness combinations, net of cash acquired (64) —', metadata={'page': 10, 'source': '/kaggle/input/tesla10q/tsla-20230930.pdf'}), Document(page_content='Table of Contents\nThe following table presents long-lived assets by geographic area (in millions):\nSeptember 30,\n2023December 31,\n2022\nUnited States $ 25,162 $ 21,667 \nGermany 4,008 3,547 \nChina 2,786 2,978 \nOther international 1,081 845 \nTotal $ 33,037 $ 29,037 \nThe following table presents inventory by reportable segment (in millions):\nSeptember 30,\n2023December 31,\n2022\nAutomotive $ 11,398 $ 10,996 \nEnergy generation and storage 2,323 1,843 \nTotal $ 13,721 $ 12,839 \n26', metadata={'page': 34, 'source': '/kaggle/input/tesla10q/tsla-20230930.pdf'})]

Let's try also with some "fresh" questions.

In [20]:
answer = rag_system.query("How much cash was provided by or used in operating activities during the quarter?")
display(Markdown(colorize_text(answer)))





**<font color='blue'>Question:</font>**
How much cash was provided by or used in operating activities during the quarter?

**<font color='magenta'>Prompt:</font>**

        You are an assistant for question-answering tasks for Retrieval Augmented Generation system. 
        Use the following pieces of retrieved context to answer the question. 
        If you don't know the answer, just say that you don't know. 
        Use two sentences maximum and keep the answer concise.
        Question: How much cash was provided by or used in operating activities during the quarter?
        Context: Table of Contents
See Note 1, Summary of Significant Accounting Policies, to the consolidated financial statements included elsewhere in this
Quarterly Report on Form 10-Q for further details.
Liquidity and Capital Resources
We expect to continue to generate net positive operating cash flow as we have done in the last five fiscal years. The cash we
generate from our core operations enables us to fund ongoing operations and production, our research and development projects for
new products and te
        Answer:
        

**<font color='red'>Answer:</font>**

        You are an assistant for question-answering tasks for Retrieval Augmented Generation system. 
        Use the following pieces of retrieved context to answer the question. 
        If you don't know the answer, just say that you don't know. 
        Use two sentences maximum and keep the answer concise.
        Question: How much cash was provided by or used in operating activities during the quarter?
        Context: Table of Contents
See Note 1, Summary of Significant Accounting Policies, to the consolidated financial statements included elsewhere in this
Quarterly Report on Form 10-Q for further details.
Liquidity and Capital Resources
We expect to continue to generate net positive operating cash flow as we have done in the last five fiscal years. The cash we
generate from our core operations enables us to fund ongoing operations and production, our research and development projects for
new products and te
        Answer:
        The provided text does not contain any information regarding the amount of cash provided or used in operating activities during the quarter, so I am unable to answer this question from the given context.

**<font color='green'>Context:</font>**
[Document(page_content='Table of Contents\nSee Note 1, Summary of Significant Accounting Policies, to the consolidated financial statements included elsewhere in this\nQuarterly Report on Form 10-Q for further details.\nLiquidity and Capital Resources\nWe expect to continue to generate net positive operating cash flow as we have done in the last five fiscal years. The cash we\ngenerate from our core operations enables us to fund ongoing operations and production, our research and development projects for\nnew products and technologies including our proprietary battery cells, additional manufacturing ramps at existing manufacturing\nfacilities, the construction of future factories, and the continued expansion of our retail and service locations, body shops, Mobile', metadata={'page': 47, 'source': '/kaggle/input/tesla10q/tsla-20230930.pdf'}), Document(page_content='Table of Contents\nWe continue adapting our strategy to meet our liquidity and risk objectives, such as investing in U.S. government and other\ninvestments, to do more vertical integration, expand our product roadmap and provide financing options to our customers.\nSummary of Cash Flows\n Nine Months Ended September 30,\n(Dollars in millions) 2023 2022\nNet cash provided by operating activities $ 8,886 $ 11,446 \nNet cash used in investing activities $ (10,780)$ (5,842)\nNet cash provided by (used in) financing activities $ 1,702 $ (3,032)\nCash Flows from Operating Activities\nNet cash provided by operating activities decreased by $2.56 billion to $8.89 billion during the nine months ended', metadata={'page': 49, 'source': '/kaggle/input/tesla10q/tsla-20230930.pdf'}), Document(page_content='During the three and nine months ended September 30, 2023, our net income attributable to common stockholders was $1.85\nbillion and $7.07 billion, respectively, representing unfavorable changes of $1.44 billion and $1.80 billion, respectively, over the same\nperiods ended September 30, 2022. We continue to focus on further cost reductions and operational efficiencies while maximizing\ndelivery volumes.\nWe ended the third quarter of 2023 with $26.08 billion in cash and cash equivalents and investments, representing an increase\nof $3.89 billion from the end of 2022. Our cash flows provided by operating activities during the nine months ended September 30,\n2023 and 2022 were $8.89 billion and $11.45 billion, respectively, representing a decrease of $2.56 billion. Capital expenditures', metadata={'page': 35, 'source': '/kaggle/input/tesla10q/tsla-20230930.pdf'}), Document(page_content='Table of Contents\nTesla, Inc.\nConsolidated Statements of Cash Flows\n(in millions)\n(unaudited)\n Nine Months Ended September 30,\n 2023 2022\nCash Flows from Operating Activities\nNet income $ 7,031 $ 8,880 \nAdjustments to reconcile net income to net cash provided by operating activities:\nDepreciation, amortization and impairment 3,435 2,758 \nStock-based compensation 1,328 1,141 \nInventory and purchase commitments write-downs 361 118 \nForeign currency transaction net unrealized (gain) loss (317) 1 \nNon-cash interest and other operating activities 94 159 \nDigital assets loss, net — 106 \nChanges in operating assets and liabilities:\nAccounts receivable 377 (426)\nInventory (1,953) (4,492)\nOperating lease vehicles (1,858) (1,136)\nPrepaid expenses and other current assets 322 (865)', metadata={'page': 10, 'source': '/kaggle/input/tesla10q/tsla-20230930.pdf'})]

In [21]:
answer = rag_system.query("What are the biggest risks for Tesla as a business?")
display(Markdown(colorize_text(answer)))





**<font color='blue'>Question:</font>**
What are the biggest risks for Tesla as a business?

**<font color='magenta'>Prompt:</font>**

        You are an assistant for question-answering tasks for Retrieval Augmented Generation system. 
        Use the following pieces of retrieved context to answer the question. 
        If you don't know the answer, just say that you don't know. 
        Use two sentences maximum and keep the answer concise.
        Question: What are the biggest risks for Tesla as a business?
        Context: TESLA, INC.
FORM 10-Q FOR THE QUARTER ENDED SEPTEMBER 30, 2023
INDEX
  Page
PART I. FINANCIAL INFORMATION
Item 1. Financial Statements 4
Consolidated Balance Sheets 4
Consolidated Statements of Operations 5
Consolidated Statements of Comprehensive Income 6
Consolidated Statements of Redeemable Noncontrolling Interests and Equity 7
Consolidated Statements of Cash Flows 9
Notes to Consolidated Financial Statements 10
Item 2. Management's Discussion and Analysis of Financial Condition and Results o
        Answer:
        

**<font color='red'>Answer:</font>**

        You are an assistant for question-answering tasks for Retrieval Augmented Generation system. 
        Use the following pieces of retrieved context to answer the question. 
        If you don't know the answer, just say that you don't know. 
        Use two sentences maximum and keep the answer concise.
        Question: What are the biggest risks for Tesla as a business?
        Context: TESLA, INC.
FORM 10-Q FOR THE QUARTER ENDED SEPTEMBER 30, 2023
INDEX
  Page
PART I. FINANCIAL INFORMATION
Item 1. Financial Statements 4
Consolidated Balance Sheets 4
Consolidated Statements of Operations 5
Consolidated Statements of Comprehensive Income 6
Consolidated Statements of Redeemable Noncontrolling Interests and Equity 7
Consolidated Statements of Cash Flows 9
Notes to Consolidated Financial Statements 10
Item 2. Management's Discussion and Analysis of Financial Condition and Results o
        Answer:
        Tesla faces significant risks related to manufacturing and supply chain disruptions, as well as competition from established automakers.

**<font color='green'>Context:</font>**
[Document(page_content="TESLA, INC.\nFORM 10-Q FOR THE QUARTER ENDED SEPTEMBER 30, 2023\nINDEX\n  Page\nPART I. FINANCIAL INFORMATION\nItem 1. Financial Statements 4\nConsolidated Balance Sheets 4\nConsolidated Statements of Operations 5\nConsolidated Statements of Comprehensive Income 6\nConsolidated Statements of Redeemable Noncontrolling Interests and Equity 7\nConsolidated Statements of Cash Flows 9\nNotes to Consolidated Financial Statements 10\nItem 2. Management's Discussion and Analysis of Financial Condition and Results of Operations 27\nItem 3. Quantitative and Qualitative Disclosures about Market Risk 36\nItem 4. Controls and Procedures 36\nPART II. OTHER INFORMATION\nItem 1. Legal Proceedings 36\nItem 1A. Risk Factors 37\nItem 2. Unregistered Sales of Equity Securities and Use of Proceeds 37", metadata={'page': 2, 'source': '/kaggle/input/tesla10q/tsla-20230930.pdf'}), Document(page_content='occurred. We cannot predict the outcome or impact of any ongoing matters. Should the government decide to pursue an enforcement\naction, there exists the possibility of a material adverse impact on our business, results of operation, prospects, cash flows, financial\nposition or brand.\nWe are also subject to various other legal proceedings, risks and claims that arise from the normal course of business\nactivities. For example, during the second quarter of 2023, a foreign news outlet reported that it obtained certain misappropriated data\nincluding, purportedly, among other things, non-public Tesla business and personal information. Tesla has made notifications to\npotentially affected individuals (current and former employees) and regulatory authorities and we are working with certain law', metadata={'page': 31, 'source': '/kaggle/input/tesla10q/tsla-20230930.pdf'}), Document(page_content='Table of Contents\nTesla, Inc.\nConsolidated Statements of Operations\n(in millions, except per share data)\n(unaudited)\n Three Months Ended September 30, Nine Months Ended September 30,\n 2023 2022 2023 2022\nRevenues\nAutomotive sales $ 18,582  $ 17,785  $ 57,879  $ 46,969  \nAutomotive regulatory credits 554 286 1,357  1,309  \nAutomotive leasing 489 621 1,620  1,877  \nTotal automotive revenues 19,625  18,692  60,856  50,155  \nEnergy generation and storage 1,559  1,117 4,597  2,599  \nServices and other 2,166  1,645  6,153  4,390  \nTotal revenues 23,350  21,454  71,606  57,144  \nCost of r evenues\nAutomotive sales 15,656  13,099  47,919  34,166  \nAutomotive leasing 301 381 972 1,157  \nTotal automotive cost of revenues 15,957  13,480  48,891  35,323', metadata={'page': 6, 'source': '/kaggle/input/tesla10q/tsla-20230930.pdf'}), Document(page_content='Cal. Health & Saf. Code § 25100 et seq. and Cal. Civil Code § 1798.80. Tesla has implemented various remedial measures, including\nconducting training and audits, and enhancements to its site waste management programs, and settlement discussions are ongoing.\nWhile the outcome of this matter cannot be determined at this time, it is not currently expected to have a material adverse impact on\nour business.', metadata={'page': 50, 'source': '/kaggle/input/tesla10q/tsla-20230930.pdf'})]

# Conclusions

We tested a RAG system developed with Gemma as LLM, Langchain for data loaders utilities, and ChromaDB as database. 
The RAG system is initialized with a dataset, that is used to populate the vector database, and with an AI Agent, that will query Gemma, given the initial query and the retrieved context.
To verify that the result is composed based on the context provided, we include as well the context in the exported result.
