# RAG with ChromaDB, Langchain, Mistral

First, we'll install the dependencies and set up the model.

In [2]:
!pip install --force-reinstall -Uq 'torch==2.2.2' datasets accelerate peft bitsandbytes transformers trl 'numpy<2.0' langchain chromadb openai tiktoken sentence-transformers pypdf langchain-community langchain-huggingface

In [4]:
!pip install -Uq pipreqsnb

In [8]:
!pip freeze

accelerate==0.31.0
aiohttp==3.9.5
aiosignal==1.3.1
alembic @ file:///home/conda/feedstock_root/build_artifacts/alembic_1705179948871/work
altair @ file:///home/conda/feedstock_root/build_artifacts/altair-split_1711824856061/work
annotated-types==0.7.0
anyio==4.4.0
archspec @ file:///home/conda/feedstock_root/build_artifacts/archspec_1708969572489/work
argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1692818318753/work
argon2-cffi-bindings @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi-bindings_1695386553988/work
arrow @ file:///home/conda/feedstock_root/build_artifacts/arrow_1696128962909/work
asgiref==3.8.1
asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work
async-generator==1.10
async-lru @ file:///home/conda/feedstock_root/build_artifacts/async-lru_1690563019058/work
attrs==23.2.0
Babel @ file:///home/conda/feedstock_root/build_artifacts/babel_1702422572539/work
backcall==0.2.0
backoff==2.2.1
bcrypt==

In [6]:
import transformers
import torch
import datasets
import accelerate
import peft
import bitsandbytes
import trl
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import AsyncChromiumLoader
# from langchain.document_loaders import CSVLoader
from langchain.vectorstores import FAISS
import nest_asyncio
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough, RunnableSequence
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_huggingface import HuggingFacePipeline
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.output_parsers.json import SimpleJsonOutputParser
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoModel, MistralForCausalLM
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
# from langchain.chains import LLMChain
from langchain.chains import RetrievalQA
from time import time

This sets up the tokeniser. This breaks the text up into tokens (chunks) which can be individual words or fragments of words.

I'm going to use Mistral 7B as it offers a good performance at a low overhead of processing and memory.

In [7]:
model_name='../models/Mistral-7B-Instruct-v0.1'

model_config = transformers.AutoConfig.from_pretrained(
    model_name,
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

#### Quantization of the Model

I'm going to quantize the model to 4 bits. This lowers the precision of the data types (int4 vs fp16 or fp32), which reduces the overheads even further.

In [5]:
# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

In [6]:
#################################################################
# Set up quantization config
#################################################################
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

Your GPU supports bfloat16: accelerate training with bf16=True


In [7]:
torch.cuda.get_device_capability()

(8, 6)

In [8]:
#################################################################
# Load pre-trained config
#################################################################
# model = AutoModelForCausalLM.from_pretrained(
# model = AutoModel.from_pretrained(
model = MistralForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    low_cpu_mem_usage=True,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

#### Let's test it..

This query asks the model a question. We haven't loaded any of our data into it yet, this is all information held within the model from its training data set.

In [9]:
inputs_not_chat = tokenizer.encode_plus("[INST] What is Designed4Devops? [/INST]", return_tensors="pt")['input_ids'].to('cuda')

generated_ids = model.generate(inputs_not_chat, 
                               max_new_tokens=1000, 
                               do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['<s> [INST] What is Designed4Devops? [/INST] Designed4DevOps is an open-source, automated testing framework for DevOps pipeline management for Windows, Mac, and Linux operating systems. It is designed to be easy to use and offers a wide range of features, including continuous integration and continuous delivery support, automated testing for both traditional Microsoft technologies and modern open source technologies, and built-in debugging and reporting capabilities. Designed4DevOps was originally developed by Microsoft, and it is released under the MIT license.</s>']


#### Create the ChromaDB vector database

In [12]:
nest_asyncio.apply()
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Load the book
loader = PyPDFLoader("/home/jovyan/docker-shared-data/rag-data/d4do_paperback.pdf")
documents = loader.load()

# Chunk text
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunked_documents = text_splitter.split_documents(documents)

# Load chunked documents into the Chroma index
db = Chroma.from_documents(chunked_documents, embedding_function)

# Connect query to Chroma index using a retriever
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={'k': 4}
)

  warn_deprecated(


#### Test the vector store

This tests that the data exists within the vector store.

In [13]:
query = "What can designed4devops do for my organisation?"
docs = db.similarity_search(query)
print(docs[0].page_content)

designed4devops


#### Create the LLM chain

To create a symantically aware search, we need to store the context of the question, and engineer a prompt that focuses the model on answering questions using the data from our vector store instead of making it up (hallucinating). Prompt engineering is a way to coach the model into giving the sort of answers that you want return and filter those that you don't.

In [None]:
text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=300,
    do_sample=True
)

prompt_template = """
### [INST] 
Instruction: Answer the question based on your knowledge from the files in the vector database only.
Don't use expletives or bad language.
If you can't answer the appoligise and say you don't know.
Don't make up answers. Please limit your answer to 500 words or less. Here is context to help:

{context}

### QUESTION:
{question} 

[/INST]
 """

# mistral_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

# Create prompt from prompt template 
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

# Create llm chain 
# llm_chain = LLMChain(llm=mistral_llm, prompt=prompt)
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)
llm_chain = prompt | llm | StrOutputParser()

In [37]:
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    # result = qa.run(query)
    result = rag_chain.invoke(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

#### Create RAG Chain

This chains the prompt with the question to _hopefully_ get a strong answer.

In [38]:
query = "Summarize the text in the vector database." 

test_rag(qa, query)

Query: Summarize the text in the vector database.

Inference time: 28.068 sec.

Result:  
### [INST] 
Instruction: Answer the question based on your knowledge from the files in the vector database only.
Don't use expletives or bad language.
If you can't answer the appoligise and say you don't know.
Don't make up answers. Please limit your answer to 200 words or less. Here is context to help:

[Document(page_content="Part III - WHAT - A Model Value Stream- Phase 1 - Design \n \n \n100 They will interview users, conduct workshops, and complete surveys using analytical methods and their \nexperience to understand the requirements and their context better. They will turn the requirements into \nuser stories. \nA user story is a user-centric definition of what the target user tries to achieve with the requirement. It \nwill be unambiguous and allow a developer to code the functionality without a back-and-forth exchange \nof information that will slow down cadence. \nA user story might look 

## The Result:

In [39]:
query = "How can I start a digital transformation programme on my portfolio of digital products using designed4devops?"

test_rag(qa, user_input)

Query: How do I make the transition of novemes more efficient?

Inference time: 28.967 sec.

Result:  
### [INST] 
Instruction: Answer the question based on your knowledge from the files in the vector database only.
Don't use expletives or bad language.
If you can't answer the appoligise and say you don't know.
Don't make up answers. Please limit your answer to 200 words or less. Here is context to help:

[Document(page_content='Designed4: Selecting Novemes \n \n \n95 \uf0b7 Security vulnerabilities - any critical security vulnerabilities should be able to jump the queue! \nAnother selection criterion that might be important is the cost/benefit relationship identified at the \nestimation stage. Many organizations expect a return on investments within a specified period. Not all \ninvestments will be financial. For example, you may have a security bug that requires fixing. You may be \nmorally or contractually obliged to remedy the known bug within a defined period of its discovery or \

Let's ask a very specific question.

In [40]:
query = "My developers produce working digital software products and them hand them to the operations team to manually install them in production systems. How do I reduce my lead time?"

test_rag(qa, user_input)

Query: How do I make the transition of novemes more efficient?

Inference time: 30.03 sec.

Result:  
### [INST] 
Instruction: Answer the question based on your knowledge from the files in the vector database only.
Don't use expletives or bad language.
If you can't answer the appoligise and say you don't know.
Don't make up answers. Please limit your answer to 200 words or less. Here is context to help:

[Document(page_content='Designed4: Selecting Novemes \n \n \n95 \uf0b7 Security vulnerabilities - any critical security vulnerabilities should be able to jump the queue! \nAnother selection criterion that might be important is the cost/benefit relationship identified at the \nestimation stage. Many organizations expect a return on investments within a specified period. Not all \ninvestments will be financial. For example, you may have a security bug that requires fixing. You may be \nmorally or contractually obliged to remedy the known bug within a defined period of its discovery or \n

# Conclusion

This was quite simple to set up on a local laptop. It demonstrates that generative AI is achievable with modest resources and in short time periods. Before you jump in, be sure to check out my blog on [Generative AI and RAG Security](https://).

I'll be taking this project further and blogging along the way. I'll be talking about the environment I used to build this demo, how I productionise the system, package it and host it, and adding a front end so that you can interact with the book yourselves!

You can download this blog as a Jupyter notebook file [here](https://github.com/tudor-james/ai-playground/blob/main/mistral-rag-langchain-chromadb.ipynb). As ever, if you need help with AI projects you can get in touch with Methods or contact us via LinkedIn.

In [35]:
user_input = input('What is your question?')
# rag_chain.invoke(user_input)
test_rag(qa, user_input)

What is your question? How do I make the transition of novemes more efficient?


Query: How do I make the transition of novemes more efficient?

Inference time: 26.657 sec.

Result:  
### [INST] 
Instruction: Answer the question based on your knowledge from the files in the vector database only.
Don't use expletives or bad language.
If you can't answer the appoligise and say you don't know.
Don't make up answers. Please limit your answer to 200 words or less. Here is context to help:

[Document(page_content='Designed4: Selecting Novemes \n \n \n95 \uf0b7 Security vulnerabilities - any critical security vulnerabilities should be able to jump the queue! \nAnother selection criterion that might be important is the cost/benefit relationship identified at the \nestimation stage. Many organizations expect a return on investments within a specified period. Not all \ninvestments will be financial. For example, you may have a security bug that requires fixing. You may be \nmorally or contractually obliged to remedy the known bug within a defined period of its discovery or \

### Show information sources for a query

In [1]:
doc_search = input('What would you like to look for?')
docs = db.similarity_search(doc_search)
print(f"Query: {doc_search}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")

What would you like to look for? What page is security on?


NameError: name 'db' is not defined