# The hitchhiker's guide to Jupyter (part 5/n)

## Let's create a Generative AI chatbot using RAG to talk to my book

I'm going to implement local chatbot on my laptop to talk to my book, ["Designed4Devops"](https://designed4devops.com). This will allow a user to be able to ask questions of the book and summarise its contents. My book is self-published and copywrite so it shouldn't appear in models' training data. To achieve this I'm going to RAG or _Retrieval Augmented Generation_. 

## RAG

RAG is a technique that allows you to add data to a LLM after the model was trained, without retraining or finetuning it. Training models requires access to large and often numerous high-end GPUs. This can be expensive. It also has the downside that if you want to update the data, you need to retrain the model again.

RAG overcomes this by taking the data (e.g., PDF, CSV, HTML) and vectorising it. Remember that models work by matrix multiplations of numbers not text. We use a model to embed the text as numbers in a vectore store. This allows the LLM to query the data with symantec searching. The model then returns results based on the context of the query given.

Let's set up the model.

### First, we'll install the dependencies and set up the model.

In [None]:
!pip install --force-reinstall -Uq torch datasets accelerate peft bitsandbytes transformers trl

In [1]:
import transformers
import torch
import datasets
import accelerate
import peft
import bitsandbytes
import trl

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

This sets up the tokeniser. This breaks the text up into tokens (chunks) which can be individual words or fragments of words.

I'm going to use Mistral 7B as it offers a good performance at a low overhead of processing and memory.

### let's load the model

In [3]:
model_name='../models/Mistral-7B-Instruct-v0.1'

model_config = transformers.AutoConfig.from_pretrained(model_name, low_cpu_mem_usage=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

#### Quantization of the Model

I'm going to quantize the model to 4 bits. This lowers the precision of the data types (int4 vs fp16 or fp32), which reduces the overheads even further.

In [4]:
# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

In [5]:
#################################################################
# Set up quantization config
#################################################################
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

Your GPU supports bfloat16: accelerate training with bf16=True


In [6]:
torch.cuda.get_device_capability()

(8, 6)

In [7]:
#################################################################
# Load pre-trained config
#################################################################
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

#### Let's test it..

This query asks the model a question. We haven't loaded any of our data into it yet, this is all information held within the model from its training data set.

In [8]:
messages = [{
    "role":"user",
    "content": "Can you tell us 3 reasons why Eryri is a good place to visit?"
}]

tokenizer.apply_chat_template(messages, tokenize=True)

model_inputs = tokenizer.apply_chat_template(messages, return_tensors = "pt").to('cuda:0')

In [9]:
model_inputs

tensor([[    1,   733, 16289, 28793,  2418,   368,  1912,   592, 28705, 28770,
          6494,  2079,   413,   643,   373,   349,   264,  1179,  1633,   298,
          3251, 28804,   733, 28748, 16289, 28793]], device='cuda:0')

In [12]:
generated_ids = model.generate(
    model_inputs,
    max_new_tokens = 1000,
    do_sample = True,
    pad_token_id=tokenizer.eos_token_id,
)

decoded = tokenizer.batch_decode(generated_ids,padding=True)

print(decoded[0])

<s> [INST] Can you tell us 3 reasons why Eryri is a good place to visit? [/INST] 1. Scenic Beauty: Eryri is known for its stunning natural landscapes and picturesque views. From the majestic mountains to the serene lakes and rolling hills, the region offers a range of breathtaking sights to behold. The area is also home to several national parks and nature reserves, providing visitors with opportunities to hike, bike, and explore the great outdoors.

2. Rich Cultural Heritage: Eryri has a rich and diverse cultural heritage, with a number of ancient sites, historic buildings, and museums to explore. The region is home to several castles, including the famous Conwy Castle, which dates back to the 13th century. Visitors can also learn about the region's Celtic and medieval history at the National Museum of Wales in Conwy.

3. Delicious Food and Drink: Eryri is known for its delicious cuisine, which combines traditional Welsh flavors with modern twists. The region is famous for its lamb, b

___The model is working!___

#### Create the vector database

I'm going to use ChromaDB, which is a lightweight local vector store, to hold the embeddings of the books text that will come from the PDF.

In [None]:
!pip install --force-reinstall -Uq langchain chromadb openai tiktoken sentence-transformers pypdf fastembed

In [13]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import AsyncChromiumLoader
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.document_loaders import PyPDFLoader

In [14]:
embeddings = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")

# Load the book
loader = PyPDFLoader("/tf/docker-shared-data/rag-data/d4do_paperback.pdf")
documents = loader.load()

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 150,
    chunk_overlap  = 50,
    length_function = len,
    is_separator_regex = False,
)

chunks = text_splitter.split_documents(documents)
chunks[0]

store = Chroma.from_documents(
    chunks,
    embeddings,
    ids = [f"{item.metadata['source']}-{index}" for index, item in enumerate(chunks)],
    collection_name="D4DO-Embeddings"
)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

#### Test the vector store

This tests that the data exists within the vector store.

In [15]:
query = "What did Conway say?"
docs = store.similarity_search(query)
print(docs[0].page_content)

context of discussions that happened at the time. For example, an architect, a UI designer, and a developer


#### Test the model with a prompt

In [16]:
messages = [{
    "role": "user", 
    "content": "Act as a consultant. I have a client who needing to make his software product company more efficient. \
    I want to impress my client by providing advice from the book Designed4Devops. \
    What do you recommend? \
    Give me two options, along with how to go about it for each"
}]

model_inputs = tokenizer.apply_chat_template(messages,return_tensors = "pt",padding=True).to('cuda:0')

generated_ids = model.generate(
    model_inputs,
    max_new_tokens = 1000,
    do_sample = True,
    pad_token_id=tokenizer.eos_token_id,
)

decoded = tokenizer.batch_decode(generated_ids)

print(decoded[0])

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.


<s> [INST] Act as a consultant. I have a client who needing to make his software product company more efficient.     I want to impress my client by providing advice from the book Designed4Devops.     What do you recommend?     Give me two options, along with how to go about it for each [/INST] Option 1: Automation and Continuous Integration

One way to improve efficiency in a software product company is to implement automation and continuous integration (CI) practices. This can help streamline the development process, reduce manual errors, and improve the speed and reliability of software releases.

To do this, the first step is to identify areas of the development process where automation and continuous integration can be implemented. This may include building automated tests, automating the deployment process, and implementing a CI tool, such as Jenkins or Travis CI, to orchestrate these processes.

To ensure a successful implementation, it's important to involve the entire developme

#### Create the LLM chain

To create a symantically aware search, we need to store the context of the question, and engineer a prompt that focuses the model on answering questions using the data from our vector store instead of making it up (hallucinating). Prompt engineering is a way to coach the model into giving the sort of answers that you want return and filter those that you don't. This block sets up the chain and the template for the query that brings the context and question together to engineer the prompt.

In [17]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.chains import LLMChain

text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=300,
)

prompt_template = """
### [INST] 
Instruction: Answer the question based on your 
designed4devops knowledge. Don't make up answers, just say there is no answer in the book. Here is context to help:

{context}

### QUESTION:
{question} 

[/INST]
 """

mistral_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

# Create prompt from prompt template 
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

# Create llm chain 
llm_chain = LLMChain(llm=mistral_llm, prompt=prompt)

  warn_deprecated(


#### Create RAG Chain

This chain allows us to engineer our prompt by adding context to the question to _hopefully_ get a stronger answer. The context will come from our vector store where we embedded the book as tokens.

In [18]:
retriever = store.as_retriever()

rag_chain = ( 
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

## The Result:

In [19]:
query = "How do I improve the speed of changes that we are making to my product?"
rag_chain.invoke(query)

{'context': [Document(page_content='we process changes to our product. Digital marketplaces move quickly, so we need to introduce change', metadata={'page': 32, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='process and within the product to create short feedback loops that allow us to keep improving our product', metadata={'page': 230, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='concern yet. Improving them comes later. If you can identify separate workflows within your product, you', metadata={'page': 57, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='making minor changes and integrating, testing, and releasing them more often.', metadata={'page': 125, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'})],
 'question': 'How do I improve the speed of changes that we are making to my product?',
 'text': "\n### [INST] \nInstruction: Answer the

In [20]:
query = "How do I speed up the transfer of workflow tickets for new releases, from the development team to the operations team?"
rag_chain.invoke(query)

{'context': [Document(page_content='delivery before the physical installation. Anything we can do to streamline this workflow will significantly', metadata={'page': 158, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='continuous flow of minor changes. You might charge on a per-transaction for APIs or per-user for', metadata={'page': 100, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='existing pipelines to increase flow. Mapping a value stream from an idea to release or a backlog entry to', metadata={'page': 54, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='difficult to automate with continuous integration and deployment tools, which slows the flow. \nMicroservices', metadata={'page': 127, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'})],
 'question': 'How do I speed up the transfer of workflow tickets for new releases, from the development

In [21]:
query = "How do I reduce the waste in my delivery pipelines?"
rag_chain.invoke(query)

{'context': [Document(page_content='negotiation from your pipeline and remove the dependency. It removes waste. You do need to exercise', metadata={'page': 100, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='lower the risk of waste creeping into the pipelines, products, and processes. When we get down to the flow', metadata={'page': 53, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='methodologies to optimize production lines, reduce waste, and increase the agility they deliver physical', metadata={'page': 25, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='our pipeline and retest the build from scratch. It is adding waste to the system and potentially', metadata={'page': 138, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'})],
 'question': 'How do I reduce the waste in my delivery pipelines?',
 'text': "\n### [INST] \nInstruction: Answer the 

In [22]:
query = "How will Designed4Devops help me cook a risotto withour burning the rice?"
rag_chain.invoke(query)

{'context': [Document(page_content='designed4devops', metadata={'page': 0, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='how designed4devops  helps to make this easier to implement. It is a more technical discussion of', metadata={'page': 17, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='designed4devops  sets out to increase the flow of novemes through this lifecycle while improving', metadata={'page': 50, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='in the Designed4: Analysis section. \n \nApplication Development', metadata={'page': 163, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'})],
 'question': 'How will Designed4Devops help me cook a risotto withour burning the rice?',
 'text': "\n### [INST] \nInstruction: Answer the question based on your \ndesigned4devops knowledge. Don't make up answers, just say there is no answer in the boo

# Conclusion

This demonstrates that the barrier to entry for prototyping generative AI servics is surprisingly low. This was achieved on relatively modest compute resources in in a few hours.

Before you jump in, be sure to check out my blog on [Generative AI and RAG Security](https://www.linkedin.com/posts/methods_the-hitchhikers-guide-to-jupyter-activity-7195772472221224961-IBzo?utm_source=share&utm_medium=member_desktop).

I'll be taking this project further and blogging along the way. I'll be talking about the environment I used to build this demo, how I productionise the system, package it and host it, and adding a front end so that you can interact with the book yourselves!

You can download this blog as a Jupyter notebook file [here](https://github.com/tudor-james/ai-playground/blob/main/mistral-rag-langchain-chromadb.ipynb). As ever, if you need help with AI projects you can get in touch with Methods or contact us via LinkedIn.