# The hitchhiker's guide to Jupyter (part 5/n)

## Let's create a Generative AI chatbot using RAG to talk to my book

I'm going to implement local chatbot on my laptop to talk to my book, ["Designed4Devops"](https://designed4devops.com). This will allow a user to be able to ask questions of the book and summarise its contents. My book is self-published and copywrite so it shouldn't appear in models' training data. To achieve this I'm going to RAG or _Retrieval Augmented Generation_. 

## RAG

RAG is a technique that allows you to add data to a LLM after the model was trained, without retraining or finetuning it. Training models requires access to large and often numerous high-end GPUs. This can be expensive. It also has the downside that if you want to update the data, you need to retrain the model again.

RAG overcomes this by taking the data (e.g., PDF, CSV, HTML) and vectorising it. Remember that models work by matrix multiplations of numbers not text. We use a model to embed the text as numbers in a vectore store. This allows the LLM to query the data with symantec searching. The model then returns results based on the context of the query given.

Let's set up the model.

### First, we'll install the dependencies and set up the model.

In [None]:
!pip install --force-reinstall -Uq torch datasets accelerate peft bitsandbytes transformers trl

In [1]:
import transformers
import torch
import datasets
import accelerate
import peft
import bitsandbytes
import trl

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

This sets up the tokeniser. This breaks the text up into tokens (chunks) which can be individual words or fragments of words.

I'm going to use Mistral 7B as it offers a good performance at a low overhead of processing and memory.

#### let's load the model

In [3]:
model_name='../models/Mistral-7B-Instruct-v0.1'

model_config = transformers.AutoConfig.from_pretrained(
    model_name,
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

#### Let's create the template

In [11]:
messages = [{
    "role":"user",
    "content": "Can you tell us 3 reasons why Eryri is a good place to visit?"
}]

tokenizer.apply_chat_template(messages, tokenize=True)

model_inputs = tokenizer.apply_chat_template(messages, return_tensors = "pt")

In [12]:
model_inputs

tensor([[    1,   733, 16289, 28793,  2418,   368,  1912,   592, 28705, 28770,
          6494,  2079,   413,   643,   373,   349,   264,  1179,  1633,   298,
          3251, 28804,   733, 28748, 16289, 28793]])

#### Quantization of the Model

I'm going to quantize the model to 4 bits. This lowers the precision of the data types (int4 vs fp16 or fp32), which reduces the overheads even further.

In [6]:
# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

In [7]:
#################################################################
# Set up quantization config
#################################################################
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

Your GPU supports bfloat16: accelerate training with bf16=True


In [8]:
torch.cuda.get_device_capability()

(8, 6)

In [9]:
#################################################################
# Load pre-trained config
#################################################################
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

#### Let's test it..

This query asks the model a question. We haven't loaded any of our data into it yet, this is all information held within the model from its training data set.

In [13]:
generated_ids = model.generate(
    model_inputs,
    max_new_tokens = 1000,
    do_sample = True,
)

decoded = tokenizer.batch_decode(generated_ids)

print(decoded[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] Can you tell us 3 reasons why Eryri is a good place to visit? [/INST] Eryri, also known as Snowdonia, is a magnificent and historic region located in North Wales that offers a wealth of reasons to visit. Here are three good reasons why Eryri is a great destination:

1. Natural Beauty: Eryri boasts some of the most stunning natural landscapes in the UK, with its breathtaking mountains, lush valleys, and rolling hills. The region is home to Snowdon, the highest peak in Wales and England, which offers incredible views of the surrounding area. Visitors can also explore popular attractions such as the Llanberis Pass, the Rhinogydd Mountains, and the Anglesey Coastal Path.

2. Rich History and Culture: Eryri has a rich history and is home to many fascinating attractions that showcase its cultural heritage. The region is steeped in ancient Welsh history, and visitors can explore castles, abbeys, and settlements that date back to the Iron Age. The town of Betws-y-Coed, for instance,

___The model is working!___

#### Create the vector database

I'm going to use ChromaDB, which is a lightweight local vector store, to hold the embeddings of the books text that will come from the PDF.

In [14]:
!pip install --force-reinstall -Uq langchain chromadb openai tiktoken sentence-transformers pypdf fastembed

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 2.19.1 requires huggingface-hub>=0.21.2, but you have huggingface-hub 0.20.3 which is incompatible.[0m[31m
[0m

In [15]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import AsyncChromiumLoader
# from langchain.document_loaders import CSVLoader
# from langchain.vectorstores import FAISS
# import nest_asyncio
# from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.document_loaders import PyPDFLoader

In [18]:
embeddings = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")

# Load the book
loader = PyPDFLoader("/tf/docker-shared-data/rag-data/d4do_paperback.pdf")
documents = loader.load()

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 150,
    chunk_overlap  = 50,
    length_function = len,
    is_separator_regex = False,
)

chunks = text_splitter.split_documents(documents)
chunks[0]

store = Chroma.from_documents(
    chunks,
    embeddings,
    ids = [f"{item.metadata['source']}-{index}" for index, item in enumerate(chunks)],
    collection_name="D4DO-Embeddings"
)
store.persist()

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

#### Test the vector store

This tests that the data exists within the vector store.

In [None]:
query = "What is Designed4Devops?"
docs = store.similarity_search(query)
print(docs[0].page_content)

#### Test the model with the store

In [23]:
messages = [{
    "role": "user", 
    "content": "Act as a consultant. I have a client who needing to make his software product company more efficient. \
    I want to impress my client by providing advice from the book Designed4Devops. \
    What do you recommend? \
    Give me two options, along with how to go about it for each"
}]

model_inputs = tokenizer.apply_chat_template(messages,return_tensors = "pt").to('cuda:0')

generated_ids = model.generate(
    model_inputs,
    max_new_tokens = 1000,
    do_sample = True,
)

decoded = tokenizer.batch_decode(generated_ids)

print(decoded[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] Act as a consultant. I have a client who needing to make his software product company more efficient.     I want to impress my client by providing advice from the book Designed4Devops.     What do you recommend?     Give me two options, along with how to go about it for each [/INST] The book "Design for DevOps" offers a lot of valuable advice on how to improve the efficiency of a software product company. Based on that, I recommend the following two options:

Option 1: Improve the development process
Option 2: Improve the operations process

Option 1: Improve the development process
Improving the development process can help make your company more efficient by reducing the time it takes to develop and deliver software. Here is how you can do it:

1. Use continuous integration and continuous delivery (CI/CD) pipelines: Automate the process of building, testing, and delivering software. This will help to catch and fix issues early, reduce the time between builds and deployment

#### Create the LLM chain

To create a symantically aware search, we need to store the context of the question, and engineer a prompt that focuses the model on answering questions using the data from our vector store instead of making it up (hallucinating). Prompt engineering is a way to coach the model into giving the sort of answers that you want return and filter those that you don't.

In [None]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.chains import LLMChain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=300,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

template = """You are a bot that answers user questions about designed4devops using only the context provided. Don't use expletives or bad language.
If you can't answer the appoligise and say you don't know. Don't make up answers. Please limit your answer to 200 words or less. Here is context to help:

{context}

Question: {input}"""

prompt = PromptTemplate(
    template=template, input_variables=["context", "input"]
)

retriever = store.as_retriever(search_kwargs={
      'k': 10
})
combine_docs_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, combine_docs_chain)

#### Create RAG Chain

This chains the prompt with the question to _hopefully_ get a strong answer.

## The Result:

In [None]:
result = chain.invoke({
  "input": "How can I start a digital transformation programme on my portfolio of digital products using designed4devops?"
})

In [None]:
query = "How can I start a digital transformation programme on my portfolio of digital products using designed4devops?"

chain.invoke(query)

Not bad:

### QUESTION:\nHow can I start a digital transformation programme on my portfolio of digital products using designed4devops? \n\n[/INST]\n \nTo start a digital transformation program on your portfolio of digital products using designed4devOps, you should follow these steps:\n\n1. Understand the concept of digital transformation and its benefits. This will help you understand why you need to transform your digital products and what the expected outcomes are.\n2. Identify the areas where your digital products need improvement. This could include improving product stability, security, agility, innovation, and reducing your organization's contribution to climate change.\n3. Develop a structured approach to digital transformation. Designed4devOps provides a framework for designing and delivering digital transformation within an organization. You can use this framework to develop a structured approach to digital transformation that is repeatable, measurable, and testable.\n4. Integrate digital transformation into broader business frameworks. Designed4devOps emphasizes the importance of integrating digital transformation into broader business frameworks such as security and service management. This will ensure that your digital transformation efforts align with your overall business goals.\n5. Implement the changes. Once you have developed a structured approach to digital transformation and integrated it into broader business frameworks, you can begin implementing the changes. This may involve reorganizing your teams, updating processes and procedures, and investing in new technologies.\n6. Monitor and measure progress. To ensure that your digital transformation program is successful, you need to monitor and measure progress regularly. This will help you identify areas where you need to make adjustments

Let's ask a very specific question.

In [None]:
query = "My developers produce working digital software products and them hand them to the operations team to manually install them in production systems. How do I reduce my lead time?"

rag_chain.invoke(query)

### QUESTION:\nMy developers produce working digital software products and them hand them to the operations team to manually install them in production systems. How do I reduce my lead time? \n\n[/INST]\n \nTo reduce lead time, you can consider implementing agile practices such as documenting bugs globally, using short daily meetings to flag items up and collaborate further, and breaking down deliverables into smaller, prioritized chunks. Additionally, you can focus on automating processes and reducing external dependencies to increase overall flow and reduce lead time. Encouraging collaboration between teams and fostering effective communication can also help to reduce lead time and improve overall productivity.

# Conclusion

This was quite simple to set up on a local laptop. It demonstrates that generative AI is achievable with modest resources and in short time periods. Before you jump in, be sure to check out my blog on [Generative AI and RAG Security](https://).

I'll be taking this project further and blogging along the way. I'll be talking about the environment I used to build this demo, how I productionise the system, package it and host it, and adding a front end so that you can interact with the book yourselves!

You can download this blog as a Jupyter notebook file [here](https://github.com/tudor-james/ai-playground/blob/main/mistral-rag-langchain-chromadb.ipynb). As ever, if you need help with AI projects you can get in touch with Methods or contact us via LinkedIn.

# Addendum - Let's add Voila

In [None]:
!pip install voila