# The hitchhiker's guide to Jupyter (part 5/n)

## Let's create a Generative AI chatbot using RAG to talk to my book

I'm going to implement local chatbot on my laptop to talk to my book, ["Designed4Devops"](https://designed4devops.com). This will allow a user to be able to ask questions of the book and summarise its contents. My book is self-published and copywrite so it shouldn't appear in models' training data. To achieve this I'm going to RAG or _Retrieval Augmented Generation_. 

## RAG

RAG is a technique that allows you to add data to a LLM after the model was trained, without retraining or finetuning it. Training models requires access to large and often numerous high-end GPUs. This can be expensive. It also has the downside that if you want to update the data, you need to retrain the model again.

RAG overcomes this by taking the data (e.g., PDF, CSV, HTML) and vectorising it. Remember that models work by matrix multiplations of numbers not text. We use a model to embed the text as numbers in a vectore store. This allows the LLM to query the data with symantec searching. The model then returns results based on the context of the query given.

Let's set up the model.

### First, we'll install the dependencies and set up the model.

In [3]:
!pip install --force-reinstall -Uq 'torch==2.2.2' datasets accelerate peft bitsandbytes transformers trl 'numpy<2.0'

In [4]:
import transformers
import torch
import datasets
import accelerate
import peft
import bitsandbytes
import trl

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

This sets up the tokeniser. This breaks the text up into tokens (chunks) which can be individual words or fragments of words.

I'm going to use Mistral 7B as it offers a good performance at a low overhead of processing and memory.

In [6]:
model_name='../models/Mistral-7B-Instruct-v0.1'

model_config = transformers.AutoConfig.from_pretrained(
    model_name,
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

#### Quantization of the Model

I'm going to quantize the model to 4 bits. This lowers the precision of the data types (int4 vs fp16 or fp32), which reduces the overheads even further.

In [7]:
# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

In [8]:
#################################################################
# Set up quantization config
#################################################################
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

Your GPU supports bfloat16: accelerate training with bf16=True


In [9]:
torch.cuda.get_device_capability()

(8, 6)

In [10]:
#################################################################
# Load pre-trained config
#################################################################
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

#### Let's test it..

This query asks the model a question. We haven't loaded any of our data into it yet, this is all information held within the model from its training data set.

In [11]:
inputs_not_chat = tokenizer.encode_plus("[INST] What is Designed4Devops? [/INST]", return_tensors="pt")['input_ids'].to('cuda')

generated_ids = model.generate(inputs_not_chat, 
                               max_new_tokens=1000, 
                               do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['<s> [INST] What is Designed4Devops? [/INST] Designed4DevOps is a collection of tools, processes, and methodologies that are designed specifically for software development teams to collaborate and work effectively with operations teams and other stakeholders to build, deploy, and manage software applications at scale.\n\nThe Designed4DevOps approach is focused on improving communication, collaboration, automation, and continuous improvement throughout the entire software development lifecycle (SDLC). It emphasizes the importance of DevOps best practices, such as continuous integration and delivery (CI/CD), continuous monitoring and testing, and infrastructure-as-code (IAC) to help organizations release software quickly and reliably.</s>']


___Not bad, we must have made it into the web scrapes!___

#### Create the vector database

I'm going to use ChromaDB, which is a lightweight local vector store, to hold the embeddings of the books text that will come from the PDF.

In [24]:
!pip install -Uq langchain chromadb openai tiktoken sentence-transformers pypdf langchain-community langchain-huggingface

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [25]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import AsyncChromiumLoader
# from langchain.document_loaders import CSVLoader
from langchain.vectorstores import FAISS
import nest_asyncio
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_huggingface import HuggingFacePipeline

In [21]:
nest_asyncio.apply()
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Load the book
loader = PyPDFLoader("/home/jovyan/docker-shared-data/rag-data/d4do_paperback.pdf")
documents = loader.load()

# Chunk text
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunked_documents = text_splitter.split_documents(documents)

# Load chunked documents into the Chroma index
db = Chroma.from_documents(chunked_documents, embedding_function)

# Connect query to Chroma index using a retriever
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={'k': 4}
)

#### Test the vector store

This tests that the data exists within the vector store.

In [22]:
query = "What can designed4devops do for my organisation?"
docs = db.similarity_search(query)
print(docs[0].page_content)

designed4devops


#### Create the LLM chain

To create a symantically aware search, we need to store the context of the question, and engineer a prompt that focuses the model on answering questions using the data from our vector store instead of making it up (hallucinating). Prompt engineering is a way to coach the model into giving the sort of answers that you want return and filter those that you don't.

In [30]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.chains import LLMChain

text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=300,
    do_sample=True
)

prompt_template = """
### [INST] 
Instruction: Answer the question based on your knowledge from the files in the vector database only.
Don't use expletives or bad language.
If you can't answer the appoligise and say you don't know.
Don't make up answers. Please limit your answer to 200 words or less. Here is context to help:

{context}

### QUESTION:
{question} 

[/INST]
 """

mistral_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

# Create prompt from prompt template 
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

# Create llm chain 
llm_chain = LLMChain(llm=mistral_llm, prompt=prompt)

#### Create RAG Chain

This chains the prompt with the question to _hopefully_ get a strong answer.

In [31]:
query = "Summarize the text in the vector database." 

retriever = db.as_retriever()

rag_chain = ( 
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

rag_chain.invoke(query)

{'context': [Document(page_content="Part III - WHAT - A Model Value Stream- Phase 1 - Design \n \n \n100 They will interview users, conduct workshops, and complete surveys using analytical methods and their \nexperience to understand the requirements and their context better. They will turn the requirements into \nuser stories. \nA user story is a user-centric definition of what the target user tries to achieve with the requirement. It \nwill be unambiguous and allow a developer to code the functionality without a back-and-forth exchange \nof information that will slow down cadence. \nA user story might look like, “I want to add details to a user ticket by editing the record when I have \nselected it from the list of tickets available.” It should have enough information so the developer can \nunderstand the user's intent, picture themselves in their position, and walk through their actions in their \nmind. \nBusiness analysis is critical to agile development in many ways. Its purpose i

## The Result:

In [32]:
query = "How can I start a digital transformation programme on my portfolio of digital products using designed4devops?"

rag_chain.invoke(query)

{'context': [Document(page_content='Background \n \niii \n With designed4devops  and its framework, I aim to give you a structure to use DevOps to design and \ndeliver Digital Transformation within your organization.', metadata={'page': 12, 'source': '/home/jovyan/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='designed4devops \nDigital Transformation the Lean and Easy way \n \n \n \n \n \n \n \n \n \n \n \n \n \nAJ James', metadata={'page': 2, 'source': '/home/jovyan/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='About this book \n \nvii \n About this book \nThis book is for anyone considering or working in the digital transformation of organizations that \ncreate digital products. It uses DevOps and the optimization of delivering change as its core but shows you \nhow to approach it structurally and repeatedly. It also shows you how to design your product delivery and \nintegrate it into broader business frameworks such as secu

Not bad:

### QUESTION:\nHow can I start a digital transformation programme on my portfolio of digital products using designed4devops? \n\n[/INST]\n \nTo start a digital transformation program on your portfolio of digital products using designed4devOps, you should follow these steps:\n\n1. Understand the concept of digital transformation and its benefits. This will help you understand why you need to transform your digital products and what the expected outcomes are.\n2. Identify the areas where your digital products need improvement. This could include improving product stability, security, agility, innovation, and reducing your organization's contribution to climate change.\n3. Develop a structured approach to digital transformation. Designed4devOps provides a framework for designing and delivering digital transformation within an organization. You can use this framework to develop a structured approach to digital transformation that is repeatable, measurable, and testable.\n4. Integrate digital transformation into broader business frameworks. Designed4devOps emphasizes the importance of integrating digital transformation into broader business frameworks such as security and service management. This will ensure that your digital transformation efforts align with your overall business goals.\n5. Implement the changes. Once you have developed a structured approach to digital transformation and integrated it into broader business frameworks, you can begin implementing the changes. This may involve reorganizing your teams, updating processes and procedures, and investing in new technologies.\n6. Monitor and measure progress. To ensure that your digital transformation program is successful, you need to monitor and measure progress regularly. This will help you identify areas where you need to make adjustments

Let's ask a very specific question.

In [33]:
query = "My developers produce working digital software products and them hand them to the operations team to manually install them in production systems. How do I reduce my lead time?"

rag_chain.invoke(query)

{'context': [Document(page_content='213 Epilogue \nHopefully, we can draw enough parallels between the improvements made in the Lean physical product \nproduction world and apply them to our digital pipelines. We can increase the efficiency of releasing \nnovemes into our products by changing how we approach the introduction of changes. \nWe aim to decrease the size of novemes to get close to single-piece flow to allow us to increase our \nrelease frequency and decrease our release complexity. We achieve this by having developers check code \nin more frequently and automate the testing of releases to ensure that they won’t cause problems and \ncapture issues quickly. \nTo make this easier for our developers, we look at the whole lifecycle of our product from the outset in \nany change we make to it. We structure everyone involved in introducing novemes into sequential, \nfunctional cells that self-organize. Each cell treats its downstream cell as its primary customer to ensure a \nsmoo

### QUESTION:\nMy developers produce working digital software products and them hand them to the operations team to manually install them in production systems. How do I reduce my lead time? \n\n[/INST]\n \nTo reduce lead time, you can consider implementing agile practices such as documenting bugs globally, using short daily meetings to flag items up and collaborate further, and breaking down deliverables into smaller, prioritized chunks. Additionally, you can focus on automating processes and reducing external dependencies to increase overall flow and reduce lead time. Encouraging collaboration between teams and fostering effective communication can also help to reduce lead time and improve overall productivity.

# Conclusion

This was quite simple to set up on a local laptop. It demonstrates that generative AI is achievable with modest resources and in short time periods. Before you jump in, be sure to check out my blog on [Generative AI and RAG Security](https://).

I'll be taking this project further and blogging along the way. I'll be talking about the environment I used to build this demo, how I productionise the system, package it and host it, and adding a front end so that you can interact with the book yourselves!

You can download this blog as a Jupyter notebook file [here](https://github.com/tudor-james/ai-playground/blob/main/mistral-rag-langchain-chromadb.ipynb). As ever, if you need help with AI projects you can get in touch with Methods or contact us via LinkedIn.

In [34]:
user_input = input('Whatis your question?')
rag_chain.invoke(user_input)

Whatis your question? what name does designed4devops give to the concept of an idea or change as it flows through the value stream?


{'context': [Document(page_content='Observe \n \n \n33 My background is as an engineer and technical architect. The critical skill in both roles is breaking \nproblems into solvable chunks that I can work on separately. An organization, and the value streams \nwithin it, are complex systems. By logically isolating processes and subprocesses, we can \nredesign our organization’s value streams to optimize the flow of novemes. \ndesigned4devops uses this approach to implement digital transformation . \nI have also created a hierarchy of value streams within the products’ four phases. It allows us to develop \na level of abstraction between functions of the value streams. An abstraction is a valuable tool for \ndecoupling complicated dependencies. It reduces the level of complexity and simplifies the challenges \nahead. \nWe can divide our value stream into four phases through a product’s life. Considering all four phases \nallows us to optimize the delivery flow of our product and integra

In [36]:
# GET /notebook
print("hello world")

hello world
