# The hitchhiker's guide to Jupyter (part 5/n)

## Let's create a Generative AI chatbot using RAG to talk to my book

I'm going to implement local chatbot on my laptop to talk to my book, ["Designed4Devops"](https://designed4devops.com). This will allow a user to be able to ask questions of the book and summarise its contents. My book is self-published and copywrite so it shouldn't appear in models' training data. To achieve this I'm going to RAG or _Retrieval Augmented Generation_. 

## RAG

RAG is a technique that allows you to add data to a LLM after the model was trained, without retraining or finetuning it. Training models requires access to large and often numerous high-end GPUs. This can be expensive. It also has the downside that if you want to update the data, you need to retrain the model again.

RAG overcomes this by taking the data (e.g., PDF, CSV, HTML) and vectorising it. Remember that models work by matrix multiplations of numbers not text. We use a model to embed the text as numbers in a vectore store. This allows the LLM to query the data with symantec searching. The model then returns results based on the context of the query given.

Let's set up the model.

### First, we'll install the dependencies and set up the model.

In [None]:
#!pip install --force-reinstall -Uq torch datasets accelerate peft bitsandbytes transformers trl

In [1]:
import transformers
import torch
import datasets
import accelerate
import peft
import bitsandbytes
import trl

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

This sets up the tokeniser. This breaks the text up into tokens (chunks) which can be individual words or fragments of words.

I'm going to use Mistral 7B as it offers a good performance at a low overhead of processing and memory.

### let's load the model

In [4]:
model_name='../models/Mistral-7B-Instruct-v0.1'

model_config = transformers.AutoConfig.from_pretrained(model_name, low_cpu_mem_usage=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

#### Quantization of the Model

I'm going to quantize the model to 4 bits. This lowers the precision of the data types (int4 vs fp16 or fp32), which reduces the overheads even further.

In [5]:
# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

In [6]:
#################################################################
# Set up quantization config
#################################################################
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

Your GPU supports bfloat16: accelerate training with bf16=True


In [7]:
torch.cuda.get_device_capability()

(8, 6)

In [8]:
#################################################################
# Load pre-trained config
#################################################################
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

#### Let's test it..

This query asks the model a question. We haven't loaded any of our data into it yet, this is all information held within the model from its training data set.

In [9]:
messages = [{
    "role":"user",
    "content": "Can you tell us 3 reasons why Eryri is a good place to visit?"
}]

tokenizer.apply_chat_template(messages, tokenize=True)

model_inputs = tokenizer.apply_chat_template(messages, return_tensors = "pt").to('cuda:0')

In [10]:
model_inputs

tensor([[    1,   733, 16289, 28793,  2418,   368,  1912,   592, 28705, 28770,
          6494,  2079,   413,   643,   373,   349,   264,  1179,  1633,   298,
          3251, 28804,   733, 28748, 16289, 28793]], device='cuda:0')

In [11]:
generated_ids = model.generate(
    model_inputs,
    max_new_tokens = 1000,
    do_sample = True,
    pad_token_id=tokenizer.eos_token_id,
)

decoded = tokenizer.batch_decode(generated_ids,padding=True)

print(decoded[0])

2024-05-21 20:57:28.126039: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-21 20:57:28.501029: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<s> [INST] Can you tell us 3 reasons why Eryri is a good place to visit? [/INST] Sure, I'd be happy to help! Here are three reasons why Eryri is a great place to visit:

1. Scenic Beauty: Eryri is located in the heart of the Welsh mountains, offering breathtaking views of the surrounding countryside. The area is home to several national parks, including the Snowdonia National Park, which is known for its stunning landscapes, waterfalls, and hiking trails. Whether you're a nature lover or just looking for a peaceful getaway, Eryri is the perfect destination.
2. Cultural Attractions: Eryri is steeped in history and culture, with several interesting attractions to explore. The town is home to the Royal Welsh College of Music & Drama, which hosts regular performances and is a great place to experience Welsh culture and arts. You can also visit the Welsh Language Centre, which offers courses and workshops on Welsh language and culture.
3. Outdoor Activities: Eryri is a great destination for

___The model is working!___

Correction, the Royal Welsh College of music and drama is in Cardif, which is not in Eryri, which is not a town. I recommend visiting Harlech.

#### Create the vector database

I'm going to use ChromaDB, which is a lightweight local vector store, to hold the embeddings of the books text that will come from the PDF.

In [12]:
#!pip install --force-reinstall -Uq langchain chromadb openai tiktoken sentence-transformers pypdf fastembed

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 2.19.1 requires fsspec[http]<=2024.3.1,>=2023.1.0, but you have fsspec 2024.5.0 which is incompatible.
datasets 2.19.1 requires huggingface-hub>=0.21.2, but you have huggingface-hub 0.20.3 which is incompatible.
langchain-community 0.0.38 requires langchain-core<0.2.0,>=0.1.52, but you have langchain-core 0.2.0 which is incompatible.[0m[31m
[0m

In [13]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import AsyncChromiumLoader
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.document_loaders import PyPDFLoader

In [14]:
embeddings = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")

# Load the book
loader = PyPDFLoader("/tf/docker-shared-data/rag-data/d4do_paperback.pdf")
documents = loader.load()

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 150,
    chunk_overlap  = 50,
    length_function = len,
    is_separator_regex = False,
)

chunks = text_splitter.split_documents(documents)
chunks[0]

store = Chroma.from_documents(
    chunks,
    embeddings,
    ids = [f"{item.metadata['source']}-{index}" for index, item in enumerate(chunks)],
    collection_name="D4DO-Embeddings"
)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

#### Test the vector store

This tests that the data exists within the vector store.

In [15]:
query = "What did Conway say?"
docs = store.similarity_search(query)
print(docs[0].page_content)

context of discussions that happened at the time. For example, an architect, a UI designer, and a developer


#### Test the model with a prompt

In [16]:
messages = [{
    "role": "user", 
    "content": "Act as a consultant. I have a client who needing to make his software product company more efficient. \
    I want to impress my client by providing advice from the book Designed4Devops. \
    What do you recommend? \
    Give me two options, along with how to go about it for each"
}]

model_inputs = tokenizer.apply_chat_template(messages,return_tensors = "pt",padding=True).to('cuda:0')

generated_ids = model.generate(
    model_inputs,
    max_new_tokens = 1000,
    do_sample = True,
    pad_token_id=tokenizer.eos_token_id,
)

decoded = tokenizer.batch_decode(generated_ids)

print(decoded[0])

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.


<s> [INST] Act as a consultant. I have a client who needing to make his software product company more efficient.     I want to impress my client by providing advice from the book Designed4Devops.     What do you recommend?     Give me two options, along with how to go about it for each [/INST] To improve the efficiency of the software product company, I would recommend considering two options from the book "Designed for DevOps":

Option 1: Adopt a DevOps approach

A DevOps approach involves a set of practices that promote collaboration and communication between development and operations teams to improve the efficiency and reliability of software products. To adopt a DevOps approach, your client should consider the following steps:

1. Identify and eliminate silos between development and operations teams. Encourage collaboration between teams by fostering a culture of communication and shared goals.
2. Emphasize automation of repetitive tasks to reduce manual errors and increase effici

#### Create the LLM chain

To create a symantically aware search, we need to store the context of the question, and engineer a prompt that focuses the model on answering questions using the data from our vector store instead of making it up (hallucinating). Prompt engineering is a way to coach the model into giving the sort of answers that you want return and filter those that you don't. This block sets up the chain and the template for the query that brings the context and question together to engineer the prompt.

In [18]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.chains import LLMChain

text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=300,
)

prompt_template = """
### [INST] 
Instruction: Answer the question based on your 
designed4devops knowledge. Don't make up answers, just say there is no answer in the book. Here is context to help:

{context}

### QUESTION:
{question} 

[/INST]
 """

mistral_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

# Create prompt from prompt template 
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

# Create llm chain 
llm_chain = LLMChain(llm=mistral_llm, prompt=prompt)

#### Create RAG Chain

This chain allows us to engineer our prompt by adding context to the question to _hopefully_ get a stronger answer. The context will come from our vector store where we embedded the book as tokens.

In [19]:
retriever = store.as_retriever()

rag_chain = ( 
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

## The Result:

In [20]:
query = "How do I improve the speed of changes that we are making to my product?"
rag_chain.invoke(query)

{'context': [Document(page_content='we process changes to our product. Digital marketplaces move quickly, so we need to introduce change', metadata={'page': 32, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='process and within the product to create short feedback loops that allow us to keep improving our product', metadata={'page': 230, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='concern yet. Improving them comes later. If you can identify separate workflows within your product, you', metadata={'page': 57, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'}),
  Document(page_content='making minor changes and integrating, testing, and releasing them more often.', metadata={'page': 125, 'source': '/tf/docker-shared-data/rag-data/d4do_paperback.pdf'})],
 'question': 'How do I improve the speed of changes that we are making to my product?',
 'text': "\n### [INST] \nInstruction: Answer the

In [None]:
query = "How do I speed up the transfer of workflow tickets for new releases, from the development team to the operations team?"
rag_chain.invoke(query)

In [None]:
query = "How do I reduce the waste in my delivery pipelines?"
rag_chain.invoke(query)

In [None]:
query = "How will Designed4Devops help me cook a risotto withour burning the rice?"
rag_chain.invoke(query)

# Conclusion

This demonstrates that the barrier to entry for prototyping generative AI servics is surprisingly low. This was achieved on relatively modest compute resources in in a few hours.

Before you jump in, be sure to check out my blog on [Generative AI and RAG Security](https://www.linkedin.com/posts/methods_the-hitchhikers-guide-to-jupyter-activity-7195772472221224961-IBzo?utm_source=share&utm_medium=member_desktop).

I'll be taking this project further and blogging along the way. I'll be talking about the environment I used to build this demo, how I productionise the system, package it and host it, and adding a front end so that you can interact with the book yourselves!

You can download this blog as a Jupyter notebook file [here](https://github.com/tudor-james/ai-playground/blob/main/mistral-rag-langchain-chromadb.ipynb). As ever, if you need help with AI projects you can get in touch with Methods or contact us via LinkedIn.

# Adendum

Let's try and make this interactive!

In [23]:
#!pip install widgetsnbextension ipywidgets voila