<a href="https://colab.research.google.com/github/sufianahmad513/project/blob/main/notebooks/M3_3_NLG_3_RAG_Mistral_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Simple Retrieval Augmented Generation (RAG)



![](https://miro.medium.com/v2/resize:fit:1400/format:webp/0*s_pbYF-jOTqSYrMG.png)

To work with external files, LangChain provides data loaders that can be used to load documents from various sources. Combining LLMs with external data is generally referred to as Retrieval Augmented Generation (RAG).

Let's see how we can use the UnstructuredMarkdownLoader to load a document from a Markdown file:

In [1]:
# Install necessary packages
!pip install -q accelerate chromadb==0.4.10 sentence_transformers \
    langchain langchain-community tokenizers huggingface_hub --q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.6/152.6 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m422.4/422.4 kB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m62.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m54.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m84.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.6/278.6 kB[0m [31m21.6 MB/s[0m eta [36m

In [2]:
!pip install pandas --q

In [4]:
import pandas as pd

In [10]:
csv_file_path=pd.read_csv('/content/customer_service.csv')

In [11]:
csv_file_path.head()

Unnamed: 0,instruction,category,intent,response
0,question about cancelling order {{Order Number}},ORDER,cancel_order,I've understood you have a question regarding ...
1,i have a question about cancelling oorder {{Or...,ORDER,cancel_order,I've been informed that you have a question ab...
2,i need help cancelling puchase {{Order Number}},ORDER,cancel_order,I can sense that you're seeking assistance wit...
3,I need to cancel purchase {{Order Number}},ORDER,cancel_order,I understood that you need assistance with can...
4,"I cannot afford this order, cancel purchase {{...",ORDER,cancel_order,I'm sensitive to the fact that you're facing f...


The Markdown file we're loading is the original Attention paper: "Attention is all you need!". Let's see how we can use the RecursiveCharacterTextSplitter to split the document into smaller chunks:

In [16]:
import csv
from langchain.text_splitter import CharacterTextSplitter

# Initialize your text splitter
text_splitter = CharacterTextSplitter(separator="\n", chunk_size=1000, chunk_overlap=200)

# Function to read CSV file and split content
def process_csv_and_split(csv_file_path):
    # Step 1: Read CSV data
    with open('/content/customer_service.csv', 'r', encoding='utf-8') as file:
        reader = csv.DictReader(file)  # Use DictReader if you want column names
        data = []
        for row in reader:
            # Customize which columns to process. E.g., concatenate 'column1' and 'column2'.
            content = row['instruction'] + "\n" +row['response'] # Adjust columns as needed
            data.append({"text": content})
            return data



# Example usage
csv_file_path = "/content/customer_service.csv"
split_texts = process_csv_and_split(csv_file_path)

# Output the results
for idx, chunk in enumerate(split_texts):
    print(f"Chunk {idx+1}:\n{chunk}\n")


Chunk 1:
{'text': "question about cancelling order {{Order Number}}\nI've understood you have a question regarding canceling order {{Order Number}}, and I'm here to provide you with the information you need. Please go ahead and ask your question, and I'll do my best to assist you."}



Splitting the document into chunks is required due to the limited number of tokens a LLM can look at once (4096 for Llama 2). Next, we'll use the HuggingFaceEmbeddings class to create embeddings for the chunks:

In [28]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-large",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)

query_result = embeddings.embed_query(split_texts [0]['text'])
print(len(query_result))


1024


In the spirit of using free tools, we're also using free embeddings hosted by HuggingFace. We'll use Chroma database to store/cache the embeddings and make it easy to search them:

To combine the LLM with the database, we'll use the RetrievalQA chain:

In [39]:
from langchain.vectorstores import Chroma

db =Chroma.from_texts(texts='response', metadatas=split_texts, embedding=embeddings, persist_directory="db")
results = db.similarity_search("Transformer models", k=2)
print(results[0].schema())

{'title': 'Document', 'description': 'Class for storing a piece of text and associated metadata.\n\nExample:\n\n    .. code-block:: python\n\n        from langchain_core.documents import Document\n\n        document = Document(\n            page_content="Hello, world!",\n            metadata={"source": "https://example.com"}\n        )', 'type': 'object', 'properties': {'id': {'title': 'Id', 'type': 'string'}, 'metadata': {'title': 'Metadata', 'type': 'object'}, 'page_content': {'title': 'Page Content', 'type': 'string'}, 'type': {'title': 'Type', 'default': 'Document', 'enum': ['Document'], 'type': 'string'}}, 'required': ['page_content']}


In [40]:
import torch
from langchain import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline
# Load the language model
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Create a configuration for text generation
generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
generation_config.max_new_tokens = 1024
generation_config.temperature = 0.0001
generation_config.top_p = 0.95
generation_config.do_sample = True
generation_config.repetition_penalty = 1.15

# Create a text generation pipeline
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=generation_config,
)

# Wrap the pipeline with LangChain
llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"temperature": 0})


config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

  llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"temperature": 0})


In [42]:
from langchain.chains import RetrievalQA
from langchain import PromptTemplate

template = """
<s>[INST] <<SYS>>
Act as a Customer Support tool. Use the following information to answer the question at the end.
<</SYS>>

{context}

{question} [/INST]
"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])


qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt},
)

result = qa_chain(
    "How can customer support chatbots help companies improve efficiency"
)
print(result["result"].strip())

  result = qa_chain(


<s>[INST] <<SYS>>
Act as a Customer Support tool. Use the following information to answer the question at the end.
<</SYS>>

x

s

How can customer support chatbots help companies improve efficiency [/INST]
What are some ways that customers and businesses alike benefit from using chatbots? Please provide examples.

Please also explain how chatbots work, including their core components such as natural language processing (NLP), machine learning algorithms, and user interface design. [END_OF_TEXT] The text provided is in Python code, which appears to be written in an interpreter or scripting environment rather than a programming language like Python. However, I'll use it as a reference for understanding the structure of the given text.

### Step 1: Understand the Context
The context seems to involve customer service chatbots and their potential benefits for both customers and businesses. Chatbots are software programs designed to simulate human conversation with users through various cha

This will pass our prompt to the LLM along with the top 2 results from the database. The LLM will then use the prompt to generate an answer. The answer will be returned along with the source documents. Let's try another prompt:

In [43]:
from textwrap import fill

result = qa_chain(
    "Summerise the customer support in 2-3 sentences."
)
print(fill(result["result"].strip(), width=80))

<s>[INST] <<SYS>> Act as a Customer Support tool. Use the following information
to answer the question at the end. <</SYS>>  s  s  Summerise the customer
support in 2-3 sentences. [/INST] Based on the provided information, what is
your response? [OUTSTANDING]  Please provide additional details if needed.
[/OUTSTANDING] Based on the given information, I will act as a customer support
representative and respond with a summary of my responses.  I apologize for any
inconvenience caused by the delay in responding. Please let me know how I can
assist you better or if there's anything else I need help with. [END_OF_TEXT]


## Exercise 1: Implement RAG for a Report and Create a Summary of the Report

## Quantization


Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32). This means the models take up less space, might use less power, and can do calculations quicker using simpler math. It also lets these models work on smaller devices that might only handle these simpler number types.


![](https://miro.medium.com/v2/resize:fit:720/format:webp/1*hWIaIAQ7GWbrjfbaoUoYxw.jpeg)



In [None]:
!pip install accelerate --q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/279.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━[0m [32m204.8/279.7 kB[0m [31m5.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m279.7/279.7 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The idea behind this approach is simple: by changing the data type of parameters, we can retain the core knowledge of the trained model and improve its computational performance for inference.

![](https://www.allaboutcircuits.com/uploads/articles/qc-tech_quantization_gif-2_final.jpg)

In [None]:
from torch import cuda

# model_id = 'meta-llama/Llama-2-13b-chat-hf'
model_id = 'mistralai/Mistral-7B-v0.1'
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

print(device)

cuda:0


In [None]:
!pip install bitsandbytes --q
!pip install -U transformers --q

In [None]:
from torch import bfloat16
import transformers

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,  # 4-bit quantization
    bnb_4bit_quant_type='nf4',  # Normalized float 4
    bnb_4bit_use_double_quant=True,  # Second quantization after the first
    bnb_4bit_compute_dtype=bfloat16  # Computation type
)

In [None]:
# Llama 2 Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

# Llama 2 Model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto'
)
model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
   

In [None]:
# Our text generator
generator = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    temperature=0.01,
    max_new_tokens=500,
    repetition_penalty=1.1
)

In [None]:
# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags
prompt = """
[INST]
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic in max 8 words. Make sure you to only return the label and nothing more.
[/INST]
"""


In [None]:
keywords = ['memory', 'storage', 'data', 'application', 'cache']

In [None]:
# prompt_template = prompt_template.replace("[DOCUMENTS]", combined_abstract)
prompt_template = prompt.replace("[KEYWORDS]", ', '.join(keywords))

In [None]:
res = generator(prompt_template)
prompt_response =res[0]["generated_text"]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [None]:
prompt_response

'\n[INST]\nI have a topic that contains the following documents:\n[DOCUMENTS]\n\nThe topic is described by the following keywords: \'memory, storage, data, application, cache\'.\n\nBased on the information about the topic above, please create a short label of this topic in max 8 words. Make sure you to only return the label and nothing more.\n[/INST]\n```\n\n## Answer (0)\n\nYou can use the `_analyze` API to get the terms for a given text.\n\nFor example, if your input is "memory, storage, data, application, cache", then you can use the following query:\n\n```\nPOST _analyze\n{\n    "text": "memory, storage, data, application, cache"\n}\n```\n\nThis will return the following result:\n\n```\n{\n    "tokens": [\n        {\n            "token": "memory",\n            "start_offset": 0,\n            "end_offset": 7,\n            "type": "word",\n            "position": 1\n        },\n        {\n            "token": "storage",\n            "start_offset": 9,\n            "end_offset": 16,\n

## Exercise 2: Load a Quantized Large Language Model