<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/RAG_for_Mistral_7B_on_Consumer_Hardware_with_LlamaIndex_and_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows how to set up a RAG system based on Mistral 7B. It uses LlamaIndex for the retrieval and the database, and Hugging Face Transformers for inference.

You need a 16 GB GPU to run this notebook (the free T4 or V100 of Google Colab are enough). It may work with a 12 GB GPU if use a smaller embedding model.

First, install the following:

In [None]:
!pip install --upgrade llama-index llama-index-embeddings-huggingface peft transformers auto-gptq accelerate optimum bitsandbytes

Collecting llama-index
  Downloading llama_index-0.10.20-py3-none-any.whl (5.6 kB)
Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.1.4-py3-none-any.whl (7.7 kB)
Collecting peft
  Downloading peft-0.9.0-py3-none-any.whl (190 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/190.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.9/190.9 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Collecting transformers
  Downloading transformers-4.39.0-py3-none-any.whl (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m38.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting auto-gptq
  Downloading auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downlo

Import everything we need for the RAG part:

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.postprocessor import SimilarityPostprocessor

Setting the model and important configuration settings. Use BAAI/bge-small-en-v1.5 if you don't have enough memory.

In [None]:

Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")

Settings.llm = None
Settings.chunk_size = 128
Settings.chunk_overlap = 10
top_k = 6

LLM is explicitly disabled. Using MockLLM.


For this demonstration, I use a few articles from The Kaitchup that will be indexed and use as knowledge source. You can get these articles here:

In [None]:
!wget https://about.benjaminmarie.com/data/kaitchup_samples/samples.zip
!unzip samples.zip -d samples

--2024-03-24 07:43:31--  https://about.benjaminmarie.com/data/kaitchup_samples/samples.zip
Resolving about.benjaminmarie.com (about.benjaminmarie.com)... 192.95.30.6
Connecting to about.benjaminmarie.com (about.benjaminmarie.com)|192.95.30.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4992687 (4.8M) [application/zip]
Saving to: ‘samples.zip’


2024-03-24 07:43:35 (2.25 MB/s) - ‘samples.zip’ saved [4992687/4992687]

Archive:  samples.zip
  inflating: samples/a0.pdf          
  inflating: samples/a1.pdf          
  inflating: samples/a2.pdf          
  inflating: samples/a3.pdf          
  inflating: samples/a4.pdf          


Load the PDF and extract text. Then, index the text into the vectore data base.

Also set the two main components of RAG: the retriever and the query engine.

In [None]:
documents = SimpleDirectoryReader("samples").load_data()
index = VectorStoreIndex.from_documents(documents)

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=top_k,
)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)],
)

Retrieve the chunks for a given query, i.e., the prompt entered by the user.

In [None]:
query = "How can I run Mixtral-8x7B on my 24 GB GPU?"
response = query_engine.query(query)
context = ""
for i in range(top_k):
    context = context + response.source_nodes[i].text + "\n\n"

In [None]:
print(context)

T o receive new posts and
support my work, consider becoming a free or
paid subscriber.
In this article, I show how to fine-tune Mixtral-8x7B quantized with AQLM using only 16
GB of GPU RAM. In other words, we only need a $500 GPU  to fine-tune Mixtral. I also
discuss how to optimize the fine-tuning hyperparameters to further reduce memory
consumption while maintaining a good performance.

If you only have a GPU with 16 GB of VRAM, fine-tuning Mixtral is still possible but it will
be slower and won’t perform as well.
Try the following changes of configuration (from best to worst to maintain the model
performance):
Decrease per_device_train_batch_size and proportionally increase
gradient_accumulation_steps. The minimal value for the training batch size is 1. Iftrainer.train()
Reducing the Memory Consumption

Even when quantized to 4-bit, the model can’t be
fully loaded on a consumer GPU (e.g., an R TX 3090 with 24 GB of VRAM is not enough).
Mixtral-8x7B is a mixture of experts (MoE). It

Load Mistral 7B instruct. We use the GPTQ version to reduce the memory consumption.

In [None]:
from transformers import AutoTokenizer
import transformers
import torch

model = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="cuda",
)




First, let's see how the model answers the user's query without RAG:

In [None]:
messages = [{"role": "user", "content": query}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
outputs = pipeline(prompt, max_new_tokens=512, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST] How can I run Mixtral-8x7B on my 24 GB GPU? [/INST] Mixtal-8x7B is a deep learning model that requires a significant amount of computational resources, including a powerful GPU. However, the exact GPU requirements for running Mixtal-8x7B depend on the specific configuration of the model and the size of the data you plan to process.

With a 24 GB GPU, you should be able to run Mixtal-8x7B on larger datasets than what would be feasible on a smaller GPU. Here are some general steps you can follow to get started:

1. Install the required software and dependencies: You will need to have a compatible version of TensorFlow or PyTorch installed on your system, along with any necessary dependencies. You can find the installation instructions for these deep learning frameworks on their official websites.

2. Download the Mixtal-8x7B model: You will need to download the pre-trained Mixtal-8x7B model weights and configuration files. You can usually find these files on the model's officia

Mistral 7B instruct doesn't have a "system" role in its chat template. For RAG, it's better to have one that will provide the chunks retrieved from the database for the assistant to answer the user's queries.

I use the following custom chat template:


In [None]:
!git clone https://github.com/chujiezheng/chat_templates.git

Cloning into 'chat_templates'...
remote: Enumerating objects: 166, done.[K
remote: Counting objects: 100% (166/166), done.[K
remote: Compressing objects: 100% (112/112), done.[K
remote: Total 166 (delta 99), reused 112 (delta 51), pack-reused 0[K
Receiving objects: 100% (166/166), 27.54 KiB | 3.93 MiB/s, done.
Resolving deltas: 100% (99/99), done.


Override the original chat template of Mistral 7B

In [None]:
chat_template = open('./chat_templates/chat_templates/mistral-instruct.jinja').read()
chat_template = chat_template.replace('    ', '').replace('\n', '')
tokenizer.chat_template = chat_template

Then, provide the retrieved chunks from the database in the system's message. Mistral 7B now should have all the information it needs to answer the question.

In [None]:
messages = [{"role": "system", "content": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n\nContext:\n"+context+"\n\n"},
 {"role": "user", "content": query}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
outputs = pipeline(prompt, max_new_tokens=512, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST] You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

Context:
T o receive new posts and
support my work, consider becoming a free or
paid subscriber.
In this article, I show how to fine-tune Mixtral-8x7B quantized with AQLM using only 16
GB of GPU RAM. In other words, we only need a $500 GPU  to fine-tune Mixtral. I also
discuss how to optimize the fine-tuning hyperparameters to further reduce memory
consumption while maintaining a good performance.

If you only have a GPU with 16 GB of VRAM, fine-tuning Mixtral is still possibl