<a href="https://colab.research.google.com/github/vasarajasekhar/Fine-Tuning-LLMs/blob/main/Run_mistral_on_a_single_GPU_ATL%20Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Runing Mistral-7b AI on a Single GPU with Google Colab
Welcome to this notebook that will show you how to load and run Mistral-7b with QLoRA which is a 4bit quantization technique with no performance degradation.

In this notebook, we will learn together how to load a model in 4bit, understand all its variants and how to run them for inference.

Note that this could be used for any model that supports device_map (i.e. loading the model with accelerate).

## Step 0 -  Enable text wrapping so we don't have to scrool horizontally


In [1]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))

get_ipython().events.register('pre_run_cell', set_css)


## Step 1 - Install necessary packages
First, install the dependencies below to get started. As these features are available on the main branches only, we need to install the libraries below from source.

In [2]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m279.7/279.7 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone


## Step 2 - Define quantization parameters through the BitsandBytesConfig from transformers


* load_in_4bit=True: specify that we want to convert and load the model in 4-bit precision.
* bnb_4bit_use_double_quant=True: Use nested quantization for more memory efficient inference and training.
* bnd_4bit_quant_type="nf4": The 4bit integration comes with 2 different quantization types FP4 and NF4. The NF4 dtype stands for Normal Float 4 and is introduced in the QLoRA paper. By default, the FP4 quantization is used.
* bnb_4bit_compute_dype=torch.bfloat16: The compute dtype is used to change the dtype that will be used during computation. By default, the compute dtype is set to float32 but computation can be set to bf16 for speedups.



In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)



## Step 3 - Load the Model with quantization

Now we specify the model ID and then we load it with our previously defined quantization configuration.

In [4]:
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

If you print the model, you will see that most of the nn.Linear layers are replaced by bnb.nn.Linear4bit layers!

In [5]:
print(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
   

Let's make sure we loaded the whole model on GPU

In [6]:
model.hf_device_map

{'': 0}

## Step 5 - Once loaded, run a generation!

Test 1: ask a recipe on mayonnaise

In [7]:
device = "cuda:0"

messages = [
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]


encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)


generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] Do you have mayonnaise recipes? [/INST] I'd be happy to share a few simple mayonnaise recipes with you!

Here's a basic recipe:

Ingredients:

* 1 egg yolk
* 1 tablespoon Dijon mustard
* 1 teaspoon salt
* ¼ teaspoon black pepper
* 1 cup vegetable oil (such as canola or sunflower)
* ¼ cup champagne vinegar or lemon juice

Instructions:

1. In a blender or food processor, blend the egg yolk, Dijon mustard, salt, and black pepper until well combined.
2. Slowly add the vegetable oil in a steady stream while the blender is running.
3. Continue blending until well combined and the mixture has thickened slightly.
4. Add the champagne vinegar or lemon juice and blend again until well combined.
5. Serve immediately or store in an airtight container in the fridge for up to one week.

Here are a few variations you may enjoy:

* Honey mustard mayonnaise: Replace the Dijon mustard with a tablespoon of honey and use apple cider vinegar in place of champagne vinegar.
* Garlic mayonnaise: A

Test 2: ask the model to generate python code from natural language instruction

In [8]:
messages = [
    {"role": "user", "content": "write a python function to generate a list of random 1000 numbers between 1 and 10000?"}
    ]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] write a python function to generate a list of random 1000 numbers between 1 and 10000? [/INST] Here is a python function that will generate a list of 1000 random numbers between 1 and 10000:
```
import random

def generate_random_numbers(n):
    random_numbers = []
    for i in range(n):
        random_numbers.append(random.randint(1, 10000))
    return random_numbers
```
The function takes one argument, `n`, which is the number of random numbers you want to generate. It uses the built-in `random` module to generate random integers between 1 and 10000 using the `randint()` function. It then appends each random number to a list called `random_numbers`. Finally, it returns the list.

Here's an example of how you could use the function to generate a list of 1000 random numbers:
```
random_numbers = generate_random_numbers(1000)
print(random_numbers)
```</s>


Test 3 : ask the model to explain in simple words what is a Large Language Model

In [9]:
PROMPT= """ ### Instruction: Act as a data science expert.
### Question:
Explain to me what is Large Language Model. Assume that I am a 5-year-old child.

### Answer:
"""

encodeds = tokenizer(PROMPT, return_tensors="pt", add_special_tokens=True)
model_inputs = encodeds.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])



<s>  ### Instruction: Act as a data science expert.
### Question:
Explain to me what is Large Language Model. Assume that I am a 5-year-old child.

### Answer:
A Large Language Model, also called an AI language model, is a special type of computer program that's very good at understanding and talking like a person. It's like having a super smart robot that can understand what you're saying and respond in a way that makes sense. It's called "large" because it's really big and can understand a lot of different things. The model is trained using a large amount of data, so it can understand and respond to many different types of text and speech. This means that it's really good at things like understanding what you're saying, helping you to find information online, and even helping you with homework!</s>


In [10]:
device = "cuda:0"
PROMPT= """
### Question:
我想去中国旅游。 你推荐参观哪些地方

### Answer:
"""

encodeds = tokenizer(PROMPT, return_tensors="pt", add_special_tokens=True)
model_inputs = encodeds.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

<s> 
### Question:
我想去中国旅游。 你推荐参观哪些地方

### Answer:
中国是一个很大的国家，有很多美好的景点要参观。以下是一些建议:

* 北京: 参观京城、故宫、海长台和等。
* 长沙: 参观永登寺、鹤凤寺、永登山和等。
* 天안 Mountain: 参观天主寺、天寺、大乞草海滩 и等
*  Suzhou: 参观 Master of the Nets Garden， Humble Administrator's Garden, Pingjiang Road and  Pagoda with Fishing Peons
*  Shanghai: 参观 Bund, Yu Garden and the People's Park.
*  Xi'an: 参观 Terracotta Warriors, Horses, and Policemen, Dynasty Square and Great Wild Goose Pagoda

这都是我认为值得参观的地方。但你还需要根据自己的兴趣和时间计划来决定。对于其他地方的建议，请告诉我更多 Details.</s>


In [16]:
device = "cuda:0"

messages = [
    {"role": "user", "content": "High school student retrieved small Pyrite stones by breaking the stones he found in his house. Generate 10 tinkering activities that he can do with these small Pyrite stones"}]


encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)


generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] High school student retrieved small Pyrite stones by breaking the stones he found in his house. Generate 10 tinkering activities that he can do with these small Pyrite stones [/INST] 1. Create a necklace or bracelet by threading the Pyrite stones onto a length of string or wire.
2. Use the Pyrite stones to construct a small obstacle course or maze in your home.
3. Make a small sculpture or statue using the Pyrite stones as building blocks.
4. Art project: use Pyrite stones to create a mosaic or pattern on a piece of paper or clay.
5. Attach the Pyrite stones to a piece of paper or cardboard and use them as marbles in a game of skill.
6. Create a small kinetic sculpture using the Pyrite stones as the base.
7. Use the Pyrite stones to create a small decoration, such as a wall hanging or a plant marker.
8. Create a birdhouse or bird feeder using Pyrite stones as the base.
9. Use the Pyrite stones to create a small jewelry box or organizer.
10. Create a small sculpture or statue