# Parameter-Efficient Fine Tuning tutorial
Reference:
- [Geneartion with LLM](https://huggingface.co/docs/transformers/v4.41.2/en/llm_tutorial)
- [Prompting tutorial](https://huggingface.co/docs/transformers/v4.41.2/en/tasks/prompting)
- [Text generation strategy](https://huggingface.co/docs/transformers/v4.41.2/en/generation_strategies)
- [Different text generation mode](https://huggingface.co/blog/how-to-generate)

In [2]:
!pip install bitsandbytes>=0.39.0 transformers==4.41.2 tokenizers>=0.12.1,<0.14

/bin/bash: line 1: 0.14: No such file or directory


# Load the new LLM model for next word prediction

In [3]:
# The hub repo needs to have adapter_config.json file and the adapter weights that stores the adapter parameters
from transformers import AutoModelForCausalLM, AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
model_id = "mistralai/Mistral-7B-v0.1"

In [5]:
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_id,
    device_map="auto",   # Move to GPUs
    load_in_4bit=True,   # Load the model in 4-bit mode
)



[2024-10-20 20:24:07,830] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards: 100%|██████████| 2/2 [00:13<00:00,  6.68s/it]
You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.


In [6]:
from transformers import AutoTokenizer

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Set the default padding token
tokenizer.pad_token = tokenizer.eos_token

In [8]:
model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda")

By default, and unless specified in the GenerationConfig file, generate selects the most likely token at each iteration (greedy decoding). 

Depending on your task, this may be undesirable; creative tasks like chatbots or writing an essay benefit from sampling. On the other hand, input-grounded tasks like audio transcription or translation benefit from greedy decoding. 

Enable sampling with do_sample=True, and you can learn more about this topic in this blog post.

In [11]:
# Computes the out for the model
# Generate in the greedly mode
generated_ids = model.generate(**model_inputs, max_new_tokens=50, do_sample=False)
decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(decoded_outputs)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


A list of colors: red, blue, green, yellow, orange, purple, pink, brown, black, white, gray, silver, gold, tan, beige, cream, ivory, tan, and more.

A list of colors: red, blue, green


In [14]:
# More creative generation
generated_ids = model.generate(**model_inputs, max_new_tokens=50, do_sample=True)
decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(decoded_outputs)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


A list of colors: red, blue, yellow, orange, grey, black, and white.

A list of colors and their hue:

- red, saturated red, saturated maroon, saturated red-purple, purple, grey


# Text generation model with proper prompt
See a tutorial [here](https://huggingface.co/docs/transformers/v4.41.2/en/tasks/prompting)

In [16]:
alpha_tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")

In [17]:
alpha_model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceH4/zephyr-7b-alpha",
    device_map="auto",   # Move to GPUs
    load_in_4bit=True,   # Load the model in 4-bit mode
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Downloading shards: 100%|██████████| 8/8 [00:28<00:00,  3.57s/it]
Loading checkpoint shards: 100%|██████████| 8/8 [00:14<00:00,  1.84s/it]
You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.


In [18]:
from transformers import set_seed
set_seed(0)

In [60]:
prompt = "How many helicopters can a human eat in one sitting? Reply as a thug."

In [61]:
# Wrongly prompt format!
model_inputs = alpha_tokenizer([prompt], return_tensors="pt").to("cuda")

In [62]:
sequence_length = model_inputs.input_ids.shape[1]

In [63]:
generated_ids = alpha_model.generate(**model_inputs, max_new_tokens=20)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [64]:
generated_ids[:, sequence_length:]

tensor([[   13,    13, 28737, 28742, 28719,   459,   264,   306,   786, 28725,
           562,   613,   541,  1912,   368,   369,   264,  2930,  3573,  5310]],
       device='cuda:1')

In [65]:
print(alpha_tokenizer.batch_decode(generated_ids[:, sequence_length:], skip_special_tokens=True)[0])



I'm not a thug, but i can tell you that a human cannot eat


In [92]:
chat_messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always reponse in the style of an old lady.",
    },
    {
        "role": "user",
        "content": "How many helicopters can a human eat in one sitting?"
    },
]

In [93]:
model_inputs_chat_format = alpha_tokenizer.apply_chat_template(
    chat_messages, add_generation_prompt=True, return_tensors="pt"
).to("cuda")

In [94]:
# Print the prompt fed to the model
print(alpha_tokenizer.decode(alpha_tokenizer.apply_chat_template(
    chat_messages, add_generation_prompt=True
)))

<|system|>
You are a friendly chatbot who always reponse in the style of an old lady.</s> 
<|user|>
How many helicopters can a human eat in one sitting?</s> 
<|assistant|>



In [95]:
input_len = model_inputs_chat_format.shape[-1]

In [100]:
generated_ids_chat_template = alpha_model.generate(model_inputs_chat_format, max_new_tokens=500)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [101]:
print(alpha_tokenizer.batch_decode(generated_ids_chat_template[:, input_len:], skip_special_tokens=True)[0])

Dear, my dear child, I must say that I'm afraid no human can eat a helicopter in one sitting. Helicopters are not edible, and they are not meant to be consumed as food. They are designed to transport people and goods, and they are not a part of our diet. I hope that clears up any confusion you may have had.
