# Install Libraries

In [None]:
!pip install -q -U bitsandbytes
# !pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install transformers==4.31 #temporary fix required owing to breaking changes on Aug 9th 2023
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

# Notebook Login

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Load Model

## Model Weights

Llama2 family is a group of large language models (LLMs) that can generate natural language texts for various tasks, such as writing, chatting, coding, etc. Llama2 consists of three models with different numbers of parameters:

* **7b model** has **7 billion** parameters and needs **28 GB** of memory
* **13b model** has **13 billion** parameters and needs **52 GB** of memory
* **70b model** has **70 billion** parameters and needs **280 GB** of memory

If each `weight` is represented by `32-bits`, then
* `1 byte = 8 bits`
* 70b model
* 70b * 32bits / (8 bits per byte) = 70b * 4 = `280 GB` of weights.


Similarly, for 7b model
* 1 byte = 8 bits
* 7b model
* 7b * 32bits / (8 bits per byte) = 7b * 4 = `28 GB` of weights.

Most powerful computers
* Nvidia A100 - 40GB, 80GB

The memory requirements of these models are very high, and they **cannot be loaded on most computers** or platforms. For example, the most powerful computers, such as Nvidia A100, have only 40 GB or 80 GB of memory. Google Colab, a free platform that provides GPUs for training models, has only 15 GB of memory. Therefore, to solve this problem, we use **Quantization**.



## Quantization

**Quantization** is a technique that reduces the size of the model by using fewer bits to represent a weight. A weight is a numerical value that determines how the model processes the input and produces the output. The more bits are used to represent a weight, the more precise and accurate the weight is, but also the more memory and computation it requires.

Quantization has the following benefits and drawbacks:

* **Benefits:**
    * Reduces the memory usage of the model by a factor of the bit reduction. For example, reducing from 32 bits to 4 bits reduces the memory usage by 8 times.

    * Improves the speed and efficiency of the model by using less resources and power.

    * Enables the model to run on platforms or devices that have limited memory or computation capabilities, such as Google Colab, microcontrollers, or embedded systems.

* **Drawbacks:**
    * Introduces quantization error, which is the difference between the original weight and the quantized weight. This error can affect the quality and accuracy of the model, especially for sensitive or complex tasks.

    * Requires careful tuning and calibration of the quantization parameters, such as the scale and zero-point, to minimize the quantization error and preserve the model performance.

**Example**

Let's see an example of how quantization works. Suppose we have a model with 7 billion weights, and each weight is represented by 32 bits. The memory usage of this model is:

- 7b * 32 bits / (8 bits per byte) = 28 GB

This model is too large to fit in Google Colab, which has only 15 GB of memory. To reduce the size of the model, we can quantize the weights from 32 bits to 4 bits. The memory usage of the quantized model is:

- 7b * 4 bits / (8 bits per byte) = 3.5 GB

This model is much smaller and can be easily stored in Google Colab for further manipulation or fine-tuning. Here is a table that compares the memory usage of different models before and after quantization:

| Model | Number of weights | Bits per weight | Memory usage |
|-------|-------------------|-----------------|--------------|
| Original | 7 billion | 32 | 28 GB |
| Quantized | 7 billion | 4 | 3.5 GB |
| Reduced | 1 billion | 32 | 4 GB |

As you can see, quantizing the weights from 32 bits to 4 bits reduces the memory usage by 8 times, from 28 GB to 3.5 GB. This is equivalent to reducing the number of weights by 8 times, from 7 billion to 1 billion, while keeping the same bit size of 32.

**Quantization Types**

There are different types of quantization techniques that can be applied to models, depending on the stage and the method of quantization.

* **Post-training quantization:** This type of quantization is `applied after the model is trained, without retraining or fine-tuning the model`. It is the simplest and fastest way to quantize a model, but it may result in more accuracy loss than other types. There are different methods of post-training quantization, such as **static quantization**, **dynamic quantization**, and **quantization-aware pruning**.

* **Quantization-aware training:** This type of quantization is applied `during the model training process`, by simulating the effects of quantization on the model weights and activations. It is more complex and time-consuming than post-training quantization, but it can achieve higher accuracy and lower quantization error than post-training quantization. There are different methods of quantization-aware training, such as **fake quantization**, **quantization noise injection**, and **stochastic rounding**.

* **Mixed-precision quantization:** This type of quantization is applied by using `different precisions` for different parts of the model, such as weights, activations, gradients, and accumulators. It is a flexible and efficient way to quantize.

## Parameter Efficient Fine Tuning (PEFT)
> PEFT means "we only train some weights".

**LoRA** and **QLoRA** are powerful tools for **adapting pre-trained language models to specific tasks** while **optimizing memory usage**.

**LoRA** - **L**ow **R**ank **A**daptatiion means instead of training all the 7-Billion Parameters, we train(update) only the most important ones.

**QLoRA** - Quantized LoRA means Parameter Efficient Fine Tuning but with Quantized Weights of 4-bits (here).

**BitsAndBytes** - Hugging Face developed Quantization library that supports LoRA.


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "Trelis/Llama-2-7b-chat-hf-sharded-bf16"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

* **iamsubrata/Llama-2-7b-chat-hf-sharded-bf16-fine-tuned**: FineTuned **Llama-2-7b-chat-hf-sharded-bf16** (optimized for dialogue use cases)
on `Enhglish Quotes Dataset`.

* **BitsAndBytesConfig** is used to configure for **model quantization**, which is a technique to reduce the memory and computation requirements of LLM by using lower-precision data types.
    * **load_in_4bit**: Load the model weights in 4-bit precision, which reduces the model size by **75%** compared to **16-bit precision**.
    
    * **bnb_4bit_use_double_quant**: Leverages **double quantization**, which is a method to further compress the model by using `two quantization levels`: one for the **most frequent values** and one **for the rest**.
    
    * **bnb_4bit_quant_type**: Quantization type: **nf4**, which stands for **nibble float 4**, a custom data type that uses 4 bits to
    represent floating-point numbers.
    
    - **bnb_4bit_compute_dtype**: This specifies the data type to use for the model computation, **torch.bfloat16**, which is a `16-bit floating-point` format that preserves more dynamic range than 16-bit integer formats.

* A **causal language model** is a type of generative model that **predicts the next token in a sequence given the previous tokens**.

# Training Setup (PEFT)

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=32,
    # target_modules=["query_key_value"],
    target_modules=["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj", "self_attn.o_proj"], #specific to Llama models.
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

# Data Setup

The **english_quotes dataset** is a collection of quotes from **goodreads.com**, a website that allows users to rate and review books. The dataset contains **3,910 quotes**, each with the following attributes:

* **quote:** The content of the quote in English
* **author:** The name of the author who said or wrote the quote
* **tags:** A list of keywords that describe the theme or topic of the quote

The dataset can be used for multi-label text classification and text generation tasks. For example, you can train a model to predict the tags of a given quote, or to generate a quote based on a given tag or author.

In [None]:
from datasets import load_dataset
data_path="Abirate/english_quotes"
data = load_dataset(data_path)
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

In [None]:
## Uncomment if the above block doesn't work, and make sure to add the dataset from kaggle
# from datasets import load_dataset
# data_path="/kaggle/input/english-quotes/quotes.jsonl"
# data = load_dataset("json",data_files={
#     "train":data_path
# })
# data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

# Training

In [None]:
import transformers

# needed for Llama tokenizer
tokenizer.pad_token = tokenizer.eos_token # </s>

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings.
trainer.train()

# Evaluation

In [None]:
from tensorboard import notebook
log_dir = "/kaggle/working/outputs/runs" # make sure this path is correct
notebook.start("--logdir {} --port 4000".format(log_dir))

# Inference

* **TextStreamer** class enables streaming of the generated text, which means that the text is printed to the standard output as soon as each token is generated, instead of waiting for the whole response to be generated.

In [None]:
from transformers import TextStreamer
model.config.use_cache = True
model.eval()

In [None]:
# Define a stream *without* function calling capabilities
def stream(user_prompt,model):
    runtimeFlag = "cuda:0"
    system_prompt = 'You are a helpful assistant that provides accurate and concise responses'

    B_INST, E_INST = "[INST]", "[/INST]"
    B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

    prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{user_prompt.strip()} {E_INST}\n\n"

    inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)

    streamer = TextStreamer(tokenizer)

    # Despite returning the usual output, the streamer will also print the generated text to stdout.
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=500)

In [None]:
stream('Provide a very brief comparison of salsa and bachata.',model)

# Save and Push Fine Tuned Model to Hub

In [None]:
## Uncomment if you want to push it in your hub
# # Extract the last portion of the base_model
# base_model_name = model_id.split("/")[-1]

# # Change the "iamsubrata" part with your Hugging Face Repo Name
# adapter_model = f"iamsubrata/{base_model_name}-fine-tuned-adapters"
# new_model = f"iamsubrata/{base_model_name}-fine-tuned"

* **Note** Adapter Model doesn't have `config.json` file in it, to get it we have to merge the Adapter Model with the Pretrained Model. And without `config.json` file will not be able to load it from the hub, it will give `config.json` **file not found error**.

In [None]:
# # Save the adapter model (doesn't have config.json)
# model.save_pretrained(adapter_model, push_to_hub=True, use_auth_token=True)

# # Push the adapter model to the hub
# model.push_to_hub(adapter_model, use_auth_token=True)

* **Reload the base model without quantization**, it might(certainly) crash the free tier version of your notebook as it is loading the whole model.

In [None]:
# Reload the base model without quantization
model = AutoModelForCausalLM.from_pretrained(model_id, device_map='cpu', trust_remote_code=True, torch_dtype=torch.float16, cache_dir="cache")

In [None]:
from peft import PeftModel

# load perf model with new adapters
model = PeftModel.from_pretrained(
    model,
    adapter_model,
)

In [None]:
model = model.merge_and_unload() # merge adapters with the base model.

In [None]:
model.push_to_hub(new_model, use_auth_token=True, max_shard_size="5GB")

In [None]:
# Push the tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.push_to_hub(new_model, use_auth_token=True)

* **Now, you can use the [inference notebook]() to use it for inference or you can use the `pipeline` from `transformers`.**

# Inference

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

finetuned_model_checkpoint = "iamsubrata/Llama-2-7b-chat-hf-sharded-bf16-fine-tuned"

bitsandbytes_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(finetuned_model_checkpoint)
model = AutoModelForCausalLM.from_pretrained(finetuned_model_checkpoint, quantization_config=bitsandbytes_config, device_map={"":0})

In [None]:
from transformers import TextStreamer

# Define a stream *without* function calling capabilities
def stream(user_prompt,model):
    runtimeFlag = "cuda:0"
    system_prompt = 'You are a helpful assistant that provides accurate and concise responses'

    B_INST, E_INST = "[INST]", "[/INST]"
    B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

    prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{user_prompt.strip()} {E_INST}\n\n"

    inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)

    streamer = TextStreamer(tokenizer)

    # Despite returning the usual output, the streamer will also print the generated text to stdout.
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=500)

In [None]:
stream('Can you tell me quotes related to universe?.',model)

# Advanced Fine Tuning
* Prompt Masking
* End of sequence token