## INSTALLING REQUIRED LIBRARIES
1. bitsandbytes: It is a Python library for handling various data types efficiently. It allows for easy manipulation of binary data, bytes, and bits, making it useful for tasks that involve low-level data processing and communication.

2. transformers: Developed by Hugging Face, it is a comprehensive library for natural language processing (NLP). It provides access to various pre-trained transformer-based models, like BERT and GPT-2, and offers functionalities for tokenization, training, and inference with NLP tasks.

3. peft: The "peft" library is also from Hugging Face and is designed to optimize the training performance of deep learning models. It uses various techniques to accelerate the training process, especially on hardware accelerators like GPUs.

4. accelerate: Another library by Hugging Face, "accelerate" focuses on speeding up the model training process through distributed training and hardware optimization. It is particularly useful for large-scale deep learning tasks.

5. datasets: This library provides access to various publicly available datasets commonly used in machine learning and NLP tasks. It simplifies the process of loading, processing, and using datasets, making it convenient for researchers and practitioners.

In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.1/97.1 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m85.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m79.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m5.5 MB/

In this code, we are using the "transformers" library by Hugging Face to work with a GPT-like model called "EleutherAI/gpt-neox-20b." Additionally, we are incorporating the "BitsAndBytesConfig" from the "bitsandbytes" library.

1. First, we define the model ID as "EleutherAI/gpt-neox-20b," which points to the pre-trained GPT-like model we want to use.

2. Next, we create a configuration called "bnb_config" from the "BitsAndBytesConfig" class. We enable quantization by setting "load_in_4bit" to True, which allows us to load the model in a compressed 4-bit format, optimizing memory usage.

3. We further configure the quantization settings by setting "bnb_4bit_use_double_quant" to True, which uses double quantization to improve quantization accuracy.

4. The "bnb_4bit_quant_type" is set to "nf4," which stands for "Noise-Free Quantization" with 4-bit format. This type of quantization reduces noise during compression, ensuring high-quality results.

5. Additionally, we set the "bnb_4bit_compute_dtype" to "torch.bfloat16," which specifies the datatype to be used during computations to reduce memory usage and accelerate processing.

6. Using the AutoTokenizer, we load the tokenizer corresponding to the specified "model_id."

7. Lastly, we load the pre-trained model using AutoModelForCausalLM. We pass the "bnb_config" to enable the quantization settings we defined earlier. We also specify "device_map" as {"": 0}, which indicates that the model should be loaded on device 0 (usually a GPU) for efficient computation.

loading the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "EleutherAI/gpt-neox-20b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

Downloading (…)okenizer_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/457k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Downloading (…)fetensors.index.json:   0%|          | 0.00/60.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/46 [00:00<?, ?it/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/926M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/604M [00:00<?, ?B/s]

Downloading (…)of-00046.safetensors:   0%|          | 0.00/620M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/46 [00:00<?, ?it/s]

In this code snippet, we are using the "peft" library (Partial Edwards-Fix-Taylor) to prepare the model for k-bit training. Let's break down the steps:

1. `model.gradient_checkpointing_enable()`: This function is enabling gradient checkpointing for the model. Gradient checkpointing is a technique to trade off memory usage for computation time during training, allowing the model to perform backward passes more efficiently, especially when dealing with large models or limited GPU memory.

2. `model = prepare_model_for_kbit_training(model)`: This line prepares the model for k-bit training using the "peft" library. "k-bit training" refers to quantization of model weights to k bits (for example, 4-bit or 8-bit quantization). This process helps reduce the memory footprint and potentially speeds up the training process.

Overall, the code aims to optimize the model for k-bit training by enabling gradient checkpointing and then preparing the model using the "peft" library for efficient quantization, which is useful for deployment on memory-constrained devices or specialized hardware with limited precision support.


WHAT IS K BIT TRAINING?
K-bit training, also known as quantization-aware training, is a technique used in deep learning to train models with reduced precision (k-bit) weights instead of the standard 32-bit floating-point weights. The term "k-bit" refers to the number of bits used to represent the model weights, where k can be any integer value, commonly 4, 8, or 16.

During the k-bit training process, the model is trained with lower-precision weights, typically using fixed-point or bfloat (bfloat16) data types instead of 32-bit floating-point (float32) data types. This approach can significantly reduce the memory footprint and computational requirements of the model, making it more efficient for deployment on hardware with limited precision support, such as edge devices, mobile phones, or specialized accelerators.

However, reducing the precision of weights can result in a loss of model accuracy, as it introduces quantization errors. To mitigate this, quantization-aware training incorporates techniques to account for the effect of quantization during training. This includes emulating lower precision during forward and backward passes, using special quantization-aware optimizers, and considering quantization effects when calculating gradients.

By training the model with lower-precision weights and considering quantization effects, k-bit training strikes a balance between model size and performance, enabling the deployment of deep learning models in resource-constrained environments without sacrificing too much accuracy.

In [3]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

The function `print_trainable_parameters` calculates and prints the number of trainable parameters in a given PyTorch model. It iterates through all the named parameters of the model and checks if each parameter requires gradients (i.e., is trainable). For each trainable parameter, it counts the number of elements (parameters) it contains and accumulates the count.

The function provides the following information:

1. `trainable params`: The total number of trainable parameters in the model.
2. `all params`: The total number of parameters in the model (both trainable and non-trainable).
3. `trainable%`: The percentage of trainable parameters relative to all parameters, indicating how much of the model's parameters are trainable.

By calling this function and passing a model as an argument, you can get insights into the model's parameter count and the proportion of parameters that are being updated during training (trainable parameters). This information can be useful for understanding model size, memory requirements, and potential computational costs during training and inference.

In [4]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In the provided code, the following steps are being executed:

1. A `LoraConfig` object is created, defining the configuration for the LORA (Learn Once, Rule All) model. LORA is a technique that allows for efficient multi-task learning in Transformers. The configuration specifies parameters such as `r` (number of LORA steps), `lora_alpha` (LORA's alpha parameter), `target_modules` (the modules on which LORA will be applied), `lora_dropout` (dropout rate for LORA), `bias` (LORA's bias type), and `task_type` (the type of task, here set to "CAUSAL_LM" for causal language modeling).

2. The function `get_peft_model` is called to obtain the LORA-transformed model. This function applies the LORA technique to the original model. The `model` variable likely contains a pre-trained Transformers model before applying LORA.

3. The function `print_trainable_parameters(model)` is called to print the number of trainable parameters in the LORA-transformed model. This function was described earlier and provides insights into the trainable parameter count and the proportion of trainable parameters in the model.

In summary, the code applies the LORA technique to a pre-trained Transformers model, configures the LORA parameters using `LoraConfig`, and then prints the number of trainable parameters in the LORA-transformed model using the `print_trainable_parameters` function. This allows you to analyze the impact of the LORA transformation on the model's parameter count and the percentage of trainable parameters.

In [5]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)


trainable params: 8650752 || all params: 10597552128 || trainable%: 0.08162971878329976


this code loads the "english_quotes" dataset using the datasets library, tokenizes the "quote" text using a specified tokenizer, and applies the tokenization to all samples in the dataset in batches. This is useful for preparing the dataset for further processing or training with token-based models like Transformers.

In [6]:
from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

Downloading readme:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

Downloading and preparing dataset json/Abirate--english_quotes to /root/.cache/huggingface/datasets/Abirate___json/Abirate--english_quotes-6e72855d06356857/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/647k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/Abirate___json/Abirate--english_quotes-6e72855d06356857/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

The tokenizer's pad_token is set to the eos_token. This is done to ensure that the model can correctly handle padding.

The Trainer class is initialized with the following arguments:

model: The GPT-neo-x model that will be trained.
train_dataset: The training dataset. It is assumed that the dataset has been loaded and preprocessed earlier in the code and is assigned to data["train"].
args: The training arguments, which define various training parameters like batch size, learning rate, and optimization method. The model will be trained using 8-bit quantization (paged_adamw_8bit) since the optim argument is set to that value.
data_collator: The data collator used during training. It is responsible for batching and preparing the data for the model. In this case, DataCollatorForLanguageModeling is used, and mlm=False indicates that we are not using masked language modeling objective.
The trainer.train() method is called to start the training process. The model will be trained for the specified max_steps (in this case, 10) using the training dataset. During training, the model's weights will be updated based on the loss computed using the language modeling objective.

The model's cache usage is disabled (use_cache=False) to silence any warnings related to cache usage during inference. This is just for the training process and can be re-enabled for inference.
Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [7]:
import transformers

# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,2.1359
2,2.4431
3,2.5278
4,3.2956
5,2.4872
6,1.5659
7,2.1599
8,2.5875
9,1.7405
10,1.9281


TrainOutput(global_step=10, training_loss=2.2871545791625976, metrics={'train_runtime': 167.4082, 'train_samples_per_second': 0.239, 'train_steps_per_second': 0.06, 'total_flos': 100490233282560.0, 'train_loss': 2.2871545791625976, 'epoch': 0.02})

MODEL INFERENCE

In [8]:
model_id = 'lora_based'
model.save_pretrained(model_id)
model.config.use_cache = True
def infer_model(model,text):
  device ='cuda:0'
  inputs = tokenizer(text,return_tensors="pt").to(device)
  outputs = model.generate(**inputs,max_new_tokens=20)
  print(tokenizer.decode(outputs[0],skip_special_tokens=True))
  return tokenizer.decode(outputs[0],skip_special_tokens=True)
text1 = "Ask not what is your country can do for you"
text2 = "I'm selfish, impatient and a little insecure"
infer_model(model,text1)


Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is in

Ask not what is your country can do for you, ask what you can do for your country.

-John F. Kennedy

The


'Ask not what is your country can do for you, ask what you can do for your country.\n\n-John F. Kennedy\n\nThe'

In [9]:
infer_model(model,text2)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is in

I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you


"I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you"