In [1]:
import pkgutil


In [2]:
# !pip install -q -U bitsandbytes
# !pip install -q -U git+https://github.com/huggingface/transformers.git 
# !pip install -q -U git+https://github.com/huggingface/peft.git
# !pip install -q -U git+https://github.com/huggingface/accelerate.git
# !pip install -q datasets

package = "datasets"

if pkgutil.find_loader(package) is None:
    print(package,"is not installed in python environment")
else:
    print(package,"is installed")


datasets is installed


Accelerate is a user-friendly tool designed to provide developers with the flexibility of writing their own training Loops for pytorch models while avoiding the hassle of dealing with the extra code required
for Multi-Device setups 

Think of accelerate as your personal assistant taking care of all the tedious parts of your code associated with running your model on different kinds of devices whether there are multiple gpus tpus or even enabling mixed Precision in practice 

This means that you can add just a handful of lines to your typical pytorch training script and then your model can run smoothly on a variety of setups ranging from a single CPU or GPU to multiple CPUs or  gpus and even tpus

Even better you can switch between different environments without modifying your code which is super handy for debugging on your local machine before moving to larger scale training setup

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


In [5]:
model_id = "EleutherAI/pythia-410m"


Configuration for the bits and bytes quantization Library

- Quantization is a technique to compress neural networks by reducing the amount of bits that represent the weights of the model 
- Here it's configured to load the model in a 4-bit representation using the normal float
optimized quantization type and the B float 16 data type for computation
- Double quantization is also enabled - a technique to further reduce the storage requirements 

In [6]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)


Causal language modeling is a type of language modeling task where the model generates the next token in a sequence based on the pre previous tokens 

- It's called causal because the model can't see future tokens when predicting current token 
- It's as if the model is reading the text from left to right just like a human 
- This kind of model is frequently used for text generation 
- GPT2 is a well-known example of a causal language model 

In [7]:
model = AutoModelForCausalLM.from_pretrained(model_id, 
                                             quantization_config=bnb_config, 
                                             device_map={"":0})


# device_count = torch.cuda.device_count()
# if device_count > 0:
#     device = torch.device("cuda")
# else:
#     device = torch.device("cpu")

# model.to(device)


### Pre-process the model - Apply PEFT (QLoRA)

We have to apply some preprocessing to the model to prepare it for training. For that use the prepare_model_for_kbit_training method from PEFT.

- PEFT (parameter efficient fine tuning). 
- It involves a new wave of machine learning that allow us to tweak pre-trained language models for different applications without adjusting all the model parameters 
- This approach is handy because fine-tuning large llms can be extremely resource intensive 
- Therefore by fine-tuning only a fraction of the model's parameters PEFT methods significantly reduce the computational costs and storage costs 
- Some of the popular peft methods include Lora, prefix tuning, p-tuning, adalora and QLoRA

- Quantization is a process that involves reducing the amount of data required to represent an input 
- This is typically achieved by converting a data type with a high bit count into a data type with a low bit count 
- For instance a 32-bit floating Point number might be converted into an 8-bit integer
- This process helps manage the range of the lower bit data type more effectively by rescaling the input data type into the target data types range using normalization 
- However a problem that arises with this technique is that it can be skewed by outliers or large magnitude values in the input data 
- These outliers can cause the quantization bins which are specific big combinations to be used inefficiently 
- This can lead to some bins being populated with few or no numbers essentially wasting computational resources 
- To address this issue a common approach is to split the input tensor into several smaller chunks or blocks which are independently quantized 
- Each of these blocks would have its own quantization constant ensuring a more efficient usage of the quantization bins 
- This method known as blockwise k-bit quantization is more resistant to outliers and can lead to more efficient computations and better performance when processing large data sets 

In [5]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)


- By looking at the number of trainable parameters you can see how many parameters we're actually training 
- Since the goal is to achieve parameter efficient fine tuning, you should expect to see fewer trainable parameters in the Lora model in comparison to the original model 

In [6]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )


LoRA config allows you to control how LoRA is applied to the base model by manipulating the hyper
parameters

r - is the rank of the update matrices expressed in integers 
A lower number for this would represent a smaller update Matrix with fewer trainable parameters

alpha - LoRA scaling parameter 

r and alpha together control the total number of final trainable parameters when using Lora giving you the flexibility to balance and performance with compute efficiency 

If you have more parameters you're going to have a better performing model but if you have less parameters you're going to be much more computationally efficient

target module - specifies the modules for example, attention blocks to apply to the Lora update matrices 

In this case we're targeting query key and value but it's possible to Target just the query in the key or some other combination

dropout - likelihood of co-adaptation where multiple neurons extract identical or very similar features from the input data. this phenomenon can occur when different neurons share nearly identical connection weights and this co-adaptation not only wastes computational resources but can also lead to overfitting 

To address this we use dropout in the n/w during training. dropout involves randomly disabling a fraction of neurons in a layer at each training step by setting their values to zero. this fraction is often referred to as the dropout rate

bias - specifies if the bias parameter should be trained this can be none, all or LoRA only 



In [7]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8, 
    lora_alpha=32, 
    target_modules=["query_key_value"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)


When we do the get peft model, we're wrapping the base model so that the update matrices which are the low rank decomposition matrices of Lora are added to their respective places 

So this is actually the part where we inject the parameters that we're going to be tuning into the model

In [8]:
model = get_peft_model(model, config)
print_trainable_parameters(model)

# total parameters- 255 M 

trainable params: 786432 || all params: 255125504 || trainable%: 0.3082529922214284


### Fine-tune the model

#### 1. Dataset

Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

In the map function we feed the data through our tokenizer in order to get the machine readable tokens

In [19]:
from datasets import load_dataset

data = load_dataset("kotzeje/lamini_docs.jsonl", split="train")


In [12]:
# data = load_dataset("kotzeje/lamini_docs.jsonl")

# data.keys()


dict_keys(['train'])

In [20]:
data['question'][0:2]


['How can I evaluate the performance and quality of the generated text from Lamini models?',
 "Can I find information about the code's approach to handling long-running tasks and background jobs?"]

In [21]:
data['answer'][:2]


["There are several metrics that can be used to evaluate the performance and quality of generated text from Lamini models, including perplexity, BLEU score, and human evaluation. Perplexity measures how well the model predicts the next word in a sequence, while BLEU score measures the similarity between the generated text and a reference text. Human evaluation involves having human judges rate the quality of the generated text based on factors such as coherence, fluency, and relevance. It is recommended to use a combination of these metrics for a comprehensive evaluation of the model's performance.",
 'Yes, the code includes methods for submitting jobs, checking job status, and retrieving job results. It also includes a method for canceling jobs. Additionally, there is a method for sampling multiple outputs from a model, which could be useful for long-running tasks.']

Tokenizer is responsible for converting your input text into a format that the model can understand typically a sequence of integer token IDs

In [15]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

def tokenize_function(examples):
    if "question" in examples and "answer" in examples:
      text = examples["question"][0] + examples["answer"][0]
    elif "input" in examples and "output" in examples:
      text = examples["input"][0] + examples["output"][0]
    else:
      text = examples["text"][0]

    tokenizer.pad_token = tokenizer.eos_token
    tokenized_inputs = tokenizer(text,
                                 return_tensors="np",
                                 padding=True)

    max_length = min(tokenized_inputs["input_ids"].shape[1], 2048)
    
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(text,
                                 return_tensors="np",
                                 truncation=True,
                                 max_length=max_length)

    return tokenized_inputs


In [22]:
tokenized_dataset = data.map(tokenize_function,
                             batched=True,
                             batch_size=1,
                             drop_last_batch=True)


Map:   0%|          | 0/1400 [00:00<?, ? examples/s]

In [23]:
print(tokenized_dataset)


Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask'],
    num_rows: 1400
})


In [24]:
split_dataset = tokenized_dataset.train_test_split(test_size=0.2, shuffle=True, seed=123)

print(split_dataset)


DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask'],
        num_rows: 1120
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask'],
        num_rows: 280
    })
})


#### 2. Training

Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

- This is the part where we actually train the QLora parameters. Here we are defining the hyper-parameters in the training arguments. You can change the values of most of the parameters, however if you prefer the only hyper parameter that's really required is the output directory which specifies where to save your model at the end of each epoch 
the trainer will evaluate the metric and then save the training checkpoint 
- Note that this is just a walk through but if you want the best performing model you can even perform hyper parameter tuning to achieve optimal results. choosing the right hyper parameters can significantly affect the performance of the model.
- Also note that we're using the paged atomw 8-bit Optimizer. This is taking advantage of the paged optimizer concept that was also introduced in that Q Laura paper. This tool behaves as a mechanism to control the memory traffic. In instances where GPU is reaching memory capacity during data training, this feature intervenes. It automatically transfers data between the GPU and the CPU effectively averting memory related issues. This mechanism resembles the process in which a computer shuffles data between RAM and its disk storage when facing low memory scenarios. Paged optimizers harness this feature. So when GPU memory reaches its limit optimizer states are temporarily relocated to the CPU Ram freeing up space on the GPU. These states are then reloaded back onto the GPU memory as needed in the optimizer update step.


Now let's break down some of these hyper parameters 
- The per device train batch size - batch size for training
- Gradient accumulation steps is the number of steps to accumulate the gradients before performing an update
- The warm-up steps is the number of steps for learning rate warm-up 
- The max steps is the total number of steps for training 
- fp16 equals true is whether to use the 16-bit half Precision floating Point numbers for the training as opposed to the default 32-bit ones
- Logging steps is the number of steps between each logging event


In [27]:
import transformers
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

epochs = 3
trained_model_name = f"/home/raghu/DL/topics/LLM/Finetune-LLM/outputs/lamini_docs_{epochs}_epochs"
output_dir = trained_model_name

training_args = transformers.TrainingArguments(per_device_train_batch_size=8,
                                               gradient_accumulation_steps=4,
                                               warmup_steps=2,
                                               evaluation_strategy="epoch",
                                               learning_rate=1.0e-5,
                                               fp16=True,
                                               #logging_steps=1,
                                               output_dir=output_dir,
                                               num_train_epochs=epochs,
                                               optim="paged_adamw_8bit")

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["test"],
    args=training_args,
    data_collator=data_collator)


Also we set model.config.use_cache = False

This line disables caching in the model configuration

Caching can speed up training by storing the model's past computations but it may produce warnings in some cases so it's being disabled here but it's recommended to enable it again when you're doing inferencing

The last line here trainer.train is the command that starts the training process


In [28]:
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()


You are using 8-bit optimizers with a version of `bitsandbytes` < 0.41.1. It is recommended to update your version as a major bug has been fixed in 8-bit optimizers.


  0%|          | 0/105 [00:00<?, ?it/s]

  0%|          | 0/35 [00:00<?, ?it/s]

{'eval_loss': 2.5501582622528076, 'eval_runtime': 3.8888, 'eval_samples_per_second': 72.002, 'eval_steps_per_second': 9.0, 'epoch': 1.0}


  0%|          | 0/35 [00:00<?, ?it/s]

{'eval_loss': 2.3736860752105713, 'eval_runtime': 3.9372, 'eval_samples_per_second': 71.116, 'eval_steps_per_second': 8.889, 'epoch': 2.0}


  0%|          | 0/35 [00:00<?, ?it/s]

{'eval_loss': 2.3215787410736084, 'eval_runtime': 3.9323, 'eval_samples_per_second': 71.206, 'eval_steps_per_second': 8.901, 'epoch': 3.0}
{'train_runtime': 159.8752, 'train_samples_per_second': 21.016, 'train_steps_per_second': 0.657, 'train_loss': 2.464066569010417, 'epoch': 3.0}


TrainOutput(global_step=105, training_loss=2.464066569010417, metrics={'train_runtime': 159.8752, 'train_samples_per_second': 21.016, 'train_steps_per_second': 0.657, 'train_loss': 2.464066569010417, 'epoch': 3.0})

#### 3. Save the model

Now you can use the save pre-trained method of the Lora model to save the Model and inference Laura only parameters locally

Alternatively you can use the push to HUB method to upload these parameters directly to the hugging face Hub 

In [29]:
# Take care of distributed/parallel training
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model
model_to_save.save_pretrained("/home/raghu/DL/topics/LLM/Finetune-LLM/outputs")


In [None]:
# # ===== Save the Model ===============
# save_dir = f'{output_dir}/final'
# trainer.save_model(save_dir)
# print("Saved model to:", save_dir)


# # ===== Load and test the model =======
# finetuned_slightly_model = AutoModelForCausalLM.from_pretrained(save_dir, local_files_only=True)
# finetuned_slightly_model.to(device)

#### 4. Load the model

In [30]:
lora_config = LoraConfig.from_pretrained('/home/raghu/DL/topics/LLM/Finetune-LLM/outputs')
model = get_peft_model(model, lora_config)


### Inference

To do inference, you can take some text, tokenize it, put it onto the GPU then feed it through the LLM to get the machine readable outputs

Then just convert those machine readable outputs to human readable outputs

In [31]:
model.config.use_cache = True

def inference(text, 
              model, 
              tokenizer, 
              max_input_tokens=1000, 
              max_output_tokens=100, 
              temperature=1.0):

    # Tokenize
    input_ids = tokenizer.encode(
            text,
            return_tensors="pt",
            truncation=True,
            max_length=max_input_tokens
    )

    # Generate
    device = model.device
    generated_tokens_with_prompt = model.generate(
        input_ids=input_ids.to(device),
        max_length=max_output_tokens,
        temperature = temperature,
        do_sample=True,
        top_p = 0.95
        
    )

    # Decode
    generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, 
                                                        skip_special_tokens=True)

    # Strip the prompt
    generated_text_answer = generated_text_with_prompt[0][len(text):]

    return generated_text_answer


In [32]:
test_question = split_dataset["test"][0]['question']
print("****Question input (test)****:", test_question)

print("****Finetuned slightly model's answer****: ")
print(inference(test_question, model, tokenizer))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


****Question input (test)****: Is it possible to fine-tune Lamini on a specific dataset for text generation in legal documents?
****Finetuned slightly model's answer****: 


I am trying to write code to detect if the code section of a document is using legal terminology. I am trying to do this in the context of a specific case and I don't know how to proceed to this. I will be using Lamini-Text and Lamini-Rng. I know that this is possible with Rng but I am not sure if


In [33]:
print("****Finetuned slightly model's answer*****: ")
print(inference("How can I evaluate the performance and quality of the generated text??", 
                model, tokenizer))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


****Finetuned slightly model's answer*****: 


A:

I'm just not sure what you're seeing, from the JFSE.




In [34]:
print("Finetuned slightly model's answer: ")
print(inference("Tell me about the Keras API?", model, tokenizer))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Finetuned slightly model's answer: 


A:

I've found the answer to my query problem.
This was a problem with this error.
In this line I was trying to type "for" with "for" that would not work for me because my code (Python 3.5) didn't match.
I changed the code as below.

I changed the code, and it works, it is just that I did not like it.


